ArticlePDF Available

First and second order dynamics in a hierarchical SOM system for action recognition

  • Magnus Johnsson AI Research AB

Abstract and Figures

Human recognition of the actions of other humans is very efficient and is based on patterns of movements. Our theoretical starting point is that the dynamics of the joint movements is important to action categorization. On the basis of this theory, we present a novel action recognition system that employs a hierarchy of Self-Organizing Maps together with a custom supervised neural network that learns to categorize actions. The system preprocesses the input from a Kinect like 3D camera to exploit the information not only about joint positions, but also their first and second order dynamics. We evaluate our system in two experiments with publicly available datasets, and compare its performance to the performance with less sophisticated preprocessing of the input. The results show that including the dynamics of the actions improves the performance. We also apply an attention mechanism that focuses on the parts of the body that are the most involved in performing the actions.
Content may be subject to copyright.
First and Second Order Dynamics in a Hierarchical
SOM system for Action Recognition
Zahra Gharaee, Peter ardenfors and Magnus Johnsson
Lund University Cognitive Science,
Helgonav¨agen 3, 221 00 Lund, Sweden
Human recognition of the actions of other humans is very efficient and is based
on patterns of movements. Our theoretical starting point is that the dynamics of
the joint movements is important to action categorization. On the basis of this
theory, we present a novel action recognition system that employs a hierarchy
of Self-Organizing Maps together with a custom supervised neural network that
learns to categorize actions. The system preprocesses the input from a Kinect
like 3D camera to exploit the information not only about joint positions, but
also their first and second order dynamics. We evaluate our system in two
experiments with publicly available data sets, and compare its performance to
the performance with less sophisticated preprocessing of the input. The results
show that including the dynamics of the actions improves the performance. We
also apply an attention mechanism that focuses on the parts of the body that
are the most involved in performing the actions.
Keywords: Self-Organizing Maps, Conceptual Spaces, Neural Networks,
Action Recognition, Hierarchical Models, Attention, Dynamics
Email address:,, (Zahra Gharaee, Peter
ardenfors and Magnus Johnsson)
Preprint submitted to Elsevier May 23, 2017
1. Introduction
The success of human-robot interaction depends on the development of ro-
bust methods that enable robots to recognize and predict goals and intentions of
other agents. Humans do this, to a large extent, by interpreting and categoriz-
ing the actions they perceive. Hence, it is central to develop methods for action
categorization that can be employed in robotic systems. This involves an analy-
sis of on-going events from visual data captured by cameras to track movements
of humans and to use this analysis to identify actions. One crucial question is
to know what kind of information should be extracted from observations for an
artificial action recognition system.
Our ambition is to develop an action categorization method that, at large,
works like the human system. We present a theory of action categorization due
to ardenfors and Warglien (2012) (see also ardenfors (2007) and ardenfors
(2014)) that builds on ardenfors’s (2000) theory of conceptual spaces. The
central idea is that actions are represented by the underlying force patterns.
Such patterns can be derived from the second order dynamics of the input data.
We present experimental data on how humans categorize action that supports
the model. A goal of this article is to show that if the dynamics of actions is
considered, the performance of our Self-Organizing Maps (SOMs) (Kohonen,
1988) based action recognition system can be improved when categorizing ac-
tions based on 3D camera input.
The architecture of our action recognition system is composed of a hierarchy
of three neural network layers. These layers have been implemented in different
versions. The first layer consists of a SOM, which is used to represent prepro-
cessed input frames (e.g. posture frames) from input sequences and to extract
their motion patterns. This means that the SOM reduces the data dimensional-
ity of the input and the actions in this layer are represented as activity patterns
over time.
The second layer of the architecture consists of a second SOM. It receives
the superimposed activities in the first layer for complete actions. The super-
imposition of all the activity in the first layer SOM provides a mechanism that
makes the system time invariant. This is because similar movements carried out
at different speed elicit similar sequences of activity in the first layer SOM. Thus
the second layer SOM represents and clusters complete actions. The third layer
consists of a custom made supervised neural network that labels the different
clusters in the second layer SOM with the corresponding action.
We have previously studied the ability of SOMs to learn discriminable rep-
resentations of actions (Buonamente et al., 2015), and we have developed a
hierarchical SOM based action recognition architecture. This architecture has
previously been tested using video input from human actions in a study that
also included a behavioural comparison between the architecture and humans
(Buonamente et al., 2016), and using extracted joint positions from a Kinect
like 3D camera as input with good results (Gharaee et al., 2017a).
This article presents results that suggest that the performance of our ac-
tion recognition architecture can be improved by exploiting not only the joint
positions extracted from a Kinect like 3D camera, but also simultaneously the
information present in their first and second order dynamics.
Apart from analysing the dynamics of the data, we implement an attention
mechanism that is inspired by how human attention works. We model attention
by reducing the input data to those parts of the body that contribute the most in
performing the various actions. Adding such an attention mechanism improves
the performance of the system.
The rest of the paper is organized as follows: First we present the theoretical
background from cognitive science in section 2. The action recognition archi-
tecture is described in detail in section 3. Section 4 presents two experiments
to evaluate the performance of the architecture employing new kinds of prepro-
cessing to enable additional dynamic information as additional input. Section
5 concludes the paper.
2. Theoretical Background
When investigating action recognition in the context of human-robot inter-
action, it should first be mentioned that human languages contain two types
of verbs describing actions (Levin and Rappaport Hovav, 2005; Warglien et al.,
2012). The first type is manner verbs that describe how an action is performed.
In English, some examples are run, swipe, wave, push, and punch. The sec-
ond type is result verbs that describe the result of actions. In English, some
examples are move, heat, clean, enter, and reach.
In the context of robotics, research has focused on how result verbs can be
modelled (e.g. Cangelosi et al. (2008), Kalkan et al. (2014), Lallee et al. (2010)
and Demiris and Khadhouri (2006)). However, when it comes to human-robot
interaction, the robot should also be able to recognize human actions by the
manner they are performed. This is often called recognition of biological motion
(Hemeren, 2008). Recognising manner action is important in particular if the
robot is supposed to model the intentions of a human. In the literature, there
are some systems for categorizing human actions described by manner verbs, e.g.
Giese and Lappe (2002) and Giese et al. (2008). However, these systems have
not been developed with the aim of supporting human-robot interaction. Our
aim in this article is to present a system that recognises a set of manner actions.
Our future aim is, however, to integrate this with a system for recognising results
verbs that can be used in linguistic interactions between a human and an robot
(see Cangelosi et al. (2008), Mealier et al. (2016) for examples of such linguistic
Results from the cognitive sciences indicate that the human brain performs
a substantial information reduction when categorizing human manner actions.
Johansson (1973) has shown that the kinematics of a movement contain suf-
ficient information to identify the underlying dynamic patterns. He attached
light bulbs to the joints of actors who were dressed in black and moved in a
black room. The actors were filmed performing actions such as walking, run-
ning, and dancing. Watching the films - in which only the dots of light could be
seen - subjects recognized the actions within tenths of a second. Further exper-
iments by Runesson and Frykholm (1983), see also (Runesson, 1994), show that
subjects extract subtle details of the actions performed, such as the gender of
the person walking or the weight of objects lifted (where the objects themselves
cannot be seen).
One lesson to learn from the experiments by Johansson and his followers is
that the kinematics of a movement contains sufficient information to identify
the underlying dynamic force patterns. Runesson (1994) claims that people can
directly perceive the forces that control different kinds of motion. He formulates
the following thesis:
Kinematic specification of dynamics: The kinematics of a movement contains
sufficient information to identify the underlying dynamic force patterns.
From this perspective, the information that the senses - primarily vision -
receive about the movements of an object or individual is sufficient for the brain
to extract, with great precision, the underlying forces. Furthermore, the process
is automatic: one cannot help but perceiving the forces.
Given these results from perceptual psychology, the central problem for
human-robot interaction now becomes how to construct a model of action recog-
nition that can be implemented in a robotic system. One idea for such a model
comes from Marr and Vaina (1982) and Vaina (1983), who extend Marr and
Nishihara’s (1978) cylinder models of objects to an analysis of actions. In Marr’s
and Vaina’s model, an action is described via differential equations for move-
ments of the body parts of, for example, a walking human. What we find useful
in this model is that a cylinder figure can be described as a vector with a lim-
ited number of dimensions. Each cylinder can be described by two dimensions:
length and radius. Each joining point in the figure can be described by a small
number of coordinates for point of contact and angle of joining cylinder. This
means that, at a particular moment, the entire figure can be written as a (hier-
archical) vector of a fairly small number of dimensions. An action then consists
of a sequence of such vectors. In this way, the model involves a considerable
reduction of dimensionality in comparison to the original visual data. Further
reduction of dimensionality is achieved in a skeleton model.
It is clear that, using Newtonian mechanics, one can derive the differential
equations from the forces applied to the legs, arms, and other moving parts
of the body. For example, the pattern of forces involved in the movements of
a person running is different from the pattern of forces of a person walking;
likewise, the pattern of forces for saluting is different from the pattern of forces
for throwing (Vaina and Bennour, 1985).
The human cognitive apparatus is not exactly evolved for Newtonian me-
chanics. Nevertheless, ardenfors (2007) (see also Warglien et al. (2012) and
ardenfors (2014)) proposed that the brain extracts the forces that lie behind
different kinds of movements and other actions:
Representation of actions: An action is represented by the pattern of forces
that generates it.
We speak of a pattern of forces since, for bodily motions, several body parts
are involved; and thus, several force vectors are interacting (by analogy with
Marr’s and Vaina’s differential equations). Support for this hypothesis will be
presented below. One can represent these patterns of forces in principally the
same way as the patterns of shapes described in ardenfors (2014), section
6.3. In analogy with shapes, force patterns also have meronomic structure. For
example, a dog with short legs moves in a different way than a dog with long
This representation fits well into the general format of conceptual spaces
presented by ardenfors (2000, 2007). In order to identify the structure of the
action domain, similarities between actions should be investigated. This can be
accomplished by basically the same methods used for investigating similarities
between objects. Just as there, the dynamic properties of actions can be judged
with respect to similarities: for example, walking is more similar to running
than to waving. Very little is known about the geometric structure of the action
domain, except for a few recent studies that we will present below. We assume
that the notion of betweenness is meaningful in the action domain, allowing us
to formulate the following thesis in analogy to the thesis about properties (see
ardenfors (2000, 2007) and ardenfors and Warglien (2012)):
Thesis about action concepts: An action concept is represented as a convex
region in the action domain.
One may interpret here convexity as the assumption that, given two actions
in the region of an action concept, any linear morph between those actions will
fall under the same concept.
One way to support the analogy between how objects and how actions are
represented in conceptual space is to establish that action concepts share a sim-
ilar structure with object categories (Hemeren, 2008, p. 25). Indeed, there
are strong reasons to believe that actions exhibit many of the prototype effects
that Rosch (1975) presented for object categories. In a series of psychological
experiments, Hemeren (2008) showed that action categories show a similar hi-
erarchical structure and have similar typicality effects to object concepts. He
demonstrated a strong inverse correlation between judgements of most typical
actions and reaction time in a word/action verification task.
Empirical support for the thesis about action concepts as regards body move-
ments can also be found in Giese and Lappe (2002). Using Johansson’s (1973)
patch-light technique, they started from video recordings of natural actions such
as walking, running, limping, and marching. By creating linear combinations
of the dot positions in the videos, they then made films that were morphs of
the recorded actions. Sub jects watched the morphed videos and were asked to
classify them as instances of walking, running, limping, or marching, as well
as to judge the naturalness of the actions. Giese and Lappe did not explicitly
address the question of whether the actions recognized form convex regions in
the force domain. However, their data clearly support this thesis.
Another example of data that can be used to study force patterns comes
from Wang et al. (2004). They collected data from the walking patterns of
humans under different conditions. Using the methods of Giese et al. (2008),
these patterns can be used to calculate the similarity of the different gaits.
A third example is Malt et al. (2014) who studied how subjects named
the actions shown in 36 video clips of different types of walking, running and
jumping. The subjects were native speakers of English, Spanish, Dutch, and
Japanese. The most commonly produced verb for each clip in each language
was calculated. This generated a number of verbs also including several subcat-
egories. Another group of subjects, again native speakers of the four languages,
judged the physical similarity of the actions in the video clips. Based on these
judgements a two-dimensional multidimensional scaling solution was calculated.
The verbs from the first group were then mapped onto this solution. Figure 4 in
Malt et al. (2014) shows the results for the most common English action word
for each video clip. The results support the thesis that regions corresponding
to the names are convex.
The action recognition system presented in this article has similarities with
these models in the sense that actions are represented as sequences of vectors,
and it categorizes the actions on basis of their similarities. In our system,
similarity is modelled as closeness in SOMs. We next turn to a description of
the architecture of the system.
3. Hierarchical SOM Architecture for Action Recognition
Our action recognition system consists of a three layered neural network
architecture (Fig. 1). The first layer consists of a SOM that develops an ordered
representation of preprocessed input. The second layer consists of a second SOM
that receives, as input, the superimposed sequence of activity elicited during an
action in the first layer SOM. Thus the second layer SOM develops an ordered
representation of the activity traces in the first layer SOM that correspond to
different actions. Finally, the third layer is a custom supervised neural network
that associates activity representing action labels to the activity in the second
layer SOM.
3.1. Preprocessing
From a computational point of view, there are many challenges that make
the action recognition task difficult to imitate artificially. For example, the act-
ing individuals differ in height, weight and bodily proportions. Other important
Figure 1: The architecture of the action recognition system, consisting of three layers where
the first and second layers are SOMs, and the third layer is a custom supervised neural
network. The input is a combination of joint positions extracted from the output of a Kinect
like depth camera. 9
Figure 2: Different body orientations; turned to the left (a), front direction (b), turned to
the right (c) and the joints used to calculate the egocentric coordinate system (d).
issues to be addressed are the impact of the camera’s viewing angle and distance
from the actor and the performance speed of the actions. In brief, categoriza-
tions of actions ought to be invariant under distance, viewing angle, size of the
actor, lighting conditions and temporal variations.
3.1.1. Ego-Centered Transformation
To make the action recognition system invariant to different orientations of
the action performing agents, coordinate transformation into an ego-centered
coordinate system, Fig. 2, located in the central joint, stomach in Fig. 5, is
applied to the extracted joint positions. To build this new coordinate system
three joints named Right Hip, Left Hip and Stomach are used.
3.1.2. Scaling
The other preprocessing mechanism applied to the joint positions is the
scaling transformation to make the system invariant to the distance from the
depth camera. Due to the scaling to a standard size, the representations of the
agent will always have a fixed size even if the action performers have different
distances to the depth camera, Fig. 3.
Figure 3: Different skeleton sizes due to different distances from the depth camera.
3.1.3. Attention
As humans we learn how to control our attention in performing other tasks
Shariatpanahi and Ahmadabadi (2007). Utilizing an attention mechanism can,
together with other factors (e.g, state estimation), improve our performance in
doing many other tasks Gharaee et al. (2014). One of the strongest factors that
attract our attention is the movement. Here, we therefore assume that when
actions are performed, the observer pays attention to the body parts that are
most involved in the actions. Thus, in the experiments we have applied attention
mechanisms, that direct the attention towards the body parts that move the
most. Thus, for example, when the agent is clapping hands, the attention is
focused on the arms and no or very little attention is directed towards the rest
of the agent’s body.
Our attention mechanisms are simulated by only extracting some of the
joint coordinates. These coordinates are determined by how much they are
involved in performing the actions. These mechanisms reduce the number of
input dimensions that enters the first-layer SOM, and it helps the system to
recognize actions by focusing on the more relevant parts of the body while
ignoring the less relevant parts. This procedure increases the accuracy of the
system. The attention mechanism applied to the system will be described in
the experiment section below.
3.1.4. Dynamics
By using postures composed of 3D joints positions as input to our architec-
ture, a high performance of our action recognition system has been obtained
(Gharaee et al., 2017a). In addition to the 3D positioning, our body joints have
velocity and acceleration that could be modeled as the first and second orders
of dynamics. The first order of dynamic (velocity) determines the speed and
the direction of joint’s movements while the acceleration, via Newtons second
law, determines the direction of force vector applies to joint during acting.
In this study, the first and second order joints dynamics have been extracted
and used together with the 3D positions. This has been done by calculating the
differences between consecutive sets of joint positions, and between consecutive
sets of first order joint dynamics in turn. This has enabled us to investigate
how the inclusion of the dynamics contributes to the performance of our action
recognition system.
3.2. The First and Second Layer SOMs
The first two layers of our architecture consist SOMs. The SOMs self-
organizes into dimensionality reduced and discretized topology preserving repre-
sentations of their respective input spaces. Due to the topology preserving prop-
erty nearby parts of the trained SOM respond to similar input patterns. This
is reminiscent of the cortical maps found in mammalian brains. The topology-
preserving property of SOMs is a consequence of the use of a neighbourhood
function during the adaptation of the neuron responses, which means the adap-
tation strength of a neuron is a decreasing function of the distance to the most
activated neuron in the SOM. This also provides the SOM, and in the extension
our action recognition system, with the ability to generalize learning to novel
inputs, because similar inputs elicit similar activity in the SOM. Thus similar
actions composed of similar sequences of agent postures and dynamics will elicit
similar activity trajectories in the first layer SOM, and similar activity trajecto-
ries in the first layer SOM, and thus similar actions, will be represented nearby
in the second layer SOM. Since a movement performed at different speeds will
elicit activity along the same trajectory in the first layer SOM when the input
consists of a stream of sets of joint positions, as well as when dynamics is added
(as long as the action’s internal dynamic relations are preserved), our action
recognition system also achieves time invariance.
The SOM consists of an I×Jgrid of neurons with a fixed number of neurons
and a fixed topology. Each neuron nij is associated with a weight vector wij
Rnwith the same dimensionality as the input vectors. All the elements of the
weight vectors are initialized by real numbers randomly selected from a uniform
distribution between 0 and 1, after which all the weight vectors are normalized,
i.e. turned into unit vectors.
At time teach neuron nij receives the input vector x(t)Rn. The net input
sij (t) at time tis calculated using the Euclidean metric:
sij (t) = ||x(t)wij (t)|| (1)
The activity yij(t) at time tis calculated by using the exponential function:
yij (t) = e
sij (t)
The parameter σis the exponential factor set to 106and 0 i<I, 0 j <
J.i, j N. The role of the exponential function is to normalize and increase
the contrast between highly activated and less activated areas.
The neuron cwith the strongest activation is selected:
c= argmaxij yij (t) (3)
The weights wijk are adapted by
wijk (t+ 1) = wij k(t) + α(t)Gijc (t)[xk(t)wijk (t)] (4)
Figure 4: Ordered Vector Representation Process. The figure shows the patterns in the first
layer SOM elicited by the same action performed at different rates. When the action is
performed slowly (right) there are more activations, a higher number of activated neurons
and the same neurons may be activated repeatedly (the darker and larger arrows), than when
the action is performed quickly (left). In both cases the activations are on the same path
in the first layer SOM. The ordered vector representation process creates a representation
of the path (lower) designed to be independent of the performance rate, thus achieving time
invariance for the system.
The term 0 α(t)1 is the adaptation strength, α(t)0 when t .
The neighbourhood function Gij c(t) = e
||rcrij ||
2σ2(t)is a Gaussian function de-
creasing with time, and rcR2and rij R2are location vectors of neurons c
and nij respectively. All weight vectors wij(t) are normalized after each adap-
3.3. Ordered Vector Representation
Ordered vector representation of the activity traces unfolding in the first-
layer SOM during an action is a way to transform an activity pattern over time
into a spatial representation which will then be represented in the second-layer
SOM. In addition, as mentioned shortly above, it provides the system with time
invariance. This means that if an action is carried out several times, but at
different rates, the activity trace will progress along the same path in the first-
layer SOM. If this path is coded in a consistent way, the system achieves a time
invariant encoding of actions. If instead the activity patterns in the first-layer
SOM were transformed into spatial representations by, for example, using the
sequence of centres of activity in the first-layer SOM, time variance would not
be achieved. A very quick performance of an action would then yield a small
number of centres of activity along the path in the first-layer SOM, whereas a
very slow performance would yield a larger number of centres of activity, some
of them sequentially repeated in the same neurons (see Fig. 4).
Strictly speaking, time invariance is only achieved when using sets of joint
positions as input to the system, because then the first-layer SOM develops
a topology preserving representation of postures. When using first or second
order dynamics this is no longer the case, unless the action’s internal dynamic
relations are preserved, because then an action’s representation in the first-order
SOM will vary depending on the dynamics. On the other hand some discrim-
inable information for the system is added. For discrimination performance, the
benefits of adding the dynamics to the input seems to be to some extent bigger
than the drawbacks.
The ordered vector representation process used in this study works as follows.
The length of the activity trace of an action jis calculated by:
||Pi+1 Pi||2(5)
The parameter Nis the total number of centres of activity for the action
sequence jand Piis the ith centre of activity in the same action sequence.
Suitable lengths of segments to divide the activity trace for action sequence
jin the first-layer SOM are calculated by:
dj= j/NMax (6)
The parameter NMax is the longest path in the first-layer SOM elicited by
the Mactions in the training set. Each activity trace in the first-layer SOM,
generated by an action, is divided into djsegments, and the coordinates of the
borders of these segments in the order they appear from the start to the end on
the activity trace are composed into a vector used as input to the second-layer
3.4. Output Layer
The output layer consists of an I×Jgrid of a fixed number of neurons and
a fixed topology. Each neuron nij is associated with a weight vector wij Rn.
All the elements of the weight vector are initialized by real numbers randomly
selected from a uniform distribution between 0 and 1, after which the weight
vector is normalized, i.e. turned into unit vectors.
At time teach neuron nij receives an input vector x(t)Rn.
The activity yij(t) at time tin the neuron nij is calculated using the standard
cosine metric:
yij (t) = x(t)·wij (t)
||x(t)||||wij (t)|| (7)
During the learning phase the weights wijl are adapted by
wijl (t+ 1) = wij l(t) + β[yij (t)dij (t)] (8)
The parameter βis the adaptation strength and dij (t) is the desired activity
for the neuron nij at time t.
4. Experiments
We have evaluated the performance of our recognition architecture in two
separate experiments using publicly available data from the repository MSR
Action Recognition Datasets and Codes (Wan, accessed 2015). Each experiment
uses 10 different actions (in total 20 different actions) performed by 10 different
subjects in 2 to 3 different events. The actions data are composed of sequences
Figure 5: The 3D information of the joints of the skeleton extracted from the 3D camera.
of sets of joint positions obtained by a depth camera, similar to a Kinect sensor.
Each action sample is composed of a sequence of frames where each frame
contains 20 joint positions expressed in 3D Cartesian coordinates as shown in
Fig. 5. The sequences composing the action samples vary in length. We have
used Ikaros framework Balkenius et al. (2010) in order to design and implement
our experiments.
4.1. Experiment 1
In the first experiment, we have selected a set of actions from MSR Action3D
Dataset containing 276 samples of 10 different actions performed by 10 different
subjects, each in 2 to 3 different events. The actions of first experiment can be
described as: 1. High Arm Wave, 2. Horizontal Arm Wave, 3. Using Hammer,
4. Hand Catch, 5. Forward Punch, 6. High Throw, 7. Draw X, 8. Draw Tick,
9. Draw Circle, 10. Tennis Swing. The first subset of actions was split into
a training set containing 80% of the action instances randomly selected from
Figure 6: The attention mechanism in the first experiment is obtained by setting the focus
of attention to the left arm which is involved in performing all of the actions.
the original dataset and a test set containing the remaining 20% of the action
The attention mechanism applied in this experiment is set to focus the at-
tention to the arm which is the part of the body mainly involved in performing
all of the actions, that is, the left arm for this experiment, see Fig. 6. The action
recognition architecture was trained with randomly selected instances from the
training set in two phases, the first training the first-layer 30 ×30 neurons SOM,
and the second training the second-layer 35 ×35 neurons SOM together with
the output-layer containing 10 neurons.
In this experiment we continued the study of Gharaee et al. (2017a) in order
to investigate the effect of applying dynamics on the system performance of
action recognition. We achieved a recognition accuracy of 83% for the postures
(see Gharaee et al. (2017a)). To extract the information of the first and second
orders of dynamics we applied first the first order of dynamic (velocity) as
the input to our system and obtained a recognition accuracy of 75% and then
the second order of dynamic (acceleration) as the input to our system where
Figure 7: Classification results of all actions when using as input only the joint positions,
only the joint velocities (first derivatives), only the joint accelerations (second derivatives)
or their combination (merged). Results with the training set during training (uppermost).
Results for the fully trained system with the test set (lowermost). As can be seen, the best
result was achieved when using the combined (merged) input.
the recognition accuracy was 52%. In the third step we merged the postures
together with the first and second orders of dynamic (position, velocity and
acceleration) as the input to the system (merged system). The categorization
results show that 87% of all test sequences are being correctly categorized. The
results of using position, velocity or acceleration as input data one by one and
then all together are depicted in Fig. 7. As the figure shows, when we used
the combination of inputs in the merged system, 6 out of 10 actions were 100%
correctly categorized, see Fig. 8.
Fig. 9 shows how the performance can be improved by using as input a
combination of joint positions, joint velocities (first order of dynamic) and joint
accelerations (second order of dynamic) compared to when each of these kinds
of input is used alone.
The results for the systems using only joint velocities and joints accelerations
are not as good as the result for the system using only joint positions. One
Figure 8: Classification results of the action recognition system when receiving the combined
input of joint positions and their first and second orders dynamics. Results with the training
set during training (uppermost). Results for the fully trained system per action with the test
set (lowermost).
Figure 9: Comparison of the classification performance, per action, of the action recognition
system when using as input only the joint positions, only the joint velocities (first derivatives),
only the joint accelerations (second derivatives) or their combination (merged).
possible explanation for this is the low quality of the input data. The algorithm
that extracts the skeleton data from the camera input is often not delivering
biologically realistic results. The errors in joint positions that occur in the data
set generated by the skeleton algorithm are magnified when the first derivatives
are calculated for the joint velocities and doubly magnified when the second
derivatives are calculated for the accelerations. We therefore believe that if our
system was to be tested on a dataset with smaller errors, then the velocity and
acceleration systems would perform better.
4.2. Experiment 2
In the second experiment we used the rest of actions in the MSR Action 3D
Dataset. These actions are as follows: 1. Hand Clap, 2. Two Hand Wave, 3.
Side Boxing, 4. Forward Bend, 5. Forward Kick, 6. Side Kick, 7. Jogging, 8.
Tennis Serve, 9. Golf Swing, 10. Pick up and Throw. This second set of actions
was split into a training set containing 75% of the action instances randomly
selected from the original dataset and a test set containing the remaining 25%
of the action instances.
The action recognition architecture had the same settings for both experi-
ments so the first layer SOM contained 30 ×30 neurons, and the second layer
SOM contained 35 ×35 neurons and the output layer contained 10 neurons.
The attention mechanism used in this experiment is not as simple as in the
first experiment because of the nature of actions of the second subset. These
actions involve more varying parts of the body including the arms as well as
the legs. Therefore extracting the body part that should form the focus of
attention is not simple. Attention focus is determined, for each separate action,
by a separate selection of the most moving body parts, which is inspired from
human behaviour when observing the performing actions.
For example, the action Forward Bend mainly involves the base part of the
body which is composed of joints named Head, Neck, Torso and Stomach, see
Fig. 5, so the attention is focused on the base part, which includes the mentioned
joints. Another example is the action Jogging which involves arms and legs so
the attention is focused on the joints Left Ankle, Left Wrist, Right Ankle and
Right wrist. For the actions named Hand Clap, Two Hand Wave and Side
Boxing, Tennis Serve, Golf Swing and Pick up and Throw, the attention is
focused on the joints Left Elbow, Left Wrist, Right Elbow and Right Wrist.
Finally for the actions Forward Kick and Side Kick the attention is focused on
the joints Left Knee, Left Ankle, Right Knee and Right Ankle.
The attention mechanism significantly improves the performance of the sys-
tem. Without using attention the system could only obtain an accuracy around
70%. Therefore it is important to highlight the contribution of attention in
developing the SOM architecture into a more accurate and optimal system.
In the second experiment we first used only posture data as input to the
architecture and reached a performance of 86% correct recognitions of actions.
In the next step we used merged input, i.e. posture data together with its first
order dynamics (position and velocity). The categorization results show that
90% of all test sequences was correctly categorized. As can be seen in Fig. 10,
5 of the 10 actions are 100% correctly categorized and the rest also have a very
high accuracy.
Fig. 11 shows a comparison of the performance when using only posture
data (no dynamics) and when adding the dynamic to the system. Though the
accuracy for the actions Two Hand Wave and Side Boxing is reduced when the
dynamics is added to the system, the accuracy for the actions Forward Kick,
Golf Swing and Pick Up and Throw are significantly improved. So in total the
recognition performance was improved by adding the dynamics to the system.
The classification results of the two experiments can be compared by looking
at the trained second layer SOMs shown in Fig. 12 (corresponding to the merged
system of the first experiment) and 13 (corresponding to the merged system of
the second experiment). As shown in these figures, for many of the actions there
are wide activated areas of neurons. This is especially the situation in the first
experiment (see 13). This reflects the fact that the actions are performed in
multiple styles. This is why an action can be represented in multiple regions
in the second layer SOM. This effect can make the classification more difficult
Figure 10: Classification results of the action recognition system when receiving the combined
input of joint positions and their first order dynamics (merged system). Results with the
training set during training (uppermost). Results for the fully trained system, per action,
with the test set (lowermost).
Figure 11: Comparison of the classification performance, per action, of the action recognition
system when using as input only the joint positions and the combination of the joint positions
with their first order dynamics (merged).
Figure 12: Activations by the training set in the trained second layer SOM in the first
experiment. The map is divided into 9 regions, starts from the bottom left square (region
1) and ends to the upper right square (region 9). As can be seen, the activations are spread
out over several regions for many actions.
Figure 13: Activations by the training set in the trained second layer SOM in the second
experiment. The map is divided into 9 regions, starts from the bottom left square (region
1) and ends in the upper right square (region 9). As can be seen, the activations are spread
out over several regions for some actions, but not for others (such as Two Hand Wave, Bend,
Forward Kick and Jogging) which are contained in single regions.
Figure 14: Activations by the test set in the trained second layer SOM in the first experiment.
The map is divided into 9 regions, starts from the bottom left square (region 1) and ends in
the upper right square (region 9). As can be seen, the activations are spread out over several
regions for many actions.
due to overlapping of the activated areas of different actions. This makes it
important to use a sufficient number of samples of each action performance
style to train the system properly and to improve the accuracy. Thus many
actions form several sub-clusters in the second layer SOMs, which is depicted
in Fig. 12 and Fig. 13 for the trained data and in Fig. 14 and Fig. 15 for the
test data of the first and the second experiment respectively. In these figures we
see the neurons activated by each performed action sample. It can be observed
that for several actions there are activated neurons in more than one region.
To understand this better, the percentage of activations by an action in
each region has been calculated, Fig. 16, for the trained data of the second
experiment. By comparing the percentage of activated areas belonging to each
action we can see that the actions named Two Hand wave, Bend, Forward Kick
and Jogging activated neurons belonging to only one of the 9 regions. This
means that these actions can be considered to form only one cluster. For the
other actions the representations are spread out in several regions.
The performance accuracy we have obtained in these experiments can be
Figure 15: Activations by the test set in the trained second layer SOM in the second experi-
ment. The map is divided into 9 regions, starts from the bottom left square (region 1) and
ends in the upper right square (region 9). As can be seen, the activations are spread out over
several regions for some actions, but not for others (such as Two Hand Wave, Bend, Forward
Kick and Jogging) which are contained in single regions. This is similar as for the training
set in the second experiment.
compared to other relevant studies on the action recognition. In the literature
one finds several action recognition systems which are validated and tested on
the MSR Action 3D dataset (Wan (accessed 2015)). Our results show a signifi-
cant accuracy improvement from 74.7% of the state of the art system introduced
in the Li et al. (2010). Even though there is a difference in the way the data
set is divided, we can show that our hierarchical SOM architecture outperforms
many of the systems tested on the MSR Action 3D data set (such as the sys-
tems introduced in Oreifej et al. (2013), Wang et al. (2012b), Vieira et al. (2012),
Wang et al. (2012a), Xia et al. (2012) and Xia and Aggarwal (2013)).
Among the systems using self organizing maps for the action recognition
we can refer to Huang and Wu (2010) that is a different SOM based system
for human action recognition in which a different data set of 2D contours is
used as the input data. Although our system and our applied data set are
totally different from the ones used in Huang and Wu (2010), our hierarchical
Figure 16: The percentage of activations in each region in the second layer SOM for each
action. The values of this table are calculated only for the training data set used in the
second experiment as a sample to indicate that the better accuracy in the second experiment
might be due to how the second layer SOM is formed.
Figure 17: The results of the experiments using MSR Action 3D data with the Hierarchical
SOM architecture divided into the results when using postures only and when using postures
together with the dynamics.
SOM architecture outperforms this system too. In Buonamente et al. (2016),
a three layer hierarchical SOM architecture is also used for the task of action
recognition which has some similarities with the hierarchical SOM architecture
presented in this study besides differences in the units such as the pre-processing
and ordered vector representation. The system introduced in Buonamente et al.
(2016) was evaluated on a different dataset of 2D contours of actions (obtained
from the INRIA 4D repository) in which the system was trained on the actions
performed by one actor (Andreas) and then tested on the actions of a different
actor (Helena) which resulted in a performance accuracy of 53%. We achieved
a significant improvement of this result in our hierarchical SOM architecture,
which is depicted in the Fig. 17.
In our work on action recognition we have also implemented a version of the
hierarchical action recognition architecture that works in real time receiving in-
put online from a Kinect sensor with very good results. This system is presented
in Gharaee et al. (2016). We have also made another experiment in Gharaee
et al. (2017b) on the recognition of actions including objects in which the archi-
tecture is extended to perform the object detection process too. In our future
experiments we plan to present our suggested solution for the segmentation of
the actions.
5. Conclusion
In this article we have presented a system for action recognition based on
Self-Organizing Maps (SOMs). The architecture of the system is inspired by
findings concerning human action perception, in particular those of (Johansson,
1973) and a model of action categories from ardenfors and Warglien (2012).
The first and second layers in the architecture consist of SOMs. The third layer
is a custom made supervised neural network.
We evaluated the ability of the architecture to categorize actions in the
experiments based on input sequences of 3D joint positions obtained by a depth-
camera similar to a Kinect sensor. Before entering the first-layer SOM, the input
went through a preprocessing stage with scaling and coordinate transformation
into an ego-centric framework, as well as an attention process which reduces the
input to only contain the most moving joints. In addition, the first and second
order dynamics were calculated and used as additional input to the original joint
The primary goal of the architecture is to categorize human actions by ex-
tracting the available information in the kinematics of performed actions. As
in prototype theory, the categorization in our system is based on similarities
of actions, and similarity is modelled in terms of distances in SOMs. In this
sense, our categorization model can be seen as an implementation of the con-
ceptual space model of actions presented in ardenfors (2007) and ardenfors
and Warglien (2012).
Although categorization based on the first and second order dynamics has
turned out to be slightly worse than when sequences of 3D joint positions are
used, we believe that this derives from limited quality of the dataset. We have
also noticed that the correctly categorized actions in these three different cases
do not completely overlap. This has been successfully exploited in the ex-
periment presented in this article, by combining them all to achieve a better
performance of the architecture.
Another reason for focusing on the first order dynamics (implemented as
the difference between subsequent frames) is that it is a way of modelling a
part of the human attention mechanism. By focusing on the largest changes of
position between two frames, that is, the highest velocity, the human tendency to
attend to movement is captured. We believe that attention plays an important
role in selecting which information is most relevant in the process of action
categorization, and our experiment is a way of testing this hypothesis. The
hypothesis should, however, be tested with further data sets in order to be
better evaluated. In the future, we intend to perform such test with datasets of
higher quality that also contains new types of actions.
An important aspect of the architecture proposed in this article is its gen-
eralizability. A model of action categorization based on patterns of forces is
presented in (G¨ardenfors, 2007) and (G¨ardenfors and Warglien, 2012). The
extended architecture presented in this article takes into account forces by con-
sidering the second order dynamics (corresponding to sequences of joint accel-
erations), and, as has been shown, improves the performance. We also think
that it is likely that the second order dynamics contains information that could
be used to implement automatized action segmentation. We will explore this in
the future. The data we have tested come from human actions. The generality
of the architecture allows it to be applied to other forms of motion involving
animals and artefacts. This is another area for future work.
Balkenius, C., Mor´en, J., Johansson, B., Johnsson, M., 2010. Ikaros: Building
cognitive models for robots. Advanced Engineering Informatics 24, 40–48.
Buonamente, M., Dindo, H., Johnsson, M., 2015. Discriminating and simulating
actions with the associative self-organizing map. Connection Science 27, 118–
Buonamente, M., Dindo, H., Johnsson, M., 2016. Hierarchies of self-
organizing maps for action recognition. Cognitive Systems Research DOI:
Cangelosi, A., Metta, G., Sagerer, G., Nolfi, S., Nehaniv, C., Fischer, K., Tani,
J., Belpaeme, T., Sandini, G., Nori, F., Fadiga, L., Wrede, B., Rohlfing,
K., Tuci, E., Dautenhahn, K., Saunders, J., Zeschel, A., 2008. The italk
project: Integration and transfer of action and language knowledge in robots,
in: Proceedings of Third ACM/IEEE International Conference on Human
Robot Interaction 2, pp. 167–179.
Demiris, Y., Khadhouri, B., 2006. Hierarchical attentive multiple models for
execution and recognition of actions. Robotics and Autonomous System 54,
ardenfors, P., 2000. Conceptual Spaces: The Geometry of Thought. Cam-
bridge, Massachussetts: The MIT Press.
ardenfors, P., 2007. Representing actions and functional properties in con-
ceptual spaces, in: Body, Language and Mind. Mouton de Gruyter, Berlin.
volume 1, pp. 167–195.
ardenfors, P., 2014. Geometry of Meaning: Semantics Based on Conceptual
Spaces. Cambridge, Massachussetts: The MIT Press.
ardenfors, P., Warglien, M., 2012. Using conceptual spaces to model actions
and events. Journal of Semantics 29, 487–519.
Gharaee, Z., Fatehi, A., Mirian, M.S., Ahmadabadi, M.N., 2014. Attention
control learning in the decision space using state estimation. International
Journal of Systems Science (IJSS) , 1–16DOI: 10.1080/00207721.2014.945982.
Gharaee, Z., ardendors, P., Johnsson, M., 2016. Action recognition online with
hierarchical self-organizing maps, in: Proceedings of the 12th International
Conference on Signal Image Technology and Internet Based Systems(SITIS).
Gharaee, Z., ardendors, P., Johnsson, M., 2017a. Hierarchical self-organizing
maps system for action classification, in: Proceedings of the International
Conference on Agents and Artificial Intelligence (ICAART).
Gharaee, Z., ardendors, P., Johnsson, M., 2017b. Online recognition of ac-
tions involving objects, in: Proceedings of the International Conference on
Biologically Inspired Cognitive Architecture (BICA).
Giese, M., Thornton, I., Edelman, S., 2008. Metrics of the perception of body
movement. Journal of Vision 8, 1–18.
Giese, M.A., Lappe, M., 2002. Measurement of generalization fields for the
recognition of biological motion. Vision Research 42, 1847–1858.
Hemeren, P.E., 2008. Mind in Action. Ph.D. thesis. Lund University Cognitive
Science. Lund University Cognitive Studies 140.
Huang, W., Wu, Q.J., 2010. Human action recognition based on self organiz-
ing map, in: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE
International Conference on, IEEE. pp. 2130–2133.
Johansson, G., 1973. Visual perception of biological motion and a model for its
analysis. Perception & Psychophysics 14, 201–211.
Kalkan, S., Dag, N., ur¨uten, O., Borghi, A.M., Sahin, E., 2014. Verb concepts
from affordances. to appear in .
Kohonen, T., 1988. Self-Organization and Associative Memory. Springer Verlag.
Lallee, S., Madden, C., Hoen, M., Dominey, P.F., 2010. Linking language with
embodied and teleological representations of action for humanoid cognition.
Frontiers in Neurorobotics 4. Doi:10.3389/fnbot.2010.00008.
Levin, B., Rappaport Hovav, M., 2005. Argument Realization. Cambridge:
Cambridge University Press.
Li, W., Zhang, Z., Liu, Z., 2010. Action recognition based on a bag of 3d points,
in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010
IEEE Computer Society Conference on, IEEE. pp. 9–14.
Malt, B., Ameel, E., Imai, M., Gennari, S., Saji, N.M., Majid, A., 2014. Human
locomotion in languages: Constraints on moving and meaning. Journal of
Memory and Language 14, 107–123.
Marr, D., Nishihara, K.H., 1978. Representation and recognition of the spatial
organization of three-dimensional shapes. Proceedings of the Royal Society
in London, B 200, 269–294.
Marr, D., Vaina, L., 1982. Representation and recognition of the movements of
shapes. Proceedings of the Royal Society in London, B 214, 501–524.
Mealier, A.L., Pointeau, G., ardenfors, P., Dominey, P.F., 2016. Construals of
meaning: The role of attention in robotic language production. Interaction
Studies 17, 48 76.
Oreifej, O., Liu, Z., Redmond, W., 2013. Hon4d: Histogram of oriented 4d
normals for activity recognition from depth sequences. Computer Vision and
Pattern Recognition .
Rosch, E., 1975. Cognitive representations of semantic categories. Journal of
Experimental Psychology: General 104, 192–233.
Runesson, S., 1994. Perception of biological motion: The ksd-principle and the
implications of a distal versus proximal approach, in: Perceiving Events and
Objects. Hillsdale, NJ, pp. 383–405.
Runesson, S., Frykholm, G., 1983. Kinematic specification of dynamics as an in-
formational basis for person and action perception. expectation, gender recog-
nition, and deceptive intention. Journal of Experimental Psychology: General
112, 585–615.
Shariatpanahi, H.F., Ahmadabadi, M.N., 2007. Biologically inspired framework
for learning and abstract representation of attention control. Attention in
cognitive systems, theories and systems from an interdisciplinary viewpoint
4840, 307 324.
Vaina, L., 1983. From shapes and movements to objects and actions. Synthese
54, 3–36.
Vaina, L., Bennour, Y., 1985. A computational approach to visual recognition
of arm movement. Perceptual and Motor Skills 60, 203–228.
Vieira, A., Nascimento, E., Oliveira, G., Liu, Z., Campos, M., 2012. Stop:
Space-time occupancy patterns for 3d action recognition from depth map
sequences, in: Progress in Pattern Recognition, Image Analysis, Computer
Vision, and Applications (CIARP), pp. 252–259. DOI: 10.1007/978-3-642-
Wan, Y.W., accessed 2015. Msr action recognition datasets and
codes. URL:
Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y., 2012a. Robust 3d action
recognition with random occupancy patterns, in: Computer Vision-ECCV,
pp. 872–885.
Wang, J., Liu, Z., Wu, Y., Yuan, J., 2012b. Mining actionlet ensemble for
action recognition with depth cameras, in: Computer Vision and Pattern
Recognition (CVPR) 2012 IEEE Conference on, pp. 1290–1297.
Wang, W., Crompton, R.H., Carey, T.S., unther, M.M., Li, Y., Savage, R.,
Sellers, W.I., 2004. Comparison of inverse-dynamics musculo-skeletal models
of al 288-1 australopithecus afarensis and knm-wt 15000 homo ergaster to
modern humans, with implications for the evolution of bipedalism. Journal
of Human Evolution 47, 453–478.
Warglien, M., ardenfors, P., Westera, M., 2012. Event structure, conceptual
spaces and the semantics of verbs. Theoretical Linguistics 38, 159–193.
Xia, L., Aggarwal, J., 2013. Spatio-temporal depth cuboid similarity feature for
activity recognition using depth camera, in: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). DIO: 10.1109/CVPR.2013.365.
Xia, L., Chen, C.C., Aggarwal, J., 2012. View invariant human action recogni-
tion using histograms of 3d joints, in: IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 20–27.
... To employ CNN for action categorization, researchers have encoded the characteristics of skeleton groups in both time and space. By utilizing the joint positions of the 3D skeleton and a particle swarm-optimized support vector machine, researchers have successfully identified human actions [15][16][17]. Additionally, some researchers have proposed a method that combines deep motion maps and convolutional neural networks (CNN) to extract superior features using the robust feature extraction capabilities of deep learning. ...
Full-text available
With the development of artificial intelligence technology, an increasing number of human action recognition (HAR) methods are being applied to tennis training action analysis. The HAR methods based on skeletal points have been extensively researched and applied due to their superior action expression capabilities. In order to enhance the HAR ability of tennis players and effectively capture the detailed features in training actions, this paper proposes a tennis training action analysis model based on graph convolutional neural networks. Firstly, this paper establishes the limb vectors of humans in three-dimensional spatial coordinates and extracts the features of tennis error techniques based on the distances between the skeletal joints of five parts of the human body. Secondly, the data’s time frames are segmented to extract attention and improve the model’s ability to capture detailed features. Additionally, the attention mechanism is introduced to embed the position information into the attention map, enhancing the model’s generalization ability. Experiments conducted on several action datasets demonstrate that the proposed model in this paper achieves higher HAR accuracy and better recognition results compared to most current methods.
... The tapes are grouped into 25 groups, each containing four action clips. Similar video clips have the same performer, setting, point of view, etc. Table 2 shows that the proposed approach outperformed real event replicas 34 , motion routes 35 , better course 36 , and ranked clustering multi-task 37 for this dataset, with the accuracy of 89.43%, 89.70%, 89.50%, and 98.90% respectively. Table 3 shows the class-wise accuracy of all activity which is taken for experiments. ...
Full-text available
Abnormal human behavior must be monitored and controlled in today’s technology-driven era, since it may cause damage to society in the form of assault or web-based violence, such as direct harm to a person or the propagation of hate crimes through the internet. Several authors have attempted to address this issue, but no one has yet come up with a solution that is both practical and workable. Recently, deep learning models have become popular as a means of handling massive amounts of data but their potential to categorize the aberrant human activity remains unexplored. Using a convolutional neural network (CNN), a bidirectional long short-term memory (Bi-LSTM), and an attention mechanism to pay attention to the unique spatiotemporal characteristics of raw video streams, a deep-learning approach has been implemented in the proposed framework to detect anomalous human activity. After analyzing the video, our suggested architecture can reliably assign an abnormal human behavior to its designated category. Analytic findings comparing the suggested architecture to state-of-the-art algorithms reveal an accuracy of 98.9%, 96.04%, and 61.04% using the UCF11, UCF50, and subUCF crime datasets, respectively.
... The temporal attention model proposed by Z. Liu et al. studied human activities and could identify just the important frames [192]. Some studies on spatial-temporal attention examined the spatial-temporal attention design and suggested two different models: one to investigate spatial and temporal distinctiveness, and the other to explore the time complexity of feature learning [193]. These attention models were specifically designed to do action analysis in the image frames, and then mine relevant frames to find an action-related representation and combine the representation of such action-important frames to construct a powerful feature vector. ...
Full-text available
Human action recognition systems use data collected from a wide range of sensors to accurately identify and interpret human actions. One of the most challenging issues for computer vision is the automatic and precise identification of human activities. A significant increase in feature learning-based representations for action recognition has emerged in recent years, due to the widespread use of deep learning-based features. This study presents an in-depth analysis of human activity recognition that investigates recent developments in computer vision. Augmented reality, human–computer interaction, cybersecurity, home monitoring, and surveillance cameras are all examples of computer vision applications that often go in conjunction with human action detection. We give a taxonomy-based, rigorous study of human activity recognition techniques, discussing the best ways to acquire human action features, derived using RGB and depth data, as well as the latest research on deep learning and hand-crafted techniques. We also explain a generic architecture to recognize human actions in the real world and its current prominent research topic. At long last, we are able to offer some study analysis concepts and proposals for academics. In-depth researchers of human action recognition will find this review an effective tool.
Full-text available
In the modern era of technology, monitoring and controlling abnormal human activity is essentially required as these activities may harm society through physical harm to a human being, or by spreading hate crimes on the World Wide Web. Although many authors have contributed to address this problem, a desired solution that may work in a real-time scenario has yet to be achieved. Recently, deep learning models have gained attraction as processing power for a large volume of data. However, there is little work based on deep learning models for detecting abnormal human activity classification that has been done till now. In the proposed framework, a deep-learning method has been used to detect abnormal human activity by combining a convolutional neural network (CNN), a Recurrent Neural Network (RNN), and an attention module for attending the specific spatiotemporal characteristics from unprocessed video streams. This proposed architecture can accurately classify an aberrant human activity with its special category after processing the video. The proposed architecture's analytical results show an accuracy of 96.94%, 98.95%, and 62.04% with UCF50, UCF110, and UCF crime datasets, which is compared with the results of state-of-the-art algorithms (SOTA).
Nowadays, digital surveillance devices are widely implemented to collect massive volumes of data indefinitely, necessitating human monitoring to identify various activities. The requirement for smarter surveillance in this era is for artificial intelligence and computer vision technology to automatically identify normal and aberrant actions. We present a long short-term memory (LSTM)-based attention mechanism based on a pre-train convolutional neural network (CNN) that focuses on the most important characteristics in the source frame to distinguish human activities in videos. We employ the DenseNet layers to extract the prominent spatial features from frames. We input these characteristics into an LSTM to learn temporal features in video; after that, an attention technique is used to enhance performance and calculate more high-level selected activity-related patterns. The presented system was evaluated on UCF11 datasets and achieved recognition rates of 97.90%, demonstrating a substantial advancement over the current state-of-the-art (SOTA) method.
ADAS is a vision and sensor-based system that aids the driver in understanding the immediate surroundings and navigate around it in a semi-autonomous manner constantly, using different computer vision methods like object detection, depth estimation, image segmentation, image classification, etc. 3D scene understanding plays a major role in a driver assistance system, and monocular depth estimation, which is a pixel-level regression task for estimating the distance of objects from a single camera without camera calibration as performed in stere-vision-based depth sensing, is an important task in 3D scene understanding. ADAS also involves features like lane-departure warnings and lane-keeping assistance, which continuously requires it to predict the lane lines of a road, and that falls under the grouped task of lane detection. To solve the problem of lane detection and depth estimation, we initialize the proposed work with the simple, traditional computer vision approaches and then moved to some supervised, unsupervised, and self-supervised approaches involving deep neural networks. The overall aim of this research is to analyze, compare, and contrast various deep learning approaches that hold the potential for integration into autonomous vehicles and can improve the performance and safety of advanced driving assistance systems. We discuss all the methods and identify their shortcomings.
Full-text available
Machine comprehension of visual information from images and videos by neural networks suffers from two limitations: (1) the computational and inference gap in vision and language to accurately determine which object a given agent acts on and then to represent it by language, and (2) the shortcoming in stability and generalization of the classifier trained by a single, monolithic neural network. To address these limitations, we propose MoE-VRD, a novel approach to visual relationship detection via a mixture of experts. MoE-VRD recognizes language triplets in the form of a < subject,predicate,object > tuple to extract the relationship between subject, predicate, and object from visual processing. Since detecting a relationship between a subject (acting) and the object(s) (being acted upon) requires that the action be recognized, we base our network on recent work in visual relationship detection. To address the limitations associated with single monolithic networks, our mixture of experts is based on multiple small models, whose outputs are aggregated. That is, each expert in MoE-VRD is a visual relationship learner capable of detecting and tagging objects. MoE-VRD employs an ensemble of networks while preserving the complexity and computational cost of the original underlying visual relationship model by applying a sparsely-gated mixture of experts, which allows for conditional computation and a significant gain in neural network capacity. We show that the conditional computation capabilities and massive ability to scale the mixture-of-experts leads to an approach to the visual relationship detection problem which outperforms the state-of-the-art.
Full-text available
The application driven technology wireless sensor networks (WSNs) are developed substantially in the last decades. The technology has drawn the attention for application in the scientific as well as in industrial domains. The networks use multifunctional and cheap sensor nodes. The application of the networks ranges from military to the civilian application such as battlefield monitoring, environment monitoring and patient monitoring. The network goal is to collect the data from different environmental phenomenon in an unsupervised manner from unknown and hash environment using the resource constrained sensor nodes. The construction of the sensor nodes used in the network and the distributed nature of the network infrastructure is susceptible to various types of attacks. In order to assure the functional operation of WSNs and collecting the meaningful data from the network, detecting the anomalous node and mechanisms to secure the networks are vital. In this research paper, we have used machine learning based decision tree algorithm to determine the anomalous sensor node to provide security to the WSNs. The decision tree has the capability to deal with categorical and numerical data. The simulation work was carried out in python and the result shows the accurate detection of the anomalous node. In future, the hybrid approach combining two algorithms will be employed to further performance improvement of the model.
Facial expression recognition is an intriguing research area that has been explored and utilized in a wide range of applications such as health, security, and human-computer interactions. The ability to recognize facial expressions accurately is crucial for human-computer interactions. However, most of the facial expression analysis techniques have so far paid little or no concern to users’ data privacy. To overcome this concern, in this paper, we incorporated Federated Learning (FL) as a privacy-preserving machine learning approach in the field of facial expression recognition to develop a shared model without exposing personal information. The individual models are trained on the different client devices where the data is stored. In this work, a lightweight Convolutional Neural Network (CNN) model called the MobileNet architecture is utilised to detect expressions from facial images. To evaluate the model, two publicly available datasets are used and several experiments are conducted. The result shows that the proposed privacy-preserving Federated-MobileNet approach could recognize facial expressions with considerable accuracy compared to the general approaches.
Full-text available
We present an online system for real time recognition of actions involving objects working in online mode. The system merges two streams of information processing running in parallel. One is carried out by a hierarchical self-organizing map (SOM) system that recognizes the performed actions by analysing the spatial trajectories of the agent’s movements. It consists of two layers of SOMs and a custom made supervised neural network. The activation sequences in the first layer SOM represent the sequences of significant postures of the agent during the performance of actions. These activation sequences are subsequently recoded and clustered in the second layer SOM, and then labeled by the activity in the third layer custom made supervised neural network. The second information processing stream is carried out by a second system that determines which object among several in the agent’s vicinity the action is applied to. This is achieved by applying a proximity measure. The presented method combines the two information processing streams to determine what action the agent performed and on what object. The action recognition system has been tested with excellent performance.
Conference Paper
Full-text available
We present a novel action recognition system that is able to learn how to recognize and classify actions. Our system employs a three-layered neural network hierarchy consisting of two self-organizing maps together with a supervised neural network for labeling the actions. The system is equipped with a module that pre- processes the 3D input data before the first layer, and a module that transforms the activity elicited over time in the first layer SOM into an ordered vector representation before the second layer, thus achieving a time invariant representation. We have evaluated our system in an experiment consisting of ten different actions selected from a publicly available data set with encouraging result.
Conference Paper
Full-text available
We present a hierarchical self-organizing map based system for online recognition of human actions. We have made a first evaluation of our system by training it on two different sets of recorded human actions, one set containing manner actions and one set containing result actions, and then tested it by letting a human performer carry out the actions online in real time in front of the system’s 3D-camera. The system successfully recognized more than 94% of the manner actions and most of the result actions carried out by the human performer.
Full-text available
In robotics research with language-based interaction, simplifications are made, such that a given event can be described in a unique manner, where there is a direct mapping between event representations and sentences that can describe these events. However, common experience tells us that the same physical event can be described in multiple ways, depending on the perspective of the speaker. The current research develops methods for representing events from multiple perspectives, and for choosing the perspective that will be used for generating a linguistic construal, based on attentional processes in the system. The multiple perspectives are based on the principle that events can be considered in terms of the force driving the event, and the result obtained from the event, based on the theory of Godenfors. In addition, within these perspectives a further refinement can be made with respect to the agent, object, and recipient perspectives. We develop a system for generating appropriate construals of meaning, and demonstrate how this can be used in a realistic dialogic interaction between a behaving robot and a human interlocutor.
Full-text available
We propose a system able to represent others' actions as well as to internally simulate their likely continuation from a partial observation. The approach presented here is the first step towards a more ambitious goal of endowing an artificial agent with the ability to recognise and predict others' intentions. Our approach is based on the associative self-organising map, a variant of the self-organising map capable of learning to associate its activity with different inputs over time, where inputs are processed observations of others' actions. We have evaluated our system in two different experimental scenarios obtaining promising results: the system demonstrated an ability to learn discriminable representations of actions, to recognise novel input, and to simulate the likely continuation of partially seen actions.
Full-text available
Actions and events are central to a semantics of natural language. In this article, we present a cognitively based model of these notions. After giving a general presentation of the theory of conceptual spaces, we explain how the analysis of perceptual concepts can be extended to actions and events. First, we argue that action space can be analyzed in the same way as, for example, colour space or shape space. Our hypothesis is that the categorization of actions depends, to a large extent, on the perception of forces. In line with this, we describe an action as a pattern of forces. We identify an action category as a convex region of action space. We review some indirect evidence for this representation. Second, we represent an event as an interaction between a force vector and a result vector. Typically an agent performs an action—that is, exerts a force—that changes the properties of the patient. Such a model of events is suitable for an analysis of the semantics of verbs. We compare the model to other related attempts from cognitive semantics.
We propose a hierarchical neural architecture able to recognise observed human actions. Each layer in the architecture represents increasingly complex human activity features. The first layer consists of a SOM which performs dimensionality reduction and clustering of the feature space. It represents the dynamics of the stream of posture frames in action sequences as activity trajectories over time. The second layer in the hierarchy consists of another SOM which clusters the activity trajectories of the first-layer SOM and learns to represent action prototypes. The third - and last - layer of the hierarchy consists of a neural network that learns to label action prototypes of the second-layer SOM and is independent - to certain extent - of the camera’s angle and relative distance to the actor. The experiments were carried out with encouraging results with action movies taken from the INRIA 4D repository. In terms of representational accuracy, measured as the recognition rate over the training set, the architecture exhibits 100% accuracy indicating that actions with overlapping patterns of activity can be correctly discriminated. On the other hand, the architecture exhibits 53% recognition rate when presented with the same actions interpreted and performed by a different actor. Experiments on actions captured from different view points revealed a robustness of our system to camera rotation. Indeed, recognition accuracy was comparable to the single viewpoint case. To further assess the performance of the system we have also devised a behavioral experiments in which humans were asked to recognize the same set of actions, captured from different points of view. Results form such a behavioral study let us argue that our architecture is a good candidate as cognitive model of human action recognition, as architectural results are comparable to those observed in humans.
We study the problem of action recognition from depth sequences captured by depth cameras, where noise and occlusion are common problems because they are captured with a single commodity camera. In order to deal with these issues, we extract semi-local features called random occupancy pattern (ROP) features, which employ a novel sampling scheme that effectively explores an extremely large sampling space. We also utilize a sparse coding approach to robustly encode these features. The proposed approach does not require careful parameter tuning. Its training is very fast due to the use of the high-dimensional integral image, and it is robust to the occlusions. Our technique is evaluated on two datasets captured by commodity depth cameras: an action dataset and a hand gesture dataset. Our classification results are superior to those obtained by the state of the art approaches on both datasets.
Conference Paper
This paper presents Space-Time Occupancy Patterns (STOP), a new visual representation for 3D action recognition from sequences of depth maps. In this new representation, space and time axes are divided into multiple segments to define a 4D grid for each depth map sequence. The advantage of STOP is that it preserves spatial and temporal contextual information between space-time cells while being flexible enough to accommodate intra-action variations. Our visual representation is validated with experiments on a public 3D human action dataset. For the challenging cross-subject test, we significantly improved the recognition accuracy from the previously reported 74.7% to 84.8%. Furthermore, we present an automatic segmentation and time alignment method for online recognition of depth sequences.