First and Second Order Dynamics in a Hierarchical
SOM system for Action Recognition
Zahra Gharaee, Peter G¨ardenfors and Magnus Johnsson
Lund University Cognitive Science,
Helgonav¨agen 3, 221 00 Lund, Sweden
Human recognition of the actions of other humans is very eﬃcient and is based
on patterns of movements. Our theoretical starting point is that the dynamics of
the joint movements is important to action categorization. On the basis of this
theory, we present a novel action recognition system that employs a hierarchy
of Self-Organizing Maps together with a custom supervised neural network that
learns to categorize actions. The system preprocesses the input from a Kinect
like 3D camera to exploit the information not only about joint positions, but
also their ﬁrst and second order dynamics. We evaluate our system in two
experiments with publicly available data sets, and compare its performance to
the performance with less sophisticated preprocessing of the input. The results
show that including the dynamics of the actions improves the performance. We
also apply an attention mechanism that focuses on the parts of the body that
are the most involved in performing the actions.
Keywords: Self-Organizing Maps, Conceptual Spaces, Neural Networks,
Action Recognition, Hierarchical Models, Attention, Dynamics
Email address: email@example.com,
firstname.lastname@example.org,email@example.com (Zahra Gharaee, Peter
G¨ardenfors and Magnus Johnsson)
Preprint submitted to Elsevier May 23, 2017
The success of human-robot interaction depends on the development of ro-
bust methods that enable robots to recognize and predict goals and intentions of
other agents. Humans do this, to a large extent, by interpreting and categoriz-
ing the actions they perceive. Hence, it is central to develop methods for action
categorization that can be employed in robotic systems. This involves an analy-
sis of on-going events from visual data captured by cameras to track movements
of humans and to use this analysis to identify actions. One crucial question is
to know what kind of information should be extracted from observations for an
artiﬁcial action recognition system.
Our ambition is to develop an action categorization method that, at large,
works like the human system. We present a theory of action categorization due
to G¨ardenfors and Warglien (2012) (see also G¨ardenfors (2007) and G¨ardenfors
(2014)) that builds on G¨ardenfors’s (2000) theory of conceptual spaces. The
central idea is that actions are represented by the underlying force patterns.
Such patterns can be derived from the second order dynamics of the input data.
We present experimental data on how humans categorize action that supports
the model. A goal of this article is to show that if the dynamics of actions is
considered, the performance of our Self-Organizing Maps (SOMs) (Kohonen,
1988) based action recognition system can be improved when categorizing ac-
tions based on 3D camera input.
The architecture of our action recognition system is composed of a hierarchy
of three neural network layers. These layers have been implemented in diﬀerent
versions. The ﬁrst layer consists of a SOM, which is used to represent prepro-
cessed input frames (e.g. posture frames) from input sequences and to extract
their motion patterns. This means that the SOM reduces the data dimensional-
ity of the input and the actions in this layer are represented as activity patterns
The second layer of the architecture consists of a second SOM. It receives
the superimposed activities in the ﬁrst layer for complete actions. The super-
imposition of all the activity in the ﬁrst layer SOM provides a mechanism that
makes the system time invariant. This is because similar movements carried out
at diﬀerent speed elicit similar sequences of activity in the ﬁrst layer SOM. Thus
the second layer SOM represents and clusters complete actions. The third layer
consists of a custom made supervised neural network that labels the diﬀerent
clusters in the second layer SOM with the corresponding action.
We have previously studied the ability of SOMs to learn discriminable rep-
resentations of actions (Buonamente et al., 2015), and we have developed a
hierarchical SOM based action recognition architecture. This architecture has
previously been tested using video input from human actions in a study that
also included a behavioural comparison between the architecture and humans
(Buonamente et al., 2016), and using extracted joint positions from a Kinect
like 3D camera as input with good results (Gharaee et al., 2017a).
This article presents results that suggest that the performance of our ac-
tion recognition architecture can be improved by exploiting not only the joint
positions extracted from a Kinect like 3D camera, but also simultaneously the
information present in their ﬁrst and second order dynamics.
Apart from analysing the dynamics of the data, we implement an attention
mechanism that is inspired by how human attention works. We model attention
by reducing the input data to those parts of the body that contribute the most in
performing the various actions. Adding such an attention mechanism improves
the performance of the system.
The rest of the paper is organized as follows: First we present the theoretical
background from cognitive science in section 2. The action recognition archi-
tecture is described in detail in section 3. Section 4 presents two experiments
to evaluate the performance of the architecture employing new kinds of prepro-
cessing to enable additional dynamic information as additional input. Section
5 concludes the paper.
2. Theoretical Background
When investigating action recognition in the context of human-robot inter-
action, it should ﬁrst be mentioned that human languages contain two types
of verbs describing actions (Levin and Rappaport Hovav, 2005; Warglien et al.,
2012). The ﬁrst type is manner verbs that describe how an action is performed.
In English, some examples are run, swipe, wave, push, and punch. The sec-
ond type is result verbs that describe the result of actions. In English, some
examples are move, heat, clean, enter, and reach.
In the context of robotics, research has focused on how result verbs can be
modelled (e.g. Cangelosi et al. (2008), Kalkan et al. (2014), Lallee et al. (2010)
and Demiris and Khadhouri (2006)). However, when it comes to human-robot
interaction, the robot should also be able to recognize human actions by the
manner they are performed. This is often called recognition of biological motion
(Hemeren, 2008). Recognising manner action is important in particular if the
robot is supposed to model the intentions of a human. In the literature, there
are some systems for categorizing human actions described by manner verbs, e.g.
Giese and Lappe (2002) and Giese et al. (2008). However, these systems have
not been developed with the aim of supporting human-robot interaction. Our
aim in this article is to present a system that recognises a set of manner actions.
Our future aim is, however, to integrate this with a system for recognising results
verbs that can be used in linguistic interactions between a human and an robot
(see Cangelosi et al. (2008), Mealier et al. (2016) for examples of such linguistic
Results from the cognitive sciences indicate that the human brain performs
a substantial information reduction when categorizing human manner actions.
Johansson (1973) has shown that the kinematics of a movement contain suf-
ﬁcient information to identify the underlying dynamic patterns. He attached
light bulbs to the joints of actors who were dressed in black and moved in a
black room. The actors were ﬁlmed performing actions such as walking, run-
ning, and dancing. Watching the ﬁlms - in which only the dots of light could be
seen - subjects recognized the actions within tenths of a second. Further exper-
iments by Runesson and Frykholm (1983), see also (Runesson, 1994), show that
subjects extract subtle details of the actions performed, such as the gender of
the person walking or the weight of objects lifted (where the objects themselves
cannot be seen).
One lesson to learn from the experiments by Johansson and his followers is
that the kinematics of a movement contains suﬃcient information to identify
the underlying dynamic force patterns. Runesson (1994) claims that people can
directly perceive the forces that control diﬀerent kinds of motion. He formulates
the following thesis:
Kinematic speciﬁcation of dynamics: The kinematics of a movement contains
suﬃcient information to identify the underlying dynamic force patterns.
From this perspective, the information that the senses - primarily vision -
receive about the movements of an object or individual is suﬃcient for the brain
to extract, with great precision, the underlying forces. Furthermore, the process
is automatic: one cannot help but perceiving the forces.
Given these results from perceptual psychology, the central problem for
human-robot interaction now becomes how to construct a model of action recog-
nition that can be implemented in a robotic system. One idea for such a model
comes from Marr and Vaina (1982) and Vaina (1983), who extend Marr and
Nishihara’s (1978) cylinder models of objects to an analysis of actions. In Marr’s
and Vaina’s model, an action is described via diﬀerential equations for move-
ments of the body parts of, for example, a walking human. What we ﬁnd useful
in this model is that a cylinder ﬁgure can be described as a vector with a lim-
ited number of dimensions. Each cylinder can be described by two dimensions:
length and radius. Each joining point in the ﬁgure can be described by a small
number of coordinates for point of contact and angle of joining cylinder. This
means that, at a particular moment, the entire ﬁgure can be written as a (hier-
archical) vector of a fairly small number of dimensions. An action then consists
of a sequence of such vectors. In this way, the model involves a considerable
reduction of dimensionality in comparison to the original visual data. Further
reduction of dimensionality is achieved in a skeleton model.
It is clear that, using Newtonian mechanics, one can derive the diﬀerential
equations from the forces applied to the legs, arms, and other moving parts
of the body. For example, the pattern of forces involved in the movements of
a person running is diﬀerent from the pattern of forces of a person walking;
likewise, the pattern of forces for saluting is diﬀerent from the pattern of forces
for throwing (Vaina and Bennour, 1985).
The human cognitive apparatus is not exactly evolved for Newtonian me-
chanics. Nevertheless, G¨ardenfors (2007) (see also Warglien et al. (2012) and
G¨ardenfors (2014)) proposed that the brain extracts the forces that lie behind
diﬀerent kinds of movements and other actions:
Representation of actions: An action is represented by the pattern of forces
that generates it.
We speak of a pattern of forces since, for bodily motions, several body parts
are involved; and thus, several force vectors are interacting (by analogy with
Marr’s and Vaina’s diﬀerential equations). Support for this hypothesis will be
presented below. One can represent these patterns of forces in principally the
same way as the patterns of shapes described in G¨ardenfors (2014), section
6.3. In analogy with shapes, force patterns also have meronomic structure. For
example, a dog with short legs moves in a diﬀerent way than a dog with long
This representation ﬁts well into the general format of conceptual spaces
presented by G¨ardenfors (2000, 2007). In order to identify the structure of the
action domain, similarities between actions should be investigated. This can be
accomplished by basically the same methods used for investigating similarities
between objects. Just as there, the dynamic properties of actions can be judged
with respect to similarities: for example, walking is more similar to running
than to waving. Very little is known about the geometric structure of the action
domain, except for a few recent studies that we will present below. We assume
that the notion of betweenness is meaningful in the action domain, allowing us
to formulate the following thesis in analogy to the thesis about properties (see
G¨ardenfors (2000, 2007) and G¨ardenfors and Warglien (2012)):
Thesis about action concepts: An action concept is represented as a convex
region in the action domain.
One may interpret here convexity as the assumption that, given two actions
in the region of an action concept, any linear morph between those actions will
fall under the same concept.
One way to support the analogy between how objects and how actions are
represented in conceptual space is to establish that action concepts share a sim-
ilar structure with object categories (Hemeren, 2008, p. 25). Indeed, there
are strong reasons to believe that actions exhibit many of the prototype eﬀects
that Rosch (1975) presented for object categories. In a series of psychological
experiments, Hemeren (2008) showed that action categories show a similar hi-
erarchical structure and have similar typicality eﬀects to object concepts. He
demonstrated a strong inverse correlation between judgements of most typical
actions and reaction time in a word/action veriﬁcation task.
Empirical support for the thesis about action concepts as regards body move-
ments can also be found in Giese and Lappe (2002). Using Johansson’s (1973)
patch-light technique, they started from video recordings of natural actions such
as walking, running, limping, and marching. By creating linear combinations
of the dot positions in the videos, they then made ﬁlms that were morphs of
the recorded actions. Sub jects watched the morphed videos and were asked to
classify them as instances of walking, running, limping, or marching, as well
as to judge the naturalness of the actions. Giese and Lappe did not explicitly
address the question of whether the actions recognized form convex regions in
the force domain. However, their data clearly support this thesis.
Another example of data that can be used to study force patterns comes
from Wang et al. (2004). They collected data from the walking patterns of
humans under diﬀerent conditions. Using the methods of Giese et al. (2008),
these patterns can be used to calculate the similarity of the diﬀerent gaits.
A third example is Malt et al. (2014) who studied how subjects named
the actions shown in 36 video clips of diﬀerent types of walking, running and
jumping. The subjects were native speakers of English, Spanish, Dutch, and
Japanese. The most commonly produced verb for each clip in each language
was calculated. This generated a number of verbs also including several subcat-
egories. Another group of subjects, again native speakers of the four languages,
judged the physical similarity of the actions in the video clips. Based on these
judgements a two-dimensional multidimensional scaling solution was calculated.
The verbs from the ﬁrst group were then mapped onto this solution. Figure 4 in
Malt et al. (2014) shows the results for the most common English action word
for each video clip. The results support the thesis that regions corresponding
to the names are convex.
The action recognition system presented in this article has similarities with
these models in the sense that actions are represented as sequences of vectors,
and it categorizes the actions on basis of their similarities. In our system,
similarity is modelled as closeness in SOMs. We next turn to a description of
the architecture of the system.
3. Hierarchical SOM Architecture for Action Recognition
Our action recognition system consists of a three layered neural network
architecture (Fig. 1). The ﬁrst layer consists of a SOM that develops an ordered
representation of preprocessed input. The second layer consists of a second SOM
that receives, as input, the superimposed sequence of activity elicited during an
action in the ﬁrst layer SOM. Thus the second layer SOM develops an ordered
representation of the activity traces in the ﬁrst layer SOM that correspond to
diﬀerent actions. Finally, the third layer is a custom supervised neural network
that associates activity representing action labels to the activity in the second
From a computational point of view, there are many challenges that make
the action recognition task diﬃcult to imitate artiﬁcially. For example, the act-
ing individuals diﬀer in height, weight and bodily proportions. Other important
Figure 1: The architecture of the action recognition system, consisting of three layers where
the ﬁrst and second layers are SOMs, and the third layer is a custom supervised neural
network. The input is a combination of joint positions extracted from the output of a Kinect
like depth camera. 9
Figure 2: Diﬀerent body orientations; turned to the left (a), front direction (b), turned to
the right (c) and the joints used to calculate the egocentric coordinate system (d).
issues to be addressed are the impact of the camera’s viewing angle and distance
from the actor and the performance speed of the actions. In brief, categoriza-
tions of actions ought to be invariant under distance, viewing angle, size of the
actor, lighting conditions and temporal variations.
3.1.1. Ego-Centered Transformation
To make the action recognition system invariant to diﬀerent orientations of
the action performing agents, coordinate transformation into an ego-centered
coordinate system, Fig. 2, located in the central joint, stomach in Fig. 5, is
applied to the extracted joint positions. To build this new coordinate system
three joints named Right Hip, Left Hip and Stomach are used.
The other preprocessing mechanism applied to the joint positions is the
scaling transformation to make the system invariant to the distance from the
depth camera. Due to the scaling to a standard size, the representations of the
agent will always have a ﬁxed size even if the action performers have diﬀerent
distances to the depth camera, Fig. 3.
Figure 3: Diﬀerent skeleton sizes due to diﬀerent distances from the depth camera.
As humans we learn how to control our attention in performing other tasks
Shariatpanahi and Ahmadabadi (2007). Utilizing an attention mechanism can,
together with other factors (e.g, state estimation), improve our performance in
doing many other tasks Gharaee et al. (2014). One of the strongest factors that
attract our attention is the movement. Here, we therefore assume that when
actions are performed, the observer pays attention to the body parts that are
most involved in the actions. Thus, in the experiments we have applied attention
mechanisms, that direct the attention towards the body parts that move the
most. Thus, for example, when the agent is clapping hands, the attention is
focused on the arms and no or very little attention is directed towards the rest
of the agent’s body.
Our attention mechanisms are simulated by only extracting some of the
joint coordinates. These coordinates are determined by how much they are
involved in performing the actions. These mechanisms reduce the number of
input dimensions that enters the ﬁrst-layer SOM, and it helps the system to
recognize actions by focusing on the more relevant parts of the body while
ignoring the less relevant parts. This procedure increases the accuracy of the
system. The attention mechanism applied to the system will be described in
the experiment section below.
By using postures composed of 3D joints positions as input to our architec-
ture, a high performance of our action recognition system has been obtained
(Gharaee et al., 2017a). In addition to the 3D positioning, our body joints have
velocity and acceleration that could be modeled as the ﬁrst and second orders
of dynamics. The ﬁrst order of dynamic (velocity) determines the speed and
the direction of joint’s movements while the acceleration, via Newtons second
law, determines the direction of force vector applies to joint during acting.
In this study, the ﬁrst and second order joints dynamics have been extracted
and used together with the 3D positions. This has been done by calculating the
diﬀerences between consecutive sets of joint positions, and between consecutive
sets of ﬁrst order joint dynamics in turn. This has enabled us to investigate
how the inclusion of the dynamics contributes to the performance of our action
3.2. The First and Second Layer SOMs
The ﬁrst two layers of our architecture consist SOMs. The SOMs self-
organizes into dimensionality reduced and discretized topology preserving repre-
sentations of their respective input spaces. Due to the topology preserving prop-
erty nearby parts of the trained SOM respond to similar input patterns. This
is reminiscent of the cortical maps found in mammalian brains. The topology-
preserving property of SOMs is a consequence of the use of a neighbourhood
function during the adaptation of the neuron responses, which means the adap-
tation strength of a neuron is a decreasing function of the distance to the most
activated neuron in the SOM. This also provides the SOM, and in the extension
our action recognition system, with the ability to generalize learning to novel
inputs, because similar inputs elicit similar activity in the SOM. Thus similar
actions composed of similar sequences of agent postures and dynamics will elicit
similar activity trajectories in the ﬁrst layer SOM, and similar activity trajecto-
ries in the ﬁrst layer SOM, and thus similar actions, will be represented nearby
in the second layer SOM. Since a movement performed at diﬀerent speeds will
elicit activity along the same trajectory in the ﬁrst layer SOM when the input
consists of a stream of sets of joint positions, as well as when dynamics is added
(as long as the action’s internal dynamic relations are preserved), our action
recognition system also achieves time invariance.
The SOM consists of an I×Jgrid of neurons with a ﬁxed number of neurons
and a ﬁxed topology. Each neuron nij is associated with a weight vector wij ∈
Rnwith the same dimensionality as the input vectors. All the elements of the
weight vectors are initialized by real numbers randomly selected from a uniform
distribution between 0 and 1, after which all the weight vectors are normalized,
i.e. turned into unit vectors.
At time teach neuron nij receives the input vector x(t)∈Rn. The net input
sij (t) at time tis calculated using the Euclidean metric:
sij (t) = ||x(t)−wij (t)|| (1)
The activity yij(t) at time tis calculated by using the exponential function:
yij (t) = e
The parameter σis the exponential factor set to 106and 0 ≤i<I, 0 ≤j <
J.i, j ∈N. The role of the exponential function is to normalize and increase
the contrast between highly activated and less activated areas.
The neuron cwith the strongest activation is selected:
c= argmaxij yij (t) (3)
The weights wijk are adapted by
wijk (t+ 1) = wij k(t) + α(t)Gijc (t)[xk(t)−wijk (t)] (4)
Figure 4: Ordered Vector Representation Process. The ﬁgure shows the patterns in the ﬁrst
layer SOM elicited by the same action performed at diﬀerent rates. When the action is
performed slowly (right) there are more activations, a higher number of activated neurons
and the same neurons may be activated repeatedly (the darker and larger arrows), than when
the action is performed quickly (left). In both cases the activations are on the same path
in the ﬁrst layer SOM. The ordered vector representation process creates a representation
of the path (lower) designed to be independent of the performance rate, thus achieving time
invariance for the system.
The term 0 ≤α(t)≤1 is the adaptation strength, α(t)→0 when t→ ∞.
The neighbourhood function Gij c(t) = e−
2σ2(t)is a Gaussian function de-
creasing with time, and rc∈R2and rij ∈R2are location vectors of neurons c
and nij respectively. All weight vectors wij(t) are normalized after each adap-
3.3. Ordered Vector Representation
Ordered vector representation of the activity traces unfolding in the ﬁrst-
layer SOM during an action is a way to transform an activity pattern over time
into a spatial representation which will then be represented in the second-layer
SOM. In addition, as mentioned shortly above, it provides the system with time
invariance. This means that if an action is carried out several times, but at
diﬀerent rates, the activity trace will progress along the same path in the ﬁrst-
layer SOM. If this path is coded in a consistent way, the system achieves a time
invariant encoding of actions. If instead the activity patterns in the ﬁrst-layer
SOM were transformed into spatial representations by, for example, using the
sequence of centres of activity in the ﬁrst-layer SOM, time variance would not
be achieved. A very quick performance of an action would then yield a small
number of centres of activity along the path in the ﬁrst-layer SOM, whereas a
very slow performance would yield a larger number of centres of activity, some
of them sequentially repeated in the same neurons (see Fig. 4).
Strictly speaking, time invariance is only achieved when using sets of joint
positions as input to the system, because then the ﬁrst-layer SOM develops
a topology preserving representation of postures. When using ﬁrst or second
order dynamics this is no longer the case, unless the action’s internal dynamic
relations are preserved, because then an action’s representation in the ﬁrst-order
SOM will vary depending on the dynamics. On the other hand some discrim-
inable information for the system is added. For discrimination performance, the
beneﬁts of adding the dynamics to the input seems to be to some extent bigger
than the drawbacks.
The ordered vector representation process used in this study works as follows.
The length of the activity trace of an action ∆jis calculated by:
The parameter Nis the total number of centres of activity for the action
sequence jand Piis the ith centre of activity in the same action sequence.
Suitable lengths of segments to divide the activity trace for action sequence
jin the ﬁrst-layer SOM are calculated by:
dj= ∆j/NMax (6)
The parameter NMax is the longest path in the ﬁrst-layer SOM elicited by
the Mactions in the training set. Each activity trace in the ﬁrst-layer SOM,
generated by an action, is divided into djsegments, and the coordinates of the
borders of these segments in the order they appear from the start to the end on
the activity trace are composed into a vector used as input to the second-layer
3.4. Output Layer
The output layer consists of an I×Jgrid of a ﬁxed number of neurons and
a ﬁxed topology. Each neuron nij is associated with a weight vector wij ∈Rn.
All the elements of the weight vector are initialized by real numbers randomly
selected from a uniform distribution between 0 and 1, after which the weight
vector is normalized, i.e. turned into unit vectors.
At time teach neuron nij receives an input vector x(t)∈Rn.
The activity yij(t) at time tin the neuron nij is calculated using the standard
yij (t) = x(t)·wij (t)
||x(t)||||wij (t)|| (7)
During the learning phase the weights wijl are adapted by
wijl (t+ 1) = wij l(t) + β[yij (t)−dij (t)] (8)
The parameter βis the adaptation strength and dij (t) is the desired activity
for the neuron nij at time t.
We have evaluated the performance of our recognition architecture in two
separate experiments using publicly available data from the repository MSR
Action Recognition Datasets and Codes (Wan, accessed 2015). Each experiment
uses 10 diﬀerent actions (in total 20 diﬀerent actions) performed by 10 diﬀerent
subjects in 2 to 3 diﬀerent events. The actions data are composed of sequences
Figure 5: The 3D information of the joints of the skeleton extracted from the 3D camera.
of sets of joint positions obtained by a depth camera, similar to a Kinect sensor.
Each action sample is composed of a sequence of frames where each frame
contains 20 joint positions expressed in 3D Cartesian coordinates as shown in
Fig. 5. The sequences composing the action samples vary in length. We have
used Ikaros framework Balkenius et al. (2010) in order to design and implement
4.1. Experiment 1
In the ﬁrst experiment, we have selected a set of actions from MSR Action3D
Dataset containing 276 samples of 10 diﬀerent actions performed by 10 diﬀerent
subjects, each in 2 to 3 diﬀerent events. The actions of ﬁrst experiment can be
described as: 1. High Arm Wave, 2. Horizontal Arm Wave, 3. Using Hammer,
4. Hand Catch, 5. Forward Punch, 6. High Throw, 7. Draw X, 8. Draw Tick,
9. Draw Circle, 10. Tennis Swing. The ﬁrst subset of actions was split into
a training set containing 80% of the action instances randomly selected from
Figure 6: The attention mechanism in the ﬁrst experiment is obtained by setting the focus
of attention to the left arm which is involved in performing all of the actions.
the original dataset and a test set containing the remaining 20% of the action
The attention mechanism applied in this experiment is set to focus the at-
tention to the arm which is the part of the body mainly involved in performing
all of the actions, that is, the left arm for this experiment, see Fig. 6. The action
recognition architecture was trained with randomly selected instances from the
training set in two phases, the ﬁrst training the ﬁrst-layer 30 ×30 neurons SOM,
and the second training the second-layer 35 ×35 neurons SOM together with
the output-layer containing 10 neurons.
In this experiment we continued the study of Gharaee et al. (2017a) in order
to investigate the eﬀect of applying dynamics on the system performance of
action recognition. We achieved a recognition accuracy of 83% for the postures
(see Gharaee et al. (2017a)). To extract the information of the ﬁrst and second
orders of dynamics we applied ﬁrst the ﬁrst order of dynamic (velocity) as
the input to our system and obtained a recognition accuracy of 75% and then
the second order of dynamic (acceleration) as the input to our system where
Figure 7: Classiﬁcation results of all actions when using as input only the joint positions,
only the joint velocities (ﬁrst derivatives), only the joint accelerations (second derivatives)
or their combination (merged). Results with the training set during training (uppermost).
Results for the fully trained system with the test set (lowermost). As can be seen, the best
result was achieved when using the combined (merged) input.
the recognition accuracy was 52%. In the third step we merged the postures
together with the ﬁrst and second orders of dynamic (position, velocity and
acceleration) as the input to the system (merged system). The categorization
results show that 87% of all test sequences are being correctly categorized. The
results of using position, velocity or acceleration as input data one by one and
then all together are depicted in Fig. 7. As the ﬁgure shows, when we used
the combination of inputs in the merged system, 6 out of 10 actions were 100%
correctly categorized, see Fig. 8.
Fig. 9 shows how the performance can be improved by using as input a
combination of joint positions, joint velocities (ﬁrst order of dynamic) and joint
accelerations (second order of dynamic) compared to when each of these kinds
of input is used alone.
The results for the systems using only joint velocities and joints accelerations
are not as good as the result for the system using only joint positions. One
Figure 8: Classiﬁcation results of the action recognition system when receiving the combined
input of joint positions and their ﬁrst and second orders dynamics. Results with the training
set during training (uppermost). Results for the fully trained system per action with the test
Figure 9: Comparison of the classiﬁcation performance, per action, of the action recognition
system when using as input only the joint positions, only the joint velocities (ﬁrst derivatives),
only the joint accelerations (second derivatives) or their combination (merged).
possible explanation for this is the low quality of the input data. The algorithm
that extracts the skeleton data from the camera input is often not delivering
biologically realistic results. The errors in joint positions that occur in the data
set generated by the skeleton algorithm are magniﬁed when the ﬁrst derivatives
are calculated for the joint velocities and doubly magniﬁed when the second
derivatives are calculated for the accelerations. We therefore believe that if our
system was to be tested on a dataset with smaller errors, then the velocity and
acceleration systems would perform better.
4.2. Experiment 2
In the second experiment we used the rest of actions in the MSR Action 3D
Dataset. These actions are as follows: 1. Hand Clap, 2. Two Hand Wave, 3.
Side Boxing, 4. Forward Bend, 5. Forward Kick, 6. Side Kick, 7. Jogging, 8.
Tennis Serve, 9. Golf Swing, 10. Pick up and Throw. This second set of actions
was split into a training set containing 75% of the action instances randomly
selected from the original dataset and a test set containing the remaining 25%
of the action instances.
The action recognition architecture had the same settings for both experi-
ments so the ﬁrst layer SOM contained 30 ×30 neurons, and the second layer
SOM contained 35 ×35 neurons and the output layer contained 10 neurons.
The attention mechanism used in this experiment is not as simple as in the
ﬁrst experiment because of the nature of actions of the second subset. These
actions involve more varying parts of the body including the arms as well as
the legs. Therefore extracting the body part that should form the focus of
attention is not simple. Attention focus is determined, for each separate action,
by a separate selection of the most moving body parts, which is inspired from
human behaviour when observing the performing actions.
For example, the action Forward Bend mainly involves the base part of the
body which is composed of joints named Head, Neck, Torso and Stomach, see
Fig. 5, so the attention is focused on the base part, which includes the mentioned
joints. Another example is the action Jogging which involves arms and legs so
the attention is focused on the joints Left Ankle, Left Wrist, Right Ankle and
Right wrist. For the actions named Hand Clap, Two Hand Wave and Side
Boxing, Tennis Serve, Golf Swing and Pick up and Throw, the attention is
focused on the joints Left Elbow, Left Wrist, Right Elbow and Right Wrist.
Finally for the actions Forward Kick and Side Kick the attention is focused on
the joints Left Knee, Left Ankle, Right Knee and Right Ankle.
The attention mechanism signiﬁcantly improves the performance of the sys-
tem. Without using attention the system could only obtain an accuracy around
70%. Therefore it is important to highlight the contribution of attention in
developing the SOM architecture into a more accurate and optimal system.
In the second experiment we ﬁrst used only posture data as input to the
architecture and reached a performance of 86% correct recognitions of actions.
In the next step we used merged input, i.e. posture data together with its ﬁrst
order dynamics (position and velocity). The categorization results show that
90% of all test sequences was correctly categorized. As can be seen in Fig. 10,
5 of the 10 actions are 100% correctly categorized and the rest also have a very
Fig. 11 shows a comparison of the performance when using only posture
data (no dynamics) and when adding the dynamic to the system. Though the
accuracy for the actions Two Hand Wave and Side Boxing is reduced when the
dynamics is added to the system, the accuracy for the actions Forward Kick,
Golf Swing and Pick Up and Throw are signiﬁcantly improved. So in total the
recognition performance was improved by adding the dynamics to the system.
The classiﬁcation results of the two experiments can be compared by looking
at the trained second layer SOMs shown in Fig. 12 (corresponding to the merged
system of the ﬁrst experiment) and 13 (corresponding to the merged system of
the second experiment). As shown in these ﬁgures, for many of the actions there
are wide activated areas of neurons. This is especially the situation in the ﬁrst
experiment (see 13). This reﬂects the fact that the actions are performed in
multiple styles. This is why an action can be represented in multiple regions
in the second layer SOM. This eﬀect can make the classiﬁcation more diﬃcult
Figure 10: Classiﬁcation results of the action recognition system when receiving the combined
input of joint positions and their ﬁrst order dynamics (merged system). Results with the
training set during training (uppermost). Results for the fully trained system, per action,
with the test set (lowermost).
Figure 11: Comparison of the classiﬁcation performance, per action, of the action recognition
system when using as input only the joint positions and the combination of the joint positions
with their ﬁrst order dynamics (merged).
Figure 12: Activations by the training set in the trained second layer SOM in the ﬁrst
experiment. The map is divided into 9 regions, starts from the bottom left square (region
1) and ends to the upper right square (region 9). As can be seen, the activations are spread
out over several regions for many actions.
Figure 13: Activations by the training set in the trained second layer SOM in the second
experiment. The map is divided into 9 regions, starts from the bottom left square (region
1) and ends in the upper right square (region 9). As can be seen, the activations are spread
out over several regions for some actions, but not for others (such as Two Hand Wave, Bend,
Forward Kick and Jogging) which are contained in single regions.
Figure 14: Activations by the test set in the trained second layer SOM in the ﬁrst experiment.
The map is divided into 9 regions, starts from the bottom left square (region 1) and ends in
the upper right square (region 9). As can be seen, the activations are spread out over several
regions for many actions.
due to overlapping of the activated areas of diﬀerent actions. This makes it
important to use a suﬃcient number of samples of each action performance
style to train the system properly and to improve the accuracy. Thus many
actions form several sub-clusters in the second layer SOMs, which is depicted
in Fig. 12 and Fig. 13 for the trained data and in Fig. 14 and Fig. 15 for the
test data of the ﬁrst and the second experiment respectively. In these ﬁgures we
see the neurons activated by each performed action sample. It can be observed
that for several actions there are activated neurons in more than one region.
To understand this better, the percentage of activations by an action in
each region has been calculated, Fig. 16, for the trained data of the second
experiment. By comparing the percentage of activated areas belonging to each
action we can see that the actions named Two Hand wave, Bend, Forward Kick
and Jogging activated neurons belonging to only one of the 9 regions. This
means that these actions can be considered to form only one cluster. For the
other actions the representations are spread out in several regions.
The performance accuracy we have obtained in these experiments can be
Figure 15: Activations by the test set in the trained second layer SOM in the second experi-
ment. The map is divided into 9 regions, starts from the bottom left square (region 1) and
ends in the upper right square (region 9). As can be seen, the activations are spread out over
several regions for some actions, but not for others (such as Two Hand Wave, Bend, Forward
Kick and Jogging) which are contained in single regions. This is similar as for the training
set in the second experiment.
compared to other relevant studies on the action recognition. In the literature
one ﬁnds several action recognition systems which are validated and tested on
the MSR Action 3D dataset (Wan (accessed 2015)). Our results show a signiﬁ-
cant accuracy improvement from 74.7% of the state of the art system introduced
in the Li et al. (2010). Even though there is a diﬀerence in the way the data
set is divided, we can show that our hierarchical SOM architecture outperforms
many of the systems tested on the MSR Action 3D data set (such as the sys-
tems introduced in Oreifej et al. (2013), Wang et al. (2012b), Vieira et al. (2012),
Wang et al. (2012a), Xia et al. (2012) and Xia and Aggarwal (2013)).
Among the systems using self organizing maps for the action recognition
we can refer to Huang and Wu (2010) that is a diﬀerent SOM based system
for human action recognition in which a diﬀerent data set of 2D contours is
used as the input data. Although our system and our applied data set are
totally diﬀerent from the ones used in Huang and Wu (2010), our hierarchical
Figure 16: The percentage of activations in each region in the second layer SOM for each
action. The values of this table are calculated only for the training data set used in the
second experiment as a sample to indicate that the better accuracy in the second experiment
might be due to how the second layer SOM is formed.
Figure 17: The results of the experiments using MSR Action 3D data with the Hierarchical
SOM architecture divided into the results when using postures only and when using postures
together with the dynamics.
SOM architecture outperforms this system too. In Buonamente et al. (2016),
a three layer hierarchical SOM architecture is also used for the task of action
recognition which has some similarities with the hierarchical SOM architecture
presented in this study besides diﬀerences in the units such as the pre-processing
and ordered vector representation. The system introduced in Buonamente et al.
(2016) was evaluated on a diﬀerent dataset of 2D contours of actions (obtained
from the INRIA 4D repository) in which the system was trained on the actions
performed by one actor (Andreas) and then tested on the actions of a diﬀerent
actor (Helena) which resulted in a performance accuracy of 53%. We achieved
a signiﬁcant improvement of this result in our hierarchical SOM architecture,
which is depicted in the Fig. 17.
In our work on action recognition we have also implemented a version of the
hierarchical action recognition architecture that works in real time receiving in-
put online from a Kinect sensor with very good results. This system is presented
in Gharaee et al. (2016). We have also made another experiment in Gharaee
et al. (2017b) on the recognition of actions including objects in which the archi-
tecture is extended to perform the object detection process too. In our future
experiments we plan to present our suggested solution for the segmentation of
In this article we have presented a system for action recognition based on
Self-Organizing Maps (SOMs). The architecture of the system is inspired by
ﬁndings concerning human action perception, in particular those of (Johansson,
1973) and a model of action categories from G¨ardenfors and Warglien (2012).
The ﬁrst and second layers in the architecture consist of SOMs. The third layer
is a custom made supervised neural network.
We evaluated the ability of the architecture to categorize actions in the
experiments based on input sequences of 3D joint positions obtained by a depth-
camera similar to a Kinect sensor. Before entering the ﬁrst-layer SOM, the input
went through a preprocessing stage with scaling and coordinate transformation
into an ego-centric framework, as well as an attention process which reduces the
input to only contain the most moving joints. In addition, the ﬁrst and second
order dynamics were calculated and used as additional input to the original joint
The primary goal of the architecture is to categorize human actions by ex-
tracting the available information in the kinematics of performed actions. As
in prototype theory, the categorization in our system is based on similarities
of actions, and similarity is modelled in terms of distances in SOMs. In this
sense, our categorization model can be seen as an implementation of the con-
ceptual space model of actions presented in G¨ardenfors (2007) and G¨ardenfors
and Warglien (2012).
Although categorization based on the ﬁrst and second order dynamics has
turned out to be slightly worse than when sequences of 3D joint positions are
used, we believe that this derives from limited quality of the dataset. We have
also noticed that the correctly categorized actions in these three diﬀerent cases
do not completely overlap. This has been successfully exploited in the ex-
periment presented in this article, by combining them all to achieve a better
performance of the architecture.
Another reason for focusing on the ﬁrst order dynamics (implemented as
the diﬀerence between subsequent frames) is that it is a way of modelling a
part of the human attention mechanism. By focusing on the largest changes of
position between two frames, that is, the highest velocity, the human tendency to
attend to movement is captured. We believe that attention plays an important
role in selecting which information is most relevant in the process of action
categorization, and our experiment is a way of testing this hypothesis. The
hypothesis should, however, be tested with further data sets in order to be
better evaluated. In the future, we intend to perform such test with datasets of
higher quality that also contains new types of actions.
An important aspect of the architecture proposed in this article is its gen-
eralizability. A model of action categorization based on patterns of forces is
presented in (G¨ardenfors, 2007) and (G¨ardenfors and Warglien, 2012). The
extended architecture presented in this article takes into account forces by con-
sidering the second order dynamics (corresponding to sequences of joint accel-
erations), and, as has been shown, improves the performance. We also think
that it is likely that the second order dynamics contains information that could
be used to implement automatized action segmentation. We will explore this in
the future. The data we have tested come from human actions. The generality
of the architecture allows it to be applied to other forms of motion involving
animals and artefacts. This is another area for future work.
Balkenius, C., Mor´en, J., Johansson, B., Johnsson, M., 2010. Ikaros: Building
cognitive models for robots. Advanced Engineering Informatics 24, 40–48.
Buonamente, M., Dindo, H., Johnsson, M., 2015. Discriminating and simulating
actions with the associative self-organizing map. Connection Science 27, 118–
Buonamente, M., Dindo, H., Johnsson, M., 2016. Hierarchies of self-
organizing maps for action recognition. Cognitive Systems Research DOI:
Cangelosi, A., Metta, G., Sagerer, G., Nolﬁ, S., Nehaniv, C., Fischer, K., Tani,
J., Belpaeme, T., Sandini, G., Nori, F., Fadiga, L., Wrede, B., Rohlﬁng,
K., Tuci, E., Dautenhahn, K., Saunders, J., Zeschel, A., 2008. The italk
project: Integration and transfer of action and language knowledge in robots,
in: Proceedings of Third ACM/IEEE International Conference on Human
Robot Interaction 2, pp. 167–179.
Demiris, Y., Khadhouri, B., 2006. Hierarchical attentive multiple models for
execution and recognition of actions. Robotics and Autonomous System 54,
G¨ardenfors, P., 2000. Conceptual Spaces: The Geometry of Thought. Cam-
bridge, Massachussetts: The MIT Press.
G¨ardenfors, P., 2007. Representing actions and functional properties in con-
ceptual spaces, in: Body, Language and Mind. Mouton de Gruyter, Berlin.
volume 1, pp. 167–195.
G¨ardenfors, P., 2014. Geometry of Meaning: Semantics Based on Conceptual
Spaces. Cambridge, Massachussetts: The MIT Press.
G¨ardenfors, P., Warglien, M., 2012. Using conceptual spaces to model actions
and events. Journal of Semantics 29, 487–519.
Gharaee, Z., Fatehi, A., Mirian, M.S., Ahmadabadi, M.N., 2014. Attention
control learning in the decision space using state estimation. International
Journal of Systems Science (IJSS) , 1–16DOI: 10.1080/00207721.2014.945982.
Gharaee, Z., G¨ardendors, P., Johnsson, M., 2016. Action recognition online with
hierarchical self-organizing maps, in: Proceedings of the 12th International
Conference on Signal Image Technology and Internet Based Systems(SITIS).
Gharaee, Z., G¨ardendors, P., Johnsson, M., 2017a. Hierarchical self-organizing
maps system for action classiﬁcation, in: Proceedings of the International
Conference on Agents and Artiﬁcial Intelligence (ICAART).
Gharaee, Z., G¨ardendors, P., Johnsson, M., 2017b. Online recognition of ac-
tions involving objects, in: Proceedings of the International Conference on
Biologically Inspired Cognitive Architecture (BICA).
Giese, M., Thornton, I., Edelman, S., 2008. Metrics of the perception of body
movement. Journal of Vision 8, 1–18.
Giese, M.A., Lappe, M., 2002. Measurement of generalization ﬁelds for the
recognition of biological motion. Vision Research 42, 1847–1858.
Hemeren, P.E., 2008. Mind in Action. Ph.D. thesis. Lund University Cognitive
Science. Lund University Cognitive Studies 140.
Huang, W., Wu, Q.J., 2010. Human action recognition based on self organiz-
ing map, in: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE
International Conference on, IEEE. pp. 2130–2133.
Johansson, G., 1973. Visual perception of biological motion and a model for its
analysis. Perception & Psychophysics 14, 201–211.
Kalkan, S., Dag, N., Y¨ur¨uten, O., Borghi, A.M., Sahin, E., 2014. Verb concepts
from aﬀordances. to appear in .
Kohonen, T., 1988. Self-Organization and Associative Memory. Springer Verlag.
Lallee, S., Madden, C., Hoen, M., Dominey, P.F., 2010. Linking language with
embodied and teleological representations of action for humanoid cognition.
Frontiers in Neurorobotics 4. Doi:10.3389/fnbot.2010.00008.
Levin, B., Rappaport Hovav, M., 2005. Argument Realization. Cambridge:
Cambridge University Press.
Li, W., Zhang, Z., Liu, Z., 2010. Action recognition based on a bag of 3d points,
in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010
IEEE Computer Society Conference on, IEEE. pp. 9–14.
Malt, B., Ameel, E., Imai, M., Gennari, S., Saji, N.M., Majid, A., 2014. Human
locomotion in languages: Constraints on moving and meaning. Journal of
Memory and Language 14, 107–123.
Marr, D., Nishihara, K.H., 1978. Representation and recognition of the spatial
organization of three-dimensional shapes. Proceedings of the Royal Society
in London, B 200, 269–294.
Marr, D., Vaina, L., 1982. Representation and recognition of the movements of
shapes. Proceedings of the Royal Society in London, B 214, 501–524.
Mealier, A.L., Pointeau, G., G¨ardenfors, P., Dominey, P.F., 2016. Construals of
meaning: The role of attention in robotic language production. Interaction
Studies 17, 48 – 76.
Oreifej, O., Liu, Z., Redmond, W., 2013. Hon4d: Histogram of oriented 4d
normals for activity recognition from depth sequences. Computer Vision and
Pattern Recognition .
Rosch, E., 1975. Cognitive representations of semantic categories. Journal of
Experimental Psychology: General 104, 192–233.
Runesson, S., 1994. Perception of biological motion: The ksd-principle and the
implications of a distal versus proximal approach, in: Perceiving Events and
Objects. Hillsdale, NJ, pp. 383–405.
Runesson, S., Frykholm, G., 1983. Kinematic speciﬁcation of dynamics as an in-
formational basis for person and action perception. expectation, gender recog-
nition, and deceptive intention. Journal of Experimental Psychology: General
Shariatpanahi, H.F., Ahmadabadi, M.N., 2007. Biologically inspired framework
for learning and abstract representation of attention control. Attention in
cognitive systems, theories and systems from an interdisciplinary viewpoint
4840, 307 – 324.
Vaina, L., 1983. From shapes and movements to objects and actions. Synthese
Vaina, L., Bennour, Y., 1985. A computational approach to visual recognition
of arm movement. Perceptual and Motor Skills 60, 203–228.
Vieira, A., Nascimento, E., Oliveira, G., Liu, Z., Campos, M., 2012. Stop:
Space-time occupancy patterns for 3d action recognition from depth map
sequences, in: Progress in Pattern Recognition, Image Analysis, Computer
Vision, and Applications (CIARP), pp. 252–259. DOI: 10.1007/978-3-642-
Wan, Y.W., accessed 2015. Msr action recognition datasets and
codes. URL: http://research.microsoft.com/en-us/um/people/zliu/
Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y., 2012a. Robust 3d action
recognition with random occupancy patterns, in: Computer Vision-ECCV,
Wang, J., Liu, Z., Wu, Y., Yuan, J., 2012b. Mining actionlet ensemble for
action recognition with depth cameras, in: Computer Vision and Pattern
Recognition (CVPR) 2012 IEEE Conference on, pp. 1290–1297.
Wang, W., Crompton, R.H., Carey, T.S., G¨unther, M.M., Li, Y., Savage, R.,
Sellers, W.I., 2004. Comparison of inverse-dynamics musculo-skeletal models
of al 288-1 australopithecus afarensis and knm-wt 15000 homo ergaster to
modern humans, with implications for the evolution of bipedalism. Journal
of Human Evolution 47, 453–478.
Warglien, M., G¨ardenfors, P., Westera, M., 2012. Event structure, conceptual
spaces and the semantics of verbs. Theoretical Linguistics 38, 159–193.
Xia, L., Aggarwal, J., 2013. Spatio-temporal depth cuboid similarity feature for
activity recognition using depth camera, in: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). DIO: 10.1109/CVPR.2013.365.
Xia, L., Chen, C.C., Aggarwal, J., 2012. View invariant human action recogni-
tion using histograms of 3d joints, in: IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 20–27.