Download full-text PDF

Dance Performance Evaluation Using Hidden Markov Models

Article · May 2016with50 Reads
DOI: 10.1002/cav.1715
Abstract
We present in this paper a hidden Markov model-based system for real-time gesture recognition and performance evaluation. The system decodes performed gestures and outputs at the end of a recognized gesture, a likelihood value that is transformed into a score. This score is used to evaluate a performance comparing to a reference one. For the learning procedure, a set of relational features has been extracted from high-precision motion capture system and used to train hidden Markov models. At runtime, a low-cost sensor (Microsoft Kinect) is used to capture a learner's movements. An intermediate step of model adaptation was hence requested to allow recognizing gestures captured by this low-cost sensor. We present one application of this gesture evaluation system in the context of traditional dance basics learning. The estimation of the log-likelihood allows giving a feedback to the learner as a score related to his performance.
Figures
Dance Performance Evaluation Using Hidden
Markov Models
Sohaib Laraba
TCTS Lab, Numediart Institute,
University of Mons, Belgium
sohaib.laraba@umons.ac.be
Jo¨
elle Tilmanne
TCTS Lab, Numediart Institute,
University of Mons, Belgium
joelle.tilmanne@umons.ac.be
Abstract
We present in this paper a HMM-based sys-
tem for real time gesture recognition and per-
formance evaluation. The system decodes per-
formed gestures and outputs at the end of a rec-
ognized gesture a likelihood value that is trans-
formed into a score. This score is used to eval-
uate a performance comparing to a reference
one. For the learning procedure, a set of re-
lational features has been extracted from high
precision motion capture system and used to
train HMM models. At runtime, a low-cost
sensor (Microsoft Kinect) is used to capture a
learner’s movements. An intermediate step of
model adaptation was hence requested to allow
recognizing gestures captured by this low-cost
sensor. We present one application of this ges-
ture evaluation system in the context of tradi-
tional dance basics learning. The estimation of
the log-likelihood allows giving a feedback to
the learner as a score related to his performance.
Keywords: gesture recognition, Hidden Markov
Models, interactive systems, Maximum Like-
lihood Linear Regression, performance evalua-
tion
1 Introduction
In the present day, having an efficient human-
computer interaction system is taking a signifi-
cant role. Gesture recognition is considered as
one of the important steps in order to achieve
this objective. Gesture recognition systems have
been successfully developed based on different
methods like Hidden Markov Models (HMM),
neural networks, finite-state machine (FSM) and
template matching [1]. In most cases, the data
used for training and the data to be recognized
come from the same capturing system. Ideally,
the data is captured by high precision motion
capture systems. These systems provide very
precise data at framerate that reaches 200 fps.
However, such systems are very expensive and
hence not suited for at home use. In order to
build an interactive dance learning system using
motion capture technologies, we need to design
a dance performance evaluation module. We
present here an approach that allows at the same
time to train robust models using high precision
and to recognize gestures recorded in real time
from a low-cost system that is cheaper and eas-
ier to obtain.
We are particularly interested on computing the
similarity between the performed gesture and
the reference gestures stored in the database and
used for the training phase. As this will be illus-
trated later, the similarity score is useful in order
to provide a feedback to the user and allows him
to improve his performance. Gestures are con-
sidered to be multidimensional temporal curves
representing relational features extracted from
geometric relationships between different joints
of users skeleton. This is inspired by M¨
ullers
work for analysis of motion data [2].
Our system is based on Hidden Markov Mod-
els with an additional step of model adaptation
using Maximum Likelihood Linear Regression
(MLLR) procedure as discussed in [3]. This lat-
ter step is essential in order to have a sensor-
dependent gesture recognition system. For de-
coding, Viterbi algorithm us used and allows to
have a log-likelihood value which will be trans-
formed into a percentage score to be presented
to the user.
This paper is structured as follows: first, we
present a summary of related works. We de-
scribe the algorithm used for recognition with
numerical results using real data from the sec-
ond version of the Microsoft Kinect and a high
precision motion capture system (Qualisys1).
Then, we show how we transformed the log-
likelihood into a percentage score related to the
performance. Finally, we present a typical use
case for learning traditional dance basics.
2 Related Works
Full body gesture recognition has been per-
formed with different approaches and in each
one, different motion features have been used.
In a first stage we cite recent feature extrac-
tors used to represent a motion sequence then
we briefly summarize gesture classification ap-
proaches. In the last part, we describe some
techniques for gesture evaluation and scoring.
2.1 Feature extraction
Different feature extraction methods have been
proposed, based on skeletal information of the
body. In some works, like in [4], 3D angu-
lar values, in addition to their first and sec-
ond derivatives (velocity and acceleration) were
taken into account in order to model stylistic gait
sequences. In [5], a set of features is computed
by calculating the Euclidean distance between
every pair of 3D joints in the current frame and
the distances between the joints in the current
frames and the ones in the previous frame. To
capture the overall dynamics of body movement,
similar distances are computed between the cur-
rent frame and a neutral pose. The neutral pose
is computed by averaging the initial skeletons
of all action sequences. Each individual feature
value was clustered into 5 groups via k-means
and replaced with a 5-bit binary vector. M ¨
uller
et al. [6] introduced geometric features which
1Qualisys Motion Capture System:
http://www.qualisys.com/
are a class of Boolean features expressing ge-
ometric relations between certain body points
of a pose, for example, whether the right foot
lies in front of or behind the plane spanned by
the left foot, the left hip joint and the center
of the hip. Such geometric features are robust
to spacial variations and allow the identification
of logically corresponding events in similar mo-
tions. Other methods learn a dictionary of key-
poses and represent an action sequence in term
of these key-poses. Ofli et al. [7] used a his-
togram of motion words (HMW) where a set
of 3D locations representing the most informa-
tive joints are clustered into Kposes (or mo-
tion words) by using K-means. An action se-
quence is represented by counting the number
of detected motion words.
2.2 Gesture classification
To deal with the temporal warping that affects
motion sequences and ensure equal lengths, Dy-
namic Time Warping has been used in many
works like in [6] [8] and [9]. Classification is
then performed by K-nearest neighbor. Finite-
State-Machines (FSM) have been efficiently em-
ployed in modeling human gestures [10] and
they were combined with Support Vector Ma-
chines in [11]. A lot of efforts [4] [12] [13] have
used Hidden Markov Models for modeling body
motion time series. Four real-time decoding al-
gorithms based on HMMs have been presented
in [4] for stylistic gait recognition and follow-
ing. These methods are based on Viterbi algo-
rithm for decoding but each one uses a differ-
ent approach. The algorithms are evaluated on
their ability to recover the progression over time
in real-time. Bevilacqua et al. [12] developed
a learning strategy based on a single recorded
example. Their system outputs the time pro-
gression index and the likelihood values that are
used for decoding. A major limitation of this
technique is the large number of states which
can hinder the real-time computation. Our sys-
tem is based on Hidden Markov Models con-
taining a fixed number of states and trained on
higher number of samples in order to model
the variability of the gestures. HMMs integrate
both the time and the stylistic variability of the
motion in their modeling thanks to their topol-
ogy. Our approach is described in section 3 and
allows, with an adaptation technique, to have
a sensor-dependent gesture recognition system,
which means that the data used for decoding can
be different than the one used for training (from
different sensor for example). For decoding, a
standard Viterbi algorithm is used.
2.3 Performance Evaluation
In order to learn and improve dance steps, hav-
ing a score that evaluates the performance can be
very helpful. For evaluation of a dancer’s perfor-
mance, Kitsikidis et al. [14] performed evalua-
tion by computing distances between knees and
ankles of the learner and a reference dancer and
attributed a final score based on these distances.
However, this method does not deal with tem-
poral warping of the reference dancer gestures
and the learner ones. In [6], M ¨
uller et al. dealt
with this issue by using Dynamic Time Warp-
ing (DTW) and creating a binary template of
the reference gesture and then the similarity is
measured by computing a distance between the
template and the gesture to compare. Bloom
et al. [15] sum occurrences of true positives,
false negatives and false positives along an event
timeline to produce an F1 score. Chan et al.
[16] proposed a dance training system based on
motion capture and virtual reality technologies.
The student’s motions are captured by a high
precision motion capture system when he tries
to imitate a teacher’s movements. Their system
computes the differences between the sequences
of the student and the teacher using DTW and
Euclidean distance and provides a score for each
joint and an overall score for the whole perfor-
mance. The features that are used to compute
these distances are positions, velocities or an-
gles. However, the data used by their system is
provided by a high precision motion capture sys-
tem which is very expensive and needs special
setup which make it difficult to have this system
for home applications. In our work, a gesture is
evaluated using HMMs. When decoding a ges-
ture using Viterbi algorithm, we output the log-
likelihood which is interpreted and translated to
a percentage score. The approach is detailed in
section 3.5.
3 Gesture Modeling
As described generally in machine learning
techniques, a gesture recognition system in-
cludes two procedures, learning and decoding.
As mentioned previously, our system is devel-
oped under the constraint that the data used for
decoding is different than the one used for learn-
ing. Hence, an additional procedure of adapta-
tion is required.
The HMM model used for gesture recognition
was trained using HTK (Hidden Markov Model
Toolkit [17]). The training dataset was recorded
using a high precision motion capture system
(Qualisys). An expert of a traditional dance
from the south of Belgium (Walloon Dance)
was recorded performing different basic steps.
These steps are: Maclotte Base, Passepied Base,
Passepied Fleuret and Back step. The expert was
recorded also using a low cost device (Microsoft
Kinect V2) in order to select the features that
are not strongly affected by the change of mo-
tion capture system. The recorded sequences
are defined as series of forward and backward
steps and the dancer do not have to turn. This
kind of gestures facilitated our study because
the Kinect device requires a specific setup in or-
der to work correctly. In fact, the user needs
to be facing the Kinect permanently in order to
be well tracked. The Qualisys records motion
at a framerate of 177 frames per second (fps).
Each frame contains 3D position of 68 mark-
ers placed on the body = 204 values per frame.
The motion data was filtered to 30 fps and the
68 joints positions were used to form a skele-
ton of 20 joints relative to locations of articula-
tions of a human body. Only eleven joints were
selected for next steps of feature extraction, ex-
cluding arms joints that are not important in the
performed gestures. Seven relational features,
inspired by M¨
uller [2] representing geometric
relationships between joints were also used and
are:
Distance between the right ankle joint and
the plane defined by the pelvis, left hip and
left ankle joints.
Distance between the left ankle and the
plane fixed in the right ankle and normal
to the vector (right hip, left hip).
Figure 1: Relational features describing geo-
metric relations between body points
of a body pose that are indicated by
red and black markers [2].
Angle between the vectors (right knee right
hip) and (left knee, right ankle).
Angle between the vectors (left knee, left
hip) and (left knee, left ankle).
Angle between the vectors (neck, pelvis)
and (right hip, right knee). (= angle be-
tween right leg and body spine).
Angle between the vectors (neck, pelvis)
and (left hip, left knee). (= angle between
Left leg and body spine).
Angle between the vector (neck, pelvis)
and the vector perpendicular to ground.
Figure 1, taken from [2], illustrates well the
main idea of these relational features. The re-
spected features in this figure express whether
(a) the right foot lies in front of or behind the
body, (b) the left hand is reaching out the front of
the body or not, (c) the left hand is raised above
neck height or not. We used also 36 features rep-
resenting distances between each pair of joints
of the lower part of the skeleton during an ac-
tion. In summary, a frame is described by 76 di-
mensions. The recorded sequences where anno-
tated manually in the four classes (dance steps)
cited previously.
3.1 Learning
The learning procedure we follow in our system
is illustrated in Figure 2. The extracted features
are used to create a left-to-right Hidden Markov
Model with no skip transitions for five classes,
the four styles cited previously in addition to one
class representing the pauses between gestures.
A left-to-right model with no skip transitions is
a basic model in which the only possible transi-
tions in each frame are either to stay in the same
Figure 2: 11-states left-to-right Hidden Markov
Model for learning procedure.
state or to go to the next state. Each model in
our system consists of 11 states and this number
was selected empirically as it gives the highest
recognition rate..
In addition to the number of states, two proba-
bility measures must be defined: transition prob-
abilities ti,j between two states (si, sj)and the
probability density functions eiof the observa-
tions in each state si. The probability density
functions (pdfs) can be either modeled by a mix-
ture of Gaussians or a single Gaussian. In our
approach we have used a single Gaussian as il-
lustrated in Figure 2. The database that served to
train our models contains about 8000 frames an-
notated in 114 steps. 70% of the database was
used for the training phase and the remaining
30% for decoding. We ran 10 different training
batches for cross-validation.
3.2 Adaptation
The idea of adaptation is inspired by works
done in the field of speech recognition [18] [19]
where a speaker-independent system is adapted
to improve recognition of a new speaker. This
means that, to avoid using a database contain-
ing a huge amount of data for every speaker,
we can train the system on one or a few speak-
ers data for whom sufficient data is available.
For recognition of a new speaker, few samples
from his data can be enough to adapt the mod-
els to this speaker and thus create a speaker-
dependent speech recognition system. In the
present work, we apply a similar approach as
we create a ”sensor-dependent” gesture recog-
nition and evaluation system based on adapta-
tion procedure [3]. This is done by using some
samples of the same gestures captured by a low-
Figure 3: Illustration of the mean only MLLR
procedure.
cost sensor that will be used for real time de-
coding. This process allows at the same time to
have a clean and highly precise data to be ana-
lyzed and a system that can recognize gestures
from a different sensor.
One of the approach used for adaptation is Max-
imum Likelihood Linear Regression (MLLR).
MLLR estimates the parameter of an adapted
model by computing a linear transformation of
a given speaker-independent models parameter
to maximize the observations likelihood. In
our system we have used a mean-only MLLR
method that is already implemented in the HTK
toolkit, and in which, a new adapted mean vec-
tor ˆµis calculated. The idea of this method is
to shift the mean parameter of the Gaussians in
order to have updated models that fit the new
data as illustrated in Figure 3. Because only sin-
gle Gaussians are used to model the probabil-
ity density functions (pdfs), few samples from
the adaptation data can be enough to have an
efficient adaptation. This data is captured by
the Microsoft Kinect V2 at a framerate of about
30fps. It contains about 1000 frames (about
17% the size of the training dataset) annotated
in the four classes (steps) of the dance.
3.3 Decoding
Decoding is the process of finding the most
likely sequence of hidden states corresponding
to a new sequence of observation given the pa-
rameters of the model. This is performed by us-
ing a standard dynamic programming algorithm
named Viterbi algorithm.
A major issue when using Viterbi for states
decoding is that the sequence must be known in
Figure 4: Results of the effect of selected feature
sets and number of states on the recog-
nition accuracy.
advance and consequently, it cannot be used for
real time recognition. In our case, this is not im-
portant because we want to give the user a final
feedback at the end of his/her performance.
3.4 Recognition Results
The first challenge for the recognition process
was to select the number of states that gives the
highest recognition accuracy, and to select the
most important features among those cited pre-
viously. For this reason, we trained our models
on the high precise data by using each time, one
single feature category (3D normalized joint po-
sitions (pos), relative motions for distances be-
tween joint pairs (rm) and the seven relational
features (rf)) and then a combination of these
features. The number of states was changed be-
tween 4 to 12. Ten different training and de-
coding batches were used for cross validation.
Figure 4 shows the results of accuracy for each
case. We observe that in most cases, combina-
tions of features give a higher accuracy than us-
ing one category alone except when the num-
ber of states equals to 9, 10 or 11, the seven re-
lational features alone gave a higher accuracy
than all other feature sets. The highest accu-
racy value for using relational features alone was
96.17%(±3.07) with a number of states equal to
10. For the combination of feature sets the accu-
racy was 94.5%(±1.50) with a number of states
Figure 5: Recognition accuracy using Kinect
then Qualisys data for training and
test.
equal to 11. We selected these two cases for our
next steps.
In fact, several features have been tested and
these three sets are the ones that allowed hav-
ing high accuracy for both Kinect and Qual-
isys data. Figure 5 shows the recognition accu-
racy using only Kinect data for training and test
then only Qualisys data for both selected cases.
Recognition accuracy using only Kinect data
was higher than 87% for both cases whereas us-
ing only Qualisys data, the accuracy was higher
than 94%. This shows that the selected features
are not easily influenced by the change of mo-
tion capture system.
The second challenge of our recognition pro-
cess was to decode gestures from data captured
by a different sensor than the one used for train-
ing the original models. In our case, decoding
using Kinect V2 data. We applied an adapta-
tion procedure using Maximum Likelihood Lin-
ear Regression (MLLR) and we compared re-
sults of recognition before and after adaptation
in the two configurations selected at the previ-
ous step and also with. The results are shown in
Figure 6. For the first case (Relational Features
- 10 states) we do not notice a big difference.
The recognition accuracy before and after adap-
tation was about 63.9% for both cases. Whereas
for the second case (combination of the 3 feature
sets - 11 states) the difference is very clear. Be-
fore the adaptation, The accuracy was equal to
74.31(±3.03) and after adaptation it was equal
Figure 6: Results of recognition accuracy before
and after adaptation for the selected
feature sets and HMM topologies.
to 81.03(±3.44) . Even with no adaptation,
the combination of features gives a higher ac-
curacy than relational features alone. This may
be explained by the insufficient number of fea-
tures for the first case (only 7 features) and that
these features are not cross-sensors. In a another
word, they are more affected by the change of
sensors. For the next step, we adopted models
of the second case where all features are selected
and the HMM as a number of states equal to 11.
3.5 Performance Evaluation
For decoding, Viterbi algorithm gives an ap-
proximation of the likelihood of the gesture to
be recognized given the model. The output is
actually a log-likelihood value which decreases
when the length (number of frames) of the se-
quence increases. We perform time normaliza-
tion of log-likelihood by dividing by the length
of the sequence. The resulting value is not in a
limited range and hence it can not be interpreted
by the user. Our goal is to give the user a per-
centage score comparing his/her performance to
a reference represented by the step models. This
percentage score can be interpreted by the user
as an evaluation of his/her performance whether
it is good (higher than 75%), medium (between
50% and 70%) or bad (less than 50%). In or-
der to obtain this score, we map the resulting
normalized log-likelihood (L) on the following
function:
Figure 7: Score function used for mapping the
normalized log-likelihood to the score.
score =
0, ifL < a
1, ifL > b
(La)
(ba), otherwise
(1)
The function is illustrated in Figure 7. It al-
lows to compute a score that evaluates the per-
formance of the user.
aand bare determined empirically by out-
putting the values of the log-likelihood when de-
coding reference gestures. We compared our ap-
proach to Kitsikidis one [14] by evaluating an
expert and a non-expert of the Walloon dance
performing two different styles (Maclotte Base
- MB. and Passepied Base - PB.). We evalu-
ate also the performance of the Backward steps
(Back.). The results are presented in Table 1.
Table 1: Comparison between expert and non
expert of Walloon dance performances.
Method HMM Kitsikidis
expert
MB. 96.06 78.18
PB. 99.32 94.87
Back. 87.63 69.41
non-expert
MB. 87.63 54.73
PB. 48.48 66.26
Back. 56.40 68.54
The average scores for the expert using
our method were 96.06% for Maclotte Base,
99.32% for Passepied Base and 87.63% for the
Backward step and this is obvious because the
models were trained on his data, where using
Kitsikidis method, scores were lower. 78.78%
for the Maclotte Base step, 94.87% for the
Passepied Base step and 69.41% for the Back-
ward step.
Figure 8: Overall architecture of the communi-
cation between MotionMachine and
the game.
The expert of this dance commented the stu-
dent’s performance and provided an evaluation
for each performed gesture. This evaluation is
summarized in the table 2.
Table 2: Evaluation of the student performance
by the expert.
MB. PB. Back.
Expert Eval. Good Bad Medium
The student performs well the Maclotte step
where the Passepied needs to be improved and
the performance of the Backward step is accept-
able. Based on these comments, we can see that
our method confirms the expert evaluation. The
student had a score of 87.63% for the perfor-
mance of the Maclotte Base steps and 48.48%
for the performance of the Passepied Base step.
Kitsikidis method missed the evaluation of the
student where the scores estimated from the per-
formance of the two steps (Maclotte Base and
Passepied Base) were 54.73% and 66.23% re-
spectively.
4 Application
The evaluation system presented in this paper
has been used for learning basic steps of tradi-
tional Walloon dance, a dance from the south
region of Belgium, in a serious game-like envi-
ronment. This serious game has been developed
under the framework of the European project i-
Treasures2. The overall architecture of the mod-
ule is presented in Figure 8. The Viterbi decod-
ing and the scoring function were implemented
2The i-Treasures project: http://i-treasures.eu/
Figure 9: Overview of the game interface.
within the MotionMachine framework3. This
implementation, which uses the models previ-
ously trained by HTK, is the dance step evalua-
tion used within the game.
At runtime, the learner’s movements are cap-
tured by the Kinect V2 sensor through a module
implemented within the MotionMachine frame-
work. These movements are sent to the ac-
tual game rendering framework for dance learn-
ing, which runs on Unity4. Figure9 presents an
overview of the Unity game interface. An avatar
of the expert is shown in the top right of the fig-
ure and the avatar of the learner is shown in the
middle of the scene. The learner imitates the
expert moves and then developed module de-
codes his movements and sends a score value
that is displayed on screen. If the learner’s per-
formance is quite good (a score >50%) the
game presents the next exercise to be learned,
otherwise, the same exercise is presented again
and the learner has to try again until he gets the
right moves.
5 Conclusion and future works
In this paper we have presented an approach
for gesture recognition and evaluation based on
Hidden Markov Models (HMMs). This ap-
proach allows on one hand to use a high precise
motion capture system to capture reference ges-
tures for better analysis and modeling, and on
the other hand to use a low-cost system for de-
coding and evaluation thanks to an adaptation
procedure. The system outputs log-likelihood
values that are translated into a score to evaluate
3MotionMachine: http://www.numediart.org/motionmachine/
4Unity 3D: https://unity3d.com/
the learner’s gestures. Results showed that the
adaptation procedure is important when dealing
with different types of data. In addition, the pro-
posed gesture evaluation approach agrees with
the expert evaluation. This system has been used
to evaluate a traditional dance learner’s perfor-
mance in game-like environment.
In this work a limited set of gestures from a two
subjects were used for recognition. Future work
will involve increasing the number of gestures,
training and testing with a larger number of sub-
jects, testing the algorithm with other kind of
gestures than traditional dances and evaluating
the game with more learners.
Acknowledgements
This work has been supported by the European
Union (FP7-IC7-2011-9) under grant agreement
n 600676 (i-Treasures project).
References
[1] Sushmita Mitra and Tinku Acharya. Ges-
ture recognition: A survey. Systems,
Man, and Cybernetics, Part C: Applica-
tions and Reviews, IEEE Transactions on,
37(3):311–324, 2007.
[2] Meinard M¨
uller. Information retrieval for
music and motion, volume 2. Springer,
2007.
[3] Sohaib Laraba, Jo¨
elle Tilmanne, and
Thierry Dutoit. Adaptation procedure
for hmm-based sensor-dependent gesture
recognition. In Proceedings of the 8th
ACM SIGGRAPH Conference on Motion
in Games, pages 17–22. ACM, 2015.
[4] Thierry Ravet, Jo¨
elle Tilmanne, and Nico-
las d’Alessandro. Hidden markov model
based real-time motion recognition and
following. In Proceedings of the 2014 In-
ternational Workshop on Movement and
Computing, page 82. ACM, 2014.
[5] Chris Ellis, Syed Zain Masood, Marshall F
Tappen, Joseph J Laviola Jr, and Rahul
Sukthankar. Exploring the trade-off be-
tween accuracy and observational latency
in action recognition. International Jour-
nal of Computer Vision, 101(3):420–436,
2013.
[6] Meinard M¨
uller, Tido R¨
oder, and Michael
Clausen. Efficient content-based retrieval
of motion capture data. In ACM Trans-
actions on Graphics (TOG), volume 24,
pages 677–685. ACM, 2005.
[7] Ferda Ofli, Rizwan Chaudhry, Gregorij
Kurillo, Ren´
e Vidal, and Ruzena Bajcsy.
Sequence of the most informative joints
(smij): A new representation for human
skeletal action recognition. Journal of Vi-
sual Communication and Image Represen-
tation, 25(1):24–38, 2014.
[8] Jaron Blackburn and Eraldo Ribeiro. Hu-
man motion recognition using isomap and
dynamic time warping. In Human motion–
understanding, modeling, capture and an-
imation, pages 285–298. Springer, 2007.
[9] Raviteja Vemulapalli, Felipe Arrate, and
Rama Chellappa. Human action recogni-
tion by representing 3d skeletons as points
in a lie group. In Proceedings of the IEEE
Conference on Computer Vision and Pat-
tern Recognition, pages 588–595, 2014.
[10] Pengyu Hong, Matthew Turk, and
Thomas S Huang. Gesture modeling and
recognition using finite state machines. In
Automatic face and gesture recognition,
2000. proceedings. fourth ieee interna-
tional conference on, pages 410–415.
IEEE, 2000.
[11] Raphael W de Bettio, Andr´
e HC Silva,
Tales Heimfarth, Andr´
e P Freire, and
Alex GC de S´
a. Model and implementa-
tion of body movement recognition using
support vector machines and finite state
machines with cartesian coordinates in-
put for gesture-based interaction. Journal
of Computer Science & Technology, 13,
2013.
[12] Fr´
ed´
eric Bevilacqua, Bruno Zamborlin,
Anthony Sypniewski, Norbert Schnell,
Fabrice Gu´
edy, and Nicolas Rasami-
manana. Continuous realtime gesture
following and recognition. In Ges-
ture in embodied communication and
human-computer interaction, pages 73–
84. Springer, 2009.
[13] Di Wu and Ling Shao. Leveraging hi-
erarchical parametric networks for skele-
tal joints based action segmentation and
recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pat-
tern Recognition, pages 724–731, 2014.
[14] Alexandros Kitsikidis, Kosmas Dim-
itropoulos, Erdal Yilmaz, Stella Douka,
and Nikos Grammalidis. Multi-sensor
technology and fuzzy logic for dancers
motion analysis and performance evalua-
tion within a 3d virtual environment. In
Universal Access in Human-Computer
Interaction. Design and Development
Methods for Universal Access, pages
379–390. Springer, 2014.
[15] Victoria Bloom, Dimitrios Makris, and
Vasileios Argyriou. G3d: A gaming ac-
tion dataset and real time action recogni-
tion evaluation framework. In Computer
Vision and Pattern Recognition Workshops
(CVPRW), 2012 IEEE Computer Society
Conference on, pages 7–12. IEEE, 2012.
[16] Jacky CP Chan, Howard Leung, Jeff KT
Tang, and Taku Komura. A virtual re-
ality dance training system using motion
capture technology. Learning Technolo-
gies, IEEE Transactions on, 4(2):187–195,
2011.
[17] S. Young, D. Kershaw, J. Odell, D. Olla-
son, V. Valtchev, and P. Woodland. The
HTK Book version 3.0. Cambridge Uni-
versity Press, 2000.
[18] Christopher J Leggetter and Philip C
Woodland. Maximum likelihood linear re-
gression for speaker adaptation of continu-
ous density hidden markov models. Com-
puter Speech & Language, 9(2):171–185,
1995.
[19] CJ Leggetter and PC Woodland. Flexible
speaker adaptation using maximum likeli-
hood linear regression. In Proc. ARPA Spo-
ken Language Technology Workshop, vol-
ume 9, pages 110–115. Citeseer, 1995.
Conference Paper
November 2003
    In addition to speech, gestures have been considered as a means of interacting with a computer as naturally as possible. Like speech, gestures can be acquired and recognized using hidden Markov models (HMMs), but there are several problems that must be overcome. We propose solutions to two of these problems: the feature extraction and the HMMs training. First, the acquisition is done by means... [Show full abstract]
    Conference Paper
    May 2014 · Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
      An i-vector approach to extracting features for video camera based gesture recognition is proposed. Conventional low-level raw features, such as position, speed, and acceleration, are low-dimensional feature representations which often suffer from measurement noise and thus are not highly discriminative. High-level features, such as Fourier descriptor, usually take a global transformation on... [Show full abstract]
      Conference Paper
      June 2011 · IEEE International Conference on Fuzzy Systems
        A multimodal gesture recognition method is proposed based on Choquet integral by fusing information from camera and 3D accelerometer data. By calculating the optimal fuzzy measures for the camera recognition module and the accelerometer recognition module, the proposal obtains enough recognition rate 92.7% in average for 8 types of gestures by improving the recognition rate approximate 20%... [Show full abstract]
        Conference Paper
        July 2014
          In this paper, we introduce an evaluation of accelerometer-based gesture recognition algorithms in user dependent and independent cases. Gesture recognition has many algorithms and this evaluation includes Hidden Markov Models, Support Vector Machine, K-nearest neighbor, Artificial Neural Net-work and Dynamic Time Warping. Recognition results are based on acceleration data collected from 12... [Show full abstract]
          Discover more