Conference PaperPDF Available

Providing Semantic Annotation for the CMU Grand Challenge Dataset


Abstract and Figures

Providing ground truth is essential for activity recognition for three reasons: to apply methods of supervised learning, to provide context information for knowledge-based methods, and to quantify the recognition performance. Semantic annotation extends simple symbolic labelling by assigning semantic meaning to the label, enabling further reasoning. In this paper we present a novel approach to semantic annotation by means of plan operators. We provide a step by step description of the workflow to manually creating the ground truth annotation. To validate our approach we create semantic annotation of the CMU grand challenge dataset, which is often cited but, due to missing and incomplete annotation, almost never used. We evaluate the quality of the annotation by calculating the interrater reliability between two annotators who labelled the dataset. The results show almost perfect overlapping (Cohen’s κ of 0.8) between the annotators. The produced annotation is publicly available, to enable further usage of the CMU grand challenge dataset.
Content may be subject to copyright.
Providing Semantic Annotation for the CMU
Grand Challenge Dataset
Kristina Yordanova, Frank Kr¨
uger, Thomas Kirste
March 22, 2018
Providing ground truth is essential for activity recognition for three reasons: to
apply methods of supervised learning, to provide context information for knowledge-
based methods, and to quantify the recognition performance. Semantic annotation
extends simple symbolic labelling by assigning semantic meaning to the label, en-
abling further reasoning. In this paper we present a novel approach to semantic
annotation by means of plan operators. We provide a step by step description of
the workflow to manually creating the ground truth annotation. To validate our ap-
proach we create semantic annotation of the CMU grand challenge dataset, which
is often cited but, due to missing and incomplete annotation, almost never used.
We evaluate the quality of the annotation by calculating the interrater reliability
between two annotators who labelled the dataset. The results show almost perfect
overlapping (Cohen’s κof 0.8) between the annotators. The produced annotation
is publicly available, to enable further usage of the CMU grand challenge dataset.
1 Introduction
The annotation of sensor datasets describing human behaviour is an important part of
the activity and plan recognition process. It provides a target label for each observa-
tion in the cases where supervised learning is applied. It also serves as a ground truth
for evaluating the performance of the activity or plan estimation procedure by compar-
ing the estimated by the model values with the annotated values. Finally, it provides
the context information needed for developing knowledge-based activity recognition
systems. In this paper we present a model-based approach to semantic annotation of
human behaviour based on the annotation process proposed in [7]. There, the labels
assigned to the data provide an underlying semantic structure that contains information
about the actions, goals, and plans being executed. This semantic structure is repre-
sented in the form of a model of the behaviour’s state in terms of collection of state
variables. Actions are then defined as effects that change the state of the model. This
form of annotation provides structured knowledge of the concepts in the data being
annotated and enables the reasoning over underlying behaviour changes, their causal
relations, and contextual dependencies. Such annotation is important for evaluating
plan recognition approaches that aim not only to recognise the goal of the plan, but
also the subgoals and actions being executed. Furthermore, the model-based semantic
annotation is important for evaluating the performance of any approach that aims at
recognising the underlying actions’ context. Finally, the annotation will be beneficial
for approaches that strive to learn models of human behaviour.
The contribution of this paper is threefold: first, we introduce a novel approach to
semantic annotation by means of precondition and effect rules; second, we describe
a step by step workflow to create such annotation; and finally, we provide a semantic
annotation for three types of recipes from the CMU grand challenge dataset.
The paper is structured as follows. In Section 2, we discuss the types of annotation
available in the literature and outline how our approach distinguishes from them. Sec-
tion 3 describes the proposed approach. In Section 4, we discuss how to improve
the quality of the annotation by training the annotators, while Section 5 illustrates
the approach by re-annotating the Carnegie Mellon University Multi-Modal Activity
Database (CMU-MMAC). The new annotation will also be made publicly available at
the authors’ website. In Section 6, we evaluate our approach by calculating the inter-
rater reliability between different annotators. Finally, the paper concludes with a short
discussion of the approach.
2 Annotation of Human Behaviour
In the context of human behaviour recognition we distinguish between three different
types of annotation. The first is the annotation of activities where a textual description
(or label) is assigned to the executed action [22, 11, 14, 13]. More formally, the ob-
jective is to manually assign a label lito each time step of a time series. This is often
done by analysing a separately recorded video log of the executed activities. These
labels are usually called ground truth, as they provide a symbolic representation of the
true sequence of activities. However, for the finite set L={l1. . . ln}of labels there
is usually no further information besides the equality relation. Annotations such as
take-baking pan provide a textual description of the executed task that however
do not contain an underlying semantic structure. There is usually no formal set of con-
straints that restrict the structure of the label sequences. Typically, nothing prevents
an annotator from producing sequences like “put fork to drawer” “close drawer”
“take knife from drawer”. This is also the most common type of annotation of human
behaviour, partially because even the assignment of non-semantic labels to the data is
a difficult, time consuming, and error prone task [22].
The second type of annotation is the plan annotation. It can be divided into goal
labelling and plan labelling [6]. The goal labelling is the annotation of each plan with a
label of the goal that is achieved [1, 5]. In contrast, plan labelling provides annotation
not only of the goal, but also of the actions constituting the plan, and of any subgoals
occurring in the plan [3]. The latter is, however, a time consuming and error prone
process [6] which explains why the only attempts of such plan annotation are done
when executing tasks on a computer (e.g. executing plans in an email program [3]).
This is also reflected in activity and plan recognition approaches such as [19, 15] that
use only synthesised observations, and thus synthesised annotation, to recognise the
human actions and goals.
The third type of annotation is the semantic annotation [18]. The term comes from
the field of the semantic web where the it is described as the process and the resulting
annotation or metadata consisting of aligning a resource or a part of it with a descrip-
tion of some of its properties and characteristics with respect to a formal conceptual
model or ontology [2]. The concept is later adopted in the field of human behaviour
annotation, where it describes the annotating of human behaviour with labels that have
an underlying semantic structure represented in the form of concepts, properties, and
relations between these concepts [20, 9]. We call this type of semantic structure an
algebraic representation in accordance to the definition provided in [12]. There, an
algebraic representation is one where the state of the system is modelled in terms of
combinations of operations required to achieve that state.
In difference to the algebraic representation, there exists a model-based represen-
tation which provides a model of the system’s state in terms of collection of state vari-
ables. Then, the individual operations are defined in terms of their effects on the state
of the model [12]. To our knowledge, there have been no attempts to represent the
semantic structure of human behaviour annotation in the form of model-based repre-
sentation. In the next sections we present an approach to semantic annotation of human
behaviour where the underlying semantic structure uses a model-based representation.
This representation allows us to provide not only a semantic meaning to the labels, but
also to produce plan labels and to reason about the plan’s causal correctness. Further-
more, it gives the state of the world corresponding to each label and allows to track
how it changes during the plan execution.
3 A novel approach to Annotating Human Behaviour
In this section, we present a model-based semantic annotation approach that strives
to overcome the drawbacks of the approaches outlined in the previous section. Our
approach combines the characteristics of the state of the art approaches and in addition
relies on model-based instead of algebraic knowledge representation. The targeted
group of activity datasets, that will potentially benefit from this approach, are those
describing goal-oriented behaviour. Typical activity recognition experiments such as
the CMU-MMAC [10] can be regarded as goal oriented. In them, the participants
are instructed to fulfil a task such as food preparation. To ensure comparability of
different repetitions, identical experimental setup is chosen for each trial. As a result,
the action sequence executed by the participants can be regarded as a plan, leading from
the same initial state (as chosen by the experimenter) to a set of goal states (given in
the experiment instruction). In the domain of automated planning and scheduling, plan
sequences are generated from domain models, where actions are defined by means of
preconditions and effects. A plan is then a sequence of actions generated by grounding
the action schemas of the domain leading from an initial state to the goal state. In
contrast, in our semantic annotation approach, we manually create plans that reflect
the participants’ actions, and define a planning domain, which describes the causal
connections of the actions to the state of the world. Below we describe the proposed
annotation process, including the definition of the label set L, the label semantics, the
manual annotation procedure, and the validation procedure. We illustrate the process
with examples from the kitchen domain.
Step one: Action and entity dictionary definition In the first step a dictionary of
actions and entities is created. The actions have a name representing the action class,
and a description of the action class that distinguishes it from the remaining classes.
The dictionary also contains the set of all entities observed during the experiment. The
dictionary is manually created by domain experts by analysing the video log, which
is typically recorded during the experiment. The results of the dictionary definition
are the set of action classes and the set of entities manipulated during action execution
(see Table 1). To allow annotators to distinguish between different actions, each action
Table 1: Result of step 1: A dictionary of actions and entities.
name is accompanied by its definition. If we look at action a1take, its definition is to
grab an object. During the executing of take, the location of the object changes from
the initial location to the hand of the person. The action consists of moving the arm to
the object, grabbing the object and finally moving the arm back to the body.
Step two: Definition of action relations In the second step, the action relations have
to be defined. For each action, the number and role of involved objects is defined. In
case of take, for example, an object and a location, where the object is taken from, are
defined. In addition, for each object, possible roles have to be identified. A pot, for
example, can be taken, filled, washed, and stirred. The result of this step is the finite
set of labels L={l1= ˜a1
1, l2= ˜a2
1, . . . , lk= ˜am
n}, where ˜adefines the syntax of the
action relation ato be used for the annotation process (see Table 2).
Table 2: Result of step 2: The table lists the type signature and each possible instantia-
tion for the set of actions identified in the previous step.
a1take (what:takeable, from:location)
1take (knife, drawer)
1take (knife, board)
a2put (what:takeable, to:location)
2put (knife, drawer)
2put (knife, board)
Step three: Definition of state properties As described above, we use a model-
based approach (according to [12]) to semantic annotation. We, therefore, have to
define the state space by means of state properties. In the third step, a set of state
properties is defined as a function of a tuple of entities to an entity of the domain. The
state space is then defined by each combination of possible mappings of entity tuples.
Finally, the subset of mappings that holds in the initial state (start of the experiment)
has to be marked (see Table 3).
Table 3: Result of step 3: A list of functions with type signatures and their instan-
tiations. A * in the last columns means that the defined function holds in the initial
state. f1is-at (what: takeable) location
1is-at (knife) 7→ drawer *
1is-at (knife) 7→ board
f2objects taken () number
2objects taken () 7→ 0 *
2objects taken () 7→ 1
Step four: Definition of preconditions and effects Objective of the third step is to
define the semantics of the actions. Using the type signature defined in the previous
step, action schemes are defined in terms of preconditions and effects. As explained
above, we regard the participants’ action sequences as plans. Here we describe them by
use of the Planning Domain Definition Language (PDDL), known from the domain of
automated planning and scheduling. The preconditions and effects for the single action
schemes are formed by domain experts. A take action for example requires an object
to be taken, the maximal number of objects not to exceed, and in case the location is
a container that can be opened and closed, it has to be open. Effects of the take action
are that the location of the object is changed from the original location to the hand and
if the object to be taken is dirty, the hands become dirty too (see Figure 1).
(:action take
:parameters (?what - takeable ?from - loc)
:precondition (and
(= (is-at ?what) ?from)
(not (= ?from hands)))
:effect (and
(assign (is-at ?what) hands)
(when (not (is-clean ?what)) (not (is-clean hands)))))
Figure 1: Extract of the action scheme for the take action encodes preconditions and
effects in PDDL.
Step five: Manual annotation Once the dictionary of labels is defined, the manual
annotation can be performed. We use the ELAN annotation tool [23] for this step. Here
an annotator has to assign labels from the defined label set to the video sequence. The
ELAN annotation tool allows to synchronise several video files and to show them in
Step six: Plan validation Since the label sequence produced in the previous step
consists of plan operators, the complete sequence can be interpreted as a plan, lead-
ing from an initial to a goal state. Objective of the sixth step is to check the causal
validity of the label sequence with respect to the planning domain created in the pre-
vious step. A plan validator (such as VAL [16]) can be used for this task. If the label
sequence does not fulfill the causal constraints of the planning domain, two possible
reasons exist: Either the planning domain does not correctly reproduce the constraints
of the experimental setting or the label sequence is incorrect. In case of an incorrect
label sequence, step five (manual annotation) has to be repeated to correct the detected
problems. In case of an incorrect domain, either the preconditions defined in step four
have to be relaxed or the effects have to be revised.
The proposed process has three results: 1) the label sequence, 2) the semantic
structure of the labels, and 3) a planning domain, describing the causal relations of the
4 Improving the quality of annotation
It is often the case that two annotators provided with the same codebook, produce
annotation with a low overlap [4]. This can be explained with the high behaviour
variability and with the different interpretation of human behaviour1. To reduce the
effect of such discrepancies between annotators, the literature suggests training the
annotators which leads to an increase in the interrater reliability [4]. We adopt this
approach and conduct a training phase with the annotators. It involves the following
steps: 1. the domain expert meets with the annotators and discusses the elements of
the dictionary and their presence in an example video log; 2. the annotators separately
annotate the same video log; 3. the annotators compare the two annotations, discuss
the differences and decide on a new consolidated annotation of the video log; 4. the
annotators repeat steps 2 and 3 for the next video log. In a study conducted in [4], only
about 13% of the studies reported the size of the training involved. It was, however,
concluded that high intensity training produces significantly better results than low
intensity training or no training. For that reason we performed training as long as the
annotators felt comfortable annotating without external help (28% of the data in the
concrete annotation scenario).
We applied the proposed annotation process together with the training phase on the
CMU Multi-Modal Activity Database [11]. In the next section we outline the CMU-
MMAC and the annotation we created by applying our approach to this dataset.
The Carnegie Mellon University Multi-Modal Activity Database (CMU-MMAC) pro-
vides a dataset of kitchen activities [11]. Several subjects were recorded by multiple
1For example, in the action “take object”, one can interpret the beginning of the action as the point at
which the protagonist starts reaching for the object, or the point at which the hand is already holding the
object. This deviation in the interpretation reduces the overlapping of labels done by different annotators.
sensors (including cameras, accelerometers, and RFIDs) while performing food prepa-
ration tasks. A literature research revealed that only few researchers ever used this
dataset. In [8] the activities of twelve subjects were directly reconstructed from the
video by use of computer vision. In [21] the cameras and the IMU data were used
for temporal classification of seven subjects. We believe that two reasons exist for this
publicly available dataset to not be further used in the literature. The first is that activity
recognition in the kitchen domain is a very challenging task and the second is that the
provided annotation is neither complete nor provides enough information to efficiently
train classifiers. In the following section, we briefly describe our annotation for the
5.1 Overview of the CMU-MMAC
The CMU-MMAC consists of five sub datasets (Brownie, Sandwich, Eggs, Salad,
Pizza). Each of them contains recorded sensor data from one food preparation task.
The dataset contains data from 55 subjects, were each of them participates in several
sub experiments. While executing the assigned task, the subjects were recorded with
five cameras and multiple sensors. While the cameras can be used for computer vision
based activity recognition [8], the resulting video log is also the base for the dataset
annotation. An annotated label sequence for 16 subjects can be downloaded from the
CMU-MMAC website2. Albeit following a grammatical structure of verbs and ob-
jects, the label sequence is still missing semantics which if present would allow the
deriving of context information such as object locations and relations between actions
and entities. In the following section, we discuss the annotation of three of the five
datasets (Brownie, Sandwich, and Eggs)3. Later, we provide a detailed evaluation of
the produced annotation.
6 Evaluation
6.1 Experimental Setup
In order to evaluate the proposed annotation process and the quality of the resulting an-
notation, we conducted the following experiments: 1. Two domain experts reviewed a
subset from the video logs for the Brownie, Eggs, and Sandwich datasets and identified
the action classes, entities, action relations, state properties, and precondition-effect
rules. 2. Two annotators (Annotator A and Annotator B) independently annotated the
three datasets (Brownie, Eggs, and Sandwich). 3. The same two annotators discussed
the differences in the annotation after each annotated video log for the first nvideos of
each dataset and prepared a consolidated annotation for 28% of the sequences in the
3The annotation can be downloaded from
4nis 12 for the Brownie, 7 for the Eggs, and 6 for the Sandwich dataset.
Based on these annotated sequences, we examined the following hypotheses: (H1) Fol-
lowing the proposed annotation process provides a high quality annotation. (H2) Train-
ing the annotators improves the quality of the annotation.
To test H1, we calculated the interrater reliability between Annotator A and Anno-
tator B for all video logs in the three datasets (90 video logs). To test H2, we investi-
gated whether the interrater reliability increases with the training of the annotators. The
interrater reliability was calculated for the ground labels, not for the classes (in other
words, we calculate the overlapping between the whole label “take-bowl-cupboard”
and not only for the action class “take”). The interrater reliability was calculated in
terms of agreement (IRa), Cohen’s κ(I Rκ), and Krippendorff’s α(IRα). We chose
the above measures as they are the most frequently used measures for interrater relia-
bility as reported in [4].
6.2 Results
6.2.1 Semantic annotation for the CMU-MMAC
To define the label set, two domain experts reviewed a subset from the video logs and
identified 13 action classes (11 for the Brownie, 12 for the Eggs, and 12 for the Sand-
wich). Table 4 shows the action classes for the three datasets. The action definitions
Table 4: Action classes for the three datasets.
Dataset Action classes
Brownie open, close, take, put, walk, turn on, fill, clean, stir, shake, other
Eggs open, close, take, put, walk, turn on, fill, clean, stir, shake, other,
turn off
Sandwich open, close, take, put, walk, turn on, fill, clean, stir, shake, other, cut
created in this step later enable different annotators to choose the same label for iden-
tical actions. In this step the domain experts also identified the entities (30 for the
Sandwich dataset, 44 for the Brownies, and 43 for the Eggs). From these dictionaries,
in step two, a discussion about the type signature and possible instantiations took place
(119 unique labels where identified for the Sandwich dataset, 187 for the Brownies, and
179 for the Eggs; see Table 2 for examples). Step three, definition of state properties
revealed 13 state properties (see Table 3). The next three steps were executed by two
annotators until all datasets were annotated without gaps and all annotation sequences
were shown to be valid plans.
The resulting annotation consists of 90 action sequences. Interestingly, while anno-
tating, we noticed that the experimenter changed the settings during the experiments’
recording. In all sub-experiments it can be seen that, before recording subject 28, some
objects were relocated in different cupboards. Our annotation is publicly available to
enable other researchers to address the activity recognition problem with the CMU-
MMAC dataset. The complete annotation can be downloaded from [24].
6.2.2 Results for H1
To address H1 we computed the interrater reliability between Annotator A and Anno-
tator B. The results can be seen in Figure 2. It can be seen that the annotators reached
a median agreement of 0.84 for the Brownie, 0.81 for the Eggs, and 0.84 for the Sand-
wich. Similarly, the Cohen’s κand the Krippendorff’sαhad a median of 0.83 for the
Brownie, 0.80 for the Eggs, and .83 for the Sandwich. A Cohen’s κbetween 0.41 –
0.60 means moderate agreement, between 0.61 – 0.80 means substantial agreement,
and above 0.81 indicates almost perfect agreement [17]. Figure 2). Similarly, data that
Figure 2: The median interrater reliability for the three datasets (in terms of Cohen’s κ)
and the deviation from this median.
has Krippendorff’s αabove 0.80 is considered to be reliable to draw conclusions. In
other words, the average interrater reliability between the two annotators is between
substantial and almost perfect. This also indicates that the proposed annotation process
not only provides semantic annotation, it also ensures that the annotators produce high
quality annotation. Consequently, hypothesis H1 was accepted. Figure 3 shows the
annotated by Annotators A and B classes and the places where they differ. It can be
seen that the difference is mainly caused by slight shifts in the start and end time of the
actions. This indicates that the problematic places during annotation of fine-grained
actions are in determining the start and end of the action.
6.2.3 Results for H2
To test H2, we investigated whether the training had impact on the interrater reliability.
We calculated the difference in the interrater reliability between each newly annotated
video log and the previous one during the training phase. The “Brownie” dataset has
mean positive difference of about 2% while “Sandwich” has a mean difference of about
10%. This means that on average there was an improvement of 2% (respectively 10%)
Figure 3: Comparison between the annotation of a video log of Annotator A (bottom)
and Annotator B (top) from the “Brownie” dataset for subject 9. The different colours
indicate different action classes. The plot in the middle illustrates the differences be-
tween both annotators (in black).
in the interrater reliability during the training phase. On the other hand, the “Eggs”
dataset shows a negative difference of 1%, which indicates that on average no im-
provement in interrater reliability was observed during the training phase. A negative
difference between some datasets was also observed. These indicate decrease in the in-
terrater reliability after a training was performed (with maximum of about 2%). Such
decrease can be explained with encountering of new situations in the dataset or with
different interpretation of a given action. However, a decrease of 2% does not signifi-
cantly reduce the quality of the annotation. Figure 4 illustrates the interrater agreement
for the datasets selected for the training phase. The orange line shows a linear model
that was fitted to predict the interrater reliability from the dataset number. It can be seen
that the effect of the training phase was not negative for all datasets. For two datasets
(Brownie and Sandwich), an increasing trend can be seen. To better understand the
change in the interrater reliability, we look into the agreement (IRa) between the an-
notators of the first 6 annotations of the “Sandwich” dataset (Figure 4). The interrater
reliability between the first and the second annotated video increases with 23%. The
same applies for the interrater reliability between the second and the third annotated
video. At that point the interrater reliability has reached about 81% overlapping (Co-
hen’s κof 0.8), which indicates almost perfect overlapping. After that, there is a mean
difference of about 1%. On average the overlapping between the two annotators stays
around 80% (or mean Cohen’sκof 0.78) even after the training phase. This indicates
that the learning phase improves the agreement between annotators, thus the quality of
the produced annotation (hence we accept H2). The results, however, show that one
Figure 4: Learning curve for the first n videos. The points illustrate the interrater
reliability for one dataset. The points are connected to increase perceivability. The
orange line illustrates the increase of reliability due to learning.
needs a relatively small training phase to produce results with almost perfect overlap-
ping between annotators5. This contradicts the assumption that we need high intensity
training to produce high quality annotation (as suggested in [4]). It also shows that
using our approach for semantic annotation ensures a high quality annotation without
5For the “Sandwich” dataset, the annotators needed to produce consolidated annotation for the first two
videos before they reached overlapping of about 80%, for the “Brownie” and the “Eggs” they needed only
the need of intensive training of the annotators.
7 Conclusion
In this work, we presented a novel approach to manual semantic annotation. The ap-
proach allows the usage of a rich label set that includes semantic meaning and relations
between actions, entities and context information. Additionally, we provide a state
space that evolves during the execution of the annotated plan sequences. In contrast to
typical annotation processes, our annotation approach allows further reasoning about
the state of the world by interpreting the annotated label sequence as grounded plan
operators. It is, for example, easy to infer the location of objects involved, without any
explicit statement about the objects’ location.
To validate our approach, we annotated the “Brownie”, “Eggs”, and “Sandwich”
trials from the CMU-MMAC dataset. In the original annotation only 16 out of 90
sequences are annotated. We now provide a uniform annotation for all 90 sequences
including a semantic meaning of the labels. To enable other researchers to participate
in the CMU grand challenge, we make the complete annotation publicly available.
Furthermore, we evaluated the quality of the produced annotation by comparing the
annotation of two annotators. The results showed that the annotators were able to
produce labelled sequences with almost perfect overlapping (Cohen’s κof about 0.8).
This stands to show that the approach provides high quality semantic annotation, which
the ubiquitous computing community can use to further the research in activity, plan,
and context recognition.
8 Acknowledgments
We would like to thank the students who annotated the dataset. This work is par-
tially funded by the German Research Foundation (YO 226/1-1). The video data was
obtained from and the data collection was funded in part by the
National Science Foundation (EEEC-0540865).
[1] D. W. Albrecht, I. Zukerman, and A. E. Nicholson. Bayesian models for key-
hole plan recognition in an adventure game. User Modeling and User-Adapted
Interaction, 8(1-2):5–47, 1998.
[2] P. Andrews, I. Zaihrayeu, and J. Pane. A classification of semantic annotation
systems. Semant. web, 3(3):223–248, August 2012.
[3] M. Bauer. Acquisition of user preferences for plan recognition. In Proc. of Int.
Conf. on User Modeling, pages 105–112, 1996.
[4] P. S. Bayerl and K. I. Paul. What determines inter-coder agreement in manual
annotations? a meta-analytic investigation. Comput. Linguist., 37(4):699–725,
December 2011.
[5] N. Blaylock and J. Allen. Statistical goal parameter recognition. In Int. Conf. on
Automated Planning and Scheduling, pages 297–304, June 2004.
[6] N. Blaylock and J. Allen. Hierarchical goal recognition. In Plan, activity, and
intent recognition, pages 3–32. Elsevier, Amsterdam, 2014.
[7] blinded. The entry is removed due to double blinded reviewing., 2017.
[8] E. Z. Borzeshi, O. P. Concha, R. Y. Da Xu, and M. Piccardi. Joint action seg-
mentation and classification by an extended hidden markov model. IEEE Signal
Process. Lett., 20(12):1207–1210, 2013.
[9] H.-S. Chung, J.-M. Kim, Y.-C. Byun, and S.-Y. Byun. Retrieving and explor-
ing ontology-based human motion sequences. In Computational Science and Its
Applications, volume 3482, pages 788–797. Springer Berlin Heidelberg, 2005.
[10] F de la Torre, J. Hodgins, J. Montano, S. Valcarcel, R. Forcada, and J. Macey.
Guide to the carnegie mellon university multimodal activity database. Technical
Report CMU-RI-TR-08-22, Robotics Institute, Carnegie Mellon University, July
[11] F. de la Torre, J. K. Hodgins, J. Montano, and S. Valcarcel. Detailed human data
acquisition of kitchen activities: the CMU-Multimodal Activity Database. In
Workshop on Developing Shared Home Behavior Datasets to Advance HCI and
Ubiquitous Computing Research, 2009.
[12] R. Denney. A comparison of the model-based & algebraic styles of specifica-
tion as a basis for test specification. SIGSOFT Softw. Eng. Notes, 21(5):60–64,
September 1996.
[13] M. Donnelly, T. Magherini, C. Nugent, F. Cruciani, and C. Paggetti. Annotating
sensor data to identify activities of daily living. In Toward Useful Services for
Elderly and People with Disabilities, volume 6719, pages 41–48. Springer Berlin
Heidelberg, 2011.
[14] J. Hamm, B. Stone, M. Belkin, and S. Dennis. Automatic annotation of daily ac-
tivity from smartphone-based multisensory streams. In Mobile Computing, Appli-
cations, and Services, volume 110, pages 328–342. Springer Berlin Heidelberg,
[15] L. M. Hiatt, A. M. Harrison, and J. G. Trafton. Accommodating human variability
in human-robot teams through theory of mind. In Proc. Int. J. Conf. Artificial
Intelligence, pages 2066–2071, Barcelona, Spain, 2011.
[16] R. Howey, D. Long, and M. Fox. Val: automatic plan validation, continuous
effects and mixed initiative planning using pddl. In IEEE Int. Conf. on Tools with
Artificial Intelligence, pages 294–301, Nov 2004.
[17] G. G. Koch J. R. Landis. The measurement of observer agreement for categorical
data. Biometrics, 33(1):159–174, 1977.
[18] A. Kiryakov, B. Popov, D. Ognyanoff, D. Manov, A. Kirilov, and M. Goranov. Se-
mantic annotation, indexing, and retrieval. In The Semantic Web - ISWC, volume
2870, pages 484–499. Springer, 2003.
[19] M. Ramirez and H. Geffner. Goal recognition over pomdps: Inferring the in-
tention of a pomdp agent. In Proc. of Int. Joint Conf. on Artificial Intelligence,
volume 3, pages 2009–2014, Barcelona, Spain, 2011.
[20] S. Saad, D. De Beul, S. Mahmoudi, and P. Manneback. An ontology for video
human movement representation based on benesh notation. In Int. Conf. on Mul-
timedia Computing and Systems, pages 77–82, 2012.
[21] E. H. Spriggs, F. de la Torre, and M. Hebert. Temporal segmentation and activ-
ity classification from first-person sensing. In IEEE Computer Society Conf. On
Computer Vision and Pattern Recognition Workshops, pages 17–24. IEEE, 2009.
[22] T. L. M. van Kasteren and B. J. A. Kr¨
ose. A sensing and annotation system for
recording datasets in multiple homes. In Proc. of Ann. Conf. on Human Factors
and Computing Systems, pages 4763–4766, Boston, USA, April 2009.
[23] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes. ELAN: a
professional framework for multimodality research. In Proc. Int. Conf. Language
Resources and Evaluation, pages 1556–1559, 2006.
[24] K. Yordanova, F. Kr¨
uger, and T. Kirste. Semantic annotation for the CMU-
MMAC Dataset. University Library, University of Rostock, 2018. http://purl.uni-
... We also apply our method to the CMU-MMAC [17] dataset and can show that we outperform previous work on the same dataset. Additionally we tested our method with a greater subset of the CMU-MMAC dataset, as a recent publication offers more annotations [18]. ...
... Recently, a new set of annotations for the CMU-MMAC dataset was released that vastly increased the number of labeled scenarios. In [18], the authors showed their approach for annotating the data while also offering semantic annotations that can be used in other experiments, e.g., when using reasoning. Overall they added annotations for three recipes and for all subjects, with the exception of cases where the video files were broken and could not be used. ...
... Therefore, to make the learning feasible and also compare the results to the original annotations, we only look at the complete set recordings for the brownie recipe. Figure 5. Distribution of the classes we consider from the CMU-MMAC dataset using the annotations from [18]. The class label is derived from the verb part of the original label. ...
Full-text available
In the field of pervasive computing, wearable devices have been widely used for recognizing human activities. One important area in this research is the recognition of activities of daily living where especially inertial sensors and interaction sensors (like RFID tags with scanners) are popular choices as data sources. Using interaction sensors, however, has one drawback: they may not differentiate between proper interaction and simple touching of an object. A positive signal from an interaction sensor is not necessarily caused by a performed activity e.g., when an object is only touched but no interaction occurred afterwards. There are, however, many scenarios like medicine intake that rely heavily on correctly recognized activities. In our work, we aim to address this limitation and present a multimodal egocentric-based activity recognition approach. Our solution relies on object detection that recognizes activity-critical objects in a frame. As it is infeasible to always expect a high quality camera view, we enrich the vision features with inertial sensor data that monitors the users’ arm movement. This way we try to overcome the drawbacks of each respective sensor. We present our results of combining inertial and video features to recognize human activities on different types of scenarios where we achieve an F 1 -measure of up to 79.6%.
... Producing annotation is a time consuming and error-prone process that is even more complicated when one needs to additionally annotate relations between actions and objects. According to Yordanova and Krüger, there are three types of annotation based on the label structure [27,28]. The first type of annotation is the one that uses strings that have no semantic meaning. ...
... As the manual model definition for semantic annotation is time consuming process (e.g. see [27,28]), we utilise an approach for automatic model generation based on the label strings produced through the ELAN annotation tool. This approach is based on works proposing learning behaviour models from textual instructions [20,21] and is described in [22]. ...
Conference Paper
With the demographic change towards ageing population, the number of people suffering from neurodegenerative diseases such as dementia increases. As the ratio between young and elderly population changes towards the seniors, it becomes important to develop intelligent technologies for supporting the elderly in their everyday activities. Such intelligent technologies usually rely on training data in order to learn models for recognising problematic behaviour. One problem these systems face is that there are not many datasets containing training data for people with dementia. What is more, many of the existing datasets are not publicly available due to privacy concerns. To address the above problems, in this paper we present a sensor dataset for the kitchen task assessment containing normal and erroneous behaviour due to dementia. The dataset is recorded by actors, who follow instructions describing normal and erroneous behaviour caused by the progression of dementia. Furthermore, we present a semantic annotation scheme which allows reasoning not only about the observed behaviour but also about the causes of the errors
... In other words, we can produce annotation that is not captured by the sensors. To produce the annotation, we followed the process proposed in [50,51]. ...
... Based on the ontology, 15 datasets were annotated using the video logs from the head mounted camera and the ELAN annotation tool [52]. The process proposed in [50] was used, ensuring that the resulting annotation is syntactically and semantically correct. Table 5 shows an example of the annotation, where the time indicates the start of the action in milliseconds. ...
Full-text available
Wellbeing is often affected by health-related conditions. Among them are nutrition-related health conditions, which can significantly decrease the quality of life. We envision a system that monitors the kitchen activities of patients and that based on the detected eating behaviour could provide clinicians with indicators for improving a patient’s health. To be successful, such system has to reason about the person’s actions and goals. To address this problem, we introduce a symbolic behaviour recognition approach, called Computational Causal Behaviour Models (CCBM). CCBM combines symbolic representation of person’s behaviour with probabilistic inference to reason about one’s actions, the type of meal being prepared, and its potential health impact. To evaluate the approach, we use a cooking dataset of unscripted kitchen activities, which contains data from various sensors in a real kitchen. The results show that the approach is able to reason about the person’s cooking actions. It is also able to recognise the goal in terms of type of prepared meal and whether it is healthy. Furthermore, we compare CCBM to state-of-the-art approaches such as Hidden Markov Models (HMM) and decision trees (DT). The results show that our approach performs comparable to the HMM and DT when used for activity recognition. It outperformed the HMM for goal recognition of the type of meal with median accuracy of 1 compared to median accuracy of 0.12 when applying the HMM. Our approach also outperformed the HMM for recognising whether a meal is healthy with a median accuracy of 1 compared to median accuracy of 0.5 with the HMM.
... To address the above problems, in this paper, we present an extended version of a model-based approach to semantic annotation of human behaviour based on the annotation process proposed in [9] and first presented in [10]. Furthermore, in this work, we provide the first evidence that the model-based approach can be used to reason beyond the action label. ...
... Manual annotation can be produced in three ways: 1. by observing a video log of the recorded behaviour [10,15]. This allows for producing very precise high quality annotation as the annotator can go back and re-annotate problematic parts of the log; 2. by directly observing the experiment participant and manually labelling their behaviour [16]. ...
Full-text available
Providing ground truth is essential for activity recognition and behaviour analysis as it is needed for providing training data in methods of supervised learning, for providing context information for knowledge-based methods, and for quantifying the recognition performance. Semantic annotation extends simple symbolic labelling by assigning semantic meaning to the label, enabling further reasoning. In this paper, we present a novel approach to semantic annotation by means of plan operators. We provide a step by step description of the workflow to manually creating the ground truth annotation. To validate our approach, we create semantic annotation of the Carnegie Mellon University (CMU) grand challenge dataset, which is often cited, but, due to missing and incomplete annotation, almost never used. We show that it is possible to derive hidden properties, behavioural routines, and changes in initial and goal conditions in the annotated dataset. We evaluate the quality of the annotation by calculating the interrater reliability between two annotators who labelled the dataset. The results show very good overlapping (Cohen’s κ of 0.8) between the annotators. The produced annotation and the semantic models are publicly available, in order to enable further usage of the CMU grand challenge dataset.
... The analysis of semantics in the study of groups is currently of interest in several lines of promising research, including the analysis of the so-called 'group genome' [15], compilation of ontologies and annotation of various types of human activity to develop interaction interfaces with multiple digital systems [16], and also to study the phenomena of information distortion during the distribution in the group and the mechanisms of 'collective memory' [17]. ...
Full-text available
The so-called Maslow’s pyramid reflects the hierarchy of motivation priorities in one person. Neuroevolutionary approach as well as studies of collective intelligence focus the attention to the group as a minimal information processing unit. Here we analyse semantics of the relation of youth groups to tree levels of ‘the most important’ – for the past week, for professional development, and for the whole life. The group consensus is more for more periods of evaluated topic. Only four words (sememes) represent about 30% of all associations mentioned by participants in relation to life. The same share of profession-related associations is covered by 12 sememes. Twenty-two sememes cover 30% of week-related associations. Pyramid-like hierarchy has been visualised. Applied techniques can be developed based on this approach for designing of new group intelligence creative systems.
... The CMU Multi-Modal Activity Database 9 (CMU-MMAC) by De la Torre Frade et al. (2009) is mentioned in Nguyen et al. (2016) , but due to the lack of publicly available annotations, it has seldom been used. This is likely to change with the recent publication of the semantic annotation done by Yordanova, Krüger, and Kirste (2018) . Another recent dataset is the EPIC-KITCHENS dataset by Damen et al. (2018) : a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. ...
Providing support for ageing and frail populations to extend their personal autonomy is desirable for their well-being as it is for the society at large, since it can ease the economic and social challenges caused by ever-ageing developed societies. Ambient-assisted living (AAL) technologies and services might be a solution to address those challenges. Recent improved capabilities in both ambient and wearable technologies, especially those related with video and lifelogging data, and huge advances in the accuracy of intelligent systems for AAL are leading to more valuable and trustworthy services for older people and their caregivers. These advances have been particularly relevant in the last years due to the appearance of RGB-D devices and the development of deep learning systems. This article reviews these latest developments in the intersection of AAL, intelligent systems, lifelogging, and computer vision. This paper provides a study of previous reviews in these fields, and later analyses newer intelligent techniques employed with different video-based lifelogging technologies in order to offer lifelogging services for AAL. Additionally, privacy and ethical issues associated with these technologies are discussed. This review aims at facilitating the understanding of the multiple fields involved.
... One of the barriers to providing benchmark datasets with cycle level information is the effort required to obtain and annotate them. Ontologies for daily activities, such as cooking [12,13], have been used simplify to the task when looking at non-cyclic data. Semi-supervised learning is also a common approach to reduce the labeling effort for activity level labels [14,15]. ...
Full-text available
Activity monitoring using wearables is becoming ubiquitous, although accurate cycle level analysis, such as step-counting and gait analysis, are limited by a lack of realistic and labeled datasets. The effort required to obtain and annotate such datasets is massive, therefore we propose a smart annotation pipeline which reduces the number of events needing manual adjustment to 14%. For scenarios dominated by walking, this annotation effort is as low as 8%. The pipeline consists of three smart annotation approaches, namely edge detection of the pressure data, local cyclicity estimation, and iteratively trained hierarchical hidden Markov models. Using this pipeline, we have collected and labeled a dataset with over 150,000 labeled cycles, each with 2 phases, from 80 subjects, which we have made publicly available. The dataset consists of 12 different task-driven activities, 10 of which are cyclic. These activities include not only straight and steady-state motions, but also transitions, different ranges of bouts, and changing directions. Each participant wore 5 synchronized inertial measurement units (IMUs) on the wrists, shoes, and in a pocket, as well as pressure insoles and video. We believe that this dataset and smart annotation pipeline are a good basis for creating a benchmark dataset for validation of other semi- and unsupervised algorithms.
... The framework provides access to GPU-based real-time execution of the classification task . Ground Truth: The ground truths of CMU dataset consists of action, object and location [15]. The combination of these three is an activity such as "put-1-egg_shell-counter". ...
Conference Paper
Human Activity Recognition (HAR) plays an important role in many real world applications. Currently, various techniques have been proposed for sensor-based "HAR" in daily health monitoring, rehabilitative training and disease prevention. However, non-visual sensors in general and wearable sensors in specific have several limitations: acceptability and willingness to use wearable sensors; battery life; ease of use; size and effectiveness of the sensors. Therefore, adopting vision-based human activity recognition approach is more viable option since its diversity would enable the application to be deployed in wide range of domains. The most popular technique of vision based activity recognition, Deep Learning, however, requires huge domain-specific datasets for training which, is time consuming and expensive. To address this problem this paper proposes a Transfer Learning technique by adopting vision-based approach to "HAR" by using already trained Deep Learning models. A new stochas-tic model is developed by borrowing the concept of "Dirichlet Alloaction" from Latent Dirichlet Allocation (LDA) for an inference of the posterior distribution of the variables relating the deep learning classifiers predicted labels with the corresponding activities. Results show that an average accuracy of 95.43% is achieved during training the model as compared to 74.88 and 61.4% of Decision Tree and SVM respectively. However, testing accuracy suffers due to non-reproducibility of labels by the deep learning models.
Full-text available
Cyclic motions such as walking, running or cycling are common to our daily lives. Thus, the analysis of these cycles has an important role to play within both the medical field, e.g. gait analysis, and the fitness domain, e.g. step counting and running analysis. For such applications, inertial sensors are ideal as they are mobile and unobtrusive. The aim of this thesis is to capture cyclic motion using inertial sensors and subsequently analyse them using machine learning techniques. A lack of realistic and annotated data currently limits the development and application of algorithms for inertial sensors under non-laboratory conditions. This is due to the effort required to both collect and label such data. The first contributions of this thesis propose novel methods to reduce annotation costs for realistic datasets, and in this manner enable the labelling of a large benchmark dataset. The applicability of the dataset is demonstrated by using it to propose and test a robust algorithm for simultaneous human activity recognition and cycle analysis. One of these methods for reducing annotation costs is then deployed to develop the first mobile gait analysis system for patients with a rare and heterogeneous disease, hereditary spastic paraplegia (HSP). Thus, machine learning algorithms which set the state-of-the-art for cycle analysis using inertial sensors were proposed and validated by this thesis. The outcomes of this thesis are beneficial in both the medical and fitness domains, enabling the development and use of algorithms trained and tested in realistic settings.
Automatic recognition of user’s activities by means of wearable devices is a key element of many e-health applications, ranging from rehabilitation to monitoring of elderly citizens. Activity recognition methods generally rely on the availability of annotated training sets, where the traces collected using sensors are labelled with the real activity carried out by the user. We propose a method useful to automatically identify misbehaving users, i.e. the users that introduce inaccuracies during the labeling phase. The method is semi-supervised and detects misbehaving users as anomalies with respect to accurate ones. Experimental results show that misbehaving users can be detected with more than 99% accuracy.
Technical Report
Full-text available
The annotation for the CMU kitchen dataset can be downloaded from
Conference Paper
Full-text available
Over the past decade, researchers in computer graphics, computer vision, and robotics have begun to work with significantly larger collections of data. A number of sizable databases have been collected and made available to researchers: faces, motion capture, natural scenes, and changes in weather and lighting. These and other databases have done a great deal to facilitate research and to provide standardized test datasets for new algorithms, however, these databases are limited by the constrained settings within which they are collected. We propose a focused effort to capture detailed (high spatial and temporal resolution) human data in the kitchen while cooking several recipes. The database contains multimodal measures of the human activity of subjects performing the tasks involved in cooking and food preparation. Currently we record video from five external cameras and one wearable camera, audio from five balanced microphones and a wearable watch, motion capture with a 12 camera Vicon systems, and accelerometers, gyroscopes and magnetic sensors from five IMUs. Several computers were used for recording the various modalities. The computers were synchronized using the Network Time Protocol (NTP). Preliminary data can be downloaded from , and it is currently used to solve problems of multimodal temporal segmentation of activities and activity recognition.
Full-text available
Utilization of computer tools in linguistic research has gained importance with the maturation of media frameworks for the handling of digital audio and video. The increased use of these tools in gesture, sign language and multimodal interaction studies has led to stronger requirements on the flexibility, the efficiency and in particular the time accuracy of annotation tools. This paper describes the efforts made to make ELAN a tool that meets these requirements, with special attention to the developments in the area of time accuracy. In subsequent sections an overview will be given of other enhancements in the latest versions of ELAN, that make it a useful tool in multimodality research.
Conference Paper
We present a system for automatic annotation of daily experience from multisensory streams on smartphones. Using smartphones as platform facilitates collection of naturalistic daily activity, which is difficult to collect with multiple on-body sensors or array of sensors affixed to indoor locations. However, recognizing daily activities in unconstrained settings is more challenging than in controlled environments: 1) multiples heterogeneous sensors equipped in smartphones are noisier, asynchronous, vary in sampling rates and can have missing data; 2) unconstrained daily activities are continuous, can occur concurrently, and have fuzzy onset and offset boundaries; 3) ground-truth labels obtained from the user's self-report can be erroneous and accurate only in a coarse time scale. To handle these problems, we present in this paper a flexible framework for incorporating heterogeneous sensory modalities combined with state-of-the-art classifiers for sequence labeling. We evaluate the system with real-life data containing 11721 minutes of multisensory recordings, and demonstrate the accuracy and efficiency of the proposed system for practical lifelogging applications. © 2013 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering.
This chapter discusses hierarchical goal recognition: simultaneous online recognition of goals and subgoals at various levels within an HTN-like plan tree. We use statistical, graphical models to recognize hierarchical goal schemas in time quadratic with the number of the possible goals. Within our formalism, we treat goals as parameterized actions, necessitating the recognition of parameter values as well. The goal schema recognizer is combined with a tractable version of the Dempster-Shafer theory to predict parameter values for each goal schema. This results in a tractable goal recognizer that can be trained on any plan corpus (a set of hierarchical plan trees). Additionally, we comment on the state of data availability for plan recognition in general and briefly describe a system for generating synthetic data using a mixture of AI planning and Monte Carlo sampling. This was used to generate the Monroe Corpus, one of the first large plan corpora used for training and evaluating plan recognizers. This chapter also discusses the need for general metrics for evaluating plan recognition and proposes a set of common metrics.
Conference Paper
In this paper, we present an approach for semantic annotation of movement in the videos that is based on ontology model and semantic concept classifiers. The video movement ontology (VMO) is defined by using OWL (Ontology Web Language), and included both schema and data. In fact, ontology concepts and their relationships that construct the movement model are based on the semantics of the Benesh Movement Notation (BMN), which can describe any form of dance or human movement. We have exploited the knowledge embedded into the ontology by using Semantic Web Rules Language (SWRL). In our approach, SWRL rules are used to perform rule-based reasoning over both concepts and concept instances, and to improve the quality of movement annotation in the video. Additionally, we can search within the VMO by writing SPARQL queries. According to the authors' knowledge, this is the first time the ontology concept is used to annotate with BMN the video human movements.
Hidden Markov models (HMMs) provide joint segmentation and classification of sequential data by efficient inference algorithms and have therefore been employed in fields as diverse as speech recognition, document processing, and genomics. However, conventional HMMs do not suit action segmentation in video due to the nature of the measurements which are often irregular in space and time, high dimensional and affected by outliers. For this reason, in this paper we present a joint action segmentation and classification approach based on an extended model: the hidden Markov model for multiple, irregular observations (HMM-MIO). Experiments performed over a concatenated version of the popular KTH action dataset and the challenging CMU multi-modal activity dataset (CMU-MMAC) report accuracies comparable to or higher than those of a bag-of-features approach, showing the usefulness of improved sequential models for joint action segmentation and classification tasks.
Conference Paper
Models for activity recognition require large annotated datasets. We describe our sensor and annotation system and give our experiences in recording datasets.
The use of formal specifications as a basis for specifying functional tests has been discussed by a numbers of researchers with most work focusing on one style of specification or another separately. But is any single style an adequate basis for writing functional tests? The strengths, weaknesses and complementary nature of two popular styles of software specification, model-based and algebraic, are examined as a basis for functional test specification.
Temporal segmentation of human motion into actions is central to the understanding and building of computational models of human motion and activity recognition. Several issues contribute to the challenge of temporal segmentation and classification of human motion. These include the large variability in the temporal scale and periodicity of human actions, the complexity of representing articulated motion, and the exponential nature of all possible movement combinations. We provide initial results from investigating two distinct problems -classification of the overall task being performed, and the more difficult problem of classifying individual frames over time into specific actions. We explore first-person sensing through a wearable camera and inertial measurement units (IMUs) for temporally segmenting human motion into actions and performing activity classification in the context of cooking and recipe preparation in a natural environment. We present baseline results for supervised and unsupervised temporal segmentation, and recipe recognition in the CMU-multimodal activity database (CMU-MMAC).