Conference PaperPDF Available

Providing Semantic Annotation for the CMU Grand Challenge Dataset

Authors:

Abstract and Figures

Providing ground truth is essential for activity recognition for three reasons: to apply methods of supervised learning, to provide context information for knowledge-based methods, and to quantify the recognition performance. Semantic annotation extends simple symbolic labelling by assigning semantic meaning to the label, enabling further reasoning. In this paper we present a novel approach to semantic annotation by means of plan operators. We provide a step by step description of the workflow to manually creating the ground truth annotation. To validate our approach we create semantic annotation of the CMU grand challenge dataset, which is often cited but, due to missing and incomplete annotation, almost never used. We evaluate the quality of the annotation by calculating the interrater reliability between two annotators who labelled the dataset. The results show almost perfect overlapping (Cohen’s κ of 0.8) between the annotators. The produced annotation is publicly available, to enable further usage of the CMU grand challenge dataset.
Content may be subject to copyright.
Providing Semantic Annotation for the CMU
Grand Challenge Dataset
Kristina Yordanova, Frank Kr¨
uger, Thomas Kirste
March 22, 2018
Abstract
Providing ground truth is essential for activity recognition for three reasons: to
apply methods of supervised learning, to provide context information for knowledge-
based methods, and to quantify the recognition performance. Semantic annotation
extends simple symbolic labelling by assigning semantic meaning to the label, en-
abling further reasoning. In this paper we present a novel approach to semantic
annotation by means of plan operators. We provide a step by step description of
the workflow to manually creating the ground truth annotation. To validate our ap-
proach we create semantic annotation of the CMU grand challenge dataset, which
is often cited but, due to missing and incomplete annotation, almost never used.
We evaluate the quality of the annotation by calculating the interrater reliability
between two annotators who labelled the dataset. The results show almost perfect
overlapping (Cohen’s κof 0.8) between the annotators. The produced annotation
is publicly available, to enable further usage of the CMU grand challenge dataset.
1 Introduction
The annotation of sensor datasets describing human behaviour is an important part of
the activity and plan recognition process. It provides a target label for each observa-
tion in the cases where supervised learning is applied. It also serves as a ground truth
for evaluating the performance of the activity or plan estimation procedure by compar-
ing the estimated by the model values with the annotated values. Finally, it provides
the context information needed for developing knowledge-based activity recognition
systems. In this paper we present a model-based approach to semantic annotation of
human behaviour based on the annotation process proposed in [7]. There, the labels
assigned to the data provide an underlying semantic structure that contains information
about the actions, goals, and plans being executed. This semantic structure is repre-
sented in the form of a model of the behaviour’s state in terms of collection of state
variables. Actions are then defined as effects that change the state of the model. This
form of annotation provides structured knowledge of the concepts in the data being
annotated and enables the reasoning over underlying behaviour changes, their causal
relations, and contextual dependencies. Such annotation is important for evaluating
plan recognition approaches that aim not only to recognise the goal of the plan, but
1
also the subgoals and actions being executed. Furthermore, the model-based semantic
annotation is important for evaluating the performance of any approach that aims at
recognising the underlying actions’ context. Finally, the annotation will be beneficial
for approaches that strive to learn models of human behaviour.
The contribution of this paper is threefold: first, we introduce a novel approach to
semantic annotation by means of precondition and effect rules; second, we describe
a step by step workflow to create such annotation; and finally, we provide a semantic
annotation for three types of recipes from the CMU grand challenge dataset.
The paper is structured as follows. In Section 2, we discuss the types of annotation
available in the literature and outline how our approach distinguishes from them. Sec-
tion 3 describes the proposed approach. In Section 4, we discuss how to improve
the quality of the annotation by training the annotators, while Section 5 illustrates
the approach by re-annotating the Carnegie Mellon University Multi-Modal Activity
Database (CMU-MMAC). The new annotation will also be made publicly available at
the authors’ website. In Section 6, we evaluate our approach by calculating the inter-
rater reliability between different annotators. Finally, the paper concludes with a short
discussion of the approach.
2 Annotation of Human Behaviour
In the context of human behaviour recognition we distinguish between three different
types of annotation. The first is the annotation of activities where a textual description
(or label) is assigned to the executed action [22, 11, 14, 13]. More formally, the ob-
jective is to manually assign a label lito each time step of a time series. This is often
done by analysing a separately recorded video log of the executed activities. These
labels are usually called ground truth, as they provide a symbolic representation of the
true sequence of activities. However, for the finite set L={l1. . . ln}of labels there
is usually no further information besides the equality relation. Annotations such as
take-baking pan provide a textual description of the executed task that however
do not contain an underlying semantic structure. There is usually no formal set of con-
straints that restrict the structure of the label sequences. Typically, nothing prevents
an annotator from producing sequences like “put fork to drawer” “close drawer”
“take knife from drawer”. This is also the most common type of annotation of human
behaviour, partially because even the assignment of non-semantic labels to the data is
a difficult, time consuming, and error prone task [22].
The second type of annotation is the plan annotation. It can be divided into goal
labelling and plan labelling [6]. The goal labelling is the annotation of each plan with a
label of the goal that is achieved [1, 5]. In contrast, plan labelling provides annotation
not only of the goal, but also of the actions constituting the plan, and of any subgoals
occurring in the plan [3]. The latter is, however, a time consuming and error prone
process [6] which explains why the only attempts of such plan annotation are done
when executing tasks on a computer (e.g. executing plans in an email program [3]).
This is also reflected in activity and plan recognition approaches such as [19, 15] that
use only synthesised observations, and thus synthesised annotation, to recognise the
human actions and goals.
2
The third type of annotation is the semantic annotation [18]. The term comes from
the field of the semantic web where the it is described as the process and the resulting
annotation or metadata consisting of aligning a resource or a part of it with a descrip-
tion of some of its properties and characteristics with respect to a formal conceptual
model or ontology [2]. The concept is later adopted in the field of human behaviour
annotation, where it describes the annotating of human behaviour with labels that have
an underlying semantic structure represented in the form of concepts, properties, and
relations between these concepts [20, 9]. We call this type of semantic structure an
algebraic representation in accordance to the definition provided in [12]. There, an
algebraic representation is one where the state of the system is modelled in terms of
combinations of operations required to achieve that state.
In difference to the algebraic representation, there exists a model-based represen-
tation which provides a model of the system’s state in terms of collection of state vari-
ables. Then, the individual operations are defined in terms of their effects on the state
of the model [12]. To our knowledge, there have been no attempts to represent the
semantic structure of human behaviour annotation in the form of model-based repre-
sentation. In the next sections we present an approach to semantic annotation of human
behaviour where the underlying semantic structure uses a model-based representation.
This representation allows us to provide not only a semantic meaning to the labels, but
also to produce plan labels and to reason about the plan’s causal correctness. Further-
more, it gives the state of the world corresponding to each label and allows to track
how it changes during the plan execution.
3 A novel approach to Annotating Human Behaviour
In this section, we present a model-based semantic annotation approach that strives
to overcome the drawbacks of the approaches outlined in the previous section. Our
approach combines the characteristics of the state of the art approaches and in addition
relies on model-based instead of algebraic knowledge representation. The targeted
group of activity datasets, that will potentially benefit from this approach, are those
describing goal-oriented behaviour. Typical activity recognition experiments such as
the CMU-MMAC [10] can be regarded as goal oriented. In them, the participants
are instructed to fulfil a task such as food preparation. To ensure comparability of
different repetitions, identical experimental setup is chosen for each trial. As a result,
the action sequence executed by the participants can be regarded as a plan, leading from
the same initial state (as chosen by the experimenter) to a set of goal states (given in
the experiment instruction). In the domain of automated planning and scheduling, plan
sequences are generated from domain models, where actions are defined by means of
preconditions and effects. A plan is then a sequence of actions generated by grounding
the action schemas of the domain leading from an initial state to the goal state. In
contrast, in our semantic annotation approach, we manually create plans that reflect
the participants’ actions, and define a planning domain, which describes the causal
connections of the actions to the state of the world. Below we describe the proposed
annotation process, including the definition of the label set L, the label semantics, the
manual annotation procedure, and the validation procedure. We illustrate the process
3
with examples from the kitchen domain.
Step one: Action and entity dictionary definition In the first step a dictionary of
actions and entities is created. The actions have a name representing the action class,
and a description of the action class that distinguishes it from the remaining classes.
The dictionary also contains the set of all entities observed during the experiment. The
dictionary is manually created by domain experts by analysing the video log, which
is typically recorded during the experiment. The results of the dictionary definition
are the set of action classes and the set of entities manipulated during action execution
(see Table 1). To allow annotators to distinguish between different actions, each action
Table 1: Result of step 1: A dictionary of actions and entities.
action
a1take
a2put
a3walk
...
anstir
entity
e1knife
e2drawer
e3counter
...
empepper
name is accompanied by its definition. If we look at action a1take, its definition is to
grab an object. During the executing of take, the location of the object changes from
the initial location to the hand of the person. The action consists of moving the arm to
the object, grabbing the object and finally moving the arm back to the body.
Step two: Definition of action relations In the second step, the action relations have
to be defined. For each action, the number and role of involved objects is defined. In
case of take, for example, an object and a location, where the object is taken from, are
defined. In addition, for each object, possible roles have to be identified. A pot, for
example, can be taken, filled, washed, and stirred. The result of this step is the finite
set of labels L={l1= ˜a1
1, l2= ˜a2
1, . . . , lk= ˜am
n}, where ˜adefines the syntax of the
action relation ato be used for the annotation process (see Table 2).
Table 2: Result of step 2: The table lists the type signature and each possible instantia-
tion for the set of actions identified in the previous step.
a1take (what:takeable, from:location)
a1
1take (knife, drawer)
a2
1take (knife, board)
...
a2put (what:takeable, to:location)
a1
2put (knife, drawer)
a2
2put (knife, board)
...
Step three: Definition of state properties As described above, we use a model-
based approach (according to [12]) to semantic annotation. We, therefore, have to
define the state space by means of state properties. In the third step, a set of state
4
properties is defined as a function of a tuple of entities to an entity of the domain. The
state space is then defined by each combination of possible mappings of entity tuples.
Finally, the subset of mappings that holds in the initial state (start of the experiment)
has to be marked (see Table 3).
Table 3: Result of step 3: A list of functions with type signatures and their instan-
tiations. A * in the last columns means that the defined function holds in the initial
state. f1is-at (what: takeable) location
f1
1is-at (knife) 7→ drawer *
f2
1is-at (knife) 7→ board
...
f2objects taken () number
f1
2objects taken () 7→ 0 *
f2
2objects taken () 7→ 1
...
Step four: Definition of preconditions and effects Objective of the third step is to
define the semantics of the actions. Using the type signature defined in the previous
step, action schemes are defined in terms of preconditions and effects. As explained
above, we regard the participants’ action sequences as plans. Here we describe them by
use of the Planning Domain Definition Language (PDDL), known from the domain of
automated planning and scheduling. The preconditions and effects for the single action
schemes are formed by domain experts. A take action for example requires an object
to be taken, the maximal number of objects not to exceed, and in case the location is
a container that can be opened and closed, it has to be open. Effects of the take action
are that the location of the object is changed from the original location to the hand and
if the object to be taken is dirty, the hands become dirty too (see Figure 1).
(:action take
:parameters (?what - takeable ?from - loc)
:precondition (and
(= (is-at ?what) ?from)
(not (= ?from hands)))
:effect (and
(assign (is-at ?what) hands)
(when (not (is-clean ?what)) (not (is-clean hands)))))
Figure 1: Extract of the action scheme for the take action encodes preconditions and
effects in PDDL.
Step five: Manual annotation Once the dictionary of labels is defined, the manual
annotation can be performed. We use the ELAN annotation tool [23] for this step. Here
an annotator has to assign labels from the defined label set to the video sequence. The
ELAN annotation tool allows to synchronise several video files and to show them in
parallel.
5
Step six: Plan validation Since the label sequence produced in the previous step
consists of plan operators, the complete sequence can be interpreted as a plan, lead-
ing from an initial to a goal state. Objective of the sixth step is to check the causal
validity of the label sequence with respect to the planning domain created in the pre-
vious step. A plan validator (such as VAL [16]) can be used for this task. If the label
sequence does not fulfill the causal constraints of the planning domain, two possible
reasons exist: Either the planning domain does not correctly reproduce the constraints
of the experimental setting or the label sequence is incorrect. In case of an incorrect
label sequence, step five (manual annotation) has to be repeated to correct the detected
problems. In case of an incorrect domain, either the preconditions defined in step four
have to be relaxed or the effects have to be revised.
The proposed process has three results: 1) the label sequence, 2) the semantic
structure of the labels, and 3) a planning domain, describing the causal relations of the
labels.
4 Improving the quality of annotation
It is often the case that two annotators provided with the same codebook, produce
annotation with a low overlap [4]. This can be explained with the high behaviour
variability and with the different interpretation of human behaviour1. To reduce the
effect of such discrepancies between annotators, the literature suggests training the
annotators which leads to an increase in the interrater reliability [4]. We adopt this
approach and conduct a training phase with the annotators. It involves the following
steps: 1. the domain expert meets with the annotators and discusses the elements of
the dictionary and their presence in an example video log; 2. the annotators separately
annotate the same video log; 3. the annotators compare the two annotations, discuss
the differences and decide on a new consolidated annotation of the video log; 4. the
annotators repeat steps 2 and 3 for the next video log. In a study conducted in [4], only
about 13% of the studies reported the size of the training involved. It was, however,
concluded that high intensity training produces significantly better results than low
intensity training or no training. For that reason we performed training as long as the
annotators felt comfortable annotating without external help (28% of the data in the
concrete annotation scenario).
We applied the proposed annotation process together with the training phase on the
CMU Multi-Modal Activity Database [11]. In the next section we outline the CMU-
MMAC and the annotation we created by applying our approach to this dataset.
5 The CMU-MMAC
The Carnegie Mellon University Multi-Modal Activity Database (CMU-MMAC) pro-
vides a dataset of kitchen activities [11]. Several subjects were recorded by multiple
1For example, in the action “take object”, one can interpret the beginning of the action as the point at
which the protagonist starts reaching for the object, or the point at which the hand is already holding the
object. This deviation in the interpretation reduces the overlapping of labels done by different annotators.
6
sensors (including cameras, accelerometers, and RFIDs) while performing food prepa-
ration tasks. A literature research revealed that only few researchers ever used this
dataset. In [8] the activities of twelve subjects were directly reconstructed from the
video by use of computer vision. In [21] the cameras and the IMU data were used
for temporal classification of seven subjects. We believe that two reasons exist for this
publicly available dataset to not be further used in the literature. The first is that activity
recognition in the kitchen domain is a very challenging task and the second is that the
provided annotation is neither complete nor provides enough information to efficiently
train classifiers. In the following section, we briefly describe our annotation for the
CMU-MMAC.
5.1 Overview of the CMU-MMAC
The CMU-MMAC consists of five sub datasets (Brownie, Sandwich, Eggs, Salad,
Pizza). Each of them contains recorded sensor data from one food preparation task.
The dataset contains data from 55 subjects, were each of them participates in several
sub experiments. While executing the assigned task, the subjects were recorded with
five cameras and multiple sensors. While the cameras can be used for computer vision
based activity recognition [8], the resulting video log is also the base for the dataset
annotation. An annotated label sequence for 16 subjects can be downloaded from the
CMU-MMAC website2. Albeit following a grammatical structure of verbs and ob-
jects, the label sequence is still missing semantics which if present would allow the
deriving of context information such as object locations and relations between actions
and entities. In the following section, we discuss the annotation of three of the five
datasets (Brownie, Sandwich, and Eggs)3. Later, we provide a detailed evaluation of
the produced annotation.
6 Evaluation
6.1 Experimental Setup
In order to evaluate the proposed annotation process and the quality of the resulting an-
notation, we conducted the following experiments: 1. Two domain experts reviewed a
subset from the video logs for the Brownie, Eggs, and Sandwich datasets and identified
the action classes, entities, action relations, state properties, and precondition-effect
rules. 2. Two annotators (Annotator A and Annotator B) independently annotated the
three datasets (Brownie, Eggs, and Sandwich). 3. The same two annotators discussed
the differences in the annotation after each annotated video log for the first nvideos of
each dataset and prepared a consolidated annotation for 28% of the sequences in the
datasets4.
2http://www.cs.cmu.edu/˜espriggs/cmu-mmac/annotations/
3The annotation can be downloaded from http://purl.uni-rostock.de/rosdok/
id00000163
4nis 12 for the Brownie, 7 for the Eggs, and 6 for the Sandwich dataset.
7
Based on these annotated sequences, we examined the following hypotheses: (H1) Fol-
lowing the proposed annotation process provides a high quality annotation. (H2) Train-
ing the annotators improves the quality of the annotation.
To test H1, we calculated the interrater reliability between Annotator A and Anno-
tator B for all video logs in the three datasets (90 video logs). To test H2, we investi-
gated whether the interrater reliability increases with the training of the annotators. The
interrater reliability was calculated for the ground labels, not for the classes (in other
words, we calculate the overlapping between the whole label “take-bowl-cupboard”
and not only for the action class “take”). The interrater reliability was calculated in
terms of agreement (IRa), Cohen’s κ(I Rκ), and Krippendorff’s α(IRα). We chose
the above measures as they are the most frequently used measures for interrater relia-
bility as reported in [4].
6.2 Results
6.2.1 Semantic annotation for the CMU-MMAC
To define the label set, two domain experts reviewed a subset from the video logs and
identified 13 action classes (11 for the Brownie, 12 for the Eggs, and 12 for the Sand-
wich). Table 4 shows the action classes for the three datasets. The action definitions
Table 4: Action classes for the three datasets.
Dataset Action classes
Brownie open, close, take, put, walk, turn on, fill, clean, stir, shake, other
Eggs open, close, take, put, walk, turn on, fill, clean, stir, shake, other,
turn off
Sandwich open, close, take, put, walk, turn on, fill, clean, stir, shake, other, cut
created in this step later enable different annotators to choose the same label for iden-
tical actions. In this step the domain experts also identified the entities (30 for the
Sandwich dataset, 44 for the Brownies, and 43 for the Eggs). From these dictionaries,
in step two, a discussion about the type signature and possible instantiations took place
(119 unique labels where identified for the Sandwich dataset, 187 for the Brownies, and
179 for the Eggs; see Table 2 for examples). Step three, definition of state properties
revealed 13 state properties (see Table 3). The next three steps were executed by two
annotators until all datasets were annotated without gaps and all annotation sequences
were shown to be valid plans.
The resulting annotation consists of 90 action sequences. Interestingly, while anno-
tating, we noticed that the experimenter changed the settings during the experiments’
recording. In all sub-experiments it can be seen that, before recording subject 28, some
objects were relocated in different cupboards. Our annotation is publicly available to
enable other researchers to address the activity recognition problem with the CMU-
MMAC dataset. The complete annotation can be downloaded from [24].
8
6.2.2 Results for H1
To address H1 we computed the interrater reliability between Annotator A and Anno-
tator B. The results can be seen in Figure 2. It can be seen that the annotators reached
a median agreement of 0.84 for the Brownie, 0.81 for the Eggs, and 0.84 for the Sand-
wich. Similarly, the Cohen’s κand the Krippendorff’sαhad a median of 0.83 for the
Brownie, 0.80 for the Eggs, and .83 for the Sandwich. A Cohen’s κbetween 0.41
0.60 means moderate agreement, between 0.61 0.80 means substantial agreement,
and above 0.81 indicates almost perfect agreement [17]. Figure 2). Similarly, data that
Figure 2: The median interrater reliability for the three datasets (in terms of Cohen’s κ)
and the deviation from this median.
has Krippendorff’s αabove 0.80 is considered to be reliable to draw conclusions. In
other words, the average interrater reliability between the two annotators is between
substantial and almost perfect. This also indicates that the proposed annotation process
not only provides semantic annotation, it also ensures that the annotators produce high
quality annotation. Consequently, hypothesis H1 was accepted. Figure 3 shows the
annotated by Annotators A and B classes and the places where they differ. It can be
seen that the difference is mainly caused by slight shifts in the start and end time of the
actions. This indicates that the problematic places during annotation of fine-grained
actions are in determining the start and end of the action.
6.2.3 Results for H2
To test H2, we investigated whether the training had impact on the interrater reliability.
We calculated the difference in the interrater reliability between each newly annotated
video log and the previous one during the training phase. The “Brownie” dataset has
mean positive difference of about 2% while “Sandwich” has a mean difference of about
10%. This means that on average there was an improvement of 2% (respectively 10%)
9
Figure 3: Comparison between the annotation of a video log of Annotator A (bottom)
and Annotator B (top) from the “Brownie” dataset for subject 9. The different colours
indicate different action classes. The plot in the middle illustrates the differences be-
tween both annotators (in black).
in the interrater reliability during the training phase. On the other hand, the “Eggs”
dataset shows a negative difference of 1%, which indicates that on average no im-
provement in interrater reliability was observed during the training phase. A negative
difference between some datasets was also observed. These indicate decrease in the in-
terrater reliability after a training was performed (with maximum of about 2%). Such
decrease can be explained with encountering of new situations in the dataset or with
different interpretation of a given action. However, a decrease of 2% does not signifi-
cantly reduce the quality of the annotation. Figure 4 illustrates the interrater agreement
for the datasets selected for the training phase. The orange line shows a linear model
that was fitted to predict the interrater reliability from the dataset number. It can be seen
that the effect of the training phase was not negative for all datasets. For two datasets
(Brownie and Sandwich), an increasing trend can be seen. To better understand the
change in the interrater reliability, we look into the agreement (IRa) between the an-
notators of the first 6 annotations of the “Sandwich” dataset (Figure 4). The interrater
reliability between the first and the second annotated video increases with 23%. The
same applies for the interrater reliability between the second and the third annotated
video. At that point the interrater reliability has reached about 81% overlapping (Co-
hen’s κof 0.8), which indicates almost perfect overlapping. After that, there is a mean
difference of about 1%. On average the overlapping between the two annotators stays
around 80% (or mean Cohen’sκof 0.78) even after the training phase. This indicates
that the learning phase improves the agreement between annotators, thus the quality of
the produced annotation (hence we accept H2). The results, however, show that one
10
Figure 4: Learning curve for the first n videos. The points illustrate the interrater
reliability for one dataset. The points are connected to increase perceivability. The
orange line illustrates the increase of reliability due to learning.
needs a relatively small training phase to produce results with almost perfect overlap-
ping between annotators5. This contradicts the assumption that we need high intensity
training to produce high quality annotation (as suggested in [4]). It also shows that
using our approach for semantic annotation ensures a high quality annotation without
5For the “Sandwich” dataset, the annotators needed to produce consolidated annotation for the first two
videos before they reached overlapping of about 80%, for the “Brownie” and the “Eggs” they needed only
one.
11
the need of intensive training of the annotators.
7 Conclusion
In this work, we presented a novel approach to manual semantic annotation. The ap-
proach allows the usage of a rich label set that includes semantic meaning and relations
between actions, entities and context information. Additionally, we provide a state
space that evolves during the execution of the annotated plan sequences. In contrast to
typical annotation processes, our annotation approach allows further reasoning about
the state of the world by interpreting the annotated label sequence as grounded plan
operators. It is, for example, easy to infer the location of objects involved, without any
explicit statement about the objects’ location.
To validate our approach, we annotated the “Brownie”, “Eggs”, and “Sandwich”
trials from the CMU-MMAC dataset. In the original annotation only 16 out of 90
sequences are annotated. We now provide a uniform annotation for all 90 sequences
including a semantic meaning of the labels. To enable other researchers to participate
in the CMU grand challenge, we make the complete annotation publicly available.
Furthermore, we evaluated the quality of the produced annotation by comparing the
annotation of two annotators. The results showed that the annotators were able to
produce labelled sequences with almost perfect overlapping (Cohen’s κof about 0.8).
This stands to show that the approach provides high quality semantic annotation, which
the ubiquitous computing community can use to further the research in activity, plan,
and context recognition.
8 Acknowledgments
We would like to thank the students who annotated the dataset. This work is par-
tially funded by the German Research Foundation (YO 226/1-1). The video data was
obtained from kitchen.cs.cmu.edu and the data collection was funded in part by the
National Science Foundation (EEEC-0540865).
References
[1] D. W. Albrecht, I. Zukerman, and A. E. Nicholson. Bayesian models for key-
hole plan recognition in an adventure game. User Modeling and User-Adapted
Interaction, 8(1-2):5–47, 1998.
[2] P. Andrews, I. Zaihrayeu, and J. Pane. A classification of semantic annotation
systems. Semant. web, 3(3):223–248, August 2012.
[3] M. Bauer. Acquisition of user preferences for plan recognition. In Proc. of Int.
Conf. on User Modeling, pages 105–112, 1996.
12
[4] P. S. Bayerl and K. I. Paul. What determines inter-coder agreement in manual
annotations? a meta-analytic investigation. Comput. Linguist., 37(4):699–725,
December 2011.
[5] N. Blaylock and J. Allen. Statistical goal parameter recognition. In Int. Conf. on
Automated Planning and Scheduling, pages 297–304, June 2004.
[6] N. Blaylock and J. Allen. Hierarchical goal recognition. In Plan, activity, and
intent recognition, pages 3–32. Elsevier, Amsterdam, 2014.
[7] blinded. The entry is removed due to double blinded reviewing., 2017.
[8] E. Z. Borzeshi, O. P. Concha, R. Y. Da Xu, and M. Piccardi. Joint action seg-
mentation and classification by an extended hidden markov model. IEEE Signal
Process. Lett., 20(12):1207–1210, 2013.
[9] H.-S. Chung, J.-M. Kim, Y.-C. Byun, and S.-Y. Byun. Retrieving and explor-
ing ontology-based human motion sequences. In Computational Science and Its
Applications, volume 3482, pages 788–797. Springer Berlin Heidelberg, 2005.
[10] F de la Torre, J. Hodgins, J. Montano, S. Valcarcel, R. Forcada, and J. Macey.
Guide to the carnegie mellon university multimodal activity database. Technical
Report CMU-RI-TR-08-22, Robotics Institute, Carnegie Mellon University, July
2009.
[11] F. de la Torre, J. K. Hodgins, J. Montano, and S. Valcarcel. Detailed human data
acquisition of kitchen activities: the CMU-Multimodal Activity Database. In
Workshop on Developing Shared Home Behavior Datasets to Advance HCI and
Ubiquitous Computing Research, 2009.
[12] R. Denney. A comparison of the model-based & algebraic styles of specifica-
tion as a basis for test specification. SIGSOFT Softw. Eng. Notes, 21(5):60–64,
September 1996.
[13] M. Donnelly, T. Magherini, C. Nugent, F. Cruciani, and C. Paggetti. Annotating
sensor data to identify activities of daily living. In Toward Useful Services for
Elderly and People with Disabilities, volume 6719, pages 41–48. Springer Berlin
Heidelberg, 2011.
[14] J. Hamm, B. Stone, M. Belkin, and S. Dennis. Automatic annotation of daily ac-
tivity from smartphone-based multisensory streams. In Mobile Computing, Appli-
cations, and Services, volume 110, pages 328–342. Springer Berlin Heidelberg,
2013.
[15] L. M. Hiatt, A. M. Harrison, and J. G. Trafton. Accommodating human variability
in human-robot teams through theory of mind. In Proc. Int. J. Conf. Artificial
Intelligence, pages 2066–2071, Barcelona, Spain, 2011.
[16] R. Howey, D. Long, and M. Fox. Val: automatic plan validation, continuous
effects and mixed initiative planning using pddl. In IEEE Int. Conf. on Tools with
Artificial Intelligence, pages 294–301, Nov 2004.
13
[17] G. G. Koch J. R. Landis. The measurement of observer agreement for categorical
data. Biometrics, 33(1):159–174, 1977.
[18] A. Kiryakov, B. Popov, D. Ognyanoff, D. Manov, A. Kirilov, and M. Goranov. Se-
mantic annotation, indexing, and retrieval. In The Semantic Web - ISWC, volume
2870, pages 484–499. Springer, 2003.
[19] M. Ramirez and H. Geffner. Goal recognition over pomdps: Inferring the in-
tention of a pomdp agent. In Proc. of Int. Joint Conf. on Artificial Intelligence,
volume 3, pages 2009–2014, Barcelona, Spain, 2011.
[20] S. Saad, D. De Beul, S. Mahmoudi, and P. Manneback. An ontology for video
human movement representation based on benesh notation. In Int. Conf. on Mul-
timedia Computing and Systems, pages 77–82, 2012.
[21] E. H. Spriggs, F. de la Torre, and M. Hebert. Temporal segmentation and activ-
ity classification from first-person sensing. In IEEE Computer Society Conf. On
Computer Vision and Pattern Recognition Workshops, pages 17–24. IEEE, 2009.
[22] T. L. M. van Kasteren and B. J. A. Kr¨
ose. A sensing and annotation system for
recording datasets in multiple homes. In Proc. of Ann. Conf. on Human Factors
and Computing Systems, pages 4763–4766, Boston, USA, April 2009.
[23] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes. ELAN: a
professional framework for multimodality research. In Proc. Int. Conf. Language
Resources and Evaluation, pages 1556–1559, 2006.
[24] K. Yordanova, F. Kr¨
uger, and T. Kirste. Semantic annotation for the CMU-
MMAC Dataset. University Library, University of Rostock, 2018. http://purl.uni-
rostock.de/rosdok/id00000163.
14
... The absence of structured annotations can pose significant challenges, particularly when validating more complex symbolic structures, as annotation labels fail to represent relations between activities [75]. To address this, the authors of [76] endeavored to provide semantic annotation for CMU- VOLUME 12, 2024 MMAC [77], thus establishing ground truth annotation. The resulting annotation is publicly accessible for further research. ...
... Subsequently, the semantic-based annotation was augmented with additional details, including the manipulated objects (termed as ''objects''), the hand involved, and the corresponding start and end times of object usage. To validate the causal correctness of our annotations, we employed a plan-validation procedure akin to the methodology outlined in [76]. Since our annotations essentially represent a sequence of consecutive actions involving manipulated objects, each experimental run can be viewed as a plan, transitioning from an initial state to a specific goal state. ...
Article
Full-text available
With the global population aging, the demand for technologies facilitating independent living, especially for those with cognitive impairments, is increasing. This paper addresses this need by conducting a comprehensive evaluation of the Rostock Kitchen Task Assessment dataset, a pivotal resource in kitchen task activity recognition. Our study begins with an in-depth introduction, emphasizing the increasing prevalence of neurodegenerative disorders and the crucial role of assistive technologies. Our contributions encompass a systematic literature review, design and implementation of a working prototype of our envisioned system, refinement of the Rostock Kitchen Task Assessment dataset, creation of a semantically annotated dataset, extraction of statistical features, comparative analysis, and rigorous model performance assessment. The core of our work is the thorough evaluation and benchmarking of different activity recognition approaches using the aforementioned Rostock Kitchen Task Assessment dataset. Our experimental results demonstrate that despite encountering an imbalance problem in the dataset, the fusion of the Hidden Markov Model and Random Forest leads to superior results, achieving a weighted-averaged F 1 -score of 74.10% for all available activities and 81.40% for the most common actions in the Rostock Kitchen Task Assessment dataset. Moreover, through systematic analysis, we identify strengths and suggest potential refinements, thereby advancing the field of kitchen activity recognition. This offers valuable insights for researchers and practitioners in assistive and remote care technologies.
... Because a broadcasting station does not manage vlogs, the domain boundary of a video is ambiguous. Moreover, it is not easy to extract the metadata manually due to variations in the characters, filming location, and filming equipment [7,8]. This study proposes automatic metadata generation using auditory information from videos. ...
Article
Videos contain visual and auditory information. Visual information in a video can include images of people, objects, and the landscape, whereas auditory information includes voices, sound effects, background music, and the soundscape. The audio content can provide detailed information on the story by conducting a voice and atmosphere analysis of the sound effects and soundscape. Metadata tags represent the results of a media analysis as text. The tags can classify video content on social networking services, like YouTube. This paper presents the methodologies of speech, audio, and music processing. Also, we propose integrating these audio tagging methods and applying them in an audio metadata generation system for video storytelling. The proposed system automatically creates metadata tags based on speech, sound effects, and background music information from the audio input. The proposed system comprises five subsystems: (1) automatic speech recognition, which generates text from the linguistic sounds in the audio, (2) audio event classification for the type of sound effect, (3) audio scene classification for the type of place from the soundscape, (4) music detection for the background music, and (5) keyword extraction from the automatic speech recognition results. First, the audio signal is converted into a suitable form, which is subsequently combined from each subsystem to create metadata for the audio content. We evaluated the proposed system using video logs (vlogs) on YouTube. The proposed system exhibits a similar accuracy to handcrafted metadata for the audio content, and for a total of 104 YouTube vlogs, achieves an accuracy of 65.83%.
... One of the most challenging issues when it comes to human movement annotation is coming into a semantic agreement among potential users. Some researchers [52,67] find that training the participants positively affects the quality of the annotations. This, however, does not guarantee a consensus between different experts annotating the same segment even if they are experts in specific and standards movement frameworks such as Laban Movement Analysis [32]. ...
Article
Full-text available
In this paper, we present a conceptual framework and toolkit for movement annotation. We explain how the design of the annotation systems, based on the framework, if combined with specific strategies for the process of annotation, can enhance the collection of ground-truth datasets for training algorithms. Computational algorithms, such as machine learning, show promising results for massive and scalable automatic movement annotation. Nevertheless, the need for reliable ground-truth datasets annotated by human experts, to train the machine learning algorithms and for bridging the gap between machine measurable and human perceived expressive aspects remains an open issue. This need constitutes a challenging task, due to the complexity of human movement and diversity of possible descriptors, as well as the high subjectivity that accompanies movement characterisation by both experts and non-expert users. We contribute to addressing this problem, by proposing a conceptual framework for dance movement manual annotation which we evaluate through the development and deployment of the toolkit. Finally, we discuss how the different design choices affect the process and the reliability of collecting data sets regarding qualitative aspects of movement.
... Then, the trained system is used to recognize the users' activities during the operational phase. Rather obviously, the availability of training datasets characterized by high quality is a necessary condition for obtaining accurate recognition results [27]. ...
Chapter
Full-text available
Automatic recognition of user’s activities by means of wearable devices is a key element of many e-health applications, ranging from rehabilitation to monitoring of elderly citizens. Activity recognition methods generally rely on the availability of annotated training sets, where the traces collected using sensors are labelled with the real activity carried out by the user. We propose a method useful to automatically identify misbehaving users, i.e. the users that introduce inaccuracies during the labeling phase. The method is semi-supervised and detects misbehaving users as anomalies with respect to accurate ones. Experimental results show that misbehaving users can be detected with more than 99% accuracy.
... The analysis of semantics in the study of groups is currently of interest in several lines of promising research, including the analysis of the so-called 'group genome' [15], compilation of ontologies and annotation of various types of human activity to develop interaction interfaces with multiple digital systems [16], and also to study the phenomena of information distortion during the distribution in the group and the mechanisms of 'collective memory' [17]. ...
Article
Full-text available
The so-called Maslow’s pyramid reflects the hierarchy of motivation priorities in one person. Neuroevolutionary approach as well as studies of collective intelligence focus the attention to the group as a minimal information processing unit. Here we analyse semantics of the relation of youth groups to tree levels of ‘the most important’ – for the past week, for professional development, and for the whole life. The group consensus is more for more periods of evaluated topic. Only four words (sememes) represent about 30% of all associations mentioned by participants in relation to life. The same share of profession-related associations is covered by 12 sememes. Twenty-two sememes cover 30% of week-related associations. Pyramid-like hierarchy has been visualised. Applied techniques can be developed based on this approach for designing of new group intelligence creative systems.
Article
Research on data annotation for artificial intelligence (AI) has demonstrated that biases, power, and culture impact the ways that annotators apply labels to data and subsequently affect downstream AI systems. However, annotators can only apply labels that are available to them in the annotation classification scheme. Drawing on a 3-year ethnographic study of an R&D collaboration between medical and AI researchers, we argue that the construction of the classification schema itself -- decisions about what kinds of data can and cannot be collected, what activities can and cannot be detected in the data, what the possible annotation classes ought to be, and the rules by which an item ought to be classified into each class -- dramatically shape the annotation process, and through it, the AI. We draw on Bowker and Star's [9] classification theory to detail how the creation of a training data codebook for a computer vision algorithm in hospital intensive care units (ICUs) evolved from its original, clinically-driven goal of classifying complex clinical activities into a narrower goal of identifying physical objects and simpler activities in the ICU. This work reinforces how trade-offs and decisions made long before annotators begin labeling data are highly consequential to the resulting AI system.
Thesis
Full-text available
Cyclic motions such as walking, running or cycling are common to our daily lives. Thus, the analysis of these cycles has an important role to play within both the medical field, e.g. gait analysis, and the fitness domain, e.g. step counting and running analysis. For such applications, inertial sensors are ideal as they are mobile and unobtrusive. The aim of this thesis is to capture cyclic motion using inertial sensors and subsequently analyse them using machine learning techniques. A lack of realistic and annotated data currently limits the development and application of algorithms for inertial sensors under non-laboratory conditions. This is due to the effort required to both collect and label such data. The first contributions of this thesis propose novel methods to reduce annotation costs for realistic datasets, and in this manner enable the labelling of a large benchmark dataset. The applicability of the dataset is demonstrated by using it to propose and test a robust algorithm for simultaneous human activity recognition and cycle analysis. One of these methods for reducing annotation costs is then deployed to develop the first mobile gait analysis system for patients with a rare and heterogeneous disease, hereditary spastic paraplegia (HSP). Thus, machine learning algorithms which set the state-of-the-art for cycle analysis using inertial sensors were proposed and validated by this thesis. The outcomes of this thesis are beneficial in both the medical and fitness domains, enabling the development and use of algorithms trained and tested in realistic settings.
Conference Paper
Ground truth is essential for activity recognition problems. It is used to apply methods of supervised learning, to provide context information for knowledge-based methods, and to quantify the recognition performance. Semantic annotation extends simple symbolic labelling by assigning semantic meaning to the label and enables reasoning about the semantic structure of the observed activity. The development of semantic annotation for activity recognition is a time consuming task, which involves a lot of effort and expertise. To reduce the time needed to develop semantic annotation, we propose an approach that automatically generates semantic models based on manually assigned symbolic labels. We provide a detailed description of the automated process for annotation generation and we discuss how it replaces the manual process. To validate our approach we compare automatically generated semantic annotation for the CMU grand challenge dataset with manual semantic annotation for the same dataset. The results show that automatically generated models are comparable to manually developed models but it takes much less time and no expertise in model development is required
Technical Report
Full-text available
The annotation for the CMU kitchen dataset can be downloaded from http://purl.uni-rostock.de/rosdok/id00000163
Conference Paper
Full-text available
Over the past decade, researchers in computer graphics, computer vision, and robotics have begun to work with significantly larger collections of data. A number of sizable databases have been collected and made available to researchers: faces, motion capture, natural scenes, and changes in weather and lighting. These and other databases have done a great deal to facilitate research and to provide standardized test datasets for new algorithms, however, these databases are limited by the constrained settings within which they are collected. We propose a focused effort to capture detailed (high spatial and temporal resolution) human data in the kitchen while cooking several recipes. The database contains multimodal measures of the human activity of subjects performing the tasks involved in cooking and food preparation. Currently we record video from five external cameras and one wearable camera, audio from five balanced microphones and a wearable watch, motion capture with a 12 camera Vicon systems, and accelerometers, gyroscopes and magnetic sensors from five IMUs. Several computers were used for recording the various modalities. The computers were synchronized using the Network Time Protocol (NTP). Preliminary data can be downloaded from http://kitchen.cs.cmu.edu/ , and it is currently used to solve problems of multimodal temporal segmentation of activities and activity recognition.
Article
Full-text available
Utilization of computer tools in linguistic research has gained importance with the maturation of media frameworks for the handling of digital audio and video. The increased use of these tools in gesture, sign language and multimodal interaction studies has led to stronger requirements on the flexibility, the efficiency and in particular the time accuracy of annotation tools. This paper describes the efforts made to make ELAN a tool that meets these requirements, with special attention to the developments in the area of time accuracy. In subsequent sections an overview will be given of other enhancements in the latest versions of ELAN, that make it a useful tool in multimodality research.
Conference Paper
We present a system for automatic annotation of daily experience from multisensory streams on smartphones. Using smartphones as platform facilitates collection of naturalistic daily activity, which is difficult to collect with multiple on-body sensors or array of sensors affixed to indoor locations. However, recognizing daily activities in unconstrained settings is more challenging than in controlled environments: 1) multiples heterogeneous sensors equipped in smartphones are noisier, asynchronous, vary in sampling rates and can have missing data; 2) unconstrained daily activities are continuous, can occur concurrently, and have fuzzy onset and offset boundaries; 3) ground-truth labels obtained from the user's self-report can be erroneous and accurate only in a coarse time scale. To handle these problems, we present in this paper a flexible framework for incorporating heterogeneous sensory modalities combined with state-of-the-art classifiers for sequence labeling. We evaluate the system with real-life data containing 11721 minutes of multisensory recordings, and demonstrate the accuracy and efficiency of the proposed system for practical lifelogging applications. © 2013 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering.
Article
This chapter discusses hierarchical goal recognition: simultaneous online recognition of goals and subgoals at various levels within an HTN-like plan tree. We use statistical, graphical models to recognize hierarchical goal schemas in time quadratic with the number of the possible goals. Within our formalism, we treat goals as parameterized actions, necessitating the recognition of parameter values as well. The goal schema recognizer is combined with a tractable version of the Dempster-Shafer theory to predict parameter values for each goal schema. This results in a tractable goal recognizer that can be trained on any plan corpus (a set of hierarchical plan trees). Additionally, we comment on the state of data availability for plan recognition in general and briefly describe a system for generating synthetic data using a mixture of AI planning and Monte Carlo sampling. This was used to generate the Monroe Corpus, one of the first large plan corpora used for training and evaluating plan recognizers. This chapter also discusses the need for general metrics for evaluating plan recognition and proposes a set of common metrics.
Conference Paper
In this paper, we present an approach for semantic annotation of movement in the videos that is based on ontology model and semantic concept classifiers. The video movement ontology (VMO) is defined by using OWL (Ontology Web Language), and included both schema and data. In fact, ontology concepts and their relationships that construct the movement model are based on the semantics of the Benesh Movement Notation (BMN), which can describe any form of dance or human movement. We have exploited the knowledge embedded into the ontology by using Semantic Web Rules Language (SWRL). In our approach, SWRL rules are used to perform rule-based reasoning over both concepts and concept instances, and to improve the quality of movement annotation in the video. Additionally, we can search within the VMO by writing SPARQL queries. According to the authors' knowledge, this is the first time the ontology concept is used to annotate with BMN the video human movements.
Article
Hidden Markov models (HMMs) provide joint segmentation and classification of sequential data by efficient inference algorithms and have therefore been employed in fields as diverse as speech recognition, document processing, and genomics. However, conventional HMMs do not suit action segmentation in video due to the nature of the measurements which are often irregular in space and time, high dimensional and affected by outliers. For this reason, in this paper we present a joint action segmentation and classification approach based on an extended model: the hidden Markov model for multiple, irregular observations (HMM-MIO). Experiments performed over a concatenated version of the popular KTH action dataset and the challenging CMU multi-modal activity dataset (CMU-MMAC) report accuracies comparable to or higher than those of a bag-of-features approach, showing the usefulness of improved sequential models for joint action segmentation and classification tasks.
Conference Paper
Models for activity recognition require large annotated datasets. We describe our sensor and annotation system and give our experiences in recording datasets.
Article
The use of formal specifications as a basis for specifying functional tests has been discussed by a numbers of researchers with most work focusing on one style of specification or another separately. But is any single style an adequate basis for writing functional tests? The strengths, weaknesses and complementary nature of two popular styles of software specification, model-based and algebraic, are examined as a basis for functional test specification.
Article
Temporal segmentation of human motion into actions is central to the understanding and building of computational models of human motion and activity recognition. Several issues contribute to the challenge of temporal segmentation and classification of human motion. These include the large variability in the temporal scale and periodicity of human actions, the complexity of representing articulated motion, and the exponential nature of all possible movement combinations. We provide initial results from investigating two distinct problems -classification of the overall task being performed, and the more difficult problem of classifying individual frames over time into specific actions. We explore first-person sensing through a wearable camera and inertial measurement units (IMUs) for temporally segmenting human motion into actions and performing activity classification in the context of cooking and recipe preparation in a natural environment. We present baseline results for supervised and unsupervised temporal segmentation, and recipe recognition in the CMU-multimodal activity database (CMU-MMAC).