Conference PaperPDF Available

TextToHBM: A Generalised Approach to Learning Models of Human Behaviour for Activity Recognition from Textual Instructions


Abstract and Figures

There are various knowledge-based activity recognition approaches that rely on manual definition of rules to describe user behaviour. These rules are later used to generate computational models of human behaviour that are able to reason about the user behaviour based on sensor observations. One problem with these approaches is that the manual rule definition is time consuming and error prone process. To address this problem, in this paper we outline an approach that learns the model structure from textual sources and later optimises it based on observations. The approach includes extracting the model elements and generating rules from textual instructions. It then learns the optimal model structure based on observations in the form of manually created plans and sensor data. The learned model can then be used to recognise the behaviour of users during their daily activities. We illustrate the approach with an example from the cooking domain.
Content may be subject to copyright.
TextToHBM: A Generalised Approach to
Learning Models of Human Behaviour for
Activity Recognition from Textual Instructions
Kristina Yordanova
University of Rostock
18059 Rostock
There are various knowledge-based activity recognition approaches that rely
on manual definition of rules to describe user behaviour. These rules are later
used to generate computational models of human behaviour that are able to reason
about the user behaviour based on sensor observations. One problem with these
approaches is that the manual rule definition is time consuming and error prone
process. To address this problem, in this paper we outline an approach that learns
the model structure from textual sources and later optimises it based on observa-
tions. The approach includes extracting the model elements and generating rules
from textual instructions. It then learns the optimal model structure based on obser-
vations in the form of manually created plans and sensor data. The learned model
can then be used to recognise the behaviour of users during their daily activities.
We illustrate the approach with an example from the cooking domain.
Some activity recognition (AR) approaches utilise human behaviour models (HBM) in
the form of rules. These rules are used to generate probabilistic models with which the
system can infer the user actions and goals [9, 17, 12]. Such types of models are also
known as computational state space models (CSSM) [12]. They treat activity recog-
nition as a plan recognition problem, where given an initial state, a set of possible
actions, and a set of observations, the executed actions and the user goals have to be
recognised [17]. These approaches rely on prior knowledge to obtain the context infor-
mation needed for building the user actions and the problem domain. The prior knowl-
edge is provided by a domain expert or by the model designer. This knowledge is then
used to manually build a CSSM. The manual modelling is however time consuming
and error prone [15].
To address this problem, different works propose the learning of models from sen-
sor data [27]. One problem these approaches face is that sensor data is expensive [20].
Furthermore, sensors are sometimes unable to capture fine-grained activities [6], thus,
they might potentially not be learned.
To reduce the need of domain experts and / or sensor data, one can substitute them
with textual data [16]. More precisely, one can utilise the knowledge encoded in textual
instructions to learn the model structure. Textual instructions specify tasks for achiev-
ing a given goal without explicitly stating all the required steps [4]. On the one hand,
this makes them a challenging source for learning a model [4]. On the other hand, they
are usually written in imperative form, have a simple sentence structure, and are highly
organised. Compared to rich texts, this makes them a better source for identifying the
sequence of actions needed for reaching the goal [26].
According to [2], to learn a model of human behaviour from textual instructions,
the system has to: 1. extract the actions’ semantics from the text, 2. learn the model
semantics through language grounding, 3. and, finally, to translate it into compu-
tational model of human behaviour for planning problems. To address the problem
of learning models of human behaviour for AR, we extend the steps proposed by [2].
We add the need of 4. learning the domain ontology that is used to abstract and / or
specialise the model. We also replace step 3. (models for planning problems) with com-
putational models for activity recognition as the targeted model format, as they are
able to reason about the human behaviour based on noisy or ambiguous observations
[9, 17].
The contribution of this paper is twofold: (1) we present an approach for learn-
ing HBM from textual instructions. In difference to existing approaches for language
grounding, our approach learns a complex domain ontology that is later used to gener-
alise or specialise the model; (2) it is the first attempt at learning CSSMs for activity
recognition from textual instructions. In the following we outline our approach for
learning HBM for AR and illustrate it with an example from the kitchen domain. This
work is based on the extended abstract in [23].
Related work
There are various approaches to learning models of human behaviour from textual in-
structions: through grammatical patterns that are used to map the sentence to a machine
understandable model of the sentence [26, 2]; through machine learning techniques
[18, 5]; or through reinforcement learning approaches that learn language by interact-
ing with an external environment [2, 3].
Models learned through model grounding have been used for plan generation [13,
2], for learning the optimal sequence of instruction execution [4], for learning nav-
igational directions [5], and for interpreting human instructions for robots to follow
them [10, 19]. To our knowledge, any attempts to apply language grounding to learn-
ing models for AR rely on identifying objects from textual data and do not build a
computational model of human behaviour [20]. This, however, suggests that models
learned from text could be used for AR tasks.
Existing approaches that learn human behaviour from text make simplifying as-
sumptions about the problem, making them unsuitable for more general AR problems.
More precisely, the preconditions and effects are learned through explicit causal rela-
tions, that are grammatically expressed in the text [13, 18]. They however, either rely
on initial manual definition to learn these relations [2], or on grammatical patterns and
rich texts with complex sentence structure [13]. They do not address the problem of
discovering causal relations between sentences, but assume that all causal relations are
expressed within the sentence [19]. They also do not identify implicit relations.
However, to find causal relations in instructions without a training phase, one has to
rely on alternative methods, such as time series analysis [21]. Moreover, the initial state
is manually defined and there are only a few works that identify possible goals based
on the textual instructions [26, 1]. This limits the approaches to a predefined problem
and does not allow the reasoning about different situations and goals. This is, however,
an important requirement for any assistive system that relies on activity recognition.
Furthermore, they rely on manually defined ontology, or do not use one. However,
one needs an ontology to deal with model generalisation problems and as a means for
expressing the semantic relations between model elements.
Moreover, there have been previously no attempts at learning CSSMs from textual
instructions. Existing CSSM approaches rely on manual rules’ definition to build the
preconditions and effects of the models. For example, [9] use the cognitive architecture
ACT-R while other approaches rely on a PDDL1-like notations to describe the possible
actions [17, 12]. In that sense, our work is the first attempt at learning CSSMs from
In this work we represent the rules in a PDDL-like notation in the form described
in [24].
Identifying text elements of interest
To extract the text elements that describe the user behaviour, the user actions and their
relations to other entities in the environment have to be identified. This is achieved
through assigning each word in a text the corresponding part of speech (POS) tag. Fur-
thermore, the dependencies between text elements are identified through dependencies
parser. To identify the human actions, the verbs from the POS-tagged text are extracted.
We are interested in present tense verbs, as textual instructions are usually written in
present tense, imperative form. We also extract any nouns that are direct objects to the
actions. These will be the objects in the environment with which the human can inter-
act. Furthermore, we extract any nouns that are in conjunction to the identified objects.
These will have dependencies to the same actions, to which the objects with which
they are in conjunction are dependent. Figure 1 gives an example of a sentence and the
identified elements based on the POS-tags and dependencies.
Moreover, any preposition relations such as in,on,at, etc. between the objects and
other elements in the text are identified. These provide spacial or directional informa-
tion about the action of interest. For example, in the sentence “Put the apple on the
table.our action is put, while the object on which the action is executed is apple. The
action is executed in the location table identified through the on preposition. Finally,
1Planning Domain Definition Language
Take the clean knife from the counter.
dobj prep_from
Action Object Location (from)
Figure 1: Elements of a sentence necessary for the model learning.
we extract “states” from the text. The state of an object is the adjectival modifier or
the nominal subject of an object. As in textual instructions the object is often omitted
(e.g. “Simmer (the sauce) until thickened.), we also investigate the relation between
an action and past tense verbs or adjectives that do not belong to an adjectival modi-
fier or to nominal subject, but that might still describe this relation. The states give us
information about the state of the environment before and after an action is executed.
Extracting causal relations from textual instructions
To identify causal relations between the actions, and between states and actions, we
use an approach proposed in [21]. It transforms every word of interest in the text into
a time series and then applies time series analysis to identify any causal relations be-
tween the series. More precisely, each sentence is treated as a time stamp in the time
series. Then, for each word of interest, the number of occurrences it appears in the
sentence is counted and stored as element of the time series with the same index as the
sentence index. We generate time series for all actions and for all states that change
an object. To discover causal relations based on the time series, we apply the Granger
causality test. It is a statistical test for determining whether one time series is useful for
forecasting another. More precisely, Granger testing performs statistical significance
test for one time series, “causing” the other time series with different time lags using
auto-regression [8]. The causality relationship is based on two principles. The first is
that the cause happens prior to the effect, while the second states that the cause has a
unique information about the future values of its effect. Given two sets of time series
xtand yt, we can test whether xtGranger causes ytwith a maximum ptime lag. To do
that, we estimate the regression yt=ao+a1yt1+...+apytp+b1xt1+... +bpxtp.
An F-test is then used to determine whether the lagged xterms are significant.
Figure 2 shows the procedure of converting text elements into time series and using
them to discover causal relations.
Take the knife from the counter.
Cut the carrots.
Put the knife on the counter.
Take the pot from the counter.
Put the pot on the stove.
Put the carrots in the pot.
Turn on the stove.
Take the wooden spoon from the counter.
Put the wooden spoon in the pot.
Instruction: cook a carrot soup
X (Take) Y (Put)
yt = ao+a1yt1+...+apytp
+b1xt1+ ... + bp xtp
The lagged x
terms are
significant for the
prediction of y?
xt Granger causes
yt => "take"
causes "put"
Figure 2: The procedure for discovering causal relations.
Building the domain ontology
The domain ontology is divided into argument and action ontology. The argument on-
tology describes the objects, locations, and any elements in the environment that are
taken as arguments in the actions. The action ontology represents the actions with their
arguments and abstraction levels.
To learn the argument ontology, a semantic lexicon (e.g. WordNet [14]) is used to
build the initial ontology. As the initial ontology does not contain some types that unify
arguments applied to the same action, the ontology has to be extended. To do that, the
prepositions with which actions are connected to indirect objects are also extracted
(e.g. in,on, etc.). They are then added to the argument ontology as parents of the
arguments they connect. In that manner, the locational properties of the arguments
are described (e.g. water has the property to be in something). During the learning
of the action templates and their preconditions, additional parent types are added to
describe objects used in actions that have the same preconditions. Furthermore, types
that are not present in the initial ontology, but which objects are used only in a specific
action, are combined in a common parent type. Figure 3 (left) shows an example of
an argument ontology. To learn the action ontology, the process proposed in [24] is
adapted for learning from textual data. Based on the argument ontology, the actions are
abstracted by replacing the concrete arguments with their corresponding types from
an upper abstraction level. In that manner, the uppermost level will represent the most
abstract form of the action. For example, the sentence “Put the apple on the table.will
yield the concrete action put apple table, and the abstract action put object location.
Figure 3 (right) shows an example of an action ontology. This representation is used as
a basis for the precondition-effect rules that describe the actions.
food with
matter take-object
take--knife-sink- take--knife-counter-
Argument ontology Action ontology
Figure 3: Argument ontology (left): objects identified through POS-tagging and depen-
dencies (blue); hierarchy identified through WordNet (black); types identified through
the relations of objects to prepositions (red); types identified based on similar precon-
ditions (green); types identified through action abstraction (yellow). Action ontology
(right): abstract representation of an action (uppermost layer); abstract representation
of action take; concrete instances of action take (bottom layer).
Generating precondition-effect rules
The next step in the process is the generation of precondition-effect rules that describe
the actions and the way they change the world. The basis for the rules is the action
ontology. Each abstract action from the ontology is taken and converted to an action
template that has the form shown in Figure 4. Basically, the action name is the first
(:action put
:parameters (?o - object ?to - location)
:precondition (and
(not (executed-put ?o ?to)))
:effect (and
(executed-put ?o ?to))
Figure 4: Example of an action template put in the PDDL notation.
part of the abstract entity put object location, while the two parameters are the second
and the third part of the entity. Furthermore, the default predicate (executed-action) is
added to both the precondition and the effect, whereas in the precondition it is negated.
Now the causal relations extracted from the text are used to extend the actions.
The execution of each action that was identified to cause another action is added as a
precondition to the second action. For example, to execute the action put, the action
take has to take place. That means that the predicate executed-take ?o has to be added
to the precondition of the action put. Furthermore, any states that cause the action are
also added in the precondition. For example, imagine the example sentence is extended
in the following manner: “If the apple is ripe, put the apple on the table.In that case
the state ripe causes the action put. For that reason the predicate (state-ripe) will also
be added to the precondition. This procedure is repeated for all available actions. The
result is a set of candidate rules that describe a given behaviour.
As it is possible that some of the rules contradict each other, a refinement step is
added. This is done by converting the argument ontology to the corresponding PDDL
format to represent the type hierarchy. The initial and goal states are then generated
by assigning different combinations of truth values to the set of predicates. Different
combinations of initial-goal states pairs are generated from the sets of initial and goal
states. Later, an initial-goal state pair as well as the rules and the type hierarchy are fed
to a planner and any predicates that prevent the reaching of the goal are removed from
the preconditions. This results in a set of candidate models from which the optimal
model will be selected.
Learning the optimal model structure
As the model will be applied to activity recognition tasks, it is important to learn
a model structure that optimises the probability of selecting the correct action2. To
achieved that, two steps are followed (see Figure 5). First, the model is optimised based
1. (take knife counter)
2. (cut carrots)
3. (put knife counter)
4. (take pot counter)
p(a1:T |Mn)remove from
candidate models
Sensor data
1. 1.242334e-23
2. 3.634459e-28
3. 1.968862e-26
4. 5.733097e-163
p(a1:T |Ml)optimal model
> T
< T
T: a threshold value
a candidate model Mn is applied to a set of plans
a model Ml that explains the plans is applied to sensor-
based AR problem
Figure 5: Learning the optimal model for a given situation based on observations.
on its ability to explain existing plans describing the user behaviour. This approach is
similar to the methods for model learning through observations [3, 7]. Here, the obser-
vations are provided in the form of manually produced plans. The plans are obtained by
asking different persons to provide a plan based on textual description of a given task.
Models that are not able to predict the plan, receive no reward. From the remaining set
of models, those which maximise the probability of executing the plan above a given
2In our case, that is the actual action executed by the user.
threshold are selected. The probability is calculated based on Formula 1.
p(a1:T|M) = Y
p(at|M) =
1pstop,same action
pstop ×
exp( P
exp( P
λkfk(a)) ,new action (2)
Mis the model used to explain the plan, atis the action executed at time t,kis a set
of features, fis an action selection heuristic, and λis its weight. The action selection
heuristics are goal distance, landmarks, cognitive heuristics, etc [24].
After selecting the set of most promising models, they are further optimised. This
is done by testing their ability to recognise activities and goals based on sensor ob-
servations. As a base for this step, the validation steps from the development process
proposed in [24] are used. Formula 1 is once again used to select the model that best
explains the observations.
TextToHBM: an Example
To illustrate the approach, we take as an example an experiment description of a per-
son who is cooking a carrots soup. A description of the experiment can be found in
[12] and the sensor dataset itself in [11]. This could be considered as a simplified ex-
ample, as the textual instruction contains explicit description of each execution step.
Table 1 shows an excerpt of the instructions provided for executing the experiment.
First, all actions in the dataset are identified3. For the carrots soup example, 15 actions
1Ta ke t h e k n i f e fr om t h e co u n t e r .
2Cu t th e c a r r o t s .
3P ut t h e k n i f e on th e c o u n t e r .
4Ta ke t h e po t fr om t h e c o u n t e r .
5P ut t h e p ot on th e s t o v e .
6P ut t he c a r r o t s i n th e p ot .
7Tu r n on th e s t o v e .
8Ta ke t h e w ood en s p oo n fr om t h e co u n t e r .
9P ut t h e w oo de n s po o n i n t h e p o t .
10 Coo k f o r 10 mi n u t e s .
11 Tu r n o f f t h e s t o v e .
12 Ope n t h e c u p b o a r d .
13 Ta k e a p l a t e f r om t h e c u p b o a r d .
14 Ta k e a g l a s s f ro m t h e c u p b o a r d .
15 P ut t h e p l a t e a nd t h e g l a s s on th e c o u n t e r .
Table 1: Excerpt from instruction describing the cooking of a carrot soup.
were identified. Furthermore, all arguments are identified. For this example, 19 argu-
ments were identified one of which was incorrectly labeled as noun (the verb “wash”).
Five of the arguments serve as locations (e.g. “counter”, “stove” etc.) describing places
3This can be done with the help of parser that POS-tags the text. Later, all present tense verbs are ex-
tracted, as they usually describe an action that is executed.
where actions are executed. The rest are objects upon which the action is executed (e.g.
“water”, “plate”, etc). Moreover, 7 prepositions were discovered that describe location,
direction or means by which an action is achieved (e.g. “in”, “from”, “with”). No states
were discovered in this example. This is due to the oversimplified sentence structure
that follows the pattern “action direct-object(s) location(s)”.
The next step is to represent each action as a time series. More precisely, each
element in the time series is represented with a number. This number indicates the
number of occurrences of the given action in the current sentence. In that manner, each
of the words (or pairs of words) of interest is assigned a time series. This allows the
utilisation of time series analysis for the discovery of implicit causal relations in textual
instructions. The resulting time series can be downloaded from [22].
Figure 6 shows the causal relations between actions discovered for cooking a car-
rots soup.
Figure 6: Causal relations discovered for cooking a carrot soup. Black indicates rela-
tions discovered by a human annotator, green: discovered with the proposed approach.
After identifying the causal relations, the argument ontology is learned. This is done
by feeding the identified nouns to WordNet in order to build the initial ontology. Then,
based on the relations described through prepositions, similar causal relations, and ab-
straction in the action ontology, the argument ontology is refined and new relations are
identified. Figure 7 shows the resulting ontology and the steps for building it. Similarly
to the argument ontology, the action ontology is based on the identified actions, the ob-
jects they are executed on and the indirect objects or locations where they are executed.
dish with
kitchen utensil
cutting toolwater
counter sink
wash-objecttake-object put-object
food instrumentation container-area
utensil tool
Figure 7: Left: Learning the argument ontology. Step 1 (blue): objects identified
through POS-tagging and dependencies; step 2 (black): hierarchy identified through
WordNet; step 3 (red): types identified through the relations of objects to prepositions;
step 4 (green): types identified based on similar preconditions; step 5 (yellow): types
identified through action abstraction. Right: Learning the action ontology: (uppermost
layer): abstract representation; (middle layer): abstract representation of action “take”;
(bottommost layer): concrete representation of action “take””.
Each abstraction level of the action ontology is based on the corresponding abstraction
level in the argument ontology.
In the next step, based on the action ontology and the identified causal relations, the
precondition-effect rules are built. In this example, the rules are built based on 17 pred-
icates. As there were no states discovered, the predicates indicate whether an action is
executed or not (e.g. “(executed-wash ?f - wash-obj)”). Based on these rules, 20 action
templates were constructed. The templates are more than the action classes because the
same action class has different preconditions or effects in different situations. Figure 8
(:action close
:parameters (?c - area)
:duration (closeDuration)
:precondition (and
(not (executed-close ?c))
(executed-open ?c)
:effect (and
(executed-close ?c)
(not (executed-open ?c))
:observation (setActivity (activity-id close))
Figure 8: The action template close in the PDDL notation.
shows the generated precondition-effect rule for the action “close”. It can be seen that
apart from the typical PDDL action notation, there are two additional slots: “:duration”
and “:observation”. These are later used for performing activity recognition. There, the
actions have durations and are observed through sensor observations. These slots al-
low linking the behaviour model to the underlying action duration distribution and the
expected type of observations.
Figure 9: Partially extended state space graph of the model.
After defining the action rules, the next step is to generate the initial and goal states.
The initial and goal states represent different combinations of truth values over all
ground predicates. In our example we have 204 ground predicates which means that
we have 204! possible combinations. To reduce this number, we utilise some prior
knowledge. We assume that none of the objects is taken, no doors or cupboards are
open, and no devices are turned on. This means that the predicates (executed-close),
(executed-put), (executed-turn-off) are set to true. We also assume that apart from these,
no other actions have been executed at the beginning of the problem. This leaves us with
only one initial state. Similarly, for the goal state we assume that the actions “cook”,
“drink”, “eat”, and “wash” (applied to the different objects) have to be executed and for
the rest we do not care. As in the experiment we conducted to collect the sensor data,
different persons chose to wash different objects, we generate different goal conditions.
There, the predicates indicating the execution of the actions “cook”, “drink”, and “eat”
are always set to true, and the truth value of the predicates describing the execution of
the “wash” actions vary. As we have 7 objects on which “wash” can be executed, this
means we have 5040 combinations of goal conditions.
Having defined the initial state and the goals, we use a state of the art planner to
identify any problems in the models that prevent reaching the goal state. For example,
the causal relation “fill causes take” was discovered, which means that in the precon-
dition of “take” the predicate (executed-fill) has to be true. However, this prevents the
model from reaching the goal state as no objects can be taken unless the action “fill”
is executed. For that reason we remove this condition from the precondition of “take”.
This procedure is repeated for all models until all problems preventing the models from
reaching a goal state are removed.
The result of this step is a set of causally correct models that contain all execution
sequences from the initial state to the possible goal states. If the model can be fully
extended it can be seen as a directed graph where the nodes are states and the transitions
are actions. The graph starts with the initial state where the probability of this state is
one in the case of only one state, otherwise there is a probability distribution over all
initial states. The transitions from one state to another also have probabilities, which are
defined based on action selection strategy such as the distance to the goal, or how often
the action is executed etc. Figure 9 shows an example of such graph where the dots are
the states and the connections between them are the actions. The graph starts with one
initial state and then follows different paths to the goal states (red dots at the bottom of
the graph). In this case the graph was only partially explored with iteratively deepening
depth first search due to the large state space. The red line shows the sequence of actions
the user actually executed.
As we now have 5040 candidate models, we use a set of plans describing the execu-
tion sequences in the experiment we conducted. In this example the plans are generated
from the annotation of the video logs recorded during the experiment. This step reduces
the model to one goal condition where apart from “cook”, “eat”, and “drink”, “wash
glass”, “wash carrot”, and “wash plate” have to be executed. For the remaining predi-
cates, we do not care about their truth value4.
The resulting model has very large branching factor5. This reduces the probability
of selecting the actual action being executed, especially in the case of noisy or ambigu-
ous observations. For that reason an optimisation step is applied. Using the model and
sensor observations describing the execution sequences of different users preparing a
carrot soup, the transition probabilities are then adjusted and the observed sequences
receive higher probability than those that are not observed. In that manner, more typical
behaviour is more likely to be observed, but in the same time less probable behaviour is
not completely removed, so that in case of new observations, the model can be further
In this work we proposed an approach that automatically extracts knowledge from
textual instructions and learns models of human behaviour that are later used for ac-
tivity recognition. One problem the approach faces is the discovery of causal relations.
As textual instructions such as recipes are usually written in informal manner, their
sentences often lack the direct objects. This in turn makes it difficult to discover the
objects on which an action is executed and also reduces the ability of the approach to
discover causal relations. For that reason, we believe that the approach could benefit
from anaphora resolution techniques in order to include the missing direct objects to
the sentence.
Another problem is the generation of initial and goal states. As it could be seen from
the example, even with simplifying assumptions, there is a very large number of pos-
sible combinations. In that sense, the approach could benefit of automated techniques
that discover improbable initial and goal conditions in the text. This could potentially
reduce the number of candidate models thus reducing the computational effort required
to evaluate the applicability of the models to the problem at hand.
Finally, the approach could potentially benefit from utilising multiple textual in-
structions to learn the candidate models. This will allow the generation of richer models
4This one condition generates a set of possible goal states that can be reached. In other words, our model
now contains one initial state and several goal states in which the goal condition is satisfied.
5This is the maximum number of actions executable from a given state, or in other words the maximum
number of connections that leave a dot in Figure 9.
that contain more contextual information and that are not tailored for only one specific
Current Results and Future Work
In a previous work we proposed an approach of extracting causal relations from textual
instructions through time series analysis [21]. We applied the approach to 20 textual in-
structions (cooking recipes, manuals, and experiment instructions). The results showed
that the approach is able to identify causal relations in simple texts with short sentences.
We compared the approach to one based on grammatical patterns and discovered that
the latter was able to detect very low number of relations. We used the extracted re-
lations as a basis for building precondition-effect rules [25]. In [25] we were able to
learn a causal model describing the activities from the carrots soup preparation dataset
and to compare it to a manually built model described in [24]. The results showed that
our approach is able to extract precondition-effect rules and to explain the plans cor-
responding to the video logs from the dataset. They, however, showed that the action
probability of executing the action described in the plan is very low given the model.
It also has to be mentioned, that the initial and goal state of the model were manually
In the future we will concentrate on optimising the textual instructions by applying
anaphora resolution techniques. We will also investigate techniques for reducing the
set of possible initial and goal states before the optimisation step. Furthermore, we will
investigate inverse reinforcement learning methods for optimising the model structure
from sparse observations. Finally, we plan to apply the approach to different domains
(such as physiotherapy and assistance of workers) in order to test its applicability to
various problems from the domain of daily activities.
If successful, the proposed approach will reduce the need of expert knowledge by
replacing it with the domain knowledge encoded in textual instructions. It, in turn, will
reduce the time and resources needed for developing computational models of human
behaviour for activity recognition. It will also be the first attempt at actually applying
CSSMs learned from textual data to an activity recognition problem.
This work is funded by the German Research Foundation (DFG) within the context of
the project TextToHBM, grant number YO 226/1-1.
[1] Monica Babes¸-Vroman, James MacGlashan, Ruoyuan Gao, Kevin Winner,
Richard Adjogah, Marie desJardins, Michael Littman, and Smaranda Muresan.
Learning to interpret natural language instructions. In Proceedings of the Second
Workshop on Semantic Interpretation in an Actionable Context, SIAC ’12, pages
1–6, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
[2] S. R. K. Branavan, Nate Kushman, Tao Lei, and Regina Barzilay. Learning high-
level planning from text. In Proc. of the Annual Meeting of the Association for
Computational Linguistics, pages 126–135, 2012.
[3] S. R. K. Branavan, David Silver, and Regina Barzilay. Learning to win by reading
manuals in a monte-carlo framework. In Proc. of the Annual Meeting of the
Association for Computational Linguistics, pages 268–277, 2011.
[4] S. R. K. Branavan, Luke S. Zettlemoyer, and Regina Barzilay. Reading between
the lines: Learning to map high-level instructions to commands. In Proceedings of
the 48th Annual Meeting of the Association for Computational Linguistics, ACL
’10, pages 1268–1277, Stroudsburg, PA, USA, 2010. Association for Computa-
tional Linguistics.
[5] David L. Chen and Raymond J. Mooney. Learning to interpret natural language
navigation instructions from observations. In Proc. of the AAAI Conference on
Artificial Intelligence, pages 859–865, August 2011.
[6] Liming Chen, J. Hoey, C.D. Nugent, D.J. Cook, and Zhiwen Yu. Sensor-based
activity recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part
C: Applications and Reviews, 42(6):790–808, November 2012.
[7] Dan Goldwasser and Dan Roth. Learning from natural instructions. Machine
Learning, 94(2):205–232, 2014.
[8] C. W. J. Granger. Investigating Causal Relations by Econometric Models and
Cross-spectral Methods. Econometrica, 37(3):424–438, August 1969.
[9] Laura M. Hiatt, Anthony M. Harrison, and J. Gregory Trafton. Accommodating
human variability in human-robot teams through theory of mind. In Proc. of the
Int. Joint Conference on Artificial Intelligence, pages 2066–2071, 2011.
[10] Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. Grounding verbs
of motion in natural language commands to robots. In Oussama Khatib, Vi-
jay Kumar, and Gaurav Sukhatme, editors, Experimental Robotics, volume 79 of
Springer Tracts in Advanced Robotics, pages 31–47. Springer Berlin Heidelberg,
[11] Frank Kr¨
uger, Albert Hein, Kristina Yordanova, and Thomas Kirste. Recognis-
ing the actions during cooking task (cooking task dataset). University Library,
University of Rostock, 2015.
[12] Frank Kr¨
uger, Martin Nyolt, Kristina Yordanova, Albert Hein, and Thomas
Kirste. Computational state space models for activity and intention recognition.
a feasibility study. PLoS ONE, 9(11):e109381, 11 2014.
[13] Xiaochen Li, Wenji Mao, Daniel Zeng, and Fei-Yue Wang. Automatic construc-
tion of domain theory for attack planning. In Int. Conference on Intelligence and
Security Informatics, pages 65–70, May 2010.
[14] George A. Miller. Wordnet: A lexical database for english. Commun. ACM,
38(11):39–41, November 1995.
[15] Tuan A Nguyen, Subbarao Kambhampati, and Minh Do. Synthesizing robust
plans under incomplete domain models. In Advances in Neural Information Pro-
cessing Systems, pages 2472–2480. Curran Associates, Inc., 2013.
[16] Matthai Philipose, Kenneth P. Fishkin, Mike Perkowitz, Donald J. Patterson, Di-
eter Fox, Henry Kautz, and Dirk Hahnel. Inferring activities from interactions
with objects. IEEE Pervasive Computing, 3(4):50–57, October 2004.
[17] Miquel Ramirez and Hector Geffner. Goal recognition over pomdps: Inferring
the intention of a pomdp agent. In Proc. of the Int. Joint Conference on Artificial
Intelligence, volume 3, pages 2009–2014, Barcelona, Spain, 2011.
[18] Avirup Sil and Er Yates. Extracting strips representations of actions and events. In
Recent Advances in Natural Language Processing, pages 1–8, September 2011.
[19] M. Tenorth, D. Nyga, and M. Beetz. Understanding and executing instructions
for everyday manipulation tasks from the world wide web. In Int. Conference on
Robotics and Automation, pages 1486–1491, May 2010.
[20] Juan Ye, Graeme Stevenson, and Simon Dobson. Usmart: An unsupervised se-
mantic mining activity recognition technique. ACM Transactions on Interactive
Intelligent Systems, 4(4):16:1–16:27, November 2014.
[21] Kristina Yordanova. Discovering causal relations in textual instructions. In Recent
Advances in Natural Language Processing, pages 714–720, September 2015.
[22] Kristina Yordanova. Time series from textual instructions for causal relations
discovery (causal relations dataset). University Library, University of Rostock,
[23] Kristina Yordanova. From textual instructions to sensor-based recognition of user
behaviour. In Companion Publication of the 21st International Conference on
Intelligent User Interfaces, IUI ’16 Companion, pages 67–73, New York, NY,
USA, 2016. ACM.
[24] Kristina Yordanova and Thomas Kirste. A process for systematic development
of symbolic models for activity recognition. ACM Transactions on Interactive
Intelligent Systems, 5(4), December 2015.
[25] Kristina Yordanova and Thomas Kirste. Learning models of human behaviour
from textual instructions. In Proceedings of the 8th International Conference on
Agents and Artificial Intelligence (ICAART 2016), pages 415–422, Rome, Italy,
February 2016.
[26] Ziqi Zhang, Philip Webster, Victoria Uren, Andrea Varga, and Fabio Ciravegna.
Automatically extracting procedural knowledge from instructional texts using
natural language processing. In Proc. of the Int. Conference on Language Re-
sources and Evaluation, May 2012.
[27] Hankz Hankui Zhuo and Subbarao Kambhampati. Action-model acquisition from
noisy plan traces. In Proc. of the Int. Joint Conference on Artificial Intelligence,
pages 2444–2450, 2013.
... To address this problem, in this paper, we analyse the processing of the high level descriptions so that it could be used to complete the process of acquiring knowledge from video records. To achieve that, we adapt a model learning approach first proposed in [16,17]. It extracts the knowledge base needed to describe the problem domain from textual instructions and later uses this knowledge to generate rules describing the possible user behaviour. ...
... When learning models from textual narratives, it is important to identify the relevant domain knowledge and the semantic structure of this knowledge [16]. We call this collection of knowledge a situation model. ...
... We call this collection of knowledge a situation model. It is the basis for building rules, describing the possible correct behaviour [16]. Existing approaches that learn human behaviour from text, however, make simplifying assumptions about the problem, making them unsuitable for more general activity recognition problems. ...
Conference Paper
One of the major difficulties in activity recognition stems from the lack of a model of the world where activities and events are to be recognised. When the domain is fixed and repetitive we can manually include this information using some kind of ontology or set of constraints. On many occasions, however, there are many new situations for which only some knowledge is common and many other domain-specific relations have to be inferred. Humans are able to do this from short descriptions in natural language, describing the scene or the particular task to be performed. In this paper we apply a tool that extracts situation models and rules from natural language description to a series of exercises in a surgical domain, in which we want to identify the sequence of events that are not possible, those that are possible (but incorrect according to the exercise) and those that correspond to the exercise or plan expressed by the description in natural language. The preliminary results show that a large amount of valuable knowledge can be extracted automatically, which could be used to express domain knowledge and exercises description in languages such as event calculus that could help bridge these high-level descriptions with the low-level events that are recognised from videos.
... To address this problem, some works propose the automatic generation of the behaviour model from textual instructions (Yordanova, 2017b;Branavan et al., 2010). More precisely, one can utilise the knowledge encoded in textual instructions to learn the model structure. ...
... To identify the set of possible actions the person can execute, these approaches rely on parsers, which assign part of speech (POS) tags to words (Yordanova, 2017b;Yordanova and Kirste, 2016;Yordanova, 2016;Preum et al., 2017;Lindsay et al., 2017). These parsers usually rely on training data in order to be able to perform POS-tagging (Martins et al., 2010;Chen and Manning, 2014). ...
... To some, this problem occurring in a relatively small subtype of texts could be insignificant. We, however, believe this is a problem that should be addressed as there is an increasing number of works that rely on natural language instructions to generate models of human behaviour (Yordanova, 2017b;Preum et al., 2017;Lindsay et al., 2017). These models are then used for various applications such as planning, robot manipulation, behaviour recognition of people with impairments. ...
Conference Paper
Full-text available
Different approaches for behaviour understanding rely on textual instructions to generate models of human behaviour. These approaches usually use state of the art parsers to obtain the part of speech (POS) meaning and dependencies of the words in the instructions. For them it is essential that the parser is able to correctly annotate the instructions and especially the verbs as they describe the actions of the person. State of the art parsers usually make errors when annotating textual instructions, as they have short sentence structure often in imperative form. The inability of the parser to identify the verbs results in the inability of behaviour understanding systems to identify the relevant actions. To address this problem, we propose a simple rule-based model that attempts to correct any incorrectly annotated verbs. We argue that the model is able to significantly improve the parser's performance without the need of additional training data. We evaluate our approach by extracting the actions from 61 textual instructions annotated only with the Stanford parser and once again after applying our model. The results show a significant improvement in the recognition rate when applying the rules (75% accuracy compared to 68% without the rules, p-value < 0.001).
... The model of Yordanova [45] uses textual instructions for human activities to learn the actions in the planning domain as well as their pre/post-conditions (cf. Section 3). ...
Developing intelligent virtual characters has attracted a lot of attention in the recent years. The process of creating such characters often involves a team of creative authors who describe different aspects of the characters in natural language, and planning experts that translate this description into a planning domain. This can be quite challenging as the team of creative authors should diligently define every aspect of the character especially if it contains complex human-like behavior. Also a team of engineers has to manually translate the natural language description of a character's personality into the planning domain knowledge. This can be extremely time and resource demanding and can be an obstacle to author's creativity. The goal of this paper is to introduce an authoring assistant tool to automate the process of domain generation from natural language description of virtual characters, thus bridging between the creative authoring team and the planning domain experts. Moreover, the proposed tool also identifies possible missing information in the domain description and iteratively makes suggestions to the author.
... Specification-based approaches rely on manually incorporating expert knowledge into logic rules that allow reasoning about the situation [15]. On the other hand, learning-based methods can rely on the sensor data to learn the situation [16] or on textual sources to extract the situation related information and its semantic structure [17,18,19,20]. In this work we use specification-based approach to build the situation model. ...
Full-text available
Background: Dementia impairs spatial orientation and route planning, thus often affecting the patient’s ability to move outdoors and maintain social activities. Situation-aware deliberative assistive technology devices (ATD) can substitute impaired cognitive function in order to maintain one’s level of social activity. To build such system one needs domain knowledge about the patient’s situation and needs. We call this collection of knowledge situation model. Objective: To construct a situation model for the outdoor mobility of people with dementia (PwD). The model serves two purposes: (a) as a knowledge base from which to build an ATD describing the mobility of PwD; (b) as a codebook for the annotation of the recorded behavior. Methods: We perform systematic knowledge elicitation to obtain the relevant knowledge. The OBO Edit tool is used for implementing and validating the situation model.The model is evaluated by using it as a codebook for annotating the behavior of PwD during a mobility study and interrater agreement is computed. In addition, clinical experts perform manual evaluation and curation of the model. Results: The situation model consists of 101 concepts with 11 relation types between them. The results from the annotation showed substantial overlapping between two annotators (Cohen’s kappa pf 0.61). Conclusion: The situation model is a first attempt to systematically collect and organize information related to the outdoor mobility of PwD for the purposes of situation-aware assistance. The model is the base for building an ATD able to provide situation-aware assistance and to potentially improve the quality of life of PwD.
... To address these two problems, in previous works we outlined an approach for automatic generation of behaviour models from texts (Yordanova and Kirste, 2016;Yordanova, 2016Yordanova, , 2017. In this work, we extend the approach by proposing a method for automatic generation of situation models. ...
Conference Paper
Full-text available
Recent attempts at behaviour understanding through language grounding have shown that it is possible to automatically generate models for planning problems from textual instructions. One drawback of these approaches is that they either do not make use of the semantic structure behind the model elements identified in the text, or they manually incorporate a collection of concepts with semantic relationships between them. We call this collection of knowledge situation model. The situation model introduces additional context information to the model. It could also potentially reduce the complexity of the planning problem compared to models that do not use situation models. To address this problem, we propose an approach that automatically generates the situation model from textual instructions. The approach is able to identify various hierarchical, spatial, directional, and causal relations. We use the situation model to automatically generate planning problems in a PDDL notation and we show that the situation model reduces the complexity of the PDDL model in terms of number of operators and branching factor compared to planning models that do not make use of situation models. We also compare the generated PDDL model to a handcrafted one and show that the generated model performs comparable to simple handcrafted models.
Procedural knowledge, or how-to knowledge, is the knowledge acquired from natural language understanding of instructions in procedural text. Procedural knowledge bases containing textual descriptions of tasks in procedures have witnessed explosive growth recently. This has facilitated a significant body of work in various natural language understanding tasks. A rich source of procedural text is in the form of recipes describing food preparation procedures. The ready availability of online recipes has enabled progress in food computing, which refers to computing tasks related to recipes, such as food perception, recipe image recognition and calorie estimation, and food-oriented retrieval of recipes. However, past work on food computing has not covered the procedural knowledge inherent in recipes and the natural language understanding tasks required to uncover that knowledge. We seek to address this by presenting an overview of recent work in natural language understanding tasks in food computing and describing how this contributes to how-to knowledge and future applications.
Full-text available
This paper presents preliminary results of our work with a major financial company, where we try to use methods of plan recognition in order to investigate the interactions of a costumer with the company's online interface. In this paper, we present the first steps of integrating a plan recognition algorithm in a real-world application for detecting and analyzing the interactions of a costumer. It uses a novel approach for plan recognition from bare-bone UI data, which reasons about the plan library at the lowest recognition level in order to define the relevancy of actions in our domain, and then uses it to perform plan recognition. We present preliminary results of inference on three different use-cases modeled by domain experts from the company, and show that this approach manages to decrease the overload of information required from an analyst to evaluate a costumer's session - whether this is a malicious or benign session, whether the intended tasks were completed, and if not - what actions are expected next.
Conference Paper
Full-text available
Procedural knowledge is the knowledge required to perform certain tasks, and forms an important part of expertise. A major source of procedural knowledge is natural language instructions. While these readable instructions have been useful learning resources for human, they are not interpretable by machines. Automatically acquiring procedural knowledge in machine interpretable formats from instructions has become an increasingly popular research topic due to their potential applications in process automation. However, it has been insufficiently addressed. This paper presents an approach and an implemented system to assist users to automatically acquire procedural knowledge in structured forms from instructions. We introduce a generic semantic representation of procedures for analysing instructions, using which natural language techniques are applied to automatically extract structured procedures from instructions. The method is evaluated in three domains to justify the generality of the proposed semantic representation as well as the effectiveness of the implemented automatic system.
Conference Paper
Full-text available
One aspect of ontology learning methods is the discovery of relations in textual data. One kind of such relations are causal relations. Our aim is to discover causations described in texts such as recipes and manuals. There is a lot of research on causal relations discovery that is based on grammatical patterns. These patterns are, however , rarely discovered in textual instructions (such as recipes) with short and simple sentence structure. Therefore we propose an approach that makes use of time series to discover causal relations. We distinguish causal relations from correlation by assuming that one word causes another only if it precedes the second word temporally. To test the approach, we compared the discovered by our approach causal relations to those obtained through grammatical patterns in 20 textual instructions. The results showed that our approach has an average recall of 41% compared to 13% obtained with the grammatical patterns. Furthermore the discovered by the two approaches causal relations are usually dis-joint. This indicates that the approach can be combined with grammatical patterns in order to increase the number of causal relations discovered in textual instructions.
Conference Paper
Full-text available
There are various activity recognition approaches that rely on manual definition of precondition-effect rules to describe human behaviour. These rules are later used to generate computational models of human behaviour that are able to reason about the user behaviour based on sensor observations. One problem with these approaches is that the manual rule definition is time consuming and error prone process. To address this problem, in this paper we propose an approach that learns the rules from textual instructions. In difference to existing approaches, it is able to learn the causal relations between the actions without initial training phase. Furthermore, it learns the domain ontology that is used for the model generalisation and specialisation. To evaluate the approach, a model describing cooking task was learned and later applied for explaining seven plans of actual human behaviour. It was then compared to a hand-crafted model describing the same problem. The results showed that the learned model was able to recognise the plans with higher overall probability compared to the hand-crafted model. It also learned a more complex domain ontology and was more general than the handcrafted model. In general, the results showed that it is possible to learn models of human behaviour from textual instructions which are able to explain actual human behaviour.
Technical Report
Full-text available
The dataset contains the data of acceleration sensors attached to a person during the execution of a kitchen task. It consists of 7 datasets that describe the execution of preparing and having a meal: preparing the ingredients, cooking, serving the meal, having a meal, cleaning the table, and washing the dishes. The aim of the experiment is to investigate the ability of activity recognition approaches to recognise fine-grained user activities based on acceleration data. The results from the dataset can be found in the PlosOne paper "Computational State Space Models for Activity and Intention Recognition. A Feasibility Study" by Krüger et al. The dataset can be found here:
Conference Paper
This paper presents a novel approach for leveraging automatically extracted textual knowledge to improve the performance of control applications such as games. Our ultimate goal is to enrich a stochastic player with highlevel guidance expressed in text. Our model jointly learns to identify text that is relevant to a given game state in addition to learning game strategies guided by the selected text. Our method operates in the Monte-Carlo search framework, and learns both text analysis and game strategies based only on environment feedback. We apply our approach to the complex strategy game Civilization II using the official game manual as the text guide. Our results show that a linguistically-informed game-playing agent significantly outperforms its language-unaware counterpart, yielding a 27% absolute improvement and winning over 78% of games when playing against the builtin AI of Civilization II.
Conference Paper
There are various activity recognition approaches that rely on manual definition of precondition-effect rules to describe user behaviour. These rules are later used to generate computational models of human behaviour that are able to reason about the user behaviour based on sensor observations. One problem with these approaches is that the manual rule definition is time consuming and error prone process. To address this problem, in this paper we outline an approach that extracts the rules from textual instructions. It then learns the optimal model structure based on observations in the form of manually created plans and sensor data. The learned model can then be used to recognise the behaviour of users during their daily activities.
The article can be downloaded from Several emerging approaches to activity recognition (AR) combine symbolic representation of user actions with probabilistic elements for reasoning under uncertainty. These approaches provide promising results in terms of recognition performance, coping with the uncertainty of observations, and model size explosion when complex problems are modelled. But experience has shown that it is not always intuitive to model even seemingly simple problems. To date, there are no guidelines for developing such models. To address this problem, in this work we present a development process for building symbolic models that is based on experience acquired so far as well as on existing engineering and data analysis workflows. The proposed process is a first attempt at providing structured guidelines and practices for designing, modelling, and evaluating human behaviour in the form of symbolic models for AR. As an illustration of the process, a simple example from the office domain was developed. The process was evaluated in a comparative study of an intuitive process and the proposed process. The results showed a significant improvement over the intuitive process. Furthermore, the study participants reported greater ease of use and perceived effectiveness when following the proposed process. To evaluate the applicability of the process to more complex AR problems, it was applied to a problem from the kitchen domain. The results showed that following the proposed process yielded an average accuracy of 78%. The developed model outperformed state-of-the-art methods applied to the same dataset in previous work, and it performed comparably to a symbolic model developed by a model expert without following the proposed development process.