Content uploaded by Kristina Yordanova
Author content
All content in this area was uploaded by Kristina Yordanova on Feb 10, 2019
Content may be subject to copyright.
Extracting Planning Operators from
Instructional Texts for Behaviour Interpretation
Kristina Yordanova
University of Rostock, 18059 Rostock, Germany
Abstract
Recent attempts at behaviour understanding through language ground-
ing have shown that it is possible to automatically generate planning mod-
els from instructional texts. One drawback of these approaches is that
they either do not make use of the semantic structure behind the model
elements identified in the text, or they manually incorporate a collection
of concepts with semantic relationships between them. To use such mod-
els for behaviour understanding, however, the system should also have
knowledge of the semantic structure and context behind the planning
operators. To address this problem, we propose an approach that au-
tomatically generates planning operators from textual instructions. The
approach is able to identify various hierarchical, spatial, directional, and
causal relations between the model elements. This allows incorporating
context knowledge beyond the actions being executed. We evaluated the
approach in terms of correctness of the identified elements, model search
complexity, model coverage, and similarity to handcrafted models. The
results showed that the approach is able to generate models that explain
actual tasks executions and the models are comparable to handcrafted
models.
1 Introduction
Libraries of plans combined with observations are often used for behaviour un-
derstanding [18, 12]. Such approaches rely on PDDL-like notations to generate
a library of plans and reason about the agent’s actions, plans, and goals based
on observations. Models describing plan recognition problems for behaviour un-
derstanding are typically manually developed [18, 2]. The manual modelling is
however time consuming and error prone and often requires domain expertise
[16]. To reduce the need of domain experts and the time required for building
the model, one can substitute them with textual data [17]. As [23] propose,
one can utilise the knowledge encoded in instructional texts, such as manuals,
recipes, and howto articles, to learn the model structure. Such texts specify
tasks for achieving a given goal without explicitly stating all the required steps.
On the one hand, this makes them a challenging source for learning a model [5].
1
On the other hand, they are written in imperative form, have a simple sentence
structure, and are highly organised. Compared to rich texts, this makes them
a better source for identifying the sequence of actions needed for reaching the
goal [28].
According to [4], to learn a model for planning problems from textual in-
structions, the system has to: 1. extract the actions’ semantics from the
text, 2. learn the model semantics through language grounding, 3. and fi-
nally to translate it into computational model for planning problems. In
this work we add 4. the learning of a situation model as a requirement
for learning the model structure. As the name suggests, it provides context
information about the situation [24]. It is a collection of concepts with semantic
relations between them. In that sense, the situation model plays the role of
the common knowledge base shared between different entities. We also add 5.
the need to extract implicit causal relations from the texts as explicit
relations are rarely found in such type of texts.
In previous work we proposed an approach for extracting domain knowl-
edge and generating situation models from textual instructions, based on which
simple planning operators can be built [26]. We extend our previous work by
proposing a mechanism for generation of rich models from instructional texts
and providing a detailed description of the methodology. Further, we show first
empirical results that the approach is able to generate planning operators, which
capture the behaviour of the user. To evaluate the approach, we examine the
correctness of the identified elements, the complexity of the search space, the
model coverage, and its similarity to handcrafted models.
The work is structured as follows. Section 2 provides the state of the art in
language grounding for behaviour understanding; Section 3 provides a formal
description of the proposed approach; Section 4 contains the empirical eval-
uation of our approach. The work concludes with discussion of future work
(Section 5).
2 Related work
The goal of grounded language acquisition is to learn linguistic analysis from a
situated context [22]. This could be done in different ways: through grammatical
patterns that are used to map the sentence to a machine understandable model
of the sentence [13, 28, 4]; through machine learning techniques [19, 6, 3, 8, 11];
or through reinforcement learning approaches that learn language by interacting
with the environment [4, 5, 22, 1, 8, 11]. Models learned through language
grounding have been used for plan generation [13, 4, 14], for learning the optimal
sequence of instruction execution [5], for learning navigational directions [22, 6],
and for interpreting human instructions for robots to follow them [11, 20].
All of the above approaches have two drawbacks. The first problem is the
way in which the preconditions and effects for the planning operators are identi-
fied. They are learned through explicit causal relations, that are grammatically
expressed in the text [13, 19]. The existing approaches either rely on initial
2
manual definition to learn these relations [4], or on grammatical patterns and
rich texts with complex sentence structure [13]. In contrast, textual instructions
usually have a simple sentence structure and grammatical patterns are rarely
discovered [25]. The existing approaches do not address the problem of discov-
ering causal relations between sentences, but assume that all causal relations
are within the sentence [20]. In contrast, in instructional texts, the elements
representing cause and effect are usually found in different sentences [25].
The second problem is that existing approaches either rely on manually
defined situation model [19, 4, 8], or do not use one [13, 5, 28, 22]. Still, one
needs a situation model to deal with model generalisation and as a means for
expressing the semantic relations between model elements. What is more, the
manual definition is time consuming and often requires domain experts. [14]
propose dealing with model generalisation by clustering similar actions together.
We propose an alternative solution where we exploit the semantic structure of
the knowledge present in the text and in language taxonomies.
In previous work, we addressed these two problems by proposing an approach
for automatic generation of situation models for planning problems [26]. In this
work, we extend the approach to generate rich planning operators and we show
first empirical evidence that it is possible to reason about human behaviour
based on the generated models. The method adapts an approach proposed by
[25] to use time series analysis to identify the causal relations between text
elements. We use it to discover implicit causal relations between actions. We
also make use of existing language taxonomies and word dependencies to identify
hierarchical, spatial and directional relations, as well as relations identifying the
means through which an action is accomplished. The situation model is then
used to generate planning operators.
3 Approach
3.1 Identifying elements of interest
The first step in generating the model is to identify the elements of interest in
the text. We consider a text Xto be a sequence of sentences S={s1, s2, ..., sn}.
Each sentence sis represented by a sequence of words Ws={w1s, w2s, ..., wms},
where each word has a tag twdescribing its part of speech (POS) meaning. In
a text we have different types of words. We are interested in verbs v∈V,
V⊂Was they describe the actions that can be executed in the environment.
The set of actions E⊂Vare verbs in their infinitive form or in present tense,
as textual instructions are usually described in imperative form with a missing
agent. We are also interested in nouns n∈,N⊂Wthat are related to the
verb. One type of nouns are the direct (accusative) objects of the verb d∈D,
D⊂N. These nouns give us the elements of the world with which the agent is
interacting (in other words, objects on which the action is executed). We denote
the relation between dand eas dobj(e, d). Here a relation ris a function applied
to two words aand b. We denote this as r(a, b). Note that r(a, b)6=r(b, a). An
3
example of such relation can be seen in Fig. 1, where “knife” is the direct object
of “take”.
Apart from the direct objects, we are also interested in any indirect objects
i∈I,I⊂Nof the action. Namely, any nouns that are connected to the action
through a preposition. These nouns give us spacial, locational or directional
information about the action being executed, or the means through which the
action is executed (e.g. an action is executed “with” the help of an object).
More formally, an indirect object ip∈Iof an action eis the noun connected to
ethrough a preposition p. We denote the relation between ipand eas p(e, ip).
For example, in Fig. 1 “counter” is the indirect object of “take” and its relation
is denoted as from(take,counter). We define the set O:= D∪Iof all relevant
objects as the union of all unique direct and indirect objects in a text.
The last type of element is the object’s property. A property c∈C,C⊂W
of an object ois a word that has one of the following relations with the object:
amod(c, o), denoting the adjectival modifier or nsubj(c, o), denoting the nominal
subject. We denote such relation as property(c, o). For example, in Fig. 1,
“clean” is the property of “knife”. As in instructions the object is often omitted
(e.g. “Simmer (the sauce) until thickened."), we also investigate the relation
between an action and past tense verbs or adjectives that do not belong to
an adjectival modifier or to nominal subject, but that might still describe this
relation.
3.2 Building the initial situation model
Given the set of objects O, the goal is to build the initial structure of the
situation model. It consists of words, describing the elements of a situation and
the relations between these elements. If we think of the words as nodes and the
relations as edges, we can represent the situation model as a graph.
Definition 1 (Situation model) Situation model G:= (W, R)is a graph con-
sisting of nodes represented through words Wand of edges represented through
Take the clean knife from the counter.
VB DT NN IN DT NN
dobj prep_from
Action Object Ind. object (from)
JJ
Property
amod
Relation
POS
Type
(:action take
:parameters(?o - object ?l - surface)
:precondition (and
(<= (number-executed-take ?o ?l) nExecuted)
(is-utensil ?o)
(is-from ?l)
(clean ?o)
(executed-put ?o ?l) )
:effect (and (increase (number-executed-take ?o ?l) 1)
(not (executed-put ?o ?l))
(executed-take ?o ?l))
)
Figure 1: Elements of a sentence necessary for the model generation and the
corresponding PDDL operator. Each sentence is assigned a part of speech tag
and the dependencies are annotated. Based on them, the relevant elements are
identified. Later, PDDL operators are generated from the identified elements.
4
relations R, where for two words a, b ∈W, there exists a relation r∈Rsuch
that r(a, b).
The initial structure of the situation model is represented through a taxonomy
that contains the objects Oand their abstracted meaning on different levels
of abstraction. To do that, a language taxonomy Lcontaining hyperonymy
relations between the words of the language is used (this is the is-a relation
between words). For example, the relation isa(knife,tool) indicates that the
concrete object “knife” is of type “tool”. To build the initial situation model, we
start with the set Oas the leaves of the taxonomy and for each object o∈O
we recursively search for its hyperonyms. This results in a hierarchy where
the bottommost layer consist of the elements in Oand the uppermost layer
contains the most abstract word, that is the least common parent of all o∈O.
Here the least common parent lcp(a,b) of two words aand bis the parent
on the lowest level in the taxonomy that contains both aand bas children.
Then the initial situation model is Ginit := (Winit, Rinit)with Winit =O∪
hyperonyms(O, L)and Rinit := isa(Winit )where Ois the set of objects and
Lis a language taxonomy. Furthermore, for every two objects oi, oj∈O, it
holds that there exists l∈Lsuch that l=lcp(oi, oj). Note that here we use
a function hyperonyms(O,L), which returns all hyperonyms of Ofound in L.
The abstraction hierarchy is later used to generalise or specialise the action
templates in a planning model.
3.3 Extending the situation model
As the initial situation model contains only the abstraction hierarchy of the iden-
tified objects, we extend it by first including the list of all actions and properties
to the situation model and then adding the relations between actions and indi-
rect objects, actions and direct ob jects, and properties and objects to the graph.
We define the extended situation model as Gext := (Wext, Rext ), such that
Wext := Winit ∪E∪Cand Rext := Rinit ∪dobj(E, O)∪p(E, O)∪property(C, O),
where Eis the set of actions, Ois the set of objects, Cis the set of properties,
and dobj(E, O)and p(E, O)are the direct, respectively indirect, relations be-
tween object and action, while property(C, O)is the property - object relation.
On the one hand, this step is performed to enrich the semantic structure of the
model. On the other hand, it gives the basis for the planning operators as the
arguments in an operator are represented by all objects that are related to the
action.
3.4 Adding implicit causal relations
The last step is extending the situation model with causal relations. They build
up the preconditions and effects in a planning operator. There are two types of
predicates that describe the preconditions and effects. The first type is described
through the identified properties (e.g. the condition that the knife has the
property “clean”) and through the indirect object relations (e.g. the counter has
5
the role “from”). The second type of preconditions are based on the assumption
that a certain action has to be executed to enable the execution of another
action. We call this “predictive causality” [7, p. 254] and the corresponding
relations “predictive causal relations” or “implicit causal relations”.
To discover implicit causal relations between actions in the text, we consider
two cases: (1) relations between two actions in the text; (2) relations between
two action-object pairs in the text. We consider the first case as there are actions
that are not related to a specific direct or indirect object but that still are
causally related to other actions. We consider the second case because applying
one action on an object can cause the execution of another action on the same
object. We denote predictive causal relations with q∈Q,Q⊂R. To discover
causal relations between actions, we adapt the algorithm proposed by [25], which
makes use of time series analysis. We start by representing each unique action
(or each action-object tuple) in a text as a time series. Each element in the series
represents the number of occurrences of the action in the sentence. We then
make use of the Granger causality test. It is a statistical test for determining
whether one time series is useful for forecasting another. It performs statistical
significance test for one time series, “causing" the other time series with different
time lags using auto-regression [9]. Given two sets of time series xtand yt, we
can test whether xtGranger causes ytwith a maximum ptime lag. To do that,
we estimate the regression yt=ao+a1yt−1+... +apyt−p+b1xt−1+... +bpxt−p.
An F-test is then used to determine whether the lagged xterms are significant1.
For example, we generate time series for the words “take” and “put” and after
applying the Granger test, it concludes that the lagged time series for “take”
significantly improve the forecast of the “put” time series, thus we conclude that
“take” causes “put”.
Now that we have identified the implicit causal relations between actions,
we add them in the situation model. The final situation model is Gf in :=
(Wfin, Rf in )such that Wfin := Wext and Rf in := Rext ∪Q, where Qis the set
of discovered causal relations, Wext is the set of words and Rext is the set of
relations in the extended situation model.
3.5 Generating planning operators
The next step is to generate operators based on the situation model. An operator
a:= (e, Z, P r, F p, Ef, F e)is a tuple, where eis the name of the operator, Z
represents the set of arguments with which the operator can be parameterised;
P r,Ef ⊂Pare the set of precondition, respectively effect, predicates; F p,
F e ⊂Fare the set of precondition, respectively effect, functions. The predicates
Pare boolean functions that provide statements about the model world state.
In difference to predicates, functions provide higher-order statements about the
1Note that regression usually reflects correlation. Granger, however, argued that causality
in economics could be tested for by measuring the ability to predict the future values of a
time series using prior values of another time series. As the question of “true causality” is
philosophical, the Granger causality test assumes that one thing preceding another can be
used as evidence of causation.
6
Algorithm 1 Generating planning operators from the situation mo del
Require: E, C, R, O actions, properties, relations, objects from Gf in
Require: n number of times an action can be executed
Require: A:= ∅empty set of operators
1: for ein Edo for each action in E
2: (nameae, Zae, P rae, F pae, E fae, F eae)←initialise()
3: nameae←e
4: for oin Odo
5: if ∃r:= relation(e, o), r ∈Rdobj ∪Rpthen add arguments
6: Zae←add.argument(Zae, o)
7: end if
8: if ∃r:= p(e, o), r ∈Rpthen add predicates from indirect ob ject relations
9: P rae←add.predicate(P rae,property-p(o))
10: end if
11: end for
12: F pae←add.function(F pae,(number-executed-e(Z)< n)) default precondition function
13: for zin Zaedo add property predicates
14: for cin Cdo
15: if ∃r:= property(c, z ), r ∈Rthen
16: P rae←add.predicate(P rae,property-c(z))
17: end if
18: end for
19: end for
20: for yin E,y6=edo
21: if ∃r:= causes(y, e), r ∈Q, Q ⊂Rthen add causal predicates to precondition
22: P rae←add.predicate(P rae,executed(y))
23: end if
24: for win E,w6=e,w6=ydo remove transitive actions in the precondition
25: if ∃u:= cyclic(y, e)∧ ∃l:= cyclic(w, e)∧ ∃t:= cyclic(y, w ), u, l, t ∈Rthen
26: tmp ←get.weakest(e, y , w)identify the weakest transitive action
27: if tmp 6=ethen
28: P rae←remove.predicate(P rae,executed(tmp))
29: end if
30: end if
31: end for
32: if ∃r:= cyclic(y, e), r ∈Rthen add predicate for cyclic actions in the effects
33: Efae←add.predicate(E fae,¬executed(y))
34: end if
35: end for
36: F eae←add.function(F eae,(number-executed-e(Z) + 1)) increase value of precondition
function with 1
37: Efae←add.predicate(E fae,executed(e)) mark action as executed
38: A←add.op(A, (nameae, Zae, P rae, F pae, E fae, F eae))
39: end for
40: return unique(A)return all unique operators
model world (e.g. increasing the function value).
Algorithm 1 shows the procedure for generating the operators from the sit-
uation model. We take the name efrom the set of actions Ein the situation
model. Then, for each action e, we take the set of arguments Zfrom the objects
Oin the situation model that have object-verb relations to the action. The set
of precondition predicates P r is generated from the set of actions, which have
implicit causal relation to e, and the set of identified properties, related to the
action or its arguments. The set of effects consists of marking the action as
executed, increasing the value of the precondition function, and of negating the
execution of another action if they are cyclic. Cyclic actions are actions that
negate each other’s effects. a, b ∈Eare cyclic, if causes(a, b)and causes(b, a).
We denote them as cyclic(a, b). For example, the execution of “put the apple on
the table” negates the effect of the action “take the apple”. For two operators
7
aand bwith cyclic relation, we have to negate the effects of aafter execut-
ing band vice versa, otherwise it will not be possible to execute these actions
again. Another problem that arises are transitive causal relations. We say that
three actions a, b and care transitive if for a, b, c ∈E, it holds that cyclic(a, b),
cyclic(b, c), and cyclic(a, c). The problem here is that the preconditions and ef-
fects of these actions block the execution of at least one of the transitive actions.
It no longer suffice to just negate the effects of the cyclic actions, as there is a
third action influencing the execution of the two remaining actions. To solve
this problem, we follow an approach similar to the one proposed in [21]. We
identify any transitive relations an action has, then remove the weakest relation,
ending up with only cyclic relations. We find the weakest relation by calculating
the frequency of appearance of the relations in the text and removing the one
with the lowest frequency. Example operator can be seen in Fig. 1, left.
The target language is the Planning Domain Definition Language (PDDL),
which represents the operators through abstracted action templates. To gen-
erate templates we replace the operator’s arguments with the corresponding
hyperonym on level min the abstraction hierarchy and then removing any re-
peating abstracted operators. In that manner we control the model specificity.
Using hyperonyms on higher abstraction level produces more general models
and using those on lower abstraction level produces more specific models.
The planning model Mis a tuple (P, F, L, A, Z, x0, g), where Pis a set of
predicates, Fis a set of functions, Lis the language taxonomy (or abstraction
hierarchy) from the situation model, Ais a set of actions, Zis a set of arguments,
xois the initial state, and gis the set of goals. The predicates and functions
build up the model states x∈X, where Xis the model state space, which
represents a unique combination of the values of all predicates and functions.
The initial state x0is the state of the world before any action has been executed.
To generate x0, we set all the predicates identifying the execution of a cyclic
action to true, add all identified properties, and set all functions to their initial
value. Furthermore, we perform analysis based on the action order in the text.
We check if the precondition of the first action that requires enabling are initially
enabled. In case some predicates cannot be enabled based on the original action
order, they are set to true in the initial state description. The rest of the
predicates are set to false. The goal states g⊆Xrepresent all the predicates
that have to hold for the goal to be reached. We generate the goal states
by defining that each type of action a∈Ahas to be executed at least once.
The generated operators often have contradicting preconditions and effects. To
address this problem, we use a strategy where all ground operators that have
impossible preconditions, given the initial state, are removed [10]. The same
applies to predicates and functions that are used only in impossible actions.
This strategy removes all impossible candidate operators and predicates and
returns a model that is causally correct.
8
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
0
10
20
30
40
number
elements
actions
causal rel.
direct obj. action rel.
hierarhcy
indirect obj. rel.
objects
properties
Figure 2: Median number of elements and relations incorporated in the situation
models extracted from 20 instructional texts.
4 Evaluation
To evaluate the approach, we used 20 instructional texts, from which we gener-
ated planning models. We used an extended version of PDDL [10] as a target
format for the planning models. The instructions included cooking recipes (3
instructions), texts from coffee and washing machines manuals (4 instructions),
texts from wikiHow2(3 instructions), descriptions of the tasks performed in the
CMU kitchen dataset3,4(3 instructions), and descriptions of student exercises
for minimally invasive surgery (7 instructions).
The instructions had between 7 and 111 sentences with a mean of 26.5 sen-
tences and a mean sentence length in a text between 5 and 16 words. To obtain
the part of speech tags and dependencies between words, we used the Stanford
NLP parser. As state of the art parsers have shown to perform poorly with iden-
tifying events in instructional texts, we use a postprocessing step, as proposed in
[27], to improve the tags accuracy. We used the taxonomy of English language
WordNet [15] to obtain the hyperonyms of the identified objects. As some words
have different meanings, we took the most frequently used meaning for each ob-
ject. To generate the PDDL action templates, we used abstraction level of 2.
Figure 2 shows the statistics for the extracted from the texts elements that were
incorporated in the situation models. The average number of identified actions
in the instructional texts was 17.3 and the average number of objects was 10.2.
Relatively small number of properties was discovered (mean of 3.55) with more
properties discovered in cooking recipes that use more unstructured language
with longer sentences (with maximum number of 18 properties). An average of
7.3 causal relations were discovered in the texts with more causal relations when
the texts had more sentences (maximum of 24 relations). This is to be expected,
as the time series analysis performs better with longer series. The opposite was
observed in discovering semantic relations (i.e. the relations between objects,
properties and actions within the sentences). Texts with longer sentence struc-
ture but with less sentences tended to have more semantic relations. Finally, the
generated abstraction hierarchy (i.e. hyperonyms) had between 3 and 8 levels
with an average of 5 levels.
2https://www.wikihow.com/Main-Page
3http://kitchen.cs.cmu.edu/
4These descriptions have been generated based on the observed in the video log behaviour.
9
Actions Objects Properties
●●
●
●●●
●●●
●
●●
●
●
●
●●
●
●
●
●●
●
●●●
●●●
●
●●
●
●
●
●
●
●
●
●
10
15
20
5 10 15 20
instruction
number
−1
0
1
2
3
4
distance
●●●●●●●
●
●
●●●
●●
●
●●
●
●
●
●●●●●●●
●
●
●
●●
●●
●
●
●
●
●
●
10
20
30
40
5 10 15 20
instruction
number
−5
−4
−3
−2
−1
0
distance
●●●●●●
●●●●
●
●
●
●
●
●●
●
●
●
●●●●●●
●●●●
●
●
●●
●
●●
●
●●
0
5
10
15
5 10 15 20
instruction
number
−1
0
1
2
3
distance
Figure 3: Difference between automatically discovered elements and manually
discovered elements in the 20 instructional texts. Green indicates that the
human annotator discovered more elements, red means that our approach dis-
covered false positives, while yellow indicates the same number of elements.
Correctness of the identified elements: To evaluate whether the ap-
proach is able to correctly identify objects, actions and properties from texts,
we asked a human annotator to manually identify these elements in the texts5.
Figure 3 graphically shows the number of discovered elements and the distance
between the number of manually and automatically discovered elements. It can
be seen that in the majority of the cases both discovered the same number of
elements6. Interestingly enough, our approach tended to discover more objects
than the human annotator, producing false positives. This can be explained
with the fact that it identified abstract concepts such as “level”, “time” etc. as
objects while the human annotator considered only physical objects.
Complexity of the model: Figure 4 (left) shows the median number of
generated action templates, and the resulting number of grounded operators,
predicates, and functions after the pruning phase. A minimum of 7 and a
maximum of 57 action templates were generated based on the situation model
5Note that we did not compare the identified causal relations. This is because implicit
causal relations are a subject of interpretation. For that reason, we consider the relations are
correctly identified if the model is able to explain the given plan.
6In the case where the number of elements was the same for both the human annotator
and our tool, the discovered elements were the same in both cases.
10
100
number
elements
functions
operators
predicates
templates
1e+02
1e+04
1e+06
number
elements
max br. factor
mean br. factor
states
Figure 4: Median number of action templates, operators, predicates, and func-
tions (left). Median branching factor and states (right).
10
with a mean of 19.85 templates. The templates resulted in models with 71.5
operators on average, 9.4 predicates, and 64.35 functions. We also applied
iterative deepening depth-first search to analyse the state space complexity and
branching factor for the resulting models. We limited the search depth to 5 as
some of the models had state spaces of hundreds of millions of states. Figure 4
(right) shows the maximum and median branching factors as well as the number
of discovered states at a search depth of 5. The branching factor tells us how
many states are reachable from any given state in the model. High branching
factor indicates that the probability of selecting the actually observed action
will be low. The models with small number of operators generated as few as
327 states, while some models had as many as 10 million states at search level
5 (with the search being incomplete at this level). This was also reflected in the
branching factor, where a maximum branching factor of 449 was observed. On
average, however, the number of reachable states from any given state was 56.
This is still a very high number, but with such number plan recognition would
still be feasible, especially in the presence of unambiguous observations.
Model coverage: To evaluate whether a generated model is actually able
to explain human behaviour, we used the CMU kitchen dataset. We analysed
15 video logs from the “brownies” dataset and based on the observed execution
sequences, we manually generated 15 plans. We then tested a model generated
from a text describing the “brownies” dataset. The text was written based on
the behaviour observed in the first video log. We expected that the model will
be able to better explain the plan corresponding to the first log.
To evaluate the model, we first looked whether the model is able to explain
the plans at all (i.e. whether the observed execution sequences are part of the
model). The results showed that the model was able to explain all of the 15
plans. We then calculated the final log likelihood of the model. This is the
likelihood that tells us how well the model fits to the provided observation
sequence (in our case to the plan). The final log likelihood is calculated based
on the cumulative action probability of the observed in the plan action, given
a model M. This approach is similar to model learning through observations
[8]. Figure 5 shows the final log likelihood for the models when explaining the
15 plans and its relation to the length of the executed plan. We fitted a linear
model to the results (in blue). It showed that the likelihood of the model (i.e.
how well it fits the given plan) is linearly proportional to the length of the plan.
●
●
●●
●
●●●
●●● ●
●
●
●
60
80
100
−500 −400 −300
likelihood
length
Figure 5: Negative log likelihood of the model, given a certain plan.
11
Metrics P DDLgP DDLh1P D DLh2P DDLh3
operators 421 1854 1461 257
predicates 10 1853 424 48
functions 329 0 0 41
min/mean/max br. 1/231.19/421 1848/1848/1854 82/117.28/290 5/30.82/55
states (depth 5) 10 000 227 10 000 162 10 000 009 1 785 896
Table 1: Comparison between P DDLg,P DDLh1,P D DLh2, and P DDLh3.
This stands to show that the model was able to explain all plans in a similar
manner (i.e. it was not overfitted for the first plan).
Similarity to handcrafted models: To investigate how a generated model
compares to a handcrafted model, we asked experts to develop 3 PDDL mod-
els for the “brownies” experiment. We call the generated model P DDLg. We
compared P DDLgto P D DLh1,P DDLh2, and P D DLh3, each of which had
increasing complexity in terms of constraints and domain knowledge. P D DLh3
was overfitted to explain only the sequences in the “brownies” experiment. Table
1 shows the comparison between the handcrafted models and P DDLg. There
the more complex the constraints and context knowledge, the more specific the
model becomes, and the search space complexity decreases. In terms of opera-
tors and predicates P DDLgperformed more similar to the overfitted P DDLh3
with 1.6 more operators than P DD Lh3.P DDLghad two times higher branch-
ing factor than P DDLh2but it still had 8 times smaller branching factor than
P DDLh1. This stands to show that the model is comparable to handcrafted
models that do not encode implicit common sense knowledge or knowledge used
by the system designer to reduce the model state space.
5 Conclusion and Future Work
In this work we showed first empirical results from an approach that generates
PDDL models for behaviour understanding from instructional texts. The results
showed that the approach is able to identify most of the relevant model elements
from textual narratives. In that sense, it performed comparable to human an-
notator. The approach was also able to generate a model that can explain the
actual execution sequences observed in the video logs of the “brownies” dataset
from the CMU kitchen activities. Finally, comparing the generated model with
handcrafted models, it was shown that the model has better parameters and
encodes more context knowledge than a simple handcrafted model but is unable
to capture the “common sense” knowledge that is encoded in overfitted hand-
crafted models. In the future, we plan to address this problem by introducing an
additional learning phase, where the generated model is further adjusted based
on observations of already executed plans.
References
[1] M. Babeş-Vroman, J. MacGlashan, R. Gao, K. Winner, R. Adjogah,
M. desJardins, M. Littman, and S. Muresan. Learning to interpret nat-
12
ural language instructions. In Proc. Workshop on Semantic Interpretation
in an Actionable Context, pages 1–6, Stroudsburg, PA, USA, 2012.
[2] C. Baker, R. Saxe, and J. Tenenbaum. Action understanding as inverse
planning. Cognition, 113(3):329–349, 2009.
[3] L. Benotti, T. Lau, and M. Villalba. Interpreting natural language instruc-
tions using language, vision, and behavior. ACM Trans. Interact. Intell.
Syst., 4(3):13:1–13:22, Aug 2014.
[4] S. Branavan, N. Kushman, T. Lei, and R. Barzilay. Learning high-level
planning from text. In Proc. Ann. Meeting of Assoc. for Computational
Linguistics, pages 126–135, Stroudsburg, PA, USA, 2012.
[5] S. Branavan, L. Zettlemoyer, and R. Barzilay. Reading between the lines:
Learning to map high-level instructions to commands. In Proc. Ann. Meet-
ing of Assoc. for Computational Linguistics, pages 1268–1277, Stroudsburg,
PA, USA, 2010.
[6] D. Chen and R. Mooney. Learning to interpret natural language navi-
gation instructions from observations. In Proc. AAAI Conf. on Artificial
Intelligence, pages 859–865, Aug 2011.
[7] F. Diebold, K. Witman, D. Hanseman, L. Lysne, and T. Moore. Elements
of Forecasting. Cengage Learning, second edition, 2000.
[8] D. Goldwasser and D. Roth. Learning from natural instructions. Machine
Learning, 94(2):205–232, 2014.
[9] C. Granger. Investigating Causal Relations by Econometric Models and
Cross-spectral Methods. Econometrica, 37(3):424–438, Aug 1969.
[10] T. Kirste and F. Krüger. CCBM-a tool for activity recognition using com-
putational causal behavior models. Technical Report CS-01-12, Institut für
Informatik, Universität Rostock, Rostock, Germany, May 2012.
[11] T. Kollar, S. Tellex, D. Roy, and N. Roy. Grounding verbs of motion in nat-
ural language commands to robots. In Experimental Robotics, volume 79,
pages 31–47. Springer Berlin Heidelberg, 2014.
[12] F. Krüger, M. Nyolt, K. Yordanova, A. Hein, and T. Kirste. Computational
state space models for activity and intention recognition. a feasibility study.
PLoS ONE, 9(11):e109381, Nov 2014.
[13] X. Li, W. Mao, D. Zeng, and F.-Y. Wang. Automatic construction of
domain theory for attack planning. In IEEE Int. Conf. on Intelligence and
Security Informatics, pages 65–70, May 2010.
[14] A. Lindsay, J. Read, J. Ferreira, T. Hayton, J. Porteous, and P. Gregory.
Framer: Planning models from natural language action descriptions. In
Int. Conf. on Automated Planning and Scheduling, 2017.
13
[15] G. Miller. Wordnet: A lexical database for english. Commun. ACM,
38(11):39–41, Nov 1995.
[16] T. A. Nguyen, S. Kambhampati, and M. Do. Synthesizing robust plans
under incomplete domain models. In Advances in Neural Information Pro-
cessing Systems 26, pages 2472–2480. Curran Associates, Inc., 2013.
[17] M. Philipose, K. Fishkin, M. Perkowitz, D. Patterson, D. Fox, H. Kautz,
and D. Hahnel. Inferring activities from interactions with objects. IEEE
Pervasive Computing, 3(4):50–57, Oct 2004.
[18] M. Ramirez and H. Geffner. Goal recognition over pomdps: Inferring the
intention of a pomdp agent. In Proc. Int. J. Conf. on Artificial Intelligence,
volume 3 of IJCAI’11, pages 2009–2014, Barcelona, Spain, 2011.
[19] A. Sil and A. Yates. Extracting strips representations of actions and events.
In Recent Advances in Natural Language Processing, pages 1–8, Hissar,
Bulgaria, Sep 2011.
[20] M. Tenorth, D. Nyga, and M. Beetz. Understanding and executing instruc-
tions for everyday manipulation tasks from the world wide web. In IEEE
Int. Conf. on Robotics and Automation, pages 1486–1491, May 2010.
[21] M. Veloso, A. Perez, and J. Carbonell. Nonlinear planning with parallel
resource allocation. In Proc. DARPA Workshop of Innovative Approaches
to Planning, Scheduling and Control, Nov 1990.
[22] A. Vogel and D. Jurafsky. Learning to follow navigational directions. In
Proc. Ann. Meeting of Assoc. for Computational Linguistics, pages 806–
814, Stroudsburg, PA, USA, 2010.
[23] B. Webber, N. Badler, B. Eugenio, C. Geib, L. Levison, and M. Moore.
Instructions, intentions and expectations. Artificial Intelligence, 73(1):253
– 269, 1995.
[24] J. Ye, S. Dobson, and S. McKeever. Review: Situation identification
techniques in pervasive computing: A review. Pervasive Mob. Comput.,
8(1):36–66, Feb 2012.
[25] K. Yordanova. Discovering causal relations in textual instructions. In
Recent Advances in Natural Language Processing, pages 714–720, Hissar,
Bulgaria, Sep 2015.
[26] K. Yordanova. Automatic generation of situation models for plan recog-
nition problems. In Proceedings of the International Conference Recent
Advances in Natural Language Processing, pages 823–830, Varna, Bulgaria,
September 2017. INCOMA Ltd.
14
[27] K. Yordanova. A simple model for improving the performance of the stan-
ford parser for action detection in textual instructions. In Proceedings of the
International Conference Recent Advances in Natural Language Processing,
pages 831–838, Varna, Bulgaria, September 2017. INCOMA Ltd.
[28] Z. Zhang, P. Webster, V. Uren, A. Varga, and F. Ciravegna. Automati-
cally extracting procedural knowledge from instructional texts using nat-
ural language processing. In Proc. Int. Conf. on Language Resources and
Evaluation, pages 520–527, Istanbul, Turkey, May 2012.
15