Conference PaperPDF Available

StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

Authors:

Abstract

Inferring spatial relations in natural language is a crucial ability an intelligent system should possess. The bAbI dataset tries to capture tasks relevant to this domain (tasks 17 and 19). However, these tasks have several limitations. Most importantly, they are limited to fixed expressions, they are limited in the number of reasoning steps required to solve them, and they fail to test the robustness of models to input that contains irrelevant or redundant information. In this paper, we present a new Question-Answering dataset called StepGame for robust multi-hop spatial reasoning in texts. Our experiments demonstrate that state-of-the-art models on the bAbI dataset struggle on the StepGame dataset. Moreover, we propose a Tensor-Product based Memory-Augmented Neural Network (TP-MANN) specialized for spatial reasoning tasks. Experimental results on both datasets show that our model outperforms all the baselines with superior generalization and robustness performance.
StepGame: A New Benchmark for
Robust Multi-Hop Spatial Reasoning in Texts
Zhengxiang Shi1, Qiang Zhang2, Aldo Lipani1
1University College London
2Zhejiang University
zhengxiang.shi.19@ucl.ac.uk, qiang.zhang.cs@zju.edu.cn, aldo.lipani@ucl.ac.uk
Abstract
Inferring spatial relations in natural language is a crucial abil-
ity an intelligent system should possess. The bAbI dataset
tries to capture tasks relevant to this domain (task 17 and 19).
However, these tasks have several limitations. Most impor-
tantly, they are limited to fixed expressions, they are limited
in the number of reasoning steps required to solve them, and
they fail to test the robustness of models to input that contains
irrelevant or redundant information. In this paper, we present
a new Question-Answering dataset called StepGame for ro-
bust multi-hop spatial reasoning in texts. Our experiments
demonstrate that state-of-the-art models on the bAbI dataset
struggle on the StepGame dataset. Moreover, we propose a
Tensor-Product based Memory-Augmented Neural Network
(TP-MANN) specialized for spatial reasoning tasks. Experi-
mental results on both datasets show that our model outper-
forms all the baselines with superior generalization and ro-
bustness performance.
1 Introduction
Neural networks have been successful in a wide array of per-
ceptual tasks, but it is often stated that they are incapable
of solving tasks that require higher-level reasoning (Ding
et al. 2020). Since spatial reasoning is ubiquitous in many
scenarios such as autonomous navigation (Vogel and Juraf-
sky 2010), situated dialog (Kruijff et al. 2007), and robotic
manipulation (Yang, Lan, and Narasimhan 2020; Landsiedel
et al. 2017), grounding spatial references in texts is essential
for effective human-machine communication through natu-
ral language. Navigation tasks require agents to reason about
their relative position to objects and how these relations
change as they move through the environment (Chen et al.
2019). If we want to develop conversational systems able to
assist users in solving tasks where spatial references are in-
volved, we need to make them able to understand and reason
about spatial references in natural language. Such ability can
help conversational systems to successfully follow instruc-
tions and understand spatial descriptions. However, despite
its tremendous applicability, reasoning over spatial relations
remains a challenging task for existing conversational sys-
tems.
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Earlier works in spatial reasoning focused on spatial in-
struction understanding in a synthetic environment (Bisk
et al. 2018; Tan and Bansal 2018; Janner, Narasimhan, and
Barzilay 2018) or in a simulated world with spatial infor-
mation annotation in texts (Pustejovsky et al. 2015), spatial
relation extractions across entities (Petruck and Ellsworth
2018) and visual observations (Anderson et al. 2018; Chen
et al. 2019). However, few of the existing datasets are de-
signed to evaluate models’ inference over spatial informa-
tion in texts. A spatial relational inference task often requires
an conversational system to infer the spatial relation between
two items given a description of a scene. For example, imag-
ine a user asking to a conversational system to recognize the
location of an entity based on the description of other enti-
ties in a scene. To do so, the conversational system needs to
be able to reason about the location of the various entities in
the scene using only textual information.
BAbI (Weston et al. 2016) is the most relevant dataset
for this task. It contains 20 synthetic question answering
(QA) tasks to test a variety of reasoning abilities in texts,
like deduction, co-reference, and counting. In particular, the
positional reasoning task (no. 17) and the path finding task
(no. 19) are designed to evaluate models’ spatial reasoning
ability. These two tasks are arguably the most challenging
ones van Aken et al. (2019). The state-of-the-art model on
the bAbI (Le, Tran, and Venkatesh 2020) dataset almost per-
fectly solve these two spatial reasoning tasks. However, in
this paper, we demonstrate that such good performance is
attributable to issues with the bAbI dataset rather than the
model inference ability.
We find four major issues with bAbI’s tasks 17 and 19: (1)
There is a data leakage between the train and test sets; that is,
most of the test set samples appear in the training set. Hence,
the evaluation results on the test set cannot truly reflect mod-
els’ reasoning ability; (2) Named entities are fixed and only
four relations are considered. Each text sample always con-
tains the same four named entities in the training, valida-
tion, and test sets. This further biases the learning models
towards these four entities. When named entities in the test
set are replaced by unseen entities or the number of such
entities increases, the model performance decreases dramat-
ically (Chen et al. 2020a). Also, relations such as top-left,
top-right, lower-left, lower-right are not taken into consider-
ation; (3) Learning models are required to reason only over
one or two sentences in the text descriptions, making such
tasks relatively simple. Palm, Paquet, and Winther (2018)
pointed out that multi-hop reasoning is not necessary for the
bAbI dataset since models only need a single step to solve
all the tasks, and; (4) It is a synthetic dataset with a limited
diversity of spatial relation descriptions. It thus cannot truly
reveal the models’ ability in understanding textual space de-
scriptions.
In this paper, we propose a new dataset called StepGame
to tackle the above-mentioned issues and a novel Tensor
Product-based Memory-Augmented Neural Network archi-
tecture (TP-MANN) for multi-hop spatial reasoning in texts.
The StepGame dataset is based on crowdsourced descrip-
tions of 8 potential spatial relations between 2 entities. These
descriptions are then used as templates when generating the
dataset. To increase the diversity of these templates, crowd-
workers were asked to diversify their expressions. This was
done in order to ensure that the crowdsourced templates
cover most of the natural ways relations between two entities
can be described in text. The StepGame dataset is character-
ized by a combinatorial growth in the number of possible
description of scenes, named stories, as the number of de-
scribed relations between two entities increases. This com-
binatorial growth reduces the chances to leak stories from
the training to the validation and test sets. Moreover, we use
a large number of named entities and require multi-hop rea-
soning to answer questions about two entities mentioned in
the stories. Experimental results show that existing models
(1) fail to achieve a performance on the StepGame dataset
similar to that achieved on the bAbI dataset, and (2) suffer
from a large performance drop as the number of required
reasoning steps increases.
The TP-MANN architecture is based on tensor product
representations (Smolensky 1990) that are used in a recur-
rent memory module to store, update or delete the relation
information among entities inferred from stories. This recur-
rent architecture provides three key benefits: (1) it enables
the model to make inferences based on the stored memory;
(2) it allows multi-hop reasoning and it is robust to noise,
and; (3) the number of parameters remains unchanged as the
number of recurrent layers in the memory module increases.
Experimental results on the StepGame dataset show that our
model achieves state-of-the-art performance with a substan-
tial improvement, and demonstrates a better generalization
ability to more complex stories. Finally, we also conduct
some analysis of our recurrent structure and demonstrate its
importance for multi-hop reasoning.
2 Related Work and Background
2.1 Related Work
Reasoning Datasets. The role of language in spatial rea-
soning has been investigated since the 1980s (Pustejovsky
1989; Gershman and Tenenbaum 2015; Tversky 2019), and
reasoning about spatial relations has been studied in sev-
eral contexts such as, 2D and 3D navigation (Bisk et al.
2018; Tan and Bansal 2018; Janner, Narasimhan, and Barzi-
lay 2018; Yang, Lan, and Narasimhan 2020), and robotic
manipulation (Landsiedel et al. 2017). However, few of the
datasets used in these works are used to evaluate systems’
spatial reasoning ability in texts.
The bAbI (Weston et al. 2016) dataset consists of several
QA tasks. Solving these tasks require logical reasoning steps
and cannot be solved by simply word matching. Of partic-
ular interest to this paper are tasks 17 and 19. Task 17 is
about positional reasoning while task 19 is about path find-
ing. These two tasks can be used to evaluate the spatial infer-
ence ability of learning models. However, the bAbI dataset
has several issues as mentioned above: the data leakage, the
fixed named entities and expressions, and the lack of a need
to perform multi-hop reasoning. Another relevant dataset is
SpartQA (Mirzaee et al. 2021), which is designed for spatial
reasoning over texts but only requires a limited multi-hop
reasoning compared to StepGame.
Multi-Hop QA Datasets. The multi-hop QA tasks re-
quire reasoning over multiple pieces of evidence and fo-
cus on leveraging the connections between entities to in-
fer a requested property of a set of them. Commonly-
used multi-hop QA datasets are HotpotQA (Yang et al.
2018), ComplexWebQuestions (Talmor and Berant 2018),
and QAngaroo (Welbl, Stenetorp, and Riedel 2018). The
proposed StepGame dataset is different from these datasets.
The StepGame dataset focuses on spatial reasoning, which
requires machine learning models to infer the spatial rela-
tions among the described entities. Moreover, multi-hop QA
datasets usually require no more than two reasoning steps,
while the StepGame dataset can require as many as 10 rea-
soning steps.
Reasoning Models. There are three types of reasoning
models: memory-augmented neural networks, graph neu-
ral networks, and transformer-based networks. Works of
the first type augment neural networks with external mem-
ory, such as End to End Memory Networks (Sukhbaatar
et al. 2015), Differential Neural Computer (Graves et al.
2016), and Gated End-to-End Memory Networks (Liu and
Perez 2017). These models have shown remarkable abili-
ties in tackling difficult computational and reasoning tasks.
Works of the second type use graph structure to incorporate a
stronger relational inductive bias (Battaglia et al. 2018). San-
toro et al. (2017) introduced Relational Networks (RN) and
demonstrated strong relational reasoning capabilities with a
shallow architecture by modelling binary relations between
entity pairs. Palm, Paquet, and Winther (2018) proposed a
graph representation of objects and models multi-hop rela-
tional reasoning using a message passing mechanism. Works
of the third type use transformers. Although transformers
have been proven successful in many NLP tasks, they still
struggle with reasoning tasks. van Aken et al. (2019) an-
alyzed the performance of BERT (Devlin et al. 2019) on
bAbI’s tasks and demonstrated that most of BERT’s errors
come from task 17 and 19 which require spatial reason-
ing. Meanwhile, Dehghani et al. (2019) demonstrated that
standard transformers cannot perform as well as memory-
augmented networks on the bAbI dataset. Moreover, it is
important to note that most of the errors of their proposed
Universal Transformer come also from task 17 and task 19
of the bAbI dataset, which matches our observations on
other transformer-based models. Therefore, spatial reason-
ing tasks are arguably the most challenging tasks in the bAbI
dataset.
Tensor Product Representation. The Tensor Prod-
uct Representation (TPR) (Smolensky 1990; Schlag,
Munkhdalai, and Schmidhuber 2021) is a technique for en-
coding symbolic structural information and modelling sym-
bolic reasoning in vector spaces by learning to deconstruct
natural language statements into combinatorial representa-
tions (Chen et al. 2020b). TPR has been used for tasks that
require deductive reasoning abilities and it is able to repre-
sent entire problem statements to solve math questions in
natural language (Chen et al. 2020b) and generate natural
language captions from images (Huang et al. 2018).
Schlag and Schmidhuber (2018) proposed a gradient-
based RNN with third-order TPR (TPR-RNN), which cre-
ates a vector space embedding of complex symbolic struc-
tures by tensor products and stores these learned represen-
tations into a third-order TPR-like memory. Self-Attentive
Associative Memory (STM) (Le, Tran, and Venkatesh 2020)
utilizes a second-order item memory and a third-order TPR-
like relational memory to simulate the hippocampus, achiev-
ing state-of-the-art performance on the bAbI dataset. De-
spite a gain in the performance on bAbI compared to TPR-
RNN, STM takes a longer time to converge in practice.
Recently, Schlag, Munkhdalai, and Schmidhuber (2021)
compared a concatenated memory MR2d×dwith a 3-
order memory MRd2×d, and experimental results indi-
cate a drop in performance when a concatenated memory is
used. However, neither STM nor TPR-RNN processes in-
formation at the paragraph level and allows later modifi-
cations after the first information is stored, as done in our
model. Both STM and TPR-RNN use an RNN-like archi-
tecture where each sentence in a paragraph is stored recur-
rently. This may result in a long-term dependency prob-
lem (Vaswani et al. 2017) where necessary information
would not interact with each other. To solve this issue, an ex-
plicit mechanism to update relational information between
entities at the end of each story is introduced in our model.
2.2 Background
The Tensor Product Representation (TPR) is a method to
create a vector space embedding of complex symbolic struc-
tures by tensor product. Such representation can be con-
structed as follows:
M=X
i
firi=X
i
fir
>
i=X
i
(fr)ii,(1)
where Mis the TPR, f= (f1, . . . , fn)is a set of nfiller
vectors and r= (r1, . . . , rn)is a set of nrole vectors. For
each role-filler vector pair, which can be considered as an
entity-relation pair, we bind (or store) them into Mby per-
forming their outer product. Then, given an unbinding role
vector ui, associated to the filler vector fi,fican be recov-
ered by performing:
Mui="X
i
firi#ui=X
i
αij fifi(2)
D
GB
J
F
Step 1 Step 2 Step 3
D
G
B
J
F
Story
1. J is below B.
Question
Answer
What is the relation of the G to the J?
Top-right ( )
2. B and G is side by side with B to the left and G
3. D and G are parallel, and D is on the top of G.
4. F is diagonally to the bottom right of J.
to the right.
Figure 1: An example of the generation of a StepGame sam-
ple with k= 4.
where αij 6= 0 if and only if i=j. It can be proven that
the recovering is perfect if the role vectors are orthogonal to
each other. In our model, TPR-like binding and unbinding
methods are used to store and retrieve information from and
to the TPR M, which we will call memory.
3 The StepGame Dataset
To design a benchmark dataset that explicitly tests mod-
els’ spatial reasoning ability and tackle the above mentioned
problems, we build a new dataset named StepGame inspired
by the spatial reasoning tasks in the bAbI dataset (Weston
et al. 2016). The StepGame is a contextual QA dataset,
where the system is required to interpret a story about sev-
eral entities expressed in natural language and answer a
question about the relative position of two of those entities.
Although this reasoning task is trivial for humans, to equip
current NLU models with such a spatial-ability remains still
a challenge. Also, to increase the complexity of this dataset
we model several form of distracting noises. Such noises
aim to make the task more difficult and force machine learn-
ing models that are trained on this dataset to be more robust
in their inference process.
3.1 Template Collection
The aim of this crowdsourcing task is to find out all pos-
sible ways we can describe the positional relationship be-
tween two entities. The crowdworkers from Amazon Me-
chanical Turk were provided with an image visually describ-
ing the spatial relations of two entities and a request to de-
scribe these entities’ relation. This crowdsourcing task was
performed in multiple runs. In the first run, we provided
crowdworkers with an image and two entities (e.g., A and
B) and they were asked to describe their positional relation.
From the data collected in this round, we then manually re-
moved bad answers, and showed the remaining good ones
as positive examples to crowdworkers in the next run. How-
ever, crowdworkers were instructed to avoid repeating them
as an answer to our request. We repeated this process until
no new templates could be collected. In total, after perform-
ing a manual generalization where templates discovered for
a relation were translated to the other relations, we collected
23 templates for left and right relations, 27 templates for top
and down relations, and 26 templates for top-left, top-right,
down-left, and down-right relations.
3.2 Data Generation
The task defined by the StepGame dataset is composed of
several story-question pairs written in natural language. In
its basic form, the story describes a set of kspatial relations
among k+ 1 entities, and it is structured as a list of ksen-
tences each talking about 2entities. The relations are kand
the entities k+1 because they define a chain-like shape. The
question requests the relative position of two entities among
the k+1 ones mentioned in the story. To each story-question
pair an answer is associated. This answer can take 9possible
values: top-left,top-right,top,left,overlap,right,down-left,
down-right, and down, each representing a relative position.
The number of edges between the two entities in the ques-
tion (k) determines the number of hops a model has to
perform in order to get to the correct answer.
To generate a story, we follow three steps, as depicted in
Figure 1. Given a value kand a set of entities E:
Step 1. We generate a sequence of entities by sampling
a set of k+ 1 unique entities from E. Then, for each pair
of entities in the sequence, kspatial relations are sampled.
These spatial relations can take any of the 8 possible val-
ues: top, down, left, right, top-left, top-right, down-left, and
down-right. Because the sampling is unconstrained, entities
can overlap with each other. This step results in a sequence
of linked entities that from now on we will call a chain.
Step 2. Two of the chain’s entities are then selected at ran-
dom to be used in the question.
Step 3. From the chain generated in Step 1, we translate
the krelations into ksentence descriptions in natural lan-
guage. Each description is based on a randomly sampled
crowdsourced template. We then shuffle these ksentences to
avoid potential distributional biases. These shuffled ksen-
tence descriptions is a called a story. From the entities se-
lected in Step 2, we then generate a question also in natural
language. Finally, using the chain and the selected entities,
we infer the answer to each story-question pair.
Given this generation process we can quickly calculate
the complexity of the task before using the templates. This
is possible because entities can overlap. Given krelations,
k+ 1 entities sampled from Ein any order (), 8 possible
relations between pairs of entities with 2 ways of describing
them (), e.g., A is on the left of B or B is on the right of
A, a random order of the ksentences in the story (), and a
question about 2 entities with 2 ways of describing it (), the
number of examples that we can generate is equal to:
(k+ 1)! |E|
k+ 1!·16k·k!
2·2 k+ 1
2!.(3)
The complexity of the dataset grows exponentially with k.
The StepGame dataset uses |E| = 26. For k= 1 we have
10,400 possible samples, for k= 2 we have more than
23 million samples, and so on. The sample complexity of
the problem guarantees that when generating the dataset the
probability of leaking samples from the training set to the
test set diminishes with the increase of k. Please note that
these calculations do not include templates. If we were to
D
GB
J
FI
Original
D
GB
J
FI
Irrelevant Noise
D
GB
J
FI
Disconnected Noise
K
V
C
WH
E
C
A A A
D
GB
J
FI
Supporting Noise
X
Y
A
Figure 2: On the left-hand side we have the original chain.
Orange entities are those targeted by the question. Beside,
we show the same chain with the addition of noise. In green
we represent irrelevant, disconnected and supporting enti-
ties.
considering also the templates, the number of variations of
the StepGame would be even larger.
3.3 Distracting Noise
To make the StepGame more challenging we also include
noisy examples in the test set. We assume that when mod-
els trained on the non-noisy dataset make mistakes on the
noisy test set, these models have failed to learn how to in-
fer spatial relations. We generate three kinds of distracting
noise: disconnected,irrelevant, and supporting. Examples
of all kinds of noise are provided in Figure 2. The irrel-
evant noise extends the original chain by branching it out
with new entities and relations. The disconnected noise adds
to the original chain a new independent chain with new enti-
ties and relations. The supporting noise adds to the original
chain new entities and relations that may provide alternative
reasoning paths. We only add supporting noise into chains
with more than 2 entities. All kinds of noise have no im-
pact on the correct answer. The type and amount of noise
added to each chain is randomly determined. The detailed
statistics for each type of distracting noise are provided in
the Appendix.
4 The TP-MANN Model
In this section we introduce the proposed TP-MANN model,
as shown in Figure 3. The model comprises three major
components: a question and story encoder, a recurrent mem-
ory module, and a relation decoder. The encoder learns to
represent entities and relations for each sentence in a story.
The recurrent memory module learns to store entity-relation
pair representations into the memory independently. It also
updates the entity-relation pair representations based on the
current memory and stores the inferred information. The de-
coder learns to represent the question and using the infor-
mation stored in the memory recurrently infers the spatial
relation of the two entities mentioned in the question.
It also has been shown that learned representations in the
TPR-like memory could be orthogonal (Schlag and Schmid-
huber 2018). We use an example to illustrate the inspiration
behind this architecture. A person may experience that when
she goes back to her hometown and sees an old tree, her
happy childhood memory about playing with her friends un-
der that tree might be recalled. However, this memory may
LN
P
PE
?
LN
I2
LN
I3
Mt+1
Mt
K
PE
I1
U
O
N
E
R
S
Decoder
Recurrent Memory
Encoder
LN
Figure 3: The TP-MANN architecture. PE stands for positional encoder, the sign in the box below the symbol Erepresents a
feed-forward neural network, the sign represents the outer-product operator, the sign represents the inner product operator,
and LN represents a layer normalization. The ,, and LN boxes implement the formulae as presented in Section 4. Lines
indicate the flow of information. Those without an arrow indicate which symbols are taken as input and are output by their box.
not be reminisced unless triggered by the old tree appear-
ance. In our model, unbinding vectors in the decoder module
play the role of the old tree in the example, where unbind-
ing vectors are learned based on the target questions. The
decoder module unbinds relevant memories given a ques-
tion via a recurrent mechanism. Moreover, although mem-
ories are stored separately, there are integration processes
in brains that retrieve information via a recursive mecha-
nism. This allows episodes in memories to interact with each
other (Kumaran and McClelland 2012; Schapiro et al. 2017;
Koster et al. 2018).
Encoder. The input of the encoder is a story and a ques-
tion. Given a input story S= (s1, . . . , sm)with msentences
and a question qboth described by words in a vocabulary
V. Each sentence si= (w1, . . . , wn)is mapped to learn-
able embeddings (w
1, . . . , w
n). Then, a positional encoding
(PE) is applied to each word embedding and then averaged
together s
i=1
nPn
j=1 w
j·pj, where {p1, . . . , pn}are learn-
able position vectors, and ·is the element-wise product. This
operation defines SRm×d, where each row of Srepre-
sents an encoded sentence and dis the dimension of a word
embedding. For the input question we convert it to a vector
qRdin the same way. For each sentence of the story in
S, we learn entity and relation representations as:
Ei=fei(S), i = 1,2,(4)
Rj=frj(S), j = 1,2,3,(5)
where feiare feed-forward neural networks that output en-
tity representations EiRm×deand frjare feed-forward
neural networks that output relation representations Rj
Rm×dr. Finally, we define three search keys Kas:
K1=E1R1,(6)
K2=E1R2,(7)
K3=E2R3,(8)
where K1, K2, K3Rm×de×dr. Keys will be used to ma-
nipulate the memory in the next module and retrieve poten-
tial existing associations for each entity-relation pair.
Recurrent Memory Module. To allow stored informa-
tion to interact with each other, we use a recurrent architec-
ture with Trecurrent-layers to update the TPR-like memory
representation MRde×dr×de, where Mcontains train-
able parameters. Through this recurrent architecture, exist-
ing episodes stored in memory can interact with new infer-
ences to generate new episodes. Different from many mod-
els like Transformer (Vaswani et al. 2017) and graph-based
models (Kipf and Welling 2017; Velickovic et al. 2018)
where adding more layers in the model leads to a larger num-
ber of trainable parameters, our model will not increase the
number of trainable parameters as the number of recurrent-
layers increases.
At each layer t, given the keys Ks, we extract pseudo-
entities Ps for each sentence in S. In the first layer (t= 0),
since there is no previous information existing in memory
M0, the model just converts each sentence in Sas an
episode and stores them in it (M1). Then at the later layers
(t > 0), pseudo-entities Ps build bridges between episodes
in the current memory Mtand allow them to interact with
potential entity-relation associations.
Pjt =KjMt, j = 1,2,3,(9)
where Pjt Rm×de. We then construct the memory
episodes needed to be updated or removed. This is done
after the first storage at t= 0 so that all story informa-
tion is already available in M. These old episodes, Ojt
Rde×dr×de, will be updated or removed to avoid memory
conflicts that may occur when receiving new information:
Ojt =KjPj t, j = 1,2,3(10)
Afterwards, new episodes, N1, N2tand N3Rde×dr×de,
will be added into the memory:
N1=K1E2,(11)
N2t=K2P1t,(12)
N3=K3E1.(13)
Then we apply this change to the memory by removing (sub-
tracting) old episodes and adding up the new ones to the now
dated memory Mt:
Mt+1 =LN(Mt+
+N1+N2t+N3O1tO2tO3t),(14)
where LN is a layer normalization.
Decoder. The prediction is computed based on the con-
structed memory Mat the last layer and a question vector q.
To do this we follow the same procedure designed by Schlag
and Schmidhuber (2018):
Uj=fu
j(q), j = 1,2,3,4,(15)
where fu
1is a feed-forward neural network that outputs a
de-dimensional unbinding vector, and fu
2, f u
3, f u
4are feed-
forward neural networks that output dr-dimensional unbind-
ing vectors. Then, the information stored in Mwill be
retrieved in a recurrent way based on unbinding vectors
learned from the question:
I1=LN(MT·U1)·U2,(16)
I2=LN(MT·I1)·U3,(17)
I3=LN(MT·I2)·U4,(18)
ˆv=softmax(Wo·
3
X
j=1
Ij).(19)
A linear projection of trainable parameters WoR|Vde
and a softmax function are used to map the extracted infor-
mation into ˆvR|V|. Hence, the decoder module outputs a
probability distribution over the terms of the vocabulary V.
5 Experiments and Results
In this section we aim to address the following research
questions: (RQ1) What is the degree of data leakage in the
datasets? (RQ2) How does our model behave with respect
to state-of-the-art NLU models in spatial reasoning tasks?
(RQ3) How do these models behave when tested on exam-
ples more challenging than those used for training? (RQ4)
What is the effect of the number of recurrent-layers in the
recurrent memory module? Before answering these ques-
tions, we first present the material and baselines used in
our experiments. The software and data are available at:
https://github.com/ZhengxiangShi/StepGame
5.1 Material and Baselines
In the following experiments we will use two datasets, the
bAbI dataset and the StepGame dataset. For the bAbI dataset
we only focus on task 17 and task 19 and use the original
train and test splits made of 10 000 samples for the train-
ing set and 1 000 for the validation and test sets. For the
StepGame dataset, we generate a training set made of sam-
ples varying kfrom 1 to 5 at steps of 1, and a test set with
kvarying from 1 to 10. Moreover, the test set will also con-
tain distracting noise. The final dataset consists of, for each
kvalue, 10 000 training samples, 1 000 validation samples,
and 10 000 test samples.
Task 17 Task 19 Mean
RN 97.33±1.55 98.63±1.79 97.98
RRN 97.80±2.34 49.80±5.76 73.80
STM 97.80±1.06 99.98±0.05 98.89
UT 98.60±3.40 93.90±7.30 96.25
TPR-RNN 97.55±1.99 99.95±0.06 98.75
Ours 99.88±0.10 99.98±0.04 99.93
Table 1: Test accuracy on the task 17 and 19 of the bAbI
dataset: Mean±Std over 5 runs.
We compare our model against five baselines: Recur-
rent Relational Networks (RRN) (Palm, Paquet, and Winther
2018), Relational Network (RN) (Santoro et al. 2017), TPR-
RNN (Schlag and Schmidhuber 2018), Self-attentive Asso-
ciative Memory (STM) (Le, Tran, and Venkatesh 2020), and
Universal Transformer (UT) (Dehghani et al. 2019). Each
model is trained and validated on each dataset independently
following the hyper-parameter ranges and procedures pro-
vided in their original papers. All training details, including
those for our model, are reported in the Appendix.
5.2 Training-Test Leakage
To answer RQ1 we have calculated the degree of data leak-
age present in bAbI and the StepGame datasets. For the task
17, we counted how many samples in the test set appear also
in the training set: 23.2% of the test samples are also in the
training set. For task 19, for each sample we extracted the
relevant sentences in the stories (i.e., those sentences neces-
sary to answer the question correctly) and questions. Then
we counted how many such pairs in the test set appear in
the training set: 80.2% of the pairs overlap with pairs in the
training set. For the StepGame dataset, for each sample we
extracted the sentences in the stories and questions. The sen-
tences in the story are sorted in lexicographical order. Then
we counted how many such pairs in the test set appear also
in the training set before adding distracting noise and using
the templates: 1.09% of the pairs overlap with triples in the
training set. However, such overlap is all produced by the
samples with k= 1, which due to their limited number have
a higher chance of being included in the test set. If we re-
move those examples, the overlap between training and test
sets drops to 0%.
5.3 Spatial Inference
To answer RQ2 and judge the spatial inference ability of our
model and the baselines we train them on the bAbI and the
StepGame datasets and compare them by measuring their
test accuracy.
In Table 1 we present the results of our model and the
baselines on the task 17 and 19 of the bAbI dataset. The per-
formance of our model is slightly better than the best base-
line. However, due to the issues of the bAbI dataset, these
results are not enough to firmly answer RQ2.
In Table 2 we present the results for the StepGame dataset.
In this dataset, the training set is without noise but the test set
is with distracting noise. In the table we break down the per-
formance of the trained models across k. In the last column
Model k=1 k=2 k=3 k=4 k=5 Mean
RN (Santoro et al. 2017) 22.64±0.25 17.08±1.41 15.08±2.58 12.84±2.27 11.52±1.73 15.83
RRN (Palm, Paquet, and Winther 2018) 24.05±4.48 19.98±4.68 16.03±2.89 13.22±2.51 12.31±2.16 17.12
UT (Dehghani et al. 2019) 45.11±4.16 28.36±4.50 17.41±2.18 14.07±2.87 13.45±1.35 23.68
STM (Le, Tran, and Venkatesh 2020) 53.42±3.73 35.96±4.45 23.03±1.83 18.45±1.87 15.14±1.56 29.20
TPR-RNN (Schlag and Schmidhuber 2018) 70.29±3.03 46.03±2.24 36.14±2.66 26.82±2.64 24.77±2.75 40.81
Ours 85.77±3.18 60.31±2.23 50.18±2.65 37.45±4.21 31.25±3.38 52.99
Table 2: Test accuracy on the StepGame dataset: Mean±Std over 5 runs.
Model k= 6 k=7 k=8 k=9 k=10 Mean
RN (Santoro et al. 2017) 11.12±0.96 11.53±0.70 11.21±0.98 11.13±1.00 11.34±0.87 11.27
RRN (Palm, Paquet, and Winther 2018) 11.62±0.80 11.40±0.76 11.83±0.75 11.22±0.86 11.69±1.40 11.56
UT (Dehghani et al. 2019) 12.73±2.37 12.11±1.52 11.40±0.92 11.41±0.96 11.74±1.07 11.88
STM (Le, Tran, and Venkatesh 2020) 13.80±1.95 12.63±1.69 11.54±1.61 11.30±1.13 11.77±0.93 12.21
TPR-RNN (Schlag and Schmidhuber 2018) 22.25±3.12 19.88±2.80 15.45±2.98 13.01±2.28 12.65±2.71 16.65
Ours 28.53±3.59 26.45±2.95 23.67±2.78 22.52±2.36 21.46±1.72 24.53
Table 3: Test accuracy on StepGame for larger ks (only on the test set). Mean±Std over 5 runs.
we report the average performance across k. Our model out-
performs all the baseline models. Compared to Table 1, the
decreased accuracy in Table 2 demonstrates the difficulty of
spatial reasoning with distracting noise. It is not surprising
that the performance of all five baseline models decreases
when kincreases, that is, when the number of required in-
ference hops increases. We also report test accuracy on test
sets without distracting noise in the Appendix.
5.4 Systematic Generalization
To answer RQ3 we generate new StepGame test sets with
k∈ {6,7,8,9,10}with distracting noise. We then test all
the models jointly trained on the StepGame train set with
k∈ {1,2,3,4,5}as in the Section 5.3. We can consider this
experiment as a zero-shot learning setting for larger ks.
In Table 3 we present the performance of different mod-
els on this generalization task. Not surprisingly, the perfor-
mance of all models degrades monotonically as we increase
k. RN, RRN, UT and SAM fail to generalize to the test sets
with higher kvalues, while our model is more robust and
outperforms the baseline models with a large margin. This
demonstrates the better generalization ability of our model,
which performs well on longer stories never seen during
training.
5.5 Inference Analysis
To answer RQ4, we conduct an analysis of the hyper-
parameter T, the number of recurrent-layers in our model.
We jointly train TP-MANN on the StepGame dataset with k
between 1 and 5 with number of Tbetween 1 and 6 and re-
port the break down test accuracy for each value of k. These
results are shown in the left-hand side figure of Figure 4. The
test sets with higher kbenefit more from a higher number of
recurrent layers than those with lower k, indicating that re-
current layers are critical for multi-hop reasoning. We also
analyze how the recurrent layer structure affects systematic
generalization. To do this we also test on a StepGame test
set with kbetween 6 and 10 with noise. These ks are larger
Figure 4: Analysis of TP-MANN’s number of recurrent-
layers (T). The x-axis is Twith which the model has been
trained. Each line represents a different value of kof the
StepGame dataset.
than the largest kused during training. These results are
shown in the right-hand side figure in Figure 4. Here we see
that as Tincreases, the performance of the model improves.
This analysis further corroborates that our recurrent struc-
ture supports multi-hop inference. It is worth noting, that
the number of trainable parameters in our model remains un-
changed as Tincreases. Interestingly, we find that the num-
ber of recurrent-layers needed to solve the task is less than
the length of the stories ksuggesting that the inference pro-
cess may happen in parallel.
6 Conclusion
In this paper, we proposed a new dataset named StepGame
that requires a robust multi-hop spatial reasoning ability to
be solved and mitigates the issues observed in the bAbI
dataset. Then, we introduced TP-MANN, a tensor product-
based memory-augmented neural network architecture that
achieves state-of-the-art performance on both datasets. Fur-
ther analysis also demonstrated the importance of a recurrent
memory module for multi-hop reasoning.
References
Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.;
S¨
underhauf, N.; Reid, I. D.; Gould, S.; and van den Hen-
gel, A. 2018. Vision-and-Language Navigation: Interpreting
Visually-Grounded Navigation Instructions in Real Environ-
ments. In 2018 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,
June 18-22, 2018, 3674–3683. IEEE Computer Society.
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-
Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;
Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational
inductive biases, deep learning, and graph networks. arXiv
preprint arXiv:1806.01261.
Bisk, Y.; Shih, K. J.; Choi, Y.; and Marcu, D. 2018. Learning
Interpretable Spatial Operations in a Rich 3D Blocks World.
In Proceedings of the Thirty-Second AAAI Conference on
Artificial Intelligence, (AAAI-18), the 30th innovative Ap-
plications of Artificial Intelligence (IAAI-18), and the 8th
AAAI Symposium on Educational Advances in Artificial In-
telligence (EAAI-18), New Orleans, Louisiana, USA, Febru-
ary 2-7, 2018. AAAI Press.
Chen, C.-H.; Fu, Y.-F.; Cheng, H.-H.; and Lin, S.-D. 2020a.
Unseen Filler Generalization In Attention-based Natural
Language Reasoning Models. In 2020 IEEE Second In-
ternational Conference on Cognitive Machine Intelligence
(CogMI), 42–51. IEEE.
Chen, H.; Suhr, A.; Misra, D.; Snavely, N.; and Artzi, Y.
2019. TOUCHDOWN: Natural Language Navigation and
Spatial Reasoning in Visual Street Environments. In IEEE
Conference on Computer Vision and Pattern Recognition,
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Com-
puter Vision Foundation / IEEE.
Chen, K.; Huang, Q.; Palangi, H.; Smolensky, P.; Forbus,
K. D.; and Gao, J. 2020b. Mapping natural-language prob-
lems to formal-language solutions using structured neural
representations. In Proceedings of the 37th International
Conference on Machine Learning, ICML 2020, 13-18 July
2020, Virtual Event, Proceedings of Machine Learning Re-
search. PMLR.
Dehghani, M.; Gouws, S.; Vinyals, O.; Uszkoreit, J.; and
Kaiser, L. 2019. Universal Transformers. In 7th Interna-
tional Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers). Association for
Computational Linguistics.
Ding, D.; Hill, F.; Santoro, A.; and Botvinick, M. 2020.
Object-based attention for spatio-temporal reasoning: Out-
performing neuro-symbolic models with flexible distributed
architectures. arXiv preprint arXiv:2012.08508.
Gershman, S.; and Tenenbaum, J. B. 2015. Phrase similar-
ity in humans and machines. In Proceedings of the 37th
Annual Meeting of the Cognitive Science Society, CogSci
2015, Pasadena, California, USA, July 22-25, 2015. cogni-
tivesciencesociety.org.
Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka,
I.; Grabska-Barwi´
nska, A.; Colmenarejo, S. G.; Grefen-
stette, E.; Ramalho, T.; Agapiou, J.; et al. 2016. Hybrid
computing using a neural network with dynamic external
memory. Nature.
Huang, Q.; Smolensky, P.; He, X.; Deng, L.; and Wu, D.
2018. Tensor Product Generation Networks for Deep NLP
Modeling. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume
1 (Long Papers). New Orleans, Louisiana: Association for
Computational Linguistics.
Janner, M.; Narasimhan, K.; and Barzilay, R. 2018. Repre-
sentation Learning for Grounded Spatial Reasoning. Trans-
actions of the Association for Computational Linguistics.
Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Clas-
sification with Graph Convolutional Networks. In 5th In-
ternational Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings.
Koster, R.; Chadwick, M. J.; Chen, Y.; Berron, D.; Banino,
A.; D¨
uzel, E.; Hassabis, D.; and Kumaran, D. 2018. Big-
loop recurrence within the hippocampal system supports in-
tegration of information across episodes. Neuron, 99(6):
1342–1354.
Kruijff, G.-J. M.; Zender, H.; Jensfelt, P.; and Christensen,
H. I. 2007. Situated dialogue and spatial organization: What,
where. . . and why? International Journal of Advanced
Robotic Systems.
Kumaran, D.; and McClelland, J. L. 2012. Generalization
through the recurrent interaction of episodic memories: a
model of the hippocampal system. Psychological review.
Landsiedel, C.; Rieser, V.; Walter, M.; and Wollherr, D.
2017. A review of spatial reasoning and interaction for real-
world robotics. Advanced Robotics.
Le, H.; Tran, T.; and Venkatesh, S. 2020. Self-Attentive As-
sociative Memory. In Proceedings of the 37th International
Conference on Machine Learning, ICML 2020, 13-18 July
2020, Virtual Event, volume 119 of Proceedings of Machine
Learning Research, 5682–5691. PMLR.
Liu, F.; and Perez, J. 2017. Gated End-to-End Memory Net-
works. In Proceedings of the 15th Conference of the Euro-
pean Chapter of the Association for Computational Linguis-
tics: Volume 1, Long Papers. Valencia, Spain: Association
for Computational Linguistics.
Mirzaee, R.; Faghihi, H. R.; Ning, Q.; and Kordjamshidi, P.
2021. SPARTQA: A Textual Question Answering Bench-
mark for Spatial Reasoning. In Proceedings of the 2021
Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Tech-
nologies, 4582–4598.
Palm, R. B.; Paquet, U.; and Winther, O. 2018. Recur-
rent Relational Networks. In Bengio, S.; Wallach, H. M.;
Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Gar-
nett, R., eds., Advances in Neural Information Processing
Systems 31: Annual Conference on Neural Information Pro-
cessing Systems 2018, NeurIPS 2018, December 3-8, 2018,
Montr´
eal, Canada, 3372–3382.
Petruck, M. R. L.; and Ellsworth, M. J. 2018. Represent-
ing Spatial Relations in FrameNet. In Proceedings of the
First International Workshop on Spatial Language Under-
standing, 41–45. New Orleans: Association for Computa-
tional Linguistics.
Pustejovsky, J. 1989. Language and Spatial Cognition. Com-
putational Linguistics, 15(3).
Pustejovsky, J.; Kordjamshidi, P.; Moens, M.-F.; Levine, A.;
Dworman, S.; and Yocum, Z. 2015. SemEval-2015 Task 8:
SpaceEval. In Proceedings of the 9th International Work-
shop on Semantic Evaluation (SemEval 2015). Denver, Col-
orado: Association for Computational Linguistics.
Santoro, A.; Raposo, D.; Barrett, D. G. T.; Malinowski, M.;
Pascanu, R.; Battaglia, P. W.; and Lillicrap, T. 2017. A
simple neural network module for relational reasoning. In
Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.;
Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds.,
Advances in Neural Information Processing Systems 30: An-
nual Conference on Neural Information Processing Systems
2017, December 4-9, 2017, Long Beach, CA, USA, 4967–
4976.
Schapiro, A. C.; Turk-Browne, N. B.; Botvinick, M. M.;
and Norman, K. A. 2017. Complementary learning systems
within the hippocampus: a neural network modelling ap-
proach to reconciling episodic memory with statistical learn-
ing. Philosophical Transactions of the Royal Society B: Bi-
ological Sciences.
Schlag, I.; Munkhdalai, T.; and Schmidhuber, J. 2021.
Learning Associative Inference Using Fast Weight Memory.
In 9th International Conference on Learning Representa-
tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
Schlag, I.; and Schmidhuber, J. 2018. Learning to Reason
with Third Order Tensor Products. In Bengio, S.; Wallach,
H. M.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and
Garnett, R., eds., Advances in Neural Information Process-
ing Systems 31: Annual Conference on Neural Information
Processing Systems 2018, NeurIPS 2018, December 3-8,
2018, Montr´
eal, Canada, 10003–10014.
Smolensky, P. 1990. Tensor product variable binding and the
representation of symbolic structures in connectionist sys-
tems. Artificial intelligence, 46(1-2): 159–216.
Sukhbaatar, S.; Szlam, A.; Weston, J.; and Fergus, R. 2015.
End-To-End Memory Networks. In Advances in Neural In-
formation Processing Systems 28: Annual Conference on
Neural Information Processing Systems 2015, December 7-
12, 2015, Montreal, Quebec, Canada, 2440–2448.
Talmor, A.; and Berant, J. 2018. The Web as a Knowledge-
Base for Answering Complex Questions. In Walker, M. A.;
Ji, H.; and Stent, A., eds., Proceedings of the 2018 Confer-
ence of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies,
NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-
6, 2018, Volume 1 (Long Papers), 641–651. Association for
Computational Linguistics.
Tan, H.; and Bansal, M. 2018. Source-Target Inference
Models for Spatial Instruction Understanding. In McIl-
raith, S. A.; and Weinberger, K. Q., eds., Proceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence,
(AAAI-18), the 30th innovative Applications of Artificial In-
telligence (IAAI-18), and the 8th AAAI Symposium on Edu-
cational Advances in Artificial Intelligence (EAAI-18), New
Orleans, Louisiana, USA, February 2-7, 2018, 5504–5511.
AAAI Press.
Tversky, B. 2019. Mind in motion: How action shapes
thought. Hachette UK.
van Aken, B.; Winter, B.; L ¨
oser, A.; and Gers, F. A. 2019.
How Does BERT Answer Questions?: A Layer-Wise Anal-
ysis of Transformer Representations. In Proceedings of
the 28th ACM International Conference on Information and
Knowledge Management, CIKM.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
tention is All you Need. In Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Infor-
mation Processing Systems.
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li`
o,
P.; and Bengio, Y. 2018. Graph Attention Networks. In
6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,
Conference Track Proceedings. OpenReview.net.
Vogel, A.; and Jurafsky, D. 2010. Learning to Follow Nav-
igational Directions. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics,
806–814. Uppsala, Sweden: Association for Computational
Linguistics.
Welbl, J.; Stenetorp, P.; and Riedel, S. 2018. Constructing
datasets for multi-hop reading comprehension across docu-
ments. Transactions of the Association for Computational
Linguistics, 6: 287–302.
Weston, J.; Bordes, A.; Chopra, S.; and Mikolov, T. 2016.
Towards AI-Complete Question Answering: A Set of Pre-
requisite Toy Tasks. In Bengio, Y.; and LeCun, Y., eds.,
4th International Conference on Learning Representations,
ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Confer-
ence Track Proceedings.
Yang, T.-Y.; Lan, A.; and Narasimhan, K. 2020. Robust and
Interpretable Grounding of Spatial References with Relation
Networks. In Findings of the Association for Computational
Linguistics: EMNLP 2020, 1908–1923. Online: Association
for Computational Linguistics.
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.;
Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA:
A Dataset for Diverse, Explainable Multi-hop Question An-
swering. In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing, Brussels,
Belgium, October 31 - November 4, 2018, 2369–2380. As-
sociation for Computational Linguistics.
... Therefore, another challenge for an agent is to follow instructions which require the learning and understanding of spatiotemporal linguistic concepts in natural language. To train models able to understand and reason about spatial references in natural language, Shi et al. (2022) proposed a benchmark for robust multi-hop spatial reasoning over texts. ...
... Also, builder action sequences are often fragmented between utterances due to the frequent interruptions of the architect. In order to solve these issues a good model should be capable to learn better representations for higher-level abstractions in natural language like spatial relation concepts and be more robust to noisy actions (Shi et al., 2022). However, existing models including pre-trained ones (Devlin et al., 2019) fail to learn such representations for spatial reasoning, which translates into poor performance in these instruction following tasks. ...
Conference Paper
Full-text available
Collaborative tasks are ubiquitous activities where a form of communication is required in order to reach a joint goal. Collaborative building is one of such tasks. We wish to develop an intelligent builder agent in a simulated building environment (Minecraft) that can build whatever users wish to build by just talking to the agent. In order to achieve this goal, such agents need to be able to take the initiative by asking clarification questions when further information is needed. Existing works on Minecraft Corpus Dataset only learn to execute instructions neglecting the importance of asking for clarifications. In this paper, we extend the Minecraft Corpus Dataset by annotating all builder utterances into eight types, including clarification questions, and propose a new builder agent model capable of determining when to ask or execute instructions. Experimental results show that our model achieves state-of-the-art performance on the collabora-tive building task with a substantial improvement. We also define two new tasks, the learning to ask task and the joint learning task. The latter consists of solving both collaborating building and learning to ask tasks jointly.
... Therefore, another challenge for an agent is to follow instructions which require the learning and understanding of spatiotemporal linguistic concepts in natural language. To train models able to understand and reason about spatial references in natural language, Shi et al. (2022) proposed a benchmark for robust multi-hop spatial reasoning over texts. ...
... Also, builder action sequences are often fragmented between utterances due to the frequent interruptions of the architect. In order to solve these issues a good model should be capable to learn better representations for higher-level abstractions in natural language like spatial relation concepts and be more robust to noisy actions (Shi et al., 2022). However, existing models including pre-trained ones (Devlin et al., 2019) fail to learn such representations for spatial reasoning, which translates into poor performance in these instruction following tasks. ...
Preprint
Full-text available
Collaborative tasks are ubiquitous activities where a form of communication is required in order to reach a joint goal. Collaborative building is one of such tasks. We wish to develop an intelligent builder agent in a simulated building environment (Minecraft) that can build whatever users wish to build by just talking to the agent. In order to achieve this goal, such agents need to be able to take the initiative by asking clarification questions when further information is needed. Existing works on Minecraft Corpus Dataset only learn to execute instructions neglecting the importance of asking for clarifications. In this paper, we extend the Minecraft Corpus Dataset by annotating all builder utterances into eight types, including clarification questions, and propose a new builder agent model capable of determining when to ask or execute instructions. Experimental results show that our model achieves state-of-the-art performance on the collaborative building task with a substantial improvement. We also define two new tasks, the learning to ask task and the joint learning task. The latter consists of solving both collaborating building and learning to ask tasks jointly.
... Several studies focus on user interfaces of CSSs, to improve their user experience [3,6]. Others take advantage from other areas, such as knowledge graphs [10,15] and neural networks [4,36,37] to improve their performance. Besides these topics, however, the evaluation of CSSs is still relatively undeveloped [7,19,28]. ...
Conference Paper
Full-text available
Due to the sequential and interactive nature of conversations, the application of traditional Information Retrieval (IR) methods like the Cranfield paradigm require stronger assumptions. When building a test collection for Ad Hoc search, it is fair to assume that the relevance judgments provided by an annotator correlate well with the relevance judgments perceived by an actual user of the search engine. However, when building a test collection for conversational search, we do not know if it is fair to assume the same. In this paper, we perform a crowdsourcing study to evaluate the applicability of the Cranfield paradigm to conversational search systems. Our main aim is to understand what is the agreement in terms of user satisfaction between the users performing a search task in a conversational search system (i.e., directly assessing the system) and the users observing the search task being performed (i.e., indirectly assessing the system). The results of this study are paramount because they underpin and guide 1) the development of more realistic user models and simulators, and 2) the design of more reliable and robust evaluation measures for conversational search systems. Our results show that there is a fair agreement between direct and indirect assessments in terms of user satisfaction and that these two kinds of assessments share similar conversational patterns. Indeed, by collecting relevance assessments for each system utterance, we tested several conversational patterns that show a promising ability to predict user satisfaction.
... There are few things so fundamental to our life as food, whose consumption is intricately linked to our health, our feelings and our culture. With the rapid development of science and technology, conversational AI has been a long-standing area of exploration in the research community [1,2,3] and has now penetrated in both academia and industries with products such as Microsoft Cortana and Amazon Alexa. Recently, researchers work on integrating cooking tasks into conversation systems with the target to assist customers to complete everyday tasks [4]. ...
Conference Paper
As virtual personal assistants have now penetrated the consumer market, with products such as Siri and Alexa, the research community has produced several works on task-oriented dialogue tasks such as hotel booking, restaurant booking, and movie recommendation. Assisting users to cook is one of these tasks that are expected to be solved by intelligent assistants, where ingredients and their corresponding attributes, such as name, unit, and quantity, should be provided to users precisely and promptly. However, existing ingredient information scraped from the cooking website is in the unstructured form with huge variation in the lexical structure, for example, "1 garlic clove, crushed", and "1 (8 ounce) package cream cheese, softened", making it difficult to extract information exactly. To provide an engaged and successful conversational service to users for cooking tasks, we propose a new ingredient parsing model that can parse an ingredient phrase of recipes into the structure form with its corresponding attributes with over 0.93 F1-score. Experimental results show that our model achieves state-of-the-art performance on AllRecipes and Food.com datasets.
... ") is very challenging to predict as it is a new topic that the user brought up. This is where a personal knowledge graph [1] or a memory augmented neural model [24] can come into play to incorporate the users' mental status or preferences into the simulator. Additionally, this reinforces the need for the design of better evaluation metrics for user utterance generation which are able to capture these subtleties. ...
Conference Paper
Full-text available
A human-like user simulator that anticipates users' satisfaction scores, actions, and utterances can help goal-oriented dialogue systems in evaluating the conversation and refining their dialogue strategies. However, little work has experimented with user simulators which can generate users' utterances. In this paper, we propose a deep learning-based user simulator that predicts users' satisfaction scores and actions while also jointly generating users' utterances in a multi-task manner. In particular, we show that 1) the proposed deep text-to-text multi-task neural model achieves state-of-the-art performance in the users' satisfaction scores and actions prediction tasks, and 2) in an ablation analysis, user satisfaction score prediction, action prediction, and utterance generation tasks can boost the performance with each other via positive transfers across the tasks. The source code and model checkpoints used for the experiments run in this paper are available at the following weblink: https://github.com/kimdanny/user-simulation-t5.
Article
Many recent recommendation systems leverage the large quantity of reviews placed by users on items. However, it is both challenging and important to accurately measure the usefulness of such reviews for effective recommendation. In particular, users have been shown to exhibit distinct preferences over different types of reviews (e.g. preferring longer vs. shorter or recent vs. old reviews), indicating that users might differ in their viewpoints on what makes the reviews useful. Yet, there have been limited studies that account for the personalised usefulness of reviews when estimating the users’ preferences. In this paper, we propose a novel neural model, called BanditProp, which addresses this gap in the literature. It first models reviews according to both their content and associated properties (e.g. length, sentiment and recency). Thereafter, it constructs a multi-task learning (MTL) framework to model the reviews’ content encoded with various properties. In such an MTL framework, each task corresponds to producing recommendations focusing on an individual property. Next, we address the selection of the features from reviews with different review properties as a bandit problem using multinomial rewards. We propose a neural contextual bandit algorithm (i.e. ConvBandit) and examine its effectiveness in comparison to eight existing bandit algorithms in addressing the bandit problem. Our extensive experiments on two well-known Amazon and Yelp datasets show that BanditProp can significantly outperform one classic and six existing state-of-the-art recommendation baselines. Moreover, BanditProp using ConvBandit consistently outperforms the use of other bandit algorithms over the two used datasets. In particular, we experimentally demonstrate the effectiveness of our proposed customised multinomial rewards in comparison to binary rewards, when addressing our bandit problem.
Article
Full-text available
Heretofore, neural networks with external memory are restricted to single memory with lossy representations of memory interactions. A rich representation of relationships between memory pieces urges a high-order and segregated relational memory. In this paper, we propose to separate the storage of individual experiences (item memory) and their occurring relationships (relational memory). The idea is implemented through a novel Self-attentive Associative Memory (SAM) operator. Found upon outer product, SAM forms a set of associative memories that represent the hypothetical high-order relationships between arbitrary pairs of memory elements, through which a rela-tional memory is constructed from an item memory. The two memories are wired into a single sequential model capable of both memorization and relational reasoning. We achieve competitive results with our proposed two-memory model in a diversity of machine learning tasks, from challenging synthetic problems to practical testbeds such as geometry, graph, reinforcement learning, and question answering.
Article
Recent evidence challenges the widely held view that the hippocampus is specialized for episodic memory, by demonstrating that it also underpins the integration of information across experiences. Contemporary computational theories propose that these two contrasting functions can be accomplished by big-loop recurrence, whereby the output of the system is recirculated back into the hippocampus. We use ultra-high-resolution fMRI to provide support for this hypothesis, by showing that retrieved information is presented as a new input on the superficial entorhinal cortex-driven by functional connectivity between the deep and superficial entorhinal layers. Further, the magnitude of this laminar connectivity correlated with inferential performance, demonstrating its importance for behavior. Our findings offer a novel perspective on information processing within the hippocampus and support a unifying framework in which the hippocampus captures higher-order structure across experiences, by creating a dynamic memory space from separate episodic codes for individual experiences.
Article
Answering complex questions is a time-consuming activity for humans that requires reasoning and integration of information. Recent work on reading comprehension made headway in answering simple questions, but tackling complex questions is still an ongoing research challenge. Conversely, semantic parsers have been successful at handling compositionality, but only when the information resides in a target knowledge-base. In this paper, we present a novel framework for answering broad and complex questions, assuming answering simple questions is possible using a search engine and a reading comprehension model. We propose to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers. To illustrate the viability of our approach, we create a new dataset of complex questions, ComplexWebQuestions, and present a model that decomposes questions and interacts with the web to compute an answer. We empirically demonstrate that question decomposition improves performance from 20.8 precision@1 to 27.5 precision@1 on this new dataset.