Content uploaded by Aldo Lipani
Author content
All content in this area was uploaded by Aldo Lipani on Apr 12, 2022
Content may be subject to copyright.
StepGame: A New Benchmark for
Robust Multi-Hop Spatial Reasoning in Texts
Zhengxiang Shi1, Qiang Zhang2, Aldo Lipani1
1University College London
2Zhejiang University
zhengxiang.shi.19@ucl.ac.uk, qiang.zhang.cs@zju.edu.cn, aldo.lipani@ucl.ac.uk
Abstract
Inferring spatial relations in natural language is a crucial abil-
ity an intelligent system should possess. The bAbI dataset
tries to capture tasks relevant to this domain (task 17 and 19).
However, these tasks have several limitations. Most impor-
tantly, they are limited to fixed expressions, they are limited
in the number of reasoning steps required to solve them, and
they fail to test the robustness of models to input that contains
irrelevant or redundant information. In this paper, we present
a new Question-Answering dataset called StepGame for ro-
bust multi-hop spatial reasoning in texts. Our experiments
demonstrate that state-of-the-art models on the bAbI dataset
struggle on the StepGame dataset. Moreover, we propose a
Tensor-Product based Memory-Augmented Neural Network
(TP-MANN) specialized for spatial reasoning tasks. Experi-
mental results on both datasets show that our model outper-
forms all the baselines with superior generalization and ro-
bustness performance.
1 Introduction
Neural networks have been successful in a wide array of per-
ceptual tasks, but it is often stated that they are incapable
of solving tasks that require higher-level reasoning (Ding
et al. 2020). Since spatial reasoning is ubiquitous in many
scenarios such as autonomous navigation (Vogel and Juraf-
sky 2010), situated dialog (Kruijff et al. 2007), and robotic
manipulation (Yang, Lan, and Narasimhan 2020; Landsiedel
et al. 2017), grounding spatial references in texts is essential
for effective human-machine communication through natu-
ral language. Navigation tasks require agents to reason about
their relative position to objects and how these relations
change as they move through the environment (Chen et al.
2019). If we want to develop conversational systems able to
assist users in solving tasks where spatial references are in-
volved, we need to make them able to understand and reason
about spatial references in natural language. Such ability can
help conversational systems to successfully follow instruc-
tions and understand spatial descriptions. However, despite
its tremendous applicability, reasoning over spatial relations
remains a challenging task for existing conversational sys-
tems.
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Earlier works in spatial reasoning focused on spatial in-
struction understanding in a synthetic environment (Bisk
et al. 2018; Tan and Bansal 2018; Janner, Narasimhan, and
Barzilay 2018) or in a simulated world with spatial infor-
mation annotation in texts (Pustejovsky et al. 2015), spatial
relation extractions across entities (Petruck and Ellsworth
2018) and visual observations (Anderson et al. 2018; Chen
et al. 2019). However, few of the existing datasets are de-
signed to evaluate models’ inference over spatial informa-
tion in texts. A spatial relational inference task often requires
an conversational system to infer the spatial relation between
two items given a description of a scene. For example, imag-
ine a user asking to a conversational system to recognize the
location of an entity based on the description of other enti-
ties in a scene. To do so, the conversational system needs to
be able to reason about the location of the various entities in
the scene using only textual information.
BAbI (Weston et al. 2016) is the most relevant dataset
for this task. It contains 20 synthetic question answering
(QA) tasks to test a variety of reasoning abilities in texts,
like deduction, co-reference, and counting. In particular, the
positional reasoning task (no. 17) and the path finding task
(no. 19) are designed to evaluate models’ spatial reasoning
ability. These two tasks are arguably the most challenging
ones van Aken et al. (2019). The state-of-the-art model on
the bAbI (Le, Tran, and Venkatesh 2020) dataset almost per-
fectly solve these two spatial reasoning tasks. However, in
this paper, we demonstrate that such good performance is
attributable to issues with the bAbI dataset rather than the
model inference ability.
We find four major issues with bAbI’s tasks 17 and 19: (1)
There is a data leakage between the train and test sets; that is,
most of the test set samples appear in the training set. Hence,
the evaluation results on the test set cannot truly reflect mod-
els’ reasoning ability; (2) Named entities are fixed and only
four relations are considered. Each text sample always con-
tains the same four named entities in the training, valida-
tion, and test sets. This further biases the learning models
towards these four entities. When named entities in the test
set are replaced by unseen entities or the number of such
entities increases, the model performance decreases dramat-
ically (Chen et al. 2020a). Also, relations such as top-left,
top-right, lower-left, lower-right are not taken into consider-
ation; (3) Learning models are required to reason only over
one or two sentences in the text descriptions, making such
tasks relatively simple. Palm, Paquet, and Winther (2018)
pointed out that multi-hop reasoning is not necessary for the
bAbI dataset since models only need a single step to solve
all the tasks, and; (4) It is a synthetic dataset with a limited
diversity of spatial relation descriptions. It thus cannot truly
reveal the models’ ability in understanding textual space de-
scriptions.
In this paper, we propose a new dataset called StepGame
to tackle the above-mentioned issues and a novel Tensor
Product-based Memory-Augmented Neural Network archi-
tecture (TP-MANN) for multi-hop spatial reasoning in texts.
The StepGame dataset is based on crowdsourced descrip-
tions of 8 potential spatial relations between 2 entities. These
descriptions are then used as templates when generating the
dataset. To increase the diversity of these templates, crowd-
workers were asked to diversify their expressions. This was
done in order to ensure that the crowdsourced templates
cover most of the natural ways relations between two entities
can be described in text. The StepGame dataset is character-
ized by a combinatorial growth in the number of possible
description of scenes, named stories, as the number of de-
scribed relations between two entities increases. This com-
binatorial growth reduces the chances to leak stories from
the training to the validation and test sets. Moreover, we use
a large number of named entities and require multi-hop rea-
soning to answer questions about two entities mentioned in
the stories. Experimental results show that existing models
(1) fail to achieve a performance on the StepGame dataset
similar to that achieved on the bAbI dataset, and (2) suffer
from a large performance drop as the number of required
reasoning steps increases.
The TP-MANN architecture is based on tensor product
representations (Smolensky 1990) that are used in a recur-
rent memory module to store, update or delete the relation
information among entities inferred from stories. This recur-
rent architecture provides three key benefits: (1) it enables
the model to make inferences based on the stored memory;
(2) it allows multi-hop reasoning and it is robust to noise,
and; (3) the number of parameters remains unchanged as the
number of recurrent layers in the memory module increases.
Experimental results on the StepGame dataset show that our
model achieves state-of-the-art performance with a substan-
tial improvement, and demonstrates a better generalization
ability to more complex stories. Finally, we also conduct
some analysis of our recurrent structure and demonstrate its
importance for multi-hop reasoning.
2 Related Work and Background
2.1 Related Work
Reasoning Datasets. The role of language in spatial rea-
soning has been investigated since the 1980s (Pustejovsky
1989; Gershman and Tenenbaum 2015; Tversky 2019), and
reasoning about spatial relations has been studied in sev-
eral contexts such as, 2D and 3D navigation (Bisk et al.
2018; Tan and Bansal 2018; Janner, Narasimhan, and Barzi-
lay 2018; Yang, Lan, and Narasimhan 2020), and robotic
manipulation (Landsiedel et al. 2017). However, few of the
datasets used in these works are used to evaluate systems’
spatial reasoning ability in texts.
The bAbI (Weston et al. 2016) dataset consists of several
QA tasks. Solving these tasks require logical reasoning steps
and cannot be solved by simply word matching. Of partic-
ular interest to this paper are tasks 17 and 19. Task 17 is
about positional reasoning while task 19 is about path find-
ing. These two tasks can be used to evaluate the spatial infer-
ence ability of learning models. However, the bAbI dataset
has several issues as mentioned above: the data leakage, the
fixed named entities and expressions, and the lack of a need
to perform multi-hop reasoning. Another relevant dataset is
SpartQA (Mirzaee et al. 2021), which is designed for spatial
reasoning over texts but only requires a limited multi-hop
reasoning compared to StepGame.
Multi-Hop QA Datasets. The multi-hop QA tasks re-
quire reasoning over multiple pieces of evidence and fo-
cus on leveraging the connections between entities to in-
fer a requested property of a set of them. Commonly-
used multi-hop QA datasets are HotpotQA (Yang et al.
2018), ComplexWebQuestions (Talmor and Berant 2018),
and QAngaroo (Welbl, Stenetorp, and Riedel 2018). The
proposed StepGame dataset is different from these datasets.
The StepGame dataset focuses on spatial reasoning, which
requires machine learning models to infer the spatial rela-
tions among the described entities. Moreover, multi-hop QA
datasets usually require no more than two reasoning steps,
while the StepGame dataset can require as many as 10 rea-
soning steps.
Reasoning Models. There are three types of reasoning
models: memory-augmented neural networks, graph neu-
ral networks, and transformer-based networks. Works of
the first type augment neural networks with external mem-
ory, such as End to End Memory Networks (Sukhbaatar
et al. 2015), Differential Neural Computer (Graves et al.
2016), and Gated End-to-End Memory Networks (Liu and
Perez 2017). These models have shown remarkable abili-
ties in tackling difficult computational and reasoning tasks.
Works of the second type use graph structure to incorporate a
stronger relational inductive bias (Battaglia et al. 2018). San-
toro et al. (2017) introduced Relational Networks (RN) and
demonstrated strong relational reasoning capabilities with a
shallow architecture by modelling binary relations between
entity pairs. Palm, Paquet, and Winther (2018) proposed a
graph representation of objects and models multi-hop rela-
tional reasoning using a message passing mechanism. Works
of the third type use transformers. Although transformers
have been proven successful in many NLP tasks, they still
struggle with reasoning tasks. van Aken et al. (2019) an-
alyzed the performance of BERT (Devlin et al. 2019) on
bAbI’s tasks and demonstrated that most of BERT’s errors
come from task 17 and 19 which require spatial reason-
ing. Meanwhile, Dehghani et al. (2019) demonstrated that
standard transformers cannot perform as well as memory-
augmented networks on the bAbI dataset. Moreover, it is
important to note that most of the errors of their proposed
Universal Transformer come also from task 17 and task 19
of the bAbI dataset, which matches our observations on
other transformer-based models. Therefore, spatial reason-
ing tasks are arguably the most challenging tasks in the bAbI
dataset.
Tensor Product Representation. The Tensor Prod-
uct Representation (TPR) (Smolensky 1990; Schlag,
Munkhdalai, and Schmidhuber 2021) is a technique for en-
coding symbolic structural information and modelling sym-
bolic reasoning in vector spaces by learning to deconstruct
natural language statements into combinatorial representa-
tions (Chen et al. 2020b). TPR has been used for tasks that
require deductive reasoning abilities and it is able to repre-
sent entire problem statements to solve math questions in
natural language (Chen et al. 2020b) and generate natural
language captions from images (Huang et al. 2018).
Schlag and Schmidhuber (2018) proposed a gradient-
based RNN with third-order TPR (TPR-RNN), which cre-
ates a vector space embedding of complex symbolic struc-
tures by tensor products and stores these learned represen-
tations into a third-order TPR-like memory. Self-Attentive
Associative Memory (STM) (Le, Tran, and Venkatesh 2020)
utilizes a second-order item memory and a third-order TPR-
like relational memory to simulate the hippocampus, achiev-
ing state-of-the-art performance on the bAbI dataset. De-
spite a gain in the performance on bAbI compared to TPR-
RNN, STM takes a longer time to converge in practice.
Recently, Schlag, Munkhdalai, and Schmidhuber (2021)
compared a concatenated memory M∈R2d×dwith a 3-
order memory M∈Rd2×d, and experimental results indi-
cate a drop in performance when a concatenated memory is
used. However, neither STM nor TPR-RNN processes in-
formation at the paragraph level and allows later modifi-
cations after the first information is stored, as done in our
model. Both STM and TPR-RNN use an RNN-like archi-
tecture where each sentence in a paragraph is stored recur-
rently. This may result in a long-term dependency prob-
lem (Vaswani et al. 2017) where necessary information
would not interact with each other. To solve this issue, an ex-
plicit mechanism to update relational information between
entities at the end of each story is introduced in our model.
2.2 Background
The Tensor Product Representation (TPR) is a method to
create a vector space embedding of complex symbolic struc-
tures by tensor product. Such representation can be con-
structed as follows:
M=X
i
fi⊗ri=X
i
fir
>
i=X
i
(f⊗r)ii,(1)
where Mis the TPR, f= (f1, . . . , fn)is a set of nfiller
vectors and r= (r1, . . . , rn)is a set of nrole vectors. For
each role-filler vector pair, which can be considered as an
entity-relation pair, we bind (or store) them into Mby per-
forming their outer product. Then, given an unbinding role
vector ui, associated to the filler vector fi,fican be recov-
ered by performing:
Mui="X
i
fi⊗ri#ui=X
i
αij fi∝fi(2)
D
GB
J
F
Step 1 Step 2 Step 3
D
G
B
J
F
Story
1. J is below B.
Question
Answer
What is the relation of the G to the J?
Top-right ( )
2. B and G is side by side with B to the left and G
3. D and G are parallel, and D is on the top of G.
4. F is diagonally to the bottom right of J.
to the right.
Figure 1: An example of the generation of a StepGame sam-
ple with k= 4.
where αij 6= 0 if and only if i=j. It can be proven that
the recovering is perfect if the role vectors are orthogonal to
each other. In our model, TPR-like binding and unbinding
methods are used to store and retrieve information from and
to the TPR M, which we will call memory.
3 The StepGame Dataset
To design a benchmark dataset that explicitly tests mod-
els’ spatial reasoning ability and tackle the above mentioned
problems, we build a new dataset named StepGame inspired
by the spatial reasoning tasks in the bAbI dataset (Weston
et al. 2016). The StepGame is a contextual QA dataset,
where the system is required to interpret a story about sev-
eral entities expressed in natural language and answer a
question about the relative position of two of those entities.
Although this reasoning task is trivial for humans, to equip
current NLU models with such a spatial-ability remains still
a challenge. Also, to increase the complexity of this dataset
we model several form of distracting noises. Such noises
aim to make the task more difficult and force machine learn-
ing models that are trained on this dataset to be more robust
in their inference process.
3.1 Template Collection
The aim of this crowdsourcing task is to find out all pos-
sible ways we can describe the positional relationship be-
tween two entities. The crowdworkers from Amazon Me-
chanical Turk were provided with an image visually describ-
ing the spatial relations of two entities and a request to de-
scribe these entities’ relation. This crowdsourcing task was
performed in multiple runs. In the first run, we provided
crowdworkers with an image and two entities (e.g., A and
B) and they were asked to describe their positional relation.
From the data collected in this round, we then manually re-
moved bad answers, and showed the remaining good ones
as positive examples to crowdworkers in the next run. How-
ever, crowdworkers were instructed to avoid repeating them
as an answer to our request. We repeated this process until
no new templates could be collected. In total, after perform-
ing a manual generalization where templates discovered for
a relation were translated to the other relations, we collected
23 templates for left and right relations, 27 templates for top
and down relations, and 26 templates for top-left, top-right,
down-left, and down-right relations.
3.2 Data Generation
The task defined by the StepGame dataset is composed of
several story-question pairs written in natural language. In
its basic form, the story describes a set of kspatial relations
among k+ 1 entities, and it is structured as a list of ksen-
tences each talking about 2entities. The relations are kand
the entities k+1 because they define a chain-like shape. The
question requests the relative position of two entities among
the k+1 ones mentioned in the story. To each story-question
pair an answer is associated. This answer can take 9possible
values: top-left,top-right,top,left,overlap,right,down-left,
down-right, and down, each representing a relative position.
The number of edges between the two entities in the ques-
tion (≤k) determines the number of hops a model has to
perform in order to get to the correct answer.
To generate a story, we follow three steps, as depicted in
Figure 1. Given a value kand a set of entities E:
Step 1. We generate a sequence of entities by sampling
a set of k+ 1 unique entities from E. Then, for each pair
of entities in the sequence, kspatial relations are sampled.
These spatial relations can take any of the 8 possible val-
ues: top, down, left, right, top-left, top-right, down-left, and
down-right. Because the sampling is unconstrained, entities
can overlap with each other. This step results in a sequence
of linked entities that from now on we will call a chain.
Step 2. Two of the chain’s entities are then selected at ran-
dom to be used in the question.
Step 3. From the chain generated in Step 1, we translate
the krelations into ksentence descriptions in natural lan-
guage. Each description is based on a randomly sampled
crowdsourced template. We then shuffle these ksentences to
avoid potential distributional biases. These shuffled ksen-
tence descriptions is a called a story. From the entities se-
lected in Step 2, we then generate a question also in natural
language. Finally, using the chain and the selected entities,
we infer the answer to each story-question pair.
Given this generation process we can quickly calculate
the complexity of the task before using the templates. This
is possible because entities can overlap. Given krelations,
k+ 1 entities sampled from Ein any order (•), 8 possible
relations between pairs of entities with 2 ways of describing
them (•), e.g., A is on the left of B or B is on the right of
A, a random order of the ksentences in the story (•), and a
question about 2 entities with 2 ways of describing it (•), the
number of examples that we can generate is equal to:
(k+ 1)! |E|
k+ 1!·16k·k!
2·2 k+ 1
2!.(3)
The complexity of the dataset grows exponentially with k.
The StepGame dataset uses |E| = 26. For k= 1 we have
10,400 possible samples, for k= 2 we have more than
23 million samples, and so on. The sample complexity of
the problem guarantees that when generating the dataset the
probability of leaking samples from the training set to the
test set diminishes with the increase of k. Please note that
these calculations do not include templates. If we were to
D
GB
J
FI
Original
D
GB
J
FI
Irrelevant Noise
D
GB
J
FI
Disconnected Noise
K
V
C
WH
E
C
A A A
D
GB
J
FI
Supporting Noise
X
Y
A
Figure 2: On the left-hand side we have the original chain.
Orange entities are those targeted by the question. Beside,
we show the same chain with the addition of noise. In green
we represent irrelevant, disconnected and supporting enti-
ties.
considering also the templates, the number of variations of
the StepGame would be even larger.
3.3 Distracting Noise
To make the StepGame more challenging we also include
noisy examples in the test set. We assume that when mod-
els trained on the non-noisy dataset make mistakes on the
noisy test set, these models have failed to learn how to in-
fer spatial relations. We generate three kinds of distracting
noise: disconnected,irrelevant, and supporting. Examples
of all kinds of noise are provided in Figure 2. The irrel-
evant noise extends the original chain by branching it out
with new entities and relations. The disconnected noise adds
to the original chain a new independent chain with new enti-
ties and relations. The supporting noise adds to the original
chain new entities and relations that may provide alternative
reasoning paths. We only add supporting noise into chains
with more than 2 entities. All kinds of noise have no im-
pact on the correct answer. The type and amount of noise
added to each chain is randomly determined. The detailed
statistics for each type of distracting noise are provided in
the Appendix.
4 The TP-MANN Model
In this section we introduce the proposed TP-MANN model,
as shown in Figure 3. The model comprises three major
components: a question and story encoder, a recurrent mem-
ory module, and a relation decoder. The encoder learns to
represent entities and relations for each sentence in a story.
The recurrent memory module learns to store entity-relation
pair representations into the memory independently. It also
updates the entity-relation pair representations based on the
current memory and stores the inferred information. The de-
coder learns to represent the question and using the infor-
mation stored in the memory recurrently infers the spatial
relation of the two entities mentioned in the question.
It also has been shown that learned representations in the
TPR-like memory could be orthogonal (Schlag and Schmid-
huber 2018). We use an example to illustrate the inspiration
behind this architecture. A person may experience that when
she goes back to her hometown and sees an old tree, her
happy childhood memory about playing with her friends un-
der that tree might be recalled. However, this memory may
LN
⊗
P
•
⊗
⊗
PE
?
•
•
LN
I2
•
LN
I3
Mt+1
Mt
K
PE
I1
U
O
N
E
R
S∗
Decoder
Recurrent Memory
Encoder
LN
Figure 3: The TP-MANN architecture. PE stands for positional encoder, the sign in the box below the symbol Erepresents a
feed-forward neural network, the ⊗sign represents the outer-product operator, the •sign represents the inner product operator,
and LN represents a layer normalization. The ⊗,•, and LN boxes implement the formulae as presented in Section 4. Lines
indicate the flow of information. Those without an arrow indicate which symbols are taken as input and are output by their box.
not be reminisced unless triggered by the old tree appear-
ance. In our model, unbinding vectors in the decoder module
play the role of the old tree in the example, where unbind-
ing vectors are learned based on the target questions. The
decoder module unbinds relevant memories given a ques-
tion via a recurrent mechanism. Moreover, although mem-
ories are stored separately, there are integration processes
in brains that retrieve information via a recursive mecha-
nism. This allows episodes in memories to interact with each
other (Kumaran and McClelland 2012; Schapiro et al. 2017;
Koster et al. 2018).
Encoder. The input of the encoder is a story and a ques-
tion. Given a input story S= (s1, . . . , sm)with msentences
and a question qboth described by words in a vocabulary
V. Each sentence si= (w1, . . . , wn)is mapped to learn-
able embeddings (w∗
1, . . . , w∗
n). Then, a positional encoding
(PE) is applied to each word embedding and then averaged
together s∗
i=1
nPn
j=1 w∗
j·pj, where {p1, . . . , pn}are learn-
able position vectors, and ·is the element-wise product. This
operation defines S∗∈Rm×d, where each row of S∗repre-
sents an encoded sentence and dis the dimension of a word
embedding. For the input question we convert it to a vector
q∗∈Rdin the same way. For each sentence of the story in
S∗, we learn entity and relation representations as:
Ei=fei(S∗), i = 1,2,(4)
Rj=frj(S∗), j = 1,2,3,(5)
where feiare feed-forward neural networks that output en-
tity representations Ei∈Rm×deand frjare feed-forward
neural networks that output relation representations Rj∈
Rm×dr. Finally, we define three search keys Kas:
K1=E1⊗R1,(6)
K2=E1⊗R2,(7)
K3=E2⊗R3,(8)
where K1, K2, K3∈Rm×de×dr. Keys will be used to ma-
nipulate the memory in the next module and retrieve poten-
tial existing associations for each entity-relation pair.
Recurrent Memory Module. To allow stored informa-
tion to interact with each other, we use a recurrent architec-
ture with Trecurrent-layers to update the TPR-like memory
representation M∈Rde×dr×de, where Mcontains train-
able parameters. Through this recurrent architecture, exist-
ing episodes stored in memory can interact with new infer-
ences to generate new episodes. Different from many mod-
els like Transformer (Vaswani et al. 2017) and graph-based
models (Kipf and Welling 2017; Velickovic et al. 2018)
where adding more layers in the model leads to a larger num-
ber of trainable parameters, our model will not increase the
number of trainable parameters as the number of recurrent-
layers increases.
At each layer t, given the keys Ks, we extract pseudo-
entities Ps for each sentence in S∗. In the first layer (t= 0),
since there is no previous information existing in memory
M0, the model just converts each sentence in S∗as an
episode and stores them in it (M1). Then at the later layers
(t > 0), pseudo-entities Ps build bridges between episodes
in the current memory Mtand allow them to interact with
potential entity-relation associations.
Pjt =Kj⊗Mt, j = 1,2,3,(9)
where Pjt ∈Rm×de. We then construct the memory
episodes needed to be updated or removed. This is done
after the first storage at t= 0 so that all story informa-
tion is already available in M. These old episodes, Ojt ∈
Rde×dr×de, will be updated or removed to avoid memory
conflicts that may occur when receiving new information:
Ojt =Kj⊗Pj t, j = 1,2,3(10)
Afterwards, new episodes, N1, N2tand N3∈Rde×dr×de,
will be added into the memory:
N1=K1⊗E2,(11)
N2t=K2⊗P1t,(12)
N3=K3⊗E1.(13)
Then we apply this change to the memory by removing (sub-
tracting) old episodes and adding up the new ones to the now
dated memory Mt:
Mt+1 =LN(Mt+
+N1+N2t+N3−O1t−O2t−O3t),(14)
where LN is a layer normalization.
Decoder. The prediction is computed based on the con-
structed memory Mat the last layer and a question vector q.
To do this we follow the same procedure designed by Schlag
and Schmidhuber (2018):
Uj=fu
j(q), j = 1,2,3,4,(15)
where fu
1is a feed-forward neural network that outputs a
de-dimensional unbinding vector, and fu
2, f u
3, f u
4are feed-
forward neural networks that output dr-dimensional unbind-
ing vectors. Then, the information stored in Mwill be
retrieved in a recurrent way based on unbinding vectors
learned from the question:
I1=LN(MT·U1)·U2,(16)
I2=LN(MT·I1)·U3,(17)
I3=LN(MT·I2)·U4,(18)
ˆv=softmax(Wo·
3
X
j=1
Ij).(19)
A linear projection of trainable parameters Wo∈R|V|×de
and a softmax function are used to map the extracted infor-
mation into ˆv∈R|V|. Hence, the decoder module outputs a
probability distribution over the terms of the vocabulary V.
5 Experiments and Results
In this section we aim to address the following research
questions: (RQ1) What is the degree of data leakage in the
datasets? (RQ2) How does our model behave with respect
to state-of-the-art NLU models in spatial reasoning tasks?
(RQ3) How do these models behave when tested on exam-
ples more challenging than those used for training? (RQ4)
What is the effect of the number of recurrent-layers in the
recurrent memory module? Before answering these ques-
tions, we first present the material and baselines used in
our experiments. The software and data are available at:
https://github.com/ZhengxiangShi/StepGame
5.1 Material and Baselines
In the following experiments we will use two datasets, the
bAbI dataset and the StepGame dataset. For the bAbI dataset
we only focus on task 17 and task 19 and use the original
train and test splits made of 10 000 samples for the train-
ing set and 1 000 for the validation and test sets. For the
StepGame dataset, we generate a training set made of sam-
ples varying kfrom 1 to 5 at steps of 1, and a test set with
kvarying from 1 to 10. Moreover, the test set will also con-
tain distracting noise. The final dataset consists of, for each
kvalue, 10 000 training samples, 1 000 validation samples,
and 10 000 test samples.
Task 17 Task 19 Mean
RN 97.33±1.55 98.63±1.79 97.98
RRN 97.80±2.34 49.80±5.76 73.80
STM 97.80±1.06 99.98±0.05 98.89
UT 98.60±3.40 93.90±7.30 96.25
TPR-RNN 97.55±1.99 99.95±0.06 98.75
Ours 99.88±0.10 99.98±0.04 99.93
Table 1: Test accuracy on the task 17 and 19 of the bAbI
dataset: Mean±Std over 5 runs.
We compare our model against five baselines: Recur-
rent Relational Networks (RRN) (Palm, Paquet, and Winther
2018), Relational Network (RN) (Santoro et al. 2017), TPR-
RNN (Schlag and Schmidhuber 2018), Self-attentive Asso-
ciative Memory (STM) (Le, Tran, and Venkatesh 2020), and
Universal Transformer (UT) (Dehghani et al. 2019). Each
model is trained and validated on each dataset independently
following the hyper-parameter ranges and procedures pro-
vided in their original papers. All training details, including
those for our model, are reported in the Appendix.
5.2 Training-Test Leakage
To answer RQ1 we have calculated the degree of data leak-
age present in bAbI and the StepGame datasets. For the task
17, we counted how many samples in the test set appear also
in the training set: 23.2% of the test samples are also in the
training set. For task 19, for each sample we extracted the
relevant sentences in the stories (i.e., those sentences neces-
sary to answer the question correctly) and questions. Then
we counted how many such pairs in the test set appear in
the training set: 80.2% of the pairs overlap with pairs in the
training set. For the StepGame dataset, for each sample we
extracted the sentences in the stories and questions. The sen-
tences in the story are sorted in lexicographical order. Then
we counted how many such pairs in the test set appear also
in the training set before adding distracting noise and using
the templates: 1.09% of the pairs overlap with triples in the
training set. However, such overlap is all produced by the
samples with k= 1, which due to their limited number have
a higher chance of being included in the test set. If we re-
move those examples, the overlap between training and test
sets drops to 0%.
5.3 Spatial Inference
To answer RQ2 and judge the spatial inference ability of our
model and the baselines we train them on the bAbI and the
StepGame datasets and compare them by measuring their
test accuracy.
In Table 1 we present the results of our model and the
baselines on the task 17 and 19 of the bAbI dataset. The per-
formance of our model is slightly better than the best base-
line. However, due to the issues of the bAbI dataset, these
results are not enough to firmly answer RQ2.
In Table 2 we present the results for the StepGame dataset.
In this dataset, the training set is without noise but the test set
is with distracting noise. In the table we break down the per-
formance of the trained models across k. In the last column
Model k=1 k=2 k=3 k=4 k=5 Mean
RN (Santoro et al. 2017) 22.64±0.25 17.08±1.41 15.08±2.58 12.84±2.27 11.52±1.73 15.83
RRN (Palm, Paquet, and Winther 2018) 24.05±4.48 19.98±4.68 16.03±2.89 13.22±2.51 12.31±2.16 17.12
UT (Dehghani et al. 2019) 45.11±4.16 28.36±4.50 17.41±2.18 14.07±2.87 13.45±1.35 23.68
STM (Le, Tran, and Venkatesh 2020) 53.42±3.73 35.96±4.45 23.03±1.83 18.45±1.87 15.14±1.56 29.20
TPR-RNN (Schlag and Schmidhuber 2018) 70.29±3.03 46.03±2.24 36.14±2.66 26.82±2.64 24.77±2.75 40.81
Ours 85.77±3.18 60.31±2.23 50.18±2.65 37.45±4.21 31.25±3.38 52.99
Table 2: Test accuracy on the StepGame dataset: Mean±Std over 5 runs.
Model k= 6 k=7 k=8 k=9 k=10 Mean
RN (Santoro et al. 2017) 11.12±0.96 11.53±0.70 11.21±0.98 11.13±1.00 11.34±0.87 11.27
RRN (Palm, Paquet, and Winther 2018) 11.62±0.80 11.40±0.76 11.83±0.75 11.22±0.86 11.69±1.40 11.56
UT (Dehghani et al. 2019) 12.73±2.37 12.11±1.52 11.40±0.92 11.41±0.96 11.74±1.07 11.88
STM (Le, Tran, and Venkatesh 2020) 13.80±1.95 12.63±1.69 11.54±1.61 11.30±1.13 11.77±0.93 12.21
TPR-RNN (Schlag and Schmidhuber 2018) 22.25±3.12 19.88±2.80 15.45±2.98 13.01±2.28 12.65±2.71 16.65
Ours 28.53±3.59 26.45±2.95 23.67±2.78 22.52±2.36 21.46±1.72 24.53
Table 3: Test accuracy on StepGame for larger ks (only on the test set). Mean±Std over 5 runs.
we report the average performance across k. Our model out-
performs all the baseline models. Compared to Table 1, the
decreased accuracy in Table 2 demonstrates the difficulty of
spatial reasoning with distracting noise. It is not surprising
that the performance of all five baseline models decreases
when kincreases, that is, when the number of required in-
ference hops increases. We also report test accuracy on test
sets without distracting noise in the Appendix.
5.4 Systematic Generalization
To answer RQ3 we generate new StepGame test sets with
k∈ {6,7,8,9,10}with distracting noise. We then test all
the models jointly trained on the StepGame train set with
k∈ {1,2,3,4,5}as in the Section 5.3. We can consider this
experiment as a zero-shot learning setting for larger ks.
In Table 3 we present the performance of different mod-
els on this generalization task. Not surprisingly, the perfor-
mance of all models degrades monotonically as we increase
k. RN, RRN, UT and SAM fail to generalize to the test sets
with higher kvalues, while our model is more robust and
outperforms the baseline models with a large margin. This
demonstrates the better generalization ability of our model,
which performs well on longer stories never seen during
training.
5.5 Inference Analysis
To answer RQ4, we conduct an analysis of the hyper-
parameter T, the number of recurrent-layers in our model.
We jointly train TP-MANN on the StepGame dataset with k
between 1 and 5 with number of Tbetween 1 and 6 and re-
port the break down test accuracy for each value of k. These
results are shown in the left-hand side figure of Figure 4. The
test sets with higher kbenefit more from a higher number of
recurrent layers than those with lower k, indicating that re-
current layers are critical for multi-hop reasoning. We also
analyze how the recurrent layer structure affects systematic
generalization. To do this we also test on a StepGame test
set with kbetween 6 and 10 with noise. These ks are larger
Figure 4: Analysis of TP-MANN’s number of recurrent-
layers (T). The x-axis is Twith which the model has been
trained. Each line represents a different value of kof the
StepGame dataset.
than the largest kused during training. These results are
shown in the right-hand side figure in Figure 4. Here we see
that as Tincreases, the performance of the model improves.
This analysis further corroborates that our recurrent struc-
ture supports multi-hop inference. It is worth noting, that
the number of trainable parameters in our model remains un-
changed as Tincreases. Interestingly, we find that the num-
ber of recurrent-layers needed to solve the task is less than
the length of the stories ksuggesting that the inference pro-
cess may happen in parallel.
6 Conclusion
In this paper, we proposed a new dataset named StepGame
that requires a robust multi-hop spatial reasoning ability to
be solved and mitigates the issues observed in the bAbI
dataset. Then, we introduced TP-MANN, a tensor product-
based memory-augmented neural network architecture that
achieves state-of-the-art performance on both datasets. Fur-
ther analysis also demonstrated the importance of a recurrent
memory module for multi-hop reasoning.
References
Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.;
S¨
underhauf, N.; Reid, I. D.; Gould, S.; and van den Hen-
gel, A. 2018. Vision-and-Language Navigation: Interpreting
Visually-Grounded Navigation Instructions in Real Environ-
ments. In 2018 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,
June 18-22, 2018, 3674–3683. IEEE Computer Society.
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-
Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;
Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational
inductive biases, deep learning, and graph networks. arXiv
preprint arXiv:1806.01261.
Bisk, Y.; Shih, K. J.; Choi, Y.; and Marcu, D. 2018. Learning
Interpretable Spatial Operations in a Rich 3D Blocks World.
In Proceedings of the Thirty-Second AAAI Conference on
Artificial Intelligence, (AAAI-18), the 30th innovative Ap-
plications of Artificial Intelligence (IAAI-18), and the 8th
AAAI Symposium on Educational Advances in Artificial In-
telligence (EAAI-18), New Orleans, Louisiana, USA, Febru-
ary 2-7, 2018. AAAI Press.
Chen, C.-H.; Fu, Y.-F.; Cheng, H.-H.; and Lin, S.-D. 2020a.
Unseen Filler Generalization In Attention-based Natural
Language Reasoning Models. In 2020 IEEE Second In-
ternational Conference on Cognitive Machine Intelligence
(CogMI), 42–51. IEEE.
Chen, H.; Suhr, A.; Misra, D.; Snavely, N.; and Artzi, Y.
2019. TOUCHDOWN: Natural Language Navigation and
Spatial Reasoning in Visual Street Environments. In IEEE
Conference on Computer Vision and Pattern Recognition,
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Com-
puter Vision Foundation / IEEE.
Chen, K.; Huang, Q.; Palangi, H.; Smolensky, P.; Forbus,
K. D.; and Gao, J. 2020b. Mapping natural-language prob-
lems to formal-language solutions using structured neural
representations. In Proceedings of the 37th International
Conference on Machine Learning, ICML 2020, 13-18 July
2020, Virtual Event, Proceedings of Machine Learning Re-
search. PMLR.
Dehghani, M.; Gouws, S.; Vinyals, O.; Uszkoreit, J.; and
Kaiser, L. 2019. Universal Transformers. In 7th Interna-
tional Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers). Association for
Computational Linguistics.
Ding, D.; Hill, F.; Santoro, A.; and Botvinick, M. 2020.
Object-based attention for spatio-temporal reasoning: Out-
performing neuro-symbolic models with flexible distributed
architectures. arXiv preprint arXiv:2012.08508.
Gershman, S.; and Tenenbaum, J. B. 2015. Phrase similar-
ity in humans and machines. In Proceedings of the 37th
Annual Meeting of the Cognitive Science Society, CogSci
2015, Pasadena, California, USA, July 22-25, 2015. cogni-
tivesciencesociety.org.
Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka,
I.; Grabska-Barwi´
nska, A.; Colmenarejo, S. G.; Grefen-
stette, E.; Ramalho, T.; Agapiou, J.; et al. 2016. Hybrid
computing using a neural network with dynamic external
memory. Nature.
Huang, Q.; Smolensky, P.; He, X.; Deng, L.; and Wu, D.
2018. Tensor Product Generation Networks for Deep NLP
Modeling. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume
1 (Long Papers). New Orleans, Louisiana: Association for
Computational Linguistics.
Janner, M.; Narasimhan, K.; and Barzilay, R. 2018. Repre-
sentation Learning for Grounded Spatial Reasoning. Trans-
actions of the Association for Computational Linguistics.
Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Clas-
sification with Graph Convolutional Networks. In 5th In-
ternational Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings.
Koster, R.; Chadwick, M. J.; Chen, Y.; Berron, D.; Banino,
A.; D¨
uzel, E.; Hassabis, D.; and Kumaran, D. 2018. Big-
loop recurrence within the hippocampal system supports in-
tegration of information across episodes. Neuron, 99(6):
1342–1354.
Kruijff, G.-J. M.; Zender, H.; Jensfelt, P.; and Christensen,
H. I. 2007. Situated dialogue and spatial organization: What,
where. . . and why? International Journal of Advanced
Robotic Systems.
Kumaran, D.; and McClelland, J. L. 2012. Generalization
through the recurrent interaction of episodic memories: a
model of the hippocampal system. Psychological review.
Landsiedel, C.; Rieser, V.; Walter, M.; and Wollherr, D.
2017. A review of spatial reasoning and interaction for real-
world robotics. Advanced Robotics.
Le, H.; Tran, T.; and Venkatesh, S. 2020. Self-Attentive As-
sociative Memory. In Proceedings of the 37th International
Conference on Machine Learning, ICML 2020, 13-18 July
2020, Virtual Event, volume 119 of Proceedings of Machine
Learning Research, 5682–5691. PMLR.
Liu, F.; and Perez, J. 2017. Gated End-to-End Memory Net-
works. In Proceedings of the 15th Conference of the Euro-
pean Chapter of the Association for Computational Linguis-
tics: Volume 1, Long Papers. Valencia, Spain: Association
for Computational Linguistics.
Mirzaee, R.; Faghihi, H. R.; Ning, Q.; and Kordjamshidi, P.
2021. SPARTQA: A Textual Question Answering Bench-
mark for Spatial Reasoning. In Proceedings of the 2021
Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Tech-
nologies, 4582–4598.
Palm, R. B.; Paquet, U.; and Winther, O. 2018. Recur-
rent Relational Networks. In Bengio, S.; Wallach, H. M.;
Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Gar-
nett, R., eds., Advances in Neural Information Processing
Systems 31: Annual Conference on Neural Information Pro-
cessing Systems 2018, NeurIPS 2018, December 3-8, 2018,
Montr´
eal, Canada, 3372–3382.
Petruck, M. R. L.; and Ellsworth, M. J. 2018. Represent-
ing Spatial Relations in FrameNet. In Proceedings of the
First International Workshop on Spatial Language Under-
standing, 41–45. New Orleans: Association for Computa-
tional Linguistics.
Pustejovsky, J. 1989. Language and Spatial Cognition. Com-
putational Linguistics, 15(3).
Pustejovsky, J.; Kordjamshidi, P.; Moens, M.-F.; Levine, A.;
Dworman, S.; and Yocum, Z. 2015. SemEval-2015 Task 8:
SpaceEval. In Proceedings of the 9th International Work-
shop on Semantic Evaluation (SemEval 2015). Denver, Col-
orado: Association for Computational Linguistics.
Santoro, A.; Raposo, D.; Barrett, D. G. T.; Malinowski, M.;
Pascanu, R.; Battaglia, P. W.; and Lillicrap, T. 2017. A
simple neural network module for relational reasoning. In
Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.;
Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds.,
Advances in Neural Information Processing Systems 30: An-
nual Conference on Neural Information Processing Systems
2017, December 4-9, 2017, Long Beach, CA, USA, 4967–
4976.
Schapiro, A. C.; Turk-Browne, N. B.; Botvinick, M. M.;
and Norman, K. A. 2017. Complementary learning systems
within the hippocampus: a neural network modelling ap-
proach to reconciling episodic memory with statistical learn-
ing. Philosophical Transactions of the Royal Society B: Bi-
ological Sciences.
Schlag, I.; Munkhdalai, T.; and Schmidhuber, J. 2021.
Learning Associative Inference Using Fast Weight Memory.
In 9th International Conference on Learning Representa-
tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
Schlag, I.; and Schmidhuber, J. 2018. Learning to Reason
with Third Order Tensor Products. In Bengio, S.; Wallach,
H. M.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and
Garnett, R., eds., Advances in Neural Information Process-
ing Systems 31: Annual Conference on Neural Information
Processing Systems 2018, NeurIPS 2018, December 3-8,
2018, Montr´
eal, Canada, 10003–10014.
Smolensky, P. 1990. Tensor product variable binding and the
representation of symbolic structures in connectionist sys-
tems. Artificial intelligence, 46(1-2): 159–216.
Sukhbaatar, S.; Szlam, A.; Weston, J.; and Fergus, R. 2015.
End-To-End Memory Networks. In Advances in Neural In-
formation Processing Systems 28: Annual Conference on
Neural Information Processing Systems 2015, December 7-
12, 2015, Montreal, Quebec, Canada, 2440–2448.
Talmor, A.; and Berant, J. 2018. The Web as a Knowledge-
Base for Answering Complex Questions. In Walker, M. A.;
Ji, H.; and Stent, A., eds., Proceedings of the 2018 Confer-
ence of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies,
NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-
6, 2018, Volume 1 (Long Papers), 641–651. Association for
Computational Linguistics.
Tan, H.; and Bansal, M. 2018. Source-Target Inference
Models for Spatial Instruction Understanding. In McIl-
raith, S. A.; and Weinberger, K. Q., eds., Proceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence,
(AAAI-18), the 30th innovative Applications of Artificial In-
telligence (IAAI-18), and the 8th AAAI Symposium on Edu-
cational Advances in Artificial Intelligence (EAAI-18), New
Orleans, Louisiana, USA, February 2-7, 2018, 5504–5511.
AAAI Press.
Tversky, B. 2019. Mind in motion: How action shapes
thought. Hachette UK.
van Aken, B.; Winter, B.; L ¨
oser, A.; and Gers, F. A. 2019.
How Does BERT Answer Questions?: A Layer-Wise Anal-
ysis of Transformer Representations. In Proceedings of
the 28th ACM International Conference on Information and
Knowledge Management, CIKM.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
tention is All you Need. In Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Infor-
mation Processing Systems.
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li`
o,
P.; and Bengio, Y. 2018. Graph Attention Networks. In
6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,
Conference Track Proceedings. OpenReview.net.
Vogel, A.; and Jurafsky, D. 2010. Learning to Follow Nav-
igational Directions. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics,
806–814. Uppsala, Sweden: Association for Computational
Linguistics.
Welbl, J.; Stenetorp, P.; and Riedel, S. 2018. Constructing
datasets for multi-hop reading comprehension across docu-
ments. Transactions of the Association for Computational
Linguistics, 6: 287–302.
Weston, J.; Bordes, A.; Chopra, S.; and Mikolov, T. 2016.
Towards AI-Complete Question Answering: A Set of Pre-
requisite Toy Tasks. In Bengio, Y.; and LeCun, Y., eds.,
4th International Conference on Learning Representations,
ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Confer-
ence Track Proceedings.
Yang, T.-Y.; Lan, A.; and Narasimhan, K. 2020. Robust and
Interpretable Grounding of Spatial References with Relation
Networks. In Findings of the Association for Computational
Linguistics: EMNLP 2020, 1908–1923. Online: Association
for Computational Linguistics.
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.;
Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA:
A Dataset for Diverse, Explainable Multi-hop Question An-
swering. In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing, Brussels,
Belgium, October 31 - November 4, 2018, 2369–2380. As-
sociation for Computational Linguistics.