Conference PaperPDF Available

StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

Authors:

Abstract

Inferring spatial relations in natural language is a crucial ability an intelligent system should possess. The bAbI dataset tries to capture tasks relevant to this domain (tasks 17 and 19). However, these tasks have several limitations. Most importantly, they are limited to fixed expressions, they are limited in the number of reasoning steps required to solve them, and they fail to test the robustness of models to input that contains irrelevant or redundant information. In this paper, we present a new Question-Answering dataset called StepGame for robust multi-hop spatial reasoning in texts. Our experiments demonstrate that state-of-the-art models on the bAbI dataset struggle on the StepGame dataset. Moreover, we propose a Tensor-Product based Memory-Augmented Neural Network (TP-MANN) specialized for spatial reasoning tasks. Experimental results on both datasets show that our model outperforms all the baselines with superior generalization and robustness performance.
StepGame: A New Benchmark for
Robust Multi-Hop Spatial Reasoning in Texts
Zhengxiang Shi1, Qiang Zhang2, Aldo Lipani1
1University College London
2Zhejiang University
zhengxiang.shi.19@ucl.ac.uk, qiang.zhang.cs@zju.edu.cn, aldo.lipani@ucl.ac.uk
Abstract
Inferring spatial relations in natural language is a crucial abil-
ity an intelligent system should possess. The bAbI dataset
tries to capture tasks relevant to this domain (task 17 and 19).
However, these tasks have several limitations. Most impor-
tantly, they are limited to fixed expressions, they are limited
in the number of reasoning steps required to solve them, and
they fail to test the robustness of models to input that contains
irrelevant or redundant information. In this paper, we present
a new Question-Answering dataset called StepGame for ro-
bust multi-hop spatial reasoning in texts. Our experiments
demonstrate that state-of-the-art models on the bAbI dataset
struggle on the StepGame dataset. Moreover, we propose a
Tensor-Product based Memory-Augmented Neural Network
(TP-MANN) specialized for spatial reasoning tasks. Experi-
mental results on both datasets show that our model outper-
forms all the baselines with superior generalization and ro-
bustness performance.
1 Introduction
Neural networks have been successful in a wide array of per-
ceptual tasks, but it is often stated that they are incapable
of solving tasks that require higher-level reasoning (Ding
et al. 2020). Since spatial reasoning is ubiquitous in many
scenarios such as autonomous navigation (Vogel and Juraf-
sky 2010), situated dialog (Kruijff et al. 2007), and robotic
manipulation (Yang, Lan, and Narasimhan 2020; Landsiedel
et al. 2017), grounding spatial references in texts is essential
for effective human-machine communication through natu-
ral language. Navigation tasks require agents to reason about
their relative position to objects and how these relations
change as they move through the environment (Chen et al.
2019). If we want to develop conversational systems able to
assist users in solving tasks where spatial references are in-
volved, we need to make them able to understand and reason
about spatial references in natural language. Such ability can
help conversational systems to successfully follow instruc-
tions and understand spatial descriptions. However, despite
its tremendous applicability, reasoning over spatial relations
remains a challenging task for existing conversational sys-
tems.
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Earlier works in spatial reasoning focused on spatial in-
struction understanding in a synthetic environment (Bisk
et al. 2018; Tan and Bansal 2018; Janner, Narasimhan, and
Barzilay 2018) or in a simulated world with spatial infor-
mation annotation in texts (Pustejovsky et al. 2015), spatial
relation extractions across entities (Petruck and Ellsworth
2018) and visual observations (Anderson et al. 2018; Chen
et al. 2019). However, few of the existing datasets are de-
signed to evaluate models’ inference over spatial informa-
tion in texts. A spatial relational inference task often requires
an conversational system to infer the spatial relation between
two items given a description of a scene. For example, imag-
ine a user asking to a conversational system to recognize the
location of an entity based on the description of other enti-
ties in a scene. To do so, the conversational system needs to
be able to reason about the location of the various entities in
the scene using only textual information.
BAbI (Weston et al. 2016) is the most relevant dataset
for this task. It contains 20 synthetic question answering
(QA) tasks to test a variety of reasoning abilities in texts,
like deduction, co-reference, and counting. In particular, the
positional reasoning task (no. 17) and the path finding task
(no. 19) are designed to evaluate models’ spatial reasoning
ability. These two tasks are arguably the most challenging
ones van Aken et al. (2019). The state-of-the-art model on
the bAbI (Le, Tran, and Venkatesh 2020) dataset almost per-
fectly solve these two spatial reasoning tasks. However, in
this paper, we demonstrate that such good performance is
attributable to issues with the bAbI dataset rather than the
model inference ability.
We find four major issues with bAbI’s tasks 17 and 19: (1)
There is a data leakage between the train and test sets; that is,
most of the test set samples appear in the training set. Hence,
the evaluation results on the test set cannot truly reflect mod-
els’ reasoning ability; (2) Named entities are fixed and only
four relations are considered. Each text sample always con-
tains the same four named entities in the training, valida-
tion, and test sets. This further biases the learning models
towards these four entities. When named entities in the test
set are replaced by unseen entities or the number of such
entities increases, the model performance decreases dramat-
ically (Chen et al. 2020a). Also, relations such as top-left,
top-right, lower-left, lower-right are not taken into consider-
ation; (3) Learning models are required to reason only over
one or two sentences in the text descriptions, making such
tasks relatively simple. Palm, Paquet, and Winther (2018)
pointed out that multi-hop reasoning is not necessary for the
bAbI dataset since models only need a single step to solve
all the tasks, and; (4) It is a synthetic dataset with a limited
diversity of spatial relation descriptions. It thus cannot truly
reveal the models’ ability in understanding textual space de-
scriptions.
In this paper, we propose a new dataset called StepGame
to tackle the above-mentioned issues and a novel Tensor
Product-based Memory-Augmented Neural Network archi-
tecture (TP-MANN) for multi-hop spatial reasoning in texts.
The StepGame dataset is based on crowdsourced descrip-
tions of 8 potential spatial relations between 2 entities. These
descriptions are then used as templates when generating the
dataset. To increase the diversity of these templates, crowd-
workers were asked to diversify their expressions. This was
done in order to ensure that the crowdsourced templates
cover most of the natural ways relations between two entities
can be described in text. The StepGame dataset is character-
ized by a combinatorial growth in the number of possible
description of scenes, named stories, as the number of de-
scribed relations between two entities increases. This com-
binatorial growth reduces the chances to leak stories from
the training to the validation and test sets. Moreover, we use
a large number of named entities and require multi-hop rea-
soning to answer questions about two entities mentioned in
the stories. Experimental results show that existing models
(1) fail to achieve a performance on the StepGame dataset
similar to that achieved on the bAbI dataset, and (2) suffer
from a large performance drop as the number of required
reasoning steps increases.
The TP-MANN architecture is based on tensor product
representations (Smolensky 1990) that are used in a recur-
rent memory module to store, update or delete the relation
information among entities inferred from stories. This recur-
rent architecture provides three key benefits: (1) it enables
the model to make inferences based on the stored memory;
(2) it allows multi-hop reasoning and it is robust to noise,
and; (3) the number of parameters remains unchanged as the
number of recurrent layers in the memory module increases.
Experimental results on the StepGame dataset show that our
model achieves state-of-the-art performance with a substan-
tial improvement, and demonstrates a better generalization
ability to more complex stories. Finally, we also conduct
some analysis of our recurrent structure and demonstrate its
importance for multi-hop reasoning.
2 Related Work and Background
2.1 Related Work
Reasoning Datasets. The role of language in spatial rea-
soning has been investigated since the 1980s (Pustejovsky
1989; Gershman and Tenenbaum 2015; Tversky 2019), and
reasoning about spatial relations has been studied in sev-
eral contexts such as, 2D and 3D navigation (Bisk et al.
2018; Tan and Bansal 2018; Janner, Narasimhan, and Barzi-
lay 2018; Yang, Lan, and Narasimhan 2020), and robotic
manipulation (Landsiedel et al. 2017). However, few of the
datasets used in these works are used to evaluate systems’
spatial reasoning ability in texts.
The bAbI (Weston et al. 2016) dataset consists of several
QA tasks. Solving these tasks require logical reasoning steps
and cannot be solved by simply word matching. Of partic-
ular interest to this paper are tasks 17 and 19. Task 17 is
about positional reasoning while task 19 is about path find-
ing. These two tasks can be used to evaluate the spatial infer-
ence ability of learning models. However, the bAbI dataset
has several issues as mentioned above: the data leakage, the
fixed named entities and expressions, and the lack of a need
to perform multi-hop reasoning. Another relevant dataset is
SpartQA (Mirzaee et al. 2021), which is designed for spatial
reasoning over texts but only requires a limited multi-hop
reasoning compared to StepGame.
Multi-Hop QA Datasets. The multi-hop QA tasks re-
quire reasoning over multiple pieces of evidence and fo-
cus on leveraging the connections between entities to in-
fer a requested property of a set of them. Commonly-
used multi-hop QA datasets are HotpotQA (Yang et al.
2018), ComplexWebQuestions (Talmor and Berant 2018),
and QAngaroo (Welbl, Stenetorp, and Riedel 2018). The
proposed StepGame dataset is different from these datasets.
The StepGame dataset focuses on spatial reasoning, which
requires machine learning models to infer the spatial rela-
tions among the described entities. Moreover, multi-hop QA
datasets usually require no more than two reasoning steps,
while the StepGame dataset can require as many as 10 rea-
soning steps.
Reasoning Models. There are three types of reasoning
models: memory-augmented neural networks, graph neu-
ral networks, and transformer-based networks. Works of
the first type augment neural networks with external mem-
ory, such as End to End Memory Networks (Sukhbaatar
et al. 2015), Differential Neural Computer (Graves et al.
2016), and Gated End-to-End Memory Networks (Liu and
Perez 2017). These models have shown remarkable abili-
ties in tackling difficult computational and reasoning tasks.
Works of the second type use graph structure to incorporate a
stronger relational inductive bias (Battaglia et al. 2018). San-
toro et al. (2017) introduced Relational Networks (RN) and
demonstrated strong relational reasoning capabilities with a
shallow architecture by modelling binary relations between
entity pairs. Palm, Paquet, and Winther (2018) proposed a
graph representation of objects and models multi-hop rela-
tional reasoning using a message passing mechanism. Works
of the third type use transformers. Although transformers
have been proven successful in many NLP tasks, they still
struggle with reasoning tasks. van Aken et al. (2019) an-
alyzed the performance of BERT (Devlin et al. 2019) on
bAbI’s tasks and demonstrated that most of BERT’s errors
come from task 17 and 19 which require spatial reason-
ing. Meanwhile, Dehghani et al. (2019) demonstrated that
standard transformers cannot perform as well as memory-
augmented networks on the bAbI dataset. Moreover, it is
important to note that most of the errors of their proposed
Universal Transformer come also from task 17 and task 19
of the bAbI dataset, which matches our observations on
other transformer-based models. Therefore, spatial reason-
ing tasks are arguably the most challenging tasks in the bAbI
dataset.
Tensor Product Representation. The Tensor Prod-
uct Representation (TPR) (Smolensky 1990; Schlag,
Munkhdalai, and Schmidhuber 2021) is a technique for en-
coding symbolic structural information and modelling sym-
bolic reasoning in vector spaces by learning to deconstruct
natural language statements into combinatorial representa-
tions (Chen et al. 2020b). TPR has been used for tasks that
require deductive reasoning abilities and it is able to repre-
sent entire problem statements to solve math questions in
natural language (Chen et al. 2020b) and generate natural
language captions from images (Huang et al. 2018).
Schlag and Schmidhuber (2018) proposed a gradient-
based RNN with third-order TPR (TPR-RNN), which cre-
ates a vector space embedding of complex symbolic struc-
tures by tensor products and stores these learned represen-
tations into a third-order TPR-like memory. Self-Attentive
Associative Memory (STM) (Le, Tran, and Venkatesh 2020)
utilizes a second-order item memory and a third-order TPR-
like relational memory to simulate the hippocampus, achiev-
ing state-of-the-art performance on the bAbI dataset. De-
spite a gain in the performance on bAbI compared to TPR-
RNN, STM takes a longer time to converge in practice.
Recently, Schlag, Munkhdalai, and Schmidhuber (2021)
compared a concatenated memory MR2d×dwith a 3-
order memory MRd2×d, and experimental results indi-
cate a drop in performance when a concatenated memory is
used. However, neither STM nor TPR-RNN processes in-
formation at the paragraph level and allows later modifi-
cations after the first information is stored, as done in our
model. Both STM and TPR-RNN use an RNN-like archi-
tecture where each sentence in a paragraph is stored recur-
rently. This may result in a long-term dependency prob-
lem (Vaswani et al. 2017) where necessary information
would not interact with each other. To solve this issue, an ex-
plicit mechanism to update relational information between
entities at the end of each story is introduced in our model.
2.2 Background
The Tensor Product Representation (TPR) is a method to
create a vector space embedding of complex symbolic struc-
tures by tensor product. Such representation can be con-
structed as follows:
M=X
i
firi=X
i
fir
>
i=X
i
(fr)ii,(1)
where Mis the TPR, f= (f1, . . . , fn)is a set of nfiller
vectors and r= (r1, . . . , rn)is a set of nrole vectors. For
each role-filler vector pair, which can be considered as an
entity-relation pair, we bind (or store) them into Mby per-
forming their outer product. Then, given an unbinding role
vector ui, associated to the filler vector fi,fican be recov-
ered by performing:
Mui="X
i
firi#ui=X
i
αij fifi(2)
D
GB
J
F
Step 1 Step 2 Step 3
D
G
B
J
F
Story
1. J is below B.
Question
Answer
What is the relation of the G to the J?
Top-right ( )
2. B and G is side by side with B to the left and G
3. D and G are parallel, and D is on the top of G.
4. F is diagonally to the bottom right of J.
to the right.
Figure 1: An example of the generation of a StepGame sam-
ple with k= 4.
where αij 6= 0 if and only if i=j. It can be proven that
the recovering is perfect if the role vectors are orthogonal to
each other. In our model, TPR-like binding and unbinding
methods are used to store and retrieve information from and
to the TPR M, which we will call memory.
3 The StepGame Dataset
To design a benchmark dataset that explicitly tests mod-
els’ spatial reasoning ability and tackle the above mentioned
problems, we build a new dataset named StepGame inspired
by the spatial reasoning tasks in the bAbI dataset (Weston
et al. 2016). The StepGame is a contextual QA dataset,
where the system is required to interpret a story about sev-
eral entities expressed in natural language and answer a
question about the relative position of two of those entities.
Although this reasoning task is trivial for humans, to equip
current NLU models with such a spatial-ability remains still
a challenge. Also, to increase the complexity of this dataset
we model several form of distracting noises. Such noises
aim to make the task more difficult and force machine learn-
ing models that are trained on this dataset to be more robust
in their inference process.
3.1 Template Collection
The aim of this crowdsourcing task is to find out all pos-
sible ways we can describe the positional relationship be-
tween two entities. The crowdworkers from Amazon Me-
chanical Turk were provided with an image visually describ-
ing the spatial relations of two entities and a request to de-
scribe these entities’ relation. This crowdsourcing task was
performed in multiple runs. In the first run, we provided
crowdworkers with an image and two entities (e.g., A and
B) and they were asked to describe their positional relation.
From the data collected in this round, we then manually re-
moved bad answers, and showed the remaining good ones
as positive examples to crowdworkers in the next run. How-
ever, crowdworkers were instructed to avoid repeating them
as an answer to our request. We repeated this process until
no new templates could be collected. In total, after perform-
ing a manual generalization where templates discovered for
a relation were translated to the other relations, we collected
23 templates for left and right relations, 27 templates for top
and down relations, and 26 templates for top-left, top-right,
down-left, and down-right relations.
3.2 Data Generation
The task defined by the StepGame dataset is composed of
several story-question pairs written in natural language. In
its basic form, the story describes a set of kspatial relations
among k+ 1 entities, and it is structured as a list of ksen-
tences each talking about 2entities. The relations are kand
the entities k+1 because they define a chain-like shape. The
question requests the relative position of two entities among
the k+1 ones mentioned in the story. To each story-question
pair an answer is associated. This answer can take 9possible
values: top-left,top-right,top,left,overlap,right,down-left,
down-right, and down, each representing a relative position.
The number of edges between the two entities in the ques-
tion (k) determines the number of hops a model has to
perform in order to get to the correct answer.
To generate a story, we follow three steps, as depicted in
Figure 1. Given a value kand a set of entities E:
Step 1. We generate a sequence of entities by sampling
a set of k+ 1 unique entities from E. Then, for each pair
of entities in the sequence, kspatial relations are sampled.
These spatial relations can take any of the 8 possible val-
ues: top, down, left, right, top-left, top-right, down-left, and
down-right. Because the sampling is unconstrained, entities
can overlap with each other. This step results in a sequence
of linked entities that from now on we will call a chain.
Step 2. Two of the chain’s entities are then selected at ran-
dom to be used in the question.
Step 3. From the chain generated in Step 1, we translate
the krelations into ksentence descriptions in natural lan-
guage. Each description is based on a randomly sampled
crowdsourced template. We then shuffle these ksentences to
avoid potential distributional biases. These shuffled ksen-
tence descriptions is a called a story. From the entities se-
lected in Step 2, we then generate a question also in natural
language. Finally, using the chain and the selected entities,
we infer the answer to each story-question pair.
Given this generation process we can quickly calculate
the complexity of the task before using the templates. This
is possible because entities can overlap. Given krelations,
k+ 1 entities sampled from Ein any order (), 8 possible
relations between pairs of entities with 2 ways of describing
them (), e.g., A is on the left of B or B is on the right of
A, a random order of the ksentences in the story (), and a
question about 2 entities with 2 ways of describing it (), the
number of examples that we can generate is equal to:
(k+ 1)! |E|
k+ 1!·16k·k!
2·2 k+ 1
2!.(3)
The complexity of the dataset grows exponentially with k.
The StepGame dataset uses |E| = 26. For k= 1 we have
10,400 possible samples, for k= 2 we have more than
23 million samples, and so on. The sample complexity of
the problem guarantees that when generating the dataset the
probability of leaking samples from the training set to the
test set diminishes with the increase of k. Please note that
these calculations do not include templates. If we were to
D
GB
J
FI
Original
D
GB
J
FI
Irrelevant Noise
D
GB
J
FI
Disconnected Noise
K
V
C
WH
E
C
A A A
D
GB
J
FI
Supporting Noise
X
Y
A
Figure 2: On the left-hand side we have the original chain.
Orange entities are those targeted by the question. Beside,
we show the same chain with the addition of noise. In green
we represent irrelevant, disconnected and supporting enti-
ties.
considering also the templates, the number of variations of
the StepGame would be even larger.
3.3 Distracting Noise
To make the StepGame more challenging we also include
noisy examples in the test set. We assume that when mod-
els trained on the non-noisy dataset make mistakes on the
noisy test set, these models have failed to learn how to in-
fer spatial relations. We generate three kinds of distracting
noise: disconnected,irrelevant, and supporting. Examples
of all kinds of noise are provided in Figure 2. The irrel-
evant noise extends the original chain by branching it out
with new entities and relations. The disconnected noise adds
to the original chain a new independent chain with new enti-
ties and relations. The supporting noise adds to the original
chain new entities and relations that may provide alternative
reasoning paths. We only add supporting noise into chains
with more than 2 entities. All kinds of noise have no im-
pact on the correct answer. The type and amount of noise
added to each chain is randomly determined. The detailed
statistics for each type of distracting noise are provided in
the Appendix.
4 The TP-MANN Model
In this section we introduce the proposed TP-MANN model,
as shown in Figure 3. The model comprises three major
components: a question and story encoder, a recurrent mem-
ory module, and a relation decoder. The encoder learns to
represent entities and relations for each sentence in a story.
The recurrent memory module learns to store entity-relation
pair representations into the memory independently. It also
updates the entity-relation pair representations based on the
current memory and stores the inferred information. The de-
coder learns to represent the question and using the infor-
mation stored in the memory recurrently infers the spatial
relation of the two entities mentioned in the question.
It also has been shown that learned representations in the
TPR-like memory could be orthogonal (Schlag and Schmid-
huber 2018). We use an example to illustrate the inspiration
behind this architecture. A person may experience that when
she goes back to her hometown and sees an old tree, her
happy childhood memory about playing with her friends un-
der that tree might be recalled. However, this memory may
LN
P
PE
?
LN
I2
LN
I3
Mt+1
Mt
K
PE
I1
U
O
N
E
R
S
Decoder
Recurrent Memory
Encoder
LN
Figure 3: The TP-MANN architecture. PE stands for positional encoder, the sign in the box below the symbol Erepresents a
feed-forward neural network, the sign represents the outer-product operator, the sign represents the inner product operator,
and LN represents a layer normalization. The ,, and LN boxes implement the formulae as presented in Section 4. Lines
indicate the flow of information. Those without an arrow indicate which symbols are taken as input and are output by their box.
not be reminisced unless triggered by the old tree appear-
ance. In our model, unbinding vectors in the decoder module
play the role of the old tree in the example, where unbind-
ing vectors are learned based on the target questions. The
decoder module unbinds relevant memories given a ques-
tion via a recurrent mechanism. Moreover, although mem-
ories are stored separately, there are integration processes
in brains that retrieve information via a recursive mecha-
nism. This allows episodes in memories to interact with each
other (Kumaran and McClelland 2012; Schapiro et al. 2017;
Koster et al. 2018).
Encoder. The input of the encoder is a story and a ques-
tion. Given a input story S= (s1, . . . , sm)with msentences
and a question qboth described by words in a vocabulary
V. Each sentence si= (w1, . . . , wn)is mapped to learn-
able embeddings (w
1, . . . , w
n). Then, a positional encoding
(PE) is applied to each word embedding and then averaged
together s
i=1
nPn
j=1 w
j·pj, where {p1, . . . , pn}are learn-
able position vectors, and ·is the element-wise product. This
operation defines SRm×d, where each row of Srepre-
sents an encoded sentence and dis the dimension of a word
embedding. For the input question we convert it to a vector
qRdin the same way. For each sentence of the story in
S, we learn entity and relation representations as:
Ei=fei(S), i = 1,2,(4)
Rj=frj(S), j = 1,2,3,(5)
where feiare feed-forward neural networks that output en-
tity representations EiRm×deand frjare feed-forward
neural networks that output relation representations Rj
Rm×dr. Finally, we define three search keys Kas:
K1=E1R1,(6)
K2=E1R2,(7)
K3=E2R3,(8)
where K1, K2, K3Rm×de×dr. Keys will be used to ma-
nipulate the memory in the next module and retrieve poten-
tial existing associations for each entity-relation pair.
Recurrent Memory Module. To allow stored informa-
tion to interact with each other, we use a recurrent architec-
ture with Trecurrent-layers to update the TPR-like memory
representation MRde×dr×de, where Mcontains train-
able parameters. Through this recurrent architecture, exist-
ing episodes stored in memory can interact with new infer-
ences to generate new episodes. Different from many mod-
els like Transformer (Vaswani et al. 2017) and graph-based
models (Kipf and Welling 2017; Velickovic et al. 2018)
where adding more layers in the model leads to a larger num-
ber of trainable parameters, our model will not increase the
number of trainable parameters as the number of recurrent-
layers increases.
At each layer t, given the keys Ks, we extract pseudo-
entities Ps for each sentence in S. In the first layer (t= 0),
since there is no previous information existing in memory
M0, the model just converts each sentence in Sas an
episode and stores them in it (M1). Then at the later layers
(t > 0), pseudo-entities Ps build bridges between episodes
in the current memory Mtand allow them to interact with
potential entity-relation associations.
Pjt =KjMt, j = 1,2,3,(9)
where Pjt Rm×de. We then construct the memory
episodes needed to be updated or removed. This is done
after the first storage at t= 0 so that all story informa-
tion is already available in M. These old episodes, Ojt
Rde×dr×de, will be updated or removed to avoid memory
conflicts that may occur when receiving new information:
Ojt =KjPj t, j = 1,2,3(10)
Afterwards, new episodes, N1, N2tand N3Rde×dr×de,
will be added into the memory:
N1=K1E2,(11)
N2t=K2P1t,(12)
N3=K3E1.(13)
Then we apply this change to the memory by removing (sub-
tracting) old episodes and adding up the new ones to the now
dated memory Mt:
Mt+1 =LN(Mt+
+N1+N2t+N3O1tO2tO3t),(14)
where LN is a layer normalization.
Decoder. The prediction is computed based on the con-
structed memory Mat the last layer and a question vector q.
To do this we follow the same procedure designed by Schlag
and Schmidhuber (2018):
Uj=fu
j(q), j = 1,2,3,4,(15)
where fu
1is a feed-forward neural network that outputs a
de-dimensional unbinding vector, and fu
2, f u
3, f u
4are feed-
forward neural networks that output dr-dimensional unbind-
ing vectors. Then, the information stored in Mwill be
retrieved in a recurrent way based on unbinding vectors
learned from the question:
I1=LN(MT·U1)·U2,(16)
I2=LN(MT·I1)·U3,(17)
I3=LN(MT·I2)·U4,(18)
ˆv=softmax(Wo·
3
X
j=1
Ij).(19)
A linear projection of trainable parameters WoR|Vde
and a softmax function are used to map the extracted infor-
mation into ˆvR|V|. Hence, the decoder module outputs a
probability distribution over the terms of the vocabulary V.
5 Experiments and Results
In this section we aim to address the following research
questions: (RQ1) What is the degree of data leakage in the
datasets? (RQ2) How does our model behave with respect
to state-of-the-art NLU models in spatial reasoning tasks?
(RQ3) How do these models behave when tested on exam-
ples more challenging than those used for training? (RQ4)
What is the effect of the number of recurrent-layers in the
recurrent memory module? Before answering these ques-
tions, we first present the material and baselines used in
our experiments. The software and data are available at:
https://github.com/ZhengxiangShi/StepGame
5.1 Material and Baselines
In the following experiments we will use two datasets, the
bAbI dataset and the StepGame dataset. For the bAbI dataset
we only focus on task 17 and task 19 and use the original
train and test splits made of 10 000 samples for the train-
ing set and 1 000 for the validation and test sets. For the
StepGame dataset, we generate a training set made of sam-
ples varying kfrom 1 to 5 at steps of 1, and a test set with
kvarying from 1 to 10. Moreover, the test set will also con-
tain distracting noise. The final dataset consists of, for each
kvalue, 10 000 training samples, 1 000 validation samples,
and 10 000 test samples.
Task 17 Task 19 Mean
RN 97.33±1.55 98.63±1.79 97.98
RRN 97.80±2.34 49.80±5.76 73.80
STM 97.80±1.06 99.98±0.05 98.89
UT 98.60±3.40 93.90±7.30 96.25
TPR-RNN 97.55±1.99 99.95±0.06 98.75
Ours 99.88±0.10 99.98±0.04 99.93
Table 1: Test accuracy on the task 17 and 19 of the bAbI
dataset: Mean±Std over 5 runs.
We compare our model against five baselines: Recur-
rent Relational Networks (RRN) (Palm, Paquet, and Winther
2018), Relational Network (RN) (Santoro et al. 2017), TPR-
RNN (Schlag and Schmidhuber 2018), Self-attentive Asso-
ciative Memory (STM) (Le, Tran, and Venkatesh 2020), and
Universal Transformer (UT) (Dehghani et al. 2019). Each
model is trained and validated on each dataset independently
following the hyper-parameter ranges and procedures pro-
vided in their original papers. All training details, including
those for our model, are reported in the Appendix.
5.2 Training-Test Leakage
To answer RQ1 we have calculated the degree of data leak-
age present in bAbI and the StepGame datasets. For the task
17, we counted how many samples in the test set appear also
in the training set: 23.2% of the test samples are also in the
training set. For task 19, for each sample we extracted the
relevant sentences in the stories (i.e., those sentences neces-
sary to answer the question correctly) and questions. Then
we counted how many such pairs in the test set appear in
the training set: 80.2% of the pairs overlap with pairs in the
training set. For the StepGame dataset, for each sample we
extracted the sentences in the stories and questions. The sen-
tences in the story are sorted in lexicographical order. Then
we counted how many such pairs in the test set appear also
in the training set before adding distracting noise and using
the templates: 1.09% of the pairs overlap with triples in the
training set. However, such overlap is all produced by the
samples with k= 1, which due to their limited number have
a higher chance of being included in the test set. If we re-
move those examples, the overlap between training and test
sets drops to 0%.
5.3 Spatial Inference
To answer RQ2 and judge the spatial inference ability of our
model and the baselines we train them on the bAbI and the
StepGame datasets and compare them by measuring their
test accuracy.
In Table 1 we present the results of our model and the
baselines on the task 17 and 19 of the bAbI dataset. The per-
formance of our model is slightly better than the best base-
line. However, due to the issues of the bAbI dataset, these
results are not enough to firmly answer RQ2.
In Table 2 we present the results for the StepGame dataset.
In this dataset, the training set is without noise but the test set
is with distracting noise. In the table we break down the per-
formance of the trained models across k. In the last column
Model k=1 k=2 k=3 k=4 k=5 Mean
RN (Santoro et al. 2017) 22.64±0.25 17.08±1.41 15.08±2.58 12.84±2.27 11.52±1.73 15.83
RRN (Palm, Paquet, and Winther 2018) 24.05±4.48 19.98±4.68 16.03±2.89 13.22±2.51 12.31±2.16 17.12
UT (Dehghani et al. 2019) 45.11±4.16 28.36±4.50 17.41±2.18 14.07±2.87 13.45±1.35 23.68
STM (Le, Tran, and Venkatesh 2020) 53.42±3.73 35.96±4.45 23.03±1.83 18.45±1.87 15.14±1.56 29.20
TPR-RNN (Schlag and Schmidhuber 2018) 70.29±3.03 46.03±2.24 36.14±2.66 26.82±2.64 24.77±2.75 40.81
Ours 85.77±3.18 60.31±2.23 50.18±2.65 37.45±4.21 31.25±3.38 52.99
Table 2: Test accuracy on the StepGame dataset: Mean±Std over 5 runs.
Model k= 6 k=7 k=8 k=9 k=10 Mean
RN (Santoro et al. 2017) 11.12±0.96 11.53±0.70 11.21±0.98 11.13±1.00 11.34±0.87 11.27
RRN (Palm, Paquet, and Winther 2018) 11.62±0.80 11.40±0.76 11.83±0.75 11.22±0.86 11.69±1.40 11.56
UT (Dehghani et al. 2019) 12.73±2.37 12.11±1.52 11.40±0.92 11.41±0.96 11.74±1.07 11.88
STM (Le, Tran, and Venkatesh 2020) 13.80±1.95 12.63±1.69 11.54±1.61 11.30±1.13 11.77±0.93 12.21
TPR-RNN (Schlag and Schmidhuber 2018) 22.25±3.12 19.88±2.80 15.45±2.98 13.01±2.28 12.65±2.71 16.65
Ours 28.53±3.59 26.45±2.95 23.67±2.78 22.52±2.36 21.46±1.72 24.53
Table 3: Test accuracy on StepGame for larger ks (only on the test set). Mean±Std over 5 runs.
we report the average performance across k. Our model out-
performs all the baseline models. Compared to Table 1, the
decreased accuracy in Table 2 demonstrates the difficulty of
spatial reasoning with distracting noise. It is not surprising
that the performance of all five baseline models decreases
when kincreases, that is, when the number of required in-
ference hops increases. We also report test accuracy on test
sets without distracting noise in the Appendix.
5.4 Systematic Generalization
To answer RQ3 we generate new StepGame test sets with
k∈ {6,7,8,9,10}with distracting noise. We then test all
the models jointly trained on the StepGame train set with
k∈ {1,2,3,4,5}as in the Section 5.3. We can consider this
experiment as a zero-shot learning setting for larger ks.
In Table 3 we present the performance of different mod-
els on this generalization task. Not surprisingly, the perfor-
mance of all models degrades monotonically as we increase
k. RN, RRN, UT and SAM fail to generalize to the test sets
with higher kvalues, while our model is more robust and
outperforms the baseline models with a large margin. This
demonstrates the better generalization ability of our model,
which performs well on longer stories never seen during
training.
5.5 Inference Analysis
To answer RQ4, we conduct an analysis of the hyper-
parameter T, the number of recurrent-layers in our model.
We jointly train TP-MANN on the StepGame dataset with k
between 1 and 5 with number of Tbetween 1 and 6 and re-
port the break down test accuracy for each value of k. These
results are shown in the left-hand side figure of Figure 4. The
test sets with higher kbenefit more from a higher number of
recurrent layers than those with lower k, indicating that re-
current layers are critical for multi-hop reasoning. We also
analyze how the recurrent layer structure affects systematic
generalization. To do this we also test on a StepGame test
set with kbetween 6 and 10 with noise. These ks are larger
Figure 4: Analysis of TP-MANN’s number of recurrent-
layers (T). The x-axis is Twith which the model has been
trained. Each line represents a different value of kof the
StepGame dataset.
than the largest kused during training. These results are
shown in the right-hand side figure in Figure 4. Here we see
that as Tincreases, the performance of the model improves.
This analysis further corroborates that our recurrent struc-
ture supports multi-hop inference. It is worth noting, that
the number of trainable parameters in our model remains un-
changed as Tincreases. Interestingly, we find that the num-
ber of recurrent-layers needed to solve the task is less than
the length of the stories ksuggesting that the inference pro-
cess may happen in parallel.
6 Conclusion
In this paper, we proposed a new dataset named StepGame
that requires a robust multi-hop spatial reasoning ability to
be solved and mitigates the issues observed in the bAbI
dataset. Then, we introduced TP-MANN, a tensor product-
based memory-augmented neural network architecture that
achieves state-of-the-art performance on both datasets. Fur-
ther analysis also demonstrated the importance of a recurrent
memory module for multi-hop reasoning.
References
Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.;
S¨
underhauf, N.; Reid, I. D.; Gould, S.; and van den Hen-
gel, A. 2018. Vision-and-Language Navigation: Interpreting
Visually-Grounded Navigation Instructions in Real Environ-
ments. In 2018 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,
June 18-22, 2018, 3674–3683. IEEE Computer Society.
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-
Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;
Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational
inductive biases, deep learning, and graph networks. arXiv
preprint arXiv:1806.01261.
Bisk, Y.; Shih, K. J.; Choi, Y.; and Marcu, D. 2018. Learning
Interpretable Spatial Operations in a Rich 3D Blocks World.
In Proceedings of the Thirty-Second AAAI Conference on
Artificial Intelligence, (AAAI-18), the 30th innovative Ap-
plications of Artificial Intelligence (IAAI-18), and the 8th
AAAI Symposium on Educational Advances in Artificial In-
telligence (EAAI-18), New Orleans, Louisiana, USA, Febru-
ary 2-7, 2018. AAAI Press.
Chen, C.-H.; Fu, Y.-F.; Cheng, H.-H.; and Lin, S.-D. 2020a.
Unseen Filler Generalization In Attention-based Natural
Language Reasoning Models. In 2020 IEEE Second In-
ternational Conference on Cognitive Machine Intelligence
(CogMI), 42–51. IEEE.
Chen, H.; Suhr, A.; Misra, D.; Snavely, N.; and Artzi, Y.
2019. TOUCHDOWN: Natural Language Navigation and
Spatial Reasoning in Visual Street Environments. In IEEE
Conference on Computer Vision and Pattern Recognition,
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Com-
puter Vision Foundation / IEEE.
Chen, K.; Huang, Q.; Palangi, H.; Smolensky, P.; Forbus,
K. D.; and Gao, J. 2020b. Mapping natural-language prob-
lems to formal-language solutions using structured neural
representations. In Proceedings of the 37th International
Conference on Machine Learning, ICML 2020, 13-18 July
2020, Virtual Event, Proceedings of Machine Learning Re-
search. PMLR.
Dehghani, M.; Gouws, S.; Vinyals, O.; Uszkoreit, J.; and
Kaiser, L. 2019. Universal Transformers. In 7th Interna-
tional Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers). Association for
Computational Linguistics.
Ding, D.; Hill, F.; Santoro, A.; and Botvinick, M. 2020.
Object-based attention for spatio-temporal reasoning: Out-
performing neuro-symbolic models with flexible distributed
architectures. arXiv preprint arXiv:2012.08508.
Gershman, S.; and Tenenbaum, J. B. 2015. Phrase similar-
ity in humans and machines. In Proceedings of the 37th
Annual Meeting of the Cognitive Science Society, CogSci
2015, Pasadena, California, USA, July 22-25, 2015. cogni-
tivesciencesociety.org.
Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka,
I.; Grabska-Barwi´
nska, A.; Colmenarejo, S. G.; Grefen-
stette, E.; Ramalho, T.; Agapiou, J.; et al. 2016. Hybrid
computing using a neural network with dynamic external
memory. Nature.
Huang, Q.; Smolensky, P.; He, X.; Deng, L.; and Wu, D.
2018. Tensor Product Generation Networks for Deep NLP
Modeling. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume
1 (Long Papers). New Orleans, Louisiana: Association for
Computational Linguistics.
Janner, M.; Narasimhan, K.; and Barzilay, R. 2018. Repre-
sentation Learning for Grounded Spatial Reasoning. Trans-
actions of the Association for Computational Linguistics.
Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Clas-
sification with Graph Convolutional Networks. In 5th In-
ternational Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings.
Koster, R.; Chadwick, M. J.; Chen, Y.; Berron, D.; Banino,
A.; D¨
uzel, E.; Hassabis, D.; and Kumaran, D. 2018. Big-
loop recurrence within the hippocampal system supports in-
tegration of information across episodes. Neuron, 99(6):
1342–1354.
Kruijff, G.-J. M.; Zender, H.; Jensfelt, P.; and Christensen,
H. I. 2007. Situated dialogue and spatial organization: What,
where. . . and why? International Journal of Advanced
Robotic Systems.
Kumaran, D.; and McClelland, J. L. 2012. Generalization
through the recurrent interaction of episodic memories: a
model of the hippocampal system. Psychological review.
Landsiedel, C.; Rieser, V.; Walter, M.; and Wollherr, D.
2017. A review of spatial reasoning and interaction for real-
world robotics. Advanced Robotics.
Le, H.; Tran, T.; and Venkatesh, S. 2020. Self-Attentive As-
sociative Memory. In Proceedings of the 37th International
Conference on Machine Learning, ICML 2020, 13-18 July
2020, Virtual Event, volume 119 of Proceedings of Machine
Learning Research, 5682–5691. PMLR.
Liu, F.; and Perez, J. 2017. Gated End-to-End Memory Net-
works. In Proceedings of the 15th Conference of the Euro-
pean Chapter of the Association for Computational Linguis-
tics: Volume 1, Long Papers. Valencia, Spain: Association
for Computational Linguistics.
Mirzaee, R.; Faghihi, H. R.; Ning, Q.; and Kordjamshidi, P.
2021. SPARTQA: A Textual Question Answering Bench-
mark for Spatial Reasoning. In Proceedings of the 2021
Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Tech-
nologies, 4582–4598.
Palm, R. B.; Paquet, U.; and Winther, O. 2018. Recur-
rent Relational Networks. In Bengio, S.; Wallach, H. M.;
Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Gar-
nett, R., eds., Advances in Neural Information Processing
Systems 31: Annual Conference on Neural Information Pro-
cessing Systems 2018, NeurIPS 2018, December 3-8, 2018,
Montr´
eal, Canada, 3372–3382.
Petruck, M. R. L.; and Ellsworth, M. J. 2018. Represent-
ing Spatial Relations in FrameNet. In Proceedings of the
First International Workshop on Spatial Language Under-
standing, 41–45. New Orleans: Association for Computa-
tional Linguistics.
Pustejovsky, J. 1989. Language and Spatial Cognition. Com-
putational Linguistics, 15(3).
Pustejovsky, J.; Kordjamshidi, P.; Moens, M.-F.; Levine, A.;
Dworman, S.; and Yocum, Z. 2015. SemEval-2015 Task 8:
SpaceEval. In Proceedings of the 9th International Work-
shop on Semantic Evaluation (SemEval 2015). Denver, Col-
orado: Association for Computational Linguistics.
Santoro, A.; Raposo, D.; Barrett, D. G. T.; Malinowski, M.;
Pascanu, R.; Battaglia, P. W.; and Lillicrap, T. 2017. A
simple neural network module for relational reasoning. In
Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.;
Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds.,
Advances in Neural Information Processing Systems 30: An-
nual Conference on Neural Information Processing Systems
2017, December 4-9, 2017, Long Beach, CA, USA, 4967–
4976.
Schapiro, A. C.; Turk-Browne, N. B.; Botvinick, M. M.;
and Norman, K. A. 2017. Complementary learning systems
within the hippocampus: a neural network modelling ap-
proach to reconciling episodic memory with statistical learn-
ing. Philosophical Transactions of the Royal Society B: Bi-
ological Sciences.
Schlag, I.; Munkhdalai, T.; and Schmidhuber, J. 2021.
Learning Associative Inference Using Fast Weight Memory.
In 9th International Conference on Learning Representa-
tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
Schlag, I.; and Schmidhuber, J. 2018. Learning to Reason
with Third Order Tensor Products. In Bengio, S.; Wallach,
H. M.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and
Garnett, R., eds., Advances in Neural Information Process-
ing Systems 31: Annual Conference on Neural Information
Processing Systems 2018, NeurIPS 2018, December 3-8,
2018, Montr´
eal, Canada, 10003–10014.
Smolensky, P. 1990. Tensor product variable binding and the
representation of symbolic structures in connectionist sys-
tems. Artificial intelligence, 46(1-2): 159–216.
Sukhbaatar, S.; Szlam, A.; Weston, J.; and Fergus, R. 2015.
End-To-End Memory Networks. In Advances in Neural In-
formation Processing Systems 28: Annual Conference on
Neural Information Processing Systems 2015, December 7-
12, 2015, Montreal, Quebec, Canada, 2440–2448.
Talmor, A.; and Berant, J. 2018. The Web as a Knowledge-
Base for Answering Complex Questions. In Walker, M. A.;
Ji, H.; and Stent, A., eds., Proceedings of the 2018 Confer-
ence of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies,
NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-
6, 2018, Volume 1 (Long Papers), 641–651. Association for
Computational Linguistics.
Tan, H.; and Bansal, M. 2018. Source-Target Inference
Models for Spatial Instruction Understanding. In McIl-
raith, S. A.; and Weinberger, K. Q., eds., Proceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence,
(AAAI-18), the 30th innovative Applications of Artificial In-
telligence (IAAI-18), and the 8th AAAI Symposium on Edu-
cational Advances in Artificial Intelligence (EAAI-18), New
Orleans, Louisiana, USA, February 2-7, 2018, 5504–5511.
AAAI Press.
Tversky, B. 2019. Mind in motion: How action shapes
thought. Hachette UK.
van Aken, B.; Winter, B.; L ¨
oser, A.; and Gers, F. A. 2019.
How Does BERT Answer Questions?: A Layer-Wise Anal-
ysis of Transformer Representations. In Proceedings of
the 28th ACM International Conference on Information and
Knowledge Management, CIKM.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
tention is All you Need. In Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Infor-
mation Processing Systems.
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li`
o,
P.; and Bengio, Y. 2018. Graph Attention Networks. In
6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,
Conference Track Proceedings. OpenReview.net.
Vogel, A.; and Jurafsky, D. 2010. Learning to Follow Nav-
igational Directions. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics,
806–814. Uppsala, Sweden: Association for Computational
Linguistics.
Welbl, J.; Stenetorp, P.; and Riedel, S. 2018. Constructing
datasets for multi-hop reading comprehension across docu-
ments. Transactions of the Association for Computational
Linguistics, 6: 287–302.
Weston, J.; Bordes, A.; Chopra, S.; and Mikolov, T. 2016.
Towards AI-Complete Question Answering: A Set of Pre-
requisite Toy Tasks. In Bengio, Y.; and LeCun, Y., eds.,
4th International Conference on Learning Representations,
ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Confer-
ence Track Proceedings.
Yang, T.-Y.; Lan, A.; and Narasimhan, K. 2020. Robust and
Interpretable Grounding of Spatial References with Relation
Networks. In Findings of the Association for Computational
Linguistics: EMNLP 2020, 1908–1923. Online: Association
for Computational Linguistics.
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.;
Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA:
A Dataset for Diverse, Explainable Multi-hop Question An-
swering. In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing, Brussels,
Belgium, October 31 - November 4, 2018, 2369–2380. As-
sociation for Computational Linguistics.
... Traditional approaches in LLMs mainly rely on free-form prompting in a single call to LLMs for facilitating spatial reasoning. However, these methods have demonstrated notable limitations, particularly on challenging datasets like StepGame (Shi et al., 2022) and SparQA (Mirzaee and Kordjamshidi, 2022), which require multi-step planning. In these scenarios, LLMs often struggle with maintaining coherence, frequently hallucinating or losing sight of the original objectives, resulting in inaccurate and unreliable outputs. ...
... The two test benchmark datasets are taken to help evaluate the effectivenss of our strategies comprehensively: StepGame (Shi et al., 2022), and SparQA (Mirzaee and Kordjamshidi, 2022). The following provides a detailed account of the two datasets. ...
... StepGame (Shi et al., 2022) is a synthetic spatial question answering dataset featuring Finding Relations questions that require between 1 to 10 reasoning steps to answer. It employs eight spatial relations (top, down, left, right, top-left, top-right, down-left, and down-right) for story generation. ...
Preprint
Full-text available
Large Language Models (LLMs) have demonstrated impressive capabilities across various tasks. However, LLMs often struggle with spatial reasoning which is one essential part of reasoning and inference and requires understanding complex relationships between objects in space. This paper proposes a novel neural-symbolic framework that enhances LLMs' spatial reasoning abilities. We evaluate our approach on two benchmark datasets: StepGame and SparQA, implementing three distinct strategies: (1) ASP (Answer Set Programming)-based symbolic reasoning, (2) LLM + ASP pipeline using DSPy, and (3) Fact + Logical rules. Our experiments demonstrate significant improvements over the baseline prompting methods, with accuracy increases of 40-50% on StepGame} dataset and 3-13% on the more complex SparQA dataset. The "LLM + ASP" pipeline achieves particularly strong results on the tasks of Finding Relations (FR) and Finding Block (FB) questions, though performance varies across different question types. The impressive results suggest that while neural-symbolic approaches offer promising directions for enhancing spatial reasoning in LLMs, their effectiveness depends heavily on the specific task characteristics and implementation strategies. We propose an integrated, simple yet effective set of strategies using a neural-symbolic pipeline to boost spatial reasoning abilities in LLMs. This pipeline and its strategies demonstrate strong and broader applicability to other reasoning domains in LLMs, such as temporal reasoning, deductive inference etc.
... Exploration of spatial reasoning in large transformer architectures was conducted by [11] in which they evalute ChatGPT on two spatial reasoning tasks: SpartQA [61] and StepGame [83]. These datasets consist of story-question pairs written in natural language about relations of + 1 ( ≤ 10) entities, finding that off the shelf approaches like ...
... To assess the spatial reasoning capabilities of our system, we developed a specialized benchmark focused on positional reasoning from textual inputs. This aligns with existing datasets which require inferring spatial relationships between entities from text [61,83,90]. Unlike [61], which assess tasks like finding blocks, selecting objects, or answering yes/no questions, our benchmark exclusively targets the precise understanding of spatial positions. ...
... Unlike [61], which assess tasks like finding blocks, selecting objects, or answering yes/no questions, our benchmark exclusively targets the precise understanding of spatial positions. [83,90] involve presenting longer form narratives with distractors to infer spatial relationships between objects, whereas we focus on the systemâĂŹs ability to propose accurate and logical candidate locations within a 3D space. ...
Preprint
Generative artificial intelligence has shown promise in prompting virtual worlds into existence, yet little attention has been given to understanding how this process unfolds as social interaction. We present Social Conjurer, a framework for AI-augmented dynamic 3D scene co-creation, where multiple users collaboratively build and modify virtual worlds in real-time. Through an expanded set of interactions, including social and tool-based engagements as well as spatial reasoning, our framework facilitates the creation of rich, diverse virtual environments. Findings from a preliminary user study (N=12) provide insight into the user experience of this approach, how social contexts shape the prompting of spatial environments, and perspective on social applications of prompt-based 3D co-creation. In addition to highlighting the potential of AI-supported multi-user world creation and offering new pathways for AI-augmented creative processes in VR, this article presents a set of implications for designing human-centered interfaces that incorporate AI models into 3D content generation.
... Spatial Reasoning on Multi-Modal Vision-Text. There has been a body of work on text-only spatial reasoning with the advancement of LLMs (Yamada et al., 2024), within the context of relative spatial relation recognition (Mirzaee et al., 2021;Shi et al., 2022), natural language navigation (Yamada et al., 2024), and planning (Momennejad et al., 2023) (see Appendix A for a more complete overview). ...
... Spatial reasoning has been investigated with the advancement of LLMs (Yamada et al., 2024). Various benchmarks have been proposed to evaluate models' spatial reasoning abilities, including relative spatial relation recognition (Weston et al., 2016;Mirzaee et al., 2021;Shi et al., 2022), natural language navigation (Yamada et al., 2024), and planning (Momennejad et al., 2023). Mirzaee and Kordjamshidi (2022) suggest that introducing synthetic data of spatial reasoning when pre-training helps to improve the spatial awareness of the model. ...
... Many researchers have adopted this technique as a memory component in their methods. (Schlag and Schmidhuber, 2018;Shi et al., 2022;Li et al., 2023a;Chen et al., 2020;Schlag et al., 2021). Recent research (Geva et al., 2021) finds that the transformer feed-forward layers are likely to be a key-value memory. ...
... We focus on the decoderonly autoregressive models and do not include encoder-decoder structure models, as the autoregressive structures are the mainstream architecture nowadays (OpenAI, 2023;Touvron et al., 2023). Further, as stated by , the weight matrix in some models like OPT-13B (Zhang et al., 2022) is not invertible. However, such an issue can be relieved by adding a term βI to the Eq. ...
... Cohn and Hernandez-Orallo (2023) which investigated a number of spatial reasoning problems include some limited instances of relational composition, but not exhaustively. Other work investigating the spatial reasoning abilities of LLMs typically which revolved around especially constructed benchmarks such as StepGame (Li, Hogg, and Cohn 2024;Shi, Zhang, and Lipani 2022) can also be regarded as testing compositional reasoning, but not in a methodical or exhaustive manner. StepGame aims to test an LLM's ability to correctly determine the qualitative direction relationship between two objects, given a set direction relations between a larger set of objects, and between 1 and 10 reasoning steps are required to correctly determine the result. ...
Preprint
Full-text available
Qualitative Spatial Reasoning is a well explored area of Knowledge Representation and Reasoning and has multiple applications ranging from Geographical Information Systems to Robotics and Computer Vision. Recently, many claims have been made for the reasoning capabilities of Large Language Models (LLMs). Here, we investigate the extent to which a set of representative LLMs can perform classical qualitative spatial reasoning tasks on the mereotopological Region Connection Calculus, RCC-8. We conduct three pairs of experiments (reconstruction of composition tables, alignment to human composition preferences, conceptual neighbourhood reconstruction) using state-of-the-art LLMs; in each pair one experiment uses eponymous relations and one, anonymous relations (to test the extent to which the LLM relies on knowledge about the relation names obtained during training). All instances are repeated 30 times to measure the stochasticity of the LLMs.
... Early work focused on extracting spatial information from natural language (Hois & Kutz, 2011;Kordjamshidi et al., 2011). More recent efforts emphasize improving multi-hop spatial reasoning (Li et al., 2024b), especially in complex scenarios like 2D visual scenes (Shi et al., 2022). Key methods include pretraining on synthetic datasets to better capture spatial patterns (Mirzaee et al., 2021), and using in-context learning to generalize spatial reasoning across tasks, such as transforming spatial data into logical forms or visualizing reasoning traces . ...
Preprint
Vision language models (VLMs) have demonstrated impressive performance across a wide range of downstream tasks. However, their proficiency in spatial reasoning remains limited, despite its crucial role in tasks involving navigation and interaction with physical environments. Specifically, much of the spatial reasoning in these tasks occurs in two-dimensional (2D) environments, and our evaluation reveals that state-of-the-art VLMs frequently generate implausible and incorrect responses to composite spatial reasoning problems, including simple pathfinding tasks that humans can solve effortlessly at a glance. To address this, we explore an effective approach to enhance 2D spatial reasoning within VLMs by training the model on basic spatial capabilities. We begin by disentangling the key components of 2D spatial reasoning: direction comprehension, distance estimation, and localization. Our central hypothesis is that mastering these basic spatial capabilities can significantly enhance a model's performance on composite spatial tasks requiring advanced spatial understanding and combinatorial problem-solving. To investigate this hypothesis, we introduce Sparkle, a framework that fine-tunes VLMs on these three basic spatial capabilities by synthetic data generation and targeted supervision to form an instruction dataset for each capability. Our experiments demonstrate that VLMs fine-tuned with Sparkle achieve significant performance gains, not only in the basic tasks themselves but also in generalizing to composite and out-of-distribution spatial reasoning tasks (e.g., improving from 13.5% to 40.0% on the shortest path problem). These findings underscore the effectiveness of mastering basic spatial capabilities in enhancing composite spatial problem-solving, offering insights for improving VLMs' spatial reasoning capabilities.
... There are several synthetic corpora available for spatial reasoning tasks (Weston et al. 2015;Clark et al. 2019;Shi, Zhang, and Lipani 2022). The bAbI dataset is a set of question/answer pairs in natural language text designed to train and test a machine's text interpretation and reasoning capabilities. ...
Book
Full-text available
Advancements in large language models (LLMs) have led to notable successes across various natural language processing tasks, including but not limited to text generation, question answering, and text summarization. This chapter explores how LLMs perform in tackling language-based qualitative spatial reasoning tasks, with a focus on two key areas: their performance in addressing spatial reasoning queries directly, and their performance in named entity recognition (NER) tasks for extracting subject-predicate-object triples as inputs to logical reasoning engines. While the results indicate (expected) limitations in application to both tasks, we argue that the integration of the relative strengths of both machine-learning (ML) and knowledge representation (KR) approaches offers the most promise for future advances in big spatial data and GeoAI.
Article
Full-text available
Spatial reasoning, a fundamental aspect of human intelligence, is essential for machine learning models to understand and interpret object relationships. It is crucial for numerous real-world applications, ranging from autonomous navigation to urban planning. The lack of comprehensive datasets limits the development and evaluation of models that can effectively handle spatial reasoning tasks. Existing datasets often contain complex spatial reasoning problems with overlapping spatial relationships, making it challenging to diagnose specific aspects that a model struggles with. We address this gap by introducing a new dataset of linear layouts. This dataset is systematically designed to exhibit a range of spatial relations and complexity levels. Analyzing spatial reasoning through linear layout generation offers a more structured and manageable approach to understanding how models learn and interpret spatial relationships. Linear layout generation has broad applicability and is of fundamental importance in design and optimization. To benchmark dataset, we develop LinLayCNN, a generic data-driven method that applies shallow, one-dimensional convolutional neural network (CNN), to generate linear layouts in an iterative process. Experimental results reveal that LinLayCNN can effectively solve fundamental spatial challenges even with the relatively small size of the training set. It is capable of precise object placement, making it a robust tool for linear layout generation. Current layout generation methods focus on domain-specific solutions and often fail to maintain the precision needed for technical domains, such as accurate sizing, and object counting. They also require a substantial amount of data to function effectively. LinLayCNN overcame these issues. This study further clarifies CNNs’ capabilities in spatial reasoning, highlight their potential to advance the field of layout generation. As a result, our approach establishes a clear benchmark for evaluating spatial reasoning and aids in development of models that can more effectively understand and reason about space.
Article
Full-text available
Heretofore, neural networks with external memory are restricted to single memory with lossy representations of memory interactions. A rich representation of relationships between memory pieces urges a high-order and segregated relational memory. In this paper, we propose to separate the storage of individual experiences (item memory) and their occurring relationships (relational memory). The idea is implemented through a novel Self-attentive Associative Memory (SAM) operator. Found upon outer product, SAM forms a set of associative memories that represent the hypothetical high-order relationships between arbitrary pairs of memory elements, through which a rela-tional memory is constructed from an item memory. The two memories are wired into a single sequential model capable of both memorization and relational reasoning. We achieve competitive results with our proposed two-memory model in a diversity of machine learning tasks, from challenging synthetic problems to practical testbeds such as geometry, graph, reinforcement learning, and question answering.
Article
Recent evidence challenges the widely held view that the hippocampus is specialized for episodic memory, by demonstrating that it also underpins the integration of information across experiences. Contemporary computational theories propose that these two contrasting functions can be accomplished by big-loop recurrence, whereby the output of the system is recirculated back into the hippocampus. We use ultra-high-resolution fMRI to provide support for this hypothesis, by showing that retrieved information is presented as a new input on the superficial entorhinal cortex-driven by functional connectivity between the deep and superficial entorhinal layers. Further, the magnitude of this laminar connectivity correlated with inferential performance, demonstrating its importance for behavior. Our findings offer a novel perspective on information processing within the hippocampus and support a unifying framework in which the hippocampus captures higher-order structure across experiences, by creating a dynamic memory space from separate episodic codes for individual experiences.
Article
Answering complex questions is a time-consuming activity for humans that requires reasoning and integration of information. Recent work on reading comprehension made headway in answering simple questions, but tackling complex questions is still an ongoing research challenge. Conversely, semantic parsers have been successful at handling compositionality, but only when the information resides in a target knowledge-base. In this paper, we present a novel framework for answering broad and complex questions, assuming answering simple questions is possible using a search engine and a reading comprehension model. We propose to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers. To illustrate the viability of our approach, we create a new dataset of complex questions, ComplexWebQuestions, and present a model that decomposes questions and interacts with the web to compute an answer. We empirically demonstrate that question decomposition improves performance from 20.8 precision@1 to 27.5 precision@1 on this new dataset.