Conference PaperPDF Available

Changing perspective: Local alignment of reference frames in dialogue


Abstract and Figures

In this paper we examine how people negotiate, interpret and repair the frame of reference (FoR) in free dialogues discussing spatial scenes. We describe a pilot study in which participants are given different perspectives of the same scene and asked to locate several objects that are only shown on one of their pictures. This task requires participants to coordinate on FoR in order to identify the missing objects. Preliminary results indicate that conversational participants align locally on FoR but do not converge on a global frame of reference. Misunderstandings lead to clarification sequences in which participants shift the FoR. These findings have implications for situated dialogue systems.
Content may be subject to copyright.
Changing perspective: Local alignment of reference frames in dialogue
Simon Dobnik and Christine Howes
Centre for Language Technology
University of Gothenburg, Sweden
John D. Kelleher
School of Computing
Dublin Institute of Technology, Ireland
In this paper we examine how people
negotiate, interpret and repair the frame
of reference (FoR) in free dialogues dis-
cussing spatial scenes. We describe a pilot
study in which participants are given dif-
ferent perspectives of the same scene and
asked to locate several objects that are only
shown on one of their pictures. This task
requires participants to coordinate on FoR
in order to identify the missing objects.
Preliminary results indicate that conversa-
tional participants align locally on FoR but
do not converge on a global frame of refer-
ence. Misunderstandings lead to clarifica-
tion sequences in which participants shift
the FoR. These findings have implications
for situated dialogue systems.
1 Introduction
Directional spatial descriptions such as “to the left
of green cup” or “in front of the blue one” require
the specification of a frame of reference (FoR) in
which the spatial regions “left” and “front” are
projected, for example “from where I stand” or
“from Katie’s point of view”. The spatial refer-
ence frame can be modelled as a set of three or-
thogonal axes fixed at some origin (the location of
the landmark object) and oriented in a direction
determined by the viewpoint (Maillat, 2003).
A good grasp of spatial language is crucial
for interactive embodied situated agents or robots
which will engage in conversations involving such
descriptions. These agents have to build represen-
tations of their perceptual environment and con-
nect their interpretations to shared meanings in
the common ground (Clark, 1996) through inter-
action with their human dialogue partners. There
are two main challenges surrounding the computa-
tional modelling of FoR. Firstly, there are several
ways in which the viewpoint may be assigned. If
the FoR is assigned by the reference object of the
description itself (“green cup” in the first exam-
ple above) then we talk about intrinsic reference
frame (after (Levinson, 2003)). Alternatively, the
viewpoint can be any conversational participant or
object in the scene that has an identifiable front
and back in which case we talk about a relative
FoR. Finally, one can also to refer to the location
of objects where the viewpoint is external to the
scene, for example, as a superimposed grid struc-
ture on a table top with cells such as A1 and B4. In
this case it is an extrinsic reference frame. There
are a number of factors that affect the choice of
FoR, including: task (Tversky, 1991), personal
style (Levelt, 1982), arrangement of the scene
and the position of the agent (Taylor and Tver-
sky, 1996; Carlson-Radvansky and Logan, 1997;
Kelleher and Costello, 2009; Li et al., 2011), the
presence of a social partner (Duran et al., 2011),
the communicative role and knowledge of infor-
mation (Schober, 1995). The second challenge for
computational modelling is that the viewpoint may
not be overtly specified and must be recovered
from the linguistic or perceptual context. Such un-
derspecification may lead to situations where con-
versational partners fail to accommodate the same
FoR leading to miscommunication.
Psycholinguistic research suggests that inter-
locutors in a dialogue align their utterances at sev-
eral levels of representation (Pickering and Gar-
rod, 2004), including their spatial representations
(Watson et al., 2004). However, as with syntac-
tic priming (Branigan et al., 2000), the evidence
comes from controlled experiments with a confed-
erate and single prime-target pairs of pictures, and
this leaves open the question of how well such ef-
fects scale up to longer unconfined free dialogues.
In the case of syntactic priming, corpus studies
suggest that interlocutors actually diverge syntac-
tically in free dialogue (Healey et al., 2014).
Semantic coordination has been studied using
the Maze Game (Garrod and Anderson, 1987), a
task in which interlocutors must produce location
descriptions, which can be figurative or abstract.
Evidence suggests that dyads converge on more
abstract representations, although this is not ex-
plicitly negotiated. Additionally, the introduction
of clarification requests decreases convergence,
suggesting that mutual understanding, and how
misunderstandings are resolved is key to shifts in
description types (Mills and Healey, 2006). How-
ever, both participants see the maze from the same
perspective, in contrast to our egocentric, embod-
ied perceptions of everyday scenes.
We are interested in how participants align their
spatial representations in free dialogue when they
perceive a scene from different perspectives. If the
interactive alignment model is correct, although
participants may start using different FoRs (us-
ing e.g. an egocentric perspective (Keysar, 2007)),
they should converge on a particular FoR over the
course of the dialogue. We are also concerned with
how they identify if a misalignment has occurred,
and the strategies they use to get back on track in
dialogues describing spatial scenes.
In contrast to several previous studies, this pa-
per investigates the coordination of FoR between
two conversational participants over an ongoing
dialogue. Our hypotheses are that (i) there is no
baseline preference for a specific FoR; (ii) partic-
ipants will align on spatial descriptions over the
course of the dialogue; (iii) sequences of misun-
derstanding will prompt the use of different FoRs.
2 Method
We describe below our pilot experimental set-up
in which participants were required to discuss a
visual scene in order to identify objects that were
missing from one another’s views of the scene.
2.1 Task
Using 3D modelling software (Google SketchUp)
we designed a virtual scene depicting a table
with several mugs of different colours and shapes
placed on it. As shown in Figure 1, the scene in-
cludes three people on different sides of the table.
The people standing at the opposite side of the ta-
ble were the avatars of the participants (the man =
P1 and the woman = P2), and a third person at the
side of the table was described to the participants
as an observer “Katie”.
Each participant was shown the scene from their
avatar’s point of view (see Figures 2 and 3), and in-
formed that some of the objects on the table were
missing from their picture, but visible to their part-
ner. Their joint task was to discover the missing
objects from each person’s point of view and mark
them on the printed sheet of the scene provided.
The objects that were hidden from each partici-
pants are marked with their ID in Figure 1.
Figure 1: A virtual scene with two dialogue partners and an
observer Katie. Objects labelled with a participant ID were
removed in that person’s view of the scene.
2.2 Procedure
Each participant was seated at their own com-
puter and the participants were separated by a
screen so that they could not see each other or
each other’s computer screens. They could only
communicate using an online text based chat tool
(Dialogue Experimental Toolkit, DiET, (Healey et
al., 2003)).1The DiET chat tool resembles com-
mon online messaging applications, with a chat
window in which participants view the unfolding
dialogue and a typing window in which partici-
pants can type and correct their contributions be-
fore sending them to their interlocutor. The server
records each key press and associated timing data.
In addition to the chat interface each participant
saw a static image of the scene from their view, as
shown in Figure 2, which shows the scene from
P1’s view and Figure 3, which shows the same
scene from P2’s view.
2.3 Participants
In the pilot study reported here, we have recorded
two dialogues. Both dialogues were conducted in
English but the native language of the first pair was
Swedish while the second pair were native British
English speakers. Participants were instructed that
Figure 2: The table scene as seen by Participant 1.
Figure 3: The table scene as seen by Participant 2.
they should chat to each other until they found the
missing objects or for 30 minutes. The first dyad
took approximately 30 minutes to find the objects
and produced 157 turns in total. The second dyad
(native English speakers) discussed the task for a
little over an hour, during which they produced
441 turns. Following completion of the task par-
ticipants were debriefed about the nature of the ex-
2.4 Data annotation
The turns were annotated manually for the follow-
ing features: (i) does a turn (T) contain a spa-
tial description; (ii) the viewpoint of the FoR that
the spatial description uses (P1, P2, Katie, ob-
ject, extrinsic); a turn may contain several spatial
descriptions with different FoR in which case all
were marked; (iii) whether a turn contains a topo-
logical spatial description such as “near” or “at”
which do not require a specification of FoR; and
(iv) whether the FoR is explicitly referred to by
the description, for example “on my left”.
20 P1: from her right I see yell, white, blue red
spatial, relative-katie, explicit
21 and the white has a funny thing around the top
22 P2: then you probably miss the white i see
23 P1: and is between yel and bl but furhter away from
spatial, relative-katie, explicit, topological
24 P2: because i see a normal mug too, right next to the
yellow one, on the left
spatial, relative-katie, topological
25 P1: ok, is your white one closer to katie than the yellow
and blue?
spatial, relative-katie, topological
26 P2: yes
27 closest to me, from right to left:
spatial, relative-p2, topological
28 P1: ok, got it
29 P2: white mug, white thing with funny top, red mug,
yellow mug (the same as katies)
The example also shows that topological spatial
descriptions can be used in two ways. They can
feature in explicit definitions of FoR as “away” in
T23, be independent as “right next to” in T24 and
“closest to me” in T27 or sometimes they may be
ambiguous between the two as “ closer to Katie”
in T25. In addition to referring to proximity, topo-
logical spatial descriptions also draw attention to
a particular part of the scene that dialogue partici-
pants should focus on to locate the objects and to
a particular FoR that has already been accommo-
dated, in this case relative to Katie. Strictly speak-
ing, this is not an explicit expression of a FoR but
is used to add additional salience to it.
2.5 Dialogue Acts and entropy
We tagged both conversations with a dialogue
act (DA) tagger trained on the NPS Chat Corpus
(Forsyth and Martell, 2007) using utterance words
as features as described in Chapter 6 of (Bird et al.,
2009) but using Support Vector Machines rather
than Naive Bayes classifier (F-score 0.83 tested
on 10% held-out data). Out of 15 dialogue acts
used, the most frequent classifications of turns in
our corpus are (in decreasing frequency) State-
ment, Accept, yAnswer, ynQuestion and whQues-
tion and others. In parallel to DA tagging we also
marked turns that introduced a change in the FoR
assignment. Turns with no projective spatial de-
scription and hence no FoR annotation are marked
as no-change. We process the dialogues by intro-
ducing a moving window of 5 turns and for each
window we calculate the entropy of DA assign-
ments and the entropy of FoR changes.
3 Results and Discussion
3.1 Overall usage of FoR
Table 1 summarises the number of turns that use
each FoR in the dialogues. The data shows that
the majority of FoR is assigned relative to dialogue
participants (P1: 36%, P2: 27% and Speaker:
33%, Addressee: 29%, all values relative to the
turns containing a spatial description). Extrinsic
FoR is also quite common (25%) followed by the
FoR relative to Katie (6%). In 10% of turns con-
taining a spatial description the FoR could not be
determined, most likely because a turn contained
only a topological spatial description. Topologi-
cal spatial descriptions are used in 18% of spatial
turns. Note that since one turn may contain more
than one spatial description, the number of turns of
these does not add up to the total number of turns
containing a spatial description.
Category Turns %
Turns in total 598 1.0000
Contains a spatial description 245 0.4097
FoR=P1 88 0.3592
FoR=P2 66 0.2694
FoR=speaker 81 0.3306
FoR=addressee 72 0.2939
FoR=Katie 15 0.0612
FoR=extrinsic 61 0.2490
FoR=unknown 26 0.1061
Topological description 44 0.1796
Table 1: Overall usage of FoR
In our data there are no uses of the intrinsic
reference frame relative to the landmark object.
This may be because the objects in this study were
mugs and they are used as both target and land-
mark objects in descriptions. Although they may
have identifiable fronts and backs and are hence
able to set the orientation of the FoR, they are not
salient enough to attract the assignment of FoR
relative to the presence of the participants. This
observation is orthogonal to the observation made
in earlier work where the visual salience proper-
ties of the dialogue partners and the landmark ob-
ject were reversed compared to this scene (Dob-
nik et al., 2014). Note, however, that we anno-
tate descriptions such as “one directly in from of
you” (D(ialogue) 1, T146) as relative FoR to P1,
although this could also be analysed as an intrin-
sic FoR. We opt for the relative interpretation on
the grounds that otherwise important information
about which contextual features attract the assign-
ment of FoR would be lost. In our system there
is therefore no objectively intrinsic FoR but FoR
assigned to different contextually present entities.
3.2 Local alignment of FoR
Figure 4 show the uses of FoR over the length of
the entire D1 and the same length of utterances
of D2. The plots show that although there is no
global preference for a particular entity to assign
the FoR one can observe local alignments of FoR
that stretch over several turns which can be ob-
served as lines made of red (P1) and green (P2)
shapes. This supports the findings in earlier work
(Watson et al., 2004; Dobnik et al., 2014) that par-
ticipants tend to align to FoR over several turns.
Partial auto-correlations on each binary FoR
variable in Figure 4 (P1, P2, Katie and Extrinsic)
confirm this. Each correlates positively with itself
(p<0.05) at 1–3 turns lag, confirming that the
use of a particular FoR makes reuse of that FoR
more likely. Cross-correlations between the vari-
ables show no such pattern.
The graph also shows that the alignment is per-
sistent to a different degree at different parts of
both dialogues. For example, in D1 the partici-
pants align considerably in the first part of the di-
alogue up to turn 75, first relative to Katie, then
to P2 and finally to P1. After approximately T115
both FoR relative to P1 and P2 appear to be used
interchangeably in a threaded manner as well as
the use of the extrinsic perspective. In D2 the situ-
ation is reversed. The participants thread the usage
of the FoR in the first part of the dialogue but con-
verge to segments with a single FoR shortly before
T100 where they both prefer the extrinsic FoR and
also FoR relative to P1. We will discuss these seg-
ments further in Section 3.4
Overall, the data show that the use of FoR is
not random and that different patterns of FoR as-
signment and coordination are present at different
segments of the dialogue. In order to understand
how FoR is assigned we therefore have to examine
these segments separately.
3.3 Explicitness of FoR
With an increase in (local) alignment, as discussed
above, we might expect that there is less neces-
sity for dialogue participants to describe the FoR
overtly after local alignment has been established.
Explicitness of FoR is therefore indicated in Fig-
ure 4: stars indicate that the FoR is described
explicitly wheres triangles indicate that it is not.
However, contrary to our expectation that the FoR
would only be described explicitly at the begin-
ning of a cluster of aligned FoR turns, it appears
0 50 100 150
0 50 100 150
Figure 4: The assignment of FoR over the length of Dialogue 1 (top) and Dialogue 2 (bottom)
that the FoR is explicitly described every couple
of utterances even if the participants align as in the
first half of D1. This may be because participants
are engaged in a task where the potential for ref-
erential ambiguity is high and precision is critical
for successful completion of the task.
Note also that in D2 at around turn 100 there are
clusters of turns where extrinsic FoR was used but
this was not referred to explicitly. This is because
participants in this dialogue previously agreed on
a 2-dimensional coordinate system involving let-
ters and numbers that they superimposed over the
surface of the table. Referring to a region “A2”
does not require stating “of the table” and hence a
lack of explicitness in their FoRs.
3.4 Changing FoR
One of the main consequences of the local, and
not global alignment of FoR, as shown in Figure 4
is that there are several shifts in FoR as the dia-
logue progresses. Below we outline some possible
reasons for this, with illustrative examples taken
from the dialogues. Due to the sparsity of data in
our pilot study, these observations are necessarily
qualitative, but they point the way towards some
interesting future work.
(i) The scene is better describable from an-
other perspective. Due to the nature of the task
and the scene, it is not possible to generate a
unique and successfully identifiable referring ex-
pression without leading to miscommunication.
In D1 we can observe that the dialogue partners
take neutral Katie’s viewpoint over several turns.
In fact, they explicitly negotiate that they should
take this FoR: T13 “shall we take it from katies
point of view?”. However, in T25 P1 says “ok,
is your white one closer to katie than the yellow
and blue?” which prompts P2 to switch FoR to
themselves “closest to me, from right to left:”. The
change appears to be initiated by the fact that the
participants have just discovered a missing white
mug but a precise reference is made ambiguous
because of another white distractor mug nearby.
P2 explicitly changes the FoR because a descrip-
tion can be made more precise from their perspec-
tive: from Katie’s perspective both white mugs are
arranged in a line at her front. Interestingly, in T35
P1 uses the same game strategy and switches the
FoR to theirs saying “closest to me, from left to
right red, blue, white, red” and the conversation
continues using that FoR for a while, until turn 63.
The example also shows that participants align in
terms of conversational games for the purposes of
identifying the current object and that the nature of
dialogue game also affects the assignment of FoR.
(ii) Current dialogue game. The nature of the
task seems to naturally lead to a series of different
dialogue games, from describing the whole scene
to zooming in on a particular area when a poten-
tial mismatch is identified. In this case, since the
scene in focus is only a part of the overall picture
it is less likely that a an identifiable reference to
a particular object will fail as there will be fewer
distractors. As a result a single FoR can be used
over a stretch of the conversation and participants
are likely to align. There is less need for explicit
perspective marking. See for example D1,T20-29
in the previous dialogue listing which corresponds
to a cluster in Figure 4. Another cluster in Figure 4
starts at D1,T42 and is shown below. P2 identifies
an empty space in their view which they assume is
not empty for P1 and this becomes a region of fo-
cus. Since this region is more visually accessible
to P1 and since they are information giver they opt
for P1’s FoR (“away from you” in T42 and T43).
As shown in Figure 4 this is a dominant FoR for
this stretch of dialogue.
42 P2: there is an empty space on the table on the second
row away from you
relative-p1, explicit, topological
43 between the red and white mug (from left to right)
44 P1: I have one thing there, a white funny top
45 P2: ok, i’ll mark it.
46 P1: and the red one is slightly close to you
relative-p2, explicit, topological
47 is that right?
48 to my left from that red mug there is a yellow mug
relative-p1, explicit, topological
49 P2: hm...
Conversely, when looking for single objects that
may be located anywhere on the entire table, for
example, the speaker focuses on one object only
that may be in a different part of the table than the
one referred to in the previous utterance. There is
no spatial continuum in the way the scene is pro-
cessed and there may be several distracting objects
that may lead to misunderstanding. Therefore,
each description must be made more precise, both
in the explicit definition of the FoR and through
taking the perspective from which the reference
is most identifiable. An example of this can be
found towards the end of D1, before turn 115 (cf.
Figure 4) where the participants decide to enumer-
ate the mugs of each colour that they can see, P1
leads the enumeration and and describes the loca-
tion of each object. However, the example also
shows effects of continuity that is created by per-
ceptual and discourse salience of objects, i.e. the
way the scene is processed visually and the way
it is described. In T117 “your left hand” is good
landmark which attracts the FoR to P2 in the fol-
lowing spatial utterance in T119 but in T120 the
FoR switches to P1 and in T121 back to P2. Turns
T131-T136 show a similar object enumerating sit-
uation where FoR changes in every term and is
also explicitly marked.
115 P1: my red ones are two in my first row (one of them
close to katie)
relative-p1, explicit
116 P2:i mean there is a chance we both see a white that
the other one is missing..
117 P1: one just next to your left hand
relative-p2, explicit
118 P2: yes
119 P1: and one on the third row from you slightly to your
relative-p2, explicit
120 P2: is it directly behind the red mug on your left?
relative-p1, explicit
121 P1: no, much closer to you
relative-p2, topological
131 P1: and the blue ones are one on the second row from
you, to the right from you
relative-p2, explicit
132 one slightly to my left
relative-p1, explicit
133 and one in front of katie in the first row
relative-katie, explicit
134 P2: yes, that’s the same
135 P1: and the yellow are on between us to your far right
136 and one quite close to the corner on your left and katies
relative-p2, relative-katie, explicit
A switch between dialogue games tends to
come with a switch of FoR. For example, in the
following segment of D2, P1’s FoR is selected
initially to describe a row of cups closest to P1
and starting from their left to right (T14-T17).
However, at T18 P1 initiates clarification. As
P2 is information giver in this case the FoR is
switched to theirs. Interestingly, the participants
also switch the axis along which they enumerate
objects (T21): starting at P2 and proceeding to P1,
thus consistent from P2’s perspective. At T26 a
new clarification game is started and FoR changes
to both P1 and P2, and at T32, after the partic-
ipants exit both clarification games, P1 resumes
the original game enumerating objects row-by-row
and hence FoR is adjusted back to P1 accordingly.
14 P1: On my first row. I have from the left (your right):
one red, handle turned to you but I can see it. A blue
cup next. Handle turned to my right. A white with
handle turned to right. Then a red with handle turned
to my left.
relative-p1, explicit
15 P2: first row = row nearest you?
relative-p1, explicit
16 P1: Yes.
17 P2: ok then i think we found a cup of yours that i can’t
see: the red with the handle to your left (the last one
you mention)
relative-p1, explicit
18 P1: Okay, that would make sense. Maybe it is blocked
by the other cups in front or something?
19 P2: yeh, i have a blue one and a white one, either of
which could be blocking it
20 P1: Yes, I think I see those.
21 It looks almost like a diagonal line to me. From a red
cup really close to you on your left, then a white, then
the blue, then this missing red.
22 P2: blue with the handle to my left and white with the
handle to my rigth/towards me a bit
relative-p2, explicit
26 P1: You know this white one you just mentioned. Is it
a takeaway cup?
27 Because I think I know which cup that is but I don’t see
the handle.
28 P2: no, i was referring to the white handled cup to the
right of the blue cup in the second row from you. its
handle faces... south east from my perspective
relative-p1, relative-p2, explicit
29 the second row of cups from your end
relative-p1, explicit
32 P1: Shall we take my next row? Which is actually just
a styrofoam cup. It’s kinda marooned between the two
relative-p1, explicit
(iii) Miscommunication and repair. We have
already shown in the previous section that in line
with (Mills and Healey, 2006), clarification trig-
gers a change in FoR, with the explanation that
clarification triggered a change of roles between
the information giver and information receiver as
well as introducing a different perceptual focus
on the scene. However, during repair one would
also expect that participants describe FoR explic-
itly more often. In the following example from
D1, P1 is not sure about the location P2 is referring
to. In T148 P2 explicitly describes the cup that can
be found at that location using double specification
of FoR. Information giver is thus providing more
information that necessary to ensure precision of
146 P2: so you see that yellow cup to be right on teh cor-
147 P1: Yes
148 A yellow cup, on my right your left, with the handle
facing east to me, west to you.
relative-p1, relative-p2, explicit
149 P2: ok, from my perspective, there is at least a cup-
sized gap between the edge of the table and the yellow
relative-p2, explicit
150 P1: Yes, I can say that too
As we have already seen, participants also use
other strategies to reduce miscommunication, for
example by enumerating objects that can be seen
at any time of the conversation. From D1:
69 P1: so now I have 17 including the ones I’ve marked,
how many do you have?
100 P2: so then again, it looks like we see everything we
101 P1: yes, you still just got 17?
102 P2: yes
(iv) Explicit strategies Participants also devise
strategies for processing the scene to find the miss-
ing objects. In (D1, T13) participants agree to use
Katie’s perspective as a reference. In (D2, T51 and
following) they negotiate to split the table into a
grid of 16 sub-areas where they label the columns
with letters and rows with numbers. They nego-
tiate the coordinates so that column labels A-D
go from left to right and row numbers go from
top to bottom relative to P2’s view of the table.
Hence, although they devise an extrinsic FoR with
areas that they can refer to with coordinates they
are forced to combine it with a FoR relative to P2
and therefore they create a more complicated sys-
tem that involves two viewpoints. Interestingly,
P1 clearly marked the axis labels on their printed
sheet of the scene, which P2 did not, probably
because the coordinate system was more difficult
from P1’s viewpoint. The negotiation of the coor-
dinate system requires a lot of effort and involves
referring to objects in the scene when negotiating
where to start the lettering and numbering and how
to place the lines for the grid. The participants fin-
ish the negotiation in T165, 114 turns later. How-
ever, although participants of D1 and D2 both ne-
gotiate on some reference perspective they do not
use it exclusively as shown in Figure 4. One hy-
pothesis that follows from these observations is
that participants would use the reference (combin-
ing relative-katie and extrinsic) FoR in turns that
involve greater information precision, that is those
under repair as demonstrated in T119 of D2. Here
the participants are negotiating where to draw the
lines that would delimit different areas of the grid.
Figure 5: The entropy of DATs and FoR assignment calculated per each moving window of 5 turns. Both dialogues are
combined into a single sequence and D2 starts in T158. Entropies were normalised by maximum observable entropy in the
105 P2: so, 2 could be in line with a can you see a blue cup,
that is behind the A1 red cup?
110 P1: Yes. For me the blue cup is in front of the red cup.
But yes.
relative-?, explicit
111 It has a handle that perhaps you can’t see.
112 Since it is pointing south east for me.
relative-p1, explicit
113 P2: what do you mean by “in front of”
114 P1: Hmm
115 P2: closer to me or closer to you?
116 P1: Closer to you
relative-p1, explicit
117 P2: ok yep
118 P1: Okay
119 P2: i cna just see the handle almost pointing to A1
The excerpt shows that FoR itself may be open
for repair. In T110 P1 corrects P2 in T105. P2’s
description contains FoR relative to P1, but P1
mistakenly takes a FoR relative to the landmark
“the red cup” (i.e. intrinsic FoR). It is likely that
this is because the red cup is very salient for P1
and allows P1 to project their orientation to the
cup (the orientation of the FoR is not set by its
handle). This is the only example where intrinsic
FoR is used in the corpus and since it is repaired
we do not count it as such. In T116 P1 comes to
an agreement with P2.
3.5 FoR assignment over conversation
The preceding analysis of dialogue shows that
FoR assignment is dependent on the type of com-
municative act or conversational game that partic-
ipants are engaged in. The changes in perspec-
tive are dependent on factors that are involved in
that particular game, for example the structure and
other perceptual properties of the scene, the partic-
ipants’ focusing on the scene, their conversational
role and availability of knowledge, the accommo-
dated information so far, etc. To test whether the
FoR assignment could be predicted only from the
general dialogue structure we compared the en-
tropy of the Dialogue Act tags with the entropy
of the changes in FoR. As shown in Figure 5 there
are subsections of the dialogue where the variabil-
ity of DAs coincides with the variability of the
FoR (i.e. where the entropy is high) but this is
not a global pattern (Spearman’s correlation rho =
0.36, p=0.383). There are also no significant
cross-correlations between the variables at differ-
ent time lags. In conclusion, at least from our pilot
data, we cannot predict the FoR from the general
structure of conversational games at the level of
DAs. This also means that there is no global align-
ment of FoR assignment and that this is shaped by
individual perceptual and discourse factors that are
part of the game.
4 Conclusions and future work
We have described data from a pilot study which
shows how dialogue participants negotiate FoR
over several turns and what strategies they use.
The data support hypothesis (i) that there is no
general preference of FoR in dialogue but rather
this is related to the communicative acts of a par-
ticular dialogue game. Examining more dialogues
would allow us to design an ontology of such
games with their associated strategies which could
be modelled computationally. Hypothesis (ii) that
participants align over the entire dialogue, is not
supported. Rather, we see evidence for local align-
ment. Hypothesis (iii) is also not supported: while
misunderstanding may be associated with the use
of different FoRs, there are also other dialogue
games where this is the case, for example locat-
ing unconnected objects over the entire scene.
We are currently extending our corpus to more
dialogues which will allow us more reliable quan-
titative analyses. In particular we are interested
in considering additional perceptual and discourse
features (rather than just DAs) to allow us to auto-
matically identify dialogue games with particular
assignments of FoR and therefore apply the model
Steven Bird, Ewan Klein, and Edward Loper. 2009.
Natural language processing with Python. O’Reilly.
Holly Branigan, Martin Pickering, and Alexandra Cle-
land. 2000. Syntactic co-ordination in dialogue.
Cognition, 75:13–25.
Laura A. Carlson-Radvansky and Gordon D. Logan.
1997. The influence of reference frame selection on
spatial template construction. Journal of Memory
and Language, 37(3):411–437.
Herbert H. Clark. 1996. Using language. Cambridge
University Press, Cambridge.
Simon Dobnik, John D. Kelleher, and Christos Ko-
niaris. 2014. Priming and alignment of frame of
reference in situated conversation. In Verena Rieser
and Philippe Muller, editors, Proceedings of Dial-
Watt - Semdial 2014: The 18th Workshop on the Se-
mantics and Pragmatics of Dialogue, pages 43–52,
Edinburgh, 1–3 September.
Nicholas D. Duran, Rick Dale, and Roger J. Kreuz.
2011. Listeners invest in an assumed other’s
perspective despite cognitive cost. Cognition,
Eric N. Forsyth and Craig H. Martell. 2007. Lexical
and discourse analysis of online chat dialog. In Pro-
ceedings of the First IEEE International Conference
on Semantic Computing (ICSC 2007), pages 19–26.
Simon Garrod and Anne Anderson. 1987. Saying what
you mean in dialogue: A study in conceptual and
semantic co-ordination. Cognition, 27:181–218.
Patrick G. T. Healey, Matthew Purver, James King,
Jonathan Ginzburg, and Greg J. Mills. 2003. Ex-
perimenting with clarification in dialogue. In Pro-
ceedings of the 25th Annual Meeting of the Cogni-
tive Science Society, Boston, MA, Aug.
Patrick G. T. Healey, Matthew Purver, and Christine
Howes. 2014. Divergence in dialogue. PLoS ONE,
9(6):e98598, June.
John D. Kelleher and Fintan J. Costello. 2009. Apply-
ing computational models of spatial prepositions to
visually situated dialog. Computational Linguistics,
Boaz Keysar. 2007. Communication and miscommu-
nication: The role of egocentric processes. Intercul-
tural Pragmatics, 4(1):71–84.
Willem J. M. Levelt. 1982. Cognitive styles in the
use of spatial direction terms. In R. J. Jarvella
and W. Klein, editors, Speech, place, and action,
pages 251–268. John Wiley and Sons Ltd., Chich-
ester, United Kingdom.
Stephen C. Levinson. 2003. Space in language and
cognition: explorations in cognitive diversity. Cam-
bridge University Press, Cambridge.
Xiaoou Li, Laura A. Carlson, Weimin Mou, Mark R.
Williams, and Jared E. Miller. 2011. Describing
spatial locations from perception and memory: The
influence of intrinsic axes on reference object selec-
tion. Journal of Memory and Language, 65(2):222–
Didier Maillat. 2003. The semantics and pragmat-
ics of directionals: a case study in English and
French. Ph.D. thesis, University of Oxford: Com-
mittee for Comparative Philology and General Lin-
guistics, Oxford, United Kingdom, May.
Gregory Mills and Patrick G. T. Healey. 2006. Clarify-
ing spatial descriptions: Local and global effects on
semantic co-ordination. In Proceedings of the 10th
Workshop on the Semantics and Pragmatics of Dia-
logue (SEMDIAL), Potsdam, Germany, September.
Martin Pickering and Simon Garrod. 2004. Toward
a mechanistic psychology of dialogue. Behavioral
and Brain Sciences, 27:169–226.
Michael F. Schober. 1995. Speakers, addressees,
and frames of reference: Whose effort is minimized
in conversations about locations? Discourse Pro-
cesses, 20(2):219–247.
Holly A. Taylor and Barbara Tversky. 1996. Perspec-
tive in spatial descriptions. Journal of Memory and
Language, 35(3):371–391.
Barbara Tversky. 1991. Spatial mental models. The
psychology of learning and motivation: Advances in
research and theory, 27:109–145.
Matthew E Watson, Martin J Pickering, and Holly P
Branigan. 2004. Alignment of reference frames in
dialogue. In Proceedings of the 26th annual confer-
ence of the Cognitive Science Society, pages 2353–
2358. Lawrence Erlbaum Mahwah, NJ.
... Multi-Modal Disambiguation (Sub-Task #1) In dialogue, speakers produce their turns spontaneously in realtime, often in dynamic environments, and where there is potential mismatch between the perspectives of the speaker and hearer (see e.g. Dobnik, Howes, and Kelleher (2015)). This can lead to referring expressions that do not necessarily identify the referent uniquely. ...
Anaphoric expressions, such as pronouns and referential descriptions, are situated with respect to the linguistic context of prior turns, as well as, the immediate visual environment. However, a speaker's referential descriptions do not always uniquely identify the referent, leading to ambiguities in need of resolution through subsequent clarificational exchanges. Thus, effective Ambiguity Detection and Coreference Resolution are key to task success in Conversational AI. In this paper, we present models for these two tasks as part of the SIMMC 2.0 Challenge (Kottur et al. 2021). Specifically, we use TOD-BERT and LXMERT based models, compare them to a number of baselines and provide ablation experiments. Our results show that (1) language models are able to exploit correlations in the data to detect ambiguity; and (2) unimodal coreference resolution models can avoid the need for a vision component, through the use of smart object representations.
... "to the left of" and "above") but not topological relations (e.g. "close" and "at"), the frame of reference which can be influenced from other modalities such as scene attention and dialogue interaction (Dobnik et al., 2015). Work in cognitive psychology (Logan, 1994(Logan, , 1995 argues that while object identification may be pre-attentive, identification of spatial relations is not and is accomplished by a top-down mechanisms of attention after the objects have been identified. ...
... More recently models based on the concept of an attentional vector sum [59,41], and the functional geometric framework [12,11] have been proposed. Another stream of research on spatial language deals with the question of frame of reference modelling and ambiguity [7,35,16,14,61] The term mutual knowledge describes set of things that are taken as shared knowledge by interlocutors, and hence are available as referents within the discourse [57]. In a situated dialog, an interlocutor may consider an entity to be in the mutual knowledge set if: (i) they consider it to be part of the cultural or biographical knowledge they share with their dialog partner, or (i)) it is in the shared perception of the situation the dialog occurs within. ...
Full-text available
From theoretical linguistic and cognitive perspectives, situated dialog systems are interesting as they provide ideal test-beds for investigating the interaction between language and perception. At the same time there are a growing number of practical applications, for example robotic systems and driver-less cars, where spoken interfaces, capable of situated dialog, promise many advantages. To date, however much of the work on situated dialog has focused resolving anaphoric or exophoric references. This paper, by contrast, opens up the question of how perceptual memory and linguistic references interact, and the challenges that this poses to computational models of perceptually grounded dialog.
... Ambiguity with respect to the intended perspective of a reference can affect the grounding of spatial terms in surprising ways (Carlson-Radvansky and Logan, 1997;Kelleher and Costello, 2005). However, frequently the intended perspective can be either inferred from the perceptual context (if only one interpretation is possible, see for example the discussion on contrastive versus relative meanings in (Kelleher and Kruijff, 2005a)) or it may be linguistically negotiated and aligned between conversational partners in dialogue (Dobnik et al., , 2015(Dobnik et al., , 2016. ...
Natural language processing (NLP) can be done using either top-down (theory driven) and bottom-up (data driven) approaches, which we call mechanistic and phenomenological respectively. The approaches are frequently considered to stand in opposition to each other. Examining some recent approaches in deep learning we argue that deep neural networks incorporate both perspectives and, furthermore, that leveraging this aspect of deep learning may help in solving complex problems within language technology, such as modelling language and perception in the domain of spatial cognition.
... These functional relations can be captured as meanings induced from word distributions (Dobnik andKelleher, 2013, 2014). Another important factor of (projective) spatial descriptions is their contextual underspecification in terms of the assigned frame of reference which is coordinated through dialogue interaction between conversational participants (Dobnik et al., 2015). It is therefore based on their coordinated intentions in their interaction. ...
This paper examines to what degree current deep learning architectures for image caption generation capture spatial language. On the basis of the evaluation of examples of generated captions from the literature we argue that systems capture what objects are in the image data but not where these objects are located: the captions generated by these systems are the output of a language model conditioned on the output of an object detector that cannot capture fine-grained location information. Although language models provide useful knowledge for image captions, we argue that deep learning image captioning architectures should also model geometric relations between objects.
... These functional relations can be captured as meanings induced from word distributions (Dobnik andKelleher, 2013, 2014). Another important factor of (projective) spatial descriptions is their contextual underspecification in terms of the assigned frame of reference which is coordinated through dialogue interaction between conversational participants (Dobnik et al., 2015). It is therefore based on their coordinated intentions in their interaction. ...
Conference Paper
Full-text available
This paper examines to what degree current deep learning architectures for image caption generation capture spatial language. On the basis of the evaluation of examples of generated captions from the literature we argue that systems capture what objects are in the image data but not where these objects are located: the captions generated by these systems are the output of a language model conditioned on the output of an object detector that cannot capture fine-grained location information. Although language models provide useful knowledge for image captions, we argue that deep learning image captioning archi-tectures should also model geometric relations between objects.
... In the pilot study (Dobnik et al., 2015), we recorded and annotated in detail two dialogues in English. The native language of the first pair was Swedish while the second pair were native British English speakers. ...
In this paper we examine how people assign, interpret, negotiate and repair the frame of reference (FoR) in online text-based dialogues discussing spatial scenes in English and Swedish. We describe our corpus and data collection which involves a coordination experiment in which dyadic dialogue participants have to identify differences in their picture of a visual scene. As their perspectives of the scene are different, they must coordinate their FoRs in order to complete the task. Results show that participants do not align on a global FoR, but tend to align locally, for sub-portions (or particular conversational games) in the dialogue. This has implications for how dialogue systems should approach problems of FoR assignment – and what strategies for clarification they should implement.
Full-text available
We argue that computational modelling of perception, action, language, and cognition introduces several requirements on a formal semantic theory and its practical implementations. Using examples of semantic representations of spatial descriptions we show how Type Theory with Records (TTR) satisfies these requirements.
Conference Paper
Full-text available
In this paper, we study how the frame of reference (FoR) or perspective is communicated in dialogue through mechanisms such as linguistic priming and alignment (Pickering and Garrod, 2004). In order to isolate the contribution of these mechanisms we deliberately work with a constrained artificial dialogue scenario. First we collect data that deal with human behaviour in interpreting descriptions that are ambiguous in terms of the FoR. From these interpretations we extract and identify strategies for FoR assignment in conversations which we then apply to generate descriptions and measure human agreement with the system. Our findings confirm that both speakers and hearers rely on such mechanisms in conversation.
Full-text available
One of the best known claims about human communication is that people’s behaviour and language use converge during conversation. It has been proposed that these patterns can be explained by automatic, cross-person priming. A key test case is structural priming: does exposure to one syntactic structure, in production or comprehension, make reuse of that structure (by the same or another speaker) more likely? It has been claimed that syntactic repetition caused by structural priming is ubiquitous in conversation. However, previous work has not tested for general syntactic repetition effects in ordinary conversation independently of lexical repetition. Here we analyse patterns of syntactic repetition in two large corpora of unscripted everyday conversations. Our results show that when lexical repetition is taken into account there is no general tendency for people to repeat their own syntactic constructions. More importantly, people repeat each other’s syntactic constructions less than would be expected by chance; i.e., people systematically diverge from one another in their use of syntactic constructions. We conclude that in ordinary conversation the structural priming effects described in the literature are overwhelmed by the need to actively engage with our conversational partners and respond productively to what they say.
Conference Paper
Full-text available
A new technique for integrating experimental manipulations into text-based, synchronous dialogue is introduced. This method supports fine-grained, systematic transformation of conversational turns and the introduction of ‘artificial’ probe turns and turn sequences. It can be used to introduce manipulations that are sensitive to aspects of the local linguistic and conversational context for any task or dialogue type. The use of this technique is illustrated by an experimental investigation of the effect of word category and level of grounding on the interpretation of reprise clarifications. The results show that these factors affect both the type and likelihood of response to reprise fragment clarifications.
Full-text available
The use of spatial relational terms requires the selection of a reference frame and the construction of a spatial template. The reference frame divides up space, indicating above/below, front/back, and left/right directions. Spatial templates are applied to reference frames and define regions in space that correspond to good, acceptable, and bad uses of particular spatial relations. In two experiments we examined whether reference frame selection influences the spatial templates that are constructed for the spatial relations “above” and “below.” Results indicated two such influences, one operating across trials and one operating within a trial. Across trials, the preferences for using the different types of reference frames directly influenced the parsing of space, such that when multiple spatial templates were constructed, they were combined to form a composite template. Spatial templates constructed for the different reference frames were very similar, indicating that the type of reference frame did not alter the overall shape or relative sizes of the regions within the spatial template. Within a trial, the activation of multiple reference frames during reference frame selection resulted in the construction of multiple spatial templates, even when instructions were given to respond on the basis of a single reference frame.
Full-text available
Spatial orientation and direction are core areas of human and animal thinking. But, unlike animals, human populations vary considerably in their spatial thinking. Revealing that these differences correlate with language (which is probably mostly responsible for the different cognitive styles), this book includes many cross-cultural studies investigating spatial memory, reasoning, types of gesture and wayfinding abilities. It explains the relationship between language and cognition and cross-cultural differences in thinking to students of language and the cognitive sciences. © Stephen C. Levinson 2004 and Cambridge University Press, 2010.
This chapter reviews that spatial descriptions spontaneously construct spatial mental models of the described scenes as a natural consequence of reading for comprehension and memory, with no special training, instructions, or prior visual displays. The spatial mental models constructed reveal people's conceptions of space, which, though built on their perceptions of space, are more abstract and general. In the first set of experiments, subjects read route or survey descriptions of four environments, and verified verbatim and inference statements about those environments from both the same and the other perspective. Subjects were equally fast and accurate in verifying inference statements from the read perspective and the other perspective. This led to the conclusion that subject's mental models capture the categorical spatial relations described in the text, but not from any particular perspective. Like structural descriptions, spatial mental models contain information about the parts of a scene and the relations between the parts. Unlike images, which have been likened to internalized perceptions, spatial mental models are perspective-free and allow the taking of many perspectives, required in order to verify the test statements. The second set of studies examined perspective taking and information retrieval in a particular, one that is simple and common, that of an observer surrounded by objects. The chapter reviews that the experiments reported have added to that body of research, uncovering many features of spatial mental models in the process.
Conference Paper
One of the ultimate goals of natural language processing (NLP) systems is understanding the meaning of what is being transmitted, irrespective of the medium (e.g., written versus spoken) or the form (e.g., static documents versus dynamic dialogues). Although much work has been done in traditional language domains such as speech and static written text, little has yet been done in the newer communication domains enabled by the Internet, e.g., online chat and instant messaging. This is in part due to the fact that there are no annotated chat corpora available to the broader research community. The purpose of this research is to build a chat corpus, tagged with lexical (token part-of-speech labels), syntactic (post parse tree), and discourse (post classification) information. Such a corpus can then be used to develop more complex, statistical-based NLP applications that perform tasks such as author profiling, entity identification, and social network analysis.
Previous research has shown that interlocutors in a dialogue align their utterances at several levels of representation. This paper reports two experiments that use a confederate-priming paradigm to examine whether interlocutors also align their spatial representations during dialogue. Experiment 1 showed a significant reference frame priming effect: Speakers tended to use the same reference frame to locate an object in a scene as the frame that they had just heard their interlocutor use. Experiment 2 demonstrated the same pattern even when the speaker's description and their partner's previous description involved different prepositions. Hence the effect cannot be explained in terms of lexical priming of a particular preposition. Our results are strong evidence that interlocutors in a dialogue align non-linguistic as well as linguistic representations.
Communication is typically considered to be guided by principles of co- operation, requiring the consideration of the communication partner's men- tal states for its success. Miscommunication, in turn, is considered a prod- uct of noise and random error. I argue that communication proceeds in a relatively egocentric manner, with addressees routinely interpreting what speakers say from their own perspective, and speakers disambiguating their utterances with little consideration to the mental states of their addressees. Speakers also tend to overestimate how eective they are, believing that their message is understood more often than it really is. Together, these findings suggest a systematic cause for miscommunication. 1. Communication and miscommunication Most people, most of the time, think that what they say is pretty clear. Ambiguity is not routinely noted when people normally communicate. In contrast, linguists and psychologists who study the use of language notice potential ambiguity everywhere. The newspaper is a goldmine for unintended meanings, as in this recent classified ad: ''Bedroom furniture—Triple dresser with mirror, armoire, one night stand.'' Stu- dents of language know that even if you write ''one nightstand,'' the text will not be devoid of ambiguity because every text can have more than one meaning. Even a simple statement such as ''This chocolate is wonder- ful'' is ambiguous because it could be a statement of fact, an oer, a re- quest for more, and so on. Despite such ubiquitous ambiguity, there are two reasons why people may not be confused. They use context for dis- ambiguation, and they assume that the writer or speaker is a cooperative agent (Grice 1975). With both powerful tools, language users take a lin- guistic system that has a huge potential to fail, and use it successfully.