PreprintPDF Available

The Whole Truth and Nothing But the Truth: Faithful and Controllable Dialogue Response Generation with Dataflow Transduction and Constrained Decoding

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In a real-world dialogue system, generated responses must satisfy several interlocking constraints: being informative, truthful, and easy to control. The two predominant paradigms in language generation -- neural language modeling and rule-based generation -- both struggle to satisfy these constraints. Even the best neural models are prone to hallucination and omission of information, while existing formalisms for rule-based generation make it difficult to write grammars that are both flexible and fluent. We describe a hybrid architecture for dialogue response generation that combines the strengths of both approaches. This architecture has two components. First, a rule-based content selection model defined using a new formal framework called dataflow transduction, which uses declarative rules to transduce a dialogue agent's computations (represented as dataflow graphs) into context-free grammars representing the space of contextually acceptable responses. Second, a constrained decoding procedure that uses these grammars to constrain the output of a neural language model, which selects fluent utterances. The resulting system outperforms both rule-based and learned approaches in human evaluations of fluency, relevance, and truthfulness.
Content may be subject to copyright.
The Whole Truth and Nothing But the Truth:
Faithful and Controllable Dialogue Response Generation
with Dataflow Transduction and Constrained Decoding
Hao FangAnusha BalakrishnanHarsh Jhamtani
John Bufe Jean Crawford Jayant Krishnamurthy
Adam Pauls Jason Eisner Jacob Andreas Dan Klein
Microsoft Semantic Machines <>
In a real-world dialogue system, generated re-
sponses must satisfy several interlocking con-
straints: being informative, truthful, and easy
to control. The two predominant paradigms in
language generation—neural language model-
ing and rule-based generation—both struggle
to satisfy these constraints. Even the best neu-
ral models are prone to hallucination and omis-
sion of information, while existing formalisms
for rule-based generation make it difficult to
write grammars that are both flexible and flu-
ent. We describe a hybrid architecture for di-
alogue response generation that combines the
strengths of both approaches. This architec-
ture has two components. First, a rule-based
content selection model defined using a new
formal framework called dataflow transduc-
tion, which uses declarative rules to transduce
a dialogue agent’s computations (represented
as dataflow graphs) into context-free gram-
mars representing the space of contextually
acceptable responses. Second, a constrained
decoding procedure that uses these grammars
to constrain the output of a neural language
model, which selects fluent utterances. The
resulting system outperforms both rule-based
and learned approaches in human evaluations
of fluency, relevance, and truthfulness.
1 Introduction
In a task-oriented dialogue system, response gen-
eration is naturally posed as a conditional lan-
guage modeling problem: dialogue agents must
produce a contextually appropriate natural lan-
guage string conditioned on the history of user and
agent utterances. But unlike many language gen-
eration problems, a good dialogue response gener-
ation model is not just a model of typical human
utterances in context. Instead, effective dialogue
agents must balance fluent generation with a set of
much stricter constraints.
Equal contribution.
User: How many events are on my calendar today?
List([Event(…), …])
Agent: You have three events.
User: Can you schedule a meeting with Sarah
attendee=queryPerson(name=“Tara Smith”))
Agent: OK, I’ve booked it.
Agent: OK, I’ve booked a meeting with Tara
Smith at 2pm today. (d)
Date(2022, 1, 3)
Figure 1: Interaction between a user and a dialogue
agent. Once the user’s request is translated into an
agent action—expressible as a program or dataflow
graph (a)—the agent must generate a response. Agent
responses might simply state the result of the agent’s
action (b–c), but often should describe both the ac-
tion and the result, e.g., to help users identify when the
agent has misunderstood their request (d). In all cases,
these responses must be truthful and produced by rules
that are straightforward for system designers to revise.
Consider the dialogue shown in Fig. 1. In the
first turn of this dialogue, the user makes a request,
the dialogue agent correctly translates it into a
computation, here represented as a dataflow graph
(Fig. 1a), and then accurately describes this com-
putation’s return value (Fig. 1b). But in the second
step, the dialogue agent makes a mistake: perhaps
because of a speech recognition error, it creates a
meeting with Tara Smith rather than Sarah Smith.
Simply describing the result of its action (Fig. 1c)
might cause a user to incorrectly conclude that
their request was completed successfully. Seeing
this error, a system designer might wish to ensure
that the dialogue agent instead echoes back to the
user the details of the agent’s action (Fig. 1d). This
example highlights several of the challenges cen-
arXiv:2209.07800v1 [cs.CL] 16 Sep 2022
tral to building real-world dialogue response gen-
eration systems.
First, response generation is not simply a prob-
lem of describing the result of a computation in
natural language. In some cases, response gen-
erators may also usefully describe the prove-
nance of that result—the computation itself and
its intermediate values. In many human-to-
human conversational contexts, a response as de-
tailed and redundant as Fig. 1d would be over-
informative, violating Grice’s maxim of quan-
tity (1975). But Gricean cooperative speakers may
sometimes deem it useful to convey information
about their mental state. For a speaker that is prone
to mistakes, such as an AI agent, revealing its own
understanding and/or reasoning as in Fig. 1d can
increase user trust when its computation was ap-
propriate, and can provide an opportunity for cor-
rection when it was not.
Second, dialogue response generation systems
must guarantee truthfulness: as the primary
source of information about the action that a di-
alogue agent took, a response generator that de-
scribes even a small fraction of these computations
incorrectly can produce disastrous results. Impor-
tantly, truthful utterances might be low-probability
under a domain-general language model (LM),
particularly when they reflect errors in language
understanding (as in Fig. 1c–d).
Finally, response generation systems must sup-
port declarative specification of agent behavior.
When confusing or infelicitious responses are dis-
covered, it should be possible to easily and pre-
cisely modify them without changing the dialogue
agent’s behavior in other contexts.
How might we build a generation system with
all these properties? In recent years, the main fo-
cus of academic dialogue research has been on
“end-to-end” learned models for response genera-
tion, especially neural sequence-to-sequence mod-
els (Vinyals and Le,2015;Zhang et al.,2020b).
But while such models excel at producing flu-
ent and coherent output, research continues to
find that they struggle in maintaining faithfulness
(Wiseman et al.,2017;Maynez et al.,2020). Per-
haps more fundamentally, because the behavior of
such systems is encoded implicitly in their training
data, designing a user experience requires system
builders to write and edit a large number of train-
ing examples whose final effect may be difficult to
As a result, many dialogue systems in the real
world remain rule-based: system builders hand-
write rules (e.g., in the form of a synchronous
grammar) for transforming dialogue states into
text, and these rules are applied directly during de-
ployment. But such rule-based systems are also
notoriously difficult to build and maintain (Walker
et al.,2002;Reiter,2022). They require designers
to anticipate every low-level question about sur-
face realization (does your meeting about “prod-
uct” sound more or less natural than your “prod-
uct” meeting?), and to encode these in the same
grammar that is responsible for enforcing high-
level properties like truthfulness.
Given the many strengths of modern LMs, is
there a way to leverage them while preserving the
numerous other properties that dialogue response
generation systems must satisfy? In this paper, we
describe a hybrid approach to building dialogue
systems that combines the advantages of end-to-
end and rule-based approaches. This approach has
two components:
A dataflow transduction procedure, based on
a new formalism that uses declarative rules to
map a computation (represented as a dataflow
graph) into a context-free grammar (CFG)
that defines the space of all responses al-
lowed for the given computation. This formal
framework makes it possible to write rules
to precisely and truthfully describe both data
and its provenance, while performing supple-
mentary computation where needed to pro-
duce informative responses.
A constrained decoding procedure that inter-
sects a CFG with a neural LM, making it pos-
sible to decompose language generation into
acontent selection model (implemented by
the grammar) and a separate fluency model
(implemented by an LM).
Together, dataflow transduction and constrained
decoding make it possible to build a faithful gen-
eration system capable of describing a complex,
open-ended, space of tasks. Using a subset of
SMCalFlow dialogues (Semantic Machines et al.,
2020) and only 187 declarative rules, our hybrid
system is consistently rated as more truthful, rele-
vant, and fluent than either a rule-based or end-to-
end neural system.1
1Code, data, and trained models will be released.
2 Problem Formulation
We study the problem of response generation for
task-oriented dialogue. A dialogue, like the one in
Fig. 1, consists of a sequence of turns, each con-
sisting of a user utterance xi, one or more actions
ai, and an agent response yi. The job of a dia-
logue agent is to predict an appropriate action and
response from a dialogue history, i.e., mapping
from (x1, a1, y1, x2, a2, y2, . . . , xn)7→ (an, yn).
A common approach to building dialogue
agents decomposes this prediction process into
several steps. First, a language understanding
module maps from a user utterance (and possi-
bly other components of the dialogue history) to
a meaning representation (e.g., a structured user
intent, API request or other executable program).
This meaning representation is then evaluated,
producing actions a; these are finally passed to
aresponse generation module that produces an
agent utterance y.
The focus of this paper is the response genera-
tor. We assume that we have a pre-specified lan-
guage understanding module that maps from con-
versation histories to computations, in the form
of short programs, which are then executed to
produce actions a. As described by Semantic
Machines et al. (2020), these computations may
equivalently be viewed as dataflow graphs in
which each node is labeled with a function, con-
structor, or primitive value, as well as a return
value once the node is executed. We additionally
assume access to a dataset of dialogues containing
gold-standard user and agent utterances. Given a
language understanding module and a dataset of
dialogues, we aim to implement a response gen-
erator that, when applied to a dataflow graph, sat-
isfies the three properties outlined in §1: descrip-
tion of data and its provenance, guaranteed truth-
fulness, and declarative specification.
Our response generation system is built from
two pieces: (1) a dataflow transduction procedure
for transducing dataflow graphs into CFGs, and
(2) a constrained decoding procedure for intersect-
ing a CFG with a neural LM, described in §3 and
§4, respectively. Hybrid generation systems of this
kind have a long history in natural language gen-
eration (Langkilde and Knight,1998). Our aim in
this paper is to show the benefits of a new gen-
eration paradigm based on dataflow transduction,
and offer new rule-writing formalisms and decod-
ing algorithms tailored to modern language mod-
els in this setting. The combination of declarative
rules for content selection and learned models for
fluency offers a powerful framework for building
dialogue systems that are ready for the real world.
3 Dataflow Transduction
Given a dataflow graph G(e.g., Fig. 1a) rooted at
a node vroot (the return value of the program rep-
resented by the dataflow graph), our task is to gen-
erate a string that describes vroot and its prove-
nance. To achieve this, we propose a new for-
mal framework for generation based on dataflow
transduction. At a high level, the formalism
uses declarative rules that describe how to trans-
form a dataflow graph into a graph-specific gram-
mar (specifically a quasi-synchronous context-
free grammar, or QCFG) that defines the space
of allowed responses. These rules walk along the
graph, introduce new computations (dataflow sub-
graphs) as needed, and add rules to the grammar.
Formally, a dataflow transducer Sis defined by
a 4-tuple (T,Σ,R, tstart)where Tis a set of non-
terminal symbols,2Σis the set of terminal sym-
bols (word types), Ris a set of dataflow transduc-
tion rules (see §3.1), and tstart is the start nonter-
minal. When applied to Gthe dataflow transducer
expands the graph, yielding a new graph ¯
G, and
produces a QCFG.
A QCFG (Smith and Eisner,2006) is a spe-
cialized CFG whose nonterminals include align-
ments to the nodes V(¯
G)of ¯
G. Where an ordinary
CFG might specify ways to generate an NP (noun
phrase) or a DATE, a QCFG would specify ways
to generate an NP or DATE that describes the result
and provenance of v, for each appropriately typed
node vV(¯
G). A QCFG resulting from dataflow
transduction is a 4-tuple (T × V(¯
G),Σ,P, tstart)
where T ×V(¯
G)is the QCFG’s set of nonterminal
symbols and Pis its set of productions. A QCFG
production has the form
αβ1β2· · · βN
where the left-hand-side α= (t, v) T ×V(¯
a QCFG nonterminal, and each βican be either a
nonterminal (ti, vi)or a terminal in Σ. The viof a
right-hand-side nonterminal βimay have appeared
in the original G, or may have been added to ¯
the dataflow transducer. These production rules
then derive a set of strings as in an ordinary CFG.
2In practice, nonterminal types often correspond to syn-
tactic categories such as NP (noun phrase) or to semantic cat-
egories such as EVENT. This is up to the designer.
Response Template:
Head: S
match computation:
case findEventsOnDate(date):
num = size(computation)
event = head(computation)
return {"num": num, "event": event, "date": date}
I found {LEX <num>} event {PP <date>}.
It’s {EVENT <event>}.
Figure 2: A dataflow transduction rule with head S, a
body (expressed in Python), and a response template
(which queries the dictionary returned by the body).
3.1 Dataflow Transduction Rules
Each dataflow transduction rule can be applied to
a node v¯
G(if vhas appropriate properties) to
create a single QCFG production (t, v) · · · that
could be used to describe v. Since several dataflow
tranductions might apply to v, we may get multi-
ple QCFG productions that could produce alterna-
tive descriptions of v.
Such a rule has three components: (1) a head,
namely the nonterminal t T ; (2) a body, which
is a piece of code that determines whether the rule
can apply to v, and which may look up or add
nodes that are related to v; and (3) a response
template, which specifies the right-hand side of
the QCFG production in terms of the related nodes
that were identified by the body. (The related
nodes will then be transduced into QCFG produc-
tions of their own.) An example is shown in Fig. 2.
Rule Head. This nonterminal characterizes the
type of node that the transduction rule is able to
describe and the type of description that it will
produce.2When a rule with head tis successfully
applied to the node v, the resulting QCFG produc-
tion has left-hand-side (t, v).
Rule Body. A rule body declares the condition
when the rule can be applied by examining the
dataflow graph ¯
Gvrooted at v. It can contain ex-
ecutable logic that identifies additional computa-
tion nodes that will be recursively described.3For
example, the rule body in Fig. 2checks whether
Gvhas the form findEventsOnDate(date). If
so, it binds the variable date accordingly, and
introduces new nodes into ¯
G, bound to the vari-
ables num and event, which compute the number
3Note that the nodes added by the body may represent fur-
ther computations on existing nodes of ¯
Gvor may be com-
pletely disjoint from the existing nodes.
of events and the first event. All three of these vari-
ables will be referenced in the response template.
Response Template. The response template is
a sequence of terminals and nonterminals that the
QCFG rule should produce. This can be used
to copy information verbatim from the dataflow
graph (e.g., for simple values like strings and num-
bers), as well as to create more complex descrip-
tions. Every QCFG nonterminal βiin the tem-
plate specifies another node vithat needs to be de-
scribed as well as a dataflow nonterminal ti; that
node can be recursively described by any rule with
head ti. In our template syntax, the QCFG non-
terminal (EVENT,v4)would be constructed using
the notation {EVENT <event>}, where the variable
event is bound to the node v4. The syntax is illus-
trated in Fig. 2, where the response template con-
structs three QCFG nonterminals with types LEX,
PP, and EVENT, respectively.
We note here that (1) transduction rules are se-
lected via their head nonterminal but also condi-
tion on the computation graph through their body;
and (2) all QCFG nonterminals are grounded in
concrete computations. Together, this provides a
means to ensure truthfulness when generating re-
sponses from this system.
3.2 Dataflow Transduction Procedure
Given a dataflow transducer Sand a dataflow
graph Grooted at node vroot, we can transduce the
graph into a QCFG as follows. The system starts
by creating QCFG productions that can expand the
start nonterminal (tstart, vroot ). For each trans-
duction rule in Rwhose head is tstart, it executes
the body, which checks any additional conditions
for whether the rule can be applied to vroot, binds
variables, and uses the response template to create
a QCFG production. For any new nonterminals
that appear on the right-hand-sides of these pro-
ductions, the system then recursively creates fur-
ther QCFG productions, in the same way, that can
expand those nonterminals. This recursive process
continues until productions have been created for
every nonterminal that appears in the QCFG. The
resulting QCFG compactly represents a combina-
torial space of possible responses.
4 Constrained Decoding
In this section, we describe how to integrate the
formal framework above with a general LM to per-
form response generation, as illustrated in Fig. 3.
v2 v1
(S, v0) (UH, v0) , (S, v1)
(UH, v0) Ye s
(S, v1) I found (LEX, v3) event (PP, v2) . It’s
(EVENT, v4) .
Yes, I found one event on Sept 14, 2022. It’s
“Show and Tell”.
Yes, I found one event on Thursday. It's "Show
and Tell" from 11:00 am to 11:30 am.
1“Show and Tell”
{UH <answer>}, {S <query>}
Do I have any
meetings tomorrow ?
match computation:
case findEventsOnDate(date):
num = size(computation)
event = head(computation)
return {...}
I found {LEX <num>} event {PP <date>}.
It’s {EVENT <event>}.
Event(…) v4
Figure 3: The hybrid response generation approach using dataflow transduction and constrained decoding. Given
a computation nonEmpty(findEventsOnDate(tomorrow())) for the user utterance “Do I have any meetings
tomorrow”, we first derive QCFG productions by applying the dataflow transducer to the dataflow graph Gusing
the procedure described in §3.2. This procedure also expands the dataflow graph into ¯
G: for example, the nodes
v3 and v4 were added by the third transducer rule. Then we extract candidate responses from a LM, constrained
by the QCFG. The varying descriptions of the date v2 and the event v4 are permitted because the QCFG can
choose different productions while expanding the (PP,v2)and (EVENT,v4)nonterminals. (Those productions and
the transducer rules that created them are not shown in the figure. The nodes added by those transducer rules and
used by those productions are also not shown, except for v5.)
Given a derived QCFG of the kind described in
§3.2, we perform constrained decoding as in (Shin
et al.,2021;Roy et al.,2022), generating response
candidates from a pretrained LM.
The QCFG resulting from dataflow transduction
implicitly represents a set of possible derivation
trees and the agent responses they yield. As long
as transduction rules faithfully describe the nodes
they apply to, every derivation in this set will cor-
respond to a truthful agent utterance. But these
utterances may not always be grammatical or nat-
ural. For example, the response template in Fig. 2
may be realized as “I found 2 event on Monday”
since the rule body does not check whether the
value of num is 1. Similarly, the response template
EVENT heventistarts on DATE hdatei.
may be realized as “The event starting on Monday
starts on Monday” if the grammar permits iden-
tifying events by their dates. With carefully engi-
neered and highly specialized rules (e.g., using ex-
tremely fine-grained nonterminal types), it would
be possible to ensure that the responses are always
fluent and even that there is always a single possi-
ble outcome from the top-down search procedure.
However, this would usually require much a more
complicated set of rules, which creates a burden
for system development and maintenance.
Our proposed approach uses a large-scale
pretrained (and preferably fine-tuned) language
model to serve as a fluency model. One option
is to use the LM to re-rank all strings that can be
produced by the QCFG, but that would be very
computationally expensive. Instead, we follow
Shin et al. (2021) and Roy et al. (2022), who de-
code sentences from a given LM under the con-
straint that they must be valid under a given CFG.
This constrained decoding method uses Earley’s
algorithm (1970) to incrementally parse the sen-
tence as it is generated and determine the set of
words that could grammatically serve as the next
token. In contrast to these prior papers, which
used a static CFG, we derive a new CFG each time
the dialogue agent needs to generate a response,
by applying the dataflow transducer to the current
dataflow graph.
5 Experiments
To evaluate this approach, we conducted a set of
experiments on the SMCalFlow dataset (Semantic
Machines et al.,2020). The data and evaluation
metrics are described in §5.1. In §5.2, we present
the main evaluation results, followed by an abla-
tion study in §5.3.
5.1 Data and Evaluation Metrics
SMCalFlow is a large-scale task-oriented dialogue
dataset, in which each user utterance is annotated
with a correct dataflow program (i.e., computa-
tion) and a “gold” response that would be desirable
for the agent to produce. We use the v2.0 release
processed by Platanios et al. (2021). We focus on
a subset of SMCalFlow involving calendar event
queries. This subset contains 8938 training exam-
ples and 1041 validation examples. We found that
187 transduction rules, written by some of us in a
matter of hours, were sufficient to cover all gold
system responses in these examples.4
Automatic Metrics. For automatic evaluation,
we use several reference-based metrics , including
BLEU-4 (Papineni et al.,2002), ROUGE-L (Lin,
2004), and BERTScore-F1 (Zhang et al.,2020a),
computed using the GEM-metrics tool.5Follow-
ing the recommendation in Zhang et al. (2020a),
we use the re-scaled version of BERTScore which
is easier to interpret. We additionally consider
exact match scores, i.e., R@K, which measure
whether one of the top Kresponse candidates ex-
actly matches the reference. Both R@1 and R@5
scores are reported. We lowercase all the strings
and remove any extra spaces while computing the
exact match between two strings.
Human Evaluation. It is well-known that pop-
ular automatic evaluation metrics may not always
reflect the true quality of the generated responses
(Celikyilmaz et al.,2021). Thus, we further carry
out human evaluation on 297 examples randomly
sampled from the validation data. Specifically,
for each generated response, we collect human
judgments on three questions: grammaticality
(“has the virtual assistant made any grammar er-
rors?”), relevance (“has the virtual assistant mis-
understood the user’s request?”), and truthful-
4Some of our rule bodies chose to expand the dataflow
graph by calling functions, so we also had to implement those
functions. In an end-to-end dialogue system, most of those
functions would already have been implemented to support
agent actions, not just natural language responses.
5 metrics
ness (“has the virtual assistant provided any in-
correct information as judged using the database
and timestamp?”). Three judgments are collected
for each question, and we report the percentage
of examples where “no” is the majority-voted an-
swer. Higher percentages are better. Crowdwork-
ers are recruited from Amazon Mechanical Turk
with qualification requirements such as having a
work approval rate higher than 80% and having
performed a minimum of 100 annotations. They
are paid at the rate of 0.15 cents per judgment. The
inter-annotator agreements for the three questions
are around 90%,80% and 75%, respectively, as
measured by the percentage of examples where all
three workers choose the same answer.
5.2 Main Results
Our main evaluation results on SMCalFlow are
shown in Table 1. The first baseline we considered
is QCFG random sampling, which is similar to the
top-down search process described in §3.2 except
that at each step we randomly choose an applica-
ble transduction rule. The other baseline is un-
constrained LM decoding without using dataflow
transduction. We also report results on the gold
agent responses serving as the upper bound.
For both unconstrained and constrained decod-
ing, we prompt the LM with a string representation
of the computation graph (i.e., the lispress format
released in SMCalFlow v2.0), followed by its ex-
ecution result rendered as a JSON string. We use
beam search with a beam size K= 5. The LM is
initialized from CodeT5-base (Wang et al.,2021)
and fine-tuned on all training examples.
As expected, the QCFG random sampling base-
line struggles on all the automatic metrics, since
the dataflow transduction rules are written with
an emphasis on truthfulness rather than fluency,
which is reflected in the grammaticality score from
the human evaluation as well. However, the truth-
fulness score is as high as 92.3%, indicating the
generated responses are rarely incorrect. We ob-
serve that the generated responses sometimes are
generic and do not contain relevant information for
the user request, which partially contributes to the
high truthfulness score. This is also reflected in
the relevance score, which is the lowest among all
compared approaches.
In contrast, unconstrained decoding without
dataflow transduction achieves impressive scores
on automatic evaluation. Human evaluation also
System Automatic Metrics Human Evaluation
BLEU ROUGE BERTSc. R@1 R@5 Grammatical Relevant Truthful
QCFG Random Sampling .35 .58 .50 .02 .06 .623 .909 .923
Unconstrained Decoding .77 .86 .86 .48 .66 .990 .940 .798
QCFG-Constrained Decoding .79 .87 .85 .56 .79 .993 .973 .916
Gold 1.0 1.0 1.0 1.0 1.0.990 .980 .923
Table 1: Evaluation results on SMCalFlow. Automatic metrics are calculated against the gold responses on the full
validation set. Human evaluation is conducted on 297 randomly sampled validation examples.
suggests that the generated responses are gram-
matically correct and relevant to the user’s request
in most cases. However, unconstrained decoding
scores low on the truthfulness dimension, making
false statements in about one-fifth of the gener-
ated responses. This high error rate is usually un-
acceptable in real-world applications. This high
rate of factual errors from neural LMs is consis-
tent with findings in prior work (Wiseman et al.,
2017;Maynez et al.,2020).
Compared with unconstrained decoding, our
proposed QCFG-constrained decoding achieves
significantly better scores on exact match, rele-
vance and truthfulness, while maintaining similar
scores on BLUE, ROUGE, BERTScore and gram-
mar. In particular, human evaluation results indi-
cate that the quality of generated responses is very
close to that of the gold responses.
Since even the gold responses did not achieve
100% on human evaluation scores, we manually
inspect those problematic examples. There are
4 examples for which the majority-voted answer
to the grammar question is “yes but understand-
able”, and others are all rated as not containing
any grammar errors. For the relevance question,
4 examples are due to arguably bad data and 2
examples receive tied votes. For the truthfulness
question, 9 examples are due to arguably bad data,
8 examples are due to to crowd worker mistakes,
and 6 examples receive tied votes.
5.3 Ablation Study
We carry out an ablation study on SMCalFlow to
analyze how the amount of fine-tuning data and
the context used in the input sequence impact the
quality of the generated responses for both un-
constrained and QCFG-constrained LM decoding.
Results are summarized in Table 2.
Impact of fine-tuning: Without fine-tuning the
LM, neither unconstrained nor constrained de-
coding works well. This is likely due to the
1. LM without fine-tuning
7.45 .03 .29 .00 .00
3.38 .22 .07 .02 .02
2. LM fine-tuned on 3% training data
7.67 .82 .80 .27 .39
3.74 .83 .82 .40 .61
3. LM fine-tuned on full training data
7.77 .86 .86 .48 .66
3.79 .87 .85 .56 .79
4. LM input without execution results
7.58 .70 .72 .27 .41
3.78 .85 .84 .53 .77
5. LM input with user utterance
7.77 .88 .87 .46 .66
3.78 .85 .85 .55 .79
Table 2: SMCalFlow ablation results, varying the
amount of fine-tuning data (groups 1–3) and the con-
text used in the input sequence (groups 4–5). 7and
3on the first column use unconstrained and QCFG-
constrained decoding, respectively.
mismatch between the pre-training tasks and the
response generation task. However, after fine-
tuning on only 3% randomly sampled training
data, both approaches achieve significantly bet-
ter scores, with larger gains on QCFG-constrained
decoding. This suggests that QCFG-constrained
decoding is much more data-efficient in the low-
data regime. Moreover, using 3% of the training
data, QCFG-constrained decoding is on par with
the unconstrained decoding when it uses 100%
of the training data, indicating that several ex-
pert hours spent on creating dataflow transduction
rules can dramatically reduce the cost of collect-
ing training data. Finally, we observe that increas-
ing the data from 3% to 100% can yield further
improvements, although the gains are diminish-
ing and QCFG-constrained decoding benefits less
from this. Note that despite the fact that the gaps
in BLEU, ROUGE and BERTScore between un-
constrained and QCFG-constrained decoding are
tiny in the full training data setting, as we observed
in Table 1, the unconstrained decoding approach
still performs poorly on the truthfulness evalua-
tion. Thus, it is unlikely that we can solve the
truthfulness issue for the unconstrained decoding
method by simply scaling up the training data. In
other words, using QCFG-constrained decoding is
a much more effective way to achieve faithful re-
sponse generation.
Impact of context: Results in groups 3–5 in Ta-
ble 2 all use the full training data to fine-tune the
LM. The difference is in the context used in the
input sequence to the LM. For group 3, the input
sequence is the computation concatenated with the
execution result, which is the same setup used in
§5.2. For group 4, we omit the execution result,
whereas for group 5, we add the user utterance
(prefixed to the computation). Comparing group
3 and group 4, we observe that omitting execu-
tion results significantly harms the performance of
unconstrained decoding; we believe that uncon-
strained decoding heavily relies on copying tokens
from execution results into the agent response. In
contrast, the dataflow transduction rules can exe-
cute the computation internally, so the proposed
QCFG-constrained decoding does not necessarily
need to see the execution results in the LM input
sequence, allowing a shorter LM prompt. Com-
paring group 3 and group 5, we see that the user
utterance in the prompt does not bring any addi-
tional benefits to both unconstrained or QCFG-
constrained decoding. This is expected because
it is redundant with the computation, which is also
in the prompt.
6 Related Work
One line of response generation research focuses
on generating fluent and coherent responses di-
rectly from user utterances without any interme-
diate structured representation. This paradigm is
mostly used for chatbots, as in early rule-based
systems (Weizenbaum,1966;Wallace,2009), neu-
ral conversation models (Vinyals and Le,2015;
Shang et al.,2015;Sordoni et al.,2015;Li et al.,
2017;Serban et al.,2016), and recent large-
scale pretrained LMs like DialoGPT (Zhang et al.,
2020b) and GPT-3 (Brown et al.,2020).
Another line focuses on generating text from
structured data, with applications beyond dia-
logue response generation. For example, the
WebNLG challenge (Gardent et al.,2017) gen-
erates natural language descriptions from relation
tuples, and Lebret et al. (2016) generate a biog-
raphy from a structured “infobox” record. Many
recent dialogue response generation tasks adopt
dialogue-act-based meaning representations, in-
cluding the MultiWOZ dataset (Budzianowski
et al.,2018), the Schema-Guided dialogue dataset
(Rastogi et al.,2020), and the E2E NLG challenge
(Dusek et al.,2020). In contrast, our response gen-
eration task uses computations as the input, which
do not directly encode the dialogue acts of the re-
sponses. This is a more challenging task, as the
system needs to perform extra reasoning to obtain
the derived information. In this sense, our task is
similar to the one in CoSQL (Yu et al.,2019) and
Logic2Text (Chen et al.,2020), but the computa-
tions used in our task are semantically richer.
Constrained decoding techniques for neural
LMs have been developed for text generation with
different types of constraints (Balakrishnan et al.,
2019;Dathathri et al.,2020;Lu et al.,2021,2022).
Shin et al. (2021) develop a constrained decoding
approach for semantic parsing by restricting the
LM output at each step according to a given gram-
mar. Differently, the grammar production rules in
our case are derived dynamically for each input.
7 Conclusion
In this paper, we have described a hybrid approach
for building dialogue response generation systems
that combines the advantages of both end-to-end
and rule-based approaches. This approach intro-
duces a new formalism for transducing dataflow
graphs into a QCFG, which can be then used in a
constrained decoding procedure that intersects the
QCFG with a neural LM. This formal framework
makes it possible to write rules to precisely and
truthfully describe both data and its provenance.
Experiments show that this hybrid approach
outperforms the neural LM unconstrained decod-
ing counterpart in both automatic evaluation and
human evaluation, especially in the truthfulness
dimension. Moreover, using 3% of the training
data, the constrained decoding approach is on par
with the unconstrained decoding approach when
it uses 100% of the training data, indicating that
several expert hours spent on authoring rules can
dramatically reduce the cost of collecting training
We would like to thank Ben Van Durme, Baoling
Peng, Subhro Roy, Richard Shin, and Patrick Xia
for valuable discussions and feedback on this pa-
Anusha Balakrishnan, Jinfeng Rao, Kartikeya Upasani,
Michael White, and Rajen Subba. 2019. Con-
strained decoding for neural NLG from composi-
tional representations in task-oriented dialogue. In
Proceedings of the 57th Conference of the Associa-
tion for Computational Linguistics, ACL 2019, Flo-
rence, Italy, July 28- August 2, 2019, Volume 1:
Long Papers, pages 831–844. Association for Com-
putational Linguistics.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey
Wu, Clemens Winter, Chris Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot
learners. In Advances in Neural Information Pro-
cessing Systems, volume 33, pages 1877–1901. Cur-
ran Associates, Inc.
Pawel Budzianowski, Tsung-Hsien Wen, Bo-Hsiang
Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra-
madan, and Milica Gasic. 2018. MultiWOZ a
large-scale multi-domain wizard-of-oz dataset for
task-oriented dialogue modelling. In Proc. Confer-
ence on Empirical Methods in Natural Language
Processing (EMNLP).
Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao.
2021. Evaluation of text generation: A survey.
arXiv:2006.14799v2 [cs.CL].
Zhiyu Chen, Wenhu Chen, Hanwen Zha, Xiyou
Zhou, Yunkai Zhang, Sairam Sundaresan, and
William Yang Wang. 2020. Logic2Text: High-
fidelity natural language generation from logical
forms. In Findings of the Association for Computa-
tional Linguistics: EMNLP 2020, pages 2096–2111,
Online. Association for Computational Linguistics.
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane
Hung, Eric Frank, Piero Molino, Jason Yosinski, and
Rosanne Liu. 2020. Plug and play language models:
A simple approach to controlled text generation. In
8th International Conference on Learning Represen-
tations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020.
Ondrej Dusek, Jekaterina Novikova, and Verena Rieser.
2020. Evaluating the state-of-the-art of end-to-end
natural language generation: The E2E NLG chal-
lenge.Comput. Speech Lang., 59:123–156.
Jay Earley. 1970. An efficient context-free parsing al-
gorithm.Communications of the ACM, 13(2):94–
Claire Gardent, Anastasia Shimorina, Shashi Narayan,
and Laura Perez-Beltrachini. 2017. The WebNLG
challenge: Generating text from RDF data. In Pro-
ceedings of the 10th International Conference on
Natural Language Generation, INLG 2017, Santi-
ago de Compostela, Spain, September 4-7, 2017,
pages 124–133. Association for Computational Lin-
Paul Grice. 1975. Logic and conversation. In Syntax
and semantics, volume 3, pages 41–58. Academic
Irene Langkilde and Kevin Knight. 1998. Generation
that exploits corpus-based statistical knowledge. In
COLING 1998 Volume 1: The 17th International
Conference on Computational Linguistics.
Rémi Lebret, David Grangier, and Michael Auli. 2016.
Neural text generation from structured data with ap-
plication to the biography domain. In Proceed-
ings of the 2016 Conference on Empirical Meth-
ods in Natural Language Processing, EMNLP 2016,
Austin, Texas, USA, November 1-4, 2016, pages
1203–1213. The Association for Computational Lin-
Jiwei Li, Will Monroe, Alan Ritter, Jianfeng Gao
Michel Galley, and Dan Jurafsky. 2017. Deep rein-
forcement learning for dialogue generation. In Proc.
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 1192–1202.
Chin-Yew Lin. 2004. Rouge: A package for auto-
matic evaluation of summaries. In Text summariza-
tion branches out, pages 74–81.
Ximing Lu, Sean Welleck, Peter West, Liwei Jiang,
Jungo Kasai, Daniel Khashabi, Ronan Le Bras,
Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A.
Smith, and Yejin Choi. 2022. NeuroLogic a*esque
decoding: Constrained text generation with looka-
head heuristics. In Proceedings of the 2022 Confer-
ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan-
guage Technologies, pages 780–799, Seattle, United
States. Association for Computational Linguistics.
Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras,
Chandra Bhagavatula, and Yejin Choi. 2021. Neu-
roLogic decoding: (un)supervised neural text gener-
ation with predicate logic constraints. In Proceed-
ings of the 2021 Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
4288–4299, Online. Association for Computational
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Ryan McDonald. 2020. On faithfulness and factu-
ality in abstractive summarization. In Proceedings
of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 1906–1919, On-
line. Association for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: a method for automatic
evaluation of machine translation. In Proceedings of
the 40th annual meeting of the Association for Com-
putational Linguistics ACL 2002.
Emmanouil Antonios Platanios, Adam Pauls, Subhro
Roy, Yuchen Zhang, Alex Kyte, Alan Guo, Sam
Thomson, Jayant Krishnamurthy, Jason Wolfe, Ja-
cob Andreas, and Dan Klein. 2021. Value-agnostic
conversational semantic parsing. In Proceedings
of the 59th Annual Meeting of the Association for
Computational Linguistics, Online. Association for
Computational Linguistics.
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara,
Raghav Gupta, and Pranav Khaitan. 2020. Towards
scalable multi-domain conversational agents: The
schema-guided dialogue dataset. In Proceedings
of the AAAI Conference on Artificial Intelligence,
pages 8689–8696.
Ehud Reiter. 2022. What are the problems with rule-
based NLG?
01/26/problems-with-rule-based- nlg/.
Subhro Roy, Sam Thomson, Tongfei Chen, Richard
Shin, Adam Pauls, Jason Eisner, and Benjamin
Van Durme. 2022. BenchCLAMP: A benchmark for
evaluating language models on semantic parsing.
Semantic Machines, Jacob Andreas, John Bufe, David
Burkett, Charles Chen, Josh Clausman, Jean Craw-
ford, Kate Crim, Jordan DeLoach, Leah Dorner, Ja-
son Eisner, Hao Fang, Alan Guo, David Hall, Kristin
Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Sm-
riti Jha, Dan Klein, Jayant Krishnamurthy, Theo
Lanman, Percy Liang, Christopher H. Lin, Ilya
Lintsbakh, Andy McGovern, Aleksandr Nisnevich,
Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth,
Subhro Roy, Jesse Rusak, Beth Short, Div Slomin,
Ben Snyder, Stephon Striplin, Yu Su, Zachary
Tellman, Sam Thomson, Andrei Vorobev, Izabela
Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang,
and Alexander Zotov. 2020. Task-oriented dialogue
as dataflow synthesis.Transactions of the Associa-
tion for Computational Linguistics, 8:556–571.
Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio,
Aaron Courville, and Joelle Pineau. 2016. Build-
ing end-to-end dialogue systems using generative hi-
erarchical neural network models. In Proc. AAAI
Conf. on Artificial Intelligence (AAAI).
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.
Neural responding machine for short-text conversa-
tion. In Proc. Annual Meeting of the Association
for Computational Linguistics (ACL), pages 1577–
Richard Shin, Christopher H. Lin, Sam Thomson,
Charles Chen, Subhro Roy, Emmanouil Antonios
Platanios, Adam Pauls, Dan Klein, Jason Eisner, and
Benjamin Van Durme. 2021. Constrained language
models yield few-shot semantic parsers. In Proceed-
ings of the 2021 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2021, Vir-
tual Event / Punta Cana, Dominican Republic, 7-11
November, 2021, pages 7699–7715. Association for
Computational Linguistics.
David Smith and Jason Eisner. 2006. Quasi-
synchronous grammars: Alignment by soft projec-
tion of syntactic dependencies. In Proceedings on
the Workshop on Statistical Machine Translation,
pages 23–30, New York City. Association for Com-
putational Linguistics.
Alessandro Sordoni, Michel Galley, Michael Auli,
Chris Brockett, Yangfeng Ji, Margaret Mitchell,
Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.
A neural network approach to context-sensitive gen-
eration of conversational responses. In Proc. Conf.
of the North American Chapter of the Association
for Computational Linguistics (NAACL), pages 196–
Oriol Vinyals and Quoc Le. 2015. A neural conversa-
tion model. In Proc. ICML Deep Learning Work-
Marilyn A. Walker, Owen C. Rambow, and Monica Ro-
gati. 2002. Training a sentence planner for spoken
dialogue using boosting.Computer Speech & Lan-
guage, 16(3):409–433. Spoken Language Genera-
Richard S. Wallace. 2009. The Anatomy of A.L.I.C.E.,
chapter Parsing the Turing Test. Springer, Dor-
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H.
Hoi. 2021. CodeT5: Identifier-aware unified pre-
trained encoder-decoder models for code under-
standing and generation. In Proceedings of the 2021
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 8696–8708, Online and
Punta Cana, Dominican Republic. Association for
Computational Linguistics.
Joseph Weizenbaum. 1966. ELIZA a computer pro-
gram for the study of natural language communi-
cation between man and machine.Commun. ACM,
Sam Wiseman, Stuart Shieber, and Alexander Rush.
2017. Challenges in data-to-document generation.
In Proceedings of the 2017 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2253–2263, Copenhagen, Denmark. Association for
Computational Linguistics.
Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue,
Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze
Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga,
Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan
Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vin-
cent Zhang, Caiming Xiong, Richard Socher, Wal-
ter Lasecki, and Dragomir Radev. 2019. CoSQL: A
conversational text-to-SQL challenge towards cross-
domain natural language interfaces to databases. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 1962–
1979, Hong Kong, China. Association for Computa-
tional Linguistics.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Weinberger, and Yoav Artzi. 2020a. BERTScore:
Evaluating text generation with BERT. In 8th Inter-
national Conference on Learning Representations,
ICLR 2020.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun
Chen, Chris Brockett, Xiang Gao, Jianfeng Gao,
Jingjing Liu, and Bill Dolan. 2020b. DIALOGPT
: Large-scale generative pre-training for conversa-
tional response generation. In Proceedings of the
58th Annual Meeting of the Association for Compu-
tational Linguistics: System Demonstrations, pages
270–278, Online. Association for Computational
ResearchGate has not been able to resolve any citations for this publication.
We investigate the task of building open domain, conversational dialogue systems based on large dialogue corpora using generative models. Generative models produce system responses that are autonomously generated word-by-word, opening up the possibility for realistic, flexible interactions. In support of this goal, we extend the recently proposed hierarchical recurrent encoder-decoder neural network to the dialogue domain, and demonstrate that this model is competitive with state-of-the-art neural language models and back-off n-gram models. We investigate the limitations of this and similar approaches, and show how its performance can be improved by bootstrapping the learning from a larger question-answer pair corpus and from pretrained word embeddings.
We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset, code for replicating experiments, and a public leaderboard are available at .