Available via license: CC BY 4.0
Content may be subject to copyright.
Maieutic Prompting: Logically Consistent Reasoning with
Recursive Explanations
Jaehun Jung†Lianhui Qin†Sean Welleck†‡
Faeze Brahman‡Chandra Bhagavatula‡Ronan Le Bras ‡Yejin Choi †‡
†Paul G. Allen School of Computer Science & Engineering, University of Washington
‡Allen Institute for Artificial Intelligence
hoony123@cs.wasington.edu
Abstract
Despite their impressive capabilities, large pre-
trained language models (LMs) struggle with
consistent reasoning; recently, prompting LMs
to generate explanations that self-guide the in-
ference has emerged as a promising direction
to amend this. However, these approaches
are fundamentally bounded by the correctness
of explanations, which themselves are often
noisy and inconsistent. In this work, we de-
velop MAIEUTIC PROMPTING, which infers
a correct answer to a question even from
the noisy and inconsistent generations of LM.
MAIEUTIC PROMPTING induces a tree of ex-
planations abductively (e.g. X is true, be-
cause . . . ) and recursively, then frames the in-
ference as a satisfiability problem over these
explanations and their logical relations. We
test MAIEUTIC PROMPTING for true/false QA
on three challenging benchmarks that require
complex commonsense reasoning. MAI EU -
TI C PROM PT IN G achieves up to 20% better
accuracy than state-of-the-art prompting meth-
ods, and as a fully unsupervised approach, per-
forms competitively with supervised models.
We also show that MAIEUTIC PROMPTING im-
proves robustness in inference while providing
interpretable rationales.
1 Introduction
Following the remarkable success of few-shot
prompting powered by large language models (e.g.
Brown et al.,2020), recent studies on prompting
methods suggest that LMs’ reasoning capability
can be further promoted by generating a sequence
of explanation for a given problem, prior to infer-
ring the answer (Wei et al.,2022;Wang et al.,2022;
Liu et al.,2021). The so-called explanation-based
prompting helps an LM better elicit its knowledge
and reason by leveraging its own generated expla-
nations - whether it be a commonsense knowledge
statement (Liu et al.,2021), a solution for a math
word problem (Wei et al.,2022), or a scratchpad
Explanation-based Prompting
Input
Prompt
Q: Captain Kirk is part of Star Wars?
A: Captain Kirk is a character in Star Trek. Therefore, the answer is False.
…
Q: At least one mayor is not male?
Output
A: There are female mayors. Therefore, the answer is True.
Type I
(41%)
Smoke is not the source of fire?
Smoke is a result of fire. Therefore, the statement is False.
One is a number that comes before zero?
One is ... Therefore, the statement is True.
One is a number that comes after zero?
One is ... Therefore, the statement is True.
Type II
(33%)
Butterflies fly with 3 wings?
Butterflies have 4 wings. Therefore, the statement is False.
Butterflies have 4 wings?
Butterflies have 2 wings on each side of their body. Therefore,
the statement is False.
Type III
(35%)
???
Figure 1: Logical errors in explanation-based prompt-
ing: (1) explanation does not logically lead to the an-
swer, (2) model is invariant to negation, and (3) falsi-
fies its own explanation. We prompt 175B GPT-3 with
100 questions sampled from Talmor et al. (2021).
representing intermediate steps of program execu-
tion (Nye et al.,2021a).
Explanation-based prompting is intuitively mo-
tivated by the steps of reasoning humans typically
employ while solving a problem (Hausmann and
VanLehn,2007). However, we find that this in-
tuition is faulty in practice, as model-generated
explanations are often logically inconsistent and
unreliable. For example, we manually inspected
100 samples from a QA task (Figure 1) and found
that for a considerable number of cases, (1) the
generated explanation does not logically lead to
the inferred answer, (2) the model infers the same
label for a statement and its negation (Kassner and
Schütze,2020), and (3) the model falsifies its own
generated explanation. These findings raise funda-
mental questions on the role of explanations in LM
reasoning: If the explanation is correct - is there
a guarantee that the LM will infer a label that is
consistent with the explanation? And if the expla-
nation is wrong - is there a way to make use of
even the wrong explanation in inferring the correct
arXiv:2205.11822v1 [cs.CL] 24 May 2022
Q
ET
Tru e , b e ca u se
False, because
EF
ETF
Logically Integral
Logically Integral
: War cannot have a tie?
Q
False, because
War cannot have a tie? True, because
In a context of war, there's always a victor and
a loser.
In a context of war, there's always a victor and
a loser? False, because
There can be cases where the loser is not clear.
Width-wise spanning
Depth-wise spanning
Entail
Contradict
: 0.92,
: 0.98
: 1.00,
: 1.00,
...
w(EF)
w(ETF)
w(ET→Q)
w(EF→¬Q)
Max-SAT Solver
: False
: Tru e
: Tru e
: False
ET
EF
ETF
Q
Maieutic tree generation
Defining the relations
Inference
Q
ET
EF
ETF
Figure 2: An overview of MAIEUT IC P ROM PTING. Given a question Q, we generate maieutic tree consisting of
abductive and recursive explanations, define the relations between them, and employ MAX-SAT to find the best
truth-value assignments to the explanations and Q.
answer?
To this end, we propose M AIE UT IC PRO MPT-
ING, a novel few-shot inference method that in-
fers a correct answer by enumerating a structure
of explanations - possibly noisy and contradictory
- and resolving them with a symbolic inference
algorithm. Inspired by the maieutic method
1
of
Socrates, MAIEUTIC PROMPTING induces the LM
to generate abductive explanations for diverse hy-
potheses with deep recursive reasoning, and collec-
tively identifies and eliminates contradicting candi-
dates, resulting in consistent answers.
Figure 2shows the overview of MAIEUTIC
PROMPTING. First, we prompt the LM to abduc-
tively (Peirce,1974) rationalize both possible an-
swers, True and False, rather than generating a
single explanation and then connecting it to one
of the answer choices. Moreover, we do not ex-
pect the 1-hop explanations to be always correct;
thus, we further validate the LM’s confidence in its
explanations by recursively prompting the model
with its own generation as the question. Our gener-
ation process derives a tree structure of generated
propositions, where one proposition establishes a
logical ground for the correctness of one another.
To infer the answer for the original question, we
quantify the strength of the LM’s belief in each
proposition and the logical relationships between
propositions in the maieutic tree. We then employ
the weighted MAX-SAT (Battiti,2009) solver to
collectively infer the truth-values of all the propo-
sitions (including the original question) that best
satisfy the set of observed relations. This way, we
symbolically induce the subset of generations that
makes the most probable and consistent inference.
1
Maieutic method brings out definitions implicit in the
interlocutor’s beliefs, ... is a method of hypothesis elimina-
tion, steadily identifying and eliminating those that lead to
contradictions (Vlastos,1991).
Our proposed method can run completely unsu-
pervised with any few-shot promptable LM (e.g.
GPT-3; Brown et al.,2020).
Our experimental results show that the perfor-
mance of MAI EU TIC P RO MP TING exceeds that of
all the few-shot prompting baselines (e.g. Chain of
Thought; Wei et al.,2022) in three commonsense
reasoning and fact verification benchmarks. Using
a small NLI model to infer the relations between
propositions, MAIEUTIC PROMPTING performs
up to 20% better than other prompting methods,
and performs on par or even better than fine-tuned
models. Further analyses show that MAIEUTIC
PRO MPT IN G is robust to perturbations in both the
questions and prompts, and offers an interpretable
interface to understand the rationale behind the
model’s inference.
2 Problem Setup and Background
Our goal is to infer whether a given statement
Q
makes sense, i.e. inferring the truth value
A
of
Q
.
Conventionally, this can be done through prompt-
ing a language model with the following two meth-
ods:
Standard Prompting
Let Q be a statement we
want to infer the truth value of (i.e. either True or
False). In standard
k
-shot prompting, the model-
inferred answer ˆ
Ais defined as:
ˆ
A= argmax
A∈{T,F }
pLM (A|Q, C)
where
C={(q1, a1),· · · ,(qk, ak)}
denotes the
few-shot examples for in-context learning.
Explanation-based Prompting
In explanation-
based prompting, the inference process is factor-
ized into two steps:
ˆ
A= argmax
A∈{T,F }ZE
pLM (A|Q, E, C )pLM (E|Q, C)
Here,
E
denotes the explanation generated
prior to inferring the answer label, and
C=
{(q1, e1, a1),· · · ,(qk, ek, ak)}
includes
k
exam-
ples of questions, explanations and answers. Since
marginalizing over all
E
is intractable, prior works
resort to a sampling based approximation:
ˆ
A= argmax
A∈{T,F }
pLM (A|Q, E, C ),
where E∼pLM (E|Q, C)
Also, sampling multiple explanations then aggre-
gating the inference results could help in increas-
ing the diversity of the generated knowledge (West
et al.,2021) and reducing the impact of erroneous
explanations (Wang et al.,2022).
3 Maieutic Prompting
In this section, we introduce MAIEUTIC PROMPT-
ING, which performs inference over a maieutic tree
of generated explanations. First, we introduce logi-
cal integrity, a key concept that is used to determine
the reliability of propositions.
Language models often generate logically incon-
sistent propositions; for instance, in Figure 1, the
model infers True when prompted with either “One
is a number that comes before zero.” or “One is
a number that comes after zero.”. In this sense,
p(True|Q)
does not provide a reliable value to de-
termine whether
Q
is true or not. We formalize
this idea as logical integrity: a proposition
Q
is
logically integral when the LM consistently infers
the truth value of
Q
and
¬Q
(i.e.
Q
as True and
¬Qas False, or vice versa).
Formally, we define a boolean function
integral(E)as following2:
1. argmax
A∈{T,F }
pLM (A|E, C ) = Tand
argmax
A∈{T,F }
pLM (A|¬E, C ) = F
2. argmax
A∈{T,F }
pLM (A|E, C ) = Fand
argmax
A∈{T,F }
pLM (A|¬E, C ) = T
integral(E) = {1 or 2 is satisfied}
2
Given
E
,
¬E
can be automatically generated simply by
inserting a prefix (e.g. It is wrong to say that), or prompting
LM to negate the given sentence.
A statement is considered to be logically integral
/ True when condition 1 is met, and logically inte-
gral / False when condition 2 is met. Intuitively,
the truth values of logically integral propositions
are more credible than non-integral ones in that
the LM is guaranteed to be logically consistent to
their negated counterparts. For example, “One is a
number that comes before zero.” in Figure 1would
not be logically integral, as the model assigns same
truth value to both Qand ¬Q.
For the rest of this section, we first search for
logically integral propositions by constructing the
maieutic tree (Section 3.1); we then quantify the
relations between the propositions (Section 3.2),
on which basis we infer the final consistent answer
(Section 3.3).
3.1 Maieutic Tree Generation
3.1.1 Abductive Explanation Generation
Given a question, we require the LM to post-hoc ra-
tionalize both True and False labels. This abductive
explanation generation has several advantages over
an ad-hoc approach that first generates an explana-
tion, then predicts the label. First, in the ad-hoc
setting, the model is required to not only stay rel-
evant to the given question, but also to generate a
discriminative explanation that helps in choosing
one label over the other. Abductive generation, on
the contrary, exposes the model to consider dif-
ferent possible answers rather than discriminating
one, which often reveals an explanation that other-
wise would not have been generated. Second, the
model is able to incorporate label information into
its generation. Intuitively, this may help elicit more
specific explanations and mitigate the issue of a
bland and generic generation which does not help
the inference, a well-known weakness in LM-based
conditional generation (Adiwardana et al.,2020).
Concretely, we define a function
abductive
which gets the statement
Q
as the input and outputs
a tuple of two abductive explanations with True,
False given as the answer, respectively:
abductive(Q)=(ET, EF)
where EA∈{T,F }∼pLM (E|Q, A, C )
Figure 2shows a concrete example of generating
ET
given
Q
. With
Q
, we prompt the model to
rationalize True as the answer: “War cannot have
a tie? True, because”, which then is completed by
an explanation by LM “In a context of war, there’s
always a victor and a loser.”.
: If you travel west far enough from the
west coast, you will reach the east cost?
Q
:You cannot reach the
east coast by going west.
EF
Tru e , b e ca u se
False, because
Depth 1 generation
: The Earth is round and if you travel
in any direction long enough, you will
eventually return to where you started.
ET
integral(ET)= 1
integral(EFT)= 1
integral(EF)= 0
Q
ET
: You cannot reach the east
coast by going west?
EF
: You can reach the east coast
by going west by traveling around
the world.
EFT
: If you travel in a specific
straight line, you will eventually
reach the other side.
EFF
Tru e , b e ca u se False, because
False, because
integral(EFF)= 0
Q
ET
Tru e , b e ca u se False, because
EF
Tru e , b e ca u se
EFT
EFF
Logically Integral
Prune non-integral Branch
Tru e , b e ca u se
Depth 2 generation
Logically Integral
Figure 3: Illustrative example of maieutic tree generation, with the max tree depth set to 2. For visual clarity, we
generate only 1 ETand 1 EFper question and omit the width-wise spanning of knowledge.
3.1.2 Depth-wise Knowledge Spanning
As substantiated in Figure 1, LM-generated expla-
nations are noisy and inaccurate by nature. Prior
works indirectly compensate for the untrustworthy
generations by independently sampling multiple
generations then aggregating them in the answer-
level (e.g., through majority voting; Wang et al.,
2022), but it is questionable whether the voting-
based aggregation is indeed capable of filtering out
only the incorrect explanations.
To systematically address this issue, we harness
the LM itself to validate its own generations - by
recursively prompting the LM with the generated
explanations. As Figure 2shows, this corresponds
to a depth-wise spanning of knowledge that induces
amaieutic tree, a multi-depth structure of generated
propositions and relations between them. Let
Si
denote the set of nodes at depth
i
in the maieutic
tree
T
. Each node in
Si
is an explanation for an
answer label (True or False), recursively generated
given its parent node as the question:
Si⊆[
l∈{T,F }i−1
{ElT , ElF },
(ElT , ElF ) = abductive(El)
For instance, in Figure 2“There can be cases
where the loser is not clear.” (
ETF
) is generated by
prompting the LM with its parent node “In a con-
text of war, there’s always a victor and a loser.”
(
ET
) and False, i.e.
ET F ∼pLM (·|ET,F, C)
.
Note that
T
is a full tree when the equality holds
for all depths.
In practice, we sample multiple explanations
with the same
Q
and
A
through nucleus sampling
(Holtzman et al.,2019). This corresponds to the
width-wise spanning of knowledge, enhancing the
diversity and coverage of generated explanations.
3.1.3 When to Stop Generating
Generating a full tree could be computationally ex-
pensive, as the number of generation grows expo-
nentially with the maximum tree depth. Therefore,
in each branch, we stop generating further once we
reach a logically integral proposition; intuitively,
this aligns with our goal of finding generations for
which the LM consistently infers a particular truth
value.
Figure 3illustrates an example of maieutic tree
generation where the maximum depth of the tree
is set to 2. For visual clarity, we only generate
one explanation per
Q
and
A
. Given
Q
, we first
generate
ET
and
EF
, then validate whether each
of these explanations is logically integral. Since
ET
is logically integral, we stop generating in this
branch, but continue generating from
EF
which is
not logically integral. After reaching the maximum
depth, we prune the branches leading to leaf nodes
that are still not logically integral. This procedure
guarantees one simple constraint over the maieu-
tic trees: keep only the generations that lead to a
logically integral proposition. We provide a formal
description of the generation process in Appendix
A.
3.2 Defining the Relations
Now that we have generated the maieutic tree, we
seek to define the relations between propositions
and quantify their strength into scalar weights. For
illustration, assume that an LM has generated the
following EFfor the given Q:
Q: Captain Kirk is part of Star Wars?
A: False, because
Captain Kirk is a
character in Star Trek.
The generation can be logically interpreted as fol-
lows: (1) the LM believes that Captain Kirk is a
character in Star Trek, (2) the LM believes that
the proposition Captain Kirk is a character in Star
Trek can be a reason to deny that Captain Kirk is
part of Star Wars. Accordingly, we define belief
and consistency to represent the two dimensions of
the logical relationship.
Belief wE
corresponds to the LM’s belief that the
proposition
E
is true (and therefore,
¬E
is false).
To quantify belief, we prompt the LM with
E
and
¬E
respectively as a question, then comparing the
probability assigned to True:
wE:=pLM(T|E , C)
pLM (T|E, C ) + pLM (T|¬E, C)
Note that calculating this does not require any ad-
ditional prompting, as we already gained access to
these values while checking for the logical integrity
of each proposition.
Consistency wE,Q,A
corresponds to the consis-
tency of the generated
E
with the given
Q
and
A
. Intuitively, if the LM is logically consistent,
the likelihood of
E
generated given an answer
(e.g. likelihood of
EF
being generated given False)
should be larger than its likelihood given the oppo-
site answer (e.g.
EF
being generated given True).
Following this intuition, we compute the consis-
tency as:
wE,Q,A :=pLM (E|Q, A, C)
pLM (E|Q, A, C) + pLM (E|Q, ¬A, C )
3.3 Inference
The two types of relations formulate a set of unary
and binary logical constraints, based on which we
assign the truth values to all nodes in the maieutic
tree
T
, and in consequence, infer the answer to
the original question. First, we represent the set of
beliefs
Cblf
as a set of unary constraints. For each
leaf node Ein T,
cblf =(Eif Eis logically integral / True
¬Eif Eis logically integral / False
Note that all the leaf nodes in
T
are logically
integral, hence we can count on the credibility of
belief for these nodes. We now define the set of all
belief constraints Cblf as:
Cblf ={cblf for ∀E∈leaf(T)}
For example, the nodes
EF
and
ET F
in Figure 2
would have a belief constraint in Cblf .
Likewise, for consistency, we define
Ccon
as the
set of binary constraints using logical implication.
For each edge (El, ElA)in T,
ccon =(ElA →Elif A=True
ElA → ¬Elif A=False
Ccon ={ccon for ∀(El, ElA)∈edge(T)}
Our objective is to assign the truth values for
all
E
s and the root node
Q
in
T
, such that we
maximize
X
c∈Cblf∪Ccon
wc·{c=True}
which is the sum of the weights of the satisfied
constraints.
This problem is naturally formulated as weighted
MAX-SAT, which can be algorithmically solved
using an off-the-shelf solver. Specifically, we use
RC2 solver (Morgado et al.,2014) to find the as-
signments for
E
s and
Q
that max-satisfy the logical
constraints in Cblf ∪ Ccon.
3.4 Auxiliary Verifier
One limitation of the consistency definition in Sec-
tion 3.2 is that it only considers the relationship
between a parent node and a child node. Since the
definition builds upon the likelihood of each gen-
eration from an LM, we cannot take into account
the relationships across branches, e.g.
ET
and
EF
in Figure 3. This motivates us to introduce a small
NLI model as an auxiliary verifier, which can infer
the relationship between an arbitrary pair of nodes
in
T
. Following previous works (Minervini and
Riedel,2018;Wang et al.,2019), we convert the
NLI labels into logical relations as following:
Entail(E1, E2) : E1→E2
Contradict(E1, E2) : E1→ ¬E2
For all pairs of nodes
(E1, E2)∈node(T)2
,
E16=E2
, we obtain either
E1→E2
or
E1→
¬E2
if
E1
entails or contradicts
E2
. For NLI-based
clauses, we fix the weights to 1.
3
While the objec-
tive function stays the same,
Ccon
is now replaced
with
CNLI
, a set of clauses induced by the verifier
model.
3
We also tried using the label probability assigned by NLI
model as weight, but fixing it to 1 yielded better results.
Dataset Com2Sense CSQA 2.0 CREAK
Model dev test pairwise dev test dev test contrast
Supervised
RoBERTa-large (Liu et al.,2019) 62.8 59.4 33.3 - - 80.6 80.3 61.5
T5-large (Raffel et al.,2020) 62.8 60.6 41.8 53.8 54.6 - - -
T5-3B (Raffel et al.,2020) 73.2 - - - 60.2 85.6 85.1 70.0
UnifiedQA-3B (Khashabi et al.,2020) 75.1 71.3 51.3 ---- -
T5-11B (Raffel et al.,2020)77.2 - - 68.5 67.8 89.5 -75.2
Unicorn-11B (Lourie et al.,2021) - - - 69.9 70.2 - - -
Prompting
Standard 58.1 - - 54.1 - 60.3 - 55.2
Chain of Thought (Wei et al.,2022) 61.6 - - 59.6 - 64.8 - 59.4
Self Consistency (Wang et al.,2022) 61.4 - - 60.8 - 70.5 - 64.8
GKP (Liu et al.,2021) 61.8 - - 59.7 - 75.4 - 68.2
MAIEUTIC PROMPTING (Ours) 72.5 75.0 68.7 69.5 68.3 85.2 85.3 77.4
Table 1: Experimental results of MAIEUTIC PROMPTING and baseline methods on three benchmark datasets. We
differentiate supervised baselines (upper section) from prompting methods (lower section), and bold the best num-
bers for each section. MAIEUTIC PROMPTING with GPT-3 outperforms all prompting baselines with the same
model, while being competitive against billion-scale supervised LMs.
4 Experiments
Datasets
We evaluate MAIEUTIC PROMPTING
on three commonsense reasoning and fact verifica-
tion benchmarks in binary QA format: Com2Sense
(Singh et al.,2021), CSQA 2.0 (Talmor et al.,
2021), CREAK (Onoe et al.,2021). Com2Sense
and CSQA 2.0 consist of adversarial commonsense
questions generated to mislead a proxy model.
CREAK tests for a combination of commonsense
reasoning and accurate fact retrieval, consisting of
long-tail questions such as “Harry Potter can teach
how to fly on a broomstick?”. Despite their simple
format, these datasets require both a substantial
amount of knowledge and robust reasoning, which
make them challenging even for the billion-scale
fine-tuned LMs (Table 1).
Baselines
We compare our method with both the
few-shot prompting methods and supervised mod-
els. For few-shot prompting we consider both the
standard prompting and explanation-based prompt-
ing methods, including Chain of Thought (Wei
et al.,2022), Self-Consistency (Wang et al.,2022)
and Generated Knowledge Prompting (GKP) (Liu
et al.,2021). For supervised models, we consider
the strong baselines used for the respective dataset,
such as fine-tuned RoBERTa (Liu et al.,2019), T5
(Raffel et al.,2020), UnifiedQA (Khashabi et al.,
2020) and Unicorn (Lourie et al.,2021).
Configuration Details
For all prompting meth-
ods, we use the same set of 6 demonstration exam-
ples and the same version of GPT-3 (text-davinci-
001) as the LM. In maieutic tree generation, we
set the maximum depth to 2. For depth 1, we use
nucleus sampling (
p= 0.7
) (Holtzman et al.,2019)
to generate 3
ET
s and 3
EF
s from
Q
. For depth
2, we use greedy decoding to generate 1 ETand 1
EF
from each parent node. This constrains the gen-
erated tree to have at most 18 nodes excluding the
original
Q
.
4
In Section 4.3, we conduct an ablation
study on this depth-adaptive decoding scheme and
analyze the effect of the tree size. For the main ex-
periments, we use the off-the-shelf RoBERTa (Liu
et al.,2019) fine-tuned on MNLI (Williams et al.,
2018) as a verifier model.
4.1 Benchmark Performance
Table 1presents overall evaluation results of
MAI EU TIC P RO MP TIN G along with the prompting
and supervised baselines. MAIE UT IC PRO MPT IN G
significantly outperforms all prompting methods
across all benchmarks. Notably, GKP and Self Con-
sistency ensembled more 1-hop explanations than
the maximal size of the maieutic tree; our supe-
rior performance compared to these methods con-
firms the sample efficiency of depth-wise knowl-
edge spanning. Moreover, MAIEUTIC PROMPT-
ING is the only prompting method that performs
better than even the smallest supervised baseline
(RoBERTa-large) in Com2Sense and CREAK. In
fact, MAIE UTIC PROM PTING allows us to use an
4
Both GKP and Self Consistency employ an ensemble
strategy, generating
N
different samples of explanations then
aggregating their answers. For a fair comparison with ours, we
set
N= 20
for both methods, generating more explanations
than the maximal possible size of the maieutic tree in our
setting.
Different orders
71.68
61.544
61.052
57.801
50
55
60
65
70
75
80
Different examples
72.34
61.2
60.673
57.883
Standard Chain of Thought
Self Consistency Maieutic
0
20
40
60
80
100
Correct - Perfect Correct - Mixed Correct - None
Incorrect - Perfect Incorrect - Mixed Incorrect - None
RelevantGrammatical Factual Helpful
Figure 4: Robustness of prompting methods under dif-
ferent few-shot examples / different order of exam-
ples. We compare the mean and standard deviation of
Com2Sense dev set accuracy.
off-the-shelf LM to achieve comparable perfor-
mance to a large fine-tuned LM by simply plugging
in our inference algorithm.
Although explanation-based prompting methods
do improve the model’s accuracy compared to stan-
dard prompting, the gap gets smaller in a more
challenging benchmark. For instance, while the
gap between standard prompting and GKP is sub-
stantial in CREAK, the gap reduces down to only
around 5% in both Com2Sense and CSQA 2.0. Un-
like these baselines, MAIEUTIC PROMPTING con-
sistently improves the performance by more than
10% over standard baseline across all benchmarks.
In the challenging CSQA 2.0, MAI EU TIC PROMPT-
ING achieves similar performance with fine-tuned
Unicorn, a pre-trained model specialized for com-
monsense related tasks.
4.2 Robustness Analysis
We perform additional analyses to understand the
working of our method under semantic perturba-
tions and different prompt formats.
Robustness to semantic perturbations
In addi-
tion to the standard accuracy, we report two addi-
tional metrics called pairwise accuracy and con-
trast set accuracy in Table 1. In Com2Sense test set
and CREAK contrast set, each question is paired
with its complimentary counterpart, of which the
surface form is similar but the answer should be the
opposite (e.g. “Barack Obama only has daughters.”
vs “Barack Obama has no daughter.”). In pairwise
accuracy, a model should get both sentences correct
to get the pair correct. Since fine-tuned models are
not exposed to the complimentary sentences during
training, those that rely on surface-form heuristics
performs worse in these metrics compared to stan-
dard accuracy. In these metrics, the gap between
MAIEUTIC PROMPTING and baselines widens sub-
stantially, indicating the robustness of our method
Model Com2Sense Dev Acc.
Non-abductive generation 68.4
All greedy decoding 67.2
All nucleus sampling 72.0
Likelihood-based consistency 65.6
Maieutic Prompting 72.5
Table 2: Ablation study of M AIEUTIC PROMPTING
with different variants. The best configuration is
with abductive generation, depth-adaptive decoding
and verifier-based consistency.
Dimension 1 2 3 5 10
Depth 61.3 72.5 72.4 - -
Width 62.4 66.5 72.5 71.5 72.1
Table 3: Performance of MAIEUTIC PROMPTING on
Com2Sense with different maieutic tree sizes.
against semantic perturbations.
Robustness to different prompts
Prior works
revealed that prompting performance could be sen-
sitive to few-shot examples and their order (Lu
et al.,2021b;Zhao et al.,2021). We investigate
whether this holds true for MAIEU TI C PROM PT-
ING, as shown in Figure 4. We compare different
prompting methods run with 3 different sets of few-
shot examples (left), and 5 different permutations
of the few-shot examples (right). In both settings,
while Self Consistency and MA IEUTI C PROM PT-
ING are much more stable then the other two, our
method has slightly less variance.
4.3 Ablation Study
We ablate different components of MAIEUTIC
PROMPTING to investigate their respective contri-
butions as shown in Table 2.
Generation
First, we consider M AIEUTIC
PROMPTING without abductive generation - while
all the other stages staying the same, we gen-
erate each explanation without providing an an-
swer label, i.e. in an identical fashion to Chain
of Thought. In this setting, the performance of
MAIEUTIC PROMPTING in Com2Sense degrades
about 4%, alluding to the importance of abduc-
tive generation in eliciting the latent knowledge
from LM. Next, we ablate the depth-adaptive de-
coding mechanism (Section 3.1.2), by applying
either greedy decoding or nucleus sampling for all
depths of the maieutic tree. All greedy decoding
restrains width-wise spanning of knowledge, hence
Different orders
71.68
61.544
61.052
57.801
50
55
60
65
70
75
80
Different examples
72.34
61.2
60.673
57.883
Standard Chain of Thought
Self Consistency Maieutic
0
20
40
60
80
100
Set 1 - All Set 1 - Mixed Set 1 - None
Set 2 - All Set 2 - Mixed Set 2 - None
RelevantGrammatical Factual Helpful
96.67 99.33 23.33
76.67 76.67
18.67
4.67
64.67
28.67
6.67
42.67
42.67
14.67
64.67
22.00
13.34
23.33
42.67
34.00
Figure 5: Human evaluation results. We separately
evaluate 50 correctly inferred examples (Set 1) and 50
wrongly inferred examples (Set 2). To minimize sub-
jectivity, we use a strict 3-level scale, where annota-
tors choose All only when all the statements in the true
Es are desirable (e.g. grammatical) on its own, Mixed
when at least one Eis undesirable, and None if none of
them are desirable.
leads to large degradation of performance. All nu-
cleus sampling performs much more comparably
with our best configuration, although the stochas-
tic decoding produces slightly more errors in the
explanations.
Consistency
We ablate the NLI-based clauses
and replace them with the original
Ccon
discussed
in Section 3.2. With the LM-likelihood based
clauses, the accuracy reduces by about 7%, but
still prevails over the prompting baselines in Ta-
ble 1. The result clearly shows that the verifier
model indeed benefits the inference process, pro-
viding more accurate relations between generated
explanations. Nonetheless, our method performs
competently even without an access to this verifier
model.
Effect of tree size
We also investigate how the
size of the maieutic tree influences the performance.
In Table 3, we present the performance of MAI EU-
TIC PROMPTING on Com2Sense dev set with var-
ious values of maximal depth and width. In both
dimensions, the accuracy saturates after a certain
threshold. We attribute this to (1) the topic drift
in generation which intensifies as the depth grows,
(2) larger overlaps in generated knowledge as we
sample more explanations width-wise.
4.4 Human Evaluation
We qualitatively analyze actual inference results
of MAIEUTIC PROMPTING through human evalua-
tion. For each sample, we first retrieve true
E
s(the
set of generated
E
s that are inferred to be True by
MAI EU TIC P ROMPTI NG). We then evaluate them
over the four criteria from Liu et al. (2021): (1)
Grammar: whether each explanation is grammati-
cally correct, (2) Relevance: whether the explana-
tion is topically relevant to the question, (3) Fac-
tuality: whether each explanation states facts, and
(4) Helpfulness: whether the explanation explicitly
leads to one of the answer labels. Six NLP ex-
perts scored a total of 100 examples sampled from
CSQA 2.0 dev set, of which 50 were answered
correctly (Set 1) and 50 were answered wrongly
by the model (Set 2). The average Krippendorff’s
alpha (Krippendorff,2007) was 0.64, indicating a
substantial inter-annotator agreement.
Figure 5presents the evaluation results. For both
correct and incorrect sets, over 99% of the true
E
sare grammatically perfect, and most of them
provide relevant evidence to the question.
5
Sur-
prisingly, the LM often generates both factual and
helpful explanations even when its answer is differ-
ent from the ground truth: over 40% of the true
E
s
for incorrectly answered examples are perfectly fac-
tual, and over 23% of them are completely helpful
in correctly answering the question. We find that in
many of these cases, the questions did not have a
clear-cut answer; as exemplified in Figure 6, the ex-
planations generated and validated by MAIEUTIC
PRO MPT IN G are compelling enough as an alterna-
tive to the ground-truth answer.
5 Related Work
Numerous prior works have leveraged natural lan-
guage explanations (NLEs) to promote model rea-
soning, either by training a model to explain (Ra-
jani et al.,2019;Camburu et al.,2018;Chen et al.,
2022;Wiegreffe and Marasovi´
c,2021), generat-
ing unsupervised answers to pre-defined queries
or collecting distantly supervised rationales us-
ing LMs (Shwartz et al.,2020;Brahman et al.,
2021). Incorporated with the large-scale LMs ca-
pable of in-context learning (Brown et al.,2020;
Chowdhery et al.,2022), these efforts have led
to explanation-based prompting (Wei et al.,2022;
5
In MAIEUTIC PROMPTING, it is natural that some of the
true
E
s are not directly related to
Q
, but still contribute to the
inference by validating other Es.
Q
ET0
ET2
EF0
EF1
ET2T0
ET2F0
: War cannot have a tie.
: In order for one side to win a war, the other side must lose.
: In the context of a war, there is always a victor and a loser.
: In any conflict there is a winner and a loser.
: There can be cases where both sides claim victory or where the
loser is not clear.
: Historically there have been many wars where no victor was declared.
: The Korean War ended in a military armistice, meaning that the war
ended in a draw and neither side could claim victory.
Q
ET0
ET2
ET2T0
ET2F0
EF0
EF1
integral / True integral / True
integral / True
integral / False integral / True
Tru e s : , ,
False s : , , !
!
Inferred Answer : False
Ground-Truth : False
E
ET2F0
EF0
EF1
E
ET0
ET2
ET2T0
Maieutic Tree Explanations
Max-SAT
Q
ET2
EF0
ET1
ET2T0
EF0T0
integral / True
integral / False integral / True
: In football, the top division almost always contains the same clubs.
: The Football League is a hierarchical organization with a promotion and
relegation system between its member clubs.
: There is little movement of clubs between football's top division, known as the
Premier League, and the second division, known as the Championship.
: There is a high level of parity between clubs in the Premier League and
the Championship.
: There are many teams that change divisional placements from one year to the next.
: There are many teams that get relegated (move down a division) in football.
Q
ET1
ET2
ET2T0
EF0
EF1
Tru e s : , ,
False s : , !
!
Inferred Answer : False
Ground-Truth : True
E
ET1
EF0
EF1
E
ET2
ET2T0
Figure 6: Examples of MAIEUTIC PROMPTING. We present a case where MAIEUTIC PROMPTING correctly infers
the ground-truth answer (above), and a case where the inferred answer is different from the ground-truth. Even in
the latter case, the generated explanations make sense and logically lead to the inferred answer. We provide more
examples in Appendix B.
Wang et al.,2022;Liu et al.,2021;Lampinen et al.,
2022). MA IE UTI C PROM PTI NG motivates upon
these works, rethinking the role of NLEs in LM-
based inference.
Despite their success, recent observations reveal
that LM-generated explanations are unreliable, as
they often lack logical consistency and are not fac-
tually grounded (Ye and Durrett,2022;Kassner
and Schütze,2020). These findings are closely re-
lated to the broader limitations of generative LMs,
which assign high probability to unlikely sentences
(Welleck et al.,2020;Holtzman et al.,2021) and
are sensitive to semantic perturbations (Elazar et al.,
2021). MA IE UTI C PROM PTI NG overcomes these
limitations by avoiding the use of explanations “as-
is”, and inferring the answer based on the relation-
ships between the explanations.
Another line of relevant works harness NLEs
to improve model interpretability. A mainstream
approach in this direction is to train a model that
explains its inference post-hoc or in parallel with
the answer (Camburu et al.,2018;Narang et al.,
2020;Jacovi et al.,2021). Unlike these works, the
explanations in our work are designed to be intrin-
sic (Du et al.,2019); the explanations themselves
explicitly participate in the inference.
Our work also relates to the recent thread of
works that apply symbolic methods on top of LMs
to improve their consistency. The symbolic meth-
ods take a form of either a lexical constraint on
sequence decoding (Lu et al.,2021a), or an aux-
iliary symbolic module for the generation to be
consistent with the world model (Nye et al.,2021b)
and performing discrete operations (Chen et al.,
2019;Cobbe et al.,2021). Other works explore to
train a model that simulate the symbolic reasoning
process, such as logical transformation (Bostrom
et al.,2021) and consistent generation of beliefs
(Kassner et al.,2021;Dalvi et al.,2022). However,
these models require a curated set of human anno-
tations, which limits their application to specific
benchmarks and domains. MAIE UT IC PRO MPTIN G
generalizes the broad idea of these neuro-symbolic
approaches in an unsupervised setup, employing
MAX-SAT algorithm to symbolically determine
the true subset from a noisy pool of neural genera-
tions.
6 Conclusion
In this work, we suggest MAIEUTIC PROMPTING,
a novel few-shot inference method inspired by the
Socratic way of conversation. We systematically
generate a tree of explanations that bear logical
relations between each other, then assign the truth
values to explanations that max-satisfies these re-
lations. Empirical results on multiple benchmarks
demonstrate both the competitiveness and robust-
ness of MAIEUTIC PROMPTING compared to di-
verse baselines. Qualitative analyses show that our
method also provides intrinsic interpretations over
its inference.
References
Daniel Adiwardana, Minh-Thang Luong, David R So,
Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,
Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,
et al. 2020. Towards a human-like open-domain
chatbot. arXiv preprint arXiv:2001.09977.
Roberto Battiti. 2009. Maximum satisfiability problem,
pages 2035–2041. Springer US, Boston, MA.
Kaj Bostrom, Xinyu Zhao, Swarat Chaudhuri, and
Greg Durrett. 2021. Flexible generation of natural
language deductions. In Proceedings of the 2021
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 6266–6278.
Faeze Brahman, Vered Shwartz, Rachel Rudinger, and
Yejin Choi. 2021. Learning to rationalize for non-
monotonic reasoning with distant supervision. In
Thirty-Fifth AAAI Conference on Artificial Intelli-
gence, AAAI 2021, Thirty-Third Conference on In-
novative Applications of Artificial Intelligence, IAAI
2021, The Eleventh Symposium on Educational Ad-
vances in Artificial Intelligence, EAAI 2021, Vir-
tual Event, February 2-9, 2021, pages 12592–12601.
AAAI Press.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In
Advances in Neural Information Processing Systems,
volume 33, pages 1877–1901. Curran Associates,
Inc.
Oana-Maria Camburu, Tim Rocktäschel, Thomas
Lukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-
ural language inference with natural language expla-
nations. Advances in Neural Information Process-
ing Systems, 31.
Howard Chen, Jacqueline He, Karthik Narasimhan,
and Danqi Chen. 2022. Can rationalization improve
robustness? arXiv preprint arXiv:2204.11790.
Xinyun Chen, Chen Liang, Adams Wei Yu, Denny
Zhou, Dawn Song, and Quoc V Le. 2019. Neural
symbolic reader: Scalable integration of distributed
and symbolic representations for reading compre-
hension. In International Conference on Learning
Representations.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, et al. 2022. Palm: Scaling
language modeling with pathways. arXiv preprint
arXiv:2204.02311.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavar-
ian, Jacob Hilton, Reiichiro Nakano, Christopher
Hesse, and John Schulman. 2021. Training veri-
fiers to solve math word problems. arXiv preprint
arXiv:2110.14168.
Bhavana Dalvi, Oyvind Tafjord, and Peter Clark.
2022. Towards teachable reasoning systems. arXiv
preprint arXiv:2204.13074.
Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Tech-
niques for interpretable machine learning. Commu-
nications of the ACM, 63(1):68–77.
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhi-
lasha Ravichander, Eduard Hovy, Hinrich Schütze,
and Yoav Goldberg. 2021. Measuring and im-
proving consistency in pretrained language models.
Transactions of the Association for Computational
Linguistics, 9:1012–1031.
Robert G. M. Hausmann and Kurt VanLehn. 2007. Ex-
plaining self-explaining: A contrast between content
and generation. In AIED.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
Yejin Choi. 2019. The curious case of neural text de-
generation. In International Conference on Learn-
ing Representations.
Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi,
and Luke Zettlemoyer. 2021. Surface form compe-
tition: Why the highest probability answer isn’t al-
ways right. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Process-
ing, pages 7038–7051.
Alon Jacovi, Swabha Swayamdipta, Shauli Ravfogel,
Yanai Elazar, Yejin Choi, and Yoav Goldberg. 2021.
Contrastive explanations for model interpretability.
In Proceedings of the 2021 Conference on Empiri-
cal Methods in Natural Language Processing, pages
1597–1611.
Nora Kassner and Hinrich Schütze. 2020. Negated and
misprimed probes for pretrained language models:
Birds can talk, but cannot fly. In Proceedings of the
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 7811–7818, Online. As-
sociation for Computational Linguistics.
Nora Kassner, Oyvind Tafjord, Hinrich Schütze, and
Peter Clark. 2021. Beliefbank: Adding memory to a
pre-trained language model for a systematic notion
of belief. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing,
pages 8849–8861.
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish
Sabharwal, Oyvind Tafjord, Peter Clark, and Han-
naneh Hajishirzi. 2020. Unifiedqa: Crossing for-
mat boundaries with a single qa system. In Find-
ings of the Association for Computational Linguis-
tics: EMNLP 2020, pages 1896–1907.
Klaus Krippendorff. 2007. Computing krippendorff’s
alpha-reliability. annenberg school for communica-
tion departmental paper 43.
Andrew K Lampinen, Ishita Dasgupta, Stephanie CY
Chan, Kory Matthewson, Michael Henry Tessler,
Antonia Creswell, James L McClelland, Jane X
Wang, and Felix Hill. 2022. Can language models
learn from explanations in context? arXiv preprint
arXiv:2204.02329.
Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Pe-
ter West, Ronan Le Bras, Yejin Choi, and Hannaneh
Hajishirzi. 2021. Generated knowledge prompt-
ing for commonsense reasoning. arXiv preprint
arXiv:2110.08387.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula,
and Yejin Choi. 2021. Unicorn on rainbow: A uni-
versal commonsense reasoning model on a new mul-
titask benchmark. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 35, pages
13480–13488.
Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras,
Chandra Bhagavatula, and Yejin Choi. 2021a. Neu-
rologic decoding:(un) supervised neural text genera-
tion with predicate logic constraints. In Proceedings
of the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 4288–4299.
Yao Lu, Max Bartolo, Alastair Moore, Sebastian
Riedel, and Pontus Stenetorp. 2021b. Fantastically
ordered prompts and where to find them: Overcom-
ing few-shot prompt order sensitivity. arXiv preprint
arXiv:2104.08786.
Pasquale Minervini and Sebastian Riedel. 2018. Ad-
versarially regularising neural nli models to inte-
grate logical background knowledge. arXiv preprint
arXiv:1808.08609.
António Morgado, Carmine Dodaro, and Joao
Marques-Silva. 2014. Core-guided maxsat with soft
cardinality constraints. In International Conference
on Principles and Practice of Constraint Program-
ming, pages 564–573. Springer.
Sharan Narang, Colin Raffel, Katherine Lee, Adam
Roberts, Noah Fiedel, and Karishma Malkan. 2020.
Wt5?! training text-to-text models to explain their
predictions. arXiv preprint arXiv:2004.14546.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari,
Henryk Michalewski, Jacob Austin, David Bieber,
David Dohan, Aitor Lewkowycz, Maarten Bosma,
David Luan, et al. 2021a. Show your work: Scratch-
pads for intermediate computation with language
models. arXiv preprint arXiv:2112.00114.
Maxwell Nye, Michael Tessler, Josh Tenenbaum, and
Brenden M Lake. 2021b. Improving coherence and
consistency in neural sequence models with dual-
system, neuro-symbolic reasoning. Advances in
Neural Information Processing Systems, 34.
Yasumasa Onoe, Michael J.Q. Zhang, Eunsol Choi, and
Greg Durrett. 2021. Creak: A dataset for common-
sense reasoning over entity knowledge. OpenRe-
view.
Charles Sanders Peirce. 1974. Collected papers of
charles sanders peirce, volume 5. Harvard Univer-
sity Press.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
ine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
the limits of transfer learning with a unified text-to-
text transformer.Journal of Machine Learning Re-
search, 21(140):1–67.
Nazneen Fatema Rajani, Bryan McCann, Caiming
Xiong, and Richard Socher. 2019. Explain your-
self! leveraging language models for commonsense
reasoning. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
pages 4932–4942.
Vered Shwartz, Peter West, Ronan Le Bras, Chandra
Bhagavatula, and Yejin Choi. 2020. Unsupervised
commonsense question answering with self-talk. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 4615–4629.
Shikhar Singh, Nuan Wen, Yu Hou, Pegah Alipoormo-
labashi, Te-lin Wu, Xuezhe Ma, and Nanyun Peng.
2021. COM2SENSE: A commonsense reasoning
benchmark with complementary sentences. In Find-
ings of the Association for Computational Linguis-
tics: ACL-IJCNLP 2021, pages 883–898, Online.
Association for Computational Linguistics.
Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bha-
gavatula, Yoav Goldberg, Yejin Choi, and Jonathan
Berant. 2021. CommonsenseQA 2.0: Exposing the
limits of AI through gamification. In Thirty-fifth
Conference on Neural Information Processing Sys-
tems.
Gregory Vlastos. 1991. Socrates, ironist and moral
philosopher, volume 50. Cornell University Press.
Haohan Wang, Da Sun, and Eric P Xing. 2019. What if
we simply swap the two text fragments? a straight-
forward yet effective way to test the robustness of
methods to confounding signals in nature language
inference tasks. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, pages 7136–7143.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, and Denny Zhou. 2022. Self-consistency
improves chain of thought reasoning in language
models. arXiv preprint arXiv:2203.11171.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Chain of thought prompting elicits reasoning in large
language models. arXiv preprint arXiv:2201.11903.
Sean Welleck, Ilia Kulikov, Jaedeok Kim,
Richard Yuanzhe Pang, and Kyunghyun Cho.
2020. Consistency of a recurrent language model
with respect to incomplete decoding. In Proceed-
ings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
5553–5568.
Peter West, Chandra Bhagavatula, Jack Hessel, Jena D
Hwang, Liwei Jiang, Ronan Le Bras, Ximing
Lu, Sean Welleck, and Yejin Choi. 2021. Sym-
bolic knowledge distillation: from general language
models to commonsense models. arXiv preprint
arXiv:2110.07178.
Sarah Wiegreffe and Ana Marasovi´
c. 2021. Teach me
to explain: A review of datasets for explainable nlp.
arXiv preprint arXiv:2102.12060.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1
(Long Papers), pages 1112–1122.
Xi Ye and Greg Durrett. 2022. The unreliability of
explanations in few-shot in-context learning. arXiv
preprint arxiv:2205.03401.
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Sameer Singh. 2021. Calibrate before use: Improv-
ing few-shot performance of language models. In In-
ternational Conference on Machine Learning, pages
12697–12706. PMLR.
A Tree Generation Algorithm
Algorithm 1 Maieutic tree generation
Input: Question Q, Max tree depth D
Output: Maieutic tree T
sT ← init(Q)// initialize the tree with Q
for d∈ {1,· · · , D}do // generate nodes
Si← ∅
for E∈Si−1do
if integral(E) = 1 then
Si←Si∪abductive(E)
end if
end for
T.add(Si)
end for
V← {E;integral(E) = 0 for all E∈leaf(T)}// set of non-integral leaf nodes
while V6=∅do
T.remove(V)// prune the non-integral leaf nodes
V← {E;integral(E) = 0 for all E∈leaf(T)}
end while
B Inference Examples
Tru e s : , , , , , , / False s : , , , !
!
Inferred Answer : Tru e / Ground-Truth : True
E
ET0
ET1
ET1T0
ET2
EF0T0
EF1
EF2F0
E
ET1F0
EF0
EF1F0
EF2
: A straight line on a sphere makes a circle.
ET1T0
: The world is round and if you continue to travel in a !
straight line, you will eventually reach the other side.
ET1
: The world is not round.
ET1F0
: If you travel far enough in any direction, you will
eventually reach the opposite coast.
EF1
: It is impossible to travel to the other side of the Earth.
EF1F0
: You can only travel so far before you reach the
end of the earth.
EF2
: The Earth is round.
EF2F0
: The Earth is round and if you travel in any
direction long enough, you will eventually return to
where you started.
ET0
: You cannot reach the east coast by going west.
EF0
: You can reach the east coast by going west by!
traveling around the world.
EF0T0
integral / True
integral / False
integral / True
integral / True
integral / False
integral / False
: All directions eventually meet at the
North and South Poles.
ET2
integral / True
: If you travel west far enough from the west coast, you will reach the east coast.
Q
Figure 7: Example of correct inference by MAI EU TIC PROM PT IN G. We show the generated maieutic tree along
with the assigned truth-values to each propositions.
: The earth is composed of gas, rocks and metal.
ET1
: The empirically-based model of the Earth
is the generally accepted model of the Earth's interior.
ET2T0
: The Earth is a planet that is made primarily of air and helium.
Q
: The Earth's interior is not composed of
a dense core, mantle or crust.
ET2F0
: The earth is made primarily of gas.
ET0
: The earth is made of gas, rocks, and other materials.
ET0T0
: The earth is not made primarily of air and helium.
EF0
: Air and helium is only a part of the Earth's atmosphere.
EF0T0
: The earth is made of a variety of elements including air and helium.
EF0F0
: Air only makes up a small fraction of the Earth's mass.
EF1
: The Earth's atmosphere (made up of air) is
about 1% of the Earth's mass.
EF1T0
integral / True
integral / False
: According to the empirically-based model of the Earth, the Earth's!
interior is composed of a dense core, a mantle, and a crust. The crust is !
made up of the solid rock, the mantle is made of the hot rock, and the !
core is made of the solid metal.
ET2
integral / True integral / True
integral / True
integral / False
integral / True
Tru e s : , , , , , , , / False s : , , !
!
Inferred Answer : False / Ground-Truth : False
E
ET0T0
ET1
ET2
ET2T0
EF0
EF0T0
EF0F0
EF1
E
ET0
ET2F0
EF1T0
Figure 8: (continued) Example of correct inference by MAIEUTIC PROMPTING.
: Every living being is capable of
getting energy from lower in the food chain.
ET1
: Everyone is capable of moving lower in the food chain.
Q
: Humans are omnivores that can consume both
plants and animals.
ET2
: Some people are not capable of digesting complex
carbs and proteins and thus, must eat food from the higher
end of the food chain.
EF1
: Some people are not able to digest complex proteins,
and therefore, need to eat food that is lower in the food chain.
EF0
integral / True
integral / True
Tru e s : , , , !
!
Inferred Answer : Tru e / Ground-Truth : False
E
ET1
ET2
EF0
EF1
integral / True
integral / True
integral / True
Figure 9: Example of incorrect inference by MAIEUTIC PROMPTING.
integral / False
integral / True
Tru e s : , , , , , , , / False s : !
!
Inferred Answer : False / Ground-Truth : Tr ue
E
ET1
ET2
ET2T0
ET2F0
EF0
EF1
EF2
EF2T0
E
ET0
: A city is a place where many people live.
ET2T0
: A city is a place where people live and work.
ET2F0
: A city will have residents who have permanent
addresses and commuters who have temporal addresses.
EF1
: People in a city are always coming and going.
ET1
: A city will have a mix of both transient
and local traffic.
EF2
integral / True
: People passing through a city will!
always be there.
ET0
integral / True
integral / False
integral / True
: A city will always have transient traffic.
Q
integral / True
: A city will have people who live there and
people who are just visiting.
EF2T0
: A city will have both transient and
non-transient traffic.
EF0
integral / True
integral / True
: A city is a place where many people
come and go.
ET2
Figure 10: (continued) Example of incorrect inference by MAIEUTIC PROMPTING.