Available via license: CC BY 4.0
Content may be subject to copyright.
Deductive Veriﬁcation of ChainofThought Reasoning
Zhan Ling1∗Yunhao Fang1∗Xuanlin Li1Zhiao Huang1Mingu Lee2
Roland Memisevic2Hao Su1
1UC San Diego, 2Qualcomm AI Research†
Abstract
Large Language Models (LLMs) signiﬁcantly beneﬁt from ChainofThought
(CoT) prompting in performing various reasoning tasks. While CoT allows models
to produce more comprehensive reasoning processes, its emphasis on intermediate
reasoning steps can inadvertently introduce hallucinations and accumulated errors,
thereby limiting models’ ability to solve complex reasoning tasks. Inspired by how
humans engage in careful and meticulous deductive logical reasoning processes
to solve tasks, we seek to enable language models to perform explicit and rigor
ous deductive reasoning, and also ensure the trustworthiness of their reasoning
process through selfveriﬁcation. However, directly verifying the validity of an
entire deductive reasoning process is challenging, even with advanced models
like ChatGPT. In light of this, we propose to decompose a reasoning veriﬁcation
process into a series of stepbystep subprocesses, each only receiving their nec
essary context and premises. To facilitate this procedure, we propose Natural
Program, a natural languagebased deductive reasoning format. Our approach
enables models to generate precise reasoning steps where subsequent steps are
more rigorously grounded on prior steps. It also empowers language models to
carry out reasoning selfveriﬁcation in a stepbystep manner. By integrating this
veriﬁcation process into each deductive reasoning stage, we signiﬁcantly enhance
the rigor and trustfulness of generated reasoning steps. Along this process, we also
improve the answer correctness on complex reasoning tasks. Code will be released
at https://github.com/lz1oceani/verify_cot.
1 Introduction
The transformative power of large language models, enhanced by ChainofThought (CoT)
prompting [
47
,
19
,
55
,
39
], has signiﬁcantly reshaped the landscape of information process
ing [
13
,
24
,
46
,
52
,
12
,
51
,
21
,
27
], fostering enhanced abilities across a myriad of disciplines
and sectors. While CoT allows models to produce more comprehensive reasoning processes, its
emphasis on intermediate reasoning steps can inadvertently introduce hallucinations [
4
,
28
,
14
,
18
]
and accumulated errors [
4
,
48
,
1
], thereby limiting models’ ability to produce cogent reasoning
processes.
In fact, the pursuit of reliable reasoning is not a contemporary novelty; indeed, it is an intellectual
endeavor that traces its roots back to the time of Aristotle’s ancient Greece. Motivated by the desire
to establish a rigorous reasoning process, in his “Organon,” Aristotle introduced principles of logic,
in particular, syllogism, a form of logical argument that applies deductive reasoning to arrive at
a conclusion based on two or more propositions assumed to be true. In disciplines that rigorous
reasoning is critical, such as judical reasoning and mathematical problem solving, documents must be
written in a formal language with a logical structure to ensure the validity of the reasoning process.
∗Equal contribution
†Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc
Preprint. Under review.
arXiv:2306.03872v1 [cs.CL] 6 Jun 2023
There are 53 maple trees currently in
the park. Park workers will plant maple
trees today. When the workers are
finished there will be 64 maple trees
in the park. How many maple trees did
the workers plant today?
Question
# 1. There are 53 maple trees currently
in the park.
# 2. After the workers plant trees,
there will be 64 maple trees in the park.
# 3. The workers plant some maple trees.
# 4. How many maple trees did the
workers plant today?
QuestionRelated Premises
#5. (by #2 #1) Step 1:
Calculate the number of
maple trees the workers
will plant. Number of maple
trees the workers will
plant: 64  53 = 11
Reasoning
#6. (by #4 #5) Final Step: The original
question is #4. How many maple trees did the
workers plant today? We do not miss
information on the rewritten labels. So the
answer to this question is the workers planted
11 maple trees today.
Reasoning
…
Here is some information: "There are 53 maple trees currently in the park.
After the workers plant trees, there will be 64 maple trees in the park."
Based on the given information, here is a reasoning process: "Calculate
the number of maple trees the workers will plant. Number of maple trees
the workers will plant: 64  53 = 11"
Doublecheck the reasoning process, let's analyze its correctness, and
end with "yes" or "no".
Verification
Figure 1: An overview of our proposed deductive reasoning and veriﬁcation process. In response
to an input question, LLMs generate deductive reasoning chains using the Natural Program format
(bottom 3 boxes), a natural languagebased deductive reasoning approach. The Natural Program
format allows individual reasoning steps (an example in purple) and their corresponding minimal
set of premises (an example in orange) to be easily extracted. This streamlined extraction process
facilitates the stepbystep decomposition and veriﬁcation of deductive reasoning (topright box).
Avery needs to buy a 3
piece place setting
(dinner & salad plate
and a bowl) for her
holiday dinner. She’s
having 12 people over
for dinner. If the
dinner plates cost $6.00
each and bowls each cost
$5.00 and the salad
plates cost $4.00, how
much will she spend on
place settings?
Question
#1. Avery needs to buy a 3 piece place
setting (dinner & salad plate and a bowl) for
her holiday dinner.
#2. She’s having 12 people over for dinner.
#3. Dinner plates cost $6.00 each, bowls each
cost $5.00, and salad plates cost $4.00.
#4. How much will she spend on place settings?
QuestionRelated Premises
#5. (by #2) Step 1:
Calculate the total
number of place
settings needed.
Total number of
place settings: 12 *
3 = 36
Reasoning
#1. Avery needs to buy a 3 piece place
setting for her holiday dinner.
#2. Avery is having 12 people over for dinner.
#3. The cost of each dinner plate is $6.00.
#4. The cost of each bowl is $5.00.
#5. The cost of each salad plate is $4.00.
#6. How much will Avery spend on place
settings?
QuestionRelated Premises
#1. Avery needs to buy a 3 piece place
setting for 12 people.
#2. The place setting consists of a dinner
plate, a salad plate, and a bowl.
#3. The cost of a dinner plate is $6.00.
#4. The cost of a bowl is $5.00.
#5. The cost of a salad plate is $4.00.
#6. Avery needs to calculate the total cost
of the place settings.
QuestionRelated Premises
…
…
#7. (by #2) Step 1:
Calculate the total
number of place
settings Avery needs
to buy.Total number
of place settings:
12 * 3 = 36
Reasoning
#7. (by #1) Step 1:
Calculate the total
number of place
settings Avery needs
to buy. Total number
of place settings
needed: 12
Reasoning
#8. (by #4 #7) Final Step: The
original question is #4. How
much will she spend on place
settings? We do not miss
information on the rewritten
labels. So the answer to this
question is Avery will spend
$540.00 on place settings.
Reasoning
#10. (by #6 #9) Final Step: The
original question is #6. How
much will Avery spend on place
settings? We do not miss
information on the rewritten
labels. So the answer to this
question is Avery will spend
$540.00 on place settings.
Reasoning
…
#10. (by #6 #9) Final Step: The
original question is #6. How
much will Avery spend on place
settings? We do not miss
information on the rewritten
labels. So the answer to this
question is Avery will spend
$180.00 on place settings.
Reasoning
…
Reasoning ErrorGrounding Error
Figure 2: Through our Natural Programbased deductive reasoning veriﬁcation approach, we identify
and eliminate reasoning chains that contain errors in reasoning and grounding (we deﬁne grounding
error as utilizing information that is not present in cited premises). By alleviating such errors,
we signiﬁcantly enhance the rigor, trustworthiness, and interpretability of the generated reasoning
outputs.
We yearn for this sequence of reliable knowledge when answering questions. Our goal is to develop
language models that can propose potential solutions through reasoning in logical structures. Si
multaneously, we aim to establish a veriﬁer capable of accurately assessing the validity of these
reasoning processes. Despite recent signiﬁcant explorations in the ﬁeld, such as [
45
]’s emphasis
on selfconsistency and [
25
,
5
]’s innovative use of codes to represent the reasoning process, these
approaches still exhibit considerable limitations. For example, consistency and reliability are not
inherently correlated; as for program codes, they are not powerful enough to represent many kinds
of reasoning process, e.g., in the presence of quantiﬁers (“for all”, “if there exists”) or nuances of
natural language (moral reasoning, “likely”, . . .).
2
We propose leveraging the power of natural language to achieve the deductive reasoning emphasized in
ancient Greek logic, introducing a “natural program”. This involves retaining natural language for its
inherent power and avoiding the need for extensive retraining with large data sets. A natural program
represents a rigorous reasoning sequence, akin to a computer program. We expect implementations
of the idea to have two properties: 1) that natural programs are generated with minimal effort from an
existing language model capable of CoT reasoning, preferably through incontext learning; 2) that
the natural program can be easily veriﬁed for reliability in the reasoning process.
Through a stepbystep investigation, we discovered that large language models have the potential
to meet our expectation. Naïve CoT prompts like "Let us think step by step." has many ﬂaws, and
entrusting the entire veriﬁcation process to a large model like ChatGPT can still lead to signiﬁcant
error rates. However, we found that, if the reasoning process is very short, and only based on
necessary premises and contexts, the veriﬁcation of existing large language models is already quite
reliable. Therefore, our approach is to design prompts that induce CoT processes comprised of
rigorous premises/conditions and conclusions with statement labels, and veriﬁcation can be done by
gradually isolating very few statements within the long thought chain. Experimentally, we found that
most reasoning that passed the veriﬁcation was rigorous, and nearly all that did not pass had elements
of imprecision in the reasoning process, even if they occasionally arrived at correct answers.
It is worth emphasizing that, we are not looking for a method to just maximize the correctness rate
of ﬁnal answers; instead, we aspire to generate a cogent reasoning process, which is more aligned
with the spirit of judical reasoning. When combined with samplingbased methods, our method can
identify lowprobability but rigorous reasoning processes. When repeated sampling fails to yield a
rigorous reasoning process, we can output "unknown" to prevent hallucinations that mislead users.
We demonstrate the efﬁcacy of our natural programbased veriﬁcation approach across a range
of arithmetic and common sense datasets on publiclyavailable models like OpenAI’s GPT3.5
turbo (175B). Our key contributions are as follows:
1. We propose a novel framework for rigorous deductive reasoning by introducing a “Natural
Program” format (Fig. 1), which is suitable for veriﬁcation and can be generated by just incontext
learning;
2. We show that reliable selfveriﬁcation of long deductive reasoning processes written in our Natural
Program format can be achieved through stepbystep subprocesses that only cover necessary context
and premises;
3. Experimentally, we demonstrate the superiority of our framework in improving the rigor, trustwor
thiness, and interpretability of LLMgenerated reasoning steps and answers (Fig. 2).
2 Related work
Reasoning with large language models. Recent large language models (LLMs) [
3
,
8
,
53
,
44
,
35
,
16
,
9
,
34
] have shown incredible ability in solving complex reasoning tasks. Instead of letting LLMs
directly generate ﬁnal answers as output, prior work have shown that by encouraging stepbystep
reasoning through proper prompting, such as ChainofThought (CoT) prompting [
47
] and many
others [19, 55, 54, 41, 45, 56, 23, 50], LLMs exhibit signiﬁcantly better performance across diverse
reasoning tasks. To further improve the stepbystep reasoning process, some recent studies have
investigated leveraging external solvers such as program interpreters [
36
,
5
,
25
], training and calling
external reasoning modules [
11
], or performing explicit search to generate deductive steps [
2
,
43
].
Parallel to these works, we do not rely on external modules and algorithms, and we directly leverage
the incontext learning ability of LLMs to generate more precise and rigorous deductive reasoning
steps.
Large language models as veriﬁers. Using language models to evaluate model generations has been
a long standing idea [
20
,
33
,
37
,
4
]. As LLMs exhibit impressive capabilities across diverse tasks, it
becomes a natural idea to use LLMs as evaluation and veriﬁcation tools. For example, [
10
,
11
,
31
]
ﬁnetune LLMs to verify solutions and intermediate steps. LLMs aligned with RLHF [
30
,
29
,
45
]
have also been employed to compare different model generations. In addition, recent works like
[
40
,
49
,
26
,
6
] leverage prompt designs to allow LLMs to selfverify, selfreﬁne, and selfdebug
without the need for ﬁnetuning. However, these works do not focus on the rigor and trustworthiness
3
Question: Jerry is rolling a sixsided die. How much more likely is it (expressed as a percentage)
that he rolls a number greater than $3 than that he rolls two even numbers in a row?
Ground Truth: There are 3 numbers greater than 3 on the dice, so the chances of rolling one
of them are 3/6=50%. The chance of rolling one even number is 50%, so the chance of rolling
two in a row is 50% * 50% = 25%. The difference between these two probabilities is 50%  25%
= 25%. #### 25
Output: There are two possible outcomes for rolling a number greater than 3 out of 6 total
outcomes, so the probability is 2/6or1/3. For rolling two even numbers in a row, there are
three possible outcomes (24, 46, 62) out of 36 total outcomes, so the probability is 3 / 36 or
1 / 12. To ﬁnd the difference in probability, we subtract the probability of rolling two even
numbers in a row from the probability of rolling a number greater than 3: 1/31/12=1/4.
This means that it is 25% more likely to roll a number greater than 3 than to roll two even
numbers in a row.
Table 1: An example question from GSM8K with a generated CoT reasoning path with GPT
3.5 (turbo), where the output provides a wrong reasoning chain with the correct answer.
of the deductive reasoning processes at every reasoning step. In this work, we propose a natural
languagebased deductive reasoning format that allows LLMs to selfverify every intermediate step
of a deductive reasoning process, thereby improving the rigor and trustfulness of reasoning
3 Motivation and Problem Formulation
A reasoningbased questionanswering (QA) task can be deﬁned as a tuple
(Q, C, O, A)
[
32
], where
Q
is the target question;
C
is the context of a question, such as the necessary background for answering a
question;
O= (o1, o2,· · · , ck)
are optional answer choices if
Q
is a Kway multiple choice problem;
and
A
is the groundtruth answer. Given
Q
and
C
as inputs, large language models (LLMs) [
3
,
8
,
44
]
generate a sequence of tokens
T= (t1, t2,· · · , tn)
to answer the question. Recent works like Chain
ofThought (CoT) [
47
,
19
] leverage prompt engineering in the context
C
to encourage models to
generate the intermediate reasoning process in
T
, which beneﬁts LLM performance across diverse
reasoning tasks. In this case,
T
consists of a set of
m
intermediate reasoning steps, which we denote
as
S= (s1, s2,· · · , sm)
. Each step
si
can be represented by a subsequence of the generated tokens
(tli, tri)⊆T
. In much prior work, a generated solution is “correct” if and only if the predicted ﬁnal
answer in smmatches the ground truth A, which we call answer correct(ness).
We observe that for all cases where LLMs produce erroneous ﬁnal answers, there exists at least one
mistake among the intermediate reasoning steps
S
. Moreover, even when the ﬁnal answer is correct,
there might still exist some mistakes among
S
. This phenomenon, as illustrated in Tab. 1, occurs for
all LLMs we tested, including stateoftheart models such as ChatGPT and GPT4 [
30
]. Since later
reasoning steps are conditioned on prior reasoning steps, these mistakes often initiate a snowball
effect, causing subsequent mistakes to compound. This signiﬁcantly diminishes the likelihood of
correct problemsolving and impedes the progress towards achieving humanlevel complex reasoning.
Therefore, in this work, we place signiﬁcant emphasis on ensuring the validity of every reasoning
step, not just the correctness of the ﬁnal answer. In particular, we focus on the validity of deductive
reasoning, an essential component of a logical reasoning process. In deductive reasoning, we are
given a (premise, conclusion) pair, and we are interested in determining whether the conclusion
follows from the premises. In the context of reasoningbased QA tasks, for each reasoning step
si
,
we deﬁne its deductive validity
V(si)
as a binary variable. A reasoning step is deductively valid
(
V(si) = 1
) if and only if
si
can be logically deduced from its corresponding premises
pi
, which
consist of the context
C
, the question
Q
, and all the previous reasoning steps
sj(j < i)
. Then, we can
also deﬁne the deductive validity for the entire reasoning chain
S
as
V(S) = ∧M
i=1V(si)
. Compared
to evaluating answer correctness, which can be accomplished by simple functions such as exact string
match, evaluating deductive validity is a lot more challenging. Thanks to the recent progress on
LLMs, which demonstrate impressive incontext learning capabilities across diverse scenarios, we
propose to use LLMs to examine reasoning chains and predict the deductive reasoning validity.
4
Prompting Reasoning Correctness GSM8K AQuA MATH AddSub Date Last Letters
Zeroshot
Correct 0.98 0.96 1.00 0.98 0.98 1.00
Wrong 0.04 0.06 0.04 0.02 0.04 0.04
(Average) 0.51 0.51 0.52 0.50 0.51 0.52
Twoshot
Correct 0.98 0.96 1.00 0.92 1.00 0.96
Wrong 0.02 0.04 0.00 0.06 0.26 0.06
(Average) 0.50 0.50 0.50 0.49 0.63 0.51
Table 2: Zeroshot and twoshot reasoning chain veriﬁcation accuracy for GPT3.5turbo (ChatGPT),
where an entire reasoning chain is veriﬁed at once. The two shot prompt we used is presented in
Appendix C.1. To generate veriﬁcation inputs, for each dataset, we perform ChainofThought (CoT)
prompting and randomly sample 50 reasoning chains that are valid and 50 reasoning chains that
exhibit mistakes. We observe that when given an entire reasoning process, where the deductive graphs
for all reasoning steps are entangled together, it is challenging even for strong language models like
ChatGPT to verify its validity.
4 Deductively Veriﬁable ChainofThought Reasoning
In this section, we introduce our speciﬁc approaches to performing deductive veriﬁcation of reasoning
chains. Speciﬁcally, we ﬁrst introduce our motivation and method for decomposing a deductive veriﬁ
cation process into a series of stepbystep processes, each only receiving contexts and premises that
are necessary. Then, we propose Natural Program, a natural languagebased deductive reasoning
format, to facilitate local stepbystep veriﬁcation. Finally, we show that by integrating deductive ver
iﬁcation with unanimityplurality voting, we can improve the trustworthiness of reasoning processes
along with ﬁnal answers. An overview of our approach is illustrated in Fig. 1 and Fig. 2.
4.1 Decomposition of Deductive Veriﬁcation Process
Given a reasoning chain
S= (s1, s2,· · · , sn)
, a straightforward idea to verify its deductive validity is
to ask LLMs to examine the entire reasoning chain at once. To assess the effectiveness of this approach,
we conduct a preliminary experiment: for a dataset problem and its reasoning chain
S
generated
by ChatGPT, we prompt ChatGPT with “
Do you think the above reasoning process is
correct? Let’s think step by step
” such that its outputs whether there exists any mistake
among any reasoning step in
S
. However, as demonstrated in Tab. 2, the veriﬁcation accuracy is 50%
for most datasets, and ChatGPT struggles at ﬁnding out mistaken reasonings. Notably, it persistently
outputs "Correct" for most reasoning chain queries, regardless of their actual validity.
We conjecture that such phenomenon is caused by the abundance of irrelevant premises for each
reasoning step. Recall that the premises
pi
for a reasoning step
si
consist of the the question
Q
, the question context
C
, along with the prior reasoning steps
s≤j={sj:j < i}
. For
Q
and
C
, we can further extract and decompose
Q∪C
into a set of “questionrelated premises”
QC ={qc1, qc2,· · · , q cm}
, where
qci
is a premise or condition inferred from
Q∪C
. Then, it is
often the case that most elements of
pi=QC ∪s≤j
are irrelevant to the validity of
si
, leading
to erroneous veriﬁcations from language models. A very recent work [
38
] also observes a similar
phenomenon where LLMs are easily distracted by irrelevant context.
Hence, we propose a decomposition of the reasoning chain veriﬁcation process into a series of step
bystep processes, where each step only considers the premises that are necessary. The overall validity
of the reasoning chain, denoted as
V(S) = ∧M
i=1V(si)
, can be naturally decomposed into individual
step validity
V(si)
. However, achieving such decomposition is highly challenging without imposing
constraints on the format of reasoning chains. Additionally, for each
si∈S
, we aim to ensure that
it explicitly lists the minimal subset of premises
¯pi⊆pi
required for deductive reasoning to avoid
potential ambiguities during veriﬁcation. This motivates us to introduce a naturallanguagebased
deductive reasoning format in Section 4.2.
4.2 Natural Program Deductive Reasoning Format
As previously mentioned in Sec. 4.1, we desire LLMs to output deductive reasoning processes that
can be easily veriﬁed by themselves, speciﬁcally by listing out the minimal set of necessary premises
5
pi
at each reasoning step
si
. To accomplish its goal, we propose to leverage the power of natural
language, which is capable of rigorously representing a large variety of reasoning processes and can
be generated with minimal effort. In particular, we introduce Natural Program , a novel deductive
reasoning format for LLMs. More formally, Natural Program consists of the following components:
•
An instruction for models to extract questionrelated premises
QC
. We use the following in
struction: “
First, let’s write down all the statements and relationships
in the question with labels".
• A numberedlist of questionrelated premises, each preﬁxed with “#{premise_number}”.
•
An instruction for models to generate the reasoning chain
S
based on the questionrelated
premises
QC
. We use the following instruction: “
Next, let’s answer the question
step by step with reference to the question and reasoning process”.
• A list of preﬁxed reasoning steps Si. The preﬁx has the following format:
#{number} (by {list_of_premises_used})
. Here “number” equals
QC+i
, and
“list_of_premises_used” consists of numbers from the smallest subset of premises among
QC ∪s≤j
that are used for the deductive reasoning of
si
. In addition, for the last reasoning
step
sm
, we ensure that it (1) includes a special tag
Final Step
; (2) refers to the premise
number of the target question to be answered; (3) explicitly gives the ﬁnal answer to a
question.
To encourage language models to reason in the Natural Program format, we have designed one
shot prompts for different datasets, which are shown Appendix C.2. Given that LLM’s rea
soning outputs follow the Natural Program format, we can then verify the deductive validity
of a single reasoning step
si
through an instruction that consists of (1) the full descriptions of
premises used for the reasoning of
si
; (2) the full description of
si
; (3) an instruction for va
lidity veriﬁcation, such as “
Doublecheck the reasoning process, let’s analyze its
correctness, and end with "yes" or "no".
” Note that throughout this veriﬁcation pro
cess, we only retain the minimal necessary premise and context for
si
, thereby avoiding irrelevant
context distraction and signiﬁcantly improving the effectiveness of validation. Additionally, we
employ a oneshot prompt for this veriﬁcation process, which we ﬁnd very helpful for improving the
veriﬁcation accuracy. The prompt is shown in Appendix C.3.
Figure 1 provides an overview of the complete Natural Programbased deductive reasoning and
veriﬁcation process. By using the Natural Program approach, we demonstrate that LLMs are
capable of performing explicit, rigorous, and coherent deductive reasoning. Furthermore, Natural
Program enables LLMs to selfverify their reasoning processes more effectively, enhancing the
reliability and trustworthiness of the generated responses.
4.3 Integrating Deductive Veriﬁcation with UnanimityPlurality Voting
Given that we can effectively verify a deductive reasoning process, we can naturally integrate
veriﬁcation with LLM’s sequence generation strategies to enhance the trustworthiness of both the
intermediate reasoning steps and the ﬁnal answers. In this work, we propose UnanimityPlurality
Voting, a 2phase sequence generation strategy described as follows. Firstly, similar to prior work
like [
45
], we sample
k
reasoning chain candidates along with their ﬁnal answers. In the unanimity
phase, we perform deductive validation on each reasoning chain. Recall that a chain
S
is valid (i.e.,
V(S) = 1
) if and only if all of its intermediate reasoning steps are valid (i.e.,
∀i, V (si) = 1
). For
each intermediate reasoning step
si
, we perform majority voting over
k′
sampled singlestep validity
predictions to determine its ﬁnal validity
V(si)
. We then only retain the veriﬁed chain candidates
{S:V(S) = 1}
. In the plurality voting stage, we conduct a majoritybased voting among the veriﬁed
chain candidates to determine the ﬁnal answer. This voting process ensures that the ﬁnal answer is
selected based on a consensus among the trustworthy reasoning chains.
5 Experiments
In this section, we perform evaluations to demonstrate the effectiveness of our Natural Programbased
deductive reasoning veriﬁcation approach over diverse reasoning datasets. Firstly, we show that
our deductive veriﬁcation process leads to substantial improvements in the rigor and reliability of
6
Veriﬁcation Method Reasoning Correctness GSM8k AQuA MATH AddSub Date Last Letters Overall
CoT
Twoshot
Correct 98% 96% 100% 92% 100% 96% 97%
Wrong 2% 4% 0% 6% 26% 6% 7%
(Average) 50% 50% 50% 49% 63% 51% 52%
Natural Program
Oneshot
Correct 84% 72% 70% 95% 90% 96% 85%
Wrong 84% 62% 76% 40% 56% 6% 54%
(Average) 84% 67% 73% 68% 73% 51% 69%
Table 3: Comparison of deductive veriﬁcation accuracy of reasoning chains for GPT3.5turbo (Chat
GPT). We compare two approaches: (1) verifying entire reasoning chains generated by Chainof
Thought prompting; (2) verifying reasoning chains generated in the Natural Program format with
stepbystep decomposition. In the latter case, when we verify each reasoning step
si
, we only keep
the necessary subset of premises
¯pi⊆pi
. To calculate veriﬁcation accuracy, for each dataset, we
randomly sample 50 reasoning chains that are deductively valid and 50 reasoning steps exhibiting
incorrect reasonings.
Arithmetic Commonsense
Methods GSM8K AQuA MATH∗AddSub Date Last Letters
CoT + Voting 87.62% 53.18% 35.93% 92.36% 69.97% 81.60%
Faithful CoT + Voting 75.80% 53.50% 31.78%188.35%173.50% 
Ours (Natural Program (NP), No Veriﬁcation) 87.05% 70.34% 36.75% 93.67% 72.49% 92.98%
Ours (NP + Deductive Veriﬁcation + UPV) 86.01% 69.49% 36.48% 93.54% 71.45% 92.60%
Table 4: Final answer accuracy comparison on GPT3.5turbo (ChatGPT). All approaches generate
k= 10
reasoning chains for each problem before performing majority voting or reasoning chain
ﬁltering with our deductive veriﬁcation approach.
reasoning chains. Subsequently, we will examine the impact of deductive veriﬁcation on the accuracy
of ﬁnal answers. Our ﬁndings reveal that by adopting our Natural Program reasoning format without
veriﬁcation, we improve answer correctness on challenging benchmarks. Further applying deductive
veriﬁcation leads to slight reductions in ﬁnal answer accuracy. One reason for this phenomenon is
that the veriﬁcation process effectively identiﬁes and eliminates ﬂawed reasoning chains that still
produce correct answers.
5.1 Experimental Setup
Benchmarks. We evaluate the deductive veriﬁcation accuracy and the answer correctness of reasoning
chains over a diverse set of reasoning tasks: arithmetic reasoning, symbol manipulation, and date
understanding. For arithmetic reasoning, we utilize the following benchmarks: 1) AddSub [
17
];
2) GSM8K [
10
]; 3) MATH [
15
]; 4) AQuA [
22
]. Among these benchmarks, the AddSub and GSM8K
datasets involve middle schoollevel multistep calculations to arrive at a single number as the ﬁnal
answer. The MATH dataset presents more challenging problems that require expressing the answer
as a mathematical expression in LaTeX format. These problems involve concepts from linear algebra,
algebra, geometry, calculus, statistics, and number theory. AQuA also features similarly challenging
problems, except that questions are in a multiplechoice format. For symbol manipulation, we use
Last Letter Concatenation [
47
], where the model is tasked with concatenate the last letters of all the
words provided in the question. For date understanding, we use the one from BIGbench [42]
Deductive verﬁcation evaluation setup. For each of the above benchmarks, we select 100 reasoning
chains, where 50 of them are deductively valid and 50 of them exhibit reasoning mistakes. The
groundtruth deductive validity of each reasoning chain is determined by human annotators.
1
Most results for Faithful CoT are from their ofﬁcial repository
https://github.com/veronica320/
FaithfulCOT
, except MATH and AddSub due to the unavailability. For these two datasets, we use our
implementation and the same prompt for the math world problems in their paper. The prompt for Last Letters is
not available, so we leave it blank.
7
Question: In March it rained 0.81 inches. It rained 0.35 inches less in April than in March.
How much did it rain in April?
Ground Truth: It rained 0.81  0.35 = 0.46 inches.
GPT3.5turbo (ChatGPT):
Questionrelated premises:
#1. It rained 0.81 inches in March.
#2. It rained 0.35 inches less in April than in March.
#3. How much did it rain in April?
The ﬁrst reasoning step:
#4. (by #2) Step 1: Calculate how much it rained in April by subtracting the difference of 0.35
inches from the amount in March.
Amount of rain in April: 0.81  0.35 = 0.46 inches
Table 5: For this reasoning chain candidate of this GSM8k problem, ChatGPT generates the correct
ﬁnal answer but provides incorrect premise numbers for grounding the ﬁrst reasoning step. In “Step
1”, the correct premise numbers should be #1 and #2. Our deductive reasoning veriﬁcation approach
effectively identiﬁes these reasoning errors, enhancing the rigor and trustworthiness of the reasoning
process. It is worth noting that removing a correct answer from the candidate reasoning chains has a
slightly negative impact on the overall ﬁnal answer correctness.
Answer extraction. To extract answers from reasoning solutions, we ﬁrst perform text splitting based
on answer preﬁx patterns such as “answer is” or “option is”. Then, using problem typespeciﬁc regular
expressions, we extract the ﬁnal answer. To extract the validity results from deductive veriﬁcation
processes, we only keep the last sentence of model response. We then extract the validity answer with
regular expressions to obtain attitude words, e.g., “yes” or “no”, to determine the validity answer.
Sometimes, language models may not provide a direct answer and instead output phrases like “not
applicable” at the end of the response. In such cases, we consider the answer from the model as "yes".
Please refer to Appendix B for more details.
Model and Hyperparameters. We conduct our main experiments with GPT3.5turbo (Chat
GPT) [
30
]. We also present results for the LLama modelfamily [
44
]) in Appendix A, where we
ﬁnd the deductive veriﬁcation accuracy to be worse than larger models even after ﬁnetuning. For
ChatGPT, we use a generation temperature of
T= 0.7
. For UnanimityPlurality Voting, we set
k= 10
and
k′= 3
by default. We use 1shot prompting for both reasoning chain generation and
deductive veriﬁcation (except reasoning chain generation for the date understanding task where we
use 2shot). See Appendix C.2 and Appendix C.3 for more details.
5.2 Comparison of Deductive Veriﬁcation Accuracy
We compare the veriﬁcation accuracy of reasoning chains using two methods: (1) verifying the entire
reasoning chain at once (as described in Section 4.1) without utilizing the Natural Program, and
(2) our Natural Programbased veriﬁcation approach with stepbystep decomposition. The results,
presented in Table 3, indicate that our approach achieves signiﬁcantly higher reasoning veriﬁcation
accuracy across most datasets. It effectively identiﬁes erroneous reasoning in faulty chains while
maintaining a low rate of false positives for valid chains. However, we observe that our approach’s
effectiveness is limited on the “Last Letters” task. We hypothesize that this is due to the task’s
nature, where each subsequent reasoning step is conditioned on all previous steps, presenting greater
challenges for reasoning veriﬁcation due to the increased dependency among premises.
5.3 Impact of Natural Program and Deductive Veriﬁcation on Final Answer Correctness
We then investigate the impact of our Natural Program reasoning format and our deductive veriﬁcation
process on ﬁnal answer correctness. We conduct two experiments: (1) for each problem, we instruct
language models to generate
k= 10
reasoning chain candidates in the Natural Program (NP) format
and perform simple majority voting on ﬁnal answers, without using deductive veriﬁcation to ﬁlter out
reasoning chain candidates; (2) applying our deductive veriﬁcation approach to ﬁlter out reasoning
chain candidates, and apply UnanimityPlurality Voting (UPV) along the process to determine the
ﬁnal answer. As a reference, we also report the performance of ChainofThought (CoT) [
47
] and
8
Premise Context # Shots Reasoning Correctness GSM8K AQuA MATH AddSub Date Last Letters Average
Full Premises 1
Correct 64% 54% 58% 95% 26% 96% 66%
Wrong 56% 68% 56% 24% 76% 5% 48%
(Average) 60% 61% 57% 60% 51% 51% 57%
Minimal Premises 0
Correct 84% 78% 90% 96% 90% 12% 75%
Wrong 26% 12% 28% 20% 20% 80% 31%
(Average) 55% 45% 59% 58% 55% 46% 53%
Minimal Premises 1
Correct 84% 72% 70% 95% 90% 96% 85%
Wrong 84% 62% 76% 40% 56% 6% 54%
(Average) 84% 67% 73% 68% 73% 51% 69%
Table 6: Ablation study on the impact of (1) premise context and (2) zeroshot vs. fewshot scenarios
on deductive veriﬁcation accuracy using our Natural Programbased approach with stepbystep
reasoning chain decomposition. To verify each reasoning step
si
, we either the full premises
pi=QC ∪S≤j
, or use the minimal subset of premises
¯pi⊆pi
necessary as outlined in Sec. 4.1
The oneshot prompt we used is shown in Appendix C.3. For each dataset, we randomly sample 50
reasoning chains that are deductively valid and 50 reasoning steps exhibiting incorrect reasonings.
Answer Correctness k′= 1 k′= 3 k′= 5 k′= 10
Correct 86% 90% 90% 92%
Wrong 38% 38% 38% 40%
Table 7: Ablation of different values of
k′
on the veriﬁcation accuracy of reasoning chains using our
UnanimityPlurality Voting strategy. Experiments are performed on AddSub using GPT3.5turbo
(ChatGPT).
Faithful CoT [
25
]. For these baselines, we perform simple answerbased majority voting with
k= 10
for fair comparison.
Results are presented in Tab. 4. While our major goal is to improve the trustworthiness and reliability
of deductive reasoning, we ﬁnd that prompting language models to reason in our Natural Program
format achieves onpar or better ﬁnal answer accuracy than baselines over many reasoning tasks.
Upon further applying our deductive veriﬁcation approach to ﬁlter out invalid reasoning chains, we
observe a slight decrease in ﬁnal answer accuracy. One major contributing factor to this decrease
is the ﬁltering out of reasoning chain candidates that provide correct answers but exhibit incorrect
reasoning. We illustrate an example in Table 5, where ChatGPT generates the correct ﬁnal answer
but assigns incorrect premise numbers to support the ﬁrst reasoning step. We note that in many such
cases, our approach effectively identiﬁes these reasoning errors, thereby enhancing the rigor and
reliability of the language models’ reasoning processes, albeit with a slight negative impact on the
overall ﬁnal answer correctness.
5.4 Ablation Study
In addition, we perform several ablation studies to gain further insights into the designs of our
deductive veriﬁcation approach. In Tab. 6, we compare two different approaches to verify a single
reasoning step
si∈S
following our Natural Program format. The ﬁrst approach utilizes all premises
pi=QC ∪S≤j
for veriﬁcation regardless of their relevance to
si
, potentially introducing irrelevant
contexts. The second approach follows our design in Sec. 4.1 and only includes the necessary context
and premises
¯pi⊆pi
. We observe that removing irrelevant premises signiﬁcantly improves the
reasoning chain veriﬁcation accuracy on many datasets, highlighting the importance of this technique.
We also ablate on our UnanimityPlurality Voting strategy by investigating the impact of different
k′
.
Recall that
k′
determines the number of votes to produce validity predictions of singlestep reasoning.
Results are shown in Tab. 7. We observe that increasing
k′
generally enhances reasoning validation
accuracy, though we note that this is at the expense of more compute.
9
Question: Melanie had 10 quarters and 17 pennies in her bank. Her dad gave her 27 pennies
and her mother gave her 19 pennies. How many pennies does Melanie have now?
Ground Truth: Melanie have 17 + 27 + 19 = 63 pennies.
ChatGPT’s reasoning step:
#5. (by #1) Step 1: Calculate the number of pennies Melanie had initially.
Number of pennies in 10 quarters: 10 * 25 = 250
Number of pennies initially: 250 + 17 = 267
Table 8: An example question with ambiguous wordings. The term "pennies" in this question can
be interpreted as either a type of coin or a unit of currency. In this particular question, "pennies" is
treated as a type of coin. However, the initial reasoning step by ChatGPT mistakenly treats "pennies"
as a unit of currency, resulting in the conversion of all Melanie’s money into "pennies" (highlighted
in red). Consequently, all subsequent reasoning steps follow this ﬂawed logic, leading to an incorrect
reasoning trace. Our deductive veriﬁcation is not yet able to detect such errors.
6 Limitations
While we have demonstrated the effectiveness of Natural Programbased deductive reasoning veriﬁ
cation to enhance the trustworthiness and interpretability of reasoning steps and ﬁnal answers, it is
important to acknowledge that our approach has limitations. In this section, we analyze a common
source of failure cases to gain deeper insights into the behaviors of our approach. The failure case, as
shown in Tab. 8, involves the ambiguous interpretation of the term “pennies,” which can be understood
as either a type of coin or a unit of currency depending on the context. The ground truth answer
interprets “pennies” as coins, while ChatGPT interprets it as a unit of currency. In this case, our
deductive veriﬁcation process is incapable of ﬁnding such misinterpretations. Contextual ambiguities
like this are common in realworld scenarios, highlighting the current limitation of our approach.
7 Conclusion
In this paper, we aim to enable Large Language Models (LLMs) to perform explicit and rigorous
deductive reasoning while ensuring the trustworthiness of their reasoning processes through self
veriﬁcation. To this end, we have proposed a novel framework based on “Natural Program”, a natural
languagebased deductive reasoning format that facilitates reasoning veriﬁcation and can be easily
generated through incontext learning. Within this framework, we decompose the veriﬁcation process
of complex reasoning chains into stepbystep subprocesses that focus solely on necessary context and
premises, allowing us to signiﬁcantly enhance the accuracy of veriﬁcation. Additionally, we introduce
a UnanimityPlurality Voting strategy to further improve veriﬁcation accuracy. Experimentally,
we demonstrate the superiority of our framework in improving the rigor, trustworthiness, and
interpretability of reasoning steps and answers.
Broader Impact. While our deductive veriﬁcation approach can mitigate hallucinations and reasoning
errors of Large Language Models (LLMs), it does not completely eliminate these phenomena.
LLMs can still produce harmful and biased content, make incorrect claims, and produce wrongful
advice. This issue becomes particularly signiﬁcant when LLMs engage in complex reasoning chains,
increasing the risk of misleading users. Consequently, it is still crucial for users to exercise great
caution when interacting with, deploying, or developing LLMbased applications.
Acknowledgements
We would like to express our sincere gratitude to Tongzhou Mu and Caiwei Xiao from UC San
Diego, Kairong Luo from Tsinghua University, and Pulkit Madan, Reza Pourreza, Sunny Panchal,
and Apratim Bhattacharyya from Qualcomm for their valuable discussions and feedback.
10
References
[1]
Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and Jackie Chi Kit Cheung. Why exposure bias
matters: An imitation learning perspective of error accumulation in language generation. arXiv preprint
arXiv:2204.01171, 2022.
[2]
Kaj Bostrom, Zayne Sprague, Swarat Chaudhuri, and Greg Durrett. Natural language deduction through
search over statement compositions. arXiv preprint arXiv:2201.06028, 2022.
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are fewshot learners.
Advances in neural information processing systems, 33:1877–1901, 2020.
[4]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artiﬁcial general intelligence: Early
experiments with gpt4. arXiv preprint arXiv:2303.12712, 2023.
[5]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting:
Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,
2022.
[6]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to
selfdebug. arXiv preprint arXiv:2304.05128, 2023.
[7]
WeiLin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An opensource
chatbot impressing gpt4 with 90%* chatgpt quality, March 2023.
[8]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language
modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[9]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instructionﬁnetuned language models. arXiv
preprint arXiv:2210.11416, 2022.
[10]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training veriﬁers to solve math word
problems. arXiv preprint arXiv:2110.14168, 2021.
[11]
Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint
arXiv:2208.14271, 2022.
[12]
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan
Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palme: An embodied multimodal language
model. arXiv preprint arXiv:2303.03378, 2023.
[13]
Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier
Bousquet, and Denny Zhou. Compositional semantic parsing with large language models. arXiv preprint
arXiv:2209.15003, 2022.
[14]
Nuno M Guerreiro, Duarte Alves, Jonas Waldendorf, Barry Haddow, Alexandra Birch, Pierre Colombo,
and André FT Martins. Hallucinations in large multilingual translation models. arXiv preprint
arXiv:2303.16104, 2023.
[15]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song,
and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint
arXiv:2103.03874, 2021.
[16]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training computeoptimal
large language models. arXiv preprint arXiv:2203.15556, 2022.
[17]
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve
arithmetic word problems with verb categorization. In EMNLP, pages 523–533, 2014.
[18]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing
Surveys, 55(12):1–38, 2023.
11
[19]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language
models are zeroshot reasoners. arXiv preprint arXiv:2205.11916, 2022.
[20]
Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. Learning to automatically solve
algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 271–281, Baltimore, Maryland, June 2014. Association for
Computational Linguistics.
[21]
Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell,
James McClelland, Jane Wang, and Felix Hill. Can language models learn from explanations in context?
In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, Abu Dhabi,
United Arab Emirates, December 2022. Association for Computational Linguistics.
[22]
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation:
Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
[23]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pretrain,
prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM
Computing Surveys, 55(9):1–35, 2023.
[24]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, KaiWei Chang, SongChun Zhu, Oyvind Tafjord, Peter
Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question
answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
[25]
Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and
Chris CallisonBurch. Faithful chainofthought reasoning. arXiv preprint arXiv:2301.13379, 2023.
[26]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Selfreﬁne: Iterative reﬁnement with selffeedback.
arXiv preprint arXiv:2303.17651, 2023.
[27]
Ana Marasovi´
c, Iz Beltagy, Doug Downey, and Matthew E. Peters. Fewshot selfrationalization with
natural language prompts, 2022.
[28]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in
abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
[29] OpenAI. Gpt4 technical report, 2023.
[30]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with
human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
[31]
Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi
Faltings. Reﬁner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904,
2023.
[32]
Danilo Ribeiro, Shen Wang, Xiaofei Ma, Henry Zhu, Rui Dong, Deguang Kong, Juliette Burger, Anjelica
Ramos, William Wang, Zhiheng Huang, et al. Street: A multitask structured reasoning and explanation
benchmark. arXiv preprint arXiv:2302.06729, 2023.
[33]
Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413,
2016.
[34]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Chafﬁn, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zeroshot
task generalization. arXiv preprint arXiv:2110.08207, 2021.
[35]
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´
c, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176bparameter
openaccess multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
[36]
Timo Schick, Jane DwivediYu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv
preprint arXiv:2302.04761, 2023.
[37]
Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. Generate & rank:
A multitask framework for math word problems. arXiv preprint arXiv:2109.03034, 2021.
12
[38]
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli,
and Denny Zhou. Large language models can be easily distracted by irrelevant context. arXiv preprint
arXiv:2302.00093, 2023.
[39]
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won
Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chainofthought
reasoners. arXiv preprint arXiv:2210.03057, 2022.
[40]
Noah Shinn, Beck Labash, and Ashwin Gopinath. Reﬂexion: an autonomous agent with dynamic memory
and selfreﬂection. arXiv preprint arXiv:2303.11366, 2023.
[41]
Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan BoydGraber, and Lijuan
Wang. Prompting gpt3 to be reliable. arXiv preprint arXiv:2210.09150, 2022.
[42]
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Adam R Brown, Adam Santoro, Aditya Gupta, Adrià GarrigaAlonso, et al. Beyond the imitation game:
Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
[43]
Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Proofwriter: Generating implications, proofs, and
abductive statements over natural language. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli,
editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event,
August 16, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 3621–3634. Association for
Computational Linguistics, 2021.
[44]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, MarieAnne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efﬁcient foundation
language models. arXiv preprint arXiv:2302.13971, 2023.
[45]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Selfconsistency improves
chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
[46]
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv
preprint arXiv:2206.07682, 2022.
[47] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain
of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
[48]
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text
generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
[49]
Yixuan Weng, Minjun Zhu, Shizhu He, Kang Liu, and Jun Zhao. Large language models are reasoners
with selfveriﬁcation. arXiv preprint arXiv:2212.09561, 2022.
[50]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React:
Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
[51]
Eric Zelikman, Jesse Mu, Noah D Goodman, and Yuhuai Tony Wu. Star: Selftaught reasoner bootstrapping
reasoning with reasoning. 2022.
[52]
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit,
Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. Socratic models: Composing
zeroshot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
[53]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pretrained transformer language models.
arXiv preprint arXiv:2205.01068, 2022.
[54]
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large
language models. arXiv preprint arXiv:2210.03493, 2022.
[55]
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Olivier Bousquet, Quoc Le, and Ed Chi. Leasttomost prompting enables complex reasoning in large
language models. arXiv preprint arXiv:2205.10625, 2022.
[56]
Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, and Hanie Sedghi.
Teaching algorithmic reasoning via incontext learning. arXiv preprint arXiv:2211.09066, 2022.
13
A Deductive Veriﬁcation with Vicuna Models
We further explore the efﬁcacy of deductive veriﬁcation for opensource models. We select two
popular models: Vicuna7B and Vicuna13B [
7
]. These models are ﬁnetuned versions of LLaMA7B
and LLaMA13B [
44
] using the ShareGPT data
3
. We use the same Natural Programbased oneshot
veriﬁcation method we used in the main paper. Results are shown in the ﬁrst and the third rows
of Table 9. We observe for the original Vicuna models without ﬁnetuning, Vicuna7B exhibits
poor performance in deductive veriﬁcation and fails to ﬁnd out reasoning mistakes, while the larger
Vicuna13B exhibits better veriﬁcation accuracy.
Models Reasoning Correctness GSM8K AQuA MATH AddSub Date Last Letters Overall
Vicuna7B
Correct 80% 86% 96% 98% 96% 80% 89%
Wrong 14% 22% 16% 6% 20% 34% 19%
(Average) 47% 54% 56% 52% 58% 57% 54%
Vicuna7B
(ﬁnetuned)
Correct 68% 48% 46% 76% 46% 32% 53%
Wrong 72% 86% 54% 60% 72% 68% 69%
(Average) 70% 67% 50% 68% 61% 50% 61%
Vicuna13B
Correct 86% 82% 92% 96% 72% 74% 84%
Wrong 32% 36% 20% 20% 34% 30% 29%
(Average) 59% 59% 56% 58% 53% 52% 57%
Vicuna13B
(ﬁnetuned)
Correct 74% 50% 56% 86% 72% 12% 58%
Wrong 72% 76% 72% 68% 62% 96% 74%
(Average) 73% 63% 64% 77% 67% 54% 66%
ChatGPT
(GPT3.5Turbo)
Correct 84% 72% 70% 95% 90% 96% 85%
Wrong 84% 62% 76% 40% 56% 6% 54%
(Average) 84% 67% 73% 68% 73% 51% 69%
Table 9: Oneshot Deductive Veriﬁcation Accuracy of Vicuna7B and Vicuna13B. The models are
evaluated with or without ﬁnetuning on our deductive veriﬁcation dataset. For each dataset, we
randomly sample 50 reasoning chains that are deductively valid and 50 reasoning steps exhibiting
incorrect reasonings.
We therefore conduct an additional experiment to investigate if the veriﬁcation accuracy of Vicuna
models can be improved by ﬁnetuning. To this end, we generate a deductive veriﬁcation dataset,
which consists of 2000 reasoning steps evenly distributed between correct and incorrect categories.
We automatically generate this dataset using GPT3.5turbo since it exhibits a very high accuracy of
singlestep veriﬁcation. We ﬁrst use GPT3.5turbo to generate solutions for problems in GSM8K’s
training set. We then execute stepbystep deductive veriﬁcation on these solutions using GPT3.5
turbo. For solutions that result in correct ﬁnal answers, we retain the reasoning steps that pass
deductive veriﬁcation. For solutions that yield incorrect ﬁnal answers, we retain the reasoning
steps that cannot pass deductive veriﬁcation. After constructing our dataset, we then ﬁnetune the
Vicuna models using the veriﬁcations of the 2000 reasoning steps. Models were ﬁnetuned with 4
A10080GB over 3 epochs. Training parameters are shown in Table 10.
As shown in Tab. 9, we observe that ﬁnetuning with our dataset can enhance the deductive veriﬁcation
accuracy of Vicuna models not only on the dataset where the training dataset is constructed (GSM8K),
but also on many other datasets. However, the accuracy is still worse than nonﬁnetuned GPT3.5,
which suggests that model capacity has a signiﬁcant impact on deductive veriﬁcation capabilities.
B More Details on Answer Extraction
In this section, we describe our process to extract the ﬁnal answer from language models’ responses.
The process begins by selecting the last three nonempty lines. Then, these lines are processed
through the following pipeline:
3https://github.com/domeccleston/sharegpt
14
Hyperparameters Value
Optimizer AdamW
Learning rate 1×10−5
Weight decay 0.00
Num epochs 3
Batch size 64
Learning rate schedule Linear
Table 10: Hyperparameters for ﬁnetuning Vicuna models with our deductive veriﬁcation dataset.
1.
Firstly, we use a list of regular expressions to identify "NoAnswer" patterns within the text,
such as "we cannot answer (thisthe) question". This process helps us ascertain whether the
model can provide a conclusive answer. If any such patterns appear in the text, we mark
"No answer!" as the ﬁnal answer. However, if we don’t detect these patterns, we proceed to
the next steps for extracting the ﬁnal answer.
2.
Secondly, if any "AnswerSplit" patterns are found in the text, we divide the text into several
blocks using the identiﬁed pattern. The last block of text is then utilized for extracting the
answer.
3.
Lastly, we use regular expressions, as outlined in Tab. 11, to scan the remaining text for
possible ﬁnal answers. If multiple matches are found for the pattern, we select the ﬁrst
match as the ﬁnal answer. If no pattern matches are found in the remaining text, we default
the ﬁnal response to "No answer!".
“NoAnswer” Patterns: "we cannot provide an answer to this question with (this

the) given informa
tion", "we cannot answer (this

the) question", "we cannot determine", "we can’t determine", "we do
not have enough information to answer (this

the) question", "we do not have enough information to
provide a deﬁnitive answer to (this

the) question", "the answer(.*?)is unknown", "answer is not listed
among the answer choices".
“AnswerSplit” Patterns: "answer is", "ﬁnal answer:", "answer to the question is", "answer to this
question is", "concatenated letters are", "concatenate the letters ", "The answer of ".
Answer Type Regular Expression
Number (?\d[\d,\. ]*)
Fractional number (?\(\d+\/\d+\)\/\d+?\d+\/\d+)
Date (\d\d\/\d\d\/\d\d\d\d)
Yes or No (?:YesNoyesnoNOYES)
Table 11: Regular Expression for extracting the ﬁnal answers of different kinds of questions.
C Prompts
C.1 Prompt for Direct Reasoning Chain Veriﬁcation Without Natural Program Format
For the results in Tab. 2 of the main paper, We use “Do you think the above reasoning process is
correct? Let’s think step by step.” as the zeroshot prompt to verify an entire reasoning chain at once.
We also design a twoshot prompt for reasoning chain veriﬁcation as shown in Tab. 12, which covers
one correct reasoning chain and one incorrect reasoning chain.
C.2 Prompts for Reasoning Chain Generation in the Natural Program Format
To instruct models to generate reasoning chains in the Natural Program format that facilitates step
bystep deductive veriﬁcation, we have designed four distinct prompts to address different types of
problems. These include:
15
1.
Math word problems, as illustrated in Tab. 13, covering GSM8K, MATH, and AddSub
datasets.
2.
Math word problems with multiplechoice options, illustrated in Tab. 14, covering the AQuA
dataset.
3. Daterelated problems, illustrated in Tab. 15, covering the Date dataset.
4. Last Letters problems, illustrated in Tab. 16, covering the Last Letters dataset.
C.3
Prompt for Deductive Veriﬁcation Following Natural Program Format and StepbyStep
Decomposition
We have designed a general oneshot prompt for the deductive veriﬁcation of a single reasoning step
on different datasets, as shown in Tab. 17. This prompt serves to instruct language models to generate
the deductive validity of each reasoning step as illustrated in Sec. 4.2 and the topright box of Fig. 1
of the main paper.
D More Deductive Veriﬁcation Examples
In this section, we present more deductive veriﬁcation examples using our Natural Programbased
approach on single reasoning steps.
In Tab. 18, we demonstrate that the language model (ChatGPT) not only successfully identiﬁes
ungrounded information, but also identiﬁes logical errors within the given solutions.
In Tab. 19, we illustrate a case where the language model fails to detect ungrounded premise numbers,
mistakenly assuming that these numbers can be derived from grounded ones.
Lastly, in Tab. 20, we illustrate a case where the language model is sometimes unable to correctly
identify grounded numbers.
16
Here is a math question and its solution:
"Question:
Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops
did Jason give to Denny?
Answer:
Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 + 12 =34.
The answer is 34."
You are a math teacher. Do you think the reasoning process is correct?
Let’s check the "Answer" step by step, and then end with "yes" or "no".
Answer:
The problem statement is clear, but the reasoning process has a major error in it.
Let’s go through it step by step:
1. "Jason started with 20 lollipops." This is correct according to the problem statement.
2. "Then he had 12 after giving some to Denny." This is also correct as per the problem.
3. "So he gave Denny 20 + 12 =34. The answer is 34." This is incorrect. The correct reasoning process
should be subtracting the number of lollipops he ended up with from the number he started with.
That would be 20  12 = 8 lollipops.
So, Jason gave Denny 8 lollipops, not 34. No, the reasoning process is incorrect.
Here is a math question and its solution:
"Question:
Olivia has $23. She bought ﬁve bagels for $3 each. How much money does she have left?
Answer:
Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23  15 dollars left.
23  15 is 8. The answer is 8."
You are a math teacher. Do you think the reasoning process is correct?
Let’s check the "Answer" step by step, and then end with "yes" or "no".
Answer:
Let’s go through this reasoning process step by step.
1. "Olivia had 23 dollars." This is the given starting amount of money that Olivia has. This step is correct.
2. "5 bagels for 3 dollars each will be 5 x 3 = 15 dollars." This is correct as well. Olivia spends 3 dollars
for each bagel, and she buys 5 of them. So the total cost of the bagels is 3 dollars/bagel x 5 bagels = 15 dollars.
3. "So she has 23  15 dollars left." This statement is also correct. After purchasing the bagels, Olivia will
have her initial amount of money minus the cost of the bagels.
4. "23  15 is 8. The answer is 8." This step is correct as well. 23 dollars (Olivia’s initial amount)  15 dollars
(the cost of the bagels) = 8 dollars. This means that Olivia has 8 dollars left after buying the bagels.
So overall, Olivia has 8 dollars left after buying ﬁve bagels for 3 dollars each. Yes, the reasoning process
is correct.
Table 12: Twoshot prompt for direct reasoning chain veriﬁcation without Natural Program format.
17
Question:
Alice buys 5 apples. Each apple costs her $2. She gets them for 50% off. After she bought
she decides she doesn’t like 2 of them and sells them to her friends for 1. How much money
was she out?
Answer:
First, let’s write down all the statements and relationships in the question with labels.
#1. Alice gets 5 apples.
#2. Each apple costs $2, but Alice gets apples for 50% off.
#3. Alice sells 2 of 5 apples for 1.
#4. How much money was Alice out?
Next, let’s answer the question step by step with reference to the question and reasoning process:
#5. (by #2) Step 1: Calculate the price per apple after a 50% discount.
Original price per apple: $2
Discount: 50%
Price per apple after discount: $2 * (1  0.50) = $2 * 0.50 = $1
#6. (by #1 #5) Step 2: Calculate Alice’s cost of the 5 apples after a 50% discount.
Price per apple after discount: $1
Total apples: 5
Alice’s total cost of 5 apples: $1 * 5 = $5
#7. (by #3) Step 3: Calculate the money Alice gets back from selling 2 apples.
Money that Alice gets back from selling 2 apples: $1
#8. (by #6 #7) Step 4: Calculate the total amount of money Alice is out.
Alice’s total cost of 5 apples: $5
Money that Alice gets back from selling 2 apples: $1
Money that Alice is out: $5  $1 = $4
#9. (by #4 #8) The original question is #4.
How much money was Alice out? We do not miss information on the rewritten labels. So the
answer to this question is Alice is out $4.
Please answer the following question by following the above template.
Table 13: Oneshot Natural Program prompt for reasoning chain generation on math word problems.
18
Question:
Alice buys 5 apples. Each apple costs her $2.
She gets them for 50% off. After she bought She decides she doesn’t like 2 of them and sells them
to her friends for 1.
How much money was she out?
Answer Choices:
A) 4
B) 2
C) None of the above
Answer:
First, let’s write down all the statements and relationships in the question with labels.
#1. Alice gets 5 apples.
#2. Each apple costs $2, but Alice gets apples for 50% off.
#3. Alice sells 2 of 5 apples for 1.
#4. How much money was Alice out?
Next, let’s answer the question step by step with reference to the question and reasoning process:
#5. (by #2) Step 1: Calculate the price per apple after a 50% discount.
Original price per apple: $2
Discount: 50%
Price per apple after discount: $2 * (1  0.50) = $2 * 0.50 = $1
#6. (by #1 #5) Step 2: Calculate Alice’s cost of the 5 apples after a 50% discount.
Price per apple after discount: $1
Total apples: 5
Alice’s total cost of 5 apples: $1 * 5 = $5
#7. (by #3) Step 3: Calculate the money Alice gets back from selling 2 apples.
Money that Alice gets back from selling 2 apples: $1
#8. (by #6 #7) Step 4: Calculate the total amount of money Alice is out.
Alice’s total cost of 5 apples: $5
Money that Alice gets back from selling 2 apples: $1
Money that Alice is out: $5  $1 = $4
#9. (by #4 #8) The original question is #4. How much money was Alice out? We do not miss
information on the rewritten labels. So the answer to this question is Alice is out $4. Among
all the answer choices, the best option is A) 4.
Please answer the following question by following the above template.
Table 14: Oneshot Natural Program prompt for reasoning chain generation on math word problems
with multiple choice.
19
Let’s work on date computation.
Question:
Yesterday is 02/22/2012. What is the date one week from today? Please answer in format
MM/DD/YYYY.
Answer:
First, let’s write down all the statements and relationships in the question with labels.
#1. Yesterday is 02/22/2012.
#2. What is the date one week from today? Please answer in format MM/DD/YYYY.
Next, let’s answer the question step by step with reference to the question and reasoning process:
#4. (by #1) Step 1: Calculate the date today.
The date of Yesterday is 02/22/2012.
The date of today is 02/23/2012.
#5. (by #2 #4) Step 2: Calculate the date one week from today and write the answer in
MM/DD/YYYY.
Because 2012 is a leap year and February in 2012 has 29 days, the date one week from today is
03/01/2012.
Therefore, date one week from today is 03/01/2012.
Question:
Today is 3/2 and Bob is watching NBA 2007. What is the date one week before today?
Please answer in format MM/DD/YYYY.
Answer:
First, let’s write down all the statements and relationships in the question with labels.
#1. Today is 3/2.
#2. Bob is watching NBA 2007 today.
#3. What is the date one week before today? Please answer in format MM/DD/YYYY.
Next, let’s answer the question step by step with reference to the question and reasoning process:
#4. (by #1 #2) Step 1: Calculate the date today.
Today is 3/2 and Bob is watching NBA 2007 today. Thus, Today is 03/02/2007.
#5. (by #3 #4) Step 2: Calculate the date one week before today and write the answer
in MM/DD/YYYY.
Because 2007 is not a leap year and February in 2007 has 28 days, the date one week from today
is 02/23/2007.
Therefore, the date one week before today is 02/23/2007.
Please answer the following question by following the above template.
Table 15: Twoshot Natural Program prompt for reasoning chain generation on the Date dataset.
20
Let’s work on date computation.
Question:
Take the last letters of each words in "Apple Is" and concatenate them.
Answer:
First, let’s write down all the statements and relationships in the question with labels.
#1. We need to take the last letter of the word "Apple".
#2. We need to take the last letter of the word "Is".
#3. We need to concatenate the last letters of the words sequentially
Next, let’s answer the question step by step with the complete reference to the question and reasoning process:
#4. (by #1, #3) Step 1: Taking the last letter of the word "Apple" and concatenate the latter.
The last letter of the word "Apple" is "e".
The concatenated last letters are "e".
#5. (by #2, #3, #4) Step 2: Taking the last letter of the word "Is" and concatenate the latter.
The last letter of the word "Is" is "s".
The concatenated last letters are "es".
#6. (by #3, #5) Step 3: Provide the ﬁnal answer of the concated letter.
Therefore, the answer is "es".
Please answer the following question by following the above template.
Table 16: Oneshot Natural Program prompt for reasoning chain generation on the Last Letters
dataset.
21
Here is some information:
“1. Alice gets 2 apples.
2. Alice gets twice as many oranges as bananas.
3. Original price per apple: $2
Discount: 50%
Price per apple after discount: $2 * (1  0.50) = $2 * 0.50 = $1"
Based on the given information, here is a reasoning process:
"Calculate Alice’s cost of the 5 apples after a 50% discount.
Price per apple after discount: $1
Apples be eaten: 3
Oranges be eaten: 6
Total apples: 5
Total oranges: x
Total bananas: 2x
Alice’s total number of fruits: 5 + x + 2x = 5 + 4x
Alice’s total cost of 5 apples: $1 * (5  3) = $2”
Doublecheck the reasoning process, let’s analyze its correctness, and end with "yes" or "no".
Answer:
Let’s think step by step.
Grounding check
Purpose: Find references for numbers in the reasoning process sequentially
Rule: Any ungrounded number makes the check fails
"x" is a variable, we do not care about variables.
"y" is a variable, we do not care about variables.
"5" is a quantity from the given information that represents the number of apples.
"50%" is a quantity from the given information that represents the discount.
"$1" is a quantity from the given information that represents the price per apple after the discount.
"3" is a quantity not from the given information, it is ungrounded.
"6" is a quantity not from the given information, it is ungrounded.
"$2" is a quantity from the given information that represents the original price per apple.
Result: Based on the check, there are ungrounded numbers, so the grounding check fails.
Reasoning check:
Purpose: Check whether the reasoning logic correctly answers the question
Rule: Any logic error makes the check fails
To calculate the total cost of apples after a 50% discount, we should multiply the number of apples
by the discounted price. But to answer the question, we do not need to calculate the total number
of fruit Alice gets.
Result: Based on the check, there are some logic mistakes, so the reasoning check fails.
Calculation check:
Purpose: Check whether the calculation process is consistent
Rule: Any inconsistent calculation makes the check fails
calculation1:
equation: $1 * (5  3), answer: $2
(53)=2
$1 * 2 = $2 is consistent with the answer, so the calculation is correct.
calculation2:
equation: 5 + x + 2x, answer: 5 + 4x
x+2x=3x
5 + 3x is inconsistent with the answer, so the calculation is incorrect.
Result: Based on the check, the calculation process is inconsistent, so the calculation check fails.
Check results: Ground check fails, Reasoning check fails, Calculation check fails.
Rule: Any failed check makes the reasoning incorrect.
So the answer is "no".
Table 17: Oneshot prompt for deductive veriﬁcation of a single reasoning step, following our Natural
Program format and stepbystep reasoning chain decomposition.
22
Input:
Here is some information:
"For every 25 telephone calls he makes, he gets one person to come into the dealership."
Based on the given information, here is a reasoning process:
"Calculate the number of people that would come into the dealership for 15 car sales.
Number of people needed to sell 15 cars: 15 * 2 = 30"
Doublecheck the reasoning process, let’s analyze its correctness, and end with "yes" or "no".
Answer:
Let’s think step by step without any assumptions.
Output:
Grounding check:
Purpose: Find references for numbers in the reasoning process sequentially
Rule: Any ungrounded number makes the check fails
"15" is a quantity from the question that represents the number of car sales.
"2" is a quantity not from the given information, it is ungrounded.
"25" is a quantity from the given information that represents the number of telephone calls
per one person to come into the dealership.
Result: Based on the check, there are ungrounded numbers, so the grounding check fails.
Reasoning check:
Purpose: Check whether the reasoning logic correctly answers the question
Rule: Any logic error makes the check fails
It is not clear how the number of people coming into the dealership relates to the number of car sales.
The given information only tells us the ratio of telephone calls to people coming into the dealership,
but it does not provide any information about the number of telephone calls made for each car sale.
Therefore, the reasoning process is not correct.
Result: Based on the check, there is a logic mistake, so the reasoning check fails.
Calculation check:
Purpose: Check whether the calculation process is consistent
Rule: Any inconsistent calculation makes the check fails
calculation1:
equation: 15 * 2, answer: 30
15 * 2 = 30 is consistent with the answer, so the calculation is correct.
Result: Based on the check, the calculation process is consistent, but the reasoning is incorrect,
so the calculation check fails.
Check results: Grounding check fails, Reasoning check fails, Calculation check fails.
Rule: Any failed check makes the reasoning incorrect.
So the answer is "no".
Table 18: Successful case: our deductive veriﬁcation approach successfully discovers ungrounded
information and reasoning mistakes.
23
Input:
Here is some information:
"How many hours would it take for 5 people to paint a whole house?"
Based on the given information, here is a reasoning process:
"Calculate the total number of hours it would take for 5 people to paint a whole house.
Total hours for one person to paint a whole house: 240 hours
Total number of people: 5
Total hours for 5 people to paint a whole house: 240 hours / 5 = 48 hours"
Doublecheck the reasoning process, let’s analyze its correctness, and end with "yes" or "no".
Answer:
Let’s think step by step without any assumptions.
Output:
Grounding check
Purpose: Find references for numbers in the reasoning process sequentially
Rule: Any ungrounded number makes the check fails
"240" is a quantity not from the given information, it is ungrounded.
"5" is a quantity from the given information that represents the number of people.
"48" is a quantity calculated from the ungrounded number and grounded number in the reasoning process.
Result: Based on the check, there is one ungrounded number, but it is calculated from another grounded
number, so the grounding check passes.
Reasoning check:
Purpose: Check whether the reasoning logic correctly answers the question
Rule: Any logic error makes the check fails
The reasoning process correctly calculates the total number of hours for 5 people to paint a whole house by
dividing the total hours for one person to paint a whole house by 5. Therefore, the reasoning check passes.
Calculation check:
Purpose: Check whether the calculation process is consistent
Rule: Any inconsistent calculation makes the check fails
calculation1:
equation: 240 / 5, answer: 48
240 / 5 = 48 is consistent with the answer, so the calculation is correct.
Result: Based on the check, the calculation process is consistent, so the calculation check passes.
Check results: Ground check passes, Reasoning check passes, Calculation check passes.
Rule: All checks pass make the reasoning correct.
So the answer is "yes".
Table 19: Failure case: our deductive veriﬁcation process fails to ﬁnd out ungrounded information in
the reasoning step. The number 240 in the reasoning step is ungrounded, but the model states that it
can be calculated from grounded numbers.
24
Input:
Here is some information:
"Calculate the number of doughnuts Derrick makes in a day.
Number of doughnuts in a dozen: 12
Number of doughnuts in 10 dozens: 10 * 12 = 120
Calculate the revenue from selling one doughnut.
Price per doughnut: $2"
Based on the given information, here is a reasoning process:
"Calculate the revenue from selling all the doughnuts in a day.
Number of doughnuts in a day: 120
Revenue from selling all the doughnuts in a day: $2 * 120 = $240"
Doublecheck the reasoning process, let’s analyze its correctness, and end with "yes" or "no".
Answer:
Let’s think step by step without any assumptions.
Output:
Grounding check
Purpose: Find references for numbers in the reasoning process sequentially
Rule: Any ungrounded number makes the check fails
"12" is a quantity from the given information that represents the number of doughnuts in a dozen.
"10" is a quantity not from the given information, it is ungrounded.
"$2" is a quantity from the given information that represents the price per doughnut.
"120" is a quantity not from the given information, it is ungrounded.
Result: Based on the check, there are ungrounded numbers, so the grounding check fails.
Reasoning check:
Purpose: Check whether the reasoning logic correctly answers the question
Rule: Any logic error makes the check fails
The reasoning process is correct. We have the number of doughnuts Derrick makes in a day,
and we have the price per doughnut.
Multiplying these two yields the revenue from selling all the doughnuts in a day.
Result: Based on the check, the reasoning logic is correct.
Calculation check:
Purpose: Check whether the calculation process is consistent
Rule: Any inconsistent calculation makes the check fails
calculation1:
equation: 10 * 12, answer: 120
10 * 12 = 120 is consistent with the answer, so the calculation is correct.
calculation2:
equation: $2 * 120, answer: $240
$2 * 120 = $240 is consistent with the answer, so the calculation is correct.
Result: Based on the check, the calculation process is consistent, so the calculation check passes.
Check results: Ground check fails, Reasoning check passes, Calculation check passes.
Rule: Any failed check makes the reasoning incorrect.
So the answer is "no".
Table 20: Failure case: our deductive veriﬁcation process sometimes treats grounded information as
if they were ungrounded. The number 120 is provided in the given information, but the model states
that it is ungrounded.
25