ArticlePDF Available

Feedback enhances the positive effects and reduces the negative effect of multiple-choice testing

Authors:

Abstract

Multiple-choice tests are used frequently in higher education without much consideration of the impact this form of assessment has on learning. Multiple-choice testing enhances retention of the material tested (the testing effect); however, unlike other tests, multiple-choice can also be detrimental because it exposes students to misinformation in the form of lures. The selection of lures can lead students to acquire false knowledge (Roediger & Marsh, 2005). The present research investigated whether feedback could be used to boost the positive effects and reduce the negative effects of multiple-choice testing. Subjects studied passages and then received a multiple-choice test with immediate feedback, delayed feedback, or no feedback. In comparison with the no-feedback condition, both immediate and delayed feedback increased the proportion of correct responses and reduced the proportion of intrusions (i.e., lure responses from the initial multiple-choice test) on a delayed cued recall test. Educators should provide feedback when using multiple-choice tests.
Copyright 2008 Psychonomic Society, Inc. 604
The multiple-choice test is a staple of higher education
because it provides an efficient and effective measure of
student learning (McKeachie, 1999). The popularity of
this highly objective testing format has increased over
the years, partly due to improvements in technology that
make grading multiple-choice tests quick and easy. The
multiple-choice test is also highly reliable across scorers,
unlike essay tests. For these reasons and others (Frederik-
sen, 1984), many educators consider the multiple-choice
format an optimal method of testing.
Although tests are primarily used as means of assess-
ment, they also affect the knowledge they measure. Taking
a test generally improves retention of the material tested—a
result commonly referred as the testing effect (for a review,
see Roediger & Karpicke, 2006a). Multiple-choice tests
generally enhance learning as measured on later tests; how-
ever, the multiple-choice test presents a unique situation
because it exposes students to erroneous information in
the form of lure items. By endorsing (or even reading) lure
items during the course of taking a multiple-choice test,
students may acquire incorrect knowledge (see, e.g., But-
ler, Marsh, Goode, & Roediger, 2006; Roediger & Marsh,
2005). As a result, the value of using multiple-choice test-
ing as a learning tool will be enhanced to the extent that the
positive effects (increased retention) can be maximized and
the negative effects (the acquisition of misinformation) can
be minimized. Providing feedback after a multiple-choice
test may promote optimal learning by helping students to
maintain correct responses and correct errors (for a review,
see Bangert-Drowns, Kulik, Kulik, & Morgan, 1991).
The present research sought to identify the circum-
stances under which multiple-choice testing is most ben-
eficial to learning. More specif ically, we examined two
factors that may influence the positive and negative ef-
fects of taking a multiple-choice test: the amount of study
prior to the test and the number of lures on the multiple-
choice test. Considering the first factor, students vary
considerably in their preparation for a test. Obviously, the
lack of prior study will result in poor performance on the
multiple-choice test, decreasing the positive effects of test-
ing, because students must answer questions correctly in
order to benefit from testing. Roediger and Marsh (2005;
see too Butler et al., 2006) showed this effect, but they
also showed that the negative effects of taking a multiple-
choice test were greater when students had studied less,
which also makes sense. When students know little and
guess, they select a lure, and then (if they are not cor-
rected with feedback) they may believe that they made a
correct choice and provide the answer on a later test. The
other factor of interest is the number of lures on the test.
Instructors often prefer to use multiple lures (three or four
lures in addition to the correct answer is typical) in order
to drive down the probability of guessing correctly. How-
ever, increasing the number of lures produces the same
negative effect of acquiring false knowledge, because stu-
dents are exposed to more erroneous information.
Instructors vary greatly in whether they give feedback
on multiple-choice tests. Some do so as a matter of course,
but others protect their test banks and do not give students
feedback unless they make an appointment to see their
exam in the instructor’s office. In the present experiment,
we were interested in whether providing feedback would
influence the magnitude of positive testing effects (we
predicted they would) and at the same time overcome the
Feedback enhances the positive effects and reduces
the negative effects of multiple-choice testing
An d r e w C. Bu t l e r A n d He n r y l. ro e d i g e r iii
Washington University, St. Louis, Missouri
Multiple-choice tests are used frequently in higher education without much consideration of the impact this
form of assessment has on learning. Multiple-choice testing enhances retention of the material tested (the testing
effect); however, unlike other tests, multiple-choice can also be detrimental because it exposes students to misin-
formation in the form of lures. The selection of lures can lead students to acquire false knowledge (Roediger &
Marsh, 2005). The present research investigated whether feedback could be used to boost the positive effects and
reduce the negative effects of multiple-choice testing. Subjects studied passages and then received a multiple-
choice test with immediate feedback, delayed feedback, or no feedback. In comparison with the no-feedback
condition, both immediate and delayed feedback increased the proportion of correct responses and reduced the
proportion of intrusions (i.e., lure responses from the initial multiple-choice test) on a delayed cued recall test.
Educators should provide feedback when using multiple-choice tests.
Memory & Cognition
2008, 36 (3), 604-616
doi: 10.3758/MC.36.3.604
A. C. Butler, butler@wustl.edu
Fe e d b a c k en h a n c e s Te s T i n g 605
Schooler, Foster, and Loftus (1988) found an impairing
effect of endorsing a lure even when the endorsed lure
was not included on the final test, indicating that the nega-
tive effects of committing an error on a multiple-choice
test are not completely due to a bias for maintaining the
same response. In essence, the persistence of incorrect re-
sponses indicates that students are acquiring false knowl-
edge through multiple-choice testing—an outcome that is
especially troubling given the power of testing to enhance
retention (of incorrect facts, in this case).
A primary determinant of the magnitude of these negative
effects is the level of performance on the multiple-choice
test: As students commit more errors, the opportunities for
acquiring false knowledge grow. One factor that influences
the level of performance is test difficulty. Although test
difficulty can be operationalized in many ways, a simple
and systematic method for manipulating test difficulty is
varying the number of multiple-choice alternatives. For ex-
ample, Roediger and Marsh (2005) had subjects read prose
passages and then take a multiple-choice test that contained
equal numbers of two-, four-, and six-alternative questions.
As the number of alternatives on the initial multiple-choice
test increased, the proportion of correct responses on the
multiple-choice test decreased. Then, after a delay, subjects
took a comprehensive cued recall test. Increasing the num-
ber of lures on the multiple-choice test led to a decrease in
the proportion of correct responses and an increase in the
proportion of lures produced on the later cued recall tests.
As was noted previously, Roediger and Marsh showed that
the amount of prior study affected performance on an initial
multiple-choice test and, as a result, on the final cued recall
test as well. When students were not given the opportunity
to read the passages and took the initial test “cold,” they
performed worse on both the initial multiple-choice test and
the final cued recall test than they did when they studied
the material.
Feedback Boosts Retention and Corrects Errors
One potential method for increasing the benefits of
testing and reducing the negative effects of exposing stu-
dents to misinformation is to provide feedback after test-
ing. Feedback allows students to correct errors (Bangert-
Drowns, Kulik, Kulik, & Morgan, 1991) and maintain
correct responses (Butler, Karpicke, & Roediger, in press),
resulting in superior performance on a subsequent test in
comparison with no feedback (McDaniel & Fisher, 1991).
The type of feedback provided can range from a simple
indication of whether the response is correct or incorrect
(see, e.g., Schroth, 1977) to an elaborate explanation of
why a certain response is correct (e.g., Tait, Hartley, &
Anderson, 1973), to a full re-presentation of the original
study materials that allows students to determine the ac-
curacy of their responses (Agarwal, Karpicke, Kang, Roe-
diger, & McDermott, in press). Perhaps the most critical
piece of information in the feedback message is the cor-
rect response, which permits students to both evaluate the
accuracy of their knowledge and encode the correct re-
sponse, if necessary. Consequently, providing the correct
response is more effective than simply indicating whether
the response is correct or incorrect (e.g., Gilman, 1969;
negative effects of such tests (an issue in more doubt). In
addition, we were interested in whether the timing of the
feedback would have differential effects. Before describ-
ing our experiment, we will briefly summarize previous
research relevant to this study.
Testing Benefits Retention
The act of retrieving information from memory serves
to modify the memory trace and increase the probabil-
ity of future retrieval success (see, e.g., Carrier & Pash-
ler, 1992; McDaniel & Masson, 1985; Tulving, 1967;
Wheeler & Roediger, 1992). Because of the mnemonic
benefit conferred by retrieval, many researchers have
argued that tests should be used as learning tools in the
classroom (e.g., Bangert-Drowns, Kulik, & Kulik, 1991;
Foos & Fisher, 1988; Glover, 1989; Jones, 1923–1924;
Roediger & Karpicke, 2006b). Indeed, recent research
suggests that testing produces long-lasting benefits for
retention of complex, educationally relevant materials
(Butler & Roediger, 2007; McDaniel, Anderson, Derbish,
& Morrisette, 2007).
With respect to multiple-choice tests in particular, pre-
vious research has shown that taking an initial multiple-
choice test leads to superior performance on a subsequent
test in comparison with not taking an initial test, regardless
of whether the final test format is multiple-choice (see,
e.g., Duchastel & Nungester, 1982; McDaniel et al., 2007)
or cued recall (e.g., Butler & Roediger, 2007; McDaniel
& Masson, 1985; Roediger & Marsh, 2005). For example,
Butler, Karpicke, and Roediger (2007) had students study
prose passages and then take an initial multiple-choice
test that covered the material in the passages. On a subse-
quent cued recall test, subjects produced a higher propor-
tion of correct responses for previously tested items than
for a subset of the items that were not tested on the initial
multiple-choice test.
Exposing Students to Misinformation
Although testing generally enhances retention of the
material, studies that utilize multiple-choice tests have also
revealed negative consequences of exposing students to
incorrect information. Taking a multiple-choice test leads
subjects to assign higher “truth” values to false statements
that appeared on the earlier multiple-choice test than to
novel false statements (Toppino & Luipersbeck, 1993).
Similarly, research has shown that exposure to incorrect
spellings (Brown, 1988; Jacoby & Hollingshead, 1990)
or false facts embedded within a passage (Marsh, Meade,
& Roediger, 2003) can interfere with memory for cor-
rect spellings and facts, respectively. In addition, expo-
sure to incorrect information can have a negative effect
on subsequent test performance, even when the exposure
occurs after the initial test (see, e.g., Brown, Schilling,
& Hockensmith, 1999). However, the most detrimental
effect of multiple-choice testing probably occurs when
students endorse a lure, believing it to be the correct re-
sponse. After selecting a lure on an initial multiple-choice
test, students tend to produce that lure when prompted
with the same question on a subsequent cued recall test
(Butler et al., 2006; Roediger & Marsh, 2005). Moreover,
606 bu T l e r a n d ro e d i g e r
determined number of responses, generally resulting in
the production of a large amount of incorrect informa-
tion (see, e.g., Roediger & Payne, 1985). In contrast, free
report instructions allow subjects to volunteer or with-
hold responses, which often leads to enhanced memory
accuracy in comparison with forced report (e.g., Koriat &
Goldsmith, 1994, 1996). Thus, a manipulation of report
option will help to gauge what people really know and
what they will report on a test.
The Present Research
The present research examined the effects of three
variables on learning from a multiple-choice test in hopes
of finding situations that maximize positive effects of
multiple-choice testing while minimizing the negative ef-
fects. Students were randomly assigned to one of three
initial study conditions: no exposure to the material (no
study), a brief reading of the material (study), or a brief
reading of the material combined with a rereading of the
key sentences (restudy). The no-study and study condi-
tions were similar to those employed in previous research
on this topic (e.g., Roediger & Marsh, 2005). The re-
study condition was designed to boost performance on
the multiple-choice test above that of the study condition.
The rereading of the key sentences was intended to be
analogous to students reading through their notes or re-
turning to the parts of the passage that they highlighted.
Next, all subjects took a multiple-choice test with equal
numbers of two-, four-, and six-alternative questions. For
each response on the multiple-choice test, they received
no feedback, immediate feedback, or delayed feedback.
An additional subset of items was never tested to serve
as a baseline for comparison with the testing conditions.
Finally, after a 1-week delay, subjects returned for a
comprehensive cued recall test. This final test used Kor-
iat and Goldsmith’s (1996) procedure in which a forced
report phase that required guessing was followed by a
free report phase in which the responses were judged for
correctness.
On the basis of the testing-effect literature reviewed
previously, we predicted an overall benefit in performance
on the final cued recall test for items tested on the initial
multiple-choice test relative to items not initially tested.
However, we expected the magnitude of this testing ef-
fect to be determined by the amount of prior study and
the number of multiple-choice alternatives. More spe-
cifically, a greater amount of prior study should lead to
better performance on the initial multiple-choice test, re-
sulting in a higher proportion of correct responses and a
lower proportion of intrusions (lure responses from the
initial multiple-choice test) on the final cued recall test.
Similarly, fewer multiple-choice alternatives should lead
to better performance on the initial multiple-choice test,
resulting in a higher proportion of correct responses and
lower proportion of intrusions on the final cued recall test.
Thus, performance on the initial multiple-choice test was
expected to play a large role in determining performance
on the subsequent cued recall test when feedback was not
provided. We expected feedback to have positive effects
on both correct and incorrect responses (perhaps elimi-
Pashler, Cepeda, Wixted, & Rohrer, 2005; for a meta-
analysis see Bangert-Drowns, Kulik, Kulik, & Morgan,
1991). The feedback message may also include other in-
formation, such as a re-presentation of the question and/
or the student’s prior response, both of which help to re-
establish the original context and permit the student to
fully process the feedback. Such contextual reinstatement
is especially critical when feedback is given after a delay.
Another important consideration is when to deliver the
feedback to the student. In contrast with the general con-
sensus about the types of feedback that work best, there is
substantial disagreement about the optimal timing of feed-
back (for a review, see Kulik & Kulik, 1988). Motivated
by behavioral theories of reinforcement, some research-
ers have argued that feedback should be given as soon
as possible after an error in order to eliminate incorrect
responses (see, e.g., Skinner, 1954), a position supported
by numerous studies that have conceptualized feedback as
reinforcement (e.g., Angell, 1949; Bourne, 1957; Paige,
1966; Sullivan, Schutz, & Baker, 1971). However, others
have contended that feedback is functionally different from
reinforcement and that delayed feedback is more effective
because it gives errors a chance to dissipate, making the
process of learning the correct response easier (e.g., Kul-
havy, 1977; Kulhavy & Anderson, 1972; Kulhavy & Stock,
1989). Indeed, many studies have found delayed feedback
to be more beneficial to the learning and retention of infor-
mation than immediate feedback (e.g., Brackbill, Bravos,
& Starr, 1962; Butler et al., 2007; Sturges, 1969; Surber &
Anderson, 1975). For the most part, these disparate results
have yet to be reconciled, feeding the debate about the op-
timal timing of feedback.
Report Option: Responding
and Belief in Correctness
In most testing situations in the classroom, students are
not penalized for guessing on multiple-choice tests; rather,
the instructor simply calculates the proportion of answers
that are correct. A final interest in the present research
was trying to assess both student responding and student
belief in what answers were actually correct. If students
acquire information from a multiple-choice test (whether
correct or incorrect), do they believe that the information
is correct? When students retrieve information on a test,
they can assess various aspects of knowledge to deter-
mine its accuracy. In most metamemory situations, such
monitoring processes lead to relatively accurate memory
reports, because people control whether or not to report
the information retrieved. However, educational testing is
one area in which forced report dominates. On most class-
room tests, the potential for full or partial credit exists, and
there is often no penalty for guessing. As a result, students
are encouraged to answer every question regardless of the
perceived accuracy of the candidate response, making it
hard to ascertain whether or not they actually believe in
the correctness of any given response.
An interesting way to assess students’ belief in the cor-
rectness of their knowledge is to manipulate the report
option on the final test. Forced report instructions require
subjects to respond to every question or to produce a pre-
Fe e d b a c k en h a n c e s Te s T i n g 607
Design
The experiment used a 3 (amount of prior study: no study, study,
restudy) 3 3 (number of multiple-choice alternatives: two, four,
six)3 3 (feedback condition: no feedback, immediate feedback,
delayed feedback)3 2 (report option: forced report and free report)
mixed design. In addition, the experiment included a control condi-
tion in which no multiple-choice test was given on some material
(no test). This condition could not be crossed with the number of
multiple-choice alternatives factor and is therefore not fully incor-
porated into the main design (see the counterbalancing section). The
number of multiple-choice alternatives and the feedback-condition
variables were manipulated within subjects and between materi-
als. The amount-of-prior-study variable was manipulated between
subjects. Report option was manipulated within subjects during the
cued recall test on all items.
Materials and Counterbalancing
Stimuli consisted of a set of 12 prose passages covering a variety
of historical topics (e.g., the Khmer Rouge). The passages were de-
veloped using information obtained from two online encyclopedias
(www.encyclopedia.com and www.en.wikipedia.org). Each passage
contained approximately 400 words arranged into four paragraphs.
Four facts were identified in each passage, with each fact corre-
sponding to one of the four paragraphs. The “key sentences” that
subjects in the restudy condition were given to reread consisted of
the sentences from the passage that contained these facts. A ques-
tion was designed to test each fact from the passage using a fill-in-
the-blank format. The correct response to each question (henceforth
referred to as the target) consisted of a short phrase between one
and three words in length. For example, Many of the leaders of the
Khmer Rouge were educated in _____ (target: France). For the pur-
poses of the multiple-choice test, five plausible lures were developed
for each question for the six-alternative condition (the correct an-
swer plus five lures). Two lures were randomly removed to create the
four-alternative condition, and four lures were randomly removed to
create the two-alternative condition. No lure or target appeared as a
potential answer to another question.
The experimental materials were counterbalanced in several ways.
First, across subjects, each passage was used in each condition an
equal number of times. In order to accomplish this, the materials were
divided into four sets of three passages. The four sets were then ro-
tated through the three feedback conditions (no feedback, immediate
feedback, delayed feedback) and the no-test control condition to cre-
ate four versions of the multiple-choice test. Then, within each ver-
sion, the three passages in each test condition were rotated through
the number of multiple-choice alternative conditions (two, four, six)
to create a total of 12 versions of the multiple-choice test. Second,
for each of the 12 versions of the multiple-choice test, the target ap-
peared equally in each possible position in comparison with the lures
across the items within each multiple-choice alternatives condition.
For example, in the four-alternative condition, the target appeared
three times in the first, second, third, and fourth positions.
Procedure
The entire experiment was conducted on PCs using E-Prime soft-
ware (Schneider, Eschman, & Zuccolotto, 2002) and involved two
sessions that were spaced 1 week apart.
Session 1
. Subjects began with a study phase in which treatment
was given according to a randomly assigned condition: no study,
study, or restudy. Those in the no-study condition did not read
the passages and skipped ahead to the filler task (playing a Pac-
Man video game for 5 min) that separated the study phase from
the multiple-choice test. Each subject in the study condition read
through all 12 passages in a different order, which was randomly
determined by the computer at the start of the session. The passages
were presented one half at a time (approximately 200 words), with
each half appearing on the screen for 30 sec. Subjects in the restudy
condition read through the full set of passages in the same manner
nating the negative effects of multiple-choice testing). We
anticipated that feedback would allow students to correct
their errors, leading to a reduction of the proportion of
intrusions produced on the cued recall test, and that feed-
back would help to maintain correct responses made on
the initial multiple-choice test.
A primary purpose of the present experiment was to
explore any novel interactions among the three variables
of interest. Although each of the variables included has
been investigated in previous research, no study has ma-
nipulated all three within a single experiment. As was
described previously, the amount of prior study and the
number of multiple-choice alternatives variables were ex-
pected to have separate and additive effects on cued recall
test performance in the absence of feedback. However, it
was less clear whether (and how) the pattern of cued recall
test performance produced by these two variables would
be altered when feedback was provided. There were at
least two potential hypotheses about how feedback would
interact with the amount of prior study and the number of
multiple-choice alternatives variables. One possible out-
come was that the provision of feedback would increase
the proportion of correct responses and reduce the propor-
tion of intrusions, but would leave the overall pattern of
effects observed in the no-feedback condition intact (e.g.,
a smaller increase in production of lures on the final test
as a function of number of alternatives on the prior test).
Another possible outcome would be for feedback to com-
pletely eliminate the effects of prior study and the number
of multiple-choice alternatives on final cued recall, bring-
ing performance in all conditions up to the same level.
Similarly, several hypotheses could be generated about
the optimal timing of feedback. However, we predicted
that delayed feedback would lead to superior performance
in comparison with immediate feedback because of the
added benefits of allowing the incorrect response to dissi-
pate (see, e.g., Kulhavy & Anderson, 1972) and providing
a spaced presentation of the material (see Dempster, 1989)
in the case of delayed feedback.
We also manipulated report option on the final cued
recall test. We expected that free report would reduce
the overall proportion of intrusions in comparison with
forced report. However, students’ ultimate success at re-
stricting their report to correct responses hinges upon
their ability to differentiate between correct and incor-
rect responses. If students cannot effectively make such
a distinction, then free report may result in the reduction
of both correct and incorrect responses in comparison
with forced report. Thus, manipulating report option per-
mits us to examine students’ metamemorial knowledge
of their responding.
METHOD
Subjects
Seventy-two undergraduate psychology students at Washington
University in St. Louis participated for course credit or pay ($20).
They were treated in accordance with the “Ethical Principles of Psy-
chologists and Code of Conduct” (American Psychological Associa-
tion, 2002).
608 bu T l e r a n d ro e d i g e r
Initial Multiple-Choice Test
Table 1 displays the proportion of correct responses on
the initial multiple-choice test as a function of the number
of multiple-choice alternatives and the amount of prior
study (the data are collapsed across feedback condition
because the manipulation had not yet been introduced).
The data were analyzed by way of a 3 (amount of prior
study)3 3 (number of multiple-choice alternatives) re-
peated measures ANOVA. This analysis revealed two sig-
nificant main effects, both of which were expected. First,
there was a main effect of prior study [F(2,69) 5 66.25,
MSe 5 0.035, ηp
2 5 .66] in which the restudy condition
produced a higher proportion of correct responses than
did the study condition [t(46) 5 13.68, SED 5 0.026,
d 5 1.29, prep 5 1.00 ( prep is an estimate of the probabil-
ity of replicating the direction of an effect; see Killeen,
2005)], which in turn was higher than the no-study condi-
tion [t(46) 5 5.99, SED 5 0.031, d 5 1.20, prep 5 1.00].
Second, there was a main effect of the number of multiple-
choice alternatives in which the proportion of correct re-
sponses decreased as the number of alternatives increased.
Both the linear [F(1,69) 5 73.76, MSe 5 0.022, ηp
2 5 .52]
and quadratic [F(1,69) 5 7.05, MSe 5 0.018, ηp
2 5 .09]
effects were significant. No other effects approached sig-
nificance. Our interest centered on how multiple-choice
performance would affect the cued recall test that was
given a week later.
Final Cued Recall Test: Forced Report
The data from the forced report phase of the cued re-
call test were analyzed first. There were three potential
outcomes for each item in this phase: a correct response
(correct), the production of a lure from the prior multiple-
choice test (intrusion), or an incorrect response that had
not appeared previously as a lure (incorrect other).1
Correct responses
. The upper portion of Table 2
shows the proportion of correct responses as a function of
the amount of prior study and feedback condition, with the
no-test condition included for comparison purposes. The
data are collapsed across the number of multiple-choice
alternatives because this variable did not interact with any
other variable of interest, as reported below (see the Ap-
pendix for the full data). The data were analyzed with a 3
(amount of prior study)3 3 (number of multiple-choice
alternatives)3 3 (feedback condition) repeated measures
ANOVA. First, a main effect of feedback condition was
observed [F(2,138) 5 66.79, MSe 5 0.051, ηp
2 5 .49] in
which delayed feedback led to a higher proportion of cor-
as did those in the study condition. However, after reading the pas-
sages, they engaged in the filler task for 2 min and then reread the
key sentences that contained information on the tests from each pas-
sage. The key sentences were grouped by passage, and each group of
sentences was displayed for 30 sec. Again, the computer randomly
determined a different presentation order for each subject.
Prior to taking the multiple-choice test, all subjects engaged in a
filler task (Pac-Man) for 5 min. After the filler task, they received
instructions about the multiple-choice test (those in the no-study
condition were told that it was a general knowledge test). The test
was self-paced and consisted of 42 questions. The first 6 questions
were always filler items in order to ensure that subjects did not have
information from the last passage in working memory. Subjects in
the study and restudy conditions were told that these items were
practice questions. The remaining 36 questions corresponded to the
passages in the three feedback conditions (no feedback, immediate
feedback, delayed feedback). With the exception of the filler ques-
tions, which were always presented f irst, the computer randomized
the presentation of the questions so that each subject received a dif-
ferent order. Each question was presented at the top of the screen
with the alternatives listed below. The position of the target in com-
parison with the lures was counterbalanced as described previously.
Subjects were instructed to press the button corresponding to the
position of the correct answer (e.g., press 1 for the alternative in
position 1). The position number preceded each alternative to fa-
cilitate responding. Feedback was presented for 10 sec either im-
mediately after the response (immediate feedback) or at the end of
the test (delayed feedback). In order to equate the amount of time
spent on each question, a message (“Please wait for the next ques-
tion to load”) was displayed for 10 sec after items in the no-feedback
condition. Feedback consisted of an indication of the accuracy of the
response (correct–incorrect), a re-presentation of the question, the
response selected, and the correct response. After students finished
the multiple-choice test, they were reminded of the second session
and dismissed.
Session 2
. One week after the first session, subjects returned to
take a final, comprehensive cued recall test that incorporated a pro-
cedure adopted from Koriat and Goldsmith (1994, 1996) in which
a forced report phase is followed by a free report phase. The ques-
tions on the final cued recall test were exactly the same as those on
the initial multiple-choice test (with the addition of the subset of
untested items). Similar to the multiple-choice test, each fill-in-the-
blank question on the cued recall test was presented at the top of the
screen (e.g., Many of the leaders of the Khmer Rouge were educated
in _____). However, instead of choosing from a list of alternatives,
subjects had to produce the response from memory and type it in
by using the keyboard. The test was self-paced and consisted of two
phases. In the forced report phase, subjects were given instructions
to provide a response to every question, even if they had to guess.
After each response, they were asked to rate their confidence in the
response on a scale of 0 to 100. After answering all questions, stu-
dents proceeded to the free report stage, in which they were given
the opportunity to go back through their responses from the forced
report phase and to decide whether to keep or omit each response.
They were shown the question and their response, but not their con-
fidence estimate. The stated goal was to keep as many correct and
omit as many incorrect answers as possible. After the second session
was complete, subjects were debriefed and dismissed.
RESULTS
All results deemed signif icant were reliable at the .05
level of confidence unless otherwise noted. Pairwise com-
parisons were Bonferroni corrected to the .05 level. In the
analysis of repeated measures, a Greenhouse–Geisser cor-
rection was used for violations of the sphericity assump-
tion (Geisser & Greenhouse, 1958).
Table 1
Proportion Correct on the Initial Multiple-Choice Test As a
Function of the Number of Multiple-Choice Alternatives
and the Amount-of-Prior-Study Condition
Study Number of Alternatives
Condition Two Four Six M
No study .56 .34 .28 .39
Study .69 .56 .47 .57
Restudy .84 .72 .70 .75
M .69 .54 .48 .57
Fe e d b a c k en h a n c e s Te s T i n g 609
Intrusions
. The upper portion of Table 3 displays the
proportion of intrusions made on the cued recall test as
a function of the amount-of-prior-study and feedback
conditions, with the no-test condition included for com-
parison purposes (the data are again collapsed across
the number of multiple-choice alternatives because this
variable did not interact with any other variable of in-
terest, as reported below; see the Appendix). The data
were again analyzed by a 3 (amount of prior study)3 3
(number of multiple-choice alternatives)3 3 (feedback
condition) repeated measures ANOVA. Several signifi-
cant effects emerged. First, there was a main effect of
feedback condition [F(2,138) 5 22.62, MSe 5 0.031,
ηp
2 5 .25] in which no feedback produced a significantly
higher proportion of intrusions than did the immediate
feedback [t(71) 5 5.90, SEM 5 0.018, d 5 .65, prep 5
.99] and delayed feedback [t(71) 5 4.65, SEM 5 0.019,
d 5 .81, prep 5 1.00]. Second, a main effect of prior study
was observed [F(2,69) 5 3.07, MSe 5 0.034, ηp
2 5 .08]
in which the no-study condition led to the production of
more lures than did the restudy condition [t(46) 5 2.45,
SEM 5 0.016, d 5 .35, prep 5 .95]. Third, there was a
linear trend in the number of multiple-choice alterna-
tives [F(1,69) 5 5.30, MSe 5 0.039, ηp
2 5 .07] in which a
greater number of alternatives led to a higher proportion
of intrusions, as shown in the Appendix. Finally, there
was an interaction between prior study and feedback
condition [F(4,138) 5 5.88, MSe 5 0.031, ηp
2 5 .15].
Greater amount of prior study decreased the proportion
of intrusions in the no-feedback condition (as in Roe-
diger & Marsh, 2005), but the amount of prior study
was neutralized by feedback, which reduced the num-
ber of intrusions. No other effects reached significance.
As before, the main analysis did not include the no-test
condition, but an additional t test confirmed that the no-
feedback condition produced a higher proportion of in-
trusions than did the no test control condition in which
no lures had been shown [t(71) 5 4.49, SEM 5 0.019,
d 5 .65, prep 5 .99]. Feedback on the multiple-choice
test reduced lure intrusions to this baseline level.
rect responses than did immediate feedback [t(71) 5 4.79,
SEM 5 0.022, d 5 .56, prep 5 1.00] and immediate feed-
back was higher than no feedback [t(71) 5 6.40, SEM 5
0.022, d 5 .69, prep 5 1.00]. Second, there was a margin-
ally significant main effect of prior study [F(2,69) 5 2.92,
MSe 5 0.197, p 5 .06]. Pairwise comparisons showed
only one significant difference: The restudy condition
led to a higher proportion of correct responses than did
the no-study condition [t(46) 5 3.02, SEM 5 0.035, d 5
.75, prep 5 .98]. Finally, there was also a significant in-
teraction between the feedback condition and prior study
[F(4,138) 5 5.77, MSe 5 0.051, ηp
2 5 .14]. The propor-
tion of correct responses increased substantially as the
amount of prior study increased (i.e., going from no study
to study to restudy) in the no-feedback condition, but not
in the delayed-feedback condition. In fact, in the delayed-
feedback condition, it did not matter whether students
had previously studied the material at all; performance
was roughly the same in all three of the study conditions.
Finally, the proportion of correct responses differed only
slightly as a function of the number of multiple-choice
alternatives (see the Appendix). There was a numerical
trend in the no-feedback condition in which the proportion
of correct responses decreased as the number of multiple-
choice alternatives increased. However, the linear trend
did not reach significance when tested with an ANOVA
conducted on the data from no-feedback condition alone
[F(1,69) 5 2.60, MSe 5 0.048, p 5 .11]. Nevertheless,
the numerical trend was in the right direction, and there
may have been insufficient power to detect the effect (ob-
served power 5 .36). Other research making this compari-
son has found effects of roughly the same size as that seen
in the present study (e.g., Roediger & Marsh, 2005). No
other effects approached significance. Although the main
analyses did not include the no-test control condition, an
additional t test revealed that the no-feedback condition
(which received a test) produced a significantly greater
proportion of correct responses than did the no-test condi-
tion [t(71) 5 9.11, SEM 5 0.018, d 5 .97, prep 5 1.00],
showing the basic testing effect.
Table 2
Proportion Correct on the Final Cued Recall Test
As a Function of Amount-Of-Prior-Study and Feedback
Conditions (Including the No-Test Condition)
Feedback Condition
Study No No Immediate Delayed
Condition Test Feedback Feedback Feedback M
Forced Report
No study .10 .18 .42 .57 .32
Study .11 .33 .43 .54 .35
Restudy .22 .41 .50 .57 .43
M.14 .31 .45 .56 .37
Free Report
No study .04 .12 .39 .52 .27
Study .06 .28 .38 .50 .31
Restudy .16 .35 .46 .52 .37
M.09 .25 .41 .51 .32
Note—Data have been collapsed across the number of initial multiple-
choice alternatives.
Table 3
Proportion Intrusions on the Final Cued Recall Test
As a Function of Amount-of-Prior-Study and Feedback
Conditions (Including the No-Test Condition)
Feedback Condition
Study No No Immediate Delayed
Condition Test Feedback Feedback Feedback M
Forced Report
No study .17 .32 .14 .13 .19
Study .16 .25 .18 .13 .18
Restudy .15 .16 .14 .15 .15
M.16 .24 .15 .14 .17
Free Report
No study .05 .19 .07 .07 .10
Study .06 .15 .08 .07 .09
Restudy .05 .08 .09 .07 .07
M.05 .14 .08 .07 .09
Note —Data have been collapsed across the number of initial multiple-
choice alternatives.
610 bu T l e r a n d ro e d i g e r
confidence in each response on a scale of 0–100, in which
“0” indicated no confidence (i.e., a pure guess) and “100”
indicated that the response was def initely correct. Of in-
terest was whether any of the manipulated variables would
influence the subjects’ ability to assess the accuracy of
their knowledge. Feedback on the multiple-choice test in-
creased the subjects’ ability to assess the accuracy of their
responses on the cued recall test, as indicated by the abso-
lute correspondence between proportion correct and con-
fidence estimate. The overall mean proportions correct for
the delayed-feedback (M 5 .56) and immediate-feedback
(M 5 .49) conditions were almost identical to the mean
confidence estimate (means of 55 and 46, respectively).
In contrast, the overall mean proportion correct in the no-
test (M 5 .14) and no-feedback (M 5 .31) conditions was
lower than the mean confidence estimate (means of 23
and 39, respectively), indicating overconfidence. The re-
lationship between performance and confidence did not
differ as a function of any of the other variables.
Mean confidence estimates were also computed for the
intrusions produced on the forced cued recall test. Numeri-
cally, there was little difference between the confidence es-
timates in the no-feedback (M 5 42), immediate- feedback
(M 5 38), and delayed-feedback (M 5 38) conditions.
However, subjects assigned greater confidence to intru-
sions produced in these three conditions (overall M 5 39)
than to those in the no-test condition (M 5 26) [t(71) 5
4.11, SEM 5 2.98, d 5 .56, prep 5 .99]. Thus, prior ex-
posure to lure items on the multiple-choice test seemed to
increase confidence in the intrusion responses.
Conditional Analyses4
Conditional analyses were conducted to investigate
the relationship between performance on the initial
multiple-choice test and on the final cued recall test. Of
interest was how response outcome on the multiple-choice
test (correct–incorrect) influenced the production of cor-
rect responses on the f inal cued recall test as a function
of the amount-of-prior-study and testing conditions. One
question was whether the overall pattern of results obtained
in the main analyses (e.g., the superiority of delayed feed-
back) would hold for both items that were initially correct
and those that were incorrect. For the purposes of these
conditional analyses, the data were again collapsed across
the number of multiple-choice alternatives.
Table 4 displays the proportion of correct responses
on the cued recall test for items that were correctly and
incorrectly answered on the initial multiple-choice test
as a function of prior-study and feedback conditions. For
items that were answered correctly on the multiple-choice
test, delayed feedback led to a higher proportion of cor-
rect responses on the cued recall test than did immedi-
ate feedback, which in turn produced a higher proportion
than did no feedback. For the most part, increases in the
amount of prior study had little effect on the maintenance
of correct responses from multiple choice to final cued
recall. The only exception was the no-study–no-feedback
condition, which produced a much lower proportion of
correct responses than did the other prior-study conditions
that did not receive feedback. For items that were initially
Final Cued Recall Test: Free Report2
In the free report phase, subjects had the option of keep-
ing or omitting each response they made during the forced
report phase. The subsequent analysis focuses on the re-
sponses that subjects chose to keep in order to examine
the extent to which students believed that the information
learned on the multiple-choice test was true. The free report
phase follows the forced report phase and could be affected
by the earlier phase, so the data and analysis presented
below should be interpreted with this influence in mind.
Correct responses
. The lower portion of Table 2 dis-
plays the proportion of correct responses kept during the
free report phase as a function of prior study and feedback
condition (including the no-test condition for compari-
son).3 A 3 (amount of prior study)3 3 (feedback condition)
repeated measures ANOVA was conducted, and it revealed
the same effects as did the forced report analysis: a main
effect of feedback condition [F(2,138) 5 77.634, MSe 5
0.017, ηp
2 5 .53], a marginally signif icant main effect of
prior study [F(2,69) 5 2.762, MSe 5 0.065, p 5 .07], and
an interaction between the prior study and feedback condi-
tions [F(4,138) 5 6.053, MSe 5 0.0167, ηp
2 5 .15]. This
interaction appears to be driven by the no-feedback condi-
tion in the same manner as that in the forced report data.
The only difference between the forced and free results is
that the proportion correct in each cell has decreased by
between 3% and 6% in free report due to response with-
holding. When report option (forced, free) was entered into
the analysis as a within-subjects variable, a main effect of
report option did emerge [F(1,69) 5 82.67, MSe 5 0.004,
ηp
2 5 .55], indicating that free report instructions led to a
reduction in the proportion of correct responses. Report
option did not interact with any of the other factors.
Intrusions
. The lower portion of Table 3 displays the
proportion of intrusions kept during the free report phase
as a function of the prior-study and feedback conditions
(including the no-test condition). A 3 (amount of prior
study)3 3 (feedback condition) repeated measures ANOVA
revealed a main effect of feedback condition [F(2,138) 5
16.65, MSe 5 0.007, ηp
2 5 .19] and an interaction between
prior study and feedback condition [F(4,138) 5 4.352,
MSe 5 0.007, ηp
2 5 .11]. This interaction probably repre-
sents the differential effect of the amount of prior study on
the no-feedback condition. A greater amount of prior study
decreased the proportion of intrusions in the no-feedback
condition, but had no effect on any of the other feedback
conditions. Overall, students managed to reduce the pro-
portion of intrusions relative to the forced report phase,
and the magnitude of this reduction differed slightly across
the feedback conditions. When report option (forced, free)
was entered into the analysis, a main effect of report option
emerged [F(1,69) 5 229.90, MSe 5 0.005, ηp
2 5 .77], in-
dicating that subjects were able to reduce the proportion of
intrusions in free report as opposed to forced report. Report
option did not interact with any of the other factors.
Final Cued Recall: Confidence
Performance on the final cued recall test was compared
with the confidence estimate given by subjects during the
forced report phase. The subjects were asked to rate their
Fe e d b a c k en h a n c e s Te s T i n g 611
feedback was not given, the pattern of performance on the
cued recall test was largely determined by performance on
the initial multiple-choice test. Greater amounts of prior
study led to a higher proportion of correct responses in
cued recall, but the number of multiple-choice alternatives
did not have a significant effect on correct responses. Less
prior study and a greater number of initial multiple-choice
alternatives resulted in a higher proportion of intrusions,
as in prior research (Butler et al., 2006; Roediger & Marsh,
2005). Third, the initial predictions regarding the effect of
feedback on performance were substantiated: Feedback
on the multiple-choice test increased the proportion of
correct responses on the final cued recall test, whereas
the proportion of intrusions was sharply reduced. Delayed
feedback led to a higher proportion of correct responses
than did immediate feedback, but both feedback schedules
were equally effective at reducing the amount of misinfor-
mation acquired. Finally, when given the option of free
report, subjects succeeded in reducing the proportion of
intrusions reported; however, they also eliminated many
correct responses. These results are discussed further in
the subsequent sections, focusing first on the learning and
retention of correct responses and then on the acquisition
of misinformation. After placing the results in the context
of other studies, we will conclude by discussing the impli-
cations of this research for educational practice.
The Learning and Retention
of Correct Responses
We first consider performance in the no-feedback con-
dition, which was conceptually the most similar to previ-
ous studies (Butler et al., 2006; Roediger & Marsh, 2005).
Just as in the initial multiple-choice test, the proportion
of correct responses on the cued recall test was influ-
enced by both the amount of prior study and the num-
ber of multiple-choice alternatives. A greater amount of
prior study led to a higher proportion of correct responses,
replicating Roediger and Marsh (2005), who included a
study–no study manipulation. The restudy condition in
our experiment extends their finding by showing that in-
creasing the amount of prior study (by selective restudy-
ing of facts) can enhance recall. Increasing the number of
multiple-choice alternatives led to a lower proportion of
correct responses on the cued recall test, but not signifi-
cantly so. Previous studies have found significant effects
in which a greater number of alternatives led to a lower
proportion of correct responses on a subsequent recall test
(e.g., Butler et al., 2006; Roediger & Marsh, 2005; but see
Whitten & Leonard, 1980). However, these effects tend to
be relatively small in size, presumably because as more
lures are included, the plausibility of each additional lure
decreases. Nevertheless, the numerical trend was in the
predicted direction in the present study, suggesting that
greater power may have been needed to detect this effect.
When compared with the no-test condition, the no-
feedback condition also shows the mnemonic benefit of
taking a prior multiple-choice test. Retrieval of the cor-
rect response on the multiple-choice test helped students to
learn and retain that response, which is no surprise because
the testing effect is generally quite robust (see Roediger
answered incorrectly, a similar overall pattern emerged
among the testing conditions: Delayed feedback produced
the highest proportion of correct responses, followed by
immediate feedback and no feedback. The amount of prior
study did not have clear overall effects on performance.
However, the magnitude of the differences between the
testing conditions appears to be attenuated as the amount
of prior study increases.
Thus, the superiority of delayed feedback in compari-
son with immediate feedback that emerged in the main
analyses held for both items that were initially correct and
those that were incorrect on the multiple-choice test. The
benefit of providing feedback (either delayed or immedi-
ate) as opposed to not providing feedback also held for
both sets of items. The impact of increasing the amount
of prior study was less clear, but this is probably because
prior study increases correct performance and decreases
errors on the initial test. Of course, conditional analy-
ses are always subject to item-selection artifacts, and
the results presented previously should be interpreted
with this caution in mind. Still, delayed feedback on the
multiple-choice test provided consistently better perfor-
mance on the final cued recall test whether or not the
multiple-choice item was correctly answered.
DISCUSSION
The present experiment investigated the predictions
that feedback on a multiple-choice test would enhance
the testing effect for items answered correctly and reduce
or eliminate negative effects on items answered incor-
rectly, as assessed on a cued recall test given a week later.
Our findings conf irmed these hypotheses. We replicated
several previous findings within a single experiment and
found novel interactions between two of the three variables
being investigated. First, a testing effect emerged on the
final cued recall test: Students performed better on items
that were tested on the prior multiple-choice test than on
items not initially tested, regardless of the amount of prior
study or feedback. Second, when items were tested and
Table 4
Proportion of Correct Responses on the Final Cued Recall Test
for Items That Were Correctly and Incorrectly Answered on the
Initial Multiple-Choice (MC) Test As a Function of Amount-of-
Prior-Study Condition and Feedback Condition
Feedback Condition
Study No Immediate Delayed
Condition Feedback Feedback Feedback M
MC–Correct
No Study .35 .53 .61 .50
Study .56 .52 .64 .57
Restudy .52 .57 .62 .57
Mean .50 .55 .63 .56
MC–Incorrect
No Study .08 .35 .53 .31
Study .07 .27 .41 .24
Restudy .12 .30 .35 .25
Mean .08 .31 .46 .28
Note—All data are for performance under forced report instructions.
612 bu T l e r a n d ro e d i g e r
Accordingly, delayed feedback is more effective because
it allows incorrect responses to dissipate, making the cor-
rect response easier to learn. A second theory posits that
delayed feedback leads to better subsequent recall because
it provides an additional spaced presentation of the mate-
rial. The superiority of spaced (or distributed) study in
comparison with massed study for enhancing the retention
of verbal material has been well established (see Cepeda,
Pashler, Vul, Wixted, & Rohrer, 2006, for a review). If a
correct response to a test question is considered a study
trial, then immediate feedback represents a massed study
trial, and delayed feedback represents a spaced study trial.
Thus, delayed feedback should be superior to immediate
feedback for initially correct responses, but equally effec-
tive for initially incorrect responses for which both feed-
back timings would represent a spaced study trial (see,
e.g., McConnell, Hunt, & Smith, 2006). However, delayed
feedback led to a higher proportion of correct responses
in the present study regardless of whether the initial re-
sponse was correct or incorrect. Importantly, these two
theories are not mutually exclusive, because the spaced
presentation and interference-perseveration theories focus
on the effect of feedback timing after correct and incor-
rect responses, respectively. Therefore, a combination of
the two theories provides a comprehensive account of the
present results.
As a final consideration, feedback also helped stu-
dents to better gauge the accuracy of their responses on
the f inal cued recall test. Students’ ability to differenti-
ate between correct and incorrect responses was explored
through the absolute correspondence between the confi-
dence estimates and the proportion of correct responses.
This analysis revealed that subjects were almost perfectly
calibrated for items in the delayed-feedback condition and
only slightly overconfident in the immediate-feedback
condition. However, students were highly overconfident
in the no-test and no-feedback conditions. Thus, testing
with feedback also helps students to judge better what
they know and what they do not know.
The Acquisition of Misinformation
Focusing first on the no-feedback condition, the pro-
portion of intrusions produced on the forced report phase
of the cued recall test was heavily influenced by both the
prior study and number of alternatives variables. Decreas-
ing the amount of prior study and/or increasing the num-
ber of alternatives led to a higher proportion of intrusions,
replicating previous research (Butler et al., 2006; Roediger
& Marsh, 2005). The overall pattern of intrusions closely
resembled the incorrect response data from the multiple-
choice test, suggesting that performance on the multiple-
choice test mediated the influence of these two variables
on the acquisition of misinformation. This conclusion is
bolstered by the fact that 75% of the intrusions produced in
the no-feedback condition were lures that had been (incor-
rectly) selected on the initial multiple-choice test (the other
25% were initially correct responses that were subsequently
switched to lures on the cued recall test; see Butler et al.,
in press). Interestingly, the amount of misinformation ac-
quired varied widely among the different conditions: Over
& Karpicke, 2006a). However, a more stringent way of
assessing the benefits of testing is to take into account any
negative effect that occurs as a result of multiple-choice
testing. When both correct responses and intrusions are
considered, there is usually still a net benefit of testing
(see, e.g., Roediger & Marsh, 2005). Thus, it is surprising
to find that in the no-study–no-feedback condition of the
present experiment, the proportion of intrusions produced
(M 5 .32) was substantially greater than the proportion of
correct responses (M 5 .18). Although the no-study–no-
test condition produced fewer correct responses (M 5 .10),
it also produced fewer (spontaneous) intrusions (M 5 .17).
In other words, if students had not studied the material, they
would have been worse off if they were tested than if not
tested. Although a net benefit of prior testing was obtained
in the study and restudy conditions, this negative effect
of testing in this one condition is particularly important,
because it indicates that there is a point at which multiple-
choice testing ceases to be beneficial to students.
The testing effect observed in comparing the no-test and
no-feedback conditions was enhanced by the provision of
feedback. Both immediate and delayed feedback led to
large gains in the proportion of correct responses on the
cued recall test in comparison with the no-feedback con-
dition. The added benefit of feedback is likely due to the
correction of errors (Butterfield & Metcalfe, 2001; Pashler
et al., 2005) and the maintenance of correct responses that
otherwise might have been forgotten or switched to an at-
tractive alternative (Butler et al., in press). Greater amounts
of prior study led to a small increase in the proportion of
correct responses in the immediate- feedback condition,
but had no effect on the delayed-feedback condition (al-
though the interaction was not statistically significant;
p 5 .11). Assuming the differential effect exists, a possible
explanation derives from the interference- perseveration
theory (Kulhavy, 1977; Kulhavy & Anderson, 1972; Kul-
havy & Stock, 1989). According to this theory, immediate
feedback produces response competition when the correct
response is presented immediately after an incorrect re-
sponse is made. With a greater amount of prior study, fewer
incorrect responses are made; therefore, the potential for
response competition to occur should decrease. In contrast,
a delay in the presentation of feedback may allow incorrect
responses to dissipate, making the correct response easier
to learn and negating the impact of any differences in the
number of incorrect responses made.
In accordance with this theory, the timing of feedback
also had a large influence on the learning and retention of
correct responses. Delayed feedback led to a higher pro-
portion of correct responses (overall M 5 .56) than did im-
mediate feedback (overall M 5 .45). The superiority of de-
layed feedback in the present results can best be explained
by invoking two different—but compatible—theories.
First, as was just noted, the interference-perseveration the-
ory offers an explanation for why delayed feedback might
also benefit initially incorrect responses (Kulhavy, 1977;
Kulhavy & Anderson, 1972; Kulhavy & Stock, 1989). As
explained above, this theory revolves around the idea that
immediate feedback produces competition between the
incorrect response and the presented correct response.
Fe e d b a c k en h a n c e s Te s T i n g 613
and omit any responses they believed to be incorrect, they
succeeded in reducing the proportion of intrusions, more
than halving the number of lure items reported. However,
this reduction in the proportion of intrusions was roughly
equivalent across all the conditions, and the same overall
pattern of effects that was obtained under forced report re-
mained. That is, the no-feedback condition still produced
the highest proportion of intrusions in comparison with
the other conditions, and this proportion increased as the
amount of prior study decreased. Remarkably, even under
free report, subjects in the no-study–no-feedback condi-
tion decided to keep a large proportion of intrusions (M 5
.19). Overall, these results indicate that students strongly
believed in the veracity of misinformation acquired dur-
ing the multiple-choice test.
Implications for Educational Practice
The present experiment demonstrates that students
acquire both correct and incorrect information from
multiple-choice tests. Taking a multiple-choice test
leads to a substantial benef it in retention of correct re-
sponses, but the exposure to misinformation in the form
of multiple-choice lure items can lead to the intrusion of
these lures on a subsequent test, especially when the lure is
(incorrectly) endorsed on the initial multiple-choice test.
The magnitude of these positive and negative effects is
greater with little prior study and with increasing numbers
of lures on the multiple-choice test. If the material has not
been sufficiently studied prior to taking a multiple-choice
test, or a test is made more difficult by increasing the
number of alternatives, students acquire a greater amount
of misinformation. Although these two factors have been
emphasized in the present study, any factor that negatively
affects performance on a multiple-choice test (e.g., test
anxiety, time restrictions, increasing the attractiveness of
lures, etc.) will probably have the same effect.
A pragmatic solution to the possible negative effects of
multiple-choice tests is to ensure that students always re-
ceive feedback after testing. Feedback enhances the posi-
tive effects of taking a test and helps students correct their
errors, thereby reducing the acquisition of misinformation.
The latter outcome is especially important when the same
questions and alternatives from a first test are reused on a
later test, because the production of misinformation often
increases the chance that it will be produced again on a
later test (Roediger, Jacoby, & McDermott, 1996; Roe-
diger, Wheeler, & Rajaram, 1993). One positive aspect of
the present results is that feedback need not be given im-
mediately; a delay in the presentation of feedback seems
to be beneficial to learning. Of course, in our conditions,
what we are calling delayed feedback is what many in-
structors who cannot use computerized testing would see
as immediate feedback; students in our delayed feedback
condition were actually given feedback soon after taking
the test (but not immediately after answering each item).
Further research will be needed to determine if feedback
may cease to be effective if it is delayed for too long. For
example, in many classroom settings, feedback on a test
is provided a week or two after the test is given, in order
to permit time for grading the test. Would feedback under
a third of the responses in the no-study–six- alternative
condition were intrusions (M 5 .38), whereas only a rather
small proportion in the restudy–two-alternative condition
were intrustions (M 5 .14). Note that the latter propor-
tion was no greater than the overall mean in the no-test
(M 5 .16) and feedback conditions (immediate feedback,
M 5 .15; delayed feedback, M 5 .14). Theoretically, in-
trusions on the cued recall test likely result from lure re-
sponses blocking previously learned correct responses (or
causing them to be unlearned), similar to the retroactive
interference created in the misinformation paradigm (Lof-
tus, Miller, & Burns, 1978) or the in classic A–B, A–D
interference paradigm (McGeoch, 1932). Alternatively, the
correct response may never have been learned, and people
may just have guessed and learned the wrong response as
a result. However, research suggests that errors resulting
from faulty reasoning are much more likely to persist than
are guesses (Huelser & Marsh, 2006).
When feedback was provided after the multiple-choice
test, a very different pattern of results emerged. First, the
overall amount of misinformation acquired was sharply
reduced. Second, the effects observed in the no-feedback
condition were neutralized: Neither the amount of prior
study nor the number of multiple-choice alternatives had
an influence on the proportion of intrusions produced.
Armed with knowledge about whether their response was
correct or incorrect (as well as the correct answer), sub-
jects were able to correct many of their errors and refrain
from producing the lures on the cued recall test. This re-
sult fits nicely with previous investigations that have dem-
onstrated the error-correcting function of feedback (e.g.,
Butterfield & Metcalfe, 2001; Pashler et al., 2005). Fur-
thermore, the timing of the feedback did not seem to mat-
ter: Immediate and delayed feedback were equally effec-
tive at reducing the amount of misinformation acquired. It
might be argued that even with feedback, multiple-choice
tests were harmful because a small but sizeable number
of intrusions were produced (overall M 5 .15) even with
feedback on the multiple-choice test. However, we believe
that this implication is erroneous; if the proportion of in-
trusions spontaneously produced in the no-test condition
(M 5 .16) is used as a baseline, then it is clear that taking
a multiple-choice test with feedback is no more harmful
than not taking a multiple-choice test at all.
Another goal of the experiment was to investigate the
extent to which students believe that the misinformation
acquired from the multiple-choice test is true. Students’
confidence estimates for intrusions indicated roughly the
same level of confidence regardless of whether or not they
had received feedback. This result indicates that feedback
works in an all-or-none manner: If students do not success-
fully correct the error, then feedback does not diminish the
potency of the misinformation. The only obvious differ-
ence in confidence estimates for intrusions was between
the no-test condition (M 5 26) and the other three testing
conditions (overall M 5 39), indicating that prior exposure
to lures led to greater conf idence in intrusions in com-
parison with intrusions that were spontaneously produced
(presumably due to the familiarity of the lures). When stu-
dents were allowed to revisit their forced report responses
614 bu T l e r a n d ro e d i g e r
The negative suggestion effect: Pondering incorrect alternative may
be hazardous to your knowledge. Journal of Educational Psychology,
91, 756-764.
Butler, A. C., Karpicke, J. D., & Roediger, H. L., III (2007). The ef-
fect of type and timing of feedback on learning from multiple-choice
tests. Journal of Experimental Psychology: Applied, 13, 273-281.
Butler, A. C., Karpicke, J. D., & Roediger, H. L., III (in press). Cor-
recting a metacognitive error: Feedback enhances retention of low
confidence correct responses. Journal of Experimental Psychology:
Learning, Memory, & Cognition.
Butler, A. C., Marsh, E. J., Goode, M. K., & Roediger, H. L., III
(2006). When additional multiple-choice lures aid versus hinder later
memory. Applied Cognitive Psychology, 20, 941-956.
Butler, A. C., & Roediger, H. L., III (2007). Testing improves long-
term retention in a simulated classroom setting. European Journal of
Cognitive Psychology, 19, 514-527.
Butterfield, B., & Metcalfe, J. (2001). Errors committed with high
confidence are hypercorrected. Journal of Experimental Psychology:
Learning, Memory, & Cognition, 27, 1491-1494.
Carrier, M., & Pashler, H. (1992). The influence of retrieval on reten-
tion. Memory & Cognition, 20, 633-642.
Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D.
(2006). Distributed practice in verbal recall tasks: A review and quan-
titative synthesis. Psychological Bulletin, 132, 354-380.
Dempster, F. N. (1989). Spacing effects and their implications for the-
ory and practice. Educational Psychology Review, 1, 309-330.
Duchastel, P. C., & Nungester, R. J. (1982). Testing effects mea-
sured with alternate test forms. Journal of Educational Research, 75,
309-313.
Foos, P. W., & Fisher, R. P. (1988). Using tests as learning opportuni-
ties. Journal of Educational Psychology, 80, 179-183.
Frederiksen, N. (1984). The real test bias: Influences of testing on
teaching and learning. American Psychologist, 39, 193-202.
Geisser, S., & Greenhouse, S. W. (1958). An extension of Box’s results
on the use of F distribution in multivariate analysis. Annals of Math-
ematical Statistics, 29, 885-891.
Gilman, D. A. (1969). Comparison of several feedback methods for
correcting errors by computer-assisted instruction. Journal of Educa-
tional Psychology, 60, 503-508.
Glover, J. A. (1989). The “testing” phenomenon: Not gone but nearly
forgotten. Journal of Educational Psychology, 81, 392-399.
Huelser, B. J., & Marsh, E. J. (2006, November). Does guessing on a
multiple-choice test affect later cued recall? Poster session presented
at the annual meeting of the Psychonomic Society, Houston, TX.
Jacoby, L. L., & Hollingshead, A. (1990). Reading student essays may
be hazardous to your spelling: Effects of reading incorrectly and cor-
rectly spelled words. Canadian Journal of Psychology, 44, 345-358.
Jones, H. E. (1923–1924). The effects of examination on the perfor-
mance of learning. Archives of Psychology, 10, 1-70.
Killeen, P. R. (2005). An alternative to null-hypothesis significance
tests. Psychological Science, 16, 345-353.
Koriat, A., & Goldsmith, M. (1994). Memory in naturalistic and labo-
ratory contexts: Distinguishing the accuracy-oriented and quantity-
oriented approaches to memory assessment. Journal of Experimental
Psychology: General, 123, 297-315.
Koriat, A., & Goldsmith, M. (1996). Monitoring and control processes
in strategic regulation of memory accuracy. Psychological Review,
103, 490-517.
Kulhavy, R. W. (1977). Feedback in written instruction. Review of Edu-
cational Research, 47, 211-232.
Kulhavy, R. W., & Anderson, R. C. (1972). Delay-retention effect
with multiple-choice tests. Journal of Educational Psychology, 63,
505-512.
Kulhavy, R. W., & Stock, W. A. (1989). Feedback in written instruc-
tion: The place of response certitude. Educational Psychology Review,
1, 279-308.
Kulik, J. A., & Kulik, C. C. (1988). Timing of feedback and verbal
learning. Review of Educational Research, 58, 79-97.
Loftus, E. F., Miller, D. G., & Burns, H. J. (1978). Semantic integra-
tion of verbal material into a visual memory. Journal of Experimental
Psychology: Human Learning & Memory, 4, 19-31.
Marsh, E. J., Meade, M. L., & Roediger, H. L., III (2003). Learning
facts from fiction. Journal of Memory & Language, 49, 519-536.
these conditions lose its effectiveness or be even more
effective?
One final consideration involves the type of to-be-
learned information used in the present study. Although
our test questions focused on relatively basic factual infor-
mation, the results of the present study probably extend to
more complex conceptual information as well. The criti-
cal mechanism for promoting the retention of informa-
tion is the successful retrieval of that information. Thus,
if a test leads students to successfully retrieve conceptual
information, then the retention of that conceptual infor-
mation will be enhanced. Indeed, there is some evidence
to suggest that testing on conceptual information leads to
even bigger testing effects (Wildman & McDaniel, 2007).
Marsh, Roediger, Bjork, and Bjork (2007) reported that
both positive and negative effects of multiple-choice test-
ing were apparent in complex materials when no feed-
back was given. We expect that effects of feedback would
diminish errors in these complex materials—as with our
current factual materials—but testing this conjecture must
await future research.
In summary, the present research highlights the impor-
tance for educators to provide (briefly) delayed feedback
following multiple-choice tests in order to maintain correct
answers and to correct erroneous answers. Even students
who have not studied the material thoroughly will ben-
efit, at least for the information that is tested. Our results
provide further evidence for the importance of judicious
testing in order to enhance educational performance.
AUTHOR NOTE
We thank Patrick Flanagan for his help in collecting data. This re-
search was supported with a Collaborative Activity Award from the
James S. McDonnell Foundation and Grant R305H030339 from the
Institute of Education Sciences. Correspondence may be addressed to
A. C. Butler, Department of Psychology, Campus Box 1125, Washington
University, One Brookings Drive, St. Louis, MO 63139-4899 (e-mail:
butler@wustl.edu).
REFERENCES
Agarwal, P. K., Karpicke, J. D., Kang, S. H. K., Roediger, H. L., III
& McDermott, K. B. (in press). Examining the testing effect with
open- and closed-book tests. Applied Cognitive Psychology.
American Psychological Association (2002). Ethical priniciples of
psychologists and code of conduct. Retrieved June 1, 2003, from www
.apa.org/ethics/code2002.html.
Angell, G. W. (1949). The effect of immediate knowledge of quiz re-
sults on final examination scores in freshman chemistry. Journal of
Educational Research, 42, 391-394.
Bangert-Drowns, R. L., Kulik, C. C., Kulik, J. A., & Morgan, M.
(1991). The instructional effect of feedback in test-like events. Review
of Educational Research, 61, 213-238.
Bangert-Drowns, R. L., Kulik, J. A., & Kulik, C. C. (1991). Effects
of frequent classroom testing. Journal of Educational Research, 85,
89-99.
Bourne, L. E., Jr. (1957). Effect of information feedback and task
complexity on the identification of concepts. Journal of Experimental
Psychology, 54, 201-207.
Brackbill, Y., Bravos, A., & Starr, R. H. (1962). Delay-improved
retention of a difficult task. Journal of Comparative & Physiological
Psychology, 55, 947-952.
Brown, A. S. (1988). Experiencing misspellings and spelling perfor-
mance: Why wrong isn’t right. Journal of Educational Psychology,
80, 488-494.
Brown, A. S., Schilling, H. E. H., & Hockensmith, M. L. (1999).
Fe e d b a c k en h a n c e s Te s T i n g 615
inforcement contingencies. American Educational Research Journal,
8, 135-141.
Surber, J. R., & Anderson, R. C. (1975). Delay-retention effect in
natural classroom settings. Journal of Educational Psychology, 67,
170-173.
Tait, K., Hartley, J. R., & Anderson, R. C. (1973). Feedback proce-
dures in computer-assisted arithmetic instruction. British Journal of
Educational Psychology, 43, 161-171.
Toppino, T. C., & Luipersbeck, S. M. (1993). Generality of the negative
suggestion effect in objective tests. Journal of Educational Psychol-
ogy, 86, 357-362.
Tulving, E. (1967). The effects of presentation and recall of material in
free-recall learning. Journal of Verbal Learning & Verbal Behavior,
6, 175-184.
Wheeler, M. A., & Roediger, H. L., III. (1992). Disparate effects of
repeated testing: Reconciling Ballard’s (1913) and Bartlett’s (1932)
results. Psychological Science, 3, 240-245.
Whitten, W. B., & Leonard, J. M. (1980). Learning from tests: Facili-
tation of delayed recall by initial recognition alternatives. Journal of
Experimental Psychology: Human Learning & Memory, 6, 127-134.
Wildman, K., & McDaniel, M. A. (2007, August). Test-enhanced learn-
ing for facts and concepts. Poster session presented at the annual meet-
ing of the American Psychological Association, San Francisco.
NOTES
1. Performance in a given response category is dependent on per-
formance in the other two response categories (e.g., the proportion of
incorrect other responses is necessarily determined by the proportion
correct and intrusions). The analysis of incor rect other responses is not
reported because the measure was not of primary interest and would
be redundant with other measures, but the data can be found in the Ap-
pendix. We include the Appendix to provide a complete summary of the
forced recall results.
2. As in the forced report phase, there were three possible outcomes
for each response: correct, intrusion, or incorrect other. However, unlike
the forced report phase, these various response outcomes are potentially
independent of each other because a response could also be omitted. The
analyses of incorrect other responses are not presented because subjects
kept very few of these responses (overall M 5 .10) and there were no
significant differences between any of the conditions. The proportions
reported for the free report performance were computed by dividing the
number of items kept by the total number of items in that condition of the
experiment (i.e., not the total number of items that were kept).
3. The number of multiple-choice alternatives factor was initially in-
cluded in all of the free report analyses. However, this factor did not
produce any main effects and did not interact with any of the other fac-
tors. Thus, all the free report data were collapsed across the number of
multiple-choice alternatives for the sake of brevity.
4. The conditional analyses were conducted on the aggregated data
(i.e., across subjects) rather than the alternative method of computing
conditionalized means for each individual subject. This method was used
in order to avoid the problem of how to replace or estimate a mean for
individual subjects when they did not produce any observations in one of
the conditionalized cells (e.g., “correct on final cued recall given incor-
rect on initial multiple choice”). Because we used this form of analysis,
inferential statistics could not be computed on the data. They should be
considered descriptive.
Marsh, E. J., Roediger, H. L., III, Bjork, R. A., & Bjork, E. L. (2007).
The memorial consequences of multiple-choice testing. Psychonomic
Bulletin & Review, 14, 194-199.
McConnell, M. D., Hunt, R. R., & Smith, R. E. (2006, May). Differ-
ential effects of immediate and delayed feedback on test performance.
Paper presented at the annual meeting of the Midwestern Psychologi-
cal Society, Chicago, IL.
McDaniel, M. A., Anderson, J. L., Derbish, M. H., & Morrisette, N.
(2007). Testing the testing effect in the classroom. European Journal
of Cognitive Psychology, 19, 494-513.
McDaniel, M. A., & Fisher, R. P. (1991). Test and test feedback
as learning sources. Contemporary Educational Psychology, 16,
192-201.
McDaniel, M. A., & Masson, M. E. J. (1985). Altering memory rep-
resentations through retrieval. Journal of Experimental Psychology:
Learning, Memory, & Cognition, 11, 371-385.
McGeoch, J. A. (1932). Forgetting and the law of disuse. Psychological
Review, 39, 352-370.
McKeachie, W. J. (1999). Teaching tips: Strategies, research, and the-
ory for college and university teachers (10th ed.). Boston: Houghton
Mifflin.
Paige, D. D. (1966). Learning while testing. Journal of Educational
Research, 59, 276-277.
Pashler, H., Cepeda, N. J., Wixted, J. T., & Rohrer, D. (2005). When
does feedback facilitate learning of words? Journal of Experimental
Psychology: Learning, Memory, & Cognition, 31, 3-8.
Roediger, H. L., III, Jacoby, J. D., & McDermott, K. B. (1996).
Misinformation effects in recall: Creating false memories through
repeated retrieval. Journal of Memory & Language, 35, 300-318.
Roediger, H. L., III, & Karpicke, J. D. (2006a). The power of testing
memory: Basic research and implications for educational practice.
Perspectives on Psychological Science, 1, 181-210.
Roediger, H. L., III, & Karpicke, J. D. (2006b). Test-enhanced learn-
ing: Taking memory tests improves long-term retention. Psychologi-
cal Science, 17, 249-255.
Roediger, H. L., III, & Marsh, E. J. (2005). The positive and nega-
tive consequences of multiple-choice testing. Journal of Experimental
Psychology: Learning, Memory, & Cognition, 31, 1155-1159.
Roediger, H. L., III, & Payne, D. G. (1985). Recall criterion does not
affect recall level or hypermnesia: A puzzle for generate/recognize
theories. Memory & Cognition, 13, 1-7.
Roediger, H. L., III, Wheeler, M. A., & Rajaram, S. (1993). Remem-
bering, knowing and reconstructing the past. In D. L. Medin (Ed.),
The psychology of learning and motivation: Advances in research and
theory (Vol. 30, pp. 97-134). New York: Academic Press.
Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-Prime ref-
erence guide. Pittsburgh: Psychology Software Tools, Inc.
Schooler, J. W., Foster, R. A., & Loftus, E. F. (1988). Some deleteri-
ous consequences of the act of recollection. Memory & Cognition,
16, 243-251.
Schroth, M. L. (1977). Effects of informative feedback and active train-
ing upon concept attainment. Psychological Reports, 40, 647-653.
Skinner, B. F. (1954). The science of learning and the art of teaching.
Harvard Educational Review, 24, 86-97.
Sturges, P. T. (1969). Verbal retention as a function of the informa-
tiveness and delay of informative feedback. Journal of Educational
Psychology, 60, 11-14.
Sullivan, H. J., Schutz, R. E., & Baker, R. L. (1971). Effects of re-
(Continued on next page)
616 bu T l e r a n d ro e d i g e r
APPENDIX
Performance on the Cued Recall Test Under Forced Report Instructions As a
Function of Amount-of-Prior-Study Condition, the Number of Initial Multiple-Choice
Alternatives, and Feedback Condition (Excluding the No-Test Condition)
Feedback Condition
Study No Feedback Immediate Feedback Delayed Feedback
Condition DV Two Four Six Tw o Four Six Two Four Six
No study Correct .21 .17 .16 .36 .43 .48 .52 .57 .60
Intrusions .23 .34 .38 .16 .14 .14 .10 .14 .14
Incorrect other .56 .49 .46 .48 .43 .38 .38 .29 .26
Study Correct .40 .29 .31 .41 .43 .45 .47 .56 .58
Intrusions .20 .28 .27 .16 .19 .20 .13 .10 .16
Incorrect other .40 .43 .42 .43 .39 .35 .40 .34 .26
Restudy Correct .43 .42 .38 .51 .50 .49 .57 .56 .56
Intrusions .14 .15 .21 .13 .17 .14 .15 .16 .16
Incorrect other .43 .43 .41 .36 .33 .37 .28 .28 .28
Mean Correct .35 .29 .28 .43 .45 .47 .52 .56 .58
Intrusions .19 .26 .29 .15 .17 .16 .13 .13 .15
Incorrect other .46 .45 .43 .42 .38 .37 .35 .30 .27
Note—The first dependent variable (DV) is the proportion of targets correctly recalled, the second is the proportion
of intrusions, and the third is the proportion of incorrect other responses.
(Manuscript received June 28, 2007;
revision accepted for publication October 29, 2007.)
... Retrieval practice with feedback is ordinarily superior to retrieval practice without feedback. But even retrieval trials without feedback are seen to be more beneficial to subsequent performance than additional study trials (e.g., Butler & Roediger, 2008;Potts & Shanks, 2014). In terms of theory, this last observation is important because it precludes new learning that would likely result from additional intact learning trials. ...
... Considerable attention in the testing effect literature has been directed at the role of feedback during the practice tests (e.g., Butler et al., 2006;Butler & Roediger, 2008;McDaniel & Fisher, 1991;Vojdanoska et al., 2010). It is widely accepted that feedback enhances the benefits of testing by allowing the participant to correct errors that were made during retrieval, thereby reducing the likelihood of the error from being repeated during a later test (Bangert-Drowns et al., 1991). ...
... The benefit with respect to future retrieval provided by practice tests without feedback (e.g., Butler & Roediger, 2008;Miller, 1982;Potts & Shanks, 2014) is fundamentally problematic for associative theories of learning. Although testing effects have been observed in both associative and instrumental preparations, associative theories (e.g., Rescorla & Wagner, 1972) consistently predict that CS-alone trials (i.e., practice tests) following initial learning (i.e., CS-US pairings) constitute extinction trials and should be detrimental to the responding based on the initially learned CS-US relationship. ...
Article
Full-text available
Taking a test of previously studied material has been shown to improve long-term subsequent test performance in a large variety of well controlled experiments with both human and nonhuman subjects. This phenomenon is called the testing effect. The promise that this benefit has for the field of education has biased research efforts to focus on applied instances of the testing effect relative to efforts to provide detailed accounts of the effect. Moreover, the phenomenon and its theoretical implications have gone largely unacknowledged in the basic associative learning literature, which historically and currently focuses primarily on the role of information processing at the time of acquisition while ignoring the role of processing at the time of testing. Learning is still widely considered to be something that happens during initial training, prior to testing, and tests are viewed as merely assessments of learning. However, the additional processing that occurs during testing has been shown to be relevant for future performance. The present review offers an introduction to the historical development, application, and modern issues regarding the role of testing as a learning opportunity (i.e., the testing effect). We conclude that the testing effect is seen to be sufficiently robust across tasks and parameters to serve as a compelling challenge for theories of learning to address. Our hope is that this review will inspire new research, particularly with nonhuman subjects, aimed at identifying the basic underlying mechanisms which are engaged during retrieval processes and will fuel new thinking about the learning-performance distinction. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
... Feedback contains information for learners to monitor or regulate their learning in future situations. It can trigger confirmation, addition, overwriting, tuning or reconstruction of information in memory (Butler & Winne, 1995) and thus should amplify the direct testing effect by providing opportunities for deeper processing and disclose errors (Butler & Roediger, 2008;Schwieren et al., 2017). However, Adesope et al. (2017) found no significant differences between testing effects with or without feedback in their meta-analysis. ...
... It is possible that the effectiveness of the feedback is related to students' perception of the testing situation as meaningful and worth engaging in (Jönsson & Panadero, 2018;Kornell, 2014). Hence, differences in the meaningfulness of settings (e.g., laboratory setting with no relations to personal performance in contrast to a mandatory course setting where students prepare and post-process test content) might be an explanation for the ambiguous role of feedback on practice tests in the testing literature (Adesope et al., 2017;Butler & Roediger, 2008;Rowland, 2014). Summarizing existing studies in this area, such a setting in the context of a lecture is characterized by the following features (Barenberg & Dutke, 2021;Cogliano et al., 2019;Enders et al., 2021;Fernandez & Jamet, 2016;Händel et al., 2020;McDaniel et al., 2007McDaniel et al., , 2012 In addition, students should continue to process the content of the tests and not merely be interested in gaining insight into the difficulty and structure of the exam (Dutke et al., 2010). ...
Article
Full-text available
Two studies investigated the testing effects on performance and on metacognitive judgment accuracy in authentic learning settings. Across two educational psychology courses, undergraduate students had the opportunity to voluntarily participate in four different practice tests during the term—without feedback in Study 1 (N = 201 students) or with individual corrective feedback in Study 2 (N = 111 students). Across studies in real classroom settings with and without feedback, regression analyses indicated that a higher number of taken practice tests were related to higher performance in the final course exam and to two scores of metacognitive judgment accuracy (absolute accuracy and sensitivity). However, students’ preparation and post-processing practice tests, their perceived usefulness of tests for monitoring one’s performance, and metacognitive specificity differed depending on whether students received feedback or not. Overall, the studies convey considerable evidence on how participation in practice tests is related not only to performance but also to monitoring accuracy in authentic learning settings.
... Traditionally the presence of testing effect is examined by comparing the final learning achievements of two groups-in which test-enhanced learning was applied and in which other strategies were administered (restudying, no treatment, concept mapping, etc.) [8], [18], [19], [28], etc. Numerous explanations of the positive effect of this technique, which are sometimes overlapping, are suggested, among them, several most spread theories should be noted. Additional exposure theory suggests that learning and retention are enhanced by often exposure to the target material, testing with corrective feedback is more beneficial than without it [29]. The immediate feedback feature of quizzes is paid special attention as it shows the result immediately, helps to achieve the aim, increases learner engagement, and may promote student's autonomy in the educational process [29], [30]. ...
... Additional exposure theory suggests that learning and retention are enhanced by often exposure to the target material, testing with corrective feedback is more beneficial than without it [29]. The immediate feedback feature of quizzes is paid special attention as it shows the result immediately, helps to achieve the aim, increases learner engagement, and may promote student's autonomy in the educational process [29], [30]. Retrieval effort theory states that the more difficult tests, the more demanding retrieval processes that boost long-term memory [19], failure in prior tests motivates more efforts to retrieve target information in further tests [31]. ...
Article
Full-text available
The research studied the effectiveness of a three-stage Seamless Learning Model with Enhanced Web-Quizzing based on the intensive use of sets of quizzes created in different web-based quiz generators. The findings revealed that continuous testing using diagnostic, formative, benchmark and summative quizzes, administered at the three stages - familiarisation, formation and assessment, applying such features of seamless learning as learning in various contexts, ubiquitous access to digital learning resources and quizzes, combination of teacher-guided learning, self-directed and collaborative learning, and switching between various learning activities can be an efficient teaching technique in higher education having positive effect on academic performance, motivation, learners’ approaches to studying and course engagement. The testing effect was investigated on the summative tests taken at the end of the three different academic courses. The results showed that online quizzes applied primarily as learning tools with the emphasis on information retrieval and retention resulted in the higher achievements of the experimental group students. The survey conducted with Biggs’s Revised Two-Factor Study Process Questionnaire (R-SPQ-2F) showed the changes in the learners’ motives and approaches to studying revealed in students’ active participation in in-class and out-of-class activities, mastering of the material through understanding rather than mechanical memorizing, the search for additional information and increased attendance. The engagement of the experimental group students in the new quiz-enhanced settings was examined focusing on emotional, skills, participation and performance engagement aspects.
... The second explanation pertains to the nature of multiple-choice questions by which respondents are required to process response options and therefore are presented with erroneous information. Roediger and Marsh (2005) showed that multiple-choice testing may lead participants to answer later criterial tests with false information, but feedback following multiple-choice questions might reduce these negative effects and increase beneficial effects (Butler & Roediger, 2008). ...
Article
Full-text available
Proponents of the testing effect claim that answering questions about the learning content benefits retention more than does additional restudying—even without corrective feedback. In educational contexts, evidence for this claim is scarce and points toward differential effects for different question formats: Benefits emerged for short-answer questions but not for multiple-choice questions. The present study implemented an experimentally controlled, minimal intervention design in five sessions of an existing lecture. In each session, participants reviewed lecture content by answering short-answer questions, multiple-choice questions, or reading summarising statements. An unannounced test measured the retention of learning content. Bayesian analyses revealed a positive testing effect for short-answer questions that was strongest for difficult practice questions. Analyses also provided evidence for the absence of a testing effect for multiple-choice questions. These results suggest that short-answer testing is more beneficial than multiple-choice testing in a higher education context when feedback is not provided.
... The differentiation in stimulation encourages trainees to pay more attention to positive feedback, with little attention to negative feedback that only indicates a response is not recommended. In this way, the message of recommended behavioral responses is emphasized instead of misinformation [9,11]. ...
Article
Full-text available
Various attempts and approaches have been made to teach individuals about the knowledge of best practice for earthquake emergencies. Among them, Immersive Virtual Reality Serious Games (IVR SGs) have been suggested as an effective tool for emergency training. The notion of IVR SGs is consistent with the concept of problem-based gaming (PBG), where trainees interact with games in a loop of forming a playing strategy, applying the strategy, observing consequences, and making reflection. PBG triggers reflection-on-action, enabling trainees to reform perceptions and establish knowledge after making a response to a scenario. However, in the literature of PBG, little effort has been made for trainees to reflect while they are making a response (i.e., reflection-in-action) in a scenario. In addition, trainees do not have the possibility to adjust their responses and reshape their behaviors according to their reflection-in-action. In order to overcome these limitations, this study proposes a game mechanism, which integrates spiral narratives with immediate feedback, to underpin reflection-in-action and reflective redo in PBG. An IVR SG training system suited to earthquake emergency training was developed, incorporating the proposed game mechanism. A controlled experiment with 99 university students and staff was conducted. Participants were divided into three groups, with three interventions tested: a spiral narrated IVR SG, a linear narrated IVR SG, and a leaflet. Both narrated IVR SGs were effective in terms of immediate knowledge gain and self-efficacy improvement. However, challenges and opportunities for future research have been suggested.
... Third, lab studies have shown beneficial effects of feedback when practicing multiple-choice questions, but this effect can also avert the negative effects that could arise because of the exposure to incorrect information in the form of lures or distractors (Butler & Roediger, 2008;Roediger & Marsh, 2005). Therefore, research is needed to investigate whether feedback adds to the unmediated effects of multiple-choice practice tests in educational settings. ...
Article
Full-text available
Background Retrieval practice promotes retention of learned information more than restudying the information. However, benefits of multiple-choice testing over restudying in real-world educational contexts and the role of practically relevant moderators such as feedback and learners’ ability to retrieve tested content from memory (i.e., retrievability) are still underexplored. Objective The present research examines the benefits of multiple-choice questions with an experimental design that maximizes internal validity, while investigating the role of feedback and retrievability in an authentic educational setting of a university psychology course. Method After course sessions, students answered multiple-choice questions or restudied course content and afterward could choose to revisit learning content and obtain feedback in a self-regulated way. Results Participants on average obtained corrective feedback for 9% of practiced items when practicing course content. In the criterial test, practicing retrieval was not superior to reading summarizing statements in general but a testing effect emerged for questions that targeted information that participants could easily retrieve from memory. Conclusion Feedback was rarely sought. However, even without feedback, participants profited from multiple-choice questions that targeted easily retrievable information. Teaching Implications Caution is advised when employing multiple-choice testing in self-regulated learning environments in which students are required to actively obtain feedback.
Article
Frequent repetition of vocabulary is essential for effective language learning. To increase exposure to learning content, this work explores the integration of vocabulary tasks into the smartphone authentication process. We present the design and initial user experience evaluation of twelve prototypes, which explored three learning tasks and four common authentication types. In a three-week within-subject field study, we compared the most promising concept as mobile language learning (MLL) applications to two baselines: We designed a novel (1) UnlockApp that presents a vocabulary task with each authentication event, nudging users towards short frequent learning session. We compare it with a (2) NotificationApp that displays vocabulary tasks in a push notification in the status bar, which is always visible but learning needs to be user-initiated, and a (3) StandardApp that requires users to start in-app learning actively. Our study is the first to directly compare these embedding concepts for MLL, showing that integrating vocabulary learning into everyday smartphone interactions via UnlockApp and NotificationApp increases the number of answers. However, users show individual subjective preferences. Based on our results, we discuss the trade-off between higher content exposure and disturbance, and the related challenges and opportunities of embedding learning seamlessly into everyday mobile interactions.
Article
Feedback is an essential construct for many theories of learning and instruction, and an understanding of the conditions for effective feedback should facilitate both theoretical development and instructional practice. In an early review of feedback effects in written instruction, Kulhavy (1977) proposed that feedback’s chief instructional significance is to correct errors. This error-correcting action was thought to be a function of presentation timing, response certainty, and whether students could merely copy answers from feedback without having to generate their own. The present meta-analysis reviewed 58 effect sizes from 40 reports. Feedback effects were found to vary with control for presearch availability, type of feedback, use of pretests, and type of instruction and could be quite large under optimal conditions. Mediated intentional feedback for retrieval and application of specific knowledge appears to stimulate the correction of erroneous responses in situations where its mindful (Salomon & Globerson, 1987) reception is encouraged.
Article
High school students studied a brief history text, then took either a short-answer test or a multiple-choice test on the material, or they completed a study habits questionnaire serving as a filler task. Two weeks later, all students completed a retention test composed equally of short-answer and multiple-choice questions. Initial testing greatly enhanced later retention of the material. Retention was greatest on those items cast in the same test format as seen previously (a test practice effect). However, initial testing also enhanced retention on those items cast in the alternate format (a consolidation effect). This latter factor was interpreted as playing a more substantial role than practice in enhancing retention through testing.
Article
Informative feedback procedures were compared with active training in concept attainment. 60 first-, second-, and third-grade children were presented with conjunctive concepts and assigned at random to each of 12 groups. The stimuli were drawings on cards representing the concepts. A 4 × 3 factorial design was employed which involved 3 combinations of feedback plus active training and 3 levels of task complexity. The results showed the informative feedback groups (right-wrong, nothing-wrong and right-nothing) discovered the concepts faster than the active training groups. Over all 3 levels of task difficulty the right-wrong condition yielded significantly better performance than the other feedback conditions and, in addition, nothing-wrong resulted in a faster rate of learning than right-nothing.
Article
The educational effects of frequent classroom testing have been studied and discussed since the early part of this century. Testing advocates have suggested that more frequent classroom testing stimulates practice and review, gives students more opportunities for feedback on their work, and has a positive influence on student study time. Reviewers of relevant research and evaluation literature, however, have expressed uncertainty about whether such benefits are actually realized in classrooms. The present review distinguishes research on frequent classroom testing from research in two related areas, research on adjunct questions and research on mastery testing, and provides results from a meta-analysis of findings on frequency of classroom testing. The meta-analysis showed that students who took at least one test during a 15-week term scored about one half of a standard deviation higher on criterion examinations than did students who took no tests. Better criterion performance was associated with more frequent testing, but the amount of improvement in achievement diminished as the number of tests increased.
Article
Two experiments examined the influence of test taking and feedback in promoting learning. Participants were shown a list of trivia facts during an incidental learning task. Some facts were later tested (plus feedback provided), whereas other facts were not presented for further processing. Tested facts were better recalled on a final criterion test than untested facts, showing the beneficial effects of testing. Tested facts were also better recalled than facts that were presented for additional study (Experiment 1). Although testing plus feedback enhanced learning, there were no effects of whether the participants were required simply to repeat the feedback or elaborate it.