Content uploaded by John Truscott
Author content
All content in this area was uploaded by John Truscott on May 18, 2020
Content may be subject to copyright.
1
The Efficacy of Written Corrective Feedback:
A Critique of a Meta-analysis
John Truscott
National Tsing Hua University, Taiwan
Abstract
An influential meta-analysis on the effectiveness of written error correction (Kang & Han,
2015) concluded that the practice is valuable for language classes. This paper critically
examines the meta-analysis and challenges its conclusion. The average effect size of the 21
included studies is unimpressive, even if taken at face value. The three studies that carry most
of the weight for the favorable conclusion are all essentially the same experiment, an
experiment which is too narrow and specialized to support any general conclusions on the
value of correction and the findings of which are challenged by other research. Two other
studies obtained moderately large effect sizes: In one of them the effect disappeared within
two weeks; the other was not a study of second language learning. For all the remaining
studies modest or weak results were reported. Some obtained negative effects which were
reported in the meta-analysis as positive effects. Others relied on inappropriate comparisons.
Two relevant studies that found correction ineffective or harmful were inappropriately
excluded from the meta-analysis. Several of the included studies fail on the authors’ inclusion
criteria and should not have been used. The paper also examines some issues that arise in a
meta-analysis on this topic and offers suggestions for future work.
keywords: written error correction, meta-analysis, effect size, inclusion criteria, control group
The effectiveness of error correction for improving learners’ writing skills is an
important issue for language teachers and so, not surprisingly, has inspired a great deal of
research. An important tool for understanding this body of research is meta-analysis (Cohen,
1992; Lipsey & Wilson, 2001; Norris & Ortega, 2000; Rosenthal, 1991). Its value lies in its
ability to bring the results of different studies together and place them in a common form,
effect size, so they can be compared and averaged. A large number of meta-analyses have
been done on error correction research, reaching a variety of conclusions (see Plonsky &
Brown, 2015; Truscott, 2016).
The primary meta-analysis dealing specifically with written correction is that of Kang
and Han (2015), published in the Modern Language Journal. Those authors looked at 21
studies, concluding that writing instructors can take their findings as a favorable message
about the effectiveness of written correction. Because of the importance of this conclusion for
teaching practice, critical analysis is necessary. This is the purpose of this paper. I will
suggest that the authors’ positive conclusion is unwarranted. In the process, I will consider
various issues arising in the use of meta-analysis and in original research on this topic. I will
not attempt here to provide an alternative meta-analysis, a project that would be extremely
ambitious and would go well beyond the goals of the paper.
The Meta-analysis
The effect size measure that has been most commonly used in this area is Cohen’s d,
which is the difference between the means of two groups divided by their pooled standard
deviation. So if the mean score of a group that received correction is one standard deviation
better than the mean of a group that did not, the effect size is 1.00. Kang and Han (2015) used
Hedge’s g, which follows the same principle but is more conservative and adjusts for small
sample sizes. Interpretation of effect sizes, both d and g, is based on the following
benchmarks (Plonsky & Oswald, 2014):
large effect: 1.00
medium effect: .70
small effect: .40
2
Negative effect sizes indicate that the comparison group outperformed the experimental
group; i.e., they point to harmful effects of the treatment.
Table 1 lists the studies that were included in the meta-analysis along with the
reported effect size for each (from Appendix C of Kang & Han, 2015). I have arranged them
in descending order and divided them in terms of their relation to the above benchmarks. One
complication is that Kang and Han’s table is apparently missing one of the studies they used,
as it lists them from 1 to 21 but does not include a number 16. I will briefly return to the
missing study (apparently Sheen, 2010) below.
Study
Effect size (g)
Bitchener (2008)
1.482
Bitchener & Knoch (2008)
1.375
Bitchener & Knoch (2010b)
1.161
Shintani & Ellis (2013)
.902
van Beuningen et al. (2008)
.888
Bitchener & Knoch (2010a)
.642
Hartshorn et al. (2010)
.607
Sheen et al. (2009)
.570
Chandler (2003), Study 1
.496
Fazio (2001)
.481
Evans et al. (2011)
.473
Sun (2013)
.472
Ellis et al. (2008)
.430
Kepner (1991)
.383
Jhowry (2010)
.341
Mubarak (2013)
.245
Sheen (2007)
.104
Bitchener et al. (2005)
.103
Semke (1980)
.089
Truscott & Hsu (2008)
.068
Table 1. Effect sizes reported by Kang and Han (2015)
Some of the numbers can be challenged, as can Kang and Han’s (2015) decisions
about which studies and which numbers to include. But before getting into these detailed
points, it is worthwhile to consider the findings at face value.
First, the overall effect size the authors reported was .54, meaning that the effects of
correction were slightly closer to the “small” benchmark than to the “medium” benchmark. It
is appropriate to ask whether a finding like this can justify a favorable recommendation to
teachers. Returning to the table, the first thing to note is that it is dominated by effects that
range from unimpressive to essentially non-existent. Of the 20 effect sizes shown, 15 did not
reach the benchmark for medium effect, nearly all of them falling well short of it; 7 of these
obtained effects that fall short even of the “small” benchmark, again well short in most cases.
Thus, any favorable conclusions drawn from the meta-analysis necessarily depend on the 5
studies (1/4 of the sample) that reported better results, especially the 3 lying above the “large”
benchmark. These will therefore be considered in more detail in the two following sections,
after which I will turn more briefly to the remaining studies.
3
The Three Studies that Obtained Large Effect Sizes
First, the sample contains three studies that yielded large effect sizes. All three came
from the research of Bitchener and Knoch (Bitchener, 2008; Bitchener & Knoch, 2008,
2010b). By my calculation, if these and the fourth member of this group (Bitchener & Knoch,
2010a) were removed, the overall effect size would fall below the “small” benchmark. Thus,
favorable conclusions about the value of correction rest on this research, and its limitations
are therefore limitations on any such conclusions.
The first limitation is that the experiments reported in the four papers are virtually
identical. The work can reasonably be seen as one original study with replications applying it
to different groups.
1
The inclusion in a meta-analysis of each as a distinct study is not wrong,
but we have to recognize that the three large effect sizes, with their very substantial influence
on the overall effect size, reflect a narrow base.
What then is special about these studies? Why did they obtain findings so much
stronger than others? The main answer is readily apparent: The researchers deliberately
selected as the target of correction a single, very simple error type: the use of a for first
mention of a noun and the for subsequent mentions (“I read a book today; the book was about
linguistics”). The writing tasks were then designed to support the focus on this one aspect of
grammar. The testing consisted of those same tasks. It was done in the same context, by the
same researcher; in other words, the learners were being repeatedly reminded of the
corrections they had received. This treatment should be expected not only to keep the
information fresh but, perhaps more importantly, to lead the corrected students to pay greater
attention to that particular grammar point during the testing, potentially introducing a
significant bias.
Under these conditions it is hardly surprising that strong results were obtained. The
question is what such results tell us about the value of correcting errors in writing classes.
They tell us, possibly, that if we select one very simple point to correct and design writing
assignments to support correction of that one point, then afterward the students will probably
write more accurately on that point when they are doing writing tasks that are built around it
and the context encourages special attention to it. These are limitations of the Bitchener and
Knoch studies and therefore limitations on any favorable conclusions drawn from the meta-
analysis.
Even this is probably too optimistic an assessment of this research, though, as
questions can be raised about the more general impact of the treatment on learners’ ability to
write accurately. A general problem with the teaching of grammar points is that it can easily
result in over-application of what has been taught, leading to increased errors (e.g.
Lightbown, 1983; Pica, 1983; Weinert, 1987). The grammar principle that is the focus of the
Bitchener and Knoch studies predominates in the tasks/tests that were used (because the tasks
were designed for that purpose) but its role is considerably smaller in English article usage in
general. Learners who need to be corrected for failure to follow it in the carefully selected
contexts used in these studies are presumably learners who would have difficulty judging
when other factors overrule it. Thus, the treatment might well encourage them to make
mistakes by applying it where it is not appropriate (cf. Ellis et al. 2008, Note 2). These
studies have not been concerned with such negative influences on learning, picking out only
the positive effects.
1
I also find it difficult to judge the extent to which Bitchener (2008), Bitchener and Knoch
(2008), and Bitchener and Knoch (2010a) are distinct (non-overlapping) studies.
4
Ekiert and di Gennaro (2019) obtained evidence that harmful effects do occur. Their
conceptual replication of Bitchener and Knoch (2010a) looked at other uses of English
articles, in addition to those targeted by the correction, and found that the correction groups’
scores on these other uses were consistently below those of the control group, with effect
sizes (g) ranging from -.01 to -1.06; in other words, the correction appeared to harm other
aspects of article use. Limitations of narrow studies like the Bitchener and Knoch research
are also suggested by the findings of Mubarak (2013), in which comprehensive correction
resulted in negligible gains in the general accuracy of English article use, despite extensive
correction of article errors. Similarly, Pashazadeh and Marefat (2010), in an uncontrolled
study, targeted “the entire article system of English” and found that substantial gains on an
immediate posttest became negligible four weeks later and negative after an additional four-
week delay.
Other findings raise doubts about the value of correction even for the specific article
uses on which Bitchener and Knoch found dramatic improvements. Shintani and Ellis (2013),
studying the first-mention function of the English indefinite article, found that direct
correction had no effect on accurate use and that metalinguistic feedback had only immediate
effects, disappearing within two weeks. Ellis, Sheen, Murakami, and Takashima (2008),
looking at both first and subsequent mention, obtained more favorable results, based entirely
on the rather puzzling finding that the performance of one correction group improved
dramatically during a two-week period of no treatment. Additional challenges to the
Bitchener and Knoch findings come from Sheen, Wright, and Moldawa (2009), which I will
consider below.
More recently, Ekiert and di Gennaro (2019) found that while their correction groups
did improve on the targeted uses, the control group improved more: The effect sizes are all
negative, ranging from -.15 to -.55. The correction, in other words, was ineffective and
possibly harmful, even for the very simple target error on which it focused. The limitations
on these findings are that the study sacrificed some validity by using a form of story retelling
instead of a more realistic task and, especially, combining results of this test and an error
correction task to use as the measure. On the other hand, these choices seem most likely to
benefit the correction groups, as the measure is more open to the explicit knowledge that the
treatment probably produced.
Thus, the three studies that yielded large effect sizes, and carry most of the burden in
any favorable conclusions from the meta-analysis, actually have little to tell us about the
general value of correction, and may not be informative even about its value in the particular
narrow context in which it was used in those studies. They have little to offer to teachers who
are deciding whether to correct in their classes.
The Two Cases of Moderately Large Effect Sizes
While favorable conclusions from the meta-analysis rest mainly on the large effect
sizes of the Bitchener and Knoch research, two other studies yielded effect sizes not very far
below the “large” benchmark. They also require a closer look here.
For Shintani and Ellis (2013), first, the relatively good effect size shown in the table
(g = .902) is quite misleading. It appears in the meta-analysis because Kang and Han (2015)
used only immediate posttests, excluding data from follow-up testing. While this decision can
certainly be defended, the effect in this case is to hide the main finding of a study. The second
posttest of Shintani and Ellis, given just two weeks after the treatment, found no significant
advantages for corrected learners and yielded a small g. The authors concluded that “the
effect was not durable” (p. 286). When its most important finding is recognized, this turns out
5
to be a study that found correction ineffective. It should also be noted, again, that this was a
very narrowly-focused study, looking specifically at the first-mention use of English a.
Interestingly, Ellis, Sheen, Murakami, and Takashima (2008) showed the opposite
pattern to that found by Shintani and Ellis (2013), with weak results on the immediate
measure and very strong results on the delayed test. Thus, if delayed posttest results are used
the effect size for this study becomes far greater than the .430 listed in the table. Strong
results here are perhaps not surprising, as this study had the same narrow focus as the
Bitchener studies and so most of the comments above apply here as well. Control issues also
appear to be present in this experiment: (a) the description of participants suggests that the
classes used as experimental groups consisted of superior students; (b) during the course of
the study these students were taking a reading class while the control group’s class was on
oral communication; and (c) substantial differences were found on the pretest, favoring the
experimental groups. So, again, it is probably not surprising to see strong effects. But this
leaves the mystery of why weak results immediately after the treatment turned into
outstanding results after a two-week period of no treatment.
The other moderately large effect size came from van Beuningen, de Jong, and
Kuiken (2008). Given Kang and Han’s (2015) inclusion criteria, as well as generally accepted
thinking in the area, this study should not have been included in the meta-analysis. According
to the authors’ description of their participants, around 20% were native speakers of the target
language, Dutch, and “most students were born in The Netherlands” but “many of them only
started learning Dutch in school (i.e. at age four)” [emphasis added], meaning about 10 years
prior to the study. In other words the authors’ description of their participants suggests that
this was not a study of second language learning.
This point is further clarified in van Beuningen, de Jong, and Kuiken (2012). The
2008 paper is in fact a report of the pilot study that was done for this main experiment. In the
later paper the authors make it clear that they were not concerned with the distinction
between L1 and L2 learners. They included students whose writing was considered weak,
without regard to the language background of those students (see especially their Note 1).
Their concession to the L1-L2 distinction was to reanalyze their data without the students
who came from families that used only Dutch at home, with the result that the findings did
not change. But this additional analysis still included an unknown and probably quite large
number of students who could not reasonably be classified as L2 learners: those for whom
Dutch was one of the home languages, those who had significant very early exposure to
Dutch outside the home, and those who had acquired native or near-native knowledge
through school experience, starting at age 4.
Kang and Han (2015) excluded this main study on the grounds that the effect size it
produced was an outlier, five standard deviations above the overall average of their sample.
This conclusion appears to reflect a misreading of the results. van Beuningen et al. (2012)
used as their measures both a revision task, in which learners used the corrections they were
given to revise their assignment, and new writing tasks. Kang and Han’s stated policy was,
appropriately, to use only data from new writings. But the extreme effect size they obtained
for this study could only have come from the revision data; the new writings showed only
moderate gains (for both measures, see Table 3 of van Beuningen et al., 2012; also Tables 4
and 5). So it appears that the decision to exclude the study as an outlier was a mistake. But
while it should not have been excluded for this reason, its exclusion was nonetheless
appropriate, as the authors’ description of their participants makes it clear that this was not a
study of second language learning.
6
The Remaining Studies
The conclusion to this point is that none of the five studies for which substantial effect
sizes were reported actually provide any meaningful support for the use of error correction in
second language writing instruction. I turn now to the remaining studies.
First, the number reported in Table 1 for Fazio (2001) is incorrect. The effect size is
listed as .481, indicating a small positive effect. But in fact the study found a negative effect;
i.e., the performance of the correction groups was poorer than that of the no-correction
(commentaries) group. The number .481 appears to have come from a confusion between two
groups. Fazio used both native and non-native speakers, reporting their results separately.
The meta-analysis should, of course, use the non-natives (Fazio’s Table 1), but Kang and Han
appear to have used the results for the native speakers (Table 2). In any case, the number for
this study should be negative.
A similar problem arises, but in a somewhat more confused form, with Jhowry
(2010). Kang and Han (2015) list the effect size as .341. But in Jhowry’s main analysis,
presented in her Table 2, the posttest score for the control group is higher than that of the
correction group, so the g should be negative. This score represents the total number of
correct uses of the forms divided by the total number of uses (p. 27). To confuse things, the
charts in Jhowry’s Appendix C portray the results of another measure, error rates, and here
the correction group is noticeably better than the control group, though no specific numbers
are reported. The author tentatively attributed this finding to the adoption of an avoidance
strategy by corrected students; i.e., they made fewer errors with the target forms because they
limited their use of those forms. I have doubts about the inclusion of this study, due to
vagueness in Jhowry’s description of the treatment and the scoring, along with the
uncertainty created by contrasts between the main analysis and the error rates. But if its
results are to be included the effect size has to be negative.
One requirement for inclusion of a study in the meta-analysis was that it had to use a
group that received no error feedback. The inclusion of Chandler (2003) represents a
deviation from this policy, as Kang and Han (2015) more or less concede (Note 4), because
the group identified as control group did receive such feedback, differing from the
experimental group only in not being required to put it to use until after the study. Hartshorn
et al. (2010) and Evans et al. (2011) also lacked the necessary no-correction group. These
studies compared the authors’ novel version of correction, “dynamic written corrective
feedback”, to “traditional process writing instruction”, using a comparison group which
received “a wide variety of feedback on the linguistic accuracy of what they produced”
(Hartshorn et al., p. 95). Neither study should be a part of the meta-analysis. Sun (2013) is a
borderline case. The control group received comments like “Pay attention to conjugation of
pl/sing. verbs”, raising doubts about whether the study should have been included.
Control issues appear in a different form with Sheen (2007) and Sheen, Wright, and
Moldawa (2009). The issue here is the proper identification of control groups. The treatment
given the correction groups involved not only correction but also a reading and writing task
designed to present the target forms to the learners and give them practice in using them. The
group that was labeled “control” did not receive this treatment. The comparison made was
thus the combination of the task + correction vs. the absence of both. For the purpose of
determining the effect of correction, this is not a legitimate comparison – the control group
should have been given the same task, just without corrective feedback on it. The implication
is that Sheen (2007) had no valid comparison group. The same is true for Sheen (2010),
which can be identified as the missing study in Table 1, based on Kang and Han’s (2015)
Appendix B. These studies did not meet the requirements for inclusion in the meta-analysis.
7
The closely related study of Sheen, Wright, and Moldawa (2009) is more interesting.
In addition to the groups used by Sheen (2007), they included a group which was given the
reading and writing task but received no feedback on their writing. This group, labeled the
“writing practice” group, provides a valid comparison for measuring the effects of correction:
It is the actual control group of this study. Interestingly, this point seems to have been
recognized by Ellis, Sheen, Murakami, and Takashima (2008) in their closely related
experiment. They used the same correction groups as well as the “writing practice” group but
properly identified the latter as the control group and did without the condition that was
labeled “control” by Sheen (2007) and Sheen, Wright, and Moldawa. For Sheen, Wright, and
Moldawa, then, it should be clear that the calculation of the effect size should use the scores
of the “writing practice” group and not the “control” group. This could be done in two ways.
If effect sizes for the two correction groups, focused and unfocused, are averaged, the
resulting g is a tiny positive number, far below the reported .570. But it is more enlightening
to separate the correction groups, as the focused portion of the experiment is another study on
first use vs. subsequent use of English articles and therefore of only limited interest. The
unfocused group (which was actually a “somewhat focused” group) is the more interesting of
the two. It yields negative effect sizes; i.e., the correction appears to have been harmful.
A different sort of control problem comes up with Bitchener, Young, and Cameron
(2005). During the study one of the two correction groups received 20 hours per week of
language instruction and the other 10 hours, while the comparison group had only 4. The
negligible effect size produced by this study thus came from a comparison that was biased in
favor of corrected students.
I will conclude this section with a brief summary of the studies with effect sizes
below the “medium” benchmark but above the “small” benchmark; i.e. those that fall in the
middle ground between the top five and the seven that obtained very weak results. Bitchener
and Knoch (2010a), first, is another instance of the narrowly focused work considered above,
with little to offer writing instructors. Ellis et al. (2008), with its large effect size on a delayed
posttest, falls into the same category.
2
Hartshorn et al. (2010), Chandler (2003), and Evans et
al. (2011) lacked a no-correction group and should have been excluded. Sun (2013), with an
effect size only slightly above the “small” benchmark, is a marginal member of this group.
Fazio (2001) actually yields a negative effect size. That for Sheen et al. (2009) is at best a
tiny positive number or, more appropriately, a negative number, depending on how the
calculation is done. Altogether, there is nothing in these studies to support a favorable view
of correction, and two of them argue strongly against such a view.
Exclusion of Relevant Studies
Kang and Han (2015) stated that a study was to be excluded if “the effects of
feedback could not be isolated from that of other treatments such as conferences (e.g., Polio,
Fleck, & Leder, 1998; Sheppard, 1992)” (p. 4). It is unclear, though, what things should count
as “other treatments”. Conferences in which the teacher discusses the feedback with students,
as in Sheppard (1992), seem to me an integral part of the feedback process as it is commonly
done in writing classes, not an extra factor contaminating the data. If discussion of feedback
is taken as a separate treatment, one might ask why revision following feedback, which is
commonly treated as an independent variable in this research area, does not also qualify as an
2
Ellis et al. (2008), in contrast to Sheen et al. (2009), did not report results for the
comparison between the unfocused group and the control group on the errors targeted for the
former, data that would give it more general relevance.
8
“other treatment”. To the best of my knowledge no one (including Kang and Han) thinks that
a study should be excluded because students rewrote their work after receiving feedback.
There is no apparent reason why conferences should be treated differently. The same question
can be raised regarding Hartshorn et al.’s (2010) practice of requiring students to maintain “a
comprehensive inventory of the errors they produce along with the written context in which
they are produced” (p. 88). Why was this not classified as an additional, contaminating
treatment?
It is also unclear how the ban on studies using conferences was applied. Mubarak’s
(2013) treatment included discussion with individual students while they were writing plus
whole-class comments on errors and peer discussion of the errors. This study, included in the
meta-analysis, would appear to fit the criteria for exclusion. For Hartshorn et al. (2010),
“Classroom discussions and activities were centered on the most frequent types of errors
being produced by the students in their daily writing” (pp. 94-95). A similar approach appears
to have been taken in the closely related study of Evans et al. (2011). It is not clear why these
studies were not excluded.
Beyond the lack of clarity and the apparent inconsistency in its application, an
exclusion policy like this one seems counterproductive. A meta-analysis should serve the
interests of the field, meaning in this case that it should provide teachers with information
they can use in deciding to correct or not to correct in their classes. This decision needs to be
based on the effects of correction within genuine teaching contexts, not on what it does in
isolation from the practices that naturally and properly accompany it. Sheppard (1992)
provides such information, as do Polio, Fleck, and Leder (1998). The information in these
cases is that correction, in the realistic teaching contexts used in these studies, is harmful
(Sheppard) or has no effect at all (Polio, Fleck, & Leder). This information should be
included in a meta-analysis that is concerned with the effectiveness of error correction,
particularly one that offers advice to writing instructors.
Discussion
In this section I will first summarize the above critique. As many of the problems
stem more from the nature of meta-analysis than from this particular use of it, I will then
extend the critique to meta-analysis in general. The final sub-section will turn to some issues
in the research itself, focusing on the use of controls.
Summary of the Critique
Table 2 summarizes the main points raised above regarding the various studies
included by Kang and Han (2015). They point to a number of adjustments that I believe
should be made if we are to understand the literature that Kang and Han set out to synthesize.
Most importantly, the studies that narrowly focused on first use vs. subsequent use of
English articles should be treated as a separate category. This category includes Bitchener
(2008) and Bitchener and Knoch (2008, 2010a, 2010b), as well as some of the findings (not
all) from the related studies of Ellis et al. (2008) and Sheen, Wright and Moldawa (2009), and
perhaps Shintani and Ellis (2013) and Ekiert and di Gennaro (2019). Note that I am not
suggesting that this type of research is entirely pointless or that the findings should be
discarded. A teacher might be interested in teaching this particular function of a and the and
want to see relevant evidence on its effectiveness. In terms of the case against correction, this
research might be seen as a pursuit of the “special, hypothetical circumstances under which
correction might not be a bad idea” (Truscott, 1999, p. 121). The essential point here is that
its (very large) limitations must be recognized.
9
Study
Effect
size (g)
Comments
Bitchener (2008)
1.482
targeted only one, very simple feature; tasks/tests
were designed specifically for that feature; corrected
students likely paid more attention to it during the
testing; did not look at possible harmful effects;
results are challenged by later research
Bitchener & Knoch (2008)
1.375
essentially the same study as Bitchener (2008), with
same limitations
Bitchener & Knoch (2010b)
1.161
essentially the same study as Bitchener (2008), with
same limitations
Shintani & Ellis (2013)
.902
this number comes from the immediate posttest; a
test 2 weeks later showed small effects; “the effect
was not durable”
van Beuningen et al. (2008)
.888
not a study of L2 learning; should be excluded
Bitchener & Knoch (2010a)
.642
essentially the same study as Bitchener (2008), with
same limitations
Hartshorn et al. (2010)
.607
lacked a no-correction group; should be excluded
Sheen et al. (2009)
.570
apparently based on an invalid comparison; the
appropriate comparison yields a tiny g; one
correction group’s treatment was of the B&K type;
the other group (“unfocused”) yields a negative g
Chandler (2003), Study 1
.496
lacked a no-correction group; should be excluded
Fazio (2001)
.481
no-correction group outperformed both correction
groups; this number should be negative
Evans et al. (2011)
.473
lacked a no-correction group; should be excluded
Sun (2013)
.472
comparison group received feedback like Pay
attention to conjugation of pl/sing. verbs; its
inclusion is questionable
Ellis et al. (2008)
.430
small immediate effect with large delayed effect;
same narrow focus as Bitchener (2008) with similar
limitations
Kepner (1991)
.383
Jhowry (2010)
.341
this should be a negative g; description of treatment
and scoring is limited; inclusion is questionable
Mubarak (2013)
.245
two measures of general accuracy yielded
conflicting results; measures of tense and article
accuracy gave weak results
Sheen (2007)
.104
represents the combined effect of correction and a
treatment designed to demonstrate the target use and
provide practice with it; should be excluded (but the
failure of correction despite the bias is noteworthy)
Bitchener et al. (2005)
.103
comes from a comparison that was biased in favor
of corrected learners; should be excluded (but the
failure of correction despite the bias is noteworthy)
Semke (1980)
.089
Truscott & Hsu (2008)
.068
one-shot treatment, not intended to directly test the
effectiveness of correction
Table 2. Comments on the studies included by Kang and Han (2015)
10
I suggest that a meta-analysis on written error correction should also include the
findings of a number of additional studies, beginning with Sheppard (1992) and Polio, Fleck,
and Leder (1998). Several studies not mentioned by Kang and Han (2015) should also be
considered, some of them appearing after the meta-analysis was completed, others
unpublished or appearing in relatively obscure sources. My tentative list of candidates
includes the following: Nakazawa (2006), Baldwin (2008), Muñoz (2011), Khanlarzadeh and
Nemati (2016), and Bonilla López et al. (2018). Also interesting is Nicolás–Conesa,
Manchón, and Cerezo (2019), though the use of a one-shot treatment limits its
meaningfulness. Two other potentially useful studies (Robb, Ross, & Shortreed, 1986; Karim
& Nassaji, 2018) did not provide the necessary information for effect size calculation and so
cannot be included in a meta-analysis, but their findings should nonetheless be recognized. In
this group of ten additional studies, all except one (Bonilla López et al., 2018) obtained
results that were quite weak and in some cases negative, which is to say the correction
appeared to be harmful.
I will not attempt the very large task of providing an alternative meta-analysis here,
one that would incorporate these studies and, in principle, deal with all the often difficult
issues raised above. My goal in this paper is to examine one influential meta-analysis,
showing that its findings cannot support the claim that written error correction is effective,
and to raise a number of issues that come up in a meta-analysis on this topic and in the
research that provides the data for it.
Some General Limitations of Meta-analysis
While meta-analysis is a valuable tool, its limitations must be recognized. Subjective
judgments are difficult if not impossible to avoid in establishing and applying inclusion
criteria and in deciding what number(s) to use when a study includes multiple measures. This
subjectivity can lead to a variety of different conclusions, even if we set aside the (very
significant) issue of reviewer bias (see Truscott, 2016).
Another concern is that the judgments, no matter how fairly and properly they are
made, can have the effect of hiding important information. Kang and Han’s (2015) decision
to use only immediate posttests was reasonable, but the consequence is that some important
information gets lost. Results from delayed tests are more meaningful than those from
immediate tests, and the two are often quite different. The striking contrasts found by Ellis et
al. (2008) and by Shintani and Ellis (2013) are cases in point (but are by no means the only
examples). The effect size reported for a given study cannot be understood without reference
to longer-term effects.
This is not the only way that important findings can get lost in a meta-analysis. A
single study will often include various measures, and decisions have to be made about how to
deal with them. Kang and Han (2015) adopted the reasonable policy of using only one
number per study, but this policy, again, results in the loss of interesting information. Sheen
et al. (2009) provides an example. The study looked at both the effects of narrow focus on a
specific use of English articles (first vs. subsequent mention) and a type of correction that is
potentially of more general value. Each by itself can provide interesting information (one
more so than the other), as can the comparison between them. Averaging them together has
the effect of removing this information in favor of a single number that is essentially
meaningless. For this case the solution is, again, to treat the two types of information as
separate targets of meta-analysis.
The striking inconsistencies in the findings of Mubarak (2013) provide another
example. General accuracy was measured in two ways: error rates showed strong effects (at
least on the immediate posttest) while error-free T-units showed weak results. The author also
measured specific accuracy, in terms of error-free T-units, on tense and article use, which
11
were perhaps the most common errors in the study. The results were very weak, raising the
crucial (but unaddressed) question of what error types were responsible for the strong effects
found for general accuracy on the error-rate measure. These inconsistencies and open issues
need to be recognized and considered. The reliance on a single number, in order to meet the
requirements of the meta-analysis, is not a good way to understand the results of this study.
Control Issues in Error Correction Research
Issues repeatedly arose in the above discussion regarding the nature and use of control
groups – issues for both researchers and reviewers. One can ask, first, whether experimental
and control groups are truly comparable in some of the studies, a necessity if any meaningful
conclusions are to be drawn from the comparison. An example of the problem is Bitchener,
Young and Cameron (2005), in which the control group received only a small fraction of the
overall language instruction received by the correction groups during the study. The
questions raised above about the groups used by Ellis, Sheen, Murakami, and Takashima
(2008) point to other examples of possible problems. The use of control conditions by Sheen
(2007, 2010) and Sheen, Wright, and Moldawa (2009) is simply wrong.
Another crucial issue is whether the comparison group used in a study was a genuine
no-correction group, a condition generally considered necessary in this research area. Should
a meta-analysis include a study like Sun (2013), in which the comparison group received at
the end of their assignments some fairly specific comments on errors of language form? Kang
and Han’s (2015) defense of their decision to include Chandler (2003), despite the extensive
error correction provided to the comparison group, seems to me untenable. Hartshorn et al.
(2010) and Evans et al. (2011), both included in the meta-analysis, were not interested in
using a no-correction group; their goal was to show that one type of error correction is
superior to another.
The limitations resulting from the lack of a genuine no-correction group, or the use of
a questionable one, have to be recognized by both authors and reviewers. Findings like those
of Hartshorn et al. (2010) and Evans et al. (2011), for example, may have significant
implications for teachers who are already committed to correcting their learners’ errors and
are now looking for the best way to do it, but they have nothing to offer to a teacher who is
undecided about whether to correct or not to correct. It is important to avoid the fallacy of
citing such experiments as evidence on this more fundamental question.
One more issue regarding control groups should be considered. Truscott (2003)
argued that the absence of correction is not inherently demotivating and can in fact have
positive effects on students’ attitudes. But if the conditions of an experiment encourage the
uncorrected students in the intuitive belief that correction would benefit them, this might be
expected to negatively influence their performance. They could be led to feel that they are
being denied valuable help which other students are receiving – that they are being cheated.
So when favorable results are reported for corrected groups relative to a no-correction group,
we need to ask exactly how the latter was treated and how this treatment might have
influenced students’ attitudes and therefore their performance. This question becomes
especially significant when positive results obtained in a study owe much of their strength to
declines in the control group’s performance over the course of the study, as in the cases of
van Beuningen et al. (2008) and Bonilla López et al. (2018).
The issue here is the validity of the comparisons that are made in the experiments.
The ultimate pedagogical issue behind this research is whether teachers should correct in
their classes. A teacher who decides not to is presumably one who feels that teaching without
correction is a good idea, and who will show this belief to the students and possibly explain
to them the reasons for it. Thus if we want to apply the findings of a study to real teaching,
the researcher should try to create these conditions for the control groups. At the very least,
12
researchers should be careful to avoid encouraging in uncorrected students a negative view of
the treatment they are receiving. To the best of my knowledge, this point has never been
addressed in any research reports, so its significance is an open question.
Conclusion
Kang and Han (2015) conclude with a statement that their findings provide “a clear
message for L2 writing instructors, that written corrective feedback can improve the
grammatical accuracy of student writing” (p. 12). I have argued here that their findings in fact
offer no basis for such a message. The argument took the form of a detailed critique rather
than an alternative meta-analysis, because the former is, I believe, the most effective way to
make the point. It also has to be recognized that any meta-analysis on this topic inevitably
incorporates many specific assumptions on potentially contentious questions, questions that
need to be explicitly addressed in their own right.
While meta-analysis can be a valuable tool, it has important limitations. Its results are
inevitably influenced, if not shaped, by subjective judgments that the reviewer has to make.
Its requirements can tie the reviewer’s hands, potentially resulting in questionable inclusion
of some information and questionable exclusion of other information. These factors readily
lead to debatable and somewhat simplistic summaries of the research findings in the field. It
should come as no surprise then that meta-analyses of error correction research have reached
wildly differing conclusions (see Plonsky & Brown, 2015; Truscott, 2016). The moral is that
the numbers that come out of a meta-analysis and the conclusions offered by its authors
should never be taken simply at face value; they must be subjected to critical analysis, of the
sort I have offered here.
References
Baldwin, C.A. (2008). To correct or not to correct: Error correction in L2 writing
instruction. MSc thesis, Aston University, UK.
Bitchener, J. (2008). Evidence in support of written corrective feedback. Journal of Second
Language Writing, 17, 102-118.
Bitchener, J., & Knoch, U. (2008). The value of written corrective feedback for migrant and
international students. Language Teaching Research, 12, 406-431.
Bitchener, J., & Knoch, U. (2010a). The contribution of written corrective feedback to
language development: A ten month investigation. Applied Linguistics, 31, 193-214.
Bitchener, J., & Knoch, U. (2010b). Raising the linguistic accuracy level of advanced L2
writers with written corrective feedback. Journal of Second Language Writing, 19,
207-217.
Bitchener, J., Young, S., & Cameron, D. (2005). The effect of different types of feedback on
ESL student writing. Journal of Second Language Writing, 14, 191-205.
Bonilla López, M., van Steendam, E., Speelman, D., & Buyse, K. (2018). The differential
effects of comprehensive feedback forms in the second language writing class.
Language Learning, 68, 813–850.
Chandler, J. (2003). The efficacy of various kinds of error feedback for improvement in the
accuracy and fluency of L2 student writing. Journal of Second Language Writing, 12,
267–296.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
Ekiert, M., & di Gennaro, K. (2019, online). Focused written corrective feedback and
linguistic target mastery: Conceptual replication of Bitchener and Knoch (2010).
Language Teaching.
13
Ellis, R., Sheen, Y., Murakami, M., & Takashima, H. (2008). The effects of focused and
unfocused written corrective feedback in an English as a foreign language context.
System, 36, 353-371.
Evans, N.W., Hartshorn, K.J., & Strong–Krause, D. (2011). The efficacy of dynamic written
corrective feedback for university-matriculated ESL learners. System, 39, 229–239.
Fazio, L.L. (2001). The effect of corrections and commentaries on the journal writing
accuracy of minority- and majority language students. Journal of Second Language
Writing, 10, 235–249.
Hartshorn, K.J., Evans, N.W., Merrill, P.F., Sudweeks, R.R., Strong–Krause, & Anderson,
N.J. (2010). Effects of dynamic corrective feedback on ESL writing accuracy. TESOL
Quarterly, 44, 84–109.
Jhowry, K. (2010). Does the provision of an intensive and highly focused indirect corrective
feedback lead to accuracy? MA thesis, University of North Texas, Denton, TX.
Kang, E., & Han, Z. (2015). The efficacy of written corrective feedback in improving L2
written accuracy: A meta-analysis. Modern Language Journal, 99, 1-18.
Karim, K., & Nassaji, H. (2018, online). The revision and transfer effects of direct and
indirect comprehensive corrective feedback on ESL students’ writing. Language
Teaching Research.
Kepner, C.G. (1991). An experiment in the relationship of types of written feedback to the
development of second-language writing skills. Modern Language Journal, 75, 305-
313.
Khanlarzadeh, M., & Nemati, M. (2016). The effect of written corrective feedback on
grammatical accuracy of EFL students: An improvement over previous unfocused
designs. Iranian Journal of Language Teaching Research, 4(2), 55-68.
Lightbown, P.M. (1983). Exploring relationships between developmental and instructional
sequences in L2 acquisition. In H.W. Seliger & M.H. Long (Eds.), Classroom
oriented research in second language acquisition. Rowley, MA: Newbury.
Lipsey, M.W., & Wilson, D.B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
Mubarak, M. (2013). Corrective feedback in L2 writing: A study of practices and
effectiveness in the Bahrain context. Doctoral dissertation, The University of
Sheffield, Sheffield, UK.
Muñoz, C. (2011). The effects of two methods of error correction on L2 writing: The case of
acquisition of the Spanish preterite and imperfect. Doctoral dissertation, Purdue
University. West Lafayette, Indiana.
Nakazawa, K. (2006). Efficacy and effects of various types of teacher feedback on student
writing in Japanese. Doctoral Dissertation, Purdue University. West Lafayette,
Indiana.
Nicolás–Conesa, F., Manchón, R.M., & Cerezo, L. (2019, online). The effect of unfocused
direct and indirect written corrective feedback on rewritten texts and new texts:
Looking into feedback for accuracy and feedback for acquisition. Modern Language
Journal.
Norris, J.M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and
quantitative meta-analysis. Language Learning, 50, 417–528.
Pashazadeh, A., & Marefat, H. (2010). The long-term effect of selective written grammar
feedback on EFL learners’ acquisition of articles. Pazhuhesh-e Zabanha-ye Khareji,
No. 56, Special Issue, English, 49-67.
Pica, T. (1983). Adult acquisition of English as a second language under different conditions
of exposure. Language Learning, 33, 465-497.
14
Plonsky, L., & Brown, D. (2015). Domain definition and search techniques in meta-analyses
of L2 research (Or why 18 metaanalyses of feedback have different results). Second
Language Research, 31, 267–278.
Plonsky, L., & Oswald, F.L. (2014). How big is ‘big’? Interpreting effect sizes in L2
research. Language Learning, 64, 878–912.
Polio, C., Fleck, C., & Leder, N. (1998). ‘‘If I only had more time:’’ ESL learners’ changes in
linguistic accuracy on essay revisions. Journal of Second Language Writing, 7, 43–
68.
Robb, T., Ross, S., & Shortreed, I. (1986). Salience of feedback on error and its effect on
EFL writing quality. TESOL Quarterly, 20, 83-95.
Rosenthal, R. (1991). Meta-analytic procedures for social research (Rev. ed.). Newbury
Park, CA: Sage.
Semke, H.M. (1980). The comparative effects of four methods of treating free-writing
assignments on the second language skills and attitudes of students in college level
first year German. Doctoral Dissertation, University of Minnesota.
Sheen, Y. (2007). The effect of focused written corrective feedback and language aptitude on
ESL learners’ acquisition of articles. TESOL Quarterly, 41, 255-283.
Sheen, Y. (2010). Differential effects of oral and written corrective feedback in the ESL
classroom. Studies in Second Language Acquisition, 32, 201–234.
Sheen, Y., Wright D., & Moldawa, A. (2009). Differential effects of focused and unfocused
written correction on the accurate use of grammatical forms by adult ESL learners.
System, 37, 556-569.
Sheppard, K. (1992). Two feedback types: Do they make a difference? RELC Journal, 23,
103–110.
Shintani, N., & Ellis, R. (2013). The comparative effect of direct written corrective feedback
and metalinguistic explanation on learners’ explicit and implicit knowledge of the
English indefinite article. Journal of Second Language Writing, 22, 286–306.
Sun, S. (2013). Written corrective feedback: Effects of focused and unfocused grammar
correction on the case acquisition in L2 German. Doctoral dissertation. University of
Kansas, Lawrence, KS.
Truscott, J. (1999). The case for ‘‘The case against grammar correction in L2 writing
classes’’: A response to Ferris. Journal of Second Language Writing, 8, 111–122.
Truscott, J. (2003). Students in the correction-free writing class. In H.-C. Liou, J. Katchen, &
H. Wang (Eds.), Lingua Tsing Hua: A 20th Anniversary Commemorative Anthology
(pp. 265-276). Taipei: Crane.
Truscott, J. (2007). The effect of error correction on learners’ ability to write accurately.
Journal of Second Language Writing, 16, 255–272.
Truscott, J. (2016). The effectiveness of error correction: Why do meta-analytic reviews
produce such different answers? In Y-N. Leung (Ed.), Epoch making in English
teaching and learning: Evolution, innovation, revolution. Taipei: Crane.
Truscott, J., & Hsu, A.Y.-p. (2008). Error correction, revision, and learning. Journal of
Second Language Writing, 17, 292–305.
van Beuningen, C.G., de Jong, N.H., & Kuiken, F. (2008). The effect of direct and indirect
corrective feedback on L2 learners’ written accuracy. ITL International Journal of
Applied Linguistics, 156, 279–296.
van Beuningen, C.G., de Jong, N.H., & Kuiken, F. (2012). Evidence on the effectiveness of
comprehensive error correction in second language writing. Language Learning, 62,
1–41.
15
Weinert, R. (1987). Processes in classroom second language development: The acquisition of
negation in German. In R. Ellis (Ed.), Second language acquisition in context (pp. 83-
99). Englewood Cliffs, NJ: Prentice-Hall.