ArticlePDF Available

Comparing the accuracy and effectiveness of Wordvice AI Proofreader to two automated editing tools and human editors

Authors:

Abstract

Purpose: Wordvice AI Proofreader is a recently developed web-based artificial intelligence-driven text processor that provides real-time automated proofreading and editing of user-input text. It aims to compare its accuracy and effectiveness to expert proofreading by human editors and two other popular proofreading applications—automated writing analysis tools of Google Docs, and Microsoft Word. Because this tool was primarily designed for use by academic authors to proofread their manuscript drafts, the comparison of this tool’s efficacy to other tools was intended to establish the usefulness of this particular field for these authors.Methods: We performed a comparative analysis of proofreading completed by the Wordvice AI Proofreader, by experienced human academic editors, and by two other popular proofreading applications. The number of errors accurately reported and the overall usefulness of the vocabulary suggestions was measured using a General Language Evaluation Understanding metric and open dataset comparisons.Results: In the majority of texts analyzed, the Wordvice AI Proofreader achieved performance levels at or near that of the human editors, identifying similar errors and offering comparable suggestions in the majority of sample passages. The Wordvice AI Proofreader also had higher performance and greater consistency than that of the other two proofreading applications evaluated.Conclusion: We found that the overall functionality of the Wordvice artificial intelligence proofreading tool is comparable to that of a human proofreader and equal or superior to that of two other programs with built-in automated writing evaluation proofreaders used by tens of millions of users: Google Docs and Microsoft Word.
37
https://www.escienceediting.org Copyright © 2022 Korean Council of Science Editors
This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
pISSN 2288-8063
eISSN 2288-7474
Received: July 23, 2021
Accepted: November 8, 2021
Correspondence to Kevin Heintz
content@wordvice.com
ORCID
Kevin Heintz
https://orcid.org/0000-0002-8964-2573
Younghoon Roh
https://orcid.org/0000-0002-1008-5699
Jonghwan Lee
https://orcid.org/0000-0002-4660-6046
Original Article
Sci Ed 2022;9(1):37-45
https://doi.org/10.6087/kcse.261
Comparing the accuracy and effectiveness
of Wordvice AI Proofreader to two
automated editing tools and human editors
Kevin Heintz1, Younghoon Roh2, Jonghwan Lee2
1Department of Research & Development, Wordvice Editing Service, Des Moines, IA, USA; 2Department of Research &
Development, Wordvice Editing Service, Seoul, Korea
Abstract
Purpose: Wordvice AI Proofreader is a recently developed web-based artificial intelligence-
driven text processor that provides real-time automated proofreading and editing of user-in-
put text. It aims to compare its accuracy and effectiveness to expert proofreading by human
editors and two other popular proofreading applications—automated writing analysis tools
of Google Docs, and Microsoft Word. Because this tool was primarily designed for use by
academic authors to proofread their manuscript drafts, the comparison of this tool’s efficacy
to other tools was intended to establish the usefulness of this particular field for these au-
thors.
Methods: We performed a comparative analysis of proofreading completed by the Wordvice
AI Proofreader, by experienced human academic editors, and by two other popular proofread-
ing applications. The number of errors accurately reported and the overall usefulness of the
vocabulary suggestions was measured using a General Language Evaluation Understanding
metric and open dataset comparisons.
Results: In the majority of texts analyzed, the Wordvice AI Proofreader achieved performance
levels at or near that of the human editors, identifying similar errors and offering comparable
suggestions in the majority of sample passages. The Wordvice AI Proofreader also had higher
performance and greater consistency than that of the other two proofreading applications evalu-
ated.
Conclusion: We found that the overall functionality of the Wordvice artificial intelligence proof-
reading tool is comparable to that of a human proofreader and equal or superior to that of two
other programs with built-in automated writing evaluation proofreaders used by tens of millions
of users: Google Docs and Microsoft Word.
Keywords
Artificial intelligence; Natural language processing; English proofreading; Writing assistant;
Human editing
Kevin Heintz et al.
https://www.escienceediting.org
38 | Sci Ed 2022;9(1):37-45
Introduction
Background/rationale:
The use of English in all areas of aca-
demic publishing and the need for nearly all non-native Eng-
lish-speaking researchers to compose research studies in Eng-
lish have created difficulties for non-native English speakers
worldwide attempting to publish their work in international
journals. Faced with the time-consuming process of self-edit-
ing before submission to journals, many researchers are now
using Automated Writing Analysis tools to edit their work
and enhance their academic writing development [1,2]. These
include grammatical error correction (GEC) programs that
automatically identify and correct objective errors in text en-
tered by the user. At the time of this study, most popular GEC
tools are branded automated English proofreading programs
that include Grammarly [3], Ginger Grammar Checker [4],
and Hemingway Editor [5], all of which were developed using
natural language processing (NLP) techniques; NLP is a type
of artificial intelligence (AI) technology that allows computers
to interpret and understand text in the same way a human does.
Although these AI writing and proofreading programs con-
tinue to grow in popularity, reviews regarding the effective-
ness of these programs at large are inconsistent. Similar stud-
ies to the present one have analyzed the effectiveness of NLP
text editors and their potential to approach the level revision
of expert human proofreading [6-8]. At least one 2016 article
[9] evaluates popular GEC tools and comes to the terse con-
clusion that “grammar checkers do not work.” The jury ap-
pears to be out on the overall usefulness of modern GEC pro-
grams in correcting writing.
However, Napoles et al. [10] propose applying the General-
ized Language Evaluation Understanding (GLEU) metric, a
variant of the Bilingual Evaluation Understudy (BLEU) algo-
rithm that “accounts for both the source and the reference
text, to establish a ground truth ranking that is rooted in
judgements by human editors. Similarly, the present study ap-
plies a GLEU metric to more accurately compare the accuracy
of these automated proofreading tools with that of revision by
human editors. While the practical application of many of
these programs is evidenced by their success in the market-
place of writing and proofreading aids, gaps remain in how
accurate and consistent certain AI proofreading programs are
in correcting grammatical and spelling errors.
Objectives:
It aimed to analyze the effectiveness of the Word-
vice AI Proofreader [11], a web-based AI-driven text proces-
sor that provides real-time automated proofreading and edit-
ing of user-input text. We also compared its effectiveness to
expert proofreading by human editors and two other popular
writing tools with proofreading and grammar checking appli-
cations, Google Docs [12] and Microsoft (MS) Word [13].
Methods
Ethics statement:
This is not a human subject study. There-
fore, neither approval by the institutional review board nor
the obtainment of informed consent is required.
Study design:
This was a comparative study using qualitative
open dataset and quantitative GLEU metric of comparison.
Setting:
The Wordvice AI Proofreader tool was measured in
terms of its ability to identify and correct objective errors, and
it was evaluated by comparing its performance to that of ex-
perienced human proofreaders and to two other commercial
AI writing assistant tools with proofreading features: MS
Word and Google Docs in June 2021. By combining the ap-
plication of a quantitative GLEU metric with a qualitative
open-dataset comparison, this study compared the effective-
ness of the Wordvice AI Proofreader with that of other editing
methods, both in the correction of “objective errors” (gram-
mar, punctuation, and spelling) and in the identification and
correction of more “subjective” stylistic issues (including weak
academic language and terms).
Data sources
Open datasets
The performance of the Wordvice AI Proofreader was mea-
sured using the JHU FLuency-Extended GUG (JFLEG) open
dataset 1 [14], a dataset developed by researchers as Johns
Hopkins University and consisting of a total of 1,501 sentenc-
es, 800 of which were used to comprise Dataset 1 in the ex-
periment (https://github.com/keisks/jfleg). The JFLEG data
consists of sentence pairs, showing the input text and the re-
sults of proofreading by professional editors. These datasets
assess improvements in sentence fluency (style revisions),
rather than recording all objective error corrections. Accord-
ing to Sakaguchi et al. [15], unnatural sentences can result
when the annotator collects only the minimum revision data
within a range of error types, and letting the annotator re-
phrase or rewrite a given sentence can result in more compre-
hensible and natural sentences. Thus, the JFLEG data was ap-
plied with the aim of assessing improvements in textual fluen-
cy rather than simple grammatical correction.
Because many research authors using automated writing as-
sistant tools are English as a second language writers, the
proofread data was based on sentences written by non-native
English speakers. This was designed to create a more accurate
sample pool for likely users of the AI Proofreader. “Proofread
data” refers to data that has been corrected by professional na-
tive speakers with master’s and doctoral degrees in the aca-
demic domain. The data were constructed in pairs: sentence
before receiving proofreading and sentence after receiving
proofreading.
Wordvice AI Proofreader vs. human editors
https://www.escienceediting.org Sci Ed 2022;9(1):37-45 | 39
Measurement (evaluation metrics)
Error type comparison
A qualitative comparison was performed on T1, P1, P2, and
P3 for categories including stylistic improvement (fluency,
vocabulary) and objective errors (determiner/article correc-
tion, spell correction). Table 2 illustrates these details for each
writing correction method (human proofreading, Wordvice
AI, MS Word, and Google Docs).
A GLEU metric [16] was used to evaluate the performance
of all proofreading types (T1, P1, P2, and P3). GLEU is an in-
dicator based on the BLEU metric [17] and measures the
number of overlapping words by comparing ground truth
sentences and predicted sentences with n-gram to assign high
scores to sequence words. To calculate GLEU score, we record
all sub-sequences of 1, 2, 3, or 4 tokens in a given predicted
and ground truth sentence. We then compute a recall (Equa-
tion 1), which is the ratio of the number of matching n-grams
to the number of total n-grams in the ground truth sentence;
we also compute a precision (Equation 2), which is the ratio
of the number of matching n-grams to the number of total n-
grams in the predicted sequence [18]. Python library (https://
www.nltk.org/_modules/nltk/translate/gleu_score.html) was
used for the calculation of GLEU.
The GLEU score is then simply the minimum of recall and
precision. This GLEU score’s range is always between 0 (no
matches) and 1 (complete match). As with the BLEU metric,
the higher the GLEU score, the higher the percentage of iden-
tified and corrected errors and issues captured by the proof-
reading tool. These are expressed as a percentage of the total
revisions applied in the ground truth model (human-edited
text), including objective errors and stylistic issues. The closer
to the ground truth editing results, the higher the perfor-
mance score and the better the editing quality.
Statistical methods:
Descriptive statistics were applied for
The sample data used in the experiment consisted of 1,245
sentences (i.e., 1,245 pairs of sentences were assessed both be-
fore and after proofreading), and these sentences were derived
from eight academic domains: arts and humanities, biosci-
ences, business and economics, computer science and mathe-
matics, engineering and technology, medicine, physical sci-
ences, and social sciences. Table 1 summarizes the number of
sentences applied from each academic domain (Dataset 2).
GLEU-derived datasets
The GLEU metric was used to create four datasets of compar-
ison. The first dataset (Dataset 3), GLEU 1 (T1, P1), compares
the correctness of the output sentence text of the Wordvice AI
Proofreader (“predicted sentence,” P1) with that of human
proofreaders (“ground truth sentence,” T1). The second data-
set (Dataset 4), GLEU 2 (T1, P2), compares the correctness
of the Wordvice AI Proofreader’s predicted sentence (P1). The
third dataset (Dataset 5), GLEU 3 (T1, P2), compares the cor-
rectness of MS Word’s predicted sentence (P2). The fourth data-
set (Dataset 6), GLEU 4 (T1, P3), compares the correctness of
Google Doc’s predicted sentence (P4).
Table 1. Summary of experiment dataset
Subject area No. of sentences
Arts and humanities 57
Biosciences 54
Business and economics 58
Computer science and mathematics 60
Engineering and technology 52
Medicine 53
Physical sciences 55
Social sciences 56
JFLEG 800
Total 1,245
JFLEG, JHU FLuency-Extended GUG.
Table 2. Comparison of the corrections and improvements of the sentences before correction, the sentences after correction of the comparative methods, and
the sentences after the correction by Wordvice AI Proofreader
Correction method
Stylistic improvement Objective errors
Fluency improvement Vocabulary improvement Determiner/article correction Spelling correction
Human editing Yes Yes Yes Yes
Wordvice AI Proofreader Yes Intermediate Yes Yes
Google Docs Yes No Intermediate Yes
Microsoft Word No No Intermediate Yes
Kevin Heintz et al.
https://www.escienceediting.org
40 | Sci Ed 2022;9(1):37-45
comparison between the target program and other editing
tools.
Results
Quantitative results based on GLEU
Comparison of all automated writing evaluation
proofreaders
Table 3 shows the average GLEU score in terms of percentag-
es of corrections made by the Wordvice AI Proofreader and
other automated proofreading tools as compared to the
ground truth sentences. As an average of total corrections
made, the Wordvice AI Proofreader had the highest erfor-
mance of the Automated Writing Analysis proofreading tools,
performing 77% of the corrections applied by the human edi-
tor.
Based on the dataset of 1,245 sentences used in the experi-
ment, the proofreading performance of Wordvice AI achieved
a maximum of 11.2%P and a minimum of 3.0%P compared
to those of Google Doc’s proofreader. Additionally, the GLEU
score of the Wordvice AI-revised text was higher by 13.0%P
at maximum on average compared to sentences before proof-
reading.
Analysis of variance was used to determine the statistical
significance of the values. Comparisons made between Word-
vice AI, Google Docs, and MS Word proofreading tools re-
vealed a statistically significant difference in proofreading
performance (analysis of variance, P< 0.05) (Table 4).
Comparison of Wordvice AI Proofreader and Google
Docs proofreading tool
Google Docs proofreader’s results scored second in total cor-
rections. Our comparative method confirmed that the devia-
tion of Wordvice AI performance was smaller than that of the
performance of Google Docs and MS Word proofreaders.
The proofreading performance of Wordvice AI (with a
variation of 5.4%) was more consistent in terms of percentage
of errors corrected compared to MS Word (with a variation of
5.6%), but was slightly less consistent than the Google Docs
proofreader (with a variation of 5%).
Comparison of Wordvice AI Proofreader and MS Word
proofreading tool
We compared the AI Proofreader’s performance in the specif-
ic academic subject area compared to Google Docs and MS
Word, as listed in the Methods section (Tables 3, 5, 6). In each
of the eight subject areas, the Worldvice AI Proofreader
showed the highest proofreading performance, by total per-
centage of ground truth sentence corrections applied, at
79.4%. When compared using the GLEU method, MS Word
applied the lowest amount of revision and was closest to the
original source text in terms of revised to unrevised text. Of the
three proofreading tools, MS Word applied the least amount of
editing. Table 3 shows the comparison between performance of
Table 3. Percentage of appropriate corrections of all automated proofreaders compared to ground truth sentence (100% correct)
Subject area Original sentence (%) Wordvice AI Proofreader (%) Google Docs (%) Microsoft Word (%)
Arts and humanities 61.5 78.5 73.2 65.1
Biosciences 62.6 75.7 68.5 64.9
Business and economics 66.5 79.4 68.2 67.1
Computer science and mathematics 65.1 74.5 71.5 66.5
Engineering and technology 64.3 74.1 67.5 65.8
Medicine 61.5 80.5 73.5 62.9
Physical sciences 67.8 78.3 73.4 66.5
Social sciences 65.8 78.1 71.5 68.5
Average 64.4 77.4 70.9 65.9
Table 4. One-way analysis of variance results of proofreading performance analysis between Automated Writing Analysis tools
Source of variation Sum of squres Degrees of freedom Mean squres FP-value F crit
Between groups 0.052960333 2 0.026480167 54.78317838 4.65E-09 3.466800112
Withing gropus 0.010150625 21 0.000483363
Total 0.063110958 23
Wordvice AI Proofreader vs. human editors
https://www.escienceediting.org Sci Ed 2022;9(1):37-45 | 41
the Wordvice AI Proofreader and Google Docs.
The Wordvice AI Proofreader exhibited a higher perfor-
mance metric over MS Word in every subject area. As illus-
trated in Table 4, the Wordvice AI Proofreader outperformed
the MS Word proofreader by 17.6%P in the subject area of
medicine and by 8.1%P in computer science and mathemat-
ics. It also exhibited an 11.4% total average performance ad-
vantage over MS Word in each subject area.
Qualitative results
Qualitative results were derived from an open dataset by ap-
plying a set of error category criteria (Table 2). These criteria
are applied to the input sentences before proofreading, input
sentences proofread by MS Word and Google Docs, and sen-
tences proofread by Wordvice AI.
Criteria 1. Fluency improvement (stylistic improvement)
The Wordvice AI Proofreader improved sentence fluency by
editing awkward expressions, similar to revision applied in
documents edited by editing experts (“human editing”). In
Table 7, “point” was used to indicate how different editing ap-
plications can interpret the intended or “correct” meaning of
words that have multiple potential meanings. In the original
sentence instance, “point” means pointing a finger or posi-
tioning something in a particular direction. However, “point
out” means indicating the problem, and thus the original term
“point” was changed to “point out” by human editing. Because
our study considers the sentence revised by human editing as
100% correct, this result accurately conveys the intended
meaning of the sentence—here, “point out” is more appropri-
ate than “point.
Google Docs applied the same correction, changing “points”
to “points out.” However, it did not correct the misspelling
“scond,” the intended meaning of which human editing rec-
ognized as “second.” However, Wordvice AI corrected both of
these errors perfectly, following the human editor’s revisions.
MS Word did not detect or correct either error in this sen-
tence.
Criteria 2. Vocabulary improvement (stylistic
improvement)
Wordvice AI Proofreader applied appropriate terminology to
convey sentence meaning in the same manner as the human
editor. Human editing removed the unnecessary definite arti-
cle “the” from the phrase “the most countries” to capture the
intended meaning of “an unspecified majority”; it also
Table 5. Comparison of Wordvice AI Proofreader performance to Google Docs proofreader by academic subject area
Subject area Wordvice AI Proofreader (%) Google Docs (%) Difference (%)
Arts and humanities 78.5 72.2 6.3
Biosciences 75.7 70.5 5.2
Business and economics 79.4 69.8 9.6
Computer science and mathematics 74.5 71.5 3.0
Engineering and technology 74.1 68.5 5.6
Medicine 80.5 73.5 7.0
Physical sciences 77.2 73.4 3.8
Social sciences 78.1 72.5 5.6
Table 6. Comparison of Wordvice AI Proofreader performance to Microsoft Word’s proofreader by academic subject area
Subject area Wordvice AI (%) Microsoft Word (%) Difference (%)
Arts and humanities 78.5 65.1 13.4
Biosciences 75.7 64.9 10.8
Business and economics 79.4 67.1 12.3
Computer science and mathematics 74.5 66.5 8.0
Engineering and technology 74.1 65.8 8.3
Medicine 80.5 62.9 17.6
Physical sciences 77.2 66.5 10.7
Social sciences 78.1 68.5 9.6
Kevin Heintz et al.
https://www.escienceediting.org
42 | Sci Ed 2022;9(1):37-45
changed the phrase “functioning of the public transport” to
“public transport” to reduce wordiness (Table 8).
Similarly, Wordvice AI improved the clarity of the sentence
by removing the unnecessary article “the” from the above-
mentioned sentence. In addition, Wordvice AI was able to
improve the clarity of sentences by inserting a comma and the
word “completely,” neither of which revisions were made by
human editing. Furthermore, neither Google Docs nor MS
Word performed these or any additional revisions to the text.
Criteria 3. Determiner/article correction (objective errors)
In the grammar assessment, Wordvice AI exhibited the same
level of performance as human editing. Table 9 shows that the
objective errors identified and corrected by Wordvice AI were
the same as those corrected by human editing. A comma is
required before the phrase “in other words” to convey the
correct meaning, but the comma is omitted in the original.
Both the human edit and Wordvice AI edit detected the error
and added a comma appropriately.
Additionally, the definite article “the” should be deleted
from the original sentence because it is unnecessary in this
usage, and both the human edit and Wordvice AI edit per-
formed this revision correctly. Finally, because the human
body is not composed of one bone, but multiple bones, the
Table 7. Comparative sentence example evaluating fluency improvement
Fluency improvement Sentence
Original (source text) Scond, Menzied points that chinese ships in the 1400s used very distinctive anchors that were round stones with a hole in the
middle.
Human editing Second, Menzied points out that chinese ships in the 1400s used very distinctive anchors that were round stones with a hole in
the middle.
Wordvice AI Proofreader Second, Menzied points out that chinese ships in the 1400s used very distinctive anchors that were round stones with a hole in
the middle.
Google Doc Scond, Menzied points out that chinese ships in the 1400s used very distinctive anchors that were round stones with a hole in
the middle.
Microsoft Word Second, Menzies points those Chinese ships in the 1400s used very distinctive anchors that were round stones with a hole in the
middle.
Text marked in red denotes incorrect alterations to the input text; text marked in blue denotes correct alterations to the input text.
Table 8. Comparative sentence example evaluating vocabulary improvement
Vocabulary improvement Sentence
Original (source text) Unfortunately in the most of the countries the functioning of the public transport is not perfecty organised.
Human editing Unfortunately in most countries, public transport is not perfectly organised.
Wordvice AI Proofreader Unfortunately, in most countries, the functioning of public transport is not completely organized.
Google Doc Unfortunately in most of the countries the functioning of the public transport is not perfectly organised.
Microsoft Word Unfortunately, in most of the countries the functioning of the public transport is not perfectly organized.
Text marked in red denotes incorrect alterations to the input text; text marked in blue denotes correct alterations to the input text; text marked in pink denotes a
style edit to improve clarity or meaning.
Table 9. Comparative sentence example evaluating determiner and article correction
Determiner/article correction Sentence
Original (source text) He said in other words that the more flouride may create damage in human body, specifically the bone.
Human editing He said, in other words, that the more fluoride may create damage to the human body, specifically the bones.
Wordvice AI Proofreader He said, in other words, that the more fluoride may create damage to the human body, specifically the bones.
Google Doc He said in other words that the more fluoride may create damage in the human body, specifically the bone.
Microsoft Word He said in other words that the more fluoride may create damage in human body, specifically the bone.
Text marked in red denotes incorrect alterations to the input text; text marked in blue denotes correct alterations to the input text.
Wordvice AI Proofreader vs. human editors
https://www.escienceediting.org Sci Ed 2022;9(1):37-45 | 43
term “bone” should be revised to “bones.” Both Wordvice AI
and the human editor recognized this error and corrected it
appropriately. However, Google Docs and MS Word did not
detect or correct these errors.
Criteria 4. Spelling correction (objective errors)
The ability to recognize and correct misspellings was exhibit-
ed not only by Wordvice AI, but also by all the other proof-
reading methods we compared (Table 10). In this original
sentence, the misspelled word “becasue” should be revised to
“because,” and the misspelled word “abd’ should be revised to
“and.” Each of the proofreading tools accurately recognized
the corresponding spelling mistakes and corrected them.
Discussion
Key results
: In terms of the accurately revised text, as evaluat-
ed by the GLEU metric, Wordvice AI exhibited the highest
proofreading score compared to the other proofreading appli-
cations, identifying and correcting 77% of the human editor-
corrected text. The Wordvice AI Proofreader scored an aver-
age of 12.8%P higher than both Google Doc and MS Word in
terms of total errors corrected. The proofreading performance
of Wordvice AI (variation of 5.4%) was more consistent in
terms of percentage of errors corrected compared to MS
Word (variation of 5.6%) but was slightly less consistent than
Google Docs (variation of 5%). These results indicate that
Wordvice AI Proofreader is more thorough than these other
two proofreading tools in terms of the percentage of errors
identified, though it does not edit stylistic or subjective issues
as extensively as the human editor.
Additionally, Wordvice AI Proofreader exhibited consistent
levels of proofreading among all academic subject areas evalu-
ated in the GLEU comparison. Variability in editing perfor-
mance among these subject areas was also relatively small,
with only a 6.4%P difference between the lowest and highest
average editing applied compared to the human proofreader.
Both Google Docs and MS Word exhibited similar degrees of
variability in performance throughout all subject areas.The
highest percentage of appropriate corrections recorded for
these automated writing evaluation proofreaders (Google
Docs: medicine 73.5%) was still lower than Wordvice Proof-
reader’s lowest average (medicine 80.5%).
Interpretation:
The Wordvice AI Proofreader identifies and
corrects writing and language errors in any main academic
domain. This tool could be especially useful for researchers
writing manuscripts to check the accuracy of their writing in
English before submitting their draft to a professional proof-
reader, who can provide additional stylistic editing and. NLP
applications like the Wordvice AI Proofreader may exhibit
greater accuracy in correcting objective errors than more
widely-used applications like MS Word and Google Docs be-
fore the input text is derived primarily from academic writing
samples. Similar AI proofreaders trained on academic texts
(such as Trinka and Ginger) may also prove more useful for
research authors than general proofreading tools such as
Grammarly, Hemingway Editor, and Ginger, among others.
Suggestion of further studies
: By training the software with
more sample texts, the Wordvice AI Proofreader could poten-
tially exhibit performance and accuracy levels even closer to
those of human editors. However, due to the current output
limits of NLP and AI, human editing by professional editors
remains the most comprehensive and effective form of text
revision, especially for authors of academic documents, which
require the understanding of jargon and natural expressions
in English.
Conclusion
: In most of the texts analyzed, the Wordvice AI
Proofreader performed at or near the level of the human edi-
Table 10. Comparative sentence example evaluating spelling correction
Spelling correction Sentence
Original (source text) Lastly, for the economic reason, it is not beneficial becasue the cost of the equipment abd staff required to control fires is
very expensive.
Human editing Lastly, for economic reasons, it is not beneficial because the cost of the equipment and staff required to control fires is very
expensive.
Wordvice AI Proofreader Lastly, for economic reasons, it is not beneficial because the cost of the equipment and staff required to control fires is very
expensive.
Google Doc Lastly, for economic reasons, it is not beneficial because the cost of the equipment and staff required to control fires is very
expensive.
Microsoft Word Lastly, for the economic reason, it is not beneficial because the cost of the equipment and staff required to control fires is
very expensive.
Text marked in red denotes incorrect alterations to the input text; text marked in blue denotes correct alterations to the input text.
Kevin Heintz et al.
https://www.escienceediting.org
44 | Sci Ed 2022;9(1):37-45
tor, identifying similar errors and offering comparable sugges-
tions in the majority of sample passages. The AI Proofreader
also had higher performance and greater consistency than the
other two proofreading applications evaluated. When used
alongside professional editing and proofreading to ensure
natural expressions and flow, Wordvice AI Proofreader has
the potential to improve manuscript writing efficiency and
help users to communicate more effectively with the global
scientific community.
Conflict of Interest
The authors are employees of Wordvice. Except for that, no
potential conflict of interest relevant to this article was report-
ed.
Funding
The authors received no financial support for this study.
Data Availability
Dataset file is available from the Harvard Dataverse at: https://
doi.org/10.7910/DVN/KZ1MYX
Dataset 1. Eight hundred sentence pairs out of 1,501 from JHU FLuency-Ex-
tended GUG (JFLEG) open dataset, which were used for assessing improve-
ments in textual fluency (https://github.com/keisks/jfleg).
Dataset 2. Four hundred forty-five sentences from eight academic domains,
derived from Wordvice’s academic document data: arts and humanities, bio-
sciences, business and economics, computer science and mathematics, engi-
neering and technology, medicine, physical sciences, and social sciences.
Dataset 3. One thousand two hundred forty-five sentences composed of 800
JFELG data and 445 academic sentence data edited by human editing experts.
Dataset 4. One thousand two hundred forty-five sentences composed of 800
JFELG data and 445 academic sentence data edited by Wordvice AI.
Dataset 5. One thousand two hundred forty-five sentences sentences com-
posed of 800 JFELG data and 445 academic sentence data edited by MS-
Word.
Dataset 6. One thousand two hundred forty-five sentences composed of 800
JFELG data and 445 academic sentence data edited by Google Docs.
References
1. Warschauer M, Ware P. Automated writing evaluation: de-
fining the classroom research agenda. Lang Teach Res
2006;10:157-80. https://doi.org/10.1191/1362168806lr190oa
2. Daudaravicius V, Banchs RE, Volodina E, Napoles C. A
report on the automatic evaluation of scientific writing
shared task. Paper presented at: Proceedings of the 11th
Workshop on Innovative Use of NLP for Building Educa-
tional Applications; 2016 Jun; San Diego, CA, USA. p. 53-
62.
3. Grammarly [Internet]. San Francisco, CA: Grammarly; 2021
[cited 2021 Aug 20]. Available from: https://www.grammarly.
com/
4. Ginger Grammar Checker [Internet]. Lexington, KY: Ginger
Software; 2021 [cited 2021 Aug 22]. Available from: https://
www.gingersoftware.com/grammarcheck
5. Hemingway Editor [Internet]. Durham, NC: 38 Long LLC; 2021
[cited 2021 Aug 22]. Available from: https://hemingwayapp.
com/
6. Leacock C, Chodorow M, Gamon M, Tetreault J. Auto-
mated grammatical error detection for language learners
[Internet]. Williston, VT: Morgan & Claypool Publishers;
2010 [cited 2021 Aug 22]. Available from: https://doi.
org/10.2200/S00275ED1V01Y201006HLT009
7. Montgomery DJ, Karlan GR, Coutinho M. The effective-
ness of word processor spell checker programs to produce
target words for misspellings generated by students with
learning disabilities. J Spec Ed Tech 2001;16:27-42. https://
doi.org/10.1177/016264340101600202
8. Dale R, Viethen J. The automated writing assistance land-
scape in 2021. Nat Lang Eng 2021;27:511-8. https://doi.
org/10.1017/S1351324921000164
9. Perelman L. Grammar checkers do not work. WLN J Writ
Cent Scholarsh 2016;40:11-9.
10. Napoles C, Sakaguchi K, Post M, Tetreault J. Ground truth
for grammatical error correction metrics. Paper pesented
at: Proceedings of the 53rd Annual Meeting of the Associ-
ation for Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language Processing;
2015 Jul 26-31; Beijng, China. p. 588-93.
11. Wordvice AI Proofreader [Internet]. Seoul: Wordvice; 2021
[cited 2021 Aug 23]. Available from: https://wordvice.ai
12. Google Docs [Internet]. Mountain View, CA: Alphabet; 2021
[cited 2021 Aug 22]. Available from: https://www.google.
com/docs/about/
13. Microsoft Word [Internet]. Redmond, WA: Microsoft; 2021
[cited 2021 Aug 22]. Available from: https://www.microsoft.
com/microsoft-365/word
14. Napoles C, Sakaguchi K, Tetreault J. JFLEG: a fluency cor-
pus and benchmark for grammatical error correction.
arXiv:1702.04066 [cs.CL] [Preprint]. 2017 [cited 2021 Aug
22]. Available from: https://arxiv.org/pdf/1702.04066.pdf
15. Sakaguchi K, Napoles C, Post M, Tetreault J. Reassessing the
goals of grammatical error correction: fluency instead of
grammaticality. Trans Assoc Comput Linguist 2016;4:169-
82. https://doi.org/10.1162/tacl_a_00091
16. Mutton A, Dras M, Wan S, Dale R. GLEU: automatic evalu-
ation of sentence-level fluency. Paper presented at: Proceed-
ings of the 45th Annual Meeting of the Association of
Wordvice AI Proofreader vs. human editors
https://www.escienceediting.org Sci Ed 2022;9(1):37-45 | 45
Computational Linguistics; 2007 Jun; Prague, Czech Re-
public. p. 344-51.
17. Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for
automatic evaluation of machine translation. Paper presented
at: Proceedings of the 40th Annual Meeting on Association
for Computational Linguistics; 2002 Jul; Philadelphia, PA,
USA. p. 311-8. https://doi.org/10.3115/1073083.1073135
18. Wu Y, Schuster M, Chen Z, et al. Google’s neural machine
translation system: bridging the gap between human and
machine translation. arXiv:1609.08144v2 [cs.CL] [Preprint].
2016 [cited 2021 Aug 19]. Available from: https://arxiv.org/
pdf/1609.08144v2.pdf
... AccurIT assists with constructing phrases (Grami, 2020), whereas Wordvice AI Proofreader (www.wordvice.ai) provides accuracy similar to that of human proofreaders (Heintz et al., 2022), though it lacks contextual feedback. Finally, Criterion offers detailed feedback but may not address individual learning needs fully (Saricaoglu & Bilki, 2021). ...
... AccurIT enhances vocabulary and corrects errors, and has received positive feedback for its opportunities for interaction (Grami, 2020). Wordvice AI Proofreader identifies errors well but needs improvement in handling stylistic issues (Heintz et al., 2022). Criterion delivers detailed feedback and supports various essay types but faces challenges with student engagement (Saricaoglu & Bilki, 2021). ...
Conference Paper
Full-text available
Automated Writing Evaluation (AWE) systems have emerged as indispensable tools in the realm of writing instruction, offering educators and students efficient means to assess and enhance writing skills. In recent years, the proliferation of these systems has underscored the need for a thorough understanding of their functionalities, effectiveness, and implications for educational practice. Consequently, this study endeavors to address this imperative by conducting a systematic literature review spanning the years 2018 to 2023. My overarching objectives include (a) scrutinizing the landscape of AWE systems employed during this timeframe, (b) dissecting the nature and scope of writing feedback provided by these systems, (c) delving into the perceptions and experiences of students and teachers regarding AWE systems, and (d) elucidating the contextual factors that shape the utilization and adoption of AWE systems across diverse educational settings.
... Two studies were found related to the use of AI in editing/proofreading, and both were found to investigate the effectiveness of AI editing/proofreading tools in comparison to human editors/proofreaders. Heintz, Roh, and Lee [11] compared the accuracy and effectiveness of an AI proofreading tool (i.e., Wordvice) to the proofreading tools of two applications (i.e., Google Docs and Microsoft Word) and expert human editors/proofreaders. The sample included 1245 sentences from different domains. ...
... Hence, all participants were found to be users of AI tools in editing and proofreading and agreed that these tools are beneficial to their work. This finding supplements the research conducted by Heintz, Roh, and Lee [11] on the utility of ...
Article
Full-text available
Human editors and proofreaders now face a new, and possibly serious, challenge: the emergence of artificial intelligence (AI) tools that some consider to be as efficient and precise as experts in editing/proofreading. This research aims to investigate editors’ and proofreaders’ perceptions of current AI tools. It examines whether editors/proofreaders view AI as an opportunity or a threat and considers their insights into the future of AI tools for them. The study collected qualitative data through email questionnaires from 17 professional editors and proofreaders purposively appointed from a society of professional editors and proofreaders in Egypt. The results revealed that the responses regarding AI for editors and proofreaders are generally mixed, with a range of both positive and negative perspectives. Some responses highlight the opportunities and benefits that AI tools can bring, such as increased efficiency, time-saving, and improved productivity. Others express concerns about potential threats, such as the possibility of AI replacing humans, ethical considerations, and the need for continued human involvement in the editing/proofreading process. Overall, the attitudes toward AI tools for editing and proofreading reflect a paradoxical view of the technology’s impact on the field. The active engagement and participation of editors and proofreaders are essential for the successful implementation of AI technologies in editorial contexts.
... Studies have shown that when it comes to language pro ciency, non-native English speakers may face challenges in academic writing due to differences in linguistic uency and task completion compared to native English speakers (Eng et al., 2018). Non-native English speakers face di culties composing text at the same level of expression as native English writers, claim Heintz et al. (2022). The reliance on AI tools like ChatGPT can help equalize actual or perceived differences among non-native English writers. ...
Preprint
Full-text available
This research details postsecondary education (PSE) students’ (n = 1021) use of artificial intelligence (AI) technologies—specifically Generative AI (GenAI) tools like ChatGPT—in their academic activities. Through a comprehensive survey analysis, the study sought to identify the extent of GenAI usage for academic purposes, exploring factors such as gender, level of study, and primary language proficiency. Key findings revealed a significant adoption rate among PSE students and high intentions to continue using AI to do academic work in subsequent year. Gender-based usage slightly affected the use of AI and the intention to use AI. There was no substantial differences corresponding to primary language or level of study. The research leveraged the unified theory of acceptance and use of technology (UTAUT) model to interpret these patterns, suggesting that gender differences and technological acceptance behaviours significantly influence GenAI tool adoption. The study emphasizes the need for educational institutions to adapt to the evolving landscape of AI in education, advocating for policies that recognize and integrate the use of GenAI tools in academic settings. It also invites scholars to examine why females are more likely to report not having used AI or intending to use AI than expected.
Chapter
In an era characterized by significant technical advancements in the field of Artificial Intelligence (AI), it is crucial to comprehend AI by considering its origins and future prospects. This chapter examines the historical origins of artificial intelligence (AI) and explores its relationship with philosophy. It also delves into the significant inquiries that philosophy poses regarding AI, encompassing its metaphysical, epistemological, and axiological dimensions. The chapter additionally provides an overview of the historical context of artificial intelligence (AI), its various manifestations, its theoretical underpinnings, and a framework that establishes a correlation between humans and machines, referred to as “Human-machine Teamwork.” The chapter also explores the importance of AI in several fields and illuminates emerging areas where artificial intelligence is also examined, giving rise to significant inquiries. The objective of this chapter is to offer comprehensive knowledge and a fresh viewpoint on the examination of AI by its users, producers, and designers. Keywords: Artificial Intelligence, Artificial wisdom, Education, Human wisdom, New frontiers, Philosophical considerations.
Article
Full-text available
The rapid development of generative artificial intelligence (GenAI), including large language models (LLM), has merged to support students in their academic writing process. Keeping pace with the technical and educational landscape requires careful consideration of the opportunities and challenges that GenAI-assisted systems create within education. This serves as a useful and necessary starting point for fully leveraging its potential for learning and teaching. Hence, it is crucial to gather insights from diverse perspectives and use cases from actual users, particularly the unique voices and needs of student-users. Therefore, this study explored and examined students' perceptions and experiences about GenAI-assisted academic writing by conducting in-depth interviews with 20 Chinese students in higher education after completing academic writing tasks using a ChatGPT4-embedded writing system developed by the research team. The study found that students expected AI to serve multiple roles, including multi-tasking writing assistant, virtual tutor, and digital peer to support multifaceted writing processes and performance. Students perceived that GenAI-assisted writing could benefit them in three areas including the writing process, performance, and their affective domain. Meanwhile, they also identified AI-related, student-related, and task-related challenges that were experienced during the GenAI-assisted writing activity. These findings contribute to a more nuanced understanding of GenAI's impact on academic writing that is inclusive of student perspectives, offering implications for educational AI design and instructional design.
Conference Paper
The following research study presents Automatic speech recognition (ASR) software in Bulgarian language and its practical use in media production for decrypting live broadcasts. AI and human typists transcribed several live broadcasts aired by the Bulgarian National Radio (BNR). After that, professional editors compared both texts according to several criteria - time to decrypt, accuracy in grammar and word recognition. A comparison was made about what time it takes a professional Web editor to adapt human and AI generated text. Recommendations are made on how to improve both ASR and journalistic approach.
Article
Full-text available
Automated writing assistance – a category that encompasses a variety of computer-based tools that help with writing – has been around in one form or another for 60 years, although it’s always been a relatively minor part of the NLP landscape. But the category has been given a substantial boost from recent advances in deep learning. We review some history, look at where things stand today, and consider where things might be going.
Conference Paper
Full-text available
In evaluating the output of language tech- nology applications—MT, natural language generation, summarisation—automatic eval- uation techniques generally conflate mea- surement of faithfulness to source content with fluency of the resulting text. In this paper we develop an automatic evaluation metric to estimate fluency alone, by examin- ing the use of parser outputs as metrics, and show that they correlate with human judge- ments of generated text fluency. We then de- velop a machine learner based on these, and show that this performs better than the indi- vidual parser metrics, approaching a lower bound on human performance. We finally look at different language models for gener- ating sentences, and show that while individ- ual parser metrics can be 'fooled' depending on generation method, the machine learner provides a consistent estimator of fluency.
Article
Full-text available
With the advent of English as a global language, the ability to write well in English across diverse settings and for different audiences has become an imperative in second language education programmes throughout the world. Yet the teaching of second language writing is often hindered by the great amount of time and skill needed to evaluate repeated drafts of student writing. Online Automated Writing Evaluation programmes have been developed as a way to meet this challenge, and the scoring engines driving such programmes have been analysed in a considerable array of psychometric studies. However, relatively little research has been conducted on how AWE is used in the classroom and the results achieved with such use. In this article, we analyse recent developments in automated writing evaluation, explain the bases on which AWE systems operate, synthesize research with these systems, and propose a multifaceted process/product research programme on the instructional use of AWE. We explore this emerging area of inquiry by proposing a range of potential questions, methodologies and analytical tools that can define such a research agenda.
Article
The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation metrics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain specific labeled corrections. We examine current practices and show that GEC’s reliance on such corpora unnaturally constrains annotation and automatic evaluation, resulting in (a) sentences that do not sound acceptable to native speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unannotated, whole-sentence rewrites. We compare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new annotation scheme has very strong correlation with expert rankings (ρ = 0.82). As a result, we advocate for a fundamental and necessary shift in the goal of GEC, from correcting small, labeled error types, to producing text that has native fluency.
Article
This study investigated spell check programs to determine how they differ in producing target words in first position in the replacement list for misspellings generated by students with learning disabilities. A pool of 1,008 misspellings taken from 111 writing samples generated by students with learning disabilities, grades three through eight, were spell checked by the spell check function of nine word processing software programs. Misspellings were classified by the level of phonetic mismatch to the target word (phonetic error level) and the proportion of correct two-letter sequences (bigram ratio). A significant difference was found among spell checkers in their ability to produce target words first in the replacement list. In addition, a significant difference with respect to phonetic error level and bigram ratio was found. Efficiency of spell checkers increased as the phonetic error level or the bigram ratio of the misspellings increased. These results suggest that spell checkers are overall ineffective in producing target words first in the replacement list for misspellings generated by students with learning disabilities.
Book
It has been estimated that over a billion people are using or learning English as a second or foreign language, and the numbers are growing not only for English but for other languages as well. These language learners provide a burgeoning market for tools that help identify and correct learners' writing errors. Unfortunately, the errors targeted by typical commercial proofreading tools do not include those aspects of a second language that are hardest to learn. This volume describes the types of constructions English language learners find most difficult -- constructions containing prepositions, articles, and collocations. It provides an overview of the automated approaches that have been developed to identify and correct these and other classes of learner errors in a number of languages. Error annotation and system evaluation are particularly important topics in grammatical error detection because there are no commonly accepted standards. Chapters in the book describe the options available to researchers, recommend best practices for reporting results, and present annotation and evaluation schemes. The final chapters explore recent innovative work that opens new directions for research. It is the authors' hope that this volume will contribute to the growing interest in grammatical error detection by encouraging researchers to take a closer look at the field and its many challenging problems. Table of Contents: Introduction / History of Automated Grammatical Error Detection / Special Problems of Language Learners / Language Learner Data / Evaluating Error Detection Systems / Article and Preposition Errors / Collocation Errors / Different Approaches for Different Errors / Annotating Learner Errors / New Directions / Conclusion