ArticlePDF Available

Accuracy pecking order – How 30 AI detectors stack up in detecting generative artificial intelligence content in university English L1 and English L2 student essays

Authors:

Abstract

This study set out to evaluate the accuracy of 30 AI detectors in identifying generative artificial intelligence (GenAI)-generated and human-written content in university English L1 and English L2 student essays. 40 student essays were divided into four essay sets of English L1 and English L2 and two undergraduate modules: a second-year module and a third-year module. There are ten essays in each essay set. The 30 AI detectors comprised freely available detectors and non-premium versions of online AI detectors. Employing a critical studies approach to artificial intelligence, the study had three research questions. It focused on and calculated the accuracy, false positive rates (FPRs), and true negative rates (TNRs) of all 30 AI detectors for all essays in each of the four sets to determine the accuracy of each AI detector to identify the GenAI content of each essay. It also used confusion matrices to determine the specificity of best- and worst-performing AI detectors. Some of the results of this study are worth mentioning. Firstly, only two AI detectors, Copyleaks and Undetectable AI, managed to correctly detect all of the essay sets of the two English language categories (English L1 and English L2) as human written. As a result, these two AI detectors jointly shared the first spot in terms of the GenAI detection accuracy ranking. Secondly, nine of the 30 AI detectors completely misidentified all the essays in each of the four essay sets of the two language categories in both modules. Thus, they collectively shared the last spot. Thirdly, the remaining 19 AI detectors both correctly and incorrectly classified the four essay sets in varying degrees without any bias to any essay set of the two English language categories. Fourthly, none of the 30 AI detectors tended to have a bias toward a specific English language category in classifying the four essay sets. Lastly, the results of the current study suggest that the bulk of the currently available AI detectors, especially the currently available free-to-use AI detectors, are not fit for purpose.
1
Accuracy pecking order – How 30 AI detectors stack up in detecting generative articial
intelligence content in university English L1 and English L2 student essays
Keywords Abstract
Accuracy;
AI;
AI detectors;
AI-generated and human-written content;
articial intelligence;
English L1 and English L2;
false positive rates;
student essays;
true negative rates.
This study set out to evaluate the accuracy of 30 AI detectors in identifying
generative articial intelligence (GenAI)-generated and human-written
content in university English L1 and English L2 student essays. 40 student
essays were divided into four essay sets of English L1 and English L2
and two undergraduate modules: a second-year module and a third-
year module. There are ten essays in each essay set. The 30 AI detectors
comprised freely available detectors and non-premium versions of
online AI detectors. Employing a critical studies approach to articial
intelligence, the study had three research questions. It focused on and
calculated the accuracy, false positive rates (FPRs), and true negative
rates (TNRs) of all 30 AI detectors for all essays in each of the four sets to
determine the accuracy of each AI detector to identify the GenAI content
of each essay. It also used confusion matrices to determine the specicity
of best- and worst-performing AI detectors. Some of the results of this
study are worth mentioning. Firstly, only two AI detectors, Copyleaks and
Undetectable AI, managed to correctly detect all of the essay sets of the
two English language categories (English L1 and English L2) as human
written. As a result, these two AI detectors jointly shared the rst spot in
terms of the GenAI detection accuracy ranking. Secondly, nine of the 30
AI detectors completely misidentied all the essays in each of the four
essay sets of the two language categories in both modules. Thus, they
collectively shared the last spot. Thirdly, the remaining 19 AI detectors
both correctly and incorrectly classied the four essay sets in varying
degrees without any bias to any essay set of the two English language
categories. Fourthly, none of the 30 AI detectors tended to have a bias
toward a specic English language category in classifying the four essay
sets. Lastly, the results of the current study suggest that the bulk of the
currently available AI detectors, especially the currently available free-to-
use AI detectors, are not t for purpose.
Article Info
Received 9 March 2024
Received in revised form 28 March 2024
Accepted 3 April 2024
Available online 10 April 2024
DOI: https://doi.org/10.37074/jalt.2024.7.1.33
Content Available at :
Journal of Applied Learning
&
Teaching
Vol.7 No.1 (2024)
J
o
u
r
n
a
l
o
f
A
p
p
l
i
e
d
L
e
a
r
n
i
n
g
&
T
e
a
c
h
i
n
g
JALT
http://journals.sfu.ca/jalt/index.php/jalt/index
ISSN : 2591-801X
Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
chakachaka8@gmail.com A
Correspondence
Chaka ChakaAAProfessor, University of South Africa, Pretoria, South Africa
2Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
Introduction
In academia, plagiarism and generative articial intelligence
(GenAI)-generated content are two dierent things. For
instance, a student does not need a GenAI tool to plagiarise,
but they need a GenAI tool to generate GenAI content.
Notably, plagiarism predates the advent of GenAI content
generation, especially as the latter is heralded by GenAI
language models such as ChatGPT. As such, the possibility of
plagiarism is always there with or without the use of GenAI
tools, but GenAI-generated content is almost impossible to
generate without using GenAI tools such as ChatGPT as its
catalysts. With the launch of ChatGPT and the other related
GenAI-powered chatbots, the quest for detecting GenAI-
generated content in university student writing, in particular,
has become unavoidable. What is even more pressing is
the quest for dierentiating between GenAI-generated
and human-written content in student writing in higher
education (HE). In the HE arena, universities and academics
have always prided themselves in being the guardians
and protectors of original and authentic academic writing
in all disciplines. This guardianship and protectorship has
often come under the banner of academic integrity (see
Anthology White Paper, 2023; Blau et al., 2020; Gamage et
al., 2020; Perkins, 2023; Sullivan et al., 2023; Uzun, 2023).
It is no exaggeration to assert that academic integrity,
guardianship and protectorship in HE almost borders on
a frenzy due to, mainly, though not exclusively, pressure
points brought by GenAI-powered chatbots like ChatGPT.
In this frenzied scrambling, GenAI-generated content and
plagiarism feature as proxies for academic dishonesty.
However, viewing academic integrity through the prism
of its nemesis, like academic dishonesty that comprises
GenAI-generated content and plagiarism, is simplistic and
supercial. This conception of academic integrity has to do
with the practice of text- or content-matching that chimes
with plagiarism-detection software programmes in which
plagiarism and GenAI-generated content, are deemed
a twin threat to academic integrity (cf. Blau et al., 2020;
Gamage et al., 2020; Ifelebuegu, 2023; Rudolph et al., 2023;
Sobaih, 2024). As Gamage et al. (2020) contend, this view
of academic integrity overlooks other elements of academic
dishonesty or other violations of academic integrity (see Blau
et al., 2020). In addition to GenAI-generated content and
plagiarism, examples of elements of academic dishonesty
or violations of academic integrity include fraudulence,
falsication, fabrication, facilitation, cheating, ghost-writing
(Blau et al., 2020), contract cheating, and collusion (Gamage
et al., 2020). Of course, some of these elements or violations
may overlap: fraudulence with falsication and fabrication,
ghost-writing with contract cheating, and facilitation
with collusion (cf. Blau et al., 2020; Gamage et al., 2020).
Additionally, both cheating and fraudulence can be used
as overarching terms for academic dishonesty. Therefore,
reducing academic dishonesty to GenAI-generated content
and plagiarism alone tends to obscure its other facets, such
as the ones furnished here.
With the surge of GenAI-generated content and plagiarism
being a threat to academic integrity in HE, several AI content
detectors have been released, while existing traditional
plagiarism detection tools have upgraded their oerings to
include AI content detection features (see Anil et al., 2023;
Chaka, 2023a, 2024; Bisi et al., 2023; Dergaa et al., 2023;
Ladha et al., 2023; Uzun, 2023; Wiggers, 2023; Weber-Wul
et al., 2023). The cardinal function of AI content detectors is
to do exactly what they are designed to do: detect GenAI-
generated content in dierent types of academic and
scholarly writing. To this eect, there have been studies
that have tested the eectiveness or reliability of AI content
detectors in detecting GenAI-generated content in academic
writing, or in distinguishing between GenAI-generated and
human-written content in academic writing. These studies
have tested dierent types of AI content detectors that
include single AI content detectors (see Habibzadeh, 2023;
Perkins et al., 2024; Subramaniam, 2023), two AI content
detectors (see Bisi et al., 2023; Desaire et al., 2023; Ibrahim,
2023), three AI content detectors (see Cingillioglu, 2023;
Elali & Rachid, 2023; Gao et al., 2023; Homolak, 2023; Ladha
et al., 2023; Wee & Reimer, 2023), four AI content detectors
(Abani et al., 2023; Alexander et al., 2023; Anil et al., 2023),
and multiple AI content detectors (Chaka, 2023a; Odri &
Yoon, 2023; Santra & Majhi, 2023; Walters, 2023) (see Chaka,
2024).
Most crucially, there is one study that has discovered that AI
detectors tend to be biased against non-English language
speakers (Liang et al., 2023; Mathewson, 2023; Shane, 2023;
cf. Adamson, 2023; Gillham, 2024). This nding resonates,
in a dierent but related scenario, with the view that some
studies have established that currently available automatic
speech recognition technologies poorly detect, if any, and
discriminate against the English spoken by Black people,
especially African American Language (AAL), thereby
exposing their racial bias and demographic discrimination
against this type of English (Martin & Wright, 2023).
Linguistic and racial biases are but two of the instances of
bias that GenAI models, and not just AI detection models,
have to contend with in their everyday deployment. Other
instances of bias GenAI models have to grapple with are
cultural, ideological, political, temporal, and conrmation
biases (see Ferrara, 2023). Thus, in addition to simply
detecting GenAI-generated content, or distinguishing it
from its human-written counterpart, these biases are some
of the pressing challenges that these models have to wrestle
with on an ongoing basis.
Against this background, the current study set out to:
evaluate the accuracy of 30 AI detectors in
dierentiating between GenAI-generated and
human-written content in university English L1
and English L2 student essays for two dierent
undergraduate modules;
establish whether these 30 AI detectors will classify
these four sets of student essays dierentially based
on their English L1 and English L2 categories; and
discover which language category within these
four sets of student essays is assigned more false
positives.
3Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
On this basis, the overarching purpose of this study is to
contribute to the ongoing debate about the eectiveness
(accuracy, precision, and reliability) of AI content detectors
in distinguishing between GenAI-generated and human-
written content in the essays produced by English L1 and
English L2 students. The student essays in this study were
written by English L1 and English L2 students who registered
for a second-year undergraduate module and a third-year
undergraduate module oered by an English department at
a university in South Africa in 2018, 2020, and 2022.
Given the points highlighted above, this study seeks to
answer the following research questions (RQs):
RQ1: What is the accuracy of the 30 AI detectors
in dierentiating between GenAI-generated and
human-written content in university English L1
and English L2 student essays for two dierent
undergraduate modules?
RQ2: Do these 30 AI detectors classify these four
sets of student essays dierentially based on their
English L1 and English L2 categories or not?
RQ3: Which language category within these four
sets of student essays is assigned more false
positives by these AI detectors?
Critical studies approach to AI
In a surreal world, AI, algorithms, and machine learning
would be devoid of any bias: racial, demographic, gender,
sexuality, disability, and training data bias (see Lindgren,
2023; also see AIContentfy team, 2023; Chaka, 2022; Ferrara,
2023; Wu et al., 2023). In real-world contexts, though,
that is not the case. This rings true for AI detectors. Their
ecacy is largely determined by, among other things, their
training data, their algorithms, and their computing prowess
(AIContentfy team, 2023). All of this, together with the types
of bias mentioned and those stated earlier, leads to AI
detectors having shortcomings and deciencies. As such,
they end up not being as eective and ecient as they are
made out to be or as they often claim to be. This is where a
critical studies approach to AI comes in. This approach draws
on some of the ideas propounded by Chaka (2022), Couldry
and Mejias (2019), Lindgren (2023), Mohamed et al. (2020),
Ricaurte (2019), who adopt a critically driven approach to
dealing with and studying technology, algorithms, data,
and datacation. Importantly, it draws on Lindgren’s (2023)
notion of critical studies of AI.
In this paper, in particular, the critical studies approach to AI
entails recognising that AI detectors are not 100% ecient
and eective: they have limitations, deciencies, and biases.
This is so notwithstanding the accuracy percentage claims
that these models may arrogate to themselves on their
landing pages. This approach also acknowledges that AI
detectors are constrained by contextual factors such as
domains, algorithms, training data, performance, robustness,
and adversarial testing. The latter refers to how well an AI
detector performs when tested with an adversarial input like
edited or paraphrased content (see Captain Words, 2024;
Wu et al., 2023) or such as single spacing (Cai & Cui, 2023).
This latter aspect highlights the fact that AI detectors can
be tricked by manipulating or reworking input content (see
Chaka, 2023a; Lee, 2023). This is one of the limitations AI
detectors have, which is recognised by the critical studies
approach to AI as framed here. Finally, this approach
contends that the limitations and deciencies of AI detectors
should not be reduced to technologism alone: they are also
a reection of their designers, architecture, or otherwise.
Related literature
This related literature section is unconventional in that it
selectively deals with a few studies that have a bearing on
the current study. To this end, it wants to foreground a few
points. First, save for Liang et al.’s (2023) study, there is a
paucity of studies that have tested how currently available
AI detectors tend to be biased against non-native English
writers/students vis-à-vis native English writers/students.
Secondly, as pointed out briey earlier, since the release of
ChatGPT and the other related GenAI-powered chatbots,
several AI detectors have been designed and launched,
which are intended to detect GenAI-generated content or
distinguish between GenAI-generated and human-written
content. In keeping with this attempt to detect GenAI-
generated content, existing traditional plagiarism detection
software programmes have been upgraded to accommodate
AI detection tools in their oerings (see Anil et al., 2023; Bisi
et al., 2023; Chaka, 2023a, 2024; Dergaa et al., 2023; Ladha
et al., 2023; Uzun, 2023; Wiggers, 2023; Weber-Wul et al.,
2023). Again, as stated earlier, some studies have evaluated
the eectiveness of single AI detectors (Habibzadeh, 2023;
Subramaniam, 2023), two AI detectors (Desaire et al., 2023;
Ibrahim, 2023), three AI detectors (Cingillioglu, 2023; Elali
& Rachid, 2023; Wee & Reimer, 2023), four AI detectors
(Alexander et al., 2023; Anil et al., 2023), and multiple AI
detectors (Chaka, 2023a, 2024; Odri & Yoon, 2023; Walters,
2023).
In the midst of so many and varied studies that have been
conducted in the aftermath of ChatGPT’s launch, I will, in
this section, briey discuss a select few studies that have
explored or tested the eectiveness of multiple AI detectors
in detecting GenAI-generated content or distinguish
between GenAI-generated from human-written content in
given subject areas. Elsewhere, Chaka (2024) conducted a
review of studies that tested the eectiveness of dierent
AI detectors in distinguishing between GenAI-generated
and human-written content in dierent subject areas. It is
also worth mentioning that some of the studies that have
investigated the eectiveness of multiple AI detectors, in
this regard, are preprints like Webber-Wul (2023) and
Wu et al. (2023). Others are AI detectors’ in-house studies
such as AIContentfy Team, 2023; Captain Words, 2024). The
rst study that has some bearing on the present study is
Liang et al.’s (2023) study. This study set out to evaluate
the eectiveness of seven AI detectors in detecting GenAI-
generated text in a dataset of 91 human-written Test of
English as a Foreign Language (TOEFL) essays and in a dataset
of 88 U.S. 8th-grade essays extracted from the Hewlett
Foundations’ Automated Student Assessment Prize (ASAP).
The rst dataset was sourced from a Chinese educational
forum. The seven AI detectors employed to evaluate these
4Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
two essay datasets were ZeroGPT, GPTZero, Crossplag,
OpenAI, Sapling, Quillbot, and Originality. These detectors
detected and classied the U.S. 8th-grade essay dataset
almost accurately. Nonetheless, they misidentied more
than half of the TOEFL essay dataset as generated by GenAI,
with a mean false positive rate (FPR) of 61.22%. In addition,
these AI detectors accorded the misidentied TOEFL essays
a very low perplexity due to the limited linguistic variability
of these essays, which was easily predictable. But, after
ChatGPT was employed to improve the linguistic expressions
of the TOEFL essays to those of a native English speaker,
their misidentication by the said AI detectors decreased,
with their mean FPR concomitantly decreasing to 11.77%,
and their perplexity signicantly improving as well.
Since the publication of Liang et al.’s (2023) study, there
have been, in varying degrees, some comments about it
(see Mathewson, 2023; Shane, 2023) and some reactions to
it (see Adamson, 2023; Gillham, 2024). Among the reactions,
Adamson’s (2023) is the most interesting one as it shows
how Liang et al.’s (2023) study seems to have rued up
the veneer of AI detectors’ eectiveness in detecting
GenAI-generated text in student-written essays without
being linguistically biased. To this eect, a Turnitin test
was subsequently conducted to detect GenAI-generated
text in three datasets of ASAP, ICNALE, and PELIC that
comprised L1 English (ASAP = 2,481 and ICNALE = 400) and
L2 English (ICNALE = 2,222 and PELIC = 4,000). The results
of this test showed that for documents with a minimum
300-word threshold, the dierence in the false positive
rate (FPR) between L1 English essays and L2 English essays
was fractional and, thus, was not statistically signicant.
This proved that the paper asserts that Turnitin, as an AI
detector, did not evince any statistically signicant bias
against the two sets of English language essays. Moreover,
the paper avers that even though each essay set’s FPR was
marginally higher than Turnitin’s overall target of 0.01 (1%),
none of the two essay sets’ FPR was signicantly dierent
from this overall target. In contrast, the paper argues that
for documents whose content was below the minimum 300-
word threshold, there was a signicant dierence in the
FPR between L1 English essays and L2 English essays. This
dierence was greater than Turnitin’s 0.01 overall target. On
this basis, the paper concludes that this nding conrms that
AI detectors need longer essay samples for them to detect
GenAI-generated content accurately and for them to be able
to avoid producing a high rate of false positives (Adamson,
2023). An overall FPR target of 1% means that 10 human-
produced student essays are likely to be misclassied as
false positives in every 1,000 university essay scripts. This
number is still concerning given those students who might
be aected by this misclassication (see Anderson, 2023).
It is worth mentioning that Turnitin is not among the seven
AI detectors tested by Liang et al. (2023). Despite this,
there is no gainsaying that this resultant Turnitin test bears
testimony to the rue that Liang et al.’s (2023) study has
caused to the AI detection ecosystem, not only Turnitin
but that of the other AI detectors as well. The other point
to emphasise is that Liang et al.’s (2023) study has an
element of a critical studies approach to AI. This element
has to do with the way the study approached the seven AI
detectors from a critical standpoint by highlighting their
linguistic detection bias in dealing with native English
speakers versus non-native English speakers in their written
English. Moreover, this criticality element is related to the
two adversarial prompts the study inputted into ChatGPT
to write the two datasets dierently with a view of tricking
the seven AI detectors. It is when one applies this type of
critical perspective which is grounded on relevant raw data
to GenAI in general, and to AI detectors in particular, that
one gets the owners and designers of AI detectors’ attention
as is the case with Adamson’s (2023) paper. Without that
criticality, nothing is likely to happen.
Among the studies that have evaluated multiple AI detectors
in other subject areas than English is Odri and Yoon’s (2023)
study. This study had three objectives, which were to:
evaluate 11 AI detectors’ performance on a wholly GenAI-
generated text, test AI detection-evading methods, and
evaluate how eective these AI detection-evading methods
were on previously tested AI detectors. It hypothesised that
the 11 AI detectors to be tested were not all equally eective
in identifying GenAI-generated text and that some of the
evasion methods could render the GenAI-generated text
almost undetectable. The GenAI text was generated from
ChatGPT-4 and was tested on 11 AI detectors: Originality,
ZerоGPT, Writer, Cоpyleaks, Crossplag, GPTZerо, Sapling,
Content at Scale, Corrector, Writefull, and Quill. The text was
tested before applying AI detection evasion techniques and
after applying them. The AI detection evasion techniques
employed included: improving command messages
(prompts) in ChatGPT, adding minor grammatical errors
(e.g., a comma deletion), paraphrasing, and substituting
Latin letters with their Cyrillic equivalents. The GenAI text
was manipulated six times to produce its slightly modied
versions using the aforesaid evasion techniques in ChatGPT.
The study also tested a scientic text produced by a human
(Sir John Charnley) in 1960 (Odri & Yoon, 2023). One
plausible reason that can be extrapolated from the study
about the use of this text is that it is freely available online.
The other plausible reason is that the text predates the
advent of GenAI models, particularly ChatGPT, by 62 years.
Therefore, in 1960, there was no way any text could have
been generated by GenAI models.
For the initial, unaltered GenAI text generated by ChatGPT,
seven of the 11 AI detectors identied it as written mainly
by humans. This is how these AI detectors fared in this text:
GPTZero = human, Writer = 100% human, Quill = human,
Content at Scale = 85% human, Copyleaks = 59.9% human,
Corrector = 0.02% AI, and ZeroGPT = 25.8% AI. The more
this text was slightly modied in sustained degrees (one
modication after another as mentioned above), the more the
11 AI detectors misclassied it as human-written. Regarding
the human-written text, only one of the 11 AI detectors
(Originality) was able to correctly detect it as having 0% AI. It
is important to mention that despite this correct detection,
Originality is one of the four AI detectors that misidentied
the nal modied version of the GenAI-generated text as
having 0% AI content (Odri & Yoon, 2023). Like Liang et
al.’s (2023) study discussed above, the relevance of Odri
and Yoon’s (2023) study is that it has elements of a critical
studies approach to AI. Its use of adversarial attacks in the
form of prompt attacks is an example of an adversarial input
that I earlier referred to as one of the contextual factors that
5Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
degrades the ecacy of AI detectors (also see Anderson,
2023; Chaka, 2023a, 2024; Krishna et al., 2023; Sadasivan et
al., 2023). From a critical perspective, prompt attacks expose
the limitations and deciencies of AI detectors.
Materials and methods
This study followed an exploratory research design, with
the primary objective of exploring a given area, aspect, or
phenomenon that has not been extensively researched. By
its nature, exploratory research can tentatively analyse a new
emerging topic, or suggest new ideas (Swedberg, 2020; see
Makri & Neely, 2021). Testing the accuracy and eectiveness
of AI detectors in identifying GenAI-generated and human-
written content, or in distinguishing between these content
types is still a relatively new area in many disciplines (see
Chaka, 2023a, 2023b).
Data collection
The data collection process for this study comprised three
stages. The rst stage entailed selecting student (human)
essay samples. These essays consisted of four datasets of
university English L1 and English L2 student essays. They
were selected from a pool of essays that had been submitted
as assignment responses for two undergraduate modules
oered by an English department at an open-distance and
e-learning university in South Africa. The modules were
second and third-year, major modules. Each dataset had
ten essays. The two sets of essays for a second-year major
module were submitted in 2018 (second semester), 2020
(rst and second semesters), and 2022 (rst and second
semesters). The submission details of the ten essays in the
English L1 essay set were as follows: 2018 rst semester (n
= 1), 2020 rst semester (n = 4), 2020 second semester (n =
3), 2022 rst semester (n = 1), and 2022 second semester (n
= 1). The corresponding English L2 essay set for the second-
year module consisted of the following essays in relation to
their years and semesters of submission: 2020 rst semester
(n = 3), 2022 rst semester (n = 1), and 2022 second semester
(n = 6). Both sets of essays (English L1 and English L2) for
a third-year, major module, each of which with ten essays,
were submitted in the rst semester of 2020.
As is evident from the points presented above, the four
datasets used in this study together had 40 essays. The
essays were randomly selected from assignment scripts
that served as either dummy or moderation scripts that are
generally emailed to module team members by module
primary lecturers. It is from this pool of essays that the
current student essays were selected for this study. These
essays were categorised as English L1 and English L2 based
on whether the students who wrote them had identied
English as their home language (English L1) or had
identied a dierent language other than English as their
home language (English L2) in their module registration
information. All the selected essays for the four datasets
were copied from their original PDF les and pasted into
an MS Word le without changing anything. Thereafter, two
MS Word les, English L1 and English L2 essay sets, were
compiled for the two modules. The ten English L1 essays for
the second-year module had a total word count of 4,465,
with a mean word count of 446.5; their counterpart English
L2 essays had a total word count of 4,322, with a mean word
count of 432.2. The total word count of the ten English L1
essays for the third-year module was 4,504, with a mean
word count of 450.4. Their corresponding English L2 essays
had a total word count of 4,404, with 440.4 as their mean
word count. The essay selection and compiling process
took place between 18 December 2023 and 20 December
2023. Before the study was conducted, ethical clearance was
secured, and the certicate number of this ethical clearance
is Ref #: 2021_RPSC_050.
The second stage in the data collection process involved
choosing free, publicly available online AI detectors. This
process happened between 21 December 2023 and 28
December 2023. During which, many online AI detectors
were identied. After trialling some of them, 30 AI detectors
were chosen for use in this study (see Table 1). Then, from 02
January 2024 to 20 February 2024, the third stage occurred.
Each essay from the four datasets was submitted to each of
the 30 AI detectors for GenAI-generated content scanning.
The test scores for each essay scan were copied and
transferred to relevant tables, each of which was labelled
English L1 and English L2 for each of the two modules, with
each AI detector’s name used as a caption for each table.
However, to avoid having 30 individual tables, two tables
were merged into one (see Table 1).
Table 1: Names of 30 AI detectors and their accuracy ranking.
6Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
Data analysis
After the scan results for each of the relevantly labelled
tables had been captured under the English L1 and English
L2 categories for each of the two modules, the GenAI and
human content probability scores (as percentages) and their
accompanying statements as yielded by each AI detector,
were entered in an MS Word le. The GenAI and human
content probability scores for each set of English L1 and
English L2 essays were calculated and summed. The sum
for each set was averaged to get the mean score. This
procedure was done for all essay datasets whose AI detector
scans yielded GenAI and human content probability
scores. For those essay datasets whose AI detector scans
yielded only statements, those statements were captured
accordingly in a tabular form. The mean scores of all the
scan scores for all AI detectors were compared in each
language category. Additionally, false positives (human-
written essays misclassied as GenAI-generated) and true
negatives (correctly detected human-written essays) for
each AI detector were calculated with a view to getting false
positive rates (FPRs) and true negative rates (TNRs) within
each AI detector and between all AI detectors. The accuracy,
specicity, and negative predictive value (NPV) of AI
detectors whose test results were a direct opposite of each
other were measured using confusion matrices (see Captain
Words, 2024; Colquhoun, 2014; Gillham, 2024; Weber-Wul
et al., 2023; Wu et al., 2023) and compared with those of its
counterparts.
Results
The GenAI test scores that were yielded by scanning
each of the 30 AI detectors were compiled in a table (see
Table 2). These test results were captured in the manner
in which each AI detector displayed them without any
modication. An example of such results is shown in Table
2. The exception is the phrasing about the colour red and
the colour purple provided for GLTR AI test results. But even
for this AI detector, this phrasing was formulated in keeping
with how this AI detector itself explains its colour-coded
scan scores. Where each AI detector’s scan scores made it
possible, the GenAI and human content probability scores
for each set of English L1 and English L2 essays, together
with their respective means, were calculated (see Tables 2
and 3). As is evident from Table 2, various GenAI and human
content probability scores, expressed in percentages and
percentage points, have been displayed as generated by
Writer’s and ZeroGPT’s scan scores (raw data) for each of
the ten essays for each of the two sets of essays for English
L1 and English L2. These two AI detectors are used here for
illustrative purposes since the scan scores of each of the 30 AI
detectors cannot be displayed for lack of space. For example,
Writer detected eight essays and seven essays for L1 and L2,
respectively, under the 2nd-year module, as having 100%
human-generated content. For the 3rd-year module, Writer
classied six essays and ve essays for L1 and L2, apiece,
as containing 100% human-generated content. In contrast,
under the 2nd-year module, ZeroGPT classied nine essays
and none as containing 0% AI GPT content for L1 and L2
respectively. It, then, identied four essays for L1 and eight
essays for L2 under the 3rd-year module, as having 0% AI
GPT content.
Table 2: An example of how scan/test results were captured.
In terms of false positives, Writer had two false positives
and three false positives for the 2nd-year module’s L1 and
L2 essay sets, respectively. The rst set collectively had 5%
AI content, with an average false positive percentage of
2.5% AI content, while the second set contained 18% AI
content, with an average false positive percentage of 6% AI
content. With regard to the 3rd-year module, the L1 essay
set consisted of four false positives that contained an overall
AI content of 74%. Collectively, they had an average false
positive percentage of 18.5% AI content. Its counterpart
L2 essay set had four false positives, whose aggregate AI
content was 53%. Its average false positive percentage was
10.6% AI content.
7Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
For ZeroGPT, the 2nd-year module’s L1 and L2 essay sets
had one false positive and no false positive, respectively.
The rst set contained 6.88% AI content, which was also its
average false positive percentage. The second set had 0%
AI content and 0% AI content as its average false positive
percentage. ZeroGPT’s 3rd-year module’s L1 and L2 essay
sets had six false positives and two false positives each. The
rst set had an aggregate AI content of 77.82%, with 12.97%
as its average false positive percentage for its AI content.
By contrast, the second essay set contained an overall AI
content of 23.39%, with 11.695% being its average false
positive percentage for its AI content (see Table 3).
Table 3: How the AI and human content probability scores
and means were calculated.
Since the raw false positives and their corresponding
average false positive percentages as discussed above are
not a reliable measure of the accuracy of AI detectors, false
positive rates (FPRs), true negative rates (TNRs), and the
accuracy of the scan scores of the 30 AI detectors for the
four sets of essays were calculated (see Captain Words, 2024;
Colquhoun, 2014; Gillham, 2024; Weber-Wul et al., 2023;
Wu et al., 2023; also see Table 3). In particular, the FPRs, the
TNRs, the accuracy, and the specicity of the AI detectors
whose scan scores were direct opposites of each other, were
chosen and calculated for comparative analysis. Included
in the 30 AI detectors are the AI detectors that correctly
classied all ten essays in each of the four essay sets (two
sets for English L1 and two sets for English L2), which were
tested by the 30 AI detectors. They also encompassed the AI
detectors that completely misclassied all ten essays in each
of these four essay sets. In this context, two AI detectors,
Copyleaks and Undetectable AI, correctly classied all
ten essays in each of the four essay sets (see Table 4).
Contrariwise, nine AI detectors completely misclassied all
ten essays in each of these four essay sets. These nine AI
detectors were AI Content Checker, AI-Detector, AI Detector,
Detecting-AI.com, GLTR, GPT-2 Output Detector Demo,
IvyPanda GPT Essay Checker, RewriteGuru’s AI Detector, and
SEO (see Table 5).
Table 4: How Copyleaks and Undetectable AI correctly
detected all the essay sets in both English language
categories of the two modules.
The three measures: the FPR (false positive rate), accuracy,
and the TNR (true negative rate) were manually calculated
based on the scan scores of the said AI detectors. The FPR
was calculated using the formula, FPR = incorrectly detected
AI essays/all human-written essays, or FP/FP + TP, where
8Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
FP and TP stand for false positives and true positives,
respectively. This is related to each essay set (see Table 3).
In the same breadth, accuracy was calculated by utilising
the formula, accuracy = correctly detected essays/all essays,
or TP + TN/TP + TN + FP + FN. In this case, TN and FN
stand for true negatives and false negatives. For its part, the
TNR was calculated through this formula: TNR = correctly
detected human-written essays/all human-written essays,
or TN/TN + FP (see Table 3). For example, Table 4 depicts
the FPR, the accuracy, and the TNR of each of the L1 and L2
essay sets of both the 2nd-year module and the 3rd-year
module for Writer and ZeroGPT. On one hand, for the 2nd-
year module’s L1 and L2, Writer had the following sets of
scores for each of these two English language categories:
FRP = 0.2, Accuracy = 0.8, and TNR = 0.8; and FRP = 0.3,
Accuracy = 0.7, and TNR = 0.7. Its 3rd-year module’s L1 and
L2 scores for these three measures were as follows: FRP =
0.4, Accuracy = 0.6, and TNR = 0.6; and FRP = 0.5, Accuracy
= 0.5, and TNR = 0.5.
Table 5: How the nine AI incorrectly detected all the essay
sets in both English language categories of the two modules.
On the other hand, ZeroGPT had the following score sets
for its 2nd-year module’s L1 and L2: FRP = 0.1, Accuracy =
0.9, and TNR = 0.9; and FRP = 0.0, Accuracy = 1, and TNR
= 1. And its score sets for the 3rd-year module’s L1 and
L2 were as follows: FRP = 0.6, Accuracy = 0.4 and TNR =
0.4; and FRP = 0.2, Accuracy = 0.8, and TNR = 0.8. With the
exception of two essay sets (the 2nd-year module’s L1 for
Writer and the 3rd-year module’s L2 for ZeroGPT), the two
AI detectors had varying scores for these three measures in
their other essay sets for these two modules. Suce it to say
that ZeroGPT correctly classied one essay set for the 2nd-
year module’s L2, while it incorrectly identied this module’s
L1 by one percentage point. Therefore, ZeroGPT performed
better between the two AI detectors.
The points discussed in the preceding paragraph, lead to
the calculation of the FPRs, the TNRs, the accuracy, and the
specicity of the two AI detectors that correctly identied all
the essay sets and of the nine AI detectors that incorrectly
identied all the essay sets. Specicity is the function of
TNR: it is about the proportion of correct/true negative
cases correctly classied as such by an AI detector (see
Elkhatat et al., 2023). In the context of the present study,
this relates to the proportion of student-written essays
correctly recognised by any of the 30 AI detectors out of
ten student-written essays in each of the four essay sets. To
calculate these four measures in the two sets of AI detectors
mentioned above, an online confusion matrix calculator
was used. This calculator was ideal for computing these
measures. As said earlier, for Copyleaks, Undetectable AI,
and the other nine AI detectors, the scores are as portrayed
in Table 6.
Table 6: FPRs, TNRs, the accuracy, and the specicity of
Copyleaks and Undetectable AI (top half) and of the other
nine AI detectors (bottom half) for English L1 and English
L2 essay sets as measured by a confusion matrix calculator.
As depicted in the top half of this table, the scores for the
FPR, the negative predictive value (NPV) (which is also an
equivalent of a true negative rate (TNR)), accuracy, and
specicity for both Copyleaks and Undetectable AI were as
follows: FPR = 0, NPV = 1, accuracy = 1, and specicity =
1. The acronym, NAN (not a number), or sometimes, NaN,
denotes the measures whose scores could not be computed
as they were not relevant for the purpose at hand. As was
highlighted concerning Table 4 earlier, Copyleaks and
Undetectable AI had these scores because they correctly
identied all of the essay sets which consisted of the two
English language categories. Inversely, as exhibited in the
bottom half of Table 6, the nine AI detectors mentioned
above, collectively had the score set, FPR = 1, accuracy =
0, and specicity = 0, since all of them misidentied all the
essay sets of the two English language categories for both
modules. Here, too, NAN signies the measures whose
scores could not be captured as they were not relevant.
All the 30 AI detectors were ranked for their accuracy in
detecting if the four sets of essays (two sets of English L1
essays, n = 20; and two sets of English L2 essays, n = 20) were
GenAI-generated or human-written. The accuracy and TNR
9Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
scores of each AI detector were used to rank the accuracy
of the 30 AI detectors (for relevant examples, see Tables 3
and 4). Based on these composite scores, many AI detectors
shared joint spots when they were ranked for accuracy. For
instance, two AI detectors, Copyleaks and Undetectable AI,
jointly shared the rst spot. They were followed by Hive
Moderation and Scribbr, AI Content Detector and Plagiarism
Detector, and Dupli Checker and Grammarly, which, as
pairs, jointly shared the second, third, and fourth spots,
respectively. ZeroGPT and Detect Bard, each notched the
fth and sixth places, while AI Checker Tool and AI Contentfy
jointly occupied the seventh position. This is followed by
Writer in the eighth spot and Rank Wizard AI and Sapling
jointly took up the ninth position.
The spots ranging from ten to 15 were, each, occupied by
dierent AI detectors, with GPTZero at the tenth spot and
QuillBot AI Detector at the 15th place. The 16th and last spot
was collectively shared by the nine AI detectors mentioned
earlier.
Discussion
The results presented above is discussed in this section in
response to the three research questions for this study.
The accuracy of 30 AI content detectors
As highlighted in the preceding section, of the 30 free,
publicly available online AI detectors, only two of them,
Copyleaks and Undetectable AI, were able to correctly
identify all the essay sets of the two English language
categories (English L1 and English L2) as human written.
These two AI detectors also had the highest accuracy and
TNR scores for all these essay sets, when their scores were
manually calculated. Moreover, they did so even when their
specicity and NPV was computed using a confusion matrix
calculator. However, their scores in all these four measures
diametrically contrasted with those of the nine AI detectors,
whose scores in these measures, especially for accuracy and
specicity, were zero (0%). Their FPR score of one (100%)
was the polar opposite of the FPR score of zero (0%) for
Copyleaks and Undetectable AI. In this sense, the nine AI
detectors misidentied all four essay sets of the two English
language categories. The rest of the other AI detectors
had varying accuracy, FPR, and TNR scores. As such, they
classied these four essay sets of English L1 and English L2
in varying degrees of accuracy, FPRs, and TNRs (see Figure
1).
In some of the previous studies conducted on the ecacy
of AI detectors, Copyleaks has been the best-performing AI
detector or, at least one of the best-performing AI detectors.
One such study is Walters’ (2023) study. This study tested
the eectiveness of 16 AI detectors in identifying GenAI-
generated and human-written content in three sets of rst-
year, undergraduate composition essays. The three sets
of essays comprised 42 essays generated by ChatGPT-3.5,
42 essays created by ChatGPT-4, and 42 essays written by
students. The last set was chosen from a college’s English
110 (First-Year Composition) essays, which had been
submitted during the 2014-2015 academic year. In this
study, both Copyleaks and Turnitin had the highest accuracy
rate, followed by Originality. Sapling and Content at Scale
had the lowest accuracy rate among the 16 AI detectors. In
the current study, Sapling and Content at Scale, occupied
the 9th and 13th spots respectively.
Another study is Chaka’s (2023a), which evaluated the
accuracy of ve AI detectors in detecting GenAI-generated
content in 21 applied English language studies responses
generated by three GenAI chabots: ChatGPT (n = 6),
YouChat (n = 7), and Chatsonic (n = 8). The ve AI detectors
were GPTZero, OpenAI Text Classier, Writer, Copyleaks, and
GLTR. All the twenty-one English responses were submitted
to the ve AI detectors for scanning. The ChatGPT-generated
responses were translated into German, French, Spanish,
Southern Sotho, and isiZulu by using Google Translate. They
were, then, submitted to GPTZero for scanning. The German,
French and Spanish translated versions were inputted into
Copyleaks for scanning. In this sense, this study utilised
machine translation as an adversarial attack, which is a
strategy that is related to a critical studies approach to AI
as I had argued in the relevant section above. In all the
dierent versions of the twenty-one responses, Copyleaks
was the most accurate of the ve AI detectors (see Chaka,
2023a). Similarly, in a literature and integrative hybrid
review conducted by Chaka (2024), which reviewed 17
peer-reviewed journal articles, Copyleaks was one of the
best-performing AI detectors in one of the four articles in
which OpenAI Text Classier, and Crossplag, Grammarly also
topped in each of the other three articles. But, overall, in all
the 17 reviewed articles, Crossplag was the best-performing
AI detector, followed by Copyleaks.
Figure 1: A graphic representation of the 30 AI detectors
based on their accuracy, FPR, and TNR scores.
In Odri and Yoon’s (2023) study, though, which as discussed
earlier, tested 11 AI detectors and employed adversarial
attacks, especially evasion techniques (e.g., improving
command messages (prompts) in ChatGPT, adding minor
grammatical errors, paraphrasing, and substituting Latin
letters with their Cyrillic equivalents), as part of a critical
studies approach to AI, Originality out-performed all the
11 AI detectors in correctly identifying the human-written-
text. It, nonetheless, misclassied the nal version of the
AI-generated text. However, it was the AI detector that was
most resistant to adversarial attacks compared to the other
AI detectors (see Odri & Yoon, 2023).
10Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
Dierential classication of the four sets of student
essays and a language category assigned more false
positives
As pointed out in the preceding section, both Copyleaks
and Undetectable AI classied all the four sets of English
L1 and English L2 student essays similarly and correctly by
assigning the same scores for the three measures: accuracy,
FPR, and TNR, to all of them. Additionally, both did so for
their specicity scores for all the four essay sets. Likewise,
the nine AI detectors allotted the same scores for their
respective measures for the four essay sets. Even the rest
of the other AI detectors, which had varying scores for
these measures, did not have scores specically skewed
toward one English language category in each of the four
essay sets. In fact, even in the cases where one AI detector
had lower scores for essays within a given essay set of a
particular language category, it had higher scores for essays
within another essay set of a dierent language category.
In instances where a particular AI detector scored the essay
sets of the one language category in a given module (e.g.,
the English L1 essay sets for both the 2nd-year module and
the 3rd-year module) higher than the essay sets of the other
language category in the same module, the dierences in the
scores of essay sets of these dierent language categories
were not substantial. Or, if the scores were higher, they were
not consistent for the essay sets of one language category
(e.g., English L1) to the exclusion of the essay sets of the
other language category (e.g., English L2) (see Table 3). So,
in the present study, the AI detectors that correctly classied
the student essay sets did so for both English L1 and English
L2. In a similar vein, those AI detectors that misclassied the
student essay sets did so for both of these English language
categories. Moreover, no language category was assigned
more false positives for its essay sets than those of the other
language category. This means that the 30 AI detectors
were not language category-biased or language category-
sensitive when assigning false positives to and when
classifying the essays belonging to the four essay sets. In
the current study, therefore, there is no evidence suggesting
that the AI detectors that were tested were consistently
and invariably biased towards or against any of the student
essay sets of the two English language categories.
In contrast, though, and as stated earlier, Liang et al.’s (2023)
study found that the AI detectors that they evaluated tended
to be biased against non-English language speakers’ essays
(also see Mathewson, 2023; Shane, 2023; cf. Adamson,
2023; Gillham, 2024). While this is the case, the results of
the current study, nonetheless, do not nullify or invalidate
those of Liang et al.’s (2023) study, as it did not use the
same data sets as the ones used by that study. Instead, the
present study’s results are dierent from those of Liang et
al.’s (2023) study.
Implications and recommendations
This study has implications for detecting GenAI content
in student essays and for dierentiating between GenAI-
generated and human-written content in student essays in
higher education. Firstly, detecting GenAI in student essays
or distinguishing between GenAI-generated and human-
written content in such essays is not simply a matter of
displaying AI and human content probability scores (or
percentages) and the statements accompanying them as
most, if not all, AI content detectors currently tend to do.
Neither is it a matter of making self-serving claims about
high AI detection accuracy rates, as is the case with 28
(93%) of the 30 AI detectors tested in this study. This means
that the AI detection accuracy claims made by dierent AI
detection tools on their respective landing pages should be
taken with a pinch of salt. As demonstrated in this study, such
claims hardly live up to their stated expectations. Again, as
shown by the results of this study, of these 28 AI detectors
that did not perform as expected, nine of them completely
misclassied all the human-written essays, while the
remaining 19 misclassied these essays in varying degrees.
Any AI content probability percentage or percentage
point, however negligible it may be, that is attributed to
a student essay which has no GenAI content at all, inicts
immeasurable reputational damage to that essay and to the
student who produced it. This means that if this particular
essay was meant for assessment purposes, then, the student
concerned would be unfairly accused of a gross academic
dishonesty they would not have committed. Given all of this,
it is advisable for academics and for universities to which
these academics belong, to exercise extreme caution when
utilising any AI content detection tool for detecting GenAI
content in their students’ academic essays. The reason for
having to be extra cautious is that most of the current AI
detectors demonstrate a high degree of inaccuracy and
unreliability. Importantly, it is very risky to employ one AI
content detector and take its scan results as a nal verdict
for any given human-written text.
Secondly, the reliance of the current AI detectors on perplexity
and burstiness for determining and predicting the presence
or absence of GenAI content in human-written student
essays results in these detectors consistently misclassifying
such essays. This is one of the reasons why they keep on
misclassifying student writing that has low perplexity and
burstiness, such as that of non-English native speakers,
as containing GenAI content portions, even when that is
not the case. Repetitive word sequences and predictable
lexical and syntactic parsing, as assumed by perplexity and
burstiness, might work as indicators of the presence or the
absence of GenAI content within the surreal world of GenAI
driven by large language models. Nevertheless, in a real-
world and human environment in which university students
produce dierent forms of academic writing, informed by
their diverse English language backgrounds and in response
to assignment questions, perplexity and burstiness serve
as weak, if not misplaced, indicators of the presence or the
absence of GenAI content in student writing. The types of
essays used in the current study serve as a case in point
that detecting GenAI-generated content or distinguishing
between it and its human-written counterpart is not merely
a matter of English L1 writing versus English L2 writing.
Human-produced writing cannot be reduced to robotic
writing powered and aided by machine learning and
GenAI large language models. Therefore, it is prudent for
AI detection tools to have language training data sets that
reect the diverse, multi-dialectal, poly-racial, and pluri-
ethnic speakers of a given language, in various global or
11Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
geographical settings, for them to be able to capture the
nuances of such a language. This is more so for a language
such as English that has these types of speakers across the
globe.
Conclusion
The current study had three research questions (RQs) and
three corresponding objectives as stated earlier. Only two
of the 30 tested, free-to-use, AI detectors, Copyleaks and
Undetectable AI, did manage to correctly detect all of the
student essay sets of the two English language categories
(English L1 and English L2) as human-written. Nine of these
30 AI detectors (AI Content Checker, AI-Detector, AI Detector,
Detecting-AI.com, GLTR, GPT-2 Output Detector Demo,
IvyPanda GPT Essay Checker, RewriteGuru’s AI Detector, and
SEO) did the opposite: they misidentied all the essays in
each of the four essay sets of the two language categories
in both the 2nd-year module and the 3rd-year module. The
remaining 19 AI detectors both correctly and incorrectly
classied the four essay sets in varying degrees without any
bias to any essay set of the two English language categories.
Therefore, Copyleaks and Undetectable AI, were, jointly,
the top-most accurate AI detectors that ranked rst in this
study, while the nine AI detectors were the most inaccurate,
which collectively ranked last in the pecking order. Of the
other 19 AI detectors, ten of them held joint positions, with
the remaining nine notching individual accuracy slots in the
ranking.
All 30 AI detectors did not assign dierential classication
to the four essay sets according to the English language
categories to which they belonged. That is, they displayed
no specic bias toward language categories in classifying
or misclassifying the four essay sets. The same applies to
the false positives they accorded to these essay sets. If only
two AI detectors out of 30 can accurately detect all the
student essay sets across the two language categories, and
nine AI detectors can do the complete opposite, with the
remaining AI detectors yielding variable accuracy scores for
the same sets of essays in the two language categories as
is the case in this study, then, university students and the
universities to which they belong are in trouble concerning
the presence or absence of GenAI content in student essays.
Moreover, the results of the current study demonstrate that
detecting GenAI-generated content or distinguishing it
from its human-written counterpart is not simply a matter
of perplexity and burstiness, or of English L1 writing versus
English L2 writing. Human-produced writing is very complex
and nuanced and thus cannot be reduced to measures of
high or low perplexity and burstiness. This applies to both
English L1 and English L2 writers, depending on their writing
prociency. On this basis, the present study suggests that
the bulk of the currently available AI detectors are not t
for its purpose, even when the input content, such as the
essays used in this study, is not manipulated through any
adversarial attacks. The implication of this study, therefore,
is that relying on one or even a few AI detection tools for
identifying GenAI content in student essays is a risky move.
References
Abani, S., Volk, H. A., De Decker, S., Fenn, J., Rusbridge, C.,
Charalambous, M., Nessler J. N. (2023). ChatGPT and
scientic papers in veterinary neurology: Is the genie out of
the bottle? Frontiers in Veterinary Science, 10(1272755), 1-7.
https://doi.org/10.3389/fvets.2023.1272755
Adamson, D. (2023). New research: Turnitin’s AI detector
shows no statistically signicant bias against English language
learners. https://www.turnitin.com/blog/new-research-
turnitin-s-ai-detector-shows-no-statistically-significant-
bias-against-english-language-learners
AIContentfy Team. (2023). Evaluating the eectiveness of
AI detectors: Case studies and metrics. https://aicontentfy.
com/en/blog/evaluating-of-ai-detectors-case-studies-and-
metrics
Alexander, K., Savvidou, C., & Alexander, C. (2023). Who
wrote this essay? Detecting AI-generated writing in second
language education in higher education. The Journal of
Teaching English with Technology, 23(20), 25-43. https://doi.
org/10.56297/BUKA4060/XHLD5365
Anderson, C. (2023). The false promise of AI writing detectors.
https://www.linkedin.com/pulse/false-promise-ai-writing-
detectors-carol-anderson
Anil, A., Saravanan, A., Singh, S., Shamim, M. A., Tiwari, K.,
Lal, H., …Sah, R. (2023). Are paid tools worth the cost? A
prospective cross-over study to nd the right tool for
plagiarism detection. Heliyon, 9(9), e19194, 1-11. https://
doi.org/10.1016/j.heliyon.2023.e19194
Anthology White Paper. (2023). AI, academic integrity, and
authentic assessment: An ethical path forward for education.
https://www.anthology.com/sites/default/files/2023-09/
White%20Paper-AI%20Academic%20Integrity%20and%20
Authentic%20Assessment-An%20Ethical%20Path%20
Forward%20for%20Education-v2_09-23_0.pdf
Bisi, T., Risser, A., Clavert, P., Migaud, H., & Dartus, J. (2023).
What is the rate of text generated by articial intelligence
over a year of publication in orthopedics and traumatology:
Surgery and research? Analysis of 425 articles before versus
after the launch of ChatGPT in November 2022. Orthopaedics
and Traumatology: Surgery and Research, 109(8), 103694.
https://doi.org/10.1016/j.otsr.2023.103694
Blau, I., Goldberg, S., Friedman, A., & Eshet-Alkalai, Y. (2020).
Violation of digital and analog academic integrity through
the eyes of faculty members and students: Do institutional
role and technology change ethical perspectives? Journal of
Computing in Higher Education, 33(1), 157-187. https://doi.
org/10.1007/s12528-020-09260-0
Cai, S., & Cui, W. (2023). Evade ChatGPT detectors via a single
space. https://arxiv.org/pdf/2307.02599.pdf
Captain Words. (2024). Testing AI detection tools – our
methodology. https://captainwords.com/ai-detection-tools-
test-methodology/
12Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
Chaka, C. (2022). Digital marginalization, data marginalization,
and algorithmic exclusions: A critical southern decolonial
approach to datacation, algorithms, and digital citizenship
from the Souths. Journal of e-Learning and Knowledge Society,
18(3), 83-95. https://doi.org/10.20368/1971-8829/1135678
Chaka, C. (2023a). Detecting AI content in responses
generated by ChatGPT, YouChat, and Chatsonic: The
case of ve AI content detection tools. Journal of Applied
Learning & Teaching, 6(2), 94-104. https://doi.org/10.37074/
jalt.2023.6.2.12
Chaka, C. (2023b). Generative AI chatbots - ChatGPT versus
YouChat versus Chatsonic: Use cases of selected areas of
applied English language studies. International Journal of
Learning, Teaching and Educational Research, 22(6), 1-19.
https://doi.org/10.26803/ijlter.22.6.1
Chaka, C. (2024). Reviewing the performance of AI detection
tools in dierentiating between AI-generated and human-
written texts: A literature and integrative hybrid review.
Journal of Applied Learning & Teaching, 7(1), 1-12. https://
doi.org/10.37074/jalt.2024.7.1.14
Cingillioglu, I. (2023). Detecting AI-generated essays: The
ChatGPT challenge. The International Journal of Information
and Learning Technology, 40(3), 259-268. https://doi.
org/10.1108/IJILT-03-2023-0043
Colquhoun, D. (2014). An investigation of the false discovery
rate and the misinterpretation of p-values. Royal Society
Open Science, 1(140216), 1-16. https:doi.org/10.1098/
rsos.140216
Couldry, N., & Mejias, U. A. (2019a). Data colonialism:
Rethinking big data’s relation to the contemporary
subject. Television & New Media, 20(4), 336349. https://doi.
org/10.1177/1527476418796632
Dergaa, I., Chamari, K., Zmijewski, P., & Saad, H. B. (2023).
From human writing to articial intelligence generated text:
Examining the prospects and potential threats of ChatGPT
in academic writing. Biology of Sport, 40(2), 615-622. https://
doi.org/10.5114/biolsport.2023.125623
Desaire, H. A., Chua, A. E., Isom, M., Jarosova, R., & Hua,
D. (2023). Distinguishing academic science writing from
humans or ChatGPT with over 99% accuracy using o-the-
shelf machine learning tools. Cell Reports Physical Science,
4(6), 1-2. https://doi.org/10.1016/j.xcrp.2023.101426
Elali, F. R., & Rachid, L. N. (2023). AI-generated research
paper fabrication and plagiarism in the scientic community.
Patterns, 4, 1-4. https://doi.org/10.1016/j.patter.2023.100706
Elkhatat, A. M., Elsaid, K., & Almeer, S. (2023). Evaluating
the ecacy of AI content detection tools in dierentiating
between human and AI generated text. International
Journal for Educational Integrity, 19(17), 1-16. https://doi.
org/10.1007/s40979-023-00140-5
Ferrara, E. (2023). Should ChatGPT be biased? Challenges
and risks of bias in large language models. https://arxiv.org/
abs/2304.03738
Gamage, K. A. A., De Silva, E. K., & Gunawardhana, N.
(2020). Online delivery and assessment during COVID-19:
Safeguarding academic integrity. Education Sciences,
10(301), 1-24. https://doi.org/10.3390/educsci10110301
Gao, C. A., Howard, F. M., Markov, N. S., Dyer, E. C., Ramesh, S.,
Luo, Y., & Pearson, A. T. (2023). Comparing scientic abstracts
generated by ChatGPT to real abstracts with detectors and
blinded human reviewers. NPJ Digital Medicine, 6(75), 1-5.
https://doi.org/10.1038/s41746-023-00819-6
Gillham, J. (2024). Native English speakers? https://originality.
ai/blog/are-gpt-detectors-biased-against-non-native-
english-speakers
Habibzadeh, F. (2023). GPTZero performance in identifying
articial intelligence-generated medical texts: A preliminary
study. Journal of Korean Medical Sciences, 38(38), e319.
https://doi.org/10.3346/jkms.2023.38.e319
Homolak, J. (2023). Exploring the adoption of ChatGPT in
academic publishing: Insights and lessons for scientic
writing. Croatian Medical Journal, 64(3), 205-207. https://
doi.org/10.3325/cmj.2023.64.205
Ibrahim, K. (2023). Using AI based detectors to control AI
assisted plagiarism in ESL writing: “The terminator versus the
machines”. Language Testing in Asia, 13(46), 1-28. https://
doi.org/10.1186/s40468-023-00260-2
Ifelebuegu, A. (2023). Rethinking online assessment
strategies: Authenticity versus AI chatbot intervention.
Journal of Applied Learning and Teaching, 6(2), 385-392.
https://doi.org/10.37074/jalt.2023.6.2.2
Krishna, K., Song, Y., Karpinska, M., Wieting, J., & Iyyer,
M. (2023). Paraphrasing evades detectors of AI-generated
text, but retrieval is an eective defense. https://arxiv.org/
abs/2303.13408
Ladha, N., Yadav, K., & Rathore, P. (2023). AI-generated
content detectors: Boon or bane for scientic writing.
Indian Journal of Science and Technology, 16(39), 3435-3439.
https://doi.org/10.17485/IJST/v16i39.1632
Lee, D. (2023). How hard can it be? Testing the reliability
of AI detection tools. https://www.researchgate.net/
prole/Daniel-Lee-95/publication/374170650_How_hard_
can_it_be_Testing_the_reliability_of_AI_detection_tools/
links/6512b65237d0df2448edc358/How-hard-can-it-be-
Testing-the-reliability-of-AI-detection-tools.pdf
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J.
(2023). GPT detectors are biased against non-native
English writers. Patterns, 4(7), 1-4. https://doi.org/10.1016/j.
patter.2023.100779
Lindgren, S. (2023). Introducing critical studies of articial
intelligence. In S. Lindgren (Ed.), Handbook of critical studies
of articial intelligence (pp. 1-19). Cheltenham: Edward Elgar
Publishing. http://dx.doi.org/10.4337/9781803928562.00005
13Journal of Applied Learning & Teaching Vol.7 No.1 (2024)
Copyright: © 2024. Chaka Chaka. This is an open-access article distributed under the terms of the Creative Commons Attribution
License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright
owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No
use, distribution or reproduction is permitted which does not comply with these terms.
Makri, C., & Neely, A. (2021). Grounded theory: A guide for
exploratory studies in management research. International
Journal of Qualitative Methods, 20, 1-14. https://doi.
org/10.1177/16094069211013654
Martin, J. L., & Wright, K. E. (2023). Bias in automatic speech
recognition: The case of African American language. Applied
Linguistics, 44(4), 613-630. https://doi.org/10.1093/applin/
amac066
Mathewson, T. G. (2023). AI detection tools falsely accuse
international students of cheating. The Markup. https://
themarkup.org/machine-learning/2023/08/14/ai-
detection-tools-falsely-accuse-international-students-of-
cheating
Mohamed, S., Png, M.-T., & Isaac, W. (2020). Decolonial AI:
Decolonial theory as sociotechnical foresight in articial
intelligence. Philosophy &Technology, 33, 659-684. https://
doi.org/10.1007/s13347-020-00405-8
Odri, G. A., & Yооn, D. J. Y. (2023). Detecting generative
articial intelligence in scientic articles: Evasion techniques
and implications for scientic integrity. Orthopaedics &
Traumatology: Surgery & Research, 109(8), 103706. https://
doi.org/10.1016/j.otsr.2023.103706
Perkins, M. (2023). Academic integrity considerations of AI
large language models in the post-pandemic era: ChatGPT
and beyond. Journal of University Teaching & Learning
Practice, 20(2). https://doi.org/10.53761/1.20.02.07
Perkins, M., Roe, J., Postma, D., McGaughran, J., & Hickerson,
D. (2024). Detection of GPT-4 generated text in higher
education: Combining academic judgement and software
to identify generative AI tool misuse. Journal of Academic
Ethics, 22, 89-113. https://doi.org/10.1007/s10805-023-
09492-6
Ricaurte, P. (2019). Data epistemologies, the coloniality of
power, and resistance. Television & New Media, 20(4), 350-
365. https://doi.org/10.1177/1527476419831640
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer
or the end of traditional assessments in higher education?
Journal of Applied Learning and Teaching, 6(1), 342-363.
https://doi.org/10.37074/jalt.2023.6.1.9
Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W.,
& Feizi, S. (2023). Can AI-generated text be reliably detected?
https://arxiv.org/abs/2303.11156
Santra, P. P., & Majhi, D. (2023). Scholarly communication
and machine-generated text: Is it nally AI vs AI in plagiarism
detection? Journal of Information and Knowledge, 60(3),
175-183. https://doi.org/10.17821/srels/2023/v60i3/171028
Shane, J. (2023). Don’t use AI detectors for anything
important. Fortune. https://www.aiweirdness.com/dont-
use-ai-detectors-for-anything-important/
Sobaih, A. E. E. (2024). Ethical concerns for using articial
intelligence chatbots in research and publication: Evidences
from Saudi Arabia. Journal of Applied Learning & Teaching,
7(1), 1-11. http://journals.sfu.ca/jalt/index.php/jalt/index
Subramaniam, R. (2023). Identifying text classication failures
in multilingual AI-generated content. International Journal
of Articial Intelligence and Applications (IJAIA), 14(5), 57-63.
https://doi.org/10.5121/ijaia.2023.14505
Sullivan, M., Kelly, A., & McLaughlan, P. (2023). ChatGPT in
higher education: Considerations for academic integrity and
student learning. Journal of Applied Learning & Teaching,
6(1), 31-40. https://journals.sfu.ca/jalt/index.php/jalt/article/
view/731
Swedberg, R. (2020). Exploratory research. In C. Elman, J.
Gerring, & J. Mahoney (Eds.), The production of knowledge:
Enhancing progress in social science (pp. 17- 41). Cambridge:
Cambridge University Press.
Uzun, L. (2023). ChatGPT and academic integrity concerns:
Detecting articial intelligence generated content.
Language Education & Technology, 3(1), 45-54. https://
www.researchgate.net/publication/370299956_ChatGPT_
and_Academic_Integrity_Concerns_Detecting_Artificial_
Intelligence_Generated_Content
Walters, W. H. (2023). The eectiveness of software designed
to detect AI-generated writing: A comparison of 16 AI text
detectors. Open Information Science, 7(20220158), 1-24.
https://doi.org/10.1515/opis-2022-0158
Weber-Wul, D., Anohina-Naumeca, A., Bjelobaba, S.,
Foltýnek, T., Guerrero-Dib, J., Popoola, O., Waddington,
L. (2023). Testing of detection tools for AI-generated text.
https://doi.org/10.48550/arxiv.2306.15666
Wee, H. B., & Reimer, J. D. (2023). Non-English academics
face inequality via AI-generated essays and countermeasure
tools. BioScience, 73, 476-478. https://doi.org/10.1093/
biosci/biad034
Wiggers, K. (2023). Most sites claiming to catch AI-written
text fail spectacularly. TechCrunch. https://techcrunch.
com/2023/02/16/most-sites-claiming-to-catch-ai-written-
text-fail-spectacularly/
Wu, J., Yang, S., Zhan, R., Yuan, Y., Wong, D. F., & Chao,
L. S. (2023). A survey on LLM-generated text detection:
Necessity, methods, and future directions. https://arxiv.org/
pdf/2310.14724.pdf
... Data quality has to do with the comprehensiveness and robustness of the training data while data inclusiveness is related to the diversity and representativeness of data not only in terms of language and dialects, but also concerning racial and ethnic demographics, geographies, gender, cultures, and value systems. Algorithms are often designed to be effective, reliable, accurate, and value-neutral (unbiased) (AIContentfy Team, 2023;Chaka, 2022Chaka, , 2024aChaka, , 2024bRudolph et al., 2024). Cross-domain generalization pertains to the extent to which an LLM's dataset is generalizable and applicable to different domains of use that exist in real life (Wu et al., 2023). ...
... Instead, it compared the seven LLMs' responses to a common lesson plan prompt in five low-resource languages (isiZulu, Sesotho, Yoruba, Māori, and Mi'kmaq (also known as Mi'kmawi'simk) with their counterpart English responses. In view of this, this study was exploratory in nature as it set out to explore a phenomenon or an aspect that has not yet been extensively studied (Chaka, 2024b). This phenomenon is the generative multilingual capabilities of the seven LLMs in the five aforesaid low-resource languages versus their language capabilities in English (a high-resource language) based on a given common lesson plan prompt. ...
Article
Full-text available
A lot of hype has accompanied the increasing number of generative artificial intelligence-powered large language models (LLMs). Similarly, much has been written about what currently available LLMs can and cannot do, including their benefits and risks, especially in higher education. However, few use cases have investigated the performance and generative capabilities of LLMs in low-resource languages. With this in mind, one of the purposes of the current study was to explore the extent to which seven, currently available, free-to-use versions of LLMs (ChatGPT, Claude, Copilot, Gemini, GroqChat, Perplexity, and YouChat) perform in five low-resource languages (isiZulu, Sesotho, Yoruba, M?ori, and Mi’kmaq) in their generative multilingual capabilities. Employing a common input prompt, in which the only change was to insert the name of a given low-resource language and English in each case, this study collected its datasets by inputting this common prompt into the seven LLMs. Three of the findings of this study are noteworthy. First, the seven LLMs displayed a significant lack of generative multilingual capabilities in the five low-resource languages. Second, they hallucinated and produced nonsensical, meaningless, and irrelevant responses in their low-resource language outputs. Third, their English responses were far better in quality, relevance, depth, detail, and nuance than their low-resource language only and English responses for the five low-resource languages. The paper ends by offering the implications and making the conclusions of the study in terms of LLMs’ generative capabilities in low-resource languages.
... The responsible use of Generative AI (GenAI) in higher education requires a balanced approach that fosters innovation while maintaining academic integrity. Recent studies underscore the importance of clear policies, practical assessment tools, and reflective practices in achieving this balance [51][52][53]. ...
Article
Full-text available
Written assignments for large classes pose a far more significant challenge in the age of the GenAI revolution. Suggestions such as oral exams and formative assessments are not always feasible with many students in a class. Therefore, we conducted a study in South Africa and involved 280 Honors students to explore the usefulness of Turnitin's AI detector in conjunction with student self-reflection. Using a Mixed Methods Research (MMR) approach, we analysed data generated from the Turnitin AI reports, our grading rubrics, and qualitative student self-reflection. The findings show that incorporating self-reflection into assessments supports ethical GenAI use and improves the transparency lecturers need for decision-making. A declaration form allowed the students to be upfront about using Generative Artificial Intelligence tools. We found that students who can reflect on their learning relied less on generated content. However, students with high AI detected scores (> 20%) did not adequately reflect on how the tools supported their learning and could not give credible explanations of use. We contribute to the body of knowledge by providing students and academics with examples of responsibly handling AI-detected scores in large-class settings. We present a guided self-reflection and declaration with an AI detector to support students and help lecturers make decisions when grading. We also present a decision tree that lecturers and graders can use when evaluating AI use in assessments.
... Perhaps owing to the continued absence of an effective established protocol for educator AI-generated text decipherment (Kirmani, 2023), multiple studies attest to rates of limited educator success in weeding out authentic assessment artefacts of sole human authorship (Fleckenstein et al., 2024;Matthews & Volpe, 2023;Liu et al., 2023;amongst others). Despite cumulative scholarly evidence to the contrary, commercial enterprises persist in their claims regarding AI-generated text detection efficacy with limited scholarship diverging (Walters, 2023) from the consensus to the contrary (Baron, 2024;Chaka, 2024;Kumar & Mindzak, 2024;Weber-Wulff et al., 2023). illustrate a compendium of adversarial techniques which users can use to manipulate GenAI outputs tensuring the probability of going under the radar of text classifiers is increased. ...
Article
Full-text available
This paper presents the development and application of the PANDORA GenAI Susceptibility Rubric, a novel tool designed to assess the susceptibility of higher education assessments to the undeclared use of generative artificial intelligence (GenAI) tools. In response to growing concerns about academic integrity and the rising sophistication of GenAI technologies, the rubric provides educators with a structured framework to critically evaluate the validity of their assessments across key criteria, including collaborative authorship, intellectual task complexity, and the opportunity for creativity. Through a mixed-methods design, the rubric was refined to include expert-informed modifications and validated through end-user application across various arts and humanities courses. Results highlight how assessment design can either mitigate or exacerbate GenAI susceptibility, revealing that tasks requiring genuine collaboration, creative thinking, and process-oriented evaluation offer greater resistance to AI manipulation. The rubric also emphasises balancing detailed guidance with student autonomy to avoid facilitating GenAI prompt formulation. This study contributes to the field by offering a practical instrument that promotes more robust, ethically sound, and future-proofed assessment practices. It serves as a critical response to the pressing challenges posed by GenAI in higher education and informs ongoing discourse on academic integrity.
... Only two of the tools (Copyleaks and Undetectable AI) were able to identify the essays as human writings accurately. The study concluded that most of the freely available AI-detecting tools showed inconsistent results and were not very effective (Chaka, 2024b). ...
Article
Full-text available
AI chatbots and LLMs have made a significant impact in a short time. Despite their benefits, they pose serious threats to academic integrity and ethics by generating human-like text, which is very hard to detect. Various AI-detection tools have been developed to tackle this issue. However, their effectiveness is questionable. This study investigates the performance of four AI-detection tools (Turnitin, ZeroGPT, GPTZero, and Writer AI) in detecting AI-generated text. That text was generated using three LLMs (ChatGPT, Perplexity, and Gemini). Furthermore, three adversarial techniques (edited through Grammarly, paraphrased through Quillbot, and 10%-20% editing by a human expert) were applied to see their effects on the performance of AI-detection tools. Turnitin turned out to be the most accurate and consistent one, with a 100% AI score even with the adversarial techniques. ZeroGPT and GPTZero also reported relatively high AI scores, especially with the original files and the first and third adversarial techniques. Among the three adversarial techniques, paraphrasing through Quillbot affected the performance of three AI-detection tools (ZeroGPT, GPTZero, and Writer AI) the most. Among the three LLMs, text generated through Perplexity was more accurately detected, while Gemini-generated text showed a relatively lower AI score. What was the most noteworthy was the fact that in many cases, even when the text was generated through the same LLM, and detected through the same AI-detection tool; different files showed different AI scores, further highlighting the inconsistencies among AI-detection tools.
... Meanwhile, Chaka (2024) set out to evaluate the accuracy of 30 AI detectors in identifying Gen-AI-generated and human-written content in students' essays in the context of South Africa and found that the bulk of currently available AI detectors are not fit for this purpose. In a similar study, Walters (2023) came to the same conclusion when examining the effectiveness of 16 AI text detectors in discriminating between AI-generated and human-generated writing in a first-year composition course in the US. ...
... Using a combination of academic judgement and AI detection software to identify machine-generated content created using GPT-4, the AI detection tool is able to identify 91% of the submissions but only 54.8 % of each text data's original content is classified as AI-generated (Perkins et al., 2024). In a general overview of AI detectors that are currently available online for free (Chaka, 2024), 30 AI detectors were compared for detecting AI content in written essays. Only a few AI detectors such as Copyleaks performed well when crossing English as first language(L1) and English as second language(L2) settings. ...
Article
Full-text available
Generative AI models, including ChatGPT, Gemini, and Claude, are increasingly significant in enhancing K–12 education, offering support across various disciplines. These models provide sample answers for humanities prompts, solve mathematical equations, and brainstorm novel ideas. Despite their educational value, ethical concerns have emerged regarding their potential to mislead students into copying answers directly from AI when completing assignments, assessments, or research papers. Current detectors, such as GPT-Zero, struggle to identify modified AI-generated texts and show reduced reliability for English as a Second Language learners. This study investigates detection of academic cheating by use of generative AI in high-stakes writing assessments. Classical machine learning models, including logistic regression, XGBoost, and support vector machine, are used to distinguish between AI-generated and student-written essays. Additionally, large language models including BERT, RoBERTa, and Electra are examined and compared to traditional machine learning models. The analysis focuses on prompt 1 from the ASAP Kaggle competition. To evaluate the effectiveness of various detection methods and generative AI models, we include ChatGPT, Claude, and Gemini in their base, pro, and latest versions. Furthermore, we examine the impact of paraphrasing tools such as GPT-Humanizer and QuillBot and introduce a new method of using synonym information to detect humanized AI texts. Additionally, the relationship between dataset size and model performance is explored to inform data collection in future research.
... Consequently, GenAI tools may be implemented in inequitable ways and contain unknown biases (Hacker et al., 2024;Kwak & Pardos, 2024). The risks of GenAI tools in education also relate to academic integrity and ownership of work (Cotton et al., 2023;Perkins, 2023;Rudolph et al., 2023), as well as the undetectability of GenAI outputs in submitted student work Chaka, 2024;Weber-Wulff et al., 2023). More broadly, societal impacts of GenAI in education include the possible pollution of the Internet with GenAI content, the worsening of digital poverty (Miao & Holmes, 2023), the 'monoculturing' of scientific knowledge (Messeri & Crockett, 2024), and the devaluing of human relationships (Resnick, 2024). ...
Preprint
Full-text available
This scoping review examines the relationship between Generative AI (GenAI) and agency in education, analyzing the literature available through the lens of Critical Digital Pedagogy. Following PRISMA-ScR guidelines, we collected 10 studies from academic databases focusing on both learner and teacher agency in GenAI-enabled environments. We conducted an AI-supported hybrid thematic analysis that revealed three key themes: Control in Digital Spaces, Variable Engagement and Access, and Changing Notions of Agency. The findings suggest that while GenAI may enhance learner agency through personalization and support, it also risks exacerbating educational inequalities and diminishing learner autonomy in certain contexts. This review highlights gaps in the current research on GenAI's impact on agency. These findings have implications for educational policy and practice, suggesting the need for frameworks that promote equitable access while preserving learner agency in GenAI-enhanced educational environments.
... Numbering in the hundreds, the penalties applied to these cases drew the attention of the dean and deputy vice-chancellor. With a consideration of the global news storm on the pervasiveness of AI usage in society , and with the recognition of emerging evidence revealing that these software tools were not as effective in detecting AI content as claimed by the developers (Anderson et al., 2023;Chaka, 2023Chaka, , 2024Elali & Rachid, 2023;Elkhatat et al., 2023;Liang et al., 2023;Orenstrakh et al., 2023;Perkins, Roe, Vu et al., 2024;Weber-Wulff et al., 2023), BUV was faced with a dilemma: How could faculty maintain academic integrity in the age of GenAI, while still adequately preparing students for future industry applications? This led to an understanding that a more nuanced perspective on GenAI use in assessments was needed; an approach that addressed assessment integrity but also acknowledged the inevitability of GenAI's societal integration. ...
Article
Full-text available
The rapid adoption of generative artificial intelligence (GenAI) technologies in higher education has raised concerns about academic integrity, assessment practices and student learning. Banning or blocking GenAI tools has proven ineffective, and punitive approaches ignore the potential benefits of these technologies. As a result, assessment reform has become a pressing topic in the GenAI era. This paper presents the findings of a pilot study conducted at British University Vietnam exploring the implementation of the Artificial Intelligence Assessment Scale (AIAS), a flexible framework for incorporating GenAI into educational assessments. The AIAS consists of five levels, ranging from “no AI” to “full AI,” enabling educators to design assessments that focus on areas requiring human input and critical thinking. The pilot study results indicate a significant reduction in academic misconduct cases related to GenAI and enhanced student engagement with GenAI technology. The AIAS facilitated a shift in pedagogical practices, with faculty members incorporating GenAI tools into their modules and students producing innovative multimodal submissions. The findings suggest that the AIAS can support the effective integration of GenAI in higher education, promoting academic integrity while leveraging technology’s potential to enhance learning experiences. Implications for practice or policy: Higher education institutions should adopt flexible frameworks like the AIAS to guide ethical integration of GenAI into assessment practices. Educators should design assessments that leverage GenAI capabilities, while supporting critical thinking and human input. Institutional policies related to GenAI should be developed in consultation with stakeholders and regularly updated to keep pace with technological advancements. Policymakers should prioritise research funding into the impacts of GenAI on higher education to inform evidence-based practices.
... While some efforts are being made to educate younger students about deepfakes (Ali et al., 2021;Blankenship, 2021), empirical studies should test and compare specific pedagogical strategies for building university students' and staff critical evaluation skills. This is particularly important given the broader challenges faced by academia regarding the detection of text-based GenAI content (Chaka, 2023(Chaka, , 2024Perkins et al., , 2024Weber-Wulff et al., 2023). This theme's focus on investigating detection and authentication methods in academic contexts fills this critical gap. ...
Article
Full-text available
The availability of software which can produce convincing, yet synthetic media poses both threats and benefits to tertiary education globally. While other forms of synthetic media exist, this study focuses on deepfakes: advanced Generative AI (GenAI) imitations of real people’s likeness or voice. This conceptual paper assesses the current literature on deepfakes across multiple disciplines by conducting an initial scoping review of 182 peer-reviewed publications. The review reveals three major trends: detection methods, malicious applications, and potential benefits, although no specific studies on deepfakes in the tertiary educational context were found. Following a discussion of these trends, we apply the findings to postulate the major risks and potential mitigation strategies of deepfake technologies in higher education, as well as potential beneficial uses to aid the teaching and learning of both deepfakes and synthetic media. This culminates in the proposal of a research agenda to build a comprehensive, cross-cultural approach to investigate deepfakes in higher education.
... Moreover, in cases where students have been able to use GenAI counter to the guidance of an instructor or institutional guideline, it may be impossible to prove that such usage has taken place. Although AI text detection software exists, research indicates that these tools are often inaccurate, easily circumvented (Chaka, 2023(Chaka, , 2024Weber-Wulff et al., 2023), and potentially biased against speakers of English as an Additional Language (EAL) (Liang et al., 2023). Furthermore, the line between a simple AI-powered writing assistant and an illegitimate overuse that obfuscates authorship is not always clear . ...
Preprint
Full-text available
The rapid advancement of Generative Artificial Intelligence (GenAI) presents both opportunities and challenges for English for Academic Purposes (EAP) instruction. This paper proposes an adaptation of the AI Assessment Scale (AIAS) specifically tailored for EAP contexts, termed the EAP-AIAS. This framework aims to provide a structured approach for integrating GenAI tools into EAP assessment practices while maintaining academic integrity and supporting language development. The EAP-AIAS consists of five levels, ranging from "No AI" to "Full AI", each delineating appropriate GenAI usage in EAP tasks. We discuss the rationale behind this adaptation, considering the unique needs of language learners and the dual focus of EAP on language proficiency and academic acculturation. This paper explores potential applications of the EAP-AIAS across various EAP assessment types, including writing tasks, presentations, and research projects. By offering a flexible framework, the EAP-AIAS seeks to empower EAP practitioners seeking to deal with the complexities of GenAI integration in education and prepare students for an AI-enhanced academic and professional future. This adaptation represents a step towards addressing the pressing need for ethical and pedagogically sound AI integration in language education.
Article
Full-text available
The remarkable ability of large language models (LLMs) to comprehend, interpret, and generate complex language has rapidly integrated LLM-generated text into various aspects of daily life, where users increasingly accept it. However, the growing reliance on LLMs underscores the urgent need for effective detection mechanisms to identify LLM-generated text. Such mechanisms are critical to mitigating misuse and safeguarding domains like artistic expression and social networks from potential negative consequences. LLM-generated text detection, conceptualized as a binary classification task, seeks to determine whether an LLM produced a given text. Recent advances in this field stem from innovations in watermarking techniques, statistics-based detectors, and neural-based detectors. Human-assisted methods also play a crucial role. In this survey, we consolidate recent research breakthroughs in this field, emphasizing the urgent need to strengthen detector research. Additionally, we review existing datasets, highlighting their limitations and developmental requirements. Furthermore, we examine various LLM-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, real-world data issues, and ineffective evaluation frameworks. Finally, we outline intriguing directions for future research in LLM-generated text detection to advance responsible artificial intelligence. This survey aims to provide a clear and comprehensive introduction for newcomers while offering seasoned researchers valuable updates in the field.
Article
Full-text available
As generative language models, exemplified by ChatGPT, continue to advance in their capabilities, the spotlight on biases inherent in these models intensifies. This paper delves into the distinctive challenges and risks associated with biases specifically in large-scale language models. We explore the origins of biases, stemming from factors such as training data, model specifications, algorithmic constraints, product design, and policy decisions. Our examination extends to the ethical implications arising from the unintended consequences of biased model outputs. In addition, we analyze the intricacies of mitigating biases, acknowledging the inevitable persistence of some biases, and consider the consequences of deploying these models across diverse applications, including virtual assistants, content generation, and chatbots. Finally, we provide an overview of current approaches for identifying, quantifying, and mitigating biases in language models, underscoring the need for a collaborative, multidisciplinary effort to craft AI systems that embody equity, transparency, and responsibility. This article aims to catalyze a thoughtful discourse within the AI community, prompting researchers and developers to consider the unique role of biases in the domain of generative language models and the ongoing quest for ethical AI.
Article
Full-text available
This study explores the capability of academic staff assisted by the Turnitin Artificial Intelligence (AI) detection tool to identify the use of AI-generated content in university assessments. 22 different experimental submissions were produced using Open AI’s ChatGPT tool, with prompting techniques used to reduce the likelihood of AI detectors identifying AI-generated content. These submissions were marked by 15 academic staff members alongside genuine student submissions. Although the AI detection tool identified 91% of the experimental submissions as containing AI-generated content, only 54.8% of the content was identified as AI-generated, underscoring the challenges of detecting AI content when advanced prompting techniques are used. When academic staff members marked the experimental submissions, only 54.5% were reported to the academic misconduct process, emphasising the need for greater awareness of how the results of AI detectors may be interpreted. Similar performance in grades was obtained between student submissions and AI-generated content (AI mean grade: 52.3, Student mean grade: 54.4), showing the capabilities of AI tools in producing human-like responses in real-life assessment situations. Recommendations include adjusting the overall strategies for assessing university students in light of the availability of new Generative AI tools. This may include reducing the overall reliance on assessments where AI tools may be used to mimic human writing, or by using AI-inclusive assessments. Comprehensive training must be provided for both academic staff and students so that academic integrity may be preserved.
Article
Full-text available
The release of ChatGPT marked the beginning of a new era of AI-assisted plagiarism that disrupts traditional assessment practices in ESL composition. In the face of this challenge, educators are left with little guidance in controlling AI-assisted plagiarism, especially when conventional methods fail to detect AI-generated texts. One approach to managing AI-assisted plagiarism is using fine-tuned AI classifiers, such as RoBERTa, to identify machine-generated texts; however, the reliability of this approach is yet to be established. To address the challenge of AI-assisted plagiarism in ESL contexts, the present cross-disciplinary descriptive study examined the potential of two RoBERTa-based classifiers to control AI-assisted plagiarism on a dataset of 240 human-written and ChatGPT-generated essays. Data analysis revealed that both platforms could identify AI-generated texts, but their detection accuracy was inconsistent across the dataset.
Article
Full-text available
This study evaluates the accuracy of 16 publicly available AI text detectors in discriminating between AI-generated and human-generated writing. The evaluated documents include 42 undergraduate essays generated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written by students in a first-year composition course without the use of AI. Each detector’s performance was assessed with regard to its overall accuracy, its accuracy with each type of document, its decisiveness (the relative number of uncertain responses), the number of false positives (human-generated papers designated as AI by the detector), and the number of false negatives (AI-generated papers designated as human). Three detectors – Copyleaks, TurnItIn, and Originality.ai – have high accuracy with all three sets of documents. Although most of the other 13 detectors can distinguish between GPT-3.5 papers and human-generated papers with reasonably high accuracy, they are generally ineffective at distinguishing between GPT-4 papers and those written by undergraduate students. Overall, the detectors that require registration and payment are only slightly more accurate than the others.
Article
With the rising popularity of generative AI tools, the nature of apparent classification failures by AI content detection softwares, especially between different languages, must be further observed. This paper aims to do this through testing OpenAI’s “AI Text Classifier” on a set of human and AI-generated texts inEnglish, German, Arabic, Hindi, Chinese, and Swahili. Given the unreliability of existing tools for detection of AIgenerated text, it is notable that specific types of classification failures often persist in slightly different ways when various languages are observed: misclassification of human-written content as “AI-generated” and vice versa may occur more frequently in specific language content than others. Our findings indicate that false negative labelings are more likely to occur in English, whereas false positives are more likely to occur in Hindi and Arabic. There was an observed tendency for other languages to not be confidently labeled at all.