ArticlePDF Available

The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors

Authors:

Abstract and Figures

This study evaluates the accuracy of 16 publicly available AI text detectors in discriminating between AI-generated and human-generated writing. The evaluated documents include 42 undergraduate essays generated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written by students in a first-year composition course without the use of AI. Each detector’s performance was assessed with regard to its overall accuracy, its accuracy with each type of document, its decisiveness (the relative number of uncertain responses), the number of false positives (human-generated papers designated as AI by the detector), and the number of false negatives (AI-generated papers designated as human). Three detectors – Copyleaks, TurnItIn, and Originality.ai – have high accuracy with all three sets of documents. Although most of the other 13 detectors can distinguish between GPT-3.5 papers and human-generated papers with reasonably high accuracy, they are generally ineffective at distinguishing between GPT-4 papers and those written by undergraduate students. Overall, the detectors that require registration and payment are only slightly more accurate than the others.
This content is subject to copyright. Terms and conditions apply.
Research Article
William H. Walters*
The Eectiveness of Software Designed to
Detect AI-Generated Writing: A Comparison
of 16 AI Text Detectors
https://doi.org/10.1515/opis-2022-0158
received August 01, 2023; accepted September 15, 2023
Abstract: This study evaluates the accuracy of 16 publicly available AI text detectors in discriminating between
AI-generated and human-generated writing. The evaluated documents include 42 undergraduate essays gen-
erated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written by students in a rst-year composition
course without the use of AI. Each detectors performance was assessed with regard to its overall accuracy, its
accuracy with each type of document, its decisiveness (the relative number of uncertain responses), the
number of false positives (human-generated papers designated as AI by the detector), and the number of
false negatives (AI-generated papers designated as human). Three detectors Copyleaks, TurnItIn, and
Originality.ai have high accuracy with all three sets of documents. Although most of the other 13 detectors
can distinguish between GPT-3.5 papers and human-generated papers with reasonably high accuracy, they are
generally ineective at distinguishing between GPT-4 papers and those written by undergraduate students.
Overall, the detectors that require registration and payment are only slightly more accurate than the others.
Keywords: AI content detector, AI writing detector, articial intelligence, chatbot, generative AI
1 Introduction
1.1 Generative AI and AI Text Detectors
Despite the great potential of generative articial intelligence, the use of AI raises problems in situations
where performance goals are meant to signal progress toward learning goals where the completion of a
written paper, for instance, is valuable not as an end in itself but as a mechanism for helping students learn
how to plan, complete, and edit their written work (Dweck, 1986). Many authors have expressed concern that
students are submitting papers generated by ChatGPT and other AI tools as their own original work, thereby
attaining the performance goal but bypassing the learning goal. This has implications for teaching, learning,
and academic integrity (e.g., Lund et al., 2023; Marche, 2022). Moreover, studentsuse of AI is widespread and
likely to increase. In a recent survey of 1,000 US university students, 43% reported that they had used ChatGPT
or a similar AI tool. Twenty-two percent of all respondents had used AI to help complete [their] assignments
or exams,and 32% planned to use or continue using AI in their academic work (Welding, 2023). The problem
may have become more serious since the release of ChatGPT-4 in March 2023 (OpenAI, 2023a,c).
AI text detectors provide qualitative or quantitative assessments of the likelihood that a particular docu-
ment was AI generated. They can therefore help instructors determine whether students have used AI to

* Corresponding author: William H. Walters, Mary Alice & Tom OMalley Library, Manhattan College, 4513 Manhattan College
Parkway, Riverdale, NY 10471, USA, e-mail: william.walters@manhattan.edu
Open Information Science 2023; 7: 20220158
Open Access. © 2023 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0
International License.
complete their academic work. They can also help students determine whether a particular paper is likely to
trigger allegations of academic misconduct. Many AI detectors work by breaking the text down into tokens
(words or other common sequences of characters) and predicting the probability that a particular token will
be followed by the next in the sequence. The texts most likely to be identied as AI generated are those with
high predictability and low perplexity those with relatively few of the random elements and idiosyncrasies
that people tend to use in their writing and speech. Some AI text detectors employ other methods (Crothers,
Japkowicz, & Viktor, 2023), but methods based on perplexity and related concepts are most often used by the
detectors available to the general public.
1.2 Previous Evaluations of AI Text Detectors
Quite a few websites and blogs claim to evaluate the accuracy of various AI text detectors (e.g., Abdullahi, 2023;
Andrews, 2023; Aw, 2023; Cauleld, 2023; Cemper, 2023; Compilatio.net, 2023; Demers, 2023; Deziel, 2023;
Gewirtz, 2023; Ivanov, 2023; Singh, 2023; van Oijen, 2023; Wiggers, 2023; Winston.ai, 2023). Unfortunately,
each has signicant limitations or biases. Fourteen problems can be readily identied:
1. The authors or their sponsors have a clear conict of interest; they provide AI detection software or accept
extensive advertising from providers.
2. The assessment has a strong subjective component, often conating accuracy with other factors such as
convenience or ease of use.
3. The assessment uses a number of dierent procedures that are not applied systematically to every detector.
4. The tests are performed on just a small number of documents.
5. The report does not specify how the documents were generated or acquired.
6. AI-generated text is evaluated but human-generated text is not. The assessment can therefore detect false
negatives but not false positives.
7. The documents evaluated are not typical of those submitted by students undertaking academic work.
8. The human-generated documents are taken from sources (such as websites) that are potentially available
to AIs as sources of training documents.
9. The human-generated documents are written by the investigators themselves. This introduces the poten-
tial for conscious or unconscious bias.
10. The assessment does not consider a representative set of detectors. It may exclude the newest or most
widely used detectors, or it may compare one eective detector with several ineective ones.
11. The assessment includes only those detectors that do not require registration or payment.
12. The report does not specify which versions of the detectors were used, or how they were used (e.g., which
test options were chosen, and whether the software evaluated entire documents or just portions of them).
13. The report does not mention the specic responses provided by the software or how those responses were
coded as AI generated,human generated,oruncertain.
14. The results are presented inconsistently, with detailed results for some detectors or documents but not for
others.
At least one website presents a more careful assessment (Gillham, 2023). Moreover, recent scholarly
investigations have avoided most of the problems mentioned here. Since the release of GPT-3.5, more than
a dozen studies have included evaluations of the English-language AI text detectors currently in general use
(Aremu, 2023; Cingillioglu, 2023; Desaire, Chua, Isom, Jarosova, & Hua, 2023; Gao et al., 2023; Guo et al., 2023;
Khalil & Er, 2023; Krishna, Song, Karpinska, Wieting, & Iyyer, 2023; Liang, Yuksekgonul, Mao, Wu, & Zou, 2023;
Pegoraro, Kumari, Fereidooni, & Sadeghi, 2023; Perkins, Roe, Postma, McGaughran, & Hickerson, 2023; Wang,
Liu, Xie, & Li, 2023; Weber-Wulet al., 2023; Yan, Fauss, Hao, & Cui, 2023). Tables 1 and 2 summarize the results
of the analyses most similar to this investigation. That is, the tables exclude evaluations of detectors not
currently available to the public (e.g., Desaire et al., 2023; Guo et al., 2023; Yan et al., 2023), studies of texts
created by nonnative writers of English (Liang et al., 2023), evaluations of computer code and related materials
(Wang et al., 2023), analyses in which the AI-generated papers were modied before being submitted to the
2William H. Walters
Table 1: Percentage of ChatGPT texts correctly identied as AI in previous studies
a
Detector Aremu,
2023
Cingillioglu,
2023
Desaire
et al.,
2023
Gao
et al.,
2023
Guo
et al.,
2023
Khalil
&
Er,
2023
Krishna
et al.,
2023
b
Krishna
et al.,
2023
c
Liang
et al.,
2023
d
Liang
et al.,
2023
e
Pegoraro
et al.,
2023
Perkins
et al.,
2023
Wang
et al.,
2023
c
Wang
et al.,
2023
b
Weber-
Wul
et al.,
2023
Weber-
Wul
et al.,
2023
f
Yan
et al.,
2023
No. of documents 4 75 120 50 27k 50 ——31 145 7k 22 15k 25k 18 18 800
ChatGPT version —— 3.5 3 3.5 3.5 3.5 3.5 3.5 4 3.5 3.5 3.5 3.5 3
ChatGPT —— 92 ——— ———
Checker AI —— 13 ———
Compilatio —— 89 92
Content at Scale Low ——38 ——00
Copyleaks 97 —— ——23 ———
Crossplag Low ——58 37 ——89 89
DetectGPT —— 27 67 ——18 66 63 56 75
Draft and Goal —— 24 ———
GLTR —— High —— —— 32 ———
GPT-2/RoBERTa —— 92 High High —— —— 760 79 94 94 100
GPTZero High 96 ——7100 14 27 44 17 78 86
Grover —— 43 ———
Hello-SimpleAI —— 47 ———
Hugging Face —— 11 ———
OpenAI Low 96 ——30 41 58 41 32 99 74 50 61
Originality.ai —— 42 59 8 ———
Perplexity —— 44 ———
PlagiarismCheck —— 33 47
Quill.org —— 58 57 ———
RankGen —— 1 ——— ———
RoBERTa-QA —— High —— ——— 68 67 ——
Sapling Low ——74 68 ———
TurnItIn —— 91 ——94 97
Winston AI —— 94 94
Writefull —— 22 ——28 53
Writer —— 723 17 44 53
ZeroGPT High ——100 31 46 ——83 83
a
Includes only those analyses that evaluated unmodied ChatGPT output.
b
Wikipedia-type articles.
c
Responses to short questions.
d
College admissions essays.
e
Abstracts of scientic papers.
f
Half credit was
assigned for responses that were neither clearly correct nor clearly incorrect.
Eectiveness of Software Designed to Detect AI-Generated Writing 3
Table 2: Percentage of human-generated texts correctly identied as human in previous studies
Detector Aremu,
2023
Cingillioglu,
2023
Desaire
et al., 2023
Gao
et al.,
2023
Guo
et al.,
2023
Liang
et al.,
2023
a
Pegoraro
et al., 2023
Wang
et al.,
2023
b
Wang
et al.,
2023
c
Weber-Wul
et al., 2023
Weber-Wul
et al., 2023
d
Yan
et al., 2023
No. of documents 24 75 60 50 59k 88 6k 15k 25k 9 9 800
Checker AI —— 95 ———
Compilation —— —— ——89 94
Content at Scale 100 ——80 ——100 100
Copyleaks 93 ——92 ———
Crossplag 100 ——88 ——100 100
DetectGPT —— 80 94 65 100 100
Draft and Goal —— 91 ———
GLTR —— High 98 ———
GPT-2/RoBERTa —— 97 High High 96 6 11 100 100 100
GPTZero High 96 ——100 94 98 97 67 67
Grover —— 91 ———
Hello-SimpleAI —— 98 ———
Hugging Face —— 63 ———
OpenAI High 97 ——91 92 37 39 100 100
Originality.ai —— 99 95 ———
Perplexity —— 98 ———
PlagiarismCheck —— —— ——78 89
Quill.org —— 91 ——
RoBERTa-QA —— High —— 95 65 ——
Sapling High ——95 ——
TurnItIn —— —— ——100 100
Winston AI —— —— ——78 83
Writefull —— 99 ——100 100
Writer —— 95 96 93 100 100
ZeroGPT High ——100 92 ——100 100
a
Essays by middle school students.
b
Responses to short questions.
c
Wikipedia-type articles.
d
Half credit was assigned for responses that were neither clearly correct nor clearly incorrect.
4William H. Walters
detectors (Anderson et al., 2023; Krishna et al., 2023; Sadasivan, Kumar, Balasubramanian, Wang, & Feizi, 2023;
Weber-Wulet al., 2023), and reports in which the detectors were not identied by name (Dalalah & Dalalah, 2023).
Together, Tables 1 and 2 suggest that GPT-2/RoBERTa, TurnItIn, and ZeroGPT are the most consistently
accurate detectors. Overall, however, the results for the 27 detectors are not consistent across the 29 analyses.
There are at least three reasons for this. First, three dierent versions of ChatGPT were used to generate the AI
documents. Most of the investigations used GPT-3.5, but at least two used GPT-3 and at least one used GPT-4.
Second, the documents themselves are of various types. Seventeen analyses evaluated undergraduate essays
or responses to short, straightforward questions, but the others used a variety of texts including abstracts of
scientic papers (Gao et al., 2023; Liang et al., 2023), college admissions essays (Liang et al., 2023), essays by
middle school students (Liang et al., 2023), examination papers (Yan et al., 2023), overview articles in scientic
journals (Desaire et al., 2023), and Wikipedia-type articles (Krishna et al., 2023; Wang et al., 2023). Finally, each
research team interpreted the detector output dierently, adopting either rigorous or lenient standards for the
identication of AI- and human-generated text. This at least partly explains why some detectors performed
well in certain studies but not nearly as well in others.
2 Methods
This study evaluates the accuracy of 16 publicly available AI text detectors using three sets of documents: 42
undergraduate essays generated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written without the use of
AI by students in a rst-year composition course. Each detectors performance was assessed with regard to its
overall accuracy across all 126 documents, its accuracy when tested against each of the three sets of docu-
ments, its decisiveness (the relative number of uncertain responses), the number of false positives (human-
generated papers designated as AI by the detector), and the number of false negatives (AI-generated papers
designated as human by the detector). The analysis involved four steps:
1. Prepare the three sets of documents.
2. Select the 16 AI text detectors to include in the study.
3. Use each detector to evaluate each of the 126 documents, coding the responses as AI,human,oruncertain.
4. Evaluate the accuracy of each detector its eectiveness in identifying AI-generated and human-gener-
ated text.
2.1 Preparing the 126 Documents
GPT 3.5 and GPT 4 were each used to generate 42 short papers (literature reviews) of the kind typically
expected of students in rst-year composition courses at US universities. The 42 paper topics cover the social
sciences, the natural sciences, and the humanities (Appendix 1). A new chat/conversation was initiated for each
paper topic, and each topic was embedded within a ChatGPT prompt of the type recommended by Atlas (2023).
The same introductory text was used in each case: I want you to act as an academic researcher. Your task is to
write a paper of approximately 2000 words with parenthetical citations and a bibliography that includes at
least 5 scholarly resources such as journal articles and scholarly books. The paper should respond to this
question: [paper topic].’” Because the ChatGPT response eld is limited in length, the systems initial response
to each prompt was never a complete paper. An additional prompt of Please continuewas used, sometimes
more than once, to get ChatGPT to continue the text exactly where it had left o.
1
All the AI texts were
generated in the rst week of April 2023.

1If Please continuewas entered near the end of the paper, ChatGPT sometimes provided supplementary text that was not fully
integrated into the main body of the paper, presumably on the assumption that the original response was unsatisfactory or
inadequate. For this study, any text that followed the bibliography was not regarded as part of the paper generated by ChatGPT
and was therefore excluded from the analysis.
Eectiveness of Software Designed to Detect AI-Generated Writing 5
The 42 human-generated documents were taken from a set of 178 papers submitted by Manhattan College
English 110 (First Year Composition) students during the 20142015 academic year. The use of papers from 2014
to 2015, before the widespread availability of AI tools such as ChatGPT, ensures that these papers were created
without the use of AI. Although the English 110 papers do not cover the exact same topics as the AI-generated
papers, they are quite similar; they cover topics such as gun control, racism in the US education system, policy
responses to climate change, robotic warfare, family structure in traditional folk tales, e-cigarettes and public
health, the ethical implications of the death penalty, concussion in the National Hockey League, and 3D
printing technology. Stratied random sampling was used to select a set of papers with the same broad subject
representation as the ChatGPT documents: 25 papers in the social sciences, 9 in the natural sciences, and 8 in
the humanities.
2.2 Selecting the 16 AI Text Detectors
Although dozens of AI text detectors are available online, just 10 appear on two or more of ve recent best AI
text detectorlists (Abdullahi, 2023; Cauleld, 2023; Ganesh, 2023; Somoye, 2023; Wiggers, 2023): Content at
Scale (2023), Copyleaks (2023), Crossplag (2023), GPT Radar (2023), GPTZero (2023), OpenAI (2023b),
2
Originali-
ty.ai (2023), Sapling (2023), Writer (2023), and ZeroGPT (2023). This study evaluates those 10 AI text detectors,
along with TurnItIn and 5 others (Table 3).
TurnItIn (2023) was added to the study due to its widespread availability at colleges and universities in the
United States and elsewhere. Instructors at institutions with subscriptions to the TurnItIn plagiarism detector
also have access to the AI text detector, unless their universities have chosen not to make it available.
3
The ve other AI text detectors included in the study ContentDetector.ai (2023), Grammica (2023),
IvyPanda (2023), Scribbr (2023), and SEO.ai (2023) are promoted widely online, do not require registration
or payment, and do not appear on any of the ve best detectorlists. Arguably, these detectors are typical of
the tools students might use to conduct a quick check of their papers for evidence of AI involvement. A Google
search for free AI text detector was conducted, and the rst ve detectors that met the criteria (and that
worked reliably for the set of 126 documents) were included in the study. Some of them are clearly intended for
students who want to use AI without getting caught, and the IvyPanda site includes advertisements for a
paper-writing service (Our experts can complete a task on any subject based on your instructions without
any AI! To ensure that your paper is 100% human-written and plagiarism-free, place an order here.)
2.3 Evaluating the Documents and Coding the Responses
Each of the 126 documents was stripped of any introductory material (e.g., course and author information),
tables, gures, and lists of works cited, then entered into each of the 16 AI text detectors in plain-text format.
Documents longer than the maximum allowable length (Table 3) were truncated. The detector tests were
conducted from June 25 through July 12, 2023.

2The OpenAI text detector was discontinued on July 20, 2023, due to low accuracy. As of September 2023, it is no longer available
online.
3As one reviewer pointed out, the results for TurnItIn may be biased if (a) some studentspapers were submitted to TurnItIn for
plagiarism checking and (b) those papers were subsequently used to train the TurnItIn AI text detector. In that case, the TurnItIn
detector might be expected to perform especially well with this particular set of human-generated papers. This is not likely to be a
major problem, however, since very few Manhattan College instructors used TurnItIn in 20142015. A related issue is that students
may have submitted their papers to the iThenticate plagiarism checker, which is associated with TurnItIn and uses the same set of
texts. Unfortunately, we do not know the extent to which this might have occurred.
6William H. Walters
Table 3: Characteristics of the 16 AI text detectors
Detector Payment Limits on use Input Min.
length
Max. length Longer docs.
Content at scale Not required None Text box 4 wds. 25,000 chars. Truncates
ContentDetector.ai Not required None Text box 2 wds. 15,000 wds. Will not process
Copyleaks
a
Free: up to 45,000 wds. per day;
thereafter: when billed monthly, $0.28 to
$0.44 per thousand wds.
Without registration: 6,250 wds. per day; with
registration but without payment: 45,000
wds. per day; with registration and payment:
depends on amount paid
Free: text box;
subscribers: text box or
upload
150 chars. Free: 25,000 chars.;
subscribers: 500,000 wds.
Will not process
Crossplag Free, but registration is required for full
functionality
None Text box 2 wds. 3,000 wds. Truncates
Grammica Not required None Text box 2 wds. 380 wds. Truncates
GPT Radar Free: up to 2,500 wds. per day;
thereafter: $0.02 per 125 wds.
Depends on amount paid Text box 75 wds. 1,400 wds. lower than
the stated limit
Will not process
GPTZero
b
Classic: not required; Educator (more
eective): $9.99 per month; Pro (most
eective): $19.99 per month
Classic: Limits not stated; Educator: 1 million
wds. per month; Pro: 2 million wds. per
month
Text box or upload 250
chars.
Classic: 5,000 chars.;
Educator: 50,000 chars.;
Pro: 50,000 chars.
Text box: will not
process; upload:
truncates
IvyPanda Free, but registration is required None Text box 2 wds. 4,500 chars. Truncates
OpenAI Free, but registration is required None Text box 1,000
chars.
3,000 wds. Will not process
Originality.ai
c
$0.01 per 100 wds. Depends on amount paid Text box 50 wds. 10,000 wds. Will not process
Sapling Free version has limited functionality;
subscription: $25 per month, but the
system may oer a free 1-month trial
None Text box 150
chars.
Free: 2,000 chars.; paid:
8,000 chars.
Truncates
Scribbr Not required None Text box 25 wds. 500 wds. Will not process
SEO.ai Not required None Text box 2 wds. 5,000 chars. Truncates
TurnItIn Institutional subscription required None Upload 20 wds. 800 pages Will not process
Writer Not required None Text box 2 wds. 1,500 chars. Will not process
ZeroGPT Not required None Text box or upload 2 wds. 50,000 chars. Will not process
a
Free interface: https://copyleaks.com/ai-content-detector; subscriber interface: https://app.copyleaks.com/dashboard/v1/account/new-scan.
b
This study presents the Pro results; the Educator results are
identical except that one GPT-3.5 paper classied as uncertain by Educator is classied as AI by Pro.
c
This study uses detection model 1.4 rather than 1.1.
Eectiveness of Software Designed to Detect AI-Generated Writing 7
As Appendix 2 reveals, each detectors output is unique. The responses used by the detectors to characterize
the documents vary in ve important respects:
1. whether they include descriptive text, numeric values, or both
2. whether the wording of the text is formal or casual
3. whether the assessments suggest a high degree of condence (this text is AI generated) or greater
ambiguity (parts of the text may show evidence of AI involvement)
4. whether the numeric scores represent the proportion of the text that is AI generated, the detectors level of
condence in the result, or something else
5. whether there are just a few possible responses or many.
Each of the 2,016 responses was coded as AI generated,human generated,oruncertain.(AI generated
indicates that a signicant portion of the text not necessarily all of it is likely to be AI generated.) For
responses that included both descriptive text and a numeric component, the descriptive text (e.g., likely AI
generated) was regarded as denitive. For the strictly numeric results provided by Grammica, Originality.ai,
Sapling, Scribbr, and TurnItIn, each response was categorized as AI,human,oruncertain based on three
factors: the meaning of the numeric value, the natural breaks in the frequency distribution, and the general
principle that roughly twice as many responses should be included in the AI category as in the human category.
Although just one individual coded the responses, the distinctions among the AI,uncertain, and human
categories were generally quite clear. (Appendix 2 shows the responses generated by the AI text detectors and
the number of times each response was given.) The only diculty occurred with Sapling, for which the breaks
in the frequency distribution were not always pronounced. Overall, the classications used here are very
similar to those adopted by Weber-Wulet al. (2023).
3 Results and Discussion
3.1 Accuracy of the 16 AI Text Detectors
Two of the 16 detectors, Copyleaks and TurnItIn, correctly identied the AI- or human-generated status of all
126 documents, with no incorrect or uncertain responses. As noted in Section 2.2, however, it is possible that
TurnItIn performs especially well with the human-generated papers used in this particular analysis. A third
detector, Originality.ai, performed nearly as well, correctly assessing the status of all but two documents
human-generated papers that it could not classify with certainty (Table 4 and Figure 1).
Among the other 13 detectors, overall accuracy ranges from 63 to 88%. The distribution of percentage
correct follows a smooth progression, with just three distinct groups: the top 3 detectors, the next 11, and the
bottom 2 Sapling and ContentDetector.ai.
All the detectors except Content at Scale and ContentDetector.ai are able to identify the GPT-3.5 documents
as AI generated at least 86% of the time, and seven perform awlessly with this particular set of documents
(Figure 2). Likewise, all but three ZeroGPT, SEO.ai, and Sapling are eective at identifying human-gener-
ated text (Figure 3). However, only the top three detectors can correctly classify GPT-4 documents with greater
than 83% accuracy; the rest tend to classify those documents as human or uncertain (Figure 4). Arguably, this is
the most important distinction between the top 3 detectors and the other 13.
3.2 Correlates of Accuracy
As noted in Section 2.2, 10 of the 16 detectors were initially identied through online best detectorlists.
Overall, the detectors that appear on these lists are only marginally more accurate than the others 81%
8William H. Walters
Table 4: Percentage of documents for which each detector gave correct or incorrect responses
a
All papers AI papers GPT-3.5 papers GPT-4 papers Human papers
Detector Percentage
correct
Percentage
incorrect
Percentage
uncertain
Percentage
correct
Percentage
incorrect
Percentage
correct
Percentage
incorrect
Percentage
correct
Percentage
incorrect
Percentage
correct
Percentage
incorrect
Copyleaks
b
100 0 0 100 0 100 0 100 0 100 0
TurnItIn 100 0 0 100 0 100 0 100 0 100 0
Originality.ai
b
98 0 2 100 0 100 0 100 0 95 0
Scribbr 88 11 1 85 15 100 0 69 31 95 2
ZeroGPT
b
87 1 12 92 0 100 0 83 0 79 2
Grammica 86 11 3 81 17 100 0 62 33 95 0
GPTZero
b
81 4 15 77 5 98 0 57 10 88 2
Crossplag
b
80 20 0 77 23 86 14 69 31 86 14
OpenAI
b
78 6 17 69 8 98 2 40 14 95 0
IvyPanda 77 0 23 71 0 100 0 43 0 88 0
GPT Radar
b
76 24 0 64 36 98 2 31 69 100 0
SEO.ai 72 4 24 92 0 100 0 83 0 33 12
Content at Scale
b
71 13 15 63 15 74 2 52 29 88 10
Writer
b
71 29 0 64 36 88 12 40 60 86 14
Sapling
b
65 7 28 63 11 93 0 33 21 69 0
ContentDetector.ai 63 10 27 45 14 83 0 7 29 100 0
Avg. percentage 81 9 10 78 11 95 2 61 20 87 4
Standard deviation 12 9 11 16 12 8 4 28 22 17 5
Median percentage 79 7 8 77 10 99 0 60 18 92 0
a
In each case, the percentage uncertain is the percentage neither correct nor incorrect.
b
Appears on at least two of the best AI text detectorwebsites.
Eectiveness of Software Designed to Detect AI-Generated Writing 9
correct versus 77%. For the set of all detectors other than TurnItIn, there is no meaningful correlation between
the accuracy of a detector and its appearance on the best detectorlists; Kendalls tau-b =0.08.
In general, the accuracy of a detector is only modestly associated with its paid or free status. While all
three of the most accurate detectors require registration and payment for full functionality, the three others
that require payment GPTZero, GPT Radar, and Sapling have just average or below-average accuracy.
Among the six detectors that require a subscription, the average accuracy is 87%; among the others, it is 77%.
Overall, the correlation between the accuracy of a detector and its paid or free status is weak; Kendalls tau-b
=0.29.
3.3 Key Similarities and Dierences Among the 16 AI Text Detectors
Table 5 highlights the characteristics that set each detector apart from the others. Copyleaks, TurnItIn, and
Originality.ai are similar in many respects. Likewise, ZeroGPT and GPTZero are much the same, as are Sapling
and ContentDetector.ai.
The three accuracy columns in Table 5 are based not just on percentage correct, but on percentage
incorrect and the ratio of correct to incorrect responses. For example, GPTZero has a high accuracy designation
while Crossplag does not but this cannot be attributed to the one-point dierence in their accuracy rates.
Instead, it reects the fact that GPTZero has a lower rate of incorrect responses. When the type of document is
unclear, GPTZero generally gives a response of uncertain. In contrast, Crossplag is more likely to label AI text
as human and vice versa.
Figure 1: Percentage of all 126 documents for which each detector gave correct, uncertain, or incorrect responses.
10 William H. Walters
As described in Section 3.1, many detectors are eective at identifying GPT-3.5 text but ineective at
identifying GPT-4 text. This same result can be seen when percentage incorrect is taken into account. In
particular, four detectors have excellent performance with regard to GPT-3.5 but very poor performance
with regard to GPT-4. GPT Radar is perhaps the best example of this, with correct responses for 98% of the
GPT-3.5 documents but for just 31% of the GPT-4 documents worse than might be expected due to chance
alone.
The decisiveness column represents the percentage of documents for which each detector gave responses
of AI or human rather than uncertain. The high decisiveness label was assigned to detectors with uncertainty
rates lower than 4% and the low label to those with uncertainty rates higher than 22%.
The false positives column identies the detectors that are especially likely to respond AI when evaluating
papers written by humans. The four detectors labeled many each have false positive rates of 1014%. In
contrast, the other detectors each have no more than a single false positive within the set of 42 human-
generated documents.
Likewise, the false negatives column identies the detectors that are especially likely to respond human for
papers that were actually produced by an AI. Crossplag, GPT Radar, and Writer each have false negative rates
of 2336%, while the other detectors have a maximum rate of 17% and a mean of 6.5%.
The many false positives for SEO.ai and Content at Scale reect their general tendency to declare that text
is AI rather than human. Likewise, the many false negatives for GPT Radar reect its tendency to label text as
human rather than AI. The situation is dierent for Crossplag and Writer, however. Those two detectors have
many false positives and many false negatives due to a combination of relative inaccuracy and high decisive-
ness. Overall, the more accurate detectors tend to be more decisive the correlation between percentage
correct and percentage uncertain is 0.68 but Crossplag and Writer are exceptions to that general
relationship.
Figure 2: Percentage of the 42 GPT-3.5 documents for which each detector gave correct, uncertain, or incorrect responses.
Eectiveness of Software Designed to Detect AI-Generated Writing 11
4 Conclusion
4.1 Main Findings
The results of this study support three main conclusions:
1. Three AI text detectors Copyleaks, TurnItIn, and Originality have very high accuracy with all three sets
of documents examined for this study: GPT-3.5 papers, GPT-4 papers, and human-generated papers.
2. Most of the other detectors can distinguish between GPT-3.5 papers and human-generated papers with
reasonably high accuracy. However, most are ineective at distinguishing between GPT-4 papers and
papers written by students.
3. In general, a detectors free or paid status is not a good indicator of its accuracy, nor is its appearance on the
best AI text detectorlists considered here.
Several recent articles in the popular press have asserted that AI-generated text is almost impossible to
identify (Heikkilä, 2023; Maruccia, 2023; Mujezinovic, 2023; Wiggers, 2023; Williams, 2023), and it is true that
most detectors perform poorly with GPT-4 documents. However, these results also suggest that technological
improvements in publicly available AI text generators are matched very quickly by improvements in the
capabilities of the best AI text detectors. The release of GPT-4 in March 2023 may have given AI users a
temporary ability to pass oAI text as human-authored but less than 4 months later, the three most eective
AI text detectors perform just as well with GPT-4 documents as with GPT-3.5 documents.
Figure 3: Percentage of the 42 human-generated documents for which each detector gave correct, uncertain, or incorrect responses.
12 William H. Walters
Figure 4: Percentage of the 42 GPT-4 documents for which each detector gave correct, uncertain, or incorrect responses.
Table 5: Eectiveness of the 16 AI text detectors
Detector Overall
accuracy
Accuracy,
GPT-3.5
Accuracy, GPT-4 Decisive-
ness
False positives False negatives
Copyleaks
a
V. high V. high V. high High ——
TurnItIn V. high V. high V. high High ——
Originality.ai
a
V. high V. high V. high High ——
Scribbr High V. high High ——
ZeroGPT
a
High V. high ——
Grammica High V. high Low High ——
GPTZero
a
High V. high ——
Crossplag
a
—— High Many Many
OpenAI
a
V. high Low ——
IvyPanda V. high Low Low ——
GPT Radar
a
V. high Low High Many
SEO.ai V. high Low Many
Content at Scale
a
—— Low Many
Writer
a
—— Low High Many Many
Sapling
a
Low Low Low ——
ContentDetector.ai Low Low Low ——
a
Appears on at least two of the best AI text detectorwebsites.
Eectiveness of Software Designed to Detect AI-Generated Writing 13
4.2 Previous Research and New Results
Previous research suggests that TurnItIn, ZeroGPT, and GPT-2/RoBERTa are among the more accurate AI text
detectors (Tables 1 and 2). These results support those earlier ndings with regard to TurnItIn and ZeroGPT.
Of the top three detectors identied in this investigation, TurnItIn achieved very high accuracy in all ve
previous evaluations. Copyleaks, included in four earlier analyses, performed very well in three of them. The
prior results for Originality.ai are mixed, suggesting that it classies human-generated documents accurately
but has diculty with AI-generated text. In this analysis, no such diculty can be seen (Tables 4 and 5). As
noted in Section 1.2, previous studies have used a wide range of methods that do not always generate
comparable results. Consequently, comparative analyses such as this are especially important.
4.3 Implications
Many authors have called for the modication of traditional undergraduate essays and written assignments in
ways that circumvent the capabilities of generative AI (e.g., Baidoo-Anu & Owusu Ansah, 2023; Golinko&
Wilson, 2023; Marche, 2022; Rigolino, 2023; Tate, 2023). At the most supercial level, this involves changes in
assessment methods a greater reliance on in-class exams and interactive presentations, for instance. At a
deeper level, it involves a greater emphasis on the kinds of capabilities that are unique to humans, such as the
generation and renement of ideas rather than texts. AI tools can also be incorporated into teaching, helping
students learn how to edit, how to evaluate subtle dierences in style and content, how to determine whether
an assertion is supported by evidence, and how to use AI eectively. Even in circumstances where the use of AI
is accepted or required, however, there is still a need to determine the extent of AI involvement.
When students are not expected to use AI, false positives can lead to unwarranted accusations of mis-
conduct while false negatives may allow violations of academic integrity to go undetected. For this reason, the
detectors with high false positive or false negative rates (Table 5) should be avoided. If we also exclude the
detectors that are generally ineective in detecting GPT-4 text, just a few detectors essentially, the top three
remain as viable candidates for use in the academic environment.
Local and individual factors are likely to inuence the ways in which AI text detectors are used and
perceived. Some faculty may be inclined to accept their results uncritically, without further investigation or
consideration of the context. At the same time, other faculty may reject the use of detectors in favor of less
systematic, intuitive judgments. It is probably best to adopt a moderate approach to consider the results
provided by AI text detectors, to account for other evidence as well, and to acknowledges that some detectors
are far more eective (or ineective) than others. Assessments of studentswork should also consider the
specic parts of the text for which AI involvement was detected. Fortunately, 10 of the detectors evaluated here
all but Crossplag, Grammica, OpenAI, Scribbr, SEO.ai, and Writer provide separate assessments or scores
for particular phrases, sentences, or paragraphs within each document.
4.4 Limitations and Further Research
Because this investigation used student papers that could potentially have been used to train the TurnItIn
detector, TurnItIn may be especially accurate for the particular human-generated texts evaluated here. As
noted in Section 2.2, however, this is unlikely to have had a major impact on the results. More generally, this
analysis is based on a set of 126 undergraduate composition papers (literature reviews), so the results may not
be generalizable to other kinds of documents. The most signicant limitation of the study, however, is that it
does not account for the fact that users of ChatGPT are likely to paraphrase or otherwise modify AI-generated
texts rather than simply submitting them, unaltered, as their own academic work (Welding, 2023). It is
important to know how well these detectors perform with unaltered ChatGPT text, but a more realistic
14 William H. Walters
assessment would also evaluate their eectiveness in identifying documents that have been generated by AI,
then modied by users.
This is just the second study to evaluate the eectiveness of publicly available AI text detectors in
identifying documents generated by ChatGPT-4. (Perkins et al., 2023, was the rst.) Additional analyses of
GPT-4 documents are needed. Moreover, this investigation and other recent studies suggest several questions
for further research:
1. How well do AI text detectors evaluate documents that are partly AI generated and partly human gener-
ated? Are the assessments provided by the detectors (e.g., 30% AI) accurate, and does their accuracy vary
with the proportion of AI-generated text?
2. What paraphrasing strategies are most eective at thwarting AI text detectors? For instance, is it better to
replace words with less common synonyms, to change the order of clauses, or to introduce idiosyncratic
phrases? Several studies have shown that paraphrasing can alter AI-generated texts to make them less
susceptible to detection (Anderson et al., 2023; Krishna et al., 2023; Sadasivan et al., 2023; Weber-Wulet al.,
2023), but none have evaluated the eectiveness of the various paraphrasing techniques.
3. How do students actually modify AI-generated or AI-assisted texts when completing their assignments? Are
those modications eective at rendering AI involvement undetectable?
Finally, there is a need to investigate potential biases in the performance of AI text detectors. Liang et al.
(2023) have demonstrated that the texts written by nonnative speakers of English are far more likely than
those of native speakers to generate false positive responses. If would be helpful to know whether this bias is
widespread or whether it is restricted to particular types of authors or documents.
Acknowledgments: I am grateful for the comments of Esther Isabelle Wilder and two anonymous referees.
Funding information: No funding was involved.
Conict of interest: The author states no conict of interest.
Data availability statement: The texts generated by GPT-3.5 and GPT-4 in response to the 42 prompts are
available from the author on request, as are the results (responses) generated by the 16 AI text detectors for
each of the 126 documents.
References
Abdullahi, A. (2023, May 5). Top 10 AI detector tools for 2023. eWeek. https://www.eweek.com/articial-intelligence/ai-detector-software/.
Allison, N. (2023, Mar. 16). 250 + interesting research paper topics for 2022. MyPerfectWords. https://myperfectwords.com/blog/research-
paper-guide/research-paper-topics.
Anderson, N., Belavy, D. L., Perle, S. M., Hendricks, S., Hespanhol, L., Verhagen, E., & Memon, A. R. (2023). AI did not write this manuscript,
or did it? Can we trick the AI text detector into generated texts? The potential future of ChatGPT and AI in sports & exercise medicine
manuscript generation. BMJ Open Sport & Exercise Medicine,9(1), article e001568. doi: 10.1136/bmjsem-2023-001568.
Andrews, E. (2023). Comparing AI detection tools: One instructors experience. Academic Honesty and Integrity. https://tilt.colostate.edu/
comparing-ai-detection-tools-one-instructors-experience/.
Aremu, T. (2023, June 7). Unlocking Pandoras box: Unveiling the elusive realm of AI text detection. Rochester, NY: SSRN. doi: 10.2139/ssrn.
4470719.
Atlas, S. (2023). Chatbot prompting: A guide for students, educators, and an AI-augmented workforce. https://www.researchgate.net/
publication/367464129_Chatbot_Prompting_A_guide_for_students_educators_and_an_AI-augmented_workforce.
Aw, B. (2023, July 23). 12 best AI detectors in 2023: Results from 180 tests. https://brendanaw.com/best-ai-detector.
Baidoo-Anu, D., & Owusu Ansah, L. (2023, Jan. 25). Education in the era of generative articial intelligence (AI): Understanding the potential
benets of ChatGPT in promoting teaching and learning. Rochester, NY: SSRN. doi: 10.2139/ssrn.4337484.
Cauleld, J. (2023, June 2). Best AI detector: Free & premium tools compared. Scribbr. https://www.scribbr.com/ai-tools/best-ai-detector/.
Eectiveness of Software Designed to Detect AI-Generated Writing 15
Cemper, C. C. (2023, Jan. 29). 13 AI content detection tools tested and AI watermarks. LinkResearchTools. https://www.linkresearchtools.
com/blog/ai-content-detector-tools/.
Cingillioglu, I. (2023). Detecting AI-generated essays: The ChatGPT challenge. International Journal of Information and Learning Technology,
40(3), 259268. doi: 10.1108/IJILT-03-2023-0043.
Compilatio.net. (2023, Feb. 16). Comparison of the best AI detectors in 2023. https://www.compilatio.net/en/blog/best-ai-detectors.
Content at Scale. (2023). AI detector for ChatGPT, GPT4, bard & more. https://contentatscale.ai/ai-content-detector/.
ContentDetector.ai. (2023). AI content detector ChatGPT plagiarism checker. https://contentdetector.ai/.
Copyleaks. (2023). AI content detector. https://copyleaks.com/ai-content-detector.
Crossplag. (2023). AI content detector. https://crossplag.com/ai-content-detector/.
Crothers, E. N., Japkowicz, N., & Viktor, H. L. (2023, July 18). Machine-generated text: A comprehensive survey of threat models and
detection methods. IEEE Access,11, 7097771002. doi: 10.1109/ACCESS.2023.3294090.
Dalalah, D., & Dalalah, O. M. A. (2023). The false positives and false negatives of generative AI detection tools in education and academic
research: The case of ChatGPT. International Journal of Management Education,21(2), article 100822. doi: 10.1016/j.ijme.2023.100822.
Demers, T. (2023, Apr. 25). 16 of the best AI and ChatGPT content detectors compared. Search Engine Land. https://searchengineland.
com/ai-chatgpt-content-detectors-395957.
Desaire, H., Chua, A. E., Isom, M., Jarosova, R., & Hua, D. (2023). Distinguishing academic science writing from humans or ChatGPT with
over 99% accuracy using o-the-shelf machine learning tools. Cell Reports Physical Science,4(6), article 101426. doi: 10.1016/j.xcrp.
2023.101426.
Deziel, M. (2023, Feb. 19). We pitted ChatGPT against tools for detecting AI-written text, and the results are troubling. The Conversation.
https://theconversation.com/we-pitted-chatgpt-against-tools-for-detecting-ai-written-text-and-the-results-are-troubling-199774.
Dweck, C. S. (1986). Motivational processes aecting learning. American Psychologist,41(10), 10401048. doi: 10.1037/0003-066X.41.
10.1040.
Ganesh, S. (2023, June 12). Explore these top 5 AI detector tools to detect AI-generated content. Analytics Insight. https://www.
analyticsinsight.net/explore-these-top-5-ai-detector-tools-to-detect-ai-generated-content/.
Gao, C. A., Howard, F. M., Markov, N. S., Dyer, E. C., Ramesh, S., Luo, Y., & Pearson, A. T. (2023). Comparing scientic abstracts generated
by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digital Medicine,6, article 75. doi: 10.1038/s41746-023-
00819-6.
Gewirtz, D. (2023, Jan. 13). Can AI detectors save us from ChatGPT? I tried 3 online tools to nd out. ZDNET Tech Today. https://www.zdnet.
com/article/can-ai-detectors-save-us-from-chatgpt-i-tried-3-online-tools-to-nd-out/.
Gillham, J. (2023). AI content detector accuracy review + open source dataset and research tool. Originality.ai. https://originality.ai/blog/
ai-content-detection-accuracy.
Golinko, R. M., & Wilson, J. (2023, Feb. 2). ChatGPT is a wake-up call to revamp how we teach writing. Philadelphia Inquirer. https://www.
inquirer.com/opinion/commentary/chatgpt-ban-ai-education-writing-critical-thinking-20230202.html.
GPT Radar. (2023). Detect AI generated text in a click. https://gptradar.com/.
GPTZero. (2023). More than an AI detector. Preserve whats human. https://gptzero.me/.
Grammica. (2023). AI detector. https://grammica.com/ai-detector.
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Wu, Y. (2023, Jan. 18). How close is ChatGPT to human experts? Comparison corpus,
evaluation, and detection. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2301.07597.
Heikkilä, M. (2023, Feb. 7). Why detecting AI-generated text is so dicult (and what to do about it). MIT Technology Review. https://www.
technologyreview.com/2023/02/07/1067928/why-detecting-ai-generated-text-is-so-dicult-and-what-to-do-about-it/.
Ivanov, V. (2023, June 23). Which is the best AI content detector? https://trickmenot.ai/which-is-the-best-ai-content-detector/.
IvyPanda. (2023). GPT essay checker for students. https://ivypanda.com/gpt-essay-checker.
Kearney, V. (2022, Oct. 26). 100 technology topics for research papers. Owlcation. https://owlcation.com/academia/100-Technology-
Topics-for-Research-Paper.
Khalil, M., & Er, E. (2023, Feb. 8). Will ChatGPT get you caught? Rethinking of plagiarism detection. Ithaca, NY: Cornell University. Ithaca, NY:
Cornell University. doi: 10.48550/arXiv.2302.04335.
Krishna, K., Song, Y., Karpinska, M., Wieting, J., & Iyyer, M. (2023, Mar. 23). Paraphrasing evades detectors of AI-generated text, but retrieval is
an eective defense. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2303.13408.
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023, July 10). GPT detectors are biased against non-native English writers. Ithaca, NY:
Cornell University. doi: 10.48550/arXiv.2304.02819.
Lund, B. D., Wang, T., Mannuru, N. R., Nie, B., Shimray, S., & Wang, Z. (2023). ChatGPT and a new academic reality: Articial Intelligence-
written research papers and the ethics of the large language models in scholarly publishing. Journal of the Association for Information
Science and Technology,74(5), 570581. doi: 10.1002/asi.24750.
Marche, S. (2022, Dec. 6). The college essay is dead. Nobody is prepared for how AI will transform academia. The Atlantic. https://www.
theatlantic.com/technology/archive/2022/12/chatgpt-ai-writing-college-student-essays/672371/.
Maruccia, A. (2023, Mar. 22). Reliable detection of AI-generated text is impossible, a new study says. TechSpot. https://www.techspot.
com/news/98031-reliable-detection-ai-generated-text-impossible-new-study.html.
Mujezinovic, D. (2023, May 11). AI content detectors dont work, and thats a big problem. MUO: Make Use Of. https://www.makeuseof.
com/ai-content-detectors-dont-work/.
OpenAI. (2023a). GPT-4 is OpenAIs most advanced system, producing safer and more useful responses. https://openai.com/gpt-4.
16 William H. Walters
OpenAI. (2023b). New AI classier for indicating AI-written text. https://openai.com/blog/new-ai-classier-for-indicating-ai-written-text.
OpenAI. (2023c, Mar. 27). GPT-4 technical report. https://paperswithcode.com/paper/gpt-4-technical-report-1.
Originality.ai. (2023). Most accurate AI content checker & plagiarism checker for content marketers. https://originality.ai/.
Paperell.net. (2023). 200 best research paper topics for 2023 +examples. https://paperell.net/blog/best-research-paper-topics.
Pegoraro, A., Kumari, K., Fereidooni, H., & Sadeghi, A.-R. (2023, Apr. 5). To ChatGPT, or not to ChatGPT: That is the question! Ithaca, NY:
Cornell University. doi: 10.48550/arXiv.2304.01487.
Perkins, M., Roe, J., Postma, D., McGaughran, J., & Hickerson, D. (2023, May 29). Game of tones: Faculty detection of GPT-4 generated content
in university assessments. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2305.18081.
Rigolino, R. E. (2023, Jan. 31). With ChatGPT, were all editors now. Inside Higher Ed. https://www.insidehighered.com/views/2023/01/31/
chatgpt-we-must-teach-students-be-editors-opinion.
Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023, June 28). Can AI-generated text be reliably detected? Ithaca, NY:
Cornell University. doi: 10.48550/arXiv.2303.11156.
Sapling. (2023). AI detector. https://sapling.ai/ai-content-detector.
Sarikas, C. (2020, Jan. 25). 113 great research paper topics. PrepScholar. https://blog.prepscholar.com/good-research-paper-topics.
Scribbr. (2023). Free AI detector. https://www.scribbr.com/ai-detector/.
SEO.ai. (2023). AI content detector. https://seo.ai/detector.
Singh, A. (2023, July 24). 12 best AI content detectors of 2023 (accurate data). DemandSage. https://www.demandsage.com/ai-content-
detectors/.
Somoye, F. L. (2023, June 12). ChatGPT detectors in 2023. PC Guide. https://www.pcguide.com/apps/chat-gpt-detectors/.
Tate, J. (2023, Feb. 5). Socrates never wrote a term paper. Wall Street Journal. 281, A15. https://www.wsj.com/articles/socrates-never-
wrote-a-term-paper-education-teaching-learning-college-ai-chatgpt-lecturing-students-11675613853.
TurnItIn. (2023). Empower students to do their best, original work. https://www.turnitin.com/.
van Oijen, V. (2023, Mar. 31). AI-generated text detectors: Do they work? SURF Communities: AI in Education. https://communities.surf.nl/
en/ai-in-education/article/ai-generated-text-detectors-do-they-work.
Walters, W. H., Sheehan, S. E., Handeld, A. E., López-Fitzsimmons, B. M., Markgren, S., & Paradise, L. (2020). A multi-method information
literacy assessment program: Foundation and early results. Portal: Libraries and the Academy,20(1), 101135. doi: 10.1353/pla.
2020.0006.
Wang, J., Liu, S., Xie, X., & Li, Y. (2023, Apr. 11). Evaluating AIGC detectors on code content. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.
2304.05193.
Weber-Wul, D., Anohina-Naumeca, A., Bjelobaba, S., Foltýnek, T., Guerrero-Dib, J., Popoola, O., Waddington, L. (2023, July 10). Testing
of detection tools for AI-generated text. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2306.15666.
Welding, L. (2023, Mar. 27). Half of college students say using AI on schoolwork is cheating or plagiarism. BestColleges. https://www.
bestcolleges.com/research/college-students-ai-tools-survey/.
Wiggers, K. (2023, Feb. 16). Most sites claiming to catch AI-written text fail spectacularly. TechCrunch. https://techcrunch.com/2023/02/16/
most-sites-claiming-to-catch-ai-written-text-fail-spectacularly/.
Williams, R. (2023, July 7). AI-text detection tools are really easy to fool. MIT Technology Review. https://www.technologyreview.com/2023/
07/07/1075982/ai-text-detection-tools-are-really-easy-to-fool/.
Winston.ai. (2023, Feb. 14). Best AI detectors in 2023 compared. https://gowinston.ai/best-ai-detector/.
Writer. (2023). AI content detector. https://writer.com/ai-content-detector/.
Yan, D., Fauss, M., Hao, J., & Cui, W. (2023). Detection of AI-generated essays in writing assessments. Psychological Test and Assessment
Modeling,65(1), 125144. https://www.psychologie-aktuell.com/leadmin/Redaktion/Journale/ptam_2023-1/PTAM__1-2023_5_
kor.pdf.
ZeroGPT. (2023). GPT-4, ChatGPT & AI detector by ZeroGPT: detect OpenAI text. https://www.zerogpt.com/.
Eectiveness of Software Designed to Detect AI-Generated Writing 17
Appendix 1
Topics of the ChatGPT Papers
Although most of the paper topics were suggested by personal experience with students and their written
work (Walters et al., 2020), about two dozen websites were consulted for additional ideas. Topics 24, 33, and 42
are similar to those suggested by Paperell.net (2023). Topics 19, 22, and 37 are similar to those suggested by
Sarikas (2020), Allison (2023), and Kearney (2022), respectively. Topics 18 are in the humanities, 933 in the
social sciences, and 3442 in the natural sciences:
1. Why was Stonehenge built? What are the most likely explanations, and what evidence supports or
challenges each of them?
2. What were the causes of the Second Boer War (1899 to 1902)? What did the British Empire, the South
African Republic, and the Orange Free State each hope to achieve?
3. What major nineteenth-century literary works received initially negative reviews but are now regarded as
key contributions to literature? What accounts for the changing opinions of these works?
4. What studies best demonstrate how quantitative methods can be applied to the analysis of English-
language literary works?
5. When did unicorns rst appear in literature? How has the depiction of unicorns and their characteristics
changed over time?
6. Will languages other than English gain importance over time as languages of scientic discourse?
7. What accounts for the dominance of American and British songwriters and musicians in twentieth- and
twenty-rst-century popular music? Why did no other countriesartists have a similar impact?
8. What are the historical origins of the religious concept of purgatory? Who put forth the concept of
purgatory? Was it accepted initially? When and how did it assume its place within Catholic theology?
9. Among retired Americans and those approaching retirement, are there distinct types of migration or geo-
graphic mobility (distinct groups of migrants)? What are the distinctive characteristics of each type or group?
10. What were the unintended eects of Chinas one-child policy? How have the Chinese government and the
Chinese people responded to them?
11. How have ride-sharing services such as Uber and Lyft inuenced overall employment in the taxi and ride-
sharing industry? How have they inuenced wages?
12. In the present-day United States, what are the most eective strategies by which wealthy individuals can
minimize their income tax payments?
13. What are the long-term economic and political impacts of the global shortages of copper, lithium, nickel,
and cobalt?
14. What is the best way to determine the impact of Brexit on the UK economy?
15. Why did the US government rst institute minimum wage laws? What were they hoping to achieve?
16. Among American college students, to what extent do self-reported assessments of ability represent self-
ecacy rather than ability?
17. Can synchronous demonstrations, delivered online, be just as eective as in-person lab instruction for
undergraduate biology courses?
18. Can the educational success of US charter schools at the high school (secondary) level be attributed to
factors other than the socioeconomic characteristics of their students?
19. Do students who get free meals in grades P5 do better academically than students of similar backgrounds
who do not get free meals?
20. Is there evidence to support the idea that high school math teachers who struggled with math can be more
eective than those for whom math came easily?
21. To what extent are university studentsevaluations of their instructors related to the diculty of the
course? What is the best way to overcome any bias related to the link between teaching evaluations and
course diculty?
18 William H. Walters
22. What are the advantages and disadvantages of taking a gap yearof employment or volunteer work
between high school and college for individuals and for society?
23. Are there systematic dierences in the organizational leadership styles of men and women? To what
extent are they unique to either women or men?
24. Who were the most successful businesswomen of the twentieth century?
25. Internationally, how have Patrick S. AtiyahsAccidents, Compensation and the Lawand The Damages
Lotteryinuenced legal education, practice, and theory?
26. What are the military missions or situations for which aerial drones have proven most successful? In what
areas do they have the greatest unmet potential?
27. What occupations are most likely to disappear entirely over the next 20 years?
28. In the United States, what safety-related innovations (devices, policies, or procedures) were once man-
dated by law or regulation but later abandoned? Why were they abandoned? On what grounds should
safety-related innovations be evaluated?
29. Do the fans at a football stadium inuence the outcome of the game? Can we isolate the impact of the fans
behavior from the impact of having home-eld advantage (and more fans in the stadium)? [Both GPT-3.5
and GPT-4 interpreted this question in terms of association football (soccer) rather than American
football.]
30. Across nations, what is the inuence of gun control legislation on rates of gun-related homicide, suicide,
and accidental death? What factors make these comparisons potentially dicult?
31. Are adolescents who play violent video games especially likely to commit acts of violence? Do violent video
games have other negative (or positive) psychological eects?
32. In terms of recruiting, training, and managing personnel, what are the most eective methods of pre-
venting police violence against the public (police brutality)?
33. What percentage of political assassination attempts are successful? What evidence can be used to address
this question?
34. At the individual level, what is the impact of professional dental care on morbidity and mortality risk?
35. How harmful are e-cigarettes to the health of those who use them, relative to conventional cigarettes?
36. To what extent do sleep disorders inuence the productivity of the American labor force?
37. Can cloning or similar methods be used to bring back extinct plant species? Extinct animal species?
38. What strategies have proven most eective as methods of stabilizing and increasing the orangutan
population?
39. To what extent can global climate change be attributed to ruminant grazing and dairy farming?
40. What is the best way to gauge the environmental impact of a large-scale switch to electric vehicles for
private passenger transportation in the United States? Account for the impact of the vehicles themselves as
well as the need to generate electricity from sources such as natural gas, coal, nuclear, wind, and
hydropower.
41. Which island nations and coastal nations will be most aected by climate change? What steps are they
taking to prepare?
42. How are molten salt reactors dierent from conventional nuclear ssion reactors? What are their unique
advantages and disadvantages? In what ways are they more or less safe than conventional ssion reactors?
Appendix 2
Responses Provided by the AI Text Detectors
The numbers in the ncolumns indicate the number of documents in each response category across all three
document types GPT-3.5, GPT-4, and human generated.
Eectiveness of Software Designed to Detect AI-Generated Writing 19
Content at Scale
a
Response n
Responses counted as AI 57
Highly likely to be AI generated! (1029% human) 8
Likely to be AI generated! (3344% human) 17
Likely both AI and human! (6079% human) 32
Responses counted as uncertain 19
Unclear if it is AI content! (4558% human) 19
Responses counted as human 50
Highly likely to be human! (80100% human) 50
a
The descriptive text appears to be based primarily on the detectors condence in the assessment, while the numeric results appear to
reect the percentage of the text that is AI.
ContentDetector.ai
a
Response n
Responses counted as AI 38
Likely AI content (How articial is your content: 6782%) 38
Responses counted as uncertain 34
Unclear (How articial is your content: 5067%) 34
Responses counted as human 54
Likely human content (How articial is your content: 1650%) 54
a
These results indicate the detectors condence in the assessment not the percentage of the text that is AI.
Copyleaks
a
Response n
Responses counted as AI 84
Suspected cheating: AI text detected. Very high. We are unable to verify
that the text was written by a human
84
Responses counted as uncertain 0
(None) 0
Responses counted as human 42
(No AI-related alerts associated with the text) 42
a
Copyleaks provides an overall descriptive assessment for the entire document along with statements such as 93.3% probability
for humanor 94.8% probability for AIfor particular parts of the document. Those numeric values indicate the detectors
condence in the assessment not the percentage of the text that is AI. Moreover, the percentages reported by Copyleaks are not
actual probabilities, since 30% probability for humandoes not mean 70% probability for AI.It simply means This text is
probably human generated, and our condence in that assessment is 30 on a scale from 1 to 100.
Crossplag
a
Response n
Responses counted as AI 71
This text is mainly written by an AI (No % score) 8
20 William H. Walters
This text is mainly written by an AI (67100% AI) 59
This text is co-written by both a human and an AI (50% AI) 4
Responses counted as uncertain 0
(None) 0
Responses counted as human 55
This text is mainly written by a human (06% AI) 55
a
These results indicate the detectors condence in the assessment not the percentage of the text that is AI.
GPT Radar
a
Response n
Responses counted as AI 54
Likely AI generated (5284% accuracy) 54
Responses counted as uncertain 0
(None) 0
Responses counted as human 72
Likely human generated (5783% accuracy) 72
a
These results appear to indicate the detectors condence in the assessment not the percentage of the text that is AI.
GPTZero
Response n
Responses counted as AI 66
Your text is likely to be written entirely by AI 58
Your text is has [sic] a moderate likelihood of being written by AI 8
Responses counted as uncertain 19
Your text may include parts written by AI 19
Responses counted as human 41
Your text is most likely human written but there are some sentences with low perplexities 9
Your text is likely to be written entirely by a human 32
Grammica
a
Response n
Responses counted as AI 68
100% AI 49
9199% AI 10
8188% AI 3
5062% AI 5
39% AI 1
Responses counted as uncertain 4
2529% AI 3
17% AI 1
Eectiveness of Software Designed to Detect AI-Generated Writing 21
Responses counted as human 54
18% AI 11
0% AI 43
a
These results indicate the percentage of the text that is AI not the detectors condence in the assessment.
IvyPanda
a
Response n
Responses counted as AI 60
High risk 52
Relatively high risk 8
Responses counted as uncertain 29
Medium risk 29
Responses counted as human 37
Relatively low risk 37
a
These results indicate the detectors condence in the assessment not the percentage of the text that is AI.
OpenAI
Response n
Responses counted as AI 58
Likely AI generated 32
Possibly AI generated 26
Responses counted as uncertain 21
Unclear if it is AI generated 21
Responses counted as human 47
Unlikely AI generated 5
Very unlikely AI generated 42
Originality.ai
a
Response n
Responses counted as AI 84
100% AI 80
9899% AI 3
70% AI 1
Responses counted as uncertain 2
3334% AI 2
Responses counted as human 40
1525% AI 4
07% AI 36
a
These results indicate the detectors condence in the assessment not the percentage of the text that is AI.
22 William H. Walters
Sapling
a
Response n
Responses counted as AI 53
97100% AI 36
8194% AI 7
7379% AI 10
Responses counted as uncertain 35
6168% AI 8
5258% AI 7
4049% AI 11
3038% AI 8
Unexpected error1
Responses counted as human 38
2029% AI 8
1019% AI 11
39% AI 7
0% AI 12
a
These results indicate the detectors condence in the assessment not the percentage of the text that is AI. The Unexpected error
message persisted even after repeated attempts to conduct the analysis.
Scribbr
a
Response n
Responses counted as AI 72
100% AI 51
9399% AI 10
8186% AI 2
5573% AI 5
45% AI 1
3136% AI 3
Responses counted as uncertain 1
25% AI 1
Responses counted as human 53
17% AI 10
0% AI 43
a
These results indicate the percentage of the text that is AI not the detectors condence in the assessment.
SEO.ai
a
Response n
Responses counted as AI 82
Your text appears AI generated (Probability for AI is 71100%) 82
Responses counted as uncertain 30
Your text appears uncertain to determine (Probability for AI is 4570%) 30
Responses counted as human 14
Your text appears human made (Probability for AI is 137%) 14
a
These results indicate the detectors condence in the assessment not the percentage of the text that is AI.
Eectiveness of Software Designed to Detect AI-Generated Writing 23
TurnItIn
a
Response n
Responses counted as AI 84
100% AI 83
84% AI 1
Responses counted as uncertain 0
(None) 0
Responses counted as human 42
0% AI 42
a
These results indicate the percentage of the text that is AI not the detectors condence in the assessment.
Writer
Response n
Responses counted as AI 60
You should edit your text until theres less detectable AI content (090% human-generated content) 60
Responses counted as uncertain 0
(None) 0
Responses counted as human 66
Looking great! (9294% human-generated content) 10
Fantastic! (96100% human-generated content) 56
ZeroGPT
a
Response n
Responses counted as AI 78
Your le content is AI/GPT generated (63100% AI) 70
Your le content is most likely AI/GPT generated (6185% AI) 4
Your le content is likely generated by AI/GPT (55% AI) 1
Most of your le content is AI/GPT generated (3261% AI) 3
Responses counted as uncertain 15
Your le content contains mixed signals, with some parts generated by AI/GPT (3648% AI) 4
Your le content is likely human written, may include parts generated by AI/GPT (1350% AI) 5
Your le content is most likely human written, may include parts generated by AI/GPT (2433% AI) 6
Responses counted as human 33
Your le content is most likely human written (1327% AI) 11
Your le content is human written (017% AI) 22
a
ZeroGPT provides both numeric values (which indicate the percentage of the text that is AI) and text descriptions (which appear to
reect the numeric values as well as the detectors condence in the assessment). The text descriptions do not always correspond to
specic numeric values.
24 William H. Walters
... In a comparison of 16 AI detection tools, three tools (Turnitin, Copyleaks, and Originality.ai) demonstrated being able to accurately identify undergraduate essays versus essays written by ChatGPT 3.5 and ChatGPT-4 (Walters, 2023). Similarly, an investigation of 12 publicly available AI detection tools as well as Turnitin and PlagiarismCheck revealed that AI detectors were more accurate in identifying human written text than identifying AIwritten text, especially if the document was also put through a software to paraphrase it (Weber-Wulff et al., 2023). ...
... This database was selected as it provides free access to full texts for millions of dissertations and theses (EBSCO, 2024). For the purposes of this study, dissertations posted between January 2013 and December 2013 comprised the dissertations written prior to the widespread availability and accessibility of AI (Walters, 2023). Of the initial 49,678 results, dissertations were randomly selected and evaluated for eligibility until the required sample size (n = 100) was achieved. ...
... Collaboration between AI developers and AI detectors can support efforts to maintain academic and research integrity. Considering other tools, such as Crossplag, GPT Zero, and SEO.ai have previously produced false positives (Walters, 2023;Weber-Wulff et al., 2023), it is critical for educators to critically evaluate the evidence supporting the tool before arbitrarily applying it to students' submissions. ...
Article
Full-text available
Academic misconduct is a prevalent issue in higher education with detrimental effects on the individual students, rigor of the program, and strength of the workplace. Recent advances in artificial intelligence (AI) have reinvigorated concern over academic integrity and the potential use and misuse of AI. However, there is a lack of research on academic integrity in doctoral dissertations. Consequently, the purpose of this study is to explore academic misconduct in dissertations, specifically investigating the prevalence of AI use and plagiarism. Considering debate over the accuracy of technology in detecting AI-generated text, dissertations were also analyzed from before widespread AI availability, as a point of comparison. A sample of 200 dissertations from 2013 and 2023 were analyzed through Turnitin to flag plagiarism and AI-generated text. Results did not support any significant differences in plagiarism. However, 6% of analyzed dissertations from 2023 were positive for AI-generated text (AI scores over 20%). This is significantly different from 2013 dissertations, which only had 0% or < 20% AI scores. The findings of this study suggest that some students may be relying on generative text or editing from AI in their dissertations. Dissertation committees and awarding institutions have an obligation to ensure the students’ educational and ethical development and promote academic integrity at every level of education.
... Generally, the detector's accuracy is moderately related to its cost, such as free or paid. The most accurate detectors require the users to register and pay to utilize their complete features, while the rest, Sapling, GPT Radar, and GPTZero, are middling in terms of accuracy (Walters, 2023). AIGT detectors' performance declines dramatically after using the evasive soft prompt framework. ...
... It is determined that the accuracy of the paid AIGT detection tools averages 87%, while those that do not need a subscription are at 77%. A modest link exists between detector accuracy and paid or free status (Walters, 2023). This means that there is an unfairness among the different AIGT detection tools since their accuracies differ, further worsened by the fact that their accuracy depends on whether the tool is paid or free. ...
Article
Full-text available
Artificial intelligence has become a significant tool for completing a wide range of tasks, from simple to complex, though its use is subject to various considerations and preferences. This study explored one aspect of the varied usages of artificial intelligence, the AI-generated text (AIGT) detection. This study used a literature review wherein pertinent studies were gathered and selected to discover potential implications. Research objectives were defined to assess the accuracy and reliability of the AI text detectors and identify which AI detectors were evaluated. Three online databases were used to search for relevant literature, of which 34 articles were finalized. Results show that despite most detectors attaining accuracy above 50%, they are unreliable. Paid tools generally perform better than free ones, but there are concerns about bias against non-native English speakers. These tools also struggle with sophisticated AI content and tricks like paraphrasing, so using them carefully and relying on human judgment is important to avoid unfairly discrediting someone’s work. AI-generated text detection technology still has a lot of room for improvement. Users should not rely completely on these tools but rather cooperate with those tools to better find the true writer of a text. Hence, authorities who use these AI detectors should only partially trust these tools, for they are imperfect and can still make mistakes in their judgment.
... Such inconsistencies raise serious concerns about their validity in academic contexts. The findings align with earlier research [16] [17] [18] [19] [20] [21] [22], and sharply contrast with the accuracy claims made by commercial detection tools [23] [24] [15] [22]. A notable bias is observed across these systems, as they tend to label AI-generated text as human-written, with approximately 20% of AI content likely being misattributed to human authorship. ...
Preprint
The growing use of generative AI tools like ChatGPT has raised urgent concerns about their impact on student learning, particularly the potential erosion of critical thinking and creativity. As students increasingly turn to these tools to complete assessments, foundational cognitive skills are at risk of being bypassed, challenging the integrity of higher education and the authenticity of student work. Existing AI-generated text detection tools are inadequate; they produce unreliable outputs and are prone to both false positives and false negatives, especially when students apply paraphrasing, translation, or rewording. These systems rely on shallow statistical patterns rather than true contextual or semantic understanding, making them unsuitable as definitive indicators of AI misuse. In response, this research proposes a proactive, AI-resilient solution based on assessment design rather than detection. It introduces a web-based Python tool that integrates Bloom's Taxonomy with advanced natural language processing techniques including GPT-3.5 Turbo, BERT-based semantic similarity, and TF-IDF metrics to evaluate the AI-solvability of assessment tasks. By analyzing surface-level and semantic features, the tool helps educators determine whether a task targets lower-order thinking such as recall and summarization or higher-order skills such as analysis, evaluation, and creation, which are more resistant to AI automation. This framework empowers educators to design cognitively demanding, AI-resistant assessments that promote originality, critical thinking, and fairness. It offers a sustainable, pedagogically sound strategy to foster authentic learning and uphold academic standards in the age of AI.
... Paraphrased texts were detected with 30% accuracy. The detectors Copyleaks, TurnItIn, and Originality detect deepfakes generated with GPT-3.5, but they also identify human-written works as deepfakes [45]. ...
Article
Full-text available
The world is currently facing the issue of text authenticity in different areas. The implications of generated text can raise concerns about manipulation. When a photo of a celebrity is posted alongside an impactful message, it can generate outrage, hatred, or other manipulative beliefs. Numerous artificial intelligence tools use different techniques to determine whether a text is artificial intelligence-generated or authentic. However, these tools fail to accurately determine cases in which a text is written by a person who uses patterns specific to artificial intelligence tools. For these reasons, this article presents a new approach to the issue of deepfake texts. The authors propose methods to determine whether a text is associated with a specific person by using specific written patterns. Each person has their own written style, which can be identified in the average number of words, the average length of the words, the ratios of unique words, and the sentiments expressed in the sentences. These features are used to develop a custom-made written-style machine learning model named the custom deepfake text model. The model’s results show an accuracy of 99%, a precision of 97.83%, and a recall of 90%. A second model, the anomaly deepfake text model, determines whether the text is associated with a specific author. For this model, an attempt was made to determine anomalies at the level of textual characteristics that are assumed to be associated with particular patterns of a certain author. The results show an accuracy of 88.9%, a precision of 100%, and a recall of 89.9%. The findings outline the possibility of using the model to determine if a text is associated with a certain author. The paper positions itself as a starting point for identifying deepfakes at the text level.
... However, due to their sophisticated functionality, LLMs pose significant challenges in the robustness of current AI detection systems (Wu et al., 2025). The existing detection systems, including commercial ones, frequently misclassify texts as HWT (Price and Sakellarios, 2023;Walters, 2023) and generate inconsistent results when analyzing the same text using different detectors (Chaka, 2023;Weber-Wulff et al., 2023). Studies show false positive rates reaching up to 50% and false negative rates as high as 100% in different tools (Weber-Wulff et al., 2023) when dealing with out-of-distribution (OOD) datasets. ...
Preprint
Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide explainable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and a Distinguisher that examines how well the input texts align with the predicted prompts. We develop and examine two versions of Distinguishers. Empirical evaluations demonstrate that both Distinguishers perform significantly better than the baseline methods, with version2 outperforming baselines by 9.73% on in-distribution data (F1-score) and 12.65% on OOD data (AUROC). Furthermore, a user study is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.
... results are also close to 100%. One of the findings of the research was that in many cases, the free AI detection services available for free also work quite well [11]. ...
Article
Generative artificial intelligence (AI) large language models have become sufficiently accessible and user-friendly to assist students with course work, studying tactics, and written communication. AI-generated writing is almost indistinguishable from human-derived work. Instructors must rely on intuition/experience and, recently, assistance from online AI detectors to distinguish between student- and AI-written material. Here, we tested the veracity of AI detectors for writing samples from a fact-heavy, lower-division undergraduate anatomy and physiology course. Student participants (n=190) completed three parts: a hand-written essay answering a prompt on the structure/function of the plasma membrane, creating an AI-generated answer to the same prompt, and a survey seeking participants' views on the quality of each essay as well as general AI use. Randomly selected (n=50) participant-written and AI-generated essays were blindly uploaded onto four AI detectors; a separate and unique group of randomly selected essays (n=48) was provided to human raters (n=9) for classification assessment. For the majority of essays, human raters and the best-performing AI detectors (n=3) similarly identified their correct origin (84-95 and 93-98%, respectively) (p>0.05). Approximately, 1.3 and 5.0 % of essays were detected as false-positives (human writing incorrectly labeled as AI) by AI detectors and human raters, respectively. Surveys generally indicated that students viewed the AI-generated work as better than their own (p<0.01). Using AI detectors in aggregate reduced the likelihood of detecting a false-positive to ~0%, and was tested using human rater-labeled false-positives. Taken together, our findings show that AI detectors, when used together, become a powerful tool to inform instructors.
Article
Full-text available
This paper presents a pioneering and comprehensive analysis of fake text, a pressing issue in the digital age, by categorizing it into two main types: Misinformation and LM-generated texts. It is the first study to systematically dissect and examine the intricate challenges and nuances in distinguishing between genuine and artificial text. Through a meticulous review of various methodologies and technologies in fake text detection, the paper provides an in-depth evaluation of their effectiveness across diverse scenarios. Furthermore, this research delves into the significant societal impacts of both misinformation and LM-generated texts, underlining the urgent need for precise and effective detection mechanisms in our increasingly information-saturated world. This extensive survey not only offers a unique perspective on the current landscape of fake text detection, but also paves the way for future research, highlighting critical areas where further innovation and exploration are essential.
Book
Full-text available
O CIEB Notas Técnicas é uma série que contém análises sobre temas atuais relacionados à inovação na educação pública brasileira. São reflexões e conceitos gerados pela equipe do CIEB ao longo do desenvolvimento de projetos, e compartilhados com o intuito de contribuir para o debate público. Esta Nota Técnica é resultado do trabalho do CIEB em parceria com Dra. Rosa Maria Vicari, Dr. Christian Brackmann, Dr. Cristiano Galafassi e Dr. Lucas Mizusaki, para aprofundar o debate e as implicações da utilização de inteligência artificial na educação básica brasileira, com base em análises conceituais dedicadas a apresentar um panorama de caminhos para a integração responsável e assertiva dessa tecnologia no contexto brasileiro.
Article
Full-text available
Machine-generated text is increasingly difficult to distinguish from text authored by humans. Powerful open-source models are freely available, and user-friendly tools that democratize access to generative models are proliferating. ChatGPT, which was released shortly after the first edition of this survey, epitomizes these trends. The great potential of state-of-the-art natural language generation (NLG) systems is tempered by the multitude of avenues for abuse. Detection of machine-generated text is a key countermeasure for reducing the abuse of NLG models, and presents significant technical challenges and numerous open problems. We provide a survey that includes 1) an extensive analysis of threat models posed by contemporary NLG systems and 2) the most complete review of machine-generated text detection methods to date. This survey places machine-generated text within its cybersecurity and social context, and provides strong guidance for future work addressing the most critical threat models. While doing so, we highlight the importance that detection systems themselves demonstrate trustworthiness through fairness, robustness, and accountability.
Preprint
Full-text available
Recent advances in generative pre-trained transformer large language models have emphasised the potential risks of unfair use of artificial intelligence (AI) generated content in an academic environment and intensified efforts in searching for solutions to detect such content. The paper examines the general functionality of detection tools for artificial intelligence generated text and evaluates them based on accuracy and error type analysis. Specifically, the study seeks to answer research questions about whether existing detection tools can reliably differentiate between human-written text and ChatGPT-generated text, and whether machine translation and content obfuscation techniques affect the detection of AIgenerated text. The research covers 12 publicly available tools and two commercial systems (Turnitin and PlagiarismCheck) that are widely used in the academic setting. The researchers conclude that the available detection tools are neither accurate nor reliable and have a main bias towards classifying the output as human-written rather than detecting AIgenerated text. Furthermore, content obfuscation techniques significantly worsen the performance of tools. The study makes several significant contributions. First, it summarises up-to-date similar scientific and non-scientific efforts in the field. Second, it presents the result of one of the most comprehensive tests conducted so far, based on a rigorous research methodology, an original document set, and a broad coverage of tools. Third, it discusses the implications and drawbacks of using detection tools for AI-generated text in academic settings.
Article
Full-text available
ChatGPT has enabled access to artificial intelligence (AI)-generated writing for the masses, initiating a culture shift in the way people work, learn, and write. The need to discriminate human writing from AI is now both critical and urgent. Addressing this need, we report a method for discriminating text generated by ChatGPT from (human) academic scientists, relying on prevalent and accessible supervised classification methods. The approach uses new features for discriminating (these) humans from AI; as examples, scientists write long paragraphs and have a penchant for equivocal language, frequently using words like "but," "however," and "although." With a set of 20 features, we built a model that assigns the author, as human or AI, at over 99% accuracy. This strategy could be further adapted and developed by others with basic skills in supervised classification, enabling access to many highly accurate and targeted models for detecting AI usage in academic writing and beyond.
Article
Full-text available
Large language models (LLMs) like GPT have raised concerns about their potential misuse for academic dishonesty and other negative purposes. This study examines the ability of AI text detectors to accurately identify different types of essays, including those generated by LLMs. Experiments show that current text detection systems have limitations, indicating the need for further research and development in this area. Various text detector tools are evaluated and their performance on GPT-generated essays is discussed. The results highlight the challenges in distinguishing between human-written and GPT-generated content reliably. The importance of improving the models and algorithms used by text detectors to make them more robust and adaptable is stressed. Collaboration between LLM companies and text detection organizations is encouraged to effectively address the risks associated with their tools. By advancing AI text detection, the integrity of educational systems can be safeguarded and the risks of misusing GPT-generated text can be reduced.
Preprint
Full-text available
This study explores the robustness of university assessments against the use of Open AI's Generative Pre-Trained Transformer 4 (GPT-4) generated content and evaluates the ability of academic staff to detect its use when supported by the Turnitin Artificial Intelligence (AI) detection tool. The research involved twenty-two GPT-4 generated submissions being created and included in the assessment process to be marked by fifteen different faculty members. The study reveals that although the detection tool identified 91% of the experimental submissions as containing some AI-generated content, the total detected content was only 54.8%. This suggests that the use of adversarial techniques regarding prompt engineering is an effective method in evading AI detection tools and highlights that improvements to AI detection software are needed. Using the Turnitin AI detect tool, faculty reported 54.5% of the experimental submissions to the academic misconduct process, suggesting the need for increased awareness and training into these tools. Genuine submissions received a mean score of 54.4, whereas AI-generated content scored 52.3, indicating the comparable performance of GPT-4 in real-life situations. Recommendations include adjusting assessment strategies to make them more resistant to the use of AI tools, using AI-inclusive assessment where possible, and providing comprehensive training programs for faculty and students. This research contributes to understanding the relationship between AI-generated content and academic assessment, urging further investigation to preserve academic integrity.
Article
Full-text available
Purpose With the advent of ChatGPT, a sophisticated generative artificial intelligence (AI) tool, maintaining academic integrity in all educational settings has recently become a challenge for educators. This paper discusses a method and necessary strategies to confront this challenge. Design/methodology/approach In this study, a language model was defined to achieve high accuracy in distinguishing ChatGPT-generated essays from human written essays with a particular focus on “not falsely” classifying genuinely human-written essays as AI-generated (Negative). Findings Via support vector machine (SVM) algorithm 100% accuracy was recorded for identifying human generated essays. The author discussed the key use of Recall and F2 score for measuring classification performance and the importance of eliminating False Negatives and making sure that no actual human generated essays are incorrectly classified as AI generated. The results of the proposed model's classification algorithms were compared to those of AI-generated text detection software developed by OpenAI, GPTZero and Copyleaks. Practical implications AI-generated essays submitted by students can be detected by teachers and educational designers using the proposed language model and machine learning (ML) classifier at a high accuracy. Human (student)-generated essays can and must be correctly identified with 100% accuracy even if the overall classification accuracy performance is slightly reduced. Originality/value This is the first and only study that used an n-gram bag-of-words (BOWs) discrepancy language model as input for a classifier to make such prediction and compared the classification results of other AI-generated text detection software in an empirical way.
Article
Full-text available
Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Most generated abstracts were detected using an AI output detector, ‘GPT-2 Output Detector’, with % ‘fake’ scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% ‘fake’ [12.73%, 99.98%] compared with median 0.02% [IQR 0.02%, 0.09%] for the original abstracts. The AUROC of the AI output detector was 0.94. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. ChatGPT writes believable scientific abstracts, though with completely generated data. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies.
Preprint
Full-text available
Artificial Intelligence Generated Content (AIGC) has garnered considerable attention for its impressive performance, with ChatGPT emerging as a leading AIGC model that produces high-quality responses across various applications, including software development and maintenance. Despite its potential, the misuse of ChatGPT poses significant concerns, especially in education and safetycritical domains. Numerous AIGC detectors have been developed and evaluated on natural language data. However, their performance on code-related content generated by ChatGPT remains unexplored. To fill this gap, in this paper, we present the first empirical study on evaluating existing AIGC detectors in the software domain. We created a comprehensive dataset including 492.5K samples comprising code-related content produced by ChatGPT, encompassing popular software activities like Q&A (115K), code summarization (126K), and code generation (226.5K). We evaluated six AIGC detectors, including three commercial and three open-source solutions, assessing their performance on this dataset. Additionally, we conducted a human study to understand human detection capabilities and compare them with the existing AIGC detectors. Our results indicate that AIGC detectors demonstrate lower performance on code-related data compared to natural language data. Fine-tuning can enhance detector performance, especially for content within the same domain; but generalization remains a challenge. The human evaluation reveals that detection by humans is quite challenging.
Article
Since its maiden release into the public domain on November 30, 2022, ChatGPT garnered more than one million subscribers within a week. The generative AI tool ⎼ChatGPT took the world by surprise with it sophisticated capacity to carry out remarkably complex tasks. The extraordinary abilities of ChatGPT to perform complex tasks within the field of education has caused mixed feelings among educators, as this advancement in AI seems to revolutionize existing educational praxis. This is an exploratory study that synthesizes recent extant literature to offer some potential benefits and drawbacks of ChatGPT in promoting teaching and learning. Benefits of ChatGPT include but are not limited to promotion of personalized and interactive learning, generating prompts for formative assessment activities that provide ongoing feedback to inform teaching and learning etc. The paper also highlights some inherent limitations in the ChatGPT such as generating wrong information, biases in data training, which may augment existing biases, privacy issues etc. The study offers recommendations on how ChatGPT could be leveraged to maximize teaching and learning. Policy makers, researchers, educators and technology experts could work together and start conversations on how these evolving generative AI tools could be used safely and constructively to improve education and support students’ learning.