ArticlePDF Available

Practical and ethical challenges of large language models in education: A systematic scoping review

Wiley
British Journal of Educational Technology
Authors:

Abstract and Figures

Educational technology innovations leveraging large language models (LLMs) have shown the potential to automate the laborious process of generating and analysing textual content. While various innovations have been developed to automate a range of educational tasks (eg, question generation, feedback provision, and essay grading), there are concerns regarding the practicality and ethicality of these innovations. Such concerns may hinder future research and the adoption of LLMs‐based innovations in authentic educational contexts. To address this, we conducted a systematic scoping review of 118 peer‐reviewed papers published since 2017 to pinpoint the current state of research on using LLMs to automate and support educational tasks. The findings revealed 53 use cases for LLMs in automating education tasks, categorised into nine main categories: profiling/labelling, detection, grading, teaching support, prediction, knowledge representation, feedback, content generation, and recommendation. Additionally, we also identified several practical and ethical challenges, including low technological readiness, lack of replicability and transparency and insufficient privacy and beneficence considerations. The findings were summarised into three recommendations for future studies, including updating existing innovations with state‐of‐the‐art models (eg, GPT‐3/4), embracing the initiative of open‐sourcing models/systems, and adopting a human‐centred approach throughout the developmental process. As the intersection of AI and education is continuously evolving, the findings of this study can serve as an essential reference point for researchers, allowing them to leverage the strengths, learn from the limitations, and uncover potential research opportunities enabled by ChatGPT and other generative AI models. Practitioner notes What is currently known about this topic Generating and analysing text‐based content are time‐consuming and laborious tasks. Large language models are capable of efficiently analysing an unprecedented amount of textual content and completing complex natural language processing and generation tasks. Large language models have been increasingly used to develop educational technologies that aim to automate the generation and analysis of textual content, such as automated question generation and essay scoring. What this paper adds A comprehensive list of different educational tasks that could potentially benefit from LLMs‐based innovations through automation. A structured assessment of the practicality and ethicality of existing LLMs‐based innovations from seven important aspects using established frameworks. Three recommendations that could potentially support future studies to develop LLMs‐based innovations that are practical and ethical to implement in authentic educational contexts. Implications for practice and/or policy Updating existing innovations with state‐of‐the‐art models may further reduce the amount of manual effort required for adapting existing models to different educational tasks. The reporting standards of empirical research that aims to develop educational technologies using large language models need to be improved. Adopting a human‐centred approach throughout the developmental process could contribute to resolving the practical and ethical challenges of large language models in education.
This content is subject to copyright. Terms and conditions apply.
90
|
Br J Educ Technol. 2024;55:90–112.
wileyonlinelibrary.com/journal/bjet
Received: 9 March 2023
|
Accepted: 22 July 2023
DO I: 10 .1111/ b je t.1337 0
REVIEW
Practical and ethical challenges of large
language models in education: A systematic
scoping review
Lixiang Yan | Lele Sha | Linxuan Zhao | Yuheng Li |
Roberto Martinez- Maldonado | Guanliang Chen | Xinyu Li |
Yueqiao Jin | Dragan Gašev
This is an op en access article under t he terms of t he Creative Commons Attribution- NonCommercial License, which pe rmits
use, distr ibution a nd reproduction in any medium, provided the origin al work is properly cited and is not used for c ommerc ial
purposes.
© 2023 The Authors. British Journal of Educational Technology published by John Wi ley & Sons Ltd on b ehalf of Br itish
Educational Research Association.
Centre for Learning Analytics at Monash,
Faculty of Information Technology, Monash
University, Clayton, Victoria, Australia
Correspondence
Lixiang Yan, Centre for Learning Analytics
at Monash, Fac ulty of Informati on
Technology, Monash Univers ity, 20
Exhibition Walk, Cl ayton, VI C 3800,
Australia.
Email: lixiang.yan@monash.edu
Funding information
Australian Research Council, Grant/Award
Number: D P2101000 60 and DP220101209;
Jacobs Foundation; Defense Advanced
Research Project Agency, Grant/Award
Number: HR0011- 22- 2- 00 47
Abstract
Educational technology innovations leveraging large
language models (LLMs) have shown the potential
to automate the laborious process of generating
and analysing textual content. While various innova-
tions have been developed to automate a range of
educational tasks (eg, question generation, feedback
provision, and essay grading), there are concerns re-
garding the practicality and ethicality of these innova-
tions. Such concerns may hinder future research and
the adoption of LLMs- based innovations in authentic
educational contexts. To address this, we conducted
a systematic scoping review of 118 peer- reviewed
papers published since 2017 to pinpoint the current
state of research on using LLMs to automate and
support educational tasks. The findings revealed 53
use cases for LLMs in automating education tasks,
categorised into nine main categories: profiling/label-
ling, detection, grading, teaching support, prediction,
knowledge representation, feedback, content gen-
eration, and recommendation. Additionally, we also
identified several practical and ethical challenges,
including low technological readiness, lack of rep-
licability and transparency and insufficient privacy
and beneficence considerations. The findings were
summarised into three recommendations for future
studies, including updating existing innovations with
|
91
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
state- of- the- art models (eg, GPT- 3/4), embracing
the initiative of open- sourcing models/systems, and
adopting a human- centred approach throughout the
developmental process. As the intersection of AI and
education is continuously evolving, the findings of this
study can serve as an essential reference point for
researchers, allowing them to leverage the strengths,
learn from the limitations, and uncover potential re-
search opportunities enabled by ChatGPT and other
generative AI models.
KEYWORDS
artificial intelligence, BERT, ChatGPT, education, GPT- 3, large
language models, pre- trained language models, systematic
scoping review
Practitioner notes
What is currently known about this topic
Generating and analysing text- based content are time- consuming and laborious
tasks.
Large language models are capable of efficiently analysing an unprecedented
amount of textual content and completing complex natural language processing
and generation tasks.
Large language models have been increasingly used to develop educational tech-
nologies that aim to automate the generation and analysis of textual content, such
as automated question generation and essay scoring.
What this paper adds
A comprehensive list of different educational tasks that could potentially benefit
from LLMs- based innovations through automation.
A structured assessment of the practicality and ethicality of existing LLMs- based
innovations from seven important aspects using established frameworks.
Three recommendations that could potentially support future studies to develop
LLMs- based innovations that are practical and ethical to implement in authentic
educational contexts.
Implications for practice and/or policy
Updating existing innovations with state- of- the- art models may further reduce the
amount of manual effort required for adapting existing models to different educa-
tional tasks.
The reporting standards of empirical research that aims to develop educational
technologies using large language models need to be improved.
Adopting a human- centred approach throughout the developmental process could
contribute to resolving the practical and ethical challenges of large language mod-
els in education.
92
|
YAN et al.
INTRODUCTION
Advancements in generative artificial intelligence (AI) and large language models (LLMs)
have fuelled the development of many educational technology innovations that aim to auto-
mate the often time- consuming and laborious tasks of generating and analysing textual con-
tent (eg, generating open- ended questions and analysing student feedback survey) (Kasneci
et al., 2023; Leiker et al., 2023; Wollny et al., 2021). LLMs are generative artificial intelligence
models that have been trained on an extensive amount of text data, capable of generating
human- like text content based on natural language inputs. Specifically, these LLMs, such
as Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018)
and Generative Pre- trained Transformer (GPT) (Brown et al., 2020), utilise deep learning
and self- attention mechanisms (Vaswani et al., 2017) to selectively attend to the different
parts of input texts, depending on the focus of the current tasks, allowing the model to learn
complex patterns and relationships among textual contents, such as their semantic, con-
textual, and syntactic relationships (Liu et al., 2023; Min et al., 2021). As several LLMs (eg,
GPT- 3 and Codex) have been pre- trained on massive amounts of data across multiple disci-
plines, they are capable of completing natural language processing tasks with little (few- shot
learning) or no additional training (zero- shot learning) (Brown et al., 2020; Wu et al., 2023).
This could lower the technological barriers to LLMs- based innovations as researchers and
practitioners can develop new educational technologies by fine- tuning LLMs on specific
educational tasks without starting from scratch (Caines et al., 2023; Sridhar et al., 2023).
The recent release of ChatGPT, an LLMs- based generative AI chatbot that requires only
natural language prompts without additional model training or fine- tuning (OpenAI, 2023),
has further lowered the barrier for individuals without technological background to leverage
the generative powers of LLMs.
Although educational research that leverages LLMs to develop technological innovations
for automating educational tasks is yet to achieve its full potential (ie, most works have fo-
cused on improving model performances (Kurdi et al., 2020; Ramesh & Sanampudi, 2022)),
a growing body of literature hints at how different stakeholders could potentially benefit
from such innovations. Specifically, these innovations could potentially play a vital role in
addressing teachers' high levels of stress and burnout by reducing their heavy workloads
by automating punctual, time- consuming tasks (Carroll et al., 2022) such as question gen-
eration (Bulut & Yildirim- Erbasli, 2022; Kurdi et al., 2020; Oleny, 2023), feedback provision
(Cavalcanti et al., 2021; Nye et al., 2023), scoring essays (Ramesh & Sanampudi, 2022) and
short answers (Zeng et al., 2023). These innovations could also potentially benefit both stu-
dents and institutions by improving the efficiency of often tedious administrative processes
such as learning resource recommendation, course recommendation and student feedback
evaluation, potentially (Sridhar et al., 2023; Wollny et al., 2021; Zawacki- Richter et al., 2019).
Despite the growing empirical evidence of LLMs' potential in automating a wide range of
educational tasks, none of the existing work has systematically reviewed the practical and
ethical challenges of these LLMs- based innovations. Understanding these challenges is es-
sential for developing responsible technologies as LLMs- based innovations (eg, ChatGPT)
could contain human- like biases based on the existing ethical and moral norms of society,
such as inheriting biased and toxic knowledge (eg, gender and racial biases) when trained
on unfiltered internet text data (Schramowski et al., 2022). Prior systematic reviews have
focused on investigating these issues related to one specific application scenario of LLMs-
based innovations (eg, question generation, essay scoring, chatbots or automated feedback)
(Cavalcanti et al., 2021; Kurdi et al., 2020; Ramesh & Sanampudi, 2022; Wollny et al., 2021).
The practical and ethical challenges of LLMs in automating different types of educational
tasks remain unclear. Understanding these challenges is essential for translating research
|
93
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
findings into educational technologies that stakeholders (eg, students, teachers, and institu-
tions) can use in authentic teaching and learning practices (Adams et al., 2021).
The current study is the first systematic scoping review that aimed to address this gap by
reviewing the current state of research on using LLMs to automate educational tasks and
identify the practical and ethical challenges of adopting these LLMs- based innovations in
authentic educational contexts. A total of 118 peer- reviewed publications from four prom-
inent databases were included in this review following the Preferred Reporting Items for
Systematic reviews and Meta- Analyses (PRISMA) (Page et al., 2021) protocol. An inductive
thematic analysis was conducted to extract details regarding the different types of educa-
tional tasks, stakeholders, LLMs, and machine learning tasks investigated in prior literature.
The practicality of LLMs- based innovations was assessed through the lens of technological
readiness, model performance, and model replicability. Lastly, the ethicality of these innova-
tions was assessed by investigating system transparency, privacy, equality and beneficence.
The contribution of this paper to the educational technology community is threefold: (1)
we systematically summarise a comprehensive list of 53 different educational tasks that
could potentially benefit from LLMs- based innovations through automation, (2) we present
a structured assessment of the practicality and ethicality of existing LLMs- based innova-
tions based on seven important aspects using established frameworks (eg, the transpar-
ency index (Chaudhry et al., 2022)), and (3) we propose three recommendations that could
potentially support future studies to develop LLMs- based innovations to be practically and
ethically implement in authentic educational contexts. As the intersection of LLMs and ed-
ucation is continuously evolving, the findings of this systematic scoping review can serve
as an essential reference point for researchers, allowing them to leverage the strengths,
learn from the limitations, and uncover potential opportunities of novel LLMs in supporting
educational research and practice. Specifically, emerging works should carefully consider
the practical and ethical challenges identified in this study while exploring the research op-
portunities enabled by ChatGPT and other generative AI models.
BACKGROUND
In this section, we first establish the definitions for the key terminologies, specifically the
definitions of practicality and ethicality in the context of educational technology. We then
provided an overview of prior systematic reviews on LLMs in education. Then, we present
the research questions based on the gaps identified in the existing literature.
Practicality
Several theoretical frameworks have been proposed regarding the practicality of integrat-
ing technological innovations in educational settings. For example, Ertmer's (1999) first-
and second- order barriers to change focused on the external conditions of the educational
system (eg, infrastructure readiness) and teachers' internal states (eg, personal beliefs).
Becker (2000) further suggested that for technological innovations to have actual benefits
in supporting pedagogical practices, these innovations should be convenient to access,
support constructivist pedagogical beliefs, be adaptable to changes in the curriculum, and
be compatible to teachers' level of knowledge and skills. These factors were also presented
in an earlier framework of the practicality index (Doyle & Ponder, 1977), which summarised
three critical components for integrating educational technologies, including the degree of
adoption feasibility, the cost and benefit ratio and the alignment with existing practices and
beliefs. Based on these prior theoretical frameworks and considering the recentness of
94
|
YAN et al.
LLMs- based innovations (which only emerged in the past five years), the practical chal-
lenges of LLMs- based innovations in automating educational tasks can be assessed from
three primary perspectives. First, evaluating the technological readiness of these innova-
tions is essential for determining whether there is empirical evidence to support successful
integration and operation in authentic educational contexts. Second, assessing the model
performance could contribute valuable insights into the cost and benefits of adopting these
innovations, such as comparing the benefits of automation with the costs of inaccurate pre-
dictions. Finally, understanding whether these innovations are methodologically replicable
could be important for future studies to investigate their alignment with different educational
contexts and stakeholders. We elaborated on the evaluation items for each challenge in
Section “Data analysis”.
Ethicality
Ethical AI is a prevalent topic of discussion in multiple communities, such as learning ana-
lytics, AI in education, educational data mining, and educational technology communities
(Adams et al., 2021; Pardo & Siemens, 2014). There are ongoing debates regarding AI
ethics in education with a mixture of focuses on algorithmic and human ethics among edu-
cational data mining and AI in education communities (Holmes & Porayska- Pomsta, 2022).
As such debates continue, it is difficult to identify an established definition of ethical AI
from these fields. Whereas, ethicality has already been thoroughly investigated and ad-
dressed in a closed field to AI in education, namely, the field of learning analytics (Pardo &
Siemens, 2014; Selwyn, 2019). Drawing on the established definition of ethicality from the
field of learning analytics (Pardo & Siemens, 2014), the ethicality of LLMs- based innovations
can thus be defined as the systematisation of appropriate and inappropriate functionali-
ties and outcomes of these innovations, as determined by all stakeholders (eg, students,
teachers, parents and institutions). For example, Khosravi et al. (2022) explained that the
ethicality of AI- powered educational technology systems needs to involve the considera-
tion of accountability, explainability, fairness, interpretability and safety of these systems.
These different domains of ethical AI are all closely related and can be addressed by con-
sidering system transparency. Transparency is a subset of ethical AI that involves making
all information, decisions, decision- making processes, and assumptions available to stake-
holders, which in turn enhances their comprehension of the AI systems and related outputs
(Chaudhry et al., 2022). Additionally, for LLMs- based innovations, Weidinger et al. (2021)
suggested six types of ethical risks, including (1) discrimination, exclusion, and toxicity, (2)
information hazards, (3) misinformation harms, (4) malicious uses, (5) human- computer in-
teraction harms and (6) automation, access and environmental harms. These risks can be
further aggregated into three fundamental ethical issues, such as privacy concerns regard-
ing educational stakeholders' personal data, equality concerns regarding the accessibility
of stakeholders with different backgrounds, and beneficence concerns about the poten-
tial harms and negative impacts that LLMs- based innovations may have on stakeholders
(Ferguson et al., 2016). These three fundamental ethical issues were considered in the
analysis of the reviewed literature. Further details were available in Section “Data analysis”.
Related work
Prior systematic reviews have focused primarily on reviewing a specific application scenario
(eg, question generation, automated feedback, chatbots and essay scoring) of natural lan-
guage processing and LLMs. For example, Kurdi et al. (2020) have systematically reviewed
|
95
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
empirical studies that aimed to tackle the problem of automatic question generation in ed-
ucational domains. They comprehensively summarised the different generation methods,
generation tasks, and evaluation methods presented in prior literature. In particular, LLMs
could potentially benefit the semantic- based approaches for generating meaningful ques-
tions that are closely related to the source contents. Likewise, Cavalcanti et al. (2021) have
systematically reviewed different automated feedback systems regarding their impacts on
improving students' learning performances and reducing teachers' workloads. Despite half
of their reviewed studies showing no evidence of reducing teachers' workloads, as these au-
tomated feedback systems were mostly rule- based and required extensive manual efforts,
they identified that using natural language generation techniques could further enhance
such systems' generalisability and potentially reduce manual workloads. On the other hand,
Wollny et al. (2021) have systematically reviewed areas of education where chatbots have
already been applied. They concluded that there is still much to be done for chatbots to
achieve their full potential, such as making them more adaptable to different educational
contexts. A systematic review has also investigated the various automated essay scoring
systems (Ramesh & Sanampudi, 2022). The findings have revealed multiple limitations of
the existing systems based on traditional machine learning (eg, regression and random for-
est) and deep learning algorithms (eg, LSTM and BERT). In sum, these previous systematic
reviews have identified room for improvement that can be potentially addressed using state-
of- the- art LLMs (eg, GPT- 3 or Codex). However, none of the prior systematic reviews has
investigated the practical and ethical issues related to LLMs- based innovations in education
generally rather than particularly (eg, limited to a specific task).
The recent hype around one of the latest LLMs- based innovations, ChatGPT, has inten-
sified the discussion about the practical and ethical challenges related to using LLMs in
education. For example, in a position paper, Kasneci et al. (2023) provided an overview of
some existing LLMs research and proposed several practical opportunities and challenges
of LLMs from students' and teachers' perspectives. Likewise, Rudolph et al. (2023) also
provided an overview of the potential impacts, challenges, and opportunities that ChatGPT
might have on future educational practices. Although these studies have not systematically
reviewed the existing educational literature on LLMs, their arguments resonated with some
of the pressing issues around LLMs and ethical AI, such as data privacy, bias, and risks.
On the other hand, Sallam (2023) systematically reviewed the implications and limitations of
ChatGPT in healthcare education and identified potential utility around personalisation and
automation. However, it is worth noting that most papers reviewed in Sallam's study were
either editorials, commentaries, or preprints. This lack of peer- reviewed empirical studies on
ChatGPT is understandable as it has only been released since late 2022 (OpenAI, 2023).
None of the existing work has systematically reviewed the peer- reviewed literature on prior
LLMs- based innovations. Such investigations could provide more reliable and empirically-
based evidence regarding the potential opportunities and challenges of LLMs in educational
practices. Thus, the current study aimed to address this gap in the literature by conducting a
systematic scoping review of prior educational research on LLMs. Specifically, the following
research questions were investigated to guide this review:
RQ1: What is the current state of research on using LLMs to automate educational tasks,
specifically through the lens of educational tasks, stakeholders, LLMs and machine- learning
tasks1?
RQ2: What are the practical challenges of LLMs in automating educational tasks, spe-
cifically through the lens of technological readiness, model performance, and model
replicability?
RQ3: What are the ethical challenges of LLMs in automating educational tasks, specifi-
cally through the lens of system transparency, privacy, equality and beneficence?
96
|
YAN et al.
METHODS
A systematic scoping review was conducted in this study as this method has been frequently
used in emerging and rapidly evolving research areas to scope a body of literature and iden-
tify the key concepts, methods, evidence, and challenges (Munn et al., 2018). Consequently,
the quality of the included studies was often not assessed as the aim is to provide a boarder
picture of an emerging field.
Review procedures
We followed the PRISMA (Page et al., 2021) protocol to conduct the current systematic
scoping review of LLMs. We searched four reputable bibliographic databases, including
Scopus, ACM Digital Library, IEEE Xplore and Web of Science, to find high- quality peer-
reviewed publications. Additional searches were conducted through Google Scholar and
Education Resources Information Center (ERIC) to identify peer- reviewed publications that
have yet to be indexed by these databases, either recently published or not indexed (eg,
Journal of Educational Data Mining; prior to 2020). Our initial search query for the title,
abstract, and keywords included terms such as “large language model”, “pre*trained lan-
guage model”, “GPT- *”, “BERT”, “education”, “student*” and “teacher*”. A publication year
constraint was also applied to restrict the search to studies published since 2017, specifi-
cally from 01/01/2017 to 12/31/2022, as the foundational architecture (Transformer) of LLMs
was formally released in 2017 (Vaswani et al., 2017). Only peer- reviewed publications were
considered to enhance the scientific credibility of this review. The initial database search
was conducted by two researchers independently. Any discrepancies between the search
results were resolved through further discussion or consulting the librarian for guidance.
Two researchers independently reviewed the titles and abstracts of eligible articles
based on five predetermined inclusion and exclusion criteria. First, we included studies
that used large or pre- trained language models directly or built on top of such models,
and excluded studies that used general machine- learning or deep- learning models with
unspecified usage of LLMs. Second, we included empirical studies with detailed methodol-
ogies, such as a detailed description of the LLMs and research procedures, and excluded
review, opinion and scoping works. Third, we only included full- length peer- reviewed pa-
pers and excluded short, workshop, and poster papers that were less than six and eight
pages for double- and single- column layouts, respectively. Additionally, we included stud-
ies that used LLMs for the purpose of automating educational tasks (eg, essay grading and
question generation), and excluded studies that merely used LLMs as part of the analysis
without educational implications. Finally, we only included studies that were published in
English (both the abstract and the main text) and excluded studies that were published in
other languages. Any conflicting decisions were resolved through further discussion be-
tween the two researchers or consulting with a third researcher to achieve a consensus.
The database search initially yielded 854 publications, with 191 duplicates removed, re-
sulting in 663 publications for the title and abstract screening (see Figure 1). After the title
FIGURE 1 Systematic scoping review process following the PRISMA protocol.
|
97
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
and abstract screening, 197 articles were included for the full- text review with an interrater
reliability (Cohen's kappa) of 0.75, indicating substantial agreement between the reviewers
during the title and abstract screening. A total of 118 articles were selected for data ex-
traction after the full- text review with an interrater reliability (Cohen's kappa) of 0.73, indicat-
ing substantial agreement between the reviewers during the full- text review. Out of the initial
197 articles, 79 were excluded for various reasons, including not full paper (n = 41), lack
of educational automation (n = 17), lack of pre- trained or LLMs (n = 12), merely using pre-
trained or LLMs as part of the analysis (n = 3), non- English paper (n = 2) and non- empirical
paper (n = 2).
Data analysis
For the first research question (RQ1), we conducted an inductive thematic analysis to extract
information regarding the current state of research on using LLMs to automate educational
tasks. Specifically, we extracted four primary types of contextual information from each
included paper: educational tasks, stakeholders, LLMs and machine- learning tasks. This
contextual information would provide a holistic view of the existing research and inform
researchers and practitioners regarding the viable directions to explore with the state- of-
the- art LLMs (eg, GPT- 3.5 and Codex). A total of seven data extraction items were devel-
oped to address the second and third research questions. These items were developed as
they are directly related to the definition of practicality (RQ2: Item 1– 3) and ethicality (RQ3:
Item 4– 7), as defined in the background section. The following list elaborates on the final
set of items along with the corresponding guiding questions. For the thematic analysis and
Items, two researchers independently coded 20 random samples of the included studies.
Any conflicts were resolved through further discussion or consulting a third researcher. After
reaching a Cohen's kappa of more than 0.80 (indicating almost perfect agreement), each re-
searcher coded half of the remaining 98 studies (49 studies each) and cross- checked each
other's work. The database of the studies included in this review and the extracted data for
each item are available in the supplementary document.
1. Technology readiness What levels of technology readiness are the LLMs- based
innovations at? We adopted the assessment tool from the Australian government,
namely the Australian Department of Defence's Technology Readiness Levels (TRL)
(Defence Science and Technology Group, 2021), which has been used to assess
the maturity of educational technologies in prior SLR (Yan et al., 2022). There
are nine different technological readiness levels: Basic Research (TRL- 1), Applied
Research (TRL- 2), Critical Function or Proof of Concept Established (TRL- 3), Lab
Testing/Validation of Alpha Prototype Component/Process (TRL- 4), Laboratory Testing
of Integrated/Semi- Integrated System (TRL- 5), Prototype System Verified (TRL- 6),
Integrated Pilot System Demonstrated (TRL- 7), System Incorporated in Commercial
Design (TRL- 8), and System Proven and Ready for Full Commercial Deployment
(TRL- 9), further explained in the Result section.
2. Performance: How accurate and reliable can the LLMs- based innovations complete the
designated educational tasks? For example, what are the model performance scores for
classification (eg, AUC and F1 scores), generation (eg, BLEU score), and prediction tasks
(eg, RMSE and Pearson's correlation)?
3. Replicability: Can other researchers or practitioners replicate the LLMs- based innova-
tions without additional support from the original authors? This item evaluates whether the
paper provided sufficient details about the LLMs (eg, open- sourced algorithms) and the
dataset (eg, open- source data).
98
|
YAN et al.
4. Transparency: What tiers of transparency index (Chaudhry et al., 2022) are the LLMs-
based innovations at? The transparency index proposed three tiers of transparency, in-
cluding transparent to AI researchers and practitioners (Tier 1), transparent to educational
technology experts and enthusiasts (Tier 2), and transparent to educators and parents
(Tier 3). The tier of transparency increases as educational stakeholders become fully in-
volved in developing and evaluating the AI system. These tiers were further elaborated on
in the Results section.
5. Privacy: Has the paper mentioned or considered privacy issues of their innovations? This
item explores potential issues related to informed consent, transparent data collection,
individuals' control over personal data, and unintended surveillance (Ferguson et al., 2016;
Tsai et al., 2020).
6. Equality: Has the paper mentioned or considered equal access to their innovations? This
item explores potential issues related to limited access for students from low- income back-
grounds or rural areas and the linguistic limitation of the innovations, such as their capabil-
ity to analyse different languages (Ferguson et al., 2016).
7. Beneficence: Has the paper mentioned or considered potential issues that violate the
ethical principle of beneficence? Such violations may include the risks associated with
labelling and profiling students, inadequate usage of machine- generated content for as-
sessments, and algorithmic biases (Ferguson et al., 2016; Zawacki- Richter et al., 2019).
RES U LTS
The current state RQ1
We identified nine different categories of educational tasks that prior studies have attempted
to automate using LLMs (as shown in Table 1). Prior studies have used LLMs to automate
the profiling and labelling of 17 types of education- related contents and concepts (eg, forum
posts, student sentiment and discipline similarity), the detection of six latent constructs (eg,
confusion and urgency), the grading of five types of assessments (eg, short answer ques-
tions and essays), the development of five types of teaching support (eg, conversation agent
and intelligent question- answering), the prediction of five types of student- orientated metrics
(eg, dropout and engagement), the construction of four types of knowledge representa-
tions (eg, knowledge graph and entity recognition), the provision of four different forms of
feedback (eg, real- time and post- hoc feedback), the generation of four types of content (eg,
MCQs and open- ended questions) and the delivery of three types of recommendations (eg,
resource and course). Of the 118 reviewed studies, 85 studies aimed to automate educa-
tional tasks related to teachers (eg, question grading and generation), 54 studies targeted
student- related activities (eg, feedback and resource recommendation), 20 studies focused
on supporting institutional practices (eg, course recommendations and discipline planning),
and 14 studies empowered researchers with automated methods to investigate latent con-
structs (eg, student confusion) and capture verbal data (eg, speech recognition).
We identified five categories of LLMs used in prior studies to automate educational tasks.
BERT and its variations (eg, RoBERTa, DistilBERT, multilingual BERT, LaBSE, EstBERT,
and Sentence- BERT) were the most predominant model used in 109 reviewed studies.
However, they often required manual effort for fine- tuning (n = 90). GPT- 2 and GPT- 3 have
been used in five and three studies, respectively. Specifically, GPT- 2 and GPT- 3 have per-
formed better than BERT- based models in content generation and evaluation tasks, such
as generating university math problems (Drori et al., 2022) and evaluating the quality of
student- generated short answer questions (Moore et al., 2022). OpenAI's Codex has been
used in two prior studies, specifically for code generation tasks. T5 has also been used in
|
99
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
two prior studies for classification and generation purposes. In terms of machine- learning
tasks, 74 studies used LLMs to perform classification tasks. Generation and prediction tasks
were investigated in 24 and 23 prior studies, respectively. In sum, LLMs- based innovations
have already been used to automate a range of educational tasks, but most of these innova-
tions were developed on older models, such as BERT and GPT- 2. Although state- of- the- art
models, such as GPT- 3, have been introduced for over two years (Brown et al., 2020), they
have yet to be widely applied to automate educational tasks. A potential reason for this lack
of adoption could be these models' commercial and close- sourced nature, increasing the
financial burdens of developing and operating educational technology innovations on top of
such models.
Practical challenges— RQ2
Technology readiness
According to the Technology Readiness Level scale (Defence Science and Technology
Group, 2021), the LLMs- based innovations are still in the early development and testing
stage. Over three- quarters of the LLMs studies (n = 89) are in the applied research stage
(TRL- 2), which aims to experiment with the capability of LLMs in automating different ed-
ucational tasks by developing different models and combining LLMs with other machine-
learning and deep- learning techniques (eg, RCNN (Shang et al., 2022)). Thirteen studies
have established a proof of concept and demonstrated the feasibility of using LLMs-
based innovations to automate certain processes of educational tasks (TRL- 3). Nine
TAB LE 1 Educational Tasks in LLMs research.
Categories Educational tasks
Profiling and
labelling
Forum post classification, dialogue act classification, classification of learning
designs, review sentiment analysis, topic modelling, pedagogical classification
of MOOCs, collaborative problem- solving modelling, paraphrase quality, speech
tagging, labelling educational content with knowledge components, key sentence
and keyword extraction, reflective writing analysis, multimodal representational
thinking, discipline similarity, concept classification, cognitive level classification,
essay arguments segmentation
Detection Semantic analyses, detecting off- task messages, confusion detection, urgency
detection, conversational intent detection, teachers' behaviour detection
Assessment and
grading
Formative and summative assessment grading, short answer grading, essay grading,
subjective question grading, student self- explanation
Teaching support Classroom teaching, learning community support, online learning conversation agent,
intelligent question- answering, teacher activity recognition
Prediction Student performance prediction, student dropout prediction, emotional and cognitive
engagement detection, growth and development indicators for college students,
at- risk student identification
Knowledge
representation
Knowledge graph construction, knowledge entity recognition, knowledge tracing,
cause- effect relation extraction
Feedback Real- time feedback, post- hoc feedback, aggregated feedback, feedback on feedback
(peer- review comments)
Content generation MCQs generation, open- ended question generation, code generation, reply (natural
language) generation
Recommendation English reference selection and recommendation, resource recommendation, course
recommendation
100
|
YAN et al.
studies have developed functional prototypes and conducted preliminary validation under
controlled laboratory settings (TRL- 4), often involving stakeholders (eg, students and
teachers) to test and evaluate the output of their innovations. Only seven studies have
taken a further step and conducted validation studies in authentic learning environments,
with most functional components integrated into the educational tasks (TRL- 5), such as
an intelligent virtual standard patient for medical students training (Song et al., 2022) and
an intelligent chatbot for university admission (Nguyen et al., 2021). Yet, none of the ex-
isting LLMs- based innovations has been verified through successful operations (TRL- 6).
Together, these findings suggest although existing LLMs- based innovations can be used
to automate certain educational tasks, they have yet to show evidence regarding im-
provements to teaching, learning and administrative processes in authentic educational
practices.
Performance
The performance of LLMs- based innovations varies across different machine- learning and
educational tasks. For classification tasks, LLMs- based innovations have shown high per-
formance for simple educational tasks, such as modelling the topics from a list of program-
ming assignments (best F1 = 0.95) (Fonseca et al., 2020), analysing the sentiment of student
feedback (best F1 = 0.94) (Truong et al., 2020), constructing subject knowledge graph from
teaching materials (best F1 = 0.94) (Su & Zhang, 2020) and classifying educational forum
posts (Sha, Raković, Lin, et al., 2022) (best F1 = 0.92). However, the classification perfor-
mance of LLMs- based innovations decreases for other educational tasks. For example,
the F1 scores for detecting student confusion in the course forum (Geller et al., 2021) and
students' off- task messages in game- based collaborative learning (Carpenter et al., 2020)
are around 0.77 and 0.67, respectively. Likewise, the F1 score for classifying short- answer
responses varies between 0.61 to 0.82, with the lower performance on out- of- sample ques-
tions (best F1 = 0.61) (Condor et al., 2021). Similar performances were also observed in clas-
sifying students' argumentative essays (best F1 = 0.66) (Ghosh et al., 2020).
For prediction tasks, LLMs- based innovations have demonstrated reliable performance
compared to ground truth or human raters. For example, LLMs- based innovations have
achieved high scores of quadratic weighted kappa (QWK) in essay scoring, specifically for
off- topic (QWK = 0.80), gibberish (QWK = 0.80), and paraphrased answers (QWK = 0.9 4),
indicating substantial to almost perfect agreements with human raters (Doewes &
Pechenizkiy, 2021). Similar performances on essay scoring have been observed in several
other studies (eg, 0.80 QWK in (Beseiso et al., 2021) and 0.81 QWK in (Sharma et al., 2021)).
Likewise, LLMs- based innovations' performances on automatic short- answer grading were
also highly correlated with human ratings (Pearson's correlation between 0.75 to 0.82)
(Ahmed et al., 2022; Sawatzki et al., 2022).
Regarding generation tasks, LLMs- based innovations demonstrated high performance
across different educational tasks. For example, LLMs- based innovations have achieved
an F1 score of 0.92 for generating MCQs with single- word answers (Kumar et al., 2022).
Educational technologies developed by fine- tuning Codex also demonstrated the capa-
bility of resolving 81% of the advanced mathematics problems (Drori et al., 2022). Tex t
summaries generated using BERT had no significant differences compared with student-
generated summaries and can not be differentiated by graduate students (Merine &
Purkayastha, 2022). Similarly, BERT- generated doctor- patient dialogues were also found to
be indistinguishable from actual doctor- patient dialogues, which can be used to create vir-
tual standard patients for medical students' diagnosis practice training (Song et al., 2022).
Additionally, for introductory programming courses, the state- of- the- art LLMs, Codex,
|
101
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
could generate sensible and novel exercises for students along with an appropriate sample
solution (around three out of four times) and accurate code explanation (67% accuracy)
(Sarsa et al., 2022).
In sum, although the classification performance of LLMs- based innovations on com-
plex educational tasks is far from suitable for practical adoption, LLMs- based innova-
tions have already shown high performance on several relatively simple classification
tasks that could potentially be deployed to automatically generate meaningful insights
that could be useful to teachers and institutions, such as navigating through numerous
student feedback and course review. Likewise, LLMs- based innovations' prediction and
generation performance reveals a promising future of potentially automating the gener-
ation of educational content and the initial grading of student assessments. However,
ethical issues must be considered for such implementations, which we covered in the
findings for RQ3.
Replicability
Most reviewed studies (n = 107) have not disclosed sufficient details about their methodolo-
gies for other researchers and practitioners to replicate their proposed LLMs- based innova-
tions. Among these studies, 12 studies have open- sourced the original code for developing
the innovations but failed to open- source the data they used. In contrast, 20 studies have
open- sourced the data they used but failed to release the actual code. Around two- thirds
of the reviewed studies (n = 75) have failed to release both the original code and the data
they used, leaving only 11 studies publicly available for other researchers and practitioners
to replicate without needing to contact the original authors. This lack of replicability could
become a vital barrier to adoption, as 87 out of the 107 non- replicable studies required
fine- tuning the LLMs to achieve the reported per formance. This replication issue also limits
others from further evaluating the generalisability of the proposed LLMs- based innovations
in other datasets, constraining potential practical utilities.
Ethical challenges— RQ3
Transparency
Based on the transparency index and the three tiers of transparency (Chaudhry et al., 2022),
most of the reviewed study reached at- most Tier 1 (n = 109), which is merely considered
transparent to AI researchers and practitioners. Although these studies reported details
regarding their machine learning models (eg, optimisation and hyperparameters), such in-
formation is unlikely to be interpretable and considered transparent for individuals without
a strong background in machine learning. For the remaining nine studies, they reached at-
most Tier 2 as they often involved some form of human- in- the- loop elements. Specifically,
making the LLMs innovations available for student evaluation has been found in three stud-
ies (Merine & Purkayastha, 2022; Nguyen et al., 2021; Song et al., 2022). Such evalua-
tions often involved students differentiating AI- generated from human- generated content
(Merine & Purkayastha, 2022; Song et al., 2022) and assessing student satisfaction with AI-
generated responses (Nguyen et al., 2021). Likewise, two studies have involved experts in
evaluating specific features of the content generated by the LLMs- based innovations, such
as informativeness (Maheen et al., 2022) and cognitive level (Moore et al., 2022). Surveys
have been used to evaluate students' experience with LLMs- based innovations from multiple
perspectives, such as the quality and difficulty of AI- generated questions (Drori et al., 2022;
102
|
YAN et al.
Li & Xing, 2021) and potential learning benefits of the systems (Jayaraman & Black, 2022).
Finally, semi- structured interviews have been conducted to understand students' percep-
tion of the LLM system after using the system in authentic computer- supported collabora-
tive learning activities (Zheng et al., 2022). Although these nine studies had some elements
of human- in- the- loop, stakeholders were often involved in a post- hoc evaluation manner
instead of throughout the development process, and thus, have limited knowledge regard-
ing the operating principle and potential weakness of the systems. Consequently, none of
the existing LLMs- based innovations can be considered as being at Tier 3, which describes
an AI system that is considered transparent for educational stakeholders (eg, students,
teachers, and parents).
Privacy
The privacy issues related to LLMs- based innovations were rarely attended to or investi-
gated in the reviewed studies. Specifically, for studies that have fine- tuned LLMs with textual
data collected from students, none of these studies has explicitly explained their consenting
strategies (eg, whether students acknowledge the collection and intended usage of their
data) and data protection measures (eg, data anonymisation and sanitisation). This lack of
attention to privacy issues is particularly concerning as LLMs- based innovations work with
stakeholders' natural languages that may contain personal and sensitive information re-
garding their private lives and identities (Brown et al., 2022). It is possible that stakeholders
might not be aware of their textual data (eg, forum posts or conversations) on digital plat-
forms (eg, MOOCs and LMS) being used in LLMs- based innovations for different purposes
of automation (eg, automated reply and training chatbots) as the consenting process is
often embedded into the enrollment or signing up of these platforms (Tsai & Gasevic, 2017).
This process can hardly be considered informed consent. Consequently, if stakeholders
shared their personal information on these platforms in natural language (eg, sharing phone
numbers and addresses with group members via digital forums), such information could
be used as training data for fine- tuning LLMs. This usage could potentially expose private
information as LLMs are incapable of understanding the context and sensitivity of text,
and thus, could return stakeholders' personal information based on semantic relationships
(Brown et al., 2022).
Equality
Although most of the studies (n = 95) used LLMs that only apply to English content, we
also identified application scenarios of LLMs in automating educational tasks in 12 other
languages. Specifically, 19 studies used LLMs that can be applied to Chinese content.
Ten prior studies used LLMs for Vietnamese (n = 3), Spanish (n = 3), Italian (n = 2), an d
German (n = 2) contents. Additionally, seven studies applied LLMs to Croatian, Indonesian,
Japanese, Romanian, Russian, Swedish, and Hindi content. While the dominance of
English- based innovations remains a concerning equality issue, the availability of inno-
vations that support a variety of other languages, specifically in non- western, educated,
industrialised, rich and democratic (WEIRD) societies (eg, Indonesia and Vietnam), may
indicate a promising sign for LLMs- based innovations to have potential global impacts and
levels such equality issues in the future. However, the financial burdens from adopting the
state- of- the- art models (eg, OpenAI's GPT- 3 and Codex) could potentially exacerbate the
equality issues, making the best- performing innovations only accessible and affordable to
WEIRD societies.
|
103
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
Beneficence
A total of seven studies have discussed potential issues related to the violation of the
ethical principle of beneficence. For example, one study has discussed the potential risk
of adopting underperforming models, which could negatively affect students' learning
experiences (Li & Xing, 2021). Such issues could be minimised by deferring decisions
made by such models (Schneider et al., 2022) and labelling the AI- generated content
with a warning message (eg, teachers' manual revision is mandatory before determining
the actual correctness) (Angelone et al., 2022). Apart from issues with adopting inac-
curate models, two studies have suggested that potential bias and discrimination issues
may occur if adopting a model that is accurate but unfair (Merine & Purkayastha, 2022;
Sha et al., 2021). This issue is particularly concerning as most existing studies focused
solely on developing an accurate model. Only nine reviewed studies released infor-
mation regarding the descriptive data of different sample groups, such as gender and
ethicality (eg, Pugh et al., 2021). Two studies have proposed potential approaches that
could address such fairness issues. Specifically, using sampling strategies, such as bal-
ancing demographic distribution, has been found as an effective approach to improve
both model fairness and accuracy (Sha, Li, Gasevic, & Chen, 2022; Sha, Raković, Das,
et al., 2022). These approaches are essential for ensuring that LLMs- based innovations
will not perpetuate problematic and systematic biases (eg, gender biases), especially as
the best- performing LLMs are often black- boxed with little interpretability, traceability and
justification of the results (Wu, 2022).
DISCUSSION
Main findings
The current study systematically reviewed 118 peer- reviewed empirical studies that used
LLMs to automate educational tasks. For the first research question (RQ1), we illustrated
the current state of educational research on LLMs. Specifically, we identified 53 types
of application scenarios of LLMs in automating educational tasks, summarised into nine
general categories, including profiling and labelling, detection, assessment and grading,
teaching support, prediction, knowledge representation, feedback, content generation and
recommendation. While some of these categories resonate with the utilities proposed in
prior positioning works (eg, feedback, content generation and recommendation) (Kasneci
et al., 2023; Rudolph et al., 2023), novel directions such as using LLMs to automate the
creation of knowledge graph and entity further indicated the potential of LLMs- based inno-
vations in supporting institutional practices (eg, creating knowledge- based search engines
across multiple disciplines). These identified directions could benefit from the state- of- the-
art LLMs (eg, GPT- 3 and Codex) as most of the reviewed studies (92%) focused on using
BERT- based models, which often required manual effort for fine- tuning. Whereas, the state-
of- the- art LLMs could potentially achieve similar performance with a zero- shot approach
(Bang et al., 2023). While the majority of the reviewed studies (63%) focused on using LLMs
to automate classification tasks, there could be more future studies that aimed to tackle the
automation of prediction and generation tasks with the more capable LLMs (Sallam, 2023).
Likewise, although supporting teachers are the primary focus (72%) of the existing LLMs-
based innovations, students and institutions could also benefit from such innovations as
novel utilities could continue to emerge from the educational technology literature. Together,
the findings of the first research question could spark educational researchers with ideas
of exploring the potential of state- of- the- art LLMs in augmenting educational practices,
104
|
YAN et al.
specifically, the identified 53 types of application scenarios may all worth to re- explore in
the light of ChatGPT and other powerful generative AI models (Kasneci et al., 2023).
Regarding the second research question (RQ2), we identified several practical challenges
that need to be addressed for LLMs- based innovations to have actual educational benefits.
The development and educational research on LLMs- based innovations are still in the early
stages. Most of the innovations demonstrated a low level of technology readiness, where
the innovations have yet to be fully integrated and validated in authentic educational con-
texts. This finding resonates with previous systematic reviews on related educational tech-
nologies, such as reviews on automated question generation (Kurdi et al., 2020), feedback
provision (Cavalcanti et al., 2021), essay scoring (Ramesh & Sanampudi, 2022), and chatbot
systems (Wollny et al., 2021). There is a pressing need for in- the- wild studies that provide
LLMs- based innovations directly to educational stakeholders for supporting actual educa-
tional tasks instead of testing on different datasets or in laboratory settings. Such authentic
studies could also validate whether the existing innovations can achieve the reported high
model performance in real- life scenarios, specifically in prediction and generation tasks,
instead of being limited to prior datasets. This validation process is vital for preventing inad-
equate usage, such as adopting a subject- specific prediction model for unintended subjects.
Researchers need to carefully examine the extent of generalisability of their innovations
and inform the limitations to stakeholders (Gašević et al., 2016). However, addressing such
needs could be difficult considering the current literature's poor replicability, which increases
the barriers for others to adopt LLMs- based innovations in authentic educational contexts or
validate with different samples. Similar replication issues have also been identified in other
areas of educational technology research (Yan et al., 2022).
For the third research question (RQ3), we identified several ethical challenges regarding
LLMs- based innovations. In particular, most of the existing LLMs- based innovations (92%)
were only transparent to AI researchers and practitioners (Tier 1), with only nine studies that
can be considered transparent to educational technology experts and enthusiasts (Tier 2).
The primary reason behind this low transparency can be attributed to the lack of human- in-
the- loop components in prior studies. This finding resonates with the call for explainable and
human- centred AI, which stresses the vital role of stakeholders in developing meaningful
and impactful educational technology (Khosravi et al., 2022; Yang et al., 2021). Involving
stakeholders during the development and evaluation of LLMs- based innovations is essen-
tial for addressing both practical and ethical issues. For example, as the current findings
revealed, LLMs- based innovations are subject to data privacy issues but were rarely men-
tioned or investigated in the literature (Merine & Purkayastha, 2022), which may be due
to the little voice that stakeholders had in prior research. The several concerning issues
around beneficence also demand the involvement of stakeholders as their perspectives are
vital for shaping the future directions of LLMs- based innovations, such as how responsible
decisions can be made with these AI systems (Schneider et al., 2022). Likewise, the equality
issue regarding the financial burdens that may occur when adopting innovations that lever-
age commercial LLMs (eg, GPT- 3 and Codex) can also be further studied with institutional
stakeholders.
Implications
The current findings have several implications for education research and practice with
LLMs, which we have summarised into three recommendations that aim to support future
studies to develop practical and ethical innovations that can have actual benefits to educa-
tional stakeholders. First, the wide range of application scenarios of LLMs- based innova-
tions can further benefit from the improvements in the capability of LLMs. Updating existing
|
105
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
innovations with state- of- the- art LLMs may further reduce the amount of manual effort re-
quired for fine- turning and achieve similar performances (Bang et al., 2023). Considering the
53 identified use cases of LLMs in education, there are multiple research trajectories that
could foster the development of practical educational technologies. These avenues have
the potential to address some of the pressing challenges that plague the global education
system. Particularly, the use cases involving teaching support, assessment and grading,
feedback, and content generation categories (Table 1) could act as catalysts for the de-
velopment of educational technologies that could alleviate teachers' workload and mental
stress by automating the laborious tasks associated with creating, evaluating, and providing
feedback for student assessments (Carroll et al., 2022). Similarly, further exploration of the
use cases in profiling and labelling, detection, prediction, and recommendation could lead to
the development of educational technologies that can deliver personalised learning support
for each student across various disciplines (Wollny et al., 2021). Such improvements could
enhance the overall well- being of teachers and increase students' learning opportunities,
thereby contributing to the achievement of SDG 4 by 2030 (Boeren, 2019). Nonetheless, re-
searchers should also be mindful of the potential financial and resource burdens that could
be imposed on educational stakeholders when innovating with the commercial LLMs (eg,
GPT- 3/4 and ChatGPT).
The unrivalled natural language generation capabilities exhibited by ChatGPT and other
cutting- edge LLMs (eg, LLaMA and PaLM 2) might also inspire future studies to delve
into a broader spectrum of research directions. These include comparisons between the
quality of student- generated and ChatGPT- generated writings (Li et al., 2023) and eval-
uating these LLMs' capability to tackle educational assessments (Gilson et al., 2023).
Such explorations would not only unveil the potential of LLMs and generative AI models in
educational content generation and evaluation tasks but also expose the possible threats
that these models pose to academic integrity, a pervasive issue across the education
sector (Kasneci et al., 2023). Intriguingly, leveraging the use cases of LLMs in tasks
such as creating knowledge representation (Zheng et al., 2023) and classifying cognitive
levels (Liu et al., 2022) could potentially facilitate the transition from outcome- focused to
process- focused assessments. Here, LLMs and generative AI models could be employed
for learning assessments in a manner similar to learning analytics (Gašević et al., 2022).
Consequently, future studies may begin to explore methods of addressing the potential
threats of LLMs with LLMs- based solutions.
For LLMs- based innovations to achieve a high level of technology readiness and perfor-
mance, the current reporting standards must be improved. Future studies should support
the initiative of open- sourcing their models/systems when possible and provide sufficient
details about the test datasets, which are essential for others to replicate and validate ex-
isting innovations across different contexts, preventing the potential pitfall of another repli-
cation crisis (Maxwell et al., 2015). This initiative is particularly vital in the era of generative
AI models as most of these models, especially the commercial ones (eg, ChatGPT and the
GPT series), are proprietary. Thus, when using these LLMs for augmenting educational
practices, such as scoring student essays (Doewes & Pechenizkiy, 2021), providing real-
time feedback (Zheng et al., 2022) or generating questions for learning activities (Sarsa
et al., 2022), researchers need to be systematic and transparent about the reporting of
the model usage and prompts (Wu, 2022). For example, when using the ChatGPT API for
question generation at scale, researchers should at least report the exact models, prompts,
and model temperature used in the process, as different models may differ in their ability to
generate accurate and reliable content and the prompts are essential for others to replicate
the same or similar results (Kasneci et al., 2023).
Apart from the aforementioned technical and methodological details, researchers and
educational policymakers should also consider the potential wider impacts of LLMs- based
106
|
YAN et al.
solutions on different stakeholders. For example, in terms of detection and academic in-
tegrity, some institutions have rapidly adopted AI- detection tools that claim to have high
accuracy and a low false positive rate. Yet, as disclosed in a recent report by Turnitin, a
company whose AI- detection function has been utilised on more than 38.5 million student
submissions, the real- world performance of their solution resulted in a significantly higher
occurrence of false positives compared to their laboratory findings (Chechitelli, 2023). Such
negligence can be devastating for students who have been falsely accused of academic
misconduct, as well as for educators who must handle the repercussions. This example re-
inforced the importance of conducting rigorous scientific studies with key stakeholders when
adopting any LLMs- based solutions that have direct or indirect impacts on students, educa-
tors, and other stakeholders. Likewise, the reporting of such studies should also adhere to
high standards, incorporating both methodological specifics and detailed data descriptions.
These details are especially pertinent when considering the diverse cultural backgrounds of
students and the fact that most LLMs are primarily trained on English datasets, which could
potentially introduce biases towards non- native English students (Liang et al., 2023).
Adopting a human- centred approach when developing and evaluating LLMs- based inno-
vations are essential for ensuring these innovations remain ethical in practice, especially as
ethical principles may not guarantee ethical AI due to their top- down manners (eg, devel-
oped by regulatory bodies) (Mittelstadt, 2019). Future studies need to consider the ethical
issues that may arise from their specific application scenarios and actively involve stake-
holders to identify and address such issues. Specifically, LLM- based innovations should aim
to reach at least Tier 3 in the transparency index and TRL- 7 in technology readiness. This
involves a fully functional system being integrated into authentic learning environments and
validated by students and educators in terms of its practicality and ethical considerations.
For any decisions made by the LLM- based innovations, the relevant stakeholders should
be informed about how the decision was reached, as well as the potential risks and biases
involved. For instance, when students receive an assessment that has been automatically
graded, these grades should be accompanied by a warning message indicating that they
have been graded by LLMs and AI (Angelone et al., 2022). Students should also have the
opportunity to consult their teacher regarding any concerns.
The active involvement of stakeholders should also extend beyond the education sector,
also involving policymakers and industry companies to establish the guidelines for adopt-
ing LLMs- based innovations in learning and teaching practices, as such adoptions could
have broader implications on society beyond the education sector. For example, human- AI
collaboration might become an essential skill for students to succeed in the job market as
AI solutions become an integral component of productivity in the industrial sector (Wang
et al., 2020). Therefore, institutions that aim to prohibit AI tools could inadvertently place
their students at a disadvantage compared to other institutions that proactively welcome
such changes. This could be achieved by consistently refining their policy regarding the
use of LLMs and generative AI solutions, based on stakeholder feedback and empirical
evidence.
Limitations
The current findings should be interpreted with several limitations in mind. First, although
we assessed the practicality and ethicality of LLMs- based innovations with seven different
items, there could be other aspects of these multi- dimensional concepts that we omitted.
Nevertheless, these assessment items were chosen directly from the corresponding defi-
nitions and related to the pressing issues in the literature (Adams et al., 2021; Weidinger
et al., 2021). Second, we only included English publications, which could have biased our
|
107
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
findings regarding the availability of LLMs- based innovations among different countries.
Thirdly, as we strictly followed the PRISMA protocol and only included peer- reviewed pub-
lications, we may have omitted the emerging works published in different open- sourced
archives. These studies may contain interesting findings regarding the latest LLMs (eg,
ChatGPT). Additionally, this review focused on the potential of LLMs- based innovations in
automating educational tasks, and thus, other pressing issues, such as the potential threat
to academic integrity, were outside of the scope of this systematic scoping review. We briefly
touched on these pressing issues in the implications and illustrated the importance of the
current findings in supporting future educational studies to address these issues. Moreover,
since this study is a systematic scoping review, we did not assess the quality of the included
studies, and thus, the findings, particularly, the performance metrics extracted from the re-
viewed studies, may need further evaluation. The goal of this study is to provide an overview
of the different educational tasks that can be augmented by LLMs and generative AI mod-
els, which can serve as a reference point for future studies to further develop on using the
state- of- the- art models (eg, ChatGPT and PaLM 2). Furthermore, the transparency index
that we adopted for RQ3 did not consider the transparency to students, which could be an
important direction for future human- centred AI studies. It is pertinent to mention that a
number of recent workshops and preliminary papers, while contributing to this field, were not
incorporated in this scoping review due to time constraints (Caines et al., 2023; Leiker et al.,
2023; Ma et al., 2023). Their exclusion represents a limitation to the breadth of this study,
acknowledging the relentless pace of scholarly advancements in this area.
CONCLUSION
In this study, we systematically reviewed the current state of educational research on LLMs and
identified several practical and ethical challenges that need to be addressed in order for LLMs-
based innovations to become beneficial and impactful. Based on the findings, we proposed
three recommendations for future studies, including updating existing innovations with state-
of- the- art models, embracing the initiative of open- sourcing models/systems, and adopting a
human- centred approach throughout the developmental process. These recommendations
could potentially support future studies to develop practical and ethical innovations that can be
implemented in authentic contexts to automate a wide range of educational tasks.
ACKNO WLE DGE MENTS
This research was funded partially by the Australian Government through the Australian
Research Council (project number DP210100060 and DP220101209). Roberto Martinez-
Maldonado's research is partly funded by Jacobs Foundation. This research was also funded
partially by the Jacobs Foundation (CELLA 2 CERES). This material is in part based on research
sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement num-
ber HR0011- 22- 2- 0047. The U.S. Government is authorised to reproduce and distribute reprints
for Governmental purposes notwithstanding any copyright notation thereon. The views and
conclusions contained herein are those of the authors and should not be interpreted as neces-
sarily representing the official policies or endorsements, either expressed or implied, of DARPA
or the U.S. Government. Open access publishing facilitated by Monash University, as part of
the Wiley - Monash University agreement via the Council of Australian University Librarians.
FUNDING INFORMATION
This research was at least in part funded by the Australian Research Council (DP210100060;
DP220101209), Jacobs Foundation (Research Fellowship; CELLA 2 CERES), and Defense
Advanced Research Project Agency (HR0011- 22- 2- 0047).
108
|
YAN et al.
CONFLICT OF INTEREST STATEMENT
The authors have declared no conflicts of interest.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in the supplementary
document.
ORCID
Lixiang Yan https://orcid.org/0000-0003-3818-045X
Linxuan Zhao https://orcid.org/0000-0001-5564-0185
Yuheng Li https://orcid.org/0000-0002-5971-8469
Guanliang Chen https://orcid.org/0000-0002-8236-3133
ENDNOTE
1 Such as classication, prediction, clustering, etc.
REFERENCES
Adams, C., Pente, P., Lemermeyer, G., & Rockwell, G. (2021). Artificial intelligence ethic s guidelines for k- 12 edu-
cation: A review of the global landscape. In Artificial Intelligence in Education: 22nd International Conference,
AIED 2021, Utrecht, The Netherlands, June 14– 18, 2021, Proceedings, Par t II (pp. 24 28). Springer.
Ahmed, A., Joorabchi, A., & Hayes, M. J. (2022). On the application of sentence transformers to automatic short
answer grading in blended assessment. In 2022 33rd Irish Signals and Systems Conference (ISSC) (pp.
1– 6). IEEE.
Angelone, A. M., Galassi, A., & Vit torini, P. (2022). Improved automated classification of sentences in data science
exercises. In Methodologies and Intelligent Systems for Technology Enhanced Learning, 11th International
Conference 11 (pp. 12– 21). Springer.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q. V., Xu,
Y., & Fung, P. (2023). A multitask, multilingual, multimodal evaluation of chatGPT on reasoning, hallucination,
and interactivity. arXiv preprint arXiv:2302.04023.
Becker, H. J. (2000). Findings from the teaching, learning, and computing survey. Education Policy Analysis
Archives, 8, 51.
Beseiso, M., Alzubi, O. A., & Rashaideh, H. (2021). A novel automated essay scoring approach for reliable higher
educational assessments. Journal of Computing in Higher Education, 33, 7 27– 74 6 .
Boeren, E. (2019). Understanding sustainable development goal (sdg) 4 on “quality education” from micro, meso
and macro perspectives. International Review of Education, 65, 277– 29 4.
Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022). What does it mean for a language model
to preserve privacy? In 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 2280
2292). Association for Computing Machiner y.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,
G., Askell, A., Agarwal, S., Herbert- Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A ., Ziegler, D.
M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few- shot learners. Advances in Neural
Information Processing Systems, 33, 1877– 1901.
Bulut, O., & Yildirim- Erbasli, S. N. (2022). Automatic story and item generation for reading comprehension as-
sessments with transformers. International Journal of Assessment Tools in Education, 9, 72 87.
Caines, A., Benedetto, L., Taslimipoor, S., Davis, C., Gao, Y., Andersen, O., Yuan, Z., Elliott, M., Moore,
R., Bryant, C., Rei, M., Mullooly, A., Nicholls, D., & Buttery, P. (2023). On the application of large
language models for language teaching and assessment technology. In AIED Workshops, in press.
CEUR- WS.org
Carpenter, D., Emerson, A., Mott, B. W., Saleh, A., Glazewski, K. D., Hmelo- Silver, C. E., & Lester, J. C. (2020).
Detecting off- task behavior from student dialogue in game- based collaborative learning. In Artificial
Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6 10, 2020,
Proceedings, Part I 21 (pp. 5566). Springer.
Carroll, A ., Forrest, K., Sanders- O'Connor, E., Flynn, L., Bower, J. M., Fynes- Clinton, S., York, A., & Ziaei, M.
(2022). Teacher stress and burnout in Australia: Examining the role of intrapersonal and environmental fac-
tors. Social Psychology of Education, 25, 4 41 469 .
|
109
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
Cavalcanti, A. P., Barbosa, A., Carvalho, R., Freitas, F., Tsai, Y.- S., Gašević, D., & Mello, R. F. (2021). Automatic
feedback in online learning environments: A systematic literature review. Computers and Education: Ar tificial
Intelligence, 2, 100027.
Chaudhry, M. A., Cukurova, M., & Luckin, R. (2022). A transparency index framework for AI in educ ation. In
Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry
and Innovation Tracks, Practitioners’ and Doctoral Consortium: 23rd International Conference, AIED 2022,
Durham, UK, July 27– 31, 2022, Proceedings, Part II (pp. 195 198). Springer.
Chechitelli, A. (2023). AI writing detection update from turnitin's chief product officer. https://www.turni tin.com/
blog/ai- writi ng- detec tion- updat e- from- turni tins- chief - produ ct- officer
Condor, A., Litster, M., & Pardos, Z. (2021). Automatic short answer grading with sbert on out- of- sample ques-
tions. International Educational Data Mining Society.
Defence Science and Technology Group. (2021). Technology readiness levels definitions and descriptions. https: //
www.dst.defen ce.gov.au/sites/ defau lt/files/ basic_pages/ docum ents/ TRL%20Exp lanat ions_1.pdf
Devlin, J., Chang, M.- W., Lee, K., & Toutanova, K. (2018). Bert: Pre- training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805.
Doewes, A., & Pechenizkiy, M. (2021). On the limitations of human- computer agreement in automated essay
scoring. International Educational Data Mining Society.
Doyle, W., & Ponder, G. A. (1977). The practicality ethic in teacher decision- making. Interchange, 8, 1– 12.
Drori, I., Zhang, S., Shuttleworth, R., Tang, L., Lu, A., Ke, E., Liu, K., Chen, L., Tran, S., Cheng, N., Wang, R.,
Singh, N., Patti, T. L., Lynch, J., Shporer, A., Verma, N., Wu, E., & Strang, G. (2022). A neural network solves,
explains, and generates university math problems by program synthesis and few- shot learning at human
level. Proceedings of the National Academy of Sciences, 119 , e 2 12 3 4 3 3119 .
Ertmer, P. A. (1999). Addressing first- and second- order barriers to change: Strategies for technology integration.
Educational Technology Research and Development, 47, 47– 61.
Ferguson, R., Hoel, T., Scheffel, M., & Drachsler, H. (2016). Guest editorial: Ethics and privacy in learning analyt-
ics. Journal of Learning Analytics, 3, 5– 15.
Fonseca, S. C., Pereira, F. D., Oliveira, E. H., Oliveira, D. B., Carvalho, L. S., & Cristea, A. I. (2020). Automatic
subject- based contextualisation of programming assignment lists. International Educational Data Mining
Soci et y.
Gašević, D., Dawson, S., Rogers, T., & Gasevic, D. (2016). Learning analytics should not promote one size fits all:
The effects of instructional conditions in predicting academic success. The Internet and Higher Education,
28, 68– 84.
Gašević, D., Greif f, S., & Shaffer, D. W. (2022). Towards strengthening links bet ween learning analytics and as-
sessment: Challenges and potentials of a promising new bond. Computers in Human Behavior, 134, 107304.
https://www.scien cedir ect.com/scien ce/artic le/pii/S0747 56322 2001261
Geller, S. A., Gal, K ., Segal, A., Sripathi, K., Kim, H. G., Facciotti, M. T., Igo, M., Hoernle, N., & Karger, D. (2021).
New methods for confusion detection in course forums: Student, teacher, and machine. IEEE Transactions
on Learning Technologies, 14, 665 679.
Ghosh, D., Klebanov, B. B., & Song, Y. (2020). An exploratory study of argumentative writing by young students:
A transformer- based approach. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for
Building Educational Applications (pp. 145150). Association for Computational Linguistics.
Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., & Chartash, D. (2023). How does
chatGPT perform on the United States medical licensing examination? The implications of large language
models for medical education and knowledge assessment. JMIR Medical Education, 9, e 45312.
Holmes, W., & Porayska- Pomsta, K. (2022). The ethics of artificial intelligence in education: Practices, chal-
lenges, and debates. Taylor & Francis.
Jayaraman, J., & Black, J. (2022). Effectiveness of an intelligent question answering system for teaching f inancial
literacy: A pilot study. In Innovations in Learning and Technology for the Workplace and Higher Education:
Proceedings of ‘The Learning Ideas Conference’ 2021 (pp. 133– 140). Springer.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G.,
Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O.,
Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and chal-
lenges of large language models for education. Learning and Individual Dif ferences, 103, 102 2 74.
Khosravi, H., Shum, S. B., Chen, G., Conati, C., Tsai, Y.- S., Kay, J., Knight, S., Mar tinez- Maldonado, R., Sadiq,
S., & Gašević, D. (2022). Explainable ar tificial intelligence in education. Computers and Education: Ar tificial
Intelligence, 3, 100074.
Kumar, N., Mali, R., Ratnam, A., Kurpad, V., & Magapu, H. (2022). Identification and addressal of knowledge
gaps in students. In 2022 3rd International Conference for Emerging Technology (INCET ) (pp. 1– 6).
IEEE.
Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al- Emari, S. (2020). A systematic review of automatic question gen-
eration for educational purposes. International Journal of Artificial Intelligence in Education, 30, 121– 20 4.
110
|
YA N et a l.
Leiker, D., Finnigan, S., Gyllen, A . R., & Cukurova, M. (2023). Prototyping the use of large language models (llms)
for adult learning content creation at scale. In AIED Workshops, in press. CEUR- WS.org
Li, C., & Xing, W. (2021). Natural language generation using deep learning to support mooc learners. International
Journal of Artificial Intelligence in Education, 31, 186– 214.
Li, Y., Sha, L., Yan, L., Lin, J., Raković, M., Galbraith, K., Lyons, K., Gašević, D., & Chen, G. (2023). Can large
language models write reflectively. Computers and Education: Artificial Intelligence, 4, 10 0140.
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non - native en-
glish writers. arXiv preprint arXiv:2304.02819.
Liu, S., Liu, S., Liu, Z., Peng, X., & Yang, Z. (2022). Automated detection of emotional and cognitive engagement
in mooc discussions to predict learning achievement. Computers & Education, 181, 104461.
Liu, Z., He, X., Liu, L., Liu, T., & Zhai, X. (2023). Context mat ters: A strategy to pre- train language model for sci-
ence education. arXiv preprint arXiv:2301.12031.
Ma, Q., Wu, S., & Koedinger, K. (2023) Is llm the better programming partner? AIED Workshops, in press.
CEUR- WS.org
Maheen, F., Asif, M., Ahmad, H., Ahmad, S., Alturise, F., Asiry, O., & Ghadi, Y. Y. (2022). Automatic computer
science domain multiple- choice questions generation based on informative sentences. PeerJ Computer
Science, 8, e1010.
Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does
“failure to replicate” really mean? American Psychologist, 70, 487– 498.
Merine, R., & Purkayastha, S. (2022). Risks and benefits of AI - generated text summarization for exper t level con-
tent in graduate health informatics. In 2022 IEEE 10th International Conference on Healthcare Informatics
(ICHI) (pp. 567– 574). IEEE.
Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen, T. H., Sainz, O., Agirre, E., Heinz, I., & Roth, D. (2021).
Recent advances in natural language processing via large pre- trained language models: A survey. arXiv
preprint arXiv:2111.01243.
Mittelstadt, B. (2019). Principles alone cannot guarantee ethical AI. Nature Machine Intelligence, 1, 501– 507.
Moore, S., Nguyen, H. A., Bier, N., Domadia, T., & Stamper, J. (2022). Assessing the quality of student- generated
short answer questions using GPT- 3. In Educating for a New Future: Making Sense of Technology- Enhanced
Learning Adoption: 17th European Conference on Technology Enhanced Learning, EC- TEL 2022, Toulouse,
France, September 12– 16, 2022, Proceedings (pp. 243– 257). Springer.
Munn, Z., Peters, M. D., Stern, C., Tufanaru, C., McA rthur, A., & Aromataris, E. (2018). Systematic review or scop-
ing review? Guidance for authors when choosing between a systematic or sc oping review approach. B MC
Medical Research Methodology, 18, 1 7.
Nguyen, T. T., Le, A. D., Hoang, H. T., & Nguyen, T. (2021). Neu- chatbot: Chatbot for admission of national eco-
nomics university. Computers and Education: Artificial Intelligence, 2, 100036.
Nye, B., Mee, D., & Core, M. G. (2023) Generative large language models for dialog- based tutoring: An early
consideration of opportunities and concerns. In AIED Workshops, in press. CEUR- WS.org
Oleny, A. (2023) Generating multiple choice questions from a textbook: Llms match human per formance on most
metrics. In AIED Workshops, in press. CEUR- WS.org
OpenAI. (2023). Introducing chatGPT. https://openai.com/blog/chatgpt
Page, M. J., McKenzie, J. E., Bossuy t, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J.
M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T.,
Loder, E. W., Mayo- Wilson, E., McDonald, S., … Moher, D. (2021). The prisma 2020 statement: An updated
guideline for reporting systematic reviews. International Journal of Surgery, 88, 105906.
Pardo, A., & Siemens, G. (2014). Ethical and privacy princ iples for learning analytics. British Journal of Educational
Technology, 45, 438– 450.
Pugh, S. L., Subburaj, S. K., Rao, A. R., Stewart, A. E., Andrews- Todd, J., & D'Mello, S. K. (2021). Say what?
Automatic modeling of collaborative problem solving skills from student speech in the wild. International
Educational Data Mining Society.
Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review.
Artificial Intelligence Review, 55, 2495 2 5 2 7.
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher
education? Journal of Applied Learning and Teaching, 6, 342– 363.
Sallam, M. (2023). The utility of chatGPT as an example of large language models in healthcare education,
research and practic e: Systematic review on the future perspectives and potential limitations. medRxiv,
2023– 02.
Sarsa, S., Denny, P., Hellas, A., & Leinonen, J. (2022). Automatic generation of programming exercises and code
explanations using large language models. In Proceedings of the 2022 ACM Conference on International
Computing Education Research- Volume 1 (pp. 27– 43). Association for Computing Machinery.
Sawatzki, J., Schlippe, T., & Benner- Wickner, M. (2022). Deep learning techniques for automatic short answer
grading: Predicting scores for English and German answers. In Artificial Intelligence in Education: Emerging
|
111
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
Technologies, Models and Applications: Proceedings of 2021 2nd International Conference on Artificial
Intelligence in Education Technology (pp. 65– 75). Springer.
Schneider, J., Richner, R., & Riser, M. (2022). Towards trustworthy autograding of short, multi- lingual, multi- type
answers. International Journal of Artificial Intelligence in Education, 33, 88– 118.
Schramowski, P., Turan, C., Andersen, N., Rothkopf, C. A., & Kersting, K. (2022). Large pre- trained language
models contain human- like biases of what is right and wrong to do. Nature Machine Intelligence, 4, 258– 268.
Selwyn, N. (2019). What's the problem with learning analytics? Journal of Learning Analytics, 6, 11 19 .
Sha, L., Li, Y., Gasevic, D., & Chen, G. (2022). Bigger data or fairer data? Augmenting ber t via active sampling
for educational text classification. In Proceedings of the 29th International Conference on Computational
Linguistics (pp. 1275– 1285). International Committee on Computational Linguistics.
Sha, L., Raković, M., Das, A., Gašević, D., & Chen, G. (2022). Leveraging class balancing techniques to alle-
viate algorithmic bias for predictive tasks in education. IEEE Transactions on Learning Technologies, 15,
481– 492.
Sha, L., Raković, M., Lin, J., Guan, Q., Whitelock- Wainwright, A., Gašević, D., & Chen, G. (2022). Is the latest
the greatest? A comparative study of automatic approaches for classifying educational for um posts. IEEE
Transactions on Learning Technologies, 16, 339– 352.
Sha, L., Rakovic, M., Whitelock- Wainwright, A ., Carroll, D., Yew, V. M., Gasevic, D., & Chen, G. (2021). Assessing
algorithmic fairness in automatic classifiers of educational forum posts. In Artificial Intelligence in Education:
22nd International Conference, AIED 2021, Utrecht, The Netherlands, June 14– 18, 2021, Proceedings, Par t
I 22 (pp. 381– 394). Springer.
Shang, J., Huang, J., Zeng, S., Zhang, J., & Wang, H. (2022). Representation and extraction of physics knowledge
based on knowledge graph and embedding- combined text classif ication for cooperative learning. In 2022
IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD) (pp.
1053– 1058). IEEE.
Sharma, A ., Kabra, A., & Kapoor, R. (2021). Feature enhanced capsule networks for robust automatic essay sc or-
ing. In Machine Learning and Knowledge Discover y in Databases. Applied Data Science Track: European
Conference, ECML PK DD 2021, Bilbao, Spain, September 1317, 2021, Proceedings, Part V 21 (pp. 365
380). Springer.
Song, W., Hou, X., Li, S., Chen, C., Gao, D., Sun, Y., Hou, J., & Hao, A. (2022). An intelligent virtual standard pa-
tient for medical students training based on oral knowledge graph. IEEE Transactions on Multimedia, 1– 14.
Sridhar, P., Doyle, A., Agarwal, A ., Bogar t, C., Savelka, J., & Sakr, M. (2023) Harnessing llms in curricular design:
Using gpt- 4 to support authoring of learning objectives. In AIED Workshops, in press. CEUR- WS.org
Su, Y., & Zhang, Y. (2020). Automatic construction of subject knowledge graph based on educational big data.
In Proceedings of the 2020 The 3rd International Conference on Big Data and Education (pp. 30– 36).
Association for Computing Machinery.
Truong, T.- L., Le, H.- L., & Le- Dang, T.- P. (2020). Sentiment analysis implementing bert- based pre- trained lan-
guage model for vietnamese. In In 2020 7th NAFOSTED Conference on Information and Computer Science
(NICS) (pp. 362– 367). IEEE.
Tsai, Y.- S., & Gasevic, D. (2017). Learning analytics in higher education Challenges and policies: A review of
eight learning analytics policies. In Proceedings of the Seventh International Learning Analytics & Knowledge
Conference (pp. 233 242). Association for Computing Machinery.
Tsai, Y.- S., Whitelock- Wainwright, A., & Gašević, D. (2020). The privacy paradox and its implications for learning
analytics. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge (pp.
230239). Association for Computing Machinery.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017).
Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998– 6008.
Wang, D., Churchill, E., Maes, P., Fan, X., Shneiderman, B., Shi, Y., & Wang, Q. (2020). From human- human col-
laboration to human- ai collaboration: Designing AI systems that can work together with people. In Extended
Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1– 6). Association for
Computing Machinery.
Weidinger, L., Mellor, J., Rauh, M., Griff in, C., Uesato, J., Huang, P.- S., Cheng, M., Glaese, M., Balle, B.,
Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L.,
Hendricks, L. A., … Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv preprint
arXiv:2112.04359.
Wollny, S., Schneider, J., Di Mitri, D., Weidlich, J., Rittberger, M., & Drachsler, H. (2021). Are we there yet?— A
systematic literature review on chatbots in education. Frontiers in Artificial Intelligence, 4, 654924.
Wu, J. (2022). Analysis and evaluation of the impact of integrating mental health education into the teaching
of university civics courses in the context of artificial intelligence. Wireless Communications and Mobile
Computing, 2022, 1 – 11 .
Wu, X., He, X., Li, T., Liu, N., & Zhai, X. (2023). Matching exemplar as next sentence prediction (mensp): Zero- shot
prompt learning for automatic scoring in science education. arXiv preprint arXiv:2301.08771.
112
|
YA N et a l.
Yan, L., Zhao, L., Gasevic, D., & Mar tinez- Maldonado, R. (2022). Scalability, sustainability, and ethicality of mul-
timodal learning analytics. In LAK22: 12th International Learning Analytics and K nowledge Conference (pp.
13– 23). Association for Computing Machinery.
Yang, S. J., Ogata, H., Matsui, T., & Chen, N.- S. (2021). Human- centered artificial intelligence in education:
Seeing the invisible through the visible. Computers and Education: Artificial Intelligence, 2, 100008.
Zawacki- Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial in-
telligence applications in higher education— Where are the educators? International Journal of Educational
Technology in Higher Education, 16, 1 2 7.
Zeng, Z., Gašević, D., & Chen, G. (2023). On the effectiveness of curriculum learning in educational text scoring.
Proceedings of the AAAI Conference on Artificial Intelligence, 37(12), 14 60 2– 14 610. https://doi.org/10.1609/
aaai.v37i12.26707
Zheng, L., Niu, J., Long, M., & Fan, Y. (2023). An automatic knowledge graph construction approach to promoting
collaborative knowledge building, group performance, social interaction and socially shared regulation in
CSCL. British Journal of Educational Technology, 54, 6 8 6 711.
Zheng, L., Niu, J., & Zhong, L. (2022). Effects of a learning analytics - based real- time feedback approach on
knowledge elaboration, knowledge convergence, interactive relationships and group performance in CSCL.
British Journal of Educational Technology, 53, 130– 149.
SUPPORTING INFORMATION
Additional supporting information can be found online in the Supporting Information section
at the end of this article.
How to cite this article: Yan, L., Sha, L., Zhao, L., Li, Y., Martinez- Maldonado, R.,
Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large
language models in education: A systematic scoping review. British Journal of
Educational Technology, 55, 90–112. ht tp s: //do i . o r g /10 .1111 / b j et.13 370
... While all of these factors are important, designing a GenAI chatbot that adopts a learner-centered approach and aligns with the course curriculum may be particularly relevant for promoting transparency. A recent systematic review by Yan et al. (2023) found that most GenAI tools (92%) currently employed to support learning are comprehensible only to AI experts. Educators, students, and other key stakeholders often lack the necessary insight into how these tools function. ...
... Finally, there was no difference in self-efficacy between the three conditions in either study. The findings suggest that developing learner-centered GenAI chatbots can help address the transparency and usability gaps noted in prior research (Yan et al., 2023). Increased enjoyment is critical for sustained use and long-term adoption of AI in education. ...
Article
Full-text available
Generative artificial intelligence (GenAI) has emerged as a transformative tool in education, offering scalable individualized learning. However, there is a lack of theoretically informed and methodologically rigorous research on how GenAI can effectively augment learning. This manuscript addresses this gap by investigating the potential of a theory-informed GenAI chatbot, ChatTutor—designed for students—to facilitate generative sense-making by leveraging principles from generative learning theory. The study aimed to (1) extend theory by proposing how GenAI can support generative sense-making and (2) test its effects on conceptual knowledge, self-efficacy, and trust post-use and on conceptual knowledge, enjoyment, and behavioral intentions in a delayed follow-up. We conducted two between-subjects design experiments. Study 1 was a pre-registered experiment with 175 university students in an authentic cognitive psychology course, comparing ChatTutor with a generic GenAI system (ChatGPT) and a teaching-as-usual control. Study 2 replicated the design with 234 high school students, comparing ChatTutor to ChatGPT, and a re-study control. Results showed that ChatTutor significantly enhanced trust and enjoyment in both studies, and behavioral intentions in Study 1, compared to ChatGPT. It also improved conceptual knowledge over ChatGPT immediately in both studies and at follow-up in Study 2. ChatTutor outperformed the control condition in conceptual knowledge, although this difference was only significant in the follow-up of Study 1 and the immediate test of Study 2. Differences in self-efficacy were not significant. The research underscores the importance of integrating human-centered design and educational psychology theories into GenAI applications and offers directions for future research and practice.
... First, as summarised in Appendix B, the majority of prior reviews synthesised empirical research published before 2020. While these reviews provided foundational knowledge, they largely predate the surge of recent advancements in DL techniques and the emergence of large language models (LLMs; Yan et al. 2024). ...
Article
Full-text available
Background The growth of online education has provided flexibility and access to a wide range of courses. However, the self‐paced and often isolated nature of these courses has been associated with increased dropout and failure rates. Researchers employed machine learning approaches to identify at‐risk students, but multiple issues have not been addressed concerning the definition of at‐risk students, as well as the strengths and limitations of different machine learning models to predict at‐risk students. Objectives This systematic review aims to provide a comprehensive overview of the past 10‐year research focusing on applying machine learning techniques for predicting at‐risk students (i.e., failure, dropouts) in online learning environments. Methods Studies were extracted from the ACM Digital Library, IEEE Xplore Digital Library, Web of Science, ERIC, ProQuest, and EBSCO. A total of 161 studies published from 2014 to 2024 were included in the review. Results and Conclusions Findings revealed (1) four primary at‐risk definitions outlined in the reviewed studies, each focusing on specific stages of student engagement and performance in a course; (2) most studies relied on student behavioural engagement and academic factors as at‐risk predictors; (3) the adoption of deep learning and ensemble deep learning networks has significantly increased in the past 5 years, often outperforming classical machine learning models. While studies in which classical machine learning excelled often relied on the ensemble methodology and smaller sample sizes; (4) current machine learning practice evaluated by a list of criteria showed concerns regarding reproducibility, generalisability, and interpretability.
... However, in higher education, personalized feedback delivery is often absent or delayed due to the large and diverse student population [6]. Recently, advances in Natural Language Processing (NLP) utilizing novel deep learning approaches and state-of-the-art language models (e.g., BERT, RoBERTa, GPT, Llama, etc.) are a promising direction in automating feedback provision [15,29]. Previous research has employed zero-and few-shot generative AI techniques to generate formative feedback [5,10,22,26]. ...
Preprint
Full-text available
The scarcity of high-quality labeled data often hinders the effective use of automated formative feedback in education. While analytic rubrics offer a reliable framework for automated grading, training robust models still requires hundreds of expert-labeled responses, an expensive and time-consuming process. This paper proposes a methodology for generating diverse, rubric-aligned synthetic student responses using large language models (LLMs) guided by knowledge profiles and representative examples. We introduce two profile-based generation strategies, straightforward and error-informed, and evaluate them compared to a dataset (N = 585) of authentic open-ended logical-proof responses from a Discrete Mathematics course. We analyze the diversity and realism of the generated datasets using embedding-based distance metrics and PCA and assess their utility for training automated grading models. Our results show that synthetic responses are less diverse than authentic ones, and models trained solely on generated data perform worse than those trained on real data. However, combining small authentic datasets with generated data significantly improves model performance, suggesting it is an effective augmentation strategy in low-resource educational settings.
... Beyond general concerns about accuracy [4,5], the following three crucial issues have emerged. (1) Systemic limitations, including algorithmic biases that may misinterpret the writing practices of non-native speakers, thereby disadvantaging EFL learners; and reliance on standardized algorithms that produce generic feedback, often ignoring individual stylistic or contextual needs [6][7][8][9]. (2) Pedagogical risks, as relying too heavily on AI can erode students' *Corresponding author: Qianshan Chen, School of Foreign Language Studies, Hangzhou Dianzi University, China. Email: qianshan@hdu.edu.cn ...
Article
Full-text available
Collaboration between humans and artificial intelligence (AI) has the power to transform education, but research has yet to fully address how students engage cognitively with AI-powered feedback. Although studies suggest that AI can improve writing, few explore how students perceive and interact with AI tools. To fill this gap, the present study investigated Chinese university students' perceptions and experiences of using AI-powered English writing feedback tools: automated writing evaluation, generative AI, and corpora. Two hundred and ten student reflective journals were subjected to qualitative thematic analysis using NVivo software, along with analysis of classroom observations and students' writing. The students evaluated AI-powered feedback tools in three dimensions: content quality, delivery method, and overall effectiveness. They felt that these tools provided better grammar correction, instant feedback delivery, and an enhanced user experience, but challenges included vague explanations, limited emotional connection, and risks of overreliance. Based on these insights, this study introduces the Student-Teacher-AI Collaboration Model for feedback writing, which is designed to enhance collaboration between students, teachers, and AI in foreign language education. The findings have practical implications for the integration of AI tools into writing instruction and will inform policymaking in the rapidly evolving educational field. Received: 6 March 2025 | Revised: 6 May 2025 | Accepted: 30 May 2025 Conflicts of Interest The author declares that she has no conflicts of interest to this work. Data Availability Statement The data that support this work are available upon reasonable request to the corresponding author. Author Contribution Statement Qianshan Chen: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Project administration, Funding acquisition.
... Recent advancements in generative AI (GenAI) (Chiu, 2024), particularly the development of Large Language Models (LLMs) like ChatGPT (Malfatti, 2025), have dramatically lowered the barrier for individuals, regardless of their technological expertise, to access and utilize the sophisticated capabilities of AI for generating high-quality, context-aware content (Meyer et al., 2023). These advancements have propelled generative AI to the forefront of educational discourse, capturing both enthusiasm and skepticism, with educators, policymakers, and researchers expressing growing apprehension over the impact of LLMs on students' learning habits (Deng et al., 2024;Yan et al., 2024;Jeon and Lee, 2023). While these tools have the potential to enhance productivity and streamline learning processes, the concern remains that students may begin to depend too heavily on AI-driven assistance (Ouyang and Lo et al., 2024). ...
Article
Full-text available
AI integration in professional coursework has gained attention in higher education, offering potential improvements in learning and performance. However; the usage rate, accuracy, and patterns of AI tools across disciplines remain unclear. The impact of AI on students’ learning and long-term outcomes still needs further investigation. This study investigates the role of AI-driven learning tools in enhancing professional coursework across various academic disciplines, including Engineering, Natural Sciences, Humanities, Social Sciences, and Arts. Using a multi-stage empirical approach, we evaluate student performance and engagement with AI tools, exploring both the positive outcomes and limitations in each discipline. Our findings show that 1,472 students across different regions and academic fields frequently use AI-driven learning tools in their professional coursework, with generally positive feedback. The accuracy of AI-generated responses to 1,200 questions is commendable, but the tools still demonstrate limitations in areas such as flexibility, contextual adaptation, analytical reasoning, and conceptual synthesis in certain disciplines. We segment 800 representative students based on AI engagement patterns, revealing distinct usage behaviors and preferences across four groups, offering further insights into how AI tools are utilized. Furthermore, the results from assessing the impact of AI-driven learning on the academic performance of 400 students show that AI significantly enhances performance in technical fields like Engineering and Natural Sciences. However, instructors in these fields express concerns about students’ over-reliance on AI, which may hinder independent thinking and creative problem-solving. Despite these concerns, our long-term academic outcome predictions indicate that AI-driven learning tools continue to provide substantial benefits to students.
... While the accessibility and immediacy of such AI-assisted feedback are clear advantages over potentially delayed traditional teacher feedback [37], its specific impact on the development of English as a Foreign Language (EFL) writing proficiency requires more nuanced investigation. Much existing research either examines broader AI instructional systems [4,51] or explores student perceptions [23], often lacking direct empirical comparison between the effects of these readily available AI feedback tools and traditional teacher feedback methods within authentic classroom settings. Furthermore, there is a specific gap in understanding how interaction with AI feedback focused predominantly on form and style influences not only the final writing product's quality (across different dimensions like content and organization, not just language use) but also the students' writing process, particularly their revision behaviors [39]. ...
Article
Full-text available
This study investigated the impact of generative AI-assisted writing feedback, specifically using Grammarly, on English as a Foreign Language (EFL) learners' writing proficiency, revision practices, and writing quality. Sixty postgraduate EFL students were randomly assigned to either an experimental group using Grammarly as an AI-enhanced feedback system or a control group receiving traditional instruction. Pre-and post-tests assessed overall writing proficiency, and semi-structured interviews were conducted with 25 students from the experimental group to gather qualitative data. Quantitative results indicated that the experimental group achieved significantly higher post-test scores than the control group. Additionally, positive correlations were found between the use of AI features, including generative AI, grammar, and vocabulary tools, and increased revision frequency, as well as improvements in content, organization, and cohesion. Thematic analysis of the qualitative data revealed that AI-assisted feedback promoted student engagement by reducing frustration, boosting confidence, and facilitating a greater sense of accomplishment. Participants emphasized the need to strategically integrate AI tools with traditional instruction, to foster critical thinking, and for AI tools to prioritize natural language fluency as well as to provide detailed feedback beyond basic grammar and mechanics. These results suggest that generative AI-assisted writing feedback, when used with pedagogical awareness, can effectively enhance EFL learners' writing skills and support active participation in the writing process. The study also highlights that critical thinking skills are essential and that students should avoid overreliance on AI tools.
Chapter
This chapter reflects on the impact of artificial intelligence (AI) on the educational sector by considering several implications for creating personalised learning and the infrastructure of academic institutions. Hence, this chapter aims to understand the role of artificial intelligence in helping and transforming the educational framework in a more creative environment. The use of artificial intelligence in educational fields facilitates the creation of new innovative processes that are useful for improving creativity among students and educators. In fact, by exploiting artificial intelligence, it is possible to help students develop their ideas and help educators create study plans based on innovative projects enriched by immersive experiences. This chapter, based on an overview of Artificial Intelligence in Education, aims to identify the main elements of the transition from the traditional education system to a new educational model based on immersive technologies. In particular, the chapter seeks to specify how immersive technologies that exploit artificial intelligence can improve students’ learning abilities and generate a dynamic environment for both students and educators. Finally, the chapter explores the challenges and opportunities accompanying AI integration into higher education, such as data privacy, the digital divide, and concerns about the authenticity of AI-generated content.
Chapter
AI in Society provides an interdisciplinary corpus for understanding artificial intelligence (AI) as a global phenomenon that transcends geographical and disciplinary boundaries. Edited by a consortium of experts hailing from diverse academic traditions and regions, the 11 edited and curated sections provide a holistic view of AI’s societal impact. Critically, the work goes beyond the often Eurocentric or U.S.-centric perspectives that dominate the discourse, offering nuanced analyses that encompass the implications of AI for a range of regions of the world. Taken together, the sections of this work seek to move beyond the state of the art in three specific respects. First, they venture decisively beyond existing research efforts to develop a comprehensive account and framework for the rapidly growing importance of AI in virtually all sectors of society. Going beyond a mere mapping exercise, the curated sections assess opportunities, critically discuss risks, and offer solutions to the manifold challenges AI harbors in various societal contexts, from individual labor to global business, law and governance, and interpersonal relationships. Second, the work tackles specific societal and regulatory challenges triggered by the advent of AI and, more specifically, large generative AI models and foundation models, such as ChatGPT or GPT-4, which have so far received limited attention in the literature, particularly in monographs or edited volumes. Third, the novelty of the project is underscored by its decidedly interdisciplinary perspective: each section, whether covering Conflict; Culture, Art, and Knowledge Work; Relationships; or Personhood—among others—will draw on various strands of knowledge and research, crossing disciplinary boundaries and uniting perspectives most appropriate for the context at hand.
Book
Full-text available
Background: Medical education is integrating technology, including artificial intelligence, but ethical concerns remain. Objective: This study aims to enhance the role of ethics in medicine, focusing on the integration of human intelligence, particularly social and emotional intelligence, for ethical decision-making in medical education. Methods: The study uses a scoping review design to examine and explore key approaches to ethical considerations in using technology for medical education. Results: The data synthesis process identifies themes concerning behavior changes, technology acceptance, digital distraction, and AI mobile learning. Technology integration in medical education has led to significant advancements in assessment, learning, and professional development. Conclusions: Using Technology for Medical Education has enhanced learning outcomes, provided creative teaching methods, and increased resource access. The digitalization of medical education has led to the development of clinical skills and increased access to resources.
Article
Full-text available
GPT detectors frequently misclassify non-native English writing as AI generated, raising concerns about fairness and robustness. Addressing the biases in these detectors is crucial to prevent the marginalization of non-native English speakers in evaluative and educational settings and to create a more equitable digital landscape.
Article
Full-text available
Automatic Text Scoring (ATS) is a widely-investigated task in education. Existing approaches often stressed the structure design of an ATS model and neglected the training process of the model. Considering the difficult nature of this task, we argued that the performance of an ATS model could be potentially boosted by carefully selecting data of varying complexities in the training process. Therefore, we aimed to investigate the effectiveness of curriculum learning (CL) in scoring educational text. Specifically, we designed two types of difficulty measurers: (i) pre-defined, calculated by measuring a sample's readability, length, the number of grammatical errors or unique words it contains; and (ii) automatic, calculated based on whether a model in a training epoch can accurately score the samples. These measurers were tested in both the easy-to-hard to hard-to-easy training paradigms. Through extensive evaluations on two widely-used datasets (one for short answer scoring and the other for long essay scoring), we demonstrated that (a) CL indeed could boost the performance of state-of-the-art ATS models, and the maximum improvement could be up to 4.5%, but most improvements were achieved when assessing short and easy answers; (b) the pre-defined measurer calculated based on the number of grammatical errors contained in a text sample tended to outperform the other difficulty measurers across different training paradigms.
Chapter
Full-text available
Developing natural language processing (NLP) models to automatically score students’ written responses to science problems is critical for science education. However, collecting sufficient student responses and labeling them for training or fine-tuning NLP models is time and cost-consuming. Recent studies suggest that large-scale pre-trained language models (PLMs) can be adapted to downstream tasks without fine-tuning by using prompts. However, no research has employed such a prompt approach in science education. As students’ written responses are presented with natural language, aligning the scoring procedure as the next sentence prediction task using prompts can skip the costly fine-tuning stage. In this study, we developed a zero-shot approach to automatically score student responses via Matching exemplars as Next Sentence Prediction (MeNSP). This approach employs no training samples. We first apply MeNSP in scoring three assessment tasks of scientific argumentation and found machine-human scoring agreements, Cohen’s Kappa ranges from 0.30 to 0.57, and F1 score ranges from 0.54 to 0.81. To improve scoring performance, we extend our research to the few-shots setting, either randomly selecting labeled student responses at each grading level or manually constructing responses to fine-tune the models. We find that one task’s performance is improved with more samples, Cohen’s Kappa from 0.30 to 0.38, and F1 score from 0.54 to 0.59; for the two other tasks, scoring performance is not improved. We also find that randomly selected few-shots perform better than the human expert-crafted approach. This study suggests that MeNSP can yield referable automatic scoring for student-written responses while significantly reducing the cost of model training. This method can benefit low-stakes classroom assessment practices in science education. Future research should further explore the applicability of the MeNSP in different types of assessment tasks in science education and further improve the model performance. Our code is available at https://github.com/JacksonWuxs/MeNSP.KeywordsPrompt LearningPre-trained Language Modelwritten ResponseAutomatic ScoringNatural Language Processing
Article
Full-text available
Generative Large Language Models (LLMs) demonstrate impressive results in different writing tasks and have already attracted much attention from researchers and practitioners. However, there is limited research to investigate the capability of generative LLMs for reflective writing. To this end, in the present study, we have extensively reviewed the existing literature and selected 9 representative prompting strategies for ChatGPT – the chatbot based on state-of-art generative LLMs to generate a diverse set of reflective responses, which are combined with student-written reflections. Next, those responses were evaluated by experienced teaching staff following a theory-aligned assessment rubric that was designed to evaluate student-generated reflections in several university-level pharmacy courses. Furthermore, we explored the extent to which Deep Learning classification methods can be utilised to automatically differentiate between reflective responses written by students vs. reflective responses generated by ChatGPT. To this end, we harnessed BERT, a state-of-art Deep Learning classifier, and compared the performance of this classifier to the performance of human evaluators and the AI content detector by OpenAI. Following our extensive experimentation, we found that (i) ChatGPT may be capable of generating high-quality reflective responses in writing assignments administered across different pharmacy courses, (ii) the quality of automatically generated reflective responses was higher in all six assessment criteria than the quality of student-written reflections; and (iii) a domain-specific BERT-based classifier could effectively differentiate between student-written and ChatGPT-generated reflections, greatly surpassing (up to 38% higher across four accuracy metrics) the classification performed by experienced teaching staff and general-domain classifier, even in cases where the testing prompts were not known at the time of model training.
Preprint
Full-text available
An artificial intelligence (AI)-based conversational large language model (LLM) was launched in November 2022 namely, "ChatGPT". Despite the wide array of potential applications of LLMs in healthcare education, research and practice, several valid concerns were raised. The current systematic review aimed to investigate the possible utility of ChatGPT and to highlight its limitations in healthcare education, research and practice. Using the PRIMSA guidelines, a systematic search was conducted to retrieve English records in PubMed/MEDLINE and Google Scholar under the term "ChatGPT". Eligibility criteria included the published research or preprints of any type that discussed ChatGPT in the context of healthcare education, research and practice. A total of 280 records were identified, and following full screening, a total of 60 records were eligible for inclusion. Benefits/applications of ChatGPT were cited in 51/60 (85.0%) records with the most common being the utility in scientific writing followed by benefits in healthcare research (efficient analysis of massive datasets, code generation and rapid concise literature reviews besides utility in drug discovery and development). Benefits in healthcare practice included cost saving, documentation, personalized medicine and improved health literacy. Concerns/possible risks of ChatGPT use were expressed in 58/60 (96.7%) records with the most common being the ethical issues including the risk of bias, plagiarism, copyright issues, transparency issues, legal issues, lack of originality, incorrect responses, limited knowledge, and inaccurate citations. Despite the promising applications of ChatGPT which can result in paradigm shifts in healthcare education, research and practice, the embrace of this application should be done with extreme caution. Specific applications of ChatGPT in health education include the promising utility in personalized learning tools and shift towards more focus on critical thinking and problem-based learning. In healthcare practice, ChatGPT can be valuable for streamlining the workflow and refining personalized medicine. Saving time for the focus on experimental design and enhancing research equity and versatility are the benefits in scientific research.
Article
Full-text available
Background Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. ResultsOf the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P
Article
Full-text available
Large language models represent a significant advancement in the field of AI. The underlying technology is key to further innovations and, despite critical views and even bans within communities and regions, large language models are here to stay. This commentary presents the potential benefits and challenges of educational applications of large language models, from student and teacher perspectives. We briefly discuss the current state of large language models and their applications. We then highlight how these models can be used to create educational content, improve student engagement and interaction, and personalize learning experiences. With regard to challenges, we argue that large language models in education require teachers and learners to develop sets of competencies and literacies necessary to both understand the technology as well as their limitations and unexpected brittleness of such systems. In addition, a clear strategy within educational systems and a clear pedagogical approach with a strong focus on critical thinking and strategies for fact checking are required to integrate and take full advantage of large language models in learning settings and teaching curricula. Other challenges such as the potential bias in the output, the need for continuous human oversight, and the potential for misuse are not unique to the application of AI in education. But we believe that, if handled sensibly, these challenges can offer insights and opportunities in education scenarios to acquaint students early on with potential societal biases, criticalities, and risks of AI applications. We conclude with recommendations for how to address these challenges and ensure that such models are used in a responsible and ethical manner in education.
Article
Full-text available
Recent studies have demonstrated the advantage of applying mobile sink to prevent the energy-hole problem and prolong network lifetime in wireless sensor network. However, most researches treat the touring length constraint simply as the termination indicator of rendezvous point selection, which leads to a suboptimal solution. In this paper, we notice that the optimal set of rendezvous points is unknown but deterministic and propose to elect the set of rendezvous points directly with the multiwinner voting-based method instead of step-by-step selection. A weighted heuristic voter generation method is introduced to choose the representative voters, and a scoring rule is also well designed to obtain a satisfying solution. We also employ an iterative schema for the voting score update to refine the solution. We have conducted extensive experiments, and the results show that the proposed method can effectively prolong the network lifetime and achieve the competitive performance with other SOTA methods. Compared to the methods based on step-by-step selection, the proposed method increases the network lifetime by 23.2% and 10.5% on average under the balanced-distribution and unbalanced-distribution scenarios, respectively.