Access to this full-text is provided by Wiley.
Content available from British Journal of Educational Technology
This content is subject to copyright. Terms and conditions apply.
90
|
Br J Educ Technol. 2024;55:90–112.
wileyonlinelibrary.com/journal/bjet
Received: 9 March 2023
|
Accepted: 22 July 2023
DO I: 10 .1111/ b je t.1337 0
REVIEW
Practical and ethical challenges of large
language models in education: A systematic
scoping review
Lixiang Yan | Lele Sha | Linxuan Zhao | Yuheng Li |
Roberto Martinez- Maldonado | Guanliang Chen | Xinyu Li |
Yueqiao Jin | Dragan Gašević
This is an op en access article under t he terms of t he Creative Commons Attribution- NonCommercial License, which pe rmits
use, distr ibution a nd reproduction in any medium, provided the origin al work is properly cited and is not used for c ommerc ial
purposes.
© 2023 The Authors. British Journal of Educational Technology published by John Wi ley & Sons Ltd on b ehalf of Br itish
Educational Research Association.
Centre for Learning Analytics at Monash,
Faculty of Information Technology, Monash
University, Clayton, Victoria, Australia
Correspondence
Lixiang Yan, Centre for Learning Analytics
at Monash, Fac ulty of Informati on
Technology, Monash Univers ity, 20
Exhibition Walk, Cl ayton, VI C 3800,
Australia.
Email: lixiang.yan@monash.edu
Funding information
Australian Research Council, Grant/Award
Number: D P2101000 60 and DP220101209;
Jacobs Foundation; Defense Advanced
Research Project Agency, Grant/Award
Number: HR0011- 22- 2- 00 47
Abstract
Educational technology innovations leveraging large
language models (LLMs) have shown the potential
to automate the laborious process of generating
and analysing textual content. While various innova-
tions have been developed to automate a range of
educational tasks (eg, question generation, feedback
provision, and essay grading), there are concerns re-
garding the practicality and ethicality of these innova-
tions. Such concerns may hinder future research and
the adoption of LLMs- based innovations in authentic
educational contexts. To address this, we conducted
a systematic scoping review of 118 peer- reviewed
papers published since 2017 to pinpoint the current
state of research on using LLMs to automate and
support educational tasks. The findings revealed 53
use cases for LLMs in automating education tasks,
categorised into nine main categories: profiling/label-
ling, detection, grading, teaching support, prediction,
knowledge representation, feedback, content gen-
eration, and recommendation. Additionally, we also
identified several practical and ethical challenges,
including low technological readiness, lack of rep-
licability and transparency and insufficient privacy
and beneficence considerations. The findings were
summarised into three recommendations for future
studies, including updating existing innovations with
|
91
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
state- of- the- art models (eg, GPT- 3/4), embracing
the initiative of open- sourcing models/systems, and
adopting a human- centred approach throughout the
developmental process. As the intersection of AI and
education is continuously evolving, the findings of this
study can serve as an essential reference point for
researchers, allowing them to leverage the strengths,
learn from the limitations, and uncover potential re-
search opportunities enabled by ChatGPT and other
generative AI models.
KEYWORDS
artificial intelligence, BERT, ChatGPT, education, GPT- 3, large
language models, pre- trained language models, systematic
scoping review
Practitioner notes
What is currently known about this topic
• Generating and analysing text- based content are time- consuming and laborious
tasks.
• Large language models are capable of efficiently analysing an unprecedented
amount of textual content and completing complex natural language processing
and generation tasks.
• Large language models have been increasingly used to develop educational tech-
nologies that aim to automate the generation and analysis of textual content, such
as automated question generation and essay scoring.
What this paper adds
• A comprehensive list of different educational tasks that could potentially benefit
from LLMs- based innovations through automation.
• A structured assessment of the practicality and ethicality of existing LLMs- based
innovations from seven important aspects using established frameworks.
• Three recommendations that could potentially support future studies to develop
LLMs- based innovations that are practical and ethical to implement in authentic
educational contexts.
Implications for practice and/or policy
• Updating existing innovations with state- of- the- art models may further reduce the
amount of manual effort required for adapting existing models to different educa-
tional tasks.
• The reporting standards of empirical research that aims to develop educational
technologies using large language models need to be improved.
• Adopting a human- centred approach throughout the developmental process could
contribute to resolving the practical and ethical challenges of large language mod-
els in education.
92
|
YAN et al.
INTRODUCTION
Advancements in generative artificial intelligence (AI) and large language models (LLMs)
have fuelled the development of many educational technology innovations that aim to auto-
mate the often time- consuming and laborious tasks of generating and analysing textual con-
tent (eg, generating open- ended questions and analysing student feedback survey) (Kasneci
et al., 2023; Leiker et al., 2023; Wollny et al., 2021). LLMs are generative artificial intelligence
models that have been trained on an extensive amount of text data, capable of generating
human- like text content based on natural language inputs. Specifically, these LLMs, such
as Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018)
and Generative Pre- trained Transformer (GPT) (Brown et al., 2020), utilise deep learning
and self- attention mechanisms (Vaswani et al., 2017) to selectively attend to the different
parts of input texts, depending on the focus of the current tasks, allowing the model to learn
complex patterns and relationships among textual contents, such as their semantic, con-
textual, and syntactic relationships (Liu et al., 2023; Min et al., 2021). As several LLMs (eg,
GPT- 3 and Codex) have been pre- trained on massive amounts of data across multiple disci-
plines, they are capable of completing natural language processing tasks with little (few- shot
learning) or no additional training (zero- shot learning) (Brown et al., 2020; Wu et al., 2023).
This could lower the technological barriers to LLMs- based innovations as researchers and
practitioners can develop new educational technologies by fine- tuning LLMs on specific
educational tasks without starting from scratch (Caines et al., 2023; Sridhar et al., 2023).
The recent release of ChatGPT, an LLMs- based generative AI chatbot that requires only
natural language prompts without additional model training or fine- tuning (OpenAI, 2023),
has further lowered the barrier for individuals without technological background to leverage
the generative powers of LLMs.
Although educational research that leverages LLMs to develop technological innovations
for automating educational tasks is yet to achieve its full potential (ie, most works have fo-
cused on improving model performances (Kurdi et al., 2020; Ramesh & Sanampudi, 2022)),
a growing body of literature hints at how different stakeholders could potentially benefit
from such innovations. Specifically, these innovations could potentially play a vital role in
addressing teachers' high levels of stress and burnout by reducing their heavy workloads
by automating punctual, time- consuming tasks (Carroll et al., 2022) such as question gen-
eration (Bulut & Yildirim- Erbasli, 2022; Kurdi et al., 2020; Oleny, 2023), feedback provision
(Cavalcanti et al., 2021; Nye et al., 2023), scoring essays (Ramesh & Sanampudi, 2022) and
short answers (Zeng et al., 2023). These innovations could also potentially benefit both stu-
dents and institutions by improving the efficiency of often tedious administrative processes
such as learning resource recommendation, course recommendation and student feedback
evaluation, potentially (Sridhar et al., 2023; Wollny et al., 2021; Zawacki- Richter et al., 2019).
Despite the growing empirical evidence of LLMs' potential in automating a wide range of
educational tasks, none of the existing work has systematically reviewed the practical and
ethical challenges of these LLMs- based innovations. Understanding these challenges is es-
sential for developing responsible technologies as LLMs- based innovations (eg, ChatGPT)
could contain human- like biases based on the existing ethical and moral norms of society,
such as inheriting biased and toxic knowledge (eg, gender and racial biases) when trained
on unfiltered internet text data (Schramowski et al., 2022). Prior systematic reviews have
focused on investigating these issues related to one specific application scenario of LLMs-
based innovations (eg, question generation, essay scoring, chatbots or automated feedback)
(Cavalcanti et al., 2021; Kurdi et al., 2020; Ramesh & Sanampudi, 2022; Wollny et al., 2021).
The practical and ethical challenges of LLMs in automating different types of educational
tasks remain unclear. Understanding these challenges is essential for translating research
|
93
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
findings into educational technologies that stakeholders (eg, students, teachers, and institu-
tions) can use in authentic teaching and learning practices (Adams et al., 2021).
The current study is the first systematic scoping review that aimed to address this gap by
reviewing the current state of research on using LLMs to automate educational tasks and
identify the practical and ethical challenges of adopting these LLMs- based innovations in
authentic educational contexts. A total of 118 peer- reviewed publications from four prom-
inent databases were included in this review following the Preferred Reporting Items for
Systematic reviews and Meta- Analyses (PRISMA) (Page et al., 2021) protocol. An inductive
thematic analysis was conducted to extract details regarding the different types of educa-
tional tasks, stakeholders, LLMs, and machine learning tasks investigated in prior literature.
The practicality of LLMs- based innovations was assessed through the lens of technological
readiness, model performance, and model replicability. Lastly, the ethicality of these innova-
tions was assessed by investigating system transparency, privacy, equality and beneficence.
The contribution of this paper to the educational technology community is threefold: (1)
we systematically summarise a comprehensive list of 53 different educational tasks that
could potentially benefit from LLMs- based innovations through automation, (2) we present
a structured assessment of the practicality and ethicality of existing LLMs- based innova-
tions based on seven important aspects using established frameworks (eg, the transpar-
ency index (Chaudhry et al., 2022)), and (3) we propose three recommendations that could
potentially support future studies to develop LLMs- based innovations to be practically and
ethically implement in authentic educational contexts. As the intersection of LLMs and ed-
ucation is continuously evolving, the findings of this systematic scoping review can serve
as an essential reference point for researchers, allowing them to leverage the strengths,
learn from the limitations, and uncover potential opportunities of novel LLMs in supporting
educational research and practice. Specifically, emerging works should carefully consider
the practical and ethical challenges identified in this study while exploring the research op-
portunities enabled by ChatGPT and other generative AI models.
BACKGROUND
In this section, we first establish the definitions for the key terminologies, specifically the
definitions of practicality and ethicality in the context of educational technology. We then
provided an overview of prior systematic reviews on LLMs in education. Then, we present
the research questions based on the gaps identified in the existing literature.
Practicality
Several theoretical frameworks have been proposed regarding the practicality of integrat-
ing technological innovations in educational settings. For example, Ertmer's (1999) first-
and second- order barriers to change focused on the external conditions of the educational
system (eg, infrastructure readiness) and teachers' internal states (eg, personal beliefs).
Becker (2000) further suggested that for technological innovations to have actual benefits
in supporting pedagogical practices, these innovations should be convenient to access,
support constructivist pedagogical beliefs, be adaptable to changes in the curriculum, and
be compatible to teachers' level of knowledge and skills. These factors were also presented
in an earlier framework of the practicality index (Doyle & Ponder, 1977), which summarised
three critical components for integrating educational technologies, including the degree of
adoption feasibility, the cost and benefit ratio and the alignment with existing practices and
beliefs. Based on these prior theoretical frameworks and considering the recentness of
94
|
YAN et al.
LLMs- based innovations (which only emerged in the past five years), the practical chal-
lenges of LLMs- based innovations in automating educational tasks can be assessed from
three primary perspectives. First, evaluating the technological readiness of these innova-
tions is essential for determining whether there is empirical evidence to support successful
integration and operation in authentic educational contexts. Second, assessing the model
performance could contribute valuable insights into the cost and benefits of adopting these
innovations, such as comparing the benefits of automation with the costs of inaccurate pre-
dictions. Finally, understanding whether these innovations are methodologically replicable
could be important for future studies to investigate their alignment with different educational
contexts and stakeholders. We elaborated on the evaluation items for each challenge in
Section “Data analysis”.
Ethicality
Ethical AI is a prevalent topic of discussion in multiple communities, such as learning ana-
lytics, AI in education, educational data mining, and educational technology communities
(Adams et al., 2021; Pardo & Siemens, 2014). There are ongoing debates regarding AI
ethics in education with a mixture of focuses on algorithmic and human ethics among edu-
cational data mining and AI in education communities (Holmes & Porayska- Pomsta, 2022).
As such debates continue, it is difficult to identify an established definition of ethical AI
from these fields. Whereas, ethicality has already been thoroughly investigated and ad-
dressed in a closed field to AI in education, namely, the field of learning analytics (Pardo &
Siemens, 2014; Selwyn, 2019). Drawing on the established definition of ethicality from the
field of learning analytics (Pardo & Siemens, 2014), the ethicality of LLMs- based innovations
can thus be defined as the systematisation of appropriate and inappropriate functionali-
ties and outcomes of these innovations, as determined by all stakeholders (eg, students,
teachers, parents and institutions). For example, Khosravi et al. (2022) explained that the
ethicality of AI- powered educational technology systems needs to involve the considera-
tion of accountability, explainability, fairness, interpretability and safety of these systems.
These different domains of ethical AI are all closely related and can be addressed by con-
sidering system transparency. Transparency is a subset of ethical AI that involves making
all information, decisions, decision- making processes, and assumptions available to stake-
holders, which in turn enhances their comprehension of the AI systems and related outputs
(Chaudhry et al., 2022). Additionally, for LLMs- based innovations, Weidinger et al. (2021)
suggested six types of ethical risks, including (1) discrimination, exclusion, and toxicity, (2)
information hazards, (3) misinformation harms, (4) malicious uses, (5) human- computer in-
teraction harms and (6) automation, access and environmental harms. These risks can be
further aggregated into three fundamental ethical issues, such as privacy concerns regard-
ing educational stakeholders' personal data, equality concerns regarding the accessibility
of stakeholders with different backgrounds, and beneficence concerns about the poten-
tial harms and negative impacts that LLMs- based innovations may have on stakeholders
(Ferguson et al., 2016). These three fundamental ethical issues were considered in the
analysis of the reviewed literature. Further details were available in Section “Data analysis”.
Related work
Prior systematic reviews have focused primarily on reviewing a specific application scenario
(eg, question generation, automated feedback, chatbots and essay scoring) of natural lan-
guage processing and LLMs. For example, Kurdi et al. (2020) have systematically reviewed
|
95
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
empirical studies that aimed to tackle the problem of automatic question generation in ed-
ucational domains. They comprehensively summarised the different generation methods,
generation tasks, and evaluation methods presented in prior literature. In particular, LLMs
could potentially benefit the semantic- based approaches for generating meaningful ques-
tions that are closely related to the source contents. Likewise, Cavalcanti et al. (2021) have
systematically reviewed different automated feedback systems regarding their impacts on
improving students' learning performances and reducing teachers' workloads. Despite half
of their reviewed studies showing no evidence of reducing teachers' workloads, as these au-
tomated feedback systems were mostly rule- based and required extensive manual efforts,
they identified that using natural language generation techniques could further enhance
such systems' generalisability and potentially reduce manual workloads. On the other hand,
Wollny et al. (2021) have systematically reviewed areas of education where chatbots have
already been applied. They concluded that there is still much to be done for chatbots to
achieve their full potential, such as making them more adaptable to different educational
contexts. A systematic review has also investigated the various automated essay scoring
systems (Ramesh & Sanampudi, 2022). The findings have revealed multiple limitations of
the existing systems based on traditional machine learning (eg, regression and random for-
est) and deep learning algorithms (eg, LSTM and BERT). In sum, these previous systematic
reviews have identified room for improvement that can be potentially addressed using state-
of- the- art LLMs (eg, GPT- 3 or Codex). However, none of the prior systematic reviews has
investigated the practical and ethical issues related to LLMs- based innovations in education
generally rather than particularly (eg, limited to a specific task).
The recent hype around one of the latest LLMs- based innovations, ChatGPT, has inten-
sified the discussion about the practical and ethical challenges related to using LLMs in
education. For example, in a position paper, Kasneci et al. (2023) provided an overview of
some existing LLMs research and proposed several practical opportunities and challenges
of LLMs from students' and teachers' perspectives. Likewise, Rudolph et al. (2023) also
provided an overview of the potential impacts, challenges, and opportunities that ChatGPT
might have on future educational practices. Although these studies have not systematically
reviewed the existing educational literature on LLMs, their arguments resonated with some
of the pressing issues around LLMs and ethical AI, such as data privacy, bias, and risks.
On the other hand, Sallam (2023) systematically reviewed the implications and limitations of
ChatGPT in healthcare education and identified potential utility around personalisation and
automation. However, it is worth noting that most papers reviewed in Sallam's study were
either editorials, commentaries, or preprints. This lack of peer- reviewed empirical studies on
ChatGPT is understandable as it has only been released since late 2022 (OpenAI, 2023).
None of the existing work has systematically reviewed the peer- reviewed literature on prior
LLMs- based innovations. Such investigations could provide more reliable and empirically-
based evidence regarding the potential opportunities and challenges of LLMs in educational
practices. Thus, the current study aimed to address this gap in the literature by conducting a
systematic scoping review of prior educational research on LLMs. Specifically, the following
research questions were investigated to guide this review:
• RQ1: What is the current state of research on using LLMs to automate educational tasks,
specifically through the lens of educational tasks, stakeholders, LLMs and machine- learning
tasks1?
• RQ2: What are the practical challenges of LLMs in automating educational tasks, spe-
cifically through the lens of technological readiness, model performance, and model
replicability?
• RQ3: What are the ethical challenges of LLMs in automating educational tasks, specifi-
cally through the lens of system transparency, privacy, equality and beneficence?
96
|
YAN et al.
METHODS
A systematic scoping review was conducted in this study as this method has been frequently
used in emerging and rapidly evolving research areas to scope a body of literature and iden-
tify the key concepts, methods, evidence, and challenges (Munn et al., 2018). Consequently,
the quality of the included studies was often not assessed as the aim is to provide a boarder
picture of an emerging field.
Review procedures
We followed the PRISMA (Page et al., 2021) protocol to conduct the current systematic
scoping review of LLMs. We searched four reputable bibliographic databases, including
Scopus, ACM Digital Library, IEEE Xplore and Web of Science, to find high- quality peer-
reviewed publications. Additional searches were conducted through Google Scholar and
Education Resources Information Center (ERIC) to identify peer- reviewed publications that
have yet to be indexed by these databases, either recently published or not indexed (eg,
Journal of Educational Data Mining; prior to 2020). Our initial search query for the title,
abstract, and keywords included terms such as “large language model”, “pre*trained lan-
guage model”, “GPT- *”, “BERT”, “education”, “student*” and “teacher*”. A publication year
constraint was also applied to restrict the search to studies published since 2017, specifi-
cally from 01/01/2017 to 12/31/2022, as the foundational architecture (Transformer) of LLMs
was formally released in 2017 (Vaswani et al., 2017). Only peer- reviewed publications were
considered to enhance the scientific credibility of this review. The initial database search
was conducted by two researchers independently. Any discrepancies between the search
results were resolved through further discussion or consulting the librarian for guidance.
Two researchers independently reviewed the titles and abstracts of eligible articles
based on five predetermined inclusion and exclusion criteria. First, we included studies
that used large or pre- trained language models directly or built on top of such models,
and excluded studies that used general machine- learning or deep- learning models with
unspecified usage of LLMs. Second, we included empirical studies with detailed methodol-
ogies, such as a detailed description of the LLMs and research procedures, and excluded
review, opinion and scoping works. Third, we only included full- length peer- reviewed pa-
pers and excluded short, workshop, and poster papers that were less than six and eight
pages for double- and single- column layouts, respectively. Additionally, we included stud-
ies that used LLMs for the purpose of automating educational tasks (eg, essay grading and
question generation), and excluded studies that merely used LLMs as part of the analysis
without educational implications. Finally, we only included studies that were published in
English (both the abstract and the main text) and excluded studies that were published in
other languages. Any conflicting decisions were resolved through further discussion be-
tween the two researchers or consulting with a third researcher to achieve a consensus.
The database search initially yielded 854 publications, with 191 duplicates removed, re-
sulting in 663 publications for the title and abstract screening (see Figure 1). After the title
FIGURE 1 Systematic scoping review process following the PRISMA protocol.
|
97
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
and abstract screening, 197 articles were included for the full- text review with an interrater
reliability (Cohen's kappa) of 0.75, indicating substantial agreement between the reviewers
during the title and abstract screening. A total of 118 articles were selected for data ex-
traction after the full- text review with an interrater reliability (Cohen's kappa) of 0.73, indicat-
ing substantial agreement between the reviewers during the full- text review. Out of the initial
197 articles, 79 were excluded for various reasons, including not full paper (n = 41), lack
of educational automation (n = 17), lack of pre- trained or LLMs (n = 12), merely using pre-
trained or LLMs as part of the analysis (n = 3), non- English paper (n = 2) and non- empirical
paper (n = 2).
Data analysis
For the first research question (RQ1), we conducted an inductive thematic analysis to extract
information regarding the current state of research on using LLMs to automate educational
tasks. Specifically, we extracted four primary types of contextual information from each
included paper: educational tasks, stakeholders, LLMs and machine- learning tasks. This
contextual information would provide a holistic view of the existing research and inform
researchers and practitioners regarding the viable directions to explore with the state- of-
the- art LLMs (eg, GPT- 3.5 and Codex). A total of seven data extraction items were devel-
oped to address the second and third research questions. These items were developed as
they are directly related to the definition of practicality (RQ2: Item 1– 3) and ethicality (RQ3:
Item 4– 7), as defined in the background section. The following list elaborates on the final
set of items along with the corresponding guiding questions. For the thematic analysis and
Items, two researchers independently coded 20 random samples of the included studies.
Any conflicts were resolved through further discussion or consulting a third researcher. After
reaching a Cohen's kappa of more than 0.80 (indicating almost perfect agreement), each re-
searcher coded half of the remaining 98 studies (49 studies each) and cross- checked each
other's work. The database of the studies included in this review and the extracted data for
each item are available in the supplementary document.
1. Technology readiness What levels of technology readiness are the LLMs- based
innovations at? We adopted the assessment tool from the Australian government,
namely the Australian Department of Defence's Technology Readiness Levels (TRL)
(Defence Science and Technology Group, 2021), which has been used to assess
the maturity of educational technologies in prior SLR (Yan et al., 2022). There
are nine different technological readiness levels: Basic Research (TRL- 1), Applied
Research (TRL- 2), Critical Function or Proof of Concept Established (TRL- 3), Lab
Testing/Validation of Alpha Prototype Component/Process (TRL- 4), Laboratory Testing
of Integrated/Semi- Integrated System (TRL- 5), Prototype System Verified (TRL- 6),
Integrated Pilot System Demonstrated (TRL- 7), System Incorporated in Commercial
Design (TRL- 8), and System Proven and Ready for Full Commercial Deployment
(TRL- 9), further explained in the Result section.
2. Performance: How accurate and reliable can the LLMs- based innovations complete the
designated educational tasks? For example, what are the model performance scores for
classification (eg, AUC and F1 scores), generation (eg, BLEU score), and prediction tasks
(eg, RMSE and Pearson's correlation)?
3. Replicability: Can other researchers or practitioners replicate the LLMs- based innova-
tions without additional support from the original authors? This item evaluates whether the
paper provided sufficient details about the LLMs (eg, open- sourced algorithms) and the
dataset (eg, open- source data).
98
|
YAN et al.
4. Transparency: What tiers of transparency index (Chaudhry et al., 2022) are the LLMs-
based innovations at? The transparency index proposed three tiers of transparency, in-
cluding transparent to AI researchers and practitioners (Tier 1), transparent to educational
technology experts and enthusiasts (Tier 2), and transparent to educators and parents
(Tier 3). The tier of transparency increases as educational stakeholders become fully in-
volved in developing and evaluating the AI system. These tiers were further elaborated on
in the Results section.
5. Privacy: Has the paper mentioned or considered privacy issues of their innovations? This
item explores potential issues related to informed consent, transparent data collection,
individuals' control over personal data, and unintended surveillance (Ferguson et al., 2016;
Tsai et al., 2020).
6. Equality: Has the paper mentioned or considered equal access to their innovations? This
item explores potential issues related to limited access for students from low- income back-
grounds or rural areas and the linguistic limitation of the innovations, such as their capabil-
ity to analyse different languages (Ferguson et al., 2016).
7. Beneficence: Has the paper mentioned or considered potential issues that violate the
ethical principle of beneficence? Such violations may include the risks associated with
labelling and profiling students, inadequate usage of machine- generated content for as-
sessments, and algorithmic biases (Ferguson et al., 2016; Zawacki- Richter et al., 2019).
RES U LTS
The current state— RQ1
We identified nine different categories of educational tasks that prior studies have attempted
to automate using LLMs (as shown in Table 1). Prior studies have used LLMs to automate
the profiling and labelling of 17 types of education- related contents and concepts (eg, forum
posts, student sentiment and discipline similarity), the detection of six latent constructs (eg,
confusion and urgency), the grading of five types of assessments (eg, short answer ques-
tions and essays), the development of five types of teaching support (eg, conversation agent
and intelligent question- answering), the prediction of five types of student- orientated metrics
(eg, dropout and engagement), the construction of four types of knowledge representa-
tions (eg, knowledge graph and entity recognition), the provision of four different forms of
feedback (eg, real- time and post- hoc feedback), the generation of four types of content (eg,
MCQs and open- ended questions) and the delivery of three types of recommendations (eg,
resource and course). Of the 118 reviewed studies, 85 studies aimed to automate educa-
tional tasks related to teachers (eg, question grading and generation), 54 studies targeted
student- related activities (eg, feedback and resource recommendation), 20 studies focused
on supporting institutional practices (eg, course recommendations and discipline planning),
and 14 studies empowered researchers with automated methods to investigate latent con-
structs (eg, student confusion) and capture verbal data (eg, speech recognition).
We identified five categories of LLMs used in prior studies to automate educational tasks.
BERT and its variations (eg, RoBERTa, DistilBERT, multilingual BERT, LaBSE, EstBERT,
and Sentence- BERT) were the most predominant model used in 109 reviewed studies.
However, they often required manual effort for fine- tuning (n = 90). GPT- 2 and GPT- 3 have
been used in five and three studies, respectively. Specifically, GPT- 2 and GPT- 3 have per-
formed better than BERT- based models in content generation and evaluation tasks, such
as generating university math problems (Drori et al., 2022) and evaluating the quality of
student- generated short answer questions (Moore et al., 2022). OpenAI's Codex has been
used in two prior studies, specifically for code generation tasks. T5 has also been used in
|
99
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
two prior studies for classification and generation purposes. In terms of machine- learning
tasks, 74 studies used LLMs to perform classification tasks. Generation and prediction tasks
were investigated in 24 and 23 prior studies, respectively. In sum, LLMs- based innovations
have already been used to automate a range of educational tasks, but most of these innova-
tions were developed on older models, such as BERT and GPT- 2. Although state- of- the- art
models, such as GPT- 3, have been introduced for over two years (Brown et al., 2020), they
have yet to be widely applied to automate educational tasks. A potential reason for this lack
of adoption could be these models' commercial and close- sourced nature, increasing the
financial burdens of developing and operating educational technology innovations on top of
such models.
Practical challenges— RQ2
Technology readiness
According to the Technology Readiness Level scale (Defence Science and Technology
Group, 2021), the LLMs- based innovations are still in the early development and testing
stage. Over three- quarters of the LLMs studies (n = 89) are in the applied research stage
(TRL- 2), which aims to experiment with the capability of LLMs in automating different ed-
ucational tasks by developing different models and combining LLMs with other machine-
learning and deep- learning techniques (eg, RCNN (Shang et al., 2022)). Thirteen studies
have established a proof of concept and demonstrated the feasibility of using LLMs-
based innovations to automate certain processes of educational tasks (TRL- 3). Nine
TAB LE 1 Educational Tasks in LLMs research.
Categories Educational tasks
Profiling and
labelling
Forum post classification, dialogue act classification, classification of learning
designs, review sentiment analysis, topic modelling, pedagogical classification
of MOOCs, collaborative problem- solving modelling, paraphrase quality, speech
tagging, labelling educational content with knowledge components, key sentence
and keyword extraction, reflective writing analysis, multimodal representational
thinking, discipline similarity, concept classification, cognitive level classification,
essay arguments segmentation
Detection Semantic analyses, detecting off- task messages, confusion detection, urgency
detection, conversational intent detection, teachers' behaviour detection
Assessment and
grading
Formative and summative assessment grading, short answer grading, essay grading,
subjective question grading, student self- explanation
Teaching support Classroom teaching, learning community support, online learning conversation agent,
intelligent question- answering, teacher activity recognition
Prediction Student performance prediction, student dropout prediction, emotional and cognitive
engagement detection, growth and development indicators for college students,
at- risk student identification
Knowledge
representation
Knowledge graph construction, knowledge entity recognition, knowledge tracing,
cause- effect relation extraction
Feedback Real- time feedback, post- hoc feedback, aggregated feedback, feedback on feedback
(peer- review comments)
Content generation MCQs generation, open- ended question generation, code generation, reply (natural
language) generation
Recommendation English reference selection and recommendation, resource recommendation, course
recommendation
100
|
YAN et al.
studies have developed functional prototypes and conducted preliminary validation under
controlled laboratory settings (TRL- 4), often involving stakeholders (eg, students and
teachers) to test and evaluate the output of their innovations. Only seven studies have
taken a further step and conducted validation studies in authentic learning environments,
with most functional components integrated into the educational tasks (TRL- 5), such as
an intelligent virtual standard patient for medical students training (Song et al., 2022) and
an intelligent chatbot for university admission (Nguyen et al., 2021). Yet, none of the ex-
isting LLMs- based innovations has been verified through successful operations (TRL- 6).
Together, these findings suggest although existing LLMs- based innovations can be used
to automate certain educational tasks, they have yet to show evidence regarding im-
provements to teaching, learning and administrative processes in authentic educational
practices.
Performance
The performance of LLMs- based innovations varies across different machine- learning and
educational tasks. For classification tasks, LLMs- based innovations have shown high per-
formance for simple educational tasks, such as modelling the topics from a list of program-
ming assignments (best F1 = 0.95) (Fonseca et al., 2020), analysing the sentiment of student
feedback (best F1 = 0.94) (Truong et al., 2020), constructing subject knowledge graph from
teaching materials (best F1 = 0.94) (Su & Zhang, 2020) and classifying educational forum
posts (Sha, Raković, Lin, et al., 2022) (best F1 = 0.92). However, the classification perfor-
mance of LLMs- based innovations decreases for other educational tasks. For example,
the F1 scores for detecting student confusion in the course forum (Geller et al., 2021) and
students' off- task messages in game- based collaborative learning (Carpenter et al., 2020)
are around 0.77 and 0.67, respectively. Likewise, the F1 score for classifying short- answer
responses varies between 0.61 to 0.82, with the lower performance on out- of- sample ques-
tions (best F1 = 0.61) (Condor et al., 2021). Similar performances were also observed in clas-
sifying students' argumentative essays (best F1 = 0.66) (Ghosh et al., 2020).
For prediction tasks, LLMs- based innovations have demonstrated reliable performance
compared to ground truth or human raters. For example, LLMs- based innovations have
achieved high scores of quadratic weighted kappa (QWK) in essay scoring, specifically for
off- topic (QWK = 0.80), gibberish (QWK = 0.80), and paraphrased answers (QWK = 0.9 4),
indicating substantial to almost perfect agreements with human raters (Doewes &
Pechenizkiy, 2021). Similar performances on essay scoring have been observed in several
other studies (eg, 0.80 QWK in (Beseiso et al., 2021) and 0.81 QWK in (Sharma et al., 2021)).
Likewise, LLMs- based innovations' performances on automatic short- answer grading were
also highly correlated with human ratings (Pearson's correlation between 0.75 to 0.82)
(Ahmed et al., 2022; Sawatzki et al., 2022).
Regarding generation tasks, LLMs- based innovations demonstrated high performance
across different educational tasks. For example, LLMs- based innovations have achieved
an F1 score of 0.92 for generating MCQs with single- word answers (Kumar et al., 2022).
Educational technologies developed by fine- tuning Codex also demonstrated the capa-
bility of resolving 81% of the advanced mathematics problems (Drori et al., 2022). Tex t
summaries generated using BERT had no significant differences compared with student-
generated summaries and can not be differentiated by graduate students (Merine &
Purkayastha, 2022). Similarly, BERT- generated doctor- patient dialogues were also found to
be indistinguishable from actual doctor- patient dialogues, which can be used to create vir-
tual standard patients for medical students' diagnosis practice training (Song et al., 2022).
Additionally, for introductory programming courses, the state- of- the- art LLMs, Codex,
|
101
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
could generate sensible and novel exercises for students along with an appropriate sample
solution (around three out of four times) and accurate code explanation (67% accuracy)
(Sarsa et al., 2022).
In sum, although the classification performance of LLMs- based innovations on com-
plex educational tasks is far from suitable for practical adoption, LLMs- based innova-
tions have already shown high performance on several relatively simple classification
tasks that could potentially be deployed to automatically generate meaningful insights
that could be useful to teachers and institutions, such as navigating through numerous
student feedback and course review. Likewise, LLMs- based innovations' prediction and
generation performance reveals a promising future of potentially automating the gener-
ation of educational content and the initial grading of student assessments. However,
ethical issues must be considered for such implementations, which we covered in the
findings for RQ3.
Replicability
Most reviewed studies (n = 107) have not disclosed sufficient details about their methodolo-
gies for other researchers and practitioners to replicate their proposed LLMs- based innova-
tions. Among these studies, 12 studies have open- sourced the original code for developing
the innovations but failed to open- source the data they used. In contrast, 20 studies have
open- sourced the data they used but failed to release the actual code. Around two- thirds
of the reviewed studies (n = 75) have failed to release both the original code and the data
they used, leaving only 11 studies publicly available for other researchers and practitioners
to replicate without needing to contact the original authors. This lack of replicability could
become a vital barrier to adoption, as 87 out of the 107 non- replicable studies required
fine- tuning the LLMs to achieve the reported per formance. This replication issue also limits
others from further evaluating the generalisability of the proposed LLMs- based innovations
in other datasets, constraining potential practical utilities.
Ethical challenges— RQ3
Transparency
Based on the transparency index and the three tiers of transparency (Chaudhry et al., 2022),
most of the reviewed study reached at- most Tier 1 (n = 109), which is merely considered
transparent to AI researchers and practitioners. Although these studies reported details
regarding their machine learning models (eg, optimisation and hyperparameters), such in-
formation is unlikely to be interpretable and considered transparent for individuals without
a strong background in machine learning. For the remaining nine studies, they reached at-
most Tier 2 as they often involved some form of human- in- the- loop elements. Specifically,
making the LLMs innovations available for student evaluation has been found in three stud-
ies (Merine & Purkayastha, 2022; Nguyen et al., 2021; Song et al., 2022). Such evalua-
tions often involved students differentiating AI- generated from human- generated content
(Merine & Purkayastha, 2022; Song et al., 2022) and assessing student satisfaction with AI-
generated responses (Nguyen et al., 2021). Likewise, two studies have involved experts in
evaluating specific features of the content generated by the LLMs- based innovations, such
as informativeness (Maheen et al., 2022) and cognitive level (Moore et al., 2022). Surveys
have been used to evaluate students' experience with LLMs- based innovations from multiple
perspectives, such as the quality and difficulty of AI- generated questions (Drori et al., 2022;
102
|
YAN et al.
Li & Xing, 2021) and potential learning benefits of the systems (Jayaraman & Black, 2022).
Finally, semi- structured interviews have been conducted to understand students' percep-
tion of the LLM system after using the system in authentic computer- supported collabora-
tive learning activities (Zheng et al., 2022). Although these nine studies had some elements
of human- in- the- loop, stakeholders were often involved in a post- hoc evaluation manner
instead of throughout the development process, and thus, have limited knowledge regard-
ing the operating principle and potential weakness of the systems. Consequently, none of
the existing LLMs- based innovations can be considered as being at Tier 3, which describes
an AI system that is considered transparent for educational stakeholders (eg, students,
teachers, and parents).
Privacy
The privacy issues related to LLMs- based innovations were rarely attended to or investi-
gated in the reviewed studies. Specifically, for studies that have fine- tuned LLMs with textual
data collected from students, none of these studies has explicitly explained their consenting
strategies (eg, whether students acknowledge the collection and intended usage of their
data) and data protection measures (eg, data anonymisation and sanitisation). This lack of
attention to privacy issues is particularly concerning as LLMs- based innovations work with
stakeholders' natural languages that may contain personal and sensitive information re-
garding their private lives and identities (Brown et al., 2022). It is possible that stakeholders
might not be aware of their textual data (eg, forum posts or conversations) on digital plat-
forms (eg, MOOCs and LMS) being used in LLMs- based innovations for different purposes
of automation (eg, automated reply and training chatbots) as the consenting process is
often embedded into the enrollment or signing up of these platforms (Tsai & Gasevic, 2017).
This process can hardly be considered informed consent. Consequently, if stakeholders
shared their personal information on these platforms in natural language (eg, sharing phone
numbers and addresses with group members via digital forums), such information could
be used as training data for fine- tuning LLMs. This usage could potentially expose private
information as LLMs are incapable of understanding the context and sensitivity of text,
and thus, could return stakeholders' personal information based on semantic relationships
(Brown et al., 2022).
Equality
Although most of the studies (n = 95) used LLMs that only apply to English content, we
also identified application scenarios of LLMs in automating educational tasks in 12 other
languages. Specifically, 19 studies used LLMs that can be applied to Chinese content.
Ten prior studies used LLMs for Vietnamese (n = 3), Spanish (n = 3), Italian (n = 2), an d
German (n = 2) contents. Additionally, seven studies applied LLMs to Croatian, Indonesian,
Japanese, Romanian, Russian, Swedish, and Hindi content. While the dominance of
English- based innovations remains a concerning equality issue, the availability of inno-
vations that support a variety of other languages, specifically in non- western, educated,
industrialised, rich and democratic (WEIRD) societies (eg, Indonesia and Vietnam), may
indicate a promising sign for LLMs- based innovations to have potential global impacts and
levels such equality issues in the future. However, the financial burdens from adopting the
state- of- the- art models (eg, OpenAI's GPT- 3 and Codex) could potentially exacerbate the
equality issues, making the best- performing innovations only accessible and affordable to
WEIRD societies.
|
103
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
Beneficence
A total of seven studies have discussed potential issues related to the violation of the
ethical principle of beneficence. For example, one study has discussed the potential risk
of adopting underperforming models, which could negatively affect students' learning
experiences (Li & Xing, 2021). Such issues could be minimised by deferring decisions
made by such models (Schneider et al., 2022) and labelling the AI- generated content
with a warning message (eg, teachers' manual revision is mandatory before determining
the actual correctness) (Angelone et al., 2022). Apart from issues with adopting inac-
curate models, two studies have suggested that potential bias and discrimination issues
may occur if adopting a model that is accurate but unfair (Merine & Purkayastha, 2022;
Sha et al., 2021). This issue is particularly concerning as most existing studies focused
solely on developing an accurate model. Only nine reviewed studies released infor-
mation regarding the descriptive data of different sample groups, such as gender and
ethicality (eg, Pugh et al., 2021). Two studies have proposed potential approaches that
could address such fairness issues. Specifically, using sampling strategies, such as bal-
ancing demographic distribution, has been found as an effective approach to improve
both model fairness and accuracy (Sha, Li, Gasevic, & Chen, 2022; Sha, Raković, Das,
et al., 2022). These approaches are essential for ensuring that LLMs- based innovations
will not perpetuate problematic and systematic biases (eg, gender biases), especially as
the best- performing LLMs are often black- boxed with little interpretability, traceability and
justification of the results (Wu, 2022).
DISCUSSION
Main findings
The current study systematically reviewed 118 peer- reviewed empirical studies that used
LLMs to automate educational tasks. For the first research question (RQ1), we illustrated
the current state of educational research on LLMs. Specifically, we identified 53 types
of application scenarios of LLMs in automating educational tasks, summarised into nine
general categories, including profiling and labelling, detection, assessment and grading,
teaching support, prediction, knowledge representation, feedback, content generation and
recommendation. While some of these categories resonate with the utilities proposed in
prior positioning works (eg, feedback, content generation and recommendation) (Kasneci
et al., 2023; Rudolph et al., 2023), novel directions such as using LLMs to automate the
creation of knowledge graph and entity further indicated the potential of LLMs- based inno-
vations in supporting institutional practices (eg, creating knowledge- based search engines
across multiple disciplines). These identified directions could benefit from the state- of- the-
art LLMs (eg, GPT- 3 and Codex) as most of the reviewed studies (92%) focused on using
BERT- based models, which often required manual effort for fine- tuning. Whereas, the state-
of- the- art LLMs could potentially achieve similar performance with a zero- shot approach
(Bang et al., 2023). While the majority of the reviewed studies (63%) focused on using LLMs
to automate classification tasks, there could be more future studies that aimed to tackle the
automation of prediction and generation tasks with the more capable LLMs (Sallam, 2023).
Likewise, although supporting teachers are the primary focus (72%) of the existing LLMs-
based innovations, students and institutions could also benefit from such innovations as
novel utilities could continue to emerge from the educational technology literature. Together,
the findings of the first research question could spark educational researchers with ideas
of exploring the potential of state- of- the- art LLMs in augmenting educational practices,
104
|
YAN et al.
specifically, the identified 53 types of application scenarios may all worth to re- explore in
the light of ChatGPT and other powerful generative AI models (Kasneci et al., 2023).
Regarding the second research question (RQ2), we identified several practical challenges
that need to be addressed for LLMs- based innovations to have actual educational benefits.
The development and educational research on LLMs- based innovations are still in the early
stages. Most of the innovations demonstrated a low level of technology readiness, where
the innovations have yet to be fully integrated and validated in authentic educational con-
texts. This finding resonates with previous systematic reviews on related educational tech-
nologies, such as reviews on automated question generation (Kurdi et al., 2020), feedback
provision (Cavalcanti et al., 2021), essay scoring (Ramesh & Sanampudi, 2022), and chatbot
systems (Wollny et al., 2021). There is a pressing need for in- the- wild studies that provide
LLMs- based innovations directly to educational stakeholders for supporting actual educa-
tional tasks instead of testing on different datasets or in laboratory settings. Such authentic
studies could also validate whether the existing innovations can achieve the reported high
model performance in real- life scenarios, specifically in prediction and generation tasks,
instead of being limited to prior datasets. This validation process is vital for preventing inad-
equate usage, such as adopting a subject- specific prediction model for unintended subjects.
Researchers need to carefully examine the extent of generalisability of their innovations
and inform the limitations to stakeholders (Gašević et al., 2016). However, addressing such
needs could be difficult considering the current literature's poor replicability, which increases
the barriers for others to adopt LLMs- based innovations in authentic educational contexts or
validate with different samples. Similar replication issues have also been identified in other
areas of educational technology research (Yan et al., 2022).
For the third research question (RQ3), we identified several ethical challenges regarding
LLMs- based innovations. In particular, most of the existing LLMs- based innovations (92%)
were only transparent to AI researchers and practitioners (Tier 1), with only nine studies that
can be considered transparent to educational technology experts and enthusiasts (Tier 2).
The primary reason behind this low transparency can be attributed to the lack of human- in-
the- loop components in prior studies. This finding resonates with the call for explainable and
human- centred AI, which stresses the vital role of stakeholders in developing meaningful
and impactful educational technology (Khosravi et al., 2022; Yang et al., 2021). Involving
stakeholders during the development and evaluation of LLMs- based innovations is essen-
tial for addressing both practical and ethical issues. For example, as the current findings
revealed, LLMs- based innovations are subject to data privacy issues but were rarely men-
tioned or investigated in the literature (Merine & Purkayastha, 2022), which may be due
to the little voice that stakeholders had in prior research. The several concerning issues
around beneficence also demand the involvement of stakeholders as their perspectives are
vital for shaping the future directions of LLMs- based innovations, such as how responsible
decisions can be made with these AI systems (Schneider et al., 2022). Likewise, the equality
issue regarding the financial burdens that may occur when adopting innovations that lever-
age commercial LLMs (eg, GPT- 3 and Codex) can also be further studied with institutional
stakeholders.
Implications
The current findings have several implications for education research and practice with
LLMs, which we have summarised into three recommendations that aim to support future
studies to develop practical and ethical innovations that can have actual benefits to educa-
tional stakeholders. First, the wide range of application scenarios of LLMs- based innova-
tions can further benefit from the improvements in the capability of LLMs. Updating existing
|
105
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
innovations with state- of- the- art LLMs may further reduce the amount of manual effort re-
quired for fine- turning and achieve similar performances (Bang et al., 2023). Considering the
53 identified use cases of LLMs in education, there are multiple research trajectories that
could foster the development of practical educational technologies. These avenues have
the potential to address some of the pressing challenges that plague the global education
system. Particularly, the use cases involving teaching support, assessment and grading,
feedback, and content generation categories (Table 1) could act as catalysts for the de-
velopment of educational technologies that could alleviate teachers' workload and mental
stress by automating the laborious tasks associated with creating, evaluating, and providing
feedback for student assessments (Carroll et al., 2022). Similarly, further exploration of the
use cases in profiling and labelling, detection, prediction, and recommendation could lead to
the development of educational technologies that can deliver personalised learning support
for each student across various disciplines (Wollny et al., 2021). Such improvements could
enhance the overall well- being of teachers and increase students' learning opportunities,
thereby contributing to the achievement of SDG 4 by 2030 (Boeren, 2019). Nonetheless, re-
searchers should also be mindful of the potential financial and resource burdens that could
be imposed on educational stakeholders when innovating with the commercial LLMs (eg,
GPT- 3/4 and ChatGPT).
The unrivalled natural language generation capabilities exhibited by ChatGPT and other
cutting- edge LLMs (eg, LLaMA and PaLM 2) might also inspire future studies to delve
into a broader spectrum of research directions. These include comparisons between the
quality of student- generated and ChatGPT- generated writings (Li et al., 2023) and eval-
uating these LLMs' capability to tackle educational assessments (Gilson et al., 2023).
Such explorations would not only unveil the potential of LLMs and generative AI models in
educational content generation and evaluation tasks but also expose the possible threats
that these models pose to academic integrity, a pervasive issue across the education
sector (Kasneci et al., 2023). Intriguingly, leveraging the use cases of LLMs in tasks
such as creating knowledge representation (Zheng et al., 2023) and classifying cognitive
levels (Liu et al., 2022) could potentially facilitate the transition from outcome- focused to
process- focused assessments. Here, LLMs and generative AI models could be employed
for learning assessments in a manner similar to learning analytics (Gašević et al., 2022).
Consequently, future studies may begin to explore methods of addressing the potential
threats of LLMs with LLMs- based solutions.
For LLMs- based innovations to achieve a high level of technology readiness and perfor-
mance, the current reporting standards must be improved. Future studies should support
the initiative of open- sourcing their models/systems when possible and provide sufficient
details about the test datasets, which are essential for others to replicate and validate ex-
isting innovations across different contexts, preventing the potential pitfall of another repli-
cation crisis (Maxwell et al., 2015). This initiative is particularly vital in the era of generative
AI models as most of these models, especially the commercial ones (eg, ChatGPT and the
GPT series), are proprietary. Thus, when using these LLMs for augmenting educational
practices, such as scoring student essays (Doewes & Pechenizkiy, 2021), providing real-
time feedback (Zheng et al., 2022) or generating questions for learning activities (Sarsa
et al., 2022), researchers need to be systematic and transparent about the reporting of
the model usage and prompts (Wu, 2022). For example, when using the ChatGPT API for
question generation at scale, researchers should at least report the exact models, prompts,
and model temperature used in the process, as different models may differ in their ability to
generate accurate and reliable content and the prompts are essential for others to replicate
the same or similar results (Kasneci et al., 2023).
Apart from the aforementioned technical and methodological details, researchers and
educational policymakers should also consider the potential wider impacts of LLMs- based
106
|
YAN et al.
solutions on different stakeholders. For example, in terms of detection and academic in-
tegrity, some institutions have rapidly adopted AI- detection tools that claim to have high
accuracy and a low false positive rate. Yet, as disclosed in a recent report by Turnitin, a
company whose AI- detection function has been utilised on more than 38.5 million student
submissions, the real- world performance of their solution resulted in a significantly higher
occurrence of false positives compared to their laboratory findings (Chechitelli, 2023). Such
negligence can be devastating for students who have been falsely accused of academic
misconduct, as well as for educators who must handle the repercussions. This example re-
inforced the importance of conducting rigorous scientific studies with key stakeholders when
adopting any LLMs- based solutions that have direct or indirect impacts on students, educa-
tors, and other stakeholders. Likewise, the reporting of such studies should also adhere to
high standards, incorporating both methodological specifics and detailed data descriptions.
These details are especially pertinent when considering the diverse cultural backgrounds of
students and the fact that most LLMs are primarily trained on English datasets, which could
potentially introduce biases towards non- native English students (Liang et al., 2023).
Adopting a human- centred approach when developing and evaluating LLMs- based inno-
vations are essential for ensuring these innovations remain ethical in practice, especially as
ethical principles may not guarantee ethical AI due to their top- down manners (eg, devel-
oped by regulatory bodies) (Mittelstadt, 2019). Future studies need to consider the ethical
issues that may arise from their specific application scenarios and actively involve stake-
holders to identify and address such issues. Specifically, LLM- based innovations should aim
to reach at least Tier 3 in the transparency index and TRL- 7 in technology readiness. This
involves a fully functional system being integrated into authentic learning environments and
validated by students and educators in terms of its practicality and ethical considerations.
For any decisions made by the LLM- based innovations, the relevant stakeholders should
be informed about how the decision was reached, as well as the potential risks and biases
involved. For instance, when students receive an assessment that has been automatically
graded, these grades should be accompanied by a warning message indicating that they
have been graded by LLMs and AI (Angelone et al., 2022). Students should also have the
opportunity to consult their teacher regarding any concerns.
The active involvement of stakeholders should also extend beyond the education sector,
also involving policymakers and industry companies to establish the guidelines for adopt-
ing LLMs- based innovations in learning and teaching practices, as such adoptions could
have broader implications on society beyond the education sector. For example, human- AI
collaboration might become an essential skill for students to succeed in the job market as
AI solutions become an integral component of productivity in the industrial sector (Wang
et al., 2020). Therefore, institutions that aim to prohibit AI tools could inadvertently place
their students at a disadvantage compared to other institutions that proactively welcome
such changes. This could be achieved by consistently refining their policy regarding the
use of LLMs and generative AI solutions, based on stakeholder feedback and empirical
evidence.
Limitations
The current findings should be interpreted with several limitations in mind. First, although
we assessed the practicality and ethicality of LLMs- based innovations with seven different
items, there could be other aspects of these multi- dimensional concepts that we omitted.
Nevertheless, these assessment items were chosen directly from the corresponding defi-
nitions and related to the pressing issues in the literature (Adams et al., 2021; Weidinger
et al., 2021). Second, we only included English publications, which could have biased our
|
107
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
findings regarding the availability of LLMs- based innovations among different countries.
Thirdly, as we strictly followed the PRISMA protocol and only included peer- reviewed pub-
lications, we may have omitted the emerging works published in different open- sourced
archives. These studies may contain interesting findings regarding the latest LLMs (eg,
ChatGPT). Additionally, this review focused on the potential of LLMs- based innovations in
automating educational tasks, and thus, other pressing issues, such as the potential threat
to academic integrity, were outside of the scope of this systematic scoping review. We briefly
touched on these pressing issues in the implications and illustrated the importance of the
current findings in supporting future educational studies to address these issues. Moreover,
since this study is a systematic scoping review, we did not assess the quality of the included
studies, and thus, the findings, particularly, the performance metrics extracted from the re-
viewed studies, may need further evaluation. The goal of this study is to provide an overview
of the different educational tasks that can be augmented by LLMs and generative AI mod-
els, which can serve as a reference point for future studies to further develop on using the
state- of- the- art models (eg, ChatGPT and PaLM 2). Furthermore, the transparency index
that we adopted for RQ3 did not consider the transparency to students, which could be an
important direction for future human- centred AI studies. It is pertinent to mention that a
number of recent workshops and preliminary papers, while contributing to this field, were not
incorporated in this scoping review due to time constraints (Caines et al., 2023; Leiker et al.,
2023; Ma et al., 2023). Their exclusion represents a limitation to the breadth of this study,
acknowledging the relentless pace of scholarly advancements in this area.
CONCLUSION
In this study, we systematically reviewed the current state of educational research on LLMs and
identified several practical and ethical challenges that need to be addressed in order for LLMs-
based innovations to become beneficial and impactful. Based on the findings, we proposed
three recommendations for future studies, including updating existing innovations with state-
of- the- art models, embracing the initiative of open- sourcing models/systems, and adopting a
human- centred approach throughout the developmental process. These recommendations
could potentially support future studies to develop practical and ethical innovations that can be
implemented in authentic contexts to automate a wide range of educational tasks.
ACKNO WLE DGE MENTS
This research was funded partially by the Australian Government through the Australian
Research Council (project number DP210100060 and DP220101209). Roberto Martinez-
Maldonado's research is partly funded by Jacobs Foundation. This research was also funded
partially by the Jacobs Foundation (CELLA 2 CERES). This material is in part based on research
sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement num-
ber HR0011- 22- 2- 0047. The U.S. Government is authorised to reproduce and distribute reprints
for Governmental purposes notwithstanding any copyright notation thereon. The views and
conclusions contained herein are those of the authors and should not be interpreted as neces-
sarily representing the official policies or endorsements, either expressed or implied, of DARPA
or the U.S. Government. Open access publishing facilitated by Monash University, as part of
the Wiley - Monash University agreement via the Council of Australian University Librarians.
FUNDING INFORMATION
This research was at least in part funded by the Australian Research Council (DP210100060;
DP220101209), Jacobs Foundation (Research Fellowship; CELLA 2 CERES), and Defense
Advanced Research Project Agency (HR0011- 22- 2- 0047).
108
|
YAN et al.
CONFLICT OF INTEREST STATEMENT
The authors have declared no conflicts of interest.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in the supplementary
document.
ORCID
Lixiang Yan https://orcid.org/0000-0003-3818-045X
Linxuan Zhao https://orcid.org/0000-0001-5564-0185
Yuheng Li https://orcid.org/0000-0002-5971-8469
Guanliang Chen https://orcid.org/0000-0002-8236-3133
ENDNOTE
1 Such as classication, prediction, clustering, etc.
REFERENCES
Adams, C., Pente, P., Lemermeyer, G., & Rockwell, G. (2021). Artificial intelligence ethic s guidelines for k- 12 edu-
cation: A review of the global landscape. In Artificial Intelligence in Education: 22nd International Conference,
AIED 2021, Utrecht, The Netherlands, June 14– 18, 2021, Proceedings, Par t II (pp. 24 – 28). Springer.
Ahmed, A., Joorabchi, A., & Hayes, M. J. (2022). On the application of sentence transformers to automatic short
answer grading in blended assessment. In 2022 33rd Irish Signals and Systems Conference (ISSC) (pp.
1– 6). IEEE.
Angelone, A. M., Galassi, A., & Vit torini, P. (2022). Improved automated classification of sentences in data science
exercises. In Methodologies and Intelligent Systems for Technology Enhanced Learning, 11th International
Conference 11 (pp. 12– 21). Springer.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q. V., Xu,
Y., & Fung, P. (2023). A multitask, multilingual, multimodal evaluation of chatGPT on reasoning, hallucination,
and interactivity. arXiv preprint arXiv:2302.04023.
Becker, H. J. (2000). Findings from the teaching, learning, and computing survey. Education Policy Analysis
Archives, 8, 51.
Beseiso, M., Alzubi, O. A., & Rashaideh, H. (2021). A novel automated essay scoring approach for reliable higher
educational assessments. Journal of Computing in Higher Education, 33, 7 27– 74 6 .
Boeren, E. (2019). Understanding sustainable development goal (sdg) 4 on “quality education” from micro, meso
and macro perspectives. International Review of Education, 65, 277– 29 4.
Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022). What does it mean for a language model
to preserve privacy? In 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 2280–
2292). Association for Computing Machiner y.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,
G., Askell, A., Agarwal, S., Herbert- Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A ., Ziegler, D.
M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few- shot learners. Advances in Neural
Information Processing Systems, 33, 1877– 1901.
Bulut, O., & Yildirim- Erbasli, S. N. (2022). Automatic story and item generation for reading comprehension as-
sessments with transformers. International Journal of Assessment Tools in Education, 9, 72 – 87.
Caines, A., Benedetto, L., Taslimipoor, S., Davis, C., Gao, Y., Andersen, O., Yuan, Z., Elliott, M., Moore,
R., Bryant, C., Rei, M., Mullooly, A., Nicholls, D., & Buttery, P. (2023). On the application of large
language models for language teaching and assessment technology. In AIED Workshops, in press.
CEUR- WS.org
Carpenter, D., Emerson, A., Mott, B. W., Saleh, A., Glazewski, K. D., Hmelo- Silver, C. E., & Lester, J. C. (2020).
Detecting off- task behavior from student dialogue in game- based collaborative learning. In Artificial
Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6 – 10, 2020,
Proceedings, Part I 21 (pp. 55– 66). Springer.
Carroll, A ., Forrest, K., Sanders- O'Connor, E., Flynn, L., Bower, J. M., Fynes- Clinton, S., York, A., & Ziaei, M.
(2022). Teacher stress and burnout in Australia: Examining the role of intrapersonal and environmental fac-
tors. Social Psychology of Education, 25, 4 41– 469 .
|
109
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
Cavalcanti, A. P., Barbosa, A., Carvalho, R., Freitas, F., Tsai, Y.- S., Gašević, D., & Mello, R. F. (2021). Automatic
feedback in online learning environments: A systematic literature review. Computers and Education: Ar tificial
Intelligence, 2, 100027.
Chaudhry, M. A., Cukurova, M., & Luckin, R. (2022). A transparency index framework for AI in educ ation. In
Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry
and Innovation Tracks, Practitioners’ and Doctoral Consortium: 23rd International Conference, AIED 2022,
Durham, UK, July 27– 31, 2022, Proceedings, Part II (pp. 195 – 198). Springer.
Chechitelli, A. (2023). AI writing detection update from turnitin's chief product officer. https://www.turni tin.com/
blog/ai- writi ng- detec tion- updat e- from- turni tins- chief - produ ct- officer
Condor, A., Litster, M., & Pardos, Z. (2021). Automatic short answer grading with sbert on out- of- sample ques-
tions. International Educational Data Mining Society.
Defence Science and Technology Group. (2021). Technology readiness levels definitions and descriptions. https: //
www.dst.defen ce.gov.au/sites/ defau lt/files/ basic_pages/ docum ents/ TRL%20Exp lanat ions_1.pdf
Devlin, J., Chang, M.- W., Lee, K., & Toutanova, K. (2018). Bert: Pre- training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805.
Doewes, A., & Pechenizkiy, M. (2021). On the limitations of human- computer agreement in automated essay
scoring. International Educational Data Mining Society.
Doyle, W., & Ponder, G. A. (1977). The practicality ethic in teacher decision- making. Interchange, 8, 1– 12.
Drori, I., Zhang, S., Shuttleworth, R., Tang, L., Lu, A., Ke, E., Liu, K., Chen, L., Tran, S., Cheng, N., Wang, R.,
Singh, N., Patti, T. L., Lynch, J., Shporer, A., Verma, N., Wu, E., & Strang, G. (2022). A neural network solves,
explains, and generates university math problems by program synthesis and few- shot learning at human
level. Proceedings of the National Academy of Sciences, 119 , e 2 12 3 4 3 3119 .
Ertmer, P. A. (1999). Addressing first- and second- order barriers to change: Strategies for technology integration.
Educational Technology Research and Development, 47, 47– 61.
Ferguson, R., Hoel, T., Scheffel, M., & Drachsler, H. (2016). Guest editorial: Ethics and privacy in learning analyt-
ics. Journal of Learning Analytics, 3, 5– 15.
Fonseca, S. C., Pereira, F. D., Oliveira, E. H., Oliveira, D. B., Carvalho, L. S., & Cristea, A. I. (2020). Automatic
subject- based contextualisation of programming assignment lists. International Educational Data Mining
Soci et y.
Gašević, D., Dawson, S., Rogers, T., & Gasevic, D. (2016). Learning analytics should not promote one size fits all:
The effects of instructional conditions in predicting academic success. The Internet and Higher Education,
28, 68– 84.
Gašević, D., Greif f, S., & Shaffer, D. W. (2022). Towards strengthening links bet ween learning analytics and as-
sessment: Challenges and potentials of a promising new bond. Computers in Human Behavior, 134, 107304.
https://www.scien cedir ect.com/scien ce/artic le/pii/S0747 56322 2001261
Geller, S. A., Gal, K ., Segal, A., Sripathi, K., Kim, H. G., Facciotti, M. T., Igo, M., Hoernle, N., & Karger, D. (2021).
New methods for confusion detection in course forums: Student, teacher, and machine. IEEE Transactions
on Learning Technologies, 14, 665 – 679.
Ghosh, D., Klebanov, B. B., & Song, Y. (2020). An exploratory study of argumentative writing by young students:
A transformer- based approach. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for
Building Educational Applications (pp. 145– 150). Association for Computational Linguistics.
Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., & Chartash, D. (2023). How does
chatGPT perform on the United States medical licensing examination? The implications of large language
models for medical education and knowledge assessment. JMIR Medical Education, 9, e 45312.
Holmes, W., & Porayska- Pomsta, K. (2022). The ethics of artificial intelligence in education: Practices, chal-
lenges, and debates. Taylor & Francis.
Jayaraman, J., & Black, J. (2022). Effectiveness of an intelligent question answering system for teaching f inancial
literacy: A pilot study. In Innovations in Learning and Technology for the Workplace and Higher Education:
Proceedings of ‘The Learning Ideas Conference’ 2021 (pp. 133– 140). Springer.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G.,
Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O.,
Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and chal-
lenges of large language models for education. Learning and Individual Dif ferences, 103, 102 2 74.
Khosravi, H., Shum, S. B., Chen, G., Conati, C., Tsai, Y.- S., Kay, J., Knight, S., Mar tinez- Maldonado, R., Sadiq,
S., & Gašević, D. (2022). Explainable ar tificial intelligence in education. Computers and Education: Ar tificial
Intelligence, 3, 100074.
Kumar, N., Mali, R., Ratnam, A., Kurpad, V., & Magapu, H. (2022). Identification and addressal of knowledge
gaps in students. In 2022 3rd International Conference for Emerging Technology (INCET ) (pp. 1– 6).
IEEE.
Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al- Emari, S. (2020). A systematic review of automatic question gen-
eration for educational purposes. International Journal of Artificial Intelligence in Education, 30, 121– 20 4.
110
|
YA N et a l.
Leiker, D., Finnigan, S., Gyllen, A . R., & Cukurova, M. (2023). Prototyping the use of large language models (llms)
for adult learning content creation at scale. In AIED Workshops, in press. CEUR- WS.org
Li, C., & Xing, W. (2021). Natural language generation using deep learning to support mooc learners. International
Journal of Artificial Intelligence in Education, 31, 186– 214.
Li, Y., Sha, L., Yan, L., Lin, J., Raković, M., Galbraith, K., Lyons, K., Gašević, D., & Chen, G. (2023). Can large
language models write reflectively. Computers and Education: Artificial Intelligence, 4, 10 0140.
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non - native en-
glish writers. arXiv preprint arXiv:2304.02819.
Liu, S., Liu, S., Liu, Z., Peng, X., & Yang, Z. (2022). Automated detection of emotional and cognitive engagement
in mooc discussions to predict learning achievement. Computers & Education, 181, 104461.
Liu, Z., He, X., Liu, L., Liu, T., & Zhai, X. (2023). Context mat ters: A strategy to pre- train language model for sci-
ence education. arXiv preprint arXiv:2301.12031.
Ma, Q., Wu, S., & Koedinger, K. (2023) Is llm the better programming partner? AIED Workshops, in press.
CEUR- WS.org
Maheen, F., Asif, M., Ahmad, H., Ahmad, S., Alturise, F., Asiry, O., & Ghadi, Y. Y. (2022). Automatic computer
science domain multiple- choice questions generation based on informative sentences. PeerJ Computer
Science, 8, e1010.
Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does
“failure to replicate” really mean? American Psychologist, 70, 487– 498.
Merine, R., & Purkayastha, S. (2022). Risks and benefits of AI - generated text summarization for exper t level con-
tent in graduate health informatics. In 2022 IEEE 10th International Conference on Healthcare Informatics
(ICHI) (pp. 567– 574). IEEE.
Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen, T. H., Sainz, O., Agirre, E., Heinz, I., & Roth, D. (2021).
Recent advances in natural language processing via large pre- trained language models: A survey. arXiv
preprint arXiv:2111.01243.
Mittelstadt, B. (2019). Principles alone cannot guarantee ethical AI. Nature Machine Intelligence, 1, 501– 507.
Moore, S., Nguyen, H. A., Bier, N., Domadia, T., & Stamper, J. (2022). Assessing the quality of student- generated
short answer questions using GPT- 3. In Educating for a New Future: Making Sense of Technology- Enhanced
Learning Adoption: 17th European Conference on Technology Enhanced Learning, EC- TEL 2022, Toulouse,
France, September 12– 16, 2022, Proceedings (pp. 243– 257). Springer.
Munn, Z., Peters, M. D., Stern, C., Tufanaru, C., McA rthur, A., & Aromataris, E. (2018). Systematic review or scop-
ing review? Guidance for authors when choosing between a systematic or sc oping review approach. B MC
Medical Research Methodology, 18, 1 – 7.
Nguyen, T. T., Le, A. D., Hoang, H. T., & Nguyen, T. (2021). Neu- chatbot: Chatbot for admission of national eco-
nomics university. Computers and Education: Artificial Intelligence, 2, 100036.
Nye, B., Mee, D., & Core, M. G. (2023) Generative large language models for dialog- based tutoring: An early
consideration of opportunities and concerns. In AIED Workshops, in press. CEUR- WS.org
Oleny, A. (2023) Generating multiple choice questions from a textbook: Llms match human per formance on most
metrics. In AIED Workshops, in press. CEUR- WS.org
OpenAI. (2023). Introducing chatGPT. https://openai.com/blog/chatgpt
Page, M. J., McKenzie, J. E., Bossuy t, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J.
M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T.,
Loder, E. W., Mayo- Wilson, E., McDonald, S., … Moher, D. (2021). The prisma 2020 statement: An updated
guideline for reporting systematic reviews. International Journal of Surgery, 88, 105906.
Pardo, A., & Siemens, G. (2014). Ethical and privacy princ iples for learning analytics. British Journal of Educational
Technology, 45, 438– 450.
Pugh, S. L., Subburaj, S. K., Rao, A. R., Stewart, A. E., Andrews- Todd, J., & D'Mello, S. K. (2021). Say what?
Automatic modeling of collaborative problem solving skills from student speech in the wild. International
Educational Data Mining Society.
Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review.
Artificial Intelligence Review, 55, 2495 – 2 5 2 7.
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher
education? Journal of Applied Learning and Teaching, 6, 342– 363.
Sallam, M. (2023). The utility of chatGPT as an example of large language models in healthcare education,
research and practic e: Systematic review on the future perspectives and potential limitations. medRxiv,
2023– 02.
Sarsa, S., Denny, P., Hellas, A., & Leinonen, J. (2022). Automatic generation of programming exercises and code
explanations using large language models. In Proceedings of the 2022 ACM Conference on International
Computing Education Research- Volume 1 (pp. 27– 43). Association for Computing Machinery.
Sawatzki, J., Schlippe, T., & Benner- Wickner, M. (2022). Deep learning techniques for automatic short answer
grading: Predicting scores for English and German answers. In Artificial Intelligence in Education: Emerging
|
111
PRA CTIC AL AND ET HICAL CH ALL ENG ES OF L ARG E LA NGU AGE MO DEL S IN
EDUCATION
Technologies, Models and Applications: Proceedings of 2021 2nd International Conference on Artificial
Intelligence in Education Technology (pp. 65– 75). Springer.
Schneider, J., Richner, R., & Riser, M. (2022). Towards trustworthy autograding of short, multi- lingual, multi- type
answers. International Journal of Artificial Intelligence in Education, 33, 88– 118.
Schramowski, P., Turan, C., Andersen, N., Rothkopf, C. A., & Kersting, K. (2022). Large pre- trained language
models contain human- like biases of what is right and wrong to do. Nature Machine Intelligence, 4, 258– 268.
Selwyn, N. (2019). What's the problem with learning analytics? Journal of Learning Analytics, 6, 11 – 19 .
Sha, L., Li, Y., Gasevic, D., & Chen, G. (2022). Bigger data or fairer data? Augmenting ber t via active sampling
for educational text classification. In Proceedings of the 29th International Conference on Computational
Linguistics (pp. 1275– 1285). International Committee on Computational Linguistics.
Sha, L., Raković, M., Das, A., Gašević, D., & Chen, G. (2022). Leveraging class balancing techniques to alle-
viate algorithmic bias for predictive tasks in education. IEEE Transactions on Learning Technologies, 15,
481– 492.
Sha, L., Raković, M., Lin, J., Guan, Q., Whitelock- Wainwright, A., Gašević, D., & Chen, G. (2022). Is the latest
the greatest? A comparative study of automatic approaches for classifying educational for um posts. IEEE
Transactions on Learning Technologies, 16, 339– 352.
Sha, L., Rakovic, M., Whitelock- Wainwright, A ., Carroll, D., Yew, V. M., Gasevic, D., & Chen, G. (2021). Assessing
algorithmic fairness in automatic classifiers of educational forum posts. In Artificial Intelligence in Education:
22nd International Conference, AIED 2021, Utrecht, The Netherlands, June 14– 18, 2021, Proceedings, Par t
I 22 (pp. 381– 394). Springer.
Shang, J., Huang, J., Zeng, S., Zhang, J., & Wang, H. (2022). Representation and extraction of physics knowledge
based on knowledge graph and embedding- combined text classif ication for cooperative learning. In 2022
IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD) (pp.
1053– 1058). IEEE.
Sharma, A ., Kabra, A., & Kapoor, R. (2021). Feature enhanced capsule networks for robust automatic essay sc or-
ing. In Machine Learning and Knowledge Discover y in Databases. Applied Data Science Track: European
Conference, ECML PK DD 2021, Bilbao, Spain, September 13– 17, 2021, Proceedings, Part V 21 (pp. 365–
380). Springer.
Song, W., Hou, X., Li, S., Chen, C., Gao, D., Sun, Y., Hou, J., & Hao, A. (2022). An intelligent virtual standard pa-
tient for medical students training based on oral knowledge graph. IEEE Transactions on Multimedia, 1– 14.
Sridhar, P., Doyle, A., Agarwal, A ., Bogar t, C., Savelka, J., & Sakr, M. (2023) Harnessing llms in curricular design:
Using gpt- 4 to support authoring of learning objectives. In AIED Workshops, in press. CEUR- WS.org
Su, Y., & Zhang, Y. (2020). Automatic construction of subject knowledge graph based on educational big data.
In Proceedings of the 2020 The 3rd International Conference on Big Data and Education (pp. 30– 36).
Association for Computing Machinery.
Truong, T.- L., Le, H.- L., & Le- Dang, T.- P. (2020). Sentiment analysis implementing bert- based pre- trained lan-
guage model for vietnamese. In In 2020 7th NAFOSTED Conference on Information and Computer Science
(NICS) (pp. 362– 367). IEEE.
Tsai, Y.- S., & Gasevic, D. (2017). Learning analytics in higher education — Challenges and policies: A review of
eight learning analytics policies. In Proceedings of the Seventh International Learning Analytics & Knowledge
Conference (pp. 233 – 242). Association for Computing Machinery.
Tsai, Y.- S., Whitelock- Wainwright, A., & Gašević, D. (2020). The privacy paradox and its implications for learning
analytics. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge (pp.
230– 239). Association for Computing Machinery.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017).
Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998– 6008.
Wang, D., Churchill, E., Maes, P., Fan, X., Shneiderman, B., Shi, Y., & Wang, Q. (2020). From human- human col-
laboration to human- ai collaboration: Designing AI systems that can work together with people. In Extended
Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1– 6). Association for
Computing Machinery.
Weidinger, L., Mellor, J., Rauh, M., Griff in, C., Uesato, J., Huang, P.- S., Cheng, M., Glaese, M., Balle, B.,
Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L.,
Hendricks, L. A., … Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv preprint
arXiv:2112.04359.
Wollny, S., Schneider, J., Di Mitri, D., Weidlich, J., Rittberger, M., & Drachsler, H. (2021). Are we there yet?— A
systematic literature review on chatbots in education. Frontiers in Artificial Intelligence, 4, 654924.
Wu, J. (2022). Analysis and evaluation of the impact of integrating mental health education into the teaching
of university civics courses in the context of artificial intelligence. Wireless Communications and Mobile
Computing, 2022, 1 – 11 .
Wu, X., He, X., Li, T., Liu, N., & Zhai, X. (2023). Matching exemplar as next sentence prediction (mensp): Zero- shot
prompt learning for automatic scoring in science education. arXiv preprint arXiv:2301.08771.
112
|
YA N et a l.
Yan, L., Zhao, L., Gasevic, D., & Mar tinez- Maldonado, R. (2022). Scalability, sustainability, and ethicality of mul-
timodal learning analytics. In LAK22: 12th International Learning Analytics and K nowledge Conference (pp.
13– 23). Association for Computing Machinery.
Yang, S. J., Ogata, H., Matsui, T., & Chen, N.- S. (2021). Human- centered artificial intelligence in education:
Seeing the invisible through the visible. Computers and Education: Artificial Intelligence, 2, 100008.
Zawacki- Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial in-
telligence applications in higher education— Where are the educators? International Journal of Educational
Technology in Higher Education, 16, 1 – 2 7.
Zeng, Z., Gašević, D., & Chen, G. (2023). On the effectiveness of curriculum learning in educational text scoring.
Proceedings of the AAAI Conference on Artificial Intelligence, 37(12), 14 60 2– 14 610. https://doi.org/10.1609/
aaai.v37i12.26707
Zheng, L., Niu, J., Long, M., & Fan, Y. (2023). An automatic knowledge graph construction approach to promoting
collaborative knowledge building, group performance, social interaction and socially shared regulation in
CSCL. British Journal of Educational Technology, 54, 6 8 6 – 711.
Zheng, L., Niu, J., & Zhong, L. (2022). Effects of a learning analytics - based real- time feedback approach on
knowledge elaboration, knowledge convergence, interactive relationships and group performance in CSCL.
British Journal of Educational Technology, 53, 130– 149.
SUPPORTING INFORMATION
Additional supporting information can be found online in the Supporting Information section
at the end of this article.
How to cite this article: Yan, L., Sha, L., Zhao, L., Li, Y., Martinez- Maldonado, R.,
Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large
language models in education: A systematic scoping review. British Journal of
Educational Technology, 55, 90–112. ht tp s: //do i . o r g /10 .1111 / b j et.13 370
Content uploaded by Lixiang Yan
Author content
All content in this area was uploaded by Lixiang Yan on Jul 23, 2023
Content may be subject to copyright.