ArticlePDF Available

Bridging large language model disparities: Skill tagging of multilingual educational content

Wiley
British Journal of Educational Technology
Authors:

Abstract and Figures

The adoption of large language models (LLMs) in education holds much promise. However, like many technological innovations before them, adoption and access can often be inequitable from the outset, creating more divides than they bridge. In this paper, we explore the magnitude of the country and language divide in the leading open‐source and proprietary LLMs with respect to knowledge of K‐12 taxonomies in a variety of countries and their performance on tagging problem content with the appropriate skill from a taxonomy, an important task for aligning open educational resources and tutoring content with state curricula. We also experiment with approaches to narrowing the performance divide by enhancing LLM skill tagging performance across four countries (the USA, Ireland, South Korea and India–Maharashtra) for more equitable outcomes. We observe considerable performance disparities not only with non‐English languages but with English and non‐US taxonomies. Our findings demonstrate that fine‐tuning GPT‐3.5 with a few labelled examples can improve its proficiency in tagging problems with relevant skills or standards, even for countries and languages that are underrepresented during training. Furthermore, the fine‐tuning results show the potential viability of GPT as a multilingual skill classifier. Using both an open‐source model, Llama2‐13B, and a closed‐source model, GPT‐3.5, we also observe large disparities in tagging performance between the two and find that fine‐tuning and skill information in the prompt improve both, but the closed‐source model improves to a much greater extent. Our study contributes to the first empirical results on mitigating disparities across countries and languages with LLMs in an educational context. Practitioner notes What is already known about this topic Recent advances in generative AI have led to increased applications of LLMs in education, offering diverse opportunities. LLMs excel predominantly in English and exhibit a bias towards the US context. Automated content tagging has been studied using English‐language content and taxonomies. What this paper adds Investigates the country and language disparities in LLMs concerning knowledge of educational taxonomies and their performance in tagging content. Presents the first empirical findings on addressing disparities in LLM performance across countries and languages within an educational context. Improves GPT‐3.5's tagging accuracy through fine‐tuning, even for non‐US countries, starting from zero accuracy. Extends automated content tagging to non‐English languages using both open‐source and closed‐source LLMs. Implications for practice and/or policy Underscores the importance of considering the performance generalizability of LLMs to languages other than English. Highlights the potential viability of ChatGPT as a skill tagging classifier across countries.
This content is subject to copyright. Terms and conditions apply.
Br J Educ Technol. 2024;55:2039–2057.
|
2039
wileyonlinelibrary.com/journal/bjet
Received: 14 December 2023
|
Accepte d: 28 March 2024
DO I: 10 .1111/ b je t.13 46 5
ORIGINAL ARTICLE
Bridging large language model disparities:
Skill tagging of multilingual educational
content
Yerin Kwak | Zachary A. Pardos
This is an op en access article under t he terms of t he Creative Commons Attribution-NonCommercial-NoDerivs License, which
permits use and dist ribution in any medium, provide d the orig inal work i s proper ly cited, t he use is non -commercia l and no
modifications or adaptations are made.
© 2024 The Aut hors. British Journal of Educational Technology publish ed by John Wiley & Sons Ltd on b ehalf of Br itish
Educational Research Association.
UC Berkeley, School of Educ ation,
Berkeley, California, USA
Correspondence
Zachar y A. Pardos, UC Ber keley, School of
Education, Berkeley, CA, USA.
Email: pardos@berkeley.edu
Abstract
The adoption of large language models (LLMs)
in education holds much promise. However, like
many technological innovations before them, adop-
tion and access can often be inequitable from the
outset, creating more divides than they bridge. In
this paper, we explore the magnitude of the coun-
try and language divide in the leading open- source
and proprietary LLMs with respect to knowledge of
K- 12 taxonomies in a variety of countries and their
performance on tagging problem content with the
appropriate skill from a taxonomy, an important task
for aligning open educational resources and tutoring
content with state curricula. We also experiment with
approaches to narrowing the performance divide by
enhancing LLM skill tagging performance across four
countries (the USA, Ireland, South Korea and India–
Maharashtra) for more equitable outcomes. We ob-
serve considerable per formance disparities not only
with non- English languages but with English and
non- US taxonomies. Our findings demonstrate that
fine- tuning GPT- 3.5 with a few labelled examples
can improve its proficiency in tagging problems with
relevant skills or standards, even for countries and
languages that are underrepresented during training.
Furthermore, the fine- tuning results show the poten-
tial viability of GPT as a multilingual skill classifier.
Using both an open- source model, Llama2- 13B, and
a closed- source model, GPT- 3.5, we also observe
large disparities in tagging performance between
2040
|
KWAK and PAR DO S
INTRODUCTION
New technologies can revolutionize various aspects or segments of society but can also con-
tribute to considerable equity divides through unequal access and adoption (Van Dijk, 2006).
These divides manifest across socio- economic status, race, gender and country (Chen &
Wellman, 2004). This is true of large language models (LLMs) as well, not only in unequal
access to them but also in unequal performance of the leading models on tasks requiring
knowledge of different languages and cultures due to their heavy reliance on scraping of
text mostly produced in English by US- based authors. Unequal access to technology tends
to underrepresent marginalized groups online, skewing the web data used to train LLMs
(Baker & Hawn, 2021). Specifically, the restricted digital access and proficiency in certain
the two and find that fine- tuning and skill information
in the prompt improve both, but the closed- source
model improves to a much greater extent. Our study
contributes to the first empirical results on mitigat-
ing disparities across countries and languages with
LLMs in an educational context.
KEYWORDS
ChatGPT, Llama2, LLM, multilingual, skill tagging
Practitioner notes
What is already known about this topic
Recent advances in generative AI have led to increased applications of LLMs in
education, offering diverse opportunities.
LLMs excel predominantly in English and exhibit a bias towards the US context.
Automated content tagging has been studied using English- language content and
taxonomies.
What this paper adds
Investigates the country and language disparities in LLMs concerning knowledge
of educational taxonomies and their performance in tagging content.
Presents the first empirical findings on addressing disparities in LLM performance
across countries and languages within an educational context.
Improves GPT- 3.5's tagging accuracy through fine- tuning, even for non- US coun-
tries, starting from zero accuracy.
Extends automated content tagging to non- English languages using both open-
source and closed- source LLMs.
Implications for practice and/or policy
Underscores the importance of considering the performance generalizability of
LLMs to languages other than English.
Highlights the potential viability of ChatGPT as a skill tagging classifier across
countries.
|
2041
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
countries lead to an uneven distribution of digital resources across countries and languages
(Ta & Lee, 2023), contributing to performance disparities in LLMs. This difference is espe-
cially pronounced in countries with low- resource languages and limited digital presence
(Bang et al., 2023). In contrast, LLMs show familiarity with well- represented languages and
regions, such as the USA (Johnson et al., 2022; Rettberg, 2022).
Another factor hindering the equitable application of LLMs is their proprietary nature, as
the majority of powerful models remain closed source. While these proprietary models, like
ChatGPT, demonstrate higher efficacy, they lack accessibility and transparency, limiting broader
collaboration and modification of the model for safe use (Liao & Vaughan, 2023). In contrast,
open- source LLMs, such as Llama2 (Touvron et al., 2023), offer transparency by providing their
architecture, code and training data, and this openness enables users to understand the mod-
el's functioning and potential biases. Users can also tailor these models to their specific needs,
potentially contributing to bias reduction. However, conflicting perspectives exist regarding the
effectiveness of open- source LLMs (Gudibande et al., 2023; Kadous, 2023).
In this paper, we evaluate and then improve performance of LLMs across languages and
countries in an educational context, tagging math problems with relevant skills or standards.
Our assessment involves both an open- source LLM (Llama2) and a closed- source LLM
(ChatGPT) to understand their capabilities. Our work is the first to evaluate this tagging task
on non- English content using LLMs, revealing and addressing significant disparities among
countries and models.
RELATED WORK
Skill tagging
Tagging content to specific skills or standards from an educational curriculum is essential in
order to allow educators, open education resource repositories and education technology
companies to align content to curricula. This type of fine- grained categorization is essential
for making crowdsourced content findable by teachers, and other problem content, including
those created by generative AI, usable by adaptive tutoring systems. As an example, Khan
Academy has labelled its content with skills from the US Common Core State Standards
(CCSS) taxonomy1 and other systems like ASSISTments (Razzaq et al., 2007), Cognitive
Tutor (Stamper & Pardos, 2016) and OATutor (Pardos et al., 2023) rely on their own skill
taxonomies to properly function. For adaptive tutoring systems such as Intelligent Tutoring
Systems (ITS), organizing problem content by skill allows for fine- grained cognitive mas-
tery estimation and adaptive item selection to match the student's mastery level (Anderson
et al., 1995; Bloom, 19 84). Moreover, skill tagging adds an important piece of meta informa-
tion that facilitates learning- oriented learning analytics to be produced from the problem-
solving behaviour logs of students (Lang et al., 2017).
Several terms, such as ‘tagging’ (Li et al., 2021), ‘mapping’ (Desmarais et al., 2012) and
‘alignment’ (Supraja et al., 2017 ), have often been used to describe the classification of edu-
cational content into taxonomies. The elements within a taxonomy are referred to as knowl-
edge components (KC; Koedinger et al., 2012) or skills, with both terms often being used
interchangeably in prior works (Pardos & Dadu, 2017; Shen et al., 2021; Shi et al., 2023). In
this paper, we adopt ‘skills’ or ‘standards’ to describe the standards within an educational cur-
riculum. We use ‘educational taxonomy’ to refer to a taxonomy that consists of these skills or
standards and ‘tagging’ to indicate the process of classifying each problem into its standards.
While skill tagging has traditionally been performed manually by human experts, this method
is time intensive and laborious, requiring extensive knowledge about content and the desired
curricula. To alleviate this challenge, previous research has explored automating this task,
2042
|
KWAK and PAR DO S
primarily focusing on English content and taxonomies. Various approaches have been taken to
automate the skill tagging task without using LLMs. For example, Pardos and Dadu (2017) lev-
eraged the skip- gram model and neural networks to classify skills based on problem text and
the sequence of problem IDs. Shen et al. (2021) introduced MathBERT, a pretrained language
model for mathematics, to predict knowledge components for math problems. Shi et al. (2023)
proposed the KC- Finder model, which employed the code2vec model and neural networks to
predict the relevant KCs of each problem using student code submissions. Recently, a state-
of- the- art pretrained model was introduced to leverage multimodal data to align educational
resources with CCSS (Li et al., 2024). However, no prior work has extended this automation to
different languages, or has it explored the potential contributions of LLMs.
Bias in LLMs
Recent advances in generative AI have sparked researchers and practitioners to ex-
plore its innovative applications in educational contexts (Nguyen et al., 2023; Pardos &
Bhandari, 2023). However, concerns have also emerged that LLMs' inherent biases can
amplify existing inequalities and unfairness in society, thereby negatively impacting equita-
ble teaching and learning processes (Baker & Hawn, 2021; Kasneci et al., 2023). Thus, it
becomes crucial to investigate the origins and types of biases in LLMs and explore fairness
approaches in educational contexts.
A primary factor of bias in LLMs is their training data, predominantly sourced from the
web (Ferrara, 2023). This approach may lead to underrepresent languages and countries
that are less prominent online. Language distribution in LLMs' training data reveals that a
majority of the data are from high- resource languages like English, while fewer instances
are from medium- resource languages like Korean and low- resource languages like Marathi
(Brown et al., 2020; Touvron et al., 2023). As a result, studies reveal a notable proficiency
gap, with LLMs excelling in English but displaying limitations in non- English languages, par-
ticularly those with fewer online resources (Bang et al., 2023; Deng et al., 2023). Additionally,
the geographical distribution of web data, with over half hosted in the USA on platforms like
Common Crawl (Dodge et al., 2021), suggests a significant skew in LLMs' training data to-
wards American contexts. This skewed distribution manifests in the models' familiarity with
American culture and knowledge (Johnson et al., 2022; Rettberg, 2022) and inaccurate un-
derstandings of non- US countries, especially those in the global south (Gondwe, 2023).
Despite the potential harm, there have been limited efforts to address these disparities in
the educational context. The language- and country- related bias in LLMs potentially harms
equitable and inclusive opportunities (Kasneci et al., 2023). For instance, these biases af-
fect students and teachers using medium- or low- resource languages, limiting the benefits
they could gain from LLMs. Moreover, a bias favouring certain cultural backgrounds could
disadvantage students from marginalized groups (Ferrara, 2023). Outside of education, pre-
vious studies have explored various strategies using fine- tuning (Ramezani & Yang, 2023)
and prompting (Tao et al., 2023) to mitigate cultural gaps. Japan has developed a Japanese
version of ChatGPT to address limitations in understanding the Japanese language and
cultural nuances (Hornyak, 2023).
Open- versus closed- source LLMs
Open- source LLMs are large language models whose source code and pretrained weights
are publicly accessible, allowing for free access, modification and distribution. In contrast,
closed- source LLMs, like ChatGPT, have proprietary source code and weights that are not
|
2043
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
publicly available. In contrast to the extensive educational research using ChatGPT, there
is a dearth of studies involving open- source LLMs in the educational context, to the best of
our knowledge. This gap is noteworthy, given that leveraging these open- source LLMs, such
as BLOOM (BigScience Workshop et al., 2022), Llama (Touvron et al., 2023) and Falcon
(Almazrouei et al., 2023), can enhance transparency, adaptability and accessibility, foster-
ing a more inclusive and equitable educational research environment. Their transparency
in technical aspects and training data enables users to understand the models' biases and
limitations (Ferrara, 2023). The reproducibility of these models supports the community to
use them more safely and effectively. However, conflicting study results exist on whether
open- source models can be viable alternatives to closed ones. Some argue open- source
LLMs perform as well as or surpass closed models (Kadous, 2023), while others contend
they still lag behind (Gudibande et al., 2023).
The present study
Our research focuses on LLM- based skill tagging for educational problem content from four
countries using both open and closed pretrained models. We measure the magnitude of dis-
parities in performance on this task between content from different countries and apply two
strategies to mitigate performance differences. Our evaluation extends to both open- source
and closed- source models, Llama2- 13B and GPT- 3.5. By exploring the effectiveness of
these strategies across models and country contexts, this study contributes a novel example
of how LLM technology and error mitigation techniques may be differentially effective for dif-
ferent countries and languages.
RQ1: What is the baseline performance and to what extent do performance improve-
ment strategies effectively enhance LLM skill tagging capabilities across countries and
languages?
RQ2: How do LLM performance enhancement strategies perform when applied to open-
source and closed- source LLMs in skill tagging tasks?
DATASETS
We used Khan Academy math problem content aligned with the educational curricula
of four countries: The USA, Ireland, South Korea and India (specifically, Maharashtra).
For India, our study focuses on Maharashtra, a state in India with Marathi as its offi-
cial language. For clarity, we consistently refer to Maharashtra by its state name rather
than India in our paper. In Table 1, we provide an overview of each country's official
language and educational curriculum. The languages of the selected countries indicate
three distinct language resource levels: a high- resource language (English), a medium-
resource language (Korean) and a low- resource language (Marathi) (Bang et al., 2023;
TAB LE 1 Education curricula and languages of different countries.
Country Curriculum Language
USA Common core state standards English
Ireland Irish education curriculum English
South Korea National curriculum of Korea Korean
Maharashtra (India) Maharashtra education curriculum Marathi
2044
|
KWAK and PAR DO S
Deng et al., 2023). These languages also exhibit varying levels of representation in the
pretraining data of Llama2, with English comprising 89.7%, Korean 0.06% and Marathi
<0.005% (Touvron et al., 2023).
We randomly selected 50 skills from each country's educational curriculum, along with
their associated 16 problems, resulting in a total of 800 problems for each country. Out of
these, we tested the two proposed techniques using 500 problems—10 for each skill. The
remaining 300 were used for fine- tuning LLMs. While some of Khan Academy's problems
are tagged with multiple skills, we only chose problems labelled with a single skill. The prob-
lem texts, formatted in Markdown syntax, were used as input to LLMs without any prepro-
cessing to predict the most relevant skill.
Taxonomies and problems
The US problems are aligned with the US Common Core State Standards (CCSS), a set
of academic standards outlining the mathematical knowledge and skills that K- 12 students
should acquire.2 CCSS includes what we refer to as ‘skill codes’, such as ‘5.NBT.A.3’, with
detailed descriptions such as ‘Read, write, and compare decimals to thousandths’, referred
to as ‘skills’ or ‘standards’.
The Irish math problems are tagged with the Irish Education Curriculum, which is guided
by the National Council for Curriculum and Assessment.3 Specifically, the problems are
aligned with the primary math curriculum from 1999, the junior cycle curriculum from 2016
and the senior cycle curriculum from 2015. Standards or learning outcomes are specified for
different subjects, including math, grade levels and strands (eg, number and data).
The Korean math problems are labelled with standards from the 2015 revision version
of the National Curriculum of Korea. This curriculum serves as a standardized framework,
periodically revised by the Ministry of Education every 5 to 10 years.4
The problems from Maharashtra are mapped according to the Maharashtra Education
Curriculum. In Maharashtra, primary curriculum is developed by the Maharashtra State
Council for Educational Research and Training5 and curriculum for higher classes is devel-
oped by the Maharashtra State Board of Secondary and Higher Secondary Education.
Other taxonomies
To comprehensively assess the pre- existing knowledge level of LLMs on educational taxono-
mies, we conducted evaluations for the USA, Ireland, Korea, Maharashtra, England and Japan.
The national curriculum in England, developed by the Department for Education in 2013, serves
as a framework of subjects and standards to ensure standardized education.6 In official docu-
ments, standards or ‘statutory requirements’ are detailed for each grade and domain area (eg,
Measurement). The Curriculum Guidelines of Japan outline the standards developed by the
Ministry of Education, Culture, Sports, Science and Technology.7 These curriculum guidelines
are updated approximately every decade. We used the most recent versions for elementary
and junior high school, which were announced in 2017, and for high school, released in 2018.
METHODS
In the “Model” section, we explain two models used in our study. The “Prompt Engineering
section provides the prompts we used and the process of their selection. The “Performance
Improvement Strategies section outlines our strategies for enhancing tagging capabilities for
|
2045
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
four countries. In the “Comparison of Open- Source and Closed- Source LLMs” section, we
detail how we compare the performance of an open- and a closed- source LLM on our tasks.
Model
GPT- 3. 5
Developed by OpenAI, GPT is the leading closed- source LLM. Among several versions of
GPT, we used the gpt- 3.5- turbo model of GPT- 3.5, which can be fine- tuned through the
OpenAI API (OpenAI, 2023). To control the randomness of the model's output, we set the
‘temperature’ parameter to 0.2.
Llama2- 13B
To evaluate open- source LLMs, we used Llama2, introduced by Meta. Trained on publicly
available online data, it consists of models ranging from 7 to 70 billion parameters (Touvron
et al., 2023), outperforming other open- source chat models in multiple benchmarks. Due to
its substantial size, running Llama2 may entail high computational and memory costs with
multiple powerful GPUs. To address this challenge, the quantization method has emerged,
representing weights and activations with low- precision data types. Several experiments
have demonstrated that quantizing to 4 bits is highly efficient with negligible accuracy deg-
radation (Frantar et al., 2022). Through quantization, we successfully ran Llama2- 13B using
a single Quadro RTX 8000.
Prompt engineering
A prompt refers to a set of customized instructions given to an LLM (Liu et al., 2023). LLMs
are sensitive to the user's prompt, making prompt engineering crucial. Thus, we assessed
two types of prompts that demonstrated success in text classification tasks using GPT- 3.5
and Llama2 in recent literature (Sivarajkumar et al., 2023). After assessing their perfor-
mance with our dataset, we identified the most effective prompt for our task. Additionally,
it has been reported that LLMs tend to perform better with English prompts, even in tasks
intended for languages other than English (Lai et al., 2023). Therefore, we opted to use
English prompts, even with Korean and Marathi content and taxonomies.
Sivarajkumar et al. (2023) investigated various prompt types in natural language process-
ing tasks. Their study demonstrated that, in a text classification task similar to our skill tagging
task, both GPT- 3.5 and Llama2 performed well with prefix and anticipatory prompts. Prefix
prompts prepend a word or phrase indicating the task (eg, ‘Please provide one Common Core
Standard for Mathematics that is most closely related to the given problem’). Anticipatory
prompts are question- type prompts anticipating the response (eg, ‘Which Common Core
State Standards for Mathematics is most closely related to the given problem?’).
In our experimentation with these prompt types using 500 US problems and CCSS, GPT-
3.5 and Llama2- 13B were asked to identify the most relevant CCSS for a given problem.
GPT- 3.5 achieved scores of 0.272 with anticipatory prompts and 0.238 with prefix prompts.
Llama2- 13B scored 0.016 with prefix prompts and 0.012 with anticipatory prompts. Based
on this observation, we selected prompt types for each model and consistently used them
throughout our analyses (Table 2). ‘Skills list’ prompts were used in the ‘Standards in the
prompt’ strategy ("Standards in the prompt" sections).
2046
|
KWAK and PAR DO S
Performance improvement strategies
We utilized LLMs to tag the most relevant skill from a national curriculum for a given problem
across four different countries. Our evaluation metric was exact match accuracy, wherein a
prediction was considered accurate only when the predicted set of skills exactly matched the
ground truth. Previous studies have observed that LLMs exhibit robustness in English (Bang
et al., 2023; Deng et al., 2023) and familiarity with knowledge derived from the USA (Johnson
et al., 2022; Rettberg, 2022). For more equitable performance improvement across countries,
we implemented two strategies. We began by assessing the knowledge related to the national
curriculum and establishing baseline scores for both models in the skill tagging task. We then
applied each technique and evaluated their efficacy in improving performance across all coun-
tries. Our proposed methods are (1) fine- tuning and (2) standards in the prompt.
Base LLMs: Familiarity with national curricula and baseline tagging
performance
We assessed the models' familiarity with the educational curriculum standards of various
countries. This evaluation is crucial, as the models' pre- existing knowledge can influence
the baseline scores and the effectiveness of our proposed strategies. For a more compre-
hensive analysis, we included England and Japan in this task, along with the four selected
countries. We instructed the models to generate educational curricula for these six countries,
as outlined in the "DATASETS" section. Given that all these educational taxonomies follow
a hierarchical structure, including grades, domains and standards, we asked the models to
list standards associated with specific grade and domain levels. For example, we requested
eight standards within the ‘Measurement’ domain for Year 2 in the National Curriculum of
England. The evaluation metric was an exact match with the official documents.
To evaluate the skill tagging performance of the base versions of GPT- 3.5 and Llama2-
13B across the four countries, we prompted both models to identify the relevant standards
for given problems. This assessment was conducted without employing any specific strate-
gies, solely relying on prompts (anticipatory and prefix) in the “Prompt Engineering” section.
Fine- tuning
Fine- tuning involves adapting LLMs to specific tasks using task- specific data. It is an ef-
fective approach to enhance performance (Min et al., 2021; Wei et al., 2021) and introduce
TAB LE 2 Selected prompt types for GPT- 3.5 and Llama2- 13B.
Typ e Model Prompt
Anticipatory G PT- 3.5 Which standard from {Taxonomy name} is most closely related to
the given problem? Problem: {problem text}
Prefix Llama2- 13B Please provide one standard from {Taxonomy name} that is most
closely related to the given problem. Problem: {problem text}
Skills list—Anticipatory GPT- 3. 5 Which standard from {Taxonomy name} is most closely related to
the given problem? Problem: {problem text} Standards: {a list
of standards}
Skills list—Prefix Llama2- 13B Please provide one standard from {Taxonomy name} that is most
closely related to the given problem. Problem: {problem text}
Standards: {a list of standards}
|
2047
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
desired behaviours (Ouyang et al., 2022). In our case, to improve skill tagging performance
across different languages and incorporate knowledge about educational taxonomies in
non- US countries into LLMs, we fine- tuned LLMs using a few sets of labelled problems
per skill.
Fine- tuning requires training, validation and test data. Our dataset contains 50 skills, each
associated with 16 problems for each country. We divided the data for each skill as follows:
30% (5 problems) for training, 1 for validation and the remaining 10 for test data. We fine-
tuned the models using two, three, four and five training examples per skill and evaluated
their performance.
In the fine- tuning process, certain hyperparameters control the learning process. One
crucial parameter is the number of epochs, where an epoch is a complete iteration through
a dataset during the training. OpenAI set the default epoch value to 3 and recommended
increasing it by 1 or 2 epochs to check the results. We began with 3 epochs for both models
and tested 4 epochs for GPT- 3.5, following OpenAI's suggestion. To explore whether more
epochs lead to better outcomes, we increased epochs to 10 for both models and attempted
20 for Llama2- 13B.
Previous research observed a considerable performance decrease in base LLMs with
low- resource languages (Bang et al., 2023; Deng et al., 2023). To address this issue for
Marathi, we conducted fine- tuning using 10 problems per skill; 5 problems were sourced
from our training data and the remaining 5 were synthetically generated by GPT- 4. This
approach is inspired by Gudibande et al. (2023), where fine- tuning the Llama base model
to imitate ChatGPT using synthetic data generated by prompting ChatGPT resulted in im-
proved performance. We prompted GPT- 4 to generate five different math problems to help
master a given skill and used these problems for fine- tuning the model.
Unlike GPT where fine- tuning is done via an API, fine- tuning Llama2- 13B needs high
computational and storage requirements. To address these challenges, the parameter effi-
cient fine- tuning (PEFT) method has emerged to fine- tune LLMs efficiently by reducing the
number of trainable parameters. Among several PEFT methods, we used QLoRA (Dettmers
et al., 2023), which uses 4- bit quantization to compress the model and effectively reduce
memory usage without sacrificing performance. With this technique, we fine- tuned Llama2-
13B using our datasets from different languages on a single GPU.
Standards in the prompt
When utilizing base LLMs for the skill tagging task without model modifications like fine-
tuning, including a list of standards in the prompt enables the models to select the most
relevant standards. Although this method does not require the models’ prior knowledge
of the standards, it is possible that the pre- existing knowledge could impact performance.
We added a list of standards to the prompt, using ‘skills list’ prompt type (Table 2), and
compared the predicted standards to the ground truths. Our primary focus was to evaluate
whether noticeable differences in results emerged across countries.
Comparison of open- source and closed- source LLMs
To assess the effectiveness of open- source and closed- source models in enhancing per-
formance and producing equitable outcomes across countries, we compared the highest
scores achieved through our strategies for each country. We also evaluated the impact
of fine- tuning on both models by comparing their average score improvement per training
example.
2048
|
KWAK and PAR DO S
RES U LTS
In this section, we present the findings from our analyses. The “Base GPT- 3.5: Familiarity
with National Curricula and Baseline Tagging Performance” section provides skill tagging
results of the base GPT- 3.5 model across countries and its inherent knowledge of educa-
tional curricula. The “GPT- 3.5: Performance Improvement Strategies” and “Llama2- 13B:
Performance Improvement Strategies” sections detail the outcomes of our proposed strat-
egies to enhance the skill tagging performance of GPT- 3.5 and Llama2- 13B, respectively,
for both US and non- US countries. The “Comparison of Open- Source and Closed- Source
LLMs” section compares open- and closed- source models for this task.
Base GPT- 3.5: Familiarity with national curricula and baseline tagging
performance
We observed that GPT- 3.5 demonstrates a strong understanding of the US CCSS, a re-
duced familiarity with the England curriculum and zero with the standards from Ireland,
Korea and Maharashtra (see ‘Familiarity with National Curricula’ in Table 3). These results
align with findings from previous studies, confirming ChatGPT's pronounced familiarity with
US- related culture and knowledge (Johnson et al., 2022; Rettberg, 2022).
GPT- 3.5's limited knowledge of standards from countries other than the USA affected
the baseline score, resulting in an accuracy of 0.272 for tagging the US problem with CCSS
and 0.000 for the remaining countries, Ireland, Korea and Maharashtra (see ‘Skill Tagging
Accuracy’ in Table 3).
GPT- 3.5: Performance improvement strategies
To attempt to enhance the tagging capabilities of GPT- 3.5 for all four countries, elevating
them up from 0% performance, we explored two improvement methods: fine- tuning and
standards in the prompt.
Fine- tuning
Figure 1 demonstrates that fine- tuning GPT- 3.5 with varying hyperparameters, specifi-
cally the number of training examples and epochs, narrowed the enormous performance
gap between countries; increasing the number of examples and epochs enhanced per-
formance for all countries. Through fine- tuning, the USA and Ireland achieved scores of
0.888 and 0.914, respectively, indicating their applicability to real- world scenarios. GPT-
3.5's lack of familiarity with the Ireland curriculum may affect its performance, initially
TAB LE 3 Base GPT- 3.5 exhibits greater familiarity with the US CCSS than non- US curricula and
proficiency in tagging US content to CCSS.
US England Ireland Korea Japan Maharashtra
Familiarity with national
curricula
0.675 0.273 0.000 0.000 0.000 0.000
Skill tagging accuracy 0.272 N/A 0.000 0.000 N/A 0.000
|
2049
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
falling behind the USA, particularly using 3 epochs with two and three examples and 4
epochs with two examples. However, with more epochs and examples, Ireland's scores
exhibited an upward trajectory similar to that of the USA. For Korea, fine- tuning, espe-
cially with epochs at 10, resulted in scores comparable to the USA and Ireland. In the
case of Maharashtra, fine- tuning improves its baseline scores, but a performance gap still
exists. One possible reason could be the lack of training data for the Marathi language. To
mitigate this further, we augmented the number of training examples to 10 – five sourced
from our training pool and the remaining five synthesized using GPT- 4 (Figure 2). Using
10 examples and 4 epochs yielded a score of 0.386, comparable to the highest score of
0.406 achieved with 5 examples and 10 epochs.
Standards in the prompt
Adding information to the prompt is an alternative, simpler approach to augmenting the
model with additional information that does not involve any additional training. Using the
‘Skills list’ prompt type from Table 2, GPT- 3.5 was able to tag problems with relevant stand-
ards, even for non- US countries where the model lacks knowledge about the curricula.
Ireland achieved an accuracy close to that of the USA, trailing behind by only 0.028. Despite
both the US and Ireland datasets being in English, the per formance difference may arise
from GPT's familiarity with the CCSS and its lack of knowledge about Ireland's curriculum
(Table 3). The gap between the USA and Korea decreased to 0.148. However, Marathi still
exhibited a larger gap compared to other countries. This observation implies a potential limi-
tation in the base GPT- 3.5 model's understanding of Marathi, possibly due to its significantly
lower inclusion in the model's training data (Table 4).
FIGURE 1 GPT- 3.5: Fine- tuning with varying training examples and epochs.
2050
|
KWAK and PAR DO S
Llama2- 13B: Performance improvement strategies
Replicating our experiment from the “Base GPT- 3.5: Familiarity with National Curricula and
Baseline Tagging Performance” section, but having Llama2 list educational standards from
taxonomies from different countries, Llama2- 13B exhibited zero accuracy for all countries,
including the USA. However, this may be due to our strict evaluation metric, an exact match
with official documents. We observed that Llama2- 13B demonstrated a basic understand-
ing of the US CCSS. The model provided brief explanations about CCSS and existing skill
codes (eg, 7.EE.A.1) along with their descriptions, although not exactly identical to the of-
ficial standards but rather in a similar fashion. However, considering its inability to gener-
ate exact matches, the model's understanding of CCSS appears to be lower than that of
GPT- 3.5. Moreover, for educational curricula outside the USA, the model consistently pro-
vided highly inaccurate information. This observation can also be validated in an experiment
where we asked Llama2- 13B to generate one standard most relevant to a given problem
without providing any standards in the prompt. The model achieved an accuracy of 0.016
for the US problems and CCSS but 0.000 accuracy for Ireland, Korea and Maharashtra. To
improve Llama2- 13B's performance in tagging problem content from different countries to
skills, we employed the same methods as applied to GPT- 3.5.
Fine- tuning
Figure 3 shows that fine- tuning Llama2- 13B increased the tagging performance for all four
countries. Across all countries, the combination of five training examples and 20 epochs
yielded the highest scores. Notably, Korea achieved its best score of 0.364, comparable to
the USA (0.396). Ireland obtained 0.22 and Maharashtra scored 0.036. With 20 epochs and
a small number of examples, such as two or three, the relationship between the number of
examples and scores is not proportional in the US, Ireland and Korea datasets. Therefore,
we did not increase the number of epochs any further. Similar to GPT, the lower performance
FIGURE 2 GPT- 3.5: Fine- tuning with synthetic data to increase Marathi tagging performance.
TAB LE 4 GPT- 3.5: Standards in the prompt.
USA Ireland Korea Maharashtra
Skill tagging accuracy 0.490 0.462 0.342 0.068
|
2051
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
with Marathi may be due to the limited availability of training data for the Marathi language
(<0.005% in Llama2 (Touvron et al., 2023)). To further enhance its lower performance, we
employed 10 training examples, the same set used for GPT- 3.5. With 10 examples and 20
epochs, Marathi achieved the highest score of 0.068 (Figure 4).
Standards in the prompt
When presented with the standards, Llama2- 13B achieved accuracy scores of 0.008 with
the US dataset, 0.010 with Ireland, 0.004 with Korea and 0.000 with Maharashtra. However,
FIGURE 3 Llama2- 13B: Fine- tuning with varying training examples and epochs.
FIGURE 4 Llama2- 13B: Fine- tuning with synthetic data to increase Marathi tagging performance.
2052
|
KWAK and PAR DO S
these results may not accurately reflect the model's capability of understanding different
languages and tagging problems; rather, the scores may be due to chance. The model oc-
casionally failed to choose any standards, and when it did, it consistently predicted the same
standard. For example, the model tended to favour ‘4.OA.A.1’ with the US dataset. It some-
times selected multiple standards despite explicit instructions to choose only one.
Comparison of open- source and closed- source LLMs
Performance gap
When comparing the best performance of the two models, a significant gap emerges be-
tween a closed- source and open- source LLM (Figure 5). Specifically, when including stand-
ards in the prompt to both models, the performance gap is most pronounced with the US
dataset (0.482), followed by Ireland (0.452), Korea (0.338) and Maharashtra (0.068). For
fine- tuning, the performance gap is most substantial with the Ireland dataset (0.694), fol-
lowed by the USA (0.492), Korea (0.474) and Maharashtra (0.338). In both techniques, the
gap follows the order of English, Korean and Marathi.
Average score improvement during fine- tuning
Figure 6 shows that fine- tuning leads to a substantial average improvement in GPT- 3.5
over Llama2- 13B. This result indicates that GPT- 3.5 may have higher capabilities for
multilingual tagging tasks. While the extent of improvement varies, fine- tuning consist-
ently led to enhanced accuracy scores across datasets from all countries. Specifically,
both models exhibit a larger average score improvement in all countries with 10 epochs,
and Korean tagging performance improved as much as English. Regarding the tagging
performance of the fine- tuned GPT- 3.5 on the US dataset, it seems that fine- tuning has a
FIGURE 5 Skill tagging performance of the best model strategy for each country: Llama2- 13B versus GPT-
3.5.
|
2053
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
less pronounced impact than Ireland and Korea. This might be because GPT- 3.5 already
has a high baseline score.
DISCUSSION
We observed a notable bias of LLMs towards a US- based curriculum, which affects our
skill tagging tasks and potentially other educational tasks such as grading and content
creation. These observations align with prior research on a general bias of LLMs towards
American cultures and knowledge (Johnson et al., 2022; Rettberg, 2022). Our results,
obtained by prompting LLMs to list the skills associated with taxonomies from six coun-
tries, show not only a deficit in non- English- speaking countries but also a non- US deficit.
We observed a substantial accuracy difference, with a 0.402 gap between the USA and
England and a 0.675 gap between the USA and other countries. The models' pre- existing
familiarity with the US- based CCSS appears to result in a performance gap in baseline
scores and our proposed strategies across countries, particularly standards in the prompt
and fine- tuning with two or three examples and 3 epochs. However, by fine- tuning with a
greater number of epochs and examples, we were able to overcome this bias and narrow
the gap among countries.
The fine- tuning results suggest that LLM's bias towards English and the USA can be
overcome. We gain novel insights that fine- tuning with a few examples per skill enhances
skill tagging performance, even for countries and languages that are underrepresented in
LLM's training data. Using GPT- 3.5, the best practices of fine- tuning achieved equitable
tagging results among the USA, Ireland and Korea. For Marathi, we enhanced tagging
performance from zero to half of the US score. The results with Korean and Marathi are
notable, given that both languages are not well represented in the training data, with Korean
constituting only 0.65% and Marathi only 0.02% in the Common Crawl dataset, the primary
training data source for GPT- 3 (Brown et al., 2020). Moreover, we further enhanced the
score for Marathi through fine- tuning using both our training examples and synthetic data
from GPT- 4. Future work may expand on these methods for less- represented countries and
low- resource languages.
FIGURE 6 Average score improvement through fine- tuning: Llama2- 13B versus GPT- 3.5.
2054
|
KWAK and PAR DO S
Our study offers insights into the performance of LLMs in skill tagging problem content
across different languages with varying resource levels. While our analyses did not en-
compass an extensive number of languages, the selected ones indicate different resource
levels of languages: English (high resource), Korean (medium resource) and Marathi (low re-
source). Although previous studies have suggested that LLMs excel in understanding high-
resource languages, followed by medium- and low- resource languages (Bang et al., 2023;
Lai et al., 2023), our findings suggest that fine- tuning with a few labelled examples can en-
hance tagging performance across all language levels. Moreover, it can notably narrow the
performance gap between high- and medium- resource languages.
We find significant performance disparities between open- source and closed- source
LLMs in the tagging task, confirming previous research suggesting that open- source models
lag behind proprietary ones (Gudibande et al., 2023). However, our fine- tuning results indi-
cate that similar to closed- source models, open- source models can also achieve improved
performance across languages, leading to more equal performance, especially between
the USA and Korea. This result is noteworthy, considering Korean represents only 0.06%
of Llama2's training data, while English comprises 89.7% (Touvron et al., 2023). Moreover,
the open nature of these models empowers communities to contribute and enhance them,
revealing the potential for advancement. Further improvement would be needed, however,
for the Llama2 model to improve to performance levels of practical utility.
Furthermore, the fine- tuning results demonstrate the potential viability of LLMs, particularly
ChatGPT, as skill classifiers for both English and non- English content and taxonomies. By
adjusting the model with a few labelled data, GPT- 3.5 achieves over 80% agreements with
human experts in skill tagging for both English and Korean. LLMs' proficiency in multilingual
skill tagging can particularly benefit countries such as South Korea and Japan, where national
curricula are regularly updated, facilitating the realignment of textbooks based on the new cur-
ricula. Similarly, for educational platforms adopting new taxonomies, LLMs can significantly
ease the burden of reorganizing their content. These models can also assist teachers in iden-
tifying content relevant to the curriculum or learning objectives of their instructions.
LIMITATIONS AND FUTURE WORK
This study has three limitations that can guide future research. First, our focus was exclu-
sively on evaluating the performance of LLMs in the skill tagging context. To promote a more
equitable and inclusive application of LLMs, future work could extend its scope to investigate
other educational tasks, such as essay grading or content creation from different countries
and languages. These tasks not only require LLMs' proficiency in understanding diverse
languages and educational curricula but may also demand a grasp of cultural context.
Second, our observations regarding open- source models suggest the potential for en-
hancement through searching for optimal hyperparameters and trying more advanced open-
source models. Unlike proprietary models that restrict users from selecting hyperparameter
values during the fine- tuning process, open- source models allow users to experiment with
various values for hyperparameters, such as learning rate and batch size. Furthermore,
while we used the Llama2- 13B model in this study, future work could assess the perfor-
mance of larger models, such as Llama2- 70B.
Third, the focus of this paper was on the divides across different languages on the task
of skill tagging with open- source and closed- source LLMs. Therefore, we did not compare
to other non- LLM approaches to skill tagging. We observed, however, that using the same
English- language Khan Academy problem set, our GPT- 3.5 evaluation scored 0.272, com-
pared to the state- of- the- art classification model reported in Li et al. (2024), which scored
|
2055
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
0.223. Future work can include a more rigorous comparison of LLM and non- LLM ap-
proaches to the skill tagging task.
SUMMARY AND CONCLUSIONS
Through fine- tuning, GPT- 3.5 achieves performance increase across all languages, effec-
tively eliminating the US performance gap with Ireland and Korea with 10 epochs and 5
examples per skill. Its accuracy in tagging US problems to CCSS increased by 3.3 times.
Ireland achieved a score of 3% higher than the USA from a baseline of zero, and Korea's best
score was only 6% different from the USA. Marathi scored half of what the USA achieved.
Llama2- 13B's multilingual tagging capabilities also increased through fine- tuning. Its ac-
curacy in tagging US problems to CCSS increased by 25 times. Notably, Korea showed an
8% difference from the USA, Ireland scored 44% lower than the USA and Marathi achieved
only 1/5th of the USA score.
Previous research suggested a capabilities gap between open- source LLMs and their
closed- source counterparts, which is challenging to bridge (Gudibande et al., 2023). We
also observed this in our skill tagging task. However, with fine- tuning, we closed the perfor-
mance gap in skill tagging applied to the US dataset. Initially, the base GPT- 3.5 accuracy
(0.272) was 17 times higher than base Llama2- 13B (0.016), but after fine- tuning, GPT's ac-
curacy (0.888) was only 2 times higher than fine- tuned Llama2 (0.396).
In conclusion, our research demonstrates the ability of LLMs to overcome language and
country biases and enhance inclusivity in performance on an educational task. As educa-
tion technology firms and research labs proceed to experiment and innovate with LLMs in
educational settings, our work emphasizes the critical consideration of per formance gener-
alizability across different cultural contexts.
FUNDING INFORMATION
None.
CONFLICT OF INTEREST STATEMENT
None.
DATA AVAILABILITY STATEMENT
Data used in this research are content mapped to different countries' educational curricula
and can be accessed publicly. There are no specific conditions or restrictions on accessing
these data. These data are available through the following link: USA, Ireland, South Korea
and India- Maharashtra.
ETHICS STATEMENT
This study did not involve human subjects as participants and, therefore, did not undergo
ethical review by an institutional ethics committee.
ORCID
Yerin Kwak https://orcid.org/0009-0008-2897-9042
Zachary A. Pardos https://orcid.org/0000-0002-6016-7051
ENDNOTES
1 https:// www. khana cademy. org/ stand ards/ CCSS. Math/K. CC.
2 https:// learn ing. ccsso. org/ commo n- core- state - stand ards- initi ative/ .
2056
|
KWAK and PAR DO S
3 https:// www. curri culum online. ie/ Home/ .
4 https:// ncic. re. kr/ .
5 https:// www. maa. ac. in/ index. php? tcf= learn ingou tcomes.
6 https:// www. gov. uk/ gover nment/ publi catio ns/ natio nal- curri culum - in- engla nd- mathe matic s- progr ammes - of- study .
7 https:// www. mext. go. jp/a_ menu/ shotou/ new- cs/ index. htm.
REFERENCES
Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, E., Heslow, D.,
Launay, J., Malartic, Q., & Noune, B. (2023). Falcon- 40B: An open large language model with state- of- the-
art performance. Findings of the Association for Computational Linguistics: ACL, 2023, 10755–10773.
Anderson, J. R., Corbett, A. T., Koedinger, K. R., & Pelletier, R. (1995). Cognitive tutors: Lessons learned. Journal
of the Learning Sciences, 4( 2 ), 16 7 2 07.
Baker, R. S., & Hawn, A. (2021). Algorithmic bias in education. International Jourananal of Artificial Intelligence
in Education, 2021, 1– 41.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., & Do, Q. V.
(2023). A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactiv-
it y. arXiv preprint arXiv:2302.04023.
BigScience Workshop, Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A.
S., Yvon, F., Gallé, M., & Tow, J. (2022). Bloom: A 176b- parameter open- access multilingual language model.
arXiv preprint arXiv:2211.05100.
Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one- to- one
tutoring. Educational Researcher, 13(6), 4 –16.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., & Agarwal, S. (2020). Language models are few- shot learners. Advances in Neural Information
Processing Systems, 33, 1877–1901.
Chen, W., & Wellman, B. (2004). The global digital divide –within and between countries. IT & Society, 7, 39–45.
Deng, Y., Zhang, W., Pan, S. J., & Bing, L. (2023). Multilingual jailbreak challenges in large language models.
arXiv preprint arXiv:2310.06474.
Desmarais, M. C., Beheshti, B., & Naceur, R. (2012). Item to skills mapping: Deriving a conjunctive q- matrix from
data. In Intelligent Tutoring Systems: 11th International Conference, ITS 2012, Chania, Crete, Greece, June
14-18, 2012 (pp. 454– 463). Springer.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized LLMs.
arXiv preprint arXiv:2305.14314.
Dodge, J., Sap, M., Marasović, A ., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021).
Documenting large webtext corpora: A case study on the colossal clean crawled c orpus. arXiv preprint
arXiv:2104. 08758.
Ferrara, E. (2023). Should chatgpt be biased? Challenges and risks of bias in large language models. arXiv pre-
print arXiv:2304.03738.
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). Gptq: Accurate post- training quantization for genera-
tive pre- trained transformers. arXiv preprint arXiv:2210.17323.
Gondwe, G. (2023). CHATGPT and the global south: How are journalists in sub- Saharan Africa engaging with
generative AI? Online Media and Global Communication, 2(2), 228 249.
Gudibande, A ., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., & Song, D. (2023). The false prom-
ise of imitating proprietary LLMs. arXiv preprint arXiv:2305.15717.
Hornyak, T. (2023). Why Japan is building its own version of ChatGPT. Nature. Advance online publication. https://
doi. org/ 10. 1038/ d41586- 023- 02868- z
Johnson, R. L., Pistilli, G., Menédez- González, N., Duran, L. D. D., Panai, E., Kalpokiene, J., & Bertulfo, D.
J. (2022). The ghost in the machine has an American accent: value conflict in GPT- 3. arXiv preprint
arXiv:2203.07785.
Kadous, W. (2023). Llama 2 is a bout as factually ac curate as GPT- 4 for summaries and is 3 0X cheaper. Anyscale. https://
www. anysc ale. com/ blog/ llama-2- is- about- as- factu ally- accur ate- as- gpt-4- for- summa ries- and- is- 30x- cheaper
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G.,
Günnemann, S., Hüllermeier, E., & Krusche, S. (2023). ChatGPT for good? On opportunities and challenges
of large language models for education. Learning and Individual Dif ferences, 103, 102 2 74.
Koedinger, K. R., Corbett, A. T., & Perfetti, C. (2012). The knowledge- learning- instruction framework: Bridging the
science- practice chasm to enhance robust student learning. Cognitive Science, 36(5), 757–798.
Lai, V. D., Ngo, N. T., Veyseh, A. P. B., Man, H., Dernoncourt, F., Bui, T., & Nguyen, T. H. (2023). Chatgpt beyond
English: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv pre-
print arXiv:2304.05613.
|
2057
BRIDGING LA RGE L AN GUAG E MODEL DI SPAR ITI ES
Lang, C., Siemens, G., Wise, A., & Gasevic, D. (2017). Handbook of learning analytics. Solar, Society for Learning
Analytics and Research.
Li, Z., Pardos, Z. A., & Ren, C. (2024). Aligning open educational resources to new taxonomies: How AI technol-
ogies can help and in which scenarios. Computers & Education, 216, 105027.
Li, Z., Ren, C., Li, X., & Pardos, Z. A. (2021). Learning skill equivalencies across plat form taxonomies. In L AK2 1:
11th International Learning Analytics and Knowledge Conference (pp. 354– 363).
Liao, Q. V., & Vaughan, J. W. (2023). AI transparency in the age of LLMs: A human- centered research roadmap.
arXiv preprint arXiv:2306.01941.
Liu, P., Yuan, W., Jinlan, F., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre- train, prompt, and predict: A system-
atic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–3 5.
Min, S., Lewis, M., Zet tlemoyer, L., & Hajishirzi, H. (2021). Metaicl: Learning to learn in context. arXiv preprint
ar X iv: 2110 .15 9 4 3.
Nguyen, H. A., Stec, H., Hou, X., Di, S., & McLaren, B. M. (2023). Evaluating ChatGPT 's decimal skills and feed-
back generation in a digital learning game. In European Conference on Technology Enhanced Learning (pp.
278–293). Springer.
OpenAI. (2023). GPT guide—OpenAI. https:// platf orm. openai. com/ docs/ guides/ gpt
Ouyang, L., Jef frey, W., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray,
A., & Schulman, J. (2022). Training language models to follow instructions with human feedback. Advances
in Neural Information Processing Systems, 35, 27730–27744.
Pardos, Z. A ., & Bhandari, S. (2023). Learning gain differences between ChatGPT and human tutor generated
algebra hints. arXiv preprint arXiv:2302.06871.
Pardos, Z. A ., & Dadu, A. (2017). Imputing KCs with re presentations of pro blem content and con text. In Proceedings
of the 25th Conference on User Modeling, Adaptation and Personalization (pp. 148–155).
Pardos, Z. A ., Tang, M., Anastasopoulos, I., Sheel, S. K., & Zhang, E. (2023). OATutor: An open- source adaptive
tutoring system and curated content library for learning sciences research. In Proceedings of the 2023 CHI
Conference on Human Factors in Computing Systems (pp. 1–17).
Ramezani, A ., & Yang, X. (2023). Knowledge of cultural moral norms in large language models. arXiv preprint
arXiv:230 6.01857.
Razzaq, L., Heffernan, N. T., Feng, M., & Pardos, Z. A . (2007). Developing fine - grained transfer models in the
ASSISTment system. Technology, Instruction, Cognition and Learning, 5, 3.
Rettberg, J. W. (2022). ChatGPT is multilingual but monocultural, and it's learning your values. jill/txt. https:// jillt xt.
net/ right- now- chatg pt- is- multi lingu al- but- monoc ultur al- but- its- learn ing- your- values/
Shen, J. T., Yamashita, M., Prihar, E., Heffernan, N., Wu, X., Graff, B., & Lee, D. (2021). Mathbert: A pre- trained
language model for general NLP tasks in mathematics education. arXiv preprint arXiv:2106.07340.
Sivarajkumar, S., Kelley, M., Samolyk- Mazzanti, A., Visweswaran, S., & Wang, Y. (2023). An empirical evaluation
of prompting strategies for large language models in zero- shot clinical natural language processing. arXiv
preprint arXiv:2309.08008.
Stamper, J., & Pardos, Z. A. (2016). The 2010 KDD cup competition dataset: Engaging the machine learning
community in predictive learning analytics. Journal of Learning Analytics, 3(2), 312– 316.
Supraja, S., Hartman, K., Tatinati, S., & Khong, A. W. H. (2017). Toward the automatic labeling of course questions
for ensuring their alignment with learning outcomes. International Educational Data Mining Society.
Ta, R., & Lee, N. T. (2023). How language gaps constrain generative AI development. International Journal of
Comparative Studies in International Relations and Development, 9, 48– 52.
Tao, Y., Viberg, O., Baker, R. S., & Kizilcec, R. F. (2023). Auditing and mitigating cultural bias in LLMs. arXiv pre-
print arXiv:2311.14096.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., & Bikel, D. (2023). Llama 2: Open foundation and fine- tuned chat models. arXiv preprint
arXiv:2307.09288.
Van Dijk, J. A. G. M. (2006). Digital divide research, achievements and shortc omings. Poetics, 34, 221–23 5.
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2021). Finetuned
language models are zero- shot learners. arXiv preprint arXiv:2109.01652.
Shi, Y., Schmucker, R., Chi, M., Barnes, T., & Price, T. (2023). KC- Finder: Automated knowledge component
discovery for programming problems. In Proceedings of the 16th International Conference on Educational
Data Mining (EDM).
How to cite this article: Kwak, Y., & Pardos, Z. A. (2024). Bridging large language
model disparities: Skill tagging of multilingual educational content. British Journal of
Educational Technology, 55, 2039–2057. ht t ps: //do i .o r g /10.1111/ b j et .13 4 6 5
... Escalante, Pack e Barrett (2023) Por fim, os LLMs enfrentam desafios técnicos e culturais. Modelos treinados com predominância de dados em inglês limitam sua aplicabilidade em idiomas e culturas sub-representadas (Kwak;Pardos, 2024). Além disso, ferramentas como ChatGPT frequentemente geram conteúdo fictício, conhecido como "alucinação", comprometendo a confiabilidade de suas respostas (Mohapatra et al., 2023). ...
... A seguir, são descritos os resultados encontrados nos quatro macrotemas analisados. 3.1 DESENVOLVIMENTO DE FERRAMENTAS E APRENDIZAGEM PERSONALIZADA COM LLMS Os Large Language Models (LLMs) têm transformado a educação, aprimorando a aprendizagem, tutoria personalizada e tarefas administrativas(Kwak; Pardos, 2024; Bratić et al., 2024). Misiejuk, Kaliisa e Scianna (2024) apontam que ferramentas de IA estão revolucionando o ensino e a comunicação, fornecendo dados detalhados sobre o desempenho e comportamento dos alunos. ...
Article
Este artigo analisa o uso de Large Language Models (LLMs) na educação adaptativa e personalizada, utilizando o método de revisão integrativa da literatura. O estudo explora os benefícios, desafios e perspectivas futuras dessas tecnologias, com destaque para sua aplicação na personalização do aprendizado e no desenvolvimento de ferramentas educacionais avançadas. Foram seguidas as seis etapas do método de revisão integrativa, que incluem a definição do tema, critérios de inclusão e exclusão, identificação e seleção dos estudos, categorização, análise e interpretação dos resultados, e apresentação da síntese do conhecimento. A pesquisa considerou artigos publicados entre 2023 e 2024, extraídos das bases de dados Scopus e Web of Science, resultando em 29 estudos analisados. A análise temática identificou quatro temas principais: desenvolvimento de ferramentas e aprendizagem personalizada com LLMs; impactos positivos na educação; desafios e limitações no uso de LLMs; e perspectivas futuras. Entre os desafios, destacam-se questões éticas, desigualdade no acesso e limitações técnicas. Por fim, o artigo propõe novas direções de pesquisa para integrar os LLMs de forma eficaz no ambiente educacional, contribuindo para avanços metodológicos e tecnológicos na área.
... The resulting fine-tuned model retains the extensive knowledge embedded in the base model and additionally incorporates domain-specific information. In the education context, this method has been applied to improve automatic assessment scoring (Latif & Zhai, 2024), to support math tutors for remediation of students' mistakes (R. E. , to assess personal qualities in college admission essays (Lira et al., 2023), and to reduce performance disparities in math problem skill tagging tasks across different languages (Kwak & Pardos, 2024). Another potential technique for customizing an LLM is preference tuning using RLHF or Direct Preference Optimization (DPO). ...
... Additionally, there is a need for high-quality education datasets for pre-training and fine-tuning models (Y. Li et al., 2023;Kwak & Pardos, 2024;Lozhkov et al., 2024). Second, there is a need to develop a specific taxonomy of harms for LLMs in educational contexts that promote responsible use and highlight the perspectives of educators, students, and their families. ...
Preprint
Full-text available
Large Language Models (LLMs) are increasingly adopted in educational contexts to provide personalized support to students and teachers. The unprecedented capacity of LLM-based applications to understand and generate natural language can potentially improve instructional effectiveness and learning outcomes, but the integration of LLMs in education technology has renewed concerns over algorithmic bias which may exacerbate educational inequities. In this review, building on prior work on mapping the traditional machine learning life cycle, we provide a holistic map of the LLM life cycle from the initial development of LLMs to customizing pre-trained models for various applications in educational settings. We explain each step in the LLM life cycle and identify potential sources of bias that may arise in the context of education. We discuss why current measures of bias from traditional machine learning fail to transfer to LLM-generated content in education, such as tutoring conversations because the text is high-dimensional, there can be multiple correct responses, and tailoring responses may be pedagogically desirable rather than unfair. This review aims to clarify the complex nature of bias in LLM applications and provide practical guidance for their evaluation to promote educational equity.
... Recent advancements in natural language processing have opened new avenues for automating the process of identifying and labeling LOs. For instance, [21] uses GPT-3 to classify questions with LOs from OpenStax textbooks, while [9] explores the abilities of LLMs in tagging multilingual problem content with the appropriate skill from a taxonomy. LLMs have also shown promising results in various educational task, including automated grading, intelligent tutoring systems and student modeling [10,17,11,16]. ...
Preprint
Full-text available
This paper introduces a novel approach to create a high-resolution "map" for physics learning: an "atomic" learning objectives (LOs) system designed to capture detailed cognitive processes and concepts required for problem solving in a college-level introductory physics course. Our method leverages Large Language Models (LLMs) for automated labeling of physics questions and introduces a comprehensive set of metrics to evaluate the quality of the labeling outcomes. The atomic LO system, covering nine chapters of an introductory physics course, uses a "subject-verb-object'' structure to represent specific cognitive processes. We apply this system to 131 questions from expert-curated question banks and the OpenStax University Physics textbook. Each question is labeled with 1-8 atomic LOs across three chapters. Through extensive experiments using various prompting strategies and LLMs, we compare automated LOs labeling results against human expert labeling. Our analysis reveals both the strengths and limitations of LLMs, providing insight into LLMs reasoning processes for labeling LOs and identifying areas for improvement in LOs system design. Our work contributes to the field of learning analytics by proposing a more granular approach to mapping learning objectives with questions. Our findings have significant implications for the development of intelligent tutoring systems and personalized learning pathways in STEM education, paving the way for more effective "learning GPS'' systems.
... GenAI tools are not transparent and produce unexplainable output (Miao & Holmes, 2023). Consequently, GenAI tools may be implemented in inequitable ways and contain unknown biases (Hacker et al., 2024;Kwak & Pardos, 2024). The risks of GenAI tools in education also relate to academic integrity and ownership of work (Cotton et al., 2023;Perkins, 2023;Rudolph et al., 2023), as well as the undetectability of GenAI outputs in submitted student work Chaka, 2024;Weber-Wulff et al., 2023). ...
Preprint
Full-text available
This scoping review examines the relationship between Generative AI (GenAI) and agency in education, analyzing the literature available through the lens of Critical Digital Pedagogy. Following PRISMA-ScR guidelines, we collected 10 studies from academic databases focusing on both learner and teacher agency in GenAI-enabled environments. We conducted an AI-supported hybrid thematic analysis that revealed three key themes: Control in Digital Spaces, Variable Engagement and Access, and Changing Notions of Agency. The findings suggest that while GenAI may enhance learner agency through personalization and support, it also risks exacerbating educational inequalities and diminishing learner autonomy in certain contexts. This review highlights gaps in the current research on GenAI's impact on agency. These findings have implications for educational policy and practice, suggesting the need for frameworks that promote equitable access while preserving learner agency in GenAI-enhanced educational environments.
... We can obtain relatively reliable answers if there is enough data from a certain discipline and sufficient data training. However, if the question is complex and there is not enough training, the answers may be incorrect, resulting in what are called hallucinations or biases (Kwak & Pardos, 2024;Williams, 2024). For example, in a previous study, we asked ChatGPT 4 to create an image of a NaCl model (Feldman-Maggor et al., 2024b) and received an inaccurate representation (Figure 1). ...
Article
Full-text available
This paper discusses the ethical considerations surrounding generative artificial intelligence (GenAI) in chemistry education, aiming to guide teachers toward responsible AI integration. GenAI, driven by advanced AI models like Large Language Models, has shown substantial potential in generating educational content. However, this technology’s rapid rise has brought forth ethical concerns regarding general and educational use that require careful attention from educators. The UNESCO framework on GenAI in education provides a comprehensive guide to controversies around generative AI and ethical educational considerations, emphasizing human agency, inclusion, equity, and cultural diversity. Ethical issues include digital poverty, lack of national regulatory adaptation, use of content without consent, unexplainable models used to generate outputs, AI-generated content polluting the internet, lack of understanding of the real world, reducing diversity of opinions, and further marginalizing already marginalized voices and generating deep fakes. The paper delves into these eight controversies, presenting relevant examples from chemistry education to stress the need to evaluate AI-generated content critically. The paper emphasizes the importance of relating these considerations to chemistry teachers’ content and pedagogical knowledge and argues that responsible AI usage in education must integrate these insights to prevent the propagation of biases and inaccuracies. The conclusion stresses the necessity for comprehensive teacher training to effectively and ethically employ GenAI in educational practices.
... Another approach to mitigate cultural bias is to fine-tune models on culturally relevant data. This can improve cultural alignment [41,34] but requires resources that render this approach accessible to only a few. For example, AI Sweden released a Swedish version of GPT 1 , and the government of Japan started development of a Japanese version of ChatGPT to address cultural and linguistic bias [26]. ...
Article
Full-text available
Culture fundamentally shapes people's reasoning, behavior, and communication. As people increasingly use generative artificial intelligence (AI) to expedite and automate personal and professional tasks, cultural values embedded in AI models may bias people's authentic expression and contribute to the dominance of certain cultures. We conduct a disaggregated evaluation of cultural bias for five widely used large language models (OpenAI's GPT-4o/4-turbo/4/3.5-turbo/3) by comparing the models' responses to nationally representative survey data. All models exhibit cultural values resembling English-speaking and Protestant European countries. We test cultural prompting as a control strategy to increase cultural alignment for each country/territory. For recent models (GPT-4, 4-turbo, 4o), this improves the cultural alignment of the models' output for 71-81% of countries and territories. We suggest using cultural prompting and ongoing evaluation to reduce cultural bias in the output of generative AI.
Preprint
Full-text available
[Accepted for publications in Journal of Learning Analytics] Generative artificial intelligence (GenAI) has opened new possibilities for designing learning analytics (LA) tools, gaining new insights about student learning processes and their environment, and supporting teachers in assessing and monitoring students. This systematic literature review maps the empirical research of 41 papers utilizing GenAI and LA and interprets the results through the lens of the LA/EDM process cycle. Currently, GenAI is mostly implemented to automate discourse coding, scoring or classification tasks. Few papers used GenAI to generate data or to summarize text. Classroom integrations of GenAI and LA mostly explore facilitating human-GenAI collaboration, rather than implementing automated feedback generation or GenAI-powered learning analytics dashboards. The majority of papers use Generative Adversarial Network models to generate synthetic data, BERT models for classification or prediction tasks, BERT or GPT models for discourse coding, and GPT models for tool integration. Although most studies evaluate the GenAI output, we found examples of using GenAI without the output validation, especially, when its output is feeding into a LA pipeline aiming to, for example, develop a dashboard. This review offers a comprehensive overview of the field to aid LA researchers in the design of research studies and a contribution to establishing best practices to integrate GenAI and LA.
Article
Full-text available
As generative language models, exemplified by ChatGPT, continue to advance in their capabilities, the spotlight on biases inherent in these models intensifies. This paper delves into the distinctive challenges and risks associated with biases specifically in large-scale language models. We explore the origins of biases, stemming from factors such as training data, model specifications, algorithmic constraints, product design, and policy decisions. Our examination extends to the ethical implications arising from the unintended consequences of biased model outputs. In addition, we analyze the intricacies of mitigating biases, acknowledging the inevitable persistence of some biases, and consider the consequences of deploying these models across diverse applications, including virtual assistants, content generation, and chatbots. Finally, we provide an overview of current approaches for identifying, quantifying, and mitigating biases in language models, underscoring the need for a collaborative, multidisciplinary effort to craft AI systems that embody equity, transparency, and responsibility. This article aims to catalyze a thoughtful discourse within the AI community, prompting researchers and developers to consider the unique role of biases in the domain of generative language models and the ongoing quest for ethical AI.
Article
Prompt-based generative artificial intelligence (AI) tools are quickly being deployed for a range of use cases, from writing emails and compiling legal cases to personalizing research essays in a wide range of educational, professional, and vocational disciplines. But language is not monolithic, and opportunities may be missed in developing generative AI tools for non-standard languages and dialects. Current applications often are not optimized for certain populations or communities and, in some instances, may exacerbate social and economic divisions. As noted by the Austrian linguist and philosopher Ludwig Wittgenstein, “The limits of my language mean the limits of my world.” This is especially true today, when the language we speak can change how we engage with technology, and the limits of our online vernacular can constrain the full and fair use of existing and emerging technologies.
Article
Background Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches. Objective The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types—heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models. Methods This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches. Results The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types. Conclusions This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.
Article
Prompt-based generative artificial intelligence (AI) tools are quickly being deployed for a range of use cases, from writing emails and compiling legal cases to personalizing research essays in a wide range of educational, professional, and vocational disciplines. But language is not monolithic, and opportunities may be missed in developing generative AI tools for non-standard languages and dialects. Current applications often are not optimized for certain populations or communities and, in some instances, may exacerbate social and economic divisions. As noted by the Austrian linguist and philosopher Ludwig Wittgenstein, “The limits of my language mean the limits of my world.” This is especially true today, when the language we speak can change how we engage with technology, and the limits of our online vernacular can constrain the full and fair use of existing and emerging technologies.