Figure - available from: British Journal of Educational Technology
This content is subject to copyright. Terms and conditions apply.
Source publication
The adoption of large language models (LLMs) in education holds much promise. However, like many technological innovations before them, adoption and access can often be inequitable from the outset, creating more divides than they bridge. In this paper, we explore the magnitude of the country and language divide in the leading open‐source and propri...
Similar publications
Multilingual large language models (LLMs) have greatly increased the ceiling of performance on non-English tasks. However the mechanisms behind multilingualism in these LLMs are poorly understood. Of particular interest is the degree to which internal representations are shared between languages. Recent work on neuron analysis of LLMs has focused o...
Large language models (LLMs) such as ChatGPT play a crucial role in guiding critical decisions nowadays, such as in choosing a college major. Therefore, it is essential to assess the limitations of these models’ recommendations and understand any potential biases that may mislead human decisions. In this study, I investigate bias in terms of GPT-3....
Word Sense Induction (WSI) is the task of discovering senses of an ambiguous word by grouping usages of this word into clusters corresponding to these senses. Many approaches were proposed to solve WSI in English and a few other languages, but these approaches are not easily adaptable to new languages. We present multilingual substitution-based WSI...
La mayoría de las investigaciones sobre el procesamiento automatizado de colocaciones se ha centrado en el uso de medidas de asociación. Sin embargo, el enfoque se ha ido cambiando lentamente hacia la exploración de la efectividad de los modelos de lenguaje neuronales o neural language models (NLMs). En este artículo, investigamos el último método...
Citations
... Escalante, Pack e Barrett (2023) Por fim, os LLMs enfrentam desafios técnicos e culturais. Modelos treinados com predominância de dados em inglês limitam sua aplicabilidade em idiomas e culturas sub-representadas (Kwak;Pardos, 2024). Além disso, ferramentas como ChatGPT frequentemente geram conteúdo fictício, conhecido como "alucinação", comprometendo a confiabilidade de suas respostas (Mohapatra et al., 2023). ...
... A seguir, são descritos os resultados encontrados nos quatro macrotemas analisados. 3.1 DESENVOLVIMENTO DE FERRAMENTAS E APRENDIZAGEM PERSONALIZADA COM LLMS Os Large Language Models (LLMs) têm transformado a educação, aprimorando a aprendizagem, tutoria personalizada e tarefas administrativas(Kwak; Pardos, 2024; Bratić et al., 2024). Misiejuk, Kaliisa e Scianna (2024) apontam que ferramentas de IA estão revolucionando o ensino e a comunicação, fornecendo dados detalhados sobre o desempenho e comportamento dos alunos. ...
Este artigo analisa o uso de Large Language Models (LLMs) na educação adaptativa e personalizada, utilizando o método de revisão integrativa da literatura. O estudo explora os benefícios, desafios e perspectivas futuras dessas tecnologias, com destaque para sua aplicação na personalização do aprendizado e no desenvolvimento de ferramentas educacionais avançadas. Foram seguidas as seis etapas do método de revisão integrativa, que incluem a definição do tema, critérios de inclusão e exclusão, identificação e seleção dos estudos, categorização, análise e interpretação dos resultados, e apresentação da síntese do conhecimento. A pesquisa considerou artigos publicados entre 2023 e 2024, extraídos das bases de dados Scopus e Web of Science, resultando em 29 estudos analisados. A análise temática identificou quatro temas principais: desenvolvimento de ferramentas e aprendizagem personalizada com LLMs; impactos positivos na educação; desafios e limitações no uso de LLMs; e perspectivas futuras. Entre os desafios, destacam-se questões éticas, desigualdade no acesso e limitações técnicas. Por fim, o artigo propõe novas direções de pesquisa para integrar os LLMs de forma eficaz no ambiente educacional, contribuindo para avanços metodológicos e tecnológicos na área.
... Recent advancements in natural language processing have opened new avenues for automating the process of identifying and labeling LOs. For instance, [21] uses GPT-3 to classify questions with LOs from OpenStax textbooks, while [9] explores the abilities of LLMs in tagging multilingual problem content with the appropriate skill from a taxonomy. LLMs have also shown promising results in various educational task, including automated grading, intelligent tutoring systems and student modeling [10,17,11,16]. ...
This paper introduces a novel approach to create a high-resolution "map" for physics learning: an "atomic" learning objectives (LOs) system designed to capture detailed cognitive processes and concepts required for problem solving in a college-level introductory physics course. Our method leverages Large Language Models (LLMs) for automated labeling of physics questions and introduces a comprehensive set of metrics to evaluate the quality of the labeling outcomes. The atomic LO system, covering nine chapters of an introductory physics course, uses a "subject-verb-object'' structure to represent specific cognitive processes. We apply this system to 131 questions from expert-curated question banks and the OpenStax University Physics textbook. Each question is labeled with 1-8 atomic LOs across three chapters. Through extensive experiments using various prompting strategies and LLMs, we compare automated LOs labeling results against human expert labeling. Our analysis reveals both the strengths and limitations of LLMs, providing insight into LLMs reasoning processes for labeling LOs and identifying areas for improvement in LOs system design. Our work contributes to the field of learning analytics by proposing a more granular approach to mapping learning objectives with questions. Our findings have significant implications for the development of intelligent tutoring systems and personalized learning pathways in STEM education, paving the way for more effective "learning GPS'' systems.
... GenAI tools are not transparent and produce unexplainable output (Miao & Holmes, 2023). Consequently, GenAI tools may be implemented in inequitable ways and contain unknown biases (Hacker et al., 2024;Kwak & Pardos, 2024). The risks of GenAI tools in education also relate to academic integrity and ownership of work (Cotton et al., 2023;Perkins, 2023;Rudolph et al., 2023), as well as the undetectability of GenAI outputs in submitted student work Chaka, 2024;Weber-Wulff et al., 2023). ...
This scoping review examines the relationship between Generative AI (GenAI) and agency in education, analyzing the literature available through the lens of Critical Digital Pedagogy. Following PRISMA-ScR guidelines, we collected 10 studies from academic databases focusing on both learner and teacher agency in GenAI-enabled environments. We conducted an AI-supported hybrid thematic analysis that revealed three key themes: Control in Digital Spaces, Variable Engagement and Access, and Changing Notions of Agency. The findings suggest that while GenAI may enhance learner agency through personalization and support, it also risks exacerbating educational inequalities and diminishing learner autonomy in certain contexts. This review highlights gaps in the current research on GenAI's impact on agency. These findings have implications for educational policy and practice, suggesting the need for frameworks that promote equitable access while preserving learner agency in GenAI-enhanced educational environments.
... We can obtain relatively reliable answers if there is enough data from a certain discipline and sufficient data training. However, if the question is complex and there is not enough training, the answers may be incorrect, resulting in what are called hallucinations or biases (Kwak & Pardos, 2024;Williams, 2024). For example, in a previous study, we asked ChatGPT 4 to create an image of a NaCl model (Feldman-Maggor et al., 2024b) and received an inaccurate representation (Figure 1). ...
This paper discusses the ethical considerations surrounding generative artificial intelligence (GenAI) in chemistry education, aiming to guide teachers toward responsible AI integration. GenAI, driven by advanced AI models like Large Language Models, has shown substantial potential in generating educational content. However, this technology’s rapid rise has brought forth ethical concerns regarding general and educational use that require careful attention from educators. The UNESCO framework on GenAI in education provides a comprehensive guide to controversies around generative AI and ethical educational considerations, emphasizing human agency, inclusion, equity, and cultural diversity. Ethical issues include digital poverty, lack of national regulatory adaptation, use of content without consent, unexplainable models used to generate outputs, AI-generated content polluting the internet, lack of understanding of the real world, reducing diversity of opinions, and further marginalizing already marginalized voices and generating deep fakes. The paper delves into these eight controversies, presenting relevant examples from chemistry education to stress the need to evaluate AI-generated content critically. The paper emphasizes the importance of relating these considerations to chemistry teachers’ content and pedagogical knowledge and argues that responsible AI usage in education must integrate these insights to prevent the propagation of biases and inaccuracies. The conclusion stresses the necessity for comprehensive teacher training to effectively and ethically employ GenAI in educational practices.
... Another approach to mitigate cultural bias is to fine-tune models on culturally relevant data. This can improve cultural alignment [41,34] but requires resources that render this approach accessible to only a few. For example, AI Sweden released a Swedish version of GPT 1 , and the government of Japan started development of a Japanese version of ChatGPT to address cultural and linguistic bias [26]. ...
Culture fundamentally shapes people's reasoning, behavior, and communication. As people increasingly use generative artificial intelligence (AI) to expedite and automate personal and professional tasks, cultural values embedded in AI models may bias people's authentic expression and contribute to the dominance of certain cultures. We conduct a disaggregated evaluation of cultural bias for five widely used large language models (OpenAI's GPT-4o/4-turbo/4/3.5-turbo/3) by comparing the models' responses to nationally representative survey data. All models exhibit cultural values resembling English-speaking and Protestant European countries. We test cultural prompting as a control strategy to increase cultural alignment for each country/territory. For recent models (GPT-4, 4-turbo, 4o), this improves the cultural alignment of the models' output for 71-81% of countries and territories. We suggest using cultural prompting and ongoing evaluation to reduce cultural bias in the output of generative AI.
... Hint generation has also been a focus, with researchers examining the effectiveness of LLMs in providing hints (i.e., worked solutions) to support learning in mathematics [40,41], computer programming [42,43,44,45,46,47], and various other STEM subjects. Additionally, studies have investigated human-AI collaboration in skill tagging, assessing its effectiveness across multiple languages and its speed and accuracy [48,49]. However, unlike these areas, the topic of using LLMs to simulate respondents remains under-researched. ...
Effective educational measurement relies heavily on the curation of well-designed item pools (i.e., possessing the right psychometric properties). However, item calibration is time-consuming and costly, requiring a sufficient number of respondents for the response process. We explore using six different LLMs (GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) and various combinations of them using sampling methods to produce responses with psychometric properties similar to human answers. Results show that some LLMs have comparable or higher proficiency in College Algebra than college students. No single LLM mimics human respondents due to narrow proficiency distributions, but an ensemble of LLMs can better resemble college students' ability distribution. The item parameters calibrated by LLM-Respondents have high correlations (e.g. > 0.8 for GPT-3.5) compared to their human calibrated counterparts, and closely resemble the parameters of the human subset (e.g. 0.02 Spearman correlation difference). Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).
... The resulting fine-tuned model retains the extensive knowledge embedded in the base model and additionally incorporates domain-specific information. In the education context, this method has been applied to improve automatic assessment scoring (Latif & Zhai, 2024), to support math tutors for remediation of students' mistakes , to assess personal qualities in college admission essays (Lira et al., 2023) and to reduce performance disparities in math problem skill tagging tasks across different languages (Kwak & Pardos, 2024). ...
... First, there is a need for education-focused benchmark datasets that better represent a broader range of sociodemographic groups across the world, especially considering that applications like Khanmigo are expected to be used by a diverse group of students and teachers (Gallegos et al., 2024). Additionally, there is a need for high-quality education datasets for pre-training and fine-tuning models (Kwak & Pardos, 2024;Li et al., 2023;Lozhkov et al., 2024). Second, there is a need to develop a specific taxonomy of harms for LLMs in educational contexts that promote responsible use and highlight the perspectives of educators, students and their families. ...
Large language models (LLMs) are increasingly adopted in educational contexts to provide personalized support to students and teachers. The unprecedented capacity of LLM‐based applications to understand and generate natural language can potentially improve instructional effectiveness and learning outcomes, but the integration of LLMs in education technology has renewed concerns over algorithmic bias, which may exacerbate educational inequalities. Building on prior work that mapped the traditional machine learning life cycle, we provide a framework of the LLM life cycle from the initial development of LLMs to customizing pre‐trained models for various applications in educational settings. We explain each step in the LLM life cycle and identify potential sources of bias that may arise in the context of education. We discuss why current measures of bias from traditional machine learning fail to transfer to LLM‐generated text (eg, tutoring conversations) because text encodings are high‐dimensional, there can be multiple correct responses, and tailoring responses may be pedagogically desirable rather than unfair. The proposed framework clarifies the complex nature of bias in LLM applications and provides practical guidance for their evaluation to promote educational equity.
Practitioner notes
What is already known about this topic The life cycle of traditional machine learning (ML) applications which focus on predicting labels is well understood.
Biases are known to enter in traditional ML applications at various points in the life cycle, and methods to measure and mitigate these biases have been developed and tested.
Large language models (LLMs) and other forms of generative artificial intelligence (GenAI) are increasingly adopted in education technologies (EdTech), but current evaluation approaches are not specific to the domain of education.
What this paper adds A holistic perspective of the LLM life cycle with domain‐specific examples in education to highlight opportunities and challenges for incorporating natural language understanding (NLU) and natural language generation (NLG) into EdTech.
Potential sources of bias are identified in each step of the LLM life cycle and discussed in the context of education.
A framework for understanding where to expect potential harms of LLMs for students, teachers, and other users of GenAI technology in education, which can guide approaches to bias measurement and mitigation.
Implications for practice and/or policy Education practitioners and policymakers should be aware that biases can originate from a multitude of steps in the LLM life cycle, and the life cycle perspective offers them a heuristic for asking technology developers to explain each step to assess the risk of bias.
Measuring the biases of systems that use LLMs in education is more complex than with traditional ML, in large part because the evaluation of natural language generation is highly context‐dependent (eg, what counts as good feedback on an assignment varies).
EdTech developers can play an important role in collecting and curating datasets for the evaluation and benchmarking of LLM applications moving forward.
... The details of such results in a specific situation can lay out important and challenging values-based questions for institutional leaders to consider as they determine their equity and inclusion priorities. Kwak and Pardos (2024) present the first set of empirical results on mitigating disparities of LLMs representation of educationally relevant knowledge (in this case, K-12 state curricula taxonomies) across countries and languages as well as their use to tag problem content, a critical task for aligning open educational resources and tutoring content with state curricula. Acknowledging that LLMs perform best in English and exhibit a bias towards the US content, in their study, they investigate the extent of the country and language divide into two large language models: ChatGPT 3.5 (a closed-source model) and LLaMA-2-13B (an opensource model). ...
... The ABROCA metric modification to support assessment of intersectional fairness further offers a good example of how machine learning methods can be usefully adapted to reflect concepts central to equity considerations. The next paper (Kwak & Pardos, 2024) demonstrates the dramatic effects that focused attention on fine-tuning can have on improving model performance for minority and underrepresented languages and regions. The fourth paper (Bayer et al., 2024) is impactful in shifting the lens of focus from student degree attainment to institutional degree awarding, and in working with stakeholders as part of the dashboard development process. ...
A key goal of educational institutions around the world is to provide inclusive, equitable quality education and lifelong learning opportunities for all learners. Achieving this requires contextualized approaches to accommodate diverse global values and promote learning opportunities that best meet the needs and goals of all learners as individuals and members of different communities. Advances in learning analytics (LA), natural language processes (NLP), and artificial intelligence (AI), especially generative AI technologies, offer potential to aid educational decision making by supporting analytic insights and personalized recommendations. However, these technologies also raise serious risks for reinforcing or exacerbating existing inequalities; these dangers arise from multiple factors including biases represented in training datasets, the technologies' abilities to take autonomous decisions, and processes for tool development that do not centre the needs and concerns of historically marginalized groups. To ensure that Educational Decision Support Systems (EDSS), particularly AI‐powered ones, are equipped to promote equity, they must be created and evaluated holistically, considering their potential for both targeted and systemic impacts on all learners, especially members of historically marginalized groups. Adopting a socio‐technical and cultural perspective is crucial for designing, deploying, and evaluating AI‐EDSS that truly advance educational equity and inclusion. This editorial introduces the contributions of five papers for the special section on advancing equity and inclusion in educational practices with AI‐EDSS. These papers focus on (i) a review of biases in large language models (LLMs) applications offers practical guidelines for their evaluation to promote educational equity, (ii) techniques to mitigate disparities across countries and languages in LLMs representation of educationally relevant knowledge, (iii) implementing equitable and intersectionality‐aware machine learning applications in education, (iv) introducing a LA dashboard that aims to promote institutional equality, diversity, and inclusion, and (v) vulnerable student digital well‐being in AI‐EDSS. Together, these contributions underscore the importance of an interdisciplinary approach in developing and utilizing AI‐EDSS to not only foster a more inclusive and equitable educational landscape worldwide but also reveal a critical need for a broader contextualization of equity that incorporates the socio‐technical questions of what kinds of decisions AI is being used to support, for what purposes, and whose goals are prioritized in this process.
... The resulting fine-tuned model retains the extensive knowledge embedded in the base model and additionally incorporates domain-specific information. In the education context, this method has been applied to improve automatic assessment scoring (Latif & Zhai, 2024), to support math tutors for remediation of students' mistakes (R. E. , to assess personal qualities in college admission essays (Lira et al., 2023), and to reduce performance disparities in math problem skill tagging tasks across different languages (Kwak & Pardos, 2024). Another potential technique for customizing an LLM is preference tuning using RLHF or Direct Preference Optimization (DPO). ...
... Additionally, there is a need for high-quality education datasets for pre-training and fine-tuning models (Y. Li et al., 2023;Kwak & Pardos, 2024;Lozhkov et al., 2024). Second, there is a need to develop a specific taxonomy of harms for LLMs in educational contexts that promote responsible use and highlight the perspectives of educators, students, and their families. ...
Large Language Models (LLMs) are increasingly adopted in educational contexts to provide personalized support to students and teachers. The unprecedented capacity of LLM-based applications to understand and generate natural language can potentially improve instructional effectiveness and learning outcomes, but the integration of LLMs in education technology has renewed concerns over algorithmic bias which may exacerbate educational inequities. In this review, building on prior work on mapping the traditional machine learning life cycle, we provide a holistic map of the LLM life cycle from the initial development of LLMs to customizing pre-trained models for various applications in educational settings. We explain each step in the LLM life cycle and identify potential sources of bias that may arise in the context of education. We discuss why current measures of bias from traditional machine learning fail to transfer to LLM-generated content in education, such as tutoring conversations because the text is high-dimensional, there can be multiple correct responses, and tailoring responses may be pedagogically desirable rather than unfair. This review aims to clarify the complex nature of bias in LLM applications and provide practical guidance for their evaluation to promote educational equity.