Figure - uploaded by Junifer Leal
Content may be subject to copyright.
Summary results of the Pearson's correlation analysis.

Summary results of the Pearson's correlation analysis.

Context in source publication

Context 1
... order to validate the prior observations, the data was subjected to further investigation using Pearson's correlation analysis. The results demonstrated a strong positive correlation between ratings, particularly those generated by the chatbot machine, as shown in Table 4. ...

Citations

... ChatGPT, a prominent GenAI tool, has been envisioned as an upgraded version of traditional Automated Writing Evaluation (AWE) systems (Bucol & Sangkawong, 2024;Shi et al., 2025). Unlike AWE tools, such as Grammarly, Criterion, and Pigai, which use natural language processing and AI to provide feedback on grammar, style, and coherence, ChatGPT is based on large language models (LLMs) and offers more comprehensive feedback. ...
Article
This study explores the longitudinal reciprocal associations between task enjoyment, task anxiety, and task boredom, and their interactions in relation to artificial intelligence (AI)- assisted L2 writing achievement within a data-driven learning context. Drawing on the experience sampling method, data were collected from 67 participants across four sessions of an L2 academic writing workshop, where they rated their emotions in real-time during task-based activities every 10min, leading to a total of 1,588 observations. Dynamic structural equation modeling (DSEM) revealed that task enjoyment increased, while boredom and anxiety decreased over time. At the within-individual level, task anxiety showed the highest stability, followed by task enjoyment, with task boredom being the least persistent. Findings also indicated that task emotions exerted significant reciprocal influences on each other, with stronger impacts of task enjoyment. At the between-individual level, learners with higher levels of task enjoyment and boredom did not necessarily carry these emotions over into future sessions, whereas learners with higher task anxiety tended to maintain this emotional state over time. Additionally, higher levels of task enjoyment and lower levels of task anxiety and boredom predicted better L2 writing outcomes, with the persistence of these emotions playing a crucial role. This research provides insights into the emotional dynamics of task-based learning, offering practical implications for designing data-driven learning environments that foster positive emotions.
... This interest likely arises from their desire to alleviate burdensome tasks, such as evaluating students' writing and designing standardised L2 exams (+3: item 20). Recent studies by Fatih, Özgür, and Gamze (2024) and Bucol and Sangkawong (2024) support this, confirming the reliability and validity of GenAI as a writing assessment tool by comparing it with human expert evaluations. Teachers also showed a strong interest in assessing the effectiveness of various GenAI platforms (+3: item 24), an area not yet fully explored in the literature. ...
Article
Recently, there has been a significant increase in research on Generative Artificial Intelligence (GenAI) in the domain of second language (L2) education. Given the limited resources, it is essential for GenAI research to focus on key areas. However, there is still uncertainty about which topics should be prioritized. Research priorities are often shaped by individual researchers’ personal interests, which can skew the focus of many studies. Additionally, stakeholder perspectives on these topics can vary widely. Therefore, this study employs the Q methodology to reveal the consensus among different stakeholder groups. To this end, a total of 19 participants, including 6 researchers, 6 teachers, and 7 students, engaged in a Q-sort exercise involving 34 statements. Through KADE software, the subsequent Centroid Factor Analysis and varimax rotation were used to extract patterns. The analysis revealed three common perspectives across stakeholder groups: psychological factors of teachers and students, multiple scenarios of measurement, and the improvement of L2 competence. These findings provide valuable insights that can inform and refine research agendas in GenAI for L2 education, optimizing the allocation of resources.
... Recent developments in AI-powered assessment systems have revealed a complex interplay between technical capabilities and practical implementation challenges. While Bucol and Sangkawong (2024) demonstrated ChatGPT's potential for consistent and efficient writing assessment (ρ = 0.95), subsequent studies have exposed critical limitations in AI's evaluation capabilities. Lin and Yu's (2024) comprehensive structural equation modeling analysis (N = 1042) revealed that despite AI's technical proficiency, fundamental challenges persist in classroom administration and developmental support. ...
Article
Full-text available
The integration of artificial intelligence (AI) in language education necessitates new forms of digital leadership, yet research on how language teachers develop e-leadership competencies remains limited, particularly in non-Western contexts. This study investigates how Vietnamese EFL teachers develop and exercise e-leadership competencies in implementing AI tools. Using an exploratory sequential mixed-methods design, the study combined in-depth interviews with 17 EFL teachers and a survey of 211 teachers across Vietnamese universities. The research framework integrated e-Leadership Theory and the Technology Acceptance Model. Key findings reveal that successful e-leadership requires a balance of technical proficiency (β = 0.31, p < 0.001) and cultural sensitivity (β = 0.28, p < 0.001). Three primary dimensions of e-leadership competencies emerged: technological proficiency with guidance capability, pedagogical innovation in AI integration, and culturally responsive change management. The research also highlights critical ethical considerations in AI implementation, particularly regarding assessment transparency and decision-making processes. These findings inform the development of culturally sensitive professional development programs and provide a framework for understanding e-leadership development in non-Western educational settings.
... The use of ChatGPT for AES has attracted researchers' interest, demonstrating the potential for scoring efficiency, as Shermis (2024) stated. Regarding scoring consistency compared to human raters, research has shown both discouraging (e.g., Bui & Barrot, 2025;Kim et al., 2024) and encouraging (e.g., Bucol & Sangkawong, 2024) results. These contradictory results are not unanticipated, as shown by Yavuz et al. (2024); the consistency of AI depends on how it is utilized for AES. ...
Preprint
Full-text available
Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in low-resource languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.93, Pearson correlation of 0.73, and an overlap measure of 83.5%. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in underrepresented languages. The limitations of the study, such as reliance on single-rater scores and focusing on a specific task type, highlight avenues for future research to expand and refine the proposed approach.
... It can interact naturally with humans in a human-like manner and it has been assessed to possess good knowledge in most subject matter (Tian et al., accepted). Its affordances have spurred a line of research into its capabilities and potential applications to aid writing as a form of automated feedback (Bucol & Sangkawong, 2024;Guo & Wang, 2023;Pfau et al., 2023;Steiss et al., 2024;Su et al., 2023). ...
... In writing research, ChatGPT tended to be envisioned as an upgraded assessment tool, in comparison to the well-established automated writing evaluation (AWE) systems (Bucol & Sangkawong, 2024;Godwin-Jones, 2024;Kohnke, 2024;Su et al., 2023). AWE systems, such as Grammarly, Criterion, and Pigai, were designed to provide both summative scores and formative feedback for learners' writing, based on natural language processing, artificial intelligence, latent semantic analysis, etc. (Barrot, 2021;Shi & Aryadoust, 2023Warschauer & Ware, 2006). ...
... In the education sector, Al tools are increasingly utilized to personalize learning experiences, automate routine tasks, and provide instant feedback (Bucol & Sangkawong, 2024;Dodigovic & Tormasyan, 2021;Yang et al., 2023;). These advancements position Al as a significant asset for enhancing efficiency and engagement. ...
... These tools are designed to deliver immediate feedback on various aspects of writing, including grammar, vocabulary, syntax, and style (Dodigovic & Tovmasyan, 2021;Rahma & Zen, 2023). Their increasing integration into educational environments underscores their potential to enhance the quality of students' compositions and streamline the writing process (Bucol & Sangkawong, 2024;Bui & Barrot, 2024). For EFL learners, who often encounter difficulties in mastering linguistic subtleties, AWE systems provide a valuable platform for language development and proficiency enhancement. ...
... This challenge arises from the tools' inability to recognize stylistic and contextual variations in language. Similarly, ProWritingAid and ChatGPT were found to provide suggestions that were not contextually appropriate or aligned with the intended meaning of the text, further eroding user trust in these tools (Ariyanto et al., 2019;Bucol & Sangkawong, 2024). ...
Article
Full-text available
This study analyzes the constraints of Automated Writing Evaluation (AWE) programs, including Grammarly, Pigai, ChatGPT, and ProWritingAid, which are transforming user composition skill development. These tools offer immediate feedback on grammar, style, coherence, and readability, enhancing precision and efficiency by enabling modification and self-evaluation. Nonetheless, despite their capabilities, AWE systems are constrained by their inability to identify all varieties of writing errors, requiring human involvement for thorough feedback. This research employs a mixed-methods literature review to examine the restrictions faced by English as a Foreign Language (EFL) students in both school and higher education contexts across various countries. The results indicate that refining AWE tools to address these deficiencies more effectively could improve their efficiency in promoting users' writing development. 3
... Recent studies have explored the effectiveness of AI in essay evaluation, with the findings indicating that while human raters generally outperform AI in providing high-quality feedback, AI tools such as ChatGPT show potential in criteria-based assessment and automated scoring (Latif & Zhai, 2024;Steiss et al., 2024). Furthermore, fine-tuned AI models have demonstrated high accuracy in domain-specific educational data scoring (Latif & Zhai, 2024), and ChatGPT has shown notable correlations with human ratings in writing evaluation, offering advantages such as consistency, efficiency, and scalability (Bucol & Sangkawong, 2024). These studies underscore the need for thoughtful and ethical implementation of AI in assessments to ensure equitable outcomes for all learners. ...
... This can be attributed to the fact that advanced AI tools often rely on predefined patterns and may miss deeper and more subjective elements that humans naturally perceive. The findings of this study also support the conclusions of Bucol and Sangkawong (2024) and Mizumoto and Eguchi (2023), who found that GPT models can score essays with a degree of accuracy and reliability comparable to those of human raters. The study's findings are in contrast to those of Bevilacqua et al. (2023), who reported that transformer-based models such as GPT tend to score AI-generated essays 10-15% higher than those written by humans despite being trained on human text. ...
Article
This study aimed to explore how artificial intelligence (AI) tools compare with humans in evaluating the essays written by students in a writing course. Using a dataset of 30 essays written by English as a foreign language (EFL) students, the evaluations by the AI tools were compared with those of human evaluators, to examine whether the AI evaluations differed with respect to the quality of the entire essay or specific categories (i.e., content, vocabulary, organization, and accuracy). The results indicated that the AI tools provided high-quality feedback to students across all categories despite differences regarding essay quality. Additionally, AI tools differed in the scores they assigned, consistently grading lower than human raters across multiple evaluation categories while providing more detailed feedback than human raters. The scores assigned by each AI tool for student essays across various assessment categories did not differ significantly from the overall scores assigned by AI tools.
... Finally, Bucol & Sangkawong (2024) investigated the potential of ChatGPT as a tool for grading essays written by EFL (English as a Foreign Language) students in Thailand, comparing the scores assigned by ChatGPT to those from human instructors and analyzing the strengths and weaknesses of this AI tool. Ten EFL instructors participated in the study, divided into two groups. ...
Article
This paper examines the ability of ChatGPT to generate synthetic comment datasets that mimic those produced by humans. To this end, a collection of datasets containing human comments, freely available in the Kaggle repository, was compared to comments generated via ChatGPT. The latter were based on prompts designed to provide the necessary context for approximating human results. It was hypothesized that the responses obtained from ChatGPT would demonstrate a high degree of similarity with the human-generated datasets with regard to vocabulary usage. Two categories of prompts were analyzed, depending on whether they specified the desired length of the generated comments. The evaluation of the results primarily focused on the vocabulary used in each comment dataset, employing several analytical measures. This analysis yielded noteworthy observations, which reflect the current capabilities of ChatGPT in this particular task domain. It was observed that ChatGPT typically employs a reduced number of words compared to human respondents and tends to provide repetitive answers. Furthermore, the responses of ChatGPT have been observed to vary considerably when the length is specified. It is noteworthy that ChatGPT employs a smaller vocabulary, which does not always align with human language. Furthermore, the proportion of non-stop words in ChatGPT’s output is higher than that found in human communication. Finally, the vocabulary of ChatGPT is more closely aligned with human language than the similarity between the two configurations of ChatGPT. This alignment is particularly evident in the use of stop words. While it does not fully achieve the intended purpose, the generated vocabulary serves as a reasonable approximation, enabling specific applications such as the creation of word clouds.