Conference PaperPDF Available

Artificial Intelligence Applications in English Writing Assessment

Authors:

Abstract

Video: https://youtu.be/PWNHVwdkjlk
Dr. Ali H. Al-Hoorie
Saudi TESOL Association
27 January 2024
Artificial Intelligence
Applications in English
Writing Assessment
AI Powered Teaching: An Online Generative AI Bootcamp for Educators
Dr. Ali H. Al-Hoorie
Associate Professor of English Language
Holds a Masters degree from the University of Essex in Applied Linguistics and in
Data Analysis, and a PhD in English language from the University of Nottingham
Published more than 40 peer-reviewed papers, six books, and 50 international
conference papers
Member of the editorial board of more than ten international journals and has
reviewed about 200 research papers
Received several international awards including the Early Career Scholar Award,
the Best Paper Award, and the Best Book Award
Elected as a member on the Board of Directors of the Saudi Linguistics Society at
King Saud University and the Board of Directors of the Saudi English Language
Teaching Association at King Abdulaziz University
Obtained a patent for a device to lower anxiety while listening. Also won a gold
medal for an invention to help blind people write (patent pending)
Listed among the top 2% scientists in the world according to what is commonly
known as the Stanford ranking
Assessment fairness
Exams are high-stakes for students
Failure
Lower grade/GPA than deserved
Visa/immigration denial
Rater variability: characteristics of raters, not performance of students
Rater variability
Raters may differ (Eckes, 2008):
a) in the degree to which they comply with the scoring rubric,
b) in the way they interpret criteria employed in operational scoring sessions,
c) in the degree of severity or leniency exhibited when scoring examinee performance,
d) in the understanding and use of rating scale categories, or
e) in the degree to which their ratings are consistent across examinees, scoring criteria, and
performance tasks.
Raters focus on different elements (Milanovic et al., 1996):
Length: no. of words, no. of lines, quick glance
Legibility: handwriting, readability
Grammar: type & frequency of errors
Structure: sentence-level, paragraph-level, narrative-level
Communicative effectiveness: success in conveying the message
Tone: naturalness of expression
Composition elements raters focus on
Vocabulary: accuracy & variety of word choice
Spelling: frequency & difficulty of misspellings
Content: dull, lively, showing individuality
Task realization: meeting the criteria set in the question
Punctuation: accuracy of writing mechanisms
Weight attributed to each element varied widely among raters
Composition elements raters focus on
Rater biases
Manifestations of rating biases (Davidson & Coombe, 2023):
Strictness bias: tendency to score too harshly
Leniency bias: tendency to score highly
Central tendency bias: tendency to score near the center
Restriction of range bias: tendency to limit score range
Halo effect bias: high score on one part leads to high scores on other parts
Horns effect bias: low score on one part leads to lower scores on other parts
Rater biases
Contrast effect bias: tendency to compare performance of different examinees
First impression bias: heavily influenced by the beginning of performance
Recency bias: heavily influenced by the end of performance
Cultural familiarity bias: score candidates from familiar backgrounds differently
Cultural unfamiliarity bias: score ones from unfamiliar backgrounds differently
Acquaintanceship bias: score candidates you know differently
Similar-to-me bias: score someone similar to you differently
Dissimilar-to-me bias: score someone different from you differently
Personal bias: attitudes tw ethnicity, gender, class, age, social status
Sympathy bias: failing students, need a certain grade
Current-state-of-mind bias: morning coffee, late at night, marking deadline
Rater biases
Why variability
Factors leading to variability (Davidson & Coombe, 2023):
Cognitive load: amount of information working memory can process
Different criteria: perceive and evaluate different elements simultaneously
Multi-tasking: teaching, admin, marking different exams, family life
Monitoring pressure: especially if part of annual evaluation, contract renewal
Time pressure: looming deadline
Fatigue: large number of students
Rater training
rater training has been shown to be much less effective at reducing rater variability
than expected; that is, raters typically remained far from functioning interchangeably
even after extensive training sessionsor after individualized feedback on their ratings
(Eckes, 2008, p. 156)
Types of rubrics
(Gonzalez, 2014)
TOEFL Writing Rubric
Source:
https://www.ets.org/pdfs/toefl/toe
fl-ibt-writing-rubrics.pdf
Source:
https://www.ets.org/pdfs/toefl/toe
fl-ibt-writing-rubrics.pdf
IELTS Writing Rubric
Source: https://s3.eu-west-2.amazonaws.com/ielts-web-static/production/Guides/ielts-writing-band-descriptors.pdf
Human vs ChatGPT ratings
ChatGPT prompt:
Rate the following essay based on the 4 IELTS Band Descriptors (out of 9 each)
Band Descriptors (not fed into ChatGPT):
Task Response/achievement
Coherence and Cohesion
Lexical Resource
Grammar Range and Accuracy
Compare your rating to ChatGPT rating
Sample essay 1
Source:
www.ielts-blog.com
Rate Essay 1
Start presenting to display the poll results on this slide.
ChatGPT rating
Sample essay 2
Source:
www.ielts-blog.com
Rate Essay 2
Start presenting to display the poll results on this slide.
ChatGPT rating
Sample essay 3
Source:
www.ielts-blog.com
Rate Essay 3
Start presenting to display the poll results on this slide.
ChatGPT rating
ChatGPT to evaluate essays?
ChatGPT evaluation may vary slightly if regenerated
Explanations are sometimes unclear, or not relevant
May help train markers and encourage them to reflect
but cant rely on it fully
References
Davidson, P., & Coombe, C. (2023, March). How can we reduce rater bias. Paper presented at the 5th Applied Linguistics and
Language Teaching International Conference and Exhibition, Dubai, UAE.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language
Testing, 25(2), 155185. https://doi.org/10.1177/0265532207086780
Gonzalez, J. (2014). Know your terms: Holistic, analytic, and single-point rubrics. Cult of Pedagogy.
https://www.cultofpedagogy.com/holistic-analytic-single-point-rubrics/
Milanovic, M., Saville, N. & Shuhong, S. (1996). A study of the decision-making behaviour of composition markers. In M.
Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language
Testing Research Colloquium (Studies in language testing, Vol. 3, pp. 92114). Cambridge University Press.
Thank you!
@ A l i _ A l H o o r i e
h o o r i e _ a l i @ h o t m a i l . c o m
w w w . a l i - a l h o o r i e . c o m
ResearchGate has not been able to resolve any citations for this publication.
Article
Research on rater effects in language performance assessments has provided ample evidence for considerable variability among raters. Building on this research, I advance the hypothesis that experienced raters fall into types or classes that are distinguishable from one another concerning the importance they attach to scoring criteria. To examine the rater type hypothesis, I asked 64 raters actively involved in scoring examinee writing performance on a large-scale assessment instrument to indicate on a four-point scale how much importance they would attach to each of nine routinely used criteria. The criteria covered various performance aspects, such as fluency, completeness, and grammatical correctness. In a preliminary step, many-facet Rasch analysis revealed that raters differed significantly in their views on the importance of the various criteria. A two-mode clustering technique yielded a joint classification of raters and criteria, with six rater types emerging from the analysis. Each of these types was characterized by a distinct scoring profile, indicating that raters were far from dividing their attention evenly among the set of criteria. Moreover, rater background variables were shown to account for the scoring profile differences partially. The findings have implications for assessing the quality of large-scale rater-mediated language testing, rater monitoring, and rater training.
How can we reduce rater bias
  • P Davidson
  • C Coombe
Davidson, P., & Coombe, C. (2023, March). How can we reduce rater bias. Paper presented at the 5 th Applied Linguistics and Language Teaching International Conference and Exhibition, Dubai, UAE.
Know your terms: Holistic, analytic, and single-point rubrics. Cult of Pedagogy
  • J Gonzalez
Gonzalez, J. (2014). Know your terms: Holistic, analytic, and single-point rubrics. Cult of Pedagogy. https://www.cultofpedagogy.com/holistic-analytic-single-point-rubrics/