Content uploaded by Ali H. Al-Hoorie
Author content
All content in this area was uploaded by Ali H. Al-Hoorie on Feb 24, 2024
Content may be subject to copyright.
Dr. Ali H. Al-Hoorie
Saudi TESOL Association
27 January 2024
Artificial Intelligence
Applications in English
Writing Assessment
AI Powered Teaching: An Online Generative AI Bootcamp for Educators
Dr. Ali H. Al-Hoorie
•Associate Professor of English Language
•Holds a Master’s degree from the University of Essex in Applied Linguistics and in
Data Analysis, and a PhD in English language from the University of Nottingham
•Published more than 40 peer-reviewed papers, six books, and 50 international
conference papers
•Member of the editorial board of more than ten international journals and has
reviewed about 200 research papers
•Received several international awards including the Early Career Scholar Award,
the Best Paper Award, and the Best Book Award
•Elected as a member on the Board of Directors of the Saudi Linguistics Society at
King Saud University and the Board of Directors of the Saudi English Language
Teaching Association at King Abdulaziz University
•Obtained a patent for a device to lower anxiety while listening. Also won a gold
medal for an invention to help blind people write (patent pending)
•Listed among the top 2% scientists in the world according to what is commonly
known as the Stanford ranking
Assessment fairness
•Exams are high-stakes for students
•Failure
•Lower grade/GPA than deserved
•Visa/immigration denial
•Rater variability: characteristics of raters, not performance of students
Rater variability
•Raters may differ (Eckes, 2008):
•a) in the degree to which they comply with the scoring rubric,
•b) in the way they interpret criteria employed in operational scoring sessions,
•c) in the degree of severity or leniency exhibited when scoring examinee performance,
•d) in the understanding and use of rating scale categories, or
•e) in the degree to which their ratings are consistent across examinees, scoring criteria, and
performance tasks.
•Raters focus on different elements (Milanovic et al., 1996):
•Length: no. of words, no. of lines, quick glance
•Legibility: handwriting, readability
•Grammar: type & frequency of errors
•Structure: sentence-level, paragraph-level, narrative-level
•Communicative effectiveness: success in conveying the message
•Tone: naturalness of expression
Composition elements raters focus on
•Vocabulary: accuracy & variety of word choice
•Spelling: frequency & difficulty of misspellings
•Content: dull, lively, showing individuality
•Task realization: meeting the criteria set in the question
•Punctuation: accuracy of writing mechanisms
•Weight attributed to each element varied widely among raters
Composition elements raters focus on
Rater biases
•Manifestations of rating biases (Davidson & Coombe, 2023):
•Strictness bias: tendency to score too harshly
•Leniency bias: tendency to score highly
•Central tendency bias: tendency to score near the center
•Restriction of range bias: tendency to limit score range
•Halo effect bias: high score on one part leads to high scores on other parts
•Horns effect bias: low score on one part leads to lower scores on other parts
Rater biases
•Contrast effect bias: tendency to compare performance of different examinees
•First impression bias: heavily influenced by the beginning of performance
•Recency bias: heavily influenced by the end of performance
•Cultural familiarity bias: score candidates from familiar backgrounds differently
•Cultural unfamiliarity bias: score ones from unfamiliar backgrounds differently
•Acquaintanceship bias: score candidates you know differently
•Similar-to-me bias: score someone similar to you differently
•Dissimilar-to-me bias: score someone different from you differently
•Personal bias: attitudes tw ethnicity, gender, class, age, social status
•Sympathy bias: failing students, need a certain grade
•Current-state-of-mind bias: morning coffee, late at night, marking deadline
Rater biases
Why variability
•Factors leading to variability (Davidson & Coombe, 2023):
•Cognitive load: amount of information working memory can process
•Different criteria: perceive and evaluate different elements simultaneously
•Multi-tasking: teaching, admin, marking different exams, family life
•Monitoring pressure: especially if part of annual evaluation, contract renewal
•Time pressure: looming deadline
•Fatigue: large number of students
Rater training
•“rater training has been shown to be much less effective at reducing rater variability
than expected; that is, raters typically remained far from functioning interchangeably
even after extensive training sessions… or after individualized feedback on their ratings”
(Eckes, 2008, p. 156)
Types of rubrics
(Gonzalez, 2014)
TOEFL Writing Rubric
IELTS Writing Rubric
Source: https://s3.eu-west-2.amazonaws.com/ielts-web-static/production/Guides/ielts-writing-band-descriptors.pdf
Human vs ChatGPT ratings
•ChatGPT prompt:
• “Rate the following essay based on the 4 IELTS Band Descriptors (out of 9 each)”
•Band Descriptors (not fed into ChatGPT):
•Task Response/achievement
•Coherence and Cohesion
•Lexical Resource
•Grammar Range and Accuracy
•Compare your rating to ChatGPT rating
Rate Essay 1
ⓘStart presenting to display the poll results on this slide.
ChatGPT rating
Rate Essay 2
ⓘStart presenting to display the poll results on this slide.
ChatGPT rating
Rate Essay 3
ⓘStart presenting to display the poll results on this slide.
ChatGPT rating
ChatGPT to evaluate essays?
•ChatGPT evaluation may vary slightly if regenerated
•Explanations are sometimes unclear, or not relevant
•May help train markers and encourage them to reflect
•but can’t rely on it fully
References
Davidson, P., & Coombe, C. (2023, March). How can we reduce rater bias. Paper presented at the 5th Applied Linguistics and
Language Teaching International Conference and Exhibition, Dubai, UAE.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language
Testing, 25(2), 155–185. https://doi.org/10.1177/0265532207086780
Gonzalez, J. (2014). Know your terms: Holistic, analytic, and single-point rubrics. Cult of Pedagogy.
https://www.cultofpedagogy.com/holistic-analytic-single-point-rubrics/
Milanovic, M., Saville, N. & Shuhong, S. (1996). A study of the decision-making behaviour of composition markers. In M.
Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language
Testing Research Colloquium (Studies in language testing, Vol. 3, pp. 92–114). Cambridge University Press.
Thank you!
@ A l i _ A l H o o r i e
h o o r i e _ a l i @ h o t m a i l . c o m
w w w . a l i - a l h o o r i e . c o m