Content uploaded by Ali H. Al-Hoorie
Author content
All content in this area was uploaded by Ali H. Al-Hoorie on Feb 24, 2024
Content may be subject to copyright.
ENHANCING WRITING ASSESSMENT
WITH CHATGPT
Ali H. Al-Hoorie
Saudi TESOL Association
Presented at TEFL Kuwait Symposium
18 November 2023
ASSESSMENT FAIRNESS
•Exams are high-stakes for students
•Failure
•Lower grade/GPA than deserved
•Visa/immigration denial
•Rater variability: characteristics of raters, not performance of students
RATER VARIABILITY
•Raters may differ
•a) in the degree to which they comply with the scoring rubric,
•b) in the way they interpret criteria employed in operational scoring sessions,
•c) in the degree of severity or leniency exhibited when scoring examinee
performance,
•d) in the understanding and use of rating scale categories, or
•e) in the degree to which their ratings are consistent across examinees,
scoring criteria, and performance tasks.
•(Eckes, 2008)
COMPOSITION ELEMENTS RATERS FOCUS ON
•Raters focus on different elements (Milanovic et al., 1996):
•Length: no. of words, no. of lines, quick glance
•Legibility: handwriting, readability
•Grammar: type & frequency of errors
•Structure: sentence-level, paragraph-level, narrative -level
•Communicative effectiveness: success in conveying the message
•Tone: naturalness of expression
•Vocabulary: accuracy & variety of word choice
•Spelling: frequency & difficulty of misspellings
•Content: dull, lively, showing individuality
•Task realization: meeting the criteria set in the question
•Punctuation: accuracy of writing mechanisms
•Weight attributed to each element varied widely among raters
COMPOSITION ELEMENTS RATERS FOCUS ON
RATER BIASES
•Manifestations of rating biases (Davidson & Coombe, 2023):
•Strictness bias: tendency to score too harshly
•Leniency bias: tendency to score highly
•Central tendency bias: tendency to score near the center
•Restriction of range bias: tendency to limit score range
•Halo effect bias: high score on one part leads to high scores on other parts
•Horns effect bias: low score on one part leads to lower scores on other parts
•Contrast effect bias: tendency to compare performance of different examinees
•First impression bias: heavily influenced by the beginning of performance
•Recency bias: heavily influenced by the end of performance
•Cultural familiarity bias: score candidates from familiar backgrounds differently
•Cultural unfamiliarity bias: score ones from unfamiliar backgrounds differently
•Acquaintanceship bias: score candidates you know differently
RATER BIASES
•Similar-to-me bias: score someone similar to you differently
•Dissimilar-to-me bias: score someone different from you differently
•Personal bias: attitudes tw ethnicity, gender, class, age, social status
•Sympathy bias: failing students, need a certain grade
•Current-state-of-mind bias: morning coffee, late at night, marking deadline
RATER BIASES
WHY VARIABILITY
•Factors leading to variability (Davidson & Coombe, 2023):
•Cognitive load: amount of information working memory can process
•Different criteria: perceive and evaluate different elements simultaneously
•Multi-tasking: teaching, admin, marking different exams, family life
•Monitoring pressure: especially if part of annual evaluation, contract renewal
•Time pressure: looming deadline
•Fatigue: large number of students
RATER TRAINING
•“rater training has been shown to be much less effective at reducing rater
variability than expected; that is, raters typically remained far from
functioning interchangeably even after extensive training sessions… or after
individualized feedback on their ratings” (Eckes, 2008, p. 156)
TYPES OF RUBRICS
(Gonzalez, 2014)
TOEFL WRITING RUBRIC
IELTS WRITING RUBRIC
Source: https://s3.eu-west-2.amazonaws.com/ielts-web-static/production/Guides/ielts-writing-band-descriptors.pdf
HUMAN VS CHATGPT RATINGS
•ChatGPT prompt:
•“Rate the following essay based on the 4 IELTS Band Descriptors (out of 9 each)”
•Band Descriptors (not fed into ChatGPT):
•Task Response/achievement
•Coherence and Cohesion
•Lexical Resource
•Grammar Range and Accuracy
•Compare your rating to ChatGPT rating
CHATGPT RATING
CHATGPT RATING
CHATGPT RATING
CHATGPT TO EVALUATE ESSAYS?
•ChatGPT evaluation may vary slightly if regenerated
•Explanations are sometimes unclear, or not relevant
•May help train markers and encourage them to reflect
•but can’t rely on it fully
REFERENCES
Davidson, P., & Coombe, C. (2023, March). How can we reduce rater bias. Paper presented at the 5th Applied Linguistics
and Language Teaching International Conference and Exhibition, Dubai, UAE.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability.
Language Testing, 25(2), 155–185. https://doi.org/10.1177/0265532207086780
Gonzalez, J. (2014). Know your terms: Holistic, analytic, and single-point rubrics. Cult of Pedagogy.
https://www.cultofpedagogy.com/holistic-analytic-single-point-rubrics/
Milanovic, M., Saville, N. & Shuhong, S. (1996). A study of the decision-making behaviour of composition markers. In M.
Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th
Language Testing Research Colloquium (Studies in language testing, Vol. 3, pp. 92–114). Cambridge University
Press.
THANK YOU!
@Ali_AlHoorie
hoorie_ali@hotmail.com
www.ali-alhoorie.com