Conference PaperPDF Available

Enhancing Writing Assessment with ChatGPT

Authors:

Abstract

Video: https://youtu.be/WcvgK6kwLvc
ENHANCING WRITING ASSESSMENT
WITH CHATGPT
Ali H. Al-Hoorie
Saudi TESOL Association
Presented at TEFL Kuwait Symposium
18 November 2023
ASSESSMENT FAIRNESS
Exams are high-stakes for students
Failure
Lower grade/GPA than deserved
Visa/immigration denial
Rater variability: characteristics of raters, not performance of students
RATER VARIABILITY
Raters may differ
a) in the degree to which they comply with the scoring rubric,
b) in the way they interpret criteria employed in operational scoring sessions,
c) in the degree of severity or leniency exhibited when scoring examinee
performance,
d) in the understanding and use of rating scale categories, or
e) in the degree to which their ratings are consistent across examinees,
scoring criteria, and performance tasks.
(Eckes, 2008)
COMPOSITION ELEMENTS RATERS FOCUS ON
Raters focus on different elements (Milanovic et al., 1996):
Length: no. of words, no. of lines, quick glance
Legibility: handwriting, readability
Grammar: type & frequency of errors
Structure: sentence-level, paragraph-level, narrative -level
Communicative effectiveness: success in conveying the message
Tone: naturalness of expression
Vocabulary: accuracy & variety of word choice
Spelling: frequency & difficulty of misspellings
Content: dull, lively, showing individuality
Task realization: meeting the criteria set in the question
Punctuation: accuracy of writing mechanisms
Weight attributed to each element varied widely among raters
COMPOSITION ELEMENTS RATERS FOCUS ON
RATER BIASES
Manifestations of rating biases (Davidson & Coombe, 2023):
Strictness bias: tendency to score too harshly
Leniency bias: tendency to score highly
Central tendency bias: tendency to score near the center
Restriction of range bias: tendency to limit score range
Halo effect bias: high score on one part leads to high scores on other parts
Horns effect bias: low score on one part leads to lower scores on other parts
Contrast effect bias: tendency to compare performance of different examinees
First impression bias: heavily influenced by the beginning of performance
Recency bias: heavily influenced by the end of performance
Cultural familiarity bias: score candidates from familiar backgrounds differently
Cultural unfamiliarity bias: score ones from unfamiliar backgrounds differently
Acquaintanceship bias: score candidates you know differently
RATER BIASES
Similar-to-me bias: score someone similar to you differently
Dissimilar-to-me bias: score someone different from you differently
Personal bias: attitudes tw ethnicity, gender, class, age, social status
Sympathy bias: failing students, need a certain grade
Current-state-of-mind bias: morning coffee, late at night, marking deadline
RATER BIASES
WHY VARIABILITY
Factors leading to variability (Davidson & Coombe, 2023):
Cognitive load: amount of information working memory can process
Different criteria: perceive and evaluate different elements simultaneously
Multi-tasking: teaching, admin, marking different exams, family life
Monitoring pressure: especially if part of annual evaluation, contract renewal
Time pressure: looming deadline
Fatigue: large number of students
RATER TRAINING
rater training has been shown to be much less effective at reducing rater
variability than expected; that is, raters typically remained far from
functioning interchangeably even after extensive training sessionsor after
individualized feedback on their ratings(Eckes, 2008, p. 156)
TYPES OF RUBRICS
(Gonzalez, 2014)
TOEFL WRITING RUBRIC
Source:
https://www.ets.org/pdfs/toefl/
toefl-ibt-writing-rubrics.pdf
Source:
https://www.ets.org/pdfs/toefl/
toefl-ibt-writing-rubrics.pdf
IELTS WRITING RUBRIC
Source: https://s3.eu-west-2.amazonaws.com/ielts-web-static/production/Guides/ielts-writing-band-descriptors.pdf
HUMAN VS CHATGPT RATINGS
ChatGPT prompt:
“Rate the following essay based on the 4 IELTS Band Descriptors (out of 9 each)”
Band Descriptors (not fed into ChatGPT):
Task Response/achievement
Coherence and Cohesion
Lexical Resource
Grammar Range and Accuracy
Compare your rating to ChatGPT rating
SAMPLE ESSAY 1
Source:
www.ielts-blog.com
CHATGPT RATING
SAMPLE ESSAY 2
Source:
www.ielts-blog.com
CHATGPT RATING
SAMPLE ESSAY 3
Source:
www.ielts-blog.com
CHATGPT RATING
CHATGPT TO EVALUATE ESSAYS?
ChatGPT evaluation may vary slightly if regenerated
Explanations are sometimes unclear, or not relevant
May help train markers and encourage them to reflect
but can’t rely on it fully
REFERENCES
Davidson, P., & Coombe, C. (2023, March). How can we reduce rater bias. Paper presented at the 5th Applied Linguistics
and Language Teaching International Conference and Exhibition, Dubai, UAE.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability.
Language Testing, 25(2), 155185. https://doi.org/10.1177/0265532207086780
Gonzalez, J. (2014). Know your terms: Holistic, analytic, and single-point rubrics. Cult of Pedagogy.
https://www.cultofpedagogy.com/holistic-analytic-single-point-rubrics/
Milanovic, M., Saville, N. & Shuhong, S. (1996). A study of the decision-making behaviour of composition markers. In M.
Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th
Language Testing Research Colloquium (Studies in language testing, Vol. 3, pp. 92114). Cambridge University
Press.
THANK YOU!
@Ali_AlHoorie
hoorie_ali@hotmail.com
www.ali-alhoorie.com
ResearchGate has not been able to resolve any citations for this publication.
Article
Research on rater effects in language performance assessments has provided ample evidence for considerable variability among raters. Building on this research, I advance the hypothesis that experienced raters fall into types or classes that are distinguishable from one another concerning the importance they attach to scoring criteria. To examine the rater type hypothesis, I asked 64 raters actively involved in scoring examinee writing performance on a large-scale assessment instrument to indicate on a four-point scale how much importance they would attach to each of nine routinely used criteria. The criteria covered various performance aspects, such as fluency, completeness, and grammatical correctness. In a preliminary step, many-facet Rasch analysis revealed that raters differed significantly in their views on the importance of the various criteria. A two-mode clustering technique yielded a joint classification of raters and criteria, with six rater types emerging from the analysis. Each of these types was characterized by a distinct scoring profile, indicating that raters were far from dividing their attention evenly among the set of criteria. Moreover, rater background variables were shown to account for the scoring profile differences partially. The findings have implications for assessing the quality of large-scale rater-mediated language testing, rater monitoring, and rater training.
How can we reduce rater bias
  • P Davidson
  • C Coombe
Davidson, P., & Coombe, C. (2023, March). How can we reduce rater bias. Paper presented at the 5 th Applied Linguistics and Language Teaching International Conference and Exhibition, Dubai, UAE.
Know your terms: Holistic, analytic, and single-point rubrics. Cult of Pedagogy
  • J Gonzalez
Gonzalez, J. (2014). Know your terms: Holistic, analytic, and single-point rubrics. Cult of Pedagogy. https://www.cultofpedagogy.com/holistic-analytic-single-point-rubrics/