Content uploaded by Marcus Birkenkrahe
Author content
All content in this area was uploaded by Marcus Birkenkrahe on Dec 03, 2024
Content may be subject to copyright.
AUTOMATED ASSESSMENT OF ARTEFACTS USING AI
C. Del Gobbo, M. Birkenkrahe, N. Yonts, W. Beal
Lyon College (UNITED STATES)
Abstract
We tested the ability of generative AI (GAI) to serve as a non-expert grader in the context of
school-wide curriculum assessment. OpenAI's ChatGPT-4o Large Language Model was used to
create diverse student personas. Synthetic artefacts based on exam questions from two
Undergraduate courses on religion and philosophy were created. A custom GPT model, Claude 3.5
and human non-expert educators were used to evaluate the artefacts based on a rubric. We
performed a variety of different statistical tests on the results to identify correlations and patterns within
the GAI graders and between human and GAI graders. We found that different GAI models grade
reliably, and that GAI and human grades differ significantly. We identified systemic bias of the GAIs
towards synthetic, GAI-generated artefacts. We find that the use of synthetic artefacts to validate GAI
graders was unsuitable, and we identify different directions for further study. Despite certain
limitations, we conclude that the application of GAI-based grading in curriculum assessment
represents a viable and promising use of AI to offer an intermediate evaluation perspective on student
work.
Keywords: Generative AI, educational assessment, automated grading, custom GPT-4o, AI in
education.
1 INTRODUCTION
The integration of generative artificial intelligence (GAI) tools into education is widely anticipated to
transform teaching and learning practices. One promising application is in educational assessment,
where AI can provide both quantitative and qualitative feedback. Given the increasing demands on
educators to deliver timely and constructive evaluations, GAI models — trained on extensive datasets
— can produce coherent and contextually relevant responses, making them viable candidates for
automated grading and feedback systems. However, despite its potential, the use of GAI in
educational assessment remains underexplored, particularly from the perspective of educators [1].
Our investigation began with the question of whether Generative AI (GAI) could be applied to
curriculum assessment, where educators evaluate the quality of another discipline's curriculum by
grading student work. For instance, an English instructor might be asked to assess the quality of a
philosophy curriculum based on essays written by philosophy students. In this scenario, the English
instructor acts as a non-expert, likely hesitant to assume the role of an authority in a colleague's field
and uncertain about the accuracy of their assessments. This reluctance could make it challenging to
motivate faculty to undertake such tasks. This is where GAI enters as a possible solution: Could GAI,
as a non-human, non-expert entity, effectively perform this role?
A plethora of secondary questions are implied in this question, including:
1. Can GAI provide valid grades and useful feedback for students?
2. How does the GAI's assessment compare with human feedback?
3. How could one test these questions experimentally?
Already the first of these questions contains ambiguity: when is a grade "valid", and when is feedback
"useful"? One great advantage of humans over machines is the ability to behave in non-standard ways
and take aspects into account that rely on having a relationship with the student. This suggests that a
study into comparing the overall impact of human vs. AI assessment of student artefacts would have
to take many variables into account. Several studies have explored the second issue: It was inferred in
[2] that using a custom GPT model tailored for grading tasks could reduce the observed gap between
human and AI feedback; another study revealed a near-even split in preference between AI and
human feedback, with students appreciating detail and consistency of AI feedback, and valuing the
interactiveness and personalization of human feedback [3].
Our ambition was much more modest, and the more we learnt about dealing with GAI tools, the more
modest it became. Still, we now think that the hope of other educators' initial views that this technology
may be integrated in teaching and learning practices may be justified, and in particular that GAI may
play a valuable role in automating and enhancing assessments, though much work remains to be
done [4].
2 METHODOLOGY
Our approach combined synthetic persona generation recently proposed to improve Large Language
Models (LLMs) [5] with AI-driven assessment increasingly used in large-scale, objective grading [6].
All data used for, and obtained from this research are available online on a Google Drive folder (link:
https://tinyurl.com/delgobbo-et-al).
2.1 Synthetic Artefacts Generation
Fig. 1 shows the synthetic persona and artefact creation: the ChatGPT-4o LLM [7] was used to create
20 student personas, and to create an evaluator in the form of a custom-GPT model ("Writer-GPT").
This model was then used to generate 40 student artefacts using two prompts from an undergraduate
religion and philosophy course.
Figure 1. Synthetic persona and artefact creation.
Here is an example for a student persona generated by ChatGPT-4o:
Persona 1: Emma Johnson
Demographics:
- Age: 20
- Gender: Female
- Ethnicity: Caucasian
- Socioeconomic Background: Middle class
- Educational Level: Undergraduate, Junior year
Academic Background:
- Major or Field of Study: Biology with a minor in Environmental Science
- Academic Performance: GPA 3.8, strong in science courses, struggles with math
- Favourite Subjects: Ecology, Genetics
- Challenges Faced in Education: Difficulty with complex mathematical concepts,
balancing extracurricular activities with study time
Personal Characteristics:
- Personality Traits: Ambitious, extroverted, organised
- Hobbies and Interests: Hiking, volunteering at animal shelters, reading scientific journals
- Learning Style: Visual and kinesthetic learner
- Motivation for Studying: Passion for environmental conservation and desire to work in
wildlife protection
Attitude Toward Education:
- Attitude Toward Teachers and Peers: Respectful and collaborative, often takes a
leadership role in group projects
- Engagement in Class: Highly participative, asks questions, and engages in discussions
- Approach to Assignments: Proactive, often completes assignments well before deadlines
- Future Aspirations: Pursue a Master's degree in Environmental Science, work for a
conservation organisation.
The AI writer persona, writer-GPT, was based on the following prompt:
Make an academic writer who helps generate diverse student artefacts based on provided
student personas and assignment prompts [...] capable of creating detailed essays, reflective
writings, and other academic assignments, ensuring that each artefact is contextually relevant
and tailored to the characteristics and academic background of the specified student persona.
The artefacts should reflect real-world variability, with some being better written than others to
accurately represent a range of student abilities and performance levels.
The writer-GPT should approach generating assignments by closely aligning with the characteristics
and academic backgrounds of the specified student personas.
Emphasis should be placed on:
- Authenticity: The writing should reflect the real-world variability in student abilities. Some
artefacts should be well-written and insightful, while others might have common errors, less
clarity, or weaker arguments, depending on the persona's academic performance and writing
skills.
- Contextual Relevance: Ensure that each artefact is contextually relevant to the assignment
prompt and accurately reflects the specified persona's understanding and interpretation.
- Personality and Perspective: Infuse the artefacts with the unique personality traits and
perspectives of each persona, making the writing more personalised and reflective of
individual differences.
Things to avoid were:
- Overly Technical or Advanced Language: Avoid using language or concepts that are too
advanced for the persona’s described academic level.
- Perfection: Do not make every artefact flawless. Include common student mistakes and areas
for improvement to make the artefacts realistic.
- Generic Responses: Avoid creating artefacts that lack the personalised touch of the student
persona’s background and characteristics.
The writer-GPT should communicate in an academic yet approachable tone. It should
maintain a level of formality appropriate for educational settings, ensuring clarity and professionalism.
However, it should also be engaging and supportive, providing explanations and context where
necessary to ensure the artefacts are well-understood. The tone should be friendly enough to
encourage interaction but formal enough to maintain the integrity and seriousness of academic work.
The AI grader persona, grader-GPT, was based on this prompt:
Make an academic evaluator, who grades student artefacts based on assignment prompts and
associated rubrics [...] to provide detailed feedback and grades in a specified format [...] for different
assignments. The feedback should be constructive, highlighting strengths and areas for improvement,
and the grades should align with the given rubric criteria.
The grader-GPT should emphasise providing constructive and specific feedback that focuses
on areas for improvement while also acknowledging the strengths of the student's work. The
evaluations should be clear, actionable, and supportive, aiming to guide the student in their
learning process.
Key aspects to emphasise were:
- Constructive Feedback: Highlight areas where the student can improve with specific
suggestions and examples.
- Strengths: Acknowledge and reinforce what the student did well to encourage positive
reinforcement.
- Clarity: Ensure feedback is clear and easy to understand.
- Actionable Advice: Provide specific steps or resources the student can use to improve their
work.
Key aspects to avoid were:
- Rudeness or Harsh Criticism: Avoid being rude or overly critical. Feedback should be
supportive and encouraging, not discouraging.
- Vagueness: Avoid vague comments that do not provide clear guidance on how to improve.
- Bias: Ensure evaluations are fair and unbiased, focusing solely on the quality of the work
according to the rubric.
For the feedback, the tone of the grader-GPT should be like that of a great professor: formal but
also supportive. This means maintaining a professional and respectful tone while being encouraging
and understanding. The GPT should communicate in a way that conveys expertise and authority while
also showing empathy and a genuine desire to help the student improve.
2.2 Writing Prompts and Rubric
The grader-GPT was designed to grade and give feedback based on a writing prompt and a specific
rubric matching the writing prompt, both of which were obtained from an expert instructor.
The two writing prompts were as follows:
1. Reflect on how different religious perspectives relate to your own experiences and attitudes
toward religion (or your philosophy of life). Drawing upon at least two different religions (one of
which must not be an Abrahamic faith), discuss in detail how these religions have influenced
your thoughts about your religious experience or philosophy of life.
2. Reflect on the integration of faith (or one's philosophy of life) and reason. Consider how
historical analysis (such as understanding the historical contexts of certain texts), literary
analysis (such as examining various genres and thematic concerns of texts), and social
analysis (such as exploring the significance of texts for understanding issues like society,
gender, family, and sexuality) affect, challenge, enrich, change, and/or help you understand
your own faith or philosophy of life.
The rubric criteria for these writing prompts are shown in table 1:
Table 1. Rubric Criteria.
Prompt
Criteria
1
Discusses at least two religions
The religions are accurately portrayed
The discussion is clear
Adequate detail for understanding
Thoughtfulness and insight
2
Clarity
Comprehensiveness
Reasons Given
Faith or Philosophy of Life
2.3 Grading Process
Two of the authors (MB and WB) , both professors experienced in grading, and both non-experts in the
subject-matter for the student artefacts, were enrolled in grading and giving feedback.
Both grader-GPT and human graders were given the same synthetic student artefacts to grade. They
were instructed to provide a numerical grade for each artefact, following the rubric provided, with the
general instructions to "provide feedback in 4-5 sentences that, in your opinion, will help the students
improve their work."
The AI was prompted to give constructive feedback, highlight strengths, be clear and easy to
understand, and give actionable advice. Rudeness or harsh criticism, vagueness, and bias should be
avoided. The synthetic artefacts were also graded by another GAI, Claude 3.5 [8], and human graders
were asked to record the time needed to grade each artefact.
3 RESULTS
In order to understand our data - the grades given by humans vs. AI, we ran a few standard tests.
To establish intra-rater and inter-rater reliability for the grades assigned by human graders and
grader-GPT, we used the intraclass correlation coefficient (ICC) [9, 10]. The results are shown in table
2.
Table 2. ICC reliability scores for GAIs and human raters.
Intra-rater reliability among
human graders (between the
grades given by the two human
raters)
intra-rater reliability among GAI
graders (between the grades
given by the two GAI raters)
inter-rater reliability between
human graders and GAI graders
(between human graders and
GAI graders)
0.054*
0.545**
-0.030*
** p <.05
* p>.05
The intra-rater reliability for human graders indicates poor agreement (ICC = 0.05) between the grades
given by the two humans. On the other hand, The ICC value for GAI graders indicates moderate
reliability in their grading (ICC = 0.548). To calculate the intra-rater reliability, we treated the two human
graders and the two GAI graders as single entities. Each entity evaluated the artefacts twice (once per
grader). This approach allowed us to assess how humans and GAIs differ in their grading processes.
Our findings indicate that human graders exhibit greater variability in their evaluations compared to
GAI graders. Indeed, the inter-rater reliability between human graders and GAI graders, as expected
from the previous results, showed poor agreement between the two groups (ICC = -0.03).
Pearson’s correlation coefficients [11] are shown in table 3:
Table 3. Pearson’s correlation coefficients among rater averages.
Human (WB)
Human (MB)
Custom GPT-4o
Claude 3.5
Human (WB)
-
-
-
-
Human (MB)
0.145*
-
-
-
Custom GPT-4o
-0.291*
-0.488**
-
-
Claude 3.5
0.078*
-0.027*
0.375**
-
**p<0.5
* p>0.5
There is practically no correlation between the human graders and the GAI graders, and the
relationships appear to be weakly negative rather than positive. There is a statistically significant
(p-value = 0.001) negative correlation between human grader (MB) and Grader-GPT, suggesting that
they tend to grade artefacts in opposite ways. When one assigns a high grade, the other tends to
assign a low grade, and vice versa.
The results of an ANOVA [12] test are given in table 4:
Table 4. ANOVA test results.
Source
SS (Sum of
Squares)
DF (Degrees
of Freedom)
MS (Mean
Square)
F (the ratio of the
mean square
between groups to
the mean square
within groups.)
p-value
np2
(Eta-squa
red)
Grader
Within
45615.76
7614.80
3
156
15205.25
48.81
311.50
NaN
1.248e-65
NaN
0.856
NaN
There are significant differences in the mean grades assigned by the different graders, as predicted by
previous results. The large effect size (eta-squared) also supports the fact that the grader type
accounts for a significant portion of the variance in grades.
To understand which specific groups differ from one another, we also performed a post-hoc test
specifically, a Tukey's HSD (Honestly Significant Difference) [13], to identify the pairs of graders with
significant differences in their mean grades, shown in Table 5:
Table 5. Tukey's HSD (Honestly Significant Difference) test results.
A (The first
grader being
compared.)
B (The
second
grader
being
compared.)
Mean(A)
(The
mean
grade
given by
the first
grader.)
Mean(B)
(The
mean
grade
given by
the
second
grader.)
Diff (The
difference
between
the means
of the two
graders.)
Se
(Standard
error of the
difference)
T (t-value
for the
difference)
p-value
Hedges
(Hedges' g
effect size)
CL
GPT
94.43
94.93
-0.500
1.562
-0.320
9.8e-01
-0.113
CL
MB
94.43
54.56
39.86
1.562
25.51
8.3e-15
4.405
CL
WB
94.43
71.83
22.60
1.562
14.46
8.3e-15
5.832
GPT
MB
94.93
54.56
40.36
1.562
25.83
8.3e-15
4.390
GPT
WB
94.93
71.83
23.10
1.562
14.78
8.3e-15
5.501
MB
WB
54.56
71.83
-17.26
1.562
-11.05
8.3e-15
-1.929
To provide an overall impression of the score dispersions across each rater we have included Fig. 2,
which depicts histograms for the distribution of the scores.
Figure 2. Grade Distribution Comparison (Humans vs GAIs).
4 LIMITATIONS
We had originally planned to add a second research phase where a group of different human
professors would evaluate the grades and feedback garnered in the first phase, without knowing which
were created by humans. Without this check we are not really able to assess the validity and
usefulness of the first research question. We also did not use the qualitative feedback that was
generated by the human graders for this investigation.
The most serious limitation was the unexpected homogeneity of the synthetic student artefacts despite
the apparent diversity of the synthetic student personas. This homogeneity made it difficult to grade,
since different grades capitalise on differences in quality, perspective, and tone. It is also not what we
would expect from real student artefacts.
Another limitation concerns our observer bias: the two human evaluators of the student artefacts were
both authors of this study. As such, we had access to all the information leading to the creation of the
student artefacts, including detailed descriptions of their synthetic author personas. This bias has likely
influenced our grading and feedback, and can lead to distortions of judgement.
These qualitative limitations give rise to interesting conclusions (see below). We also encountered
quantitative limitations because of the small sample size, the limited subject area, and the small
number of participants.
5CONCLUSIONS AND OUTLOOK
We attempted to find preliminary answers to the following questions:
1. Can GAI provide valid grades and useful feedback for students?
2. How does the GAI's assessment compare with human feedback?
3. How could one test these questions experimentally?
Ad 1. We observed a significant discrepancy between the grades given by humans and AI. This could
be because our synthetic data lack nuance and distinction, which humans penalise, while GAI may
prefer artefacts that are optimised to their own standards - just like an author who is asked to critically
review his own paper. Recent research indicates that training LLMs with synthetic data can lead to
model degradation and collapse over time [14]. Even our human graders exhibited considerable
variation. This may be attributed to their disciplines and their grading habits, which were formed in
their respective fields (English language, and computer and data science). The human graders
nevertheless consistently awarded lower grades than the AI grader. We conclude that synthetic
artefacts are less valued by human graders. As one of them (MB) said:
Every time I dive into these essays, I am gobsmacked by their superficiality, glibness,
and lack of detail or connection, with rare exceptions.
Ad 2. We did not address this question because we did not proceed to the next phase of our research.
Ad 3. Our investigation was hampered by multiple limitations but we confirm that the general
experimental setup still leads to interesting, novel results that seem worth exploring further. Two of our
limitations lead us to the following additional conclusions:
- The homogeneity of the synthetic artefacts may be a spectre of things to come: If students
adopt use of AI for writing purposes uncritically and unchecked, we expect that real artefacts
will exhibit the same homogeneity that we observed from our synthetic artefacts.
- The cognitive self-reflexive bias that we observed is well known [15]. It is normally mitigated
by anonymizing student work. When pitching yourself against AI graders, the origin of the
artefact (human or GAI) needs to be anonymized as well.
The two most surprising results for us were:
1. Inserting GAI for grading and feedback in the assessment mix did revitalise the question of
assessment and opened new avenues for thinking, planning and motivating colleagues to
participate in this often not much loved activity.
2. Synthetic data is no replacement for real data. Even though the new LLMs of GAI give a good
first impression of being able to produce diverse results, "regression to the mean" dominates.
In summary, we find that GAI can potentially speed up the grading process. This is not
unexpected, since automation generally leads to faster processing. However, the reliability of these
models for grading purposes needs further investigation: A broader range of subjects and types of
assignment needs to be looked at.
ACKNOWLEDGEMENTS
We gratefully acknowledge financial support from Lyon College. The authors also thank Lyon College's
Institutional Review Board for evaluating and approving this study. We're indebted to Dr. Paul Bube for
sharing data from his teaching to help us create synthetic data.
“Chi vuol esser lieto, sia: di domani non c'è certezza.” (Original version) [16]
“Live joyfully if you want, tomorrow is not certain” (English translation)
REFERENCES
[1] Shan Wang, Fang Wang, Zhen Zhu, Jingxuan Wang, Tam Tran, Zhao Du, Artificial intelligence
in education: A systematic literature review, Expert Systems with Applications, Volume 252, Part
A, 2024, 124167, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2024.124167.
[2] Dai, W., Lin, J., Jin, F., Li, T., Tsai, Y.-S., Gaševic, D., & Chen, G. (2024). Can large language
models provide feedback to students? A case study on ChatGPT. Expert Systems with
Applications, 252(Part A), 124167. https://doi.org/10.1016/j.eswa.2024.124167
[3] Escalante, J., Pack, A., & Barrett, A. (2023). AI-generated feedback on writing: Insights into
efficacy and ENL student preference. International Journal of Educational Technology in Higher
Education,20(57). https://doi.org/10.1186/s41239-023-00425-2
[4] Kaplan-Rakowski, R., Grotewold, K., Hartwick, P. & Papin, K. (2023). Generative AI and
Teachers’ Perspectives on Its Implementation in Education. Journal of Interactive Learning
Research, 34(2), 313-338. Waynesville, NC: Association for the Advancement of Computing in
Education (AACE). Retrieved September 16, 2024
from https://www.learntechlib.org/primary/p/222363/.
[5] Scaling Synthetic Data Creation with 1,000,000,000 Personas, arxiv (28-Jun-2024)
https://doi.org/10.48550/arXiv.2406.20094
[6] Gonzalez-Calatayud, Prendes-Espinosa, and ROig Vita, AI for Student Assessment: A
Systematic Review, Appl. Sci. 2021, 11(12), 5467; https://doi.org/10.3390/app11125467
[7] OpenAI. (2023). Introducing GPTs. https://openai.com/blog/introducing-gpts.
[8] Anthropic. (2024). Claude 3.5 Sonnet. Retrieved from https://www.anthropic.com/news/claude
3-5-sonnet
[9] de Raadt, A., Warrens, M. J., Bosker, R. J., & Kiers, H. A. L. (2021). A comparison of reliability
coefficients for ordinal rating scales. Research and Evaluation of Educational Effectiveness,
Psychometrics, and Statistics. https://doi.org/10.1007/s00357-021-09386-5
[10] Mehta, S., Bastero-Caballero, R. F., Sun, Y., Zhu, R., Murphy, D. K., Hardas, B., & Koch, G.
(2018). Performance of intraclass correlation coefficient (ICC) as a reliability index under various
distributions in scale reliability studies. Statistics in Medicine, 37(18), 2734–2752.
https://doi.org/10.1002/sim.7679
[11] Awidi, I. T. (2024). Comparing expert tutor evaluation of reflective essays with marking by
generative artificial intelligence (AI) tool. Computers and Education: Artificial
Intelligence, 6, 100226. https://doi.org/10.1016/j.caeai.2024.100226
[12] Jukiewicz, M. (2024). The future of grading programming assignments in education: The role of
ChatGPT in automating the assessment and feedback process. Thinking Skills and Creativity,
52, 101522. https://doi.org/10.1016/j.tsc.2024.101522
[13] Tukey, J. W. (1949). Comparing individual means in the analysis of variance. Biometrics, 5(2),
99–114. https://doi.org/10.2307/3001913
[14] Peel, M. (2024). Researchers suggest that using 'synthetic' data, created by AI systems to train
LLMs, could lead to the rapid degradation of AI models and a collapse over time. Financial
Times. Retrieved from https://www.ft.com/content/ae507468-7f5b-440b-8512-aea81c6bf4a5
[15] Mugg, J., & Khalidi, M. A. (2021). Self-reflexive cognitive bias. European Journal for Philosophy
of Science, 11(3). https://doi.org/10.1007/s13194-021-00404-2
[16] De Medici, L. (1490). Triumph of Bacchus: Di doman non c'è certezza. Retrieved from
https://www.treccani.it/magazine/strumenti/una_poesia_al_giorno/07_22_Medici_Lorenzo_de.ht
ml .