December 2024
·
32 Reads
We tested the ability of generative AI (GAI) to serve as a non-expert grader in the context of school-wide curriculum assessment. OpenAI's ChatGPT-4o Large Language Model was used to create diverse student personas. Synthetic artefacts based on exam questions from two Undergraduate courses on religion and philosophy were created. A custom GPT model, Claude 3.5 and human non-expert educators were used to evaluate the artefacts based on a rubric. We performed a variety of different statistical tests on the results to identify correlations and patterns within the GAI graders and between human and GAI graders. We found that different GAI models grade reliably, and that GAI and human grades differ significantly. We identified systemic bias of the GAIs towards synthetic, GAI-generated artefacts. We find that the use of synthetic artefacts to validate GAI graders was unsuitable, and we identify different directions for further study. Despite certain limitations, we conclude that the application of GAI-based grading in curriculum assessment represents a viable and promising use of AI to offer an intermediate evaluation perspective on student work.