ArticlePDF Available

Automated Essay Scoring Feedback (AESF): An Innovative Writing Solution to the Malaysian University English Test (MUET)

Authors:
  • SMK GREEN ROAD

Abstract and Figures

Recent advances in information and communication technology (ICT) infrastructure can be harnessed to support and improve the quality of teaching and learning of English writing skills especially for second language context where rule based support is necessary. Essay writing is indeed the most demanding tasks to both teachers and students. From conducting the class to the assigning of task as well as marking and providing feedback from teachers, whereas from drafting essays to final submission and resubmission of essays by studentsrequire on-going iterative cycles to facilitate improvement. However, a common scenario is that the iterative process takes too much time, thus resulting in limited practice. An innovative solution to imitate such process is via the Automated Essay Scoring Feedback (AESF). AESF is a networked tool that has the ability to score and provide feedback to students’ essays instantaneously. With the speed that exceeds human ability and accuracy of a human scorer, it is hoped that AESF can increase the frequency of essay writing in the class that eventually resultsin improvement in students’performance. This paper aims to highlight the novelty and rationale of having AESF, its design and features as well as how this tool can be blended into the writing classroom, particularly for the Malaysian University English Test (MUET) extended essay writing.
No caption available
… 
No caption available
… 
No caption available
… 
Content may be subject to copyright.
A preview of the PDF is not available
... In the initial stages of AEG systems, optimization is achieved through pre-processing, involving preparing questions, reference answers, and systematic dataset processing [4]. AEG represents a promising technology addressing scalability, speed, and standardization issues in grading written essays [3,4]. ...
... In the initial stages of AEG systems, optimization is achieved through pre-processing, involving preparing questions, reference answers, and systematic dataset processing [4]. AEG represents a promising technology addressing scalability, speed, and standardization issues in grading written essays [3,4]. Efforts to improve the effectiveness of AEG systems have led to the development of integrated and adaptive models, including FFDNN, which are designed to adapt to nuances in essays, facilitating nuanced and comprehensive evaluation [5]. ...
... Specifically, our emphasis lies in evaluating students' proficiency in information technology through feature importance analysis, particularly leveraging word synonyms within the content dimension. This innovative approach aims to enhance the accuracy and depth of computer-based essay grading [3][4][5]. ...
Conference Paper
Applying Feed-Forward Deep Neural Network (FFDNN) within a deep learning framework, Automated Essay Grading (AEG) has demonstrated notable efficacy in evaluating responses characterized by their open-ended and short-answer nature. The application of FFDNN, coupled with Python 3, Tensor-Flow, and Keras, has proven to be a successful approach in accurately appraising and grading such written assessments. Despite advances in feature extraction and system evaluation, a research gap exists in adopting this approach to tackle the short AEG problem, leveraging its known capabilities in handling complex sequential tasks. The dataset comprises 1200 sets, with 767 subsets allocated for training and 433 for validation of information technology essays. Both datasets include actual marks assigned by subject matter experts to each essay answer, ensuring a robust foundation for our analysis. The correlation coefficient has been employed to measure the alignment between the predicted and actual marks of the model. Additionally, two error measures have been utilized to assess the effectiveness of the developed model. This paper contributes novel insights to the discourse on AEG of short essays, aiming to bridge the identified research gap. While the reported metrics indicate promising performance of the model, a critical discussion is warranted to examine potential outliers, sensitivity to variations , and real-world implications of observed errors.
... In the initial stages of AEG systems, optimization is achieved through pre-processing, involving preparing questions, reference answers, and systematic dataset processing [4]. AEG represents a promising technology addressing scalability, speed, and standardization issues in grading written essays [3,4]. ...
... In the initial stages of AEG systems, optimization is achieved through pre-processing, involving preparing questions, reference answers, and systematic dataset processing [4]. AEG represents a promising technology addressing scalability, speed, and standardization issues in grading written essays [3,4]. Efforts to improve the effectiveness of AEG systems have led to the development of integrated and adaptive models, including FFDNN, which are designed to adapt to nuances in essays, facilitating nuanced and comprehensive evaluation [5]. ...
... Specifically, our emphasis lies in evaluating students' proficiency in information technology through feature importance analysis, particularly leveraging word synonyms within the content dimension. This innovative approach aims to enhance the accuracy and depth of computer-based essay grading [3][4][5]. ...
Preprint
Applying Feed-Forward Deep Neural Network (FFDNN) within a deep learning framework, Automated Essay Grading (AEG) has demonstrated notable efficacy in evaluating responses characterized by their open-ended and short-answer nature. The application of FFDNN, coupled with Python 3, Tensor-Flow, and Keras, has proven to be a successful approach in accurately appraising and grading such written assessments. Despite advances in feature extraction and system evaluation, a research gap exists in adopting this approach to tackle the short AEG problem, leveraging its known capabilities in handling complex sequential tasks. The dataset comprises 1200 sets, with 767 subsets allocated for training and 433 for validation of information technology essays. Both datasets include actual marks assigned by subject matter experts to each essay answer, ensuring a robust foundation for our analysis. The correlation coefficient has been employed to measure the alignment between the predicted and actual marks of the model. Additionally, two error measures have been utilized to assess the effectiveness of the developed model. This paper contributes novel insights to the discourse on AEG of short essays, aiming to bridge the identified research gap. While the reported metrics indicate promising performance of the model, a critical discussion is warranted to examine potential outliers, sensitivity to variations , and real-world implications of observed errors.
... These traits are described in detail in the Essay Traits Section. Most researchers conducted and evaluated experiments on their datasets (Crossley, Kyle, K., & McNamara, 2015;Liu et al, 2016;Ng, Bong, Lee, & Sam, 2015;Ng, Bong, Sam, & Lee, 2019;Shehab et al, 2016). There are a few public datasets with feedback annotations. ...
Preprint
The first automated essay scoring system was developed 50 years ago. Automated essay scoring systems are developing into systems with richer functions than the previous simple scoring systems. Its purpose is not only to score essays but also as a learning tool to improve the writing skill of users. Feedback is the most important aspect of making an automated essay scoring system useful in real life. The importance of feedback was already emphasized in the first AES system. This paper reviews research on feedback including different feedback types and essay traits on automated essay scoring. We also reviewed the latest case studies of the automated essay scoring system that provides feedback.
... Raiman et al. (2017) identified that WhatsApp is able to facilitate project-based learning. In the domain of language, Mistar and Embi (2016) found WhatsApp to be a useful platform in language learning, Cetinkaya and Sütçü (2018) in acquiring vocabulary, and Ng et al. (2016) in enhancing students' writing skills. However, Malhotra and Bansal (2017) stated that although the use of WhatsApp allows information and knowledge sharing among students, major setbacks were also highlighted by students, including poor performance, lack of interest and less time for learning activities. ...
Article
Full-text available
Purpose Recent years have documented the growing interest in using WhatsApp in higher education. However, the determinants of students’ satisfaction and loyalty towards WhatsApp groups have received less attention. This study aims to extend the Delone and McLean information system success model by incorporating social and emotional factors to investigate the drivers of satisfaction. Design/methodology/approach Data were collected through questionnaires completed by 308 undergraduate students. The partial least squares technique was used for data analysis. Findings The findings reveal that information quality, trust in members and social usefulness play crucial roles in shaping students’ satisfaction and loyalty to WhatsApp groups. System quality has no significant effect on satisfaction. Furthermore, emotional connection negatively moderates the relationship between social usefulness and satisfaction. Practical implications The findings of this study will be useful for educators and practitioners seeking to integrate WhatsApp into their pedagogical repertoire. The results demonstrate the importance of considering the social and emotional needs of students in addition to the quality of the information provided. Originality/value To the best of the authors’ knowledge, this study is the first attempt to integrate system characteristics, particularly with social and emotional factors. Furthermore, this study extends the literature on WhatsApp use in higher education by testing the drivers of students’ satisfaction and loyalty.
... Apart from relying on online interaction with lecturers, tertiary students can now develop learner autonomy by using tools like PaperRater. Sing et al. (2016) believe that AES is also good to be used in tandem by lecturers as physical raters of writing and by students such as for self-study, formative assessment and assessment for learning purposes. ...
Conference Paper
Full-text available
Automated Writing Evaluation (AWE) is an innovation in the field of language teaching and learning with features like portfolio and writing assistant resources has become a useful alternative to support language assessment processes during the pandemic. Like many artificial intelligence-based tools, there is always concern on scoring accuracy, reliability, and acceptance by users. This paper aims to explore language learners’ experience in using an AWE called PaperRater (PR) available from the internet. Data was elicited via a questionnaire designed based on the Technology Acceptance Model (TAM) and it focuses on six variables of acceptance namely perceived usefulness, perceived ease of use, user satisfaction, usability, user behaviour and user profiles. Rasch model and descriptive statistical analysis were used in analysing responses from 62 undergraduates. The respondents are found to have a positive level of acceptance towards the use of AWE as depicted by the -1.21 to 2.07 Rasch logit unit. This tool is also perceived to be beneficial for formative learning purposes via students’ self-assessment, in the absence of educators in physical classes and limited online access to educators during this pandemic.
... Despite the usefulness of AES, its usage is not extended to school teachers and public examiners in Malaysia who are in fact the ones who seriously need AES, as the practical tool to assist their essay assessment work [4]. As such, this research work is carried out as a study for the realization of AES in Malaysian English test environment. ...
Article
Full-text available
Automated Essay Scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational assessment context. It is developed to overcome time, cost, and reliability issues in writing assessment. Most of the contemporary AES are “western” proprietary product, designed for native English speakers, where the source code is not made available to public and the assessment criteria may tend to be associated with the scoring rubrics of a particular English test context. Therefore, such AES may not be appropriate to be directly adopted in Malaysia context. There is no actual software development work found in building an AES for Malaysian English test environment. As such, this work is carried out as the study for formulating the requirement of a local AES, targeted for Malaysia's essay assessment environment. In our work, we assessed a well-known AES called LightSide for determining its suitability in our local context. We use various Machine Learning technique provided by LightSide to predict the score of Malaysian University English Test (MUET) essays; and compare its performance, i.e. the percentage of exact agreement of LightSide with the human score of the essays. Besides, we review and discuss the theoretical aspect of the AES, i.e. its state-of-the-art, reliability and validity requirement. The finding in this paper will be used as the basis of our future work in developing a local AES, namely Intelligent Essay Grader (IEG), for Malaysian English test environment.
Article
Full-text available
The objective of high-stakes assessment is to appraise students' achievement at the end of an academic calendar. Students' result in the assessment is impactful in determining their future life such as a qualification for enrolment in university or job interviews. In operationalizing high-stakes assessment, standardization is an important element to ensure students are assessed with fairness and only based on their abilities. High-stakes assessment is normally administered by a centralised body without the involvement of students' respective schools. However, the decision to assign students' teachers as the sole examiner of high-stakes assessment is contentious whether teachers can adhere to standardization and measurement principles. Therefore, this paper aims to discuss the challenges faced by teachers when they are tasked to assess students in high-stakes assessment particularly in the context of Pentaksiran Tingkatan Tiga (PT3) in Malaysia and propose several potential solutions. The main idea underpinning this paper is to enhance the quality of assessment practice in education system which is salient to assure validity and reliability of results received by students. Such discussion is also needed to acknowledge teachers as a professional player in the arena of educational assessment. More importantly, the discussion in this paper puts forward an urge for educational stakeholders to redesign assessment system based on research-based practice.
Conference Paper
Full-text available
A pilot study of a vendor provided automated grading system was conducted in a Business Law class of 27 students. Students answered a business law fact pattern question which was reviewed and graded by the textbook vendor utilizing artificial intelligence software. Students were surveyed on their use, satisfaction, perceptions and technical issues utilizing the Write Experience automated essay scoring (AES) software. The instructor also chronicles the adoption, set up and use of an AES. Also detailed are the advantages and disadvantages of utilizing such software in an undergraduate course environment where some students may not be technologically adept or may lack motivation to experiment with a new testing procedure. Automated grading of student assignments is part of the next wave of textbook enhancements that vendors will be providing to instructors in the near future. Several vendors are conducting beta testing with instructors in hope of offering automated grading as part of their textbook and course support. Of course, such services will be an additional cost to the text. Exactly what will be charged for such services remains to be seen. The vast majority of previous research in the area of AES has been limited to its use in grading assignments in the STEM fields. Computer science instructors have been experimenting with self-created automated grading software for several decades. Recently automated grading software has been implemented by vendors to score essay questions in online tests such as the Graduate Management Admission Test (GMAT). The experience of one business law class in using automated grading software as part of such a beta test can give valuable insight into the needs and expectations of the three stakeholders in this situation – instructor, student and vendor. Implementation of an AES raises numerous issues. Should an instructor be willing to relinquish control of the grading process to an outside entity? Are the student needs for feedback and grading fairness being met? Is the technology advanced enough to replace the human element always present in grading assignments? What will be the economic impact on students if such software is adopted? The implications for the various stakeholders are discussed and addressed. The author comes to conclusions about the pedagogical usefulness of AES systems and offers suggestions for best practices to be employed by instructors interested in implementing such software in their courses.
Conference Paper
Full-text available
Various word-processing system have been developed to identify grammatical errors and mark learners’ essays. However, they are not specifically developed for Malaysian ESL (English as a second language) learners. A marking tool which is capable to identify errors in ESL writing for these learners is very much needed. Though there are numerous techniques adopted in grammar checking and automated essay marking system, research on the formation and use of heuristics to aid the construction of automated essay marking system has been scarce. This paper aims to introduce a heuristics based approach that can be utilized for grammar checking of tenses. This approach, which uses natural language processing technique, can be applied as part of the software requirement for a CBEM (Computer Based Essay Marking) system for ESL learners. The preliminary result based on the training set shows that the heuristics are useful and can improve the effectiveness of automated essay marking tool for detecting grammatical errors of tenses in ESL writing.
Article
The development of language processing technologies and statistical methods has enabled modern automated writing evaluation (AWE) systems to provide feedback on language and content in addition to an automated score. However, concerns have been raised with regard to the instructional and assessment value of AWE in writing classrooms. The findings from a few classroom-based studies concerning the impact of AWE on writing instruction and performance are largely inconclusive. Meanwhile, since research provides favorable evidence for the reliability of AWE corrective feedback, and that writing accuracy is both an important and frustrating issue, it is worthwhile to examine more specifically the impact of AWE corrective feedback on writing accuracy. Therefore, the study used mixed-methods to investigate how Criterion® affected writing instruction and performance. Results suggested that Criterion® has led to increased revisions, and that the corrective feedback from Criterion® helped improve accuracy from a rough to a final draft. The potential benefits were also confirmed by the instructors’ interviews. The students’ perspectives were mixed, but the extent to which the views vary may depend on the students’ language proficiency level and their instructors’ use and perspectives of AWE.
Article
Language contact touches on theoretical, descriptive and applied linguistics. Typically research into lexical contact outcomes tended to be accumulative, collecting contact effects, whether old or recent, on the assumption that they are all relevant today. Their currency or demise or, in the case of recent formations, their increasing currency and spread are under-researched. Such themes are crucial in multilingual, multi-cultural and multi-religious contexts such as Malaysia in South-East Asia where loan expressions can signal a developing over-arching 'Malaysian English-ness' or the existence of ethnic lines of division. This paper investigates the contact of English with other languages in Malaysia and raises three questions. Given the large number of loan words in English and in Malaysian English, the first question is to what extent loans from Malay, Chinese and Indian languages are 'known' and to what extent knowledge stratifies in terms of ethnicity, religion, etc. Findings might signal a trend towards an overarching, pan-ethnic general Malaysian English (MalE). Particularly interesting are expressions from Arabic that occur in a growing number and raise questions of inter-religious comprehension. The second question derives from a tentative classification of selected words into those that go back to the contact with European languages from the 15th century up to independence and those that have come into MalE since. The first category is considered 'old', the second 'recent'. New words may signal a development towards endo-normativity. A third question that we hoped to find evidence for is whether age is a factor that interacts with findings to the other two questions. Though limited in scope, this study has wider applications to English, socio- and educational linguistics.
Article
Integrated writing tasks that depend on input from other language abilities are gaining ground in teaching and assessment of L2 writing. Understanding how raters assign scores to integrated tasks is a necessary step for interpreting performance from this assessment method. The current study investigates how raters approach reading-to-write tasks, how they react to source use, the challenges they face, and the features influencing their scoring decisions. To address these issues, the study employed an inductive analysis of interviews and think-aloud data obtained from two raters. The results of the study showed raters attending to judgment strategies more than interpretation behaviors. In addition, the results found raters attending to a number of issues specifically related to source use: (a) locating source information, (b) citation mechanics, and (c) quality of source use. Furthermore, the analysis revealed a number of challenges faced by raters when working on integrated tasks. While raters focused on surface source use features at lower levels, they shifted their attention to more sophisticated issues at advanced levels. These results demonstrate the complex nature of integrated tasks and stress the need for writing professionals to consider the scoring and rating of these tasks carefully.
Article
We studied the effect of the web-based tool ``Calibrated Peer Review'' ™ on student confidence in their ability to recognize the quality of their own work. CPR can be used in large enrollment classes to allow a controlled peer review of moderate length student essays. We expected that teaching students how to grade an essay and having them grade their own work would increase confidence in assessing the quality of their own essays, and the results support this. Three introductory astronomy classes participated in this study during 2005 at the University of Wisconsin Eau Claire, a four year university. Four essays were assigned in both the experimental class (104 students) and the control classes (34 students). In the comparison classes, the student was given a score on the essay and perhaps a few written comments. The experimental group used the CPR tool, in which they are taught how to evaluate the essay, evaluate assignments written by peers, and evaluate their own essay. Three survey questions were used to characterize the change in confidence level in ability to assess their own work. The survey results from a survey at the end of the semester were compared to results from the same survey administered at the beginning of the semester. A measurable effect on the average confidence level of the experimental class was found. By the final survey, significantly more of the CPR students had changed to a more positive statement in indicating their confidence in evaluating their own written work. There was no effect seen on the classes that wrote essays but did not use the CPR system, showing that this result is due to using the CPR system for the essays, not just writing essays or becoming more confident during the course of the semester.
Article
This chapter discusses the e-rater® scoring engine, automated essay scoring with natural language processing, holistic scoring, and Criterion SM. (PsycINFO Database Record (c) 2012 APA, all rights reserved)