Conference Paper

STAGE: a software tool for automatic grading of testing exercises: case study paper

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We report on an approach and associated tool-support for automatically evaluating and grading exercises in Software Engineering courses, by connecting various third-party tools to the online learning platform Moodle. In the case study presented here, the tool was used in several instances of a lecture course to automatically measure the test coverage criteria wrt. the test cases defined by the students for given Java code. We report on empirical evidence gathered using this case-study (involving more than 250 students), including the results of a survey conducted after the exercises (which yielded positive feedback from the students), as well as a performance evaluation of our tool implementation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Comparing to OC, unit testing (UT) ofers more granularity of evaluation, being able to assess classes, methods, and even statements [6]. Examples of automated assessment tools using xUnit testing frameworks include JPLAS [83] and STAGE [172]. Web testing frameworks, such as Selenium [214] and Mocha [91], and mobile testing frameworks, such as Appium [80], are also applied to check the functionality of students' web pages and mobile apps automatically. ...
... Hence, a few automated assessment tools evaluate the quality of the test suites produced by students, particularly through test coverage, i.e., the amount of code executed when it runs. External tools are typically integrated to measure coverage such as CodeCover [172,174] and Emma [180,205]. ...
... Although most tools still rely on user permissions [2,60], jails [40,164,203], and JVM [115,172,242] security mechanisms, several new tools [156,180,226] are adopting containerization, particularly through Docker containers, to perform the dynamic analysis of students' programs. Docker reduces the entry barrier to using containers, which enables their fast-growing popularization lately, as containers also ofer a similar level of isolation to that of a VM while keeping a much lower overhead, slightly above other solutions depending on the image [178]. ...
Article
Full-text available
Practical programming competencies are critical to the success in computer science education and go-to-market of fresh graduates. Acquiring the required level of skills is a long journey of discovery, trial and error, and optimization seeking through a broad range of programming activities that learners must perform themselves. It is not reasonable to consider that teachers could evaluate all attempts that the average learner should develop multiplied by the number of students enrolled in a course, much less in a timely, deeply, and fairly fashion. Unsurprisingly, exploring the formal structure of programs to automate the assessment of certain features has long been a hot topic among CS education practitioners. Assessing a program is considerably more complex than asserting its functional correctness, as the proliferation of tools and techniques in the literature over the past decades indicates. Program efficiency, behavior, readability, among many other features, assessed either statically or dynamically, are now also relevant for automatic evaluation. The outcome of an evaluation evolved from the primordial boolean values to information about errors and tips on how to advance, possibly taking into account similar solutions. This work surveys the state-of-the-art in the automated assessment of CS assignments, focusing on the supported types of exercises, security measures adopted, testing techniques used, type of feedback produced, and the information they offer the teacher to understand and optimize learning. A new era of automated assessment, capitalizing on static analysis techniques and containerization, has been identified. Furthermore, this review presents several other findings from the conducted review, discusses the current challenges of the field, and proposes some future research directions.
... The most common uses of automatic scoring and feedback are in three areas: 1) programming problems through mechanisms such as assisting the student with the coding [86,87], analyzing coding patterns [88,89], automatic grading [90,91,92,93,94], and customized feedback [86,88,87]; 2) short essays [95]; and 3) extended essays [96]. Programming problems are the easiest to use as input for this technology, as they appear in a structured language that computers can understand [97]. ...
Article
Full-text available
Automatic scoring and feedback tools have become critical components of online learning proliferation. These tools range from multiple-choice questions to grading essays using machine learning (ML). Learning environments such as massive open online courses (MOOCs) would not be possible without them. The usage of this mechanism has brought many exciting areas of study, from the design of questions to the ML grading tools’ precision and accuracy. This paper analyzes the findings of 125 studies published in journals and proceedings between 2016 and 2020 on the usages of automatic scoring and feedback as a learning tool. This analysis gives an overview of the trends, challenges, and open questions in this research area. The results indicate that automatic scoring and feedback have many advantages. The most important benefits include enabling scaling the number of students without adding a proportional number of instructors, improving the student experience by reducing the time between submission grading and feedback, and removing bias in scoring. On the other hand, these technologies have some drawbacks. The main problem is creating a disincentive to develop innovative answers that do not match the expected one or have not been considered when preparing the problem. Another drawback is potentially training the student to answer the question instead of learning the concepts. With this, given the exitance of a correct answer, such an answer could be leaked to the internet, making it easier for students to avoid solving the problem. Overall, each of these drawbacks presents an opportunity to look at ways to improve technologies to use these tools to provide a better learning experience to students.
... Berbagai macam sistem dibuat untuk membantu melaksanakan metode pembelajaran. Salah satunya untuk membantu proses penilaian otomatis dalam mengukur kemampuan peserta didik dalam menjawab pertanyaan dari suatu tes atau ujian [1]. ...
Article
Full-text available
br />Valid evaluation results depend on choosing the right test. However, at this time the type of test most often used has not paid attention to the cognitive abilities of students who want to be tested. This raises a problem, namely the mismatch of evaluation results due to the inaccuracy between the learning outcomes that have been outlined in the Semester Learning Plan (RPS) and the type of evaluation given. The problem arises because there is no knowledge about the concept of separating the cognitive abilities of students based on the learning outcomes they wish to obtain. Based on Bloom's Taxonomy concept, there are 6 levels of cognitive domain, namely remembering (C1), understanding (C2), applying (C3), analyzing (C4), evaluating (C5), and creating (C6). Media evaluation of students based on cognitive domains has been developed in previous studies. Evaluation media consists of several types of tests with several choices of cognitive domains according to the level of education of students. A good test question will be able to evaluate the extent to which students master the indicators determined by the teacher. In this study, researchers conducted a test on the level of use of the system. The resulting application is tested in the field towards lecturers and students in a course to see the level of use of the application. Based on the results of the calculation of the questionnaire, data can be obtained that the level of usability of the application is 83%. The results of these percentages are concluded as a very good level of application usability. Keywords: media evaluation, level of use, cognitive ABSTRAK Hasil evaluasi yang valid bergantung pada pemilihan tes yang tepat. Namun, saat ini jenis tes yang paling sering digunakan belum memperhatikan kemampuan kognitif peserta didik yang ingin dites. Hal ini menimbulkan masalah, yaitu ketidaksesuaian hasil evaluasi dikarenakan ketidaktepatan antara capaian pembelajaran yang telah dituangkan di Rencana Pembelajaran Semester (RPS) dengan jenis evaluasi yang diberikan. Permasalahan tersebut muncul dikarenakan tidak adanya pengetahuan mengenai konsep pemilahan kemampuan kognitif peserta didik berdasarkan capaian pembelajaran yang ingin diperoleh. Berdasarkan konsep Taksonomi Bloom, terdapat 6 tingkatan ranah kognitif, yaitu mengingat (C1), memahami (C2), mengaplikasikan (C3), menganalisis (C4), mengevaluasi (C5), dan mencipta (C6). Media evaluasi peserta didik berdasarkan ranah kognitif telah dikembangkan pada penelitian sebelumnya. Media evaluasi terdiri dari beberapa jenis tes dengan beberapa pilihan ranah kognitif sesuai dengan tingkatan pendidikan peserta didik. Soal tes yang baik akan mampu mengevaluasi sejauh mana peserta didik menguasai indikator yang sudah ditentukan oleh pengajar. Pada penelitian ini, peneliti melakukan pengujian terhadap tingkat kegunaan sistem. Aplikasi yang dihasilkan diujikan ke lapangan terhadap dosen dan mahasiswa dalam suatu mata kuliah untuk melihat tingkat kegunaan aplikasi. Berdasarkan hasil perhitungan terhadap kuisioner, maka dapat diperoleh data bahwa tingkat kegunaan aplikasi adalah 83%. Hasil presentase tersebut disimpulkan sebagai tingkat kegunaan aplikasi yang sangat baik. Kata kunci: media evaluasi, tingkat kegunaan, kognitif
... The coverage of a test suite is ostensibly a measure of how effective the suite is at catching bugs. This measure is attractive because it reflects professional software engineering practice [22] and is not labor-intensive [6]. However, coverage is at best an indirect measure, since it does not involve observing whether a test suite actually catches bugs. ...
Conference Paper
Flawed problem comprehension leads students to produce flawed implementations. However, testing alone is inadequate for checking comprehension: if a student develops both their tests and implementation with the same misunderstanding, running their tests against their implementation will not reveal the issue. As a solution, some pedagogies encourage the creation of input-output examples independent of testing-but seldom provide students with any mechanism to check that their examples are correct and thorough. We propose a mechanism that provides students with instant feedback on their examples, independent of their implementation progress. We assess the impact of such an interface on an introductory programming course and find several positive impacts, some more neutral outcomes, and no identified negative effects.
... Code coverage is often used as a proxy for thoroughness; AS-SYST [23], Web-CAT [7], and Marmoset [34] all take this approach. Code coverage is attractive because it reflects professional software engineering practice [26] and is not labor-intensive [8]. However, a growing body of evidence challenges the appropriateness of coverage as a measure of thoroughness [1,9,20], in both professional and pedagogic contexts. ...
Conference Paper
Instructors routinely use automated assessment methods to evaluate the semantic qualities of student implementations and, sometimes, test suites. In this work, we distill a variety of automated assessment methods in the literature down to a pair of assessment models. We identify pathological assessment outcomes in each model that point to underlying methodological flaws. These theoretical flaws broadly threaten the validity of the techniques, and we actually observe them in multiple assignments of an introductory programming course. We propose adjustments that remedy these flaws and then demonstrate, on these same assignments, that our interventions improve the accuracy of assessment. We believe that with these adjustments, instructors can greatly improve the accuracy of automated assessment.
Article
Full-text available
The first years in engineering degree courses are usually made of large groups with a low teacher-student ratio. Overcrowding in classrooms hinders continuous assessment much needed to promote independent learning. Therefore, there is a need to apply some kind of automatic evaluation to facilitate the correction of exercises outside the classroom. We introduce here a first experience using surveys in Moodle 2.0 in order to get an automatic evaluation of practices in our Database course. We report survey valuation of the autonomous learning tool and preliminary statistics assessing correlation to an improvement in the practice exam marks.
Article
Full-text available
Nowadays, with the use of technology and the Internet, education is undergoing significant changes, contemplating new ways of teaching and learning. One of the widely methods of teaching used to promote knowledge, consists in the use of virtual environments available in various formats, taking as example the teaching-learning platforms, which are available online. The Internet access and use of Laptops have created the technological conditions for teachers and students can benefit from the diversity of online information, communication, collaboration and sharing with others. The integration of Internet services in the teaching practices can provide thematic, social and digital enrichment for the agents involved.
Conference Paper
Full-text available
This paper presents a systematic literature review of the recent (2006-2010) development of automatic assessment tools for programming exercises. We discuss the major features that the tools support and the different approaches they are using both from the pedagogical and the technical point of view. Examples of these features are ways for the teacher to define tests, resubmission policies, security issues, and so forth. We have also identified a list of novel features, like assessing web software, that are likely to get more research attention in the future. As a conclusion, we state that too many new systems are developed, but also acknowledge the current reasons for the phenomenon. As one solution we encourage opening up the existing systems and joining efforts on developing those further. Selected systems from our survey are briefly described in Appendix A.
Article
Full-text available
Systems that automatically assess student programming assignments have been designed and used for over forty years. Systems that objectively test and mark student programming work were developed simultaneously with programming assessment in the computer science curriculum. This article reviews a number of influential automatic assessment systems, including descriptions of the earliest systems, and presents some of the most recent developments. The final sections explore a number of directions automated assessment systems may take, presenting current developments alongside a number of important emerging e-learning specifications.
Conference Paper
Assessment of learning and assessment for learning are at the core of the research on new teaching strategies involving the use of new technologies recently performed by University of Turin. The practice of automated assessment in Mathematics using the grading system Maple T.A. Has been introduced in many undergraduate scientific courses and, after the initial success, it has been diffused in high-schools through several projects aimed to improve Maths teaching and learning. The following paper is intended to describe the effectiveness of automated assessment as a learning tool, the strength of Maple T.A. For grading Mathematics and its integration in a learning content management system, and the results obtained at University and in high-schools.
Article
Many CS instructors are embracing MOOCs by creating their own course offering on one of the major MOOC platforms. Along with MOOCs has come a debate about their effect on academia. But rather than discussing MOOCs as a stand-alone entity, we want to discuss the usage of MOOCs to augment normal lecture classes. In such cases, instructors are supplementing their live lectures with on-line lectures which are often produced by leading experts of the field. Fox refers to these as SPOCs: small private online courses [1]. The marriage of live lectures and MOOC materials has the potential of maximizing both their strengths while minimizing some of their perceived weaknesses. On the one hand you have a live instructor who has a vested interest in conveying course material in the best way possible. An instructor who lectures to the students on a regular basis and keeps them on task. An instructor who holds office hours where students can get face-to-face help when needed. And on the other hand you have high quality lectures available 24 hours a day that students can watch as many times as desired. You have well vetted homework assignments that have powerful auto-graders to grade them. This panel is made up of CS faculty members who have used MOOC materials to augment their courses, not replace them. In some cases the MOOC material constitutes the bulk of the course, while in other cases it may just be a small portion of the class. In some cases the MOOC materials were created by the instructors themselves, while in other cases the instructors are using MOOCs produced by others. The panel will address the logistics of such a course, and will share experiences, both good and bad, about running such a course. In the end, we hope to encourage more to consider leveraging the power of MOOC platforms to augment their own courses.
Article
Many educators now include software testing activities in programming assignments, so there is a growing demand for appropriate methods of assessing the quality of student-written software tests. While tests can be hand-graded, some educators also use objective performance metrics to assess software tests. The most common measures used at present are code coverage measures—tracking how much of the student’s code (in terms of statements, branches, or some combination) is exercised by the corresponding software tests. Code coverage has limitations, however, and sometimes it overestimates the true quality of the tests. Some researchers have suggested that mutation analysis may provide a better indication of test quality, while some educators have experimented with simply running every student’s test suite against every other student’s program—an “all-pairs” strategy that gives a bit more insight into the quality of the tests. However, it is still unknown which one of these measures is more accurate, in terms of most closely predicting the true bug revealing capability of a given test suite. This paper directly compares all three methods of measuring test quality in terms of how well they predict the observed bug revealing capabilities of student-written tests when run against a naturally occurring collection of student-produced defects. Experimental results show that all-pairs testing—running each student’s tests against every other student’s solution—is the most effective predictor of the underlying bug revealing capability of a test suite. Further, no strong correlation was found between bug revealing capability and either code coverage or mutation analysis scores.
Article
Universities are supplementing classroom experience with small private online courses (SPOC) in place of massive open online courses (MOOC). Many universities are successfully using MOOC technology differently to achieve these objectives. Students in an analog circuits course used MIT-authored MOOC lectures and homework assignments created by Anant Agarwal under a pilot program at San José State University in California, US. The students' in-classroom time has been spent working on lab and design problems with local faculty and teaching assistants. The students in this SPOC have scored five percentage points higher on the first exam and 10 points on the second exam than the previous cohort that had used the traditional material. Using MOOC materials in a SPOC format is one way that MOOCs can be successful in helping to answer this broader question.
Conference Paper
Automatic grading of programming assignments is an important topic in academic research. It aims at improving the level of feedback given to students and optimizing the professor’s time. Its importance is more remarkable as the amount and complexity of assignments increases. Several studies have reported the development of software tools to support this process. They usually consider particular deployment scenarios and specific requirements of the interested institution. However, the quantity and diversity of these tools makes it difficult to get a quick and accurate idea of their features. This paper reviews an ample set of tools for automatic grading of programming assignments. The review includes a description of every tool selected and their key features. Among others, the key features analyzed include the programming language used to build the tool, the programming languages supported for grading, the criteria applied in the evaluation process, the work mode (as a plugin, as an independent tool, etc.), the logical and deployment architectures, and the communications technology used. Then, implementations and operation results are described with quantitative and qualitative indicators to understand how successful the tools were. Quantitative indicators include number of courses, students, tasks, submissions considered for tests, and acceptance percentage after tests. Qualitative indicators include motivation, support, and skills improvement. A comparative analysis among the tools is shown, and as result a set of common gaps detected is provided. The lack of normalized evaluation criteria for assignments is identified as a key gap in the reviewed tools. Thus, an evaluation metrics frame to grade programming assignments is proposed. The results indicate that many of the analyzed features highly depend on the current technology infrastructure that supports the teaching process. Therefore, they are a limiting factor in reusing the tools in new implementation cases. Another fact is the inability to support new programming languages, which is limited by tools’ updates. On metrics for evaluation process, the set of analyzed tools showed much diversity and inflexibility. Knowing which implementation features are always specific and particular independently of the project, and which others could be common will be helpful before the implementation and operation of a tool. Considering how much flexibility could be attained in the evaluation process will be helpful to design a new tool, which will be used not only in particular cases, and to define the automation level of the evaluation process.
Article
Fifteen months ago the first version of an “automatic grader” was tried with a group of twenty students taking a formal course in programming. The first group of twenty programs took only five minutes on the computer (an IBM 650). With such a satisfactory beginning, the grader was then used for the entire course with this group of students and have been used at Rensselaer ever since. For all exercises, the average time spent on the computer has run from half a minute to a minute for each student. In general only an eighth as much computer time is required when the grader is used as is required when each student is expected to run his own program, probably less than a third as much staff time, and considerably less student time. The grader easily justifies itself on economic grounds. It accomplishes more than savings in time and money; it makes possible the teaching of programming to large numbers of students. This spring we had 80 students taking a full semester course in programming; over 120 are expected next spring. We could not accommodate such numbers without the use of the grader. Even though the grader makes the teaching of programming to large numbers of students possible and economically feasible, a most serious question remains, how well did the students learn? After fifteen months, our experience leads us to believe that students learn programming not only as well but probably better than they did under the method we did use—laboratory groups of four or five students. They are not as skilled in machine operation, however, since they get only a brief introduction to it late in the course. After learning programming, very little time is needed for each student to become at least an adequate machine operator. Students seem to like the grader and are not reluctant to suggest improvements!
Article
: The ALGOL grader programs are presented for the computer evaluation of student ALGOL programs. One is for a beginner's program; it furnishes random data and checks answers. The other provides a searching test of the reliability and efficiency of a rootfinding procedure. There is a statement of the essential properties of a computer system, in order that grader programs can be effectively used.
Article
The project conceived in 1929 by Gardner Murphy and the writer aimed first to present a wide array of problems having to do with five major "attitude areas"--international relations, race relations, economic conflict, political conflict, and religion. The kind of questionnaire material falls into four classes: yes-no, multiple choice, propositions to be responded to by degrees of approval, and a series of brief newspaper narratives to be approved or disapproved in various degrees. The monograph aims to describe a technique rather than to give results. The appendix, covering ten pages, shows the method of constructing an attitude scale. A bibliography is also given.
Coderunner documentation
  • R Lobb
R. Lobb, "Coderunner documentation (v2.4.2)." http://coderunner.org.nz/mod/book/tool/print/ index.php?id=50, October 2015.
Moodle plugins for highly efficient programmin courses
  • S Zhigang
  • S Xiaohong
  • Z Ning
  • C Yanyu
S. Zhigang, S. Xiaohong, Z. Ning, and C. Yanyu, "Moodle plugins for highly efficient programmin courses," in Moodle Research Conference, vol. 1, pp. 157-163, 2012.
Further development on the Moodle CodeHandIn Package
  • S T Deane
S. T. Deane, Further development on the Moodle CodeHandIn Package. PhD thesis, Flinders University-Adelaide, Australia, 2014.
Moodle on the move WebEx session
  • S Robertson
S. Robertson, " Moodle on the move. " WebEx session, https://onlinevideo.napier.ac.uk/Play/5081, 2015.
Moodle plugins for highly efficient programmin courses
  • Zhigang S.
Archives of psychology 1932. R. Likert "A technique for the measurement of attitudes
  • R Likert