Conference Paper

Evaluating Distance Measures for Program Repair

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Practical programming competencies are critical to the success in computer science education and go-to-market of fresh graduates. Acquiring the required level of skills is a long journey of discovery, trial and error, and optimization seeking through a broad range of programming activities that learners must perform themselves. It is not reasonable to consider that teachers could evaluate all attempts that the average learner should develop multiplied by the number of students enrolled in a course, much less in a timely, deeply, and fairly fashion. Unsurprisingly, exploring the formal structure of programs to automate the assessment of certain features has long been a hot topic among CS education practitioners. Assessing a program is considerably more complex than asserting its functional correctness, as the proliferation of tools and techniques in the literature over the past decades indicates. Program efficiency, behavior, readability, among many other features, assessed either statically or dynamically, are now also relevant for automatic evaluation. The outcome of an evaluation evolved from the primordial boolean values to information about errors and tips on how to advance, possibly taking into account similar solutions. This work surveys the state-of-the-art in the automated assessment of CS assignments, focusing on the supported types of exercises, security measures adopted, testing techniques used, type of feedback produced, and the information they offer the teacher to understand and optimize learning. A new era of automated assessment, capitalizing on static analysis techniques and containerization, has been identified. Furthermore, this review presents several other findings from the conducted review, discusses the current challenges of the field, and proposes some future research directions.
Article
Full-text available
As computing becomes a mainstream discipline embedded in the school curriculum and acts as an enabler for an increasing range of academic disciplines in higher education, the literature on introductory programming is growing. Although there have been several reviews that focus on specific aspects of introductory programming,there has been no broad overview of the literature exploring recent trends across the breadth of introductory programming. This paper is the report of an ITiCSE working group that conducted a systematic review in order to gain an overview of the introductory programming literature. Partitioning the literature into papers addressing the student, teaching, the curriculum, and assessment, we explore trends, highlight advances in knowledge over the past 15 years, and indicate possible directions for future research.
Conference Paper
Full-text available
Compile-time errors pose a major learning hurdle for students of introductory programming courses. Compiler error messages, while accurate, are targeted at seasoned programmers, and seem cryptic to beginners. In this work, we address this problem of pedagogically-inspired program repair and report TRACER (Targeted RepAir of Compilation ERrors), a system for performing repairs on compilation errors, aimed at introductory programmers. TRACER invokes a novel combination of tools from programming language theory and deep learning and offers repairs that not only enable successful compilation, but repairs that are very close to those actually performed by students on similar errors. The ability to offer such targeted corrections, rather than just code that compiles, makes TRACER more relevant in offering real-time feedback to students in lab or tutorial sessions, as compared to existing works that merely offer a certain compilation success rate. In an evaluation on 4500 erroneous C programs written by students of a freshman year programming course, TRACER recommends a repair exactly matching the one expected by the student for 68% of the cases, and in 79.27% of the cases, produces a compilable repair. On a further set of 6971 programs that require errors to be fixed on multiple lines, TRACER enjoyed a success rate of 44% compared to the 27% success rate offered by the state-of-the-art technique DeepFix.
Conference Paper
Full-text available
Errors in the logic of a program (sometimes referred to as semantic errors) can be very frustrating for novice programmers to locate and resolve. Developing a better understanding of the kinds of logic error that are most common and problematic for students, and finding strategies for targeting them, may help to inform teaching practice and reduce student frustration. In this paper we analyse 15,000 code fragments, created by novice programming students, that contain logic errors, and we classify the errors into algorithmic errors, misinterpretations of the problem, and fundamental misconceptions. We find that misconceptions are the most frequent source of logic errors, and lead to the most difficult errors for students to resolve. We list the most common errors of this type as a starting point for designing specific teaching interventions to address them.
Conference Paper
Full-text available
Encountering the same compiler error repeatedly, particularly several times consecutively, has been cited as a strong indicator that a student is struggling with important programming concepts. Despite this, there are relatively few studies which investigate repeated errors in isolation or in much depth. There are also few data-driven metrics for measuring programming performance, and fewer for measuring repeated errors. This paper makes two contributions. First we introduce a new metric to quantify repeated errors, the repeated error density (RED). We compare this to Jadud's Error Quotient (EQ), the most studied metric, and show that RED has advantages over EQ including being less context dependent, and being useful for short sessions. This allows us to answer two questions posited by Jadud in 2006 that have until now been unanswered. Second, we compare the EQ and RED scores using data from an empirical control/intervention group study involving an editor which enhances compiler error messages. This intervention group has been previously shown to have a reduced overall number of student errors, number of errors per student, and number of repeated student errors per compiler error message. In this research we find a reduction in EQ, providing further evidence that error message enhancement has positive effects. In addition we find a significant reduction in RED providing evidence that this metric is valid.
Conference Paper
Full-text available
Educational data mining and learning analytics promise better understanding of student behavior and knowledge, as well as new information on the tacit factors that contribute to student actions. This knowledge can be used to inform decisions related to course and tool design and pedagogy, and to further engage students and guide those at risk of failure. This working group report provides an overview of the body of knowledge regarding the use of educational data mining and learning analytics focused on the teaching and learning of programming. In a literature survey on mining students' programming processes for 2005–2015, we observe a significant increase in work related to the field. However, the majority of the studies focus on simplistic metric analysis and are conducted within a single institution and a single * Working group leaders course. This indicates the existence of further avenues of research and a critical need for validation and replication to better understand the various contributing factors and the reasons why certain results occur. We introduce a novel taxonomy to analyse replicating studies and discuss the importance of replicating and reproducing previous work. We describe what is the state of the art in collecting and sharing programming data. To better understand the challenges involved in replicating or reproducing existing studies, we report our experiences from three case studies using programming data. Finally, we present a discussion of future directions for the education and research community. CCS Concepts • General and reference → Surveys and overviews; • Social and professional topics → Computing education ; • Applied computing → Education; • Information systems → Decision support systems; Data mining; Users and interactive retrieval; • Security and privacy → Human and societal aspects of security and privacy;
Conference Paper
Full-text available
The programming task known as the Rainfall Problem has developed a reputation for being surprisingly difficult for introductory-level (CS1) students. We contribute a survey of studies of the problem as well as a new study of students' solutions collected at three institutions. In all three CS1s, at least about half of the students were able to fully solve the problem and the large majority were at least close. Failure to handle invalid or missing input accounted for most bugs. Our survey and study together suggest that the Rainfall Problem is not necessarily overwhelmingly difficult: Success rates vary and some reasonably good results have been achieved under multiple programming paradigms. We provide a breakdown of confounding factors and suggest improvements and hypotheses for future studies of the Rainfall Problem.
Article
Full-text available
More and more people take up learning how to program: in schools and universities, in large open online courses or by learning it by themselves. A large number of tools have been developed over the years to support learners with the difficult task of building programs. Many of these tools focus on the resulting program and not on the process: they fail to help the student to take the necessary steps towards the final program. We have developed a prototype of a programming tutor to help students with feedback and hints to progress towards a solution for an introductory imperative programming problem. We draw upon the ideas of a similar tutor for functional programming and translate these ideas to a different paradigm. Our tutor is based on model solutions from which a programming strategy is derived, capturing the different paths to these solutions. We allow for variation by expanding the strategy with alternatives and using program transformations. The instructor is able to adapt the behaviour of the tutor by annotating the model solutions. We show a tutoring session to demonstrate that a student can arrive at a solution by following the generated hints. We have found that we can recognise between 33% and 75% of student solutions to three programming exercises that are similar to a model solution, which we can increase by incorporating more variations.
Article
Full-text available
Previous investigations of student errors have typically focused on samples of hundreds of students at individual institutions. This work uses a year's worth of compilation events from over 250,000 students all over the world, taken from the large Blackbox data set. We analyze the frequency, time-to-fix, and spread of errors among users, showing how these factors inter-relate, in addition to their development over the course of the year. These results can inform the design of courses, textbooks and also tools to target the most frequent (or hardest to fix) errors.
Article
Full-text available
To provide personalized help to students who are working on code-writing problems, we introduce a data-driven tutoring system, ITAP (Intelligent Teaching Assistant for Programming). ITAP uses state abstraction, path construction, and state reification to automatically generate personalized hints for students, even when given states that have not occurred in the data before. We provide a detailed description of the system’s implementation and perform a technical evaluation on a small set of data to determine the effectiveness of the component algorithms and ITAP’s potential for self-improvement. The results show that ITAP is capable of producing hints for almost any given state after being given only a single reference solution, and that it can improve its performance by collecting data over time.
Conference Paper
Full-text available
The frequency of different kinds of error made by students learning to write computer programs has long been of interest to researchers and educators. In the past, various studies investigated this topic, usually by recording and analysing compiler error messages, and producing tables of relative frequencies of specific errors diagnostics produced by the compiler. In this paper, we improve on such prior studies by investigating actual logical errors in student code, as opposed to diagnostic messages produced by the compiler. The actual errors reported here are more precise, more detailed and more accurate than the diagnostic produced automatically. In order to present frequencies of actual errors, error categories were developed and validated, and student code captured at time of compilation failure was manually analysed by multiple researchers. The results show that error causes can be manually analysed by independent researchers with good reliability. The resulting table of error frequencies shows that prior work using diagnostic messages tended to group some distinct errors together in single categories, which can now be listed more accurately.
Article
Automated tutoring systems offer the flexibility and scalability necessary to facilitate the provision of high-quality and universally accessible programming education. To realise the potential of these systems, recent work has proposed a diverse range of techniques for automatically generating feedback in the form of hints to assist students with programming exercises. This article integrates these apparently disparate approaches into a coherent whole. Specifically, it emphasises that all hint techniques can be understood as a series of simpler components with similar properties. Using this insight, it presents a simple framework for describing such techniques, the Hint Iteration by Narrow-down and Transformation Steps framework, and surveys recent work in the context of this framework. Findings from this survey include that (1) hint techniques share similar properties, which can be used to visualise them together, (2) the individual steps of hint techniques should be considered when designing and evaluating hint systems, (3) more work is required to develop and improve evaluation methods, and (4) interesting relationships, such as the link between automated hints and data-driven evaluation, should be further investigated. Ultimately, this article aims to facilitate the development, extension, and comparison of automated programming hint techniques to maximise their educational potential.
Article
The problem of automatically fixing programming errors is a very active research topic in software engineering. This is a challenging problem as fixing even a single error may require analysis of the entire program. In practice, a number of errors arise due to programmer's inexperience with the programming language or lack of attention to detail. We call these common programming errors. These are analogous to grammatical errors in natural languages. Compilers detect such errors, but their error messages are usually inaccurate. In this work, we present an end-to-end solution, called DeepFix, that can fix multiple such errors in a program without relying on any external tool to locate or fix them. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network with attention which is trained to predict erroneous program locations along with the required correct statements. On a set of 6971 erroneous C programs written by students for 93 programming tasks, DeepFix could fix 1881 (27%) programs completely and 1338 (19%) programs partially.
Conference Paper
Programming error messages play an important role in learning to program. The cycle of program input and error message response completes a loop between the programmer and the compiler/interpreter and is a fundamental interaction between human and computer. However, error messages are notoriously problematic, especially for novices. Despite numerous guidelines citing the importance of message readability, there is little empirical research dedicated to understanding and assessing it. We report three related experiments investigating factors that influence programming error message readability. In the first two experiments we identify possible factors, and in the third we ask novice programmers to rate messages using scales derived from these factors. We find evidence that several key factors significantly affect message readability: message length, jargon use, sentence structure, and vocabulary. This provides novel empirical support for previously untested long-standing guidelines on message design, and informs future efforts to create readability metrics for programming error messages.
Article
Novice programmers often struggle with the formal syntax of programming languages. In the traditional classroom setting, they can make progress with the help of real time feedback from their instructors which is often impossible to get in the massive open online course (MOOC) setting. Syntactic error repair techniques have huge potential to assist them at scale. Towards this, we design a novel programming language correction framework amenable to reinforcement learning. The framework allows an agent to mimic human actions for text navigation and editing. We demonstrate that the agent can be trained through self-exploration directly from the raw input, that is, program text itself, without either supervision or any prior knowledge of the formal syntax of the programming language. We evaluate our technique on a publicly available dataset containing 6975 erroneous C programs with typographic errors, written by students during an introductory programming course. Our technique fixes 1699 (24.4%) programs completely and 1310 (18.8%) program partially, outperforming DeepFix, a state-of-the-art syntactic error repair technique, which uses a fully supervised neural machine translation approach.
Conference Paper
In this article, we report an experiment where students in an introductory programming course were given the opportunity to view model solutions to programming assignments whenever they wished, without the need to complete the assignments beforehand or to wait for the deadline to pass. Our experiment was motivated by the observation that some students may spend hours stuck with an assignment, leading to non-productive study time. At the same time, we considered the possibility of students using the sample solutions as worked examples, which could help students to improve the design of their own programs. Our experiment suggests that many of the students use the model solutions sensibly, indicating that they can control their own work. At the same time, a minority of students used the model solutions as a way to proceed in the course, leading to poor exam performance.
Conference Paper
Automated hints, a powerful feature of many programming environments, have been shown to improve students' performance and learning. New methods for generating these hints use historical data, allowing them to scale easily to new classrooms and contexts. These scalable methods often generate next-step, code hints that suggest a single edit for the student to make to their code. However, while these code hints tell the student what to do, they do not explain why, which can make these hints hard to interpret and decrease students' trust in their helpfulness. In this work, we augmented code hints by adding adaptive, textual explanations in a block-based, novice programming environment. We evaluated their impact in two controlled studies with novice learners to investigate how our results generalize to different populations. We measured the impact of textual explanations on novices' programming performance. We also used quantitative analysis of log data, self-explanation prompts, and frequent feedback surveys to evaluate novices' understanding and perception of the hints throughout the learning process. Our results showed that novices perceived hints with explanations as significantly more relevant and interpretable than those without explanations, and were also better able to connect these hints to their code and the assignment. However, we found little difference in novices' performance. Our results suggest that explanations have the potential to make code hints more useful, but it is unclear whether this translates into better overall performance and learning.
Article
The types of programming errors that novice programmers make and struggle to resolve have long been of interest to researchers. Various past studies have analyzed the frequency of compiler diagnostic messages. This information, however, does not have a direct correlation to the types of errors students make, due to the inaccuracy and imprecision of diagnostic messages. Furthermore, few attempts have been made to determine the severity of different kinds of errors in terms other than frequency of occurrence. Previously, we developed a method for meaningful categorization of errors, and produced a frequency distribution of these error categories; in this article, we extend the previous method to also make a determination of error difficulty, in order to give a better measurement of the overall severity of different kinds of errors. An error category hierarchy was developed and validated, and errors in snapshots of students source code were categorized accordingly. The result is a frequency table of logical error categories rather than diagnostic messages. Resolution time for each of the analyzed errors was calculated, and the average resolution time for each category of error was determined; this defines an error difficulty score. The combination of frequency and difficulty allow us to identify the types of error that are most problematic for novice programmers. The results show that ranking errors by severity—a product of frequency and difficulty—yields a significantly different ordering than ranking them by frequency alone, indicating that error frequency by itself may not be a suitable indicator for which errors are actually the most problematic for students.
Article
In the domain of programming, a growing number of algorithms automatically generate data-driven, next-step hints that suggest how students should edit their code to resolve errors and make progress. While these hints have the potential to improve learning if done well, few evaluations have directly assessed or compared the quality of different hint generation approaches. In this work, we present the QualityScore procedure, a novel method for automatically evaluating and comparing the quality of next-step programming hints using expert ratings. We first demonstrate that the automated QualityScore ratings agree with experts’ manual ratings. We then use the QualityScore procedure to compare the quality of six data-driven, next-step hint generation algorithms using two distinct programming datasets in two different programming languages. Our results show that there are large and significant differences between the quality of the six algorithms and that these differences are relatively consistent across datasets and problems. We also identify situations where the six algorithms struggle to produce high-quality hints, and we suggest ways that future work might address these challenges. We make our methods and data publicly available and encourage researchers to use the QualityScore procedure to evaluate additional algorithms and benchmark them against our results.
Conference Paper
In this work, we propose a new methodology to profile individual students of computer science based on their programming design using a technique called embeddings. We investigate different approaches to analyze user source code submissions in the Python language. We compare the performances of different source code vectorization techniques to predict the correctness of a code submission. In addition, we propose a new mechanism to represent students based on their code submissions for a given set of laboratory tasks on a particular course. This way, we can make deeper recommendations for programming solutions and pathways to support student learning and progression in computer programming modules effectively at a Higher Education Institution. Recent work using Deep Learning tends to work better when more and more data is provided. However, in Learning Analytics, the number of students in a course is an unavoidable limit. Thus we cannot simply generate more data as is done in other domains such as FinTech or Social Network Analysis. Our findings indicate there is a need to learn and develop better mechanisms to extract and learn effective data features from students so as to analyze the students' progression and performance effectively.
Conference Paper
The software development process often follows a circuitous path, littered with mistakes and backtracks. This is particularly true for novice programmers, who typically navigate through a variety of errors en route to their final solution. This paper presents a quantitative analysis of a large dataset of Python programs written by novice students. The analysis paints a multifaceted picture of the errors that students encounter, providing insight into the distribution, duration, and evolution of these errors. Ultimately, this paper aims to incite further conversation on the mistakes made by novice programmers, and to inform the decisions instructors make as they help students overcome these mistakes.
Conference Paper
The interaction between a novice programmer, and the compiler plays a crucial role in the learning process of the novice programmer. Of particular importance is the compiler's feedback on errors in the program code. Accordingly, compiler error messages are an important and active field of research. Yet, a language that has largely been left out of this discussion so far is Python. We have collected Python programs from high school students taking introductory courses. For each collected erroneous program, we sought to classify the effective error, and assess if the student was able to fix the error. Our study is a precursor to providing improved error messages in Python, and assess their effectiveness. As such, we are eventually interested in finding ways to automatically determine the effective error, so as to base the displayed message on. From our data, we found that a considerable part of students' errors can be attributed to minor mistakes, which can easily be identified and corrected. However, beyond such minor mistakes, a proper error diagnosis might have to be based on a goal/plan analysis of the entire program. Likewise, proper assessment of whether an error has been fixed frequently requires more context than is provided by the program alone.
Conference Paper
The ability to predict student performance in a course or program creates opportunities to improve educational outcomes. With effective performance prediction approaches, instructors can allocate resources and instruction more accurately. Research in this area seeks to identify features that can be used to make predictions, to identify algorithms that can improve predictions, and to quantify aspects of student performance. Moreover, research in predicting student performance seeks to determine interrelated features and to identify the underlying reasons why certain features work better than others. This working group report presents a systematic literature review of work in the area of predicting student performance. Our analysis shows a clearly increasing amount of research in this area, as well as an increasing variety of techniques used. At the same time, the review uncovered a number of issues with research quality that drives a need for the community to provide more detailed reporting of methods and results and to increase efforts to validate and replicate work.
Article
This paper introduces the “Search, Align, and Repair” data-driven program repair framework to automate feedback generation for introductory programming exercises. Distinct from existing techniques, our goal is to develop an efficient, fully automated, and problem-agnostic technique for large or MOOC-scale introductory programming courses. We leverage the large amount of available student submissions in such settings and develop new algorithms for identifying similar programs, aligning correct and incorrect programs, and repairing incorrect programs by finding minimal fixes. We have implemented our technique in the Sarfgen system and evaluated it on thousands of real student attempts from the Microsoft-DEV204.1x edX course and the Microsoft CodeHunt platform. Our results show that Sarfgen can, within two seconds on average, generate concise, useful feedback for 89.7% of the incorrect student submissions. It has been integrated with the Microsoft-DEV204.1X edX class and deployed for production use.
Article
Formative feedback, aimed at helping students to improve their work, is an important factor in learning. Many tools that offer programming exercises provide automated feedback on student solutions. We have performed a systematic literature review to find out what kind of feedback is provided, which techniques are used to generate the feedback, how adaptable the feedback is, and how these tools are evaluated. We have designed a labelling to classify the tools, and use Narciss’ feedback content categories to classify feedback messages. We report on the results of coding a total of 101 tools. We have found that feedback mostly focuses on identifying mistakes and less on fixing problems and taking a next step. Furthermore, teachers cannot easily adapt tools to their own needs. However, the diversity of feedback types has increased over the past decades and new techniques are being applied to generate feedback that is increasingly helpful for students.
Conference Paper
Providing feedback on programming assignments is a tedious task for the instructor, and even impossible in large Massive Open Online Courses with thousands of students. Previous research has suggested that program repair techniques can be used to generate feedback in programming education. In this paper, we present a novel fully automated program repair algorithm for introductory programming assignments. The key idea of the technique, which enables automation and scalability, is to use the existing correct student solutions to repair the incorrect attempts. We evaluate the approach in two experiments: (I) We evaluate the number, size and quality of the generated repairs on 4,293 incorrect student attempts from an existing MOOC. We find that our approach can repair 97% of student attempts, while 81% of those are small repairs of good quality. (II) We conduct a preliminary user study on performance and repair usefulness in an interactive teaching setting. We obtain promising initial results (the average usefulness grade 3.4 on a scale from 1 to 5), and conclude that our approach can be used in an interactive setting.
Article
Neural program embeddings have shown much promise recently for a variety of program analysis tasks, including program synthesis, program repair, fault localization, etc. However, most existing program embeddings are based on syntactic features of programs, such as raw token sequences or abstract syntax trees. Unlike images and text, a program has an unambiguous semantic meaning that can be difficult to capture by only considering its syntax (i.e. syntactically similar pro- grams can exhibit vastly different run-time behavior), which makes syntax-based program embeddings fundamentally limited. This paper proposes a novel semantic program embedding that is learned from program execution traces. Our key insight is that program states expressed as sequential tuples of live variable values not only captures program semantics more precisely, but also offer a more natural fit for Recurrent Neural Networks to model. We evaluate different syntactic and semantic program embeddings on predicting the types of errors that students make in their submissions to an introductory programming class and two exercises on the CodeHunt education platform. Evaluation results show that our new semantic program embedding significantly outperforms the syntactic program embeddings based on token sequences and abstract syntax trees. In addition, we augment a search-based program repair system with the predictions obtained from our se- mantic embedding, and show that search efficiency is also significantly improved.
Article
This paper introduces the "Search, Align, and Repair" data-driven program repair framework to automate feedback generation for introductory programming exercises. Distinct from existing techniques, our goal is to develop an efficient, fully automated, and problem-agnostic technique for large or MOOC-scale introductory programming courses. We leverage the large amount of available student submissions in such settings and develop new algorithms for identifying similar programs, aligning correct and incorrect programs, and repairing incorrect programs by finding minimal fixes. We have implemented our technique in the SARFGEN system and evaluated it on thousands of real student attempts from the Microsoft-DEV204.1X edX course and the Microsoft CodeHunt platform. Our results show that SARFGEN can, within two seconds on average, generate concise, useful feedback for 89.7% of the incorrect student submissions. It has been integrated with the Microsoft-DEV204.1X edX class and deployed for production use.
Article
Teaching is the process of conveying knowledge and skills to learners. It involves preventing misunderstandings or correcting misconceptions that learners have acquired. Thus, effective teaching relies on solid knowledge of the discipline, but also a good grasp of where learners are likely to trip up or misunderstand. In programming, there is much opportunity for misunderstanding, and the penalties are harsh: failing to produce the correct syntax for a program, for example, can completely prevent any progress in learning how to program. Because programming is inherently computer-based, we have an opportunity to automatically observe programming behaviour - more closely even than an educator in the room at the time. By observing students' programming behaviour, and surveying educators, we can ask: do educators have an accurate understanding of the mistakes that students are likely to make? In this study, we combined two years of the Blackbox dataset (with more than 900 thousand users and almost 100 million compilation events) with a survey of 76 educators to investigate which mistakes students make while learning to program Java, and whether the educators could make an accurate estimate of which mistakes were most common. We find that educators' estimates do not agree with one another or the student data, and discuss the implications of these results. Categories and Subject Descriptors: K.3.2 [Computers And Education]: Computer and Information Science Education General Terms: Experimentation, Human Factors.
Conference Paper
In large introductory programming classes, teacher feedback on individual incorrect student submissions is often infeasible. Program synthesis techniques are capable of fixing student bugs and generating hints automatically, but they lack the deep domain knowledge of a teacher and can generate functionally correct but stylistically poor fixes. We introduce a mixed-initiative approach which combines teacher expertise with data-driven program synthesis techniques. We demonstrate our novel approach in two systems that use different interaction mechanisms. Our systems use program synthesis to learn bug-fixing code transformations and then cluster incorrect submissions by the transformations that correct them. The MistakeBrowser system learns transformations from examples of students fixing bugs in their own submissions. The FixPropagator system learns transformations from teachers fixing bugs in incorrect student submissions. Teachers can write feedback about a single submission or a cluster of submissions and propagate the feedback to all other submissions that can be fixed by the same transformation. Two studies suggest this approach helps teachers better understand student bugs and write reusable feedback that scales to a massive introductory programming classroom.
Conference Paper
Students have enthusiastically taken to online programming lessons and contests. Unfortunately, they tend to struggle due to lack of personalized feedback. There is an urgent need of program analysis and repair techniques capable of handling both the scale and variations in student submissions, while ensuring quality of feedback. Towards this goal, we present a novel methodology called semi-supervised verified feedback generation. We cluster submissions by solution strategy and ask the instructor to identify or add a correct submission in each cluster. We then verify every submission in a cluster against the instructor-validated submission in the same cluster. If faults are detected in the submission then feedback suggesting fixes to them is generated. Clustering reduces the burden on the instructor and also the variations that have to be handled during feedback generation. The verified feedback generation ensures that only correct feedback is generated. We implemented a tool, named CoderAssist, based on this approach and evaluated it on dynamic programming assignments. We have designed a novel counter-example guided feedback generation algorithm capable of suggesting fixes to all faults in a submission. In an evaluation on 2226 submissions to 4 problems, CoderAssist could generate verified feedback for 1911 (85%) submissions in 1.6s each on an average. It does a good job of reducing the burden on the instructor. Only one submission had to be manually validated or added for every 16 submissions.
Article
We present a method for automatically generating repair feedback for syntax errors for introductory programming problems. Syntax errors constitute one of the largest classes of errors (34%) in our dataset of student submissions obtained from a MOOC course on edX. The previous techniques for generating automated feed- back on programming assignments have focused on functional correctness and style considerations of student programs. These techniques analyze the program AST of the program and then perform some dynamic and symbolic analyses to compute repair feedback. Unfortunately, it is not possible to generate ASTs for student pro- grams with syntax errors and therefore the previous feedback techniques are not applicable in repairing syntax errors. We present a technique for providing feedback on syntax errors that uses Recurrent neural networks (RNNs) to model syntactically valid token sequences. Our approach is inspired from the recent work on learning language models from Big Code (large code corpus). For a given programming assignment, we first learn an RNN to model all valid token sequences using the set of syntactically correct student submissions. Then, for a student submission with syntax errors, we query the learnt RNN model with the prefix to- ken sequence to predict token sequences that can fix the error by either replacing or inserting the predicted token sequence at the error location. We evaluate our technique on over 14, 000 student submissions with syntax errors. Our technique can completely re- pair 31.69% (4501/14203) of submissions with syntax errors and in addition partially correct 6.39% (908/14203) of the submissions.
Article
Exploring the whole sequence of steps a student takes to produce work, and the patterns that emerge from thousands of such sequences is fertile ground for a richer understanding of learning. In this paper we autonomously generate hints for the Code.org 'Hour of Code,' (which is to the best of our knowledge the largest online course to date) using historical student data. We first develop a family of algorithms that can predict the way an expert teacher would encourage a student to make forward progress. Such predictions can form the basis for effective hint generation systems. The algorithms are more accurate than current state-of-the-art methods at recreating expert suggestions, are easy to implement and scale well. We then show that the same framework which motivated the hint generating algorithms suggests a sequence-based statistic that can be measured for each learner. We discover that this statistic is highly predictive of a student's future success.
Conference Paper
We present the TCS Alignment Toolbox, which offers a flexible framework to calculate and visualize (dis)similarities between sequences in the context of educational data mining and intelligent tutoring systems. The toolbox offers a variety of alignment algorithms, allows for complex input sequences comprised of multi-dimensional elements, and is adjustable via rich parameterization options, including mechanisms for an automatic adaptation based on given data. Our demo shows an example in which the alignment measure is adapted to distinguish students' Java programs w.r.t. different solution strategies, via a machine learning technique.
Article
In MOOCs, a single programming exercise may produce thousands of solutions from learners. Understanding solution variation is important for providing appropriate feedback to students at scale. The wide variation among these solutions can be a source of pedagogically valuable examples and can be used to refine the autograder for the exercise by exposing corner cases. We present OverCode, a system for visualizing and exploring thousands of programming solutions. OverCode uses both static and dynamic analysis to cluster similar solutions, and lets teachers further filter and cluster solutions based on different criteria. We evaluated OverCode against a nonclustering baseline in a within-subjects study with 24 teaching assistants and found that the OverCode interface allows teachers to more quickly develop a high-level view of students' understanding and misconceptions, and to provide feedback that is relevant to more students' solutions.
Conference Paper
While computing educators have put plenty of effort into researching and developing programming environments that make it easier for students to create their first programs, these tools often have only little resemblance with the tools used in the industry. We report on a study, where students with no previous programming experience started to program directly using an industry strength programming environment. The programming environment was augmented with logging capability that recorded every keystroke and event within the system, which provided a view on how the novices tackle their first lines of code. Our results show that while at first, the students struggle with syntax -- as is typical with learning a new language -- no evidence can be found that suggests that learning to use the programming environment is hard. In a two-week period, the students learned to use the basic features of the programming environment such as specific shortcuts. Although we observed students using copy-paste-programming relatively often, most of the pasted code is from their own previous work. Finally, when considering the compilation errors and error distributions, we hypothesize that the errors are a product of three factors; the exercises, the environment, and the data logging granularity.
Conference Paper
As programming is the basis of many CS courses, meaningful activities in supporting students on their journey towards being better programmers is a matter of utmost importance. Programming is not only about learning simple syntax constructs and their applications, but about honing practical problem-solving skills in meaningful contexts. In this article, we describe our current work on an automated assessment system called Test My Code (TMC), which is one of the feedback and support mechanisms that we use in our programming courses. TMC is an assessment service that (1) enables building of scaffolding into programming exercises; (2) retrieves and updates tasks into the students' programming environment as students work on them, and (3) causes no additional overhead to students' programming process. Instructors benefit from TMC as it can be used to perform code reviews, and collect and send feedback even on fully on-line courses.
Conference Paper
Massive open online courses (MOOCs), one of the latest internet revolutions have engendered hope that constant iterative improvement and economies of scale may cure the ``cost disease" of higher education. While scalable in many ways, providing feedback for homework submissions (particularly open-ended ones) remains a challenge in the online classroom. In courses where the student-teacher ratio can be ten thousand to one or worse, it is impossible for instructors to personally give feedback to students or to understand the multitude of student approaches and pitfalls. Organizing and making sense of massive collections of homework solutions is thus a critical web problem. Despite the challenges, the dense solution space sampling in highly structured homeworks for some MOOCs suggests an elegant solution to providing quality feedback to students on a massive scale. We outline a method for decomposing online homework submissions into a vocabulary of "code phrases", and based on this vocabulary, we architect a queryable index that allows for fast searches into the massive dataset of student homework submissions. To demonstrate the utility of our homework search engine we index over a million code submissions from users worldwide in Stanford's Machine Learning MOOC and (a) semi-automatically learn shared structure amongst homework submissions and (b) generate specific feedback for student mistakes. Codewebs is a tool that leverages the redundancy of densely sampled, highly structured homeworks in order to force-multiply teacher effort. Giving articulate, instant feedback is a crucial component of the online learning process and thus by building a homework search engine we hope to take a step towards higher quality free education.
Conference Paper
A large body of systems that gather data on students' programming process exists, and with the increase of massive open online courses in programming, the amount of gathered data is growing even at a higher rate. A common issue for data analysis is the lack of common tools for visualizing source code snapshots. We have created a browser-side snapshot analysis tool called CodeBrowser that provides a clean REST API that anyone can integrate their snapshot data into. In this article, we describe CodeBrowser and as an example, discuss how it has been used to seek differences between novice programmers that have passed (n=10) or failed (n=10) an introductory programming course.
Conference Paper
The high failure rates of many programming courses means there is a need to identify struggling students as early as possible. Prior research has focused upon using a set of tests to assess the use of a student's demographic, psychological and cognitive traits as predictors of performance. But these traits are static in nature, and therefore fail to encapsulate changes in a student's learning progress over the duration of a course. In this paper we present a new approach for predicting a student's performance in a programming course, based upon analyzing directly logged data, describing various aspects of their ordinary programming behavior. An evaluation using data logged from a sample of 45 programming students at our University, showed that our approach was an excellent early predictor of performance, explaining 42.49% of the variance in coursework marks - double the explanatory power when compared to the closest related technique in the literature.