PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The automated code evaluation system (AES) is mainly designed to reliably assess user-submitted code. The code is compiled and then tested in a unified environment with predefined input and output test cases. Due to their extensive range of applications and the accumulation of valuable resources, AESs are becoming increasingly popular. Research on the application of AES and their real-world resource exploration for diverse coding tasks is still lacking. In this study, we conducted a comprehensive survey on AESs and their resources. This survey explores the application areas of AESs, available resources, and resource utilization for coding tasks. AESs are categorized into programming contests, programming learning and education, recruitment, online compilers, and additional modules, depending on their application. We explore the available datasets and other resources of these systems for research, analysis, and coding tasks. The success of machine learning models for inference procedures depends primarily on the purity of the data, where the accumulated real-life data (e.g., codes and submission logs) from AESs can be a valuable treasure. Moreover, we provide an overview of machine learning-driven coding tasks, such as bug detection, code review, comprehension, refactoring, search, representation, and repair. These tasks are performed using real-life datasets. In addition, we briefly discuss the Aizu Online Judge platform as a real example of an AES from the perspectives of system design (hardware and software), operation (competition and education), and research. This is due to the scalability of the AOJ platform (programming education, competitions, and practice), open internal features (hardware and software), attention from the research community, open source data (e.g., solution codes and submission documents), and transparency. We also analyze the overall performance of this system and the perceived challenges over the years.
111
Exploring Automated Code Evaluation Systems and
Resources for Code Analysis: A Comprehensive Survey
MD. MOSTAFIZER RAHMAN
,Dhaka University of Engineering & Technology, Gazipur, Bangladesh
YUTAKA WATANOBE, The University of Aizu, Japan
ATSUSHI SHIRAFUJI, The University of Aizu, Japan
MOHAMED HAMADA, The University of Aizu, Japan
The automated code evaluation system (AES) is mainly designed to reliably assess user-submitted code. The
code is compiled and then tested in a unied environment with predened input and output test cases. Due
to their extensive range of applications and the accumulation of valuable resources, AESs are becoming
increasingly popular. Research on the application of AES and their real-world resource exploration for diverse
coding tasks is still lacking. In this study, we conducted a comprehensive survey on AESs and their resources.
This survey explores the application areas of AESs, available resources, and resource utilization for coding
tasks. AESs are categorized into programming contests, programming learning and education, recruitment,
online compilers, and additional modules, depending on their application. We explore the available datasets
and other resources of these systems for research, analysis, and coding tasks. The success of machine learning
models for inference procedures depends primarily on the purity of the data, where the accumulated real-life
data (e.g., codes and submission logs) from AESs can be a valuable treasure. Moreover, we provide an overview
of machine learning-driven coding tasks, such as bug detection, code review, comprehension, refactoring,
search, representation, and repair. These tasks are performed using real-life datasets. In addition, we briey
discuss the Aizu Online Judge platform as a real example of an AES from the perspectives of system design
(hardware and software), operation (competition and education), and research. This is due to the scalability
of the AOJ platform (programming education, competitions, and practice), open internal features (hardware
and software), attention from the research community, open source data (e.g., solution codes and submission
documents), and transparency. We also analyze the overall performance of this system and the perceived
challenges over the years.
CCS Concepts: General and reference
Code Assessment;Evaluation platform
Online Judge;
Code Analysis and dataset Machine Learnig; Networks Online Computing.
Additional Key Words and Phrases: Resources of Online Judge, datasets, Machine Learning, Aizu Online Judge
ACM Reference Format:
Md. Mostazer Rahman, Yutaka Watanobe, Atsushi Shirafuji, and Mohamed Hamada. 2018. Exploring Auto-
mated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey. J. ACM 37, 4,
Article 111 (August 2018), 34 pages. https://doi.org/XXXXXXX.XXXXXXX
Also with The University of Aizu, Japan.
Authors’ addresses: Md. Mostazer Rahman, mostaz@duet.ac.bd, Dhaka University of Engineering & Technology, Gazipur,
Bangladesh, 1707; Yutaka Watanobe, The University of Aizu, Aizuwakamatsu, Japan, yutaka@u-aizu.ac.jp; Atsushi Shirafuji,
The University of Aizu, Aizuwakamatsu, Japan, atsushirafuji@gmail.com; Mohamed Hamada, The University of Aizu,
Aizuwakamatsu, Japan, hamada@u-aizu.ac.jp.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2018 Association for Computing Machinery.
0004-5411/2018/8-ART111 $15.00
https://doi.org/XXXXXXX.XXXXXXX
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
arXiv:2307.08705v1 [cs.SE] 8 Jul 2023
111:2 M. M. Rahman et al.
1 INTRODUCTION
Competitive programming contests (CPCs) are usually held online or oine (local area network),
where participants try to solve programming problems according to given specications. CPCs,
also called programming sports, are recognized worldwide. Many educational institutions and
multinational tech giants (e.g., Google, Facebook, etc.) foster these programming events. Generally,
CPC organizers provide a set of mathematical or logical problems for participants. Typically, prob-
lems are related to categories such as data structures, string analysis, graph theory, combinatorics,
computational geometry, number theory, and game theory. In addition, articial intelligence (AI)
and constraint programming are also included in certain competitions. Participants write code
to solve those problems. The problem-solving process can be divided into two main steps: (
𝑖
)
development of an ecient algorithm and (
𝑖𝑖
) implementation of the algorithm in an appropriate
programming language. The correctness of the submitted solution code is judged in a special
environment called an online judge (OJ) or automated code evaluation system (AES). Numerous
factors are considered in the judging, including output quality, program size, memory usage, and
CPU time. In most contests, the submitted solution code is automatically evaluated by the host
computer (judge server), where each solution code is judged against a set of test cases. A solution
code is accepted only if it passes all test cases, otherwise, it is rejected. However, in some contests,
partial judging may occur depending on the quality of the results, the number of test cases passed,
or other criteria.
In the last 30 years, we have observed a signicant rise in the popularity of programming
computing events, exemplied by the International Collegiate Programming Contest (ICPC). This
competition holds the distinction of being the largest, oldest, and most ercely CPC for university
students worldwide. The ICPC serves as a platform where students can engage with one another,
enhance their programming abilities, develop algorithmic thinking, teamwork, and problem-solving
skills. It not only provides a valuable opportunity for academic, community, and industry collabo-
ration but also holds the overarching goal of nurturing the next generation of skilled computing
professionals. The inaugural edition of the ICPC took place at Texas A&M University in 1970 [
189
].
Over the years, the ICPC has evolved into one of the most esteemed programming competitions
worldwide. A testament to its widespread appeal can be observed through the participation numbers.
For instance, the event attracted over 52,709 students from 3,233 universities across 110 countries
in 2018, a substantial increase compared to the participation of 840 teams from 560 universities
in 1997. Following the success of the rst ICPC nal, numerous algorithmic competitions subse-
quently adopted similar AES. The International Olympiad in Informatics (IOI) is one of them, which
integrated AES into its evaluation process starting in 1989. Contrasting the ICPC, the IOI targets
specically to high school students, with participants tackling problems individually instead of
working in teams. Notably, participants in the IOI receive scores that can be full or partial based
on the merit of their submitted solutions, rather than solely binary (correct or incorrect) scores
like ICPC. Furthermore, big companies such as Google and Facebook regularly organize CPCs [
87
].
The popular Codeforces CPC platform hosts weekly contests that attract a substantial number
of participants, reaching tens of thousands [
111
]. Impressively, this platform has over 500,000
active users. In a study conducted by Cormack et al. [
35
], the fundamental regulations, evaluation
procedures, and scoring mechanisms employed by prominent CPCs such as the ICPC and IOI were
elucidated.
The CPC environment encompasses a distinct system designed to automatically verify the
correctness of submitted solution code relying on predened input/output test cases. Moreover, the
system calculates various resources such as time and memory usage limits during the evaluation
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:3
process of the submitted solution code. Such a specialized system is commonly referred to as an AES
1
or OJ system. The concept of an OJ system was rst introduced by Kurnia and collaborators [
77
] in
2001, enabling the automated and real-time evaluation of solution code. However, the development,
implementation, and maintenance of an OJ system is a complex endeavor, as various essential
factors must be carefully considered to ensure its security and seamless operation. Typically, an
AES assesses submitted codes using either local or cloud-based infrastructures and should be
well-equipped to handle various threats that may arise during the evaluation process. For instance,
submitted solution codes can take the form of executable les (.exe) or malicious source code that
have the potential to alter the test environment, prolong compilation time, or manipulate restricted
resources such as memory and time. A comprehensive exploration of potential threat types and
corresponding countermeasures for CPCs can be found in a study [
47
]. To mitigate such threats
and ensure secure execution of the submitted solution code, recent advancements have introduced
specialized sandboxes [
202
], virtualization techniques, containerization [
43
], and the utilization of
Docker frameworks [
108
]. These approaches hold promise in enhancing the security and reliability
of AES.
Apart from CPC, AESs oer a range of auxiliary functionalities that nd application in diverse
domains such as programming education, online compilers, data mining, recruitment, and software
development. Furthermore, the resources generated by AESs, including evaluation results, solution
codes, and logs, are highly regarded as valuable assets for both educational and industrial research
purposes. Over the years, numerous academic research endeavors have yielded signicant ndings
by leveraging the resources provided by AESs, which are extensively utilized for programming
tutoring across various academic institutions [
137
,
138
]. These results have shed light on submission
patterns, verdict statistics, problem-solving capabilities for constraint-oriented tasks, and prevalent
solution patterns among dierent student groups. Educators are capitalizing on the insights gained
from such analytical research to enhance their lesson plans and teaching methodologies. Moreover,
the amassed data resources within AESs are recognized as one of the largest real-world code
repositories. This has paved the way for software engineering research in areas such as software
code analysis, vulnerability prediction, source code mining, suggestions for class and method names,
and software refactoring [2, 3, 8, 12, 16, 103, 104, 139].
In recent times, the eld of AI, particularly machine learning (ML) and deep learning (DL),
has witnessed remarkable advancements in text [
1
,
83
], speech [
53
,
54
,
151
], and image [
76
,
170
]
processing. These breakthroughs in ML and DL have been facilitated by the proliferation of open-
source codes and the rapid development of computational hardware. This progress has inspired
both practitioners and researchers to explore source code and software engineering challenges
[
9
,
14
,
78
,
182
,
208
]. Within the domain of source code analysis, researchers and professionals have
embraced the utilization of DL models for various code-related tasks. These tasks encompass code
representation [
9
,
60
], program synthesis [
60
,
78
,
91
,
120
,
182
,
199
,
208
,
211
], code completion [
98
,
103
], code testing [
91
,
120
,
211
], code summarization [
8
,
82
,
98
], code refactoring [
16
], code repair
[
104
], and analysis of source code quality and vulnerability [
14
,
20
,
154
,
158
,
174
]. The adoption
of DL for code analysis is on the rise, accompanied by a growing array of methods, techniques,
resources, tools, and datasets. Consequently, researchers face the challenge of comprehending
the expansive landscape of available resources and research directions in this domain. However,
numerous endeavors have been made to consolidate application-specic research through survey
publications [
156
,
189
,
198
]. In addition to the state-of-the-art (SOTA) surveys, our survey oers
several advantages to researchers and professionals engaged in code analysis tasks. Firstly, it
1
The terms automated code evaluation system (AES), automated code assessment system (AAS), and online judge system
(OJ) are employed interchangeably to refer to the same concept.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:4 M. M. Rahman et al.
encompasses a comprehensive summary of a substantial number of SOTA surveys. Secondly, it
provides a literature review categorized by specic areas such as quality assessment, testing, code
repair, and program synthesis within the eld of software engineering. Lastly, an exploration of
the existing tools and datasets available for code analysis.
In the past few years, the resources provided by AESs have emerged as valuable datasets for a
variety of code analysis tasks. One notable dataset is CodeNet [
131
], a comprehensive collection
encompassing 14 million real-world solution codes and approximately 500 million lines of code
spanning 55 programming languages. These codes have been sourced from the Aizu Online Judge
(AOJ) [190] and AtCoder [115] OJ systems. CodeNet is a meticulously curated dataset specically
designed for code analysis applications, including code classication and code similarity measure-
ment. Another large dataset, CodeContests [
87
], is tailored for ML applications and is extensively
employed in the training of models like Alphacode. CodeContests incorporates datasets sourced
from diverse CPC platforms, including AOJ, Atcoder, CodeChef, Codeforces, and HackerEarth.
After implementing the necessary ltering procedures, the CodeContests dataset contains a total
of 715.1 GB of code. POJ-104 [
87
] and GCJ [
176
] are extensively utilized benchmark datasets that
originated from a pedagogical AES and the Google Code Jam competition held between 2008 and
2020, respectively. Notably, GCJ-297 [
49
] stands as a benchmark dataset comprising 297 problems
and approximately 208,000 solution codes. For an in-depth exploration of code analysis tasks and
their associated benchmark datasets, a comprehensive discussion can be found in CodeXGLUE
[101].
The objective of this paper is as follows: To the best of our knowledge, there is a lack of existing
surveys on AESs and their available resources for code analysis tasks, our aim is to bridge this gap.
Initially, we provide an overview of the current landscape of AESs and their potential application
domains. Considering the diverse nature of these systems, a classication based on their specic
applications holds signicant value for both researchers and professionals. We summarized the
resources, including datasets, generated by AESs and other similar platforms. These accumulated
datasets can serve as valuable assets for practitioners and researchers engaging in further research
and analysis. Moreover, we provide a comprehensive review of recent studies focusing on code
analysis tasks employing ML techniques. We also present an in-depth investigation of a real-world
AES, exploring aspects such as system design (software and hardware), operation, maintenance,
and research opportunities.
The remainder of this paper is organized as follows: Section 2 introduces the evaluation method-
ology and scoring of AESs, Section 3 classies AES based on their use, Section 4 examines the
available resources for AESs, Section 5 summarizes the ML-based coding analysis tasks, Section 6
explores aspects of a real-world AES, including system design (software and hardware), operation,
maintenance and research opportunities, and nally, Section 7 concludes the study.
2 AUTOMATED CODE EVALUATION SYSTEMS
2.1 Evaluation Method
AES is a reliable, secure, and continuous online evaluation system for algorithms submitted by
users around the world. For better understanding, the AES evaluation method can be dened as
follows:
Evaluation Method: The evaluation method consists of three main steps. (
𝑖
) code submission, (
𝑖𝑖
)
code evaluation with test datasets, and (𝑖𝑖𝑖) evaluation score.
In the code submission phase, the submitted code is compiled and veried whether the code is
executable in a homogeneous evaluation environment or not. If the verication is successful, each
solution code is reliably evaluated on a coherent evaluation infrastructure using problem-specic
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:5
test cases. The evaluation of the test cases determines for each submission: (
𝑖
) the code executes
without errors, (
𝑖𝑖
) the resource constraints (time and memory) have not been exceeded for a given
problem, (𝑖𝑖𝑖) the obtained result satises all problem denitions.
Code evaluation is basically a complex process in which the submitted code must be compiled
considering each test instance. We can formally dene the test instance as follows.
Test Instance: Let,
Ω
be an alphabet containing input and output data. A test instance
𝜏𝑖 T
,
where
T
is the set of all test instances of a particular problem. Here,
𝜏𝑖
can be dened as a triple
𝜏𝑖=(𝑔𝑖, 𝑖, 𝑝𝑖)
, where
𝑔𝑖=Ω
and
𝑖=Ω
are the input data and reference output, respectively,
and 𝑝𝑖=Ωis the set of parameters.
The parameter set
𝑝𝑖
represents the resource limits, such as the maximum usage of primary
memory (RAM), CPU time, etc., that cannot be exceeded during evaluation for a given instance.
The parameter set
𝑝𝑖
can be empty (
𝑝𝑖=
) if the conguration of the evaluation engine is set to
default resource limits. In code evaluation, the solution is compiled as an output of the submission
phase. Here, the solution can be dened as follows.
Solution: A solution is a function, denoted
𝑏(𝑔𝑖, 𝑝
𝑖) 𝑧
𝑖
, that computes output data
𝑧
𝑖
based
on input data
𝑔𝑖
and parameters
𝑝
𝑖
given by the evaluation engine. It represents the binary form of
submission.
Here, the execution parameter set
𝑝
𝑖
provided by the evaluation engine can be the same as the
original parameter set
𝑝𝑖
dened as part of the respective test instance (i.e.,
𝑝
𝑖=𝑝𝑖
). However, some
variations may also occur, e.g.,
𝑝
𝑖
can be a subset of
𝑝𝑖
(
𝑝
𝑖𝑝𝑖
),
𝑝
𝑖
can be completely dierent from
𝑝𝑖
(i.e.,
𝑝
𝑖𝑝𝑖
), or
𝑝
𝑖
can be an empty set (i.e.,
𝑝
𝑖=
). In contrast to the parameter set
𝑝𝑖
,
𝑝
𝑖
is
usually set to empty because the parameters aecting the evaluation are hidden from the solution.
The evaluation engine can be formally dened as follows.
Evaluation Engine: The evaluation engine is a function that executes a binary le (
𝑏
) with a
test instance
𝜏𝑖
, denoted
𝑒(𝑏, 𝜏𝑖) (𝑣𝑖, 𝑠𝑖, 𝑙𝑖)
. This function returns the evaluation verdict
𝑣𝑖
, the
evaluation score 𝑠𝑖(𝑠𝑖R), and a list of statistics 𝑙𝑖for the entire execution process.
The evaluation verdict
𝑣𝑖
can be one of the following: Accepted (AC), Time Limit Exceeded (TLE),
Memory Limit Exceeded (MLE), Wrong Answer (WA), Runtime Error (RE), Presentation Error (PE),
and Output Limit Exceeded (OLE). Details of these evaluation verdicts can be found in [
192
]. Many
OJ systems evaluate submissions according to the ICPC rule, which evaluates the submissions on
each test instance
𝜏𝑖
in a binary way, i.e., correct or incorrect. In this case, the evaluation score
𝑠𝑖
is
always 0 (i.e.,
𝑠𝑖=
0) because
𝑠𝑖
is not used in evaluation according to ICPC rules [
189
]. In addition,
𝑠𝑖
characterizes the evaluation even if users have received the verdict "Accepted" (i.e.
𝑣𝑖=𝐴𝐶
)
for a particular instance. In general, the test instances
𝜏𝑖
can be characterized by the dierent
values of
𝑠𝑖
, but there are also more complex scoring methods that can be applied. Some evaluation
scoring methods are presented in Section 2.2. Finally, the statistics
𝑙𝑖
collects information about the
maximum utilization of resources, such as memory and time during the execution. If OJ does not
share this information with the user, 𝑙𝑖can be empty, 𝑙𝑖=.
2.2 Evaluation Score
Evaluation score is basically the aggregated evaluation of
𝑠𝑖
and
𝑣𝑖
to calculate the nal score
𝑠
and
verdict
𝑣
of each user submission. The nal score of a submission is used to rank the solutions to
the problem.
A solution receives an AC verdict if and only if it passes all test instances (Eq. (1)). Otherwise,
the solution receives another verdict that is dierent from AC (e.g., WA, TLE) (Eq. (2)).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:6 M. M. Rahman et al.
𝑣=𝐴𝐶 𝑖𝑣𝑖=𝐴𝐶 (1)
𝑣=𝑣𝑗 (∀𝑖<𝑗𝑣𝑖=𝐴𝐶)∧(𝑣𝑗𝐴𝐶)(2)
If the problem is not an optimization problem, the total evaluation score
𝑠
is calculated by the
following two ways. (1) According to the ICPC evaluation rules, it is a binary evaluation of the
submission, and thus,
𝑠=
0. (2) Otherwise, it is the sum of the evaluation scores for all correctly
solved instances, denoted as Eq. (3).
𝑠=
| T |
𝑖=1(𝑠𝑖, 𝑖𝑓 𝑣𝑖=𝐴𝐶
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3)
When calculating the score for the optimization problems, the best solution among all the
submissions submitted by the participants is taken into account. In most cases, Eq. (4) is used
to calculate the evaluation score
𝑠
for the optimization problems where the objective function is
maximized [
189
]. Here,
𝑏𝑖
denotes the score of the best solution at the
𝑖𝑡ℎ
instance. Usually, Eq. (4)
is used to rank all submitted solutions to the problems.
𝑠=100
|T |
| T |
𝑖=1(𝑠𝑖
𝑏𝑖, 𝑖 𝑓 𝑣𝑖=𝐴𝐶
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (4)
In addition, more customized and complex evaluation procedures can be used for scoring. For
example, the Polish Olympiad in Informatics uses Eq. (5), which penalizes the solution if it takes
more than half of the time limit [
189
], where
Υ
𝑖
denotes the maximum point that can be assigned
to a test instance
𝜏𝑖
,
Λ𝑖
is the maximum time limit for a solution, and
𝜆𝑖
is the processing time
required for the solution to produce output for an instance 𝜏𝑖.
𝑠=Υ
𝑖. 𝑚𝑖𝑛 1.0,2.0.Λ𝑖𝜆𝑖
Λ𝑖!(5)
3 CLASSIFICATION OF AUTOMATED CODE EVALUATION SYSTEMS
In this section, we present the classication of AESs on the basis of application domains. Considering
the programming contests, there are few literature reviews on the classication of contests organized
using AES. Pohl [
125
] was the rst to propose a classication of programming contests based on
criteria such as contest style, duration, submission and evaluation methods, scoring, and entrance.
Furthermore, classications based on programming contests, types of programming exercises, and
characteristics of the AES system have also been discussed in studies [
34
,
116
]. However, most of
these classications are limited to a single application such as education or programming contests.
There is no clear or explicit classication of AESs based on their potential applications that can be
useful for users. Therefore, we decided to classify AESs based on their applications as follows.
3.1 Competitive Programming Contest
OJ systems have a wide range of applications for competitive programming. Many educational
institutions use this platform to prepare their students to participate in competitive programming
contests. Competitive programming contests are also held by various organizations and have gained
popularity. The rst OJ is UVa [
144
], which received great attention worldwide. It was founded in
1995 at the University of Valladolid in Spain. Based on the collected UVa dataset, Skiena and Revilla
[
162
] wrote the book "Programming Challenges: The Programming Contest Training Manual" to
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:7
Table 1. OJ systems used for the competitive programming contests
Name In Operation Language # Problems # Users Founded
UVa Online Judge Yes Eng 4,300 100,000 1995
Aizu Online Judge (AOJ)
Yes Eng, Jpn 3,000 130,000 2004
National Tsing Hua Uni-
versity Online Judge
Yes Eng 10,000 - 2015
National Taiwan Univer-
sity Online Judge
Yes Chi 2600 600 2016
Sphere Online Judge
(SPOJ)
Yes
Eng, Pol, Por,
Vie
20,000 315,000 2004
PKU Judge Online (POJ) Yes Eng, Chi 3,000 250,000 2003
Topcoder Yes Eng 5,200 4,000 2001
Codeforces Yes Eng, Rus 3,000 600,000 2010
AtCoder Yes Eng, Jpn 5,900 400,000 2012
Google Code Jam (GCJ) Yes Eng 450 670,000 2003
Facebook Hacker Cup Yes Eng - 80,000 2011
HUSTOJ Yes Eng, Chi 650 26,000 2014
Timus Online Judge Yes Eng 1,000 110,000 2000
IEEEXtreme Yes Eng - - 2006
ICPC Yes Eng - - 1970
IOI Yes Eng - - 1989
Judge0 Yes Eng - - 2017
LeetCode Yes Eng - - 2015
help students in programming contests. Judge0 [
41
] is an open-source
2
, scalable, and powerful
online code execution tool that can be used in a wide range of programming-related applications
such as programming competitions, e-learning platforms, recruitment, and online code editors and
IDEs. A partial list of OJ systems is given in Table 1.
Furthermore, data science, AI and ML experts compete against each other on various online
platforms, adding a new dynamic to competitive programming. In particular, AI game competitions
are known as bot programming competitions or AI programming competitions. Many online
platforms are used for AI, ML, and data science competitions, including the Kaggle platform,
which is mainly used for data science and ML competitions. Google AI Challenge is a biennial
competition for students; Battlecode is one of the longest-running AI programming competitions
hosted by the Massachusetts Institute of Technology; Lux AI is an AI programming competition
held on the Kaggle platform. There are also numerous platforms and AI/ML challenges such as
Halite, CodinGame, AI Coliseum, Russian AI Cup, AWS DeepRacer, SamurAI, CodeCup, Coder One,
Terminal by Correlation One, SIGNATE, and so on.
3.2 Academic Tool
Recently, OJ systems have emerged as academic tools for programming learning, programming
assignments, and assessment. Teachers/instructors at many educational institutions automatically
grade students’ assignments using OJ. The benets of using OJ system in education are innumerable.
For example, the submitted solution codes are checked with higher accuracy, no wrong solutions
2https://github.com/judge0/judge0
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:8 M. M. Rahman et al.
are accepted, students can get their result immediately, and the teacher can take action to improve
students’ programming skills based on the result. A successful application of the OJ system for
algorithms and data structures and analytical investigation based on the collected data is presented
in studies [
29
,
137
,
138
]. Peking University has integrated POJ [
194
] as an essential tool for pro-
gramming education. POJ is being utilized in various programming courses, such as Introduction to
Computing, Problem Solving and Programming, and Data Structures and Algorithms. Ala-Mutka
[
7
] gave a detailed review of the application of OJ systems in education. A review [
67
] addresses
the available software for automatic assessment of programming tasks that may be of interest for
learning programming. Fonte et al. [
46
] presented the advanced version of OJ system that can be
used to provide valuable feedback to programmers to help them understand where the problem lies
and how to improve the solution code. Additionally, literature [
37
,
130
,
145
,
161
] explores topics
such as code plagiarism detection, automated code correction, and personalized feedback through
OJ systems.
In contrast, the limited scope of black box testing and oversimplied feedback (correct or incor-
rect) restricts its widespread application in programming education. To overcome this constraint,
Zhou et al. [
215
] introduced a novel OJ system framework that enables eective programming
education and provides comprehensive assessment of submitted solutions. The new OJ system
comprises four modules, including code quality checking, similarity checking, personalized feed-
back for students, and advising on teaching adjustments. This OJ system has been implemented in
a programming in C course for over ten years and has demonstrated substantial improvements
in students’ programming abilities. Lu et al. [
102
] showed the positive inuence of the OJ system,
which increases the performance level and also arouses students’ interest in programming through-
out the year. Teachers are increasingly incorporating OJ systems into their daily programming
teaching activities. Although the performance of OJ systems is impressive, the contributions and
eects of teachers are invaluable and irreplaceable. OJ systems can serve as excellent tools to
assist teachers in programming education. By providing reliable, high-eciency, and objective
evaluations, as well as fair and accurate scoring, OJ systems can greatly reduce teachers’ workload
[
215
]. Teachers can utilize OJ systems to assign programming tasks, analyze student performance
through reports and statistics, identify code similarities, and more. For example, in a recursive
algorithm lesson, students may be given a problem that can be solved using either brute force or a
recursive algorithm. If the teacher nds that most students used the brute force algorithm (which
is relatively easy than recursive algorithm), they may reinforce the use of the recursive algorithm
to improve students’ algorithmic skills. Table 2 shows a partial list of OJ systems that have been
incorporated into programming education.
3.3 Recruitment and Selection
In recent years, the use of technologies such as big data analytics, AI, and OJ in the recruitment
process is becoming more common among companies and organizations [
119
]. These technologies
provide rigorous, eective, and cost-ecient ways to nd suitable candidates. One study [
119
]
explores how prescreening applications, practical skills testing, and candidate communication
can be handled using AI. Furthermore, technology in the dierent phases of the recruitment and
selection process is presented in the study [
118
]. The phases are attracting, screening, selection, and
on-boarding and socialization. There are numerous platforms that use OJ systems to support the
recruiting process. These platforms are mainly commercial and automatically evaluate the submitted
codes and rank the programmers. We present some OJ systems that are used for recruitment. For
example, HackerEarth,HackerRank,Qualied,CodeEval,Codility,Track Test, and so on. HackerEarth
is an online platform dedicated to hiring talented developers, hosting crowdsourcing-based ideas,
and organizing hackathons. Codility helps hiring managers to nd the best developers from a large
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:9
Table 2. OJ systems used for the Education
Name In Operation Language # Problems Founded
UVa Online Judge Yes Eng 4,300 1995
Aizu Online Judge Yes Eng, Jpn 3,000 2004
Jutge.org Yes
Eng, Esp, Fre, Cat,
Dut
4,843 2006
POJ Yes Eng, Chi 3,000 2003
CodeChef Yes Eng 3,000 2009
CodeHunt Yes Eng 8,300 2014
Codecademy Yes Eng - 2011
CodeWars Yes Eng 1,200 2012
URI Online Judge Yes Eng, Spa, Por 1,100 2011
Hanghou Dianzi Uni-
versity (HDU) OJ
Yes Eng, Chi 6,000 -
Judge0 Yes Eng - 2017
LeetCode Yes Eng - 2015
pool of skilled programmers in the shortest possible time. A list of OJ systems applicable to the
recruitment and selection process can be found in [189].
3.4 Online Compilers
Another category of OJ systems are online compilers, where the codes developed in dierent
languages by user solutions can be compiled and executed remotely through a browser. Codeany-
where is one of the most feature-rich online compilers that oers a dedicated custom development
environment and real-time collaboration. It allows users to automatically connect to GitHub, Ama-
zon Cloud, FTP servers and Bitbucket. Coding Ground oers a full-featured IDE that allows users
to edit, compile and run their projects in the cloud. Codio is a cloud-based IDE that supports a
large number of programming languages, can be integrated into e-learning platforms, and can be
used to detect plagiarism. Furthermore, there are many online compilers whose goal is to provide
support for code compilation without any physical conguration. CodeSkulptor, for example, is
an online compiler for the Python programming language and also supports the learning process
of programmers. The Codepad compiler allows users to share their code with other collaborators
via URLs. In addition, C++ Shell, Web compiler, OnlineGDB C, Tutorialspoint, Replit, Rextester,
myCompiler, OneCompiler, Online Compiler CodeChef, and Techiedelight are compilers used for
online code compilation.
3.5 Development Platforms
In contrast, many OJ development platforms are available to host programming competitions or
educational activities in local infrastructure. DOMjudge is a well-known OJ development platform
that can be easily installed to host programming contests. It allows users to prepare and run
programming contests according to ACM ICPC rules. Mooshak [
81
] is an automatic judge and
full-edged contest manager. It is an open system that can be installed on a Linux operating
system with Apache HTTP server. The architecture of Mooshak is suitable not only for small
competitions with one server, but also for competitions with multiple sites. CloudCoder, a web-
based platform inspired by CodingBat [
164
], is designed to assess students’ submitted code assigned
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:10 M. M. Rahman et al.
Table 3. OJ systems are used for online compilers and development platforms
Name In Operation Founded Compiler
Development
Platform
Codeanywhere Yes 2013 ×
Coding Ground Yes 2006 ×
DOMJudge Yes 2004 ×
C ++ Shell Yes 2014 ×
Mooshak 2015 2005 ×
SIO2 Yes 2012 ×
Ideone Yes 2009 ×
CodeChef Yes - ×
Virtual Programming
Lab
Yes 2012 ×
Online Compiler Yes 2009 ×
A+ Yes 2017 ×
Codio Yes 2013 ×
TestMyCode Yes 2013 ×
Programiz Yes - ×
xLx Yes 2001 ×
Web-CAT Yes 2003 ×
Web Compiler Yes 2014 ×
Codepad Yes 2008 ×
BOSS Yes 2012 ×
CodeSkulptor Yes 2012 ×
CloudCoder Yes 2012 ×
Tsinghua Online Judge Yes 2012 ×
paiza.io Yes 2014 ×
Judge0 Yes 2017
by the instructor. At Tsinghua University, an online assessment system was developed for university-
level code evaluation [
213
], allowing users to add new problems. However, the code resources of
this system are not open source. Xavier and collaborators described an OJ platform developed at
the University of Porto, which drew inspiration from DOMjudge and Moodle [
197
]. They reviewed
the OJ development platforms. Furthermore, various plug-ins for the Moodle platform have been
created for code evaluation, such as Code Runner, Virtual Programming Lab, and Online Judge
plug-in for Moodle [
146
,
214
]. Table 3 provides a list of online compilers and development platforms.
4 RESOURCES OF AUTOMATED CODE EVALUATION SYSTEMS
In this section, we present a consolidate overview of AESs datasets and tools. Since AESs are
regularly used for a variety of purposes, including recruitment, programming education and
contests, data mining, and research, a huge number of data resources (codes, problems, analysis
results, scores, submission logs, etc.) are generated. Kaggle is an online platform for data scientists
and ML programmers [
48
]. It is also used to organize data mining and ML competitions. Kaggle
allows users to collaborate with others, publish datasets, nd datasets of others, access GPU-
integrated notebooks, and compete against each other to solve data science problems. There are
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:11
many data science and ML codes and datasets (e.g., sports, health, travel, software, and food)
that are available (open source) and mostly reliable. DREAM Challenges is another platform for
solving biological problems, especially scientic challenges, and is therefore targeted at researchers
only [
38
,
109
,
150
]. OpenML [
178
] is an open platform for sharing datasets, code, algorithms, and
experimental results. It has a rich and diverse data repository that stores more than 5,000 datasets
and supports Python, Java, REST API and many other scientic programming languages. Recently,
Hugging Face
3
has gained attention as the open-source platform for sharing ML models, datasets,
evaluation metrics, and demo applications for open science. Since the platform is a Git-based
code repository similar to GitHub, researchers can easily share their own models and datasets. In
addition, users can easily try the public models and datasets through Python libraries.
Furthermore, numerous solution codes and logs are available on various platforms (AOJ, AtCoder,
etc.) and can be used for education and programming research. These real-world and rich data
resources have become attractive for coding, education, learning analytics, ML, and data mining
research [
138
,
140
]. In a study [
137
], a comprehensive data-driven analysis based on OJ data was
conducted. The experimental results show the shortcomings of students’ programming and the
scope for possible improvements. Hsiao et al. [
63
] leveraged an educational learning analysis
tool called "Web Programming Grading Assistant (WPGA)" to study the eectiveness of students’
programming learning. Rahman et al. [
138
] conducted educational data mining using OJ data to
support programming learning. However, benchmark datasets have had a signicant impact on the
growth of coding-related research using ML. Here we present some benchmark datasets for code
intelligence research. Code representation is a fundamental activity for preparing source code that
is compatible with ML models. It involves converting code into a numerical representation to feed
ML models to solve specic tasks (e.g., pattern recognition, method name prediction, and comment
classication). code2vec [
13
] and GHTorrent [
52
] can be useful datasets for code representation
tasks. Code quality assessment is a broad category of coding tasks with subcategories such as
clone detection, smell detection, quality assessment and prediction. BigCloneBench, Multi-label
Smells, DL Smells, QScored, and ML for software refactoring datasets can be useful for code quality
assessment [
16
,
56
,
155
,
157
,
166
]. Vulnerability analysis is another important area of code analysis
that is used to determine whether the code is vulnerable or not. Datasets such as Draper VDISC,
Genome, TRL, and Project-KB [
93
,
126
,
148
,
216
] are used to identify vulnerabilities in code. Code
testing datasets such as Defect4j, BugDetection, DeepBugs, and DTLDP [
30
,
73
,
89
,
127
] can be
useful for the testing task.
In addition, Code search [
55
,
66
,
129
] is important when programmers want to use other codes.
This system automatically nds semantically similar codes based on a natural language query. The
code completion system [
25
,
103
,
143
,
168
,
169
] can help programmers automatically complete their
code. Also, the code-to-code translation system [
75
,
117
,
147
] assists programmers translate their
code from one language to another (e.g., Python to Java and Java to Python). As the use of real
datasets for coding-related research increases, therefore, we present a list of available datasets from
platforms such as OJ, contest platforms, Stack Overow, and GitHub in Table 4.
5 CODE ANALYSIS USING MACHINE LEARNING
According to Evans Data Corporation
5
, there were approximately 23.9 million professional develop-
ers in 2019, and that number is expected to reach approximately 28.7 million by the end of 2024 [
101
].
The ML-based code intelligence can be used to improve the productivity of a growing number of
professional programmers. At the same time, benchmark datasets have a signicant impact on the
3https://huggingface.co/.
5https://evansdata.com/press/viewRelease.php?pressID=278
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:12 M. M. Rahman et al.
Table 4. A list of datasets and their application in coding analysis
Sl. Dataset Name Coding Tasks Size
1 CodeNet [131] code classication, code similarity 14 million
2 Aizu [190]
code classication, code-to-code trans-
lation, code completion, refactoring,
summarization
8 million
3 AtCoder [115]
code classication, code-to-code trans-
lation, code completion, refactoring,
summarization
7.5 million
4
BigCloneBench [
167
]
code clone
6 million (true clone pair)
and 260,000 (false clone pair)
5 POJ-104 [24]
code classication, code similarity, code
clone
52,000
6 GCJ [177] code plagiarism detection 2.4 million
7 PY150 [142] code completion 150,000
8 Devign [217] fault detection 27,318
9 QuixBugs [92] code repair 40
10 Bugs2Fix [173] code repair 122,000
11 CodeSearchNet [66] code summarization 1.1 million
12 CodeContests [87] text-to-code generation 4.5 million
13 APPS [61] text-to-code generation 10,000
14 CONCODE [70] text-to-code generation 104,000
15 MBPP [19] text-to-code generation 974
16
MathQA-Python [
19
]
text-to-code generation 23,914
17 HumanEval [32] text-to-code generation 164
18 HumanEval-X4text-to-code generation 820
19
HumanEval Inll-
ing [23]
code inlling 8,488
20
code-docstring-
corpus [110]
text-to-code generation, code summa-
rization
150,370
21 CoNaLa [203]
text-to-code generation, code summa-
rization
2,879 (annotated) and
598,237 (not annotated)
22 MCoNaLa [188]
text-to-code generation, code summa-
rization
896
23 MBXP [18]
text-to-code generation, code summa-
rization, code translation, code inlling,
prompt robustness
24 CodeXGLUE [101]
clone detection, defect detection, cloze
test, code completion, code translation,
code search, code repair, code summa-
rization, text-to-code generation, docu-
ment translation
25 StaQC [200] code repair
148K Python and 120K SQL
question-code pairs from
StackOverow
26 IntroClass [80] code repair
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:13
Table 5. Machine learning approaches for coding analysis tasks
Sl. Code Analysis Task Article
1 Defect Detection [27, 31, 90, 128, 132, 141, 183, 184, 217]
2 Clone Detection
[
21
,
28
,
112
114
,
167
,
185
,
195
,
201
,
209
]
3 Code Completion [25, 85, 97, 98, 103, 143, 168, 169]
4 Code Repair [15, 50, 51, 59, 79, 96, 104, 173, 179]
5 Code Search [55, 66, 149, 160, 180, 186]
6 Code Summarization [11, 33, 45, 65, 69, 181, 193]
7 Text-to-Code Generation [33, 58, 68, 193, 204, 205]
8 Code Classication [71, 103–105, 139, 175, 191]
growth of applied ML research. Recently, researchers have begun to utilize statistical models such
as neural networks for code analysis tasks. Also, the application of pre-trained models that learn
from large programming data (e.g., code), such as BERT [
40
]-based models [
44
,
57
,
72
,
74
,
100
,
187
]
and GPT [
26
,
133
,
134
]-based models [
32
,
121
], have achieved great success in a wide range of
coding tasks. With the growth of resources, datasets, tools, and methods, code analysis research is
also expanding. Therefore, Table 5 summarizes recent attempts of code analysis tasks using ML.
A brief description of the coding tasks with ML is as follows. The purpose of the task defect
detection is to identify errors/defects within the body of the source code. Classication of codes
as buggy or not is based on the identication of errors in the code. Compiling the code is not
enough to detect all errors in the code, because logic errors may exist in the code after compilation.
Sometimes, the entire code needs to be manually inspected to detect logic errors, which is time-
consuming and costly. ML models are able to detect these logic errors in the code based on a large
number of correct code corpus. Ghassan [
153
] proposed a practical approach to detect logic errors
in code for object-oriented environments such as C# .Net Framework. To avoid logic errors, the
proposed environment includes some predened behaviors using Xceed, Alsing, and Mind Fusion
Components. Al-Ashwal et al. [
6
] presented a tool for identifying logic errors in Java code using both
static and dynamic methods. Song et al. [
163
] proposed an automatic logic error detection method
for functional programming tasks using correct reference solutions for each counter assignment.
This method was very ecient than manually identifying logic errors in code. The BiLSTM model
is used to detect both logical and syntactic errors in code [104].
Clone detection is used to identify semantically similar code. It has two subtasks: searching for
similar codes and classication. Code cloning is a well-known approach of reusing source code in
software development. Ain et al. [
5
] provided a comprehensive review of the latest tools, trends,
and techniques in the eld of code clone detection. Code clones pose a challenge to software
maintenance because verication is required for similar segments when a bug is identied in one
segment. Feng and collaborators [
86
] proposed a token-based approach to clone detection using a
DL model called CCLEARNER. A classier is trained using known method-level code clones and
non-clones and then used the trained classier to detect code clones of a given code. Fang et al.
[
42
] proposed a supervised DL model for function-level code clone detection. They proposed a
joint code representation that uses fusion embedding techniques to learn the unseen semantic and
syntactic features from the source code. The proposed fusion learning approach outperformed
state-of-the-art models in detecting function-level code clones on a large C++ program dataset.
White et al. [
196
] introduced a learning-based code clone detection technique using DL. Moreover,
a systematic review of DL applications for code clone detection has been conducted in the study
[84].
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:14 M. M. Rahman et al.
Code completion is another task that helps programmer complete code correctly. Programmers
sometimes get confused what to write next, and in such cases, code completion helps them to
complete the code. Code completion can be done in two ways (
𝑖
) token-level completion and (
𝑖𝑖
)
line-level completion. Code repair task identies bugs and automatically x them. Typically, they
identify bugs in the code and x the code according to the context of the code. Rahman et al. [
104
]
proposed a BiLSTM la nguage model for code evaluation and repair. The model is trained on a
large number of correct codes and used for code evaluation and repair. The model also proposes
correct words as replacements for error words in the code Li et al. [
88
] proposed a DL model called
DEAR that can repair/x common errors that require dependent changes in the subsequent one or
multiple segments in the code. They leveraged a tree-based LSTM model and a divide-and-conquer
strategy to understand the actual context of the code.
Code search is a task that measures the semantic relevance between the code and natural lan-
guage. This is the activity of searching the code based on a natural language query. Gu et al.
[
55
] proposed a novel DL model called CODEnn for code search. The model embeds code and its
natural description together in a high-dimensional vector space. Code snippets can be retrieved
for a natural language query based on their high-dimensional vector space, which ensures a more
relevant code search. Ling et al. [
94
] proposed an adaptive model for deep code search (AdaCS),
which is trained once and applied to dierent code bases. AdaCS divides the learning procedure
into two main steps: embedding domain-specic words and identifying common syntactic patterns
for matching. Experimental results show that AdaCS exhibits improved resilience and outperforms
existing state-of-the-art deep code search techniques when applied to unseen software projects
that are not included in the training dataset.
Code summarization provides abstracted summary of the code and helps to better understand the
program code and software maintenance. Therefore, producing high-quality annotations is a major
challenge for researchers. Liu et al. [
95
] proposed a novel code summarization method based on
neural networks using call dependency information of the source code. All call dependencies were
extracted and converted into a sequence of tokens for the Seq2Seq model for code summarization.
The experimental results demonstrated that the model understands the call dependency during
the code summarization task. Text-to-code task is used to generate codes based on the input of the
natural language description. Code classication can be done in many ways, such as classifying
source code based on programming language, application, and error.
To perform coding tasks, researchers have developed various ML models and tools and have
had great success in the tasks of code representation, code search, code understanding, code
summarization, quality assessment, vulnerability assessment, testing, and code synthesis. Here we
present a list of tools and ML models used to analyze code, as shown in Table 6. Each model/tool is
listed with the actual reference, citation, year of rst appearance online, and a short description. All
metadata for these models/tools was collected manually by searching author websites, electronic
libraries, and publisher websites.
6 THE AOJ: AN EXAMPLE OF AN AUTOMATED CODE EVALUATION SYSTEM
The Aizu Online Judge (AOJ) [
190
] is an AES that has been in operation for more than 11 years. It
has a wide range of applications such as academic, programming contests, regular practice and
research. AOJ has over 130,000 registered users who perform programming activities regularly.
It has already evaluated more than eight million program codes. The University of Aizu, Japan,
ocially uses the AOJ platform for several programming courses as an active academic tool. In
addition, AOJ has API services to access its resources for research purposes
6
, which has led to a
6http://developers.u-aizu.ac.jp/.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:15
Table 6. A list of ML models and tools used to analyze code
Tasks Tools/ML Models #Citations
Year of
Pub.
Description
Code Search
Deep Code Search [
55
]
418 2018
Code embedding technique is
used to search code
Code
Representation
Graph-based code
modeling [10]
702 2018 Code modeling with graph
Code2vec [13] 929 2019
Generate distributed representa-
tion of the code
Vocabulary learning
on code [39]
40 2019 Generate AST from Java code
CC2vec [62] 110 2020
Use the representation of code
changes
Quality
Assessment
SVF [165] 269 2016
Interprocedural static value-
ow analysis in LLVM
ML for software refac-
toring [16]
47 2020
Explores the eectiveness of ML
in predicting software refactor-
ing
CREC [207] 32 2018
Automated code clone recom-
mendation for refactoring
SMAD [22] 30 2020
SMAD is used to detect two
renowned anti-patterns: God
Class and Feature Envy
DL smells [155] 32 2021
Detects code smells by using DL
and transfer learning
Vulnerability
Assessment
VCCFinder [123] 219 2015
Detects vulnerable codes in
repositories
SWAN [124] 15 2019 Detects vulnerable codes
WAP [107] 9 2013
Detects and corrects vulnerable
codes
Code
Understanding
CodeNN [69] 607 2016
Neural attention based model is
used to summarize the code
ChangeScribe [36] 171 2014
Automatically generate commit
messages based on code changes
DeepSim [212] 192 2018
DL is used to nd code similarity
NeuralCodeSum [4] 226 2020
Source code summarization us-
ing Transformer
CODES [122] 118 2012
Source descriptions are mined
from programmers’ communica-
tion
Rencos [210] 132 2020
Code summarization based on
neural network
Testing AppFlow [64] 61 2018
Using ML to synthesize robust
reusable UI tests
DeepFuzz [99] 57 2019
Automatic generation of syntax
for Fuzz testing
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:16 M. M. Rahman et al.
Fig. 1. Component diagram of the AOJ system
large number of research in education, coding analysis, data mining, and datasets based on its data
[
103
,
104
,
131
,
137
139
,
191
]. Considering the scalability and impact of the AOJ system on practical
applications, we present the AOJ system as an example of AES in this study.
6.1 Hardware Architecture
The AOJ system has a complex component architecture with multiple dedicated components that
work together to provide a smooth and uninterrupted evaluation. Fig. 1 shows the component
diagram of the AOJ system, in which the web server, database server, judge server, broadcaster, and
load balancer are interconnected. The web server is the interface used to communicate between the
external system and the OJ core system. Basic tasks such as browsing, registration, authentication,
and solution submission are implemented in the web server. It also communicates with the database
server and load balancer to obtain the evaluation/judgment results of each submission. The database
server may consist of one or more servers that store all information related to the evaluation, such as
problem set, evaluation results, solution codes, statistical data, and user information. The database
servers communicate with the web server and load balancer via private APIs. The judge master
is a key server responsible for managing the judge data for the judge cluster. It provides rened
judge data for all judge servers and builds the necessary checker and reactive programs for all judge
servers. It also contains a non-stop integration mechanism and is considered one of the special
components of the OJ system. Similarly, the load balancer is a core component of the OJ system to
interact between the web server and the judge servers. It also communicates with the broadcaster
and the database server as well.
The load balancer is primarily responsible for scheduling submissions for judging. It receives the
evaluation results from the judge servers and the notication is sent to the web server, database
server and broadcaster. The broadcaster is basically used to inform the users about the status
of the submissions asynchronously. Finally, the main role of the judge servers is to perform
the assigned tasks based on the judge data. The judge servers should be isolated and only allow
authorized processes to connect. It is also important to avoid data loss due to malicious or unexpected
operations. More details about these hardware components of the AOJ system can be found in [
192
].
We have presented the system information of AOJ in Table 7. Table 7 shows the AOJ servers
7
, the
congurations of each server, and the application of the servers for programming languages. This
is a list of servers, including spare/unavailable servers, for quick replacement/update for seamless
operation.
7https://onlinejudge.u-aizu.ac.jp/system_info.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:17
Table 7. System information of judge servers and corresponding application for programming languages
Judge
Server
Machine Processor and RAM Using for
#0 IBM ThinkPad Pentium (R) M 1.30 GHz and 1 GB C, C++
#1
Dell PowerEdge
R300
Dual Core Intel(R) Xeon(R) E3113 3.0
GHz and 1 GB
C, C++
#2
Dell PowerEdge
R300
Dual Core Intel(R) Xeon(R) E3113 3.0
GHz and 1 GB
JAVA
#3
Dell PowerEdge
R210
Intel(R) Xeon(R) Processor E3-1270 3.40
GHz and 2 GB
Ruby, Python,
Python3, PHP,
JavaScript
#4
Dell PowerEdge
R210 II
Intel(R) Xeon(R) Processor E3-1270 3.40
GHz and 4 GB
C, C++
#5
Dell PowerEdge
R210 II
Intel(R) Xeon(R) Processor E3-1240v2
3.40 GHz and 8 GB
C++11, JAVA, C#,
D
#6
Dell PowerEdge
R210 II
Intel(R) Xeon(R) Processor E3-1280 v2
3.6 GHz, Turbo 4C/8T and 8 GB
Ruby, Python,
Python3, PHP,
JavaScript
#7
Dell PowerEdge
R220
Intel(R) Xeon(R) Processor E3-1286 v3
3.7 GHz, Turbo 4C/8T and 8 GB
C++11, JAVA,
Scala, Haskell,
OCaml, C#, D
#8
Dell PowerEdge
R220
Intel(R) Xeon(R) Processor E3-1286 v3
3.7 GHz, Turbo 4C/8T and 8 GB
C, C++, C++14
#9
Dell PowerEdge
R210 II
Intel(R) Xeon(R) Processor E3-1280 v2
3.6 GHz, Turbo 4C/8T and 8 GB
Ruby, Python,
Python3, PHP,
JavaScript, Rust,
Go, Kotlin
#10
Dell PowerEdge
R240
Intel(R) Xeon(R) Processor E-2286G 4.0
GHz, Turbo 6C/12T and 8 GB
C++14, C++17,
Rust
#11
Dell PowerEdge
R240
Intel(R) Xeon(R) Processor E-2286G 4.0
GHz, Turbo 6C/12T and 8 GB
C++11, JAVA, C#,
PHP, Kotlin
#12
Dell PowerEdge
R250
Intel(R) Xeon(R) Processor E-2378G 2.8
GHz, Turbo 8C/16T and 8 GB
PyPy3
#13
Dell PowerEdge
R250
Intel(R) Xeon(R) Processor E-2378G 2.8
GHz, Turbo 8C/16T and 8 GB
6.2 Soware Architecture
In this section, we briey describe the software architecture of the AOJ system, in particular the
core components such as load balancer, broadcaster, and judge server. The load balancer is a core
component of any OJ system. It receives submissions from the web server and distributes them
to the judge server. Finally, it reports judging results using a variety of channels. Multithreaded
processes are used to organize the components of the load balancer, which includes three instances
such as SubmissionReceiver,SubmissionSender, and SubmissionProvider. The software architecture,
submission, and evaluation in the load balancer are shown in Fig. 2. In addition, the details of the
operation and ow of the load balancer are described in [192].
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:18 M. M. Rahman et al.
Fig. 2. Internal soware architecture of load balancer
Fig. 3 shows the software architecture of the broadcaster, representing the communication
between the broadcaster, load balancer, and external systems. Since the broadcaster is an asyn-
chronous communication, WebSocket is used for push-type communication. First, the external
system establishes a connection to send a request to the broadcaster via WebSocket in order to
access the broadcaster. Next, the load balancer sends the change status to the Broadcaster through
a TCP connection. The Broadcaster then receives the status using its internal TCP server. Finally,
the Broadcaster broadcasts the status to all connected external systems through a pre-established
WebSocket connection.
Fig. 3. Soware architecture of broadcaster
The internal software architecture of the Judge server is shown in Fig. 4. An operating system is
installed on the Judge server and there are ve processes: Controller, Observer, Executor, Launcher,
and Judge. The Observer is an administrator that monitors all submission processes. If there is
something bad in the submissions, it reacts and terminates them. A Controller is a communicator
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:19
Fig. 4. Soware architecture of judge server
between the load balancer and Judge server. It receives submissions from the load balancer and
provides the appropriate judging results. Launcher is used to prepare the submissions for the
executor, then the executor is activated by the launcher to execute the program code. Finally,
the judge evaluates the program code based on the judge data. However, problems can occur in
the Launcher, Executor, and Judge phases when executing and evaluating program code. If the
processes such as
code.exe
in the Executor phase and checker and reactor in the Judge phase take
more time than usual, these processes are terminated by the Observer.
In addition, the entire operation of the AOJ system relies on a variety of software technolo-
gies [
192
]. The web server is implemented with the Spring framework (2018 and onwards) and
Apache Tomcat (before 2018). Scala is used for the broadcaster and the load balancer runs entirely
on a Java application. In contrast, judge servers are run on scripting languages.
6.3 Operational Availability
Since the AOJ system is active in a variety of programming activities, here are some brief operation-
oriented statistics. AOJ has successfully held 100 on-site programming contests and over 3,000
virtual contests. AOJ is used as an academic tool in programming courses at the University of
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:20 M. M. Rahman et al.
Aizu and has evaluated over 8 million submitted solution codes through its long-term service. The
AOJ system has been in operation around the clock for more than a decade to learn and practice
programming. We have calculated the availability of the AOJ system in hours from 2011 to 2022.
The total possible hours (TPH) can be calculated using Eq. (6).
𝑇 𝑃𝐻 =𝑦𝑒𝑎𝑟𝑠𝑜𝑝 𝑒𝑟 𝑎𝑡𝑖 𝑜𝑛 × (365 ×24)(6)
In each year, the services of the AOJ system are suspended for planned maintenance. Therefore,
the total operating hours (TOH) can be calculated using Eq. (7), where
𝑦
is the year and
𝑀𝐻𝑦
stands
for the total number of maintenance hours in a year 𝑦.
𝑇𝑂𝐻𝑦=(365 ×24) 𝑀𝐻𝑦(7)
Assuming
𝑀𝐻2011
is 71,
𝑇𝑂𝐻2011
would be 8,689 hours according to the above formula. We
consider an hour as an actual working hour (AWH) if the AOJ system evaluates at least one
submission in that hour, otherwise, it is considered an idle hour (IH). Fig. 5 shows the
𝑇𝑂𝐻𝑦
,
𝑀𝐻𝑦
,
𝐴𝑊 𝐻𝑦, and 𝐼𝐻𝑦of the AOJ system during 2011-2022.
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
8,200
8,300
8,400
8,500
8,600
8,700
Year
hours
TOH
AWH
(a) 𝑇𝑂 𝐻 and 𝐴𝑊 𝐻
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
0
50
150
250
350
450
Year
hours
MH
IH
(b) 𝑀𝐻 and 𝐼 𝐻
Fig. 5. The availability of the AOJ system during 2011-2022
From Fig. 5, some observations can be made: (
𝑖
)
𝐴𝑊 𝐻
is relatively low in 2011 and 2012; (
𝑖𝑖
) in
last few years,
𝐴𝑊 𝐻
is close to
𝑇𝑂𝐻
; (
𝑖𝑖𝑖
)
𝐼𝐻
was also found to be very low in 2020, 2021, and
2022; (
𝑖𝑣
)
𝑀𝐻
also decreases year by year. Furthermore, the
𝑇 𝑃𝐻 (20112022 )
for the past 12 years
was 105,120 hours, in which the AOJ system was in operation (available) for about 104,656 hours,
which is 99.56% of the
𝑇 𝑃𝐻 (20112022 )
excluding scheduled
𝑀𝐻
. In addition, the average
𝐴𝑊 𝐻
is
98.74% of
𝑇𝑂𝐻
over the past 12 years. This statistic demonstrates the high operational availability
of the AOJ system throughout the years.
6.4 Computational Performance
Over the years, the AOJ system has been used in a variety of areas, including competitions,
academics, and practice. As a result, the AOJ system has received millions of submissions for
evaluation each year. Fig. 6a provides the statistics of submissions received by the AOJ system
over the years, showing that the AOJ received approximately 1,059,756 and 1,096,846 submissions
in 2021 and 2022, respectively. Moreover, the number of submissions appears to be increasing
each year. In addition, Fig. 6b shows a further breakdown of submissions for October 2022, where
we randomly selected the month. Fig. 6b demonstrates that the AOJ received a large number of
submissions on each day.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:21
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
0.2
0.4
0.6
0.8
1
1.2
·106
Year
#Submissions
(a) Submission received by the AOJ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Submission/day
# Submissions
# submissions/day
(b) Day-wise submission for October 2022
Fig. 6. The statistics of submissions received by the AOJ system over the years
The average waiting time for the AOJ system to evaluate a submission was calculated. Fig. 7
shows the waiting time for the AOJ to evaluate submissions by year. The results show that most
submissions were processed within one second (approximately 79.11%). The waiting time has tended
to decrease in recent years (see Fig. 7), which means that the eectiveness of the AOJ in evaluating
the submitted code is increasing. However, the number of submissions has increased signicantly
in recent years. These statistics demonstrate that the AOJ system can be a good candidate for a
state-of-the-art AES system.
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
0.5
1
1.5
2
2.5
3
Year
waiting time (Sec)
Average
Fig. 7. Year-wise waiting time to evaluate submissions by AOJ
6.5 Research
The real-world resources (e.g., source codes, logs, and educational data) of the AOJ system have
become important for research in learning analytics, educational data mining, coding analysis, and
programming support. The data resources of the AOJ system are freely available for research
8
.
Meanwhile, the solution code of the AOJ system has been used by IBM for their CodeNET project
8http://developers.u-aizu.ac.jp/index
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:22 M. M. Rahman et al.
[
131
] and by Google for their AlphaCode project [
87
]. Table 8 shows the list of published research
papers using AOJ data resources. It can be seen that the published research papers using AOJ data
resources have attracted the attention of researchers and received numerous citations. It is noted
that the citations are counted based on Google scholar. In addition, other research papers can be
found on the AOJ website9.
7 CONCLUSION
The extensive utilization of automated code evaluation systems (AESs) across various domains
is undeniably signicant, and the reach of these systems continues to expand. Consequently,
the classication of AESs according to their specic applications holds immense value for users.
Additionally, the data resources generated by AESs oer considerable utility for diverse research and
development tasks. In this survey paper, we have presented a thorough and concise examination of
AESs and their associated data resources. Our initial emphasis was on categorizing AESs according
to their respective application domains. Subsequently, we provided an overview of the available
resources and datasets within AESs, facilitating users in their research and development endeavors.
Furthermore, we summarized the application of machine learning in various coding analysis tasks
aimed at problem-solving in programming. Lastly, we briey discussed the Aizu Online Judge
System as an illustrative real-world example of an AES, considering factors such as hardware and
software development, operation, maintenance, and research aspects.
ACKNOWLEDGMENTS
This research was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI
(https://kaken.nii.ac.jp/en/grant/KAKENHI-PROJECT-23H03508/ accessed on 15 June 2023). Grant
Number: 23H03508.
REFERENCES
[1]
Osama Abdeljaber, Onur Avci, Serkan Kiranyaz, Moncef Gabbouj, and Daniel J. Inman. 2017. Real-time vibration-based
structural damage detection using one-dimensional convolutional neural networks. Journal of Sound and Vibration
388 (2017), 154–170. https://doi.org/10.1016/j.jsv.2016.10.043
[2]
Ibrahim Abunadi and Mamdouh Alenezi. 2015. Towards Cross Project Vulnerability Prediction in Open Source Web
Applications. In Proceedings of The International Conference on Engineering’ MIS 2015 (Istanbul, Turkey) (ICEMIS ’15).
Association for Computing Machinery, New York, NY, USA, Article 42, 5 pages. https://doi.org/10.1145/2832987.
2833051
[3]
Simran Aggarwal. 2020. Software Code Analysis Using Ensemble Learning Techniques (AISS ’19). Association for
Computing Machinery, New York, NY, USA, Article 9, 7 pages. https://doi.org/10.1145/3373477.3373486
[4]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for
source code summarization. arXiv preprint arXiv:2005.00653 (2020).
[5]
Qurat Ul Ain, Wasi Haider Butt, Muhammad Waseem Anwar, Farooque Azam, and Bilal Maqbool. 2019. A Systematic
Review on Code Clone Detection. IEEE Access 7 (2019), 86121–86144. https://doi.org/10.1109/ACCESS.2019.2918202
[6]
Deena Al-Ashwal, Eman Zaid Al-Sewari, and Asma Abdulghani Al-Shargabi. 2018. A CASE tool for JAVA programs
logical errors detection: Static and dynamic testing. In 2018 International Arab Conference on Information Technology
(ACIT). IEEE, 1–6.
[7]
Kirsti M Ala-Mutka. 2005. A survey of automated assessment approaches for programming assignments. Computer
science education 15, 2 (2005), 83–102.
[8]
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting Accurate Method and Class
Names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE
2015). Association for Computing Machinery, New York, NY, USA, 38–49. https://doi.org/10.1145/2786805.2786849
[9]
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for
Big Code and Naturalness. ACM Comput. Surv. 51, 4, Article 81 (jul 2018), 37 pages. https://doi.org/10.1145/3212695
9https://onlinejudge.u-aizu.ac.jp/papers
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:23
Table 8. A list of published research using AOJ data
Category of
Study
Title #Citations Year of Pub.
Learning
Analytics and
Data Mining
Impact of Practical Skills on Academic Performance: A
Data-Driven Analysis [137]
12 2021
Learning Path Recommendation System for Program-
ming Education Based on Neural Networks [152]
73 2020
A Novel Rule-Based Online Judge Recommender System
to Promote Computer Programming Education [140]
8 2021
Data Analysis and Code Assessment Using Machine
Learning Techniques for Programming Activities [135]
1 2022
Educational Data Mining to Support Programming Learn-
ing Using Problem-Solving Data [138]
12 2022
Categorization of frequent errors in solution codes cre-
ated by novice programmers [136]
3 2021
Dataset
Project CodeNet: A Large-Scale AI for Code Dataset for
Learning a Diversity of Coding Tasks [131]
44 2021
Competition-Level Code Generation with AlphaCode
[87]
155 2022
Programming
Support
A Model with Iterative Trials for Correcting Logic Errors
in Source Code [106]
5 2021
A Bidirectional LSTM Language Model for Code Evalua-
tion and Repair [104]
50 2021
Prompt Sensitivity of Language Model for Solving Pro-
gramming Problems [159]
- 2022
Source Code Assessment and Classication Based on Esti-
mated Error Probability Using Attentive LSTM Language
Model and Its Application in Programming Education
[139]
43 2020
Algorithm to Determine Extended Edit Distance between
Program Codes [17]
5 2019
Code Completion for Programming Education based on
Deep Learning [172]
6 2021
Identifying Algorithm in Program Code based on Struc-
tural Features Using CNN Classication Model [191]
6 2022
Automatic Generation of Fill-in-the-Blank Programming
Problems [171]
14 2019
Logic Error Detection System Based On Structure Pattern
And Error Degree [206]
7 2019
OJ System
Online Judge System: Requirements, Architecture, and
Experiences [192]
9 2022
[10]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs.
arXiv preprint arXiv:1711.00740 (2017).
[11]
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summariza-
tion of source code. In International conference on machine learning. PMLR, 2091–2100.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:24 M. M. Rahman et al.
[12]
Miltiadis Allamanis and Charles Sutton. 2013. Mining Source Code Repositories at Massive Scale Using Language
Modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA)
(MSR ’13). IEEE Press, 207–216.
[13]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of
code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29.
[14]
Hadeel Alsolai and Marc Roper. 2020. A systematic literature review of machine learning techniques for software
maintainability prediction. Information and Software Technology 119 (2020), 106214. https://doi.org/10.1016/j.infsof.
2019.106214
[15]
Leonardo Afonso Amorim, Mateus F. Freitas, Altino Dantas, Eduardo F. de Souza, Celso G. Camilo-Junior, and
Wellington S. Martins. 2018. A New Word Embedding Approach to Evaluate Potential Fixes for Automated Program
Repair. In 2018 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN.2018.
8489079
[16]
Maurício Aniche, Erick Maziero, Rafael Durelli, and Vinicius H. S. Durelli. 2022. The Eectiveness of Supervised
Machine Learning Algorithms in Predicting Software Refactoring. IEEE Transactions on Software Engineering 48, 4
(2022), 1432–1450. https://doi.org/10.1109/TSE.2020.3021736
[17]
Kazuki Anzai and Yutaka Watanobe. 2019. Algorithm to determine extended edit distance between program codes. In
2019 IEEE 13th International symposium on embedded multicore/many-core systems-on-chip (MCSoC). IEEE, 180–186.
[18]
Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad,
Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton,
Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati,
Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, and Bing Xiang. 2022. Multi-lingual Evaluation of
Code Generation Models. arXiv:2210.14868 [cs.LG]
[19]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang,
Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models.
arXiv:2108.07732 [cs.PL]
[20]
Muhammad Ilyas Azeem, Fabio Palomba, Lin Shi, and Qing Wang. 2019. Machine learning techniques for code
smell detection: A systematic literature review and meta-analysis. Information and Software Technology 108 (2019),
115–138.
[21]
Upul Bandara and Gamini Wijayarathna. 2011. A machine learning based tool for source code plagiarism detection.
International Journal of Machine Learning and Computing 1, 4 (2011), 337.
[22]
Antoine Barbez, Foutse Khomh, and Yann-Gaël Guéhéneuc. 2020. A machine-learning based ensemble method for
anti-patterns detection. Journal of Systems and Software 161 (2020), 110486.
[23]
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark
Chen. 2022. Ecient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs.CL]
[24]
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoeer. 2018. Neural Code Comprehension: A Learnable
Representation of Code Semantics. In Proceedings of the 32nd International Conference on Neural Information Processing
Systems (NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 3589–3601.
[25]
Pavol Bielik, Veselin Raychev, and Martin Vechev. 2016. PHOG: probabilistic model for code. In International conference
on machine learning. PMLR, 2933–2942.
[26]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jerey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Proceedings of Advances in Neural
Information Processing Systems (NeurIPS), Vol. 33. 1877–1901.
[27]
Laurie Butgereit. 2019. Using Machine Learning to Prioritize Automated Testing in an Agile Environment. In 2019
Conference on Information Communications Technology and Society (ICTAS). 1–6. https://doi.org/10.1109/ICTAS.2019.
8703639
[28]
Lutz Büch and Artur Andrzejak. 2019. Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code
Clone Detection. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).
95–104. https://doi.org/10.1109/SANER.2019.8668039
[29]
Brenda Cheang, Andy Kurnia, Andrew Lim, and Wee-Chong Oon. 2003. On Automated Grading of Programming
Assignments in an Academic Institution. Comput. Educ. 41, 2 (sep 2003), 121–131. https://doi.org/10.1016/S0360-
1315(03)00030-7
[30]
Jinyin Chen, Keke Hu, Yue Yu, Zhuangzhi Chen, Qi Xuan, Yi Liu, and Vladimir Filkov. 2020. Software visualization
and deep transfer learning for eective software defect prediction. In Proceedings of the ACM/IEEE 42nd international
conference on software engineering. 578–589.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:25
[31]
Jinyin Chen, Keke Hu, Yue Yu, Zhuangzhi Chen, Qi Xuan, Yi Liu, and Vladimir Filkov. 2020. Software Visualization
and Deep Transfer Learning for Eective Software Defect Prediction. In 2020 IEEE/ACM 42nd International Conference
on Software Engineering (ICSE). 578–589.
[32]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf,
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser,
Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert,
Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas
Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira
Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech
Zaremba. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374 (2021).
https://doi.org/10.48550/arXiv.2107.03374
[33]
Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. PyMT5:
multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150
(2020).
[34]
Sébastien Combés and Jérémy Wautelet. 2014. Programming Trainings and Informatics Teaching Through Online
Contests. Olympiads in Informatics 8 (2014).
[35]
Gordon Cormack, Ian Munro, Troy Vasiga, and Graeme Kemkes. 2006. Structure, Scoring and Purpose of Computing
Competitions. Informatics in Education 5, 1 (jan 2006), 15–36.
[36]
Luis Fernando Cortés-Coy, Mario Linares-Vásquez, Jairo Aponte, and Denys Poshyvanyk. 2014. On automatically
generating commit messages via summarization of source code changes. In 2014 IEEE 14th International Working
Conference on Source Code Analysis and Manipulation. IEEE, 275–284.
[37]
Georgina Cosma and Mike Joy. 2008. Towards a denition of source-code plagiarism. IEEE Transactions on Education
51, 2 (2008), 195–200.
[38]
JC Costello and G Stolovitzky. 2013. Seeking the wisdom of crowds through challenge-based competitions in
biomedical research. Clinical Pharmacology & Therapeutics 93, 5 (2013), 396–398.
[39]
Milan Cvitkovic, Badal Singh, and Animashree Anandkumar. 2019. Open vocabulary learning on source code with a
graph-structured cache. In International Conference on Machine Learning. PMLR, 1475–1485.
[40]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[41]
Herman Zvonimir Došilović and Igor Mekterović. 2020. Robust and Scalable Online Code Execution System. In
2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO). 1627–1632.
https://doi.org/10.23919/MIPRO48935.2020.9245310
[42]
Chunrong Fang, Zixi Liu, Yangyang Shi, Je Huang, and Qingkai Shi. 2020. Functional Code Clone Detection with
Syntax and Semantics Fusion Learning. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software
Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA,
516–527. https://doi.org/10.1145/3395363.3397362
[43]
Wes Felter, Alexandre Ferreira, Ram Rajamony, and Juan Rubio. 2015. An updated performance comparison of virtual
machines and Linux containers. In 2015 IEEE International Symposium on Performance Analysis of Systems and Software
(ISPASS). 171–172. https://doi.org/10.1109/ISPASS.2015.7095802
[44]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu,
Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In
Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics,
Online, 1536–1547. https://doi.org/10.18653/v1/2020.ndings-emnlp.139
[45]
Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2018. Structured neural summarization. arXiv
preprint arXiv:1811.01824 (2018).
[46]
Daniela Fonte, Daniela da Cruz, Alda Lopes Gançarski, and Pedro Rangel Henriques. 2013. A exible dynamic system
for automatic grading of programming exercises. (2013).
[47] Michal Forisek. 2007. Security of Programming Contest Systems.
[48]
Anthony Goldbloom. 2010. Data Prediction Competitions Far More than Just a Bit of Fun. In 2010 IEEE International
Conference on Data Mining Workshops. 1385–1386. https://doi.org/10.1109/ICDMW.2010.56
[49]
Google. 2021. gcj-dataset. Available: https://openreview.net/attachment?id=AZ4vmLoJft&name=supplementary_
material.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:26 M. M. Rahman et al.
[50]
Divya Gopinath, Sarfraz Khurshid, Diptikalyan Saha, and Satish Chandra. 2014. Data-guided repair of selection
statements. In Proceedings of the 36th International Conference on Software Engineering. 243–253.
[51]
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair. Commun. ACM 62, 12
(2019), 56–65.
[52]
Georgios Gousios. 2013. The GHTorent dataset and tool suite. In 2013 10th Working Conference on Mining Software
Repositories (MSR). IEEE, 233–236.
[53]
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with Deep Bidirectional
LSTM. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. 273–278. https://doi.org/10.1109/
ASRU.2013.6707742
[54]
Klaus Gre, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A
Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2222–2232. https:
//doi.org/10.1109/TNNLS.2016.2582924
[55]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep Code Search. In 2018 IEEE/ACM 40th International
Conference on Software Engineering (ICSE). 933–944. https://doi.org/10.1145/3180155.3180167
[56]
Thirupathi Guggulothu and Salman Abdul Moiz. 2020. Code smell detection using multi-label classication approach.
Software Quality Journal 28 (2020), 1063–1086.
[57]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie LIU, Long Zhou, Nan Duan, Alexey Svyatkovskiy,
Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and
Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. In International Conference
on Learning Representations. https://openreview.net/forum?id=jLoC4ez43PZ
[58]
Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. 2019. Coupling retrieval and meta-learning for context-
dependent semantic parsing. arXiv preprint arXiv:1906.07108 (2019).
[59]
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing Common C Language Errors by
Deep Learning. In Proceedings of the Thirty-First AAAI Conference on Articial Intelligence (San Francisco, California,
USA) (AAAI’17). AAAI Press, 1345–1351.
[60]
Vincent J. Hellendoorn and Premkumar Devanbu. 2017. Are Deep Neural Networks the Best Choice for Modeling
Source Code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn,
Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 763–773. https://doi.org/10.
1145/3106237.3106290
[61]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir
Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS.
In: Proceedings of Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks
(Round 2) (2021).
[62]
Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. 2020. CC2Vec: Distributed representations of code changes.
In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 518–529.
[63]
I-Han Hsiao, Po-Kai Huang, and Hannah Murphy. 2020. Integrating Programming Learning Analytics Across Physical
and Digital Space. IEEE Transactions on Emerging Topics in Computing 8, 1 (2020), 206–217. https://doi.org/10.1109/
TETC.2017.2701201
[64]
Gang Hu, Linjie Zhu, and Junfeng Yang. 2018. AppFlow: using machine learning to synthesize robust, reusable UI
tests. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium
on the Foundations of Software Engineering. 269–282.
[65]
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing Source Code with Transferred API
Knowledge. In Proceedings of the 27th International Joint Conference on Articial Intelligence (Stockholm, Sweden)
(IJCAI’18). AAAI Press, 2269–2275.
[66]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet
challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
[67] Petri Ihantola, Tuukka Ahoniemi, Ville Karavirta, and Otto Seppälä. 2010. Review of Recent Systems for Automatic
Assessment of Programming Assignments. In Proceedings of the 10th Koli Calling International Conference on Computing
Education Research (Koli, Finland) (Koli Calling ’10). Association for Computing Machinery, New York, NY, USA,
86–93. https://doi.org/10.1145/1930464.1930480
[68]
Srinivasan Iyer, Alvin Cheung, and Luke Zettlemoyer. 2019. Learning programmatic idioms for scalable semantic
parsing. arXiv preprint arXiv:1904.09086 (2019).
[69]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural
attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers). 2073–2083.
[70]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Program-
matic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:27
for Computational Linguistics, Brussels, Belgium, 1643–1652. https://doi.org/10.18653/v1/D18-1192
[71]
Tao Ji, Jinkun Pan, Liqian Chen, and Xiaoguang Mao. 2018. Identifying Supplementary Bug-x Commits. In 2018
IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Vol. 01. 184–193. https://doi.org/10.
1109/COMPSAC.2018.00031
[72]
Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. TreeBERT: A tree-based pre-trained model for
programming language. In Uncertainty in Articial Intelligence. PMLR, 54–63.
[73]
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled
testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis.
437–440.
[74]
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and Evaluating Contextual
Embedding of Source Code. In Proceedings of the 37th International Conference on Machine Learning (ICML’20).
JMLR.org, Article 474, 12 pages.
[75]
Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming
Languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reections on
Programming & Software (Portland, Oregon, USA) (Onward! 2014). Association for Computing Machinery, New York,
NY, USA, 173–184. https://doi.org/10.1145/2661136.2661148
[76]
Alex Krizhevsky, Ilya Sutskever, and Georey E. Hinton. 2017. ImageNet Classication with Deep Convolutional
Neural Networks. Commun. ACM 60, 6 (may 2017), 84–90. https://doi.org/10.1145/3065386
[77]
Andy Kurnia, Andrew Lim, and Brenda Cheang. 2001. Online Judge. Computers & Education 36, 4 (2001), 299–315.
https://doi.org/10.1016/S0360-1315(01)00018-5
[78]
Triet H. M. Le, Hao Chen, and Muhammad Ali Babar. 2020. Deep Learning for Source Code Modeling and Generation:
Models, Applications, and Challenges. ACM Comput. Surv. 53, 3, Article 62 (jun 2020), 38 pages. https://doi.org/10.
1145/3383458
[79]
Xuan-Bach D. Le, Tien-Duy B. Le, and David Lo. 2015. Should xing these failures be delegated to automated
program repair?. In 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE). 427–437.
https://doi.org/10.1109/ISSRE.2015.7381836
[80]
Claire Le Goues, Neal Holtschulte, Edward K Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley
Weimer. 2015. The ManyBugs and IntroClass benchmarks for automated repair of C programs. IEEE Transactions on
Software Engineering 41, 12 (2015), 1236–1256.
[81]
José Paulo Leal and Fernando Silva. 2003. Mooshak: A Web-based multi-site programming contest system. Software:
Practice and Experience 33, 6 (2003), 567–581.
[82]
Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A Neural Model for Generating Natural Language
Summaries of Program Subroutines. In Proceedings of the 41st International Conference on Software Engineering
(Montreal, Quebec, Canada) (ICSE ’19). IEEE Press, 795–806. https://doi.org/10.1109/ICSE.2019.00087
[83]
Song-Mi Lee, Sang Min Yoon, and Heeryon Cho. 2017. Human activity recognition from accelerometer data using
Convolutional Neural Network. In 2017 IEEE International Conference on Big Data and Smart Computing (BigComp).
131–134. https://doi.org/10.1109/BIGCOMP.2017.7881728
[84]
Maggie Lei, Hao Li, Ji Li, Namrata Aundhkar, and Dae-Kyoo Kim. 2022. Deep learning application on code clone
detection: A review of current knowledge. Journal of Systems and Software 184 (2022), 111141. https://doi.org/10.
1016/j.jss.2021.111141
[85]
Jian Li, Yue Wang, Michael R. Lyu, and Irwin King. 2018. Code Completion with Neural Attention and Pointer
Networks. In Proceedings of the 27th International Joint Conference on Articial Intelligence (Stockholm, Sweden)
(IJCAI’18). AAAI Press, 4159–25.
[86]
Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A Deep Learning-Based Clone
Detection Approach. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). 249–260.
https://doi.org/10.1109/ICSME.2017.46
[87]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling,
Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin,
Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz,
Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-
level code generation with AlphaCode. Science 378, 6624 (2022), 1092–1097. https://doi.org/10.1126/science.abq1158
[88]
Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. DEAR: A Novel Deep Learning-Based Approach for Automated
Program Repair. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania)
(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 511–523. https://doi.org/10.1145/3510003.
3510177
[89] Yi Li, Shaohua Wang, Tien N Nguyen, and Son Van Nguyen. 2019. Improving bug detection via context-based code
representation learning and attention-based neural networks. Proceedings of the ACM on Programming Languages 3,
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:28 M. M. Rahman et al.
OOPSLA (2019), 1–30.
[90]
Yi Li, Shaohua Wang, Tien N. Nguyen, and Son Van Nguyen. 2019. Improving Bug Detection via Context-Based Code
Representation Learning and Attention-Based Neural Networks. Proc. ACM Program. Lang. 3, OOPSLA, Article 162
(oct 2019), 30 pages. https://doi.org/10.1145/3360588
[91]
Rui Lima, António Miguel Rosado da Cruz, and Jorge Ribeiro. 2020. Articial Intelligence Applied to Software
Testing: A Literature Review. In 2020 15th Iberian Conference on Information Systems and Technologies (CISTI). 1–6.
https://doi.org/10.23919/CISTI49556.2020.9141124
[92]
Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A Multi-Lingual Pro-
gram Repair Benchmark Set Based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN
International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (Vancou-
ver, BC, Canada) (SPLASH Companion 2017). Association for Computing Machinery, New York, NY, USA, 55–56.
https://doi.org/10.1145/3135932.3135941
[93]
Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Yang Xiang, Olivier De Vel, and Paul Montague. 2018. Cross-project
transfer representation learning for vulnerable function discovery. IEEE Transactions on Industrial Informatics 14, 7
(2018), 3289–3297.
[94]
Chunyang Ling, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2020. Adaptive Deep Code Search. In Proceedings of the 28th
International Conference on Program Comprehension (Seoul, Republic of Korea) (ICPC ’20). Association for Computing
Machinery, New York, NY, USA, 48–59. https://doi.org/10.1145/3387904.3389278
[95]
Bohong Liu, Tao Wang, Xunhui Zhang, Qiang Fan, Gang Yin, and Jinsheng Deng. 2019. A Neural-Network Based Code
Summarization Approach by Using Source Code and Its Call Dependencies. In Proceedings of the 11th Asia-Pacic
Symposium on Internetware (Fukuoka, Japan) (Internetware ’19). Association for Computing Machinery, New York,
NY, USA, Article 12, 10 pages. https://doi.org/10.1145/3361242.3362774
[96]
Chang Liu, Xinyun Chen, Eui Chul Shin, Mingcheng Chen, and Dawn Song. 2016. Latent attention for if-then program
synthesis. Advances in Neural Information Processing Systems 29 (2016).
[97]
Fang Liu, Ge Li, Bolin Wei, Xin Xia, Zhiyi Fu, and Zhi Jin. 2020. A self-attentional neural architecture for code
completion with multi-task learning. In Proceedings of the 28th International Conference on Program Comprehension.
37–47.
[98]
Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2021. Multi-Task Learning Based Pre-Trained Language Model for
Code Completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
(Virtual Event, Australia) (ASE ’20). Association for Computing Machinery, New York, NY, USA, 473–485. https:
//doi.org/10.1145/3324884.3416591
[99]
Xiao Liu, Xiaoting Li, Rupesh Prajapati, and Dinghao Wu. 2019. Deepfuzz: Automatic generation of syntax valid c
programs for fuzz testing. In Proceedings of the AAAI Conference on Articial Intelligence, Vol. 33. 1044–1051.
[100]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
moyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint
arXiv:1907.11692 (2019). https://doi.org/10.48550/arXiv.1907.11692
[101] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain,
Daxin Jiang, Duyu Tang, et al
.
2021. Codexglue: A machine learning benchmark dataset for code understanding and
generation. arXiv preprint arXiv:2102.04664 (2021).
[102]
Xudong Lu, Dongyu Zheng, and Lei Liu. 2017. Data Driven Analysis on the Eect of Online Judge System. In
2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications
(GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). 573–577.
https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData.2017.90
[103]
Rahman M. Mostazer, Yutaka Watanobe, and Keita Nakamura. 2020. A Neural Network Based Intelligent Support
Model for Program Code Completion. Scientic Programming 2020 (2020). https://doi.org/10.1155/2020/7426461
[104]
Rahman M. Mostazer, Yutaka Watanobe, and Keita Nakamura. 2021. A Bidirectional LSTM Language Model for
Code Evaluation and Repair. Symmetry 13, 2 (2021). https://doi.org/10.3390/sym13020247
[105]
Yuzhan Ma, Sarah Fakhoury, Michael Christensen, Venera Arnaoudova, Waleed Zogaan, and Mehdi Mirakhorli. 2018.
Automatic Classication of Software Artifacts in Open-Source Applications. In Proceedings of the 15th International
Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR ’18). Association for Computing Machinery,
New York, NY, USA, 414–425. https://doi.org/10.1145/3196398.3196446
[106]
Taku Matsumoto, Yutaka Watanobe, and Keita Nakamura. 2021. A model with iterative trials for correcting logic
errors in source code. Applied Sciences 11, 11 (2021), 4755.
[107]
Ibéria Medeiros, Nuno F Neves, and Miguel Correia. 2013. Securing energy metering software with automatic source
code correction. In 2013 11th IEEE International Conference on Industrial Informatics (INDIN). IEEE, 701–706.
[108]
Dirk Merkel. 2014. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J.
2014, 239, Article 2 (mar 2014).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:29
[109]
Pablo Meyer and Julio Saez-Rodriguez. 2021. Advances in systems biology modeling: 10 years of crowdsourcing
DREAM challenges. Cell Systems 12, 6 (2021), 636–653.
[110]
Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A Parallel Corpus of Python Functions and Documentation
Strings for Automated Code Documentation and Code Generation. In Proceedings of the Eighth International Joint
Conference on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing,
Taipei, Taiwan, 314–319. https://aclanthology.org/I17-2053
[111]
Mike Mirzayanov. 2020. Codeforces: Results of 2020. https://codeforces.com/blog/entry/89502. Accessed: 2023-05-23.
[112]
Golam Mostaeen, Banani Roy, Chanchal K Roy, Kevin Schneider, and Jerey Svajlenko. 2020. A machine learning
based framework for code clone validation. Journal of Systems and Software 169 (2020), 110686.
[113]
Golam Mostaeen, Jerey Svajlenko, Banani Roy, Chanchal K. Roy, and Kevin A. Schneider. 2018. On the Use
of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools.
In 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM). 155–164.
https://doi.org/10.1109/SCAM.2018.00025
[114]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Networks over Tree Structures for
Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Articial Intelligence (Phoenix,
Arizona) (AAAI’16). AAAI Press, 1287–1293.
[115] Takahashi Naohiro. 2012. AtCoder Inc. Available: https://atcoder.jp/.
[116]
Ágnes Erdősné NÉMETH and László ZSAKÓ. 2015. Online training and contests for informatics contestants of
secondary school age. Edukacja-Technika-Informatyka 6, 1 (2015), 273–280.
[117]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2015. Divide-and-Conquer Approach for Multi-phase
Statistical Migration for Source Code (T). In 2015 30th IEEE/ACM International Conference on Automated Software
Engineering (ASE). 585–596. https://doi.org/10.1109/ASE.2015.74
[118]
Ioannis Nikolaou. 2021. What is the Role of Technology in Recruitment and Selection? The Spanish journal of
psychology 24 (2021), e2.
[119] Reija Oksanen. 2018. New technology-based recruitment methods. Master’s thesis.
[120]
Safa Omri and Carsten Sinz. 2020. Deep Learning for Software Defect Prediction: A Survey. In Proceedings of the
IEEE/ACM 42nd International Conference on Software Engineering Workshops (Seoul, Republic of Korea) (ICSEW’20).
Association for Computing Machinery, New York, NY, USA, 209–214. https://doi.org/10.1145/3387940.3391463
[121]
Long Ouyang, Jerey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow
instructions with human feedback. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal,
Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=TG8KACxEON
[122]
Sebastiano Panichella, Jairo Aponte, Massimiliano Di Penta, Andrian Marcus, and Gerardo Canfora. 2012. Mining
source code descriptions from developer communications. In 2012 20th IEEE International Conference on Program
Comprehension (ICPC). IEEE, 63–72.
[123]
Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and
Yasemin Acar. 2015. Vccnder: Finding potential vulnerabilities in open-source projects to assist code audits. In
Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 426–437.
[124]
Goran Piskachev, Lisa Nguyen Quang Do, and Eric Bodden. 2019. Codebase-adaptive detection of security-relevant
methods. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 181–191.
[125]
Wolfgang Pohl. 2006. Computer Science Contests for Secondary School Students: Approaches to Classication.
Informatics in Education 5, 1 (jan 2006), 125–132.
[126]
Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cédric Dangremont. 2019. A manually-curated
dataset of xes to vulnerabilities of open-source software. In 2019 IEEE/ACM 16th International Conference on Mining
Software Repositories (MSR). IEEE, 383–387.
[127]
Michael Pradel and Koushik Sen. 2018. Deepbugs: A learning approach to name-based bug detection. Proceedings of
the ACM on Programming Languages 2, OOPSLA (2018), 1–25.
[128]
Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-Based Bug Detection. Proc. ACM
Program. Lang. 2, OOPSLA, Article 147 (oct 2018), 25 pages. https://doi.org/10.1145/3276517
[129]
Varot Premtoon, James Koppel, and Armando Solar-Lezama. 2020. Semantic Code Search via Equational Reasoning.
In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London,
UK) (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 1066–1082. https://doi.org/10.1145/
3385412.3386001
[130]
Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, and Regina Barzilay. 2016. sk_p: a neural program corrector
for MOOCs. In Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming,
Languages and Applications: Software for Humanity. 39–40.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:30 M. M. Rahman et al.
[131]
Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie
Chen, Mihir Choudhury, Lindsey Decker, et al
.
2021. Project codenet: A large-scale ai for code dataset for learning a
diversity of coding tasks. arXiv preprint arXiv:2105.12655 1035 (2021).
[132]
Osama Al Qasem, Mohammed Akour, and Mamdouh Alenezi. 2020. The Inuence of Deep Learning Algorithms
Factors in Software Fault Prediction. IEEE Access 8 (2020), 63945–63960. https://doi.org/10.1109/ACCESS.2020.2985290
[133]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by
generative pre-training. (2018).
[134]
Alec Radford, Je Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are
Unsupervised Multitask Learners. (2019).
[135]
Md Mostazer Rahman. 2022. Data Analysis and Code Assessment Using Machine Learning Techniques for Program-
ming Activities. https://doi.org/10.15016/00000215
[136]
Md Mostazer Rahman, Shunsuke Kawabayashi, and Yutaka Watanobe. 2021. Categorization of frequent errors in
solution codes created by novice programmers. In SHS Web of Conferences, Vol. 102. EDP Sciences, 04014.
[137] Md. Mostazer Rahman, Yutaka Watanobe, Rage Uday Kiran, Truong Cong Thang, and Incheon Paik. 2021. Impact
of Practical Skills on Academic Performance: A Data-Driven Analysis. IEEE Access 9 (2021), 139975–139993. https:
//doi.org/10.1109/ACCESS.2021.3119145
[138]
Md. Mostazer Rahman, Yutaka Watanobe, Taku Matsumoto, Rage Uday Kiran, and Keita Nakamura. 2022. Educational
Data Mining to Support Programming Learning Using Problem-Solving Data. IEEE Access 10 (2022), 26186–26202.
https://doi.org/10.1109/ACCESS.2022.3157288
[139]
Md. Mostazer Rahman, Yutaka Watanobe, and Keita Nakamura. 2020. Source Code Assessment and Classication
Based on Estimated Error Probability Using Attentive LSTM Language Model and Its Application in Programming
Education. Applied Sciences 10, 8 (2020). https://doi.org/10.3390/app10082973
[140]
Md Mostazer Rahman, Yutaka Watanobe, Uday Kiran Rage, and Keita Nakamura. 2021. A novel rule-based online
judge recommender system to promote computer programming education. In Advances and Trends in Articial
Intelligence. From Theory to Practice: 34th International Conference on Industrial, Engineering and Other Applications of
Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26–29, 2021, Proceedings, Part II 34. Springer,
15–27.
[141]
Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu.
2016. On the "Naturalness" of Buggy Code. In Proceedings of the 38th International Conference on Software Engineering
(Austin, Texas) (ICSE ’16). Association for Computing Machinery, New York, NY, USA, 428–439. https://doi.org/10.
1145/2884781.2884848
[142]
Veselin Raychev, Pavol Bielik, and Martin Vechev. 2016. Probabilistic Model for Code with Decision Trees. SIGPLAN
Not. 51, 10 (oct 2016), 731–747. https://doi.org/10.1145/3022671.2984041
[143]
Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In
Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (Edinburgh,
United Kingdom) (PLDI ’14). Association for Computing Machinery, New York, NY, USA, 419–428. https://doi.org/10.
1145/2594291.2594321
[144]
Miguel A Revilla, Shahriar Manzoor, and Rujia Liu. 2008. Competitive learning in informatics: The UVa online judge
experience. Olympiads in Informatics 2, 10 (2008), 131–148.
[145]
Kelly Rivers and Kenneth R Koedinger. 2013. Automatic generation of programming feedback: A data-driven approach.
In The First Workshop on AI-supported Education for Computer Science (AIEDCS 2013), Vol. 50. 50–59.
[146]
Juan Carlos Rodríguez-del Pino, Enrique Rubio Royo, and Zenón Hernández Figueroa. 2012. A Virtual Programming
Lab for Moodle with automatic assessment and anti-plagiarism features. (2012).
[147]
Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised Translation of
Programming Languages. In Proceedings of the 34th International Conference on Neural Information Processing Systems
(Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 1730, 11 pages.
[148]
Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc
McConley. 2018. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. In 2018
17th IEEE International Conference on Machine Learning and Applications (ICMLA). 757–762. https://doi.org/10.1109/
ICMLA.2018.00120
[149]
Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on Source
Code: A Neural Code Search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning
and Programming Languages (Philadelphia, PA, USA) (MAPL 2018). Association for Computing Machinery, New York,
NY, USA, 31–41. https://doi.org/10.1145/3211346.3211353
[150]
Julio Saez-Rodriguez, James C Costello, Stephen H Friend, Michael R Kellen, Lara Mangravite, Pablo Meyer, Thea
Norman, and Gustavo Stolovitzky. 2016. Crowdsourcing biomedical research: leveraging communities as innovation
engines. Nature Reviews Genetics 17, 8 (2016), 470–486.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:31
[151]
Tara N. Sainath, Brian Kingsbury, George Saon, Hagen Soltau, Abdel rahman Mohamed, George Dahl, and Bhuvana
Ramabhadran. 2015. Deep Convolutional Neural Networks for Large-scale Speech Tasks. Neural Networks 64 (2015),
39–48. https://doi.org/10.1016/j.neunet.2014.08.005 Special Issue on “Deep Learning of Representations”.
[152]
Tomohiro Saito and Yutaka Watanobe. 2020. Learning Path Recommendation System for Programming Education
Based on Neural Networks. Int. J. Distance Educ. Technol. 18, 1 (jan 2020), 36–64. https://doi.org/10.4018/IJDET.
2020010103
[153]
Ghassan Samara. 2017. A practical approach for detecting logical error in object oriented environment. arXiv preprint
arXiv:1712.04189 (2017).
[154]
Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying
machine learning classiers on static features: A state-of-the-art survey. information security technical report 14, 1
(2009), 16–29.
[155]
Tushar Sharma, Vasiliki Efstathiou, Panos Louridas, and Diomidis Spinellis. 2021. Code smell detection by deep
direct-learning and transfer-learning. Journal of Systems and Software 176 (2021), 110936.
[156]
Tushar Sharma, Maria Kechagia, Stefanos Georgiou, Rohit Tiwari, Indira Vats, Hadi Moazen, and Federica Sarro. 2021.
A survey on machine learning techniques for source code analysis. arXiv preprint arXiv:2110.09610 (2021).
[157]
Tushar Sharma and Marouane Kessentini. 2021. Qscored: A large dataset of code smells and quality metrics. In 2021
IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 590–594.
[158]
Zhidong Shen and Si Chen. 2020. A survey of automatic software vulnerability detection, program repair, and defect
prediction techniques. Security and Communication Networks 2020 (2020), 1–16.
[159]
Atsushi Shirafuji, Takumi Ito, Makoto Morishita, Yuki Nakamura, Yusuke Oda, Jun Suzuki, and Yutaka Watanobe.
2022. Prompt Sensitivity of Language Model for Solving Programming Problems. (2022), 346–359. https://doi.org/10.
3233/FAIA220264
[160]
Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving Code Search with Co-
Attentive Representation Learning. In Proceedings of the 28th International Conference on Program Comprehension
(Seoul, Republic of Korea) (ICPC ’20). Association for Computing Machinery, New York, NY, USA, 196–207. https:
//doi.org/10.1145/3387904.3389269
[161]
Rishabh Singh, Sumit Gulwani, and Armando Solar-Lezama. 2013. Automated feedback generation for introductory
programming assignments. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and
implementation. 15–26.
[162]
Steven S Skiena and Miguel A Revilla. 2003. Programming challenges: The programming contest training manual.
Acm SIGACT News 34, 3 (2003), 68–74.
[163]
Dowon Song, Myungho Lee, and Hakjoo Oh. 2019. Automatic and scalable detection of logical errors in functional
programming assignments. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–30.
[164]
Jaime Spacco, Paul Denny, Brad Richards, David Babcock, David Hovemeyer, James Moscola, and Robert Duvall.
2015. Analyzing student work patterns using programming exercise data. In Proceedings of the 46th ACM Technical
Symposium on Computer Science Education. 18–23.
[165]
Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-ow analysis in LLVM. In Proceedings of the 25th
international conference on compiler construction. 265–266.
[166]
Jerey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. 2014. Towards a big
data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance
and Evolution. IEEE, 476–480.
[167]
Jerey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, and Mohammad Mamun Mia. 2014. Towards a Big
Data Curated Benchmark of Inter-project Code Clones. In 2014 IEEE International Conference on Software Maintenance
and Evolution. 476–480. https://doi.org/10.1109/ICSME.2014.77
[168]
Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. IntelliCode Compose: Code Generation
Using Transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference
and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for
Computing Machinery, New York, NY, USA, 1433–1443. https://doi.org/10.1145/3368089.3417058
[169]
Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia: AI-Assisted Code Completion
System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
(Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2727–2735. https:
//doi.org/10.1145/3292500.3330699
[170]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). 1–9. https://doi.org/10.1109/CVPR.2015.7298594
[171]
Kenta Terada and Yutaka Watanobe. 2019. Automatic generation of ll-in-the-blank programming problems. In 2019
IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC). IEEE, 187–193.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:32 M. M. Rahman et al.
[172]
Kenta Terada and Yutaka Watanobe. 2021. Code completion for programming education based on deep learning.
International Journal of Computational Intelligence Studies 10, 2-3 (2021), 78–98.
[173]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019.
An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. ACM Trans. Softw.
Eng. Methodol. 28, 4, Article 19 (sep 2019), 29 pages. https://doi.org/10.1145/3340544
[174]
Daniele Ucci, Leonardo Aniello, and Roberto Baldoni. 2019. Survey of machine learning techniques for malware
analysis. Computers & Security 81 (2019), 123–147.
[175]
Secil Ugurel, Robert Krovetz, and C. Lee Giles. 2002. What’s the Code? Automatic Classication of Source Code
Archives. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (Edmonton, Alberta, Canada) (KDD ’02). Association for Computing Machinery, New York, NY, USA, 632–638.
https://doi.org/10.1145/775047.775141
[176]
Farhan Ullah, Hamad Naeem, Sohail Jabbar, Shehzad Khalid, Muhammad Ahsan Latif, Fadi Al-turjman, and Leonardo
Mostarda. 2019. Cyber Security Threats Detection in Internet of Things Using Deep Learning Approach. IEEE Access
7 (2019), 124379–124389. https://doi.org/10.1109/ACCESS.2019.2937347
[177]
Farhan Ullah, Hamad Naeem, Sohail Jabbar, Shehzad Khalid, Muhammad Ahsan Latif, Fadi Al-turjman, and Leonardo
Mostarda. 2019. Cyber Security Threats Detection in Internet of Things Using Deep Learning Approach. IEEE Access
7 (2019), 124379–124389. https://doi.org/10.1109/ACCESS.2019.2937347
[178]
Jan N Van Rijn, Bernd Bischl, Luis Torgo, Bo Gao, Venkatesh Umaashankar, Simon Fischer, Patrick Winter, Bernd
Wiswedel, Michael R Berthold, and Joaquin Vanschoren. 2013. OpenML: A collaborative science platform. In Machine
Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic,
September 23-27, 2013, Proceedings, Part III 13. Springer, 645–649.
[179]
Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. 2019. Neural program repair by
jointly learning to localize and repair. arXiv preprint arXiv:1904.01720 (2019).
[180]
Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2020. Multi-Modal Attention
Network Learning for Semantic Source Code Retrieval. In Proceedings of the 34th IEEE/ACM International Conference
on Automated Software Engineering (San Diego, California) (ASE ’19). IEEE Press, 13–25. https://doi.org/10.1109/ASE.
2019.00012
[181]
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S. Yu. 2018. Improving Automatic
Source Code Summarization via Deep Reinforcement Learning. In Proceedings of the 33rd ACM/IEEE International
Conference on Automated Software Engineering (Montpellier, France) (ASE ’18). Association for Computing Machinery,
New York, NY, USA, 397–407. https://doi.org/10.1145/3238147.3238206
[182]
Zhiyuan Wan, Xin Xia, David Lo, and Gail C. Murphy. 2021. How does Machine Learning Change Software
Development Practices? IEEE Transactions on Software Engineering 47, 9 (2021), 1857–1871. https://doi.org/10.1109/
TSE.2019.2937083
[183]
Song Wang, Devin Chollak, Dana Movshovitz-Attias, and Lin Tan. 2016. Bugram: Bug detection with n-gram language
models. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 708–719.
[184]
Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically Learning Semantic Features for Defect Prediction. In
Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) (ICSE ’16). Association for
Computing Machinery, New York, NY, USA, 297–308. https://doi.org/10.1145/2884781.2884804
[185]
Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting code clones with graph neural network and
ow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and
Reengineering (SANER). IEEE, 261–271.
[186]
Wenhua Wang, Yuqun Zhang, Zhengran Zeng, and Guandong Xu. 2020. TranS^3: A Transformer-based Framework
for Unifying Code Summarization and Code Search. CoRR abs/2003.03238 (2020). arXiv preprint arXiv:2003.03238
(2020).
[187]
Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang.
2021. SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation. arXiv preprint
arXiv:2108.04556 (2021). https://doi.org/10.48550/arXiv.2108.04556
[188]
Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, and Graham Neubig. 2023. MCoNaLa: A Benchmark for
Code Generation from Multiple Natural Languages. arXiv:2203.08388 [cs.CL]
[189]
Szymon Wasik, Maciej Antczak, Jan Badura, Artur Laskowski, and Tomasz Sternal. 2018. A Survey on Online Judge
Systems and Their Applications. ACM Comput. Surv. 51, 1, Article 3 (jan 2018), 34 pages. https://doi.org/10.1145/
3143560
[190] Yutaka Watanobe. 2018. Aizu Online Judge. Available: https://onlinejudge.u-aizu.ac.jp/.
[191]
Y. Watanobe, Md Mostazer Rahman, R. Kabir, and Md Faizul Ibne Amin. 2022. Identifying Algorithm in Program
Code Based on Structural Features Using CNN Classication Model. Applied Intelligence (2022).
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Exploring Automated Code Evaluation Systems and Resources for Code Analysis: A Comprehensive Survey 111:33
[192]
Yutaka Watanobe, Md. Mostazer Rahman, Taku Matsumoto, Uday Kiran Rage, and Penugonda Ravikumar. 2022.
Online Judge System: Requirements, Architecture, and Experiences. International Journal of Software Engineering and
Knowledge Engineering 32, 06 (2022), 917–946. https://doi.org/10.1142/S0218194022500346
[193]
Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, and Zhi Jin. 2019. Code Generation as a Dual Task of Code Summarization. Curran
Associates Inc., Red Hook, NY, USA.
[194]
Li Wen-xin and Guo Wei. 2005. Peking university oneline judge and its applications [ j]. Journal of Changchun Post
and Telecommunication Institute S 2 (2005), 23.
[195]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments
for code clone detection. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).
87–98.
[196]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments
for Code Clone Detection (ASE ’16). Association for Computing Machinery, New York, NY, USA, 87–98. https:
//doi.org/10.1145/2970276.2970326
[197]
João Cristovão Afonso Sampaio Xavier et al
.
2011. Computer-based assessment system for e-learning applied to
programming education. (2011).
[198]
Wang Xiaomeng, Zhang Tao, Xin Wei, and Hou Changyu. 2018. A Survey on Source Code Review Using Machine
Learning. In 2018 3rd International Conference on Information Systems Engineering (ICISE). 56–60. https://doi.org/10.
1109/ICISE.2018.00018
[199]
Eran Yahav. 2018. From programs to interpretable deep models and back. In Computer Aided Verication: 30th
International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17,
2018, Proceedings, Part I 30. Springer, 27–37.
[200]
Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. Staqc: A systematically mined question-code dataset
from stack overow. In Proceedings of the 2018 World Wide Web Conference. 1693–1703.
[201]
Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marucs, Nesime Tatbul, Jesmin Jahan Tithi, Paul Petersen, Timothy
Mattson, Tim Kraska, Pradeep Dubey, et al
.
2020. Misim: An end-to-end neural code similarity system. arXiv preprint
arXiv:2006.05265 (2020).
[202]
Chao Yi, Su Feng, and Zhi Gong. 2014. A Comparison of Sandbox Technologies Used in Online Judge Systems. In
Mechanical Design and Power Engineering (Applied Mechanics and Materials, Vol. 490). 1201–1204. https://doi.org/10.
4028/www.scientic.net/AMM.490-491.1201
[203]
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned
Code and Natural Language Pairs from Stack Overow. In Proceedings of the 15th International Conference on Mining
Software Repositories (Gothenburg, Sweden) (MSR ’18). Association for Computing Machinery, New York, NY, USA,
476–486. https://doi.org/10.1145/3196398.3196408
[204]
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned
Code and Natural Language Pairs from Stack Overow. In Proceedings of the 15th International Conference on Mining
Software Repositories (Gothenburg, Sweden) (MSR ’18). Association for Computing Machinery, New York, NY, USA,
476–486. https://doi.org/10.1145/3196398.3196408
[205]
Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv
preprint arXiv:1704.01696 (2017).
[206]
Yuto Yoshizawa and Yutaka Watanobe. 2019. Logic error detection system based on structure pattern and error degree.
Advances in Science, Technology and Engineering Systems Journal 4, 5 (2019), 1–15.
[207]
Ruru Yue, Zhe Gao, Na Meng, Yingfei Xiong, Xiaoyin Wang, and J David Morgenthaler. 2018. Automatic clone
recommendation for refactoring based on the present and the past. In 2018 IEEE International Conference on Software
Maintenance and Evolution (ICSME). IEEE, 115–126.
[208]
Du Zhang and Jerey JP Tsai. 2003. Machine learning and software engineering. Software Quality Journal 11 (2003),
87–119.
[209]
Fanlong Zhang and Siau-Cheng Khoo. 2021. An empirical study on clone consistency prediction based on machine
learning. Information and Software Technology 136 (2021), 106573.
[210]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020. Retrieval-based neural source code
summarization. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1385–1397.
[211]
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine Learning Testing: Survey, Landscapes and Horizons.
IEEE Transactions on Software Engineering 48, 1 (2022), 1–36. https://doi.org/10.1109/TSE.2019.2962027
[212]
Gang Zhao and Je Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the 2018 26th
ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software
Engineering. 141–151.
[213]
Ninghan Zheng, Shuzhen Tian, and Yongqiang Chen. 2015. Online Learning Management System. In 2015 International
Conference on Computational Science and Computational Intelligence (CSCI). 293–299. https://doi.org/10.1109/CSCI.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:34 M. M. Rahman et al.
2015.160
[214]
Sun Zhigang, Su Xiaohong, Zhu Ning, and Cheng Yanyu. 2001. Moodle plugins for highly ecient programming
courses. (2001).
[215]
Wenju Zhou, Yigong Pan, Yinghua Zhou, and Guangzhong Sun. 2018. The framework of a new online judge system
for programming education. In Proceedings of ACM turing celebration conference-China. 9–14.
[216]
Yajin Zhou and Xuxian Jiang. 2012. Dissecting android malware: Characterization and evolution. In 2012 IEEE
symposium on security and privacy. IEEE, 95–109.
[217]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Eective Vulnerability Identication
by Learning Comprehensive Program Semantics via Graph Neural Networks. Curran Associates Inc., Red Hook, NY,
USA.
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In software, an algorithm is a well-organized sequence of actions that provides the optimal way to complete a task. Algorithmic thinking is also essential to break-down a problem and conceptualize solutions in some steps. The proper selection of an algorithm is pivotal to improve computational performance and software productivity as well as to programming learning. That is, determining a suitable algorithm from a given code is widely relevant in software engineering and programming education. However, both humans and machines find it difficult to identify algorithms from code without any meta-information. This study aims to propose a program code classification model that uses a convolutional neural network (CNN) to classify codes based on the algorithm. First, program codes are transformed into a sequence of structural features (SFs). Second, SFs are transformed into a one-hot binary matrix using several procedures. Third, different structures and hyperparameters of the CNN model are fine-tuned to identify the best model for the code classification task. To do so, 61,614 real-world program codes of different types of algorithms collected from an online judge system are used to train, validate, and evaluate the model. Finally, the experimental results show that the proposed model can identify algorithms and classify program codes with a high percentage of accuracy. The average precision, recall, and F-measure scores of the best CNN model are 95.65%, 95.85%, and 95.70%, respectively, indicating that it outperforms other baseline models.
Chapter
Full-text available
A popular language model that can solve introductory programming problems, OpenAI’s Codex, has drawn much attention not only in the natural language processing field but also in the software engineering field. It supports programmers by suggesting the next tokens to write, and it can even generate a whole function definition from a document string. We focus on its capability of automatically solving programming problems through code generation from problem descriptions. We investigate the model’s sensitivity to problem descriptions by formatting and modifying them. The experimental results show that the more explicitly formatted problem description enhances the code generation performance from 30.9% (raw) to 39.9% (formatted). Additionally, we observe that code generation relies on information specified in the problem description, such as variable names and constant values, as anonymizing them reduces the performance significantly. Moreover, statistical biases in code generation are identified, such as the generated programs ignoring the problem modification and answering the exact opposite problem. The changes in accuracy across formats suggest that the model does not correctly understand the natural language explaining the problem specification even if the model could solve the programming problems with high accuracy.
Article
Full-text available
The development and operation of Online Judge System (OJS), which is used to evaluate the correctness of programs, is a nontrivial and difficult task due to the various functional and non-functional requirements. However, although many OJSs have been developed and operated, and their usefulness reported, the theory for constructing OJSs has not been sufficiently discussed. In this paper, we present the functional and non-functional requirements oriented to OJS as well as demonstrate the internal components and software architecture of an OJS, which has been in operation for over a decade and has evaluated over six million solutions. We also present real-world experiences and challenges encountered during this long journey of our OJS.
Article
Full-text available
Open source software has been widely used in various industries due to its openness and flexibility, but it also brings potential software security problems. Together with the large-scale increase in the number of software and the increase in complexity, the traditional manual methods to deal with these security issues are inefficient and cannot meet the current cyberspace security requirements. Therefore, it is an important research topic for researchers in the field of software security to develop more intelligent technologies to apply to potential security issues in software. The development of deep learning technology has brought new opportunities for the study of potential security issues in software, and researchers have successively proposed many automation methods. In this paper, these automation technologies are evaluated and analysed in detail from three aspects: software vulnerability detection, software program repair, and software defect prediction. At the same time, we point out some problems of these research methods, give corresponding solutions, and finally look forward to the application prospect of deep learning technology in automated software vulnerability detection, automated program repair, and automated defect prediction.
Article
Full-text available
Most academic courses in information and communication technology (ICT) or engineering disciplines are designed to improve practical skills; however, practical skills and theoretical knowledge are equally important to achieve high academic performance. This research aims to explore how practical skills are influential in improving students’ academic performance by collecting real-world data from a computer programming course in the ICT discipline. Today, computer programming has become an indispensable skill for its wide range of applications and significance across the world. In this paper, a novel framework to extract hidden features and related association rules using a real-world dataset is proposed. An unsupervised k-means clustering algorithm is applied for data clustering, and then the frequent pattern-growth algorithm is used for association rule mining. We leverage students’ programming logs and academic scores as an experimental dataset. The programming logs are collected from an online judge (OJ) system, as OJs play a key role in conducting programming practices, competitions, assignments, and tests. To explore the correlation between practical (e.g., programming, logical implementations, etc.) skills and overall academic performance, the statistical features of students are analyzed and the related results are presented. A number of useful recommendations are provided for students in each cluster based on the identified hidden features. In addition, the analytical results of this paper can help teachers prepare effective lesson plans, evaluate programs with special arrangements, and identify the academic weaknesses of students. Moreover, a prototype of the proposed approach and data-driven analytical results can be applied to other practical courses in ICT or engineering disciplines.
Article
Programming is a powerful and ubiquitous problem-solving tool. Systems that can assist programmers or even generate programs themselves could make programming more productive and accessible. Recent transformer-based neural network models show impressive code generation abilities yet still perform poorly on more complex tasks requiring problem-solving skills, such as competitive programming problems. Here, we introduce AlphaCode, a system for code generation that achieved an average ranking in the top 54.3% in simulated evaluations on recent programming competitions on the Codeforces platform. AlphaCode solves problems by generating millions of diverse programs using specially trained transformer-based networks and then filtering and clustering those programs to a maximum of just 10 submissions. This result marks the first time an artificial intelligence system has performed competitively in programming competitions.
Thesis
In recent years, the development of information and communication technology (ICT)-based tools has facilitated human work and increased productivity in solving complex tasks. Computer programming has become an indispensable skill in ICT development because of its wide range of applications. At the same time, meeting the growing demand of highly skilled programmers in the ICT sector is one of the biggest challenges. However, learning programming is not an easy and trivial task, because of programming skills are essentially acquired through repeated practice. Here online judge (OJ) systems provide uninterrupted programming learning and practice opportunities in addition to classroom-based learning. Thus, OJ systems have been adopted by many institutions as an academic tool for programming education, and as a result, a huge number of programming-related resources (source codes, logs, scores, activities, etc.) are regularly accumulated in OJ systems. In this dissertation, we leveraged a large number of real-world source codes, submission logs and scores collected from an OJ system for comprehensive data analysis, as well as training, validation, testing and experimentation with machine learning models for code assessment and classification. First, we analyzed the different features and programming-related problems using a real dataset. We identified various programming errors, including time limit exceeded, memory limit exceeded, runtime error, and presentation errors in solution codes, as well as the impact of programming skills on academic performance. Next, we developed machine learning-based source code assessment and classification models to better understand the programming code and reduce errors. Finally, the outcome of the dissertation can assist programmers to understand and improve their programming skills. In Chapter 2, a comprehensive data analysis framework is proposed to extract hidden features and association rules using a real-world dataset of an OJ system. Initially, an unsupervised modified K-means (MK-means) clustering algorithm is applied for data clustering, and then the frequent pattern (FP)-growth algorithm is used for association rule mining. We leverage students' program submission logs and academic scores as an experimental dataset. To explore the correlation between programming skills and overall academic performance, the statistical features of students are analyzed and the related results are presented including hidden features, common errors made by students, submission trends, frequent patterns, association rules, and so on. A number of useful recommendations are provided for students in each cluster based on the identified hidden features. In addition, the analytical results of this Chapter can help teachers prepare effective lesson plans, evaluate programs with special arrangements, and identify the academic weaknesses of students. Furthermore, a prototype of the proposed approach and data-driven analytical results can be applied to other practical courses in ICT or engineering disciplines. Based on the data analysis, we identified most common errors made by programmers during their learning processes. We found that many of the errors encountered could not be evaluated or detected by conventional compilers. Moreover, it is difficult to assess and detect logic errors (e.g., time limit exceeded, run time error, memory limit exceeded, output limit exceeded, etc.) in the source code with traditional compilers, resulting in erroneous code. In Chapter 3, we proposed a source code assessment and classification model. The proposed model is developed based on a long short-term memory (LSTM) neural network with an attention mechanism to assess and classify the source code. The attention mechanism enhances the accuracy of the proposed model for assessment and classification. Thus, the proposed model can detect source code errors with locations and then predict the correct word for error. In addition, the proposed model can classify the source codes whether it is erroneous or not. We trained the model using source codes and then evaluated the performance. The experimental results obtained show that the accuracy in terms of error detection and prediction of the proposed model approximately is 62% and source code classification (correct or incorrect) accuracy approximately is 96% that outperformed a standard LSTM and other state-of-the-art models. Overall, these experimental results indicate the usefulness of the proposed model in professional programming and programming education fields. Furthermore, the proposed model can help programmers to reduce errors in solution codes that cannot be detected by conventional compilers. Despite the good performance of LSTM-based model, it still has a shortcoming that it only considers the past context of the input sequences, but cannot consider any future (i.e., subsequent) context. In Chapter 4, we proposed a sequential language model for evaluating source codes using a bidirectional long short-term memory (BiLSTM) neural network. The BiLSTM model can consider both the past and future context of the input sequences. We trained the BiLSTM model with a large number of real-world source codes with tuning various hyperparameters. We then used the model to evaluate incorrect code and assessed the model’s performance in three principal areas: source code error detection, suggestions for incorrect code repair, and erroneous code classification. Experimental results showed that the proposed BiLSTM model achieved significant correctness in identifying errors and providing suggestions. Moreover, the model achieved an F-score of approximately 97%, outperforming other state-of-the-art models such as recurrent neural networks (RNNs) and LSTM. Furthermore, programmers have recently improved their programming skills and can now write code in many different languages to solve problems. A lot of new code is being generated all over the world regularly. Since a programming problem can be solved in many different languages, it is quite difficult to identify the algorithm from the written source code. Therefore, a classification model is needed to help programmers identify the algorithms in source code written/developed in Multi-Programming Languages (MPLs). The classification model can help programmers learn better programming. However, source code multi-class classification models based on deep learning are still lacking in the field of programming education and software engineering. To address this gap, we also proposed a multilingual source code classification model using stacked BiLSTM. To accomplish this task, we collect a large number of source codes from the Aizu Online Judge (AOJ) system. The stacked BiLSTM model is trained, validated, and tested on the real-world dataset. Various hyperparameters are fine-tuned to improve the performance of the model. Based on the experimental results, the stacked BiLSTM model achieved an accuracy of about 93% and an F1-score of 89.24%. Moreover, the model outperforms the state-of-the-art models in terms of other evaluation matrices such as precision (90.12%) and recall (89.48%).
Article
Bad smells in code are indications of low code quality representing potential threats to the maintainability and reusability of software. Code clone is a type of bad smells caused by code fragments that have the same functional semantics with syntactic variations. In the recent years, the research on duplicate code has been dramatically geared up by deep learning techniques powered by advances in computing power. However, there exists little work studying the current state-of-art and future prospects in the area of applying deep learning to code clone detection. In this paper, we present a systematic review of the literature on the application of deep learning on code clone detection. We aim to find and study the most recent work on the subject, discuss their limitations and challenges, and provide insights on the future work.