ArticlePDF Available

Code Generation Using Machine Learning: A Systematic Review

Authors:

Abstract and Figures

Recently, machine learning (ML) methods have been used to create powerful language models for a broad range of natural language processing tasks. An important subset of this field is that of generating code of programming languages for automatic software development. This review provides a broad and detailed overview of studies for code generation using ML. We selected 37 publications indexed in arXiv and IEEE Xplore databases that train ML models on programming language data to generate code. The three paradigms of code generation we identified in these studies are description-to-code, code-to-description, and code-to-code. The most popular applications that work in these paradigms were found to be code generation from natural language descriptions, documentation generation, and automatic program repair, respectively. The most frequently used ML models in these studies include recurrent neural networks, transformers, and convolutional neural networks. Other neural network architectures, as well as non-neural techniques, were also observed. In this review, we have summarized the applications, models, datasets, results, limitations, and future work of 37 publications. Additionally, we include discussions on topics general to the literature reviewed. This includes comparing different model types, comparing tokenizers, the volume and quality of data used, and methods for evaluating synthesized code. Furthermore, we provide three suggestions for future work for code generation using ML.
Content may be subject to copyright.
Received 11 July 2022, accepted 30 July 2022, date of publication 4 August 2022, date of current version 10 August 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3196347
Code Generation Using Machine Learning:
A Systematic Review
ENRIQUE DEHAERNE 1,2, (Graduate Student Member, IEEE),
BAPPADITYA DEY 2, (Member, IEEE), SANDIP HALDER 2,
STEFAN DE GENDT 1,3, (Senior Member, IEEE),
AND WANNES MEERT 1, (Member, IEEE)
1Department of Computer Science, KU Leuven, 3001 Leuven, Belgium
2Interuniversity Microelectronics Centre (IMEC), 3001 Leuven, Belgium
3Department of Chemistry, KU Leuven, 3001 Leuven, Belgium
Corresponding author: Enrique Dehaerne (enrique.dehaerne@student.kuleuven.be)
ABSTRACT Recently, machine learning (ML) methods have been used to create powerful language models
for a broad range of natural language processing tasks. An important subset of this field is that of generating
code of programming languages for automatic software development. This review provides a broad and
detailed overview of studies for code generation using ML. We selected 37 publications indexed in arXiv
and IEEE Xplore databases that train ML models on programming language data to generate code. The three
paradigms of code generation we identified in these studies are description-to-code, code-to-description, and
code-to-code. The most popular applications that work in these paradigms were found to be code generation
from natural language descriptions, documentation generation, and automatic program repair, respectively.
The most frequently used ML models in these studies include recurrent neural networks, transformers, and
convolutional neural networks. Other neural network architectures, as well as non-neural techniques, were
also observed. In this review, we have summarized the applications, models, datasets, results, limitations,
and future work of 37 publications. Additionally, we include discussions on topics general to the literature
reviewed. This includes comparing different model types, comparing tokenizers, the volume and quality
of data used, and methods for evaluating synthesized code. Furthermore, we provide three suggestions for
future work for code generation using ML.
INDEX TERMS Automatic programming, computer languages, data collection, machine learning, natural
language processing, neural networks, recurrent neural networks, software debugging, software mainte-
nance, text mining.
I. INTRODUCTION
Software development is a complex and time-consuming pro-
cess. It consists of two main phases: analysis and coding [1].
In the analysis phase, the requirements and architecture of the
software system are formalized. In the coding phase, source
code is written and tested to meet the requirements set in the
first phase. Usually, maintenance of the system is included as
an additional phase in the software development cycle where
previous steps can be adapted to reflect changes in the needs
The associate editor coordinating the review of this manuscript and
approving it for publication was Gustavo Olague .
of the system user. Figure 1shows a flowchart for a simple
software development model. In this review, we focus on the
coding phase which works directly with source code.
Modern society relies on complex software applications.
These applications can consist of millions of lines written
in many programming languages (PLs) by many teams of
developers. Even small software projects will often leverage
large libraries that are expected to be easy to use and trusted to
be efficient and safe. PLs are difficult to read and understand
quickly so the developers must also document their programs
to make them more maintainable. Mistakes made during the
coding phase lead to software bugs that can cost time and
82434 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
FIGURE 1. An example model for software development with three
phases, each consisting of multiple steps. Software development models
used in practice vary in the number of steps and their ordering compared
to the model depicted in this flowchart.
money for the software creators and users. In the worst-case
scenario, software bugs can jeopardize the safety of human
beings.
As a result, many software development tools and tech-
nologies have been created to help developers write better
software. A popular technology used by software devel-
opers is a ‘‘linter’ which flags syntactic errors in code.
Auto-formatters will add or remove whitespace and ‘new-
line’’ characters to code to improve readability. Statement
auto-complete tools can suggest tokens that programmers
might write next to improve their productivity. While these
traditional tools can be useful for programmers, most of
them can’t help a developer with complex tasks such as
writing understandable code documentation or implementing
algorithms.
More recently, machine learning (ML) has opened up the
possibility to automate difficult programming-related tasks.
In particular, advances in neural network (NN) architectures,
such as recurrent neural networks (RNNs) and transform-
ers [2], have been used to advance the state-of-the-art (SOTA)
for many difficult automated software engineering tasks.
These tasks include code generation from code documenta-
tion [3], [4], documentation generation from code [5], [6], and
cross-PL translation [7]. These technologies, among others,
have even led to commercial products such as Tabnine [8]
and Github [9]’s Copilot [10].
To provide a broad and detailed introduction to this field,
this systematic review summarizes and discusses publica-
tions that use ML to generate code. More specifically, pub-
lications retrieved from searches on arXiv [11] and IEEE
Xplore [12] databases that propose models that synthesize
code from non-code inputs (description-to-code), generate
code documentation given code inputs (code-to-description),
or modify existing code from one form to another (code-
to-code) were reviewed. We categorize each publication by
its ML model and the relevant sub-domain. The intention of
this review is to provide a broad but detailed overview of
ML techniques applied to the domain of automatic software
generation. A summary of each publication and general dis-
cussions are provided. Topics discussed in this review include
the application categories, popular ML models, tokenization
strategies, the quantity and quality of data, and metrics used to
FIGURE 2. Flow diagram which depicts the sequence of steps taken to
search and select studies to be reviewed.
evaluate synthesized code. Additionally, three directions for
future work are suggested before concluding this systematic
review.
II. METHODOLOGY
This review followed the Preferred Reporting Elements for
Systematic Reviews and Meta-analyses (PRISMA) guide-
lines [13] where appropriate for the scope of this review
(see Section III for more details). The arXiv [11] and IEEE
Xplore [12] databases were searched to identify potential
publications for review. Inclusion criteria refer to filters
applied to the search functions of each database. The search
applied two general inclusion criteria and one inclusion crite-
rion specific for each of the databases searched. Four exclu-
sion criteria were then applied to the studies retrieved from
this search to remove studies that are not appropriate for this
review. Figure 2provides a graphical overview of the search
and selection methodology explained in more detail in the
remainder of this section.
The first inclusion criterion applied to both databases was
the search terms used. The search phrase ‘‘code generation
using machine learning’’ was applied to all fields (title,
abstract, full-text, etc.) using each database’s search engine to
identify possible publications for review. The second general
inclusion criterion was the publication date. To ensure a
variety in the different ML models studied by the retrieved
publications, the first publication date range used was chosen
to be 2016, one year before the introduction of the transformer
architecture [2]. This was done because transformers have
become very popular recently and we wanted to make sure
that studies using other ML models were also retrieved by
our search. Therefore the first searches were limited to stud-
ies published between 2016 and 2021. Additional searches
were applied on earlier years iteratively by decrementing
the last year searched, keeping all other inclusion criteria
the same, until no new studies are selected for the review.
An additional filter was added to the search of each database
VOLUME 10, 2022 82435
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
using filtering options provided by the search function of
each database, respectively. The search on the IEEE Xplore
database was limited to conference and journal publications.
For the search on the arXiv database, the search was limited
to publications on the subject of computer science (including
cross-listings).
Four exclusion criteria were defined to filter publications
returned by search queries using the inclusion criteria that
are not applicable to the review. These exclusion criteria
are as follows: (i) publications that do not propose new ML
techniques or models for code-to-description, description-to-
code, or code-to-code applications; (ii) survey, benchmark,
and vision papers; (iii) publications where an ML model
predicts numerical parameters for non-ML code generation
engines; and (iv) publications not written in English. These
criteria were applied to the title and abstract of publica-
tions first. Titles and abstracts that do not provide suffi-
cient information to justify removal from consideration of
the review are not filtered at this stage. Next, duplicates
between the remaining publications from both databases were
removed. Finally, the full texts of the remaining publica-
tions were screened based on the exclusion criteria. The
publications not filtered in this final stage are added to the
set of selected publications. All searches were performed in
April 2022.
III. RESULTS
For the initial query using the publication date range of 2016-
2021, the query on the arXiv [11] and IEEE Xplore [12]
databases returned 613 and 274 publications, respectively.
After applying the inclusion and exclusion criteria to the titles
and abstracts of the 887 identified publications, 811 publica-
tions were excluded from consideration for the review. One
duplicate study was found among the remaining 76 studies
between the two databases and was removed. The final selec-
tion of publications was obtained by applying the inclusion
and exclusion criteria to the full text of the 75 remaining
publications. The final selection consisted of 37 publications,
28 indexed in the arXiv database [11] and 9 indexed in the
IEEE database [12].
An overview of these search results for every step in the
selection methodology is shown in Figure 3. Figure 3also
shows the results of the second search which was limited
to studies published in 2015, one year earlier than the pub-
lication date range of the first search. Since no new stud-
ies were selected from this second search, no additional
searches were conducted. All searches were performed in
April 2022.
A direct comparison of the results of each publication is
not possible as they use different model types, are trained
and evaluated on different datasets, and use different meth-
ods for evaluation. Instead, a summary of each publication
is provided in Table 1. The table consists of the following
columns: (i) the application studied, (ii) the ML model used,
(iii) the datasets used by the study, (iv) the results of the
study and discussion, and (v) limitations and/or future work.
FIGURE 3. Flow diagram which shows the number of publications at each
step of the search and filtering process.
General aspects of the applications, ML model types, tokeniz-
ers, datasets, and evaluation methods of these publications are
discussed in more detail in the next section.
Common abbreviations used in Table 1and tables in
Section IV are as follows (in alphabetical order): API (Appli-
cation Programming Interface), APR (Automatic Program
Repair), AST (Abstract Syntax Tree), CNN (Convolutional
Neural Network), DSL (Domain Specific Language), ENN
(Essence Neural Network), GRU (Gated Recurrent Unit),
GUI (Graphical User Interface), kNN (k Nearest Neigh-
bors), LSTM (Long Short-Term Memory network), MLP
(Multi-Layer Perceptron), NL (Natural language), NN (Neu-
ral Network), NMT (Neural Machine Translation), PBE
(Programming-By-Example), PL (Programming Language),
RF (Random Forests), RL (Reinforcement Learning), RNN
(Recurrent Neural Network), SM (Statistical Model), SOTA
(State-Of-The-Art).
IV. DISCUSSION
This review examines 37 studies that propose ML models
to generate code from non-code descriptions, generate code
documentation, or modify code. This section discusses com-
mon challenges these studies faced as well as key findings
from the selected studies as a whole. First, the different appli-
cation categories are introduced and explained with the help
of examples. Next, the popular model types are introduced
and compared. Subsequently, different tokenization strategies
for code generation are compared. The quantity and quality
of readily-available data for the different applications are
discussed in their respective sub-sections. A section where
the different metrics for measuring the quality of synthesized
source code are compared is also included. Finally, three
suggestions for future work are listed.
A. APPLICATION CATEGORIES
The studies reviewed in Table 1can be categorized into
three paradigms: description-to-code, code-to-description,
and code-to-code. The studies can further be categorized
into a number of application categories as shown in Table 2.
82436 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 1. Summary of all selected publications. The rows are ordered chronologically by publication date.
VOLUME 10, 2022 82437
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 1. (Continued.) Summary of all selected publications. The rows are ordered chronologically by publication date.
82438 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 1. (Continued.) Summary of all selected publications. The rows are ordered chronologically by publication date.
VOLUME 10, 2022 82439
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 1. (Continued.) Summary of all selected publications. The rows are ordered chronologically by publication date.
82440 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 1. (Continued.) Summary of all selected publications. The rows are ordered chronologically by publication date.
VOLUME 10, 2022 82441
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 1. (Continued.) Summary of all selected publications. The rows are ordered chronologically by publication date.
82442 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 1. (Continued.) Summary of all selected publications. The rows are ordered chronologically by publication date.
TABLE 2. Application categorization of the selected publications. The
applications categories are grouped together by three paradigms:
description-to-code, code-to-description, and code-to-code. These
paradigms describe the nature of the inputs and outputs of its
applications. All studies within a row are ordered chronologically by
publication date.
Figure 4shows the percentage of selected studies belonging
to each paradigm. This section explains the three paradigms
and introduces the most popular application categories of the
selected publications.
FIGURE 4. Percentage of selected publications that study each of the
three application paradigms: description-to-code, code-to-description,
and code-to-code.
Description-to-code applications involve generating code
conditioned on model inputs that are not code. This was the
most popular paradigm, applicable to 46% of all selected
studies. The descriptions can come in various forms. The
most popular description type is natural language (NL) doc-
umentation. These descriptions are often obtained from code
comments written before a code snippet. An example NL
description with an associated code implementation is shown
in Figure 5.
Programming-by-example (PBE) is the second most popu-
lar application category for the description-to-code paradigm.
For PBE, the functionality of the desired program to be
VOLUME 10, 2022 82443
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
FIGURE 5. An input-output pair example for code generation from NL.
The input consists of an NL description, depicted here as a code
comment, which describes the functionality of the desired code output.
FIGURE 6. An example of a data-program pair for
programming-by-example (PBE) with a verification code snippet. PBE
generates a program that satisfies the functionality described by
input-output data. This data is depicted here as a list of tuples (top
rectangle) consisting of an input and then an output. Under this data is a
for loop which verifies that the function to be generated should return
the corresponding output for each input. The example program (bottom
rectangle) returns the desired output for each input in the list ‘‘inputs.’’
generated is described by pairs of program input and output
examples. Figure 6shows an example of a list of possible
program inputs and a list of corresponding outputs as well as
a program that satisfies the given input-output pairs.
Another important description type is images. For [24]
and [29] the images are screenshots of graphical user inter-
faces (GUIs) and for [114] the images are sketches of
data visualizations. The desired output is a program that
can synthesize the given image. Figure 7shows a possible
image-program pair for a simple GUI implemented in the
HTML language.
Code-to-description studies in this review all belong to a
single application category, documentation generation. With
25% of all selected studies being code-to-description studies,
the paradigm is the least popular of the three paradigms while
documentation generation is the single most popular applica-
tion category. Sometimes called source code summarization,
the objective of this task is to generate an NL description of
the code, usually in the form of a comment. An example of
documentation generation is shown in Figure 8as well as
Figure 7if the input and output data were swapped.
Code-to-code applications generate code conditioned on
other code. The most popular application category of this
paradigm is automatic program repair (APR). The input for
APR is faulty, or buggy, code for which the model should
generate similar code that does not have the bug. Figure 9
shows a buggy line of Python [39] code that has a syntactic
error and the fixed line of code as the output.
FIGURE 7. An example of an image-code pair for code generation from
images. The image depicted here is a simple GUI with a black background
and two buttons. The desired code to be generated, shown in the bottom
rectangle, is HTML code that synthesizes this GUI.
FIGURE 8. An input-output pair example for documentation generation
from code. The input consists of code, in this case a Python [39] function.
The desired output is NL which describes the semantics of the given code.
The problem statement for documentation generation is the same as the
problem statement of code generation from NL with the inputs and
outputs reversed.
FIGURE 9. An example of an input-output pair for automatic program
repair (APR). The input example here contains a syntactic error that has
been recognized by the Pylance [116] linting tool. The output example is
the same code as the input with an additional closing bracket which fixes
the input’s error.
Cross-PL translation involves translating code written in
one programming language (PL) to code written in another
PL while preserving as many features of the original code
as possible. An example where C++ [40] code is translated
to Python that preserves the same functionality is shown
in Figure 10. Refactoring is similar to cross-PL translation
82444 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
FIGURE 10. A cross-PL translation pair example. The first code snippet is
written in C++ [40]. A functionally equivalent Python [39] translation of
the code snippet is shown in the second code snippet.
TABLE 3. Overview of the types of ML models used by selected studies.
Types are partitioned into sub-types where appropriate. All studies within
a row are ordered chronologically by publication date. If a study proposes
multiple models or proposes a model that combines different
model-types, they are all listed in this table with the exception of MLPs
which are not listed if connected to another model type (e.g. RNN,
Transformer, CNN, etc).
in that features of the input code should be preserved but
different because the input and output code are written in the
same PL. Refactoring aims to transform the input code to a
form that is better understandable for humans. Reference [14]
does this by adding or removing whitespaces or new-line
characters to input code while [62] attempts to paraphrase
code statements so that the transformed code is more concise
than the input. The final application category, code com-
pletion, involves predicting subsequent code statements only
from prior code.
B. MACHINE LEARNING MODEL TYPES
The review shows that a wide variety of ML methods can be
used for different code generation tasks. Table 3shows which
of the selected studies use certain model types. Figure 11
shows the number of times a ML method is used for the
different application categories introduced in Section IV-A.
The popular ML methods used by the selected studies are
introduced and compared in this section.
Recurrent neural networks (RNN) are a class of NN
often used with sequential data such as NL. RNNs use
FIGURE 11. The number of selected studies that used a certain model
type. Each bar is categorized by application. The vertical axis gives the
percentage of the number of studies that use a certain model type out
of 37, the total number of selected studies.
previous outputs and states of the network as supplementary
information to the current input. Diagrams and equations
of three types of RNNs, a basic RNN, LSTM, and GRU,
are shown in figures 12,13, and 14, respectively. The hid-
den state, ht, allows the model to use previous data in a
sequence alongside the current input. Basic RNNs do not
capture long-term dependencies well. LSTMs address this
issue by passing along its cell state, ct[117]. GRUs combine
the hidden state and cell state into one state, simplifying the
network. Attention mechanisms, first proposed in [118], are
often used by RNNs to encode information in variable-length
vectors which reduces information loss for large inputs.
FIGURE 12. A simple recurrent neural network (RNN) rolled across time
for two inputs of sequence (t and t +1).
RNNs are the most popular NN type in the review, used in
43% of the selected studies, 80% of which use LSTMs specif-
ically. RNNs are used for every application category from
VOLUME 10, 2022 82445
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
FIGURE 13. A long short-term memory (LSTM) network. network.
Compared to simple RNNs such as the one shown in figure 12, LSTMs add
a cell state (ct) which allows for better handling of long-term
dependencies in sequential data.
FIGURE 14. A gated recurrent unit (GRU) network. GRUs aim to handle
long-term dependencies like LSTMs but combine the hidden state and cell
state into one state to simplify the network.
Section IV-A except for refactoring and cross-PL translation.
They are most used for documentation generation studies.
Five out of nine documentation generation studies use RNNs.
RNN decoders are used in combination with CNN
encoders in four studies, [24], [66], [69], [114].Refer-
ences [24], [114] use CNNs as encoders for image inputs
while [66], [69] use CNNs for encoding text.
FIGURE 15. Graphic depiction of multi-head attention, the main building
block of the transformer architecture [2]. Multi-head attention allows for
multiple levels of attention between tokens to be learned.
Transformers [2] rely solely on attention mechanisms
to capture dependencies between tokens in sequential data.
While complex RNNs incorporate attention mechanisms to
enhance dependency information, only using self-attention
allows for greater parallelization. Figure 15 shows multi-head
attention, the main building block of the original transformer
architecture [2].
The transformer architecture is studied in 12 out of the
37 selected publications, as shown in Figure 16, making it the
second most popular model type overall. Figure 16 also shows
that transformers are the most popular model type if only
the last year is considered. Similar to RNNs, transformers
are used for a wide variety of tasks. Transformers are used
FIGURE 16. The number of selected studies that use a certain model type
by year of publication. Each bar is categorized by application.
82446 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
for five out of seven code generation from NL description
studies, more than any other model type. The only tasks that
do not have a selected study that uses a transformer are code
generation from other structured data and refactoring.
As RNNs and transformers work well with sequential data,
they are often compared in experiments. Of the 12 studies that
use transformers in their proposed models, 9 of them [47],
[76], [79], [83], [85], [89], [91], [95], [102] compare their
results with RNN baselines. The transformer model used
in [47] underperformed a comparative RNN model studied.
The other eight publications propose transformer-based mod-
els that outperformed RNN-based baselines of each respec-
tive study. Reference [111] proposed a method with an LSTM
component that outperformed transformer baselines but it
was suggested that replacing the LSTM component with a
transformer could improve the proposed method.
Convolutional neural networks (CNN) use convolution
layers that sweep an input using a feature filter to aggregate
information about the input. Convolutional operators work
well on grid-like data such as images. Pooling is frequently
used in CNNs to reduce the number of parameters of the
model. Figure 17 shows an example of the convolution oper-
ation as well as the pooling operation.
FIGURE 17. Convolution and pooling operation examples, equations, and
dimensions. Note that no zero-padding is used in this example and
simple parameters are used (e.g., strides of size 1). For more detailed
information on these, we refer to [119].
All three publications on code generation from images
used CNNs. Reference [29] used a CNN to classify objects
in GUI screenshots. References [24], [114] used CNNs to
extract features from images in combination with RNN or
transformer decoders. CNNs are also used as encoders in
other encoder-decoder models to create embeddings of code
inputs [38], [66], [89]. References [38], [66] use CNNs in
RNN encoder-decoder architectures while [89] use CNNs to
augment a transformer architecture. Reference [69] uses a
convolutional graph NN on AST inputs due to their ability
to encode spatial information well.
ML augmented search for PBE is a method of generating
programs by building an AST node-by-node where the next
node is chosen by an ML model until the program satisfies
all input-output examples. Explicit functional specifications
such as input-output examples are important so that the model
knows when to stop searching. The advantage of this method,
compared to other code-generation techniques, is that a gen-
erated program is guaranteed to compile and behave as spec-
ified by the input-output examples given. Three of the five
PBE studies [59], [63], [73] use ML augmented search. These
studies use DSLs specifically designed to reduce the search
space of all possible programs. General PLs like Java [16] or
Python [39] have large program spaces that make these search
techniques infeasible. For general PLs, RNNs or transformers
are commonly utilized as they can more efficiently build
sequences one token at a time by using decoding strategies
such as beam search [120]. These decoding strategies usually
do not provide any syntactic or functional guarantees for the
generated program in contrast to PBE with AST search.
C. TOKENIZERS AND THE OUT-OF-VOCABULARY
PROBLEM
Tokenization is a preprocessing step where an input string
is partitioned into chunks. These chunks or ‘‘tokens’ are
mapped to numbers that ML models can process. The outputs
of the models can be mapped back to tokens which form a
part of the model output. Models with tokenizers recognize
a finite set of tokens which are called the vocabulary of the
model. Whenever a chunk of the input string does not have a
matching token in the vocabulary, a special <unknown> token
must be used. This results in a loss of information which is
referred to as the out-of-vocabulary (OOV) problem.
Table 4categorizes the selected studies of the review
into three main tokenizer types: word-based, character-
based, and subword-based. Figure 19 shows examples of the
three main types of tokenizers: word-based, character-based,
and subword-based tokenizers. Word-based tokenizers split
words on whitespace characters with special rules for punc-
tuation. In the context of code, parsers are frequently used
to handle ‘‘code punctuation’ such as brackets. Word tokens
capture a complete unit of meaning from the input string but
require a large vocabulary. Character-based tokenizers split
the input string on every character. This simplifies the vocab-
ulary, but each token usually holds little meaningful infor-
mation. Subword-based tokenization provides a compromise
between character and word-based tokenization. The vocabu-
lary consists of all base characters as well as frequently occur-
ring sequences of characters. Sub-word tokenizers are the
most popular tokenizer type as shown in Figure 18. Figure 19
shows examples of each tokenizer type.
A problem that tokenizing source code faces is the fact
that the number of unique ‘‘words’ in code is generally
much larger than in NL. This is mostly due to identifiers
for functions or variables which are multiple words concate-
nated together using some naming or casing convention (e.g.,
a function that prints ‘‘hello world’ to the console can be
VOLUME 10, 2022 82447
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 4. Overview of tokenizer types used by the selected studies. For non-string data, tokenizers are not used.
FIGURE 18. Percentages of selected studies that used different tokenizer
types. Non-string data refers to studies that did not use tokenizers to
preprocess data for ML models.
named ‘‘helloWorld’’ or ‘hello_world’’). Therefore, tokeniz-
ers should be designed with the OOV problem in mind. Using
character-level tokenization is a possible solution but has
limits for statements with more than 15 characters [121]. Only
one of the selected studies used character-based tokenization.
Word-based tokenization split on whitespace characters is
popular for processing NL but is susceptible to the OOV prob-
lem. References [38], [66], [83], [97] used subword-based
tokenization by separating words on capital letters, under-
scores, and numbers as well. The byte-pair-encoding (BPE)
algorithm [122], which builds a vocabulary of frequently
occurring subwords in a training corpus, was used by [4],
[47], [79], [85], [95], [102].
Custom tokenization processes can be used to keep vocab-
ulary sizes small while encapsulating useful information in
each token. Token copying is one such process observed
in the selected publications. Reference [76] used positioned
<unknown> tokens to copy tokens not in the model vocabu-
lary from the input to the output string. Similarly, [75] keeps
out of vocabulary tokens and a position encoding in a lookup
table to replace <unknown> tokens during the decoding of
the output. Reference [50] used copying mechanisms for AST
tokens based on probabilities from the training data.
The second type of tokenization enhancement process
observed in the selected literature is token abstraction by
using multiple <unknown> tokens that distinguish different
types of tokens. Reference [21] noticed that <unknown>
tokens are terminal nodes in the AST and uses the node
type as a replacement token rather than a generic <unknown>
token. Similarly, [34], [67] abstracted OOV identifiers for
methods and variables to the separate <unknown> tokens.
Reference [31] used an identifier abstraction mechanism
where all instances of an identifier were tokenized to a num-
bered identifier token.
D. VOLUME OF AVAILABLE DATA
The volume of data needed to train and evaluate code gen-
eration models is critical to the performance of the model.
The studies in this review use open-source repository data,
manually created data, and/or automatically generated data.
Table 5shows which data source types were used by each
selected study. Percentages of the number of studies that use
data from different combinations of these sources are shown
in Figure 20.
Data from open-source repositories are used by 62% of
selected publications. Open-source repositories provide large
volumes and varied data in many PLs. Reference [4] shows
an example of this in their proposed model which was trained
on millions of lines of code to be able to generate a wide
variety of multi-line Python [39] functions. These lines of
code were collected from Github [9]. Eighteen of the selected
publications [4], [14], [21], [31], [34], [38], [43], [46], [47],
[66], [67], [71], [79], [85], [91], [95], [97], [102] used data
sourced from git repositories. This data often includes source
code, documentation, as well as repository change infor-
mation. This last type of data is especially useful for APR
studies as these changes are occasionally pre- and post-bug-
fix pairs. All APR studies used open-source repository data.
Reference [76] used data sourced from StackOverflow [123],
a forum where users ask and answer programming-related
questions, which is especially useful for obtaining NL-code
pairs.
A lack of available data led five studies [26], [44], [53],
[59], [114] to manually create data. For example, [75]
manually wrote translations in a target programming lan-
guage given code from a source programming language.
Manually created datasets made publicly available from other
studies were used by six [29], [50], [62], [63], [76] of the
selected studies. Data generation performed by humans is
time-intensive which is why it is avoided where possible.
82448 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
FIGURE 19. Example word-, character-, and subword-based tokenizers with limited vocabularies. Each tokenizer processes the input string
differently. The output elements are called tokens. These tokens can then be translated to numbers using a lookup table to create a valid input for
a neural network. The <unknown> token is denoted as <unk>.
TABLE 5. Overview of data source types of the datasets used by the selected studies. Manually created datasets have humans-in-the-loop during data
generation while automatic generation implies data created in a programmatic manner.
FIGURE 20. Percentages of data-source types used by the selected
studies.
If there is not enough data readily available, automatic
data generation methods can also be considered. Machine-
generated data is used in all of the works reviewed in the
domain of PBE since any randomly generated input-output
pairs for numeric calculations or string transformations are
usually acceptable to derive a program from. References [26],
[114] generate automatic data in addition to creating data
manually to achieve a balance between quality and quantity
of data.
The need for a partition of the dataset for evaluation
purposes reduces the amount of data that can be used for
training. This problem is usually resolved by using cross-
validation. Cross-validation involves training many models
and is often computationally expensive. This is especially the
case for large language models. More discussion on this topic
is provided in Section IV-G.
E. QUALITY OF AVAILABLE DATA
As mentioned in the previous section, researchers leverage
automatic mining from open-source repositories to obtain
large volumes of data. Even after preprocessing and filtering,
the quality of the automatically collected data can be unreli-
able. Automatically mined source code often has dependen-
cies that can be difficult to obtain automatically, making the
source code non-executable. Reference [91] found this to be
problematic for obtaining input-output pairs in the domain
of PBE. Executable source code is also important for func-
tional evaluation of other applications. References [4], [114]
manually curated test datasets to ensure better quality testing
data. An alternative to evaluation by comparing synthesized
code with code snippet is human evaluation. This is done
VOLUME 10, 2022 82449
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 6. Overview of evaluation methods used by the selected studies. Token match metrics used in the review which are popular for NLP applications
include BLEU [22], CIDEr [35], ROUGE [37], and METEOR [36]. Examples of other token match metrics are ‘‘exact match’’ and ‘‘token accuracy’’. Dynamic
analysis analyzes runtime behavior of code while static analysis does not require code to be executed.
by [4], [43], [44], [95]. Manually generating or evaluating
data is time-intensive which is why automatic data mining
or generation and automatic evaluation are more common.
F. EVALUATING GENERATED CODE
Evaluating the quality of synthesized code is done either by
comparing it to a ‘‘ground-truth’ code statement, analyzing
it statically, or analyzing it at runtime. Table 6shows which
evaluation methods each of the selected studies used.
The selected studies that evaluate synthesized code by
comparing it to ground-truth code statements do so at the
token level. Token comparisons are either performed by algo-
rithms from NLP literature such as BLEU [22],CIDEr [35],
ROUGE [37], and METEOR [36] or other metrics such as
‘‘exact match’ and ‘token accuracy’’. Figure 21 shows that
token match is used by 76% of the selected studies. Token
match evaluation is popular because the same data used for
training can be used for evaluation, it is automatic, and does
not require the code to be executable. Using token match
metrics from NLP for documentation generation is not a
problem as these metrics show they correlate with human
judgments. However, reference [4] shows that this is not the
case for synthesized code. Reference [124] argues that BLEU
and exact match do not properly capture code semantics and
instead propose a code-specific metric, CodeBLEU. Code-
BLEU uses a weighted sum of BLEU, BLEU weighted on
code keywords, syntactic similarity by comparing ASTs, and
data-flow similarity [124]. CodeBLEU was not used by any
of the works reviewed. References [18], [111] used custom,
code-specific token match metrics to measure program equiv-
alence to better measure code semantics.
FIGURE 21. The number of selected studies that used certain evaluation
methods. The vertical axis gives the percentage of the number of studies
that used a certain evaluation method out of 37, the total number of
selected studies.
Dynamic analysis involves evaluating the functional cor-
rectness and/or the time-to-completion of executable code
at runtime. Functional correctness requires certain types of
82450 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
data such as input-output examples (e.g. PBE), unit tests [4],
or formal specifications [53]. This is a code-appropriate
metric and allows for different code implementations that
are functionally equivalent to obtain the same score. This
is unlike token match evaluation which will give a better
score to an implementation that is most similar to reference
code. Furthermore, reference [4] argues that functional cor-
rectness correlates well with what humans would consider to
be quality code. Dynamic analysis often requires the code to
be syntactically correct which is not always guaranteed by
ML-based code generation methods as discussed at the end
of Section IV-B.
Static analysis is more accessible than dynamic analy-
sis since the code does not need to be executable and no
ground-truth references are needed. However, only using
syntactic correctness as a metric to validate models can
lead to degeneration where synthesized code does not
exhibit any desirable functional or semantic properties [111].
Human evaluation is a holistic evaluation method but is
time-consuming and requires programmers with knowledge
of the PL the code is written.
G. FUTURE WORK
In this section, we list three suggestions for future work that
would contribute to the field of code generation using ML.
Improving language model efficiency is our first sug-
gestion for future work. Models such as transformers [2] are
good for general code generation tasks but are extremely data-
hungry [4], [95]. Training and evaluating these models are
therefore computationally expensive. Similarly, improving
the energy efficiency of language models would also lower
the barrier of entry for research. High energy consumption
leads to high monetary costs. References [43], [71], [75], [83]
mention computation costs as restrictive to their research.
Ensemble learning is our second suggestion for future
work. Some models excel in specific contexts while perform-
ing poorly in general. References [69], [71] are examples of
this as they found that different models performed better on
certain bug types in the context of APR. Reference [75] gives
an example of how training and evaluating a model for three
specific types of cross-PL translations required half as much
data to achieve similar performance compared to a model
trained and evaluated on four types of translations. Studying
ensembles that combine the strengths of different models to
improve performance over a variety of cases is a promising
direction for future work.
New ways of using Abstract Syntax Trees (AST) repre-
sentations of source code is our final suggestion for future
work. Multiple studies discussed in this review, such as [21],
[95], use AST representations of code for their models. Fur-
ther exploitation of this data structure, which is characteristic
of PLs, is recommended for future research. Generalized
ASTs over multiple programming languages could lead to
greater transfer learning capabilities and models that gener-
alize to multiple languages. Code-specific decoding methods
for sequential output models remain unexplored to the best
of our knowledge. A decoding method that exploits AST or
other syntax information could lead to more efficient and
syntactically correct synthesized code.
V. THREATS TO VALIDITY
This section first discusses threats to the validity of the search
criteria used by this review. Afterward, the threats to the
validity of the models proposed by the selected studies are
discussed.
One search phrase was used across two databases due to a
large number of publications returned. While we believe the
search phrase accurately and precisely defines the types of
works we aimed to survey, we recognize that it is sensitive
to variations in terminology and/or missing keywords. For
example, [6] presented an influential code language model,
CodeBERT, but never mentions ‘machine learning’’ even
though ML techniques were used by the study. This led
to it not being returned by our search. Other examples of
influential models that were not retrieved by our searches
are PyMT5 [3], Code2Seq [5], and TransCoder [7]. This
survey should first and foremost be used as an introduction
to the various applications, ML models, tokenizers, data,
and evaluation methods used in the various sub-domains of
code generation. We encourage readers who want to read
more about a topic covered in this review to perform cita-
tion searches on the selected publications, or the influential
publications mentioned above, to find additional relevant
literature.
To investigate whether our chosen search phrase retrieves a
disproportionate amount of description-to-code publications,
a small experiment was conducted. Table 7compares retrieval
statistics for the original query for publication dates between
2016 and 2021 with a similar query which also includes
two similar search phrases. These additional search phrases,
‘‘code modification using machine learning’ and ‘‘code sum-
marization using machine learning’’, are more specific for
the code-to-code and code-to-description paradigms, respec-
tively. This extended search query retrieved roughly 8% more
publications than the query with only the original search
query. Extrapolated to the number of selected studies, this
increase leads to 40 selected studies. We consider this to be
a relatively small number of additional studies and therefore
conclude that our search phrase adequately covers all three
paradigms discussed in this review.
As mentioned in Section IV-F, many of the metrics used
to evaluate generated code are not code-specific and rely on
comparisons with a ground truth code statement. Comparing
generated code to a ground-truth program is limiting as the
space of valid output programs for a given input is generally
large. Furthermore, many token match metrics such as BLEU
are sensitive to the tokenizer used [125]. This means that
results from different studies using these metrics should only
be compared if a similar tokenizer is used to tokenize the
output before calculating the token match metric. Functional
correctness is the best metric to avoid this problem but
requires extra data in the form of test programs.
VOLUME 10, 2022 82451
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
TABLE 7. Statistics for database publication searches using different search phrases and a publication date range between 2016 and 2021. The search
phrase ‘‘code generation using machine learning’’ is the one applied to identify publications for this review. Adding two additional search phrases, ‘‘code
modification using machine learning’’ and ‘‘code summarization using machine learning’’, using the or logical operator retrieves only slightly more
publications.
VI. CONCLUSION
This systematic review selected 37 works published in the last
six years in the arXiv [11] and IEEE Xplore [12] databases
that proposed ML models for code generation, documentation
generation, and code modification. Each publication’s appli-
cation, model, datasets, results, limitations, and proposed
future work were summarized. Then, the general findings of
these 37 studies were discussed.
The discussion started by introducing the various appli-
cation categories of the selected studies. The most popular
applications of the selected publications include code gener-
ation from NL descriptions, documentation generation, APR,
and PBE. The popular model types used by these studies
such as RNNs, transformers [2], CNNs, and ML augmented
search, were introduced and compared in the context of
different applications. RNNs and transformer models were
used mostly for code generation given natural language (NL)
description as well as documentation generation. In general,
transformer models outperformed RNN models when the
two were compared by an evaluative study. CNNs are used
for image data but also to augment other models such as
transformers.
Different tokenization strategies used by these publications
are listed. How some of these strategies handle the out-of-
vocabulary problem is also discussed. Effective tokenization
processes used by the reviewed publications are subword-
tokenization, copy-mechanisms, and/or multiple <unknown>
tokens to capture a particular subset of possible tokens such
as variable identifiers.
Limitations in the quantity and quality of data for code
generation models were highlighted in respective sections
of the discussion. Language models require large datasets,
especially when the model is expected to be able to generate
code in many different types of contexts. Automatic mining
of online sources such as Github [9] is often needed to
obtain large enough volumes of data. The quality of auto-
matically mined source code varies greatly. This led two
studies [4], [114] to manually create their own test datasets.
Automatically generating data is a fast alternative for
obtaining data but is only appropriate for certain contexts
such as PBE.
The question of how to measure the quality of code is
also discussed in a corresponding section. Automatic eval-
uation by comparing tokens of generated code to tokens
from ground-truth references was conducted by 76% of the
selected studies. However, most of these token match algo-
rithms are not appropriate for evaluating code. Functional
correctness is a viable alternative but requires certain types of
data and requires the generated code to be executable. Static
analysis doesn’t require the code to be executable but can
lead to degeneration when only syntax is considered or is
time-consuming if performed by humans.
Finally, three promising directions for future work were
suggested: (i) improving the efficiency of language models,
(ii) ensemble learning for specialized models, and (iii) more
research on the possibilities for exploiting abstract syntax tree
representations of source code.
REFERENCES
[1] W. W. Royce, ‘‘Managing the development of large software systems:
Concepts and techniques,’ in Proc. 9th Int. Conf. Softw. Eng., 1987,
pp. 328–338.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘Attention is all you
need,’ 2017, arXiv:1706.03762.
[3] C. B. Clement, D. Drain, J. Timcheck, A. Svyatkovskiy, and
N. Sundaresan, ‘‘PyMT5: Multi-mode translation of natural language
and Python code with transformers,’ 2020, arXiv:2010.03150.
[4] M. Chen et al., ‘‘Evaluating large language models trained on code,’’
2021.
[5] U. Alon, S. Brody, O. Levy, and E. Yahav, ‘Code2seq: Gen-
erating sequences from structured representations of code,’ 2018,
arXiv:1808.01400.
[6] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin,
T. Liu, D. Jiang, and M. Zhou, ‘CodeBERT: A pre-trained model for
programming and natural languages,’ 2020, arXiv:2002.08155.
[7] M.-A. Lachaux, B. Roziere, L. Chanussot, and G. Lample, ‘‘Unsuper-
vised translation of programming languages,’ 2020, arXiv:2006.03511.
[8] Tabnine. Code FasterWith AI Code Completions. Accessed: Mar. 6, 2022.
[Online]. Available: https://www.tabnine.com/
[9] Github. Github: Where the World Builds Software. Accessed:
Mar. 6, 2022. [Online]. Available: https://github.com/
[10] Github. Github Copilot—Your AI Pair Programmer. Accessed:
Mar. 6, 2022. [Online]. Available: https://copilot.github.com/
[11] Cornell University. E-Print Scholarly Articles Archive. Accessed:
Mar. 6, 2022. [Online]. Available: https://arxiv.org/
[12] IEEE. IEEE Xplore Digital Library. Accessed: Mar. 6, 2022.
[Online]. Available: https://www.ieee.org/publications/subscriptio
ns/products/mdl/ieeexplore-access.html
[13] (2011). Finding What Works in Health Care: Standards for Systematic
Reviews, J. Eden, L. Levit, A. Berg, and S. Morton, Eds. Washing-
ton, DC, USA: The National Academies Press. Institute of Medicine.
[Online]. Available: https://www.nap.edu/catalog/13059/finding-what-
works-in-health-care-standards-for-systematic-reviews
[14] T. Parr and J. Vinju, ‘Technical report: Towards a universal code format-
ter through machine learning,’ 2016, arXiv:1606.08866. [Online]. Avail-
able: https://arxiv.org/abs/1606.08866, doi: 10.48550/arxiv.1606.08866.
[15] T. Parr. ANTLR (Another Tool for Language Recognition). Accessed:
Mar. 6, 2022. [Online]. Available: https://www.antlr.org/
82452 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
[16] Oracle. Java. Accessed: Mar. 6, 2022. [Online]. Available:
https://www.java.com/en/
[17] Quorum. The Quorum Programming Language. Accessed: Mar. 6, 2022.
[Online]. Available: https://quorumlanguage.com/
[18] V. Murali, L. Qi, S. Chaudhuri, and C. Jermaine, ‘Neural
sketch learning for conditional program generation,’ 2017,
arXiv:1703.05698. [Online]. Available: https://arxiv.org/abs/1703.05698,
doi: 10.48550/arxiv.1703.05698.
[19] AndroidDrawer. Androiddrawer, Andriod App Repository.
Facebook Page. Accessed: Jun. 12, 2022. [Online]. Available:
https:/www.facebook.com/android.drawer/about/?ref=page_internal
[20] K. Sohn, X. Yan, and H. Lee, ‘Learning structured output represen-
tation using deep conditional generative models,’’ in Proc. 28th Int.
Conf. Neural Inf. Process. Syst., Cambridge, MA, USA, vol. 2, 2015,
pp. 3483–3491.
[21] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, ‘‘Deep code comment generation,’
in Proc. 26th Conf. Program Comprehension, New York, NY, USA,
May 2018, pp. 200–210, doi: 10.1145/3196321.3196334.
[22] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, ‘‘BLEU: A method
for automatic evaluation of machine translation,’’ in Proc. 40th Annu.
Meeting Assoc. Comput. Linguistics, Philadelphia, PA, USA, Jul. 2002,
pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040
[23] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, ‘Summarizing source
code using a neural attention model,’ in Proc. 54th Annu. Meeting Assoc.
Comput. Linguistics, 2016.
[24] Z. Zhu, Z. Xue, and Z. Yuan, ‘Automatic graphics program generation
using attention-based hierarchical decoder,’’ 2018.
[25] T. Beltramelli, ‘‘Pix2code: Generating code from a graphical user inter-
face screenshot,’ 2017, arXiv:1705.07962.
[26] Y. Kim and H. Kim, ‘Translating CUDA to OpenCL for hardware gen-
eration using neural machine translation,’ in Proc. IEEE/ACM Int. Symp.
Code Gener. Optim. (CGO), Feb. 2019, pp. 285–286.
[27] P. Vingelmann and F.H. Fitzek. (2020). Cuda, Release: 10.2.89. [Online].
Available: https://developer.nvidia.com/cuda-toolkit
[28] NVIDIA Developer. Opencl. Accessed: Mar.6, 2022. [Online]. Available:
https://developer.nvidia.com/opencl
[29] B. Asiroglu, B. R. Mete, E. Yildiz, Y. Nalcakan, A. Sezen, M. Dagtekin,
and T. Ensari, ‘Automatic HTML code generation from mock-up
images using machine learning techniques,’ in Proc. Sci. Meeting Elect.-
Electron. Biomed. Eng. Comput. Sci. (EBBT), Apr. 2019, pp. 1–4.
[30] V. Jain, P. Agrawal, S. Banga, R. Kapoor, and S. Gulyani, ‘‘Sketch2Code:
Transformation of sketches to UI in real-time using deep neural network,’’
2019, arXiv:1910.08930.
[31] M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and
D. Poshyvanyk, ‘An empirical study on learning bug-fixing
patches in the wild via neural machine translation,’ 2018,
arXiv:1812.08693. [Online]. Available: https://arxiv.org/abs/1812.08693,
doi: 10.48550/arxiv.1812.08693.
[32] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus,
‘‘Fine-grained and accurate source code differencing,’ in Proc. 29th
ACM/IEEE Int. Conf. Automated Softw. Eng., New York, NY, USA,
Sep. 2014, pp. 313–324, doi: 10.1145/2642937.2642982.
[33] Z. Chen and M. Monperrus, ‘‘The CodRep machine learning on source
code competition,’ 2018, arXiv:1807.03200.
[34] Y. Shido, Y. Kobayashi, A. Yamamoto, A. Miyamoto, and
T. Matsumura, ‘Automatic source code summarization with
extended tree-LSTM,’ 2019, arXiv:1906.08094. [Online]. Available:
https://arxiv.org/abs/1906.08094, doi: 10.48550/arxiv.1906.08094.
[35] R. Vedantam, C. L. Zitnick, and D. Parikh, ‘CIDEr: Consensus-based
image description evaluation,’’ 2014, arXiv:1411.5726.
[36] S. Banerjee and A. Lavie, ‘‘METEOR: An automatic metric for mt eval-
uation with improved correlation with human judgments,’’ in Proc. ACL
Workshop Intrinsic Extrinsic Eval. Meas. Mach. Transl. Summarization,
Jun. 2005, pp. 65–72. [Online]. Available: https://aclanthology.org/W05-
0909
[37] C.-Y. Lin, ‘ROUGE: A package for automatic evaluation of summaries,’
in Proc. Workshop ACL Text Summarization Branches Out, Jul. 2004,
pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
[38] T. Lutellier, L. Pang, V. H. Pham, M. Wei, and L. Tan, ‘‘ENCORE:
Ensemble learning using convolution neural machine translation for auto-
matic program repair,’’ 2019, arXiv:1906.08691. [Online]. Available:
https://arxiv.org/abs/1906.08691, doi: 10.48550/arxiv.1906.08691.
[39] Python Software Foundation. Welcome to Python. Accessed:
Mar. 6, 2022. [Online]. Available: https://www.python.org/
[40] Standard C++ Foundation. Standard C++. Accessed: Mar. 6, 2022.
[Online]. Available: https://isocpp.org/
[41] R. Just, D. Jalali, and M. D. Ernst, ‘‘Defects4J: A database of exist-
ing faults to enable controlled testing studies for Java programs,’’ in
Proc. Int. Symp. Softw. Test. Anal. (ISSTA), New York, NY, USA, 2014,
pp. 437–440, doi: 10.1145/2610384.2628055.
[42] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, ‘‘Quixbugs: A
multi-lingual program repair benchmark set based on the quixey chal-
lenge,’ in Proc. Companion ACM SIGPLAN Int. Conf. Syst., Program.,
Lang., Appl., Softw. Hum., New York, NY, USA, 2017, pp. 55–56, doi:
10.1145/3135932.3135941.
[43] H. Hata, E. Shihab, and G. Neubig, ‘‘Learning to generate
corrective patches using neural machine translation,’’ 2018,
arXiv:1812.07170. [Online]. Available: https://arxiv.org/abs/1812.07170,
doi: 10.48550/arxiv.1812.07170.
[44] A. Takahashi, H. Shiina, and N. Kobayashi, ‘Automatic generation of
program comments based on problem statements for computational think-
ing,’ in Proc. 8th Int. Congr. Adv. Appl. Informat. (IIAI-AAI), Jul. 2019,
pp. 629–634.
[45] B. W. Kernighan and D. M. Ritchie, The C Programming Language,
2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, 1988.
[46] H. Phan, ‘‘Self Learning from large scale code corpus to infer structure
of method invocations,’’ 2019, arXiv:1909.03147. [Online]. Available:
https://arxiv.org/abs/1909.03147, doi: 10.48550/arxiv.1909.03147.
[47] R. Agashe, S. Iyer, and L. Zettlemoyer, ‘JuICe: A large scale distantly
supervised dataset for open domain context-based code generation,’ in
Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint
Conf. Natural Lang. Process. (EMNLP-IJCNLP), 2019.
[48] Jupyter Team. Jupyter Notebok Documentation. Accessed: Mar. 6,
2022. [Online]. Available: https://jupyter-notebook.readthedocs.io/
en/stable/notebook.html
[49] Jupyter Team. Nbgrader Documentation. Accessed: Mar. 6, 2022.
[Online]. Available: https://nbgrader.readthedocs.io/en/stable/
[50] R. Shin, M. Allamanis, M. Brockschmidt, and O. Polozov, ‘‘Pro-
gram synthesis and semantic parsing with learned code idioms,’ 2019,
arXiv:1906.10816. [Online]. Available: https://arxiv.org/abs/1906.10816,
doi: 10.48550/arxiv.1906.10816.
[51] W. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kočiský,
F. Wang, and A. Senior, ‘‘Latent predictor networks for code genera-
tion,’ in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics, 2016,
pp. 1–11.
[52] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li,
Q. Yao, S. Roman, Z. Zhang, and D. Radev, ‘Spider: A large-scale
human-labeled dataset for complex and cross-domain semantic parsing
and text-to-SQL task,’ in Proc. Conf. Empirical Methods Natural Lang.
Process., 2019, pp. 1–11.
[53] Y. Koroglu and A. Sen, ‘‘Reinforcement learning-driven test genera-
tion for Android GUI applications using formal specifications,’ 2019,
arXiv:1911.05403. [Online]. Available: https://arxiv.org/abs/1911.05403,
doi: 10.48550/arxiv.1911.05403.
[54] C. Gultniek. (2010). F-Droid Benchmarks. [Online]. Available: https://f-
droid.org/
[55] R. N. Zaeem, M. R. Prasad, and S. Khurshid, ‘‘Automated generation
of oracles for testing user-interaction features of mobile apps,’ in Proc.
IEEE 7th Int. Conf. Softw. Test., Verification Validation, Mar. 2014,
pp. 183–192.
[56] Android Developers. Android UI/Application Exerciser Monkey.
Accessed: Jun. 12, 2022. [Online]. Available: https://developer.android.
com/studio/test/monkey
[57] Y. Koroglu, A. Sen, O. Muslu, Y. Mete, C. Ulker, T. Tanriverdi, and
Y. Donmez, ‘‘QBE: QLearning-based exploration of Android applica-
tions,’ in Proc. IEEE 11th Int. Conf. Softw. Test., Verification Validation
(ICST), Apr. 2018, pp. 105–115.
[58] U. Hustadt. Metric Temporal Logic: Tools and Experiments. Accessed:
Mar. 16, 2022. [Online]. Available: https://cgi.csc.liv.ac.uk/ ullrich/MTL/
[59] S. Shim, P. Patil, R. R. Yadav, A. Shinde, and V. Devale, ‘‘DeeperCoder:
Code generation using machine learning,’ in Proc. 10th Annu. Comput.
Commun. Workshop Conf. (CCWC), Jan. 2020, pp. 0194–0199.
[60] D. P. Kingma and J. Ba, ‘Adam: A method for stochastic opti-
mization,’ 2014, arXiv:1412.6980. [Online]. Available: https://arxiv.org/
abs/1412.6980, doi: 10.48550/arxiv.1412.6980.
[61] T. Dozat, ‘Incorporating Nesterov momentum into Adam,’ Com-
put. Sci., Stanford Univ., Stanford, CA, USA, Tech. Rep. 54, 2016.
Accessed: Jul. 7, 2022. [Online]. Available: https://cs229.stanford.edu/
proj2015/054_report.pdf
[62] A. J. Stein, L. Kapllani, S. Mancoridis, and R. Greenstadt, ‘‘Exploring
paraphrasing techniques on formal language for generating semantics
preserving source code transformations,’ in Proc. IEEE 14th Int. Conf.
Semantic Comput. (ICSC), Feb. 2020, pp. 242–248.
VOLUME 10, 2022 82453
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
[63] K. Morton, W. Hallahan, E. Shum, R. Piskac, and M. Santolucito,
‘‘Grammar filtering for syntax-guided synthesis,’’ 2020,
arXiv:2002.02884. [Online]. Available: https://arxiv.org/abs/2002.02884,
doi: 10.48550/arxiv.2002.02884.
[64] C. W. Barrett, C. L. Conway, M. Deters, L. Hadarean, D. Jovanovic,
T. King, A. Reynolds, and C. Tinelli, ‘‘Cvc4,’ in Proc. CAV, 2011,
pp. 171–177.
[65] R. Alur, D. Fisman, R. Singh, and A. Solar-Lezama, ‘‘SyGuS-comp 2017:
Results and analysis,’ Electron. Proc. Theor. Comput. Sci., vol. 260,
pp. 97–115, Nov. 2017, doi: 10.4204/EPTCS.260.9.
[66] Y. Choi, S. Kim, and J.-H. Lee, ‘Source code summarization using
attention-based keyword memory networks,’’in Proc. IEEE Int. Conf. Big
Data Smart Comput. (BigComp), Feb. 2020, pp. 564–570.
[67] C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshy-
vanyk, ‘On learning meaningful assert statements for unit test
cases,’ in Proc. ACM/IEEE 42nd Int. Conf. Softw. Eng., Jun. 2020,
pp. 1398–1409.
[68] JUnit Team. Developer-Side Testing on the Java Virtual Machine.
Accessed: Mar. 6, 2022. [Online]. Available: https://junit.org/junit5/
[69] A. LeClair, S. Haque, L. Wu, and C. Mcmillan, ‘Improved code summa-
rization via a graph neural network,’ in Proc. 28th Int. Conf. Program
Comprehension, Jul. 2020, pp. 184–195.
[70] A. LeClair and C. Mcmillan, ‘‘Recommendations for datasets for source
code summarization,’ in Proc. Conf. North, 2019, pp. 1–7.
[71] S. Chakraborty, Y. Ding, M. Allamanis, and B. Ray, ‘‘CODIT: Code
editing with tree-based neural models,’ IEEE Trans. Softw. Eng., vol. 48,
no. 4, pp. 1385–1399, Apr. 2022, doi: 10.1109/TSE.2020.3020502.
[72] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk,
‘‘On learning meaningful code changes via neural machine translation,’’
in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng. (ICSE), May 2019,
pp. 25–36.
[73] K. Grouwstra, ‘‘Type-driven neural programming by example,’’ 2020,
arXiv:2008.12613. [Online]. Available: https://arxiv.org/abs/2008.12613,
doi: 10.48550/arxiv.2008.12613.
[74] Haskell. Haskell Language. Accessed: Mar. 6, 2022. [Online]. Available:
https://www.haskell.org/
[75] M. H. Hassan, O. A. Mahmoud, O. I. Mohammed, A. Y. Baraka,
A. T. Mahmoud, and A. H. Yousef, ‘‘Neural machine based mobile
applications code translation,’ in Proc. 2nd Novel Intell. Lead. Emerg.
Sci. Conf. (NILES), Oct. 2020, pp. 302–307.
[76] C. Gemmell, F. Rossetto, and J. Dalton, ‘‘Relevance transformer: Gen-
erating concise code snippets with relevance feedback,’’ Proc. 43rd Int.
ACM SIGIR Conf. Res. Develop. Inf. Retr., Jul. 2020, pp. 2005–2008, doi:
10.1145/3397271.3401215.
[77] Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and
S. Nakamura, ‘‘Learning to generate pseudo-code from source code
using statistical machine translation,’ in Proc. 30th IEEE/ACM Int. Conf.
Automated Softw. Eng. (ASE), Nov. 2015, pp. 574–584.
[78] P. Yin, B. Deng, E. Chen, B. Vasilescu, and G. Neubig, ‘‘Learning
to mine aligned code and natural language pairs from stack over-
flow,’’ in Proc. 15th Int. Conf. Mining Softw. Repositories, May 2018,
pp. 476–486.
[79] L. Perez, L. Ottens, and S. Viswanathan, ‘‘Automatic code gener-
ation using pre-trained language models,’ 2021, arXiv:2102.10535.
[Online]. Available: https://arxiv.org/abs/2102.10535, doi: 10.48550/
arxiv.2102.10535.
[80] T. B. Brown et al., ‘Language models are few-shot learners,’’ 2020,
arXiv:2005.14165. [Online]. Available: https://arxiv.org/abs/2005.14165,
doi: 10.48550/arxiv.2005.14165.
[81] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,
‘‘CodeSearchNet challenge: Evaluating the state of semantic
code search,’ 2019, arXiv:1909.09436. [Online]. Available: https://
arxiv.org/abs/1909.09436, doi: doi.org/10.48550/arxiv.1909.09436.
[82] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, ‘‘Character-aware neural
language models,’ 2015, arXiv:1508.06615.
[83] R. Shahbazi, R. Sharma, and F. H. Fard, ‘API2Com: On the improvement
of automatically generated code comments using API documentations,’
in Proc. IEEE/ACM 29th Int. Conf. Program Comprehension (ICPC),
May 2021, pp. 411–421.
[84] W. Wang, Y. Zhang, Z. Zeng, and G. Xu, ‘‘TranS3: A transformer-based
framework for unifying code summarization and code search,’’ 2020,
arXiv:2003.03238. [Online]. Available: https://arxiv.org/abs/2003.03238,
doi: 10.48550/arxiv.2003.03238.
[85] N. Jiang, T. Lutellier, and L. Tan, ‘‘CURE: Code-aware neural
machine translation for automatic program repair,’’ in Proc. IEEE/ACM
43rd Int. Conf. Softw. Eng. (ICSE), May 2021, pp. 1161–1173, doi:
10.1109/ICSE43902.2021.00107.
[86] T. Lutellier, H. V. Pham, L. Pang, Y. Li, M. Wei, and L. Tan, ‘‘CoCoNuT:
Combining context-aware neural translation models using ensemble for
program repair,’’ in Proc. 29th ACM SIGSOFT Int. Symp. Softw. Test.
Anal., Jul. 2020, pp. 101–114.
[87] T. B. Brown et al., ‘Language models are few-shot learners,’’ 2020,
arXiv:2005.14165. [Online]. Available: https://arxiv.org/abs/2005.14165,
doi: 10.48550/arxiv.2005.14165.
[88] B. Wang and A. Komatsuzaki. (May 2021). GPT-J-6B: A 6 Bil-
lion Parameter Autoregressive Language Model. [Online]. Available:
https://github.com/kingoflolz/mesh-transformer-jax
[89] G. Yang, Y. Zhou, X. Chen, and C. Yu, ‘Fine-grained pseudo-code
generation method via code feature extraction and transformer,’’ in Proc.
28th Asia–Pacific Softw. Eng. Conf. (APSEC), Dec. 2021, pp. 213–222.
[90] S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken,
and P. Liang, ‘‘SPoC: Search-based pseudocode to code,’’ 2019,
arXiv:1906.04908.
[91] J. Hong, D. Dohan, R. Singh, C. Sutton, and M. Zaheer, ‘‘Latent
programmer: Discrete latent codes for program synthesis,’ 2020,
arXiv:2012.00377. [Online]. Available: https://arxiv.org/abs/2012.00377,
doi: 10.48550/arxiv.2012.00377.
[92] E. Parisotto, A.-R. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli,
‘‘Neuro-symbolic program synthesis,’’ 2016, arXiv:1611.01855.
[Online]. Available: https://arxiv.org/abs/1611.01855, doi:
10.48550/arxiv.1611.01855.
[93] Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and
P. S. Yu, ‘‘Improving automatic source code summarization via
deep reinforcement learning,’ in Proc. 33rd ACM/IEEE Int. Conf.
Automated Softw. Eng., New York, NY, USA, Sep. 2018, pp. 397–407,
doi: 10.1145/3238147.3238206.
[94] J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A. Mohamed,
and P. Kohli, ‘Robustfill: Neural program learning under noisy
I/O,’ in Proc. 34th Int. Conf. Mach. Learn. (ICML), Mar. 2017,
pp. 990–998. [Online]. Available: https://www.microsoft.com/en-
us/research/publication/robustfill-neural-program-learning-noisy-io/
[95] G. Yang, X. Chen, J. Cao, S. Xu, Z. Cui, C. Yu, and K. Liu, ‘‘Com-
Former: Code comment generation via transformer and fusion method-
based hybrid code representation,’ in Proc. 8th Int. Conf. Dependable
Syst. Their Appl. (DSA), Aug. 2021, pp. 30–41.
[96] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, ‘‘Deep code comment generation
with hybrid lexical and syntactical information,’ Empirical Softw. Eng.,
vol. 25, no. 3, pp. 2179–2217, May 2020.
[97] S. Wang, K. Liu, B. Lin, L. Li, J. Klein, X. Mao, and T. F. Bissyandé,
‘‘Beep: Fine-grained fix localization by learning to predict buggy code
elements,’ 2021, arXiv:2111.07739.
[98] R.-M. Karampatsis and C. Sutton, ‘‘How often do single-statement bugs
occur? The ManySStuBs4J dataset,’ 2019, arXiv:1905.13334.
[99] F. Madeiral, S. Urli, M. Maia, and M. Monperrus, ‘‘BEARS: An extensi-
ble Java bug benchmark for automatic program repair studies,’’ in Proc.
IEEE 26th Int. Conf. Softw. Anal., Evol. Reeng. (SANER), Feb. 2019,
pp. 468–478.
[100] R. Saha, Y. Lyu, W. Lam, H. Yoshida, and M. Prasad, ‘‘Bugs.jar: A large-
scale, diverse dataset of real-world Java bugs,’’ in Proc. IEEE/ACM 15th
Int. Conf. Mining Softw. Repositories (MSR), May 2018, pp. 10–13.
[101] K. Liu, D. Kim, A. Koyuncu, L. Li, T. F. Bissyande, and Y. Le Traon,
‘‘A closer look at real-world patches,’’ in Proc. IEEE Int. Conf. Softw.
Maintenance Evol. (ICSME), Sep. 2018, pp. 275–286.
[102] T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman, ‘‘Control-
ling conditional language models without catastrophic forgetting,’ 2021,
arXiv:2112.00791.
[103] S. Black, L. Gao, P. Wang, C. Leahy, and S. R. Biderman, ‘‘GPT-Neo:
Large scale autoregressive language modeling with mesh-tensorflow,’’
2021.
[104] G. van Rossum, B. Warsaw, and N. Coghlan. (2001). Style Guide for
Python Code. [Online]. Available: https://peps.python.org/pep-0008/
[105] P. Bielik, V. Raychev, and M. Vechev, ‘PHOG: Probabilistic
model for code,’ in Proc. The 33rd Int. Conf. Mach. Learn.,
vol. 48, M. F. Balcan and K. Q. Weinberger, Eds. New York,
New York, USA, Jun. 2016, pp. 2933–2942. [Online]. Available:
https://proceedings.mlr.press/v48/bielik16.html
[106] R. J. Williams, ‘‘Simple statistical gradient-following algorithms for
connectionist reinforcement learning,’ Mach. Learn., vol. 8, nos. 3–4,
pp. 229–256, May 1992, doi: 10.1007/BF00992696.
[107] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei,
P. Christiano, and G. Irving, ‘‘Fine-tuning language models from human
preferences,’ 2019, arXiv:1909.08593.
82454 VOLUME 10, 2022
E. Dehaerne et al.: Code Generation Using Machine Learning: A Systematic Review
[108] P. J. Blazek, K. Venkatesh, and M. M. Lin, ‘Deep distilling:
Automated code generation using explainable deep learning,’ 2021,
arXiv:2111.08275.
[109] P. J. Blazek and M. M. Lin, ‘Explainable neural networks that simulate
reasoning,’ Nature Comput. Sci., vol. 1, no. 9, pp. 607–618, Sep. 2021,
doi: 10.1038/s43588-021-00132-w.
[110] W. Stephen, ‘Statistical mechanics of cellular automata,’ Rev. Mod.
Phys., vol. 55, pp. 601–644, Jul. 1983.
[111] R. Mukherjee, Y. Wen, D. Chaudhari, T. W. Reps, S. Chaudhuri, and
C. Jermaine, ‘‘Neural program generation modulo static analysis,’’ 2021,
arXiv:2111.01633.
[112] S. Lu et al., ‘‘CodeXGLUE: A machine learning benchmark dataset for
code understanding and generation,’ 2021, arXiv:2102.04664.
[113] M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov, ‘‘Gener-
ative code modeling with graphs,’’ 2018, arXiv:1805.08490.
[114] Z. Teng, Q. Fu, J. White, and D. C. Schmidt, ‘‘Sketch2Vis: Generating
data visualizations from hand-drawn sketches with deep learning,’ in
Proc. 20th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2021,
pp. 853–858.
[115] M. Li, Z. Lin, R. Mech, E. Yumer, and D. Ramanan, ‘Photo-sketching:
Inferring contour drawings from images,’ 2019, arXiv:1901.00542.
[116] S. Ostrowski. (2020). Announcing Pylance: Fast, Feature-Rich Language
Support for Python in Visual studio Code. Accessed: Jun. 7, 2022.
[Online]. Available: https://devblogs.microsoft.com/python/announcing-
pylance-fast-feature-rich-language-support-for-python-in-visual-studio-
code/
[117] C. Colah. (Aug. 2015). Understanding LSTM networks. [Online]. Avail-
able: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
[118] D. Bahdanau, K. Cho, and Y. Bengio, ‘Neural machine
translation by jointly learning to align and translate,’ 2014,
arXiv:1409.0473. [Online]. Available: https://arxiv.org/abs/1409.0473,
doi: 10.48550/arxiv.1409.0473.
[119] A. Amidi and S. Amidi. CS 230—Convolutional Neural Networks
Cheatsheet. Accessed: Mar. 6, 2022. [Online]. Available:
https://stanford.edu/ shervine/teaching/cs-230/cheatsheet-convolutional-
neural-networks
[120] P. Platen. (2020). How to Generate Text: Using Different Decoding Meth-
ods for Language Generation With Transformers. Accessed: Jun. 8, 2022.
[Online]. Available: https://huggingface.co/blog/how-to-generate
[121] R. Gupta, S. Pal, A. Kanade, and S. Shevade. (2017). Fixing Common C
Language Errors by Deep Learning. [Online]. Available: http://www.iisc-
seal.net/deepfix
[122] R. Sennrich, B. Haddow, and A. Birch, ‘‘Neural machine transla-
tion of rare words with subword units,’ in Proc. ACL, Aug. 2016,
pp. 1715–1725. [Online]. Available: https://aclanthology.org/P16-1162
[123] StackExchance. (2022). Stackoverflow | Where Developers, Learn, Share,
& Build Carreers. [Online]. Available: https://stackoverflow.com/
[124] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan,
M. Zhou, A. Blanco, and S. Ma, ‘‘CodeBLEU: A method for automatic
evaluation of code synthesis,’’ 2020, arXiv:2009.10297. [Online]. Avail-
able: https://arxiv.org/abs/2009.10297, doi: 10.48550/arxiv.2009.10297.
[125] M. Post, ‘‘A call for clarity in reporting BLEU scores,’’ in Proc. 3rd Conf.
Mach. Transl., Res. Papers. Brussels, Belgium, Oct. 2018, pp. 186–191.
[Online]. Available: https://aclanthology.org/W18-6319
ENRIQUE DEHAERNE (Graduate Student Mem-
ber, IEEE) received the Bachelor of Engineering
degree from KU Leuven, Belgium, in 2020, where
he is currently pursuing the Master of Engineering
degree in computer science.
He was a Research Engineering Intern at Nokia
Bell Labs, Antwerp, in Summer 2021. He is cur-
rently collaborating with the Advanced Patterning
Center, Interuniversity Microelectronics Centre
(IMEC), to complete his master thesis. Among
other projects, the IMEC Advanced Patterning Laboratory researches com-
puter vision and data analysis applications for use in the semiconductor
industry. His research interests include applications of machine learning,
computer vision, natural language processing, and robotics.
BAPPADITYA DEY (Member, IEEE) received the
B.Sc. and M.Sc. degrees (Hons.) in physics and
electronic science from the University of Calcutta,
Kolkata, India, in 2006 and 2008, respectively, the
second M.Tech. degree in computer science and
engineering from MAKAUT (formerly known as
WBUT), Kolkata, in 2010, and the third M.Sc. and
Ph.D. degrees in computer engineering from the
Center for Advanced Computer Studies (CACS),
University of Louisiana at Lafayette, USA, in
2017 and 2022, respectively. He joined IMEC, Belgium, in 2018, and
worked in various roles since then. He is currently a Research and Devel-
opment Engineer at the Advanced Patterning Center, IMEC. Till now, he has
authored/coauthored 24 publications and has presented at several interna-
tional conferences. His research interests include VLSI, microelectronics,
reconfigurable hardware, machine learning, computer vision, artificial intel-
ligence, and semiconductor process optimization. He is a member of SPIE.
SANDIP HALDER received the Ph.D. degree in
metallurgy and materials science from RWTH
Aachen, in 2006. He joined IMEC, in 2007,
as a Research Scientist at the Advanced Mate-
rials and Process Department and was respon-
sible for leading the metrology and inspection
path-finding activities for the 3D SIC program.
In 2013, he moved to the Advanced Patterning
Center, IMEC, where he has worked as a Research
Scientist and then as the Team Leader for metrol-
ogy and inspection. Since 2020, he has been the Litho Group Lead within
the same department. He has published more than 90 papers and has ten
published patents.
STEFAN DE GENDT (Senior Member, IEEE)
received the Doctor of Science degree from the
University of Antwerp, in 1995. He subsequently
was recruited by IMEC, Leuven, Belgium, the
world’s largest independent research institute in
nanoelectronics and technology. He is currently
a Full Professor (part-time) at KU Leuven and
the Scientific Director of IMEC. Together with
his respective teams, he has (co)authored more
than 500 peer-reviewed journal publications. His
research interests over his 25-year career at IMEC include metrology, semi-
conductor cleaning and passivation, high-k and metal gate unit process
research, and post-CMOS nanotechnology (including nanowires, carbon
nanotubes, graphene, and related 2D materials).
WANNES MEERT (Member, IEEE) received the
Master of Electrotechnical Engineering degree in
microelectronics, the Master of Artificial Intelli-
gence degree, and the Ph.D. degree in computer
science from KU Leuven, in 2005, 2006, and 2011,
respectively. He is currently an IOF Fellow and a
Research Manager at the DTAI Section, Depart-
ment of CS, KU Leuven. His work is focused on
applying machine learning, artificial intelligence,
and anomaly detection technology to industrial
application domains.
VOLUME 10, 2022 82455
... On the other hand, AI-driven approaches offer automated code generation, testing, optimization, and maintenance with significant benefits in terms of accuracy, speed, and adaptability [2]. AI techniques, such as machine learning (ML) and NLP, allow for the automation of repetitive coding tasks, identification of optimization opportunities, and even detection of potential bugs or security flaws before they reach deployment stages. ...
... Multi-head attention extends this concept by splitting the queries, keys, and values into multiple subspaces. For the i-th head, the computation is given by the formula (2), where W Q i , W K i , and W V i are learned projection matrices. The outputs of all heads are concatenated and then linearly transformed with a matrix W O as speccified in the formula (3). ...
Preprint
Full-text available
Devally is an AI-driven full-stack software development system that translates design assets and natural language requirements into functional front-end and back-end code. It supports multiple input modalities for frontend generation, including screenshots, figma-like designs, templates, and hand-drawn sketches, while backend functionality is derived from structured human-readable requirements. The system architecture consists of four core modules: (i) a code repository that can be enabled or disabled based on desired performance trade-offs and containing reusable code snippets and templates ; (ii) a requirements analyzer that classifies and processes inputs while leveraging stored code assets; (iii) a large language model (LLM) augmented with vision transformers to enhance contextual understanding during code generation; and (iv) an integration module that composes, adapts, and merges retrieved or newly generated code into fully functional applications. Devally automates key development tasks, optimizing time, cost, and effort while improving code quality, scalability, and maintainability. Empirical evaluations across multiple case studies demonstrate that Devally consistently produces production-ready applications, bridging the gap between design concepts and their technical implementation. With the code repository module disabled, Devally's performance remains comparable to state-of-the-art LLMs. When enabled, however, it significantly outperforms all other evaluated LLM-based approaches, demonstrating superior accuracy, efficiency, and adaptability in a variety of software development scenarios.
... AI technologies, in particular state-of-the-art large language models (LLMs), have shown incredible potential in a variety of software tasks [11,6,31,36] and can significantly improve programmer productivity [2,30,13,27]. For example, Meta reports that 99% of their developers use an internal LLM to discover APIs, accelerate their work, and generate 9%-17% of all code that they write [13]. ...
Preprint
Full-text available
We discuss the challenges and propose research directions for using AI to revolutionize the development of high-performance computing (HPC) software. AI technologies, in particular large language models, have transformed every aspect of software development. For its part, HPC software is recognized as a highly specialized scientific field of its own. We discuss the challenges associated with leveraging state-of-the-art AI technologies to develop such a unique and niche class of software and outline our research directions in the two US Department of Energy--funded projects for advancing HPC Software via AI: Ellora and Durban.
... Due to the various advancements in NLP and deep learning, NLP has been utilized to automate programming languages. Convolutional neural networks (CNN), transformers, and recurrent neural network (RNN) are among the most used models [94]. Some of the applications in the code generation field generate code from description and documentation generation. ...
Article
Full-text available
Network automation development has accompanied network evolution due to its significant role in speeding up and simplifying network operations. Emerging networking and computing paradigms such as information-centric networks, next-generation networks, cloud, and edge computing and recent innovative technologies, such as the Internet of things (IoT), enabled novel network services (such as the Internet of Vehicles (IoV), context-aware applications, virtual reality, and augmented reality) that demand complex configurations and management. Intent-based networking (IBN) is a promising networking paradigm that provides abstract and autonomous network management. IBN promises to simplify configuring networking devices, allowing network engineers and service providers to focus on providing the expected services and continuously verifying that the network operates within the desired status. An IBN process starts by expressing the user requirement in a high-level descriptive format. Then, the IBN system translates these requirements to a low-level deployable format in a process called intent translation. In this work, we formally define the intent translation process and propose a generic intent translation system. Furthermore, we review the research on intent translation published between 2018 and 2022. We analyze and classify the proposed intent translation schemes and discuss the challenges and recent trends in intent translation.
... Artificial intelligence, and in particular machine learning, contributes towards the creation of algorithms and statistical models with the help of which the machine can learn and make a decision or make a prediction based on it [64]. A wheel diagram of AI is depicted in Fig. 2. Unlike conventional programming that requires an individual to instruct the machine what to do and how to do it in the best way, machine learning lets the system learn on its own from the available data and enhance its proficiency in handling future data without the need to program it for every scenario. ...
Preprint
Full-text available
Microgrids have emerged as a pivotal solution in the quest for a sustainable and energy-efficient future. While microgrids offer numerous advantages, they are also prone to issues related to reliably forecasting renewable energy demand and production, protecting against cyberattacks, controlling operational costs, optimizing power flow, and regulating the performance of energy management systems (EMS). Tackling these energy management challenges is essential to facilitate microgrid applications and seamlessly incorporate renewable energy resources. Artificial intelligence (AI) has recently demonstrated immense potential for optimizing energy management in microgrids, providing efficient and reliable solutions. This paper highlights the combined benefits of enabling AI-based methodologies in the energy management systems of microgrids by examining the applicability and efficiency of AI-based EMS in achieving specific technical and economic objectives. The paper also points out several future research directions that promise to spearhead AI-driven EMS, namely the development of self-healing microgrids, integration with blockchain technology, use of Internet of things (IoT), and addressing interpretability, data privacy, scalability, and the prospects to generative AI in the context of future AI-based EMS.
... By introducing a new dataset and assessing models across dimensions like lexical similarity, syntactic legality, and functional correctness, their findings offer nuanced insights into the performance and effectiveness of current code generation methods. Dehaerne et al. [10] provide a comprehensive review of machine learning (ML) methods in code generation, analyzing 37 studies in the description-to-code, code-to-description and code-to-code paradigms. Highlighting popular ML models such as RNNs and transformers, the review addresses key applications, limitations, and potential future directions. ...
... Constructing test cases can also be particularly challenging for certain tasks [63], such as AI model training, web scraping, etc. When test cases are unavailable, many studies rely on reference-based metrics [10,86], such as CodeBLEU [51] and CodeBERTScore [87]. However, the quality of the reference code restricts the effectiveness of reference-based metrics. ...
Preprint
Full-text available
Recently, Large Language Models (LLMs) have been increasingly used to automate SE tasks such as code generation and summarization. However, evaluating the quality of LLM-generated software artifacts remains challenging. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. Given that LLMs are typically trained to align with human judgment and possess strong coding abilities and reasoning skills, they hold promise as cost-effective and scalable surrogates for human evaluators. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLMgenerated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.
Chapter
AI operates at the heart of cloud-native software engineering and is now transforming traditional software development methodologies by introducing intelligent automation and optimization along the SDLC. The present chapter focuses on the specific opportunities and issues that the unique characteristics of containerized applications and microservices architecture introduce, with particular emphasis on AI-based tools that make developers more productive in cloud-native environments. Some critical research areas include AI-driven requirements analysis, intelligent system design, automated code generation, and advanced testing frameworks. The chapter also covers using AI to facilitate cloud-native platforms, such as Kubernetes, anomaly detection, dynamic resource management, and continuous integration and deployment
Article
Purpose This paper aims to investigate the benefits and challenges of integrating generative artificial intelligence, particularly ChatGPT, into organizational learning and development (L&D). The research seeks to understand how generative AI can transform learning processes, enhance employee development and address potential obstacles in implementation. Design/methodology/approach The research uses semi-structured interviews with 12 L&D industry experts to gather insights. Braun and Clarke's (2006) thematic analysis method was used to analyze the collected data. Additionally, a rigorous coding process was undertaken using Saldana’s (2009) coding methods. Initial codes were generated to pinpoint significant elements within the interviews that could reveal insights into the influence, implications and challenges of integrating generative artificial intelligence (AI) into organizational L&D. Findings The study reveals that the influence and implications of integrating ChatGPT include attaining more efficient and effective L&D programs, boosting employee performance and enhancing their engagement and satisfaction. However, overreliance on ChatGPT can negatively influence organizational knowledge sharing and team dynamics. This study also found that to successfully integrate ChatGPT, organizations should address human-related challenges (i.e. upskilling employees with artificial intelligence skills) and technology-related challenges (i.e. data privacy violations, misinformation and ChatGPT misuse by employees). Originality/value This research contributes to the existing literature by providing a comprehensive analysis of the multifaceted impact of generative AI on organizational L&D. It provides managers with advice on how to manage the integration of ChatGPT in a way that minimizes potential negative effects while maximizing opportunities. This study is also significant as it provides foundational insights into the capabilities of ChatGPT in organizational L&D, stressing both its transformative potential and the need for careful management to prevent misuse.
Article
Full-text available
The success of deep neural networks suggests that cognition may emerge from indecipherable patterns of distributed neural activity. Yet these networks are pattern-matching black boxes that cannot simulate higher cognitive functions and lack numerous neurobiological features. Accordingly, they are currently insufficient computational models for understanding neural information processing. Here, we show how neural circuits can directly encode cognitive processes via simple neurobiological principles. To illustrate, we implemented this model in a non-gradient-based machine learning algorithm to train deep neural networks called essence neural networks (ENNs). Neural information processing in ENNs is intrinsically explainable, even on benchmark computer vision tasks. ENNs can also simulate higher cognitive functions such as deliberation, symbolic reasoning and out-of-distribution generalization. ENNs display network properties associated with the brain, such as modularity, distributed and localist firing, and adversarial robustness. ENNs establish a broad computational framework to decipher the neural basis of cognition and pursue artificial general intelligence.
Chapter
This chapter presents a compelling family of deep learning models, known as convolutional neural networks, which are specifically designed for the image recognition task. It covers the main concepts related to the fundamental design of baseline convolutional networks. The chapter explains the convolutional layers and related attributes such as padding and stride, pooling layers, normalization layers, etc. By taking into account the number of channels in the convolutional network, the input and/or output of the layer has a three‐dimensional shape or multi‐channel input data, it is necessary to build a convolutional kernel with an identical number of input channels, so as to enable the execution of convolution with the input data. The chapter presents LeNet as one of the first released convolutional models that make an earlier break‐through in computer vision tasks. Apart from IoT security data, the convolutional networks were essentially proposed for image recognition task.
Article
The problem of automatically fixing programming errors is a very active research topic in software engineering. This is a challenging problem as fixing even a single error may require analysis of the entire program. In practice, a number of errors arise due to programmer's inexperience with the programming language or lack of attention to detail. We call these common programming errors. These are analogous to grammatical errors in natural languages. Compilers detect such errors, but their error messages are usually inaccurate. In this work, we present an end-to-end solution, called DeepFix, that can fix multiple such errors in a program without relying on any external tool to locate or fix them. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network with attention which is trained to predict erroneous program locations along with the required correct statements. On a set of 6971 erroneous C programs written by students for 93 programming tasks, DeepFix could fix 1881 (27%) programs completely and 1338 (19%) programs partially.