Conference PaperPDF Available

PySStuBs: Characterizing Single-Statement Bugs in Popular Open-Source Python Projects

Authors:

Abstract

Single-statement bugs (SStuBs) can have a severe impact on developer productivity. Despite usually being simple and not offering much of a challenge to fix, these bugs may still disturb a developer’s workflow and waste precious development time. However, few studies have paid attention to these simple bugs, focusing instead on bugs of any size and complexity. In this study, we explore the occurrence of SStuBs in some of the most popular open-source Python projects on GitHub, while also characterizing their patterns and distribution. We further compare these bugs to SStuBs found in a previous study on Java Maven projects. We find that these Python projects have different SStuB patterns than the ones in Java Maven projects and identify 7 new SStuB patterns. Our results may help uncover the importance of understanding these bugs for the Python programming language, and how developers can handle them more effectively.
PySStuBs: Characterizing Single-Statement Bugs in
Popular Open-Source Python Projects
Arthur V. Kamienski, Luisa Palechor, Cor-Paul Bezemer, Abram Hindle
University of Alberta
Edmonton, Alberta, Canada
{kamiensk,palechor,bezemer,hindle1}@ualberta.ca
Abstract—Single-statement bugs (SStuBs) can have a severe
impact on developer productivity. Despite usually being simple
and not offering much of a challenge to fix, these bugs may still
disturb a developer’s workflow and waste precious development
time. However, few studies have paid attention to these simple
bugs, focusing instead on bugs of any size and complexity. In
this study, we explore the occurrence of SStuBs in some of
the most popular open-source Python projects on GitHub, while
also characterizing their patterns and distribution. We further
compare these bugs to SStuBs found in a previous study on
Java Maven projects. We find that these Python projects have
different SStuB patterns than the ones in Java Maven projects
and identify 7 new SStuB patterns. Our results may help uncover
the importance of understanding these bugs for the Python
programming language, and how developers can handle them
more effectively.
Index Terms—Single-statement bugs, Python, open-source
projects
I. INTRODUCTION
All software developers have to deal with bugs at some
point in their careers, either while working on toy projects for
leisure or developing enterprise-grade software for industry.
These bugs may occur due to countless reasons, from syntax
errors to programming logic-related issues [11]. Bugs may also
vary in size and complexity, ranging from a single wrong token
to many lines of code spread across different components.
Tricky bugs that span multiple functions and statements may
offer developers a great challenge to unravel, making them
waste precious hours of their work time [15]. Single-statement
bugs (also known as simple stupid bugs, or SStuBs) [9]
also jeopardize developer productivity by interrupting their
workflow, despite being easier to fix. Frequently occurring
SStuBs may significantly impact workflows by continuously
making developers switch contexts to fix problems.
Several studies have analyzed the impact of bugs on de-
veloper productivity and projects’ lifecycles [7], [18]. Many
of those have focused on automatically identifying bugs to
relieve the developers’ burden of manually searching and
fixing them [21]. However, researchers have not given a lot
of attention to SStuBs and their relevance to software devel-
opment. Studying and characterizing those SStuBs can help
developers in identifying them sooner, reducing the amount
of time they invest in solving the problem.
Recently, a study by Karampatsis and Sutton [9] identi-
fied and analyzed SStuBs in 1,000 open-source Java Maven
projects. The authors characterize 16 SStuB patterns and
discuss their frequency in those projects. While that study
sheds light on how SStuBs occur and the importance of
studying them, it only focuses on Java Maven projects.
With that in mind, this paper identifies and analyzes the
occurrence of SStuBs in a different programming language,
namely 1,000 of the most popular Python projects on GitHub.
The Python programming language shows several differences
from other languages such as Java [6], [16], which may be
reflected in the types and number of occurrences of SStuBs.
By collecting data from these Python projects and using
a similar approach as the one used by Karampatsis and
Sutton [9], we seek to understand the differences in SStuBs
between the Java and Python projects. More specifically, we
answer the following Research Questions (RQs):
RQ1. What are the most common single-statement bugs
in the most popular open-source Python projects?
The differences between Python and other programming
languages may result in the occurrence of different types
of single-statement bugs in Python projects. In this research
question, we discuss the types of SStuBs we identified in
the studied Python projects. We identify the 16 top occurring
patterns, and characterize 7 new patterns not found within the
patterns presented by Karampatsis and Sutton [9].
RQ2. How do the single-statement bugs we identified
compare to the ones found in Java Maven projects?
While Python projects might contain new SStuB patterns,
we still expect to find some of the patterns described for Java
Maven projects due to the similarities between the syntaxes
of the two languages (e.g., control structures and arithmetic
operators). In this research question, we compare the types
of Python SStuBs to the ones in Java Maven projects as
described by Karampatsis and Sutton [9]. We find that some of
the SStuBs are unique to each programming language, which
affects their frequency in the projects.
II. METHODOLOGY
In this section, we describe the methodology we used to
select the studied projects, gather their data, and identify
SStuBs. Figure 1 shows an overview of the steps we took.
A. Selecting Python projects
We selected the 1,000 most popular Python projects on
GitHub as measured by their number of stars in January, 2021.
We used GitHub’s search engine to obtain a list of Python
1,844,369
bug-fixing
commits
Remove
refactoring
changes
Filter commit
messages using
keywords
Select
Python 3.6 files
Identify patterns
Extract ASTs
from each file
Select files
modified by
commits
GitHub
World of
Code
1,000 Python
projects
Commits and
files
Compare ASTs
Data gathering Identifying
single-statement bugs
73,013
SStuBs
148,450
file pairs
Select commits
with single-
statement
changes
Fig. 1. Overview of the steps taken in our methodology.
projects (i.e., projects which have most of their content written
in Python) ordered by their number of stars. We chose this
specific number of projects to provide a fairer comparison with
the 1,000 Java Maven projects collected by Karampatsis and
Sutton [9] for their “SStuBs L” dataset.
B. Gathering data
We gathered data from the projects we selected in Sec-
tion II-A using World of Code (WoC) [14], an infrastructure
for mining open-source software and their version control data
which is updated on a monthly basis. Starting from the project
IDs on GitHub, we used the WoC API to collect the commits
associated to them. We used a similar process to gather the
files containing source code associated with each commit. The
data was collected in January, 2021.
We note that the data we collected from WoC is not an exact
representation of the projects, as the API could not retrieve
some of the entities belonging to the projects. For example,
we could not collect some of the commits referenced by the
projects. However, we measured an overall loss of data of less
than 10%, which should not affect our overall results.
We collected 6,062,534 commits from the 1,000 projects
we selected. We identified bug-fixing commits using the same
methodology described by Karampatsis and Sutton [9], i.e., by
filtering commit messages which contain one of the following
keywords: ‘error’, ‘bug’, ‘fix’, ‘issue’, ‘mistake’, ‘incorrect’,
‘fault’, ‘defect’, ‘flaw’, and ‘type’. Using this method, we ob-
tained 1,844,369 bug-fixing commits. In addition, we excluded
commit messages that included the ‘refactor’ keyword to help
reduce the number of false positives in our sample, but those
only accounted for 1% (18,156) of the commits. Lastly, we
filtered out any commits that added or deleted files.
We moved on to gather the files referenced by each commit.
In this step, we collected the files before and after they were
modified by the commit. As the projects can also contain files
that are not written in Python, we filter out any files which
do not contain the suffix ‘.py’ in their file name. We also
excluded any files not containing a valid Python 3.6 syntax,
as those could not be parsed into Abstract Syntax Trees (ASTs)
in future steps using our version of Python 3.
C. Identifying single-statement bugs
We followed a similar approach as Karampatsis and Sut-
ton [9] to identify SStuBs. First, we used the Unix ‘diff’
command to identify the line changes between each file pair.
We selected only the pairs which showed single-statement
changes, while discarding those containing line deletions and
additions. We also discarded all of the files from commits that
showed multiple-statement changes in any position in any file.
After this step, we were left with 148,450 file pairs.
For file pairs that contain multiple single-statement changes,
we derived new pairs by applying each of the changes to the
original file, one at a time. Thus, each new pair contained only
one change. We parsed the files of each pair using Python’s
ast library, yielding an Abstract Syntax Tree (AST) for each of
them. The ast library ignores comments and whitespaces, and
we therefore do not consider changes to those in our analysis.
We also ignore changes to class and function docstrings.
We wrote a custom Python script to compare the resulting
AST pairs. Using the script, we perform a simultaneous
depth-first traversal of each tree and locate the first pair of
nodes in which the trees differ. Each pair of diverging nodes
thus corresponds to a single-statement change in between the
files. We manually analyzed each type of node differences to
identify if they matched any of the SStuB patterns as described
by Karampatsis and Sutton [9]. However, unlike Karampatsis
and Sutton [9], our method only matches the first pattern found
for a statement and not all of them. Furthermore, we analyzed
the most common types of node differences to identify the
16 most common SStuBs, and found 7 new SStuB patterns
(described in Section III-A).
D. Removing refactoring changes
We noticed that some of the patterns we identified in Sec-
tion II-C described changes to function and class definitions,
such as their names and arguments. As those patterns likely
relate to the refactoring of code and not to bug-fixing changes,
we decided to exclude them from our analysis. We also
excluded changes made to statements that referenced those
refactored entities in any of the files that originated from the
same commit. For example, if a commit changed the name of
a class in one of its files, we excluded all of the changes made
to that class’ usages across all of the files in the commit.
We also observed that many of the single-statement changes
we identified describe changes to the values of string constants.
Strings in Python serve a large number of purposes, from
indexing values in dictionaries to storing data, and developers
may need to frequently change them to account for new
code versions. Strings also have a flexible length and can
contain any type of written text, and are therefore prone to
errors and misspellings. However, not all of the changes to
those string values can be considered bugs, and including
them in our analysis may introduce many false positives.
For example, developers frequently use hard-coded strings
in natural language to write messages that describe errors,
program functionalities or interactions with users, and changes
to those strings may not change the behaviour of the program.
We therefore excluded these changes from our analysis.
We note that we only identified trivial changes, and we
did not remove more complex refactorings. In the end, we
were left with 126,912 single-statement changes that altered
the ASTs, 58% (73,013) of which belonging to the 23 SStuB
patterns we used. The remaining changes did not fit any of
our patterns. Our final dataset with 73,013 SStuBs is publicly
available online [8].
III. RES ULTS
In this section we answer our two RQs by describing the
results of our analysis of the 73,013 SStuBs and 23 patterns
we obtained from Section II.
A. RQ1. What are the most common single-statement bugs in
the most popular open-source Python projects?
We found 7 new SStuB patterns in Python projects. Out
of the 23 SStuBs we identified, 7 were not previously defined
by Karampatsis and Sutton [9]. While some of these patterns
may occur in Java, others only occur due to the difference in
syntax between the languages. We give a brief description of
each of these new patterns below. The number of occurrences
of each pattern can be seen in Table I.
Change Attribute Used - When developers change
the attribute accessed from an object. For example,
person.name changes to person.age.
Add Function Around Expression - When developers put
an expression inside a function call, often for modifying
the returned value. For example, human = person
changes to human = is_human(person).
Add Elements to Iterable - When developers add an
element to a hard-coded iterable, such as a list or a
tuple. For example, info = (name, age) changes
to info = (name, age, height).
Change Keyword Argument Used - When developers
change the keyword argument used in a function call or
object instantiation. For example, Person(name=20)
changes to Person(age=20).
Add Method Call - When developers add a method call
to an expression which references an object, changing the
return value. For example, year = person changes to
year = person.birth_year().
Change Constant Type - When developers
change the type of a hard-coded constant. For
example, person.age = ‘10’ changes to
person.age = 10.
Add Attribute Access - When developers access the
attribute of an object instead of the object itself.
For example, say_hello_to(person) changes to
say_hello_to(person.name).
TABLE I
COU NTS O F SSTUBPATT ERN S IN PY THO N AN D JAVA MAVEN P ROJ ECT S.
PATTER NS I N BOL D IN DIC ATE TH E NEW PATT ERN S WE I DEN TI FIED .
NUM BER S IN B OLD S HOW T HE PATTE RN S THAT OC CU R OVER T WO T IME S
MORE IN JAVA .
Pattern name Python % Java [9] %
Same Function More Args 9,958 14 5,100 8
Wrong Function/Method Name 9,091 12 10,179 16
Change Identifier Used 8,973 12 22,668 35
Add Function Around Expression 6,363 9 0 0
Change Attribute Used 5,229 7 0 0
Change Numeric Literal 4,775 7 5,447 8
Change Operand 4,657 6 807 1
Same Function Less Args 3,381 5 1,588 2
Add Method Call 3,338 5 0 0
Add Elements to Iterable 2,541 3 0 0
More Specific If 2,443 3 2,381 4
Change Constant Type 2,199 3 0 0
Change Unary Operator 2,187 3 1,016 2
Change Keyword Argument Used 1,554 2 0 0
Change Boolean Literal 1,466 2 1,842 3
Add Attribute Access 1,439 2 0 0
Same Function Wrong Caller 1,163 2 1,504 2
Change Binary Operator 976 1 2,241 5
Less Specific If 943 1 2,813 4
Same Function Swap Args 336 >1 612 1
Change Modifier 0 0 5,011 8
Delete Throws Exception 0 0 508 1
Missing Throws Exception 0 0 206 >1
Total 73,013 100 63,923 100
B. RQ2. How do the single-statement bugs we identified
compare to the ones found in Java Maven projects?
The studied Python and Java Maven Projects share most
of the 16 original SStuBs. We could find 13 of the 16 SStuBs
identified in Java Maven projects in Python projects, although
in different proportions. We applied a Chi-squared (χ2) test
to the SStuB categories found both in Java and Python and
found that the difference in proportion of SStuB types was
statistically significant (p<.001).
We observed this difference in patterns such as Wrong
Function/Method Name (as seen in a commit from the
Keras project [4] with a change from model.train to
model.fit), which comprised 16% (10,179) of the bugs
in Java and 12% (9,091) of the bugs in Python.
While shared patterns can occur in both languages, dif-
ferences between the syntax and type system of the two
programming languages make it impossible for the other three
patterns to occur in Python. Therefore, we did not observe
any Change Modifier,Missing Throws Exception or Delete
Throws Exception SStuBs. The Python programming language
does not have access level modifiers (e.g., public and private)
and developers instead use naming conventions to simulate
the access restriction behaviour. Similarly, there is no way to
explicitly denote that a function throws an exception.
Many of the new SStuBs we identified relate to Python’s
dynamic type system. We observed that many of the SStuB
patterns we identified in Python can be linked to the flexible
way in which it allows developers to work with data types.
In Python, variables can be created and assigned a value
without an explicit type declaration, and then later be reas-
signed a new value of a different type. This dynamic typing
system allows for patterns such as Change Constant Type,
exemplified by a commit in the Django project [3] when the
value assigned to a variable was changed from a string to an
integer (param = "1" to param = 1).
Other examples are Add Method Call and Add At-
tribute Access, which can only occur if the change from
an object reference to a return value is allowed. We ob-
served this pattern in a commit in the Ansible project [2],
when base.group_upgrade(group) was changed to
base.group_upgrade(group.id). Similarly, the Add
Function Around Expression pattern is usually related to
changing the value and type of an expression. We could
observe such a change in the Scipy project [1], in which a
variable was cast to an integer when being returned from a
function (return nnz changed to return int(nnz)).
In contrast, Java has a static type system that checks for
type inconsistencies during program compilation. While this
system still allows for changes in the value of variables and
constants of the same type (described by patterns such as
Change Identifier Used and Change Numerical Literal), it
prevents the occurrence of the patterns mentioned above.
IV. THR EATS TO VALI DI TY
Internal validity. We selected the 1,000 Python projects
based on their number of stars as a measure of popularity.
However, there are other ways to measure the popularity of
a project (e.g., the number of forks and contributors) which
could lead to the selection of a different set of projects.
Construct validity. We are limited by the accuracy of the
data provided by the GitHub project search. This may have
excluded some relevant projects from our analysis, including
projects that are more popular than the ones we selected.
Despite checking commits and files for refactoring changes,
we could not detect all of them and the number of SStuBs
may be overestimated. For example, we did not check for
the renaming of entities such as variables. Others have shown
that identifying refactorings in Python is complicated due to
Python’s dynamic nature [22]. As Python is a dynamically
typed language, refactoring Python code tends to cause more
errors than in statically typed languages like Java [17].
External validity. Our results are limited to popular open-
source Python projects and may not generalize to other pro-
gramming languages, or even Python code from other sources.
However, many of our findings overlap with the ones from
Karampatsis and Sutton [9], which may indicate general trends
for other programming languages and projects.
V. REL ATED W OR K
Prior work has also focused on detecting bugs using AST
representations. Karampatsis and Sutton [9] found single-
statement bugs by mining a set of 100 and 1,000 open-
source Java projects. The authors used ASTs extracted from
the modified files before and after the bugs were fixed,
finding that around 33% of the fixes could be described with
their patterns. Martinez and Durieux [20] developed repair
tools for Python, presenting an empirical study to repair the
QuixBugs benchmark, even though the authors focus on Java
implementations. Zhaogui and Liu [19] proposed a predictive
analysis of Python projects by collecting traces and detecting
bugs. They evaluated their prototype on 11 Python projects and
find 46 bug types. Chen and Lin [13] used ASTs to study fine-
grained source code changes in Python and, later on, analyzed
the dynamic feature of Python code when fixing bugs [5].
Other studies investigated program repair patterns and some
of the patterns we use in our work have been used by Le Goues
et al. [12], Kim et al. [10], and Karampatsis and Sutton [9].
This research differs from those mentioned above in that
we detect single-statement bugs in 1,000 Python projects and
discuss them with single-statement bugs from Maven projects
found by Karampatsis and Sutton [9].
VI. CONCLUSION
In this paper we analyze the most common single-statement
bugs in Python code using data from some of the most popular
open-source Python projects on GitHub. We selected projects
based on their number of stars and used World of Code (WoC)
to collect commit messages and files. After preprocessing the
data, we compared the Abstract Syntax Trees (ASTs) for pairs
of files before and after the bug fixes. As a result, we identified
23 “Simple Stupid Bug” (SStuB) patterns and 73,013 changes
that matched those patterns. Additionally, we characterize 7
new SStuB patterns found in the studied Python projects. We
moved on to compare the SStuBs we found to the ones found
by Karampatsis and Sutton [9], showing that differences in
the programming languages, and style of typing (dynamic
versus static) change the types of SStuBs identified. Our
findings may be used as a way of understanding these types
of bugs occurring in Python code, and may help developers
by improving the way they handle them. We also share our
dataset online [8], allowing its use in future research.
REFERENCES
[1] “Commit on GitHub: BUG: Homogenezie nnz type to be int for
all sparse matrix types.” [Online]. Available: https://github.com/scipy/
scipy/commit/42420f14c14a622149024a6514b8b7f10620d5bc
[2] “Commit on GitHub: Fix using DNF group upgrade/remove
api.” [Online]. Available: https://github.com/ansible/ansible/commit/
7a6c5dd1ab079d301f41c9557c50557cd0a69029
[3] “Commit on GitHub: Fixed #23434 Coerce Oracle bool params
to int.” [Online]. Available: https://github.com/django/django/commit/
4298f6261060a13eff1e30e6119c2ad5d64e5700
[4] “Commit on GitHub: Fixed import errors with six.moves.cPickle
and model.train typo in the skipgram embeddings exam-
ple.” [Online]. Available: https://github.com/keras-team/keras/commit/
034822359d9a4c3b2fc63de0676bad512b426112
[5] Z. Chen, W. Ma, W. Lin, L. Chen, Y. Li, and B. Xu, “A study on the
changes of dynamic feature code when fixing bugs: towards the benefits
and costs of python dynamic features,” Science China Information
Sciences, vol. 61, 01 2018.
[6] G. Destefanis, M. Ortu, S. Porru, S. Swift, and M. Marchesi, “A
statistical comparison of java and python software metric properties,
in Proceedings of the 7th International Workshop on Emerging Trends
in Software Metrics, 2016, pp. 22–28.
[7] L. A. F. Gomes, R. da Silva Torres, and M. L. Cˆ
ortes, “Bug report
severity level prediction in open source software: A survey and research
opportunities,” Information and software technology, vol. 115, pp. 58–
78, 2019.
[8] A. V. Kamienski, L. Palechor, C.-P. Bezemer, and A. Hindle, “PySStuBs:
Single-Statement Bugs in Popular Open- Source Python Projects,” Jan.
2021. [Online]. Available: https://doi.org/10.5281/zenodo.4589607
[9] R.-M. Karampatsis and C. Sutton, “How often do single-statement
bugs occur? the manysstubs4j dataset, in Proceedings of the 17th
International Conference on Mining Software Repositories, 2020, pp.
573–577.
[10] D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch generation
learned from human-written patches,” in 2013 35th International Con-
ference on Software Engineering (ICSE). IEEE, 2013, pp. 802–811.
[11] A. J. Ko and B. A. Myers, A framework and methodology for studying
the causes of software errors in programming systems,” Journal of Visual
Languages & Computing, vol. 16, no. 1-2, pp. 41–84, 2005.
[12] C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A
generic method for automatic software repair, Ieee transactions on
software engineering, vol. 38, no. 1, pp. 54–72, 2011.
[13] W. Lin, Z. Chen, W. Ma, L. Chen, L. Xu, and B. Xu, “An empirical study
on the characteristics of python fine-grained source code change types,”
in 2016 IEEE International Conference on Software Maintenance and
Evolution (ICSME), 2016, pp. 188–199.
[14] Y. Ma, C. Bogart, S. Amreen, R. Zaretzki, and A. Mockus, “World of
code: an infrastructure for mining the universe of open source vcs data,
in 2019 IEEE/ACM 16th International Conference on Mining Software
Repositories (MSR). IEEE, 2019, pp. 143–154.
[15] S. Planning, “The economic impacts of inadequate infrastructure for
software testing,” National Institute of Standards and Technology, 2002.
[16] L. Prechelt, “An empirical comparison of c, c++, java, perl, python, rexx
and tcl,” IEEE Computer, vol. 33, no. 10, pp. 23–29, 2000.
[17] M. Sch¨
afer, “Refactoring tools for dynamic languages, in Proceedings
of the Fifth Workshop on Refactoring Tools, ser. WRT ’12. New York,
NY, USA: Association for Computing Machinery, 2012, p. 59–62.
[Online]. Available: https://doi- org.login.ezproxy.library.ualberta.ca/10.
1145/2328876.2328885
[18] J. Uddin, R. Ghazali, M. M. Deris, R. Naseem, and H. Shah, “A survey
on bug prioritization,” Artificial Intelligence Review, vol. 47, no. 2, pp.
145–180, 2017.
[19] Z. Xu, P. Liu, X. Zhang, and B. Xu, “Python predictive analysis for bug
detection,” in Proceedings of the 2016 24th ACM SIGSOFT International
Symposium on Foundations of Software Engineering, ser. FSE 2016.
New York, NY, USA: Association for Computing Machinery, 2016, p.
121–132. [Online]. Available: https://doi.org/10.1145/2950290.2950357
[20] H. Ye, M. Martinez, T. Durieux, and M. Monperrus, “A comprehensive
study of automatic program repair on the quixbugs benchmark,” in 2019
IEEE 1st International Workshop on Intelligent Bug Fixing (IBF), 2019,
pp. 1–10.
[21] T. Zhang, H. Jiang, X. Luo, and A. T. Chan, “A Literature Review of
Research in Bug Resolution: Tasks, Challenges and Future Directions,
The Computer Journal, vol. 59, no. 5, pp. 741–773, 05 2016. [Online].
Available: https://doi.org/10.1093/comjnl/bxv114
[22] W. Zhou, Y. Zhao, G. Zhang, and X. Shen, “Harp: Holistic analysis
for refactoring python-based analytics programs,” in 2020 IEEE/ACM
42nd International Conference on Software Engineering (ICSE), 2020,
pp. 506–517.
... TraceFixer addresses challenge C1 through a neural bug injection model that creates hundreds of thousands of realistic bugs from code with existing test cases. Inspired by the success of existing neural approaches to generate bugs [4,39,41], we train a neural bug injection model for Python that mimics single-line bugs observed to occur in the wild [27]. To address challenge C2, TraceFixer provides an input formatting module that represents the different modalities in a way that is compatible with popular sequence-to-sequence models, allowing us to build upon a strong pre-trained model [58]. ...
... We use the PySStuBs dataset [27] to train the neural bug injector. This dataset contains samples with both buggy and patched versions, and we build the model input with the patched version while the ground truth is the corresponding bug. ...
... PySStuBs. This dataset [27] is an extensive collection of single statement bugs in popular open-source Python projects. We use this dataset in two ways. ...
Preprint
Full-text available
When debugging unintended program behavior, developers can often identify the point in the execution where the actual behavior diverges from the desired behavior. For example, a variable may get assigned a wrong value, which then negatively influences the remaining computation. Once a developer identifies such a divergence, how to fix the code so that it provides the desired behavior? This paper presents TraceFixer, a technique for predicting how to edit source code so that it does not diverge from the expected behavior anymore. The key idea is to train a neural program repair model that not only learns from source code edits but also exploits excerpts of runtime traces. The input to the model is a partial execution trace of the incorrect code, which can be obtained automatically through code instrumentation, and the correct state that the program should reach at the divergence point, which the user provides, e.g., in an interactive debugger. Our approach fundamentally differs from current program repair techniques, which share a similar goal but exploit neither execution traces nor information about the desired program state. We evaluate TraceFixer on single-line mistakes in Python code. After training the model on hundreds of thousands of code edits created by a neural model that mimics real-world bugs, we find that exploiting execution traces improves the bug-fixing ability by 13% to 20% (depending on the dataset, within the top-10 predictions) compared to a baseline that learns from source code edits only. Applying TraceFixer to 20 real-world Python bugs shows that the approach successfully fixes 10 of them.
... To analyze the types of issues that developers face when using the <canvas>, we looked at <canvas>-related issue reports in popular open source projects that use the <canvas>. ensure only mature, popular projects were considered in this study, inclusion criteria of at least 1,000 stars and at least 2 contributors were set empirically, similar to prior work [12], [19], [22]. We avoid immature projects because issue reports for immature projects are not necessarily representative of real testable issues, for example because the report is simply for a feature that is not finished or well-tested yet. ...
... For GitHub issue reports that were labelled, we selected issue reports that had a label containing the keyword 'bug' (except for labels like 'not bug'). For GitHub issue reports that were not labelled, we identified testable issue reports using a keyword-based search on the issue report descriptions, similar to prior work [14], [19], [20], [41]. Specifically, we selected issue reports with descriptions that matched one of the following keywords: 'error', 'bug', 'fix', or 'fault'. ...
Article
Full-text available
The HTML5 < canvas > is widely used to display high quality graphics in web applications. However, the combination of web, GUI, and visual techniques that are required to build < canvas > applications, together with the lack of testing and debugging tools, makes developing such applications very challenging. To help direct future research on testing < canvas > applications, in this paper we present a taxonomy of testable < canvas > issues. First, we extracted 2,403 < canvas >-related issue reports from 123 open source GitHub projects that use the HTML5 < canvas >. Second, we constructed our taxonomy by manually classifying a random sample of 332 issue reports. Our manual classification identified five broad categories of testable < canvas > issues, such as Visual and Performance issues. We found that Visual issues are the most frequent (35%), while Performance issues are relatively infrequent (5%). We also found that many testable < canvas > issues that present themselves visually on the < canvas > are actually caused by other components of the web application. Our taxonomy of testable < canvas > issues can be used to steer future research into < canvas > issues and testing.
... Finally, PySStuBs [41] is a Python simple, stupid bug dataset. The more recent TSSB-3M [42] is also a Python SStuBs dataset mined at scale. ...
... The dataset consists of Java single statement bugs; our results may not generalize to other languages, or less simple bugs. PySStuBs [41] and TSSB-3M [42] are larger, and cite different SStuB patterns. The ManySStuBs4J dataset is the appropriate size given our constraints on available compute, and also API access to Codex. ...
Preprint
With the advent of powerful neural language models, AI-based systems to assist developers in coding tasks are becoming widely available; Copilot is one such system. Copilot uses Codex, a large language model (LLM), to complete code conditioned on a preceding "prompt". Codex, however, is trained on public GitHub repositories, viz., on code that may include bugs and vulnerabilities. Previous studies [1], [2] show Codex reproduces vulnerabilities seen in training. In this study, we examine how prone Codex is to generate an interesting bug category, single statement bugs, commonly referred to as simple, stupid bugs or SStuBs in the MSR community. We find that Codex and similar LLMs do help avoid some SStuBs, but do produce known, verbatim SStuBs as much as 2x as likely than known, verbatim correct code. We explore the consequences of the Codex generated SStuBs and propose avoidance strategies that suggest the possibility of reducing the production of known, verbatim SStubs, and increase the possibility of producing known, verbatim fixes.
... Indeed, this idea that small changes may introduce defects is at the core of mutation testing [9], which performs small changes to the code (e.g., negating boolean expressions) to simulate defective changes to the codebase. Moreover, the popular notion of mining for "SStuBs" [8], [10], [11], i.e., simple stupid bugs that appear on a single statement and the corresponding fix is within that statement, further suggests that small changes account for a considerable proportion of defect-fixing activity. ...
... That is, the unit tests [5], integration tests [20], and system tests [28] supported by current frameworks can be too coarse-grained or ill-suited for developer testing needs that exist at the statement level. Yet, developers may want to test statements because: 1) Single-statement bugs occur frequently [12], but unit tests often do not catch single-statement bugs [14]. 2) Some statements are hard to understand or error prone, e.g., regular expressions (regexes) [19], bit manipulation [1], string manipulation [6], or collection handling [8]. 3) Statements can contain complex logic, e.g., Python oneliners [18] or Java streams [10]. ...
Preprint
We present pytest-inline, the first inline testing framework for Python. We recently proposed inline tests to make it easier to test individual program statements. But, there is no framework-level support for developers to write inline tests in Python. To fill this gap, we design and implement pytest-inline as a plugin for pytest, the most popular Python testing framework. Using pytest-inline, a developer can write an inline test by assigning test inputs to variables in a target statement and specifying the expected test output. Then, pytest-inline runs each inline test and fails if the target statement's output does not match the expected output. In this paper, we describe our design of pytest-inline, the testing features that it provides, and the intended use cases. Our evaluation on inline tests that we wrote for 80 target statements from 31 open-source Python projects shows that using pytest-inline incurs negligible overhead, at 0.012x. pytest-inline is integrated into the pytest-dev organization, and a video demo is at https://www.youtube.com/watch?v=pZgiAxR_uJg.
Article
Full-text available
Daily large number of bug reports are received in large open and close source bug tracking systems. Dealing with these reports manually utilizes time and resources which leads to delaying the resolution of important bugs. As an important process in software maintenance, bug triaging process carefully analyze these bug reports to determine, for example, whether the bugs are duplicate or unique, important or unimportant, and who will resolve them. Assigning bug reports based on their priority or importance may play an important role in enhancing the bug triaging process. The accurate and timely prioritization and hence resolution of these bug reports not only improves the quality of software maintenance task but also provides the basis to keep particular software alive. In the past decade, various studies have been conducted to prioritize bug reports using data mining techniques like classification, information retrieval and clustering that can overcome incorrect prioritization. Due to their popularity and importance, we survey the automated bug prioritization processes in a systematic way. In particular, this paper gives a small theoretical study for bug reports to motivate the necessity for work on bug prioritization. The existing work on bug prioritization and some possible problems in working with bug prioritization are summarized.
Conference Paper
Full-text available
This paper presents a statistical analysis of 20 opensource object-oriented systems with the purpose of detecting differences in metrics distribution between Java and Python projects. We selected ten Java projects from the Java Qualitas Corpus and ten projects written in Python. For each system, we considered 10 class-level software metrics. We performed a best fit procedure on the empirical distributions through the log-normal distribution and the double Pareto distribution to identify differences between the two languages. Even though the statistical distributions for projects written in Java and Python may appear the same for lower values of the metric, performing the procedure with the double Pareto distribution for the Number of Local Methods metric reveals that major differences can be noticed along the queue of the distributions. On the contrary, the same analysis performed with the Number of Statements metric reveals that only the initial portion of the double Pareto distribution shows differences between the two languages. In addition, the dispersion parameter associated to the log-normal distribution fit for the total Number Of Methods can be used for distinguishing Java projects from Python projects.
Article
Context: The severity level attribute of a bug report is considered one of the most critical variables for planning evolution and maintenance in Free/Libre Open Source Software. This variable measures the impact the bug has on the successful execution of the software system and how soon a bug needs to be addressed by the development team. Both business and academic community have made an extensive investigation towards the proposal methods to automate the bug report severity prediction. Objective: This paper aims to provide a comprehensive mapping study review of recent research efforts on automatically bug report severity prediction. To the best of our knowledge, this is the first review to categorize quantitatively more than ten aspects of the experiments reported in several papers on bug report severity prediction. Method: The mapping study review was performed by searching four electronic databases. Studies published until December 2017 were considered. The initial resulting comprised of 54 papers. From this set, a total of 18 papers were selected. After performing snowballing, more nine papers were selected. Results: From the mapping study, we identified 27 studies addressing bug report severity prediction on Free/Libre Open Source Software. The gathered data confirm the relevance of this topic, reflects the scientific maturity of the research area, as well as, identify gaps, which can motivate new research initiatives. Conclusion: The message drawn from this review is that unstructured text features along with traditional machine learning algorithms and text mining methods have been playing a central role in the most proposed methods in literature to predict bug severity level. This scenario suggests that there is room for improving prediction results using state-of-the-art machine learning and text mining algorithms and techniques.
Article
Dynamic features in programming languages support the modification of the execution status at runtime, which is often considered helpful in rapid development and prototyping. However, it was also reported that some dynamic feature code tends to be change-prone or error-prone. We present the first study that analyzes the changes of dynamic feature code and the roles of dynamic features in bug-fix activities for the Python language. We used an AST-based differencing tool to capture fine-grained source code changes from 17926 bug-fix commits in 17 Python projects. Using this data, we conducted an empirical study on the changes of dynamic feature code when fixing bugs in Python. First, we investigated the characteristics of dynamic feature code changes, by comparing the changes between dynamic feature code and non-dynamic feature code when fixing bugs, and comparing dynamic feature changes between bug-fix and non-bugfix activities. Second, we explored 226 bug-fix commits to investigate the motivation and behaviors of dynamic feature changes when fixing bugs. The study results reveal that (1) the changes of dynamic feature code are significantly related to bug-fix activities rather than non-bugfix activities; (2) compared with non-dynamic feature code, dynamic feature code is inserted or updated more frequently when fixing bugs; (3) developers often insert dynamic feature code as type checks or attribute checks to fix type errors and attribute errors; (4) the misuse of dynamic features introduces bugs in dynamic feature code, and the bugs are often fixed by adding a check or adding an exception handling. As a benefit of this paper, we gain insights into the manner in which developers and researchers handle the changes of dynamic feature code when fixing bugs.
Conference Paper
Python is a popular dynamic language that allows quick software development. However, Python program analysis engines are largely lacking. In this paper, we present a Python predictive analysis. It first collects the trace of an execution, and then encodes the trace and unexecuted branches to symbolic constraints. Symbolic variables are introduced to denote input values, their dynamic types, and attribute sets, to reason about their variations. Solving the constraints identifies bugs and their triggering inputs. Our evaluation shows that the technique is highly effective in analyzing real-world complex programs with a lot of dynamic features and external library calls, due to its sophisticated encoding design based on traces. It identifies 46 bugs from 11 real-world projects, with 16 new bugs. All reported bugs are true positives.