Conference PaperPDF Available

Understanding Bugs in Multi-Language Deep Learning Frameworks

Authors:

Abstract and Figures

Deep learning frameworks (DLFs) have been playing an increasingly important role in this intelligence age since they act as a basic infrastructure for an increasingly wide range of AI-based applications. Meanwhile, as multi-programming-language (MPL) software systems, DLFs are inevitably suffering from bugs caused by the use of multiple programming languages (PLs). Hence, it is of paramount significance to understand the bugs (especially the bugs involving multiple PLs, i.e., MPL bugs) of DLFs, which can provide a foundation for preventing, detecting, and resolving bugs in the development of DLFs. To this end, we manually analyzed 1497 bugs in three MPL DLFs, namely MXNet, PyTorch, and TensorFlow. First, we classified bugs in these DLFs into 12 types (e.g., algorithm design bugs and memory bugs) according to their bug labels and characteristics. Second, we further explored the impacts of different bug types on the development of DLFs, and found that deployment bugs and memory bugs negatively impact the development of DLFs in different aspects the most. Third, we found that 28.6%, 31.4%, and 16.0% of bugs in MXNet, PyTorch, and TensorFlow are MPL bugs, respectively; the PL combination of Python and C/C++ is most used in fixing more than 92% MPL bugs in all DLFs. Finally, the code change complexity of MPL bug fixes is significantly greater than that of single-programming-language (SPL) bug fixes in all the three DLFs, while in PyTorch MPL bug fixes have longer open time and greater communication complexity than SPL bug fixes. These results provide insights for bug management in DLFs.
Content may be subject to copyright.
Understanding Bugs in Multi-Language Deep
Learning Frameworks
Zengyang Li, Sicheng Wang, Wenshuo Wang, Peng Liang‡∗, Ran Mo, Bing Li
School of Computer Science & Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning,
Central China Normal University, Wuhan, China
School of Computer Science, Wuhan University, Wuhan, China
{zengyangli, moran}@ccnu.edu.cn, {scwang1998, wenshuowang}@mails.ccnu.edu.cn, {liangp, bingli}@whu.edu.cn
Abstract—Deep learning frameworks (DLFs) have been playing
an increasingly important role in this intelligence age since they
act as a basic infrastructure for an increasingly wide range of AI-
based applications. Meanwhile, as multi-programming-language
(MPL) software systems, DLFs are inevitably suffering from
bugs caused by the use of multiple programming languages
(PLs). Hence, it is of paramount significance to understand
the bugs (especially the bugs involving multiple PLs, i.e., MPL
bugs) of DLFs, which can provide a foundation for preventing,
detecting, and resolving bugs in the development of DLFs. To
this end, we manually analyzed 1497 bugs in three MPL DLFs,
namely MXNet, PyTorch, and TensorFlow. First, we classified
bugs in these DLFs into 12 types (e.g., algorithm design bugs and
memory bugs) according to their bug labels and characteristics.
Second, we further explored the impacts of different bug types
on the development of DLFs, and found that deployment bugs
and memory bugs negatively impact the development of DLFs in
different aspects the most. Third, we found that 28.6%, 31.4%,
and 16.0% of bugs in MXNet, PyTorch, and TensorFlow are
MPL bugs, respectively; the PL combination of Python and
C/C++ is most used in fixing more than 92% MPL bugs in all
DLFs. Finally, the code change complexity of MPL bug fixes is
significantly greater than that of single-programming-language
(SPL) bug fixes in all the three DLFs, while in PyTorch MPL
bug fixes have longer open time and greater communication
complexity than SPL bug fixes. These results provide insights
for bug management in DLFs.
Index Terms—Deep Learning Framework, Bug, Multiple-
Programming-Language Software System, Empirical Study
I. INTRODUCTION
Deep learning frameworks (DLFs), such as PyTorch, have
been playing an increasingly important role in this intelligence
age since they act as basic infrastructures for an increasingly
wide range of AI-based applications. DLFs provide building
blocks for designing, training, and validating deep neural
network models through a high-level programming interface.
Therefore, the reliability of DLFs becomes more and more im-
portant for the fast-growing AI-based applications. To ensure
their reliability, a necessary step is to understand the charac-
teristics and impact of bugs in DLFs. The previous research
on bugs in DLFs is mainly divided into two categories: the
This work was funded by the Natural Science Foundation of Hubei
Province of China under Grant No. 2021CFB577, the National Natural
Science Foundation of China under Grant Nos. 62002129 and 62172311, and
the Knowledge Innovation Program of Wuhan-Shuguang Project under Grant
No. 2022010801020280.
Corresponding author
first category studies the bugs in the implementation of DLFs,
e.g., bug categorization, severity, symptoms, root causes, and
impacts in various DLFs [1][2][3][4]; the second category
studies the bugs in the use of specific DLFs, e.g., dependency
and performance bugs in deep learning (DL) systems in terms
of symptoms, causes, and fix modes [5][6][7][8].
Popular DLFs are usually written in multiple programming
languages (PLs) [9], such as TensorFlow, which is mainly
written in Python and C++. Previous research suggests that
static code analysis in a multi-programming-language (MPL)
software system is much more difficult than in a single-
programming-language (SPL) one [10] [11]; meanwhile, chal-
lenges in understanding multiple PLs and cross-language
communication are usually faced by MPL software systems
[12][11]. As MPL software systems, DLFs are inevitably
suffering from bugs caused by the use of multiple PLs.
Therefore, it is of paramount significance to understand the
bugs (especially the bugs involving multiple PLs, i.e., MPL
bugs) of DLFs, which can provide a foundation for preventing,
detecting, and resolving bugs in the development of DLFs.
In this paper, we investigated the bugs in DLFs. Specifically,
we conducted an empirical study on the bugs and their cor-
responding fixes in three popular DLFs, namely MXNet [13],
PyTorch [14] and TensorFlow [15] on GitHub. The purpose
of this work is to systematically understand the bugs and their
impacts in the development of DLFs, with a focus on MPL
bugs. The main contributions of this work are threefold:
We conducted an empirical study by manual analysis
of 1497 bugs and their corresponding bug fixes from
three popular MPL DLFs, namely MXNet, PyTorch, and
TensorFlow.
We classified these bugs based on the labels tagged
by developers and bug characteristics, and explored the
impacts of bugs on the DLF development in terms of the
open time of bugs, the code change complexity of bug
fixes, and the communication complexity of developers
during bug fixing.
We explored the MPL bugs and their impact on the three
DLFs. Specifically, we looked into the proportion of MPL
bugs and the distribution of PL combinations used in each
DLF, and the difference between SPL and MPL bugs
regarding their impact on the DLFs.
The remaining of this paper is organized as follows. Section
II presents the related work; Section III describes the design
of the empirical study; Section IV presents the results of
the study; Section Vdiscusses the study results; Section VI
identifies the threats to validity of the results; and Section VII
concludes this work with future research directions.
II. RE LATED WO RK
A. Bug Classification of DLFs
In past years, a number of researchers tried to classify bugs
due to different research objectives and perspectives. Islam et
al. obtained five types of bugs (e.g., API bugs and structural
bugs) in the DLFs through 2716 Stack Overflow posts and 500
bug-fixing commits from GitHub [8]. According to the loca-
tion of the buggy source code, Li et al. obtained a preliminary
bug classification of TensorFlow, because TensorFlow tends
to place the source files in different directories according to
different functions [16]. Seaman et al. obtained several defect
classification schemes applicable to most projects [17], then
Thung et al. added a new category namely configuration, to the
bug classification scheme, and obtained their classification [1].
Yang et al. summarized the reference architecture of DLFs,
based on which they built a bug classification of DLFs [2].
Different from the previous classification methods of bugs in
DLFs, we employed the grounded theory [18] and took the bug
labels assigned by developers into consideration. Therefore,
we obtained a more comprehensive bug classification of DLFs
with four newly identified bug types (e.g., deployment and
version compatibility bugs). We believe that our bug classifi-
cation is close to the classification rationale of bug reports by
the developers of DLFs, which is convenient for developers to
understand which parts of the bugs will be more costly to fix.
B. Impact of the Use of Multiple PLs on Software Systems
Recently, more and more researchers have begun to pay
attention to the impact of the use of multiple PLs on software
systems. Ray et al. found a correlation between 11 PLs and
software quality in 729 projects hosted on GitHub [19]. Berger
et al. repeated the research of Ray et al. and found that
only four PLs were statistically significantly associated with
bugs, and the association was very small [20]. Kochhar et al.
collected a large dataset consisting of 628 projects to study the
impact of different PLs on the number of bug fixing commits
[21]. They found that implementing a project with more PLs
would increase its defect proneness. Abidi et al. analyzed the
source code of the MPL system, and found six anti-patterns
[22] and twelve code smells [23]. Li et al. analyzed 18 Apache
MPL software projects, and confirmed that the use of multiple
PLs is related to the increase of development difficulty and the
decline of software quality [24]. Our work further explored the
impact of multiple PLs on DLFs.
III. STU DY DESIGN
In order to investigate the bugs in DLFs, we performed
an empirical study on mainstream DLFs. In this section,
we describe the study, which was designed and reported by
following the guidelines proposed by Runeson and H¨
ost [25].
A. Objective and Research Questions
The goal of this study, described using the Goal-Question-
Metric (GQM) approach [26], is: to analyze bugs and their
corresponding bug-fixing pull requests for the purpose of
investigation with respect to the bugs with a focus on MPL
bugs in DLFs as well as their impacts on DLF development,
from the point of view of software developers in the context
of the development of MPL DLFs.
Based on the aforementioned goal, we formulated four
research questions (RQs), in which RQ1 and RQ2 focus on
the bugs in general in DLFs while RQ3 and RQ4 pay more
attention to MPL bugs in DLFs. The RQs are described as
follows:
-RQ1: What is the classification of bugs in DLFs?
Rationale: By classifying bugs in DLFs, we can have a better
understanding on the causes and distribution of bugs in DLFs,
so that we can conduct an in-depth investigation on each type
of the bugs.
-RQ2: What are the impacts of different types of bugs on the
development of DLFs?
Rationale: Different types of bugs may have different impacts
on the development of DLFs. We study the impacts in the
aspects of the open time of bugs, the complexity of code
change in bug fixing, and the complexity of communication
during bug fixing. The answer to this RQ can help recognize
the types of bugs that have greatest influence on the devel-
opment of DLFs, which can be further used to guide the bug
management in the development of DLFs.
-RQ3: What is the proportion of MPL bugs in DLFs? How
do MPL bugs distribute over bug types and PL combinations?
Rationale: By investigating the proportion of MPL bugs in
DLFs, we can understand the prevalence of multiple PLs used
for bug fixing in DLFs; by looking into the distribution of
MPL bugs over bug types and PL combinations, we can get
to know the tendency of MPL bugs among bug types and
the popularity of different PL combinations used in MPL bug
fixing in DLFs.
-RQ4: How does the use of multiple PLs affect the bug fixing
of DLFs?
Rationale: With this RQ, we investigate whether the use of
multiple PLs can cause additional costs for the bug fixing of
DLFs, so as to analyze the impact of the use of multiple PLs
on DLFs.
B. Cases and Unit Analysis
This study investigates DLFs, i.e., cases, and each bug and
the corresponding bug-fixing pull request is a single unit of
analysis (also called a data unit).
C. Case Selection
In this study, we only investigated DLFs hosted on GitHub.
The reason we used GitHub projects is that most of DLFs
are open source on GitHub. At the same time, this can ensure
that all bugs in different DLFs have the same format, so that
we can handle the bugs in the same way. To select each case
included in our study, we employed ve inclusion criteria:
C1: The source code written by the main PL is no more
than 75% of the code of the project. This criterion was
set to ensure that the PL use is not extremely unbalanced
so that the biases caused by specific PLs can be reduced.
C2: It has more than 10k stars, which indicates that it is
a popular DLF and has a large group of users.
C3: The DLF has more than 2,000 pull requests. This
criterion was set to ensure that the selected DLF is
nontrivial, and that the resulting dataset is big enough
to be statistically analyzed.
C4: The number of issues of the project is no less than
1,000. This criterion was set to ensure that the selected
DLF had been in the maintenance and evolution stage
for a reasonable length of period, and thus sufficient data
about bug introduction can be collected.
C5: The DLF has more than 150 pairs of bugs and corre-
sponding bug-fixing pull requests. It was set to ensure that
the resulting dataset is big enough for statistical analysis.
The thresholds of C1, C3, and C4 were set according to [24],
and those of C2 and C5 were set based on our experience.
D. Data Collection
1) Data Items to be Collected: To answer the RQs, we take
a bug and its corresponding pull request as the analysis unit,
and list the data items to be collected in Table I. For each bug,
we need to calculate the open time (OT), lines of source code
modified (LOCM), number of files modified (NOFM), entropy,
number of developers participating (NODP), and number of
comments in the pull request (NOC) to quantify the impact
of the bug on the development of a DLF. Since most data
items are easy to understand, we only explain the definition
of entropy in detail [27].
Suppose that the source file modified in a pull request
for fixing a bug is {f1, f2, ..., fn}. File fi(1 in)is
modified mitimes in the pull requests during the last 60
days [28] before the current bug-fixing pull request. Let pi=
mi/Pn
j=1 mj. Then, the entropy H(n) = Pn
i=1 pilog2pi.
Since the number of modified source files differs between
different bug-fixing pull requests, we need to normalize the
entropy to be comparable. Considering that H(n)achieves
the maximum of log2nwhen pi= 1/n(1 in), the
normalized entropy
˜
H(n) = (H(n)/log2nn>1,
0n = 1. (1)
The entropy of a bug fix reflects the uncertainty faced by the
bug fix. A larger entropy means a bigger uncertainty, resulting
in a higher code change complexity.
2) Data Collection Procedure: To collect the data items
in Table I, we developed a dedicated tool based on GitHub
GraphQL APIs to extract the required data from the repository
of each DLF on GitHub. Specifically, we collected the data in
the following steps:
Step 1: Export available issue reports. Extract all related issues
and corresponding pull requests in the project.
TABLE I
DATA ITEMS TOBECO LL EC TE D.
# Name Description Relevant RQ
D1 IssueID The identity number of the issue. -
D2 IssueLabels The labels of the issue. RQ1, RQ3
D3 IssueTitle The title of the issue. RQ1
D4 IssueContent The description of the issue. RQ1
D5 IssueCreatedAt The creation time of the issue. RQ2, RQ4
D6 IssueClosedAt The close time of the issue. RQ2, RQ4
D7 PrID The identity number of the pull request
associated with the issue. -
D8 PrLabels The labels of the pull request. RQ1, RQ3
D9 PrTitle The title of the pull request. RQ1
D10 PrContent The description of the pull request. RQ1
D11 Files The files modified in the pull request. RQ2- RQ4
D12 OT The open time of the issue, i.e., OT=D6-D5. RQ2, RQ4
D13 LOCM The number of lines of source code modified
in the pull request. RQ2, RQ4
D14 NOFM The number of files modified in the pull request. RQ2, RQ4
D15 Entropy The normalized entropy of the modified files
in the pull request. RQ2, RQ4
D16 NODP The number of developers participating in the
pull request (excluding robots). RQ2, RQ4
D17 NOC The number of comments in the pull request
(excluding robots). RQ2, RQ4
D18 IsMPLF Whether the pull request is a MPL fix, i.e.,
involving source files in multiple PLs. RQ3, RQ4
Step 2: Filter bug reports. The filtering logic (satisfying
anyone below) is that: an issue or the pull request connected
to it has at least one whose label’s name or label’s description
involving the word ”Bug”; the word ”Bug” or ”bug” or ”BUG”
appears in the issue title; issue description adopts a bug report
template. In this way, we can get the bug reports we need from
all available reports.
Step 3: Filter available bugs. The status of the issue must be
“closed” and the status of the pull request corresponding to
the issue must be “merged”. This ensures that the bug has an
effective fix.
Step 4: Remove abnormal data. An abnormal data unit means
that a bug does not strictly correspond to a single bug-fixing
pull request (e.g., a pull request corresponds to multiple bugs,
a bug corresponds to multiple pull requests). This will lead to
inaccurate results of the impact of the bug on the project.
Step 5: Calculate the data items listed in Table I.
E. Data Analysis
1) Data Analysis for RQ1: For RQ1, to obtain the bug
classification in DLFs, we followed a four-step process below.
Step 1: Preliminarily classify bugs based on bug labels. This
step is to get a preliminary bug classification. (1) We collected
all the labels of bugs in the selected DLFs through a dedicated
tool. (2) We then examined relevant official documents and
label descriptions of bugs to have a deep understanding of
the labels. (3) Based on the bug labels and characteristics, the
second and third authors classify bug labels separately. Then,
the first author reviewed the label classification and solved
disagreements through discussion with the second and third
authors. Finally, we obtained a preliminary classification of
the bugs in DLFs.
Step 2: Construct a relatively stable bug classification. The
first two authors manually analyzed a set of sampled bugs
with the help of grounded theory [18]. When the classifica-
tion obtained in Step 1 was found inappropriate during the
bug analysis, the first two authors tried to improve the bug
classification through discussion. In this step, new bug types
might arise, existing bug types might be removed, and the
bug classification would be updated accordingly. Finally, a
relatively stable classification was obtained.
Step 3: Conduct a pilot bug tagging. This step is to get a
stable bug classification and reach a common understanding on
each bug type in the bug classification. We randomly selected
10% of the bugs, and the second and third authors separately
tagged each of the bugs with an appropriate bug type from
the obtained classification got in Step 1. If there was any
disagreement on bug tagging, the second and third authors
discussed them with the first author to get a consensus. We
used Cohen’s Kappa [29] to measure the consistency between
the bug tagging results of the two authors. When the Cohen’s
Kappa value is less than 90%, the first three authors needed
to discuss for resolving disagreements and randomly extracted
another 10% of the bugs for another round of bug tagging. The
bug tagging was an iterative process, which stopped when the
Cohen’s Kappa value exceeded 90%.
Step 4: Classify the remaining bugs. The second author
classified the remaining bugs into different bug types.
2) Data Analysis for RQ2-RQ4: To answer RQ2, we inves-
tigated the impact of different types of bugs on DLF develop-
ment through six impact indicators (i.e., data items D12-D17)
in three aspects: (1) open time of bugs, i.e., OT (D12); (2)
complexity of code change in bug fixing, including LOCM
(D13), NOFM (D14), and entropy (D15); and (3) complexity
of communication during bug fixing, including NODP (D17)
and NOC (D18). For each indicator, we calculated the average
value and ranking of each bug type in each DLF, and also
calculated the mean ranking for all the selected DLFs by
averaging their ranking numbers. The integrated ranking (In-
teRanking) numbers for all bug types are assigned according
to their mean rankings. To answer RQ3, we examined the
extensions of the modified source files in the bug-fixing pull
requests to identify the MPL and SPL fixes, and calculated
the bug type distribution of bugs with these MPL and SPL
fixes. In addition, we calculated the distribution of MPL bugs
over different PL combinations. Similar as the PLs examined
in [24], we only considered the following 18 general-purpose
PLs: C/C++, C#, Clojure, CoffeeScript, Erlang, Go, Haskell,
Java, JavaScript, Kotlin, Objective-C, Perl, PHP, Python, Ruby,
Scala, Swift, and TypeScript. To answer RQ4, we studied
the difference between MPL bug fixes and SPL bug fixes
regarding their impact on DLF development through Mann-
Whitney U tests. Since the data of the variables to be tested do
not necessarily follow a specific distribution, it is reasonable
to use the Mann-Whitney U test a non-parametric test
in this study. The test was significant at p-value<0.05, which
means that the tested groups have a significant difference.
IV. STU DY RESU LTS
Through case selection, we got three DLFs, i.e., MXNet,
PyTorch, and TensorFlow, which demographic information is
shown in Table II. To be specific, 189, 926, and 382 data units
for analysis were collected from the whole bug set of MXNet,
PyTorch, and TensorFlow respectively, and 1497 data units in
total were obtained for investigation. The dataset is available
online [30]. Some DLFs such as Caffe and Keras were not
included, since only a few bugs are bound to their bug-fixing
pull requests or only a single general-purpose PL is used.
TABLE II
DEMOGRAPHIC INFORMATION OF THE SELE CT ED DLFS.
DLF #Pr #Issue #star (k) %Main PL #Unit
MXNet 11096 9532 20.2 48.6 189
PyTorch 59555 29575 60.3 49.4 926
TensorFlow 22191 36007 169.0 63.1 382
Total 92842 75114 289.5 - 1497
A. Bug Types in Deep Learning Frameworks (RQ1)
Through manual analysis following the process presented
in Section III-E1, we classified the collected 1497 bugs in the
three DLFs into 12 types, which are shown in Fig. 1. The
details of the 12 bug types are described as follows.
TABLE III
NUMBERS AND PROP ORT IO NS OF DI FFE RE NT BU G TYPE S (RQ1).
Bug Type MXNet PyTorch TensorFlow Total
# % # % # % # %
Algorithm design bug 21 11.1 119 12.9 34 8.9 174 11.6
Build bug 32 16.9 112 12.1 31 8.1 175 11.7
Code bug 15 7.9 81 8.8 45 11.8 141 9.4
Data bug 45 23.8 155 16.7 96 25.1 296 19.8
Deployment bug 7 3.7 106 11.5 25 6.5 138 9.2
Documentation bug 15 7.9 29 3.1 80 20.9 124 8.3
Memory bug 4 2.1 27 2.9 6 1.6 37 2.5
Performance bug 2 1.1 22 2.4 0 0.0 24 1.6
Processor bug 10 5.3 132 14.3 13 3.4 155 10.4
Test bug 19 10.1 83 9.0 23 6.0 125 8.4
Version compatibility bug 10 5.3 48 5.2 18 4.7 76 5.1
Visualization bug 9 4.8 12 1.3 11 2.9 32 2.1
(1) Algorithm design bug. This type of bugs are related
to the defects in DL algorithm design. There are three spe-
cific cases: 1) design bugs in the model layer, e.g., network
structure errors, model accuracy deterioration, and data flow
errors between networks; 2) design bugs in the loss function
algorithm, e.g., the incorrect return value of the loss function
and the internal structure error of the loss function; 3) gradient
errors caused by incorrect settings of optimization parameters
or design bugs of optimization functions, e.g., optimizer-model
mismatches and learning rate algorithm errors.
(2) Build bug. This type of bugs refer to the failures
encountered during framework build or compilation. There are
two specific cases: 1) framework build failures, which may be
caused by configuration file errors or by failures when building
frameworks in different platforms, such as Linux, Windows,
Mac, or Docker environments; and 2) API import failures.
(3) Code bug. This type of bugs refer to coding logic
errors or code smells introduced in code. Typical instances
include 1) erroneous return messages; 2) naming errors or
naming conflicts; 3) problems in cross-programming-language
communication; 4) coding problems in user-defined functions,
e.g., design defects of class templates; 5) obsolete code or
dead code; 6) file path errors, e.g., hard coded path and path
identification errors; and 7) circular dependencies.
Fig. 1. Bug types in DLFs (RQ1).
(4) Data bug. This type of bugs refer to problems related
to data preprocessing that occur before data is input into a
model. Their symptoms are usually function output errors, data
loading errors, data acquisition exceptions, data corruption,
parameter errors, etc. There are three specific cases: 1) bugs
occurring in data calculation operations, e.g., scalar operations,
vector operations, matrix operations, and broadcast mecha-
nisms; 2) bugs occurring in data structure operations, e.g., data
creation, data replication, index slicing, type conversion, and
other related errors; and 3) bugs occurring during data loading.
(5) Deployment bug. This type of bugs refer to problems
that arise when a trained model is exported or deployed in a
specific environment. There are two cases: 1) bugs occurring
during model import or export, e.g., abnormal behaviors when
storing trained models; and 2) bugs occurring in the process
of model transformation.
(6) Documentation bug. This type of bugs refer to prob-
lems in documentation. Typical bugs: 1) missing documents;
2) incomplete documents; and 3) official document errors.
(7) Memory bug. This type of bugs refer to memory-related
errors, which are often caused by inappropriate boundary
settings during training or by poor coding. There are two main
cases: 1) memory errors occurring during the running of a
DLF, e.g., memory leak, memory corruption, illegal memory
access, and insufficient memory; and 2) errors in the process
of memory operation.
(8) Performance bug. This type of bugs refer to defects
related to unsatisfying execution performance of DLFs. Typ-
ical bugs include: 1) undesirable computing speed; and 2)
performance regression caused by version change.
(9) Processor bug. This type of bugs are related to proces-
sors, such as CPUs or GPUs. There are three cases: 1) bugs
occurring when a model or an operator runs on a specific
processor; 2) bugs in distributed training and data parallel
processing on multiple processors; and 3) bugs occurring when
the processor is not properly initialized, e.g., the processor
does not match some platform environments.
(10) Test bug. This type of bugs refer to problems in testing,
including 1) test failures; 2) sample code errors; 3) lack of test
cases; and 4) static code check failures.
(11) Version compatibility bug. This type of bugs refer
to compatibility errors for some features due to framework
version changes or technical updates. There are three cases:
1) Bugs occurring in model code because the DL framework
is updated to a new version; 2) bugs caused by the use of
obsolete APIs; and 3) bugs caused by incompatibility between
the current version of a PL and its previous version that results
in syntax errors or unavailability of certain operators.
(12) Visualization bug. This type of bugs refer to failures
of visualizing training results or errors in representations of a
DLF. There are three cases: 1) errors in the process of model
visualization; 2) errors found in the process of visualizing
model parameters; 3) errors on the page display of DLFs, e.g.,
broken links at the front and page display errors.
The number and proportion of each bug type are presented
in TABLE III, which shows that the proportions of the bug
types are not well balanced either in individual DLFs or in
all of them as a whole. In MXNet, data bugs are with the
largest proportion (23.8%) and performance bugs have the
least proportion (1.1%). In PyTorch, data bugs are with the
largest proportion (16.7%) and visualization bugs have the
least proportion (1.3%). In TensorFlow, data bugs are with
the largest proportion (25.1%) and there are no performance
bugs in our collected dataset. Taking all DLFs as a whole,
data bugs (19.8%) are the leading bug type, followed by build
bugs (11.7%) and algorithm design bugs (11.6%); the least bug
types are performance bugs (1.6%), followed by visualization
bugs (2.1%) and memory bugs (2.5%).
Answer to RQ1: Bugs in DLFs can be classified into 12
types. Data bugs are the dominant bug type (19.8%) and
performance bugs account for the smallest proportion (1.6%).
B. Impacts of Bugs on DLF Development (RQ2)
We present the impacts of different types of bugs on the
development of DLFs in three aspects, i.e., the open time of
bugs, the complexity of code change in bug fixing, and the
complexity of communication during bug fixing.
1) Open Time of Bugs: We compared the associations
between bug and open time in the three DLFs in different
categories horizontally, as shown in Table IV, in which InteR-
anking denotes the integrated ranking of an indicator according
to its mean ranking. For MXNet, visualization bugs and perfor-
mance bugs are ranked in the top and bottom respectively; for
PyTorch, performance bugs and version compatibility bugs are
ranked in the top and bottom respectively; and for TensorFlow,
memory bugs and test bugs are ranked in the top and bottom
respectively. Taking the three DLFs as a whole, deployment
bugs, documentation bugs, and memory bugs are the top 3 bug
types that cost the most OT during bug fixing; build bugs, test
bugs, and processor bugs cost the least OT during bug fixing.
2) Complexity of Code Change in Bug Fixing: We investi-
gated the impact of different types of bugs on the complexity
of code change in terms of LOCM, NOFM, and entropy of
bug-fixing pull requests in the three selected DLFs.
(1) LOCM. The average LOCM and its ranking of each
bug type in the three DLFs are shown in Table V. For
MXNet, memory bugs and performance bugs are ranked in the
top and bottom respectively; for PyTorch, deployment bugs
and documentation bugs are ranked in the top and bottom
respectively; and for TensorFlow, processor bugs and build
bugs are ranked in the top and bottom respectively. Taking
the three DLFs as a whole, memory bugs, algorithm design
bugs, code bugs, and deployment bugs are the top 4 bug
types that need the most LOCM during bug fixing; build
bugs, documentation bugs, and visualization bugs need the
least LOCM during bug fixing.
(2) NOFM. The average NOFM and its ranking of each
bug type in the three DLFs are shown in Table VI. For
MXNet, memory bugs and performance bugs are ranked in the
top and bottom respectively; for PyTorch, deployment bugs
and documentation bugs are ranked in the top and bottom
respectively; and for TensorFlow, version compatibility bugs
and documentation bugs are ranked in the top and bottom
respectively. Taking the three DLFs as a whole, memory bugs,
processor bugs, and algorithm design bugs are the top 3 bug
types that need the most NOFM during bug fixing; perfor-
mance bugs, documentation bugs, test bugs, and visualization
bugs need the least NOFM during bug fixing.
(3) Entropy. The average entropy and its ranking of each
bug type in the three DLFs are shown in Table VII. For
MXNet, processor bugs and memory bugs are ranked in the
top and bottom respectively; for PyTorch, deployment bugs
and documentation bugs are ranked in the top and bottom
respectively; and for TensorFlow, data bugs and documentation
bugs are ranked in the top and bottom respectively. Taking
the three DLFs as a whole, deployment bugs, data bugs, and
processor bugs are the top 3 bug types that have the most
entropy during bug fixing; build bugs, documentation bugs,
and visualization bugs have the least entropy during bug fixing.
3) Complexity of Communication during Bug Fixing: We
present the results regarding the impact of different types of
bugs on the complexity of communication in terms of NODP
and NOC of bug-fixing pull requests in the three DLFs.
(1) NODP. The average NODP and its ranking of each bug
type in the three DLFs are shown in Table VIII. For MXNet,
documentation bugs and performance bugs are ranked in the
top and bottom respectively; for PyTorch, memory bugs and
documentation bugs are ranked in the top and bottom respec-
tively; and for TensorFlow, memory bugs and visualization
bugs are ranked in the top and bottom respectively. Taking
the three DLFs as a whole, memory bugs, deployment bugs,
and processor bugs are the top 3 bug types that have the most
NODP during bug fixing; performance bugs, test bugs, and
code bugs need the least NODP during bug fixing.
(2) NOC. The average NOC and its ranking of each bug
type in the three DLFs are shown in Table VIII. For MXNet,
memory bugs and performance bugs are ranked in the top
and bottom respectively; for PyTorch, memory bugs and doc-
umentation bugs are ranked in the top and bottom respectively;
and for TensorFlow, memory bugs and version compatibility
bugs are ranked in the top and bottom respectively. Taking
the three DLFs as a whole, memory, processor, and algorithm
design bugs are the top 3 bug types that have the most NODP
during bug fixing; version compatibility, documentation, code,
and visualization bugs have the least NOC during bug fixing.
Answer to RQ2: Deployment bugs negatively impact the
development of DLFs the most in terms of open time; de-
ployment bugs negatively impact code change complexity the
most in terms of entropy; memory bugs negatively impact
code change complexity the most in terms of LOCM and
NOFM; and memory bugs negatively impact communication
complexity the most in terms of NODP and NOC.
C. Proportions and Distribution of MPL Bug Fixes (RQ3)
Proportions of MPL bugs. As shown in TABLE X, among
the 189 bugs analyzed in MXNet, 54 are MPL bugs (which
bug-fixing pull requests involve multiple PLs), accounting for
28.6%. In PyTorch, 291 out of the analyzed 926 bugs are MPL
TABLE IV
OPE N TIME (IN DAYS )OF DI FFE REN T BUG TY PE S (RQ2).
MXNet PyTorch TensorFlow Total
Bug type OT Ranking OT Ranking OT Ranking Mean ranking InteRanking
Algorithm design bug 22.92 10 74.70 3 130.32 2 5.00 4
Build bug 40.24 7 30.63 10 33.86 10 9.00 11=
Code bug 46.34 6 57.68 5 47.30 7 6.00 5=
Data bug 55.23 5 39.63 9 68.15 5 6.33 7
Deployment bug 77.84 3 81.51 2 66.53 6 3.67 1
Documentation bug 81.32 2 50.53 6 84.38 4 4.00 2
Memory bug 40.18 8 68.90 4 135.39 1 4.33 3
Performance bug 0.75 12 106.09 1 - - 6.50 8
Processor bug 21.41 11 30.23 11 99.04 3 8.33 10
Test bug 27.45 9 48.62 7 32.07 11 9.00 11=
Version compatibility bug 58.45 4 17.09 12 38.47 8 8.00 9
Visualization bug 82.84 1 43.89 8 37.17 9 6.00 5=
TABLE V
AVERA GE LOCM AND ITS RANKING OF EA CH BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type LOCM Ranking LOCM Ranking LOCM Ranking Mean ranking InteRanking
Algorithm design bug 66.10 2 60.71 2 32.50 4 2.67 2
Build bug 34.72 7 26.22 10 14.58 11 9.33 12
Code bug 39.20 6 54.91 4 30.96 5 5.00 3=
Data bug 34.27 8 46.19 7 29.59 6 7.00 7
Deployment bug 28.14 11 61.97 1 51.68 3 5.00 3=
Documentation bug 43.93 4 18.79 12 15.78 10 8.67 10=
Memory bug 87.75 1 55.33 3 57.17 2 2.00 1
Performance bug 15.00 12 54.05 5 - - 8.50 9
Processor bug 28.50 10 51.52 6 57.69 1 5.67 5
Test bug 39.26 5 23.75 11 18.52 8 8.00 8
Version compatibility bug 58.30 3 26.31 9 25.83 7 6.33 6
Visualization bug 31.33 9 45.75 8 16.18 9 8.67 10=
TABLE VI
AVERA GE NOFM AND ITS RANKING OF EA CH BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type NOFM Ranking NOFM Ranking NOFM Ranking Mean ranking InteRanking
Algorithm design bug 3.00 5 2.99 3 1.97 9 5.67 3
Build bug 3.03 4 1.99 9 2.25 5 6.00 4
Code bug 2.73 7 2.80 5 2.04 7 6.33 5
Data bug 2.31 10 2.87 4 2.18 6 6.67 6=
Deployment bug 2.57 9 3.25 1 1.96 10 6.67 6=
Documentation bug 3.53 2 1.41 12 1.19 11 8.33 11
Memory bug 8.00 1 3.19 2 3.00 3 2.00 1
Performance bug 1.50 12 2.27 7 - - 9.50 12
Processor bug 2.60 8 2.67 6 3.15 2 5.33 2
Test bug 3.16 3 1.78 10 2.04 8 7.00 9=
Version compatibility bug 2.00 11 2.25 8 3.17 1 6.67 6=
Visualization bug 2.78 6 1.67 11 2.36 4 7.00 9=
bugs, accounting for 31.4%. In TensorFlow, 61 out of the 382
analyzed bugs are MPL bugs, accounting for 16.0%.
Distribution of MPL bugs over bug types. As shown
in TABLE X, in MXNet, algorithm design bugs are the top
bug type that has the largest proportions of MPL bugs, while
deployment bugs, performance bugs, and visualization bugs
are the three bug types that have no MPL bugs; in PyTorch,
data bugs are the top bug type that has the largest proportions
of MPL bugs, while visualization bugs are the only bug type
that has no MPL bugs; In TensorFlow, processor bugs are the
top bug type that has the largest proportions of MPL bugs,
while version compatibility bugs are the only bug type that
have no MPL bugs. Performance bugs in TensorFlow are not
discussed here because there are no performance bugs.
Use of PL combinations in MPL bug fixes. The results
of MPL bugs over different PL combinations are shown in
TABLE XI. In MXNet, only the combination of Python and
C/C++ is used in all the MPL bug fixes. In PyTorch, the
combination of Python and C/C++ is used in 268 (92.1%)
MPL bug fixes, and the combination of Python and Objective-
C is used in 17 (5.8%) MPL bug fixes, the combination of
C/C++ and Objective-C is used in 3 MPL bug fixes, the
combination of Python, C/C++, and Objective-C is used in
2 MPL bug fixes, and the combination of C/C++, Ruby, and
Objective-C is used in 1 MPL bug fix. In TensorFlow, the
combination of Python and C/C++ is used in 60 (98.4%) MPL
bug fixes, and only one MPL bug fix uses the combination of
Python and Go.
Answer to RQ3: 28.6%, 31.4%, and 16.0% bug fixes are
MPL fixes in MXNet, PyTorch, and TensorFlow, respectively.
Algorithm design bugs,data bugs, and processor bugs are the
bug types that hold the largest proportions in MXNet, PyTorch,
TABLE VII
AVERA GE ENT ROP Y AN D ITS RANKING OF EAC H BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type Entropy Ranking Entropy Ranking Entropy Ranking Mean ranking InteRanking
Algorithm design bug 0.65 3 0.40 4 0.61 4 3.67 4
Build bug 0.37 11 0.18 11 0.24 10 10.67 11=
Code bug 0.57 5 0.37 6 0.36 8 6.33 6
Data bug 0.62 4 0.51 2 0.67 1 2.33 2
Deployment bug 0.66 2 0.56 1 0.62 3 2.00 1
Documentation bug 0.42 9 0.15 12 0.06 11 10.67 11=
Memory bug 0.25 12 0.33 8 0.64 2 7.33 7
Performance bug 0.44 8 0.34 7 - - 7.50 8
Processor bug 0.80 1 0.46 3 0.54 6 3.33 3
Test bug 0.46 7 0.25 9 0.39 7 7.67 9
Version compatibility bug 0.46 6 0.37 5 0.58 5 5.33 5
Visualization bug 0.39 10 0.25 10 0.32 9 9.67 10
TABLE VIII
AVERA GE NODP AN D ITS RANKING OF EAC H BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type NODP Ranking NODP Ranking NODP Ranking Mean ranking InteRanking
Algorithm design bug 4.05 7 5.53 3 6.18 4 4.67 4
Build bug 4.06 6 5.13 7 6.13 5 6.00 5
Code bug 3.93 9 4.96 10 5.91 8 9.00 10
Data bug 3.89 10 5.42 4 6.06 7 7.00 7
Deployment bug 4.57 3 5.58 2 6.12 6 3.67 2
Documentation bug 4.80 1 4.62 12 5.83 9 7.33 9
Memory bug 4.75 2 5.85 1 8.17 1 1.33 1
Performance bug 3.00 12 4.82 11 - - 11.50 12
Processor bug 4.10 5 5.36 5 6.62 2 4.00 3
Test bug 3.79 11 5.01 8 5.65 10 9.67 11
Version compatibility bug 4.00 8 4.96 9 6.33 3 6.67 6
Visualization bug 4.11 4 5.17 6 5.00 11 7.00 7
TABLE IX
AVERA GE NOC AN D ITS RANKING OF EAC H BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type NOC Ranking NOC Ranking NOC Ranking Mean ranking InteRanking
Algorithm design bug 4.00 8 6.26 4 4.09 3 5.00 3
Build bug 4.03 7 5.20 7 3.55 4 6.00 5
Code bug 3.87 9 5.15 8 3.22 6 7.67 9=
Data bug 3.51 10 6.37 3 2.90 8 7.00 6=
Deployment bug 4.57 6 6.13 5 3.24 5 5.33 4
Documentation bug 4.60 5 3.97 12 2.23 10 9.00 11
Memory bug 7.75 1 10.74 1 8.67 1 1.00 1
Performance bug 2.00 12 6.86 2 - - 7.00 6=
Processor bug 6.30 3 5.56 6 4.77 2 3.67 2
Test bug 5.32 4 4.99 9 2.30 9 7.33 8
Version compatibility bug 2.70 11 4.50 11 3.17 7 9.67 12
Visualization bug 7.22 2 4.83 10 2.18 11 7.67 9=
and TensorFlow, respectively. The PL combination of Python
and C/C++ is most popular in MPL bug fixes of the DLFs.
D. Impact of the Use of Multiple PLs on Bug Fixing (RQ4)
We further compared the six impact indicators of MPL bug
fixes with those of SPL bug fixes in the DLFs, in order to
understand whether there are significant differences on the
characteristics between MPL and SPL bug fixes. The results
are shown in TABLE XII. (1) In MXNet, the LOCM and
NOFM of MPL bug fixes are significantly larger than the
LOCM and NOFM of SPL bug fixes, respectively, while there
are no significant differences between MPL bug fixes and SPL
bug fixes on OT, entropy, NODP, and NOC. (2) In PyTorch,
every indicator of MPL bug fixes is significantly greater
than that of SPL bug fixes. (3) In TensorFlow, the LOCM,
NOFM, and Entropy of MPL bug fixes are significantly larger
than those of SPL bug fixes, respectively, while there are no
significant differences between MPL bug fixes and SPL bug
fixes on OT, NODP, and NOC.
Answer to RQ4: No impact indicators of MPL bug fixes
are significantly smaller than those of SPL bug fixes. Code
change complexity of MPL bug fixes is significantly greater
than that of SPL bug fixes in terms of LOCM and NOFM.
V. DISCUSSION
A. Interpretation of Study Results
RQ1: (1) Data bugs and algorithm design bugs are very
common in these algorithm intensive software systems. The
most common causes are problems in the input and output
links of operators or neural network layers, e.g., the lack of
input type checks (TensorFlow #13506), the lack of necessary
edge checks (PyTorch #1653), and the output of unexpected
TABLE X
TYP E DISTRIBUTION OF BUGS WITH MPL AND S PL F IX ES (RQ3).
MXNet PyTorch TensorFlow
SPL MPL %MPL SPL MPL %MPL SPL MPL %MPL
Algorithm design bug 10 11 52.4 68 51 42.9 30 4 11.8
Build bug 29 3 9.4 106 6 5.4 31 0 0.0
Code bug 12 3 20.0 67 14 17.3 37 8 17.8
Data bug 23 22 48.9 72 83 53.6 62 34 35.4
Deployment bug 7 0 0.0 63 43 40.6 23 2 8.0
Documentation bug 12 3 20.0 27 2 6.9 78 2 2.5
Memory bug 2 2 50.0 18 9 33.3 5 1 16.7
Performance bug 2 0 0.0 17 5 22.7 - - -
Processor bug 5 5 50.0 73 59 44.7 6 7 53.9
Test bug 17 2 10.5 75 8 9.6 21 2 8.7
Version compatibility bug 7 3 30.0 37 11 22.9 18 0 0.0
Visualization bug 9 0 0.0 12 0 0.0 10 1 9.1
All bug types 135 54 28.6 635 291 31.4 321 61 16.0
TABLE XI
PL COMBINATIONS OF M PL BUG FI XES (RQ3).
MXNet PyTorch TensorFlow
PL combination # % # % # %
C/C++, Python 54 100.0 268 92.1 60 98.4
Objective-C, Python 0 0.0 17 5.8 0 0.0
C/C++, Objective-C 0 0.0 3 1.0 0 0.0
Go, Python 0 0.0 0 0.0 1 1.6
C/C++, Objective-C, Python 0 0.0 2 0.7 0 0.0
C/C++, Objective-C, Ruby 0 0.0 1 0.3 0 0.0
results (MXNet #8303). (2) Documentation bugs account for
a relatively large proportion (20.94%) in TensorFlow, which
indicates that during the development of TensorFlow, bugs in
updating documents is often a major difficulty for developers.
This was confirmed by the results of our searches for the key-
word ”document” in Stack Overflow regarding the three DLFs:
13 hits returned for MXNet, 174 hits returned for PyTorch, 913
hits returned for TensorFlow. (3) In PyTorch, processor bugs
are also a major bug type, which resonates the great efforts of
PyTorch developers spent in data synchronization and parallel
and multiprocessing [14]. (4) The percentage of performance
bugs in the three DLFs is very small. We checked the issues
labeled with ”performance” in the three DLFs and found that
most of such issues are not considered as bugs by developers.
RQ2: (1) Deployment bugs are ranked on the top in the
integrated ranking with respect to the OT and entropy of bugs,
and memory bugs are ranked on the top in the integrated
ranking with respect to the LOCM and NOFM of bugs.
As evidenced in [31], a larger code change complexity and
longer OT often mean higher maintenance cost of a bug.
In this sense, deployment bugs and memory bugs may incur
much maintenance cost. (2) Memory bugs are ranked on the
top in the integrated ranking with respect to the NODP and
NOC during bug fixing, which means the communication of
fixing memory bugs is most complex in the sense that most
developers are involved and the bugs are most discussed. It
also resonates with the finding that the code change complexity
of memory bugs is the largest in terms of LOCM and NOFM.
RQ3: (1) The proportion of MPL bugs in TensorFlow is
obviously smaller than that in MXNet and PyTorch. One main
potential reason is that the main PL in TensorFlow accounts
for 63.1% (shown in TABLE II), which is much larger than
that in MXNet (48.6%) and PyTorch (49.4%). It implies that
there is much less chances for the main PL to work with other
PLs in TensorFlow. (2) The combination of Python and C/C++
is the most used PL combination in fixing MPL bugs of the
selected DLFs. This is also the PL choice of most popular
DLFs. We speculate the reason is that Python is probably the
most comfortable PL for many data scientists and machine
learning experts, and it is easy to integrate C/C++ components
and is compatible with a large number of computing libraries.
They prefer to use C/C++ as the computational part to improve
performance [32]. We conducted a sampling analysis of MPL
fixes to verify our conjecture that the vast majority of MPL
bug fixes did occur in the back end. When a bug in the back
end is resolved, developers need to test whether the issue is
resolved in the front end.
RQ4: In MXNet, PyTorch, and TensorFlow, it is more
difficult and costlier to fix MPL bugs than SPL bugs, in the
sense that (1) MPL bug fixes have significantly larger code
change complexity than SPL bug fixes in terms of LOCM
and NOFM in the three DLFs, and (2) no impact indicators
of MPL bug fixes are significantly smaller than those of
SPL bug fixes in the three DLFs. This is consistent with the
previous research results on MPL commits in Apache projects
[31], which suggests that the use of multiple PLs has a non-
negligible impact on the development of DLFs.
B. Implications for Researchers
Firstly, we classified the bugs in the selected DLFs based
on bug labels, which is greatly helpful for researchers to
have a general understanding on the distribution of bugs
in DLFs, thereby facilitating further investigation of these
bugs. Secondly, it is worth further exploring how different
combinations of PLs influence the development of DLFs, since
the results of this exploration can guide the development to a
more efficient way. Thirdly, researchers may put their efforts
to study how the combination of C/C++ and Python may
impact MPL bugs, because more than 92% of MPL bugs
involve this PL combination. Finally, MPL bugs should receive
more attention from researchers in the software engineering
community but not limited to the DLF domain. Currently,
MPL bugs are seldom explored. Due to the significant impact
of MPL bugs, they need to be studied in more depth.
TABLE XII
MANN-WHIT NE Y U TE ST R ESU LTS R EG ARD IN G IM PACT S OF TH E US E OF M ULTI PL E PLS ON TH E FIX ES O F BU GS (RQ4).
MXNet PyTorch TensorFlow
SPL bugs MPL bugs p-value SPL bugs MPL bugs p-value SPL bugs MPL bugs p-value
OT 51.51 47.01 0.643 48.21 59.24 <0.001 73.04 87.00 0.227
LOCM 35.01 57.63 <0.001 34.37 76.36 <0.001 27.69 36.80 <0.001
NOFM 2.25 4.41 <0.001 2.07 3.95 <0.001 1.89 2.67 <0.001
Entropy 0.53 0.71 0.059 0.34 0.57 <0.001 0.39 0.80 <0.001
NODP 3.93 4.31 0.479 5.16 5.62 <0.001 6.13 5.93 0.267
NOC 3.91 5.07 0.057 5.44 6.96 <0.001 3.30 2.80 0.837
C. Implications for Practitioners
Firstly, our research results will enable developers to un-
derstand the distribution of bugs in the DLFs and which types
of bugs are more efficient to solve first. We recommend de-
velopers to employ our bug classification of DLFs in practice
thereby facilitating bug management by taking into account
the impacts of different bug types. Besides, developers could
choose more experienced developers to fix more complex
bug types (e.g., memory bugs and deployment bugs) that can
be identified through bug descriptions or related bug labels.
For bugs with relatively low fix complexity, new developers
can be assigned to fix them, thereby reducing the bug fixing
cost. Meanwhile, the developers of DLFs usually leverage the
advantages of different PLs to implement DLFs, however, the
developers should be aware of that the use of multiple PLs
for implementing DLFs is likely to increase the complexity of
code changes, resulting in higher maintenance cost.
VI. TH RE ATS TO VALIDITY
There are several threats to the validity of the study results.
We discuss these threats according to the guidelines in [25].
A. Construct Validity
Construct validity is concerned with whether the values of
the variables (listed in TABLE I) we obtained are consistent
with the real values that we expected. A potential threat is that
not all bugs resolved are explicitly linked to corresponding pull
requests, which may negatively affect the representativeness of
the collected bugs. Through our analysis, we confirmed that
the bugs with explicit links to corresponding pull requests were
not reported in a short time span and were not fixed by a small
group of developers. Therefore, this threat is to some extent
mitigated. Another possible threat is that different developers
may have biases during manual analysis, which may lead to
inconsistent classification and incorrect bug tagging results. To
reduce this threat, we adopted a four-step bug classification
process, in which a pilot bug tagging process was adopted to
identify and resolve possible disagreements with bug types (as
described in Section III-E1).
B. External Validity
External validity is concerned with the generalizability of
the study results. A potential threat is whether the selected
DLFs are representative enough. To select appropriate DLFs,
we set a number of criteria as described in Section III-C,
and the finally selected DLFs are MXNet, PyTorch, and
TensorFlow. These three DLFs are very popular and support a
wide range of DL applications. Thus, this threat is mitigated
to some extent. Another possible threat is that more than
92% MPL bug fixes use the PL combination of Python and
C/C++, which means significant imbalance regarding the use
of multiple PLs. However, the PL combination of Python and
C/C++ is natural choice in the development of DLFs by trading
off the usability and performance of the two PLs.
C. Reliability
Reliability is concerned with whether the same results can
be obtained when the study is replicated by other researchers.
A potential threat is related to the implementation of our soft-
ware tool for data collection. The tool was implemented by the
second author, and the code of the key functionalities had been
reviewed by the first and third authors. In addition, sufficient
tests were performed to ensure that the calculation of data
items are correct. Thus, this threat had been alleviated. Another
threat is related to the correctness of the Mann–Whitney U
tests. Since only IBM SPSS was used to run the tests, this
threat is minimized.
VII. CONCLUSIONS AND FUTURE WORK
In this work, we conducted an empirical study on bugs in
three mainstream DLFs (i.e., MXNet, PyTorch, and Tensor-
Flow) and their impacts on the development of such DLFs.
We collected 75114 issues and 92842 pull requests of the
DLFs, and obtained 1497 bugs after a set of filterings. We
manually analyzed the 1497 bugs, and got the following
findings. (1) The bugs in DLFs can be classified into 12
types, e.g., algorithm design bugs,data bugs, and memory
bugs. Among the 12 bug types, data bugs account for the
largest proportion in all the three DLFs. (2) Deployment bugs
and memory bugs negatively impact the development of DLFs
the most in various aspects. (3) 28.6%, 31.4%, and 16.0%
of bugs in MXNet, PyTorch, and TensorFlow are MPL bugs
respectively, and the PL combination of Python and C/C++ is
most used in fixing more than 92% MPL bugs in the three
DLFs. (4) The code change complexity of MPL bug fixes is
significantly greater than that of SPL bug fixes in the three
DLFs, while in PyTorch MPL bug fixes have longer open time
and greater communication complexity. In the future, we plan
to expand the obtained dataset in this study, and construct bug
prediction models for MPL bugs in DLFs. In addition, we are
also interested in investigating MPL bugs in other application
domains in depth.
REFERENCES
[1] F. Thung, S. Wang, D. Lo, and L. Jiang, “An empirical study of
bugs in machine learning systems,” in Proceedings of the 23rd IEEE
International Symposium on Software Reliability Engineering (ISSRE).
IEEE, 2012, pp. 271–280.
[2] Y. Yang, T. He, Z. Xia, and Y. Feng, “A comprehensive empirical study
on bug characteristics of deep learning frameworks, Information and
Software Technology, vol. 151, p. 107004, 2022.
[3] J. Chen, Y. Liang, Q. Shen, and J. Jiang, “Toward understanding deep
learning framework bugs, arXiv preprint arXiv:2203.04026, 2022.
[4] L. Quan, Q. Guo, X. Xie, S. Chen, X. Li, and Y. Liu, “Towards
understanding the faults of javascript-based deep learning systems,
in Proceedings of the 37th IEEE/ACM International Conference on
Automated Software Engineering (ASE). ACM, 2022, pp. 1–13.
[5] K. Huang, B. Chen, S. Wu, J. Cao, L. Ma, and X. Peng, “De-
mystifying dependency bugs in deep learning stack, arXiv preprint
arXiv:2207.10347, 2022.
[6] J. Cao, B. Chen, C. Sun, L. Hu, S. Wu, and X. Peng, “Understanding
performance problems in deep learning systems,” in Proceedings of
the 30th ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (ESEC/FSE).
ACM, 2022, pp. 357–369.
[7] Y. Zhang, Y. Chen, S.-C. Cheung, Y. Xiong, and L. Zhang, “An empirical
study on tensorflow program bugs, in Proceedings of the 27th ACM
SIGSOFT International Symposium on Software Testing and Analysis
(ISSTA). ACM, 2018, pp. 129–140.
[8] M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive
study on deep learning bug characteristics,” in Proceedings of the 27th
ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (ESEC/FSE).
ACM, 2019, pp. 510–520.
[9] M. Grichi, E. E. Eghan, and B. Adams, “On the impact of multi-
language development in machine learning frameworks, in Proceedings
of the 36th IEEE International Conference on Software Maintenance and
Evolution (ICSME). IEEE, 2020, pp. 546–556.
[10] A. Shatnawi, H. Mili, M. Abdellatif, Y.-G. Gu ´
eh´
eneuc, N. Moha,
G. Hecht, G. E. Boussaidi, and J. Privat, “Static code analysis of mul-
tilanguage software systems,” arXiv preprint arXiv:1906.00815, 2019.
[11] M. Kargar, A. Isazadeh, and H. Izadkhah, “Multi-programming language
software systems modularization,” Computers & Electrical Engineering,
vol. 80, p. 106500, 2019.
[12] A. Shatnawi, H. Mili, M. Abdellatif, Y.-G. Gu ´
eh´
eneuc, N. Moha,
G. Hecht, G. E. Boussaidi, and J. Privat, “Static code analysis of mul-
tilanguage software systems,” arXiv preprint arXiv:1906.00815, 2019.
[13] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu,
C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine
learning library for heterogeneous distributed systems,” arXiv preprint
arXiv:1512.01274, 2015.
[14] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imper-
ative style, high-performance deep learning library,” in Proceedings of
the 32nd Annual Conference on Neural Information Processing Systems
(NeurIPS). NeurIPS Proceedings, 2019, pp. 8024–8035.
[15] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., {TensorFlow}: a system for
{Large-Scale}machine learning,” in Proceedings of the 12th USENIX
Symposium on Operating Systems Design and Implementation (OSDI).
ACM, 2016, pp. 265–283.
[16] L. Jia, H. Zhong, X. Wang, L. Huang, and X. Lu, “The symptoms,
causes, and repairs of bugs inside a deep learning library, Journal of
Systems and Software, vol. 177, p. 110935, 2021.
[17] C. B. Seaman, F. Shull, M. Regardie, D. Elbert, R. L. Feldmann,
Y. Guo, and S. Godfrey, “Defect categorization: making use of a
decade of widely varying historical data,” in Proceedings of the 2nd
ACM/IEEE International Symposium on Empirical Software Engineering
and Measurement (ESEM). ACM, 2008, pp. 149–157.
[18] K. Charmaz, Constructing Grounded Theory: A Practical Guide through
Qualitative Analysis. SAGE, 2006.
[19] B. Ray, D. Posnett, V. Filkov, and P. Devanbu, A large scale study of
programming languages and code quality in github,” in Proceedings of
the 22nd ACM SIGSOFT International Symposium on Foundations of
Software Engineering (FSE). ACM, 2014, pp. 155–165.
[20] E. D. Berger, C. Hollenbeck, P. Maj, O. Vitek, and J. Vitek, “On
the impact of programming languages on code quality: A reproduction
study, ACM Transactions on Programming Languages and Systems,
vol. 41, no. 4, pp. 1–24, 2019.
[21] P. S. Kochhar, D. Wijedasa, and D. Lo, “A large scale study of multiple
programming languages and code quality, in Proceedings of 23rd
IEEE International Conference on Software Analysis, Evolution, and
Reengineering (SANER). IEEE, 2016, pp. 563–573.
[22] M. Abidi, F. Khomh, and Y.-G. Gu´
eh´
eneuc, “Anti-patterns for multi-
language systems,” in Proceedings of the 24th European Conference on
Pattern Languages of Programs (EuroPLoP). ACM, 2019, pp. 1–14.
[23] M. Abidi, M. S. Rahman, M. Openja, and F. Khomh, “Are multi-
language design smells fault-prone? an empirical study, ACM Trans-
actions on Software Engineering and Methodology, vol. 30, no. 3, pp.
1–56, 2021.
[24] Z. Li, X. Qi, Q. Yu, P. Liang, R. Mo, and C. Yang, “Exploring
multi-programming-language commits and their impacts on software
quality: An empirical study on apache projects,” Journal of Systems
and Software, vol. 194, p. 111508, 2022.
[25] P. Runeson and M. H¨
ost, “Guidelines for conducting and reporting case
study research in software engineering,” Empirical Software Engineer-
ing, vol. 14, no. 2, pp. 131–164, 2009.
[26] V. R. Basili, “Software modeling and measurement: The
goal/question/metric paradigm,” 1992. [Online]. Available: http:
//drum.lib.umd.edu/bitstream/1903/7538/1/Goal Question Metric.pdf
[27] A. E. Hassan, “Predicting faults using the complexity of code changes,”
in Proceedings of the 31st IEEE International Conference on Software
Engineering (ICSE). IEEE, 2009, pp. 78–88.
[28] Z. Li, P. Liang, D. Li, R. Mo, and B. Li, “Is bug severity in line
with bug fixing change complexity?” International Journal of Software
Engineering and Knowledge Engineering, vol. 30, no. 11&12, pp. 1779–
1800, 2020.
[29] A. J. Viera, J. M. Garrett et al., “Understanding interobserver agreement:
the kappa statistic,” Fam med, vol. 37, no. 5, pp. 360–363, 2005.
[30] Z. Li, S. Wang, W. Wang, P. Liang, R. Mo, and B. Li,
“Datasets for ‘Understanding Bugs in Multi-Language Deep Learning
Frameworks’, 2022. [Online]. Available: https://anonymous.4open.
science/r/dataItem-B377
[31] Z. Li, Q. Yu, P. Liang, R. Mo, and C. Yang, “Interest of defect technical
debt: An exploratory study on apache projects,” in Proceedings of
the 36th IEEE International Conference on Software Maintenance and
Evolution (ICSME). IEEE, 2020, pp. 629–639.
[32] R. Mu and X. Zeng, “A review of deep learning research,” KSII
Transactions on Internet and Information Systems, vol. 13, no. 4, pp.
1738–1764, 2019.
... Furthermore, recent studies [18] have shown that a large percentage of bug fixes in deep learning frameworks implemented in Python and C++ involve modifying source code in both programming languages. Moreover, the complexity of code changes for such bug fixes is significantly higher than that of fixes in singleprogramming-language scenarios [19]. ...
... Previous research proposed various techniques for detecting code smells or anti-patterns within a single-language codebase. More recent studies have found that bugs and security problems in multiple language systems [18]. For instance, a recent study has uncovered security issues arising from interactions between Java and C code [17]. ...
... Recent studies on multi-language systems have revealed that the interface between two languages is vulnerable and susceptible to bugs, often resulting in increased maintenance costs for fixing these issues [17][18][19]. This raises the question of whether the interface between Kotlin and Java faces similar challenges. ...
Conference Paper
Full-text available
Background: Since Google introduced Kotlin as an official programming language for developing Android apps in 2017, Kotlin has gained widespread adoption in Android development. The inter-operability of Java and Kotlin's design nature allows them to coexist and interact with each other smoothly within a project. Aims: However , there is limited research on how Java and Kotlin interact with each other in real-world projects and what challenges are faced during these interactions. The answers to these questions are key to understanding these kinds of cross-language software systems. Methods: In this paper, we implemented a tool named DependEx-tractor, which can extract 11 kinds of Kotlin-Java dependencies, and conducted an empirical study of 23 Kotlin-Java real-world projects with 3,227 Java and 8,630 Kotlin source files. Results: Our findings revealed that Java and Kotlin frequently interact with each other in these cross-language projects, with access and call dependency types being the most dominant. Compared to files interacting with other files in the same language, Java/Kotlin source files, which participate in the cross-language interactions, undergo more commits. Additionally, among all Kotlin-Java problematic interactions, we identified seven common mistakes, along with their fixing strategies. Conclusions: The findings of this study can help developers understand and address the challenges in Kotlin-Java projects.
... Second, studies that do address multilingual software are qualitative or statistical/probabilisticthey investigate bug-related human opinions (e.g.,developers' perception [1,31]) or defect proneness (e.g., in terms of statistical associations between language combination and likely bug indicators [13,17,20]), not focusing on factual bugs in real-world multilingual code. Likewise, the most recent relevant works are either based on developers' discussions [48]-and the issues examined are not necessarily code-level bugs, or limited to high-level characteristics (e.g., open/reopen timing) of bugs in software using multiple languages [26][27][28]-the software is not necessarily multilingual per se and the bugs may not be induced by language interactions. Third, studies addressing language interoperability are closer to understanding cross-language bugs, yet they are mainly focused on cross-language API misuses [14,42] rather than systematically investigating various aspects (e.g., symptoms, locations, manifestation, root cause, fixes, and their relationships) of those bugs. ...
... The study is based on developers' discussions rather than multilingual code. Li et al. investigated characteristics of bug resolution (e.g., bug open time and reopen rate) in 54 Apache projects [28] and those of bugs themselves (e.g., code change complexity) in 3 machine-learning frameworks [27]. Like ours, these two studies are based on code artifacts. ...
Conference Paper
Full-text available
Multilingual systems are prevalent and broadly impactful, but also complex due to the intricate interactions between the heterogeneous programming languages the systems are developed in. This complexity is further aggravated by the diversity of cross-language interoperability across different language combinations, resulting in additional, often stealthy cross-language bugs. Yet despite the growing number of tools aimed to discover cross-language bugs, a systematic understanding of such bugs is still lacking. To fill this gap, we conduct the first comprehensive study of cross-language bugs, characterizing them in 5 aspects including their symptoms, locations, manifestation, root causes, and fixes, as well as their relationships. Through careful identification and detailed analysis of 400 cross-language bugs in real-world multilingual projects classified from 54,356 relevant code commits in their GitHub repositories, we revealed not only bug characteristics of those five aspects but also how they compare between two top language combinations in the multilingual world (Python-C and Java-C). In addition to findings of the study as well as its enabling tools and datasets, we also provide practical recommendations regarding the prevention, detection, and patching of cross-language bugs.
... Systems developed using multiple PLs can leverage specific language features to achieve certain functionalities, simplify the code, and improve the efficiency of programming (Abidi et al., 2021). With technological advancements, the proportion of projects written in multiple PLs is steadily increasing (Kontogiannis et al., 2006;Li et al., 2023b). Developers utilize the advantages of combining various PLs to cope with the pressure of rapid market updates and iterations. ...
... There have been a couple of studies investigating MPLBs in MPL software systems. In 2023, Li et al. used open time, three change complexity measures of pull requests, and two communication complexity measures in bug resolution, to explore the impact of bugs (including MPLBs) on development in three deep learning frameworks (Li et al., 2023b). Li et al. continued their investigation into resolving MPLBs in the MPL software system, as well as the reasons for bugs involving multiple PLs. ...
Preprint
Full-text available
Context: An increasing number of software systems are written in multiple programming languages (PLs), which are called multi-programming-language (MPL) systems. MPL bugs (MPLBs) refers to the bugs whose resolution involves multiple PLs. Despite high complexity of MPLB resolution, there lacks MPLB prediction methods. Objective: This work aims to construct just-in-time (JIT) MPLB prediction models with selected prediction metrics, analyze the significance of the metrics, and then evaluate the performance of cross-project JIT MPLB prediction. Method: We develop JIT MPLB prediction models with the selected metrics using machine learning algorithms and evaluate the models in within-project and cross-project contexts with our constructed dataset based on 18 Apache MPL projects. Results: Random Forest is appropriate for JIT MPLB prediction. Changed LOC of all files, added LOC of all files, and the total number of lines of all files of the project currently are the most crucial metrics in JIT MPLB prediction. The prediction models can be simplified using a few top-ranked metrics. Training on the dataset from multiple projects can yield significantly higher AUC than training on the dataset from a single project for cross-project JIT MPLB prediction. Conclusions: JIT MPLB prediction models can be constructed with the selected set of metrics, which can be reduced to build simplified JIT MPLB prediction models, and cross-project JIT MPLB prediction is feasible.
... Systems developed using multiple PLs can leverage specific language features to achieve certain functionalities, simplify the code, and improve the efficiency of programming (Abidi et al., 2021). With technological advancements, the proportion of projects written in multiple PLs is steadily increasing (Kontogiannis et al., 2006;Li et al., 2023b). Developers utilize the advantages of combining various PLs to cope with the pressure of rapid market updates and iterations. ...
... There have been a couple of studies investigating MPLBs in MPL software systems. In 2023, Li et al. used open time, three change complexity measures of pull requests, and two communication complexity measures in bug resolution, to explore the impact of bugs (including MPLBs) on development in three deep learning frameworks (Li et al., 2023b). Li et al. continued their investigation into resolving MPLBs in the MPL software system, as well as the reasons for bugs involving multiple PLs. ...
Article
Full-text available
Context: An increasing number of software systems are written in multiple programming languages (PLs), which are called multi-programming-language (MPL) systems. MPL bugs (MPLBs) refers to the bugs whose resolution involves multiple PLs. Despite high complexity of MPLB resolution, there lacks MPLB prediction methods. Objective: This work aims to construct just-in-time (JIT) MPLB prediction models with selected prediction metrics, analyze the significance of the metrics, and then evaluate the performance of cross-project JIT MPLB prediction. Method: We develop JIT MPLB prediction models with the selected metrics using machine learning algorithms and evaluate the models in within-project and cross-project contexts with our constructed dataset based on 18 Apache MPL projects. Results: Random Forest is appropriate for JIT MPLB prediction. Changed LOC of all files, added LOC of all files, and the total number of lines of all files of the project currently are the most crucial metrics in JIT MPLB prediction. The prediction models can be simplified using a few top-ranked metrics. Training on the dataset from multiple projects can yield significantly higher AUC than training on the dataset from a single project for cross-project JIT MPLB prediction. Conclusions: JIT MPLB prediction models can be constructed with the selected set of metrics, which can be reduced to build simplified JIT MPLB prediction models, and cross-project JIT MPLB prediction is feasible.
... Generally, smells causes problems in two sections of development-either at the code part or the design part. The former refers to a problem with code writing and formulation [2], commonly known as ''code smells'', while the latter refers to issues with the designing of the software itself and is termed as model smells or design flaws [3]. Amongst the two, Code Smells is considered more severe, as they not only worsen developers' comprehension of the source code [4], but their emergence also affects the object classes making them more faultprone [5,6]. ...
... Thorough consolidation of the primary studies has been presented, covering four key aspects: (i) types of code smells addressed, (ii) DL approaches utilized in the experiments, (iii) evaluation strategies employed in the studies, and (iv) performance analysis of the models proposed. 3 Code smells are frequent issues in software engineering, resulting in worse code quality, more expensive maintenance, and slower program performance. In the late 1990 s, Beck et al. [16], the inventor of extreme programming, referred to the importance of design quality while creating software and popularised the phrase ''Code Smells'' to characterize the implementation and design defects found in source code. ...
Article
Full-text available
Code smells violate software development principles that make the software more prone to errors and changes. Researchers have developed code smell detectors using manual and semi-automatic methods to identify these issues. However, three key challenges have limited the practical use of these detectors: developers’ subjective perceptions of code smells, lack of consensus in the detection process, and difficulty in setting appropriate detection thresholds. While code smell detection using machine learning has progressed significantly, there still appears to be a gap in understanding the effective utilization of deep learning (DL) approaches. This paper aims to review and identify current methods for code smell detection using DL techniques. A systematic literature review is conducted on 35 primary studies from a collection of 8739 publications between 2013 and the present. The analysis reveals that common code smells detected include Feature Envy, God Classes, Long Methods, Complex Classes, and Large Classes. The most popular DL algorithms used are Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN), often combined with other techniques for better results. Algorithms that train models on large datasets with fewer independent variables demonstrate exemplary performance. The paper also highlights open issues and provides guidelines for future metric identification and selection research.
... However, MPLBs were not the core research objects in that work, and thus no in-depth or comprehensive investigation on MPLBs was performed. In the other work, Li et al. looked into 1,497 bugs with 406 MPLBs included in 3 MPL deep learning frameworks [9]. Although some bug resolution characteristics of MPLBs were investigated in that work, the numbers of MPLBs and projects are relatively small, and PL combinations involved and in-depth analysis of causes for involving multiple PLs in MPLB resolution were not considered. ...
... They found that there was not constantly consistent between bug severity and code change complexity of bug resolution. In 2023, Li et al. used open time, three change complexity measures of pull requests, and two communication complexity measures in bug resolution, to explore the impact of bugs (including MPLBs) on development in three deep learning frameworks [9]. ...
Preprint
Full-text available
Background: In modern software systems, more and more systems are written in multiple programming languages (PLs). There is no comprehensive investigation on the phenomenon of multi-programming-language (MPL) bugs, which resolution involves source files written in multiple PLs. Aim: This work investigated the characteristics of bug resolution in MPL software systems and explored the reasons why bug resolution involves multiple PLs. Method: We conducted an empirical study on 54 MPL projects selected from 655 Apache OSS projects, of which 66,932 bugs were analyzed. Results: (1) the percentage of MPL bugs (MPLBs) in the selected projects ranges from 0.17% to 42.26%, and the percentage of MPLBs for all projects as a whole is 10.01%; (2) 95.0% and 4.5% of all the MPLBs involve source files written in 2 and 3 PLs, respectively; (3) the change complexity resolution characteristics of MPLBs tend to be higher than those of single-programming-language bugs (SPLBs); (4) the open time for MPLBs is 19.52% to 529.57% significantly longer than SPLBs regarding 9 PL combinations; (5) the reopen rate of bugs involving the PL combination of JavaScript and Python reaches 20.66%; (6) we found 6 causes why the bug resolution involves multiple PLs and identified 5 cross-language calling mechanisms. Conclusion: MPLBs are related to increased development difficulty.
... their language interfaces (Monat et al., 2021). Li et al. conducted an empirical study on bugs in three multi-language DLFs, i.e., MXNet, PyTorch, and TensorFlow, analyzing the characteristics of the bugs and their impact on the development of DLFs (Li et al., 2023a). Li et al. analyzed 66,932 bugs from 54 Apache projects and studied the characteristics of bug resolution in multi-language software systems . ...
Article
Full-text available
Context: Nowadays, most deep learning frameworks (DLFs) use multilingual programming of Python and C/C++, facilitating the flexibility and performance of the DLF. However, inappropriate inter-language interaction may introduce design smells involving multiple programming languages (PLs), i.e., Inter-Language Design Smells (ILDS). Despite the negative impact of ILDS on multi-language DLFs, there is a lack of an automated approach for detecting ILDS in multi-language DLFs and a comprehensive understanding on ILDS in such DLFs. Objective: This work aims to automatically detect ILDS in multi-language DLFs written in the combination of Python and C/C++, and to obtain a comprehensive understanding on such ILDS in DLFs. Methods: We first developed an approach to automatically detecting ILDS in the multi-language DLFs written in the combination of Python and C/C++, including a number of ILDS and their detection rules defined based on inter-language communication mechanisms and code analysis. Then, we developed the CPSMELL tool that implements detection rules for automatically detecting such ILDS, and manually validated the accuracy of the tool. Finally, we performed an empirical study to evaluate the ILDS in multi-language DLFs. Results: We proposed seven ILDS and achieved an accuracy of 98.17% in the manual validation of CPSMELL in 5 popular multi-language DLFs. The study results revealed that among the 5 DLFs, TensorFlow, PyTorch, and PaddlePaddle exhibit relatively high prevalence of ILDS; each smelly file contains around 5 ILDS instances on average, with ILDS Long Lambda Function For Inter-language Binding and Unused Native Entity being relatively prominent; throughout the evolution process of the 5 DLFs, some ILDS were resolved to a certain extent, but the overall count of ILDS instances shows an upward trend. Conclusions: The automated detection of the proposed ILDS achieved a high accuracy, and the empirical study provides a comprehensive understanding on ILDS in the multi-language DLFs.
... their language interfaces (Monat et al., 2021). Li et al. conducted an empirical study on bugs in three multi-language DLFs, i.e., MXNet, PyTorch, and TensorFlow, analyzing the characteristics of the bugs and their impact on the development of DLFs (Li et al., 2023a). Li et al. analyzed 66,932 bugs from 54 Apache projects and studied the characteristics of bug resolution in multi-language software systems (Li et al., 2023b). ...
Preprint
Full-text available
Nowadays, most DL frameworks (DLFs) use multilingual programming of Python and C/C++, facilitating the flexibility and performance of the DLF. However, inappropriate interlanguage interaction may introduce design smells involving multiple programming languages (PLs), i.e., Inter-Language Design Smells (ILDS). Despite the negative impact of ILDS on multi-language DLFs, there is a lack of an automated approach for detecting ILDS in multi-language DLFs and a comprehensive understanding on ILDS in such DLFs. This work automatically detects ILDS in multi-language DLFs written in the combination of Python and C/C++, and to obtain a understanding on such ILDS in DLFs. We first developed an approach to automatically detecting ILDS in the multi-language DLFs written in the combination of Python and C/C++, including a number of ILDS and their detection rules defined based on inter-language communication mechanisms and code analysis. We then developed the CPSMELL tool that implements detection rules for automatically detecting such ILDS, and manually validated the accuracy of the tool. Finally, we performed a study to evaluate the ILDS in multi-language DLFs. We proposed seven ILDS and achieved an accuracy of 98.17% in the manual validation of CPSMELL in 5 popular multi-language DLFs. The study results revealed that among the 5 DLFs, TensorFlow, PyTorch, and PaddlePaddle exhibit relatively high prevalence of ILDS; each smelly file contains around 5 ILDS instances on average, with ILDS Long Lambda Function For Inter-language Binding and Unused Native Entity being relatively prominent; throughout the evolution process of the 5 DLFs, some ILDS were resolved to a certain extent, but the overall count of ILDS instances shows an upward trend. The automated detection of the proposed ILDS achieved a high accuracy, and the study provides a comprehensive understanding on ILDS in the multi-language DLFs.
... Debugging presents several challenges in the context of multi-programming-language software [58]. As the transpilation process involves translating code from a source language to various target languages, debugging becomes more complex due to potential mismatches between the original source code and transpiled code in different languages. ...
Article
Full-text available
The utilization of software architectures and designs is widespread in software development, offering conceptual frameworks to address recurring challenges. A transpiler is a tool that automatically converts source code from one high-level programming language to another, ensuring algorithmic equivalence. This study introduces an innovative software architecture design model that integrates transpilers into the back-end layer, enabling the automatic transformation of business logic and back-end components from a single source code (the coding artifact) into diverse equivalent versions using distinct programming languages (the automatically produced code). This work encompasses both abstract and detailed design aspects, covering the proposal, automated processes, layered design, development environment, nest implementations, and cross-cutting components. In addition, it defines the main target audiences, discusses pros and cons, examines their relationships with prevalent design paradigms, addresses considerations about compatibility and debugging, and emphasizes the pivotal role of the transpiler. An empirical experiment involving the practical application of this model was conducted by implementing a collaborative to-do list application. This paper comprehensively outlines the relevant methodological approach, strategic planning, precise execution, observed outcomes, and insightful reflections while underscoring the the model’s pragmatic viability and highlighting its relevance across various software development contexts. Our contribution aims to enrich the field of software architecture design by introducing a new way of designing multi-programming-language software.
Article
Recently, software systems powered by deep learning (DL) techniques have significantly facilitated people’s lives in many aspects. As the backbone of these DL systems, various DL libraries undertake the underlying optimization and computation. However, like traditional software, DL libraries are not immune to bugs. These bugs may be propagated to programs and software developed based on DL libraries, thereby posing serious threats to users’ personal property and safety. Studying the characteristics of DL libraries, their associated bugs, and the corresponding testing methods is crucial for enhancing the security of DL systems and advancing the widespread application of DL technology. This paper provides an overview of the testing research on various DL libraries, discusses the strengths and weaknesses of existing methods, and provides guidance and reference for the application of DL library testing methods. This paper first introduces the workflow of DL underlying libraries and the characteristics of three kinds of DL libraries involved, namely DL framework, DL compiler, and DL hardware library. Subsequently, this paper constructs a literature collection pipeline and comprehensively summarizes existing testing methods on these DL libraries to analyze their effectiveness and limitations. It also reports findings and the challenges of existing DL library testing in real-world applications for future research.
Article
Full-text available
Context: Modern software systems (e.g., Apache Spark) are usually written in multiple programming languages (PLs). There is little understanding on the phenomenon of multi-programming-language commits (MPLCs), which involve modified source files written in multiple PLs. Objective: This work aims to explore MPLCs and their impacts on development difficulty and software quality. Methods: We performed an empirical study on eighteen non-trivial Apache projects with 197,566 commits. Results: (1) the most commonly used PL combination consists of all the four PLs, i.e., C/C++, Java, JavaScript, and Python; (2) 9% of the commits from all the projects are MPLCs, and the proportion of MPLCs in 83% of the projects goes to a relatively stable level; (3) more than 90% of the MPLCs from all the projects involve source files in two PLs; (4) the change complexity of MPLCs is significantly higher than that of non-MPLCs; (5) issues fixed in MPLCs take significantly longer to be resolved than issues fixed in non-MPLCs in 89% of the projects; (6) MPLCs do not show significant effects on issue reopen; (7) source files undergoing MPLCs tend to be more bug-prone; and (8) MPLCs introduce more bugs than non-MPLCs. Conclusions: MPLCs are related to increased development difficulty and decreased software quality.
Article
Full-text available
In recent years, deep learning has become a hot research topic. Although it achieves incredible positive results in some scenarios, bugs inside deep learning software can introduce disastrous consequences, especially when the software is used in safety-critical applications. To understand the bug characteristic of deep learning software, researchers have conducted several empirical studies on deep learning bugs. Although these studies present useful findings, we notice that none of them analyze the bug characteristic inside a deep learning library like TensorFlow. We argue that some fundamental questions of bugs in deep learning libraries are still open. For example, what are the symptoms and the root causes of bugs inside TensorFlow, and where are they? As the underlying library of many deep learning projects, the answers to these questions are useful and important, since its bugs can have impacts on many deep learning projects. In this paper, we conduct the first empirical study to analyze the bugs inside a typical deep learning library, i.e., TensorFlow. Based on our results, we summarize 8 findings, and present our answers to 4 research questions. For example, we find that the symptoms and root causes of TensorFlow bugs are more like ordinary projects (e.g., Mozilla) than other machine learning libraries (e.g., Lucene). As another example, we find that most TensorFlow bugs reside in its interfaces (26.24%), learning algorithms (11.79%), and how to compile (8.02%), deploy (7.55%), and install (4.72%) TensorFlow across platforms.
Article
Full-text available
Nowadays, modern applications are developed using components written in different programming languages and technologies. The cost benefits of reuse and the advantages of each programming language are two main incentives behind the proliferation of such systems. However, as the number of languages increases, so do the challenges related to the development and maintenance of these systems. In such situations, developers may introduce design smells (i.e., anti-patterns and code smells) which are symptoms of poor design and implementation choices. Design smells are defined as poor design and coding choices that can negatively impact the quality of a software program despite satisfying functional requirements. Studies on mono-language systems suggest that the presence of design smells may indicate a higher risk of future bugs and affects code comprehension, thus making systems harder to maintain. However, the impact of multi-language design smells on software quality such as fault-proneness is yet to be investigated. In this article, we present an approach to detect multi-language design smells in the context of JNI systems. We then investigate the prevalence of those design smells and their impacts on fault-proneness. Specifically, we detect 15 design smells in 98 releases of 9 open-source JNI projects. Our results show that the design smells are prevalent in the selected projects and persist throughout the releases of the systems. We observe that, in the analyzed systems, 33.95% of the files involving communications between Java and C/C++ contain occurrences of multi-language design smells. Some kinds of smells are more prevalent than others, e.g., Unused Parameters , Too Much Scattering , and Unused Method Declaration . Our results suggest that files with multi-language design smells can often be more associated with bugs than files without these smells, and that specific smells are more correlated to fault-proneness than others. From analyzing fault-inducing commit messages, we also extracted activities that are more likely to introduce bugs in smelly files. We believe that our findings are important for practitioners as it can help them prioritize design smells during the maintenance of multi-language systems.
Article
Full-text available
Both complexity of code change for bug fixing and bug severity play an important role in release planning when considering which bugs should be fixed in a specific release under certain constraints. This work investigates whether there are significant differences between bugs of different severity levels regarding the complexity of code change for fixing the bugs. Code change complexity is measured by the number of modified lines of code, source files, and packages, as well as the entropy of code change. We performed a case study on 20 Apache open source software (OSS) projects using commit records and bug reports. The study results show that (1) for bugs of high severity levels (i.e., Blocker, Critical, and Major in JIRA), there is no significant difference on the complexity of code change for fixing bugs of different severity levels for most projects, while (2) for bugs of low severity levels (i.e., Major, Minor, and Trivial in JIRA), fixing bugs of a higher severity level needs significantly more complex code change than fixing bugs of a lower severity level for most projects. These findings provide useful and practical insights for effort estimation and release planning of OSS development.
Article
DL frameworks are the basis of constructing all DL programs and models, and thus their bugs could lead to the unexpected behaviors of any DL program or model relying on them. Such a wide effect demonstrates the necessity and importance of guaranteeing DL frameworks’ quality. Understanding the characteristics of DL framework bugs is a fundamental step for this quality assurance task, facilitating designing effective bug detection and debugging approaches. Hence, in this work we conduct the most large-scale study on 1,000 bugs from four popular and diverse DL frameworks (i.e., TensorFlow, PyTorch, MXNet, and DL4J). By analyzing the root causes and symptoms of DL framework bugs associated with 5 components decomposed from DL frameworks, as well as measuring test coverage achieved by three state-of-the-art testing techniques, we obtain 12 major findings for the comprehensive understanding of DL framework bugs and the current status of existing DL framework testing practice, and then provide a series of actionable guidelines for better DL framework bug detection and debugging. Finally, based on the guidelines, we design and implement a prototype DL-framework testing tool, called TenFuzz, which is evaluated to be effective and finds 3 unknown bugs on the latest TensorFlow framework in a preliminary study, indicating the significance of our guidelines.
Article
Context Deep Learning (DL) frameworks enable developers to build DNN models without learning the underlying algorithms and models. While some of these DL-based software systems have been deployed in safety-critical areas, such as self-driving cars and medical diagnostics, for DL frameworks, characterizing their bugs and thus helping researchers to design specific quality assurance techniques become desperately needed. Objective Our research aims to characterize bugs typical of DL frameworks at the source code level for an in-depth analysis of bug symptoms, root causes, and bug fixes. In this way, we hope to provide insights for researchers to design automatic quality assurance techniques, such as automatic repair techniques and fault location techniques, applicable to DL frameworks and DL-based software systems. Method We started by summarizing the DL framework reference architecture and proposing the DL framework bug taxonomy. Then, we mined 1,127 DL framework bug reports from eight popular DL frameworks and labeled the bug types, root causes, and symptoms. Finally, we discussed the bug characteristics and explored how developers could possibly deal with these bugs. Results Our main findings are: (i) DNN model building bugs and general type bugs accounted for one-third of the total defects. (ii) DNN model building bugs are more prone to algorithm logic constraints, internal API errors, and data/numerical errors. (iii) Fifteen bug-fixing patterns are summarized, providing reference for common DL framework bug repair and future research on the development of automatic DL framework bug detection tools. Conclusion By analyzing the bug-fixing changes, we characterize the occurrences, root causes, symptoms, and fixing of these bugs. The study results have provided researchers with insights into how to ensure DL framework quality and presented actionable suggestions for DL framework developers to improve their code quality.
Article
With the advent of big data, deep learning technology has become an important research direction in the field of machine learning, which has been widely applied in the image processing, natural language processing, speech recognition and online advertising and so on. This paper introduces deep learning techniques from various aspects, including common models of deep learning and their optimization methods, commonly used open source frameworks, existing problems and future research directions. Firstly, we introduce the applications of deep learning; Secondly, we introduce several common models of deep learning and optimization methods; Thirdly, we describe several common frameworks and platforms of deep learning; Finally, we introduce the latest acceleration technology of deep learning and highlight the future work of deep learning. © 2019, Korean Society for Internet Information. All rights reserved.