Content uploaded by Peng Liang
Author content
All content in this area was uploaded by Peng Liang on Mar 05, 2023
Content may be subject to copyright.
Understanding Bugs in Multi-Language Deep
Learning Frameworks
Zengyang Li†, Sicheng Wang†, Wenshuo Wang†, Peng Liang‡∗, Ran Mo†, Bing Li‡
†School of Computer Science & Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning,
Central China Normal University, Wuhan, China
‡School of Computer Science, Wuhan University, Wuhan, China
{zengyangli, moran}@ccnu.edu.cn, {scwang1998, wenshuowang}@mails.ccnu.edu.cn, {liangp, bingli}@whu.edu.cn
Abstract—Deep learning frameworks (DLFs) have been playing
an increasingly important role in this intelligence age since they
act as a basic infrastructure for an increasingly wide range of AI-
based applications. Meanwhile, as multi-programming-language
(MPL) software systems, DLFs are inevitably suffering from
bugs caused by the use of multiple programming languages
(PLs). Hence, it is of paramount significance to understand
the bugs (especially the bugs involving multiple PLs, i.e., MPL
bugs) of DLFs, which can provide a foundation for preventing,
detecting, and resolving bugs in the development of DLFs. To
this end, we manually analyzed 1497 bugs in three MPL DLFs,
namely MXNet, PyTorch, and TensorFlow. First, we classified
bugs in these DLFs into 12 types (e.g., algorithm design bugs and
memory bugs) according to their bug labels and characteristics.
Second, we further explored the impacts of different bug types
on the development of DLFs, and found that deployment bugs
and memory bugs negatively impact the development of DLFs in
different aspects the most. Third, we found that 28.6%, 31.4%,
and 16.0% of bugs in MXNet, PyTorch, and TensorFlow are
MPL bugs, respectively; the PL combination of Python and
C/C++ is most used in fixing more than 92% MPL bugs in all
DLFs. Finally, the code change complexity of MPL bug fixes is
significantly greater than that of single-programming-language
(SPL) bug fixes in all the three DLFs, while in PyTorch MPL
bug fixes have longer open time and greater communication
complexity than SPL bug fixes. These results provide insights
for bug management in DLFs.
Index Terms—Deep Learning Framework, Bug, Multiple-
Programming-Language Software System, Empirical Study
I. INTRODUCTION
Deep learning frameworks (DLFs), such as PyTorch, have
been playing an increasingly important role in this intelligence
age since they act as basic infrastructures for an increasingly
wide range of AI-based applications. DLFs provide building
blocks for designing, training, and validating deep neural
network models through a high-level programming interface.
Therefore, the reliability of DLFs becomes more and more im-
portant for the fast-growing AI-based applications. To ensure
their reliability, a necessary step is to understand the charac-
teristics and impact of bugs in DLFs. The previous research
on bugs in DLFs is mainly divided into two categories: the
This work was funded by the Natural Science Foundation of Hubei
Province of China under Grant No. 2021CFB577, the National Natural
Science Foundation of China under Grant Nos. 62002129 and 62172311, and
the Knowledge Innovation Program of Wuhan-Shuguang Project under Grant
No. 2022010801020280.
∗Corresponding author
first category studies the bugs in the implementation of DLFs,
e.g., bug categorization, severity, symptoms, root causes, and
impacts in various DLFs [1][2][3][4]; the second category
studies the bugs in the use of specific DLFs, e.g., dependency
and performance bugs in deep learning (DL) systems in terms
of symptoms, causes, and fix modes [5][6][7][8].
Popular DLFs are usually written in multiple programming
languages (PLs) [9], such as TensorFlow, which is mainly
written in Python and C++. Previous research suggests that
static code analysis in a multi-programming-language (MPL)
software system is much more difficult than in a single-
programming-language (SPL) one [10] [11]; meanwhile, chal-
lenges in understanding multiple PLs and cross-language
communication are usually faced by MPL software systems
[12][11]. As MPL software systems, DLFs are inevitably
suffering from bugs caused by the use of multiple PLs.
Therefore, it is of paramount significance to understand the
bugs (especially the bugs involving multiple PLs, i.e., MPL
bugs) of DLFs, which can provide a foundation for preventing,
detecting, and resolving bugs in the development of DLFs.
In this paper, we investigated the bugs in DLFs. Specifically,
we conducted an empirical study on the bugs and their cor-
responding fixes in three popular DLFs, namely MXNet [13],
PyTorch [14] and TensorFlow [15] on GitHub. The purpose
of this work is to systematically understand the bugs and their
impacts in the development of DLFs, with a focus on MPL
bugs. The main contributions of this work are threefold:
•We conducted an empirical study by manual analysis
of 1497 bugs and their corresponding bug fixes from
three popular MPL DLFs, namely MXNet, PyTorch, and
TensorFlow.
•We classified these bugs based on the labels tagged
by developers and bug characteristics, and explored the
impacts of bugs on the DLF development in terms of the
open time of bugs, the code change complexity of bug
fixes, and the communication complexity of developers
during bug fixing.
•We explored the MPL bugs and their impact on the three
DLFs. Specifically, we looked into the proportion of MPL
bugs and the distribution of PL combinations used in each
DLF, and the difference between SPL and MPL bugs
regarding their impact on the DLFs.
The remaining of this paper is organized as follows. Section
II presents the related work; Section III describes the design
of the empirical study; Section IV presents the results of
the study; Section Vdiscusses the study results; Section VI
identifies the threats to validity of the results; and Section VII
concludes this work with future research directions.
II. RE LATED WO RK
A. Bug Classification of DLFs
In past years, a number of researchers tried to classify bugs
due to different research objectives and perspectives. Islam et
al. obtained five types of bugs (e.g., API bugs and structural
bugs) in the DLFs through 2716 Stack Overflow posts and 500
bug-fixing commits from GitHub [8]. According to the loca-
tion of the buggy source code, Li et al. obtained a preliminary
bug classification of TensorFlow, because TensorFlow tends
to place the source files in different directories according to
different functions [16]. Seaman et al. obtained several defect
classification schemes applicable to most projects [17], then
Thung et al. added a new category namely configuration, to the
bug classification scheme, and obtained their classification [1].
Yang et al. summarized the reference architecture of DLFs,
based on which they built a bug classification of DLFs [2].
Different from the previous classification methods of bugs in
DLFs, we employed the grounded theory [18] and took the bug
labels assigned by developers into consideration. Therefore,
we obtained a more comprehensive bug classification of DLFs
with four newly identified bug types (e.g., deployment and
version compatibility bugs). We believe that our bug classifi-
cation is close to the classification rationale of bug reports by
the developers of DLFs, which is convenient for developers to
understand which parts of the bugs will be more costly to fix.
B. Impact of the Use of Multiple PLs on Software Systems
Recently, more and more researchers have begun to pay
attention to the impact of the use of multiple PLs on software
systems. Ray et al. found a correlation between 11 PLs and
software quality in 729 projects hosted on GitHub [19]. Berger
et al. repeated the research of Ray et al. and found that
only four PLs were statistically significantly associated with
bugs, and the association was very small [20]. Kochhar et al.
collected a large dataset consisting of 628 projects to study the
impact of different PLs on the number of bug fixing commits
[21]. They found that implementing a project with more PLs
would increase its defect proneness. Abidi et al. analyzed the
source code of the MPL system, and found six anti-patterns
[22] and twelve code smells [23]. Li et al. analyzed 18 Apache
MPL software projects, and confirmed that the use of multiple
PLs is related to the increase of development difficulty and the
decline of software quality [24]. Our work further explored the
impact of multiple PLs on DLFs.
III. STU DY DESIGN
In order to investigate the bugs in DLFs, we performed
an empirical study on mainstream DLFs. In this section,
we describe the study, which was designed and reported by
following the guidelines proposed by Runeson and H¨
ost [25].
A. Objective and Research Questions
The goal of this study, described using the Goal-Question-
Metric (GQM) approach [26], is: to analyze bugs and their
corresponding bug-fixing pull requests for the purpose of
investigation with respect to the bugs with a focus on MPL
bugs in DLFs as well as their impacts on DLF development,
from the point of view of software developers in the context
of the development of MPL DLFs.
Based on the aforementioned goal, we formulated four
research questions (RQs), in which RQ1 and RQ2 focus on
the bugs in general in DLFs while RQ3 and RQ4 pay more
attention to MPL bugs in DLFs. The RQs are described as
follows:
-RQ1: What is the classification of bugs in DLFs?
Rationale: By classifying bugs in DLFs, we can have a better
understanding on the causes and distribution of bugs in DLFs,
so that we can conduct an in-depth investigation on each type
of the bugs.
-RQ2: What are the impacts of different types of bugs on the
development of DLFs?
Rationale: Different types of bugs may have different impacts
on the development of DLFs. We study the impacts in the
aspects of the open time of bugs, the complexity of code
change in bug fixing, and the complexity of communication
during bug fixing. The answer to this RQ can help recognize
the types of bugs that have greatest influence on the devel-
opment of DLFs, which can be further used to guide the bug
management in the development of DLFs.
-RQ3: What is the proportion of MPL bugs in DLFs? How
do MPL bugs distribute over bug types and PL combinations?
Rationale: By investigating the proportion of MPL bugs in
DLFs, we can understand the prevalence of multiple PLs used
for bug fixing in DLFs; by looking into the distribution of
MPL bugs over bug types and PL combinations, we can get
to know the tendency of MPL bugs among bug types and
the popularity of different PL combinations used in MPL bug
fixing in DLFs.
-RQ4: How does the use of multiple PLs affect the bug fixing
of DLFs?
Rationale: With this RQ, we investigate whether the use of
multiple PLs can cause additional costs for the bug fixing of
DLFs, so as to analyze the impact of the use of multiple PLs
on DLFs.
B. Cases and Unit Analysis
This study investigates DLFs, i.e., cases, and each bug and
the corresponding bug-fixing pull request is a single unit of
analysis (also called a data unit).
C. Case Selection
In this study, we only investigated DLFs hosted on GitHub.
The reason we used GitHub projects is that most of DLFs
are open source on GitHub. At the same time, this can ensure
that all bugs in different DLFs have the same format, so that
we can handle the bugs in the same way. To select each case
included in our study, we employed five inclusion criteria:
•C1: The source code written by the main PL is no more
than 75% of the code of the project. This criterion was
set to ensure that the PL use is not extremely unbalanced
so that the biases caused by specific PLs can be reduced.
•C2: It has more than 10k stars, which indicates that it is
a popular DLF and has a large group of users.
•C3: The DLF has more than 2,000 pull requests. This
criterion was set to ensure that the selected DLF is
nontrivial, and that the resulting dataset is big enough
to be statistically analyzed.
•C4: The number of issues of the project is no less than
1,000. This criterion was set to ensure that the selected
DLF had been in the maintenance and evolution stage
for a reasonable length of period, and thus sufficient data
about bug introduction can be collected.
•C5: The DLF has more than 150 pairs of bugs and corre-
sponding bug-fixing pull requests. It was set to ensure that
the resulting dataset is big enough for statistical analysis.
The thresholds of C1, C3, and C4 were set according to [24],
and those of C2 and C5 were set based on our experience.
D. Data Collection
1) Data Items to be Collected: To answer the RQs, we take
a bug and its corresponding pull request as the analysis unit,
and list the data items to be collected in Table I. For each bug,
we need to calculate the open time (OT), lines of source code
modified (LOCM), number of files modified (NOFM), entropy,
number of developers participating (NODP), and number of
comments in the pull request (NOC) to quantify the impact
of the bug on the development of a DLF. Since most data
items are easy to understand, we only explain the definition
of entropy in detail [27].
Suppose that the source file modified in a pull request
for fixing a bug is {f1, f2, ..., fn}. File fi(1 ≤i≤n)is
modified mitimes in the pull requests during the last 60
days [28] before the current bug-fixing pull request. Let pi=
mi/Pn
j=1 mj. Then, the entropy H(n) = −Pn
i=1 pilog2pi.
Since the number of modified source files differs between
different bug-fixing pull requests, we need to normalize the
entropy to be comparable. Considering that H(n)achieves
the maximum of log2nwhen pi= 1/n(1 ≤i≤n), the
normalized entropy
˜
H(n) = (H(n)/log2nn>1,
0n = 1. (1)
The entropy of a bug fix reflects the uncertainty faced by the
bug fix. A larger entropy means a bigger uncertainty, resulting
in a higher code change complexity.
2) Data Collection Procedure: To collect the data items
in Table I, we developed a dedicated tool based on GitHub
GraphQL APIs to extract the required data from the repository
of each DLF on GitHub. Specifically, we collected the data in
the following steps:
Step 1: Export available issue reports. Extract all related issues
and corresponding pull requests in the project.
TABLE I
DATA ITEMS TOBECO LL EC TE D.
# Name Description Relevant RQ
D1 IssueID The identity number of the issue. -
D2 IssueLabels The labels of the issue. RQ1, RQ3
D3 IssueTitle The title of the issue. RQ1
D4 IssueContent The description of the issue. RQ1
D5 IssueCreatedAt The creation time of the issue. RQ2, RQ4
D6 IssueClosedAt The close time of the issue. RQ2, RQ4
D7 PrID The identity number of the pull request
associated with the issue. -
D8 PrLabels The labels of the pull request. RQ1, RQ3
D9 PrTitle The title of the pull request. RQ1
D10 PrContent The description of the pull request. RQ1
D11 Files The files modified in the pull request. RQ2- RQ4
D12 OT The open time of the issue, i.e., OT=D6-D5. RQ2, RQ4
D13 LOCM The number of lines of source code modified
in the pull request. RQ2, RQ4
D14 NOFM The number of files modified in the pull request. RQ2, RQ4
D15 Entropy The normalized entropy of the modified files
in the pull request. RQ2, RQ4
D16 NODP The number of developers participating in the
pull request (excluding robots). RQ2, RQ4
D17 NOC The number of comments in the pull request
(excluding robots). RQ2, RQ4
D18 IsMPLF Whether the pull request is a MPL fix, i.e.,
involving source files in multiple PLs. RQ3, RQ4
Step 2: Filter bug reports. The filtering logic (satisfying
anyone below) is that: an issue or the pull request connected
to it has at least one whose label’s name or label’s description
involving the word ”Bug”; the word ”Bug” or ”bug” or ”BUG”
appears in the issue title; issue description adopts a bug report
template. In this way, we can get the bug reports we need from
all available reports.
Step 3: Filter available bugs. The status of the issue must be
“closed” and the status of the pull request corresponding to
the issue must be “merged”. This ensures that the bug has an
effective fix.
Step 4: Remove abnormal data. An abnormal data unit means
that a bug does not strictly correspond to a single bug-fixing
pull request (e.g., a pull request corresponds to multiple bugs,
a bug corresponds to multiple pull requests). This will lead to
inaccurate results of the impact of the bug on the project.
Step 5: Calculate the data items listed in Table I.
E. Data Analysis
1) Data Analysis for RQ1: For RQ1, to obtain the bug
classification in DLFs, we followed a four-step process below.
Step 1: Preliminarily classify bugs based on bug labels. This
step is to get a preliminary bug classification. (1) We collected
all the labels of bugs in the selected DLFs through a dedicated
tool. (2) We then examined relevant official documents and
label descriptions of bugs to have a deep understanding of
the labels. (3) Based on the bug labels and characteristics, the
second and third authors classify bug labels separately. Then,
the first author reviewed the label classification and solved
disagreements through discussion with the second and third
authors. Finally, we obtained a preliminary classification of
the bugs in DLFs.
Step 2: Construct a relatively stable bug classification. The
first two authors manually analyzed a set of sampled bugs
with the help of grounded theory [18]. When the classifica-
tion obtained in Step 1 was found inappropriate during the
bug analysis, the first two authors tried to improve the bug
classification through discussion. In this step, new bug types
might arise, existing bug types might be removed, and the
bug classification would be updated accordingly. Finally, a
relatively stable classification was obtained.
Step 3: Conduct a pilot bug tagging. This step is to get a
stable bug classification and reach a common understanding on
each bug type in the bug classification. We randomly selected
10% of the bugs, and the second and third authors separately
tagged each of the bugs with an appropriate bug type from
the obtained classification got in Step 1. If there was any
disagreement on bug tagging, the second and third authors
discussed them with the first author to get a consensus. We
used Cohen’s Kappa [29] to measure the consistency between
the bug tagging results of the two authors. When the Cohen’s
Kappa value is less than 90%, the first three authors needed
to discuss for resolving disagreements and randomly extracted
another 10% of the bugs for another round of bug tagging. The
bug tagging was an iterative process, which stopped when the
Cohen’s Kappa value exceeded 90%.
Step 4: Classify the remaining bugs. The second author
classified the remaining bugs into different bug types.
2) Data Analysis for RQ2-RQ4: To answer RQ2, we inves-
tigated the impact of different types of bugs on DLF develop-
ment through six impact indicators (i.e., data items D12-D17)
in three aspects: (1) open time of bugs, i.e., OT (D12); (2)
complexity of code change in bug fixing, including LOCM
(D13), NOFM (D14), and entropy (D15); and (3) complexity
of communication during bug fixing, including NODP (D17)
and NOC (D18). For each indicator, we calculated the average
value and ranking of each bug type in each DLF, and also
calculated the mean ranking for all the selected DLFs by
averaging their ranking numbers. The integrated ranking (In-
teRanking) numbers for all bug types are assigned according
to their mean rankings. To answer RQ3, we examined the
extensions of the modified source files in the bug-fixing pull
requests to identify the MPL and SPL fixes, and calculated
the bug type distribution of bugs with these MPL and SPL
fixes. In addition, we calculated the distribution of MPL bugs
over different PL combinations. Similar as the PLs examined
in [24], we only considered the following 18 general-purpose
PLs: C/C++, C#, Clojure, CoffeeScript, Erlang, Go, Haskell,
Java, JavaScript, Kotlin, Objective-C, Perl, PHP, Python, Ruby,
Scala, Swift, and TypeScript. To answer RQ4, we studied
the difference between MPL bug fixes and SPL bug fixes
regarding their impact on DLF development through Mann-
Whitney U tests. Since the data of the variables to be tested do
not necessarily follow a specific distribution, it is reasonable
to use the Mann-Whitney U test – a non-parametric test –
in this study. The test was significant at p-value<0.05, which
means that the tested groups have a significant difference.
IV. STU DY RESU LTS
Through case selection, we got three DLFs, i.e., MXNet,
PyTorch, and TensorFlow, which demographic information is
shown in Table II. To be specific, 189, 926, and 382 data units
for analysis were collected from the whole bug set of MXNet,
PyTorch, and TensorFlow respectively, and 1497 data units in
total were obtained for investigation. The dataset is available
online [30]. Some DLFs such as Caffe and Keras were not
included, since only a few bugs are bound to their bug-fixing
pull requests or only a single general-purpose PL is used.
TABLE II
DEMOGRAPHIC INFORMATION OF THE SELE CT ED DLFS.
DLF #Pr #Issue #star (k) %Main PL #Unit
MXNet 11096 9532 20.2 48.6 189
PyTorch 59555 29575 60.3 49.4 926
TensorFlow 22191 36007 169.0 63.1 382
Total 92842 75114 289.5 - 1497
A. Bug Types in Deep Learning Frameworks (RQ1)
Through manual analysis following the process presented
in Section III-E1, we classified the collected 1497 bugs in the
three DLFs into 12 types, which are shown in Fig. 1. The
details of the 12 bug types are described as follows.
TABLE III
NUMBERS AND PROP ORT IO NS OF DI FFE RE NT BU G TYPE S (RQ1).
Bug Type MXNet PyTorch TensorFlow Total
# % # % # % # %
Algorithm design bug 21 11.1 119 12.9 34 8.9 174 11.6
Build bug 32 16.9 112 12.1 31 8.1 175 11.7
Code bug 15 7.9 81 8.8 45 11.8 141 9.4
Data bug 45 23.8 155 16.7 96 25.1 296 19.8
Deployment bug 7 3.7 106 11.5 25 6.5 138 9.2
Documentation bug 15 7.9 29 3.1 80 20.9 124 8.3
Memory bug 4 2.1 27 2.9 6 1.6 37 2.5
Performance bug 2 1.1 22 2.4 0 0.0 24 1.6
Processor bug 10 5.3 132 14.3 13 3.4 155 10.4
Test bug 19 10.1 83 9.0 23 6.0 125 8.4
Version compatibility bug 10 5.3 48 5.2 18 4.7 76 5.1
Visualization bug 9 4.8 12 1.3 11 2.9 32 2.1
(1) Algorithm design bug. This type of bugs are related
to the defects in DL algorithm design. There are three spe-
cific cases: 1) design bugs in the model layer, e.g., network
structure errors, model accuracy deterioration, and data flow
errors between networks; 2) design bugs in the loss function
algorithm, e.g., the incorrect return value of the loss function
and the internal structure error of the loss function; 3) gradient
errors caused by incorrect settings of optimization parameters
or design bugs of optimization functions, e.g., optimizer-model
mismatches and learning rate algorithm errors.
(2) Build bug. This type of bugs refer to the failures
encountered during framework build or compilation. There are
two specific cases: 1) framework build failures, which may be
caused by configuration file errors or by failures when building
frameworks in different platforms, such as Linux, Windows,
Mac, or Docker environments; and 2) API import failures.
(3) Code bug. This type of bugs refer to coding logic
errors or code smells introduced in code. Typical instances
include 1) erroneous return messages; 2) naming errors or
naming conflicts; 3) problems in cross-programming-language
communication; 4) coding problems in user-defined functions,
e.g., design defects of class templates; 5) obsolete code or
dead code; 6) file path errors, e.g., hard coded path and path
identification errors; and 7) circular dependencies.
Fig. 1. Bug types in DLFs (RQ1).
(4) Data bug. This type of bugs refer to problems related
to data preprocessing that occur before data is input into a
model. Their symptoms are usually function output errors, data
loading errors, data acquisition exceptions, data corruption,
parameter errors, etc. There are three specific cases: 1) bugs
occurring in data calculation operations, e.g., scalar operations,
vector operations, matrix operations, and broadcast mecha-
nisms; 2) bugs occurring in data structure operations, e.g., data
creation, data replication, index slicing, type conversion, and
other related errors; and 3) bugs occurring during data loading.
(5) Deployment bug. This type of bugs refer to problems
that arise when a trained model is exported or deployed in a
specific environment. There are two cases: 1) bugs occurring
during model import or export, e.g., abnormal behaviors when
storing trained models; and 2) bugs occurring in the process
of model transformation.
(6) Documentation bug. This type of bugs refer to prob-
lems in documentation. Typical bugs: 1) missing documents;
2) incomplete documents; and 3) official document errors.
(7) Memory bug. This type of bugs refer to memory-related
errors, which are often caused by inappropriate boundary
settings during training or by poor coding. There are two main
cases: 1) memory errors occurring during the running of a
DLF, e.g., memory leak, memory corruption, illegal memory
access, and insufficient memory; and 2) errors in the process
of memory operation.
(8) Performance bug. This type of bugs refer to defects
related to unsatisfying execution performance of DLFs. Typ-
ical bugs include: 1) undesirable computing speed; and 2)
performance regression caused by version change.
(9) Processor bug. This type of bugs are related to proces-
sors, such as CPUs or GPUs. There are three cases: 1) bugs
occurring when a model or an operator runs on a specific
processor; 2) bugs in distributed training and data parallel
processing on multiple processors; and 3) bugs occurring when
the processor is not properly initialized, e.g., the processor
does not match some platform environments.
(10) Test bug. This type of bugs refer to problems in testing,
including 1) test failures; 2) sample code errors; 3) lack of test
cases; and 4) static code check failures.
(11) Version compatibility bug. This type of bugs refer
to compatibility errors for some features due to framework
version changes or technical updates. There are three cases:
1) Bugs occurring in model code because the DL framework
is updated to a new version; 2) bugs caused by the use of
obsolete APIs; and 3) bugs caused by incompatibility between
the current version of a PL and its previous version that results
in syntax errors or unavailability of certain operators.
(12) Visualization bug. This type of bugs refer to failures
of visualizing training results or errors in representations of a
DLF. There are three cases: 1) errors in the process of model
visualization; 2) errors found in the process of visualizing
model parameters; 3) errors on the page display of DLFs, e.g.,
broken links at the front and page display errors.
The number and proportion of each bug type are presented
in TABLE III, which shows that the proportions of the bug
types are not well balanced either in individual DLFs or in
all of them as a whole. In MXNet, data bugs are with the
largest proportion (23.8%) and performance bugs have the
least proportion (1.1%). In PyTorch, data bugs are with the
largest proportion (16.7%) and visualization bugs have the
least proportion (1.3%). In TensorFlow, data bugs are with
the largest proportion (25.1%) and there are no performance
bugs in our collected dataset. Taking all DLFs as a whole,
data bugs (19.8%) are the leading bug type, followed by build
bugs (11.7%) and algorithm design bugs (11.6%); the least bug
types are performance bugs (1.6%), followed by visualization
bugs (2.1%) and memory bugs (2.5%).
Answer to RQ1: Bugs in DLFs can be classified into 12
types. Data bugs are the dominant bug type (19.8%) and
performance bugs account for the smallest proportion (1.6%).
B. Impacts of Bugs on DLF Development (RQ2)
We present the impacts of different types of bugs on the
development of DLFs in three aspects, i.e., the open time of
bugs, the complexity of code change in bug fixing, and the
complexity of communication during bug fixing.
1) Open Time of Bugs: We compared the associations
between bug and open time in the three DLFs in different
categories horizontally, as shown in Table IV, in which InteR-
anking denotes the integrated ranking of an indicator according
to its mean ranking. For MXNet, visualization bugs and perfor-
mance bugs are ranked in the top and bottom respectively; for
PyTorch, performance bugs and version compatibility bugs are
ranked in the top and bottom respectively; and for TensorFlow,
memory bugs and test bugs are ranked in the top and bottom
respectively. Taking the three DLFs as a whole, deployment
bugs, documentation bugs, and memory bugs are the top 3 bug
types that cost the most OT during bug fixing; build bugs, test
bugs, and processor bugs cost the least OT during bug fixing.
2) Complexity of Code Change in Bug Fixing: We investi-
gated the impact of different types of bugs on the complexity
of code change in terms of LOCM, NOFM, and entropy of
bug-fixing pull requests in the three selected DLFs.
(1) LOCM. The average LOCM and its ranking of each
bug type in the three DLFs are shown in Table V. For
MXNet, memory bugs and performance bugs are ranked in the
top and bottom respectively; for PyTorch, deployment bugs
and documentation bugs are ranked in the top and bottom
respectively; and for TensorFlow, processor bugs and build
bugs are ranked in the top and bottom respectively. Taking
the three DLFs as a whole, memory bugs, algorithm design
bugs, code bugs, and deployment bugs are the top 4 bug
types that need the most LOCM during bug fixing; build
bugs, documentation bugs, and visualization bugs need the
least LOCM during bug fixing.
(2) NOFM. The average NOFM and its ranking of each
bug type in the three DLFs are shown in Table VI. For
MXNet, memory bugs and performance bugs are ranked in the
top and bottom respectively; for PyTorch, deployment bugs
and documentation bugs are ranked in the top and bottom
respectively; and for TensorFlow, version compatibility bugs
and documentation bugs are ranked in the top and bottom
respectively. Taking the three DLFs as a whole, memory bugs,
processor bugs, and algorithm design bugs are the top 3 bug
types that need the most NOFM during bug fixing; perfor-
mance bugs, documentation bugs, test bugs, and visualization
bugs need the least NOFM during bug fixing.
(3) Entropy. The average entropy and its ranking of each
bug type in the three DLFs are shown in Table VII. For
MXNet, processor bugs and memory bugs are ranked in the
top and bottom respectively; for PyTorch, deployment bugs
and documentation bugs are ranked in the top and bottom
respectively; and for TensorFlow, data bugs and documentation
bugs are ranked in the top and bottom respectively. Taking
the three DLFs as a whole, deployment bugs, data bugs, and
processor bugs are the top 3 bug types that have the most
entropy during bug fixing; build bugs, documentation bugs,
and visualization bugs have the least entropy during bug fixing.
3) Complexity of Communication during Bug Fixing: We
present the results regarding the impact of different types of
bugs on the complexity of communication in terms of NODP
and NOC of bug-fixing pull requests in the three DLFs.
(1) NODP. The average NODP and its ranking of each bug
type in the three DLFs are shown in Table VIII. For MXNet,
documentation bugs and performance bugs are ranked in the
top and bottom respectively; for PyTorch, memory bugs and
documentation bugs are ranked in the top and bottom respec-
tively; and for TensorFlow, memory bugs and visualization
bugs are ranked in the top and bottom respectively. Taking
the three DLFs as a whole, memory bugs, deployment bugs,
and processor bugs are the top 3 bug types that have the most
NODP during bug fixing; performance bugs, test bugs, and
code bugs need the least NODP during bug fixing.
(2) NOC. The average NOC and its ranking of each bug
type in the three DLFs are shown in Table VIII. For MXNet,
memory bugs and performance bugs are ranked in the top
and bottom respectively; for PyTorch, memory bugs and doc-
umentation bugs are ranked in the top and bottom respectively;
and for TensorFlow, memory bugs and version compatibility
bugs are ranked in the top and bottom respectively. Taking
the three DLFs as a whole, memory, processor, and algorithm
design bugs are the top 3 bug types that have the most NODP
during bug fixing; version compatibility, documentation, code,
and visualization bugs have the least NOC during bug fixing.
Answer to RQ2: Deployment bugs negatively impact the
development of DLFs the most in terms of open time; de-
ployment bugs negatively impact code change complexity the
most in terms of entropy; memory bugs negatively impact
code change complexity the most in terms of LOCM and
NOFM; and memory bugs negatively impact communication
complexity the most in terms of NODP and NOC.
C. Proportions and Distribution of MPL Bug Fixes (RQ3)
Proportions of MPL bugs. As shown in TABLE X, among
the 189 bugs analyzed in MXNet, 54 are MPL bugs (which
bug-fixing pull requests involve multiple PLs), accounting for
28.6%. In PyTorch, 291 out of the analyzed 926 bugs are MPL
TABLE IV
OPE N TIME (IN DAYS )OF DI FFE REN T BUG TY PE S (RQ2).
MXNet PyTorch TensorFlow Total
Bug type OT Ranking OT Ranking OT Ranking Mean ranking InteRanking
Algorithm design bug 22.92 10 74.70 3 130.32 2 5.00 4
Build bug 40.24 7 30.63 10 33.86 10 9.00 11=
Code bug 46.34 6 57.68 5 47.30 7 6.00 5=
Data bug 55.23 5 39.63 9 68.15 5 6.33 7
Deployment bug 77.84 3 81.51 2 66.53 6 3.67 1
Documentation bug 81.32 2 50.53 6 84.38 4 4.00 2
Memory bug 40.18 8 68.90 4 135.39 1 4.33 3
Performance bug 0.75 12 106.09 1 - - 6.50 8
Processor bug 21.41 11 30.23 11 99.04 3 8.33 10
Test bug 27.45 9 48.62 7 32.07 11 9.00 11=
Version compatibility bug 58.45 4 17.09 12 38.47 8 8.00 9
Visualization bug 82.84 1 43.89 8 37.17 9 6.00 5=
TABLE V
AVERA GE LOCM AND ITS RANKING OF EA CH BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type LOCM Ranking LOCM Ranking LOCM Ranking Mean ranking InteRanking
Algorithm design bug 66.10 2 60.71 2 32.50 4 2.67 2
Build bug 34.72 7 26.22 10 14.58 11 9.33 12
Code bug 39.20 6 54.91 4 30.96 5 5.00 3=
Data bug 34.27 8 46.19 7 29.59 6 7.00 7
Deployment bug 28.14 11 61.97 1 51.68 3 5.00 3=
Documentation bug 43.93 4 18.79 12 15.78 10 8.67 10=
Memory bug 87.75 1 55.33 3 57.17 2 2.00 1
Performance bug 15.00 12 54.05 5 - - 8.50 9
Processor bug 28.50 10 51.52 6 57.69 1 5.67 5
Test bug 39.26 5 23.75 11 18.52 8 8.00 8
Version compatibility bug 58.30 3 26.31 9 25.83 7 6.33 6
Visualization bug 31.33 9 45.75 8 16.18 9 8.67 10=
TABLE VI
AVERA GE NOFM AND ITS RANKING OF EA CH BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type NOFM Ranking NOFM Ranking NOFM Ranking Mean ranking InteRanking
Algorithm design bug 3.00 5 2.99 3 1.97 9 5.67 3
Build bug 3.03 4 1.99 9 2.25 5 6.00 4
Code bug 2.73 7 2.80 5 2.04 7 6.33 5
Data bug 2.31 10 2.87 4 2.18 6 6.67 6=
Deployment bug 2.57 9 3.25 1 1.96 10 6.67 6=
Documentation bug 3.53 2 1.41 12 1.19 11 8.33 11
Memory bug 8.00 1 3.19 2 3.00 3 2.00 1
Performance bug 1.50 12 2.27 7 - - 9.50 12
Processor bug 2.60 8 2.67 6 3.15 2 5.33 2
Test bug 3.16 3 1.78 10 2.04 8 7.00 9=
Version compatibility bug 2.00 11 2.25 8 3.17 1 6.67 6=
Visualization bug 2.78 6 1.67 11 2.36 4 7.00 9=
bugs, accounting for 31.4%. In TensorFlow, 61 out of the 382
analyzed bugs are MPL bugs, accounting for 16.0%.
Distribution of MPL bugs over bug types. As shown
in TABLE X, in MXNet, algorithm design bugs are the top
bug type that has the largest proportions of MPL bugs, while
deployment bugs, performance bugs, and visualization bugs
are the three bug types that have no MPL bugs; in PyTorch,
data bugs are the top bug type that has the largest proportions
of MPL bugs, while visualization bugs are the only bug type
that has no MPL bugs; In TensorFlow, processor bugs are the
top bug type that has the largest proportions of MPL bugs,
while version compatibility bugs are the only bug type that
have no MPL bugs. Performance bugs in TensorFlow are not
discussed here because there are no performance bugs.
Use of PL combinations in MPL bug fixes. The results
of MPL bugs over different PL combinations are shown in
TABLE XI. In MXNet, only the combination of Python and
C/C++ is used in all the MPL bug fixes. In PyTorch, the
combination of Python and C/C++ is used in 268 (92.1%)
MPL bug fixes, and the combination of Python and Objective-
C is used in 17 (5.8%) MPL bug fixes, the combination of
C/C++ and Objective-C is used in 3 MPL bug fixes, the
combination of Python, C/C++, and Objective-C is used in
2 MPL bug fixes, and the combination of C/C++, Ruby, and
Objective-C is used in 1 MPL bug fix. In TensorFlow, the
combination of Python and C/C++ is used in 60 (98.4%) MPL
bug fixes, and only one MPL bug fix uses the combination of
Python and Go.
Answer to RQ3: 28.6%, 31.4%, and 16.0% bug fixes are
MPL fixes in MXNet, PyTorch, and TensorFlow, respectively.
Algorithm design bugs,data bugs, and processor bugs are the
bug types that hold the largest proportions in MXNet, PyTorch,
TABLE VII
AVERA GE ENT ROP Y AN D ITS RANKING OF EAC H BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type Entropy Ranking Entropy Ranking Entropy Ranking Mean ranking InteRanking
Algorithm design bug 0.65 3 0.40 4 0.61 4 3.67 4
Build bug 0.37 11 0.18 11 0.24 10 10.67 11=
Code bug 0.57 5 0.37 6 0.36 8 6.33 6
Data bug 0.62 4 0.51 2 0.67 1 2.33 2
Deployment bug 0.66 2 0.56 1 0.62 3 2.00 1
Documentation bug 0.42 9 0.15 12 0.06 11 10.67 11=
Memory bug 0.25 12 0.33 8 0.64 2 7.33 7
Performance bug 0.44 8 0.34 7 - - 7.50 8
Processor bug 0.80 1 0.46 3 0.54 6 3.33 3
Test bug 0.46 7 0.25 9 0.39 7 7.67 9
Version compatibility bug 0.46 6 0.37 5 0.58 5 5.33 5
Visualization bug 0.39 10 0.25 10 0.32 9 9.67 10
TABLE VIII
AVERA GE NODP AN D ITS RANKING OF EAC H BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type NODP Ranking NODP Ranking NODP Ranking Mean ranking InteRanking
Algorithm design bug 4.05 7 5.53 3 6.18 4 4.67 4
Build bug 4.06 6 5.13 7 6.13 5 6.00 5
Code bug 3.93 9 4.96 10 5.91 8 9.00 10
Data bug 3.89 10 5.42 4 6.06 7 7.00 7
Deployment bug 4.57 3 5.58 2 6.12 6 3.67 2
Documentation bug 4.80 1 4.62 12 5.83 9 7.33 9
Memory bug 4.75 2 5.85 1 8.17 1 1.33 1
Performance bug 3.00 12 4.82 11 - - 11.50 12
Processor bug 4.10 5 5.36 5 6.62 2 4.00 3
Test bug 3.79 11 5.01 8 5.65 10 9.67 11
Version compatibility bug 4.00 8 4.96 9 6.33 3 6.67 6
Visualization bug 4.11 4 5.17 6 5.00 11 7.00 7
TABLE IX
AVERA GE NOC AN D ITS RANKING OF EAC H BUG TY PE I N DLFS(RQ2).
MXNet PyTorch TensorFlow Total
Bug type NOC Ranking NOC Ranking NOC Ranking Mean ranking InteRanking
Algorithm design bug 4.00 8 6.26 4 4.09 3 5.00 3
Build bug 4.03 7 5.20 7 3.55 4 6.00 5
Code bug 3.87 9 5.15 8 3.22 6 7.67 9=
Data bug 3.51 10 6.37 3 2.90 8 7.00 6=
Deployment bug 4.57 6 6.13 5 3.24 5 5.33 4
Documentation bug 4.60 5 3.97 12 2.23 10 9.00 11
Memory bug 7.75 1 10.74 1 8.67 1 1.00 1
Performance bug 2.00 12 6.86 2 - - 7.00 6=
Processor bug 6.30 3 5.56 6 4.77 2 3.67 2
Test bug 5.32 4 4.99 9 2.30 9 7.33 8
Version compatibility bug 2.70 11 4.50 11 3.17 7 9.67 12
Visualization bug 7.22 2 4.83 10 2.18 11 7.67 9=
and TensorFlow, respectively. The PL combination of Python
and C/C++ is most popular in MPL bug fixes of the DLFs.
D. Impact of the Use of Multiple PLs on Bug Fixing (RQ4)
We further compared the six impact indicators of MPL bug
fixes with those of SPL bug fixes in the DLFs, in order to
understand whether there are significant differences on the
characteristics between MPL and SPL bug fixes. The results
are shown in TABLE XII. (1) In MXNet, the LOCM and
NOFM of MPL bug fixes are significantly larger than the
LOCM and NOFM of SPL bug fixes, respectively, while there
are no significant differences between MPL bug fixes and SPL
bug fixes on OT, entropy, NODP, and NOC. (2) In PyTorch,
every indicator of MPL bug fixes is significantly greater
than that of SPL bug fixes. (3) In TensorFlow, the LOCM,
NOFM, and Entropy of MPL bug fixes are significantly larger
than those of SPL bug fixes, respectively, while there are no
significant differences between MPL bug fixes and SPL bug
fixes on OT, NODP, and NOC.
Answer to RQ4: No impact indicators of MPL bug fixes
are significantly smaller than those of SPL bug fixes. Code
change complexity of MPL bug fixes is significantly greater
than that of SPL bug fixes in terms of LOCM and NOFM.
V. DISCUSSION
A. Interpretation of Study Results
RQ1: (1) Data bugs and algorithm design bugs are very
common in these algorithm intensive software systems. The
most common causes are problems in the input and output
links of operators or neural network layers, e.g., the lack of
input type checks (TensorFlow #13506), the lack of necessary
edge checks (PyTorch #1653), and the output of unexpected
TABLE X
TYP E DISTRIBUTION OF BUGS WITH MPL AND S PL F IX ES (RQ3).
MXNet PyTorch TensorFlow
SPL MPL %MPL SPL MPL %MPL SPL MPL %MPL
Algorithm design bug 10 11 52.4 68 51 42.9 30 4 11.8
Build bug 29 3 9.4 106 6 5.4 31 0 0.0
Code bug 12 3 20.0 67 14 17.3 37 8 17.8
Data bug 23 22 48.9 72 83 53.6 62 34 35.4
Deployment bug 7 0 0.0 63 43 40.6 23 2 8.0
Documentation bug 12 3 20.0 27 2 6.9 78 2 2.5
Memory bug 2 2 50.0 18 9 33.3 5 1 16.7
Performance bug 2 0 0.0 17 5 22.7 - - -
Processor bug 5 5 50.0 73 59 44.7 6 7 53.9
Test bug 17 2 10.5 75 8 9.6 21 2 8.7
Version compatibility bug 7 3 30.0 37 11 22.9 18 0 0.0
Visualization bug 9 0 0.0 12 0 0.0 10 1 9.1
All bug types 135 54 28.6 635 291 31.4 321 61 16.0
TABLE XI
PL COMBINATIONS OF M PL BUG FI XES (RQ3).
MXNet PyTorch TensorFlow
PL combination # % # % # %
C/C++, Python 54 100.0 268 92.1 60 98.4
Objective-C, Python 0 0.0 17 5.8 0 0.0
C/C++, Objective-C 0 0.0 3 1.0 0 0.0
Go, Python 0 0.0 0 0.0 1 1.6
C/C++, Objective-C, Python 0 0.0 2 0.7 0 0.0
C/C++, Objective-C, Ruby 0 0.0 1 0.3 0 0.0
results (MXNet #8303). (2) Documentation bugs account for
a relatively large proportion (20.94%) in TensorFlow, which
indicates that during the development of TensorFlow, bugs in
updating documents is often a major difficulty for developers.
This was confirmed by the results of our searches for the key-
word ”document” in Stack Overflow regarding the three DLFs:
13 hits returned for MXNet, 174 hits returned for PyTorch, 913
hits returned for TensorFlow. (3) In PyTorch, processor bugs
are also a major bug type, which resonates the great efforts of
PyTorch developers spent in data synchronization and parallel
and multiprocessing [14]. (4) The percentage of performance
bugs in the three DLFs is very small. We checked the issues
labeled with ”performance” in the three DLFs and found that
most of such issues are not considered as bugs by developers.
RQ2: (1) Deployment bugs are ranked on the top in the
integrated ranking with respect to the OT and entropy of bugs,
and memory bugs are ranked on the top in the integrated
ranking with respect to the LOCM and NOFM of bugs.
As evidenced in [31], a larger code change complexity and
longer OT often mean higher maintenance cost of a bug.
In this sense, deployment bugs and memory bugs may incur
much maintenance cost. (2) Memory bugs are ranked on the
top in the integrated ranking with respect to the NODP and
NOC during bug fixing, which means the communication of
fixing memory bugs is most complex in the sense that most
developers are involved and the bugs are most discussed. It
also resonates with the finding that the code change complexity
of memory bugs is the largest in terms of LOCM and NOFM.
RQ3: (1) The proportion of MPL bugs in TensorFlow is
obviously smaller than that in MXNet and PyTorch. One main
potential reason is that the main PL in TensorFlow accounts
for 63.1% (shown in TABLE II), which is much larger than
that in MXNet (48.6%) and PyTorch (49.4%). It implies that
there is much less chances for the main PL to work with other
PLs in TensorFlow. (2) The combination of Python and C/C++
is the most used PL combination in fixing MPL bugs of the
selected DLFs. This is also the PL choice of most popular
DLFs. We speculate the reason is that Python is probably the
most comfortable PL for many data scientists and machine
learning experts, and it is easy to integrate C/C++ components
and is compatible with a large number of computing libraries.
They prefer to use C/C++ as the computational part to improve
performance [32]. We conducted a sampling analysis of MPL
fixes to verify our conjecture that the vast majority of MPL
bug fixes did occur in the back end. When a bug in the back
end is resolved, developers need to test whether the issue is
resolved in the front end.
RQ4: In MXNet, PyTorch, and TensorFlow, it is more
difficult and costlier to fix MPL bugs than SPL bugs, in the
sense that (1) MPL bug fixes have significantly larger code
change complexity than SPL bug fixes in terms of LOCM
and NOFM in the three DLFs, and (2) no impact indicators
of MPL bug fixes are significantly smaller than those of
SPL bug fixes in the three DLFs. This is consistent with the
previous research results on MPL commits in Apache projects
[31], which suggests that the use of multiple PLs has a non-
negligible impact on the development of DLFs.
B. Implications for Researchers
Firstly, we classified the bugs in the selected DLFs based
on bug labels, which is greatly helpful for researchers to
have a general understanding on the distribution of bugs
in DLFs, thereby facilitating further investigation of these
bugs. Secondly, it is worth further exploring how different
combinations of PLs influence the development of DLFs, since
the results of this exploration can guide the development to a
more efficient way. Thirdly, researchers may put their efforts
to study how the combination of C/C++ and Python may
impact MPL bugs, because more than 92% of MPL bugs
involve this PL combination. Finally, MPL bugs should receive
more attention from researchers in the software engineering
community but not limited to the DLF domain. Currently,
MPL bugs are seldom explored. Due to the significant impact
of MPL bugs, they need to be studied in more depth.
TABLE XII
MANN-WHIT NE Y U TE ST R ESU LTS R EG ARD IN G IM PACT S OF TH E US E OF M ULTI PL E PLS ON TH E FIX ES O F BU GS (RQ4).
MXNet PyTorch TensorFlow
SPL bugs MPL bugs p-value SPL bugs MPL bugs p-value SPL bugs MPL bugs p-value
OT 51.51 47.01 0.643 48.21 59.24 <0.001 73.04 87.00 0.227
LOCM 35.01 57.63 <0.001 34.37 76.36 <0.001 27.69 36.80 <0.001
NOFM 2.25 4.41 <0.001 2.07 3.95 <0.001 1.89 2.67 <0.001
Entropy 0.53 0.71 0.059 0.34 0.57 <0.001 0.39 0.80 <0.001
NODP 3.93 4.31 0.479 5.16 5.62 <0.001 6.13 5.93 0.267
NOC 3.91 5.07 0.057 5.44 6.96 <0.001 3.30 2.80 0.837
C. Implications for Practitioners
Firstly, our research results will enable developers to un-
derstand the distribution of bugs in the DLFs and which types
of bugs are more efficient to solve first. We recommend de-
velopers to employ our bug classification of DLFs in practice
thereby facilitating bug management by taking into account
the impacts of different bug types. Besides, developers could
choose more experienced developers to fix more complex
bug types (e.g., memory bugs and deployment bugs) that can
be identified through bug descriptions or related bug labels.
For bugs with relatively low fix complexity, new developers
can be assigned to fix them, thereby reducing the bug fixing
cost. Meanwhile, the developers of DLFs usually leverage the
advantages of different PLs to implement DLFs, however, the
developers should be aware of that the use of multiple PLs
for implementing DLFs is likely to increase the complexity of
code changes, resulting in higher maintenance cost.
VI. TH RE ATS TO VALIDITY
There are several threats to the validity of the study results.
We discuss these threats according to the guidelines in [25].
A. Construct Validity
Construct validity is concerned with whether the values of
the variables (listed in TABLE I) we obtained are consistent
with the real values that we expected. A potential threat is that
not all bugs resolved are explicitly linked to corresponding pull
requests, which may negatively affect the representativeness of
the collected bugs. Through our analysis, we confirmed that
the bugs with explicit links to corresponding pull requests were
not reported in a short time span and were not fixed by a small
group of developers. Therefore, this threat is to some extent
mitigated. Another possible threat is that different developers
may have biases during manual analysis, which may lead to
inconsistent classification and incorrect bug tagging results. To
reduce this threat, we adopted a four-step bug classification
process, in which a pilot bug tagging process was adopted to
identify and resolve possible disagreements with bug types (as
described in Section III-E1).
B. External Validity
External validity is concerned with the generalizability of
the study results. A potential threat is whether the selected
DLFs are representative enough. To select appropriate DLFs,
we set a number of criteria as described in Section III-C,
and the finally selected DLFs are MXNet, PyTorch, and
TensorFlow. These three DLFs are very popular and support a
wide range of DL applications. Thus, this threat is mitigated
to some extent. Another possible threat is that more than
92% MPL bug fixes use the PL combination of Python and
C/C++, which means significant imbalance regarding the use
of multiple PLs. However, the PL combination of Python and
C/C++ is natural choice in the development of DLFs by trading
off the usability and performance of the two PLs.
C. Reliability
Reliability is concerned with whether the same results can
be obtained when the study is replicated by other researchers.
A potential threat is related to the implementation of our soft-
ware tool for data collection. The tool was implemented by the
second author, and the code of the key functionalities had been
reviewed by the first and third authors. In addition, sufficient
tests were performed to ensure that the calculation of data
items are correct. Thus, this threat had been alleviated. Another
threat is related to the correctness of the Mann–Whitney U
tests. Since only IBM SPSS was used to run the tests, this
threat is minimized.
VII. CONCLUSIONS AND FUTURE WORK
In this work, we conducted an empirical study on bugs in
three mainstream DLFs (i.e., MXNet, PyTorch, and Tensor-
Flow) and their impacts on the development of such DLFs.
We collected 75114 issues and 92842 pull requests of the
DLFs, and obtained 1497 bugs after a set of filterings. We
manually analyzed the 1497 bugs, and got the following
findings. (1) The bugs in DLFs can be classified into 12
types, e.g., algorithm design bugs,data bugs, and memory
bugs. Among the 12 bug types, data bugs account for the
largest proportion in all the three DLFs. (2) Deployment bugs
and memory bugs negatively impact the development of DLFs
the most in various aspects. (3) 28.6%, 31.4%, and 16.0%
of bugs in MXNet, PyTorch, and TensorFlow are MPL bugs
respectively, and the PL combination of Python and C/C++ is
most used in fixing more than 92% MPL bugs in the three
DLFs. (4) The code change complexity of MPL bug fixes is
significantly greater than that of SPL bug fixes in the three
DLFs, while in PyTorch MPL bug fixes have longer open time
and greater communication complexity. In the future, we plan
to expand the obtained dataset in this study, and construct bug
prediction models for MPL bugs in DLFs. In addition, we are
also interested in investigating MPL bugs in other application
domains in depth.
REFERENCES
[1] F. Thung, S. Wang, D. Lo, and L. Jiang, “An empirical study of
bugs in machine learning systems,” in Proceedings of the 23rd IEEE
International Symposium on Software Reliability Engineering (ISSRE).
IEEE, 2012, pp. 271–280.
[2] Y. Yang, T. He, Z. Xia, and Y. Feng, “A comprehensive empirical study
on bug characteristics of deep learning frameworks,” Information and
Software Technology, vol. 151, p. 107004, 2022.
[3] J. Chen, Y. Liang, Q. Shen, and J. Jiang, “Toward understanding deep
learning framework bugs,” arXiv preprint arXiv:2203.04026, 2022.
[4] L. Quan, Q. Guo, X. Xie, S. Chen, X. Li, and Y. Liu, “Towards
understanding the faults of javascript-based deep learning systems,”
in Proceedings of the 37th IEEE/ACM International Conference on
Automated Software Engineering (ASE). ACM, 2022, pp. 1–13.
[5] K. Huang, B. Chen, S. Wu, J. Cao, L. Ma, and X. Peng, “De-
mystifying dependency bugs in deep learning stack,” arXiv preprint
arXiv:2207.10347, 2022.
[6] J. Cao, B. Chen, C. Sun, L. Hu, S. Wu, and X. Peng, “Understanding
performance problems in deep learning systems,” in Proceedings of
the 30th ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (ESEC/FSE).
ACM, 2022, pp. 357–369.
[7] Y. Zhang, Y. Chen, S.-C. Cheung, Y. Xiong, and L. Zhang, “An empirical
study on tensorflow program bugs,” in Proceedings of the 27th ACM
SIGSOFT International Symposium on Software Testing and Analysis
(ISSTA). ACM, 2018, pp. 129–140.
[8] M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive
study on deep learning bug characteristics,” in Proceedings of the 27th
ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (ESEC/FSE).
ACM, 2019, pp. 510–520.
[9] M. Grichi, E. E. Eghan, and B. Adams, “On the impact of multi-
language development in machine learning frameworks,” in Proceedings
of the 36th IEEE International Conference on Software Maintenance and
Evolution (ICSME). IEEE, 2020, pp. 546–556.
[10] A. Shatnawi, H. Mili, M. Abdellatif, Y.-G. Gu ´
eh´
eneuc, N. Moha,
G. Hecht, G. E. Boussaidi, and J. Privat, “Static code analysis of mul-
tilanguage software systems,” arXiv preprint arXiv:1906.00815, 2019.
[11] M. Kargar, A. Isazadeh, and H. Izadkhah, “Multi-programming language
software systems modularization,” Computers & Electrical Engineering,
vol. 80, p. 106500, 2019.
[12] A. Shatnawi, H. Mili, M. Abdellatif, Y.-G. Gu ´
eh´
eneuc, N. Moha,
G. Hecht, G. E. Boussaidi, and J. Privat, “Static code analysis of mul-
tilanguage software systems,” arXiv preprint arXiv:1906.00815, 2019.
[13] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu,
C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine
learning library for heterogeneous distributed systems,” arXiv preprint
arXiv:1512.01274, 2015.
[14] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imper-
ative style, high-performance deep learning library,” in Proceedings of
the 32nd Annual Conference on Neural Information Processing Systems
(NeurIPS). NeurIPS Proceedings, 2019, pp. 8024–8035.
[15] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “{TensorFlow}: a system for
{Large-Scale}machine learning,” in Proceedings of the 12th USENIX
Symposium on Operating Systems Design and Implementation (OSDI).
ACM, 2016, pp. 265–283.
[16] L. Jia, H. Zhong, X. Wang, L. Huang, and X. Lu, “The symptoms,
causes, and repairs of bugs inside a deep learning library,” Journal of
Systems and Software, vol. 177, p. 110935, 2021.
[17] C. B. Seaman, F. Shull, M. Regardie, D. Elbert, R. L. Feldmann,
Y. Guo, and S. Godfrey, “Defect categorization: making use of a
decade of widely varying historical data,” in Proceedings of the 2nd
ACM/IEEE International Symposium on Empirical Software Engineering
and Measurement (ESEM). ACM, 2008, pp. 149–157.
[18] K. Charmaz, Constructing Grounded Theory: A Practical Guide through
Qualitative Analysis. SAGE, 2006.
[19] B. Ray, D. Posnett, V. Filkov, and P. Devanbu, “A large scale study of
programming languages and code quality in github,” in Proceedings of
the 22nd ACM SIGSOFT International Symposium on Foundations of
Software Engineering (FSE). ACM, 2014, pp. 155–165.
[20] E. D. Berger, C. Hollenbeck, P. Maj, O. Vitek, and J. Vitek, “On
the impact of programming languages on code quality: A reproduction
study,” ACM Transactions on Programming Languages and Systems,
vol. 41, no. 4, pp. 1–24, 2019.
[21] P. S. Kochhar, D. Wijedasa, and D. Lo, “A large scale study of multiple
programming languages and code quality,” in Proceedings of 23rd
IEEE International Conference on Software Analysis, Evolution, and
Reengineering (SANER). IEEE, 2016, pp. 563–573.
[22] M. Abidi, F. Khomh, and Y.-G. Gu´
eh´
eneuc, “Anti-patterns for multi-
language systems,” in Proceedings of the 24th European Conference on
Pattern Languages of Programs (EuroPLoP). ACM, 2019, pp. 1–14.
[23] M. Abidi, M. S. Rahman, M. Openja, and F. Khomh, “Are multi-
language design smells fault-prone? an empirical study,” ACM Trans-
actions on Software Engineering and Methodology, vol. 30, no. 3, pp.
1–56, 2021.
[24] Z. Li, X. Qi, Q. Yu, P. Liang, R. Mo, and C. Yang, “Exploring
multi-programming-language commits and their impacts on software
quality: An empirical study on apache projects,” Journal of Systems
and Software, vol. 194, p. 111508, 2022.
[25] P. Runeson and M. H¨
ost, “Guidelines for conducting and reporting case
study research in software engineering,” Empirical Software Engineer-
ing, vol. 14, no. 2, pp. 131–164, 2009.
[26] V. R. Basili, “Software modeling and measurement: The
goal/question/metric paradigm,” 1992. [Online]. Available: http:
//drum.lib.umd.edu/bitstream/1903/7538/1/Goal Question Metric.pdf
[27] A. E. Hassan, “Predicting faults using the complexity of code changes,”
in Proceedings of the 31st IEEE International Conference on Software
Engineering (ICSE). IEEE, 2009, pp. 78–88.
[28] Z. Li, P. Liang, D. Li, R. Mo, and B. Li, “Is bug severity in line
with bug fixing change complexity?” International Journal of Software
Engineering and Knowledge Engineering, vol. 30, no. 11&12, pp. 1779–
1800, 2020.
[29] A. J. Viera, J. M. Garrett et al., “Understanding interobserver agreement:
the kappa statistic,” Fam med, vol. 37, no. 5, pp. 360–363, 2005.
[30] Z. Li, S. Wang, W. Wang, P. Liang, R. Mo, and B. Li,
“Datasets for ‘Understanding Bugs in Multi-Language Deep Learning
Frameworks’,” 2022. [Online]. Available: https://anonymous.4open.
science/r/dataItem-B377
[31] Z. Li, Q. Yu, P. Liang, R. Mo, and C. Yang, “Interest of defect technical
debt: An exploratory study on apache projects,” in Proceedings of
the 36th IEEE International Conference on Software Maintenance and
Evolution (ICSME). IEEE, 2020, pp. 629–639.
[32] R. Mu and X. Zeng, “A review of deep learning research,” KSII
Transactions on Internet and Information Systems, vol. 13, no. 4, pp.
1738–1764, 2019.