Conference PaperPDF Available

Automated Bug Triaging in an Industrial Context

Authors:

Abstract and Figures

There is an increasing need to introduce some form of automation within the bug triaging process, so that no time is wasted on the initial assignment of issues. However, there is a gap in current research, as most of the studies deal with open source projects, ignoring the industrial context and needs. In this paper, we report our experience in dealing with the automation of the bug triaging process within a research-industry cooperation. After reporting the requirements and needs that were set within the industrial project, we compare the analysis results with those from an open source project used frequently in related research (Firefox). In spite of the fact that the projects have different size and development process, the data distributions are similar and the best models as well. We found out that more easily configurable models (such as SVM+TF–IDF) are preferred, and that top-x recommendations, number of issues per developers, and online learning can all be relevant factors when dealing with an industrial collaboration.
Content may be subject to copyright.
Automated Bug Triaging in an Industrial Context
V´
aclav Ded´
ık
Faculty of Informatics
Masaryk University, Brno, Czech Republic
dedik@mail.muni.cz
Bruno Rossi
Faculty of Informatics
Masaryk University, Brno, Czech Republic
brossi@mail.muni.cz
Abstract—There is an increasing need to introduce some form
of automation within the bug triaging process, so that no time
is wasted on the initial assignment of issues. However, there is a
gap in current research, as most of the studies deal with open
source projects, ignoring the industrial context and needs.
In this paper, we report our experience in dealing with the
automation of the bug triaging process within a research-industry
cooperation. After reporting the requirements and needs that
were set within the industrial project, we compare the analysis
results with those from an open source project used frequently
in related research (Firefox). In spite of the fact that the
projects have different size and development process, the data
distributions are similar and the best models as well. We found
out that more easily configurable models (such as SVM+TF–IDF)
are preferred, and that top-x recommendations, number of issues
per developers, and online learning can all be relevant factors
when dealing with an industrial collaboration.
Keywords-Software Bug Triaging; Bug Reports; Bug Assign-
ment; Machine Learning; Text Classification; Industrial Scale;
I. INTRODUCTION
In this paper, we deal with bug triaging, that is the process of
assigning new issues or tickets to developers that could better
handle them. Most of the papers in the area of automated
bug triaging deal with Open Source Software (OSS) projects
(e.g. [1], [2]). In the current paper we take a point of view
that has not been investigated often, that is the different needs
in terms of approaches needed in case of research-industry
collaboration. In fact, OSS has a distributed development
model [3] that implies a different triaging process but also
different results during the application of an automated triaging
process. For this reason, we set-up the main research question:
what are the differences in bug triaging automation when
considering machine learning models between an OSS project
and a company-based proprietary project? From this main
research question, throughout the paper we provide several
sub-questions that can give more insights on different aspects.
To answer the main research question, we conducted an
experimentation within a software company from the Czech
Republic, running a SCRUM-based development process cen-
tered around several teams and a JIRA issue tracker that
provided the central part of the triaging process. At the time of
starting the experimentation, the tickets were left unassigned
until one responsible would start the triaging process. We
dicuss in the paper the differences — when implementing an
automated triaging system — that were detected between such
proprietary-based project and data from Mozilla Firefox.
II. RE LATE D WOR KS
Over the years there have been many attempts to auto-
mate bug triaging. The first effort was made by ˇ
Cubrani´
c
and Murphy [1] using a text categorization approach with a
Na¨
ıve Bayes classifier on Eclipse data. Their dataset consisted
of 15,670 bug reports with 162 classes (developers). They
achieved about 30% accuracy.
Anvik et al. [2] used Support Vector Machines (SVM) on
Eclipse, Firefox and GCC data. They achieved precision of
64% and 58% on Firefox and Eclipse data, however, only 6%
on GCC data. As for recall, only 2%, 7% and 0.3%: results
were achieved on Firefox, Eclipse and GCC data.
Ahsan et al. [4] used an SVM classifier on Mozilla data.
They reached 44.4% classification accuracy, 30% precision
and 28% recall using SVM with LSI.
An extensive study was done by Alenezi et al. [5], using a
Na¨
ıve Bayes classifier. The best results were achieved using
χ2with precision values of 38%, 50%, 50% and 50% and
recall values of 30%, 35%, 21% and 46% on Eclipse-SWI,
Eclipse-UI, NetBeans and Maemo projects respectively.
Xia et al. [6] employed an algorithm named DevRec based
on multi-label k-nearest neighbor classifier (ML–kNN) and
topic modeling using Latent Dirichlet Allocation (LDA). For
top-5 recommendation, the recall values were 56%, 48%, 56%,
71%, 80% and the precision values 25%, 21%, 25%, 32%,
25% for GCC, OpenOffice, Mozilla, NetBeans and Eclipse.
Automated bug triaging in industrial projects has not been
discussed often (in [7], only two out of 25 studies reviewed).
Hu et al. [8] provided an evaluation of automatic bug triaging
considering a component-developer-bug network on two in-
dustrial projects (1,008/686 bug reports, 19/11 developers) and
three OSS projects (Eclipse, Mozilla, Netbeans). They overall
noticed that the proposed approach performs worse on the
industrial projects. Jonsson et al. [7] represent one recent study
to underline the missing availability of automatic triaging stud-
ies within the industrial context. Using an ensemble learner,
authors collect 50,000 bug reports from five projects out of two
companies, showing that the system can scale well with large
amounts of training sets. Main findings are that SVM models
perform well in industrial context and that models performance
drops when projects team size increases.
III. EXP ER IM EN TAL EVALUATION
The process of triaging automation begins by retrieving a
dataset from a bug tracking system (Fig. 1). The next step
Fig. 1. Data analysis process: from bug tracking data to the prediction results.
is to filter unwanted bug reports from the dataset. All tickets
that are unassigned, not fixed or not resolved are removed.
There can be also bug reports that are assigned to a universal
computer-generated user (e.g. nobody@mozilla.org) —
we remove these reports as well. Bug reports resolved by
developers that have not fixed sufficient amount of reports
in the past (e.g. 30) are also removed. Another step is to
shuffle the bug reports randomly. This is done to achieve more
accurate performance results with the cross-validation (CV)
set. The process continues by splitting the resulting dataset
into two sets—a cross-validation set and a training set. The
CV set contains 30% of the bugs while the training set contains
the remaining 70%.
The second stage is to train the machine learning model
(e.g. SVM, NB). First, however, it is necessary to run the
training dataset through a feature extraction step. This step
applies techniques that improve the performance of a machine
learning model, e.g. stop-words removal, TF–IDF etc...
The last stage uses the trained machine learning model
to generate prediction results that can be used, for example,
to compute various performance metrics. The first step is to
(again) apply feature extraction techniques on the CV dataset
and then use the trained classifier to predict results.
One of the most prominent models for text classification is
known as Support Vector Machine (SVM). Instead of trying
to find a function that fits training data as precisely as possible
by optimizing the distance from samples to the function,
the SVM algorithm attempts to find a linear separator of
positive and negative samples by optimizing the margin of
the decision boundary (hyperplane) [9]. The samples that
limit the margin of the decision boundary on both sides are
called support vectors. However, positive and negative samples
can be mixed up together so much that it is impossible to
construct a decision boundary. This is why the strictness
of the optimization algorithm can be relaxed by tuning the
regularization parameter C[9].
In many applications, the linear classification is not enough
to warrant a good decision boundary and performance. Using
kernel functions, we can effectively transform the decision
boundary into polynomial, Gaussian or sigmoid function. By
using a technique called kernel trick, this can be done without
increasing the time complexity of the learning algorithm [9].
To evaluate the performance of the models, we considered
three metrics — accuracy, precision and recall1.
1) Accuracy: This metric measures the overall fraction of
correctly predicted assignees out of all predictions [9].
Accuracy =tp +tn
tp +tn +fp +f n (1)
2) Macro-Averaged Precision: Precision measures the frac-
tion of the correctly predicted assignees as positive out of the
assignees either correctly or incorrectly predicted as positive.
A value is computed separately for every class (assignee) and
the result is computed as the mean of all these values.
P recisionmacro =1
q
q
X
λ=1
tpλ
tpλ+fpλ
(2)
3) Macro-Averaged Recall: Similar to precision, except re-
call measures the fraction of the correctly predicted assignees
as positive out of the assignees either correctly predicted as
positive or incorrectly predicted as negative. Again, the macro-
averaged variant is computed as the mean of all recall values
for every class (assignee).
Recallmacro =1
q
q
X
λ=1
tpλ
tpλ+fnλ
(3)
A. Dataset description
The Proprietary dataset was collected within the context of
the collaboration with a software company from the Czech
Republic. The development process is run with a SCRUM-
based process centered around several teams and a JIRA issue
tracker that collects all the issues. At the time of starting
the experimentation, the tickets were left unassigned until one
responsible would start the triaging process. The mined dataset
contains 2,926 bug reports created between 2012-11-23 and
2015-10-16. Only bug reports that were resolved with assigned
developers were considered. There are 110 developers in this
dataset. Only 2,424 bugs assigned to 35 developers were
retained after removal of developers with less than 30 fixed
bugs (Fig. 2).
The OSS dataset was mined from the Mozilla repository
from project Firefox2. We downloaded all bugs that are in
status RESOLVED with resolution FIXED and were created
in year 2010 or later. We also removed bugs with field
assigned_to set to nobody@mozilla.org as those
tickets were not assigned to an actual developer. In total, we
were able to retrieve 9,141 bugs. To get a better compari-
son with the other datasets, we only use 3,000 data points
for training and cross-validation that were created between
2010-01-01 and 2012-07-10. This dataset contains 343 labels
(developers). Finally, we remove developers who did not fix at
least 30 bugs, yielding 1,810 bugs with 20 developers (Fig. 3).
1for both precision and recall, we considered the macro-averaged variants.
2https://bugzilla.mozilla.org
Fig. 2. Proprietary project: distribution of tickets per single developer.
Fig. 3. Firefox project: distribution of tickets per single developer.
B. Experimental Questions
For the analysis, we structured the experimental part linking
the original goal to several questions:
Q1/Q2. Are the two datasets coming from the same
distribution? Do the samples have the same population
mean? This would allow to determine how similar the
two datasets are in terms of issues distribution;
Q3. Do classification models bring benefits over a base-
line classifier? This allows us to review the relevance of
the provided models;
Q4. Which classification model brings the best results for
both datasets in terms of accuracy, precision and recall?
This allows us to look into differences of performance of
the models taking into account the two datasets;
Q5. Taking into account the best identified models, how
do they perform considering different number of issues
per developers? Companies want to understand how
much time needs to pass until one developer can be
included in the prediction process;
Q6. Taking into account the best identified models, what
is the performance of the models considering a higher
number of recommendations? One of the requirements
of companies is to provide a ranking list of assignee that
could be useful in case of re-assignments of tickets;
C. Dataset distributions (Q1/Q2)
We use the two-sided alternative of the two-sample chi-
square test to evaluate the null hypothesis that two samples
are from the same distribution. The test with Firefox and
Proprietary samples returns a p-value equal to 0.8648 so we
fail to reject the null hypothesis that the two samples are from
the same distribution (5% significance).
After we tested the hypothesis that the used datasets
come from the same distribution, we test the null hypothesis
that the samples have the same population mean—allowing
us to further learn how statistically similar the datasets
are, looking at the variances of the samples. We use the
two-sided alternative of the standard independent two-sample
t-test to test the null hypothesis that the samples have
the same population mean, and the two-sided alternative
of the Levene test to test whether the samples have equal
population variance. Testing data from Firefox and Proprietary
datasets, the Levene test yields p-value equal to 0.2730, as
p-value >0.05, we fail to reject the null hypothesis (5%
significance). To test the population mean of the two samples,
we therefore have to use the standard independent t-test.
The standard independent t-test results in p-value of 0.1954,
which means we fail to reject the null hypothesis of equal
population mean of the two samples (5% significance).
Chi-square and t-test imply there is little difference between
proprietary and Firefox dataset in terms of data distribution.
D. Comparison with the Baseline Model (Q3)
A first step on the evaluation of the models was to compare
against a baseline model. We defined such model as a classifier
that would assign a bug report to the developer with the
highest number of reports. While the accuracy of this model
is relatively high (18%), the precision and recall values are
much lower (1% and 5%) on the Firefox data (Fig. 4).
There is also an increase in performance after all stop-words
were removed from the feature vector of Firefox data. The
performance of the classifier slightly increased on all models.
Accuracy increased by 3% on SVM. The precision value of
the SVM model decreased by 1% but increased by 6% with
Na¨
ıve Bayes. Finally, Recall values of SVM and Na¨
ıve Bayes
increased by 4% and 2% respectively. We therefore conclude
that the performance boost of stop-words removal is significant
enough to warrant better results, which matches the conclusion
of ˇ
Cubrani´
c and Murphy [1].
The initial models bring benefits over the baseline classifier.
E. Comparison of the Models (Q4)
Based on related work, we compared two models: Na¨
ıve
Bayes (see previous work from one of the authors, [10]), and
SVM. Parameters of all models were optimized by grid search.
SVM+TF–IDF offers the best performance of 53% accu-
racy, 59% precision and 47% recall. The same model with
LSI also shows quite good performance and the Na¨
ıve Bayes
model with χ2and TF–IDF performs quite well as far as
precision is concerned. Both SVM and Na¨
ıve Bayes models
exhibit rather wide spreads between precision and recall values
Fig. 4. Firefox project: baseline model and stop-words removal.
Fig. 5. Proprietary project: models comparison.
(Fig. 5), which could be an indication of higher variance of
the proprietary data (Fig. 2 and 3).
SVM+TF–IDF achieves the best performance with accuracy
of 57%. Its precision and recall also outperforms all the
other approaches with values of 51% and 45% (Fig. 6). The
comparison shows that SVM+TF–IDF performs best on all
datasets. It also generalizes very well, because there was
no need to readjust the parameters. The disadvantage of the
model is that it is the most computationally complex one,
because SVM is the slowest as there are many classes and
features. This can be at least partially dealt with by using
χ2feature extraction in conjunction with TF-IDF while
sacrificing some of the classification performance.
Fig. 6. Firefox project: models comparison.
SVM+TF–IDF+Stop Words Removal is the best model for
both proprietary company-based and Firefox data.
F. Number of Issues per Developer (Q5)
We compared the two datasets using the SVM+TF–IDF
weighting by computing their performance (accuracy, preci-
sion and recall) for six different settings of the minimum
issues per developers (1, 3, 5, 10, 20, 30) requirement (Fig.
7). The accuracy of the proprietary dataset (53%) is a bit
lower than the accuracy of the open-source dataset (54%) when
minimum issues per developer equals 30. Interesting fact is
that the initial behaviour is quite different with much higher
performance from the proprietary dataset (42%) than the open-
source dataset (29%).
Fig. 7. Comparison of accuracy of the SVM model.
The precision value of the proprietary dataset (Fig. 8)
is eventually higher (59%) than the precision value of the
open-source dataset (54%). In the recall case, the right-most
case (minimum number of issues per developer requirement
equal to 30) shows the performance of the proprietary dataset
slightly lower (47%) than that of the open-source dataset
(50%). Also in this case, the performance for minimum
number of issues per developer equal to one is much higher
for the proprietary dataset (14% vs 3%).
Fig. 8. Comparison of precision of the SVM model.
The performance of the proprietary dataset is generally
quite similar to that of the open-source dataset. An interesting
conclusion is that the spread between precision and recall is
much higher for the proprietary dataset. This possibly implies
higher variance of the dataset. All our results show significant
difference in performance for minimum number of issues
per developers equal to one. The higher performance for the
proprietary dataset in this regard can be probably explained
by the fact that open-source bug repositories are open to
anyone.
One-time assignees can impact negatively on the perfor-
mance when considering different nr. of issues per developer.
G. Performance for Higher Recommendations Number (Q6)
We examined the performance of the models with different
number of recommendations. We show the performance of the
SVM model with TF-IDF weighting trained on the proprietary
and Firefox datasets for number of recommendations from one
to ten (Fig. 9). The accuracy increases with the number of
recommendations, which is expected as the more recommen-
dations, the higher the chance of a hit (i.e. the chance that the
list of predictions contains the correct assignee).
Fig. 9. Comparison of accuracy of top-x recommendations for SVM model.
It is apparent from the plot that the highest performance
boost happens when the number of recommendations
changes from one to two both for the proprietary dataset
(54% vs 66%) and the Firefox dataset (59% vs 72%).
The accuracy of the Firefox dataset is equal to 90% for 5
recommendations and 81% for proprietary dataset. When
the number of recommendations is 10, the performance is
97% and 90% for the Firefox and proprietary data respectively.
Similar behaviour in changes of accuracy when considering
the variation of the number of recommended developers.
IV. DISCUSSION & CONCLUSION
In the current paper, we evaluated the usage of an automated
bug triaging process within a Czech Republic-based company
to look at similarities and differences to the usually analyzed
open source projects. We found that the best classification
model is SVM with TF–IDF— achieving an accuracy of 53%,
precision of 59% and recall of 47%— not dissimilar from
the Firefox dataset: 57%, 51% and 45% respectively. One
appreciated advantage of such model within the company is
that it does not require large parameters optimization. From the
point of view of the many aspects we analyzed, the distribution
of the datasets seems to be very similar (t-test and chi-square
test). Subsequent evaluation of performance of the datasets
using SVM and TF-IDF supports this conclusion further.
Unfortunately, this conclusion does not apply for all feature
extraction methods, our results show that it is necessary to
choose different parameters for both LSI and χ2techniques. In
term of performance, however, there is a significant difference
when we omit filtering of developers that fix few bugs. The
most likely explanation is that the company-based proprietary
dataset contains less developers that fixed only a few bugs, as
the proprietary bug repository is not public.
Comparing the performance of the models with the related
work can be problematic due to different processes of analysis,
datasets and metrics used. Overall, the results on the Firefox
project can be considered comparable with our results. There
can be mostly differences in the way inactive developers are
filtered ([2]) or number of reports used (we used 3,000 to
compare with the proprietary project, [5] considered 11,311).
Looking at the related works in the application of automated
bug triaging within industrial context (mainly [7]), there are
many characteristics that proprietary projects do not typically
share with open source software and that can be at the basis
of different results. Within our company-based collaboration,
we particularly noticed the needs to discuss: i) the number
of issues per developers before having reliable predictions,
ii) the number of top-x recommendations to present, iii) the
implications for online learning, as well as iv) providing also
support for teams assignment in parallel with individual one.
REFERENCES
[1] G. C. Murphy and D. Cubranic, “Automatic bug triage using text
categorization,” in the Sixteenth Int. Conference on Software Engineering
& Knowledge Engineering. KSI Press, 2004, pp. 92–97.
[2] J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?” in Pro-
ceedings of the 28th International Conference on Software Engineering,
ser. ICSE ’06. New York, NY, USA: ACM, 2006, pp. 361–370.
[3] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “Two case studies of
open source software development: Apache and mozilla,ACM Trans.
Softw. Eng. Methodol., vol. 11, no. 3, pp. 309–346, Jul. 2002.
[4] S. N. Ahsan, J. Ferzund, and F. Wotawa, “Automatic Software Bug
Triage System (BTS) Based on Latent Semantic Indexing and Support
Vector Machine,” 2009 Fourth International Conference on Software
Engineering Advances, pp. 216–221, Sep. 2009.
[5] M. Alenezi, K. Magel, and S. Banitaan, “Efficient Bug Triaging Using
Text Mining,Journal of Software, vol. 8, no. 9, pp. 2185–2190, Sep.
2013.
[6] X. Xia, D. Lo, X. Wang, and B. Zhou, “Dual analysis for recommending
developers to resolve bugs,Journal of Software: Evolution and Process,
vol. 27, no. 3, pp. 195–220, 2015.
[7] L. Jonsson, M. Borg, D. Broman, K. Sandahl, S. Eldh, and P. Runeson,
“Automated bug assignment: Ensemble-based machine learning in large
scale industrial contexts,” Empirical Software Engineering, pp. 1–46,
2015.
[8] H. Hu, H. Zhang, J. Xuan, and W. Sun, “Effective bug triage based
on historical bug-fix information.” in ISSRE. IEEE Computer Society,
2014, pp. 122–132.
[9] C. D. Manning, P. Raghavan, and H. Sch¨
utze, Introduction to Infor-
mation Retrieval. New York, NY, USA: Cambridge University Press,
2008.
[10] N. K. Singha Roy and B. Rossi, “Towards an improvement of bug sever-
ity classification,” in Software Engineering and Advanced Applications
(SEAA), 2014 40th EUROMICRO Conference on. IEEE, 2014, pp.
269–276.
... Existing approaches to solving these problems primarily rely on keyword-based approaches [1,2,4,5,8,12,16,17,20,21] and suffer from a high false positive and a high false negative rate. To address the aforementioned issues, we use a transformer-based neural network called RoBERTa [19]. ...
... Previous research works [1,2,4,5,8,12,16,17,20,21] aimed to automate various tasks of software maintenance, lack in industry adoption which is a result of scarcity of practical tools available to the software industry. Our research is aimed towards bridging this gap between industry expectations and scientific methods. ...
... Previous researches in this area have mainly relied on key word-based approaches. The majority of these research works solely rely on textual data of the issue report [2,5,8,21] while some of them attempt to complement the data from other sources such as developer source code commits [20], and contributions on Q&A platforms i.e. StackOverflow [22]. ...
Preprint
Full-text available
Software development projects rely on issue tracking systems at the core of tracking maintenance tasks such as bug reports, and enhancement requests. Incoming issue-reports on these issue tracking systems must be managed in an effective manner. First, they must be labelled and then assigned to a particular developer with relevant expertise. This handling of issue-reports is critical and requires thorough scanning of the text entered in an issue-report making it a labor-intensive task. In this paper, we present a unified framework called MaintainoMATE, which is capable of automatically categorizing the issue-reports in their respective category and further assigning the issue-reports to a developer with relevant expertise. We use the Bidirectional Encoder Representations from Transformers (BERT), as an underlying model for MaintainoMATE to learn the contextual information for automatic issue-report labeling and assignment tasks. We deploy the framework used in this work as a GitHub application. We empirically evaluate our approach on GitHub issue-reports to show its capability of assigning labels to the issue-reports. We were able to achieve an F1-score close to 80\%, which is comparable to existing state-of-the-art results. Similarly, our initial evaluations show that we can assign relevant developers to the issue-reports with an F1 score of 54\%, which is a significant improvement over existing approaches. Our initial findings suggest that MaintainoMATE has the potential of improving software quality and reducing maintenance costs by accurately automating activities involved in the maintenance processes. Our future work would be directed towards improving the issue-assignment module.
... To circumvent the construct threats, we used the well-known accuracy metric together with the other frequently used metrics, namely precision, recall, and F-measure (Murphy and Cubranic 2004;Anvik et al. 2006;Baysal et al. 2009;Anvik and Murphy 2011;Jeong et al. 2009;Bhattacharya et al. 2012;Jonsson et al. 2016;Dedík and Rossi 2016;Manning et al. 2008). The discussions in the paper mainly focused on the accuracy results as this metric has been the choice of discussion in some of the recent related works (Aktas and Yilmaz 2020a;Jonsson et al. 2016). ...
... Murphy et al. report that as the software systems are getting bigger, the issue triaging takes increasingly larger amount time (Murphy and Cubranic 2004). Therefore, many approaches have been proposed in the literature to automate the process of issue triaging (Ahsan et al. 2009;Alenezi et al. 2013;Anvik and Murphy 2011;Podgurski et al. 2003;Anvik et al. 2006;Baysal et al. 2009;Jeong et al. 2009;Bhattacharya et al. 2012;Lin et al. 2009;Helming et al. 2010;Park et al. 2011;Xia et al. 2013;Xie et al. 2012;Dedík and Rossi 2016;Jonsson et al. 2016;Lee et al. 2017;Chen et al. 2019;Gu et al. 2020;Zhang 2020;Sajedi-Badashian et al. 2020;Aung et al. 2021;Chmielowski et al. 2021). ...
... While many of the existing works were evaluated on open source projects (Murphy and Cubranic 2004; Anvik et al. 2006;Baysal et al. 2009;Ahsan et al. 2009;Jeong et al. 2009;Anvik and Murphy 2011;Bhattacharya et al. 2012;Park et al. 2011;Alenezi et al. 2013;Xia et al. 2013;Xie et al. 2012;Sajedi-Badashian et al. 2020;Aung et al. 2021), few were evaluated on commercial, closed-source projects (Dedík and Rossi 2016;Helming et al. 2010;Jonsson et al. 2016;Lee et al. 2017;Lin et al. 2009;Chen et al. 2019;Gu et al. 2020;Zhang 2020;Chmielowski et al. 2021;Oliveira et al. 2021). Compared to the former set of works, we evaluate the proposed approach in a large industrial setup where hundreds of millions of lines of mostly business-critical codes were maintained by dozens of development teams. ...
Preprint
Full-text available
In previous work, we deployed IssueTAG, which uses the texts present in the one-line summary and the description fields of the issue reports to automatically assign them to the stakeholders, who are responsible for resolving the reported issues. Since its deployment on January 12, 2018 at Softtech, i.e., the software subsidiary of the largest private bank in Turkey, IssueTAG has made a total of 301,752 assignments (as of November 2021). One observation we make is that a large fraction of the issue reports submitted to Softtech has screenshot attachments and, in the presence of such attachments, the reports often convey less information in their one-line summary and the description fields, which tends to reduce the assignment accuracy. In this work, we use the screenshot attachments as an additional source of information to further improve the assignment accuracy, which (to the best of our knowledge) has not been studied before in this context. In particular, we develop a number of multi-source (using both the issue reports and the screenshot attachments) and single-source assignment models (using either the issue reports or the screenshot attachments) and empirically evaluate them on real issue reports. In the experiments, compared to the currently deployed single-source model in the field, the best multi-source model developed in this work, significantly (both in the practical and statistical sense) improved the assignment accuracy for the issue reports with screenshot attachments from 0.843 to 0.858 at acceptable overhead costs, a result strongly supporting our basic hypothesis.
... To circumvent the construct threats, we used the well-known accuracy metric together with the other frequently used metrics, namely precision, recall, and F-measure (Murphy and Cubranic 2004;Anvik et al. 2006;Baysal et al. 2009;Anvik and Murphy 2011;Jeong et al. 2009;Bhattacharya et al. 2012;Jonsson et al. 2016;Dedík and Rossi 2016;Manning 2008). The discussions in the paper mainly focused on the accuracy results as this metric has been the choice of discussion in some of the recent related works (Aktas EU and Yilmaz C 2020a;Jonsson et al. 2016). ...
... Automated triaging of issue reports is still of practical interest (Jonsson et al. 2016;Lee et al. 2017;Zhang 2020;Aung et al. 2021). Therefore, a number of approaches have been proposed in the literature to automate the process of issue triaging (Ahsan SN et al. 2009;Alenezi et al. 2013;Anvik and Murphy 2011;Podgurski et al. 2003;Anvik et al. 2006;Baysal et al. 2009;Jeong et al. 2009;Bhattacharya et al. 2012;Lin et al. 2009;Helming et al. 2010;Park et al. 2011;Xia et al. 2013;Xie et al. 2012;Hu et al. 2014;Dedík and Rossi 2016;Jonsson et al. 2016;Lee et al. 2017;Chen et al. 2019a;2019b;Gu et al. 2020;Zhang 2020;Sajedi-Badashian and Stroulia 2020;Aung et al. 2021;Chmielowski and Kucharzak 2021;Su et al. 2021). ...
... While many of the existing works were evaluated on open source projects (Murphy and Cubranic 2004;Anvik et al. 2006;Baysal et al. 2009;Ahsan SN et al. 2009;Jeong et al. 2009;Anvik and Murphy 2011;Bhattacharya et al. 2012;Park et al. 2011;Alenezi et al. 2013;Xia et al. 2013;Xie et al. 2012;Sajedi-Badashian and Stroulia 2020;Aung et al. 2021;Su et al. 2021), few were evaluated on commercial, closed-source projects (Dedík and Rossi 2016;Helming et al. 2010;Jonsson et al. 2016;Lee et al. 2017;Lin et al. 2009;Chen et al. 2019a;2019b;Gu et al. 2020;Zhang 2020;Chmielowski and Kucharzak 2021;Oliveira et al. 2021). Compared to the former set of works, we evaluate the proposed approach in a large industrial setup where hundreds of millions of lines of mostly business-critical codes were maintained by dozens of development teams. ...
Article
Full-text available
In previous work, we deployed IssueTAG, which uses the one-line summary and the description fields of the issue reports to automatically assign them to the stakeholders, who are responsible for resolving the reported issues. Since its deployment on January 12, 2018 at Softtech – the software subsidiary of the largest private bank in Turkey, IssueTAG has made a total of 301,752 assignments (as of November 2021). One observation we make is that a large fraction of the issue reports submitted to Softtech has screenshot attachments and, in the presence of such attachments, the reports often convey less information in their one-line summary and the description fields, which tends to reduce the assignment accuracy. In this work, we use the screenshot attachments as an additional source of information to further improve the assignment accuracy, which, to the best of our knowledge, has not been studied before for automatic issue assignments. In particular, we develop a number of multi-source assignment models, which use both the issue reports and the screenshot attachments, as well as a number of single source models, which use either the issue reports or the screenshot attachments, and empirically evaluate them on real issue reports. Compared to the currently deployed single-source model in the field, the best multi-source model improved the assignment accuracy from 0.848 to 0.855 at an acceptable overhead cost, reducing the overall 3.3 percentage-point deficit between the human triagers and the deployed system by 0.7 points.
... Our benchmark models can be grouped into two categories such as independent models (ANN, CNN2, CNN3, and CNN2 + CNN3) and fusion-based models (ANN + CNN2 + CNN3 and ANN + CNN2). Apart from these benchmark models, we train traditional ML models such as SVM [44], RF [45] and NB [46] to compare our DL models' results against them. To train these models, we use tf-idf vectors (reduced using PCA) as features. ...
... Can TF-IDF vectorization of a bug report and an artificial neural network (ANN) model effectively automate the bug assignment? Although previsous researchers used TF-IDF for feature extraction [29], [44]- [46], [46], none of the existing works has explored ANN for bug assignment. To our best knowledge, this is the first study that uses significant repeating keywords (TF-IDF vectors after PCA) to assign bugs using an ANN. ...
Article
Full-text available
Automated bug report assignment is critical for large-scale software projects where reported bugs are frequent and expert developers are required to fix them on time. Finding an appropriate developer with the necessary skill sets and prior experience in fixing similar bugs is difficult and can be an expensive process, depending on the severity of the reported bug. To address this issue, researchers have proposed several machine learning and deep learning-based automated bug report assignment techniques that make use of historical data on reported bugs as well as fixer information. However, there is still room for improvement in the performance of these techniques. In this paper, we propose a novel deep learning-based approach that utilizes two sets of features from the reported bugs’ textual data, namely contextual information and the occurrence of repeating keywords. We develop convolutional neural network and artificial neural network modules to mine these features. We then fuse these two sets of extracted features to assign a bug to an appropriate developer. We conduct extensive experiments on eight benchmark datasets of open-source, real-world software projects to assess the effectiveness of our approach. The experimental results demonstrate that our information fusion-based approach outperforms previous models and improves automated bug report assignment. Furthermore, we debug the errors of our proposed model and publish all source code so that future researchers can contribute to this problem.
... Experimental results showed that the average verification efficiency increased by 243%, with a 99% confidence interval. Dedik et al. aimed to test the differences between bug triage requirements in industrial settings and open-source projects [84]. They employed an SVM + TF-IDF method for classification on a private company dataset. ...
Article
Full-text available
To address the issue of insufficient testing caused by the continuous reduction of software development cycles, many organizations maintain bug repositories and bug tracking systems to ensure real-time updates of bugs. However, each day, a large number of bugs is discovered and sent to the repository, which imposes a heavy workload on bug fixers. Therefore, effective bug deduplication and triage are of great significance in software development. This paper provides a comprehensive investigation and survey of the recent developments in bug deduplication and triage. The study begins by outlining the roadmap of the existing literature, including the research trends, mathematical models, methods, and commonly used datasets in recent years. Subsequently, the paper summarizes the general process of the methods from two perspectives—runtime information-based and bug report-based perspectives—and provides a detailed overview of the methodologies employed in relevant works. Finally, this paper presents a detailed comparison of the experimental results of various works in terms of usage methods, datasets, accuracy, recall rate, and F1 score. Drawing on key findings, such as the need to improve the accuracy of runtime information collection and refine the description information in bug reports, we propose several potential future research directions in the field, such as stack trace enrichment and the combination of new NLP models.
Article
Full-text available
p>Bug finding is a critical component of the verification flow and is resource intensive.In a typical week, a debug engineer writes triages, which take up significant amount of time that could be spent debugging another unique issue, and the lack of standardization in scripting causes maintainability issues in functional verification bug triage. A framework that allows customizable triage script generation is developed based on inputs from the engineer deploying YAML isn’t another markup language (YAML) files and practical extraction and report language (PERL) scripting, and this methodology is made automated and is standardized across projects to ensure maximum benefit going forward. The use of auto-triage in the project of functional verification bug triage has contributed to a 18% increase in triaged signatures on average, from 40% before its use to 58% after. A similar earlier project vs. current project comparison shows a 20% uplift. The triaged inputs that are parsed are currently being fed to a machine learning algorithm, which will help further improve the debug efficiency. As part of future work, the information from input YAML files can be used to analyze simulation failure attributes, hence improving the overall efficiency of debugging.</p
Chapter
Full-text available
During the software development process, occurring problems are collected and managed as bug reports using bug tracking systems. Usually, a bug report is specified by a title, a more detailed description, and additional categorical information, e.g., the affected component or the reporter. It is the task of the triage owner to assign open bug reports to developers with the required skills to fix them. However, the bug assignment task is time-consuming, especially in large software projects with many involved developers. This observation motivates using (semi-)automatic algorithms for assigning bugs to developers. Various approaches have been developed that rely on a machine learning model trained on historical bug reports. Thereby, the modeling of the textual components is mainly done using topic models, mainly Latent Dirichlet Allocation (LDA). Although different variants, inference techniques, and libraries for LDA exist and various hyperparameters can be specified, most works treat topic models as a black box without exploring them in detail. In this work, we extend a study of Atzberger and Schneider et al. on the use of the Author-Topic Model (ATM) for bug triaging tasks. We demonstrate the influence of the underlying topic model, the used library and inference techniques, and the hyperparameters on the bug triaging results. The results of our conducted experiments on a dataset from the Mozilla Firefox project provide guidelines for applying LDA for bug triaging tasks effectively. KeywordsBug triagingTopic modelsLatent dirichlet allocationInference techniques
Article
Full-text available
Context: Bug report assignment is an important part of software maintenance. In particular, incorrect assignments of bug reports to development teams can be very expensive in large software development projects. Several studies propose automating bug assignment techniques using machine learning in open source software contexts, but no study exists for large-scale proprietary projects in industry. Objective: The goal of this study is to evaluate automated bug assignment techniques that are based on machine learning classication. In particular, we study the state-of-the-art ensemble learner Stacked Generalization (SG) that combines several classiers. Method: We collect more than 50,000 bug reports from ve development projects from two companies in dierent domains. We implement automated bug assignment 2 Leif Jonsson et al. and evaluate the performance in a set of controlled experiments. Results: We show that SG scales to large scale industrial application and that it outper-forms the use of individual classiers for bug assignment, reaching prediction accuracies from 50% to 90% when large training sets are used. In addition, we show how old training data can decrease the prediction accuracy of bug assignment. Conclusions: We advice industry to use SG for bug assignment in proprietary contexts, using at least 2,000 bug reports for training. Finally, we highlight the importance of not solely relying on results from cross-validation when evaluating automated bug assignment.
Conference Paper
Full-text available
Predicting the severity of bugs has been found in past research to improve triaging and the bug resolution process. For this reason, many classification/prediction approaches emerged over the years to provide an automated reasoning over severity classes. In this paper, we use text mining together with bi-grams and feature selection to improve the classification of bugs in severe/non-severe classes. We adopt the Naïve Bayes (NB) classifier considering Mozilla and Eclipse datasets commonly used in related works. Overall, the results show that the application of bi-grams can improve slightly the performance of the classifier, but feature selection can be more effective to determine the most informative terms and bi-grams. The results are in any case project-dependent, as in some cases the addition of bi-grams may worsen the performance.
Article
Full-text available
For complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who have the expertise to fix the bugs. Many bug triage techniques have been proposed to automate this process. In this paper, we describe our study on applying conventional bug triage techniques to projects of different sizes. We find that the effectiveness of a bug triage technique largely depends on the size of a project team (measured in terms of the number of developers). The conventional bug triage methods become less effective when the number of developers increases. To further improve the effectiveness of bug triage for large projects, we propose a novel recommendation method called Bug Fixer, which recommends developers for a new bug report based on historical bug-fix information. Bug Fixer constructs a Developer-Component-Bug (DCB) network, which models the relationship between developers and source code components, as well as the relationship between the components and their associated bugs. A DCB network captures the knowledge of 'who fixed what, where'. For a new bug report, Bug Fixer uses a DCB network to recommend to triager a list of suitable developers who could fix this bug. We evaluate Bug Fixer on three large-scale open source projects and two smaller industrial projects. The experimental results show that the proposed method outperforms the existing methods for large projects and achieves comparable performance for small projects.
Article
Full-text available
Large open source software projects receive abundant rates of submitted bug reports. Triaging these incoming reports manually is error-prone and time consuming. The goal of bug triaging is to assign potentially experienced developers to new-coming bug reports. To reduce time and cost of bug triaging, we present an automatic approach to predict a developer with relevant experience to solve the new coming report. In this paper, we investigate the use of five term selection methods on the accuracy of bug assignment. In addition, we re-balance the load between developers based on their experience. We conduct experiments on four real datasets. The experimental results show that by selecting a small number of discriminating terms, the F-score can be significantly improved.
Article
Full-text available
Large open source software projects receive abundant rates of submitted bug reports. Triaging these incoming reports manually is error-prone and time consuming. The goal of bug triaging is to assign potentially experienced developers to new-coming bug reports. To reduce time and cost of bug triaging, we present an automatic approach to predict a developer with relevant experience to solve the new coming report. In this paper, we investigate the use of five term selection methods on the accuracy of bug assignment. In addition, we re-balance the load between developers by clustering similar bug reports and ranking developers in each cluster based on their experience. We conduct experiments on four real datasets. The experimental results show that by selecting a small number of discriminating terms, the F-score can be significantly improved.
Conference Paper
Full-text available
Open source development projects typically support an open bug repository to which both developers and users can report bugs. The reports that appear in this repository must be triaged to determine if the report is one which requires attention and if it is, which developer will be assigned the responsibility of resolving the report. Large open source developments are burdened by the rate at which new bug reports appear in the bug repository. In this paper, we present a semi-automated approach intended to ease one part of this process, the assignment of reports to a developer. Our approach applies a machine learning algorithm to the open bug repository to learn the kinds of reports each developer resolves. When a new report arrives, the classifier produced by the machine learning technique suggests a small number of developers suitable to resolve the report. With this approach, we have reached precision levels of 57% and 64% on the Eclipse and Firefox development projects respectively. We have also applied our approach to the gcc open source development with less positive results. We describe the conditions under which the approach is applicable and also report on the lessons we learned about applying machine learning to repositories used in open source development.
Article
Bug resolution refers to the activity that developers perform to diagnose, fix, test, and document bugs during software development and maintenance. Given a bug report, we would like to recommend the set of bug resolvers that could potentially contribute their knowledge to fix it. We refer to this problem as developer recommendation for bug resolution. In this paper, we propose a new and accurate method named DevRec for the developer recommendation problem. DevRec is a composite method that performs two kinds of analysis: bug reports based analysis (BR-Based analysis) and developer based analysis (D-Based analysis). We evaluate our solution on five large bug report datasets including GNU Compiler Collection, OpenOffice, Mozilla, Netbeans, and Eclipse containing a total of 107,875 bug reports. We show that DevRec could achieve recall@5 and recall@10 scores of 0.4826–0.7989, and 0.6063–0.8924, respectively. The results show that DevRec on average improves recall@5 and recall@10 scores of Bugzie by 57.55% and 39.39%, outperforms DREX by 165.38% and 89.36%, and outperforms NonTraining by 212.39% and 168.01%, respectively. Moreover, we evaluate the stableness of DevRec with different parameters, and the results show that the performance of DevRec is stable for a wide range of parameters. Copyright © 2015 John Wiley & Sons, Ltd.
Article
According to its proponents, open source style software development has the capacity to compete successfully, and perhaps in many cases displace, traditional commercial development methods. In order to begin investigating such claims, we examine data from two major open source projects, the Apache web server and the Mozilla browser. By using email archives of source code change history and problem reports we quantify aspects of developer participation, core team size, code ownership, productivity, defect density, and problem resolution intervals for these OSS projects. We develop several hypotheses by comparing the Apache project with several commercial projects. We then test and refine several of these hypotheses, based on an analysis of Mozilla data. We conclude with thoughts about the prospects for high-performance commercial/open source process hybrids.