Conference PaperPDF Available

Automated Bug Triaging in an Industrial Context

Authors:

Abstract and Figures

There is an increasing need to introduce some form of automation within the bug triaging process, so that no time is wasted on the initial assignment of issues. However, there is a gap in current research, as most of the studies deal with open source projects, ignoring the industrial context and needs. In this paper, we report our experience in dealing with the automation of the bug triaging process within a research-industry cooperation. After reporting the requirements and needs that were set within the industrial project, we compare the analysis results with those from an open source project used frequently in related research (Firefox). In spite of the fact that the projects have different size and development process, the data distributions are similar and the best models as well. We found out that more easily configurable models (such as SVM+TF–IDF) are preferred, and that top-x recommendations, number of issues per developers, and online learning can all be relevant factors when dealing with an industrial collaboration.
Content may be subject to copyright.
Automated Bug Triaging in an Industrial Context
V´
aclav Ded´
ık
Faculty of Informatics
Masaryk University, Brno, Czech Republic
dedik@mail.muni.cz
Bruno Rossi
Faculty of Informatics
Masaryk University, Brno, Czech Republic
brossi@mail.muni.cz
Abstract—There is an increasing need to introduce some form
of automation within the bug triaging process, so that no time
is wasted on the initial assignment of issues. However, there is a
gap in current research, as most of the studies deal with open
source projects, ignoring the industrial context and needs.
In this paper, we report our experience in dealing with the
automation of the bug triaging process within a research-industry
cooperation. After reporting the requirements and needs that
were set within the industrial project, we compare the analysis
results with those from an open source project used frequently
in related research (Firefox). In spite of the fact that the
projects have different size and development process, the data
distributions are similar and the best models as well. We found
out that more easily configurable models (such as SVM+TF–IDF)
are preferred, and that top-x recommendations, number of issues
per developers, and online learning can all be relevant factors
when dealing with an industrial collaboration.
Keywords-Software Bug Triaging; Bug Reports; Bug Assign-
ment; Machine Learning; Text Classification; Industrial Scale;
I. INTRODUCTION
In this paper, we deal with bug triaging, that is the process of
assigning new issues or tickets to developers that could better
handle them. Most of the papers in the area of automated
bug triaging deal with Open Source Software (OSS) projects
(e.g. [1], [2]). In the current paper we take a point of view
that has not been investigated often, that is the different needs
in terms of approaches needed in case of research-industry
collaboration. In fact, OSS has a distributed development
model [3] that implies a different triaging process but also
different results during the application of an automated triaging
process. For this reason, we set-up the main research question:
what are the differences in bug triaging automation when
considering machine learning models between an OSS project
and a company-based proprietary project? From this main
research question, throughout the paper we provide several
sub-questions that can give more insights on different aspects.
To answer the main research question, we conducted an
experimentation within a software company from the Czech
Republic, running a SCRUM-based development process cen-
tered around several teams and a JIRA issue tracker that
provided the central part of the triaging process. At the time of
starting the experimentation, the tickets were left unassigned
until one responsible would start the triaging process. We
dicuss in the paper the differences — when implementing an
automated triaging system — that were detected between such
proprietary-based project and data from Mozilla Firefox.
II. RE LATE D WOR KS
Over the years there have been many attempts to auto-
mate bug triaging. The first effort was made by ˇ
Cubrani´
c
and Murphy [1] using a text categorization approach with a
Na¨
ıve Bayes classifier on Eclipse data. Their dataset consisted
of 15,670 bug reports with 162 classes (developers). They
achieved about 30% accuracy.
Anvik et al. [2] used Support Vector Machines (SVM) on
Eclipse, Firefox and GCC data. They achieved precision of
64% and 58% on Firefox and Eclipse data, however, only 6%
on GCC data. As for recall, only 2%, 7% and 0.3%: results
were achieved on Firefox, Eclipse and GCC data.
Ahsan et al. [4] used an SVM classifier on Mozilla data.
They reached 44.4% classification accuracy, 30% precision
and 28% recall using SVM with LSI.
An extensive study was done by Alenezi et al. [5], using a
Na¨
ıve Bayes classifier. The best results were achieved using
χ2with precision values of 38%, 50%, 50% and 50% and
recall values of 30%, 35%, 21% and 46% on Eclipse-SWI,
Eclipse-UI, NetBeans and Maemo projects respectively.
Xia et al. [6] employed an algorithm named DevRec based
on multi-label k-nearest neighbor classifier (ML–kNN) and
topic modeling using Latent Dirichlet Allocation (LDA). For
top-5 recommendation, the recall values were 56%, 48%, 56%,
71%, 80% and the precision values 25%, 21%, 25%, 32%,
25% for GCC, OpenOffice, Mozilla, NetBeans and Eclipse.
Automated bug triaging in industrial projects has not been
discussed often (in [7], only two out of 25 studies reviewed).
Hu et al. [8] provided an evaluation of automatic bug triaging
considering a component-developer-bug network on two in-
dustrial projects (1,008/686 bug reports, 19/11 developers) and
three OSS projects (Eclipse, Mozilla, Netbeans). They overall
noticed that the proposed approach performs worse on the
industrial projects. Jonsson et al. [7] represent one recent study
to underline the missing availability of automatic triaging stud-
ies within the industrial context. Using an ensemble learner,
authors collect 50,000 bug reports from five projects out of two
companies, showing that the system can scale well with large
amounts of training sets. Main findings are that SVM models
perform well in industrial context and that models performance
drops when projects team size increases.
III. EXP ER IM EN TAL EVALUATION
The process of triaging automation begins by retrieving a
dataset from a bug tracking system (Fig. 1). The next step
Fig. 1. Data analysis process: from bug tracking data to the prediction results.
is to filter unwanted bug reports from the dataset. All tickets
that are unassigned, not fixed or not resolved are removed.
There can be also bug reports that are assigned to a universal
computer-generated user (e.g. nobody@mozilla.org) —
we remove these reports as well. Bug reports resolved by
developers that have not fixed sufficient amount of reports
in the past (e.g. 30) are also removed. Another step is to
shuffle the bug reports randomly. This is done to achieve more
accurate performance results with the cross-validation (CV)
set. The process continues by splitting the resulting dataset
into two sets—a cross-validation set and a training set. The
CV set contains 30% of the bugs while the training set contains
the remaining 70%.
The second stage is to train the machine learning model
(e.g. SVM, NB). First, however, it is necessary to run the
training dataset through a feature extraction step. This step
applies techniques that improve the performance of a machine
learning model, e.g. stop-words removal, TF–IDF etc...
The last stage uses the trained machine learning model
to generate prediction results that can be used, for example,
to compute various performance metrics. The first step is to
(again) apply feature extraction techniques on the CV dataset
and then use the trained classifier to predict results.
One of the most prominent models for text classification is
known as Support Vector Machine (SVM). Instead of trying
to find a function that fits training data as precisely as possible
by optimizing the distance from samples to the function,
the SVM algorithm attempts to find a linear separator of
positive and negative samples by optimizing the margin of
the decision boundary (hyperplane) [9]. The samples that
limit the margin of the decision boundary on both sides are
called support vectors. However, positive and negative samples
can be mixed up together so much that it is impossible to
construct a decision boundary. This is why the strictness
of the optimization algorithm can be relaxed by tuning the
regularization parameter C[9].
In many applications, the linear classification is not enough
to warrant a good decision boundary and performance. Using
kernel functions, we can effectively transform the decision
boundary into polynomial, Gaussian or sigmoid function. By
using a technique called kernel trick, this can be done without
increasing the time complexity of the learning algorithm [9].
To evaluate the performance of the models, we considered
three metrics — accuracy, precision and recall1.
1) Accuracy: This metric measures the overall fraction of
correctly predicted assignees out of all predictions [9].
Accuracy =tp +tn
tp +tn +fp +f n (1)
2) Macro-Averaged Precision: Precision measures the frac-
tion of the correctly predicted assignees as positive out of the
assignees either correctly or incorrectly predicted as positive.
A value is computed separately for every class (assignee) and
the result is computed as the mean of all these values.
P recisionmacro =1
q
q
X
λ=1
tpλ
tpλ+fpλ
(2)
3) Macro-Averaged Recall: Similar to precision, except re-
call measures the fraction of the correctly predicted assignees
as positive out of the assignees either correctly predicted as
positive or incorrectly predicted as negative. Again, the macro-
averaged variant is computed as the mean of all recall values
for every class (assignee).
Recallmacro =1
q
q
X
λ=1
tpλ
tpλ+fnλ
(3)
A. Dataset description
The Proprietary dataset was collected within the context of
the collaboration with a software company from the Czech
Republic. The development process is run with a SCRUM-
based process centered around several teams and a JIRA issue
tracker that collects all the issues. At the time of starting
the experimentation, the tickets were left unassigned until one
responsible would start the triaging process. The mined dataset
contains 2,926 bug reports created between 2012-11-23 and
2015-10-16. Only bug reports that were resolved with assigned
developers were considered. There are 110 developers in this
dataset. Only 2,424 bugs assigned to 35 developers were
retained after removal of developers with less than 30 fixed
bugs (Fig. 2).
The OSS dataset was mined from the Mozilla repository
from project Firefox2. We downloaded all bugs that are in
status RESOLVED with resolution FIXED and were created
in year 2010 or later. We also removed bugs with field
assigned_to set to nobody@mozilla.org as those
tickets were not assigned to an actual developer. In total, we
were able to retrieve 9,141 bugs. To get a better compari-
son with the other datasets, we only use 3,000 data points
for training and cross-validation that were created between
2010-01-01 and 2012-07-10. This dataset contains 343 labels
(developers). Finally, we remove developers who did not fix at
least 30 bugs, yielding 1,810 bugs with 20 developers (Fig. 3).
1for both precision and recall, we considered the macro-averaged variants.
2https://bugzilla.mozilla.org
Fig. 2. Proprietary project: distribution of tickets per single developer.
Fig. 3. Firefox project: distribution of tickets per single developer.
B. Experimental Questions
For the analysis, we structured the experimental part linking
the original goal to several questions:
Q1/Q2. Are the two datasets coming from the same
distribution? Do the samples have the same population
mean? This would allow to determine how similar the
two datasets are in terms of issues distribution;
Q3. Do classification models bring benefits over a base-
line classifier? This allows us to review the relevance of
the provided models;
Q4. Which classification model brings the best results for
both datasets in terms of accuracy, precision and recall?
This allows us to look into differences of performance of
the models taking into account the two datasets;
Q5. Taking into account the best identified models, how
do they perform considering different number of issues
per developers? Companies want to understand how
much time needs to pass until one developer can be
included in the prediction process;
Q6. Taking into account the best identified models, what
is the performance of the models considering a higher
number of recommendations? One of the requirements
of companies is to provide a ranking list of assignee that
could be useful in case of re-assignments of tickets;
C. Dataset distributions (Q1/Q2)
We use the two-sided alternative of the two-sample chi-
square test to evaluate the null hypothesis that two samples
are from the same distribution. The test with Firefox and
Proprietary samples returns a p-value equal to 0.8648 so we
fail to reject the null hypothesis that the two samples are from
the same distribution (5% significance).
After we tested the hypothesis that the used datasets
come from the same distribution, we test the null hypothesis
that the samples have the same population mean—allowing
us to further learn how statistically similar the datasets
are, looking at the variances of the samples. We use the
two-sided alternative of the standard independent two-sample
t-test to test the null hypothesis that the samples have
the same population mean, and the two-sided alternative
of the Levene test to test whether the samples have equal
population variance. Testing data from Firefox and Proprietary
datasets, the Levene test yields p-value equal to 0.2730, as
p-value >0.05, we fail to reject the null hypothesis (5%
significance). To test the population mean of the two samples,
we therefore have to use the standard independent t-test.
The standard independent t-test results in p-value of 0.1954,
which means we fail to reject the null hypothesis of equal
population mean of the two samples (5% significance).
Chi-square and t-test imply there is little difference between
proprietary and Firefox dataset in terms of data distribution.
D. Comparison with the Baseline Model (Q3)
A first step on the evaluation of the models was to compare
against a baseline model. We defined such model as a classifier
that would assign a bug report to the developer with the
highest number of reports. While the accuracy of this model
is relatively high (18%), the precision and recall values are
much lower (1% and 5%) on the Firefox data (Fig. 4).
There is also an increase in performance after all stop-words
were removed from the feature vector of Firefox data. The
performance of the classifier slightly increased on all models.
Accuracy increased by 3% on SVM. The precision value of
the SVM model decreased by 1% but increased by 6% with
Na¨
ıve Bayes. Finally, Recall values of SVM and Na¨
ıve Bayes
increased by 4% and 2% respectively. We therefore conclude
that the performance boost of stop-words removal is significant
enough to warrant better results, which matches the conclusion
of ˇ
Cubrani´
c and Murphy [1].
The initial models bring benefits over the baseline classifier.
E. Comparison of the Models (Q4)
Based on related work, we compared two models: Na¨
ıve
Bayes (see previous work from one of the authors, [10]), and
SVM. Parameters of all models were optimized by grid search.
SVM+TF–IDF offers the best performance of 53% accu-
racy, 59% precision and 47% recall. The same model with
LSI also shows quite good performance and the Na¨
ıve Bayes
model with χ2and TF–IDF performs quite well as far as
precision is concerned. Both SVM and Na¨
ıve Bayes models
exhibit rather wide spreads between precision and recall values
Fig. 4. Firefox project: baseline model and stop-words removal.
Fig. 5. Proprietary project: models comparison.
(Fig. 5), which could be an indication of higher variance of
the proprietary data (Fig. 2 and 3).
SVM+TF–IDF achieves the best performance with accuracy
of 57%. Its precision and recall also outperforms all the
other approaches with values of 51% and 45% (Fig. 6). The
comparison shows that SVM+TF–IDF performs best on all
datasets. It also generalizes very well, because there was
no need to readjust the parameters. The disadvantage of the
model is that it is the most computationally complex one,
because SVM is the slowest as there are many classes and
features. This can be at least partially dealt with by using
χ2feature extraction in conjunction with TF-IDF while
sacrificing some of the classification performance.
Fig. 6. Firefox project: models comparison.
SVM+TF–IDF+Stop Words Removal is the best model for
both proprietary company-based and Firefox data.
F. Number of Issues per Developer (Q5)
We compared the two datasets using the SVM+TF–IDF
weighting by computing their performance (accuracy, preci-
sion and recall) for six different settings of the minimum
issues per developers (1, 3, 5, 10, 20, 30) requirement (Fig.
7). The accuracy of the proprietary dataset (53%) is a bit
lower than the accuracy of the open-source dataset (54%) when
minimum issues per developer equals 30. Interesting fact is
that the initial behaviour is quite different with much higher
performance from the proprietary dataset (42%) than the open-
source dataset (29%).
Fig. 7. Comparison of accuracy of the SVM model.
The precision value of the proprietary dataset (Fig. 8)
is eventually higher (59%) than the precision value of the
open-source dataset (54%). In the recall case, the right-most
case (minimum number of issues per developer requirement
equal to 30) shows the performance of the proprietary dataset
slightly lower (47%) than that of the open-source dataset
(50%). Also in this case, the performance for minimum
number of issues per developer equal to one is much higher
for the proprietary dataset (14% vs 3%).
Fig. 8. Comparison of precision of the SVM model.
The performance of the proprietary dataset is generally
quite similar to that of the open-source dataset. An interesting
conclusion is that the spread between precision and recall is
much higher for the proprietary dataset. This possibly implies
higher variance of the dataset. All our results show significant
difference in performance for minimum number of issues
per developers equal to one. The higher performance for the
proprietary dataset in this regard can be probably explained
by the fact that open-source bug repositories are open to
anyone.
One-time assignees can impact negatively on the perfor-
mance when considering different nr. of issues per developer.
G. Performance for Higher Recommendations Number (Q6)
We examined the performance of the models with different
number of recommendations. We show the performance of the
SVM model with TF-IDF weighting trained on the proprietary
and Firefox datasets for number of recommendations from one
to ten (Fig. 9). The accuracy increases with the number of
recommendations, which is expected as the more recommen-
dations, the higher the chance of a hit (i.e. the chance that the
list of predictions contains the correct assignee).
Fig. 9. Comparison of accuracy of top-x recommendations for SVM model.
It is apparent from the plot that the highest performance
boost happens when the number of recommendations
changes from one to two both for the proprietary dataset
(54% vs 66%) and the Firefox dataset (59% vs 72%).
The accuracy of the Firefox dataset is equal to 90% for 5
recommendations and 81% for proprietary dataset. When
the number of recommendations is 10, the performance is
97% and 90% for the Firefox and proprietary data respectively.
Similar behaviour in changes of accuracy when considering
the variation of the number of recommended developers.
IV. DISCUSSION & CONCLUSION
In the current paper, we evaluated the usage of an automated
bug triaging process within a Czech Republic-based company
to look at similarities and differences to the usually analyzed
open source projects. We found that the best classification
model is SVM with TF–IDF— achieving an accuracy of 53%,
precision of 59% and recall of 47%— not dissimilar from
the Firefox dataset: 57%, 51% and 45% respectively. One
appreciated advantage of such model within the company is
that it does not require large parameters optimization. From the
point of view of the many aspects we analyzed, the distribution
of the datasets seems to be very similar (t-test and chi-square
test). Subsequent evaluation of performance of the datasets
using SVM and TF-IDF supports this conclusion further.
Unfortunately, this conclusion does not apply for all feature
extraction methods, our results show that it is necessary to
choose different parameters for both LSI and χ2techniques. In
term of performance, however, there is a significant difference
when we omit filtering of developers that fix few bugs. The
most likely explanation is that the company-based proprietary
dataset contains less developers that fixed only a few bugs, as
the proprietary bug repository is not public.
Comparing the performance of the models with the related
work can be problematic due to different processes of analysis,
datasets and metrics used. Overall, the results on the Firefox
project can be considered comparable with our results. There
can be mostly differences in the way inactive developers are
filtered ([2]) or number of reports used (we used 3,000 to
compare with the proprietary project, [5] considered 11,311).
Looking at the related works in the application of automated
bug triaging within industrial context (mainly [7]), there are
many characteristics that proprietary projects do not typically
share with open source software and that can be at the basis
of different results. Within our company-based collaboration,
we particularly noticed the needs to discuss: i) the number
of issues per developers before having reliable predictions,
ii) the number of top-x recommendations to present, iii) the
implications for online learning, as well as iv) providing also
support for teams assignment in parallel with individual one.
REFERENCES
[1] G. C. Murphy and D. Cubranic, “Automatic bug triage using text
categorization,” in the Sixteenth Int. Conference on Software Engineering
& Knowledge Engineering. KSI Press, 2004, pp. 92–97.
[2] J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?” in Pro-
ceedings of the 28th International Conference on Software Engineering,
ser. ICSE ’06. New York, NY, USA: ACM, 2006, pp. 361–370.
[3] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “Two case studies of
open source software development: Apache and mozilla,ACM Trans.
Softw. Eng. Methodol., vol. 11, no. 3, pp. 309–346, Jul. 2002.
[4] S. N. Ahsan, J. Ferzund, and F. Wotawa, “Automatic Software Bug
Triage System (BTS) Based on Latent Semantic Indexing and Support
Vector Machine,” 2009 Fourth International Conference on Software
Engineering Advances, pp. 216–221, Sep. 2009.
[5] M. Alenezi, K. Magel, and S. Banitaan, “Efficient Bug Triaging Using
Text Mining,Journal of Software, vol. 8, no. 9, pp. 2185–2190, Sep.
2013.
[6] X. Xia, D. Lo, X. Wang, and B. Zhou, “Dual analysis for recommending
developers to resolve bugs,Journal of Software: Evolution and Process,
vol. 27, no. 3, pp. 195–220, 2015.
[7] L. Jonsson, M. Borg, D. Broman, K. Sandahl, S. Eldh, and P. Runeson,
“Automated bug assignment: Ensemble-based machine learning in large
scale industrial contexts,” Empirical Software Engineering, pp. 1–46,
2015.
[8] H. Hu, H. Zhang, J. Xuan, and W. Sun, “Effective bug triage based
on historical bug-fix information.” in ISSRE. IEEE Computer Society,
2014, pp. 122–132.
[9] C. D. Manning, P. Raghavan, and H. Sch¨
utze, Introduction to Infor-
mation Retrieval. New York, NY, USA: Cambridge University Press,
2008.
[10] N. K. Singha Roy and B. Rossi, “Towards an improvement of bug sever-
ity classification,” in Software Engineering and Advanced Applications
(SEAA), 2014 40th EUROMICRO Conference on. IEEE, 2014, pp.
269–276.
... Automated issue assignment is indeed not a new idea (Murphy and Cubranic 2004;Anvik et al. 2006;Wang et al. 2008;Bhattacharya et al. 2012;Jonsson et al. 2016;Dedík and Rossi 2016). Most of the existing works, however, report the results obtained on open source projects, such as Eclipse, Mozilla, and Firefox (Murphy and Cubranic 2004;Anvik et al. 2006;Wang et al. 2008;Bhattacharya et al. 2012). ...
... There are only few recent studies reporting the results obtained on closedsource, commercial software projects (Jonsson et al. 2016;Dedík and Rossi 2016;Lin et al. 2009;Helming et al. 2010). These studies, however, carry out the assignments in a retrospective and offline manner by simply treating the actual issue databases as historical data. ...
... reports filed for commercial, closed-source projects. Although the remaining works (Lin et al. 2009;Helming et al. 2010;Jonsson et al. 2016;Dedík and Rossi 2016), report on the results obtained on closed-source, commercial software projects, they do so by carrying out a retrospective analysis in an offline manner. We, on the other hand, deployed the proposed approach and shared both the results we obtained and the lessons we learnt regarding the practical effects of automated issue assignment in the field. ...
Preprint
Full-text available
Softtech, being a subsidiary of the largest private bank in Turkey, called IsBank, receives an average of 350 issue reports from the field every day. Manually assigning the reported issues to the software development teams is costly and cumbersome. We automate the issue assignments using data mining approaches and share our experience gained by deploying the resulting system at Softtech/IsBank. Automated issue assignment has been studied in the literature. However, most of these works report the results obtained on open source projects and the remaining few, although they use commercial, closed source projects, carry out the assignments in a retrospective manner. We, on the other hand, deploy the proposed approach, which has been making all the assignments since Jan 12, 2018. This presents us with an unprecedented opportunity to observe the practical effects of automated issue assignment in the field and to carry out user studies, which have not been done before in this context. We observe that it is not just about deploying a system for automated issue assignment, but also about designing/changing the assignment process around the system; the accuracy of the assignments does not have to be higher than that of manual assignments in order for the system to be useful; deploying such a system requires the development of additional functionalities, such as detecting deteriorations in assignment accuracies in an online manner and creating human-readable explanations for the assignments; stakeholders do not necessarily resist change; and gradual transition can help stakeholders build confidence.
... Wang et al. [25] investigate unsupervised learning approach for bug triaging, they proposed an approach based on developers' activeness scores in component of product to build a prioritized list of developers and then improve supervised bug triage approaches ( [29]). Dedık et al. [24] did a comparative study on automating bug triaging using SVM classifier and TF/IDF features vectors. They proposed an overview of requirements and needs for automating bug reports in an industrial context and they did a comparative study of their results in industry vs an open-source software project usually used in research: Firefox. ...
Preprint
Bug reports are common artefacts in software development. They serve as the main channel for users to communicate to developers information about the issues that they encounter when using released versions of software programs. In the descriptions of issues, however, a user may, intentionally or not, expose a vulnerability. In a typical maintenance scenario, such security-relevant bug reports are prioritised by the development team when preparing corrective patches. Nevertheless, when security relevance is not immediately expressed (e.g., via a tag) or rapidly identified by triaging teams, the open security-relevant bug report can become a critical leak of sensitive information that attackers can leverage to perform zero-day attacks. To support practitioners in triaging bug reports, the research community has proposed a number of approaches for the detection of security-relevant bug reports. In recent years, approaches in this respect based on machine learning have been reported with promising performance. Our work focuses on such approaches, and revisits their building blocks to provide a comprehensive view on the current achievements. To that end, we built a large experimental dataset and performed extensive experiments with variations in feature sets and learning algorithms. Eventually, our study highlights different approach configurations that yield best performing classifiers.
... For RQ2, the classifiers were the SGDText, Logistic Regression, and Random Forest. It is possible to find some works in the literature that also obtained good results with Logistic Regression, Random Forest, and algorithms based on Support Vector Machines [4,7,10,11]. However, we did not find any work that used SGDText. ...
Preprint
Full-text available
Usually, managers or technical leaders in software projects assign issues manually. This task may become more complex as more detailed is the issue description. This complexity can also make the process more prone to errors (misassignments) and time-consuming. In the literature, many studies aim to address this problem by using machine learning strategies. Although there is no specific solution that works for all companies, experience reports are useful to guide the choices in industrial auto-assignment projects. This paper presents an industrial initiative conducted in a global electronics company that aims to minimize the time spent and the errors that can arise in the issue assignment process. As main contributions, we present a literature review, an industrial report comparing different algorithms, and lessons learned during the project.
... Badashian et al. [5] used the title, description, keywords, project language, and Stack Overflow as features and matching keywords for triage. Dedik et al. [22] used tf-idf for feature extraction and SVM for the classification task. Jonssons et al. [23] used tf-idf for feature extraction and stacked generalization for classifier ensembles. ...
Article
Full-text available
Bug triage processes are intended to assign bug reports to appropriate developers effectively, but they typically become bottlenecks in the development process—especially for large-scale software projects. Recently, several machine learning approaches, including deep learning-based approaches, have been proposed to recommend an appropriate developer automatically by learning past assignment patterns. In this paper, we propose a deep learning-based bug triage technique using a convolutional neural network (CNN) with three different word representation techniques: Word to Vector (Word2Vec ), Global Vector (GloVe), and Embeddings from Language Models (ELMo). Experiments were performed on datasets from well-known large-scale open-source projects, such as Eclipse and Mozilla, and top-k accuracy was measured as an evaluation metric. The experimental results suggest that the ELMo-based CNN approach performs best for the bug triage problem. GloVe-based CNN slightly outperforms Word2Vec-based CNN in many cases. Word2Vec-based CNN outperforms GloVe-based CNN when the number of samples per class in the dataset is high enough
... The automatic assignment of tickets has been treated in software development thus this literature review is based on automatic bug triage/assignment. During the past years, there have been many studies and attempts to automate the bug triage [9]. It is difficult to decide which of these methods are most accurately performing, due to the differences in the content and setup of datasets, and the quality of data. ...
Conference Paper
Full-text available
Due to the rapid shift of companies towardssuperb customer experience and satisfaction, ticketing systemshave come into a prominence and represent a strategic elementin business competitiveness. Different software companies havedeveloped very effective software tools for issue tracking, nev-ertheless, some sub-processes and tasks within the ticketing sys-tems are still performed manually. These manually performedtasks represent bottlenecks, especially at large organizationsthey result in declined productivity and increased responsetime. Advancements in machine learning can be used in anovel way in which they are combined with the traditionalissue tracking and ticketing systems on the market, to enableoptimal operational efficiency in the Customer Service andSupport(CSS) Department of large-scale businesses that dealwith customer reports. This paper proposes an integratedapproach to customer support by treating three seeminglydifferent bottlenecks in the ticketing system: spam detection,ticket assignment and sentiment analysis. We use primarydata to implement and apply the proposed machine learningapproach. The evaluation shows promising results in terms ofaccuracy and efficiency of our approach.
Article
Full-text available
As defects become more widespread in software development and advancement, bug triaging has become imperative for software testing and maintenance. The bug triage process assigns an appropriate developer to a bug report. Many automated and semiautomated systems have been proposed in the last decade, and some recent techniques have provided direction for developing an effective triage system. However, these techniques still require improvement. Another open challenge related to this problem is adding new developers to the existing triage system, which is challenging because the developers have no listed triage history. This paper proposes a transformer-based bug triage system that uses bidirectional encoder representation from transformers (BERT) for word representation. The proposed model can add a new developer to the existing system without building a training model from scratch. To add new developers, we assumed that new developers had a triage history created by a manual triager or human triage manager after learning their skills from the existing developer history. Then, the existing model was fine-tuned to add new developers using the manual triage history. Experiments were conducted using datasets from well-known large-scale open-source projects, such as Eclipse and Mozilla, and top-k accuracy was used as a criterion for assessment. The experimental outcome suggests that the proposed triage system is better than other word-embedding-based triage methods for the bug triage problem. Additionally, the proposed method performs the best for adding new developers to an existing bug triage system without requiring retraining using a whole dataset.
Article
We automate the process of assigning issue reports to development teams by using data mining approaches and share our experience gained by deploying the resulting system, called IssueTAG, at Softtech. Being a subsidiary of the largest private bank in Turkey, Softtech on average receives 350 issue reports daily from the field, which need to be handled with utmost importance and urgency. IssueTAG has been making all the issue assignments at Softtech since its deployment on Jan 12, 2018. Deploying IssueTAG presented us not only with an unprecedented opportunity to observe the practical effects of automated issue assignment, but also with an opportunity to carry out user studies, both of which (to the best of our knowledge) have not been done before in this context. We first empirically determine the data mining approach to be used in IssueTAG. We then deploy IssueTAG and make a number of valuable observations. First, it is not just about deploying a system for automated issue assignment, but also about designing/changing the assignment process around the system. Second, the accuracy of the assignments does not have to be higher than that of manual assignments in order for the system to be useful. Third, deploying such a system requires the development of additional functionalities, such as creating human-readable explanations for the assignments and detecting deteriorations in assignment accuracies, for both of which we have developed and empirically evaluated different approaches. Last but not least, stakeholders do not necessarily resist change and gradual transition helps build confidence. (To read full text online via Springer Nature SharedIt, please follow the link: https://rdcu.be/b5xQZ)
Article
Full-text available
Context: Bug report assignment is an important part of software maintenance. In particular, incorrect assignments of bug reports to development teams can be very expensive in large software development projects. Several studies propose automating bug assignment techniques using machine learning in open source software contexts, but no study exists for large-scale proprietary projects in industry. Objective: The goal of this study is to evaluate automated bug assignment techniques that are based on machine learning classication. In particular, we study the state-of-the-art ensemble learner Stacked Generalization (SG) that combines several classiers. Method: We collect more than 50,000 bug reports from ve development projects from two companies in dierent domains. We implement automated bug assignment 2 Leif Jonsson et al. and evaluate the performance in a set of controlled experiments. Results: We show that SG scales to large scale industrial application and that it outper-forms the use of individual classiers for bug assignment, reaching prediction accuracies from 50% to 90% when large training sets are used. In addition, we show how old training data can decrease the prediction accuracy of bug assignment. Conclusions: We advice industry to use SG for bug assignment in proprietary contexts, using at least 2,000 bug reports for training. Finally, we highlight the importance of not solely relying on results from cross-validation when evaluating automated bug assignment.
Conference Paper
Full-text available
Predicting the severity of bugs has been found in past research to improve triaging and the bug resolution process. For this reason, many classification/prediction approaches emerged over the years to provide an automated reasoning over severity classes. In this paper, we use text mining together with bi-grams and feature selection to improve the classification of bugs in severe/non-severe classes. We adopt the Naïve Bayes (NB) classifier considering Mozilla and Eclipse datasets commonly used in related works. Overall, the results show that the application of bi-grams can improve slightly the performance of the classifier, but feature selection can be more effective to determine the most informative terms and bi-grams. The results are in any case project-dependent, as in some cases the addition of bi-grams may worsen the performance.
Article
Full-text available
For complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who have the expertise to fix the bugs. Many bug triage techniques have been proposed to automate this process. In this paper, we describe our study on applying conventional bug triage techniques to projects of different sizes. We find that the effectiveness of a bug triage technique largely depends on the size of a project team (measured in terms of the number of developers). The conventional bug triage methods become less effective when the number of developers increases. To further improve the effectiveness of bug triage for large projects, we propose a novel recommendation method called Bug Fixer, which recommends developers for a new bug report based on historical bug-fix information. Bug Fixer constructs a Developer-Component-Bug (DCB) network, which models the relationship between developers and source code components, as well as the relationship between the components and their associated bugs. A DCB network captures the knowledge of 'who fixed what, where'. For a new bug report, Bug Fixer uses a DCB network to recommend to triager a list of suitable developers who could fix this bug. We evaluate Bug Fixer on three large-scale open source projects and two smaller industrial projects. The experimental results show that the proposed method outperforms the existing methods for large projects and achieves comparable performance for small projects.
Article
Full-text available
Large open source software projects receive abundant rates of submitted bug reports. Triaging these incoming reports manually is error-prone and time consuming. The goal of bug triaging is to assign potentially experienced developers to new-coming bug reports. To reduce time and cost of bug triaging, we present an automatic approach to predict a developer with relevant experience to solve the new coming report. In this paper, we investigate the use of five term selection methods on the accuracy of bug assignment. In addition, we re-balance the load between developers based on their experience. We conduct experiments on four real datasets. The experimental results show that by selecting a small number of discriminating terms, the F-score can be significantly improved.
Article
Full-text available
Large open source software projects receive abundant rates of submitted bug reports. Triaging these incoming reports manually is error-prone and time consuming. The goal of bug triaging is to assign potentially experienced developers to new-coming bug reports. To reduce time and cost of bug triaging, we present an automatic approach to predict a developer with relevant experience to solve the new coming report. In this paper, we investigate the use of five term selection methods on the accuracy of bug assignment. In addition, we re-balance the load between developers by clustering similar bug reports and ranking developers in each cluster based on their experience. We conduct experiments on four real datasets. The experimental results show that by selecting a small number of discriminating terms, the F-score can be significantly improved.
Conference Paper
Full-text available
Open source development projects typically support an open bug repository to which both developers and users can report bugs. The reports that appear in this repository must be triaged to determine if the report is one which requires attention and if it is, which developer will be assigned the responsibility of resolving the report. Large open source developments are burdened by the rate at which new bug reports appear in the bug repository. In this paper, we present a semi-automated approach intended to ease one part of this process, the assignment of reports to a developer. Our approach applies a machine learning algorithm to the open bug repository to learn the kinds of reports each developer resolves. When a new report arrives, the classifier produced by the machine learning technique suggests a small number of developers suitable to resolve the report. With this approach, we have reached precision levels of 57% and 64% on the Eclipse and Firefox development projects respectively. We have also applied our approach to the gcc open source development with less positive results. We describe the conditions under which the approach is applicable and also report on the lessons we learned about applying machine learning to repositories used in open source development.
Article
Bug resolution refers to the activity that developers perform to diagnose, fix, test, and document bugs during software development and maintenance. Given a bug report, we would like to recommend the set of bug resolvers that could potentially contribute their knowledge to fix it. We refer to this problem as developer recommendation for bug resolution. In this paper, we propose a new and accurate method named DevRec for the developer recommendation problem. DevRec is a composite method that performs two kinds of analysis: bug reports based analysis (BR-Based analysis) and developer based analysis (D-Based analysis). We evaluate our solution on five large bug report datasets including GNU Compiler Collection, OpenOffice, Mozilla, Netbeans, and Eclipse containing a total of 107,875 bug reports. We show that DevRec could achieve recall@5 and recall@10 scores of 0.4826–0.7989, and 0.6063–0.8924, respectively. The results show that DevRec on average improves recall@5 and recall@10 scores of Bugzie by 57.55% and 39.39%, outperforms DREX by 165.38% and 89.36%, and outperforms NonTraining by 212.39% and 168.01%, respectively. Moreover, we evaluate the stableness of DevRec with different parameters, and the results show that the performance of DevRec is stable for a wide range of parameters. Copyright © 2015 John Wiley & Sons, Ltd.
Article
According to its proponents, open source style software development has the capacity to compete successfully, and perhaps in many cases displace, traditional commercial development methods. In order to begin investigating such claims, we examine data from two major open source projects, the Apache web server and the Mozilla browser. By using email archives of source code change history and problem reports we quantify aspects of developer participation, core team size, code ownership, productivity, defect density, and problem resolution intervals for these OSS projects. We develop several hypotheses by comparing the Apache project with several commercial projects. We then test and refine several of these hypotheses, based on an analysis of Mozilla data. We conclude with thoughts about the prospects for high-performance commercial/open source process hybrids.