Predicting Bugs' Components via Mining Bug Reports

Computing Research Repository - CORR 10/2010; 7(5). DOI: 10.4304/jsw.7.5.1149-1154
Source: arXiv


The number of bug reports in complex software increases dramatically. Now
bugs are triaged manually, bug triage or assignment is a labor-intensive and
time-consuming task. Without knowledge about the structure of the software,
testers often specify the component of a new bug wrongly. Meanwhile, it is
difficult for triagers to determine the component of the bug only by its
description. We dig out the components of 28,829 bugs in Eclipse bug project
have been specified wrongly and modified at least once. It results in these
bugs have to be reassigned and delays the process of bug fixing. The average
time of fixing wrongly-specified bugs is longer than that of
correctly-specified ones. In order to solve the problem automatically, we use
historical fixed bug reports as training corpus and build classifiers based on
support vector machines and Na\"ive Bayes to predict the component of a new
bug. The best prediction accuracy reaches up to 81.21% on our validation corpus
of Eclipse project. Averagely our predictive model can save about 54.3 days for
triagers and developers to repair a bug. Keywords: bug reports; bug triage;
text classification; predictive model

Download full-text


Available from: Wenjun Wu, Jan 22, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on $t$-test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or or slightly better than the state-of-the-art feature selection methods (i.e., $\chi^2$, and IG) in terms of macro-$F_1$ and micro-$F_1$.
    Pattern Recognition Letters 05/2013; 45(1). DOI:10.1145/2396761.2398457 · 1.55 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Software maintenance starts as soon as the first artifacts are delivered and is essential for the success of the software. However, keeping maintenance activities and their related artifacts on track comes at a high cost. In this respect, change request (CR) repositories are fundamental in software maintenance. They facilitate the management of CRs and are also the central point to coordinate activities and communication among stakeholders. However, the benefits of CR repositories do not come without issues, and commonly occurring ones should be dealt with, such as the following: duplicate CRs, the large number of CRs to assign, or poorly described CRs. Such issues have led researchers to an increased interest in investigating CR repositories, by considering different aspects of software development and CR management. In this paper, we performed a systematic mapping study to characterize this research field. We analyzed 142 studies, which we classified in two ways. First, we classified the studies into different topics and grouped them into two dimensions: challenges and opportunities. Second, the challenge topics were classified in accordance with an existing taxonomy for information retrieval models. In addition, we investigated tools and services for CR management, to understand whether and how they addressed the topics identified. Copyright © 2013 John Wiley & Sons, Ltd.
    Journal of Software: Evolution and Process 12/2013; 26(7). DOI:10.1002/smr.1639 · 0.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Large open source bug tracking systems receives large number of bug reports daily. Managing these huge numbers of incoming bug reports is a challenging task. Dealing with these reports manually consumes time and resources which leads to delaying the resolution of important bugs which are crucial and need to be identified and resolved earlier. Bug triaging is an important process in software maintenance. Some bugs are important and need to be fixed right away, whereas others are minor and their fixes could be postponed until resources are available. Most automatic bug assignment approaches do not take the priority of bug reports in their consideration. Assigning bug reports based on their priority may play an important role in enhancing the bug triaging process. In this paper, we present an approach to predict the priority of a reported bug using different machine learning algorithms namely Naive Bayes, Decision Trees, and Random Forest. We also investigate the effect of using two feature sets on the classification accuracy. We conduct experimental evaluation using open-source projects namely Eclipse and Fire fox. The experimental evaluation shows that the proposed approach is feasible in predicting the priority of bug reports. It also shows that feature-set-2 outperformsfeature-set-1. Moreover, both Random Forests and Decision Trees outperform Naive Bayes.
    Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 02; 12/2013