PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Developers often struggle to identify the skills required to work on open issues in Open Source Software (OSS) projects. Proper issue labeling can help task selection, but current strategies are limited to classifying the issues according to their type (e.g., bug, question, good first issue, feature, etc.). In contrast, this paper presents a tool that mines project repositories and labels issues based on the skills required to solve them, more specifically the domain of the APIs involved in the solution (e.g., User Interface (UI), Test, Databases (DB), etc.). GiveMeLabeledIssues facilitates matching developers' skills and tasks, reducing the burden on project maintainers by minimizing the amount of manual labeling needed to annotate project issues effectively. The demo toll obtained a precision of 83.9% predicting projects with TF-IDF and Random Forest (RF). GiveMeLabeledIssues is composed by the Back-end 1, and the App UI 2. To register new projects the off-line pipeline mine the repositories, parse projects, train and verify the models' performance. The replication package contains instructions on how to execute the tool and to train the models for new projects 3. A demo video is available 4.
Content may be subject to copyright.
GiveMeLabeledIssues: An Open Source Issue
Recommendation System
Joseph Vargovich, Fabio Santos, Jacob Penney, Marco A. Gerosa, Igor Steinmacher
School of Informatics, Computing, and Cyber Systems
Northern Arizona University
Flagstaff, United States
joseph vargovich@nau.edu, fabio santos@nau.edu, jacob penney@nau.edu, Marco.Gerosa@nau.edu, Igor.Steinmacher@nau.edu
Abstract—Developers often struggle to navigate an Open
Source Software (OSS) project’s issue-tracking system and find
a suitable task. Proper issue labeling can aid task selection, but
current tools are limited to classifying the issues according to
their type (e.g., bug, question, good first issue, feature, etc.). In
contrast, this paper presents a tool (GiveMeLabeledIssues) that
mines project repositories and labels issues based on the skills
required to solve them. We leverage the domain of the APIs
involved in the solution (e.g., User Interface (UI), Test, Databases
(DB), etc.) as a proxy for the required skills. GiveMeLabeledIssues
facilitates matching developers’ skills to tasks, reducing the
burden on project maintainers. The tool obtained a precision of
83.9% when predicting the API domains involved in the issues.
The replication package contains instructions on executing the
tool and including new projects. A demo video is available at
https://www.youtube.com/watch?v=ic2quUue7i8
Index Terms—Open Source Software, Machine Learning, La-
bel, Tag, Task, Issue Tracker
I. INTRODUCTION
Contributors struggle to find issues to work on in Open
Source Software (OSS) projects due to difficulty determining
the skills required to work on an issue [1]. Labeling issues
manually can help [2], but manual work increases maintainers’
effort, and many projects end up not labeling their issues [3].
To facilitate issue labeling, several studies proposed mining
software repositories to tag the issues with labels such as
bug/non-bug and good-first-issue (to help newcomers find a
suitable task) [4, 5, 6, 7]. However, newcomers to a project
have a variety of skill levels and these approaches do not
indicate the skills needed to work on the tasks.
APIs encapsulate modules with specific functionality. If we
can predict APIs used to solve an issue, we can guide new
contributors on what to contribute. However, since projects
include thousands of different APIs, adding each API as a
label would harm the user experience. To keep the number
of labels manageable, we classified the APIs into domains for
labeling [2, 8]
In our study, API domains represent categories of APIs such
as “UI,” “DB, “Test, etc. In our previous work, we used 31
distinct API domains [8]. However, we acknowledge that the
set of possible API-domain labels varies by project, as each
project requires different skills and expertise. In another work,
Santos et al. [9] have evaluated the API domain labels in an
empirical experiment with developers. The study showed that
the labels enabled contributors to find OSS tasks that matched
their expertise more efficiently. Santos et al. study showed
promising results when classifying issues by API-Domain,
with precision, recall, and F-measure scores of 0.755, 0.747,
and 0.751, respectively.
Following this idea, we implemented GiveMeLabeledIssues
to classify issues for potential contributors. GiveMeLabeledIs-
sues is a web tool that indicates API-domain labels for open
issues in registered projects. Currently, GiveMeLabeledIssues
works with three open-source projects: JabRef, PowerToys,
and Audacity. The tool enables users to select a project, input
their skill set (in terms of our categories), and receive a list
of open issues that would require those skills.
We trained the tool with the title and body text of closed
issues and the APIs utilized within code that solve the issue,
as gathered from pull requests. We evaluated the tool using
closed issues as the ground truth and found an average of
83.8% precision and 78.5% recall when training and testing
models for individual projects.
II. GIVEMELA BE LE DIS SU ES ARCHITECTURE
GiveMeLabeledIssues leverages prediction models trained
with closed issues linked with merged pull requests. On top
of these models, we built a platform that receives user input
and queries the open issues based on the users’ skills and
desired projects. In the following, we detail the process of
building the model, the issue classification process, and the
user interface that enables users to get the recommendation.
GiveMeLabeledIssues is structured in two layers: the frontend
web interface1and the backend REST API2.
A. Model training
In the current implementation, a model must be built for
each project using data collected from project issues linked
with merged pull requests. The tool maps the issue text data
to the APIs used in the source code that solved the issues
(Figure 1).
1https://github.com/JoeyV55/GiveMeLabeledIssuesUI
2https://github.com/JoeyV55/GiveMeLabeledIssuesAPI
Mine
Repositories
Issues
Pull Requests
Link Issues
to Pull Requests
Filter out
Pull Requests
without source code
Filter
APIs
Construct
Corpus
Split Data
Train/Test
Train
Models
Classify
API-domains
Parse APIs
Source code
PowerToys
Model
JabRef
Model
Audacity
Model
Fig. 1. The Process of Training a Model
1) Mining repositories: We collected 18,482 issues and
3,129 pull requests (PRs) from JabRef, PowerToys, and Au-
dacity projects up to November 2021. We used the GitHub
REST API v3 to collect the title, body, comments, closure date,
name of the files changed in the PR, and commit messages.
2) APIs parsing: We built a parser to process all source
files from the projects to identify the APIs used in the source
code affected by each pull request. In total, we found 3,686
different APIs in 3,108 source files. The parser looked for
specific commands, i.e., import (Java), using (C#), and include
(C++). The parser identified all classes, including the complete
namespace from each import/using/include statement.
3) Dataset construction: We kept only the data from issues
linked with merged and closed pull requests since we needed
to map issue data to the APIs that are used in the source
code files changed to close the issue. To find the links
between pull requests and issues, we searched for the symbol
#issue_number in the pull request title and body and
checked the URL associated with each link. We also filtered
out issues linked to pull requests without at least one source
code file (e.g., those associated only with documentation files)
since they do not provide the model with content related to
any API.
4) API categorization: We use the API domain categories
defined by Santos et al. [8]. These 31 categories were defined
by experts to encompass APIs from several projects (e.g., UI,
IO, Cloud, DB, etc.—see our replication package3.
5) Corpus construction: We used the issue title and body
as our corpus to train our model since they performed well in
our previous analysis [9]. Similar to other studies [10, 11],
we applied TF-IDF as a technique for quantifying word
importance in documents by assigning a weight to each word
following the same process described in the previous work
[9]. TF-IDF returns a vector whose length is the number
of terms used to calculate the scores. Before calculating the
scores, we convert each word to lowercase and removed URLs,
source code, numbers, and punctuation. After that, we remove
templates and stop-words and stem the words. These TF-IDF
scores are then passed to the Random Forest classifier (RF)
as features for prediction. RF was chosen since it obtained
the best results in previous work [9]. The ground truth has a
3https://doi.org/10.5281/zenodo.7575116
binary value (0 or 1) for each API domain, indicating whether
the corresponding domain is present in the issue solution.
We also offer the option of using a BERT model in
GiveMeLabeledIssues. We created two separate CSV files to
train BERT: an input binary with expert API-domain labels
paired with the issue corpus and a list of the possible labels
for the specific project. BERT directly labels the issue using
the corpus text and lists possible labels without needing an
additional classifier (such as Random Forest).
6) Building the model: The BERT model was built using
the Python package fast-bert4, which builds on the Transform-
ers5library for Pytorch. Before training the model, the optimal
learning rate was computed using a LAMB optimizer [12].
Finally, the model was fit over 11 epochs and validated every
epoch. The BERT model was trained on an NVIDIA Tesla
V100 GPU with an Intel(R) Xeon(R) Gold 6132 CPU within
a computing cluster.
TF-IDF and BERT models were trained and validated for
every fold in a ShuffleSplit 10-fold cross-validation. Once
trained, the models were hosted on the backend. The replica-
tion package contains instructions on registering a new project
by running the model training pipeline that feeds the demo
tool. The models can then output predictions quickly without
continually retraining with each request.
For the RandomForestClassifier (TF-IDF), the best classi-
fier, we kept the following parameters: criterion = ’entropy,
max depth = 50, min samples leaf = 1, min samples split=3,
and n estimators = 50.
B. Issue Classification Process
GiveMeLabeledIssues classifies currently open issues for
each registered project. The tool combines the title and body
text and sends it to the classifier. The classifier then determines
which domain labels are relevant to the gathered issues based
on the inputted issue text. The labeled issues are stored in an
SQLite database for future requests, recording the issue num-
ber, title, body, and domain labels outputted by the classifier.
The open issues for all projects registered with GiveMeLa-
beledIssues are reclassified daily to ensure that the database is
up to date as issues are opened and closed. Figure 2 outlines
the daily classification procedure.
Fig. 2. The Process of Classifying and Storing Issues
C. User Interface
GiveMeLabeledIssues outputs the labeled issues to the User
Interface. The user interface is implemented using the Angular
web framework. To use the tool, users provide the project
4https://github.com/utterworks/fast-bert
5https://huggingface.co/docs/transformers/index
name and select API-domain labels they are interested in. This
information is sent to the backend REST endpoint via a GET
request. The backend processes the request, recommending a
set of relevant issues for the user.
The backend REST API is implemented using the Django
Rest Framework. It houses the trained TF-IDF and BERT text
classification models and provides an interface to the labeled
issues. When receiving the request, the backend queries the
selected project for issues that match the user’s skills. Once
the query is completed, the backend returns the labeled issues
to the user interface. Each labeled issue includes a link to the
open issue page on GitHub and the issue’s title, number, and
applicable labels. The querying process is shown in Figure 3.
Fig. 3. The Process of Outputting Issues
Figure 4 shows JabRef selected as the project and “Utility,”
“Databases,” “User Interface,” and Application” as the API
domains provided by the user. Figure 5 shows the results of
this query, which displays all JabRef open issues that match
those labels.
Fig. 4. Selection of a project and API domains
III. EVALUATI ON
We have evaluated the performance of the models used to
output API-domain labels using a dataset comprised of 18,482
issues, 3,129 PRs, 3,108 source code files, and 3,686 distinct
APIs, chosen from active projects and diverse domains: Au-
dacity (C++), PowerToys (C#), and JabRef (Java).6Audacity
is an audio editor, PowerToys is a set of utilities for Windows,
and JabRef is an open-source bibliography manager. Table
I shows the results with Precision, Recall, F-Measure, and
Hamming Loss values. We trained models to predict API-
domain labels using individual issue datasets from each project
and a single dataset that combined the data from all the
projects.
As shown in Table I, TF-IDF overcame BERT both in
the per-project analysis and for the complete dataset. The
6The model training replication package is available at https://zenodo.org/
record/7726323#.ZA5oy-zMIeY
Fig. 5. Labeled Issues Outputted for JabRef with the Utility, Databases, User
Interface, and Application skills Selected
TABLE I
OVERALL METRICS FROM MODELS -AVE RAG ES. RF-RANDOM FO RE ST*
HLA - HAMMING LOSS **
Model
averages
One/Multi
projects Precision Recall F-meas
-ure Hla**
RF* TF-IDF O 0.839 0.799 0.817 0.113
BERT O 0.595 0.559 0.568 0.269
RF* TF-IDF M 0.659 0.785 0.573 0.153
BERT M 0.593 0.725 0.511 0.219
Izadi et al. [13] M 0.796 0.776 0.766 NA
Kallis et al. [14] M 0.832 0.826 0.826 NA
Santos et al. [9] O 0.755 0.747 0.751 0.142
difference was quite large when using single projects. The
results were closer when we used the combined dataset that
included data from all projects (3,736 linked issues and pull
requests). We hypothesize that the sample size influenced the
classifiers’ performance. This aligns with previous research on
issue labeling that showed that BERT performed better than
other language models for datasets larger than 5,000 issues
[15]. TF-IDF performs very well when the dataset is from
a single project because the vocabulary used in the project
is very contextual, and the frequency of terms can identify
different aspects of each issue. When we include the dataset
from all the projects, the performance of TF-IDF drops as the
context is not unique. These results outperformed the results
from the API-domain labels case study conducted by Santos et
TABLE II
OVERALL METRICS FROM MODELS -BY PRO JE CTS . RF-RANDOM
FOR EST * HL A - HAMMING LOS S**
Model
by project Project Precision Recall F-meas
-ure Hla**
RF* TF-IDF Audacity 0.872 0.839 0.854 0.103
RF* TF-IDF JabRef 0.806 0.782 0.793 0.143
RF* TF-IDF PowerToys 0.84 0.776 0.805 0.094
BERT Audacity 0.382 0.511 0.434 0.42
BERT JabRef 0.791 0.606 0.686 0.192
BERT PowerToys 0.619 0.643 0.626 0.187
al. [9]. The project metrics (Table II) varied less than 6% (e.g.,
the recall: 0.839 (Audacity) - 0.776 (PowerToys). Audacity had
the best scores for all metrics except Hamming Loss.
IV. REL ATED WORK
The existing literature explores strategies to help newcom-
ers find tasks within OSS projects. Steinmacher et al. [16]
proposed a portal to help newcomers find relevant documen-
tation. Despite pointing the contributor to existing resources,
a newcomer may have difficulty relating the documentation to
the skills required to solve a task.
There are also several approaches designed to label issues
automatically. However, most of them only try to distinguish
bug reports from non-bug reports [4, 17, 18]. Zhou et al. [17]
built a Naive Bayes (NB) classifier to predict the priorities of
bug reports. Xia et al. [18] tagged questions using preexisting
labels. Other work [6, 7] is also restricted to existing labels
while others [13, 14] proposed other labels. Kallis et al. [14],
for instance, employed the textual description of the issues to
classify the issues into types.
Labeling also attracted the attention of software requirement
researchers. P´
erez-Verdejo et al. [19] and Quba et al. [20]
proposed to categorize documents with functional and non-
functional requirements labels to solve software engineers’
problems. Non-functional requirements labels included 12
general domains like “Performance” or Availability” [20].
Such approaches use higher-level labels that would not guide
contributors to choose appropriate tasks given their skills.
Availability, for instance, may be related to a database,
network, or a cloud module and can be challenging for a
newcomer to realize where to start due to the extent of modules
to analyze to find the root cause of a bug.
APIs are also often investigated in software engineering.
Recent work focuses on providing API recommendation [21,
22], giving API tips to developers [23], defining the skill space
for APIs, developers, and projects [24] or identifying experts in
libraries [25]. The APIs usually implement a concept (database
access, thread management, etc.). Knowing the API involved
in a potential solution allows newcomers to pick an issue that
matches their skillset. Therefore, unlike the presented related
studies, our tool labels issues based on API domains [9].
V. CONCLUSION AND FUTURE WORK
GiveMeLabeledIssues provides OSS developers with issues
that can potentially match their skill sets or interests. Such a
tool may facilitate the onboarding of newcomers and alleviate
the workload of project maintainers.
Future work can explore different domain labels,
such as those offered by accredited standards (i.e.,
ACM/IEEE/ISO/SWEBOK). As a future step to evaluate the
tool’s impact, we will conduct a study to receive feedback
from contributors and assess how the tool influences their
choices by means of controlled experiments. Future work can
also incorporate the use of social features [8] and integrate
the tool into GitHub Actions, or bots [26].
ACKNOWLEDGMENTS
This work is partially supported by the National Science
Foundation under Grant numbers 1815503, 1900903, and
2236198.
REFERENCES
[1] I. Steinmacher, M. A. G. Silva, M. A. Gerosa, and D. F.
Redmiles, A systematic literature review on the barriers
faced by newcomers to open source software projects,
Information and Software Technology, vol. 59, 2015.
[2] F. Santos, B. Trinkenreich, J. F. Pimentel, I. Wiese,
I. Steinmacher, A. Sarma, and M. A. Gerosa, “How to
choose a task? mismatches in perspectives of newcomers
and existing contributors, in Proceedings of the 16th
ACM/IEEE International Symposium on Empirical Soft-
ware Engineering and Measurement, 2022.
[3] A. Barcomb, K.-J. Stol, B. Fitzgerald, and D. Riehle,
“Managing episodic volunteers in free/libre/open source
software communities, IEEE Transactions on Software
Engineering, 2020.
[4] N. Pingclasai, H. Hata, and K.-i. Matsumoto, “Clas-
sifying bug reports to bugs and other requests using
topic modeling,” in 2013 20Th asia-pacific software
engineering conference (APSEC), vol. 2. IEEE, 2013,
pp. 13–18.
[5] Y. Zhu, M. Pan, Y. Pei, and T. Zhang, A bug or a
suggestion? an automatic way to label issues, arXiv
preprint arXiv:1909.00934, 2019.
[6] F. El Zanaty, C. Rezk, S. Lijbrink, W. van Bergen,
M. Cˆ
ot´
e, and S. McIntosh, Automatic recovery of miss-
ing issue type labels,” IEEE Software, 2020.
[7] Q. Perez, P.-A. Jean, C. Urtado, and S. Vauttier, “Bug or
not bug? that is the question, in 2021 IEEE/ACM 29th
International Conference on Program Comprehension
(ICPC). IEEE, 2021, pp. 47–58.
[8] F. Santos, J. Penney, J. F. Pimentel, I. Wiese, I. Stein-
macher, and M. A. Gerosa, “Tell me who are you talking
to and i will tell you what issues need your skills,”
in 2023 IEEE/ACM 20th International Conference on
Mining Software Repositories (MSR), 2023.
[9] F. Santos, I. Wiese, B. Trinkenreich, I. Steinmacher,
A. Sarma, and M. A. Gerosa, “Can i solve it? iden-
tifying apis required to complete oss tasks,” in 2021
IEEE/ACM 18th International Conference on Mining
Software Repositories (MSR). IEEE, 2021.
[10] D. Behl, S. Handa, and A. Arora, “A bug mining tool
to identify and analyze security bugs using naive bayes
and tf-idf,” in 2014 International Conference on Reliabil-
ity Optimization and Information Technology (ICROIT).
IEEE, 2014, pp. 294–299.
[11] S. L. Vadlamani and O. Baysal, “Studying software
developer expertise and contributions in Stack Overflow
and GitHub,” in 2020 IEEE International Conference on
Software Maintenance and Evolution (ICSME). IEEE,
2020, pp. 312–323.
[12] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bho-
janapalli, X. Song, J. Demmel, K. Keutzer, and C.-
J. Hsieh, “Large batch optimization for deep learning:
Training bert in 76 minutes, in International Conference
on Learning Representations, 2020.
[13] M. Izadi, K. Akbari, and A. Heydarnoori, “Predicting
the objective and priority of issue reports in software
repositories,” Empirical Software Engineering, vol. 27,
no. 2, pp. 1–37, 2022.
[14] R. Kallis, A. Di Sorbo, G. Canfora, and S. Panichella,
“Ticket tagger: Machine learning driven issue classi-
fication,” in 2019 IEEE International Conference on
Software Maintenance and Evolution (ICSME). IEEE,
2019, pp. 406–409.
[15] J. Wang, X. Zhang, and L. Chen, “How well do pre-
trained contextual language representations recommend
labels for github issues?” Knowledge-Based Systems, vol.
232, p. 107476, 2021.
[16] I. Steinmacher, T. U. Conte, C. Treude, and M. A.
Gerosa, “Overcoming open source project entry barriers
with a portal for newcomers, in Proceedings of the 38th
International Conference on Software Engineering, 2016,
pp. 273–284.
[17] Y. Zhou, Y. Tong, R. Gu, and H. Gall, “Combining text
mining and data mining for bug report classification,
Journal of Software: Evolution and Process, vol. 28,
no. 3, pp. 150–176, 2016.
[18] X. Xia, D. Lo, X. Wang, and B. Zhou, “Tag recommen-
dation in software information sites, in Mining Software
Repositories. USA: IEEE, 2013.
[19] J. M. P´
erez-Verdejo, ´
A. J. S´
anchez-Garc´
ıa, J. O.
Ochar´
an-Hern´
andez, E. Mezura-Montes, and K. Cortes-
Verdin, “Requirements and github issues: An automated
approach for quality requirements classification,” Pro-
gramming and Computer Software, vol. 47, 2021.
[20] G. Y. Quba, H. Al Qaisi, A. Althunibat, and S. AlZu’bi,
“Software requirements classification using machine
learning algorithm’s, in 2021 International Conference
on Information Technology (ICIT). IEEE, 2021.
[21] H. Zhong and H. Mei, An empirical study on api
usages,” IEEE Transactions on Software Engineering,
vol. 45, no. 4, pp. 319–334, 2019.
[22] Y. Zhou, C. Wang, X. Yan, T. Chen, S. Panichella, and
H. C. Gall, Automatic detection and repair recommenda-
tion of directive defects in java api documentation,” IEEE
Transactions on Software Engineering, pp. 1–1, 2018.
[23] S. Wang, N. Phan, Y. Wang, and Y. Zhao, “Extracting
api tips from developer question and answer websites,
in 2019 IEEE/ACM 16th International Conference on
Mining Software Repositories (MSR), 2019, pp. 321–332.
[24] T. Dey, A. Karnauch, and A. Mockus, “Representation
of developer expertise in open source software, arXiv
preprint arXiv:2005.10176, 2020.
[25] J. E. Montandon, L. Lourdes Silva, and M. T. Valente,
“Identifying experts in software libraries and frameworks
among github users,” in 2019 IEEE/ACM 16th Inter-
national Conference on Mining Software Repositories
(MSR), 2019, pp. 276–287.
[26] T. Kinsman, M. Wessel, M. A. Gerosa, and C. Treude,
“How do software developers use github actions to
automate their workflows?” in 2021 IEEE/ACM 18th In-
ternational Conference on Mining Software Repositories
(MSR). IEEE, 2021, pp. 420–431.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Selecting an appropriate task is a challenging step for newcomers to Open Source Software (OSS) projects. To facilitate task selection, researchers and OSS projects have leveraged machine learning techniques, historical information, and textual analysis to label tasks (a.k.a. issues) with information such as the issue type and domain. These approaches are still far from mainstream adoption, possibly because of a lack of good predictors. Inspired by previous research, we advocate that label prediction might benefit from leveraging metrics derived from communication data and social network analysis (SNA) for issues in which social interaction occurs. Thus, we study how these "social metrics" can improve the automatic labeling of open issues with API domains-categories of APIs used in the source code that solves the issue-which the literature shows that newcomers to the project consider relevant for task selection. We mined data from OSS projects' repositories and organized it in periods to reflect the seasonality of the contributors' project participation. We replicated metrics from previous work and added social metrics to the corpus to predict API-domain labels. Social metrics improved the performance of the classifiers compared to using only the issue description text in terms of precision, recall, and f-measure. Precision (0.945) increased by 18.7% and F-measure (0.963) by 17.7% for a project with high social activity. These results indicate that social metrics can help capture the patterns of social interactions in a software project and improve the labeling of issues in an issue tracker.
Conference Paper
Full-text available
[Background] Selecting an appropriate task is challenging for Open Source Software (OSS) project newcomers and a variety of strategies can help them in this process. [Aims] In this research, we compare the perspective of maintainers, newcomers, and existing contributors about the importance of strategies to support this process. Our goal is to identify possible gulfs of expectations between newcomers who are meant to be helped and contributors who have to put effort into these strategies, which can create friction and impede the usefulness of the strategies. [Method] We interviewed maintainers (n=17) and applied inductive qualitative analysis to derive a model of strategies meant to be adopted by newcomers and communities. Next, we sent a questionnaire (n=64) to maintainers, frequent contributors, and newcomers, asking them to rank these strategies based on their importance. We used the Schulze method to compare the different rankings from the different types of contributors. [Results] Maintainers and contributors diverged in their opinions about the relative importance of various strategies. The results suggest that newcomers want a better contribution process and more support to onboard, while maintainers expect to solve questions using the available communication channels. [Conclusions] The gaps in perspectives between newcomers and existing contributors create a gulf of expectation. OSS communities can leverage our results to prioritize the strategies considered the most important by newcomers.
Article
Full-text available
Software repositories such as GitHub host a large number of software entities. Developers collaboratively discuss, implement, use, and share these entities. Proper documentation plays an important role in successful software management and maintenance. Users exploit Issue Tracking Systems, a facility of software repositories, to keep track of issue reports, to manage the workload and processes, and finally, to document the highlight of their team’s effort. An issue report is a rich source of collaboratively-curated software knowledge, and can contain a reported problem, a request for new features, or merely a question about the software product. As the number of these issues increases, it becomes harder to manage them manually. GitHub provides labels for tagging issues, as a means of issue management. However, about half of the issues in GitHub’s top 1000 repositories do not have any labels. In this work, we aim at automating the process of managing issue reports for software teams. We propose a two-stage approach to predict both the objective behind opening an issue and its priority level using feature engineering methods and state-of-the-art text classifiers. To the best of our knowledge, we are the first to fine-tune a Transformer for issue classification. We train and evaluate our models in both project-based and cross-project settings. The latter approach provides a generic prediction model applicable for any unseen software project or projects with little historical data. Our proposed approach can successfully predict the objective and priority level of issue reports with 82%82%82\% (fine-tuned RoBERTa) and 75%75%75\% (Random Forest) accuracy, respectively. Moreover, we conducted human labeling and evaluation on unlabeled issues from six unseen GitHub projects to assess the performance of the cross-project model on new data. The model achieves 90%90%90\% accuracy on the sample set. We measure inter-rater reliability and obtain an average Percent Agreement of 85.3%85.3%85.3\% and Randolph’s free-marginal Kappa of 0.71 that translate to a substantial agreement among labelers.
Conference Paper
Full-text available
Nowadays, development teams often rely on tools such as Jira or Bugzilla to manage backlogs of issues to be solved to develop or maintain software. Although they relate to many different concerns (e.g., bug fixing, new feature development, architecture refactoring), few means are proposed to identify and classify these different kinds of issues, except for non mandatory labels that can be manually associated to them. This may lead to a lack of issue classification or to issue misclassification that may impact automatic issue management (planning, assignment) or issue-derived metrics. Automatic issue classification thus is a relevant topic for assisting backlog management. This paper proposes a binary classification solution for discriminating bug from non bug issues. This solution combines natural language processing (TF-IDF) and classification (multi-layer perceptron) techniques, selected after comparing commonly used solutions to classify issues. Moreover, hyper-parameters of the neural network are optimized using a genetic algorithm. The obtained results, as compared to existing works on a commonly used benchmark, show significant improvements on the F1 measure for all datasets.
Conference Paper
Full-text available
Abstract—Open Source Software projects add labels to open issues to help contributors choose tasks. However, manually labeling issues is time-consuming and error-prone. Current automatic approaches for creating labels are mostly limited to classifying issues as a bug/non-bug. In this paper, we investigate the feasibility and relevance of labeling issues with the domain of the APIs required to complete the tasks. We leverage the issues’ description and the project history to build prediction models, which resulted in precision up to 82% and recall up to 97.8%. We also ran a user study (n=74) to assess these labels’ relevancy to potential contributors. The results show that the labels were useful to participants in choosing tasks, and the API-domain labels were selected more often than the existing architecturebased labels. Our results can inspire the creation of tools to automatically label issues, helping developers to find tasks that better match their skills.
Conference Paper
Full-text available
Automated tools are frequently used in social coding repositories to perform repetitive activities that are part of the distributed software development process. Recently, GitHub introduced GitHub Actions, a feature providing automated workflows for repository maintainers. Although several Actions have been built and used by practitioners, relatively little has been done to evaluate them. Understanding and anticipating the effects of adopting such kind of technology is important for planning and management. Our research is the first to investigate how developers use Actions and how several activity indicators change after their adoption. Our results indicate that, although only a small subset of repositories adopted GitHub Actions to date, there is a positive perception of the technology. Our findings also indicate that the adoption of GitHub Actions increases the number of monthly rejected pull requests and decreases the monthly number of commits on merged pull requests. These results are especially relevant for practitioners to understand and prevent undesirable effects on their projects.
Article
In the development of quality software, critical decisions related to planning, estimating, and managing resources are bound to the correct and timely identification of the system needs. In particular, the process of classifying this customer input into software requirements categories tends to become tedious and error-prone when it comes to large-scale systems. On the ground described by a complementary systematic literature review, this research introduces a proposal on the application of Machine Learning techniques for automated software requirements classification. In this regard, the training and later hyperparameter optimization through Differential Evolution of five classification models are carried out based on quality attributes examples found in the available literature. As a case study, these models are tested with issue reports collected from five open-source projects at GitHub to identify quality-attributes-related knowledge on such user feedback. The finding of the most characteristic terms by quality attribute through the TF-IDF algorithm stands out from the training. The results show a moderately high ability to classify other generic software requirements correctly, achieving a Geometric Mean of up to 82.51%. However, the same classifiers applied to issue reports showed significant difficulties identifying information related to quality attributes, since an F-Score no greater than 50% was reached.
Article
Motivation Open-source organizations use issues to collect user feedback, software bugs, and feature requests in GitHub. However, many issues are unlabeled, which makes labeling time-consuming work for the maintainers. Recently, some researchers used deep learning to improve the performance of automated tagging for software objects. However, these researches use static pre-trained word vectors that can not represent the semantics of the same word in different contexts. Pre-trained contextual language representations have been shown to achieve outstanding performance on lots of NLP tasks. Description In this paper, we study whether the pre-trained contextual language models are really better than other previous language models in the label recommendation for the GitHub labels scenario. We try to give some suggestions in fine-tuning pre-trained contextual language representation models. First, we compared four deep learning models, in which three of them use traditional pre-trained word embedding. Furthermore, we compare the performances when using different corpora for pre-training. Results The experimental results show that: (1) When using large training data, the performance of BERT model is better than other deep learning language models such as Bi-LSTM, CNN and RCNN. While with a small size training data, CNN performs better than BERT. (2) Further pre-training on domain-specific data can indeed improve the performance of models. Conclusions When recommending labels for issues in GitHub, using pre-trained contextual language representations is better if the training dataset is large enough. Moreover, we discuss the experimental results and provide some implications to improve label recommendation performance for GitHub issues.