Can I Solve It? Identifying APIs Required to
Complete OSS Tasks
Fabio Santos,1Igor Wiese,2Bianca Trinkenreich,1Igor Steinmacher,1,2Anita Sarma,3Marco A. Gerosa1
1Northern Arizona University, USA,2Universidade Tecnol´
ogica Federal do Paran´
a, Brazil,3Oregon State University, USA
fabio firstname.lastname@example.org, email@example.com, bianca firstname.lastname@example.org, email@example.com
Abstract—Open Source Software projects add labels to open
issues to help contributors choose tasks. However, manually
labeling issues is time-consuming and error-prone. Current
automatic approaches for creating labels are mostly limited to
classifying issues as a bug/non-bug. In this paper, we investigate
the feasibility and relevance of labeling issues with the domain of
the APIs required to complete the tasks. We leverage the issues’
description and the project history to build prediction models,
which resulted in precision up to 82% and recall up to 97.8%.
We also ran a user study (n=74) to assess these labels’ relevancy
to potential contributors. The results show that the labels were
useful to participants in choosing tasks, and the API-domain
labels were selected more often than the existing architecture-
based labels. Our results can inspire the creation of tools to
automatically label issues, helping developers to ﬁnd tasks that
better match their skills.
Index Terms—API identiﬁcation, Labelling, Tagging, Skills,
Multi-Label Classiﬁcation, Mining Software Repositories, Case
Finding tasks to contribute to in Open Source projects is
challenging [1, 2, 3, 4, 5]. Open tasks vary in complexity
and required skills, which can be difﬁcult to determine solely
by reading the task descriptions alone, especially for new
contributors [6, 7, 8]. Adding labels to the issues (a.k.a tasks,
bug reports) help new contributors when they are choosing
their tasks . However, community managers ﬁnd that label-
ing issues is challenging and time-consuming  because
projects require skills in different languages, frameworks,
databases, and Application Programming Interfaces (APIs).
APIs usually encapsulate modules that have speciﬁc pur-
poses (e.g., cryptography, database access, logging, etc.), ab-
stracting the underlying implementation. If the contributors
know which types of APIs will be required to work on each
issue, they could choose tasks that better match their skills or
involve skills they want to learn.
Given this context, in this paper, we investigate the feasi-
bility of automatically labeling issues with domains of APIs
to facilitate contributors’ task selection. Since an issue may
require knowledge in multiple APIs, we applied a multi-
label classiﬁcation approach, which has been used in software
engineering for purposes such as, classifying questions in
Stack Overﬂow (e.g., Xia et al. ) and detecting types
of failures (e.g., Feng et al. ) and code smells (e.g.,
Guggulothu and Moiz ).
By employing an exploratory case study and a user study,
we aimed to answer the following research questions:
RQ1: To what extent can we predict the domain of APIs
used in the code that ﬁxes a software issue? To answer RQ1,
we employed a multi-label classiﬁcation approach to predict
the API-domain labels. We also explored the inﬂuence of task
elements (i.e., title, body, and comments) and machine learning
setup (i.e., n-grams and different algorithms) on the prediction
model. Overall, we found that pre-processing the issue body
using unigram and Random Forest algorithm can predict the
API-domain labels with up to 82% precision and up to 97.8%
of recall. This conﬁguration outperformed recent approaches
reported in the literature .
RQ2. How relevant are the API-domain labels to new
contributors? To answer RQ2, we conducted a study with
74 participants from both academia and industry. After asking
participants to select and rank real issues they would like to
contribute to, we provided a follow-up survey to determine
what information was relevant to make the decision. We
compared answers from the treatment group (with access to the
API-domain labels) with the Control group (who used only the
pre-existing project labels). The participants considered API-
domain labels more relevant than the project labels—with a
large effect size.
These results indicate that labeling issues with API domain
is feasible and relevant for new contributors who need to
determine which issues to contribute.
II. RE LATE D WORK
New contributors need speciﬁc guidance on what to con-
tribute [9, 15]. In particular, ﬁnding an appropriate issue can be
a daunting task, which can discourage contributors . Social
coding platforms like GitHub1encourages projects to label
issues that are easy for new contributors, which is done by
several communities (e.g. LibreOfﬁce,2KDE,3and Mozilla4)
However, community managers argue that manually labeling
issues is difﬁcult and time-consuming .
Several studies have proposed ways to automatically label
issues as bug/non-bug, combining text mining techniques with
4https://wiki.mozilla.org/Good ﬁrst bug
classiﬁcation to mitigate this problem. For example, Antoniol
et al.  compared text-based mining with Naive Bayes
(NB), Logistic Regression (LR), and Decision Trees (DT)
to process data from titles, descriptions, and discussions and
achieved a recall up to 82%. Pingclasai et al.  used the
same techniques to compare a topic and word-based approach
and found F-measures from 0.68 to 0.81 using the topic-
based approach. More recently, Zhou et al.  used two-stage
processing, introducing the use of structured information from
the issue tracker, improving the recall obtained by Antoniol
et al. . Kallis et al.  simpliﬁed the data mining step to
produce a tool able to classify issues on demand. They used the
title and body to create a bag of words used to classify issues
as “bug report”, “enhancement”, and “question”. El Zanaty
et al.  applied type detection on issues and attempted to
transfer learning to other projects using the same training data.
The best results had F-Measures around 0.64 - 0.75. Finally,
Xia et al.  employed a multi-label classiﬁcation using text
data from Stack Overﬂow questions, obtaining recall from 0.59
As opposed to these related work—which focuses mostly
on classifying the type of issue (e.g., bug/non-bug)—our
work focuses on identifying the domain of APIs used in the
implementation code, which might reﬂect skills needed to
complete a task.
Regarding APIs, recent work focuses on understanding the
crowd’s opinion about API usage , understanding and
categorizing the API discussion , creating tutorial sections
that explain a given API type , generating/understanding
API documentation [23, 24], providing API recommendations
[25, 26, 27], offering API tips to developers , and deﬁning
the skill space for APIs, developers, and projects . In
contrast to these previous work, we focus on predicting the
domain of the API used in the code that ﬁxes an issue.
III. MET HOD
This study comprises three phases, as summarized in Fig. 1:
mining software repository, building the classiﬁers, and evalu-
ating the API-domain labels with developers. To foster repro-
ducibility, we provide a publicly available dataset5containing
the raw data, the Jupyter notebook scripts that build and test
the models, and the survey data.
We conducted an exploratory case study  using the
JabRef project  as our case study. JabRef is an open-source
bibliography reference manager developed by a community of
volunteers, including contributors with different background
and diverse set of skills (with and without computer science
background)—this helped us evaluate the approach with a
diverse set of contributor proﬁles. JabRef is a mature and
active project created in 2003 (migrated to GitHub in 2014),
with 15.7k commits, 42 releases, 337 contributors, 2.7k closed
issues, and 4.1k closed pull requests. JabRef has also been
frequently investigated in scientiﬁc studies [32, 33, 34, 35, 36].
We chose JabRef as our case study because of these character-
istics and because we have access to the project’s contributors.
Fig. 1. Research Design
A. Phase 1 - Mining JabRef Repository
We used the GitHub API to collect data from JabRef. We
collected 1976 issues and pull requests (PR), including title,
description (body), comments, and submission date. We also
collected the name of the ﬁles changed in the PR and the
commit message associated with each commit. The data was
collected in April 2020.
After collecting the data, we ﬁltered out open issues and
pull requests not explicitly linked to issues. To ﬁnd the links
between pull requests and issues, we searched for the symbol
#issue_number in the pull request title and body and checked
the URL associated with each link. We manually inspected
a random sample of issues to check whether the data was
correctly collected and reﬂected what was shown on the
GitHub interface. Two authors manually examined 50 issues,
comparing the collected data with the GitHub interface. All
records were consistent. We also ﬁltered out issues linked
to pull requests without at least one Java ﬁle (e.g., those
associated only with documentation ﬁles). Our ﬁnal dataset
comprises 705 issues and their corresponding pull requests.
We also wrote a parser to process all Java ﬁles from
the project. In total, 1,692 import statements from 1,472
java sources were mapped to 796 distinct APIs. The parser
identiﬁed all classes, including the complete namespace from
each import statement. Then we ﬁltered out APIs not found
in the latest version of the source code (JabRef 5.0) to avoid
recommending APIs that were no longer used in the project.
Then, we employed a card-sorting approach to manually
classify the imported APIs into higher-level categories based
on the API’s domain. For instance, we classiﬁed “java.nio.x”
as “IO’, “java.sql.x” as “Database”, and “java.util.x” as
“Utils”’. A three-member team performed this classiﬁcation
(ﬁrst, second, and forth authors), one of them is a contributor
of JabRef and another one is an expert Java developer. They
analyzed the APIs identiﬁed in the previous step and each
person individually classiﬁed the API and discussed to reach
a consensus. After classifying all the APIs, the researchers
conducted a second round to revise the classiﬁcation (∼8
hours). During this step, we excluded some labels and ag-
gregated some others. The ﬁnal set of categories of API
domains contains: Google Commons, Test, OS, IO, UI, Net-
work, Security, OpenOfﬁce Documents, Database, Utils, PDF,
Logging, and Latex. We used these categories as labels to
the 705 issues previously collected based on the presence of
the corresponding APIs in the changed ﬁles. We used this
annotated set to build our training and test sets for the multi-
label classiﬁcation models.
B. Phase 2 - Building the Multi-label Classiﬁers
1) Corpus construction: To build our classiﬁcation models,
we created a corpus comprising issue titles, body, and com-
ments. We converted each word to lowercase and removed
URLs, source code, numbers, and punctuation. After that, we
removed stop-words and stemmed the words using the Python
nltk package. We also ﬁltered issue and pull request templates6
since the templates were not consistently used among all the
issues. We found in our exploratory studies that their repetitive
structure introduced too much noise.
Next, similar to other studies [37, 38, 39], we applied TF-
IDF, which is a technique for quantifying word importance in
documents by assigning a weight to each word. After applying
TF-IDF, we obtained a vector for each issue. The vector length
is the number of terms used to calculate the TF-IDF plus the
13 labels in the dataset. Each label received a binary value
(0 or 1), indicating whether the corresponding API-domain is
present in the issue and each term received the TF-IDF score.
2) Training/Test Sets: We split the data into training and
test sets using the ShufﬂeSplit method , which is a model
selection technique that emulates cross-validation for multi-
label classiﬁers. We randomly split our 705 issues into a
training set with 80% (564) of the issues and a test set with the
remaining 20% (142 issues). To avoid overﬁtting, we ran each
experiment ten times, using ten different training and test sets
to match a 10-fold cross validation. To improve the balance
of the data set, we ran the SMOTE algorithm for multi-label
3) Classiﬁers: To create the classiﬁcation models, we
chose ﬁve classiﬁers that work with the multi-label ap-
proach and implement different strategies to create learn-
ing models: Decision Tree, Random Forest (ensemble clas-
siﬁer), MPLC Classiﬁer (neural network multilayer percep-
tron), MLkNN (multi-label lazy learning approach based
on the traditional K-nearest neighbor algorithm) [40, 42],
and Logistic Regression. We ran the classiﬁers using the
Python sklearn package and tested several parameters. For
the RandomForestClassiﬁer, the best classiﬁer (see Sec-
tion IV), we kept the following parameters: criterion =0
entropy0,max depth = 50,min samples leaf = 1,
min samples split = 3,n estimators = 50.
4) Classiﬁers Evaluation: To evaluate the classiﬁers, we
employed the following metrics (also calculated using the
•Hamming loss measures the fraction of the wrong labels
to the total number of labels.
•Precision measures the proportion between the number
of correctly predicted labels and the total number of
•Recall corresponds to the percentage of correctly pre-
dicted labels among all truly relevant labels.
•F-measure calculates the harmonic mean of precision
and recall. F-measure is a weighted measure of how
many relevant labels are predicted and how many of the
predicted labels are relevant.
5) Data Analysis: To conduct the data analysis, we used the
aforementioned evaluation metrics and the confusion matrix
logged after each model’s execution. We used the Mann-
Whitney U test to compare the classiﬁer metrics, followed
by Cliff’s delta effect size test. The Cliff’s delta magnitude
was assessed using the thresholds provided by Romano et al.
, i.e. |d|<0.147 “negligible”, |d|<0.33 “small”, |d|<0.474
“medium”, otherwise “large”.
6) Dataset Analysis: Multi-label datasets are usually de-
scribed by label cardinality and label density . Label
cardinality is the average number of labels per sample. Label
density is the number of labels per sample divided by the
total number of labels, averaged over the samples. For our
dataset, the label cardinality is 3.04. The density is 0.25.
These values consider the 705 distinct issues and API-domain
labels obtained after the previous section’s pre-processing
steps. Since our density can be considered high, the multi-label
learning process or inference ability is not compromised .
For the remainder of our analysis, we removed the API label
“Utils,” since we found that this label was present in 96%
of the issues in our ﬁnal dataset and has an overly generic
meaning. The API-domain labels “IO”, “UI”, and “Logging”
had 492, 452, and 417 occurrences respectively. These last
three labels occurred in approximately 60% of the issues. We
also observed that “Test”, “Network”, and “Google Commons”
appeared in almost 29% of the issues (212, 208, and 206
times). “SO”, “Database”, “PDF”, “Open Ofﬁce”, “Security”,
and “Latex” were less common, with 56, 31, 21, 21, 20, and
14 occurrences respectively.
Finally, we checked the distribution of the number of labels
per issue (Fig. 2). We found 140 issues with ﬁve labels, 132
issues with three labels, 121 issues with two labels, and 117
issues with four labels. Only 8.5% of issues have one label,
which conﬁrms a multi-label classiﬁcation problem.
Fig. 2. Number of labels per issue
C. Phase 3 - Evaluating the API-Domain Labels with Devel-
To evaluate the relevancy of the API-domain labels from
a new contributor’s perspective, we conducted a user study
with 74 participants. We created two versions of the JabRef
issues page (with and without our labels) and divided our
participants into two groups (between-subjects design). We
asked participants to choose and rank three issues they would
like to contribute and answer a follow-up survey about what
information supported their decision. The artifacts used in this
phase are also part of the replication package.
1) Participants: We recruited participants from both in-
dustry and academia. We reached out to our own students
in addition to instructors and IT managers of our personal
and professional networks and asked them to help in inviting
participants. From industry, we recruit participants from one
medium-size IT startup hosted in Brazil and the IT department
of a large and global company. Students included under-
graduate and graduate computer science students from one
university in the US and two others in Brazil. We also recruited
graduate data science students from a university in Brazil,
since they are also potential contributors to the JabRef project.
We present the demographics of the participants in Table I. We
offered an Amazon Gift card to incentivize participation.
We categorized the participants’ development tenure into
novice and experienced coders, splitting our sample in half—
below and above the average “years as professional developer”
(4). We also segmented the participants between industry
practitioners and students. Participants are identiﬁed by: ”P”,
followed by a sequential number and a character representing
the location where they were recruited (University: U &
Industry: I); ”T” for Treatment and ”C” for Control groups.
DEMOGRAPHICS SUB GROU PS F OR TH E EXPE RIM EN T’SPARTICIPANTS
Popu- Quan- Percent- Tenure Quan- Percent-
lation tity age tity age
Industry 41 55.5 Expert 19 25.7
Student 33 44.5 Novice 55 74.3
The participants were randomly split into two groups:
Control and Treatment. From the 120 participants that started
the survey, 74 (61.7%) ﬁnished all the steps, and we only
considered these participants in the analysis. We ended up with
33 and 41 participants in the Control and Treatment groups,
2) Experiment Planning: We selected 22 existing JabRef
issues and built mock GitHub pages for Control and Treatment
groups. The issues were selected from the most recent ones,
trying to maintain similar distributions of the number of API-
domain labels predicted per issue and the counts of predicted
API-domain labels (see Section III-B6). The control group
mocked page had only the original labels from the JabRef
issues and the treatment group mocked page presented the
original labels in addition to those API-domain labels. These
pages are available in the replication package.
Fig. 3. Survey question about the regions relevance
3) Survey Data Collection: The survey included the fol-
•Select the three issues that you would like to work on.
•Select the information (region) from the issue page that
helped you deciding which issues to select (Fig: 3).
•Why is the information you selected relevant? (open-
•Select the labels you considered relevant for choosing the
The survey also asked about participants’ experience level,
experience as an OSS contributor, and expertise level in the
technologies used in JabRef.
Fig. 3 shows an example of an issue details page and an
issue entry on an issue list page. After selecting the issues
to contribute, the participant was presented with this page to
select what information (region) was relevant to the previous
4) Survey Data Analysis: Next, to understand participants’
perceptions about what information (regions) they considered
important and the relevancy of the API-domain labels, we
ﬁrst compared treatment and Control groups’ results. We used
violin plots to visually compare the distributions and measured
the effect size using the Cliff’s Delta test.
Then, we analyzed the data, aggregating participants ac-
cording to their demographic information, resulting in the
subgroups presented in Table I. We calculated the odds ratio
to check how likely it would be to get similar responses
from both groups. We used a 2x2 contingency table for each
comparison—for instance, industry practitioners vs. students
and experienced vs. novice coders. We used the following
formula to calculate the odds ratio :
OddsRatio(OR) = (a/c)
Probabilities >1 mean that the ﬁrst subgroup is more likely
to report a type of label, while probabilities less than 1 mean
that the second group has greater chances (OR) .
To understand the rationale behind the label choices, we
qualitatively analyzed the answers to the open question (”Why
was the information you selected relevant?”). We selected
representative quotes to illustrate the participants’ perceptions
about the labels’ relevancy.
IV. RES ULTS
We report the results grouped by research question.
A. RQ1. To what extent can we predict the domain of APIs
used in the code that ﬁxes a software issue?
To predict the API domains used in the ﬁles changed in
an issue (RQ1), we started by testing a simple conﬁguration
that was used as a baseline. For this baseline model, we
used only the issue TI TL E as input and the Random Forest
(RF) algorithm, since is insensitive to parameter settings 
and has shown to yield good prediction results in software
engineering studies [48, 49, 50, 51]. Then, we evaluated the
corpus conﬁguration alternatives, varying the input informa-
tion: only TI TL E (T), only BODY (B), TITL E and BO DY (T,
B), and TI TL E,BODY, and COMMENTS (T,B,Comments). To
compare the different models, we selected the best Random
Forest conﬁguration and used the Mann-Whitney U test with
the Cliff’s-delta effect size.
We also tested alternatives conﬁgurations using n-grams.
For each step, the best conﬁguration was kept. Then, we
used different machine learning algorithms comparing with
a dummy (random) classiﬁer.
OVERALL METRICS (SECTION III-B-4) FROM M ODE LS C REATE D TO
EVALU ATE THE C OR PUS . HLA - H AMMING LOSS
Model Precision Recall F-measure Hla
Title (T) 0.717 0.701 0.709 0.161
Body (B) 0.752 0.742 0.747 0.143
T, B 0.751 0.738 0.744 0.145
T, B, Comments 0.755 0.747 0.751 0.142
As Table II shows, when we tested different inputs and
compared to TI TL E only, all alternative settings provided
better results. We could observe improvements in terms of
precision, recall, and F-measure. When using TI TL E,BODY,
and COMMENTS, we reached Precision of 75.5%, Recall of
74.7%, and F-Measure of 75.1%.
We found statistical differences comparing the results using
TI TLE only and all the three other corpus conﬁgurations
CLI FF’SDELTA FO R F-MEASURE AND PRECISION:COMPARISON OF
CO RPU S MOD ELS A LTER NATIV ES - SECT ION III-B-1. TI TLE (T), BODY (B)
AN D COMMENTS (C).
Corpus Cliff’s delta
Comparison F-measure Precision
T versus B -0.86 large*** -0.92 large***
T versus T+B -0.8 large** -0.88 large***
T versus T+B+C -0.88 large** -0.88 large***
B versus T+B 0.04 negligible 0.04 negligible
B versus T+B+C -0.24 small -0.12 negligible
T+B versus T+B+C -0.3 small -0.08 negligible
* p ≤0.05; ** p ≤0.01; *** p ≤0.001
Fig. 4. Comparison between the unigram model and n-grams models
for both F-measure (p-value ≤0.01 for all cases, Mann-
Whitney U test) and precision (p-value ≤0.001 for all
cases, Mann-Whitney U test) with large effect size. TI-
TL E+BO DY+COMMEN TS performed better than all others in
terms of precision, recall, and F-measure. However, the results
suggest that using only the BODY would provide good enough
outcomes, since there was no statistically signiﬁcant differ-
ence comparing to the other two conﬁgurations—using TI TL E
and/or CO MM ENTS in addition to the B ODY— and it achieved
similar results with less effort. The model built using only
BO DY presented only 14.3% incorrect predictions (hamming
loss metric) for all 12 labels. Table III shows the Cliff’s-delta
comparison between each pair of corpus conﬁguration.
Next, we investigated the use of bi-grams, tri-grams, and
four-grams comparing the results to the use of unigrams. We
used the corpus with only issue BO DY for this analysis, since
this conﬁguration performed well in the last step. Fig. 4 and
Table IV present how the Random Forest model performs
for each n-gram conﬁguration. The unigram conﬁguration
outperformed the others with large effect size.
Finally, to investigate the inﬂuence of the machine learning
(ML) classiﬁer, we compared several options using the title
with unigrams as a corpus. The options included: Random
Forest (RF), Neural Network Multilayer Perceptron (MLPC),
Decision Tree (DT), LR, MlKNN, and a Dummy Classiﬁer
CLI FF’SDELTA FO R F-MEASURE AND PRECISION: COMPARISON
BE TWE EN N-G RAM S MO DEL S - SEC TI ON III-B-5
n-Grams Cliff’s delta
Comparison F-measure Precision
1 versus 2 1.0 large*** 0.86 large***
1 versus 3 1.0 large*** 0.84 large***
1 versus 4 1.0 large*** 0.96 large***
2 versus 3 0.8 large** 0.18 small
2 versus 4 0.78 large** 0.72 large**
3 versus 4 0.12 negligible 0.62 large*
* p ≤0.05; ** p ≤0.01; *** p ≤0.001
with strategy “most frequent”. Dummy or random classiﬁers
are often used as a baseline [52, 53]. We used the imple-
mentation from the Python package scikit-learn . Fig. 5
shows the comparison among the algorithms, and Table V
presents the pair-wise statistical results comparing F-measure
and precision using Cliff’s delta.
Fig. 5. Comparison between the baseline model and other machine learning
CLI FF’SDELTA FO R F-MEASURE AND PRECISION: COMPARISON
BETWEEN MACHINE LEARNING ALGORITHMS - SEC TIO N III-B-5
Algorithms Cliff’s delta
Comparison F-measure Precision
RF versus LR 1.0 large*** 0.62 large*
RF versus MLPC 0.54 large* 0.88 large***
RF versus DT 1.0 large*** 1.0 large***
RF versus MlkNN 0.98 large*** 0.78 large***
LR versus MLPC -0.96 large*** 0.24 small
LR versus DT 0.4 medium 0.94 large***
LR versus MlkNN 0.5 large* 0.48 large*
MPLC versus DT 0.98 large*** 0.98 large***
MPLC vs. MlkNN 0.94 large*** 0.32 small
MlkNN versus DT -0.28 small 0.0 negligible
RF versus Dummy 1.0 large*** 1.0 large***
* p ≤0.05; ** p ≤0.01; *** p ≤0.001
Random Forest (RF) and Neural Network Multilayer Per-
ceptron (MLPC) were the best models when compared to
Decision Tree (DT), Logistic Regression (LR), MlKNN, and
Dummy algorithms. Random Forest outperformed these four
algorithms with large effect sizes considering F-measure and
RQ1 Summary. It is possible to predict the API-domain
labels with precision of 0.755, recall of 0.747, F-measure of
0.751, and 0.142 of Hamming loss using the Random Forest
algorithm, TI TLE,BO DY and COMMENTS as the corpus, and
B. RQ2. How relevant are the API-domain labels to new
To answer this research question, we conducted a user study
with 74 participants and analyzed their responses.
What information is used when selecting a task? Un-
derstanding the type of information that participants use to
make their decision while selecting an issue to work on can
help projects better organize such information on their issue
pages. Fig. 6 shows the different regions that participants
found useful. In the Control group, the top two regions of
interest included the body of the issue (75.7%) and the title
(78.7%), followed by the labels (54.5%) and then the code
itself (54.5%). This suggests that the labels generated by the
project were only marginally useful and participants had to
also review the code. In contrast, in the Treatment group, the
top four regions of interest by priority were: Title, Label, Body,
and then Code (97.5%, 82.9%, 70.7%, 56.1%, respectively).
This shows that participants in the Treatment group found
the labels more useful than those participants in the Control
group: 82.9% usage in the Treatment group as compared to
54.5% in the Control group. Comparing the body and the label
regions in both groups, we found that participants from the
Treatment group selected 1.6x more labels than the Control
groups (p=0.05). The odds ratio analysis suggests that labels
were more relevant in the Treatment groups.
Fig. 6. The regions counts (normalized) of the issue’s information
page selected as most relevant by participants from Treatment and Con-
trol groups. 1-Title,2-Label,3-Body,4-Code,5-Comments,6-Author,7-Linked
Qualitative analysis of the reason behind the choice of
participants in the Treatment group reveals that the Title and
the Labels together provided a comprehensive view of the
issue. For instance, P4IT mentioned: ”labels were useful to
know the problem area and after reading the title of the issues,
it was the ﬁrst thing taken into consideration, even before
opening to check the details”. Participants found the labels to
be useful in pointing out the speciﬁc topic about the issue,
as P14IT stated: “[labels are] hints about what areas have
connection with the problem occurring”.
What is the role of labels? We also investigated which type
of labels helped the participants in their decision making. We
divide the labels available to our participants into three groups
based on the type of information they imparted.
•Issue type (already existing in the project): This included
information about the type of the task: Bug, Enhance-
ment, Feature, Good First Issue, and GSoC.
•Code component (already existing in the project): This
included information about the speciﬁc Code components
of JabRef: Entry, Groups, External.Files, Main Table,
Fetcher, Entry.Editor, Preferences, Import, Keywords
•API-domain (new labels): the labels that were generated
by our classiﬁer (IO, UI, Network, Security, etc.). These
labels were available only to the Treatment group.
LABEL DISTRIBUTIONS AMONG THE CONTROL AND TREATMENT GROUPS
Type of Label Control C % Treatment T %
Issue Type 145 56.4 168 36.8
Components 112 43.6 94 20.6
API Domain - - 195 42.7
Table VI compares the labels that participants considered
relevant (section III-C-3) across the Treatment and Control
groups, distributed across these label types. In the Control
group, a majority of selected labels (56.4%) relate to the
type of issue (e.g., Bug or Enhancement). In the Treatment
group, however, this number drops down to 36.8%, with API-
domain labels being the majority (42.7%), followed by Code
component labels (20.6%). This difference in distributions
alludes to the usefulness of the API-domain labels.
To better understand the usefulness of the API-domain
labels as compared to the other types of labels, we further
investigated the label choices among the Treatment group
participants. Figure 7 presents two violin plots comparing (a)
API-domain labels against Code component labels and (b)
API-domain labels against type of issue. Wider sections of
the violin plot represent a higher probability of observations
taking a given value, the thinner sections correspond to a lower
probability. The plots show that API-domain labels are more
frequently chosen (median is 5 labels) as compared to Code
component labels (median is 2 labels), with a large effect size
(|d|= 0.52). However, the distribution of the Issue Type and
API-domain labels are similar as conﬁrmed by a negligible
effect size (|d|= 0.1). These results indicate that while the
type of issue (bug ﬁx, enhancement, suitable for newcomer) is
important, but understanding the technical (API) requirements
of solving the task is equally important in developers making
their decision about which task to select.
API x Comp
API x Type
Fig. 7. Density Probability Labels (Y-Axis): API-domain x Components x
Finally, we analyzed whether the demographic subgroups
had different perceptions about the API-domain labels (Ta-
ble VII). When comparing Industry vs. Students, we found
participants from industry selected 1.9x (p-value=0.001) more
API-domain labels than students when we control by compo-
nent labels. We found the same odds when we control by issue
type (p-value=0.0007). When we compared Experienced vs.
Novice coders, we did not ﬁnd statistical signiﬁcance (p=0.11)
when controlling by components labels. However, we found
that experienced coders selected 1.7x more API-domain labels
than novice coders (p-value=0.01) when we control by type
The odds ratio analysis suggests that API-domain labels
are more likely to be perceived relevant by practitioners and
experienced developers than by students and novice coders.
RQ2 Summary. Our ﬁndings suggest that labels are rel-
evant for selecting an issue to work on. API-domain
labels increased the perception of the labels’ relevancy.
API-domain labels are specially relevant for industry and
ANS WER S FRO M DIFF ERE NT DE MO GRA PHI C SU BGRO UP S REG ARD IN G THE
API LA BEL S (API/COMPONENT/IS SUE TY PE )
Subgroup Comparison API % Comp or Type %
Industry API/Comp 56.0 44.0
Students API/Comp 40.0 60.0
Exp. Coders API/Comp 50.9 49.1
Novice Coders API/Comp 41.5 58.5
Industry API/issue Type 45.5 55.5
Students API/issue Type 30.6 69.4
Exp. Coders API/issue Type 43.5 56.5
Novice Coders API/issue Type 30.9 69.1
Are API-domain labels relevant? Finding an appropriate
issue to work on involves multiple aspects, one of which is
knowing the APIs required for a task, which is what we investi-
gate here. Our ﬁndings show that participants considered API-
domain labels relevant in selecting issues. API-domain labels
were considered more relevant as compared to the component
type and slightly more favored than the type of issue. This
suggests that a higher-level understanding of the API-domain
is more relevant than deeper information about the speciﬁc
component in the project. Therefore, our automated labeling
approach can be of service to open source projects.
When controlling for issue type and component, API-
domain labels was considered more relevant for experienced
coders than novices (or students). This might suggest that
novices may need more help than “just” the technology for
which they need skills. For example, novices could be helped
if the issues provide additional details about the complexity
levels, how much knowledge about the particular APIs is
needed, the required/recommended academic courses needed
for the skill level, estimated time to completion, contact for
help, etc. Further research is needed to provide effective ways
to help novice contributors in onboarding.
What are the effects of the characteristics of the data
corpus? Observing the results reported for different corpora
used as input, we noticed that the baseline model created using
only the issue body had similar performance to the models
using issue title, body, and comments or better performance
than the model using only title. By inspecting the results, we
noticed that by adding more words to create the model, the
matrix of features becomes sparse and does not improve the
We also found co-occurrence among labels. For instance,
“UI”, “Logging”, and “IO” appeared together more often than
the other labels. This is due to the strong relationship found
in the source ﬁles. By searching the references for these API-
domain categories in the source code, we found that “UI”
was in 366 source code ﬁles, while “IO” and “Logging” was
in 377 and 200, respectively. We also found that ”UI” and
”IO” co-occurred in 85 source ﬁles, while “UI” and “Logging”
and “IO” and “Logging” co-occurred in 74 and 127 ﬁles,
respectively. On the other hand, the API-domain labels for
“Latex” and “Open Ofﬁce Document” appeared only in ﬁve
java ﬁles, while “Security” appears in only six ﬁles. Future
research can investigate co-occurrence prediction techniques
(e.g., ) in this context.
We suspect that the high occurrence of “UI”, “Logging”,
and “IO” labels (>400 issues) compared with the small-
est occurrence of “Security”, “Open Ofﬁce Documents”,
“Database”, “PDF”, and “Latex” (<32 issues) may inﬂuence
the precision and F-measure values. We tested the classiﬁer
with only the top 5 most prevalent API-domain labels and did
not observe statistically signiﬁcant differences. One possible
explanation is that the transformation method used to create
the classiﬁer was Binary Relevance, which creates a single
classiﬁer for each label, overlooking possible co-occurrence
What are the difﬁculties in labeling? Despite the lack of
accuracy to predict the rare labels, we were able to predict
those with more than 50 occurrences with reasonable preci-
sion. We argue that JabRef’s nature contributes to the number
of issues related to the “UI” and “IO.” “Logging” occurs in all
ﬁles and therefore explains its high occurrence. On the other
hand, some speciﬁc API domains that are supposedly highly
relevant to JabRef—such as “Latex”, “PDF”, and “Open Ofﬁce
Documents”—are not well represented in the predictions.
Looking to the Table VIII and comparing it with the
aforementioned co-occurrence data, we can determine some
expectations and induce some predictions. For example, the
“database” label occurred with more frequency when we had
“UI” and “IO”. So, it is possible to guess when an issue has
both labels, and we likely can suggest a “database “label,
even when the machine learning algorithm could not predict
it. The same can happen with the “Latex” label, which co-
occurred with “IO” and “Network”. A possible future work can
combine the machine learning algorithm proposed in this work
with frequent itemset mining techniques, such as apriori .
Thus, we can ﬁnd an association rule between the previously
OVE RAL L MET RI CS FRO M TH E SEL ECT ED M ODE L
API-Domain TN FP FN TP Precision Recall
Google Commons 107 15 27 30 66.6% 52.6%
Test 112 18 29 20 52.6% 40.8%
OS 152 8 8 11 57.8% 57.8%
IO 9 30 3 137 82.0% 97.8%
UI 30 26 10 113 81.2% 91.8%
Network 107 10 30 32 76.1% 51.6%
Security 167 6 2 4 40.0% 66.6%
OpenOfﬁce 165 6 3 5 45.4% 62.5%
Database 154 3 6 16 84.2% 72.7%
PDF 164 5 4 6 54.5% 60.0%
Logging 19 32 18 110 77.4% 85.9%
Latex 170 1 1 7 87.5% 87.5%
What are the practical implications for different stake-
New contributors. API-domain labels can help open source
contributors, enabling them to review the skills needed to
work on the issues up front. This is specially useful for new
contributors and casual contributors [57, 58], who have no
previous experience with the project terminology.
Project maintainers. Automatic API-domain labeling can
help maintainers distribute team effort to address project
tasks based on required expertise. Project maintainers can
also identify which type of APIs generate more issues in
the project. Our results show that we can predict the most
prominent API domains—in this case, “UI”, “Logging”, “IO”,
“Network”, and “Test” with precision and recall up to 87.5%
and 97.8%, respectively (see Table VIII).
Platform/Forge Managers. Participants often selected TI TL E,
BO DY, and L ABELS to look for information when deciding to
which issue to contribute to. Our results can be used to propose
better layouts for the issue list and detail pages, prioritizing
them against other information regions (3). In the issue detail
page on GitHub, for instance, the label information appears
outside of the main contributor focus, on the right side of the
Templates to guide GitHub users in ﬁlling out the issues’
body to create patterns can be useful in not only making
the information space consistent across issues, but also help
automated classiﬁers that use the information to predict API
labels. For instance, some of the wrong predictions in our
study could be possibly caused by titles and body that had
very little useful information from which to generate labels.
Researchers. The scientiﬁc community can extend the pro-
posed approach to other languages and projects, including
more data and different algorithms. Our approach can also
be used to improve tools that recommend tasks that match
new contributor’s skills and career goals (e.g., ).
Educators. Educators who assign contributions to OSS as part
of their coursework  can also beneﬁt from our approach.
Labeling issues in OSS projects can help them select examples
or tasks for their classes, bringing a practical perspective to
the learning environment.
VI. TH REATS TO VALIDITY
One of the threats to the validity of this study is the
API domain categorization. We acknowledge the threat that
different individuals can create different categorizations, which
may introduce some bias in our results. To mitigate this
problem, three individuals, including a Java Developer expert
and a contributor to the JabRef project, created the API-
domain labels. In the future, we can improve this classiﬁcation
process with (semi-)automated or collaborative approaches
(e.g., [61, 62]).
Another concern is the number of issues in our dataset and
the link between issues and pull requests. To include an issue
in the dataset, we needed to link it to its solution submitted via
pull request. By linking the issue with its correspondent pull
request, we could identify the APIs used to create the labels
and deﬁne our ground truth (check Section III-A). To ensure
that the link was correctly identiﬁed, we selected a random
sample of 50 issues and manually checked for consistency.
All of the issues in this validation set were correctly linked to
their pull requests.
In prediction models, overﬁtting occurs when a prediction
model has random error or noise instead of an underlying
relationship. During the model training phase, the algorithm
used information not included in the test set. To mitigate this
problem, we also used a shufﬂe method to randomize the
training and test samples.
Further, we acknowledge that we did not investigate if the
labels helped the users to ﬁnd the most appropriate tasks. It
was not part of the user study to evaluate how effective the
API labels were to ﬁnd a match with user skills. Besides,
we did not evaluate how False Positive labels would impact
task selection or ranking. Our focus was on understanding the
relevance that the API-domain labels have on the participants’
decision. However, we believe the impact is minimal since
in the three most selected issues, out of 11 recommendations
only 1 label was a a false positive. Investigating the effective-
ness of API labels for skill matching and the problems that
misclassiﬁcation cause are potential avenues for future work.
Although participants with different proﬁles participated
in the user study, the sample cannot represent the entire
population and results can be biased. The study link ran-
domly assigned a group to each participant. However, some
participants did not ﬁnish the survey and the groups ended up
not being balanced. Also, the way we created subgroups can
introduce bias in the analysis. The practitioners’ classiﬁcation
as industry and students were done based on the location of
the recruitment and some students could also be industry prac-
titioners and vice-versa. However, the results of this analysis
were corroborated by the aggregation by experience level.
Generalization is also a limitation of any case study. The
outcomes could differ for other projects or programming lan-
guages ecosystems. We expect to extend this approach in that
direction in future work. Nevertheless, the case study in a real
world system showed how a multi-label classiﬁcation approach
can be useful for predicting API-domain labels and how
relevant such a label can be to new contributors. Moreover,
the API-domain labels that we identiﬁed can generalize to
other projects that use the same APIs across multiple project
domains (Desktop and Web applications). JabRef adopts a
common architecture (MVC) and frameworks (JavaFX, JUnit,
etc.), which makes it similar to a large number of other
projects. As described by Qiu et al. , projects adopt
common APIs, accounting up to 53% of the APIs used.
Moreover, our data can be used as a training set for automated
API-domain label generation in other projects.
In this paper, we investigate to what extent we can predict
API-domain labels. To do that, we mined data from 705
issues from the JabRef project and predicted 12 API-domain
labels over this dataset. The model that was created using the
Random Forest algorithm, unigrams, and data from the issue
body offered the best results. The labels most present in the
issues can be predicted with high precision.
To investigate whether API-domain labels are helpful to
contributors, we designed a study to present a mocked list
of open issues with the API-domain labels mixed with the
original labels. We found that industry practitioners and
experienced coders selected API-domain labels more often
than students and novice coders. Participants also preferred
API-domain labels over Code component labels, which were
already used in the project.
This study is a step toward helping new contributors match
their API skills with each task and better identify an ap-
propriate task to start their onboarding process into an OSS
project. For future work, we will explore new projects, a word
embedding approach (Word2vec) with domain trained data to
vectorise the issues, and investigate a uniﬁed API label schema
capable of accurately mapping the skills needed to contribute
to OSS projects.
This work is partially supported by the National Science
Foundation under Grant numbers 1815486, 1815503, 1900903,
and 1901031, CNPq grant #313067/2020-1. We also thank the
developers who spent their time answering our survey.
 J. Wang and A. Sarma, “Which bug should i ﬁx: helping
new developers onboard a new project,” in Proceedings
of the 4th International Workshop on Cooperative and
Human Aspects of Software Engineering. ACM, 2011,
 I. Steinmacher, T. U. Conte, and M. A. Gerosa, “Un-
derstanding and supporting the choice of an appropriate
task to start with in open source software communities,”
in 2015 48th Hawaii International Conference on System
Sciences. IEEE, 2015, pp. 5299–5308.
 I. Steinmacher, M. A. G. Silva, M. A. Gerosa, and D. F.
Redmiles, “A systematic literature review on the barriers
faced by newcomers to open source software projects,”
Information and Software Technology, vol. 59, pp. 67–85,
 I. Steinmacher, T. Conte, M. A. Gerosa, and D. Red-
miles, “Social barriers faced by newcomers placing their
ﬁrst contribution in open source software projects,” in
Proceedings of the 18th ACM Conference on Computer
Supported Cooperative Work & Social Computing, ser.
CSCW ’15. New York, NY, USA: Association for
Computing Machinery, 2015, p. 1379–1392.
 C. Stanik, L. Montgomery, D. Martens, D. Fucci, and
W. Maalej, “A simple nlp-based approach to support
onboarding and retention in open source communities,”
in 2018 IEEE International Conference on Software
Maintenance and Evolution (ICSME). IEEE, 2018, pp.
 T. Zimmermann, R. Premraj, N. Bettenburg, S. Just,
A. Schroter, and C. Weiss, “What makes a good bug
report?” IEEE Transactions on Software Engineering,
vol. 36, no. 5, pp. 618–643, 2010.
 N. Bettenburg, S. Just, A. Schr¨
oter, C. Weiß, R. Prem-
raj, and T. Zimmermann, “Quality of bug reports in
eclipse,” in Proceedings of the 2007 OOPSLA Workshop
on Eclipse Technology eXchange, ser. eclipse ’07. New
York, NY, USA: ACM, 2007, pp. 21–25.
 L. Vaz, I. Steinmacher, and S. Marczak, “An empirical
study on task documentation in software crowdsourcing
on topcoder,” in 2019 ACM/IEEE 14th International
Conference on Global Software Engineering (ICGSE).
IEEE, 2019, pp. 48–57.
 I. Steinmacher, C. Treude, and M. A. Gerosa, “Let me in:
Guidelines for the successful onboarding of newcomers
to open source projects,” IEEE Software, vol. 36, no. 4,
pp. 41–49, 2018.
 A. Barcomb, K. Stol, B. Fitzgerald, and D. Riehle,
“Managing episodic volunteers in free/libre/open source
software communities,” IEEE Transactions on Software
Engineering, pp. 1–1, 2020.
 X. Xia, D. Lo, X. Wang, and B. Zhou, “Tag recom-
mendation in software information sites,” in 2013 10th
Working Conference on Mining Software Repositories
(MSR). IEEE, 2013, pp. 287–296.
 Y. Feng, J. Jones, Z. Chen, and C. Fang, “An empirical
study on software failure classiﬁcation with multi-label
and problem-transformation techniques,” in 2018 IEEE
11th International Conference on Software Testing, Ver-
iﬁcation and Validation (ICST). IEEE, 2018, pp. 320–
 T. Guggulothu and S. A. Moiz, “Code smell detection us-
ing multi-label classiﬁcation approach,” Software Quality
Journal, vol. 28, no. 3, pp. 1063–1086, 2020.
 F. El Zanaty, C. Rezk, S. Lijbrink, W. van Bergen,
e, and S. McIntosh, “Automatic recovery of miss-
ing issue type labels,” IEEE Software, 2020.
 Y. Park and C. Jensen, “Beyond pretty pictures: Exam-
ining the beneﬁts of code visualization for open source
newcomers,” in Proceedings of the 5th IEEE Interna-
tional Workshop on Visualizing Software for Understand-
ing and Analysis, ser. VISSOFT ’09. IEEE, Sep. 2009,
 G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y.-G.
eneuc, “Is it a bug or an enhancement? a text-based
approach to classify change requests,” in Proceedings of
the 2008 conference of the center for advanced studies
on collaborative research: meeting of minds, 2008, pp.
 N. Pingclasai, H. Hata, and K.-i. Matsumoto, “Clas-
sifying bug reports to bugs and other requests using
topic modeling,” in 2013 20Th asia-paciﬁc software
engineering conference (APSEC), vol. 2. IEEE, 2013,
 Y. Zhou, Y. Tong, R. Gu, and H. Gall, “Combining text
mining and data mining for bug report classiﬁcation,”
Journal of Software: Evolution and Process, vol. 28,
no. 3, pp. 150–176, 2016.
 R. Kallis, A. Di Sorbo, G. Canfora, and S. Panichella,
“Ticket tagger: Machine learning driven issue classi-
ﬁcation,” in 2019 IEEE International Conference on
Software Maintenance and Evolution (ICSME). IEEE,
2019, pp. 406–409.
 G. Uddin and F. Khomh, “Automatic mining of opinions
expressed about apis in stack overﬂow,” IEEE Transac-
tions on Software Engineering, pp. 1–1, 2019.
 D. Hou and L. Mo, “Content categorization of API
discussions,” in 2013 IEEE International Conference on
Software Maintenance, 2013, pp. 60–69.
 G. Petrosyan, M. P. Robillard, and R. De Mori, “Discov-
ering information explaining API types using text clas-
siﬁcation,” in 2015 IEEE/ACM 37th IEEE International
Conference on Software Engineering, vol. 1, 2015, pp.
 C. Treude and M. P. Robillard, “Augmenting API doc-
umentation with insights from stack overﬂow,” in 2016
IEEE/ACM 38th International Conference on Software
Engineering (ICSE), 2016, pp. 392–403.
 Y. Zhou, R. Gu, T. Chen, Z. Huang, S. Panichella, and
H. Gall, “Analyzing apis documentation and code to
detect directive defects,” in 2017 IEEE/ACM 39th In-
ternational Conference on Software Engineering (ICSE),
2017, pp. 27–37.
 Q. Huang, X. Xia, Z. Xing, D. Lo, and X. Wang,
“Api method recommendation without worrying about
the task-api knowledge gap,” in Proceedings of the
33rd ACM/IEEE International Conference on Automated
Software Engineering, ser. ASE 2018. New York, NY,
USA: Association for Computing Machinery, 2018, p.
 H. Zhong and H. Mei, “An empirical study on API
usages,” IEEE Transactions on Software Engineering,
vol. 45, no. 4, pp. 319–334, 2019.
 Y. Zhou, C. Wang, X. Yan, T. Chen, S. Panichella, and
H. C. Gall, “Automatic detection and repair recommen-
dation of directive defects in Java API documentation,”
IEEE Transactions on Software Engineering, pp. 1–1,
 S. Wang, N. Phan, Y. Wang, and Y. Zhao, “Extracting
API tips from developer question and answer websites,”
in 2019 IEEE/ACM 16th International Conference on
Mining Software Repositories (MSR), 2019, pp. 321–332.
 T. Dey, A. Karnauch, and A. Mockus, “Representation
of developer expertise in open source software,” arXiv
preprint arXiv:2005.10176, 2020.
 R. K. Yin, Case Study Research: Design and Methods,
ser. Applied social research methods series. Beverly
Hills, CA: Sage Publications, 1984.
 JabRef, “JabRef project,” 2019. [Online]. Available:
 T. Olsson, M. Ericsson, and A. Wingkvist, “The rela-
tionship of code churn and architectural violations in
the open source software jabref,” in Proceedings of the
11th European Conference on Software Architecture:
Companion Proceedings, 2017, pp. 152–158.
 A. Mayr, R. Pl¨
osch, and C. K¨
orner, “A benchmarking-
based model for technical debt calculation,” in 2014 14th
International Conference on Quality Software. IEEE,
2014, pp. 305–314.
 S. Herold, “An initial study on the association between
architectural smells and degradation,” in European Con-
ference on Software Architecture. Springer, 2020, pp.
 Z. Shi, J. Keung, and Q. Song, “An empirical study of
bm25 and bm25f based feature location techniques,” in
Proceedings of the International Workshop on Innova-
tive Software Development Methodologies and Practices,
2014, pp. 106–114.
 S. Feyer, S. Siebert, B. Gipp, A. Aizawa, and J. Beel,
“Integration of the scientiﬁc recommender system mr.
dlib into the reference manager jabref,” in European
Conference on Information Retrieval. Springer, 2017,
 J. Ramos et al., “Using tf-idf to determine word rele-
vance in document queries,” in Proceedings of the ﬁrst
instructional conference on machine learning, vol. 242.
Piscataway, NJ, 2003, pp. 133–142.
 D. Behl, S. Handa, and A. Arora, “A bug mining tool
to identify and analyze security bugs using naive bayes
and tf-idf,” in 2014 International Conference on Reliabil-
ity Optimization and Information Technology (ICROIT).
IEEE, 2014, pp. 294–299.
 S. L. Vadlamani and O. Baysal, “Studying software
developer expertise and contributions in Stack Overﬂow
and GitHub,” in 2020 IEEE International Conference on
Software Maintenance and Evolution (ICSME). IEEE,
2020, pp. 312–323.
 F. Herrera, F. Charte, A. J. Rivera, and M. J. del Jesus,
Multilabel Classiﬁcation: Problem Analysis, Metrics and
Techniques, 1st ed. Springer Publishing Company,
 F. Charte, A. J. Rivera, M. J. del Jesus, and F. Herrera,
“Mlsmote: approaching imbalanced multilabel learning
through synthetic instance generation,” Knowledge-Based
Systems, vol. 89, pp. 385–397, 2015.
 M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning
approach to multi-label learning,” Pattern recognition,
vol. 40, no. 7, pp. 2038–2048, 2007.
 J. Romano, J. Kromrey, J. Coraggio, and J. Skowronek,
“Appropriate statistics for ordinal level data: Should
we really be using t-test and Cohen’sd for evaluating
group differences on the NSSE and other surveys?” in
annual meeting of the Florida Association of Institutional
Research, 2006, pp. 1–3.
 A. Blanco, A. Casillas, A. P´
erez, and A. D. de Ilarraza,
“Multi-label clinical document classiﬁcation: Impact of
label-density,” Expert Systems with Applications, vol.
138, p. 112835, 2019.
 M. Szumilas, “Explaining odds ratios,” Journal of the
Canadian academy of child and adolescent psychiatry,
vol. 19, no. 3, p. 227, 2010.
 D. Sheskin, Handbook of Parametric and Nonparametric
Statistical Procedures, 5th ed. Chapman & Hall, 2020.
 C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and
K. Matsumoto, “The impact of automated parameter
optimization on defect prediction models,” IEEE Trans-
actions on Software Engineering, vol. 45, no. 7, pp. 683–
 D. Petkovic, M. Sosnick-P´
erez, K. Okada, R. Todten-
hoefer, S. Huang, N. Miglani, and A. Vigil, “Using the
random forest classiﬁer to assess and predict student
learning of software engineering teamwork,” in 2016
IEEE Frontiers in Education Conference (FIE). IEEE,
2016, pp. 1–7.
 E. Goel, E. Abhilasha, E. Goel, and E. Abhilasha,
“Random forest: A review,” International Journal of
Advanced Research in Computer Science and Software
Engineering, vol. 7, no. 1, 2017.
 T. Pushphavathi, V. Suma, and V. Ramaswamy, “A novel
method for software defect prediction: hybrid of fcm
and random forest,” in 2014 International Conference
on Electronics and Communication Systems (ICECS).
IEEE, 2014, pp. 1–5.
 S. M. Satapathy, B. P. Acharya, and S. K. Rath, “Early
stage software effort estimation using random forest tech-
nique based on use case points,” IET Software, vol. 10,
no. 1, pp. 10–17, 2016.
 T. Saito and M. Rehmsmeier, “The precision-recall plot
is more informative than the roc plot when evaluating
binary classiﬁers on imbalanced datasets,” PloS one,
vol. 10, no. 3, p. e0118432, 2015.
 P. A. Flach and M. Kull, “Precision-recall-gain curves:
Pr analysis done right.” in NIPS, vol. 15, 2015.
 “scikit-learn dummy classiﬁer,” https://scikit-learn.
dummy.DummyClassiﬁer, accessed: 2021-03-12.
 I. S. Wiese, R. R´
e, I. Steinmacher, R. T. Kuroda, G. A.
Oliva, C. Treude, and M. A. Gerosa, “Using contextual
information to predict co-changes,” Journal of Systems
 A. Sarma, M. A. Gerosa, I. Steinmacher, and R. Leano,
and Software, vol. 128, pp. 220–235, 2017.
 R. Agrawal, T. Imieliundeﬁnedski, and A. Swami, “Min-
ing association rules between sets of items in large
databases,” SIGMOD Rec., vol. 22, no. 2, p. 207–216,
 G. Pinto, I. Steinmacher, and M. A. Gerosa, “More
common than you think: An in-depth study of casual
contributors,” in IEEE 23rd International Conference on
Software Analysis, Evolution, and Reengineering, SANER
2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume
1, 2016, pp. 112–123.
 S. Balali, I. Steinmacher, U. Annamalai, A. Sarma, and
M. A. Gerosa, “Newcomers’ barriers. . . is that all? an
analysis of mentors’ and newcomers’ barriers in OSS
projects,” Comput. Supported Coop. Work, vol. 27, no.
3–6, p. 679–714, Dec. 2018.
“Training the future workforce through task curation in
an OSS ecosystem,” in Proceedings of the 2016 24th
ACM SIGSOFT International Symposium on Foundations
of Software Engineering, 2016, pp. 932–935.
 G. H. L. Pinto, F. Figueira Filho, I. Steinmacher, and
M. A. Gerosa, “Training software engineers using open-
source software: the professors’ perspective,” in 2017
IEEE 30th Conference on Software Engineering Educa-
tion and Training (CSEE&T). IEEE, 2017, pp. 117–121.
 M. Ferreira Moreno, W. H. Sousa Dos Santos, R. Costa
Mesquita Santos, and R. Fontoura De Gusmao Cerqueira,
“Supporting knowledge creation through has: The hyper-
knowledge annotation system,” in 2018 IEEE Interna-
tional Symposium on Multimedia (ISM), 2018, pp. 239–
 Y. Lu, G. Li, Z. Zhao, L. Wen, and Z. Jin, “Learning
to infer API mappings from API documents,” in Inter-
national Conference on Knowledge Science, Engineering
and Management. Springer, 2017, pp. 237–248.
 D. Qiu, B. Li, and H. Leung, “Understanding the API
usage in Java,” Information and software technology,
vol. 73, pp. 81–100, 2016.