PreprintPDF Available

QuerTCI: A Tool Integrating GitHub Issue Querying with Comment Classification

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Issue tracking systems enable users and developers to comment on problems plaguing a software system. Empirical Software Engineering (ESE) researchers study (open-source) project issues and the comments and threads within to discover -- among others -- challenges developers face when, e.g., incorporating new technologies, platforms, and programming language constructs. However, issue discussion threads accumulate over time and thus can become unwieldy, hindering any insight that researchers may gain. While existing approaches alleviate this burden by classifying issue thread comments, there is a gap between searching popular open-source software repositories (e.g., those on GitHub) for issues containing particular keywords and feeding the results into a classification model. In this paper, we demonstrate a research infrastructure tool called QuerTCI that bridges this gap by integrating the GitHub issue comment search API with the classification models found in existing approaches. Using queries, ESE researchers can retrieve GitHub issues containing particular keywords, e.g., those related to a certain programming language construct, and subsequently classify the kinds of discussions occurring in those issues. Using our tool, our hope is that ESE researchers can uncover challenges related to particular technologies using certain keywords through popular open-source repositories more seamlessly than previously possible. A tool demonstration video may be found at: https://youtu.be/fADKSxn0QUk.
Content may be subject to copyright.
erTCI: A Tool Integrating GitHub Issue erying with
Comment Classification
Ye Paing
CUNY Hunter College
New York, NY, USA
Ye.Paing89@myhunter.cuny.edu
Tatiana Castro Vélez
CUNY Graduate Center
New York, NY, USA
tcastrovelez@gradcenter.cuny.edu
Ra Khatchadourian
CUNY Hunter College
New York, NY, USA
ra.khatchadourian@hunter.cuny.edu
ABSTRACT
Issue tracking systems enable users and developers to comment
on problems plaguing a software system. Empirical Software Engi-
neering (ESE) researchers study (open-source) project issues and
the comments and threads within to discover—among others—
challenges developers face when, e.g., incorporating new technolo-
gies, platforms, and programming language constructs. However,
issue discussion threads accumulate over time and thus can be-
come unwieldy, hindering any insight that researchers may gain.
While existing approaches alleviate this burden by classifying is-
sue thread comments, there is a gap between searching popular
open-source software repositories (e.g., those on GitHub) for is-
sues containing particular keywords and feeding the results into a
classication model. In this paper, we demonstrate a research infras-
tructure tool called
QuerTCI
that bridges this gap by integrating the
GitHub issue comment search API with the classication models
found in existing approaches. Using queries, ESE researchers can
retrieve GitHub issues containing particular keywords, e.g., those
related to a certain programming language construct, and subse-
quently classify the kinds of discussions occurring in those issues.
Using our tool, our hope is that ESE researchers can uncover chal-
lenges related to particular technologies using certain keywords
through popular open-source repositories more seamlessly than
previously possible. A tool demonstration video may be found at:
https://youtu.be/fADKSxn0QUk.
CCS CONCEPTS
Software and its engineering Software libraries and repos-
itories.
KEYWORDS
software repository mining, GitHub, issue comments, classication
1 INTRODUCTION
Issue tracking systems, e.g., GitHub issues, allow users and develop-
ers to discuss current problems with software. Empirical Software
Engineering (ESE) researchers have also engaged in several ac-
tivities related to mining software repositories (MSR), including
studying (open-source) project issues. Researchers examine com-
ments and threads contained in issues to discover the challenges
developers face in writing software. For example, developers may
struggle with incorporating new technologies, platforms, and pro-
gramming language constructs and document and discuss their
progress in issue “tickets.
Unfortunately, such issue discussion threads accumulate over
time and thus can become unwieldy, hindering insights researchers
may gain from them. Moreover, issues can be unlabeled or improp-
erly named, making it dicult to understand the problem at hand.
Approaches (e.g., [2]) exist to alleviate this burden by classifying
issue thread comments, however, there is a gap between searching
popular open-source software repositories (e.g., those on GitHub)
for issues containing particular keywords and feeding the results
into a classication model. In this paper, we demonstrate
QuerTCI
, a
Quer
y-based
T
ool for
C
lassifying GitHub
I
ssue thread comments
that bridges this gap by integrating the GitHub issue comment
search API with a classication model. While the default classica-
tion model used by
QuerTCI
is the one developed by Arya et al. [2],
it can also use other issue comment classication models.
QuerTCI
is a Python-based tool that queries GitHub’s search API for issues
and comments relating to a query string that the user provides.
Then, it automatically preprocesses (i.e., parses, cleans, tokenizes)
each line of issue comments retrieved from the GitHub API and
runs them through the pre-trained NLP model for classication.
QuerTCI
is highly-customizable—supporting a wide range of ad-
ditional functionalities—and works in either interactive and batch
(non-interactive) modes. As shown in Fig. 1, users can limit number
of issues retrieved (the GitHub API supports up to
1,000
), as well as
change the sorting criteria (which is important in capped queries
like GitHub). Users may also omit particular issue comment classi-
cation categories from the results, e.g., retrieving issues that have
at least one comment corresponding to a “solution discussion” [2].
Using queries,
QuerTCI
enables ESE researchers to retrieve GitHub
issues containing particular keywords, e.g., those related to a cer-
tain programming language construct, and subsequently classify
the kinds of discussions occurring in those issues in an integrated
manor. Our hope is that, by using
QuerTCI
as part of a broader
research infrastructure, ESE researchers can uncover challenges
related to particular technologies using certain keywords through
popular open-source repositories more seamlessly than previously
possible. It alleviates the required leg work of data querying and
preparation in order for the data to be: (i) compatible with the un-
derlying comment classication model and (ii) related to particular
programming language constructs (ala
gitcproc
[3]).
QuerTCI
is
open-source and publicly available [11], and a demonstration video
may be found at: https://youtu.be/fADKSxn0QUk.
2 ENVISIONED USERS
Since we envision
QuerTCI
being part of a broader research infras-
tructure, our envisioned users are mainly ESE researchers seeking
to unearth challenges developers face in particular situations or
using specic technologies. For example, ESE researchers may be
interested in discovering—using a keyword-based search—the kinds
of discussions surrounding a particular Application Programmer’s
arXiv:2202.08761v1 [cs.SE] 17 Feb 2022
Ye Paing, Tatiana Castro Vélez, and Rai Khatchadourian
Figure 1: QuerTCI interactive command-line interface.
Interface (API), programming language feature, or new platform
version. Using
QuerTCI
, they are able to receive—using NLP to auto-
matically identify the topics/semantics of what is being discussed—a
“quick gist” of otherwise long discussion threads commonly found
on GitHub issues. ESE researchers—on a relatively large scale—can
then quantify categories of discussions taking place under GitHub
issues containing keywords of interest.
To further understand developer challenges, ESE researchers
may lter for issues containing keywords corresponding to cer-
tain language constructs and having particular comment classica-
tions. Then, they may use manual inspection to further investigate.
QuerTCI
can empower ESE researchers to narrow the scope of
manual investigation so that they may focus on GitHub issues
with the most relevant discussions. For instance, we may be in-
terested in GitHub issues involving Java 8 streams (e.g.,
stream()
,
parallelStream()
,
Collectors
) that include no solution discussion.
The resulting set can then be further manually inspected by users to
understand why issues involving these constructs cannot be solved.
Another class of users may be practicing Software Engineers that
are tackling common problems pertaining to certain topics. Using
QuerTCI
, they can summarize issues that may arise for a particular
topic/query string of interest. Furthermore, Software Engineers
can use
QuerTCI
to lter out certain discussion types (e.g., “Social
Discussion” [2]) from the tool’s output, thereby saving time that
might have been spent combing through large discussion threads.
Lastly,
QuerTCI
may prove useful to (Computer Science) students
that are studying a particular topic and are looking to platforms, e.g.,
GitHub, to gain insight into what experienced Software Engineers
are discussing. Using our tool, students can focus on issue threads
that pertain to relevant categories (e.g., “Solution Discussion, “Bug
Reproduction” [2]). Such expert developer discussion may prove
useful to students learning a new programming language, strug-
gling with using a new framework or library (APIs), or adopting a
new software platform.
Figure 2: QuerTCI high-level architecture.
3 ADDRESSED EMPIRICAL SOFTWARE
ENGINEERING RESEARCH CHALLENGES
With
QuerTCI
, we aim to reduce the time spent by researchers—and
perhaps even practicing Software Engineers—in combing through
long issue discussion threads for answers relating to their topic of
interests.
QuerTCI
was created to help supplement ESE researchers
in quantifying and ultimately understanding long issue discussion
threads within GitHub issues containing particular keywords (e.g,
those related to certain APIs). A challenge that ESE researchers face
is the time it takes to fully digest and understand issue discussion
threads using manual inspection. Although GitHub issues may
be “tagged, such tags may either not be accurate or necessarily
reect how issue discussions evolves over time. Moreover, issue tags,
e.g., “bug, “enhancement, relate to the GitHub issue as a whole
and thus may not represent the conversations occurring within the
issue. As (manual) empirical studies are typically labor and resource
intensive, ESE researchers can use
QuerTCI
to shrink the search
space needed to nd answers to their (research) questions. With
QuerTCI
, researchers can query GitHub issues for keywords and
subsequently rely on a pre-trained NLP model for issue thread
discussion categorization. Doing so can limit the search space (i.e.,
issues and their discussion threads) necessary for ESE researchers
to (manually) inspect more thoroughly.
For example, suppose that we are interested in understanding
challenges that data scientists are faced with when using a particular
TensorFlow API, e.g.,
tf.module
. Simply using
QuerTCI
to query for
the string “tf.module” would yield a large amount of GitHub issue
threads that would automatically be categorized using a pre-trained
NLP model, allowing researchers to save time, gain an overview
of the nature of the issues surrounding this particular query, and
choose from the returned results a proper subset of GitHub issues
to (manually) examine in further detail.
4 IMPLEMENTATION
Figure 2 depicts the overall architecture of
QuerTCI
, which includes
querying against GitHub’s query service API to search for results
pertaining to the input query string, preprocessing the retrieved
data locally, and running the retrieved data through a pre-trained
NLP model for classication. Finally,
QuerTCI
writes the results to
output les on the local le system.
erTCI: A Tool Integrating GitHub Issue erying with Comment Classification
NLP Model, Constraints, Serialization & CLI Prototyping. The
QuerTCI
implementation uses a pre-built, customizable NLP model
to classify issue comments. By default, the tool integrates a model
provided by Arya et al. [2], but users may provide their own. The
models are serialized using the
sklearn
[14] and
pickle
[13]
Python libraries—persisting them into a le that is “imported” into
QuerTCI
. The pre-build (input) model should only be responsible
for classication; models should not employ any preprocessing or
tokenization of the input strings (i.e., those representing individ-
ual lines of GitHub issue comment threads). Preprocessing and
tokenization—steps of which are covered in Section 4—are done by
QuerTCI
before running them through the model. This implemen-
tation step involved thorough testing to ensure that integrating
with the imported model did not generate any errors or data cor-
ruption.
QuerTCI
’s CLI was prototyped using an iterative process.
To assess usability, we surveyed several ESE researchers in our lab
(independent of this project) to understand the options needed and
ease-of-use.
Interfacing with the GitHub API. Our tool interfaces with the
GitHub API (Fig. 2, step 2). Our implementation lifts several bur-
dens of ESE researchers seeking to programmatically query GitHub,
including API query throttling, API key transfer, and HTTP request
authentication (Fig. 2, step 1). Then, a query encompassing a search
term (Fig. 2, step 3) is used to retrieve a list of GitHub issue threads
(e.g., pull/patch requests,
1
issue discussion), where comments for
each of the returned issue threads are extracted. This step is accom-
plished via several REST API endpoints, e.g.,
/issues
that, given
a query parameter
q
, returns a list of issues that are related to the
desired query parameter string [7].
With the results from the previous search query (Fig. 2, step 4),
we then query the
/comments
REST API endpoint for each of the
retrieved issues. This endpoint is provided as part of the issue results.
Querying this endpoint returns the list of comments for each of the
issues (Fig. 2, step 5), thus allowing us to extract comment strings
for further processing, cleaning, and tokenizing before they are
run through the classication model. At this step, we also lter out
noisy comments, e.g., for query strings that include punctuation,
2
as well as issues that do not contain any discussion.
Data Preprocessing & Tokenization. With the list of comments
previously retrieved from the
/comments
REST API call, we then
preprocess and clean the returned data, removing any noise and
unnecessary stop words (Fig. 2, step 6). We use the list of stop words
included in the
NLTK
library [10], augmenting it with several of our
own custom words to further help reduce noise within the com-
ment corpus. Additionally, we tokenize certain (common) strings
in order to extract the essence of the GitHub issue text. This pro-
cess mainly centers around tokenizing screen (GitHub user) names,
URLs, quotes (both single and double) and code snippets (strings
beginning with back ticks). Each token is then replaced with to-
ken names, e.g.,
USER_NAME
,
URL
,
QUOTE
,
CODE
. Lastly, due to the
way GitHub processes queries—using relaxed matching—
QuerTCI
further lters out issue comments that do not contain the original
query string. This is particularly important for queries representing
1GitHub treats issues and pull requests similarly.
2GitHub ignores punctuation in all query strings.
programming language constructs or API calls, which typically
include punctuation. Issues not matching this stricter check are
omitted and stored in a corresponding “omitted” le for users to
inspect further if necessary (Fig. 2, step 8).
Model Classication & Result Output. The retrieved issue com-
ment data—now cleaned and preprocessed—is fed it into the model
for classication (Fig. 2, step 7). Classication results are then writ-
ten to a CSV le. As comments bodies may be lengthy, each com-
ment line is classied. We also list the source issue identiers, in-
cluding browser- and API-friendly URLs (not shown in Table 1) to
help users easily navigate to the GitHub issue via a web browser
for further (manual) inspection. Similar to the classied results,
we also provided the source issue identiers and URLs to facilitate
manual inspection if needed.
Table 1 portrays an example result CSV le snippet produced by
QuerTCI
; the complete example le may be found in our dataset [12].
Column
id
represents the unique GitHub issue identier, column
comment line
the preprocessed, tokenized comment text, and
column
category
the classication category as produced using
Arya et al. [2]’s pre-built model. The results were obtained for an
empirical study on the challenges facing developers in improving
the run-time performance of imperative Deep Learning (DL) code
using hybridization [15]. The query
tf.function
was used (pe-
riod included) to uncover unsolved GitHub issues mentioning the
TensorFlow [1] hybridization API keyword. The issues (ltering by
QuerTCI) were then manually inspected.
5 EVALUATION
As
QuerTCI
is in early development stages, a thorough evaluation
is pending. However,
QuerTCI
integrates several successful tech-
nologies and approaches. The GitHub API is widely used, both
for industry and research. ESE researchers have successfully used
the GitHub API at scale, e.g., Dilhara et al. [4] use it to discover
1,000
top-rated Machine Learning (ML) systems comprising
58
mil-
lion source lines of code (SLOC). Furthermore, through qualitative
content analysis of
15
complex issue threads across three GitHub
projects, Arya et al.—our default NLP model—uncovered
16
dierent
comment classication types, creating a labeled corpus containing
4,656
sentences [2]. Their model has an F-score of
0.61
and
0.42
for existing and new GitHub issues, respectively. As mentioned in
Section 4, our tool has been used in a prior empirical study.
Our isolated preliminary assessment of
QuerTCI
involved a double-
blind open card sort between two authors to independently evaluate
our tool’s integration with GitHub and the model of Arya et al. [2]
and subsequently assess its accuracy. The authors chose a random
selection of issue comment threads based on the same query and
independently categorized them. The results were then compared
to reach an agreed manual classication. We then used
QuerTCI
to
classify the issues comments and compare if
QuerTCI
had catego-
rized these issues in a similar way.
While the initial results are promising, we plan to expand the
evaluation in the future by involving external ESE researchers.
Specically, we will recruit independent ESE researchers to use
QuerTCI
for an empirical study, e.g., one studying particular API
usage. Then, we will also recruit other independent ESE researchers
not using
QuerTCI
as a control. To reduce the number of variables,
Ye Paing, Tatiana Castro Vélez, and Rai Khatchadourian
Table 1: Example result CSV le snippet. Issue URLs not shown. Column id is the GitHub issue identier.
id comment line category
415902593 however get u step closer running original code actual error message tensorboard propagate ui CODE Observed Bug Behavior
415902593 i think simplest x around would call trace_on trace_export separately around graph call so something like Workarounds
417390174 some detail i using subclassed model complex valued data Motivation
740456602 removing tf.function decorator viable workaround best practice a related issue URL tensorow issues/27120 Potential New Issues & Requests
755665148 74 x acquisition optimizer Solution Discussion
767685452 SCREEN_NAME still issue latest version coremltools if still issue please share additional code show .. . Action on Issue
873531279 i similar issue please help would great Contribution & Commitment
947976601 situation actually much worse i realised CODE CODE the following test pass Solution Discussion
947976601 . . . CODE raised even though value execute error branch perhaps due tracing covering every branch this suggests . . . Usage
1004824336 SCREEN_NAME could specify tensorow version do use docker Solution Discussion
1004824336 tf version CODE unfortunately i use docker Usage
1004824336 thx i guess could something wrong pretraining cobblestone because i tested running pre trained cobblestone agent . . . Usage
1004824336
hi i’ve checked sliced_trajectory data part correct may i ask chain used pretraining training part forger most likely one? CODE
Usage
1004824336 SCREEN_NAME I can reproduce reported behavior docker version also i tried reproduce without docker got error . . . Bug Reproduction
1004824336
yes CODE trajectory i see problem chain for example agent place additional crafting table creating stone pickaxe look rst . . .
Expected Behavior
we would need each of the research teams to perform the same study
with and without our tool. However, achieving this goal is highly
unlikely as the studies will not be novel. More practically, we will
use a mass survey among ESE researchers that have not used our
tool and then compare the results with those where the researchers
did use it. Although the comparison will not be completely isolated,
if the scale is large enough, we foresee that the obtained information
will nevertheless be useful.
6 RELATED WORK
Casalnuovo et al. [3] present
gitcproc
, a tool for processing and
classifying GitHub commits. Our tool is for processing and clas-
sifying GitHub issue comments. Like our tool, their tool is also
motivated by analyzing programming language constructs using
(e.g., API) keywords. However,
gitcproc
does not analyze GitHub
SE artifacts at scale; each project repository must be downloaded
locally and subsequently (serially) analyzed.
QuerTCI
, on the other
hand, leverages the (indexed) GitHub API online, nearly instantly
obtaining GitHub issues data from thousands of GitHub projects.
Arya et al. [2] users must sanitize and manually enter the issue
comments as input. Our keyword query-based approach automati-
cally interfaces with GitHub’s API directly by sending data repre-
senting comments only from issues matching a particular query
string. Moreover,
QuerTCI
cleans, preprocesses, and tokenizes the
automatically retrieved GitHub data and subsequently runs the
sanitized issue comment threads through Arya et al.’s model for
classication. Further, the model may be interchanged using the
pickle le discussed in Section 4.
Karantonis [8] provides a multi-label prediction for GitHub is-
sues using the RoBERTa NLP model [9]. Their approach, however,
is for automatically assigning issue labels (e.g., “bug, “feature”),
whereas our approach classies issue comments based on a keyword-
based query string. Also, unlike Karantonis, who strictly relies on a
Python notebook interface,
QuerTCI
provides an (optionally inter-
active) CLI UI using the
PyInquirer
library, enabling an interac-
tive command menu and command-line arguments. Furthermore,
QuerTCI
automatically authenticates with GitHub’s API using a
supplied access token and checks for the remaining API query limit.
Fadhel [5] writes a blog post and associated iPython notebook [6]
describing how to classifying discussions within code reviews that
are part of GitHub pull requests. Similar to Karantonis [8], there is
no GitHub integration—users must enter the data manually—and
no query feature. Although pull requests are treated similarly to
issues in GitHub, Fadhel’s classication model is highly-tuned to
code review discussions, which my not be entirely amenable to our
stated use case of studying, e.g., usage of particular APIs.
7 CONCLUSION & FUTURE WORK
An open-source, publicly available tool [11]—as part of a broader
research infrastructure—to help ESE researchers quantify the types
of discussions in GitHub issue comment threads around a partic-
ular query string of interest has been demonstrated.
QuerTCI
is
implemented in Python and relies heavily on other libraries, such as
NLTK
to help with string preprocessing and loading an NLP model
to automatically classify each of the strings.
QuerTCI
also interfaces
with GitHub’s API and features a (optionally interactive) CLI UI.
As the tool is in its early stages, plans for a fuller evaluation were
discussed.
In the future, we plan to expand onto other platforms such as
Stack Overow to also process developer Q&A posts. To further
enhance performance, we will classify each issue thread as they
are retrieved from GitHub’s API instead of waiting for all to be
retrieved. Other future plans include exploring alternate tool forms,
e.g., browser extensions, and performing a thorough evaluation.
REFERENCES
[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jerey
Dean, Matthieu Devin, Sanjay Ghemawat, Georey Irving, Michael Isard, Man-
junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,
Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke,
Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: a system for large-scale
Machine Learning. In Symposium on Operating Systems Design and Implemen-
tation.
[2]
Deeksha Arya, Wenting Wang, Jin L. C. Guo, and Jinghui Cheng. 2019. Analysis
and detection of information types of open source software issue discussions.
In International Conference on Software Engineering. (May 2019), 454–464. doi:
10.1109/ICSE.2019.00058.
[3]
Casey Casalnuovo, Yagnik Suchak, Baishakhi Ray, and Cindy Rubio-González.
2017. GitcProc: a tool for processing and classifying GitHub commits. In In-
ternational Symposium on Software Testing and Analysis (ISSTA ’17). ACM,
396–399. doi: 10.1145/3092703.3098230.
[4]
Malinda Dilhara, Ameya Ketkar, Nikhith Sannidhi, and Danny Dig. 2022. Dis-
covering repetitive code changes in Python ML systems. In International Con-
ference on Software Engineering (ICSE ’22). To appear.
erTCI: A Tool Integrating GitHub Issue erying with Comment Classification
[5]
Muntazir Fadhel. 2018. Dissecting GitHub code reviews: a text classication
experiment. Retrieved 11/24/2021 from http: / / mfadhel . com / github- code -
reviews/mfadhel.com/github-code- reviews/.
[6]
Muntazir Fadhel. 2020. What code reviewers talk about. (December 17, 2020).
Retrieved 11/24/2021 from https://git.io/JMTiL.
[7]
GitHub, Inc. 2021. Issues. REST API. Reference. GitHub Docs. Retrieved 11/24/2021
from https://git.io/JMJYC.
[8]
Giorgos Karantonis. 2021. Predicting issues’ labels with RoBERTa. (March 12,
2021). Retrieved 11/23/2021 from https://git.io/J1jBr.
[9]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
Omer Levy, Mike Lewis, Luke Zettlemoyer,and Veselin Stoyanov. 2019. RoBERTa:
A robustly optimized bert pretraining approach. (2019). arXiv: 1907. 11692
[cs.CL].
[10]
Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit.
(May 17, 2002). Retrieved 02/16/2022 from https: / / www . nltk . org/. arXiv:
cs/0205028 [cs.CL].
[11]
YePaing, Tatiana Castro Vélez, and Ra Khatchadourian. 2021. ponder-lab/GitHub-
Issue-Classier. (March 25, 2021). doi: 10.5281/zenodo.4637636.
[12]
Ye Paing, Tatiana Castro Vélez, and Ra Khatchadourian. QuerTCI: a tool
integrating GitHub issue querying with comment classication. (February 16,
2022). doi: 10.5281/zenodo.6115404.
[13]
Python Software Foundation. 2022. Pickle. Python object serialization. Python
documentation. Version 3.10.2. (February 16, 2022). Retrieved 02/16/2022 from
https://docs.python.org/3/library/pickle.html.
[14]
scikit-learn developers. 2022. scikit-learn. Machine Learning in Python. scikit-
learn documentation. Version 1.0.2. (February 16, 2022). Retrieved 02/16/2022
from https://scikit-learn.org/stable/.
[15]
Tatiana Castro Vélez, Ra Khatchadourian, Mehdi Bagherzadeh, and Anita
Raja. 2022. Challenges in migrating imperative Deep Learning programs to
graph execution: an empirical study. (January 24, 2022). arXiv: 2201. 09953
[cs.SE].
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Sites such as GitHub have created a vast collection of software artifacts that researchers interested in understanding and improving software systems can use. Current tools for processing such GitHub data tend to target project metadata and avoid source code processing, or process source code in a manner that requires significant effort for each language supported. This paper presents GitcProc, a lightweight tool based on regular expressions and source code blocks, which downloads projects and extracts their project history, including fine-grained source code information and development time bug fixes. GitcProc can track changes to both single-line and block source code structures and associate these changes to the surrounding function context with minimal set up required from users. We demonstrate GitcProc's ability to capture changes in multiple languages by evaluating it on C, C++, Java, and Python projects, and show it finds bug fixes and the context of source code changes effectively with few false positives.
Conference Paper
Full-text available
The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language. This paper reports on the simplified toolkit and explains how it is used in teaching NLP.
TensorFlow: a system for large-scale Machine Learning
  • Martín Abadi
  • Paul Barham
  • Jianmin Chen
  • Zhifeng Chen
  • Andy Davis
  • Jeffrey Dean
  • Matthieu Devin
  • Sanjay Ghemawat
  • Geoffrey Irving
  • Michael Isard
  • Manjunath Kudlur
  • Josh Levenberg
  • Rajat Monga
  • Sherry Moore
  • Derek G Murray
  • Benoit Steiner
  • Paul Tucker
  • Vijay Vasudevan
  • Pete Warden
  • Martin Wicke
  • Yuan Yu
  • Xiaoqiang Zheng
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: a system for large-scale Machine Learning. In Symposium on Operating Systems Design and Implementation.
Dissecting GitHub code reviews: a text classification experiment
  • Muntazir Fadhel
Muntazir Fadhel. 2018. Dissecting GitHub code reviews: a text classification experiment. Retrieved 11/24/2021 from http :/ /mfadhel.com /github -codereviews/mfadhel.com/github-code-reviews/.
What code reviewers talk about
  • Muntazir Fadhel
Muntazir Fadhel. 2020. What code reviewers talk about. (December 17, 2020). Retrieved 11/24/2021 from https://git.io/JMTiL.
  • Inc Github
GitHub, Inc. 2021. Issues. REST API. Reference. GitHub Docs. Retrieved 11/24/2021 from https://git.io/JMJYC.
Predicting issues' labels with RoBERTa
  • Giorgos Karantonis
Giorgos Karantonis. 2021. Predicting issues' labels with RoBERTa. (March 12, 2021). Retrieved 11/23/2021 from https://git.io/J1jBr.