PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Pull-based development is a widely adopted paradigm for collaboration in distributed software development, attracting eyeballs from both academic and industry. To better study pull-based development model, this paper presents a new dataset containing 96 features collected from 11,230 projects and 3,347,937 pull requests. We describe the creation process and explain the features in details. To the best of our knowledge, our dataset is the most comprehensive and largest one toward a complete picture for pull-based development research. CCS CONCEPTS • Software and its engineering → Programming teams. KEYWORDS pull-based development, pull request, distributed software development ACM Reference Format:
Content may be subject to copyright.
On the Shoulders of Giants: A New Dataset for Pull-based
Development Research
Xunhui Zhang
National University of Defense
Technology, Changsha, China.
zhangxunhui@nudt.edu.cn
Ayushi Rastogi
Delft University of Technology,
the Netherlands
a.rastogi@tudelft.nl
Yue Yu
National University of Defense
Technology, Changsha, China.
yuyue@nudt.edu.cn
ABSTRACT
Pull-based development is a widely adopted paradigm for collab-
oration in distributed software development, attracting eyeballs
from both academic and industry. To better study pull-based de-
velopment model, this paper presents a new dataset containing
96 features collected from 11,230 projects and 3,347,937 pull re-
quests. We describe the creation process and explain the features in
details. To the best of our knowledge, our dataset is the most com-
prehensive and largest one toward a complete picture for pull-based
development research.
CCS CONCEPTS
Software and its engineering Programming teams.
KEYWORDS
pull-based development, pull request, distributed software develop-
ment
ACM Reference Format:
Xunhui Zhang, Ayushi Rastogi, and Yue Yu. 2020. On the Shoulders of Giants:
A New Dataset for Pull-based Development Research. In 17th International
Conference on Mining Software Repositories (MSR ’20), October 5–6, 2020,
Seoul, Republic of Korea. ACM, New York, NY, USA, 5 pages. https://doi.org/
10.1145/3379597.3387489
1 INTRODUCTION
The pull-based development model [
7
] has changed the traditional
way of code contribution [
31
], code review [
28
] and process au-
tomation [
27
]. Since Gousios et al. [
8
] proposed a rsthand dataset
of pull request, plenty of valuable studies have been designed based
on it, to better understand the essence of modern software develop-
ment, e.g, human aspect of SE, DevOps and collaborative environ-
ment. Meanwhile, those studies demonstrate considerable extended
features, e.g, gender [
25
], social connection [
5
], geographical loca-
tion [
18
,
19
], personality [
10
] and emotion[
11
], etc. However, there
lacks a comprehensive dataset towards a more complete picture to
support new work investigation and prior work reproduction and
verication for pull-based development.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7517-7/20/05.. .$15.00
https://doi.org/10.1145/3379597.3387489
In this paper, standing on the shoulders of giants [
6
,
8
], we create
a new upgraded dataset, called
new_pullreq
(10 times larger than
the original one) by adding all new features, as many as possible
to the pool of existing metrics. To the best of our knowledge, our
new_pullreq
is the largest dataset for pull-based development
research, which contains 11,230 OSS projects (representative of both
small and large projects), 96 metrics and 3,347,937 pull requests.
Our dataset is publicly available
1
and source code
2
is open source
for replication as well as extension.
2 FEATURE SELECTION
The feature selection is based on Gousios et. al’s dataset [
8
] as
well as studies on pull request development from 2009 until 2019.
By combining “pull-based development”, “pull-request”, “Github”,
“open source” with zero or more of the following sub-terms: “model”,
“software”, “accepted”, “rejected”, “review”, “merged”, we searched
the paper title using Google Scholar’s boolean search engine, and
identied 76 papers, a subset of which presented features for de-
cision making. These features broadly falls into three categories:
relating to contributor, project as well as pull request but some fea-
tures lie at their intersection. Below we describe all 69 new features
in addition to the 27 features reported in Gousios et. al’s dataset [
8
].
2.1 Contributor characteristics
Contributor characteristics relate to submitters (or developer) and
integrator (or committer). Some of the factors relate to individu-
als while others are interactions between two contributors or a
contributor and a project.
Experience of developers, conceptualized as the count of pre-
vious pull requests, previous pull request acceptance rate [
8
],
accepted commit count [
12
], as well as days since account
creation [
17
], can inuence the pull request acceptance. First
pull requests are less likely to be accepted [
20
,
21
]. Similarly,
experience of integrators calculated as the count of prior
reviews inuence decision making [2].
Core member’s pull request are more likely to be accepted [
1
,
3, 16, 21, 24, 29].
Response time of an integrator, often measured as the time to
rst response, likely inuences the latency as well as chances
of pull request acceptance [29].
Gender of developers, when identied as female, reduces the
chances of pull request acceptance [23].
Country of developers inuences pull request acceptance rate
dierently for dierent countries [
19
]. Further, if developer
1https://zenodo.org/record/3700595#.XmS2GJP0kY1
2https://github.com/zhangxunhui/new_pullreq_msr2020
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Xunhui Zhang, Ayushi Rastogi, and Yue Yu
and integrator are from the same country the chances of pull
request acceptance increases [19].
Aliation of developers and integrators to companies as
well as belongingness of both the developer and committer
to the same company changes the chances of pull request
acceptance [2, 14].
Personality of developers and integrators individually as well
as dierence in personalities between the two, inuence
decision making [
10
]. Here, personality is conceptualized
as Openness to Experience, Conscientiousness, Extraver-
sion, Agreeableness, and Neuroticism (or OCEAN) and dif-
ferences in personality as the dierence between respec-
tive scores. For example, Extraversion_submitter - Extraver-
sion_integrator.
Emotion of developers as well as integrators, characterized
as the percentage of positive and negative emotions, as well
as the emotion of rst comment are found to inuence ac-
ceptance decision [11].
Social distance refers to the closeness of code submitter to
the potential integrator as well as the project. Following the
integrator as well as the project prior to code contribution is
seen to positively inuence pull request decision making [
24
].
Relatedly, the fraction of team members who interacted with
a developer over the team size in the last three months is
used as a signal of social strength/trust, which increases the
chances of pull request acceptance [29].
2.2 Project characteristics
Programming languages tend to have dierent pull request
acceptance rate [
15
,
17
,
21
]. For example, pull requests in
Java and Python have less chance of acceptance and the
opposite for Scala and R.
Popularity of project, measured as watcher count [
8
], star
count [
8
], and fork count [
13
,
17
], negatively inuences pull
request acceptance [13, 17, 24].
Age of project measured as the time interval between project
creation and pull request creation (measured in months),
indicates maturity of project as well as the less likelihood of
pull request acceptance [24, 29].
Workload of a project, as inferred from the number of open
pull requests decreases the chance of pull request accep-
tance [2, 29].
Activeness of project, as inferred from the time interval in
seconds between the opening time of two latest pull requests,
inuences pull request acceptance [13].
Openness of a project as inferred from the count of open
issues as well as the pull request acceptance rate increases
the likelihood of pull request acceptance [13].
2.3 Pull request characteristics
Size of change is measured at commit-level (number of com-
mits), le-level (les added, deleted, modied and changed)
as well as type of les changed (source, document and others).
Some of these metrics are coarse-grained, only discussing
change (like source and test churn) while other metrics sepa-
rate churn into addition and deletion [
29
]. Typically, increase
in size reduces the chances of acceptance and vice-versa.
Complexity of a pull request as inferred from the length
of description is seen to negatively inuence pull request
acceptance [29].
Nature of pull request as bug x, for example, can increase
the chances of PR acceptance [12, 15].
Test inclusion of pull requests increase the chances of its
acceptance[16, 24, 29].
Reference of a contributor, issue or pull request can increase
the change of pull request acceptance [4, 29].
Conict of a pull request, as explicitly mentioned in com-
ments [
7
] negatively inuences the chances of pull request
acceptance.
Hotness or relevance of a PR as inferred from the number of
comments during code review process is seen to inuence
decision making [
7
,
12
,
14
,
20
,
24
,
29
]. In addition to the
issue comment count [
8
] and commit comment count [
8
],
we add pull request comment count. Another indicator of
hotness - number of participants [
8
] is also updated to reect
participation in issues, commits, and pull requests.
Emotions (positive, negative and neutral) surrounding a pull
request discussion reect reviewer’s reaction and is found
to inuence decision making [11].
Continuous Integration of a pull request (its existence or
not), latency, build count, all tests passed, percentage of
tests passed/failed, rst and last build status are all seen to
inuence pull request decision making [9, 22, 27, 29, 30].
The summary of each category of
new_pullreq
is shown in
Table 1, which includes feature tag, description and related citations.
The mysql table structure
3
and technical report
4
can be seen in
the Github project.
3 DATASET
Similar to the previous study by Gousios et al. [
8
], the new dataset
for pull-based development research builds on the publicly avail-
able datasets hosted on GHTorrent
5
. We use the latest version of
Mysql data dump
6
(created on 1 June 2019) and complement it
with additional information (e.g. issue comments) provided in the
comparable version of MongoDB dump 7.
To create a large dataset of active and representative software
repositories, we applied several inclusion and exclusion criteria.
(1) We selected all source (base) repositories and removed forks or
otherwise deleted repositories from GHTorrent dataset or GitHub.
Forks with shared history as the source repository can inuence
the representativeness while deleted repositories are not active any-
more. Further, to select actively developed repositories, we included
repositories with new pull requests in the last three months.
3
https://github.com/zhangxunhui/new_pullreq_msr2020/blob/master/table_
structure.pdf
4
https://github.com/zhangxunhui/new_pullreq_msr2020/blob/master/technical_
report.pdf
5http://ghtorrent.org/
6http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2019-06- 01.tar.gz
7
http://ghtorrent-downloads.ewi.tudelft.nl/mongo-daily/mongo-dump- 2019-06-
30.tar.gz
On the Shoulders of Giants: A New Dataset for Pull-based Development Research MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
Table 1: Factors inuencing pull-based development
Feature Description Feature Description
Contributor Characteristics
acc_commit_num
The number of accepted commits of a contributor before the
creation of a pull request[12]
account_creation_days
The time interval in days from the contributor’s account cre-
ation to the pull request creation[17]
rst_pr Whether it is the rst pull request of a contributor[20, 21] prior_review_num The number of prior reviews of an integrator[2]
core_member
Whether the contributor is a core member or not[
1
,
3
,
16
,
21
,
24, 29]
rst_response_time
The time interval in minutes from pull request creation to the
rst response by a reviewer[29]
contrib_gender The gender of a contributor[23] contrib/inte_country Country of residence of contributor/integrator[19]
same_country
Whether contributor and integrator come from the same
country[19]
prior_interaction
Number of times that the contributor interacted with the
project in the last three months[24]
same_aliation
Whether the contributor and the integrator belong to the same
aliation[2, 14]
contrib/inte_aliation
The aliation that the contributor/integrator belongs to[
2
,
14
]
contrib/inte_X
The Big Five personality traits of contributor/integrator (open:
openness; cons: conscientious; extra: extraversion; agree:
agreeableness; neur: neuroticism)[10]
perc_contrib/inte_X
The percentage of contributor/integrator’s emotion in com-
ments (neg: negative/pos: positive/neu: neutral)[11]
X_di
The absolute dierence of Big Five personality traits between
contributor and integrator[11]
contrib/inte_rst_emo
The emotion of the contributor/integrator’s rst comment[
11
]
social_strength
The fraction of team members that interacted with the con-
tributor in the last three months[29]
contrib_follow_integrator
Whether the contributor follows the integrator when submit-
ting a pull request[24]
Project Characteristics
language Programming language of project[15, 17, 21] open_issue_num
Number of opened issues when submitting the pull request[
13
]
project_age Time interval in months from the project creation to the pull
request creation[24, 29]
open_pr_num
Number of opened pull requests when submitting the pull
request[2, 29]
pushed_delta
The time interval in seconds between the opening time of the
two latest pull requests[13]
fork_num
Number of forks of project when submitting the pull
request[13, 17]
pr_succ_rate Acceptance rate of pull requests in the project[13]
Pull Request Characteristics
churn_addition Number of added lines of code[29] churn_deletion Number of deleted lines of code[29]
bug_x Whether pull request xes a bug[12, 15] description_length Word count of pull request description[29]
test_inclusion Whether test code exists in a pull request[16, 24, 29] comment_conict Whether the keyword "conict" exists in comments[7]
hash/at_tag Whether #/@ tag exists in comments or description[4, 29] pr_comment_num Number of pull request comments[8]
part_num_X
Number of participants in comment (issue: issue comment; pr:
pull request comment; commit: commit comment)[8]
part_num_code
Number of participants in both pull request comment and
commit comment[8]
ci_exists
Whether a pull request uses continuous integration tools[
27
]
ci_build_num Number of CI builds[30]
ci_latency
Time interval in minutes from pull request creation to the rst
build nish time of CI tools[29]
perc_neg/pos/neu_emotion
Percentage of negative/positive/neutral emotion in
comments[11]
ci_test_passed Whether passed all the CI builds[9, 22] ci_rst_build_status First build result of CI tool[30]
ci_failed_perc Percentage of failed CI builds[30] ci_last_build_status Last build result of CI tool[30]
(2) Next, we select projects from six programming languages, dif-
ferent in size and activity count for meaningful analysis. We selected
all projects with at least 33 submitted pull requests. These projects
constitute top 3% of all projects in terms of pull request count (as
against top 1% in case of Gousios et. al’s [
8
] dataset). We extended
original selection of four programming languages (Ruby, Python,
Java, and Scala) by Go and Javascript. The resulting 19,572 projects
were distributed across projects as follows: Javascript: 6,584; Python:
5,121; Java: 3,044; Ruby: 2,794; Go: 1,497; and Scala: 532. Next, we
selected dierent-sized projects (small, medium, and large) in terms
of contributor count. The selected small teams comprised of 12 or
less developers, medium-sized teams with 13 and up to 30 devel-
opers, and large teams with more than 30 developers. We selected
4,000 projects from each class, resulting in a total of 12,000 projects.
Among these projects, we removed “everypolitician/everypolitician-
data” which is extremely large, and is used for holding the data for
national legislatures worldwide. Moreover, a large fraction of the
activities on this project are through bots.
(3) Finally, at the pull-request level, we included all pull requests
that were submitted to the default branch of the repository and
are not open otherwise (no decision is being made on open pull
requests). Moreover, we remove those projects that have less than
20 default branch related closed pull requests. This gives us 11,230
projects comprising of 3,347,937 pull requests. In comparison to
Gousios et. al’s dataset of 865 projects and 336,502 pull requests, our
dataset has 12 times more projects but only about 10 times more
pull requests (since we also included small projects).
4 FEATURE COLLECTION
For extracting features from data, we followed the procedure speci-
ed in the respective paper. We retained all variants of a feature
proposed in literature, with a few exceptions (like emotion and per-
sonality) discussed in the below. There were, however, situations
where we had to extrapolate the solution for representativeness. For
example, the existing solution to analyze continuous integration
works only for TravisTorrent. To make our dataset generalizable,
we expanded the existing solution with some heuristics.
Personality Many models of personality are available in liter-
ature and used in existing research. For this dataset, we choose
the state-of-the-practice tool - IBM Watson Personality Insights
8
- to measure the Big Five Personality Traits of each user [
10
]. We
collected all comments of developers from issue discussions, pull
request discussions as well as commit discussions. We processed
the data to remove code snippets and special characters (including
quotes, # tags, @, IPs, email address, URLs, and numbers) which
otherwise are of no use to infer personality. The resulting data is
feed as input to the Personality Insights tool conditioned only on
the availability of 100 or more words as input to ensure a sizable
text for reliable interpretation.
8https://www.ibm.com/watson/services/personality-insights/
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Xunhui Zhang, Ayushi Rastogi, and Yue Yu
Country and Gender We infer country and gender of developers
using the tool proposed by Vasilescu et al. [25].
Emotion Similar to personality, many models exist to infer emo-
tions. We use the best prediction model, known so far - UmlFit [
11
]
to infer emotions in discussions.
Continuous integration The previous studies used only one tool,
travis-ci [
27
]. However, for whether a pull request uses CI tool,
using travis-ci only meant that the existing solution no longer
holds. To overcome this challenge, we proposed a few heuristics
applicable to a wider range of CI tools.
To nd whether a pull request uses a CI tool or not, we started by
searching for terms commonly used for continuous integration such
as “continuous”, “integration”, “-ci”, “ci-”, “ci/”, “ci.” in the text elds
of pull requests. These elds include context, description as well as
the associated URL, information on which are inferred from GitHub
status API
9
. If a term match is found, the pull request uses CI tool,
otherwise we look for keywords “build” and “test” in the context
and description. Alternatively, we compiled a list of widely used
CI tools from GitHub marketplace
10
and veried it manually by
looking at related posts online. Further, we checked for the presence
of tool name in text elds as a sign of CI tool use. We assume that
if the above steps did not link a pull request to a CI tool, CI tool
is not used. To check for the accuracy of the proposed heuristic,
the rst author randomly selected 200 pull requests inferred using
CI tools and another 200 pull requests not inferred using CI tools.
The rst author then manually checked all 400 pull requests and
found 99.5% precision for pull requests that uses CI tools and 99%
precision for pull requests that do not use CI tools.
For other metrics, such as CI build count and percentage of failed
CI, in order to make the dataset more generalizable, we present
insights from three widely used CI tools: travis-ci, circle-ci and
drone-ci. For travis-ci, we use the method proposed by Vasilescu
et al. [
26
] to retrieve CI related metrics. For the other two CI tools,
unlike travis-ci, there did not exist a direct link between a pull
request and CI tool. To link a pull request to a CI tool, we used
repository slug “username/repo” as an argument and matched the
commit sha of each build with a pull request. If such a match exists,
we link a build with a pull request.
Aliation Studies on aliation as a feature examined well-known
repositories, the solution, however, was not generalizable. In this
study, we introduced a new approach to infer aliation from com-
pany and email domain information derived from GitHub API. First,
we select the company texts that appear more than 10 times in the
dataset. By manually checking it, the rst author identify a list of
stop words including freelancer, student, and remove it. Next, we
look for university aliation by mapping both the university name
and its abbreviation to university. All other aliations are seen as
related to a company. To lter company name, we removed prex
“@” and suxes such as “ltd.” and “corp.”. Further, we changed some
names to its alias. For example, aws to amazon and qihoo to 360.
We further enrich our inference of aliation using email domain.
We identied world popular domains
11
, list of world university
domains
12
. We removed world popular domains as they cannot
9https://developer.github.com/v3/repos/statuses/
10https://github.com/marketplace?category=continuous- integration
11https://github.com/mailcheck/mailcheck/wiki/List- of-Popular- Domains
12https://github.com/Hipo/university-domains-list
identify company aliation uniquely. Next, we mapped university
email domain to university aliation. For all other email domains,
followed by “.org” or “.com”, we mapped them to company. We
also dened some stop words including gmail, github, yahoo and
removed them, as some domains are missed in the popular domain
list. If an email domain uniquely maps to an alias (with at least 30
data points to avoid false positives), we append the known aliation
to the aliation inferred from company text.
5 LIMITATIONS
With an objective to create a large and representative dataset for
future research on pull-based development, we collected 96 features
from 11K+ projects and 3 million+ pull requests. In the process,
however, we made many choices or inherited it from previous
studies that can impact the dataset. Our work builds on a decade of
research on pull-based development, extracting features relevant
for decision making. This way we not only stand on the shoulders
of giants, and hence beneting from it but also inherit limitations
of the features they present. The methodology adopted in this
study is similar to and builds on the original study by Gousios et
al [
8
]. Similar to their work, we combine data from multiple sources:
GHTorrent Mysql data dump, MongoDB data dump, as well as git
repository data downloaded from GitHub. Since each of the three
data sources have data at dierent levels of abstraction, this can
lead to some dierences in outcome. We, however, went an extra
mile to improve the representativeness of the dataset. We added two
new programming languages and extrapolated known features (e.g.
continuous integration) to a variety of small and large projects. That
being said, we realize that there can be a more representative dataset
also including code-related metrics which otherwise are found not
important for decision making and less explored or metrics that
cannot be studied objectively.
6 RESEARCH OPPORTUNITIES
The pull-based model provides a synthesized paradigm for dis-
tributed collaboration, which has attracted global attentions in re-
cent years. In this paper, we create a comprehensive and large-scale
dataset collected from 11K+ representative OSS projects in GitHub,
and describe the creation process and explain all features in de-
tails. Our dataset, in addition to supporting research on pull-based
development, provides new opportunities to the related research
elds, spanning from collaborative environments (e.g., code patch
and code review), software maintenance (e.g., bug prediction), soft-
ware process (e.g., continuous integration and DevOps), human
factors in computing systems (e.g., developer personality) and etc.
Researchers can achieve a more complete picture of distributed de-
velopment by empirical study, and even train articial intelligence
models based on our carefully ltered data samples.
ACKNOWLEDGMENTS
This work is supported by Science and Technology Innovation 2030
of China (Grand No.2018AAA0102304), National Nature Science
Foundation of China (Grand No.61702534) and China Scholarship
Council. Thank you Dr. Georgios Gousios, Rahul N. Iyer, Frenk van
Mil, Celal Karakoc, Leroy Velzel, Daan Groenewegen and Sarah de
Wolf for your technical help.
On the Shoulders of Giants: A New Dataset for Pull-based Development Research MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
REFERENCES
[1]
O. Baysal, O. Kononenko, R. Holmes, and M. W. Godfrey. 2012. The Secret Life
of Patches: A Firefox Case Study. In 2012 19th Working Conference on Reverse
Engineering. 447–455. https://doi.org/10.1109/WCRE.2012.54
[2]
O. Baysal, O. Kononenko, R. Holmes, and M. W. Godfrey. 2013. The inuence of
non-technical factors on code review. In 2013 20th Working Conference on Reverse
Engineering (WCRE). 122–131. https://doi.org/10.1109/WCRE.2013.6671287
[3]
Amiangshu Bosu and Jerey C. Carver. 2014. Impact of Developer Reputa-
tion on Code Review Outcomes in OSS Projects: An Empirical Investigation.
In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Soft-
ware Engineering and Measurement (Torino, Italy) (ESEM âĂŹ14). Association
for Computing Machinery, New York, NY, USA, Article Article 33, 10 pages.
https://doi.org/10.1145/2652524.2652544
[4]
Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2017. A preliminary analysis
on the eects of propensity to trust in distributed software development. In 2017
IEEE 12th international conference on global software engineering (ICGSE). IEEE,
56–60.
[5]
Casey Casalnuovo, Bogdan Vasilescu, Premkumar Devanbu, and Vladimir Filkov.
2015. Developer onboarding in GitHub: the role of prior social links and language
experience. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software
Engineering. 817–828.
[6]
Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of
the 10th Working Conference on Mining Software Repositories (San Francisco, CA,
USA) (MSR ’13). IEEE Press, Piscataway, NJ, USA, 233–236. http://dl.acm.org/
citation.cfm?id=2487085.2487132
[7]
Georgios Gousios, Martin Pinzger, and Arie van Deursen. 2014. An Exploratory
Study of the Pull-Based Software Development Model. In Proceedings of the
36th International Conference on Software Engineering (Hyderabad, India) (ICSE
2014). Association for Computing Machinery, New York, NY, USA, 345âĂŞ355.
https://doi.org/10.1145/2568225.2568260
[8]
Georgios Gousios and Andy Zaidman. 2014. A Dataset for Pull-Based Develop-
ment Research. In Proceedings of the 11th Working Conference on Mining Software
Repositories (Hyderabad, India) (MSR 2014). Association for Computing Machin-
ery, New York, NY, USA, 368âĂŞ371. https://doi.org/10.1145/2597073.2597122
[9]
G. Gousios, A. Zaidman, M. Storey, and A. v. Deursen. 2015. Work Practices
and Challenges in Pull-Based Development: The Integrator’s Perspective. In 2015
IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1.
358–368. https://doi.org/10.1109/ICSE.2015.55
[10]
R. N. Iyer, S. A. Yun, M. Nagappan, and J. Hoey. 2019. Eects of Personality Traits
on Pull Request Acceptance. IEEE Transactions on Software Engineering (2019),
1–1. https://doi.org/10.1109/TSE.2019.2960357
[11]
Iyer, Rahul. 2019. Eects of Personality Traits and Emotional Factors in Pull
Request Acceptance. http://hdl.handle.net/10012/14952
[12]
Y. Jiang, B. Adams, and D. M. German. 2013. Will my patch make it? And how
fast? Case study on the Linux kernel. In 2013 10th Working Conference on Mining
Software Repositories (MSR). 101–110. https://doi.org/10.1109/MSR.2013.6624016
[13]
Nikhil Khadke, Ming Han Teh, and Minghan Shen. [n.d.]. Predicting Acceptance
of GitHub Pull Requests. ([n.d.]).
[14]
O. Kononenko, T. Rose, O. Baysal, M. Godfrey, D. Theisen, and B. de Water. 2018.
Studying Pull Request Merges: A Case Study of Shopify’s Active Merchant. In
2018 IEEE/ACM 40th International Conference on Software Engineering: Software
Engineering in Practice Track (ICSE-SEIP). 124–133.
[15]
Rohan Padhye, Senthil Mani, and Vibha Singhal Sinha. 2014. A Study of External
Community Contribution to Open-Source Projects on GitHub. In Proceedings
of the 11th Working Conference on Mining Software Repositories (Hyderabad,
India) (MSR 2014). Association for Computing Machinery, New York, NY, USA,
332âĂŞ335. https://doi.org/10.1145/2597073.2597113
[16]
Gustavo Pinto, Luiz Felipe Dias, and Igor Steinmacher. 2018. Who Gets a Patch
Accepted First? Comparing the Contributions of Employees and Volunteers. In
Proceedings of the 11th International Workshop on Cooperative and Human Aspects
of Software Engineering (Gothenburg, Sweden) (CHASE âĂŹ18). Association for
Computing Machinery, New York, NY, USA, 110âĂŞ113. https://doi.org/10.1145/
3195836.3195858
[17]
Mohammad Masudur Rahman and Chanchal K. Roy. 2014. An Insight into the
Pull Requests of GitHub. In Proceedings of the 11th Working Conference on Mining
Software Repositories (Hyderabad, India) (MSR 2014). Association for Computing
Machinery, New York, NY, USA, 364âĂŞ367. https://doi.org/10.1145/2597073.
2597121
[18]
Ayushi Rastogi. 2016. Do Biases Related to Geographical Location Inuence
Work-Related Decisions in GitHub?. In Proceedings of the 38th International
Conference on Software Engineering Companion (Austin, Texas) (ICSE âĂŹ16).
Association for Computing Machinery, New York, NY, USA, 665âĂŞ667. https:
//doi.org/10.1145/2889160.2891035
[19]
Ayushi Rastogi, Nachiappan Nagappan, Georgios Gousios, and André van der
Hoek. 2018. Relationship between Geographical Location and Evaluation of
Developer Contributions in Github. In Proceedings of the 12th ACM/IEEE Inter-
national Symposium on Empirical Software Engineering and Measurement (Oulu,
Finland) (ESEM âĂŹ18). Association for Computing Machinery, New York, NY,
USA, Article Article 22, 8 pages. https://doi.org/10.1145/3239235.3240504
[20]
D. M. Soares, M. L. d. L. JÞnior, L. Murta, and A. Plastino. 2015. Rejection Factors
of Pull Requests Filed by Core Team Developers in Software Projects with High
Acceptance Rates. In 2015 IEEE 14th International Conference on Machine Learning
and Applications (ICMLA). 960–965. https://doi.org/10.1109/ICMLA.2015.41
[21]
Daricélio Moreira Soares, Manoel Limeira de Lima Júnior, Leonardo Murta, and
Alexandre Plastino. 2015. Acceptance Factors of Pull Requests in Open-Source
Projects. In Proceedings of the 30th Annual ACM Symposium on Applied Computing
(Salamanca, Spain) (SAC âĂŹ15). Association for Computing Machinery, New
York, NY, USA, 1541âĂŞ1546. https://doi.org/10.1145/2695664.2695856
[22]
Y. Tao, D. Han, and S. Kim. 2014. Writing Acceptable Patches: An Empirical Study
of Open Source Project Patches. In 2014 IEEE International Conference on Software
Maintenance and Evolution. 271–280. https://doi.org/10.1109/ICSME.2014.49
[23]
Josh Terrell,Andrew Konk, Justin Middleton, Clarissa Rainear, Emerson Murphy-
Hill, Chris Parnin, and Jon Stallings. 2017. Gender dierences and bias in open
source: Pull request acceptance of women versus men. PeerJ Computer Science 3
(2017), e111.
[24]
Jason Tsay, Laura Dabbish, and James Herbsleb. 2014. Inuence of Social and
Technical Factors for Evaluating Contribution in GitHub. In Proceedings of the
36th International Conference on Software Engineering (Hyderabad, India) (ICSE
2014). Association for Computing Machinery, New York, NY, USA, 356âĂŞ366.
https://doi.org/10.1145/2568225.2568315
[25]
Bogdan Vasilescu, Daryl Posnett, Baishakhi Ray,Mark G.J. van den Brand, Alexan-
der Serebrenik, Premkumar Devanbu, and Vladimir Filkov. 2015. Gender and
Tenure Diversity in GitHub Teams. In Proceedings of the 33rd Annual ACM Con-
ference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI
âĂŹ15). Association for Computing Machinery, New York, NY, USA, 3789âĂŞ3798.
https://doi.org/10.1145/2702123.2702549
[26]
B. Vasilescu, S. van Schuylenburg, J. Wulms, A. Serebrenik, and M. G. J. van
den Brand. 2014. Continuous Integration in a Social-Coding World: Empiri-
cal Evidence from GitHub. In 2014 IEEE International Conference on Software
Maintenance and Evolution. 401–405. https://doi.org/10.1109/ICSME.2014.62
[27]
Bogdan Vasilescu, Yue Yu, Huaimin Wang, Premkumar Devanbu, and Vladimir
Filkov. 2015. Quality and Productivity Outcomes Relating to Continuous Integra-
tion in GitHub. In Proceedings of the 2015 10th Joint Meeting on Foundations of
Software Engineering (Bergamo, Italy) (ESEC/FSE 2015). Association for Comput-
ing Machinery, New York, NY, USA, 805âĂŞ816. https://doi.org/10.1145/2786805.
2786850
[28]
Y. Yu, H. Wang, G. Yin, and C. X. Ling. 2014. Who Should Review this Pull-
Request: Reviewer Recommendation to Expedite Crowd Collaboration. In 2014
21st Asia-Pacic Software Engineering Conference, Vol. 1. 335–342. https://doi.
org/10.1109/APSEC.2014.57
[29]
Yue Yu, Gang Yin, Tao Wang, Cheng Yang, and Huaimin Wang. 2016. Determi-
nants of pull-based development in the context of continuous integration. Science
China Information Sciences 59, 8 (2016), 080104. https://doi.org/10.1007/s11432-
016-5595- 8
[30]
F. Zampetti, G. Bavota, G. Canfora, and M. D. Penta. 2019. A Study on the Interplay
between Pull Request Review and Continuous Integration Builds. In 2019 IEEE
26th International Conference on Software Analysis, Evolution and Reengineering
(SANER). 38–48. https://doi.org/10.1109/SANER.2019.8667996
[31]
Jiaxin Zhu, Minghui Zhou, and Audris Mockus. 2016. Eectiveness of code
contribution: From patch-based to pull-request-based tools. In Proceedings of
the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software
Engineering. 871–882.
... These include personal characteristics of the author (e.g., affiliation), project characteristics (e.g., project age), and pull request characteristics (e.g., code churn). See Table 1 for a complete list of factors found to influence the time-to-merge of projects hosted on GitHub [79]. We, on the contrary, explore solution space to increase code velocity by manipulating controllable factors to reduce time-to-merge. ...
... Table 1 classifies each factor influencing time-to-merge according to our assessment of controllability. These factors are taken from [31,79]. Table 1 shows that no known author and project characteristics influencing time-to-merge can be controlled. ...
Preprint
Code velocity, or the speed with which code changes are integrated into a production environment, plays a crucial role in Continuous Integration and Continuous Deployment. Many studies report factors influencing code velocity. However, solutions to increase code velocity are unclear. Meanwhile, the industry continues to issue guidelines on "ideal" code change size, believing it increases code velocity despite lacking evidence validating the practice. Surprisingly, this fundamental question has not been studied to date. This study investigates the practicality of improving code velocity by optimizing pull request size and composition (ratio of insertions, deletions, and modifications). We start with a hypothesis that a moderate correlation exists between pull request size and time-to-merge. We selected 100 most popular, actively developed projects from 10 programming languages on GitHub. We analyzed our dataset of 845,316 pull requests by size, composition, and context to explore its relationship to time-to-merge - a proxy to measure code velocity. Our study shows that pull request size and composition do not relate to time-to-merge. Regardless of the contextual factors that can influence pull request size or composition (e.g., programming language), the observation holds. Pull request data from two other platforms: Gerrit and Phabricator (401,790 code reviews) confirms the lack of relationship. This negative result as in "... eliminate useless hypotheses ..." challenges a widespread belief by showing that small code changes do not merge faster to increase code velocity.
... To investigate the kinds of PRs intervened by Stale bot, we first need to characterize the PRs, their contributors, and their review processes. For this purpose, we consult the pull-based development literature [20,[70][71][72]. As shown in Table 4, we measure 16 factors covering three dimensions: ...
Preprint
Full-text available
Pull Requests (PRs) that are neither progressed nor resolved clutter the list of PRs, making it difficult for the maintainers to manage and prioritize unresolved PRs. To automatically track, follow up, and close such inactive PRs, Stale bot was introduced by GitHub. Despite its increasing adoption, there are ongoing debates on whether using Stale bot alleviates or exacerbates the problem of inactive PRs. To better understand if and how Stale bot helps projects in their pull-based development workflow, we perform an empirical study of 20 large and popular open-source projects. We find that Stale bot can help deal with a backlog of unresolved PRs as the projects closed more PRs within the first few months of adoption. Moreover, Stale bot can help improve the efficiency of the PR review process as the projects reviewed PRs that ended up merged and resolved PRs that ended up closed faster after the adoption. However, Stale bot can also negatively affect the contributors as the projects experienced a considerable decrease in their number of active contributors after the adoption. Therefore, relying solely on Stale bot to deal with inactive PRs may lead to decreased community engagement and an increased probability of contributor abandonment.
... Factors of social interactions include the trust relationship with the author of the code (Zhang et al., 2020), interaction among the MCR participants (history of interactions, frequency, and so on; Bosu et al., 2017;Fatima et al., 2019;Kashiwa et al., 2022), relationships between the team members (Succi et al., 2002;Coman et al., 2014;Bosu et al., 2017), and the perception of the individual author or reviewer Fatima et al., 2019;Kashiwa et al., 2022). Furthermore, individual factors including skills, characteristics, emotions, knowledge and experience, psychological safety, work style, and individual bias, also affect social interactions (Fatima et al., 2019). ...
Article
Full-text available
Introduction Modern Code Review (MCR) is a multistage process where developers evaluate source code written by others to enhance the software quality. Despite the numerous studies conducted on the effects of MCR on software quality, the non-technical issues in the MCR process have not been extensively studied. This study aims to investigate the social problems in the MCR process and to find possible ways to prevent them and improve the overall quality of the MCR process. Methodology To achieve the research objectives, we applied the grounded theory research shaped by GQM approach to collect data on the attitudes of developers from different teams toward MCR. We conducted interviews with 25 software developers from 13 companies to obtain the information necessary to investigate how social interactions affect the code reviewing process. Results Our findings show that interpersonal relationships within the team can have significant consequences on the MCR process. We also received a list of possible strategies to overcome these problems. Discussion Our study provides a new perspective on the non-technical issues in the MCR process, which has not been extensively studied before. The findings of this study can help software development teams to address the social problems in the MCR process and improve the overall quality of their software products. Conclusion This study provides valuable insights into the non-technical issues in the MCR process and the possible ways to prevent them. The findings of this study can help software development teams to improve the MCR process and the quality of their software products. Future research could explore the effectiveness of the identified strategies in addressing the social problems in the MCR process.
... To identify the features that are possibly associated with PR abandonment, we consult the literature on pull-based development [26,[78][79][80]. As shown in Table 2, we extract 16 features covering four different dimensions: (i) PR features, (ii) contributor features, (iii) review process features, and (iv) project features. ...
Article
Full-text available
Pull-based development has enabled numerous volunteers to contribute to open-source projects with fewer barriers. Nevertheless, a considerable amount of pull requests (PRs) with valid contributions are abandoned by their contributors , wasting the effort and time put in by both the contributors and maintainers. To better understand the underlying dynamics of contributor-abandoned PRs, we conduct a mixed-methods study using both quantitative and qualitative methods. We curate a dataset consisting of 265,325 PRs including 4,450 abandoned ones from ten popular and mature GitHub projects and measure 16 features characterizing PRs, contributors, review processes, and projects. Using statistical and machine learning techniques, we find that complex PRs, novice contributors, and lengthy reviews have a higher probability of abandonment and the rate of PR abandonment fluctuates alongside the projects’ maturity or workload. To identify why contributors abandon their PRs, we also manually examine a random sample of 354 abandoned PRs. We observe that the most frequent abandonment reasons are related to the obstacles faced by contributors, followed by the hurdles imposed by maintainers during the review process. Finally, we survey the top core maintainers of the studied projects to understand their perspectives on dealing with PR abandonment and on our findings.
Article
Full-text available
Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004–2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.
Preprint
Full-text available
Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004--2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.
Article
Full-text available
Pull request latency evaluation is an essential application of effort evaluation in the pull-based development scenario. It can help the reviewers sort the pull request queue, remind developers about the review processing time, speed up the review process and accelerate software development. There is a lack of work that systematically organizes the factors that affect pull request latency. Also, there is no related work discussing the differences and variations in characteristics in different scenarios and contexts. In this paper, we collected relevant factors through a literature review approach. Then we assessed their relative importance in five scenarios and six different contexts using the mixed-effects linear regression model. The most important factors differ in different scenarios. The length of the description is most important when pull requests are submitted. The existence of comments is most important when closing pull requests, using CI tools, and when the contributor and the integrator are different. When there exist comments, the latency of the first comment is the most important. Meanwhile, the influence of factors may change in different contexts. For example, the number of commits in a pull request has a more significant impact on pull request latency when closing than submitting due to changes in contributions brought about by the review process. Both human and bot comments are positively correlated with pull request latency. In contrast, the bot’s first comments are more strongly correlated with latency, but the number of comments is less correlated. Future research and tool implementation needs to consider the impact of different contexts. Researchers can conduct related studies based on our publicly available datasets and replication scripts.
Article
Full-text available
italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Context : The pull-based development model is widely used in open source projects, leading to the emergence of trends in distributed software development. One aspect that has garnered significant attention concerning pull request decisions is the identification of explanatory factors. Objective : This study builds on a decade of research on pull request decisions and provides further insights. We empirically investigate how factors influence pull request decisions and the scenarios that change the influence of such factors. Method : We identify factors influencing pull request decisions on GitHub through a systematic literature review and infer them by mining archival data. We collect a total of 3,347,937 pull requests with 95 features from 11,230 diverse projects on GitHub. Using these data, we explore the relations among the factors and build mixed effects logistic regression models to empirically explain pull request decisions. Results : Our study shows that a small number of factors explain pull request decisions, with that concerning whether the integrator is the same as or different from the submitter being the most important factor. We also note that the influence of factors on pull request decisions change with a change in context; e.g., the area hotness of pull request is important only in the early stage of project development, however it becomes unimportant for pull request decisions as projects become mature.
Conference Paper
Full-text available
Although many software companies have recently embraced Open Source Software (OSS) initiatives, volunteers (i.e., developers who contribute to OSS in their spare time) still represent a wealthy workforce that have the potential of driving many non-trivial open source projects. Such volunteers face well-known barriers when attempting to contribute to OSS projects. However, what is still unclear is how the problems that volunteers face transcend to the problems that employees (i.e., developers hired by a software company to work on OSS projects) face. In this paper we aim to investigate the differences on the acceptance of patches submitted by volunteers and employees to company-owned OSS projects. We explore different characteristics of the patches submitted to company-owned OSS project, including: the frequency of acceptance and rejection; the total time to review and process a patch, and; whether the changes proposed follow some contribution best practices. We found that volunteers face 26× more rejections than employees. Volunteers have to wait, on average, 11 days to have a patch processed (employees wait 2 days, on average). 92% of the dormant pull-requests (e.g., pull-requests that take too long to be processed) were submitted by employees. Finally, we observed that the best practices that had the patches are most adherent to is "commit messages need to be written in English. "
Article
In this paper, we examine the influence of personality traits of developers on the pull request evaluation process in GitHub. We first replicate Tsay et al. ’s work that examined the influence of social factors (e.g., ‘social distance’) and technical factors (e.g., test file inclusion) for evaluating contributions, and then extend it with personality based factors. In particular, we extract the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) of developers from their online digital footprints, such as pull request comments. We analyze the personality traits of 16,935 active developers from 1,860 projects and compare their relative importance to other non-personality factors from past research, in the pull request evaluation process. We find that pull requests from authors (requesters) who are more open and conscientious, but less extroverted, have a higher chance of approval. Furthermore, pull requests that are closed by developers (closers) who are more conscientious, extroverted, and neurotic, have a higher likelihood of acceptance. The larger the difference in personality traits between the requester and the closer, the more positive effect it has on pull request acceptance. Finally, although the effect of personality traits is significant and comparable to technical factors, we find that social factors are still more influential on the likelihood of pull request acceptance.
Conference Paper
Background Open source software projects show gender bias suggesting that other demographic characteristics of developers, like geographical location, can negatively influence evaluation of contributions too. Aim This study contributes to this emerging body of knowledge in software development by presenting a quantitative analysis of the relationship between the geographical location of developers and evaluation of their contributions on GitHub. Method We present an analysis of 70,000+ pull requests selected from 17 most actively participating countries to model the relationship between the geographical location of developers and pull request acceptance decision. Results and Conclusion We observed structural differences in pull request acceptance rates across 17 countries. Countries with no apparent similarities such as Switzerland and Japan had one of the highest pull request acceptance rates while countries like China and Germany had one of the lowest pull request acceptance rates. Notably, higher acceptance rates were observed for all but one country when pull requests were evaluated by developers from the same country.
Conference Paper
Pull-based development has become a popular choice for developing distributed projects, such as those hosted on GitHub. In this model, contributions are pulled from forked repositories, modified, and then later merged back into the main repository. In this work, we report on two empirical studies that investigate pull request (PR) merges of Active Merchant, a commercial project developed by Shopify Inc. In the first study, we apply data mining techniques on the project's GitHub repository to explore the nature of merges, and we conduct a manual inspection of pull requests; we also investigate what factors contribute to PR merge time and outcome. In the second study, we perform a qualitative analysis of the results of a survey of developers who contributed to Active Merchant. The study addresses the topic of PR review quality and developers' perception of it. The results provide insights into how these developers perform pull request merges, and what factors they find contribute to how they review and merge pull requests.
Article
Biases against women in the workplace have been documented in a variety of studies. This paper presents a large scale study on gender bias, where we compare acceptance rates of contributions from men versus women in an open source software community. Surprisingly, our results show that women’s contributions tend to be accepted more often than men’s. However, for contributors who are outsiders to a project and their gender is identifiable, men’s acceptance rates are higher. Our results suggest that although women on GitHub may be more competent overall, bias against them exists nonetheless.
Conference Paper
Code contributions in Free/Libre and Open Source Software projects are controlled to maintain high-quality of software. Alternatives to patch-based code contribution tools such as mailing lists and issue trackers have been developed with the pull request systems being the most visible and widely available on GitHub. Is the code contribution process more effective with pull request systems? To answer that, we quantify the effectiveness via the rates contributions are accepted and ignored, via the time until the first response and final resolution and via the numbers of contributions. To control for the latent variables, our study includes a project that migrated from an issue tracker to the GitHub pull request system and a comparison between projects using mailing lists and pull request systems. Our results show pull request systems to be associated with reduced review times and larger numbers of contributions. However, not all the comparisons indicate substantially better accept or ignore rates in pull request systems. These variations may be most simply explained by the differences in contribution practices the projects employ and may be less affected by the type of tool. Our results clarify the importance of understanding the role of tools in effective management of the broad network of potential contributors and may lead to strategies and practices making the code contribution more satisfying and efficient from both contributors' and maintainers' perspectives.
Conference Paper
Distributed version control systems provide support for pull request strategy, which is used to register external contributions in collaborative software projects. The data present on a pull request can provide insights of factors that have influence on the acceptance or rejection of contributions in open source projects. Furthermore, the discovery of knowledge about pull requests allows confirming or denying existing hypotheses and helps software developers and project managers to guide their actions. This work proposes the use of data mining, more specifically, the extraction of association rules, to find patterns that exert influence on the acceptance (merge) of a pull request. The results suggest that: (i) the use of association rules allows to identify which factors increase the likelihood of a pull request merge; (ii) the identification of attributes that influence the merge reveals important knowledge about the pull request model; and (iii) with the use of association rules, it is possible to determine which factors contribute to a faster merge.