Conference PaperPDF Available

"May the Fork Be with You": Novel Metrics to Analyze Collaboration on GitHub

Authors:

Abstract and Figures

Multi–repository software projects are becoming more and more popular, thanks to web–based facilities such as GitHub. Code and process metrics generally assume a single repository must be analyzed, in order to measure the characteristics of a codebase. Thus they are not apt to measure how much relevant information is hosted in multiple repositories contributing to the same codebase. Nor can they feature the characteristics of such a distributed development pro-cess. We present a set of novel metrics, based on an original classification of commits, conceived to capture some interesting aspects of a multi–repository development process. We also describe an efficient way to build a data structure that allows to compute these metrics on a set of Git repositories. Interesting outcomes, obtained by applying our metrics on a large sample of projects hosted on GitHub, show the usefulness of our contribution.
Content may be subject to copyright.
“May the Fork Be with You”:
Novel Metrics to Analyze Collaboration on GitHub
Marco Biazzini, Benoit Baudry
INRIA – Bretagne Atlantique
<name>.<surname>@inria.fr
ABSTRACT
Multi–repository software projects are becoming more and
more popular, thanks to web–based facilities such as GitHub.
Code and process metrics generally assume a single reposi-
tory must be analyzed, in order to measure the characteris-
tics of a codebase. Thus they are not apt to measure how
much relevant information is hosted in multiple repositories
contributing to the same codebase. Nor can they feature
the characteristics of such a distributed development pro-
cess. We present a set of novel metrics, based on an original
classification of commits, conceived to capture some interest-
ing aspects of a multi–repository development process. We
also describe an efficient way to build a data structure that
allows to compute these metrics on a set of Git reposito-
ries. Interesting outcomes, obtained by applying our metrics
on a large sample of projects hosted on GitHub, show the
usefulness of our contribution.
1. INTRODUCTION
We witness an impressive growth in the adoption of De-
centralized Version Control Systems (from now on DVCS),
which are in many cases preferred to centralized ones (CVCS)
because of their flexibility for handling concurrent develop-
ment and distribution of “mergeable” codebases.
The purpose of CVCSs has always been to maintain a sin-
gle authoritative codebase, while letting each developer have
only a single revision of each of its files at a time. DVCSs
are primarily meant to let the developers access, maintain
and compare various versions of the same codebase, along
with their commit histories, in a decentralized fashion. In
such a scenario, authoritative repositories (if any) are just
conventionally designated as such by the community of de-
velopers.
Current software metrics focus either on the analysis of the
code (code metrics) [1, 4, 6] or on characterizing the devel-
opment process (process metrics) [7–9]. They have merits
and pitfalls, well studied in the literature. Recent studies
focus on the impact of branching and merging on the qual-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
ity of the product [10]. They all implicitly assume a single
repository to be analyzed to measure the characteristics of a
single software product. Nowadays, this is no longer a safe
assumption.
The recent boost in the adoption of DVCSs has been
driven by public web–based aggregators that greatly facili-
tate the access to DVCS-based repositories and the interac-
tion among different copies of their codebases. Let us take
the case of GitHub. Thanks to the “fork–and–contribute”
policy of GitHub, the highly non–linear history of Git
repositories becomes publicly exposed and easily duplicable
in independent but inter–communicating copies at will.
GitHub makes it public what is normally disclosed only
among developers sharing branches of their Git repositories.
An explicit (and possibly cumbersome) peer-wise synchro-
nization of local repositories is no longer needed. Anyone can
contribute to any repository by creating a personal public
fork and pushing changes to this fork. Any change may even-
tually be pulled from any other fork of the same repository,
including of course the original one.
The facilities provided by web hubs like GitHub have
been shown to have an impact on the characteristics of the
development process, both by easing the parallelization of
the tasks within a team [3] and by increasing the number of
relevant contributions coming from outsiders [11]. This fact
poses at least two open issues: (i) how to analyze commit
histories scattered in multiple repositories and (ii) how to
characterize such a distributed development process.
For the purpose of having a consistent chronological his-
tory of a project, it may be enough to only consider its
mainline. This repository is progressively updated and thus
reliably and consistently shows the evolution of the code.
But, since the other forks of a project are independent
and publicly available as well, such a choice seems of a too
limited scope. It may easily discard the complexity of the
state of the software: the complete codebase of a project (or,
from a slightly different perspective, the set of all versions
of a software) on GitHub is more than what is committed
in the mainline.
A legitimate question arise: is the only analysis of the
mainline enough to fully grasp how the software is developed
by the community of all contributors? To meaningfully an-
swer this question, we need a way to quantify the amount
of contributions dispersed in the various forks, in order to
understand how much information would one discard by not
considering, beyond the mainline repository, the rest of the
project–related cosmos out there on GitHub.
In this paper we bring the following contributions:
A methodology to efficiently aggregate and analyze
commit histories of GitHub forks related to the same
project.
A classification of commits, explicitly conceived for
the analysis of DVCSs, which characterizes their dis-
tributed development process.
A set of novel metrics to quantify the degree of disper-
sion of the contributions in a codebase which is dis-
tributed on multiple repositories.
We then illustrate the usefulness of our metrics, by re-
porting outcomes obtained by mining 342 GitHub projects,
composed by a total of 3673 forks.
The paper is organized as follows. Section 2 explains
the motivations and the challenges of our work; Section 3
presents our contributions; Section 4 describes interesting
experimental outcomes obtained by computing our metrics
on a large sample of GitHub projects; Section 5 shows how
our commit classification and metrics can be used in vi-
sual data analysis to enlighten interesting features of multi–
repository projects. Finally, Section 6 presents our conclu-
sion.
2. THE IDENTITY OF A CODEBASE ON
GITHUB
Using GitHub as a centralized facility produces a previ-
ously unseen way of distributing the development process.
The mainline repository of a software is no longer the only
publicly available codebase.
The status of the various forked repositories of a project is
an interesting novelty: these forks are complete codebases,
but they do not represent the “official version of the code”
— which is in the mainline repository. Thus, hierarchical
structures of interconnected public forks, which ultimately
ends in an official mainline, are possible.
Public open–access forks, which variably differ from their
mainline, greatly increase the chances of software diversi-
fication to occur: different variants of the same software
are publicly available as distinct codebases, which can be
independently modified. Such a phenomenon, that is identi-
fiable as diversification only from the standpoint of a global
look at the whole ensemble of project forks, is mostly unin-
tentionally produced in each fork. It starts from pure code
redundancy and then evolves towards an emergent diversity.
Evolutionary speaking, it is possible for a forked repository
to become the mainline of a new “breed”. A developer may
freely choose which one of many forks is mainline to her,
notwithstanding what is currently designated as mainline
by the core team of developers in a project.
The challenges to be faced in analyzing a multi–repository
project have been well explained, taking the prominent case
of Git repositories [2]. One of the most serious and un-
resolved difficulties lies in the fact that, being the commits
dispersed in distinct repositories, it is unclear if and to which
extent all relevant contributions are expected to be found in
the mainline repository. Are the possibly many other forks
otherwise worth mining? Even once one had set up his mind
for this second option, the way to efficiently aggregate, com-
pare and mine information from a set of forks belonging to
the same “family” is still to be investigated.
Whenever a fork is created from an existing repository on
GitHub, its commit history is an exact duplicate of that of
the original repository. Then, the histories of the reposito-
ries may repeatedly diverge and re-converge, following possi-
bly very different evolutions. The divergence of two commit
histories may be measured by tracking those commits that
are made after the creation of a fork and whose occurrence
in the other forks, which derive from the same mainline,
presents non-trivial traits (e.g., they are not present in all
the forks of a given mainline, but only in some of them).
Let us call these interesting commits iCommits, for brevity.
By tracking iCommits we can actually measure to which ex-
tent the family of forks of a given project contains relevant
contents which are not in the mainline of its codebase, or
which are shared only among subgroups of its community of
developer.
Thanks to our novel data structure and commit classifi-
cation, upon which our original metrics are defined, we are
able to answer the following research questions:
R.Q. 1 — Are there commits related to a given project
that are dispersed in forks other than the project mainline?
R.Q. 2 — Are there differences in the collaboration pat-
terns of multi–repository projects which we can track by
analyzing their distributed commit history?
3. METHODOLOGY AND METRICS
We propose in the following a methodology to effectively
extract from a set of forks some useful information about
their similarity in terms of commit history. We first pro-
pose a viable way to create an all–encompassing repository,
which gathers the history of all the forks pertaining to the
same project. Then we define a classification of commits
that allows to quantify the amount of difference among the
commit histories of the various forks. Finally, we define a
set of metrics based on our commit classification.
3.1 One umbrella to rule them all
To get to know which commits belong to each category,
we need a practical way to analyze the ensemble of all forks
of a given project. We need to know what do they have in
common and how each fork differs from the mainline. Since
a software repository may have hundreds of forks, each of
which may comprise thousands of commits, a clever way of
handling the data complexity is needed.
We propose an original approach, consisting of building a
single Git repository that includes all GitHub forks of the
same project. We call it the umbrella repository. From the
operational standpoint, the procedure to build the umbrella
repository of a project Pis quite straightforward:
1. Create and empty Git repository R;
2. For each fork fin the fork family of P, add fas a
git remote to R, naming it with a unique identifier;
3. For each branch bof each fork fadded to R, fetch the
content of binto R.
By adding all branches of each fork as remotes in the
same Git repository, we can let Git work for us in building
the common commit history among all forks and optimize
the memory needed to store all data coming from different
repositories. The umbrella repository contains the official
mainline of the project and any other commit published in
one of its forks. Identical commits are automatically de-
tected and their presence (or absence) in the various forks is
easily traceable. Same considerations hold for information
related to branches, authors etc.
By melting together all the forks of the same project we
obtain very complex development histories organized in di-
rected acyclic graphs, in which all structural information
are preserved and can be matched, compared and mined
in a seamless way. In order to ease the task of data or-
ganization and metric extraction from Git repositories, we
implemented our own toolset, called GitWorks, available on
GitHub as well1. It is a pure Java application, which works
on top of JGit2. Thanks to GitWorks, the whole procedure,
from the creation of umbrella repositories to the computa-
tion of our metrics on all projects, is completely automa-
tized.
Our approach can be useful for different purposes. It can
be used to characterize the “official” history of the develop-
ment of a software with respect to the rest of the contribu-
tions, for instance by reporting the differences between the
mainline and the various forks. In the following, we use it
to “extensionally” characterize the state of the art of a given
codebase, across all publicly available forks at a given point
in time.
3.2 Commit classification
In order to understand if and how much the various forks
composing an umbrella repository contain iCommits, we
propose to first detect, in each fork, all commits made af-
ter the creation of the fork itself. By considering these ones
only, we discard all commits that are part of a fork since the
very moment of its creation, thus not meaningful to assess
developers’ activity.
Once we have the set of iCommits, we partition them into
the following categories:
unique:iCommits existing in one fork only.
vip:iCommits existing in several (but not all) forks
and in the mainline.
u–vip:iCommits existing in the mainline and in one
other fork only.
scattered:iCommits existing in several forks, but
not in the mainline.
pervasive:iCommits existing in all repositories.
These categories are useful to get a glimpse of the activity
in the various forks of a project codebase. unique and scat-
tered commits are interesting in that they are proof of de-
velopment activity which is independent from the mainline
repository. vip and u–vip commits, on the other hand, are
evidences of mainline–related activity, which is distributed
in subsets of forks. pervasive commits indicate to which
extent new contributions are shared among the whole com-
munity of contributors.
With the help of some notation, we can formally define
our categories as sets of commits related to a given umbrella
repository R.
Let Cbe the set of all commits cibelonging to R.
1See https://github.com/marbiaz/GitWorks.
2See http://eclipse.org/jgit.
Let Fbe the set of all forks ficomposing R. In the next,
we assume |F| >1.
Let Mbe the set of all commits belonging to the history of
the mainline in R.
Finally, let f C ount :C Nbe a function which, given
a commit c∈ C returns the number of forks in Fcreated
before cand whose commit history includes c.
We give the following formal definitions, for a given um-
brella repository R.
Def. 1. The set Uof unique commits is defined as
U={ci∈ C :f C ount(ci) = 1}.
Def. 2. The set Vof vip commits is defined as
V={ci∈ C :ci M ∧ |F | > f C ount(ci)>2}.
Def. 3. The set Wof u–vip commits is defined as
W={ci∈ C :ci∈ M∧|F | > f C ount(ci)f C ount(ci) = 2}.
Def. 4. The set Sof scattered commits is defined as
S={ci∈ C :ci/∈ M ∧ f C ount(ci)>1}.
Def. 5. The set Pof pervasive commits is defined as
P={ci∈ C :f C ount(ci) = |F |}.
We are also able to give a more precise definition of iCom-
mits of R:
Def. 6. We call iCommits the commits belonging to the
union set
I=U ∪ V ∪ W ∪ S ∪ P .
In Section 4, we present some evidence of the occurrence
of iCommits on a sample of GitHub projects.
3.3 Dispersion metrics
We now define some simple metrics, based on the above
given definitions, for a given umbrella repository R.
M. 1: unique-count is defined as uc =|U|.
M. 2: unique-ratio is defined as ur =|U |/|I |.
M. 3: vip-count is defined as vc =|V|.
M. 4: vip-ratio is defined as vr =|V |/|I|.
M. 5: u–vip-count is defined as uvc =|W|.
M. 6: u–vip-ratio is defined as uvr =|W |/|I|.
M. 7: scattered-count is defined as sc =|S|.
M. 8: scattered-ratio is defined as sr =|S|/|I |.
M. 9: pervasive-count is defined as pc =|P|.
M. 10: pervasive-ratio is defined as pr =|P|/|I |.
While the -count metrics are the cardinality of the sets
we defined, the -ratio metrics are the same cardinalities
normalized over the total amount of iCommits.
These metrics allow to quantify to which extent the com-
mits of an umbrella repository are scattered among its forks.
By computing these metrics, we obtain a set of values that
synthetically describe the commit dispersion in a multi–
repository project.
4. PRYING UNDER THE UMBRELLAS
According to FLOSSmole [5] (Free Libre OpenSource Soft-
ware) statistics, GitHub had 191765 repositories publicly
available at May 2012. In order to obtain a statistically
representative sample of GitHub hosted projects, we sort
these projects according to the number of watchers. To dis-
card outliers and less significant entries, we decide to cut off
the extremals of the range, i.e. projects whose number of
watchers is less than 2 or more than 1000. Then we select
1% of the projects in each of three subsets:
Projects that had from 2 to 9 watchers (total: 30236 ;
sampled: 303)
Projects that had from 10 to 99 watchers (total: 3554
; sampled: 36)
Projects that had from 100 to 999 watchers (total: 286
; sampled: 3)
For each sampled project, we clone the mainline and all
the publicly available forks descending from it (direct forks,
forks of the forks, etc.). The resulting set of 342 umbrella
repositories, each of which has a mainline and all “genera-
tions” of its forks, sums up to a total of 3673 Git reposi-
tories. This is our GitHub sample. Information about the
fork family of each project, the owner and the creation time
of each fork, as well as many other metadata, can be re-
trieved from GitHub via its publicly available rest API3.
The complete list of repositories in our sample is available
online4.
We create a single umbrella repository per project, com-
prising the mainline and all its descendants. Once we have
computed our metrics on the umbrella repositories of all
projects in our GitHub sample, we are able to see if and
how much our initial intuition is backed up by real data.
To get an overall bird–eye glimpse, we measure the dis-
tribution of values for each of our -ratio metrics, aggre-
gating the data coming from all the various repositories in
our GitHub sample. The 342 projects in the sample dif-
fer from each other in any quantitative aspect (number of
forks, branches, commits, authors, etc.) and the ratios pro-
vide values already normalized in the interval [0..1].
By aggregating values this way, we can have a general idea
about the relative importance of each category of commits
in the umbrella repositories of our GitHub sample.
Figure 1 shows boxplots for each metric. The boxes extend
for the standard interquartile range, while the whiskers cover
up to the 95% of the data points. We suppress the outliers,
because they are so many, most of all in the upper range
of the interval, that they would hinder the legibility of the
plots.
Figure 1a shows the aggregates of the metrics over the
whole sample. We see that commits shared among mainlines
and some of their forks (vip and u–vip) may often represent
a remarkable share of the iCommits. Another quite interest-
ing fact: pervasive commits are globally much less present.
Their ratio with respect to the total number of iCommits
in their umbrella repositories is often close to 0. This fact
may be due to two different facts: (i) forks are created but
not kept up–to–date with respect to the mainline and the
3See http://developer.github.com/.
4See http://people.rennes.inria.fr/Benoit.Baudry/
sampe-github-projects/.
(a) All repositories
(b) Omitting repositories with no contribution
Figure 1: Distributions of commits per categories:
aggregates on the whole GitHub sample.
other forks; (ii) forks are created and then no new commit
is added to their upstream repository (the one from which
they have been forked). Clearly, to find out which case is the
occurring one, one must analyze every umbrella repository
in detail.
Quite surprisingly, the most represented category in our
sample is that of unique commits. This fact, whose entail-
ments would of course require a deeper investigation, shows
that the amount of “original” development which stays out-
side the mainline of a project is often quite large and thus
not to be neglected. A similar consideration holds for scat-
tered commits, which, although much less common, may
in some case be fairly important (notice the long whisker
of the sr boxplot). Intuitively, the uc and sc metrics may
be useful to detect emergent diversity in a multi–repository
project, since they can point out those forks which are con-
tributing the most to the phenomenon.
As said, we have for all the distributions a large number of
outliers. In order to see the variability of the values among
the repositories which do have iCommits, we plot in Figure
1b the same dataset excluding the entries equals to 0.
Here we can see the fairly large variety of situations that
exist “in the wild”. Most of all, it becomes evident that
unique commits are extremely common in our sample. Their
ur boxplots in the two figures are actually identical, because
less than 3% of the umbrella repositories in our sample have
no unique commit (thus only outliers, not shown, would
differ). Finally, we see that pervasive commits, although
generally being a rare specimen, may represent, whenever
they occur, a relevant portion of the iCommits of an um-
brella repository.
While this is not a rigorous quantitative analysis of the
“composition” of the various projects in our sample, it is
enough to positively answer our first research question:
R.Q. 1 — Are there commits related to a given project
that are dispersed in forks other than the project mainline?
Answer — As measured by our dispersion metrics, there are
relevant amount of information disseminated among various
forks of the same project, which cannot be captured by an-
alyzing the mainline repository only.
In Section 5 we see how our classification proves to be
useful in getting insights about the characteristics of the
families of forks belonging to the same project.
5. PEACOCK TAILS
The kind of analysis we propose deals with the fact that a
software codebase may be scattered among different reposi-
tories, which may be only partially synchronized with each
others. All existing code and process metrics can be used in
order to measure interesting properties of the single forks.
But our dispersion metrics can be used to give some pre-
liminary insights about the composition of a family of forks,
which can be useful to guide further analysis towards the
more interesting ones. In order to ease the presentation and
facilitate the legibility of the aggregates computed on each
family of forks, some visualization tool can be used to pic-
torially represent our outcomes and highlight some features.
In the following we present selected pictographs, which
represent the information obtained on our GitHub sam-
ple, for some umbrella repositories. The pictures have been
drawn with Circos5. Given their shape and look, we nick-
name them “peacock tails”. We underline that the graphic
representation in itself is not a major concern of ours, but a
simple yet very helpful way of presenting the data and spot-
ting out some interesting features of diversely distributed
software projects.
Each pictograph represents the mainline of a project with
the subset of its forks having one or more iCommits. The
largest stripe at the bottom is the mainline repository. Then
the forks are sorted, clockwise from the left, according to
their creation timestamp. They are shown as stripes, con-
nected to the mainline by elongated commit–links. On the
outskirt, centered at each fork stripe, the identifiers of the
forks are reported.
The length of a fork stripe (but the mainline) is propor-
tional to the amount of its iCommits, excluding unique com-
mits. The color of each fork stripe and its commit–link is
also correlated to its length: from grey and violet for smaller
stripes, through blue and green for medium stripes, to or-
ange and then red for larger ones. Thus, while the length of
the stripes tells immediately which forks share more iCom-
mits with the mainline, the color of the stripes is useful to
quickly see which forks have a similar amount of iCommits.
5See http://circos.ca
Figure 2: Peacock tail example with legend.
Figure 3: Peacock tail of the PySynergy project.
For each fork (including the mainline) unique commits
are plotted as circles centered at their fork stripe. The di-
ameter of these circles is thus proportional to the uc value
of the fork, though not in the same scale of the length of the
fork stripes.
So, to sum up: stripes (but the mainline) can be compared
with other stripes, circles can be compared with other cir-
cles and their position around the clock tells about their age.
Figure 2 graphically explains the peacock tails’ characteris-
tics.
Visualizing umbrella repositories as peacock tails allows
us to observe different collaboration models. We give few
examples in the following.
“Chick” collaboration model — In the chick model, the
mainline is forked few times. The PySynergy project in Fig-
ure 3 is an example of this model.
“Seabirds” collaboration model — In the seabird model,
the amount of iCommits which link forks and mainline is bal-
Figure 4: Peacock tail of the MailCore project.
Figure 5: Peacock tail of the zamboni project.
anced: several forks are equally involved in the distributed
development. The MailCore project in Figure 4 is an exam-
ple of that model.
The peacock visualization highlights the balance via the col-
ors of the stripes and the commit–links: several of them have
similar colors, indicating an equivalent amount of iCommits
shared with the main fork.
“Goose” collaboration model — In the goose model, forks
differ more from each other in their activity: some forks
are heavily involved, while others very little. The zamboni
project in Figure 5 is an example of that model.
The peacock visualization highlights the lack of balance: we
can recognize four groups of commit–links, by grouping them
according to their color, and a fairly large amount of forks
with very few iCommits.
“Galapagos” Effect — The Galapagos model emphasizes
the presence of some forks that have many unique commits.
Our intuition is that this fact indicates a “speciation” inside
a fork, probably one or several branches that are used to
develop alternative solutions that are not shared with the
other forks.
The pyromcs project in Figure 6 is an example in which the
mainline has a very high uc value (the thin orange circle
traversing the plot is actually the uc circle centered at the
mainline).
An interesting feature highlighted by these pictographs is
the relation between the amount of iCommits and the age
of the fork. It is a common finding that older forks are the
most contributing, but it is not always the case. The uc
values, instead, do not show correlations with the age of the
forks or their amount of iCommits.
This quick and intuitive look at multi–repository projects
can be very helpful to study the composition of the various
forks and for a preliminary screening, in order to identify
interesting forks that are worth investigating.
We can thus positively answer to our second research ques-
tion:
R.Q. 2 — Are there differences in the collaboration pat-
terns of multi–repository projects which we can track by
analyzing their distributed commit history?
Answer — A characterization of multi–repository projects
based on our dispersion metrics reveals that indeed collabo-
ration patterns may differ significantly in different projects.
The visual analysis we briefly discuss here can help deciding
the initial directions for a deeper investigation.
6. CONCLUSION
The widespread adoption of decentralized versioning sys-
tems and the advent of web–based aggregators have caused
a substantial increment of multi–repository projects. These
projects are characterized by the fact that their complete
codebase is scattered among distinct and possibly unsyn-
chronized repositories. Existing metrics are not able to fea-
ture such a distributed development scenario.
This paper presents novel tools to tackle the analysis of
projects, whose codebase is distributed among several forks
on GitHub.
We describe a methodology to efficiently aggregate and
analyze commit histories of GitHub forks related to the
same project. We propose a classification of commits, which
characterizes a distributed development process that is typ-
ical of DVCSs. We define a set of novel metrics to quan-
tify the degree of dispersion of the overall contributions in a
multi–repository project. We finally report aggregate statis-
tics, measured on a sample of thousands of GitHub reposi-
tories, which show that our metrics shed some light on novel
interesting aspects of the software development process in
multi–repository projects.
Our future work will deal with a limit of our approach.
Aggregating commits in an umbrella repository results in a
complex commit history, which brings under the spotlight
“spatial” information about the development process, but
does not capture its temporal evolution per se. Nonetheless,
we think our methodology can be fruitfully improved, along
with some data mining on GitHub logs (pull request etc.),
by taking “snapshots” of the state of a distributed codebase
at regular time intervals, building temporal sequences of um-
brella repositories per project.
Figure 6: Peacock tail of the pyrocms project.
This way, our metrics could be used to track the evolu-
tion of project forks over time. Analogously to what has
been found for branching strategies, the long term goal is to
identify distributed development patterns which affect soft-
ware quality.
On a different track, we plan to exploit our metrics in
order to devise a measure of emergent software diversity.
7. REFERENCES
[1] V. R. Basili, L. C. Briand, and W. L. Melo. A
validation of object-oriented design metrics as quality
indicators. Software Engineering, IEEE Transactions
on, 22(10):751–761, 1996.
[2] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton,
D. M. German, and P. Devanbu. The promises and
perils of mining git. In Mining Software Repositories,
2009. MSR’09. 6th IEEE International Working
Conference on, pages 1–10. IEEE, 2009.
[3] L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb. Social
coding in github: transparency and collaboration in an
open software repository. In Proceedings of the ACM
2012 conference on Computer Supported Cooperative
Work, pages 1277–1286. ACM, 2012.
[4] S. Demeyer, S. Ducasse, and O. Nierstrasz. Finding
refactorings via change metrics. In ACM SIGPLAN
Notices, volume 35/10, pages 166–177. ACM, 2000.
[5] J. Howison, M. Conklin, and K. Crowston. Flossmole:
A collaborative repository for floss research data and
analyses. International Journal of Information
Technology and Web Engineering, 1:17–26, 07/2006
2006.
[6] R. Moser, W. Pedrycz, and G. Succi. A comparative
analysis of the efficiency of change metrics and static
code attributes for defect prediction. In Software
Engineering, 2008. ICSE’08. ACM/IEEE 30th
International Conference on, pages 181–190. IEEE,
2008.
[7] D. Posnett, R. D’Souza, P. Devanbu, and V. Filkov.
Dual ecological measures of focus in software
development. In Proceedings of the 2013 International
Conference on Software Engineering, pages 452–461.
IEEE Press, 2013.
[8] F. Rahman and P. Devanbu. Ownership, experience
and defects: a fine-grained study of authorship. In
Proceedings of the 33rd International Conference on
Software Engineering, pages 491–500. ACM, 2011.
[9] C. Rodriguez-Bustos and J. Aponte. How distributed
version control systems impact open source software
projects. In Mining Software Repositories (MSR),
2012 9th IEEE Working Conference on, pages 36–39.
IEEE, 2012.
[10] E. Shihab, C. Bird, and T. Zimmermann. The effect of
branching strategies on software quality. In
Proceedings of the ACM-IEEE international
symposium on Empirical software engineering and
measurement, pages 301–310. ACM, 2012.
[11] F. Thung, T. Bissyande, D. Lo, and L. Jiang. Network
structure of social coding in github. In Software
Maintenance and Reengineering (CSMR), 2013 17th
European Conference on, pages 323–326, March 2013.
... While OSS collaborations in non-ML software are extensively studied by researchers [14,15,16,17,18], this is not the case in the context of ML. The typical GitHub collaborative coding model would see the OSS community fork an ML research project [17], make changes to the source code, and push those changes back to the original project using Pull Requests (PRs). ...
... Brisson et al. [33] studied collaborations on GitHub projects by analyzing transitive forks, user statistics, pull requests, and issues. Furthermore, Biazzini et al. [15] identified dispersion metrics for fork-induced code changes. Ren et al. [34] developed a web UI for the management of forking-based collaborations with features like fork searching and tagging. ...
... Currently, there is no empirical evidence regarding the extent to which 1) open-sourcing ML research code helps the OSS community in building new applications and 2) the OSS community contributes and helps maintain the original ML research implementations. In contrast, for non-ML software, prior research [15,18,33,35,36] has studied the nature of multi-repository development and maintenance of OSS projects. Hence, in this RQ, we analyze the OSS development activities around research-based ML pipeline repositories. ...
Preprint
Full-text available
Machine Learning (ML) research publications commonly provide open-source implementations on GitHub, allowing their audience to replicate, validate, or even extend machine learning algorithms, data sets, and metadata. However, thus far little is known about the degree of collaboration activity happening on such ML research repositories, in particular regarding (1) the degree to which such repositories receive contributions from forks, (2) the nature of such contributions (i.e., the types of changes), and (3) the nature of changes that are not contributed back to forks, which might represent missed opportunities. In this paper, we empirically study contributions to 1,346 ML research repositories and their 67,369 forks, both quantitatively and qualitatively (by building on Hindle et al.'s seminal taxonomy of code changes). We found that while ML research repositories are heavily forked, only 9% of the forks made modifications to the forked repository. 42% of the latter sent changes to the parent repositories, half of which (52%) were accepted by the parent repositories. Our qualitative analysis on 539 contributed and 378 local (fork-only) changes, extends Hindle et al.'s taxonomy with one new top-level change category related to ML (Data), and 15 new sub-categories, including nine ML-specific ones (input data, output data, program data, sharing, change evaluation, parameter tuning, performance, pre-processing, model training). While the changes that are not contributed back by the forks mostly concern domain-specific customizations and local experimentation (e.g., parameter tuning), the origin ML repositories do miss out on a non-negligible 15.4% of Documentation changes, 13.6% of Feature changes and 11.4% of Bug fix changes. The findings in this paper will be useful for practitioners, researchers, toolsmiths, and educators.
... Work has been done to examine the interactions between fork repositories and their parents [13], [14], [15]. Notably, Ren et al. [13] introduce a similar system to ours that gives insight into forks and the behaviors that they are introducing to the parent repository. ...
... Padhye et al. [120] have measured external contributions from non-core developers. Biazzini and Baudry [22] have proposed metrics to quantify and classify collaboration in GitHub repositories pertaining to the same fork tree. Rastogi and Nagappan [130]-as well as Stanciulescu et al. [151] for firmware projects-have characterized forks on GitHub based on the flow of commits between them and the originating repository. ...
Thesis
The Software Heritage project is a software archive containing the largestpublic collection of source code files along with their development history,in the form of an immense graph of hundreds of billions of edges. In thisthesis, we present architectural techniques to make this graph availablefor research. We first propose some utilities to access the data at amicro-level in a way that is convenient for smaller-scale research.To run analyses on the entire archive, we extract a property graph in arelational format and evaluate the different ways this data can beexploited using in-house and cloud processing services.We find that while this approach is well suited to process large amounts offlat data in parallel, it has inherent limitations for the highly recursivegraph structure of the archive. We propose the use of graph compression asa way to considerably reduce the memory usage of the graph, allowing us toload it entirely in physical memory. We develop a library to run arbitraryalgorithms on the compressed graph of public development, using variousmapping techniques to access properties at the node and edge levels.Then, we leverage this infrastructure to study the topology of the entiregraph, looking at both its local properties and the way software projectsare organized in forks. The in-depth understanding of this structure thenallows us to discuss different approaches for distributed graph analysis.
... For example, work and life among programmers receives disproportionate attention in social computing because this population has always been overrepresented online. From the origins of social messaging and media [7] to the contemporary widespread production of open source software on collaborative sites like GitHub [8] , which enable unparalleled observation of online collaboration, programmers and technologists spend more of their lives online. Moreover, based on widespread connections between information systems, business, and business school scholars, there has been sustained interest in social computing about commercial content and brand management, knowledge sharing and discovery, and peer-to-peer influence, especially as relevant to enterprise [9] . ...
Article
Social computing is ubiquitous and intensifying in the 21st Century. Originally used to reference computational augmentation of social interaction through collaborative filtering, social media, wikis, and crowdsourcing, here I propose to expand the concept to cover the complete dynamic interface between social interaction and computation, including computationally enhanced sociality and social science, socially enhanced computing and computer science, and their increasingly complex combination for mutual enhancement. This recommends that we reimagine Computational Social Science as Social Computing, not merely using computational tools to make sense of the contemporary explosion of social data, but also recognizing societies as emergent computers of more or less collective intelligence, innovation and flourishing. It further proposes we imagine a socially inspired computer science that takes these insights into account as we build machines not merely to substitute for human cognition, but radically complement it. This leads to a vision of social computing as an extreme form of human computer interaction, whereby machines and persons recursively combine to augment one another in generating collective intelligence, enhanced knowledge, and other social goods unattainable without each other. Using the example of science and technology, I illustrate how progress in each of these areas unleash advances in the others and the beneficial relationship between the technology and science of social computing, which reveals limits of sociality and computation, and stimulates our imagination about how they can reach past those limits together.
... For instance, one could study the correlation between issues and GitHub stars/forks [5], explore the effects of issue labeling on the time required to resolve them [11], or even determine whether resolving issues as early as possible is beneficial for the project [6]. Similar directions can be pursued for commits [4,21]; e.g. commit-level metrics, such as the time between consecutive commits or the changes between releases, which can be used to determine the habits of project developers and align them to produce a better plan of actions [21] and/or quantify the impact of the design choices on the quality of the final product [8]. ...
Conference Paper
Full-text available
The full integration of online repositories in the contemporary software development process promotes remote work and remote collaboration. Apart from the apparent benefits, online repositories offer a deluge of data that can be utilized to monitor and improve the software development process. Towards this direction, we have designed and implemented a platform that analyzes data from GitHub in order to compute a series of metrics that quantify the contributions of project collaborators, both from a development as well as an operations (communication) perspective. We analyze contributions in an evolutionary manner throughout the projects' lifecycle and track the number of coding violations generated, this way aspiring to identify cases of software development that need closer monitoring and (possibly) further actions to be taken. In this context, we have analyzed the 3000 most popular Java GitHub projects and provide the data to the community.
Chapter
Open-source software development is a common practice that encourages collaborative development and reuse across projects. Forking is a way to make a copy of an existing project and explore it for different purposes. Two types of forks are commonly mentioned in the literature: contributing forks which continue the development lines of the forked projects and aim at merging the contribution back to the forked projects; and independently developed forks which open new lines of development deviating from the forked projects. In this study, we aim to explore characteristics of fork-involving traces for better understanding collaboration and reuse considerations in software development. Analyzing 880 Java projects and their related action and observation events, with process mining and statistical techniques, we found that the occurrence of certain event types may predict the fork type, while the creation of certain fork types increase the involvement of users in the forked projects.
Article
Context As a particular type of software ecosystem, an open source software ecosystem (OSSECO) is a collection of interdependent open source software (OSS) projects which are developed and evolve together. Events happening within an OSSECO inherently involve the collaboration of participants from multiple OSS projects, forming a temporary work group. However, it is still unclear how different members of a work group collaborate to fix cross-project bugs, a typical event in the maintenance of OSSECOs. Objective This study aims to investigate the characteristics of collaboration within a work group when fixing cross-project bugs in an OSSECO. It involves the participants from the upstream (which caused the bugs) and the downstream (which were affected by the bugs) OSS projects. Method We conducted our study on 236 cross-project bugs from the scientific Python ecosystem, involving 571 participants and 91 OSS projects, to understand open collaboration within a work group. We established a quantitative analysis to investigate the members of a work group, along with a qualitative analysis to understand the roles of the members from different OSS communities. Results The results show that: (1) A typical work group is constituted of four to eight members from the core development teams of the two OSS communities. More members concern with the upstream OSS projects and few can make active contributions to both sides; (2) Distinct responsibilities are taken by the two OSS communities, with the downstream members as the problem-finders and the upstream members as the decision-makers or gatekeepers. Conclusions Our findings reveal the collaborative mechanism and the responsibility allocation between the upstream and downstream OSS communities in the ecosystems.
Thesis
Collaborative Editing (CE) has long captured the attention of Computer-supported- cooperative work (CSCW) researchers. Early researches about CE (in the 1990s and the early 2000) focused on describing different characteristics of CE based on interviewing people who had participated in some CE projects. Some recent researches about CE started analyzing the logs of CE activities to study how people edit together with support of modern CE tools such as Git version control systems and Google Docs. From the general view point, the process of CE is the continuous synchronization of ‘multiple, parallel streams of activity’ of collaborators. If the synchronization takes place less often, for example the development of a software project based on Git version control system, it is considered as ‘asynchronous’ work mode. And if the synchronization takes place within a small interval, for example editing a shared document in ShareLaTex, it is considered as ‘synchronous’ work mode. The longer the divergence is, more conflicts are likely to happen during the synchronization. Resolving conflicts is costly, especially after a long period of divergence. Understanding how often conflicts happen and how do user resolve conflict in real CE projects is important to ensure good performance and user experience in collaborative editing. In the first part of this thesis, we borrow the collaboration traces of four large open source projects in Git version control system to conduct our analysis. We analyze different types of textual conflicts that arise during the development and how developers resolve these types of conflict. In particular regarding ‘adjacent-lines conflicts’, we found that users mostly resolve them by applying changes from both sites. Besides, we also analyze how often users use ‘roll-back to previous version’ as a way to resolve merge conflict. The process of CE based on online collaborative editor is more specific. It can be split into several ‘sessions’ of editing which are performed by a single author or several authors. They are denoted as ‘single-authored session’ and ‘co-authored session’ respectively. This fragmentation process requires a predefined ‘interval’ or ‘maximum time gap’ which is not yet well defined in previous studies. In the second part of this thesis, we analyze the logs of CE works of students of an Engineering School using ShareLaTeX which were collected and anonymized for privacy purpose. By examining different ‘maximum time gaps’ from 30 seconds to 15 minutes on the logs we found that we can determinate a suitable ‘maximum time gap’ to split CE activities into sessions by evaluating the distribution of the ‘external-distance’. Besides, we analysed the editing activities inside each ‘co-author sessions’. We borrow a [30 seconds, 10 characters] time- position window to examine these ‘potential conflict’ cases. The result shows that people rarely edit closely in both time-position. However, conflicts are more likely to happen in these cases.
Article
Full-text available
This article introduces and expands on previous work on a collaborative project, called FLOSSmole (formerly OSSmole), designed to gather, share, and store comparable data and analyses of free, libre, and open source software (FLOSS) development for academic research. The project draws on the ongoing collection and analysis efforts of many research groups, reducing duplication, and promoting compatibility both across sources of FLOSS data and across research groups and analyses. The article outlines current difficulties with the current typical quantitative FLOSS research process and uses these to develop requirements and presents the design of the system.
Conference Paper
Full-text available
Social coding enables a different experience of software development as the activities and interests of one developer are easily advertised to other developers. Developers can thus track the activities relevant to various projects in one umbrella site. Such a major change in collaborative software development makes an investigation of networkings on social coding sites valuable. Furthermore, project hosting platforms promoting this development paradigm have been thriving, among which GitHub has arguably gained the most momentum. In this paper, we contribute to the body of knowledge on social coding by investigating the network structure of social coding in GitHub. We collect 100,000 projects and 30,000 developers from GitHub, construct developer-developer and project-project relationship graphs, and compute various characteristics of the graphs. We then identify influential developers and projects on this sub network of GitHub by using PageRank. Understanding how developers and projects are actually related to each other on a social coding site is the first step towards building tool supports to aid social programmers in performing their tasks more efficiently.
Conference Paper
Full-text available
Centralized Version Control Systems have been used by many open source projects for a long time. However, in recent years several widely-known projects have migrated their repositories to Distributed Version Control Systems, such as Mercurial, Bazaar, and Git. Such systems have technical features that allow contributors to work in new ways, as various different workflows are possible. We plan to study this migration process to assess how developers' organization and their contributions are affected. As a first step, we present an analysis of the Mozilla repositories, which migrated from CVS to Mercurial in 2007. This analysis reveals both expected and unexpected aspects of the contributors' activities.
Article
Full-text available
We are now witnessing the rapid growth of decentralized source code management (DSCM) systems, in which every developer has her own repository. DSCMs facilitate a style of collaboration in which work output can flow sideways (and privately) between collaborators, rather than always up and down (and publicly) via a central repository. Decentralization comes with both the promise of new data and the peril of its misinterpretation. We focus on git, a very popular DSCM used in high-profile projects. Decentralization, and other features of git, such as automatically recorded contributor attribution, lead to richer content histories, giving rise to new questions such as “How do contributions flow between developers to the official project repository?” However, there are pitfalls. Commits may be reordered, deleted, or edited as they move between repositories. The semantics of terms common to SCMs and DSCMs sometimes differ markedly, potentially creating confusion. For example, a commit is immediately visible to all developers in centralized SCMs, but not in DSCMs. Our goal is to help researchers interested in DSCMs avoid these and other perils when mining and analyzing git data.
Conference Paper
Full-text available
In this paper we present a comparative analysis of the predictive power of two different sets of metrics for defect prediction. We choose one set of product related and one set of process related software metrics and use them for classifying Java files of the Eclipse project as defective respective defect-free. Classification models are built using three common machine learners: logistic regression, Naïve Bayes, and decision trees. To allow different costs for prediction errors we perform cost-sensitive classification, which proves to be very successful: >75% percentage of correctly classified files, a recall of >80%, and a false positive rate
Conference Paper
Full-text available
Recent research indicates that “people” factors such as ownership, experience, organizational structure, and geographic distribution have a big impact on software quality. Understanding these factors, and properly deploying people resources can help managers improve quality outcomes. This paper considers the impact of code ownership and developer experience on software quality. In a large project, a file might be entirely owned by a single developer, or worked on by many. Some previous research indicates that more developers working on a file might lead to more defects. Prior research considered this phenomenon at the level of modules or files, and thus does not tease apart and study the eect of contributions of dierent developers to each module or file. We exploit a modern version control system to examine this issue at a fine-grained level. Using version history, we examine contributions to code fragments that are actually repaired to fix bugs. Are these code fragments “implicated” in bugs the result of contributions from many? or from one? Does experience matter? What type of experience? We find that implicated code is more strongly associated with a single developer’s contribution; our findings also indicate that an author’s specialized experience in the target file is more important than general experience. Our findings suggest that quality control eorts could be profitably targeted at changes made by single developers with limited prior experience on that file.
Conference Paper
Full-text available
Reverse engineering is the process of uncovering the design and the design rationale from a functioning software system. Reverse engineering is an integral part of any successful software system, because changing requirements lead to implementations that drift from their original design. In contrast to traditional reverse engineering techniques ---which analyse a single snapshot of a system--- we focus the reverse engineering effort by determining where the implementation has changed. Since changes of object-oriented software are often phrased in terms of refactorings, we propose a set of heuristics for detecting refactorings by applying lightweight, object-oriented metrics to successive versions of a software system. We validate our approach with three separate case studies of mature object-oriented software systems for which multiple versions are available. The case studies suggest that the heuristics support the reverse engineering process by focusing attention on the relevant parts of a software system.
Conference Paper
Work practices vary among software developers. Some are highly focused on a few artifacts; others make wideranging contributions. Similarly, some artifacts are mostly authored, or “owned”, by one or few developers; others have very wide ownership. Focus and ownership are related but different phenomena, both with strong effect on software quality. Prior studies have mostly targeted ownership; the measures of ownership used have generally been based on either simple counts, information-theoretic views of ownership, or social-network views of contribution patterns. We argue for a more general conceptual view that unifies developer focus and artifact ownership. We analogize the developer-artifact contribution network to a predator-prey food web, and draw upon ideas from ecology to produce a novel, and conceptually unified view of measuring focus and ownership. These measures relate to both cross-entropy and Kullback-Liebler divergence, and simultaneously provide two normalized measures of focus from both the developer and artifact perspectives. We argue that these measures are theoretically well-founded, and yield novel predictive, conceptual, and actionable value in software projects. We find that more focused developers introduce fewer defects than defocused developers. In contrast, files that receive narrowly focused activity are more likely to contain defects than other files.
Conference Paper
Branching plays a major role in the development process of large software. Branches provide isolation so that multiple pieces of the software system can be modified in parallel without affecting each other during times of instability. However, branching has its own issues. The need to move code across branches introduces additional overhead and branch use can lead to integration failures due to conflicts or unseen dependencies. Although branches are used extensively in commercial and open source development projects, the effects that different branch strategies have on software quality are not yet well understood. In this paper, we present the first empirical study that evaluates and quantifies the relationship between software quality and various aspects of the branch structure used in a software project. We examine Windows Vista and Windows 7 and compare components that have different branch characteristics to quantify differences in quality. We also examine the effectiveness of two branching strategies – branching according to the software architecture versus branching according to organizational structure. We find that, indeed, branching does have an effect on software quality and that misalignment of branching structure and organiza-tional structure is associated with higher post-release failure rates.
Article
ContextClass cohesion is an important object-oriented software quality attribute. Assessing class cohesion during the object-oriented design phase is one important way to obtain more comprehensible and maintainable software. In practice, assessing and controlling cohesion in large systems implies measuring it automatically. One issue with the few existing cohesion metrics targeted at the high-level design phase is that they are not based on realistic assumptions and do not fulfill expected mathematical properties.ObjectiveThis paper proposes a High-Level Design (HLD) class cohesion metric, which is based on realistic assumptions, complies with expected mathematical properties, and can be used to automatically assess design quality at early stages using UML diagrams.MethodThe notion of similarity between pairs of methods and pairs of attribute types in a class is introduced and used as a basis to introduce a novel high-level design class cohesion metric. The metric considers method–method, attribute–attribute, and attribute–method direct and transitive interactions. We validate this Similarity-based Class Cohesion (SCC) metric theoretically and empirically. The former includes a careful study of the mathematical properties of the metric whereas the latter investigates, using four open source software systems and 10 cohesion metrics, whether SCC is based on realistic assumptions and whether it better explains the presence of faults, from a statistical standpoint, than other comparable cohesion metrics, considered individually or in combination.ResultsResults confirm that SCC is based on clearly justified theoretical principles, relies on realistic assumptions, and is an early indicator of quality (fault occurrences).ConclusionIt is concluded that SCC is both theoretically valid and supported by empirical evidence. It is a better alternative to measure class cohesion than existing HLD class cohesion metrics.