Conference PaperPDF Available

Identifying Key Developers using Artifact Traceability Graphs


Abstract and Figures

Developers are the most important resource to build and maintain software projects. Due to various reasons, some developers take more responsibility, and this type of developers are more valuable and indispensable for the project. Without them, the success of the project would be at risk. We use the term key developers for these essential and valuable developers, and identifying them is a crucial task for managerial decisions such as risk assessment for potential developer resignations. We study key developers under three categories: jacks, mavens and connectors. A typical jack (of all trades) has a broad knowledge of the project, they are familiar with different parts of the source code, whereas mavens represent the developers who are the sole experts in specific parts of the projects. Connectors are the developers who involve different groups of developers or teams. They are like bridges between teams. To identify key developers in a software project, we propose to use traceable links among software artifacts such as the links between change sets and files. First, we build an artifact traceability graph, then we define various metrics to find key developers. We conduct experiments on three open source projects: Hadoop, Hive and Pig. To validate our approach, we use developer comments in issue tracking systems and demonstrate that the identified key developers by our approach match the top commenters up to 92%. CCS CONCEPTS • Software and its engineering → Programming teams.
Content may be subject to copyright.
Identifying Key Developers using Artifact Traceability Graphs
H. Alperen Çetin
Bilkent University
Ankara, Turkey
Eray Tüzün
Bilkent University
Ankara, Turkey
Developers are the most important resource to build and maintain
software projects. Due to various reasons, some developers take
more responsibility, and this type of developers are more valuable
and indispensable for the project. Without them, the success of
the project would be at risk. We use the term key developers for
these essential and valuable developers, and identifying them is a
crucial task for managerial decisions such as risk assessment for
potential developer resignations. We study key developers under
three categories: jacks,mavens and connectors. A typical jack (of
all trades) has a broad knowledge of the project, they are familiar
with dierent parts of the source code, whereas mavens represent
the developers who are the sole experts in specic parts of the
projects. Connectors are the developers who involve dierent groups
of developers or teams. They are like bridges between teams.
To identify key developers in a software project, we propose
to use traceable links among software artifacts such as the links
between change sets and les. First, we build an artifact traceability
graph, then we dene various metrics to nd key developers. We
conduct experiments on three open source projects: Hadoop, Hive
and Pig. To validate our approach, we use developer comments
in issue tracking systems and demonstrate that the identied key
developers by our approach match the top commenters up to 92%.
Software and its engineering Programming teams.
key developers, most valuable developers, developer categories,
social networks, artifact traceability graphs, jack, maven, connector
ACM Reference Format:
H. Alperen Çetin and Eray Tüzün. 2020. Identifying Key Developers using
Artifact Traceability Graphs. In Proceedings of the 16th ACM International
Conference on Predictive Models and Data Analytics in Software Engineering
(PROMISE ’20), November 8–9, 2020, Virtual, USA. ACM, New York, NY, USA,
10 pages.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from
PROMISE ’20, November 8–9, 2020, Virtual, USA
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8127-7/20/11. . . $15.00
Software development mainly depends on human eort. In a project,
some developers take more responsibility, and the success rate of
the project heavily depends on these developers. Thus, they are
valuable and essential to develop and maintain the project, in other
words, they are the key developers of the project.
Developers leave and join projects due to numerous reasons such
as transferring to another project in the same company or leaving
a company to work in another one. When a developer position
is opened, it is lled by another developer in time. This is also
known as developer turnover, which is a common phenomenon
in software development and a critical risk for software projects
]. It is more critical when the key developers leave the project.
Therefore, identifying the valuable and indispensable developers is
a vital and challenging task for project management. Developers
can be valuable for a project in many dierent ways. All developers
contribute to the project in some way or another. For instance, a
developer may know a specic module very well, while another
one knows a little related to multiple modules. In our study, similar
to our previous work [
], we examine key developers under three
categories: jacks,mavens and connectors.
Our motivation for this categorization comes from The Tipping
Point by Gladwell [
]. The book discusses the reasons behind
word-of-mouth epidemics. In The Law of Few chapter, the author
justies that three kinds of people are responsible for tipping ideas:
connector, salesman and maven. Connectors have connections to
dierent social groups, and they allow ideas to spread between these
groups. Salesmen have a charisma that allows them to persuade
people and change their decisions. Mavens have a great knowledge
of specic topics and thus help people to make informed decisions.
Since there are traceable links among software artifacts as the
connections among people in real life, we propose to use a similar
categorization, connector and maven, as described in the book to
nd the key developers in a software project. A typical connector
represents a developer who is involved in dierent (sub)projects
or dierent groups of developers. Connecting divergent groups or
(sub)projects increases this type of developers’ signicance because
they connect the developers who are not in the same group (i.e.
team) and touching dierent parts of a project means collective
knowledge from dierent aspects of the project. Maven category
represents the developers who are masters in details of specic
modules or les in the project. Being the rare experts of specic
parts of the source code makes these developers dicult to replace.
Jacks (of all trades) are the developers who have a broad knowl-
edge of the project. They use or modify les from dierent parts
of the project. Here, jack and connector denitions may interfere
with each other since both dene key developers who touch dif-
ferent parts of the project. To make it more clear, jack category
purely focuses on knowledge when connector category focuses on
PROMISE ’20, November 8–9, 2020, Virtual, USA Çetin and Tüzün
connecting developers. "Jack" name comes from a gure of speech,
jack of all trades, to dene people “who can do passable work at
various tasks”
. For the developers who have broad knowledge of
projects, we use jack to remind this phrase.
To discover these three types of key developers, in this study,
we address the following research questions (RQs):
RQ 1:
How can we identify key developers in a software project?
RQ 1.1: How can we identify jacks in a software project?
RQ 1.2:
How can we identify mavens in a software project?
RQ 1.3:
How can we identify connectors in a software project?
In the following section, we share the related work. In Section
3, we explain our methodology addressing the research questions.
In Section 4, we share the details of the datasets and the important
points of the preprocessing. In Section 5, we perform case studies
in three dierent open source software (OSS) projects. In Section
6, we discuss the threats to validity of our study. In Section 7, we
present our conclusions and possible future works.
In the literature, there are many studies on truck factor, developer
roles and social developer networks. In the following, we mention
them under separate sections.
2.1 Truck Factor
Truck factor (a.k.a. bus factor) is the answer to the following ques-
tion: What is the minimum number of developers who have to leave
the project before the project becomes incapacitated and has serious
problems? To address this problem, Avelino et al. [
] associated les
to authors by using the degree of authorship [
], then they found
the minimum number of developers whose total le coverage is
more than 50% of all les. Cosentino et al. [
] measured developers’
knowledge on artifacts (e.g. les, directories and project itself) with
dierent metrics such as "last change takes it all" and "multiple
changes equally considered". They dened primary and secondary
developers for the artifacts and proposed that the project will have
problems with the artifact if all primary and secondary developers
leave the project. Rigby et al. [
] studied a model on le abandon-
ment. In their study, the author of a line is assigned by using git
blame, and a le is abandoned when the authors of 90% of its lines
left the project. They proposed to remove developers randomly
until a specic amount of le loss occurs, and use the number of
removed developers as the truck factor at that point.
Moreover, some researchers published empirical studies on ex-
isting truck factor algorithms. Avelino et al. [
] investigated aban-
doned OSS projects. In their denition, a project is abandoned when
all truck factor developers leave. Ferreira et al. [
] performed a
comparative study on truck factor algorithms and made a compre-
hensive discussion on them from many dierent viewpoints such as
the accuracy of the reported results in the studies and the reasons
why the truck factor algorithms fail at some circumstances.
2.2 Developer Roles and Social Networks
There has been a number of studies examined developer types
from dierent perspectives. Kosti et al. [
] investigated archetypal
personalities of software engineers. They chose extraversion and
conscientiousness as their main criteria and focused on the binary
combinations of them. Cheng and Guo [
] made an activity-based
analysis of OSS contributors, then adopted a data-driven approach
to nd out the dynamics and roles of the contributors. Bella, Sillitti
and Succi [
] classied OSS contributors as core, active, occasional
and rare developers. Also, there are studies examined the core
and periphery [
], hero [
], and key developers [
] in OSS
projects. Likewise, Zhou and Mockus [
] claimed that Long Term
Contributors (LTCs) are valuable for projects. They all have similar
denitions, and in Section 5.3, we further discuss these studies by
comparing with our study.
Besides developer types, researchers worked on communication
networks of developers. Wu and Goh [
] studied the long term
eects of communication patterns on success and performed ex-
periments on how graph centrality, graph density and leadership
centrality aect the success of OSS projects. Also, Kakimoto et al.
[17] worked on knowledge collaboration through communication
tools. They applied social network analysis to 4 OSS communities,
and partially veried their hypothesis, which claims "Communica-
tions are actively encouraged before/after OSS released, especially
among community members with a variety of roles but not among
particular members"[
]. Moreover, Allaho and Lee [
] conducted
a social networks (SN) analysis on OSS projects and found that OSS
SNs follow a power-law distribution which means a small number
of developers dominate the projects.
To identify the key developers in a project, rst we need to construct
an artifact traceability graph as described in Section 3.1. Afterwards,
from Section 3.2 to Section 3.4, we will explain our methodology
for each corresponding research question.
3.1 Artifact Traceability Graph
Artifact traceability graphs include software artifacts and the con-
nections between them. We denote nodes for software artifacts,
which are developers, change sets (e.g. commits in Git), source les
and issues. Then, we denote undirected edges for the relations (e.g.
commit, review, include and linked) between those artifacts. For
example, we add an edge for a commit relation between the devel-
oper node and the change set node if the developer is the author
of the change set. The edges are undirected because reaching from
one change set to another should be possible over the edges if they
include the same le. Developers commit or review change sets.
Change sets include a set of source les. Issues can be linked to a
set of change sets and vice versa.
In the graph, we denote distances for each edge. Distances of
the edges between developers and change sets are always zero
(0) because these connections are there in order to keep track of
who made commits and reviews. Other than commit and review
cases, edge distances are calculated by using the recency of the
bound change set. Our distance metric is inversely proportional to
the recency of the change set. Recency and distance metrics are
calculated as follows:
𝑅𝑒𝑐𝑒𝑛𝑐𝑦 =1# of days passed
# of days included to the graph (1)
Identifying Key Developers using Artifact Traceability Graphs PROMISE ’20, November 8–9, 2020, Virtual, USA
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 =1
𝑅𝑒𝑐𝑒𝑛𝑐𝑦 (2)
Figure 1shows a sample traceability graph, where the graph
includes 300 days of the project history, and the numbers in the
parentheses are the days that the commits are made. For example,
CS3 was committed at the 90
day (i.e. 210 days ago). All the edges
of CS3 have the same distance, which is calculated as follows.
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 =1
𝑅𝑒𝑐𝑒𝑛𝑐𝑦 =1
0.30 =3.33
3.2 Jacks (RQ 1.1)
To nd Jacks in a software project, we analyze the general knowl-
edge of the developers on the project. By looking at the history
of the project from its version control data, we can say that the
source les keep the knowledge, in other words, the know-how
of the project. There are studies to nd the authors of source les
(e.g. degree of authorship [
]). Authorship is not only about being
the rst author of the le but also about changing the source les
in time depending on the recency of the change. In our study, we
dene reachability similar to this denition. If a developer can reach
a le, s/he knows that le. Also, multiple developers can reach the
same le at the same time. In the following, we explain how we
nd reachable les and le coverage for each developer.
3.2.1 Finding Reachable Files: We dene reachable les of a de-
veloper as the les that are reached by the developer through the
connections in the artifact traceability graph. For example, in Figure
1, D2 node can reach every le node in the graph through change
sets, issues and other developers if there is no distance limit (i.e. a
limit for the sum of distances on the edges in the graph). Actually,
every developer can reach every le if the graph is connected and
there is no distance limit. In that case, every developer would know
every le, and we could not distinguish which developers know
which source les. Also, assuming that all edges having the same
importance might be problematic since some edges represent recent
commits and reviews while others represent older ones. Our re-
cency and distance denitions are utilized here to distinguish these
types of situations. For example, the distance between CS3 and F4
is 3.33 while the distance between CS4 and F5 is 1.67, and there are
around three months between the commit times of CS3 and CS4.
Therefore, to handle these situations, we dene the following rules:
We need to set a threshold for distance while reaching from
one node to another. For example, D2 cannot reach F5 if the
threshold is 5 because 3
67 and 6.67 is
beyond the threshold 5.
One developer cannot reach les through other developers
because it would transfer reachable les of a developer to
another developer if the distance threshold is large enough.
For example, in Figure 1, D2 cannot reach F1 through D1,
even if the distance threshold 308.33 or more.
Distance threshold is a parameter, and it depends on the distance
formula given in Equation (2). Due to its nature, distance goes to
innity when recency goes to zero. In the graph, the oldest relations
are the relations from the rst day, and their edges have the highest
distance. In this case, their distance is calculated as follows:
(60) F2
D1 0
D2 CS3
(90) F4
I1 300
Figure 1: Sample artifact traceability graph. Some nodes and
edges are highlighted to illustrate how the reachable les by
D2 are found. (D: Developer, F: File, CS: Change Set, I: Issue)
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 =1
𝑅𝑒𝑐𝑒𝑛𝑐𝑦 =1
Therefore, we need to set our threshold to 300 if we want to use
every direct relation in the graph. Since almost all recently changed
les are reachable by the developers who have recent commits in
that case, using 300 as threshold would not enable us to distinguish
which developers know which les.
We follow a simple way while deciding the distance threshold.
In a 300-day graph, the edges with 0.1 or less recency belong to the
change sets committed in the rst 30 days. The rest of the graph
corresponds to 90% of the time covered in the graph. Therefore,
we can set the distance threshold to 10, which allows us to use all
direct relations from the last 90% of the days in the graph. Also, if
we set it to 5, 80% of the time would be covered. After this point, we
continue with 10 as the distance threshold unless otherwise stated.
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 =1
𝑅𝑒𝑐𝑒𝑛𝑐𝑦 =1
Figure 1shows how reachable les for D2 are found in the sample
graph. The highlighted les (F3, F4, F5) are reachable by D2. While
nding these reachable les, we run a depth rst search (DFS)
algorithm starting from D2 with a stopping condition for reaching
the distance threshold. The highlighted edges show the visited edges
when DFS is started from D2 node. Also, the DFS algorithm does
not go through another developer node. For example, the algorithm
stopped when it encountered with the node of D1. Algorithm 1
shows the pseudo code for nding reachable les for each developer.
3.2.2 Identifying Jacks: While nding jacks, we sort the developers
in descending order according to their le coverage in the software
project. File coverage is simply the ratio of the number reachable
les by the developer to the number of all les in the project, not
just currently available les in the graph. Equation 3shows the le
coverage of some developer 𝑑.
𝐹𝑖𝑙𝑒 𝐶𝑜 𝑣𝑒𝑟𝑎𝑔𝑒𝑑=# of reachable les by 𝑑
# of all les in the project (3)
PROMISE ’20, November 8–9, 2020, Virtual, USA Çetin and Tüzün
Algorithm 1 Finding Reachable Files
1: function DevToFiles(𝑔𝑟𝑎𝑝 ℎ, 𝑡ℎ𝑟𝑒 𝑠ℎ𝑜𝑙𝑑 )
2: 𝑑𝑒𝑣𝑠 𝐺𝑒 𝑡𝐷 𝑒𝑣 𝑒𝑙𝑜 𝑝𝑒𝑟 𝑠 (𝑔𝑟𝑎𝑝ℎ)list
3: 𝑑𝑒𝑣𝑇𝑜 𝑅𝑒𝑎𝑐ℎ𝑎𝑏𝑙𝑒 𝐹𝑖𝑙𝑒𝑠 𝐻𝑎𝑠 ℎ𝑀𝑎𝑝 () string to list
4: for 𝑑𝑒𝑣 in 𝑑𝑒𝑣𝑠 do
5: 𝑟𝑒𝑎𝑐ℎ𝑎𝑏𝑙𝑒 𝐹 𝑖𝑙𝑒𝑠 𝐷 𝐹𝑆 (𝑔𝑟 𝑎𝑝ℎ, 𝑑𝑒 𝑣, 𝑡 ℎ𝑟𝑒𝑠 ℎ𝑜𝑙𝑑 )
6: 𝑑𝑒 𝑣𝑇𝑜 𝑅𝑒𝑎𝑐ℎ𝑎𝑏𝑙𝑒 𝐹𝑖 𝑙𝑒𝑠 .𝑝𝑢𝑡 (𝑑𝑒𝑣 , 𝑟𝑒 𝑎𝑐ℎ𝑎𝑏𝑙𝑒 𝐹𝑖𝑙 𝑒𝑠 )
7: return 𝑑𝑒𝑣𝑇 𝑜 𝑅𝑒𝑎𝑐ℎ𝑎𝑏𝑙𝑒𝐹𝑖𝑙𝑒𝑠
Algorithm 2 Finding Jacks
1: function FindJacks(𝑔𝑟𝑎𝑝ℎ)
2: 𝑑𝑒𝑣𝑇𝑜 𝐹 𝑖𝑙𝑒𝑠 𝐷𝑒 𝑣𝑇 𝑜𝐹 𝑖𝑙 𝑒𝑠 (𝑔𝑟𝑎𝑝 ℎ, 𝑡ℎ𝑟𝑒 𝑠ℎ𝑜𝑙𝑑 )
3: 𝑑𝑒𝑣𝑇𝑜 𝐹 𝑖𝑙𝑒𝐶𝑜𝑣𝑒𝑟 𝑎𝑔𝑒 𝐻 𝑎𝑠ℎ𝑀𝑎𝑝 ( ) string to oat
4: 𝑛𝑢𝑚𝐹𝑖𝑙𝑒𝑠 𝐺𝑒𝑡 𝑁 𝑢𝑚𝐹 𝑖𝑙 𝑒𝑠 (𝑔𝑟𝑎𝑝ℎ )
5: for 𝑑𝑒𝑣 in 𝑑𝑒𝑣𝑇 𝑜 𝐹𝑖 𝑙𝑒𝑠 .𝑘𝑒𝑦𝑠 () do
6: 𝑛𝑢𝑚𝐷𝑒𝑣𝐹 𝑖𝑙 𝑒𝑠 𝑑 𝑒𝑣𝑇 𝑜𝐹 𝑖𝑙𝑒𝑠 .𝑔𝑒𝑡 (𝑑𝑒𝑣 ).𝑙𝑒𝑛𝑔𝑡ℎ()
7: 𝑓 𝑖𝑙𝑒 𝐶𝑜𝑣𝑒𝑟 𝑎𝑔𝑒 𝑛𝑢𝑚𝐷𝑒 𝑣𝐹 𝑖𝑙 𝑒𝑠
𝑛𝑢𝑚𝐹𝑖 𝑙𝑒𝑠
8: 𝑑𝑒 𝑣𝑇𝑜𝐶𝑜 𝑣𝑒𝑟 𝑎𝑔𝑒.𝑝𝑢 𝑡 (𝑑𝑒𝑣, 𝑓 𝑖𝑙 𝑒𝐶𝑜𝑣𝑒 𝑟𝑎𝑔𝑒)
9: return 𝑆𝑜𝑟 𝑡 𝐵𝑦𝑉𝑎𝑙 𝑢𝑒 (𝑑𝑒𝑣𝑇 𝑜𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒)
Table 1: Reachable les and le coverages for each developer
in the sample artifact traceability graph
Developer Reachable Files File Coverage
D1 F2 and F3 40%
D2 F3, F4 and F5 60%
D3 F3, F4 and F5 60%
Table 1shows the reachable les and le coverages for each
developer in the sample artifact traceability graph given in Figure 1.
Algorithm 2shows the pseudo code of nding jacks. First, it nds
reachable les for each developer, then calculates le coverage
scores for developers. Finally, it returns developers in descending
order according to their le coverage scores.
3.3 Mavens (RQ 1.2)
By denition, mavens are the rare experts of specic parts, les
or modules of the project. As we stated in Section 3.2, the source
les in a software project are the reection of the knowledge (i.e.
know-how). Since mavens are the rare expert developers on specic
parts, they have knowledge that the others do not have. Thus, we
need to nd lesser-known parts of the project.
3.3.1 Rarely Reached Files: First, reaching a le through the edges
in the artifact graph means knowing the le. To meet the maven
denition, we can use the les only reached by a limited number
of developers. We call this type of les rarely reached les, and we
set this limit to 1, which means that the les reached by only one
developer are the rarely reached les. This could be a congurable
parameter according to the size of the project. For example, for the
graph given in Figure 1, F2 is a rarely reached le. Actually it is the
only one as it can be seen in Table 1.
Algorithm 3shows how to nd rarely reached les. It is assumed
𝑑𝑒𝑣𝑇𝑜 𝑅𝑎𝑟 𝑒𝐹𝑖𝑙𝑒𝑠
is initialized with developer names and empty
Algorithm 3 Finding Rarely Reached Files
1: function DevToRareFiles(𝑔𝑟𝑎𝑝 ℎ, 𝑡ℎ𝑟𝑒 𝑠ℎ𝑜𝑙𝑑 )
2: 𝑑𝑒𝑣𝑇𝑜 𝐹 𝑖𝑙𝑒𝑠 𝐷𝑒 𝑣𝑇 𝑜𝐹 𝑖𝑙 𝑒𝑠 (𝑔𝑟𝑎𝑝 ℎ, 𝑡ℎ𝑟𝑒 𝑠ℎ𝑜𝑙𝑑 )
3: 𝑓 𝑖𝑙𝑒𝑇 𝑜𝐷 𝑒𝑣𝑠 𝐼𝑛𝑣𝑒𝑟𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 (𝑑𝑒𝑣𝑇𝑜𝐹 𝑖𝑙 𝑒𝑠 )
4: 𝑑𝑒𝑣𝑇𝑜 𝑅𝑎𝑟 𝑒𝐹𝑖𝑙𝑒𝑠 𝐻 𝑎𝑠ℎ𝑀 𝑎𝑝 () string to list
5: for 𝑓 𝑖𝑙𝑒 in 𝑓 𝑖 𝑙𝑒𝑇 𝑜𝐷 𝑒𝑣𝑠 .𝑘𝑒𝑦𝑠 () do
6: 𝑑𝑒𝑣𝑠 𝑓 𝑖𝑙 𝑒𝑇𝑜 𝐷𝑒𝑣 𝑠.𝑔𝑒𝑡 (𝑓 𝑖𝑙 𝑒)
7: if devs.length() is 1then
8: 𝑑𝑒 𝑣𝑇𝑜 𝑅𝑎𝑟𝑒 𝐹𝑖𝑙 𝑒𝑠 .𝑔𝑒𝑡 (𝑑𝑒 𝑣𝑠 .𝑔𝑒𝑡 (0)).𝑎𝑝 𝑝𝑒𝑛𝑑 (𝑓 𝑖𝑙 𝑒)
9: return 𝑑𝑒𝑣𝑇 𝑜 𝑅𝑎𝑟𝑒𝐹 𝑖𝑙 𝑒𝑠
Algorithm 4 Finding Mavens
1: function FindMavens(𝑔𝑟𝑎𝑝 ℎ, 𝑡ℎ𝑟 𝑒𝑠ℎ𝑜𝑙𝑑 )
2: 𝑑𝑒𝑣𝑇𝑜 𝑅𝑎𝑟 𝑒𝐹𝑖𝑙𝑒𝑠 𝐷 𝑒𝑣𝑇 𝑜𝑅𝑎𝑟 𝑒 𝐹𝑖𝑙 𝑒𝑠 (𝑔𝑟 𝑎𝑝ℎ, 𝑡ℎ𝑟 𝑒𝑠ℎ𝑜𝑙 𝑑)
3: 𝑑𝑒𝑣𝑇𝑜 𝑀𝑎𝑣𝑒𝑛𝑛𝑒𝑠𝑠 𝐻 𝑎𝑠ℎ𝑀𝑎 𝑝 () string to oat
4: 𝑛𝑢𝑚𝑅𝑎𝑟𝑒 𝐹 𝑖𝑙𝑒𝑠 𝐺𝑒 𝑡 𝑁𝑢𝑚𝑅𝑎𝑟 𝑒 𝐹𝑖𝑙 𝑒𝑠 (𝑔𝑟𝑎𝑝ℎ)
5: for 𝑑𝑒𝑣 in 𝑑𝑒𝑣𝑇 𝑜 𝑅𝑎𝑟𝑒 𝐹 𝑖𝑙𝑒𝑠 .𝑘𝑒𝑦𝑠 ( ) do
6: 𝑛𝑢𝑚𝐷𝑒𝑣𝐹 𝑖𝑙 𝑒𝑠 𝑑 𝑒𝑣𝑇 𝑜𝑅𝑎𝑟𝑒 𝐹 𝑖𝑙𝑒𝑠 .𝑔𝑒𝑡 (𝑑𝑒𝑣 ).𝑙𝑒𝑛𝑔𝑡ℎ()
7: 𝑚𝑎𝑣𝑒𝑛𝑛𝑒𝑠𝑠 𝑛𝑢𝑚𝐷𝑒 𝑣𝐹 𝑖𝑙 𝑒𝑠
𝑛𝑢𝑚𝑅𝑎𝑟𝑒 𝐹𝑖𝑙𝑒𝑠
8: 𝑑𝑒 𝑣𝑇𝑜 𝑀𝑎𝑣𝑒 𝑛𝑛𝑒𝑠𝑠 .𝑝𝑢𝑡 (𝑑𝑒𝑣, 𝑚𝑎𝑣𝑒𝑛𝑛𝑒𝑠𝑠 )
9: return 𝑆𝑜𝑟 𝑡 𝐵𝑦𝑉𝑎𝑙 𝑢𝑒 (𝑑𝑒𝑣𝑇 𝑜 𝑀𝑎𝑣𝑒𝑛𝑛𝑒𝑠𝑠)
lists. Also,
𝐼𝑛𝑣 𝑒𝑟 𝑡𝑀𝑎𝑝 𝑝𝑖𝑛𝑔
function generate a mapping from val-
ues to keys. For instance, it inverts the hashmap
1 :
], 𝐷
2 :
[𝐹1, 𝐹 2]} to the hashmap {𝐹1 : [𝐷1, 𝐷2], 𝐹2 : [𝐷2]}.
3.3.2 Identifying Mavens: To nd mavens, we consider the number
of the rarely reached les of the developers. For a better comparison
among developers, we dene mavenness of a developer
as follows:
𝑀𝑎𝑣𝑒𝑛𝑛𝑒𝑠𝑠𝑑=# of rarely reached les of 𝑑
# of all rarely reached les (4)
While nding mavens, rst we nd reachable les as explained
in Section 3.2.1, then we nd rarely reached les as explained in Sec-
tion 3.3.1 and given in Algorithm 3, nally we calculate mavenness
scores and sort the developers according to them in descending
order. Algorithm 4shows the procedure.
3.4 Connectors (RQ 1.3)
Connectors are the developers who are involved in dierent sub-
projects or teams. The main idea behind the connector denition
is connecting developers who have no other connections, in other
words, being the bridge between dierent groups of developers.
Using node centrality, we identify this type of developers on artifact
traceability graphs dened in Section 3.1.
3.4.1 Calculating Betweenness Centrality: Betweenness centrality
of a node is based on the number of shortest paths passing through
that node. Freeman [
] discussed that betweenness centrality is
related to control of communication. Also, Bird et al. [
] used be-
tweenness centrality to nd the gatekeepers in the social networks
of mail correspondents. Therefore, we hypothesize that between-
ness centrality can be a measure to nd connectors. Betweenness
centrality of some node
is the set of nodes,
some nodes other than 𝑣:
Identifying Key Developers using Artifact Traceability Graphs PROMISE ’20, November 8–9, 2020, Virtual, USA
# of shortest paths passing through v
# of shortest (s,t)-paths (5)
For a better comparison among developers, betweenness values
are normalized with 2
/( (𝑛
is the number nodes
in the graph. For betweenness centrality related operations, we
use NetworkX package
, which uses faster betweenness centrality
algorithm of Brandes [6].
To use betweenness centrality, we need a graph composed of
only developers because we are looking for developers who connect
other developers to each other. Sulun et al.[
] proposed a metric,
know-about, to nd how much developers know the les. They
found dierent paths between the les and the developers in the
artifact graph, and dened know-about as the summation of the
reciprocals of the path lengths. Similarly, we propose to use dierent
paths between developers to nd how much they are connected in
the artifact graph. The next section explains the details.
3.4.2 Constructing the Developer Graph: Developer graph is a pro-
jection of the artifact traceability graph. It denes distances directly
between developers in a dierent way, not as we mentioned in
Section 3.1. When projecting an artifact graph to a developer graph,
we nd all dierent paths between each developer pair with a depth
limit of 4. Since connector denition is not about knowing the les
but about connecting the other developers, recency is not a concern
and it is assumed that all edges in the artifact graph have the same
distance of 1. Thus, depth limit 4 means that the maximum path
length can be 4. Therefore, in the traceability graph, two developers
can be connected through the paths with length of 2 (through a
change set node), or the paths with length of 4 (through 2 dier-
ent change set nodes and a le node connected to them). These
kinds of paths can be seen in the sample graph in Figure 2. We nd
the paths between two developers through software artifacts, not
through other developers. For example, in Figure 2, there is a path
between D2 and D3 through D1, but we interpret this path as the
combination of two paths: D1-D2 and D1-D3. Since the develop-
ers in the same team potentially work on the same group of les
and these les will be close to each other (they will be connected
through change sets because they will be changed by the same
group frequently) in the traceability graph, the method mentioned
above nds the paths between the developers in the same group.
So, the developers who have connections in dierent groups will
be favored in betweenness centrality calculations.
After nding the paths between each developer pair, we dene
a new distance metric, Reciprocal of Sum of Reciprocal Distances
(RSRD). We dene RSRD as follows when
denotes the set of
all distances between two developers (i.e. the set of lengths of all
dierent paths between two developers) and 𝑑is a distance in 𝐷:
Reciprocals of distances make larger contributions to the score
for closer nodes. For example,
and the path with length of
2 make a larger contribution. After summing the contributions of
commit review
Figure 2: Another sample artifact traceability graph. (D: De-
veloper, F: File, CS: Change Set)
D1 D2
D4 D3
D1 D2
D4 D3
4,4,4 4,4,4
1.33 1.33
Figure 3: Sample developer graph. (D: Developer)
all reciprocal distances, larger values represent a stronger connec-
tion. For example, the rst one means a stronger connection when
the rst one is
and the second one is
. To
use betweenness centrality, we need to inverse the result of this
summation, because the nodes with stronger connections need to
be closer. For example, for the numbers in the previous example,
33 is smaller than
2, and it means a closer relation. At the
end, a smaller RSRD score represents a closer relationship between
two developers, just like any other distance metric.
Figure 3shows how the developer graph is constructed from the
sample artifact graph in Figure 2. For example, (2, 4, 4, 4) are the
distances of the dierent paths between D1 and D2 in Figure 2, and
the RSRD between these two developers is calculated as follows:
Algorithm 5shows the pseudo code for calculating RSRD for a
given graph and depth limit. First, it runs a DFS algorithm starting
from each developer to nd the paths to other developers. Second,
it removes duplicates because the DFS results include paths for two
ways. For example, DFS nds paths for both
, 𝐷
, 𝐷
and it removes
, 𝐷
. Then, for each developer pair, it calculates
their RSRD value by using the length of the paths.
3.4.3 Identifying Connectors: When identifying connectors, we use
the betweenness centrality of developers in the developer graph.
Algorithm 6shows the procedure. First, it nds dierent paths and
RSRD values for each developer pair as mentioned above. Then,
it creates a developer graph with these RSRD values and nds
betweenness centrality for each developer in that graph. Finally, it
sorts developers in descending order according to their centrality.
PROMISE ’20, November 8–9, 2020, Virtual, USA Çetin and Tüzün
Algorithm 5 Calculating RSRD
1: function CalculateRsrd(𝑔𝑟𝑎𝑝ℎ, 𝑚𝑎𝑥𝐷𝑒𝑝𝑡ℎ)
2: 𝑑𝑒𝑣𝑠 𝐶𝑢𝑟 𝑟𝑒 𝑛𝑡𝐷 𝑒𝑣 𝑒𝑙𝑜 𝑝𝑒𝑟 𝑠 (𝑔𝑟𝑎𝑝ℎ)list
3: 𝑑𝑒𝑣𝑃𝑎𝑖𝑟𝑇 𝑜 𝑃𝑎𝑡 ℎ𝑠 𝐻 𝑎𝑠ℎ𝑀𝑎 𝑝 () string pair to list
4: for start in devs do
5: 𝑜𝑡ℎ𝑒𝑟 𝐷𝑒𝑣𝑠 𝑑𝑒 𝑣𝑠 𝑠𝑡𝑎𝑟 𝑡
6: 𝑝𝑎𝑡ℎ𝑠 𝐷𝐹 𝑆 (𝑔𝑟𝑎𝑝 ℎ, 𝑠𝑡𝑎𝑟 𝑡 , 𝑜𝑡ℎ𝑒𝑟 𝐷𝑒 𝑣𝑠, 𝑚𝑎𝑥 𝐷𝑒 𝑝𝑡ℎ )
7: for path in paths do
8: 𝑒𝑛𝑑 𝑝𝑎𝑡 ℎ.𝑔𝑒𝑡𝐿𝑎𝑠 𝑡 ()
9: 𝑑𝑒 𝑣𝑃𝑎𝑖 𝑟𝑇𝑜 𝑃𝑎𝑡 ℎ𝑠.𝑔𝑒𝑡 ( (𝑠𝑡 𝑎𝑟𝑡 , 𝑒𝑛𝑑)).𝑎𝑝 𝑝𝑒𝑛𝑑 (𝑝𝑎𝑡 )
10: 𝑅𝑒𝑚𝑜𝑣 𝑒𝐷𝑢 𝑝𝑙𝑖𝑐 𝑎𝑡𝑒 𝑠 (𝑑𝑒𝑣𝑃𝑎𝑖𝑟𝑇𝑜𝑃 𝑎𝑡ℎ𝑠 )
11: 𝑑𝑒𝑣𝑃𝑎𝑖𝑟𝑇 𝑜𝑅𝑠𝑟𝑑 𝐻𝑎𝑠 ℎ𝑀𝑎𝑝 ()
12: for (𝑠𝑡 𝑎𝑟𝑡 , 𝑒𝑛𝑑)in 𝑑𝑒 𝑣𝑃 𝑎𝑖𝑟𝑇 𝑜 𝑃𝑎𝑡 ℎ𝑠.𝑘𝑒𝑦 𝑠 () do
13: 𝑠𝑟𝑑 0
14: 𝑝𝑎𝑡ℎ𝑠 𝑑𝑒𝑣 𝑃𝑎𝑖𝑟 𝑇𝑜 𝑃𝑎𝑡ℎ𝑠 .𝑔𝑒𝑡 ((𝑠𝑡 𝑎𝑟𝑡, 𝑒 𝑛𝑑))
15: for path in paths do
16: 𝑠𝑟𝑑 𝑠𝑟𝑑 +1
𝑝𝑎𝑡ℎ .𝑙𝑒𝑛𝑔𝑡 ℎ ()
17: 𝑟𝑠𝑟𝑑 1
18: 𝑑𝑒 𝑣𝑃𝑎𝑖 𝑟𝑇𝑜 𝑅𝑠𝑟𝑑 .𝑝𝑢𝑡 ( (𝑠𝑡 𝑎𝑟𝑡 , 𝑒𝑛𝑑 ),𝑟 𝑠𝑟 𝑑)
19: return 𝑑𝑒𝑣 𝑃𝑎𝑖𝑟𝑇 𝑜𝑅𝑠𝑟𝑑
Algorithm 6 Finding Connectors
1: function FindConnectors(𝑔𝑟𝑎𝑝ℎ)
2: 𝑑𝑒𝑣𝑃𝑎𝑖𝑟𝑇 𝑜𝑅𝑠𝑟𝑑 𝐶𝑎𝑙𝑐 𝑢𝑙𝑎𝑡 𝑒𝑅𝑠𝑟 𝑑 (𝑔𝑟𝑎𝑝ℎ)
3: 𝑑𝑒𝑣𝐺𝑟 𝑎𝑝ℎ 𝐷 𝑒𝑣𝑒𝑙𝑜 𝑝𝑒𝑟𝐺𝑟𝑎𝑝ℎ(𝑑𝑒𝑣 𝑃𝑎𝑖𝑟𝑇 𝑜𝑅𝑠𝑟𝑑)
4: 𝑑𝑒𝑣𝑇𝑜 𝐵𝑡𝑤𝑛 𝐵𝑒𝑡𝑤𝑒𝑒𝑛𝑛𝑒𝑠𝑠𝐶𝑒𝑛𝑡𝑟 𝑎𝑙𝑖𝑡𝑦 (𝑑𝑒𝑣𝐺𝑟𝑎𝑝)
5: return 𝑆𝑜𝑟 𝑡 𝐵𝑦𝑉𝑎𝑙 𝑢𝑒 (𝑑𝑒𝑣𝑇 𝑜 𝐵𝑡𝑤𝑛)
4.1 Selecting Datasets
As we mentioned before, we use software artifacts from project
history to construct the artifact traceability graph. More specically,
our approach needs change sets (i.e. commits) and their related data
such as author, changed les and linked issues. Rath and Mader [
published datasets for 33 OSS projects, SEOSS 33. All 33 datasets are
available online
. Out of 33 projects, we selected Apache Hadoop
Apache Hive
and Apache Pig
since these three projects have the
highest issue and change set link ratios among SEOSS 33 datasets.
The datasets include data from version control systems (e.g. Git)
and issue tracking systems (e.g. Jira). Table 2shows the details for
each dataset with a varying number of issues and change sets.
4.2 Preprocessing
Data is already extracted from the version control and issue track-
ing platforms and provided in an SQL dataset. Nonetheless, we
processed the data in order to prevent errors and calculate specic
elds. We did not use all information in the dataset; change_set,
code_change and change_set_link tables were enough to create
nodes for developers, changes sets, issues and les.
We processed change sets from change_set table ordered by
commit_date, extracted the data required and dumped them into
a le as JSON formatted string of change sets in the temporal
order. For each change set, we extracted the following information:
commit hash, author, date (commit_date), issues linked, set of le
paths with their change types, number of les in the project (after
the change set).
In the data extraction, the following points are important:
We only extracted the code changes in java les which, we
assumed, end with ".java" extension. If a change set has no
code change including a java le, we ignored it completely.
We ignored merge change sets (is_merge is 1) since they
could inate the contributions of some developers [1].
We created a look-up table for each project to detect dierent
author names of the same authors. They are created manually
by looking at the developer names and their email addresses.
For example, "John Doe" and "Doe John" can be the same
developer if they share a common email address. We used
this table to correct the author names by replacing them.
In order to calculate mavenness score (See Section 3.3), we
needed the number of les in the project after each change
set. Thus, we tracked the set of current les over time. After
each code change, we removed the le if its change type is
DELETE, and we added the le if its change type is ADD.
Git does not track RENAME situations explicitly, and the
dataset [
] did not share such information about the code
changes. When a le is renamed, it is a DELETE and an ADD
for Git (if there is no change in the le)
. In the code_change
table, there are three change_types: ADD,DELETE and MOD-
IFY. In that case, we needed to handle renames because it
would aect our traceability graph and change the knowl-
edge balance among developers. We treated (DELETE,ADD)
pairs in the same change set (commit) as RENAME when the
following conditions were satised:
Both have the same le name (le paths are dierent).
The number of lines deleted in DELETE code change and
the number of lines added in ADD code change are equal.
So, our rename algorithm only detects le path changes and
does not check le contents. For example, it detects RENAME
when "" moved from "module1/"
to "module2/" but does not detect it when the
le content is changed. Since we used the datasets from
SQL tables [
] directly and we did not mine them from Git
repositories ourselves, we used the heuristic give. above.
In Hadoop dataset, we detected that there were duplicate
commits. Even though the commit hashes were dierent,
the rest of the extracted data were identical. This situation
only applies for Hadoop, the same preprocessing steps did
not produce such a situation for Hive and Pig. We removed
these change sets by using string comparison for all parts of
the JSON string except the commit hash.
Table 2show the number of change sets for each dataset after
preprocessing. Also, we share our implementation online
, and it
includes the preprocessing script.
Identifying Key Developers using Artifact Traceability Graphs PROMISE ’20, November 8–9, 2020, Virtual, USA
Table 2: Dataset Details
Project Before Preprocessing [21] After Preprocessing
Time Period
# of
# of
Linked Change
Sets to a Set
of Issues (%)
# of
# of Change Sets
added or modied
les more than 10
# of Change Sets
added or modied
les more than 50
Hadoop 150 39,086 27,776 97.13 15178 1900 (12.5%) 129 (0.8%)
Hive 113 18,025 11,179 96.34 9030 1062 (11.8%) 127 (1.4%)
Pig 123 5234 3134 92.85 2401 240 (10.0%) 32 (1.3%)
4.3 Handling Large Change Sets
Change set means a set of le changes, and it is called large change
set when the number of changed les is more than a specic number.
For example, initial commits of a project most probably include
many les, and it is a typical example for large commits. Another
example is moving a project into another project. In that case, its
change set includes all the les in the added project.
Committing a large number of les in one change set is not
considered to be a good practice in software engineering. Sadowski
et al.[
] claim that 90% of the changes in Google modify less than
10 les. Also, Rigby and Bird [
] excluded the changes that contain
more than 10 les in their case studies. In our experiments, we use a
looser limit for excluding change sets. In the following, we explain
the details on removing the large change sets:
Regardless of the size of a change set, we apply changes to
the traceability graph for DELETE and RENAME types. Since
the knowledge of deleted les is not required after that point
and renamed les need to proceed with their new names.
If a change set includes more than 50 les which are added or
modied, we ignore these ADD and MODIFY code changes.
We did not use 10 as the limit because we did not want to
lose 10.0-12.5% of the datasets (See Table 2). Also, sometimes
large commits can exist even though it is not a good practice.
Our purpose is handling initial commits of the projects and
project movements. So, choosing 50 is a good trade-o for the
limit of the number of les added or modied in a change set.
It is not that small to cause losing 10.0-12.5% of the datasets
and not that large to include commits like initial commits.
5.1 Experimental Setup
In the experiments, we used NetworkX package
for graph op-
erations. In order to prevent potential bugs, we used its built-in
functions whenever possible (e.g. calculating betweenness central-
ity, nding paths between developers). However, we implemented
the DFS algorithm for reachable les in Algorithm 1because it was
very specic to our case (e.g. the stopping condition is dierent).
Our source code is available online8.
How much time the artifact graph should cover is a parame-
ter in our method. We chose a sliding window approach over an
incremental window in time, in other words, the artifact graph al-
ways includes the change sets committed in a constant time period.
Followings are the reasons behind this choice:
If the time period of the graph changes over time, the mean-
ing of the recency changes. For example, 0.9 recency means
1 2 3 ... 365 366 367 368 369 ...
1 2 3 ... 365 366 367 368 369 ...
1 2 3 ... 365 366 367 368 369 ...
Step 1
Step 2
33673001 3002 3003 ... 3365 3366
TIMELINE (days)
1 year (365 days)
33673001 3002 3003 ... 3365 3366
33673001 3002 3003 ... 3365 3366
Figure 4: Experimental Setup
30 days ago in a 300-day graph while the same recency corre-
sponds to 50 days ago in a 500-day graph. Thus, keeping the
time period (sliding windows size) constant enables recency
scores to have the same meaning in dierent time points.
Keeping every artifact from history enlarges the graph every
day, and the algorithms run slower in larger graphs. There-
fore, removing unnecessary parts (the artifacts older than
one year) means less run time.
In OSS projects, there is no data about leaving developers. If
we keep every artifact from the history, we should calculate
scores even for former developers. Therefore, removing old
artifacts enables us to keep track of the current developers. If
the graph keeps the last 365 days, we assume the developers
who contributed to the project in the last 365 days are the
current active developers.
We used a 1-year (365 days) sliding window in our experiments.
Figure 4shows how the included days change in iterations. The
numbers on the gure come from the Hive dataset. "3367 days"
corresponds to the number of days after preprocessing. There are
3003 iterations including the initial window. We tracked the dates
over change sets. When forwarding the window one day, rstly,
we removed the change sets of the rst day of the window. Then,
we added the change sets of the day after the last day of the win-
dow. For each iteration, we calculated scores for jacks, mavens and
connectors, then, we reported them and their scores in descending
order. The same procedure was repeated for Hadoop and Pig.
5.2 Results
Since we propose to use jack,maven and connector as the key
developer categories for the rst time and there is no classication
of developers for these types in the literature, we are not able to
compare our approach with others. Also, since we conducted our
experiments on OSS projects, we have no data for developer labels
for these projects. However, we can show that the results of our
approach are compatible with other statistics of the projects.
To validate our approach, we propose to use developers’ com-
ments on issues. Jacks are the developers who have broad knowl-
edge by denition, and we identied them by nding their le
PROMISE ’20, November 8–9, 2020, Virtual, USA Çetin and Tüzün
Table 3: Mean accuracy percentages for the key developers of our approach vs. the developers selected randomly in Monte
Carlo simulation
Key Developer
Category Projects Top
Key Developers Randomly Selected Developers
Top-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10
Top-1 6.82 19.04 27.10 50.47 1.97 5.88 9.79 19.56
Top-3 - 22.65 30.20 47.81 - 5.58 9.30 18.59
Top-5 - - 29.62 47.40 - - 9.60 19.18
Top-10 - - - 41.75 - - - 19.14
Top-1 44.16 71.43 81.65 92.21 6.24 18.73 31.23 58.17
Top-3 - 54.78 70.20 84.12 - 17.46 29.10 54.17
Top-5 - - 57.02 73.28 - - 26.60 49.53
Top-10 - - - 55.31 - - - 38.84
Top-1 59.16 86.23 88.90 89.91 11.77 35.41 56.35 83.55
Top-3 - 75.27 85.26 89.99 - 34.95 55.61 83.05
Top-5 - - 66.91 79.66 - - 46.61 73.73
Top-10 - - - 59.15 - - - 55.84
(Sorted by
Jack Score)
Top-1 7.22 21.33 34.97 59.04 1.96 5.87 9.77 17.48
Top-3 - 23.80 32.68 52.05 - 5.57 9.28 16.65
Top-5 - - 31.26 49.08 - - 9.59 17.16
Top-10 - - - 44.38 - - - 17.04
Top-1 56.94 75.92 81.15 87.98 6.23 10.92 12.50 14.68
Top-3 - 49.40 55.17 61.33 - 10.38 11.90 14.01
Top-5 - - 38.86 44.94 - - 10.79 12.68
Top-10 - - - 30.48 - - - 10.70
Top-1 66.26 86.41 86.63 86.63 11.78 22.36 23.58 23.58
Top-3 - 55.54 56.30 56.30 - 22.44 23.78 23.78
Top-5 - - 39.39 39.39 - - 20.73 20.73
Top-10 - - - 21.29 - - - 15.45
coverage in the project. Therefore, the jacks should be involved
in the issues such as bugs and enhancements more than other de-
velopers. We claim that, by denition, the top jacks and the top
commenters in the project’s issue tracking system (e.g. Jira) should
be mostly the same developers. However, we cannot claim that
mavens and connectors should be among the top commenters. To
validate the results of these categories, we oer to use the develop-
ers who are jack,maven and connector at the same time, in other
words, the intersection of all kinds of key developers. In that way,
we include mavens and connectors to our validation, and we show
how the intersection developers overlap with the top commenters.
While using the intersection developers, we sorted them by their
jack score (i.e. le coverage) since we cannot combine betweenness
centrality, mavenness score and le coverage properly. So, in the
case studies, we examined the jacks and the intersection developers.
The datasets [
] we used for experiments include data from
issue tracking systems (e.g. Jira). In the change_set_link table, there
are links between issues and change sets, which means we can use
the comments made to issues in the traceability graph. The datasets
supply the display_name of the commenters in issue_comment table.
The names in the display_name eld matches the developer names
in author eld of change_set table. So, we can nd which developers
made how many comments to the issues in the graph. To increase
the validity of the number of comments for each developer, we
corrected the commenter names by using the look-up table created
manually in preprocessing (See Section 4.2).
The Key Developers column in Table 3shows the accuracy of
our approach when we treat the top commenters as the actual key
developers (i.e. ground truth). "Top commenters" means a ranked
list of commenters according to the number of comments that they
made to the issues in the last year. The predicted key developers
by our model are consistent with the top commenters up to 92%.
Accuracy is calculated as shown in Equation 7, where
is the
ranked list of key developers,
is the ranked list of commenters,
is the set of dates (i.e. days or iterations in Figure 4) and k refers
the numbers in Top-k phrases in the Table 3. For example, the
accuracy of day
for (Top-3, Top-5) cell is calculated as follows:
, 𝐷
, 𝐷
, 𝐷
, 𝐷
, 𝐷
, 𝐷
, the
accuracy is | { 𝐷1,𝐷2} |
| {𝐷1,𝐷2,𝐷3} | =2
𝑀𝑒𝑎𝑛 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (𝑘1,𝑘 2)=1
|𝐶𝑑(𝑘1) ∩ 𝐾 𝐷𝑑(𝑘2) |
|𝐶𝑑(𝑘1)| (7)
Since there is no comparable approach that nds our subcate-
gories of key developers, we used the Monte Carlo simulation as a
baseline approach. We randomly selected the key developers for
each day from the existing developers in the graph, in other words,
from the developers who committed changes to the source code in
the sliding window period. While producing random developers,
we considered the number of key developers in our results since
the simulation should provide random results in the same struc-
ture. For example, we selected 4 random developers if our approach
Identifying Key Developers using Artifact Traceability Graphs PROMISE ’20, November 8–9, 2020, Virtual, USA
found 4 jacks in that day even if
is 5. Then, we calculated mean
accuracies in the same way shown in Equation 7. This experiment
with random key developers is repeated 1000 times. The Randomly
Selected Developers column in Table 3shows the average of the
accuracies of 1000 simulations. It can be seen that our approach is
more successful than the random model for all cases.
Hadoop, Hive and Pig have dierent scales. Pig is a small project
with tens of developers while the others have hundreds of develop-
ers in their whole history. Even though they both have hundreds
of developers and their time periods are not that dierent, Hadoop
has a lot more change sets and issues than Hive as shown in Table
2. The average number of active developers for each project, in
other words, the average number of developers in the traceability
graph over time are as follows: 9.05 in Pig, 35.08 in Hive and 49.86
in Hadoop. So, it is clear that Hadoop is a more active project than
Hive. Also, the dierences between the results of projects in Table
3infer the same conclusion. Both the results of our algorithms and
the results of Monte Carlo simulation show that the more active
developers exist, the harder it becomes to predict key developers.
Even though the accuracies are dierent among the projects due to
the fact mentioned above, our results are better than the results of
the random model for all cases. Also, the key developers predicted
by our approach and the top commenters overlap up to 92%.
5.3 Discussion
In this section, we discuss the studies which propose a denition
for core, hero, key or LTC developers similar to our denition.
Agrawal et al. [
] worked on hero developers in software projects.
According to their denition, a project has hero developers if 20% of
the developers made 80% of the contributions. Hero developers are
similar to key developers in our case, as projects mostly depends
on them. The authors provided a denition for the projects with
hero developers, then analysed hero and non-hero projects without
providing a validation. In our study, we subcategorized the key
developers into three categories and provided separate algorithms
to identify key developers.
Oliva et al. [
] worked on characterizing the key developers,
that is "the set of developers who evolve the technical core". First,
they detected the commits to core les in the project. Then, they
ranked the developers according to their core commit counts, and
considered rst 25% of the list as key developers. Without validating
their key developers, they analyzed the identied key developers in
terms of contribution characteristics, communication and coordina-
tion within the project. Also, they only performed an experiment
on a small project with 16 developers. In our study, we investigated
our algorithms in three projects dierent in size, and validated our
key developers using top commenters in issue tracking systems.
Crowston et al. [
] examined the core and periphery of OSS
team communications. They analyzed if the three following meth-
ods produce similar results or not: the contributors named ocially
(e.g. support manager, developer), the contributors who contribute
the most to the bug reports, and the contributors who are dened
by a pattern of interactions in bug tracking systems. Since they
used data from SourceForge
, they had access to ocial labels of
the contributors. In our case, Git repositories does not provide any
labels for developer roles in the project and Apache organization
only provide the full list of current developers.
Joblin et al.[
] studied core and peripheral developers. The core
developers are the essential developers in the projects as the key
developers in our study. They worked on count-based (e.g. number
of commits as in [
]) and network-based (e.g. degree centrality in
developer graph from version control systems and mailing lists)
metrics. They established a ground truth by making a survey with
166 participants. We were not able to examine their data because
the project and survey data links are not accessible through their
. Also, their survey shows the actual core developers for
a snapshot of the time, while our approach provide results continu-
ously with a sliding window approach.
Zhou and Mockus [
] dened an LTC as "a participant who
stays with the project for at least three years and is productive".
They claimed that LTCs are crucial for success of the projects.
They mainly investigated how a new joiner become an LTC (i.e. a
valuable contributor). In our study, we investigated how to nd key
developers rather than examining how new joiners evolve in time.
Construct validity
is about how the operational measures in the
study represent what is investigated according to the RQs [
]. We
used a dataset from another study[
], and their mining process
can potentially aect our results. To reduce the threat caused by
the data mining process, we eliminated the possible problems (e.g.
we corrected author names manually by looking at their names and
email addresses.) in preprocessing (See Section 4.2). However, there
might still be problems related to data integrity.
Internal validity
concerns if the causal relations are examined
or not [
]. While building the graphs and dening algorithms, we
made many decisions related to thresholds including choosing 50
as the limit for the number of les added or modied in a change
set, choosing 10 as the distance threshold in le reachability and
using 365-day (1-year) sliding window in the experiments. We tried
various options and made the nal decisions after evaluating their
results. In the corresponding sections of this study, we shared the
justications behind these decisions. For example, we chose 10 as
the distance threshold since it corresponds to 90% of the covered
time in the graph because of the nature of the distance formula.
The potential errors in the implementation of our approach
threaten the validity of our results. We beneted from a stable
graph package NetworkX
in our operations and used its methods
whenever possible, for example, betweenness centrality calcula-
tions and DFS for nding paths between developers. To prevent
potential bugs, we performed multiple code review sessions with a
third researcher. Also, we shared the implementation online
replicability of the results.
We used developer comments in issue tracking systems to vali-
date our approach, however, we do not claim that the number of
comments shows the key developers in a project. We just claim that
there should be a correlation between the top commenters and the
key developers in the same time period. Then, we used this idea
to show that our approach produced more logical results than the
random case with a Monte Carlo simulation.
10 (Accessed on 13 Aug 2020)
PROMISE ’20, November 8–9, 2020, Virtual, USA Çetin and Tüzün
External validity
concerns about generalization of the nd-
ings in studies [
]. In our case studies, we used three dierent
OSS projects. Even though we did not conduct a case study in an
industrial company, we selected projects from Apache
, a 20-year
established foundation. Also, the sizes of the projects are dier-
ent as seen in Table 2. Although we believe that we have enough
data for an initial assessment, in the future, we need to run our
algorithms in more OSS and industrial datasets.
In this study, we proposed dierent categories for key developers
in software development projects: jacks,mavens and connectors. To
identify the developers in these subcategories of key developers,
we proposed separate algorithms. Then, we conducted case studies
on three OSS projects (Hadoop, Hive and Pig). Since there was no
labeled data for the key developer categories, we used developers’
comments in issue tracking systems to validate our results. The
key developers found by our model were compatible with the top
commenters up to 92%. The results indicated that our approach has
promising results to identify key developers in software projects.
We can summarize the contributions of our study as follows:
We oered a novel categorization for the key developers
inspired by The Tipping Point from Gladwell [15].
For each of the three key developer categories (jacks, mavens
and connectors), we proposed the corresponding algorithms
using a traceability graph (network) of software artifacts.
The ndings of this study might shed light on the truck-
factor problem. Key developers can be used to nd the truck
factor of the projects.
Identifying key developers in a software project might help
the software practitioners for making managerial decisions.
As future work, we plan to run our algorithms on the projects in
industrial datasets and validate our results by interviewing project
stakeholders and creating a labeled dataset for these types of key
developers. We are also planning to use the artifact traceability
graph to categorize projects (balanced or hero) and recommend
replacements for leaving developers.
Amritanshu Agrawal, Akond Rahman, Rahul Krishna, Alexander Sobran, and Tim
Menzies. 2018. We don’t need another hero?: the impact of heroes on software
development. In Proceedings of the 40th International Conference on Software
Engineering: Software Engineering in Practice. ACM, 245–253.
Mohammad Y Allaho and Wang-Chien Lee. 2013. Analyzing the social ties and
structure of contributors in open source software community. In Proceedings
of the 2013 IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining. 56–60.
Guilherme Avelino, Eleni Constantinou, Marco Tulio Valente, and Alexander
Serebrenik. 2019. On the abandonment and survival of open source projects: An
empirical investigation. In 2019 ACM/IEEE International Symposium on Empirical
Software Engineering and Measurement (ESEM). IEEE, 1–12.
Guilherme Avelino, Leonardo Passos, Andre Hora, and Marco Tulio Valente. 2016.
A novel approach for estimating truck factors. In 2016 IEEE 24th International
Conference on Program Comprehension (ICPC). IEEE, 1–10.
Christian Bird, Alex Gourley, Prem Devanbu, Michael Gertz, and Anand Swami-
nathan. 2006. Mining email social networks. In Proceedings of the 2006 interna-
tional workshop on Mining software repositories. 137–143.
Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. Journal of
mathematical sociology 25, 2 (2001), 163–177.
H Alperen Cetin. 2019. Identifying the most valuable developers using artifact
traceability graphs. In Proceedings of the 2019 27th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software
Engineering. 1196–1198.
Jinghui Cheng and Jin LC Guo. 2019. Activity-based analysis of open source
software contributors: roles and dynamics. In 2019 IEEE/ACM 12th International
Workshop on Cooperative and Human Aspects of Software Engineering (CHASE).
IEEE, 11–18.
Valerio Cosentino,Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2015. Assessing
the bus factor of Git repositories. In 2015 IEEE 22nd International Conference on
Software Analysis, Evolution, and Reengineering (SANER). IEEE, 499–503.
Kevin Crowston, Kangning Wei, Qing Li, and James Howison. 2006. Core and
periphery in free/libre and open source software team communications. In Pro-
ceedings of the 39th Annual Hawaii International Conference on System Sciences
(HICSS’06), Vol. 6. IEEE, 118a–118a.
Enrico Di Bella, Alberto Sillitti, and Giancarlo Succi. 2013. A multivariate classi-
cation of open source developers. Information Sciences 221 (2013), 72–83.
Mívian Ferreira, Thaís Mombach, Marco Tulio Valente, and Kecia Ferreira. 2019.
Algorithms for estimating truck factors: a comparative study. Software Quality
Journal 27, 4 (2019), 1583–1617.
Linton C Freeman. 1978. Centrality in social networks conceptual clarication.
Social networks 1, 3 (1978), 215–239.
Thomas Fritz, Gail C Murphy, Emerson Murphy-Hill, Jingwen Ou, and Emily
Hill. 2014. Degree-of-knowledge: Modeling a developer’s knowledge of code.
ACM Transactions on Software Engineering and Methodology (TOSEM) 23, 2 (2014),
Malcolm Gladwell. 2006. The tipping point: How little things can make a big
dierence. Little, Brown.
Mitchell Joblin, Sven Apel, Claus Hunsen, and Wolfgang Mauerer. 2017. Clas-
sifying developers into core and peripheral: An empirical study on count and
network metrics. In 2017 IEEE/ACM 39th International Conference on Software
Engineering (ICSE). IEEE, 164–174.
Takeshi Kakimoto, Yasutaka Kamei, Masao Ohira, and Kenichi Matsumoto. 2006.
Social network analysis on communications for knowledge collaboration in
oss communities. In Proceedings of the International Workshop on Supporting
Knowledge Collaboration in Software Development (KCSD’06). Citeseer, 35–41.
Makrina Viola Kosti, Robert Feldt, and Lefteris Angelis. 2016. Archetypal person-
alities of software engineers and their work preferences: a new perspective for
empirical studies. Empirical Software Engineering 21, 4 (2016), 1509–1532.
Audris Mockus. 2010. Organizational volatility and its eects on software de-
fects. In Proceedings of the eighteenth ACM SIGSOFT international symposium on
Foundations of software engineering. 117–126.
Gustavo Ansaldi Oliva, José Teodoro da Silva, Marco Aurélio Gerosa, Francisco
Werther Silva Santana, Cláudia Maria Lima Werner, Cleidson Ronald Botelho de
Souza, and Kleverton Carlos Macedo de Oliveira. 2015. Evolving the system’s
core: a case study on the identication and characterization of key developers in
Apache Ant. Computing and Informatics 34, 3 (2015), 678–724.
Michael Rath and Patrick Mäder. 2019. The SEOSS 33 dataset—Requirements,
bug reports, code history, and trace links for entire projects. Data in brief 25
(2019), 104005.
Peter C Rigby and Christian Bird. 2013. Convergent contemporary software peer
review practices. In Proceedings of the 2013 9th Joint Meeting on Foundations of
Software Engineering. 202–212.
Peter C Rigby, Yue Cai Zhu, Samuel M Donadelli, and Audris Mockus. 2016.
Quantifying and mitigating turnover-induced knowledge loss: case studies of
Chrome and a project at Avaya. In 2016 IEEE/ACM 38th International Conference
on Software Engineering (ICSE). IEEE, 1006–1016.
Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting
case study research in software engineering. Empirical software engineering 14, 2
(2009), 131.
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto
Bacchelli. 2018. Modern code review: a case study at google. In Proceedings of
the 40th International Conference on Software Engineering: Software Engineering
in Practice. 181–190.
Emre Sülün, Eray Tüzün, and Uğur Doğrusöz. 2019. Reviewer Recommenda-
tion using Software Artifact Traceability Graphs. In Proceedings of the Fifteenth
International Conference on Predictive Models and Data Analytics in Software
Engineering. 66–75.
Jing Wu and Khim Yong Goh. 2009. Evaluating longitudinal success of open
source software projects: A social network perspective. In 2009 42nd Hawaii
International Conference on System Sciences. IEEE, 1–10.
Minghui Zhou and Audris Mockus. 2012. What make long term contributors: Will-
ingness and opportunity in OSS community. In 2012 34th International Conference
on Software Engineering (ICSE). IEEE, 518–528.
... Cetin et al. [10] categorized developers in a software development team into three categories: Jacks, Mavens, and ...
In software development teams, developer turnover is among the primary reasons for project failures as it leads to a great void of knowledge and strain for the newcomers. Unfortunately, no established methods exist to measure how knowledge is distributed among development teams. Knowing how this knowledge evolves and is owned by key developers in a project helps managers reduce risks caused by turnover. To this end, this paper introduces a novel, realistic representation of domain knowledge distribution: the ConceptRealm. To construct the ConceptRealm, we employ a latent Dirichlet allocation model to represent textual features obtained from 300k issues and 1.3M comments from 518 open-source projects. We analyze whether the newly emerged issues and developers share similar concepts or how aligned the developers' concepts are with the team over time. We also investigate the impact of leaving members on the frequency of concepts. Finally, we evaluate the soundness of our approach to closed-source software, thus allowing the validation of the results from a practical standpoint. We find out that the ConceptRealm can represent the high-level domain knowledge within a team and can be utilized to predict the alignment of developers with issues. We also observe that projects exhibit many keepers independent of project maturity and that abruptly leaving keepers harm the team's concept familiarity.
... While changed or added requirements give input to allocate more resources, verification of said requirements provides grounds for their deallocation. However, without knowing if a development task has been properly addressed, such deallocation can be delayed or spent inefficiently (for example, see [36,37,38]). ...
The concept of traceability between artifacts is considered an enabler for software project success. This concept has received plenty of attention from the research community and is by many perceived to always be available in an industrial setting. In this industry-academia collaborative project, a team of researchers, supported by testing practitioners from a large telecommunication company, sought to investigate the partner company's issues related to software quality. However, it was soon identified that the fundamental traceability links between requirements and test cases were missing. This lack of traceability impeded the implementation of a solution to help the company deal with its quality issues. In this experience report, we discuss lessons learned about the practical value of creating and maintaining traceability links in complex industrial settings and provide a cautionary tale for researchers.
... The extensions in this study over our previous works (Cetin 2019;Ç etin and Tüzün 2020) are as follows: -Enhancing our evaluation by adding three more case studies in RQ 1. ...
Full-text available
Context In a software project, properly analyzing the contributions of developers could provide valuable insights for decision-makers. The contributions of a developer could be in many different forms such as committing and reviewing code, opening and resolving issues. Previous approaches mainly consider the commit-based contributions which provide an incomplete picture of developer contributions. Objective Different from the traditional commit-based approaches for analyzing developer contributions, we aim to provide a more holistic approach to reflect the rich set of software development activities using artifact traceability graphs. Method For analyzing the developer contributions, we propose a novel categorization of developers (Jacks, Mavens and Connectors) in a software project. We introduce a set of algorithms on artifact traceability graphs to identify key developers, recommend replacements for leaving developers and evaluate knowledge distribution among developers. Results We evaluate our proposed algorithms on six open-source projects and demonstrate that the identified key developers match the top commenters up to 98%, recommended replacements are correct up to 91% and identified knowledge distribution labels are compatible 94% on average with the baseline approaches. Conclusions The proposed algorithms using artifact traceability graphs for analyzing developer contributions could be used by software project decision-makers in several scenarios. (1) Identifying different types of key developers. (2) Finding a replacement developer in large teams. (3) Evaluating the overall knowledge distribution amongst developers to take early precautions.
The concept of traceability between artifacts is considered an enabler for software project success. This concept has received plenty of attention from the research community and is by many perceived to always be available in an industrial setting. In this industry-academia collaborative project, a team of researchers, supported by testing practitioners from a large telecommunication company, sought to investigate the partner company’s issues related to software quality. However, it was soon identified that the fundamental traceability links between requirements and test cases were missing. This lack of traceability impeded the implementation of a solution to help the company deal with its quality issues. In this experience report, we discuss lessons learned about the practical value of creating and maintaining traceability links in complex industrial settings and provide a cautionary tale for researchers.
Full-text available
Software development is a knowledge-intensive industry. For this reason, concentration of knowledge in software projects tends to be very risky, which increases the relevance of strategies that reveal how source code knowledge is distributed among team members. The truck factor (also known as the bus factor) is an increasingly popular concept—proposed by practitioners—that indicates the minimal number of developers that have to be hit by a truck (or leave the team) before a project is incapacitated. Therefore, it is a measure that reveals the concentration of knowledge and the key developers in a project. Due to the importance of this concept, algorithms have been proposed to automatically compute truck factors, using maintenance activity data extracted from version control systems. However, we still lack large studies that assess the results of truck factor algorithms. To fulfill this gap in the literature, this paper describes the results of three empirical studies. In the first study, we validate the results produced by three algorithms to estimate truck factors. To this purpose, we build an oracle of truck factors, gathered via a survey with 35 open-source project teams. In the second study, we provide a comparison between truck factors and core developers, a related concept commonly used to denote the key developers of open-source projects. Our results indicate that truck factor developers are in most cases a subset of core developers. Finally, as the algorithms proposed so far are based in commit data, in the third study, we investigate other factors that may impact the computation of truck factors.
Full-text available
This paper provides a systematically retrieved dataset consisting of 33 open-source software projects containing a large number of typed artifacts and trace links between them. The artifacts stem from the projects' issue tracking system and source version control system to enable their joint analysis. Enriched with additional metadata, such as time stamps, release versions, component information, and developer comments, the dataset is highly suitable for empirical research, e.g., in requirements and software traceability analysis, software evolution, bug and feature localization, and stakeholder collaboration. It can stimulate new research directions, facilitate the replication of existing studies, and act as benchmark for the comparison of competing approaches. The data is hosted on Harvard Dataverse using DOI 10.7910/DVN/PDDZ4Q accessible via
Conference Paper
Full-text available
Employing lightweight, tool-based code review of code changes (aka modern code review) has become the norm for a wide variety of open-source and industrial systems. In this paper, we make an exploratory investigation of modern code review at Google. Google introduced code review early on and evolved it over the years; our study sheds light on why Google introduced this practice and analyzes its current status, after the process has been refined through decades of code changes and millions of code reviews. By means of 12 interviews, a survey with 44 respondents, and the analysis of review logs for 9 million reviewed changes, we investigate motivations behind code review at Google, current practices, and developers' satisfaction and challenges.
Conference Paper
Full-text available
A software project has "Hero Developers" when 80% of contributions are delivered by 20% of the developers. Are such heroes a good idea? Are too many heroes bad for software quality? Is it better to have more/less heroes for different kinds of projects? To answer these questions, we studied 661 open source projects from Public open source software (OSS) Github and 171 projects from an Enterprise Github. We find that hero projects are very common. In fact, as projects grow in size, nearly all projects become hero projects. These findings motivated us to look more closely at the effects of heroes on software development. Analysis shows that the frequency to close issues and bugs are not significantly affected by the presence of heroes or project type (Public or Enterprise). Similarly, the time needed to resolve an issue/bug/enhancement is not affected by heroes or project type. This is a surprising result since, before looking at the data, we expected that increasing heroes on a project will slow down how fast that project reacts to change. However, we do find a statistically significant association between heroes, project types, and enhancement resolution rates. Heroes do not affect enhancement resolution rates in Public projects. However, in Enterprise projects, heroes increase the rate at which projects complete enhancements. In summary, our empirical results call for a revision of a long-held truism in software engineering. Software heroes are far more common and valuable than suggested by the literature, particularly for medium to large Enterprise developments. Organizations should reflect on better ways to find and retain more of these heroes.
Conference Paper
Finding the most valuable and indispensable developers is a crucial task in software development. We categorize these valuable developers into two categories: connector and maven. A typical connector represents a developer who connects different groups of developers in a large-scale project. Mavens represent the developers who are the sole experts in specific modules of the project. To identify the connectors and mavens, we propose an approach using graph centrality metrics and connections of traceability graphs. We conducted a preliminary study on this approach by using two open source projects: QT 3D Studio and Android. Initial results show that the approach leads to identify the essential developers.
Conference Paper
Various types of artifacts (requirements, source code, test cases, documents, etc.) are produced throughout the lifecycle of a software. These artifacts are often related with each other via traceability links that are stored in modern application lifecycle management repositories. Throughout the lifecycle of a software, various types of changes can arise in any one of these artifacts. It is important to review such changes to minimize their potential negative impacts. To maximize benefits of the review process, the reviewer(s) should be chosen appropriately. In this study, we reformulate the reviewer suggestion problem using software artifact traceability graphs. We introduce a novel approach, named RSTrace, to automatically recommend reviewers that are best suited based on their familiarity with a given artifact. The proposed approach, in theory, could be applied to all types of artifacts. For the purpose of this study, we focused on the source code artifact and conducted an experiment on finding the appropriate code reviewer(s). We initially tested RSTrace on an open source project and achieved top-3 recall of 0.85 with an MRR (mean reciprocal ranking) of 0.73. In a further empirical evaluation of 37 open source projects, we confirmed that the proposed reviewer recommendation approach yields promising top-k and MRR scores on the average compared to the existing reviewer recommendation approaches.