ArticlePDF Available

Abstract and Figures

The Truck Factor designates the minimal number of developers that have to be hit by a truck (or quit) before a project is incapacitated. It can be seen as a measurement of the concentration of information in individual team members. We calculate the Truck Factor for 133 popular GitHub applications, in six languages. To infer the authors of a file we use the Degree-of-Authorship (DOA) metric, which is computed using version history data, and to estimate the Truck Factor, we use a greedy heuristic. Results show that most systems have a small truck factor (46% have Truck Factor=1 and 28% have Truck Factor=2).
Content may be subject to copyright.
What is the Truck Factor of Popular GitHub
Applications? A First Assessment
Guilherme Avelino, Marco Tulio Valente, Andre Hora
Department of Computer Science, UFMG, Brazil
The Truck Factor designates the minimal number of developers that have to be hit by
a truck (or quit) before a project is incapacitated. It can be seen as a measurement of
the concentration of information in individual team members. We calculate the Truck
Factor for 133 popular GitHub applications, in six languages. Results show that most
systems have a small Truck Factor (34% have Truck Factor=1 and 30% have Truck
1 Introduction
The Truck Factor designates the minimal number of developers that have to be hit by a truck
(or quit) before a project is incapacitated [1]. The Wikipedia defines that it is a “measurement
of the concentration of information in individual team members. A high T ruckF actor means
that many individuals know enough to carry on and the project could still succeed even in
very adverse events.”1The term is also known by Bus Factor/Number.
In this paper, we report the first results of a study conducted to estimate the Truck Factor
of popular GitHub applications. Our results show that most systems have a small Truck
Factor (34% have Truck Factor=1 and 30% have Truck Factor=2). Section 2 reports our
study setup, including a description of the technique we used to calculate code authorship,
the dataset used in the paper, and the heuristic we used to estimate the Truck Factor.
Section 3 presents our first results.
PeerJ Preprints | | CC BY 4.0 Open Access | rec: 2 Jan 2017, publ: 2 Jan 2017
2 Study Setup
2.1 Code Authorship
We define an author as a developer able to influence or command the implementation of a
file. Therefore, she is not a collaborator with some expertise in the file, but for example
someone who is able to lead other developers when working in the file. To define the authors
of a file, we rely on the Degree of Authorship (DOA) measure [2, 3], which is is computed as
DOA = 3.293 + 1.098 FA + 0.164 DL 0.321 ln(1 + AC )
The degree of authorship of a developer din a file fdepends on three factors: first
authorship (FA), number of deliveries (DL), and number of acceptances (AC). If dis the
author of f,FA is 1; otherwise it is 0; DL is the number of changes in fmade by D;
and AC is the number of changes in fmade by other developers. Basically, the weights of
each variable assume that FA is by far the strongest predictor of file authorship. Recency
information (DL) also contributes positively to authorship, but with less importance. Finally,
changes by other developers (AC) contribute to decrease someone’s DOA, but at a slower
rate. The weights used in the DOA equation were empirically derived through an experiment
with seven professional Java developers [2]. The authors also showed that the model is robust
enough to be used in different environments and projects.
In this study we consider only normalized DOA values. For a file f, the normalized DOA
ranges from 0 to 1, where 1 is granted to the developer with the highest absolute DOA among
the developers that worked on f. A developer dis an author of a file fif its normalized
DOA is greater than a threshold k. We assume k= 0.75, which is a value that presented
reasonable accuracy in a manual validation we performed with a sample of systems.
2.2 Dataset
We evaluate systems implemented in the six languages with the largest number of reposito-
ries in GitHub: JavaScript, Python, Ruby, C/C++, Java, and PHP. We initially select the
top-100 most popular systems in each language, regarding their number of stars (starring is a
GitHub feature that lets users show their interest on repositories). Considering only the sys-
tems in a given language, we compute the first quartile of the distribution of three measures:
number of developers, number of commits, and number of files (as collected from GitHub on
February 25th, 2015). We then discard systems that are in the first quartiles of any of these
measures. The goal is to focus on the most important systems per language, implemented
PeerJ Preprints | | CC BY 4.0 Open Access | rec: 2 Jan 2017, publ: 2 Jan 2017
by teams with a considerable number of active developers and with a considerable number
of files. A similar procedure is followed by other studies on GitHub [4].
After this first selection, we remove repositories with evidences of being incorrectly mi-
grated to GitHub (from another repository, like SVN). Specifically, we remove systems having
more than 50% of their files added in less than 20 commits (i.e., less than 10% of the min-
imal number of commits we initially considered). This is an evidence that the system was
developed using another version control platform and the migration to GitHub did not pre-
serve its previous version history. Finally, we manually inspected the GitHub page of the
selected systems. As result, we decided to remove the repositories raspberrypi/linux and
django/django-old. The first is very similar to torvalds/linux and the second is an
old version of a repository already in the dataset.
Table 1 summarizes the final list of repositories we selected for the study. It includes 133
systems, in six languages; Ruby is the language with more systems (33 systems) and PHP is
the language with less systems (17 systems). Considering all systems, the dataset includes
more than 373K files, 41 MLOC, and 2 million commits.
Table 1: Dataset
Language Repositories Developers Commits Files LOC
JavaScript 22 5,740 108,080 24,688 3,661,722
Python 22 8,627 276,174 35,315 2,237,930
Ruby 33 19,960 307,603 33,556 2,612,503
C/C++ 18 21,039 847,867 107,464 19,915,316
Java 21 4,499 418,003 140,871 10,672,918
PHP 17 3,329 125,626 31,221 2,215,972
Total 133 63,194 2,083,353 373,115 41,316,361
File Cleaning: Studies on code authorship should consider only files representing the source
code of the selected systems. Therefore, files representing documentation, images, examples,
etc should be discarded. Moreover, it is also fundamental to discard source files associated
to third-party libraries, which are frequently found in repositories of systems implemented
in dynamic languages. For this purpose, we initially used the Linguist library2, which is the
tool used by GitHub to show the percentage of files in a repository implemented in different
programming languages. We excluded from our dataset the same files that Linguist discard
when computing language statistics, e.g., documentation and vendored (or third-party) files.
As a result, we automatically removed 129,455 files (34%), including 5,125 .js files, 3,099 .php
PeerJ Preprints | | CC BY 4.0 Open Access | rec: 2 Jan 2017, publ: 2 Jan 2017
files and 2,049 .c files. After this automatic clean up step, we manually inspected the first two
top-level directories in each repository, mainly to detect third-party libraries and documen-
tation files not considered by the Linguist tool. As a result, we manually removed 10,450 files.
Handling Aliases: A second challenge when inferring code authorship from software reposi-
tories is to detect alias (i.e., different IDs, for the same developer). To tackle this challenge,
we first consider as coming from the same developer the commits identified with different
developers’ names, but having the same e-mail address. Second, we compared the names
of the developers in each commit using Levenshtein distance [5]. Basically, this distance
counts the minimum number of single-character edits (insertions, deletions or replacements)
required to change one string into the other. We considered as possible aliases the commits
whose developers’ names are distinguished by a Levenshtein distance of just one. We then
manually checked these cases, to confirm whether they denote the same developer or not.
2.3 Truck Factor
To calculate the T ruckF actor , we use a greedy heuristic: we consecutively remove the author
with more authored files in a system, until more than 50% of the system’s files are orphans
(i.e., without author). Therefore, we are considering that a system is in trouble if more than
50% of its files are orphans.
3 Results
Table 2 presents the Truck Factor (TF) we calculated for the analyzed GitHub repositories.3
The results in this table are summarized as follows:
Most systems have a small Truck Factor:
45 systems have TF=1 (34%), including systems such as mbostock/d3, and
40 systems have TF=2 (30%), including systems such as cucumber/cucumber,
clojure/clojure, and netty/netty.
The two systems with the highest Truck Factor are torvalds/linux (TF = 130)
and Homebrew/homebrew (TF = 250). Homebrew is a package manager for the
3Systems with an updated TF, regarding the previous version of this preprint, are in bold.
PeerJ Preprints | | CC BY 4.0 Open Access | rec: 2 Jan 2017, publ: 2 Jan 2017
Table 2: Truck Factor results
TF Repositories
11 saltstack/salt,Seldaek/monolog,v8/v8
12 git/git,webscalesql/webscalesql-5.6
13 fog/fog
14 odoo/odoo
18 php/php-src
19 android/platform_frameworks_base,moment/moment
23 fzaninotto/Faker
56 caskroom/homebrew-cask
130 torvalds/linux
250 Homebrew/homebrew
PeerJ Preprints | | CC BY 4.0 Open Access | rec: 2 Jan 2017, publ: 2 Jan 2017
Mac OS operating system. The system can be extended by implementing formulas,
which are recipes for installing specific software packages. Homebrew currently supports
thousands of formulas, which are typically implemented by the package’s developers or
users, and rarely by Homebrew’s core developers. For this reason, the system has one
of the largest base of contributors on GitHub (almost 5K contributors, on July, 14th,
2015). All these facts contribute for Homebrew having the largest Truck Factor in our
study. However, if we do not consider the files in Library/Formula, HomeBrew’s Truck
Factor decreases to just two.
We also found that our heuristic results in an overestimated Truck Factor in the case of
systems with a large collection of plug-ins or similar code units in their repositories. Be-
sides Homebrew, this fact happens in at least two other systems: torvalds/linux, and
caskroom/homebrew-cask. If we exclude files from Linux’s subsytem drivers4and
from Casks folder of Homebrew-cask, the Linux’s Truck Factor is 57 and for Homebrew-
cask is just one.
Our research is supported by CNPq and FAPEMIG.
[1] L. Williams and R. Kessler, Pair Programming Illuminated. Addison Wesley, 2003.
[2] T. Fritz, G. C. Murphy, E. Murphy-Hill, J. Ou, and E. Hill, “Degree-of-knowledge: Mod-
eling a developer’s knowledge of code,” ACM Transactions on Software Engineering and
Methodology, vol. 23, no. 2, 2014.
[3] T. Fritz, J. Ou, G. C. Murphy, and E. Murphy-Hill, “A degree-of-knowledge model to cap-
ture source code familiarity,” in 32nd International Conference on Software Engineering
(ICSE), 2010, pp. 385–394.
[4] B. Ray, D. Posnett, V. Filkov, and P. Devanbu, “A large scale study of programming
languages and code quality in GitHub,” in 22nd International Symposium on Foundations
of Software Engineering (FSE), 2014, pp. 155–165.
4We use the mapping from files to Linux subsystems proposed by Passos et al. [6].
PeerJ Preprints | | CC BY 4.0 Open Access | rec: 2 Jan 2017, publ: 2 Jan 2017
[5] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys,
vol. 33, no. 1, pp. 31–88, 2001.
[6] L. Passos, J. Padilla, T. Berger, S. Apel, K. Czarnecki, and M. T. Valente, “Feature
scattering in the large: a longitudinal study of Linux kernel device drivers,” in 14th
International Conference on Modularity, 2015, pp. 81–92.
PeerJ Preprints | | CC BY 4.0 Open Access | rec: 2 Jan 2017, publ: 2 Jan 2017
Community smells represent sub-optimal conditions appearing within software development communities (e.g., non-communicating sub-teams, deviant contributors, etc.) that may lead to the emergence of social debt and increase the overall project’s cost. Previous work has studied these smells under different perspectives, investigating their nature, diffuseness, and impact on technical aspects of source code. Furthermore, it has been shown that some socio-technical metrics like, for instance, the well-known socio-technical congruence, can potentially be employed to foresee their appearance. Yet, there is still a lack of knowledge of the actual predictive power of such socio-technical metrics. In this paper, we aim at tackling this problem by empirically investigating (i) the potential value of socio-technical metrics as predictors of community smells and (ii) what is the performance of within- and cross-project community smell prediction models based on socio-technical metrics. To this aim, we exploit a dataset composed of 60 open-source projects and consider four community smells such as Organizational Silo, Black Cloud, Lone Wolf, and Bottleneck. The key results of our work report that a within-project solution can reach F-Measure and AUC-ROC of 77% and 78%, respectively, while cross-project models still require improvements, being however able to reach an F-Measure of 62% and overcome a random baseline. Among the metrics investigated, socio-technical congruence, communicability, and turnover-related metrics are the most powerful predictors of the emergence of community smells.
Github, one of the largest social coding platforms, offers software developers the opportunity to engage in social activities relating to software development and to store or share their codes/projects with the wider community using the repositories. Analysis of data representing the social interactions of Github users can reveal a number of interesting features. In this paper, we analyze the data to understand user social influence on the platform. Specifically, we propose a Following-Star-Fork-Activity based approach to measure user influence in the Github developer social network. We first preprocess the Github data, and construct the social network. Then, we analyze user influence in the social network, in terms of popularity, centrality, content value, contribution and activity. Finally, we analyze the correlation of different user influence measures, and use Borda Count to comprehensively quantify user influence and verify the results.
Conference Paper
Full-text available
The size and high rate of change of source code comprising a software system make it difficult for software developers to keep up with who on the team knows about particular parts of the code. Existing approaches to this problem are based solely on authorship of code. In this paper, we present data from two professional software development teams to show that both authorship and interaction information about how a developer interacts with the code are important in characterizing a developer's knowledge of code. We introduce the degree-of-knowledge model that computes automatically a real value for each source code element based on both authorship and interaction information. We show that the degree-of-knowledge model can provide better results than an existing expertise finding approach and also report on case studies of the use of the model to support knowledge transfer and to identify changes of interest.
Conference Paper
Feature code is often scattered across wide parts of the code base. But, scattering is not necessarily bad if used with care—in fact, systems with highly scattered features have evolved successfully over years. Among others, feature scattering allows developers to circumvent limitations in programming languages and system design. Still, little is known about the characteristics governing scattering, which factors influence it, and practical limits in the evolution of large and long-lived systems. We address this issue with a longitudinal case study of feature scattering in the Linux kernel. We quantitatively and qualitatively analyze almost eight years of its development history, focusing on scattering of device-driver features. Among others, we show that, while scattered features are regularly added, their proportion is lower than non-scattered ones, indicating that the kernel architecture allows most features to be integrated in a modular manner. The median scattering degree of features is constant and low, but the scattering-degree distribution is heavily skewed. Thus, using the arithmetic mean is not a reliable threshold to monitor the evolution of feature scattering. When investigating influencing factors, we find that platform-driver features are 2.5 times more likely to be scattered across architectural (subsystem) boundaries when compared to non-platform ones. Their use illustrates a maintenance-performance trade-off in creating architectures as for Linux kernel device drivers.
As a software system evolves, the system's codebase constantly changes, making it difficult for developers to answer such questions as who is knowledgeable about particular parts of the code or who needs to know about changes made. In this article, we show that an externalized model of a developer's individual knowledge of code can make it easier for developers to answer such questions. We introduce a degree-of-knowledge model that computes automatically, for each source-code element in a codebase, a real value that represents a developer's knowledge of that element based on a developer's authorship and interaction data. We present evidence that shows that both authorship and interaction data of the code are important in characterizing a developer's knowledge of code. We report on the usage of our model in case studies on expert finding, knowledge transfer, and identifying changes of interest. We show that our model improves upon an existing expertise-finding approach and can accurately identify changes for which a developer should likely be aware. We discuss how our model may provide a starting point for knowledge transfer but that more refinement is needed. Finally, we discuss the robustness of the model across multiple development sites.
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
A large scale study of programming languages and code quality in GitHub
  • B Ray
  • D Posnett
  • V Filkov
  • P Devanbu
B. Ray, D. Posnett, V. Filkov, and P. Devanbu, "A large scale study of programming languages and code quality in GitHub," in 22nd International Symposium on Foundations of Software Engineering (FSE), 2014, pp. 155-165.