Conference PaperPDF Available

Assessing the Bus Factor of Git Repositories

Authors:

Abstract and Figures

Software development projects face a lot of risks (requirements inflation, poor scheduling, technical problems, etc.). Underestimating those risks may put in danger the project success. One of the most critical risks is the employee turnover, that is the risk of key personnel leaving the project. A good indicator to evaluate this risk is to measure the concentration of information in individual developers. This is also popularly known as the bus factor (“number of key developers who would need to be incapacitated, i.e. hit by a bus, to make a project unable to proceed”). Despite the simplicity of the concept, calculating the actual bus factor for specific projects can quickly turn into an errorprone and time-consuming activity as soon as the size of the project and development team increase. In order to help project managers to assess the bus factor of their projects, in this paper we present a tool that, given a Git-based repository, automatically measures the bus factor for any file, directory and branch in the repository and for the project itself. You can also simulate with the tool what would happen to the project (e.g., which files would become orphans) if one or more developers disappeared.
Content may be subject to copyright.
A preview of the PDF is not available
... For instance, the first estimation procedure proposed in literature [6], required to solve an NP-Hard problem and did not scale to projects with more than 30 people [1], [7]. Another measure [8] requires to define primary and secondary developers and relies on two thresholds as inputs. The current state-of-the-art measure [1], assumes that a project stalls when more than 50% of files in the project are abandoned. ...
... The p-value for the test B(G) < B(null) is 0.007, indicating that the bus-factor of our project is statistically lower than what we would expect by randomly allocating people to tasks. This finding echoes those from prior research about the low bus-factor of many projects [1]- [4], [6], [8]. ...
... developers who contribute to the vast majority of the code base -and 2) quantifying the bus-factor of a project. Core developers are detected by estimating the degree of authorship (DoA) of each file by combining factors extracted from repositories, code review, meetings, etc [1], [2], [4], [8]. These approaches are specific to computer science and, since in this paper we propose a general framework, we do not review them. ...
Conference Paper
Full-text available
When enough people leave a project, the project might stall due to lack of knowledgeable personnel. The minimum number of people who are required to disappear in order for a project to stall is referred to as bus-factor. The bus-factor has been found to be real and tangible and many approaches to measure it have been developed. These approaches are problematic: some of them do not scale to large projects, others rely on ad-hoc notions of primary and secondary developers, and others use arbitrary thresholds. None of them proposes a normalized measure of the bus-factor. Therefore, in this paper we propose a framework that, by modelling a project with a bipartite graph linking people to tasks, allows us to 1) quantify the bus-factor of a project with a normalized measure which does not rely on thresholds; and 2) increase the bus-factor of a project by reassigning people to tasks. We demonstrate our approach on a real case, discuss the advantages of our framework, and outline possibilities for future research.
... BF is usually calculated along with an ordered list of key developers [6,8,18,27,37] who are also called BF developers. Besides turnover risk mitigation, information about the project's BF enables decision-makers to detect potential bottlenecks in the development process and avoid future problems. ...
... Several research groups studied the bus factor. The common approach, used by Zazworka et al. [37], Cosentino et al. [8], Rigby et al. [29], and Avelino et al. [6], is to estimate code authorship from the version control history data. Jabrayilzade et al. [18] augment the authorship measurement by adding other authorship indicators, such as code reviews. ...
... The first work on this subject by Zazworka et al. [37] uses a greedy iterative method to simulate the most knowledgeable developer removal and evaluate the effect on ownership coverage. Cosentino et al. [8] use different developer interaction patterns like commit frequency and code churn to find the best one to assess developer knowledge. Finally, Rigby et al. [29] suggest using a random instead of a greedy method to find the minimum set of BF developers. ...
Article
Full-text available
Software projects experience the departure of developers due to various reasons. As developers are one of the main sources of knowledge in software projects, their absence will inevitably result in a certain degree of knowledge depletion. Bus Factor (BF) is a metric to evaluate how this knowledge loss can affect the project's continuity. Conventionally, BF is calculated as the smallest set of developers, removing over half the project knowledge upon departure. Current state-of-the-art approaches measure developers' knowledge by the number of authored files, utilizing version control system (VCS) information. However, numerous studies have shown that files in software projects have different significance. In this study, we explore how weighting files according to their significance affects the performance of two prevailing BF estima-tors. We derive significance scores by computing five well-known graph metrics from the project's dependency graph: PageRank, In-/Out-/All-Degree, and Betweenness Centralities. Furthermore, we introduce BFSig, a prototype of our approach. Finally, we present a new dataset comprising reported BF scores collected by surveying software practitioners from five prominent Github repositories. Our results indicate that BFSig outperforms the baselines by up to an 18% reduction in terms of Normalized Mean Absolute Error (NMAE). Moreover, BFSig yields 18% fewer False Negatives in identifying potential risks associated with low BF. Besides, our respondent confirmed BFSig versatility by showing its ability to assess the BF of the project's subfolders. In conclusion, we believe to estimate BF from authorship, software components of higher importance should be assigned heavier weight. Currently, BFSig exclusively explores the topological characteristics of these components. Nevertheless, considering attributes such as code complexity and bug proneness could potentially enhance the performance of BFSig.
... Existing techniques for tracking and managing knowledge concentration in software development frequently rely on simplistic metrics, such as the number of commits, number of lines, or the identity of the last modifier, which do not adequately capture the depth of a developer's expertise [11,26,27]. Additionally, these techniques fall short of providing comprehensive insights into the expertise distribution among developers. ...
... Truck Factor, also called Bus Factor, is a measure that indicates the minimum number of developers who need to leave a software project for it to stall [20]. This metric helps practitioners identify the concentration of knowledge in their projects and has already been the focus of different studies in the software engineering literature [4,11,12,14,16,20]. Some studies focused on proposing new ways of estimating the Truck Factor. ...
Conference Paper
Current software development is often a cooperative activity, where different situations can arise that put the existence of a project at risk. One common and extensively studied issue in the software engineering literature is the concentration of a significant portion of knowledge about the source code in a few developers on a team. In this scenario, the departure of one of these key developers could make it impossible to continue the project. This work presents Knowledge Islands, a tool that visualizes the concentration of knowledge in a software repository using a state-of-the-art knowledge model. Key features of Knowledge Islands include user authentication, cloning, and asynchronous analysis of user repositories, identification of the expertise of the team’s developers, calculation of the Truck Factor for all folders and source code files, and identification of the main developers and repository files. This open-source tool enables practitioners to analyze GitHub projects, determine where knowledge is concentrated within the development team, and implement measures to maintain project health. The source code of Knowledge Islands is available in a public repository, and there is a presentation about the tool in video.
... Existing techniques for tracking and managing knowledge concentration in software development frequently rely on simplistic metrics, such as the number of commits, number of lines, or the identity of the last modifier, which do not adequately capture the depth of a developer's expertise [11,26,27]. Additionally, these techniques fall short of providing comprehensive insights into the expertise distribution among developers. ...
... Truck Factor, also called Bus Factor, is a measure that indicates the minimum number of developers who need to leave a software project for it to stall [20]. This metric helps practitioners identify the concentration of knowledge in their projects and has already been the focus of different studies in the software engineering literature [4,11,12,14,16,20]. Some studies focused on proposing new ways of estimating the Truck Factor. ...
Preprint
Current software development is often a cooperative activity, where different situations can arise that put the existence of a project at risk. One common and extensively studied issue in the software engineering literature is the concentration of a significant portion of knowledge about the source code in a few developers on a team. In this scenario, the departure of one of these key developers could make it impossible to continue the project. This work presents Knowledge Islands, a tool that visualizes the concentration of knowledge in a software repository using a state-of-the-art knowledge model. Key features of Knowledge Islands include user authentication, cloning, and asynchronous analysis of user repositories, identification of the expertise of the team's developers, calculation of the Truck Factor for all folders and source code files, and identification of the main developers and repository files. This open-source tool enables practitioners to analyze GitHub projects, determine where knowledge is concentrated within the development team, and implement measures to maintain project health. The source code of Knowledge Islands is available in a public repository, and there is a presentation about the tool in video.
... Cosentino et al. computed the bus factor on files, directories, and branches for Git-based development [7] to perform a risk assessment. Jabrayilzade et al. [17] proposed a multimodal bus factor algorithm, integrating history in the version control system, code reviews, and meetings meta-data. ...
... Most works (except [1]) focus on a limited number of systems (e.g., [7,17,28]) and various evolving definitions of the bus factor. Our main contribution is a large-scale study on GitHub repositories leveraging the pony factor to derive data-driven insights. ...
Article
In software development, developer turnover is among the primary reasons for project failures, leading to a great void of knowledge and strain for newcomers. Unfortunately, no established methods exist to measure how the problem domain knowledge is distributed among developers. Awareness of how this knowledge evolves and is owned by key developers in a project helps stakeholders reduce risks caused by turnover. To this end, this paper introduces a novel, realistic representation of problem domain knowledge distribution: the ConceptRealm . To construct the ConceptRealm , we employ a latent Dirichlet allocation model to represent textual features obtained from 300 K issues and 1.3 M comments from 518 open‐source projects. We analyze whether the newly emerged issues and developers share similar concepts or how aligned the individual developers' concepts are with the team over time. We also investigate the impact of leaving developers on the frequency of concepts. Finally, we also evaluate the soundness of our approach on a closed‐source software project, thus allowing the validation of the results from a practical standpoint. We find out that the ConceptRealm can represent the problem domain knowledge within a project and can be utilized to predict the alignment of developers with issues. We also observe that projects exhibit many keepers independent of project maturity and that abruptly leaving keepers correlates with a decline of their core concepts as the remaining developers cannot quickly familiarize themselves with those concepts.
Article
Full-text available
Community smells are negative patterns in software development teams’ interactions that impede their ability to successfully create software. Examples are team members working in isolation, lack of communication and collaboration across departments or sub-teams, or areas of the codebase where only a few team members can work on. Current approaches aim to detect community smells by analysing static network representations of software teams’ interaction structures. In doing so, they are insufficient to locate community smells within development processes. Extending beyond the capabilities of traditional social network analysis, we show that higher-order network models provide a robust means of revealing such hidden patterns and complex relationships. To this end, we develop a set of centrality measures based on the MOGen higher-order network model and show their effectiveness in predicting influential nodes using five empirical datasets. We then employ these measures for a comprehensive analysis of a product team at the German IT security company genua GmbH, showcasing our method’s success in identifying and locating community smells. Specifically, we uncover critical community smells in two areas of the team’s development process. Semi-structured interviews with five team members validate our findings: while the team was aware of one community smell and employed measures to address it, it was not aware of the second. This highlights the potential of our approach as a robust tool for identifying and addressing community smells in software development teams. More generally, our work contributes to the social network analysis field with a powerful set of higher-order network centralities that effectively capture community dynamics and indirect relationships.
Article
Full-text available
Very few software projects are completed on time, on budget, and to their original specification causing the global IT software industry to lose billions each year in project overruns and reworking software. Research supports that projects usually fail because of management mistakes rather than technical mistakes. Risk Management in software projects focuses on what the practitioner needs to know about risk in the pursuit of delivering successful software projects.
Conference Paper
Full-text available
When software repositories are mined, two distinct sources of information are usually explored: the history log and snapshots of the system. Results of analyses derived from these two sources are biased by the frequency with which developers commit their changes. We argue that the usage of mainstream SCM systems influences the way that developers work. For example, since it is tedious to resolve conflicts due to parallel commits, developers tend to minimize conflicts by not contemporarily modifying the same file. This however defeats one of the purposes of such systems. We mine repositories created by our Syde tool, which records every change by every developer in multi-developer projects. This new source of information can augment the accuracy of analyses and breaks new ground in terms of how such information can assist developers. In this paper we illustrate how the information we mine can help to provide a refined notion of code ownership. As a case study, we analyze the developers' activities of the development of a commercial system.
Conference Paper
Full-text available
Ownership is a key aspect of large-scale software development. We examine the relationship between different ownership measures and software failures in two large software projects: Windows Vista and Windows 7. We find that in all cases, measures of ownership such as the number of low-expertise developers, and the proportion of ownership for the top owner have a relationship with both pre-release faults and post-release failures. We also empirically identify reasons that low-expertise developers make changes to components and show that the removal of low-expertise contributions dramatically decreases the performance of contribution based defect prediction. Finally we provide recommendations for source code change policies and utilization of resources such as code inspections based on our results.
Conference Paper
Full-text available
Adapting new software processes and practices in organizational and academic environments requires training the developers and validating the applicability of the newly introduced activities. Investigating process conformance during training and understanding if programmers are able and willing to follow the specific steps are crucial to evaluating whether the process improves various software product quality factors. In this paper we present a process model independent approach to detect process non-conformance. Our approach is based on non-intrusively collected data captured by a version control system and provides the project manager with timely updates. Further, we provide evidence of the applicability of our approach by investigating process conformance in a five day training class on eXtreme Programming (XP) practices at the Leibniz Universität Hannover. Our results show that the approach enabled researchers to formulate minimal intrusive methods to check for conformance and that for the majority of the investigated XP practices violations could be detected.
Conference Paper
Full-text available
In spite of the potential relevance for managers and even though the Truck Factor definition is well-known in the “agile world” for many years, shared and validated measurements, algorithms, tools, thresholds and empirical studies on this topic are still lacking. In this paper, we explore the situation implementing the only approach proposed in literature able to compute the Truck Factor. Then, using our tool, we conduct an exploratory study with 37 open source projects for discovering limitations and drawbacks that could prevent its usage. Lessons learnt from the execution of the exploratory study and open issues are drawn at the end of this work. The most important lesson that we have learnt is that more research is needed to render the notion of Truck Factor operative and usable.
Conference Paper
As systems evolve their structure change in ways not expected upfront. As time goes by, the knowledge of the developers becomes more and more critical for the process of understanding the system. That is, when we want to understand a certain issue of the system we ask the knowledgeable developers. Yet, in large systems, not every developer is knowledgeable in all the details of the system. Thus, we would want to know which developer is knowledgeable in the issue at hand. In this paper we make use of the mapping between the changes and the author identifiers (e.g., user names) provided by versioning repositories. We first define a measurement for the notion of code ownership. We use this measurement to define the ownership map visualization to understand when and how different developers interacted in which way and in which part of the system. We report the results we obtained on several large systems.
Article
The emerging discipline of software risk management is described. It is defined as an attempt to formalize the risk-oriented correlates of success into a readily applicable set of principles and practices. Its objectives are to identify, address, and eliminate risk items before they become either threats to successful software operation or major sources of software rework. The basic concepts are set forth, and the major steps and techniques involved in software risk management are explained. Suggestions for implementing risk management are provided.< >