Conference PaperPDF Available

Assessing the Bus Factor of Git Repositories

Authors:

Abstract and Figures

Software development projects face a lot of risks (requirements inflation, poor scheduling, technical problems, etc.). Underestimating those risks may put in danger the project success. One of the most critical risks is the employee turnover, that is the risk of key personnel leaving the project. A good indicator to evaluate this risk is to measure the concentration of information in individual developers. This is also popularly known as the bus factor (“number of key developers who would need to be incapacitated, i.e. hit by a bus, to make a project unable to proceed”). Despite the simplicity of the concept, calculating the actual bus factor for specific projects can quickly turn into an errorprone and time-consuming activity as soon as the size of the project and development team increase. In order to help project managers to assess the bus factor of their projects, in this paper we present a tool that, given a Git-based repository, automatically measures the bus factor for any file, directory and branch in the repository and for the project itself. You can also simulate with the tool what would happen to the project (e.g., which files would become orphans) if one or more developers disappeared.
Content may be subject to copyright.
A preview of the PDF is not available
... BF is usually calculated along with an ordered list of key developers [6,8,18,27,37] who are also called BF developers. Besides turnover risk mitigation, information about the project's BF enables decision-makers to detect potential bottlenecks in the development process and avoid future problems. ...
... Several research groups studied the bus factor. The common approach, used by Zazworka et al. [37], Cosentino et al. [8], Rigby et al. [29], and Avelino et al. [6], is to estimate code authorship from the version control history data. Jabrayilzade et al. [18] augment the authorship measurement by adding other authorship indicators, such as code reviews. ...
... The first work on this subject by Zazworka et al. [37] uses a greedy iterative method to simulate the most knowledgeable developer removal and evaluate the effect on ownership coverage. Cosentino et al. [8] use different developer interaction patterns like commit frequency and code churn to find the best one to assess developer knowledge. Finally, Rigby et al. [29] suggest using a random instead of a greedy method to find the minimum set of BF developers. ...
Article
Full-text available
Software projects experience the departure of developers due to various reasons. As developers are one of the main sources of knowledge in software projects, their absence will inevitably result in a certain degree of knowledge depletion. Bus Factor (BF) is a metric to evaluate how this knowledge loss can affect the project's continuity. Conventionally, BF is calculated as the smallest set of developers, removing over half the project knowledge upon departure. Current state-of-the-art approaches measure developers' knowledge by the number of authored files, utilizing version control system (VCS) information. However, numerous studies have shown that files in software projects have different significance. In this study, we explore how weighting files according to their significance affects the performance of two prevailing BF estima-tors. We derive significance scores by computing five well-known graph metrics from the project's dependency graph: PageRank, In-/Out-/All-Degree, and Betweenness Centralities. Furthermore, we introduce BFSig, a prototype of our approach. Finally, we present a new dataset comprising reported BF scores collected by surveying software practitioners from five prominent Github repositories. Our results indicate that BFSig outperforms the baselines by up to an 18% reduction in terms of Normalized Mean Absolute Error (NMAE). Moreover, BFSig yields 18% fewer False Negatives in identifying potential risks associated with low BF. Besides, our respondent confirmed BFSig versatility by showing its ability to assess the BF of the project's subfolders. In conclusion, we believe to estimate BF from authorship, software components of higher importance should be assigned heavier weight. Currently, BFSig exclusively explores the topological characteristics of these components. Nevertheless, considering attributes such as code complexity and bug proneness could potentially enhance the performance of BFSig.
... Cosentino et al. computed the bus factor on files, directories, and branches for Git-based development [7] to perform a risk assessment. Jabrayilzade et al. [17] proposed a multimodal bus factor algorithm, integrating history in the version control system, code reviews, and meetings meta-data. ...
... Most works (except [1]) focus on a limited number of systems (e.g., [7,17,28]) and various evolving definitions of the bus factor. Our main contribution is a large-scale study on GitHub repositories leveraging the pony factor to derive data-driven insights. ...
... We claim that funding and incentive structures must be increased to ensure the middle-to long-term maintenance of packages developed by academic researchers, see Schönbrodt (2022) for an example proposal. Package longevity should be taken into account from the beginning of a project, for example by distributing competence over several researchers to improve the "bus factor" (Cosentino et al., 2015). ...
Preprint
Programming is ubiquitous in applied biostatistics; adopting software engineering skills will help biostatisticians do a better job. To explain this, we start by highlighting key challenges for software development and application in biostatistics. Silos between different statistician roles, projects, departments, and organizations lead to the development of duplicate and suboptimal code. Building on top of open-source software requires critical appraisal and risk-based assessment of the used modules. Code that is written needs to be readable to ensure reliable software. The software needs to be easily understandable for the user, as well as developed within testing frameworks to ensure that long term maintenance of the software is feasible. Finally, the reproducibility of research results is hindered by manual analysis workflows and uncontrolled code development. We next describe how the awareness of the importance and application of good software engineering practices and strategies can help address these challenges. The foundation is a better education in basic software engineering skills in schools, universities, and during the work life. Dedicated software engineering teams within academic institutions and companies can be a key factor for the establishment of good software engineering practices and catalyze improvements across research projects. Providing attractive career paths is important for the retainment of talents. Readily available tools can improve the reproducibility of statistical analyses and their use can be exercised in community events. [...]
Article
Full-text available
Community smells are negative patterns in software development teams’ interactions that impede their ability to successfully create software. Examples are team members working in isolation, lack of communication and collaboration across departments or sub-teams, or areas of the codebase where only a few team members can work on. Current approaches aim to detect community smells by analysing static network representations of software teams’ interaction structures. In doing so, they are insufficient to locate community smells within development processes. Extending beyond the capabilities of traditional social network analysis, we show that higher-order network models provide a robust means of revealing such hidden patterns and complex relationships. To this end, we develop a set of centrality measures based on the MOGen higher-order network model and show their effectiveness in predicting influential nodes using five empirical datasets. We then employ these measures for a comprehensive analysis of a product team at the German IT security company genua GmbH, showcasing our method’s success in identifying and locating community smells. Specifically, we uncover critical community smells in two areas of the team’s development process. Semi-structured interviews with five team members validate our findings: while the team was aware of one community smell and employed measures to address it, it was not aware of the second. This highlights the potential of our approach as a robust tool for identifying and addressing community smells in software development teams. More generally, our work contributes to the social network analysis field with a powerful set of higher-order network centralities that effectively capture community dynamics and indirect relationships.
Article
Open source software development is regarded as a collaborative activity in which developers interact to build a software product. Such a human collaboration is described as an organized effort of the “social” activity of organizations, individuals, and stakeholders, which can affect the development community and the open source project health. Negative effects of the development community manifest typically in the form of community smells, which represent symptoms of organizational and social issues within the open source software development community that often lead to additional project costs and reduced software quality. Recognizing the advantages of the early detection of potential community smells in a software project, we introduce a novel approach that learns from various community organizational, social, and emotional aspects to provide an automated support for detecting community smells. In particular, our approach learns from a set of interleaving organizational–social and emotional symptoms that characterize the existence of community smell instances in a software project. We build a multi‐label learning model to detect 10 common types of community smells. We use the ensemble classifier chain (ECC) model that transforms multi‐label problems into several single‐label problems, which are solved using genetic programming (GP) to find the optimal detection rules for each smell type. To evaluate the performance of our approach, we conducted an empirical study on a benchmark of 143 open source projects. The statistical tests of our results show that our approach can detect community smells with an average F‐measure of 93%, achieving a better performance compared to different state‐of‐the‐art techniques. Furthermore, we investigate the most influential community‐related metrics to identify each community smell type.
Chapter
Blockchain interoperability has gained importance in practice, is increasingly discussed in literature, and serves as basis for new use cases such as manufacturing and financial services. However, many of the blockchain interoperability solutions discussed in literature are still in the design phase, are unpopular or have a small developer community. Therefore, this study proposes a comparison framework and examines implemented public blockchain interoperability solutions, focusing on data from published GitHub repositories. The results show that these implementations vary significantly in terms of popularity, their developer communities as well as their source code, indicating differences in quality. The insights gained in this work facilitate the selection of an appropriate implementation to enable blockchain interoperability use cases.KeywordsBlockchain interoperabilityComparison frameworkEmpirical studyGitHubImplementations
Article
Full-text available
Very few software projects are completed on time, on budget, and to their original specification causing the global IT software industry to lose billions each year in project overruns and reworking software. Research supports that projects usually fail because of management mistakes rather than technical mistakes. Risk Management in software projects focuses on what the practitioner needs to know about risk in the pursuit of delivering successful software projects.
Conference Paper
Full-text available
When software repositories are mined, two distinct sources of information are usually explored: the history log and snapshots of the system. Results of analyses derived from these two sources are biased by the frequency with which developers commit their changes. We argue that the usage of mainstream SCM systems influences the way that developers work. For example, since it is tedious to resolve conflicts due to parallel commits, developers tend to minimize conflicts by not contemporarily modifying the same file. This however defeats one of the purposes of such systems. We mine repositories created by our Syde tool, which records every change by every developer in multi-developer projects. This new source of information can augment the accuracy of analyses and breaks new ground in terms of how such information can assist developers. In this paper we illustrate how the information we mine can help to provide a refined notion of code ownership. As a case study, we analyze the developers' activities of the development of a commercial system.
Conference Paper
Full-text available
Ownership is a key aspect of large-scale software development. We examine the relationship between different ownership measures and software failures in two large software projects: Windows Vista and Windows 7. We find that in all cases, measures of ownership such as the number of low-expertise developers, and the proportion of ownership for the top owner have a relationship with both pre-release faults and post-release failures. We also empirically identify reasons that low-expertise developers make changes to components and show that the removal of low-expertise contributions dramatically decreases the performance of contribution based defect prediction. Finally we provide recommendations for source code change policies and utilization of resources such as code inspections based on our results.
Conference Paper
Full-text available
Adapting new software processes and practices in organizational and academic environments requires training the developers and validating the applicability of the newly introduced activities. Investigating process conformance during training and understanding if programmers are able and willing to follow the specific steps are crucial to evaluating whether the process improves various software product quality factors. In this paper we present a process model independent approach to detect process non-conformance. Our approach is based on non-intrusively collected data captured by a version control system and provides the project manager with timely updates. Further, we provide evidence of the applicability of our approach by investigating process conformance in a five day training class on eXtreme Programming (XP) practices at the Leibniz Universität Hannover. Our results show that the approach enabled researchers to formulate minimal intrusive methods to check for conformance and that for the majority of the investigated XP practices violations could be detected.
Conference Paper
Full-text available
In spite of the potential relevance for managers and even though the Truck Factor definition is well-known in the “agile world” for many years, shared and validated measurements, algorithms, tools, thresholds and empirical studies on this topic are still lacking. In this paper, we explore the situation implementing the only approach proposed in literature able to compute the Truck Factor. Then, using our tool, we conduct an exploratory study with 37 open source projects for discovering limitations and drawbacks that could prevent its usage. Lessons learnt from the execution of the exploratory study and open issues are drawn at the end of this work. The most important lesson that we have learnt is that more research is needed to render the notion of Truck Factor operative and usable.
Conference Paper
As systems evolve their structure change in ways not expected upfront. As time goes by, the knowledge of the developers becomes more and more critical for the process of understanding the system. That is, when we want to understand a certain issue of the system we ask the knowledgeable developers. Yet, in large systems, not every developer is knowledgeable in all the details of the system. Thus, we would want to know which developer is knowledgeable in the issue at hand. In this paper we make use of the mapping between the changes and the author identifiers (e.g., user names) provided by versioning repositories. We first define a measurement for the notion of code ownership. We use this measurement to define the ownership map visualization to understand when and how different developers interacted in which way and in which part of the system. We report the results we obtained on several large systems.
Article
The emerging discipline of software risk management is described. It is defined as an attempt to formalize the risk-oriented correlates of success into a readily applicable set of principles and practices. Its objectives are to identify, address, and eliminate risk items before they become either threats to successful software operation or major sources of software rework. The basic concepts are set forth, and the major steps and techniques involved in software risk management are explained. Suggestions for implementing risk management are provided.< >