Conference PaperPDF Available

Evidence for the Pareto principle in Open Source Software Activity

Authors:

Abstract and Figures

Numerous empirical studies analyse evolving open source software (OSS) projects, and try to estimate the activity and effort in these projects. Most of these studies, however, only focus on a limited set of artefacts, being source code and defect data. In our research, we extend the analysis by also taking into account mailing list information. The main goal of this article is to find evidence for the Pareto principle in this context, by studying how the activity of developers and users involved in OSS projects is distributed: it appears that most of the activity is carried out by a small group of people. Following the GQM paradigm, we provide evidence for this principle. We selected a range of metrics used in economy to measure inequality in distribution of wealth, and adapted these metrics to assess how OSS project activity is distributed. Regardless of whether we analyse version repositories, bug trackers, or mailing lists, and for all three projects we studied, it turns out that the distribution of activity is highly imbalanced.
Content may be subject to copyright.
A preview of the PDF is not available
... In previous studies, researchers have assessed the distribution of contributions in open-source software and private projects, focusing on analyzing the Pareto principle [16][17][18]. Studies in academic contexts have focused on extracting metrics such as commits, merges, lines of code, number of modified files, types of modified files, time spent, and component and developer entropy [9,[12][13][14]. ...
... Regarding the inequality measures, they have been previously used and analyzed in open source software projects. Goeminne and Mens [16] have used the inequality indexes (Hoover, Gini, and Theil index) to analyze the distribution of activity in open source projects. They analyzed the distribution of contributions for the number of commits, mails, new bug report submissions, comments added to existing bug reports, and changes to existing bug reports mining the data from version repositories, mailing lists, and bug trackers. ...
... • Inequality indexes: We used the Hoover index, Gini index, and Theil index, as defined by Goeminne and Mens [16]. The value of these measures ranges from 0 to 1, with 0 indicating perfect equality and 1 perfect inequality. ...
Article
Many software engineering courses are centered around team-based project development. Analyzing the source code contributions during the projects' development could provide both instructors and students with constant feedback to identify common trends and behaviors that can be improved during the courses. Evaluating course projects is a challenge due to the difficulty of measuring individual student contributions versus team contributions during the development. The adoption of distributed version control systems like git enables the measurement of students' and teams' contributions to the project. In this work, we analyze the contributions within eight software development projects, with 150 students in total, from undergraduate courses that used project-based learning. We generate visualizations of aggregated git metrics using inequality measures and the contribution per module, offering insights into the practices and processes followed by students and teams throughout the project development. This approach allowed us to identify inequality among students' contributions, the modules where students contributed, development processes with a non-steady pace, and integration practices rendering a useful feedback tool for instructors and students during the project's development. Further studies can be conducted to assess the quality, complexity, and ownership of the contributions by analyzing software artifacts.
... Table 1, projects in our dataset are mature, popular GitHub projects with a non-trivial code base, active maintenance, rich development history, and frequent Dependabot usage. We notice a long-tail distribution in the metrics concerning the size of the project, i.e., number of contributors, lines of code, and commit frequency, which is expected and common in most mining software repository (MSR) datasets [54,83,89]. ...
Preprint
Dependency update bots automatically open pull requests to update software dependencies on behalf of developers. Early research shows that developers are suspicious of updates performed by bots and feel tired of overwhelming notifications from these bots. Despite this, dependency update bots are becoming increasingly popular. Such contrast motivates us to investigate Dependabot, currently the most visible bot in GitHub, to reveal the effectiveness and limitations of the state-of-art dependency update bots. We use exploratory data analysis and developer survey to evaluate the effectiveness of Dependabot in keeping dependencies up-to-date, reducing update suspicion, and reducing notification fatigue. We obtain mixed findings. On the positive side, Dependabot is effective in reducing technical lag and developers are highly receptive to its pull requests. On the negative side, its compatibility scores are too scarce to be effective in reducing update suspicion; developers tend to configure Dependabot toward reducing the number of notifications; and 11.3\% of projects have deprecated Dependabot in favor of other alternatives. Our findings reveal a large room for improvement in dependency update bots which calls for effort from both bot designers and software engineering researchers.
... Contributions to FOSS being concentrated in the hands of a few hyper-productive participants has been noted since the origins of FOSS (Hill et al., 1992;Kuk, 2006), with more recent examples confirming this observation (Chełkowski et al., 2016). Whilst examining OS development patterns some researchers found evidence for the so-called 'Pareto principle' whereby less than 20% of users make 80% or more contributions (Goeminne & Mens, 2011;Akond Rahman et al., 2018); others found less support for this principle (see Yamashita et al., 2015;Gasparini et al., 2020). Contribution patterns in our network were highly unequal, both when considered in aggregate, by firm domains (see Tables 7 and 8) with a few key organisational players massively contributing, as well as in terms of individual developers (see Table 9). ...
Article
Full-text available
The global economy’s digital infrastructure is based on free and open source software. To analyse how firms indirectly collaborate via employee contributions to developer-run projects, we propose a formal definition of ‘industrial public goods’ – inter-firm cooperation, volunteer and paid labour overlap, and participation inequality. We verify its empirical robustness by collecting networks of commits made by firm employees to active GitHub software repositories. Despite paid workers making more contributions, volunteers play a significant role. We find which firms contribute most, which projects benefit from firm investments, and identify distinct ‘contribution territories’ since the two central firms never co-contribute to top-20 repositories. We highlight the challenge posed by ‘Big Tech’ to the non-rival status of industrial public goods, thanks to cloud-based systems which resist sharing, and suggest there may be ‘contribution deserts’ neglected by large information technology firms, despite their importance for the open source ecosystem’s sustainability and diversity.
... The difference between university teachers and their teaching assistants in Personal Innovativeness in adoption of and experimentation with information technology regarding effect size is small. Based on raw data and lack of reference cut-off points we can only speculate that innovativeness in students and educators follows well known 80/20 Pareto principle [42] as observed in different domains of application of software [43,44,45,46] but not by use of the comparative measurement tools [47], however, we can recognize that innovativeness and creativity are scarce even at the university. ...
Article
Two online surveys among 1105 university students and 656 employees were conducted with the inclusion of the construct Personal Innovativeness in the domain of Information Technologies (PIIT). After calculating descriptive statistics, statistically significant differences between personal innovativeness of university students and teachers were sought by the application of one-way ANOVA. The first and most important finding was that average perceived PIIT of teachers and students falls around the middle of the seven-point scale, which cannot be regarded as a plausible predictor of upgrading the University as an Innovative Ecosystem. The second was that university teachers scored higher than their students, a situation that could produce an expectancy conflict between those who want to work in an innovative way and those who would prefer study by the book. Teaching assistants, who should belong to the generation of digital natives, are only slightly more innovative than university teachers, who can be regarded as digital immigrants. Assuming that innovativeness can be upgraded by learning, means that efforts should be made by University Management to encourage and support Personal Innovativeness (and other creativities, as well) as a preferred teaching practice.
... Therefore, the management has to find efficient organisational structures for the division of labour [12,13], even though these communities are, typically, highly heterogeneous in dedication and skills [14]. Some developers only contribute a single time, while core-developers perform the majority of work [15]. From an extensive analysis of the network of task assignments, Zanetti et al. [16] found that over time developers in GENTOO tended to rely mainly on a single central contributor who became responsible for handling most of the tasks. ...
Article
Full-text available
We study the lock-in effect in a network of task assignments. Agents have a heterogeneous fitness for solving tasks and can redistribute unfinished tasks to other agents. They learn over time to whom to reassign tasks and preferably choose agents with higher fitness. A lock-in occurs if reassignments can no longer adapt. Agents overwhelmed with tasks then fail, leading to failure cascades. We find that the probability for lock-ins and systemic failures increase with the heterogeneity in fitness values. To study this dependence, we use the Shannon entropy of the network of task assignments. A detailed discussion links our findings to the problem of resilience and observations in social systems.
... Therefore, the management has to nd e cient organizational structures for the division of labour [2,24], even though these communities are, typically, highly heterogeneous in dedication and skills [9]. Some developers only contribute a single time, while core-developers perform the majority of work [12]. From an extensive analysis of the network of task assignments, Zanetti et al. [34] found that over time developers in G tended to rely mainly on a single central contributor who became responsible for handling most of the tasks. ...
Preprint
We study the lock-in effect in a network of task assignments. Agents have a heterogeneous fitness for solving tasks and can redistribute unfinished tasks to other agents. They learn over time to whom to reassign tasks and preferably choose agents with higher fitness. A lock-in occurs if reassignments can no longer adapt. Agents overwhelmed with tasks then fail, leading to failure cascades. We find that the probability for lock-ins and systemic failures increase with the heterogeneity in fitness values. To study this dependence, we use the Shannon entropy of the network of task assignments. A detailed discussion links our findings to the problem of resilience and observations in social systems.
... Researchers have found that a small proportion of the contributors to an OSS project make most of its code contributions (Avelino et al., 2016;Dinh-Trong & Bieman, 2005;Geldenhuys, 2010;Mockus et al., 2002). The findings of several studies in which small sample sizes of one to nine projects were used indicated that OSS projects follow the Pareto principle, which means that approximately 20% of the developers make about 80% of the contributions (Goeminne & Mens, 2011;Koch & Schneider, 2002;Robles et al., 2004). However, a more recent study that used a sample size of thousands of projects employed multiple heuristics for determining which contributors were core developers (Yamashita et al., 2015). ...
Thesis
Full-text available
The study addresses the problem of the inadequacy of conventional software testing methods to detect all software defects. This problem affects software users and researchers due to poor software performance, reduced precision or accuracy of software output, and retractions of research publications. Detecting software defects may also be challenging due to the oracle problem. Existing research supports the metamorphic testing method's effectiveness for handling the oracle problem and finding software defects that conventional testing methods cannot detect. The research questions ask about the relationships among the use and acceptance of the metamorphic testing method among open-source developers and the constructs of performance expectancy and effort expectancy. The study's purpose was to examine these relationships. Another objective of the study was to understand how the variables of age, gender, and experience moderate these relationships. The guiding theoretical framework of the study was the unified theory of acceptance and use of technology. In this study, a quantitative methodology with a correlational design was employed. The participants were contributors to open-source software projects contained in the GitHub “Software in science” collection. The data was collected via an online survey instrument. The data were analyzed using Spearman’s rank-order correlation tests and moderated multiple regression analysis. Moderate to strong positive relationships were found between both the performance expectancy and effort expectancy and the acceptance and use of metamorphic testing. This finding suggests that increasing the extent to which developers believe that metamorphic testing will improve their job performance and improving its ease of use will increase the adoption of metamorphic testing. The creation of interventions to educate developers on the use of metamorphic testing is recommended. The study results support the applicability of the unified theory of acceptance and use of technology to software testing methods. Future research could involve studying the relationship between metamorphic testing adoption and other factors, such as social influence and facilitating conditions.
... To analyze the causal relationship quantitatively, research using a time lag is practical [8]. There have been several studies over time [14], [15]. Stewart et al. [16] investigated the impacts of organizational sponsorship and license restrictiveness on developers over time and concluded that the cues regarding the future of a project can be picked up from the available information. ...
Article
Full-text available
Open source software (OSS) has become indispensable to our society. The success of OSS depends on the participation of a large number of developers or maintainers (contributors). Shedding light on the mechanisms of their participation has been an important academic and practical matter. One aspect to decide participation is the future prospects of a project. However, the causal mechanism behind participation has yet to be studied exhaustively and remains unclear. In this study, we used cryptocurrency projects, many of them were developed on GitHub, to better understand this mechanism. Both GitHub and cryptocurrencies are highly transparent, i.e., information is fully disclosed; we can analyze relevant information on a project, such as the contributors' activities, financial information, and development status. We adopted market capitalization as the substitution index of future prospects and the number of contributors and analyzed the relationship using time series analysis techniques, such as the Granger causality test and regression. We found that the number of contributors increases two months after market capitalization increases. This quantifies the impact of the future prospects of the project, i.e., of the market capitalization of a cryptocurrency, on the participation of contributors.
Article
Full-text available
Context In a software project, properly analyzing the contributions of developers could provide valuable insights for decision-makers. The contributions of a developer could be in many different forms such as committing and reviewing code, opening and resolving issues. Previous approaches mainly consider the commit-based contributions which provide an incomplete picture of developer contributions. Objective Different from the traditional commit-based approaches for analyzing developer contributions, we aim to provide a more holistic approach to reflect the rich set of software development activities using artifact traceability graphs. Method For analyzing the developer contributions, we propose a novel categorization of developers (Jacks, Mavens and Connectors) in a software project. We introduce a set of algorithms on artifact traceability graphs to identify key developers, recommend replacements for leaving developers and evaluate knowledge distribution among developers. Results We evaluate our proposed algorithms on six open-source projects and demonstrate that the identified key developers match the top commenters up to 98%, recommended replacements are correct up to 91% and identified knowledge distribution labels are compatible 94% on average with the baseline approaches. Conclusions The proposed algorithms using artifact traceability graphs for analyzing developer contributions could be used by software project decision-makers in several scenarios. (1) Identifying different types of key developers. (2) Finding a replacement developer in large teams. (3) Evaluating the overall knowledge distribution amongst developers to take early precautions.
Article
Full-text available
Software repositories such as versioning systems, defect tracking systems, and archived communication between project personnel are used to help manage the progress of software projects. Software practitioners and researchers increasingly recognize the potential benefit of mining this information to support the maintenance of software systems, improve software design or reuse, and empirically validate novel ideas and techniques. Research is now proceeding to uncover ways in which mining these repositories can help to understand software development, to support predictions about software development, and to plan various evolutionary aspects of software projects. This chapter presents several analysis and visualization techniques to understand software evolution by exploiting the rich sources of artifacts that are available. Based on the data models that need to be developed to cover sources such as modification and bug reports we describe how to use a Release History Database for evolution analysis. For that we present approaches to analyze developer effort for particular software entities. Further we present change coupling analyses that can reveal hidden change dependencies among software entities. Finally, we show how to investigate architectural shortcomings over many releases and to identify trends in the evolution. Kiviat graphs can be effectively used to visualize such analysis results.
Article
Full-text available
Communication between developers plays a very central role in team-based software development for a variety of tasks such as coordinating development and maintenance activities, discussing requirements for better comprehension, assessing alternative solu-tions to complex problems, and like. However, the frequency of communication varies from time to time—sometimes developers exchange more messages with each other than at other times. In this paper, we investigate whether developer communication has any bearing with software quality by examining the relationship between communication frequency and number of bugs injected into the software. The data used for this study is drawn from the bug database, version archive, and mailing lists of the JDT sub-project in ECLIPSE. Our results show a statistically significant positive correlation between communication frequency and number of injected bugs in the software. We also noted that communication levels of key developers in the project do no correlate with number of injected bugs. Moreover, we show that defect prediction models can accom-modate social aspects of the development process and potentially deliver more reliable results.
Article
Full-text available
While software metrics are commonly used to assess software maintainability and study software evolution, they are usually defined on a micro-level (method, class, package). Metrics should therefore be aggregated in order to provide insights in the evolution at the macro-level (system). In addition to traditional aggregation techniques such as the mean, recently econometric aggregation techniques such as the Gini index and the Theil index have been proposed. Advantages and disadvantages of different aggregation techniques have not been evaluated empirically so far. In this paper we present the preliminary results of the comparative study of different aggregation techniques.
Article
Full-text available
Article
Open source software has risen to prominence within the last decade, largely due to the success of well known projects such as the GNU/Linux operating system and the Apache web server, amongst others. Their significant commercial impact, with GNU/Linux reportedly running on 25% of server machines and Apache on 60% of web servers, has prompted many companies who use and who develop software to reassess their traditional modes of functioning. A number of companies such as IBM, HP and Sun have invested significantly in developing open source software. Much early written work on open source software development aimed at raising awareness and advocating its uptake. More recently the interest has been in quantifying and qualifying the advantages, disadvantages and other features of open source software. This paper aims to contribute in this second area. Most work on open source implicitly treats all projects as equivalent, for want of ways of classifying them. Benefits of 'typical' projects are claimed, with little attention to what constitutes a 'typical' project. In this paper we look at data available on SourceForge, a web site hosting upward of 30,000 open source projects and characterise the distribution of projects. Considering the number of downloads per week of the software, we show that for the most part the data follows a Pareto type distribution i.e. there are a small number of exceptionally popular projects, most projects being much less popular, and the number of projects with more than a given number of downloads tails off exponentially. We offer explanations for this distribution and for the places where the actual distribution deviates from the model and propose ways that these explanations could be tested. In particular there seem to be fewer than expected projects with a small number of weekly downloads. Likely explanations for this would seem to be either that projects with a small number of downloads per week do not tend to use SourceForge, or that this small number of downloads indicates a low level of interest in the project and such projects are inherently unstable (either they die or become more popular). Two practical applications of this work are: it is useful for people or companies starting an Open Source project to have an idea of what a 'typical' project might entail; secondly, it enables analysis of best practice and benefits to be tied to some sort of classification of projects and allows questions such as how benefits scale with project size to be examined
Article
CVS logs are a rich source of software trails (informa-tion left behind by the contributors to the development pro-cess, usually in the forms of logs). This paper describes how softChange extracts these trails, and enhances them. This paper also addresses some challenges that CVS fact extraction poses to researchers.