Conference PaperPDF Available

Evidence for the Pareto principle in Open Source Software Activity

Authors:

Abstract and Figures

Numerous empirical studies analyse evolving open source software (OSS) projects, and try to estimate the activity and effort in these projects. Most of these studies, however, only focus on a limited set of artefacts, being source code and defect data. In our research, we extend the analysis by also taking into account mailing list information. The main goal of this article is to find evidence for the Pareto principle in this context, by studying how the activity of developers and users involved in OSS projects is distributed: it appears that most of the activity is carried out by a small group of people. Following the GQM paradigm, we provide evidence for this principle. We selected a range of metrics used in economy to measure inequality in distribution of wealth, and adapted these metrics to assess how OSS project activity is distributed. Regardless of whether we analyse version repositories, bug trackers, or mailing lists, and for all three projects we studied, it turns out that the distribution of activity is highly imbalanced.
Content may be subject to copyright.
A preview of the PDF is not available
... Overall, our analysis reveals that various aspects of development activity on the HF Hub-e.g., interactions in model, dataset, and space repositories; collaboration in model repositories; and model adoption in spaces-exhibit right-skewed, Pareto distributions, which is a well-documented pattern in OSS development [23,24,25,26,27]. While the open model development life-cycle involves unique practices which differ from OSS development [22], such as model training and fine-tuning, the observed similarities in the overall patterns of activity suggests that future research on open source AI can benefit from drawing on the extensive, multidisciplinary literature on the social dynamics of OSS development. ...
... Numerous studies highlight that various types of activity in OSS development, such as discussions in mailing lists, bug-spotting in issue trackers, and commit activity, exhibit right-skewed, Pareto distributions [23,25,27]. Indeed, it is well-documented observation that OSS development is typically characterised by the Pareto principle, commonly known as the 80/20 rule or the law of the vital few, which states that approximately 80% of effects come from 20% of causes [79]. ...
... Activity distributions follow power law patterns, with a small fraction of repositories accounting for most interactions (e.g., < 1% for 80% of likes, 10% for 80% discussions, 30% for 80% commits, <1% for 80% downloads). Similarly, the collaboration networks exhibit right-skewed centrality distributions, indicating that influence is concentrated amongst few developers, congruent with prior observations that OSS development patterns generally follow Pareto distributions [23,24,25,26,27]. Influence also flows across the HF Hub, with likes per model having strong correlations with their usage in spaces (ρ = 0.66, p < 0.001). ...
Preprint
Full-text available
Open source developers have emerged as key actors in the political economy of artificial intelligence (AI), with open model development being recognised as an alternative to closed-source AI development. However, we still have a limited understanding of collaborative practices in open source AI. This paper responds to this gap with a three-part quantitative analysis of development activity on the Hugging Face (HF) Hub, a popular platform for building, sharing, and demonstrating models. First, we find that various types of activity across 348,181 model, 65,761 dataset, and 156,642 space repositories exhibit right-skewed distributions. Activity is extremely imbalanced between repositories; for example, over 70% of models have 0 downloads, while 1% account for 99% of downloads. Second, we analyse a snapshot of the social network structure of collaboration on models, finding that the community has a core-periphery structure, with a core of prolific developers and a majority of isolate developers (89%). Upon removing isolates, collaboration is characterised by high reciprocity regardless of developers' network positions. Third, we examine model adoption through the lens of model usage in spaces, finding that a minority of models, developed by a handful of companies, are widely used on the HF Hub. Overall, we find that various types of activity on the HF Hub are characterised by Pareto distributions, congruent with prior observations about OSS development patterns on platforms like GitHub. We conclude with a discussion of the implications of the findings and recommendations for (open source) AI researchers, developers, and policymakers.
... This phenomenon has been also studied in the context of Open Source Software (OSS) development. Most of the studies investigated the principle by studying the patterns in the OSS community ways of working-i.e., commits [1], communication [2], [3], issue trackers, while others focused on the modeling the distribution of code smells, architecture, data, and software defects [4], [5]. ...
... • Evidence for the Pareto principle in open source software activity [3] focuses on analyzing the activity of users in mailing lists on a sample of three OSS software projects. This study is loosely related to our work since the studied object differs from ours. ...
... Contributions to FOSS being concentrated in the hands of a few hyper-productive participants has been noted since the origins of FOSS (Hill et al., 1992;Kuk, 2006), with more recent examples confirming this observation (Chełkowski et al., 2016). Whilst examining OS development patterns some researchers found evidence for the so-called 'Pareto principle' whereby less than 20% of users make 80% or more contributions (Goeminne & Mens, 2011;Akond Rahman et al., 2018); others found less support for this principle (see Yamashita et al., 2015;Gasparini et al., 2020). Contribution patterns in our network were highly unequal, both when considered in aggregate, by firm domains (see Tables 7 and 8) with a few key organisational players massively contributing, as well as in terms of individual developers (see Table 9). ...
Article
Full-text available
The data economy depends on digital infrastructure produced in self-managed projects and communities. To understand how information technology (IT) firms communicate to a volunteer workforce, we examine IT firm and foundation employee discourses about open source. We posit that organizations employ rhetorical strategies to advocate for or resist changing the meaning of this institution. Our analysis of discourses collected at three open source professional conferences in 2019 is complemented by computational methods, which generate semantic clusters from presentation summaries. In terms of defining digital infrastructure, business models, and the firm-community relationship, we find a clear division between the discourses of large firm and consortia foundation employees, on one hand, and small firm and non-profit foundation employees, on the other. These divisions reflect these entities’ roles in the data economy and levels of concern about predatory “Big Tech” practices, which transform common goods to be shared into proprietary assets to be sold.
... Method. To analyze contributor backgrounds and numbers (Findings 1-2), we collected contribution data from GitHub and identified the core contributors as those collectively responsible for 80% of all commits (in line with past work [26,111]). We manually classified each contributor's background as SE-focused, ML-focused, or other (e.g., physics, finance), based on public self-description, professional title, and education history stated in their GitHub profile, LinkedIn profile, and personal or company websites, if available. ...
Preprint
Full-text available
Machine learning (ML) components are increasingly incorporated into software products, yet developers face challenges in transitioning from ML prototypes to products. Academic researchers struggle to propose solutions to these challenges and evaluate interventions because they often do not have access to close-sourced ML products from industry. In this study, we define and identify open-source ML products, curating a dataset of 262 repositories from GitHub, to facilitate further research and education. As a start, we explore six broad research questions related to different development activities and report 21 findings from a sample of 30 ML products from the dataset. Our findings reveal a variety of development practices and architectural decisions surrounding different types and uses of ML models that offer ample opportunities for future research innovations. We also find very little evidence of industry best practices such as model testing and pipeline automation within the open-source ML products, which leaves room for further investigation to understand its potential impact on the development and eventual end-user experience for the products.
Article
Full-text available
Bu makale, Vilfredo Pareto'nun mantıklı ve mantıksal olmayan davranışlar kriteri açısından muhasebe ve finansal tablo hilelerini değerlendir-meyi amaçlamaktadır. Makalenin giriş bölümünde Pareto'nun yaşamı ve çalışmaları hakkında bilgi verilmektedir. Son yıllarda psikoloji ve sosyal bi-limlerin insanların karar alma mekanizmalarını araştırmasıyla, insan davranışlarının her zaman mantıklı olmadığı anlaşılmıştır. Pareto da bu dü-şünceyle paralel olarak insanların mantıksal ol-mayan davranış eğilimlerine sahip olduğunu ve bunun toplumdaki dinamikleri etkilediğini ifade etmiştir. Pareto'ya göre, insan davranışları man-tıklı ve mantıksal olmayan şeklinde iki kategoride incelenebilir. Ancak mantıksal olmayan davra-nışlar daha yaygındır. Pareto, mantıksal olmayan davranışların ardında belirli tortuların olduğunu ve bu tortuların türemlerle meşrulaştırıldığını be-lirtir. Makalede Pareto'nun mantıklı ve mantıksal olmayan davranışları analiz etmek için kullandı-ğı yöntemler ve davranış kategorileri detaylı bir
Article
Process discovery techniques analyze process logs to extract models that characterize the behavior of business processes. In real-life logs, however, noises exist and adversely affect the extraction and thus decrease the understandability of discovered models. In this paper, we propose a novel double granularity filtering method, executed on both the event and trace levels, to detect noises by analyzing the directly-following and parallel relations between events. Based on the probability of an event occurring in a sequence, the infrequent behaviors and redundant events in the logs can be filtered out. In addition, the missing events in parallel blocks are detected to further improve the performance of filtering. Experiments on synthetic logs and five real-life datasets demonstrate that our method significantly outperforms other state-of-the-art methods.
Article
Dependency management bots automatically open pull requests to update software dependencies on behalf of developers. Early research shows that developers are suspicious of updates performed by dependency management bots and feel tired of overwhelming notifications from these bots. Despite this, dependency management bots are becoming increasingly popular. Such contrast motivates us to investigate Dependabot, currently the most visible bot on GitHub, to reveal the effectiveness and limitations of state-of-art dependency management bots. We use exploratory data analysis and a developer survey to evaluate the effectiveness of Dependabot in keeping dependencies up-to-date, interacting with developers, reducing update suspicion, and reducing notification fatigue. We obtain mixed findings. On the positive side, projects do reduce technical lag after Dependabot adoption and developers are highly receptive to its pull requests. On the negative side, its compatibility scores are too scarce to be effective in reducing update suspicion; developers tend to configure Dependabot toward reducing the number of notifications; and 11.3% of projects have deprecated Dependabot in favor of other alternatives. The survey confirms our findings and provides insights into the key missing features of Dependabot. Based on our findings, we derive and summarize the key characteristics of an ideal dependency management bot which can be grouped into four dimensions: configurability, autonomy, transparency, and self-adaptability.
Article
Objectives. Assessing team software development projects is notoriously difficult and typically based on subjective metrics. To help make assessments more rigorous, we conducted an empirical study to explore relationships between subjective metrics based on peer and instructor assessments, and objective metrics based on GitHub and chat data. Participants. We studied 23 undergraduate software teams ( n = 117 students) from two undergraduate computing courses at two North American research universities. Study Method. We collected data on teams’ (a) commits and issues from their GitHub code repositories; (b) chat messages from their Slack and Microsoft Teams channels; (c) peer evaluation ratings from the CATME peer evaluation system, and (d) individual assignment grades from the courses. We derived metrics from (a) and (b) to measure both individual team members’ contributions to the team, and the equality of team members’contributions. We then performed Pearson analyses to identify correlations among the metrics, peer evaluation ratings, and individual grades. Findings. We found significant positive correlations between team members’ GitHub contributions, chat contributions, and peer evaluation ratings. In addition, the equality of teams’ GitHub contributions was positively correlated with teams’ average peer evaluation ratings, and negatively correlated with the variance in those ratings. However, no such positive correlations were detected between the equality of teams’ chat contributions and their peer evaluation ratings. Conclusions. Our study extends previous research results by providing evidence that (a) team members’ chat contributions, like their GitHub contributions, are positively correlated with their peer evaluation ratings; (b) team members’ chat contributions are positively correlated with their GitHub contributions; and (c) the equality of team’ GitHub contributions is positively correlated with their peer evaluation ratings. These results lend further support to the idea that combining objective and subjective metrics can make the assessment of team software projects more comprehensive and rigorous.
Article
The purpose of the study is to test the application of the Pareto Principle on the research productivity of journals. Oncology was selected as the subject of study and data were extracted from the “Web of Science.” A series of keywords specifying Oncology and sub-fields have been derived from the Medical Subject Headings (MeSH). A string of 15 search terms (Lead, Related, and Narrow) was connected using the Boolean operator “OR” to retrieve results limiting the scope to journal articles of India and Iran consecutively. The results weren’t strictly as per the Pareto’s Principle of 80/20 rule, but almost close (i.e. 75/25 in the case of India and 65/35 in the case of Iran). The results derived provide strong evidence that the Pareto principle fits the research productivity of journals to a great extend. The study could help libraries to improve the efficiency of collection development and financial management policies. The results will be highly applicable for the acquisition of scholarly journals for libraries, especially for library consortia. This law will be highly useful for cost-benefit analysis (CBA) of serial publications and will help in “subscribing the maximum collection at the least cost.”
Article
Full-text available
Software repositories such as versioning systems, defect tracking systems, and archived communication between project personnel are used to help manage the progress of software projects. Software practitioners and researchers increasingly recognize the potential benefit of mining this information to support the maintenance of software systems, improve software design or reuse, and empirically validate novel ideas and techniques. Research is now proceeding to uncover ways in which mining these repositories can help to understand software development, to support predictions about software development, and to plan various evolutionary aspects of software projects. This chapter presents several analysis and visualization techniques to understand software evolution by exploiting the rich sources of artifacts that are available. Based on the data models that need to be developed to cover sources such as modification and bug reports we describe how to use a Release History Database for evolution analysis. For that we present approaches to analyze developer effort for particular software entities. Further we present change coupling analyses that can reveal hidden change dependencies among software entities. Finally, we show how to investigate architectural shortcomings over many releases and to identify trends in the evolution. Kiviat graphs can be effectively used to visualize such analysis results.
Article
Full-text available
Communication between developers plays a very central role in team-based software development for a variety of tasks such as coordinating development and maintenance activities, discussing requirements for better comprehension, assessing alternative solu-tions to complex problems, and like. However, the frequency of communication varies from time to time—sometimes developers exchange more messages with each other than at other times. In this paper, we investigate whether developer communication has any bearing with software quality by examining the relationship between communication frequency and number of bugs injected into the software. The data used for this study is drawn from the bug database, version archive, and mailing lists of the JDT sub-project in ECLIPSE. Our results show a statistically significant positive correlation between communication frequency and number of injected bugs in the software. We also noted that communication levels of key developers in the project do no correlate with number of injected bugs. Moreover, we show that defect prediction models can accom-modate social aspects of the development process and potentially deliver more reliable results.
Article
Full-text available
Article
Open source software has risen to prominence within the last decade, largely due to the success of well known projects such as the GNU/Linux operating system and the Apache web server, amongst others. Their significant commercial impact, with GNU/Linux reportedly running on 25% of server machines and Apache on 60% of web servers, has prompted many companies who use and who develop software to reassess their traditional modes of functioning. A number of companies such as IBM, HP and Sun have invested significantly in developing open source software. Much early written work on open source software development aimed at raising awareness and advocating its uptake. More recently the interest has been in quantifying and qualifying the advantages, disadvantages and other features of open source software. This paper aims to contribute in this second area. Most work on open source implicitly treats all projects as equivalent, for want of ways of classifying them. Benefits of 'typical' projects are claimed, with little attention to what constitutes a 'typical' project. In this paper we look at data available on SourceForge, a web site hosting upward of 30,000 open source projects and characterise the distribution of projects. Considering the number of downloads per week of the software, we show that for the most part the data follows a Pareto type distribution i.e. there are a small number of exceptionally popular projects, most projects being much less popular, and the number of projects with more than a given number of downloads tails off exponentially. We offer explanations for this distribution and for the places where the actual distribution deviates from the model and propose ways that these explanations could be tested. In particular there seem to be fewer than expected projects with a small number of weekly downloads. Likely explanations for this would seem to be either that projects with a small number of downloads per week do not tend to use SourceForge, or that this small number of downloads indicates a low level of interest in the project and such projects are inherently unstable (either they die or become more popular). Two practical applications of this work are: it is useful for people or companies starting an Open Source project to have an idea of what a 'typical' project might entail; secondly, it enables analysis of best practice and benefits to be tied to some sort of classification of projects and allows questions such as how benefits scale with project size to be examined
Article
While software metrics are commonly used to assess software maintainability and study software evolution, they are usually defined on a micro-level (method, class, package). Metrics should therefore be aggregated in order to provide insights in the evolution at the macro-level (system). In addition to traditional aggregation techniques such as the mean, recently econometric aggregation techniques such as the Gini index and the Theil index have been proposed. Advantages and disadvantages of different aggregation techniques have not been evaluated empirically so far. In this paper we present the preliminary results of the comparative study of different aggregation techniques.
Article
CVS logs are a rich source of software trails (informa-tion left behind by the contributors to the development pro-cess, usually in the forms of logs). This paper describes how softChange extracts these trails, and enhances them. This paper also addresses some challenges that CVS fact extraction poses to researchers.