Article

Analysing the 'biodiversity' of open source ecosystems: the GitHub case

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In nature the diversity of species and genes in ecological communities affects the functioning of these communities. Biologists have found out that more diverse communities appear to be more productive than less diverse communities. Moreover such communities appear to be more stable in the face of perturbations. In this paper, we draw the analogy between ecological communities and Open Source Software (OSS) ecosystems, and we investigate the diversity and structure of OSS communities. To address this question we use the MSR 2014 challenge dataset, which includes data from the top-10 software projects for the top programming languages on GitHub. Our findings show that OSS communities on GitHub consist of 3 types of users (core developers, active users, passive users). Moreover, we show that the percentage of core developers and active users does not change as the project grows and that the majority of members of large projects are passive users.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This fact could have an effect on the set of commits that are marked as mutant in the study and potentially also reduce the number of committers labeled as mutant. Matragkas et al. (Matragkas et al. 2014) analyzed user activity in projects to cluster users into roles, investigating the structure of the ecosystem of open source communities on GitHub. In the study each repository is considered and referred to as a " project " , regardless of whether it is a base repository or a fork of one (see Table 1in Matragkas et al. (2014)). ...
... Matragkas et al. (Matragkas et al. 2014) analyzed user activity in projects to cluster users into roles, investigating the structure of the ecosystem of open source communities on GitHub. In the study each repository is considered and referred to as a " project " , regardless of whether it is a base repository or a fork of one (see Table 1in Matragkas et al. (2014)). The rationale behind this choice is that it is hard to determine if work done in a fork is collaboration with other repositories or independent work that will not be contributed to other repositories; hence it is safer to consider them as separate. ...
... We would like to thank the authors of Padhye et al. (2014) and Matragkas et al. (2014) for their valuable feedback regarding the evaluation of the impact of these perils on their research. We would also like to thank Margaret-Anne Storey for her invaluable help in the development of this paper. ...
Article
Full-text available
With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub’s event logs to understand how its users employ the site to collaborate on software, but so far there have been no studies describing the quality and properties of the available GitHub data. We document the results of an empirical study aimed at understanding the characteristics of the repositories and users in GitHub; we see how users take advantage of GitHub’s main features and how their activity is tracked on GitHub and related datasets to point out misalignment between the real and mined data. Our results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. For example, we show that the majority of the projects are personal and inactive, and that almost 40% of all pull requests do not appear as merged even though they were. Also, approximately half of GitHub’s registered users do not have public activity, while the activity of GitHub users in repositories is not always easy to pinpoint. We use our identified perils to see if they can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge and see if there are potential impacts to consider. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.
... Matragkas et al. (Matragkas et al. 2014) analyzed user activity in projects to cluster users into roles, investigating the structure of the ecosystem of open source communities on GitHub. In the study each repository is considered and referred to as a "project", regardless of whether it is a base repository or a fork of one (see Table 1 in Matragkas et al. (2014)). ...
... Matragkas et al. (Matragkas et al. 2014) analyzed user activity in projects to cluster users into roles, investigating the structure of the ecosystem of open source communities on GitHub. In the study each repository is considered and referred to as a "project", regardless of whether it is a base repository or a fork of one (see Table 1 in Matragkas et al. (2014)). The rationale behind this choice is that it is hard to determine if work done in a fork is collaboration with other repositories or independent work that will not be contributed to other repositories; hence it is safer to consider them as separate. ...
... We would like to thank the authors of Padhye et al. (2014) and Matragkas et al. (2014) for their valuable feedback regarding the evaluation of the impact of these perils on their research. We would also like to thank Margaret-Anne Storey for her invaluable help in the development of this paper. ...
Article
Full-text available
With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub’s event logs to understand how its users employ the site to collaborate on software, but so far there have been no studies describing the quality and properties of the available GitHub data. We document the results of an empirical study aimed at understanding the characteristics of the repositories and users in GitHub; we see how users take advantage of GitHub’s main features and how their activity is tracked on GitHub and related datasets to point out misalignment between the real and mined data. Our results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. For example, we show that the majority of the projects are personal and inactive, and that almost 40 % of all pull requests do not appear as merged even though they were. Also, approximately half of GitHub’s registered users do not have public activity, while the activity of GitHub users in repositories is not always easy to pinpoint. We use our identified perils to see if they can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge and see if there are potential impacts to consider. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.
... Vasilescu et al. reports that "everyone who does something in the project" (e.g., pushes code, submits pull requests, reports issues) is considered part of the community [67], while Yamashita et al. identify two kinds of users, core and noncore developers (where the former are granted with write permission on the project while the latter are not) [23]. Other works [70], [71], [72], [73] provide more detailed structures by relaying on the user experience, coding activity, popularity and actions on the platform (e.g., watching, forking, commenting, etc.). ...
... Stability of core developers. The proportion of core developers remains stable as the project gets larger [72], [23]. ...
Article
Full-text available
Context: GitHub, nowadays the most popular social coding platform, has become the reference for mining Open Source repositories, a growing research trend aiming at learning from previous software projects to improve the development of new ones. In the last years, a considerable amount of research papers have been published reporting findings based on data mined from GitHub. As the community continues to deepen in its understanding of software engineering thanks to the analysis performed on this platform, we believe it is worthwhile to reflect on how research papers have addressed the task of mining GitHub and what findings they have reported. Objective: The main objective of our work is to identify the quantity, topic and empirical methods of research works targeting the analysis of how software development practices are influenced by the use of a distributed social coding platform like GitHub. Method: A systematic mapping study was conducted with four research questions and assessed 80 publications from 2009 to 2016. Results: Most works focused on the interaction around codingrelated tasks and project communities. We also identified some concerns about how reliable were these results based on the fact that, overall, papers used small datasets, poor sampling techniques, employed a scarce variety of methodologies and/or were hard to replicate. Conclusions: Our study attested the high activity of research work around the field of Open Source collaboration, specially in the software domain, revealed a set of shortcomings and proposed some actions to mitigate them. We hope this work can also create the basis for additional studies on other collaborative activities (like book writing for instance) that are also moving to GitHub.
... Besides, S. Daniel et al. performed a diversity-related analysis on SourceForge and they found out that diversity in project roles and experiences positively impacts project success [9]. Nicholas Matragkaset al. also analyzed the Biodiversity of GitHub, and showed that the percentage of core developers and active users does not change as the project grows [10]. Some other studies the composition of Software Ecosystems instead of SECOs health and Biodiversity. ...
Article
Full-text available
In nature ecosystems, animal life-spans are determined by genes and some other biological characteristics. Similarly, the software project life-spans are related to some internal or external characteristics. Analyzing the relations between these characteristics and the project life-span, may help developers, investors, and contributors to control the development cycle of the software project. The paper provides an insight on the project life-span for a free open source software ecosystem. The statistical analysis of some project characteristics in GitHub is presented, and we find that the choices of programming languages, the number of files, the label format of the project, and the relevant membership expressions can impact the life-span of a project. Based on these discovered characteristics, we also propose a prediction model to estimate the project life-span in open source software ecosystems. These results may help developers reschedule the project in open source software ecosystem.
... [6]), (2) its structure (e.g. [7], [8][2] [9]), (3) its diversity (e.g. [10], [11], [12]), (4) the profile of the users in the community (e.g. ...
Conference Paper
Full-text available
Many open source projects stagnate after an initial push and end-up fading away. In this talk we will argue that, most of the time, the reason has nothing to do with the quality of the software itself but with the project's inability to attract and support a healthy community around it. The community of contributors and also the users must take an active role. MDE tools are not an exception to this challenge. We will review several actions and strategies that OSS project managers of MDE tools could put into practice to reverse this situation, mostly taken from other disciplines like social science, economy and political science.
Chapter
GitHub is the most common code hosting and repository service for open-source software (OSS) projects. Thanks to the great variety of features, researchers benefit from GitHub to solve a wide range of OSS development challenges. In this context, the authors thought that was important to conduct a literature review on studies that used GitHub data. To reach these studies, they conducted this literature review based on a GitHub dataset source study instead of a keyword-based search in digital libraries. Since GHTorrent is the most widely known GitHub dataset according to the literature, they considered the studies that cite this dataset for the systematic literature review. In this study, they reviewed the selected 172 studies according to some criteria that used the dataset as a data source. They classified them within the scope of OSS development challenges thanks to the information they extract from the metadata of studies. They put forward some issues about the dataset and they offered the focused and attention-grabbing fields and open challenges that we encourage the researchers to study on them.
Chapter
GitHub is the most common code hosting and repository service for open-source software (OSS) projects. Thanks to the great variety of features, researchers benefit from GitHub to solve a wide range of OSS development challenges. In this context, the authors thought that was important to conduct a literature review on studies that used GitHub data. To reach these studies, they conducted this literature review based on a GitHub dataset source study instead of a keyword-based search in digital libraries. Since GHTorrent is the most widely known GitHub dataset according to the literature, they considered the studies that cite this dataset for the systematic literature review. In this study, they reviewed the selected 172 studies according to some criteria that used the dataset as a data source. They classified them within the scope of OSS development challenges thanks to the information they extract from the metadata of studies. They put forward some issues about the dataset and they offered the focused and attention-grabbing fields and open challenges that we encourage the researchers to study on them.
Article
Full-text available
Understanding user behaviors and social relations in social media has been an important topic in Human-Computer Interaction research. In this paper, we look at an emerging form of social media, Event-based Social Networks (EBSNs), which support a special type of hybrid community where people are connected online to organize themselves for offline gatherings. EBSN users are labeled as either organizers or members on existing platforms, and their behavioral and relationship patterns in offline events have not been described systematically. To understand participation dynamics in EBSNs, we present an interview study with 12 Meetup users and categorize a variety of social roles beyond organizers and members. We identify that different types of organizers, classical leaders and delegated leaders, have different relationship patterns with various types of members, including active contributors, active followers, newcomers and occasional visitors. By comparing these roles with purely offline or online communities, we discuss how participation dynamics in EBSNs reflect the intertwined impacts of hybrid community and implications of our findings for technology designs.
Chapter
The mining of software archives has enabled new ways for increasing the productivity in software development: Analyzing software quality, mining project evolution, investigating change patterns and evolution trends, mining models for development processes, developing methods of integrating mined data from various historical sources, or analyzing natural language artifacts in software repositories, are examples of research topics. Software repositories include various data, ranging from source control systems, issue tracking systems, artifact repositories such as requirements, design and architectural documentation, to archived communication between project members. Practitioners and researchers have recognized the potential of mining these sources to support the maintenance of software, to improve their design or architecture, and to empirically validate development techniques or processes. We revisited software mining studies that were published in recent years in the top venues of software engineering, such as ICSE, ESEC/FSE, and MSR. In analyzing these software mining studies, we highlight different viewpoints: pursued goals, state-of-the-art approaches, mined artifacts, and study replicability. To analyze the mining artifacts, we (lexically) analyzed research papers of more than a decade. In terms of replicability we looked at existing work in the field in mining approaches, tools, and platforms. We address issues of replicability and reproducibility to shed light onto challenges for large-scale mining studies that would enable a stronger conclusion stability.
Conference Paper
Full-text available
During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve high-quality, interconnected data. The GHTorent project has been collecting data for all public projects available on Github for more than a year. In this paper, we present the dataset details and construction process and outline the challenges and research opportunities emerging from it.
Article
Diversity is a defining characteristic of global collectives facilitated by the Internet. Though substantial evidence suggests that diversity has profound implications for a variety of outcomes including performance, member engagement, and withdrawal behavior, the effects of diversity have been predominantly investigated in the context of organizational workgroups or virtual teams. We use a diversity lens to study the success of nontraditional virtual work groups exemplified by open source software (OSS) projects. Building on the diversity literature, we propose that three types of diversity (separation, variety, and disparity) influence two critical outcomes for OSS projects: community engagement and market success. We draw on the OSS literature to further suggest that the effects of diversity on market success are moderated by the application development stage. We instantiate the operational definitions of three forms of diversity to the unique context of open source projects. Using archival data from 357 projects hosted on SourceForge, we find that disparity diversity, reflecting variation in participants' contribution-based reputation, is positively associated with success. The impact of separation diversity, conceptualized as culture and measured as diversity in the spoken language and country of participants, has a negative impact on community engagement but an unexpected positive effect on market success. Variety diversity, reflected in dispersion in project participant roles, positively influences community engagement and market success. The impact of diversity on market success is conditional on the development stage of the project. We discuss how the study's findings advance the literature on antecedents of OSS success, expand our theoretical understanding of diversity, and present the practical implications of the results for managers of distributed collectives.
Article
Before contributing to a free or open source software project, understand the developers, leaders, and active users behind it. The computing world lauds many Free/Libre and open source software offerings for both their reliability and features. Successful projects such as the Apache httpd Web server and Linux operating system kernel have made FLOSS a viable option for many commercial organizations. While FLOSS code is easy to access, understanding the communities that build and support the software can be difficult. Despite accusations from threatened proprietary vendors, few continue to believe that open source programmers are all amateur teenaged hackers working alone in their bedrooms. But neither are they all part of robust, well-known communities like those behind Apache and Linux.
Biodiversity evaluation tools for european forests. Criteria and Indicators for Sustainable Forest Management at the Forest Management Unit Level
  • T Larsson