Conference Paper

On the topology of package dependency networks: A comparison of three programming language ecosystems

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Package-based software ecosystems are composed of thousands of interdependent software packages. Many empirical studies have focused on software packages belonging to a single software ecosystem, and suggest to generalise the results to more ecosystems. We claim that such a generalisation is not always possible, because the technical structure of software ecosystems can be very different, even if these ecosystems belong to the same domain. We confirm this claim through a study of three big and popular package- based programming language ecosystems: R’s CRAN archive network, Python’s PyPI distribution, and JavaScript’s NPM package manager. We study and compare the structure of their package dependency graphs and reveal some important differences that may make it difficult to generalise the findings of one ecosystem to another one.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To address these research questions, this study focuses on the PyPI ecosystem due to its widespread use in modern software development and the strong integration of libraries into its development practices [1,8,14]. We gather assigned URLs from the project page on PyPI, along with available details on the repository owner's type (individual or organization) and any donation platform links provided on GitHub. ...
... While many studies focus on specific ecosystem aspects, others examine broader structural features, such as dependencies [5,8], with particular attention to transitive dependencies [10,18,33]. However, the usage of donation platforms in dependency chains remains unexplored, creating a gap that this study intends to fill. ...
... To evaluate the usage of donation platforms within the PyPI ecosystem, we collect library information from different sources with various Python scripts. This requires first a comprehensive list of available libraries, which is provided via an endpoint 8 . For detailed information about each library, including assigned URLs and dependency data, an additional endpoint 9 is used. ...
Preprint
Software systems rely heavily on open source software (OSS) libraries, which offer benefits but also pose risks. When vulnerabilities arise, the OSS community may struggle to address them due to inactivity or lack of resources. Research highlights the link between OSS maintenance and financial support. To sustain the OSS ecosystem, maintainers should register on donation platforms and link these profiles on their project pages, enabling financial support from users and industry stakeholders. However, a detailed study on donation platform usage in OSS is missing. This study analyzes the adoption of donation platforms in the PyPI ecosystem. For each PyPI library, we retrieve assigned URLs, dependencies, and, when available, owner type and GitHub donation links. Using PageRank, we analyze different subsets of libraries from both a library and dependency chain perspective. Our findings reveal that donation platform links are often omitted from PyPI project pages and instead listed on GitHub repositories. GitHub Sponsors is the dominant platform, though many PyPI-listed links are outdated, emphasizing the need for automated link verification. Adoption rates vary significantly across libraries and dependency chains: while individual PyPI libraries show low adoption, those used as dependencies have much higher usage. This suggests that many dependencies actively seek financial support, benefiting developers relying on PyPI libraries.
... It remained unexplored until now whether the references indicate genuine dependencies. In contrast, the dependencies between software packages across various programming languages have been extensively analyzed in various studies [5,16,11]. This research has shed light on the level of trust placed in popular software libraries and their influence on other packages. ...
... Dependencies in software packages. Initially probed by Decan et al. in [5], the package topologies between Python, R, and Javascript languages were compared. The key takeaway was that there are considerable differences between the treatment of package dependencies, each with distinct security implications. ...
... To better understand the threat that a vulnerability in a high-reach device could have on its surroundings, we measured to how many products it may propagate. In the component-reuse sub-graph, we selected all weakly-connected components with at least 10 certified products (15 different components in total) 5 . In each of the components, we monitored the product with the highest reach and manually annotated all its incoming transitive references with labels from the fine-grained categorization. ...
Chapter
Full-text available
With 5394 security certificates of IT products and systems, the Common Criteria for Information Technology Security Evaluation have bred an ecosystem entangled with various kind of relations between the certified products. Yet, the prevalence and nature of dependencies among Common Criteria certified products remains largely unexplored. This study devises a novel method for building the graph of references among the Common Criteria certified products, determining the different contexts of references with a supervised machine-learning algorithm, and measuring how often the references constitute actual dependencies between the certified products. With the help of the resulting reference graph, this work identifies just a dozen of certified components that are relied on by at least 10% of the whole ecosystem -- making them a prime target for malicious actors. The impact of their compromise is assessed and potentially problematic references to archived products are discussed.
... -Thus far, the literature (Goswami et al. 2020;Maste 2017;Lamb and Zacchiroli 2021) considered build reproducibility specific to an individual project. However, in the ecosystem settings of open-source distributions, there are strong socio-technical dependencies between packages (Decan et al. 2016). Similar to how a package might "inherit" a vulnerability by depending on a vulnerable library package (Zerouali et al. 2018), theoretically, a package's build could be unreproducible because a package it depends on at build time is not reproducible. ...
... This is an example of the three ecosystem-related external factors potentially impacting a package's build reproducibility that we analyze in this RQ. The three external factors are inspired by previous work on software ecosystems (Decan et al. 2016): the distribution in which a package is released; the build dependencies of a package; and the hardware architecture on which the package is compiled. ...
... The following paragraphs explain how we collected the data necessary to investigate each external factor, also showing the total number of packages available for analysis in each case. Table 5 lists, inspired by the work of Decan et al. (2016) on the socio-technical dependencies between packages, the external factors taken into consideration for our study. Furthermore, the table describes the detailed steps (third column) performed to collect data related to each external factor. ...
Article
Full-text available
Context A reproducible build occurs if, given the same source code, build instructions, and build environment (i.e., installed build dependencies), compiling a software project repeatedly generates the same build artifacts. Reproducible builds are essential to identify tampering attempts responsible for supply chain attacks, with most of the research on reproducible builds considering build reproducibility as a project-specific issue. In contrast, modern software projects are part of a larger ecosystem and depend on dozens of other projects, which begs the question of to what extent build reproducibility of a project is the responsibility of that project or perhaps something forced on it. Objective This empirical study aims at analyzing reproducible and unreproducible builds in Linux Distributions to systematically investigate the process of making builds reproducible in open-source distributions. Our study targets build performed on 11,528 and 597,066 Arch Linux and Debian packages, respectively. Method We compute the likelihood of unreproducible packages becoming reproducible (and vice versa) and identify the root causes behind unreproducible builds. Finally, we compute the correlation between the reproducibility status of packages and three ecosystem factors (i.e., factors outside the control of a given package). Results Arch Linux packages become reproducible a median of 30 days quicker when compared to Debian packages, while Debian packages remain reproducible for a median of 68 days longer once fixed. We identified a taxonomy of 16 root causes of unreproducible builds and found that the build reproducibility status of a package across different hardware architectures is statistically significantly different (strong effect size). At the same time, the status also differs between versions of a package for different distributions and depends on the build reproducibility of a package’s build dependencies, albeit with weaker effect sizes. Conclusions The ecosystem a project belongs to, plays an important role w.r.t. the project’s build reproducibility. Since these are outside a developer’s control, future work on (fixing) unreproducible builds should consider these ecosystem influences.
... Human migration [6], urban mobility [7], population size estimation [8], the brain [9], missing person searching [10], and social interaction [11] are only some of many application examples [12,13]. In particular, the analysis of free and open-source software (FOSS) [14,15] ecosystems 10 under a network perspective has been conducted on multiple FOSS languages' libraries, including Fortran, Perl and Javascript [16,17], SQL and Linux [18], Python [17], and R [16,19,20,21,22]. FOSS ecosystems are exciting examples of human-made complex systems due to the multiple levels of structure they exhibit: functions and objects within a software package [18,23,24] and inter- 15 actions between packages [17,21,25,26], repositories [20], and maintainers or developers [27, 28,29]. ...
... Human migration [6], urban mobility [7], population size estimation [8], the brain [9], missing person searching [10], and social interaction [11] are only some of many application examples [12,13]. In particular, the analysis of free and open-source software (FOSS) [14,15] ecosystems 10 under a network perspective has been conducted on multiple FOSS languages' libraries, including Fortran, Perl and Javascript [16,17], SQL and Linux [18], Python [17], and R [16,19,20,21,22]. FOSS ecosystems are exciting examples of human-made complex systems due to the multiple levels of structure they exhibit: functions and objects within a software package [18,23,24] and inter- 15 actions between packages [17,21,25,26], repositories [20], and maintainers or developers [27, 28,29]. ...
... In particular, the analysis of free and open-source software (FOSS) [14,15] ecosystems 10 under a network perspective has been conducted on multiple FOSS languages' libraries, including Fortran, Perl and Javascript [16,17], SQL and Linux [18], Python [17], and R [16,19,20,21,22]. FOSS ecosystems are exciting examples of human-made complex systems due to the multiple levels of structure they exhibit: functions and objects within a software package [18,23,24] and inter- 15 actions between packages [17,21,25,26], repositories [20], and maintainers or developers [27, 28,29]. All of these elements can be characterized through a CNA framework. ...
Article
We analyze the evolution of the main package library of the programming language R, a free and open-source software used in Statistics, Economics, Machine Learning, Geography, and many other fields. R-packages are self-contained pieces of the software that can relate to each other through dependency and suggestion relationships, giving rise to empirical collaborative networks that have grown significantly in the last twenty years. The dependency network connects two packages if one requires another, and the suggestion network connects packages if there are examples using them together. Each network’s structure is composed by two main groups: the biggest connected component (BCC) and the set of independent packages, isolated from the rest. We characterize how new packages enter the network in terms of the number of connections they incorporate, and the packages they connect to. The number of incorporated connections follows a log-normal distribution, whose scale is linear on the fraction of packages in the BCC. We characterize to which packages the incomers connect to in terms of preferential attachment, finding super-linear preferential attachment in both networks. We provide a detailed characterization of the network’s evolution, and point possible links to the history of the R community. The constructed dataset with the networks at different times is freely available through a public repository.
... 2) Accuracy. Existing work [49,64,66,77] only conduct dependency-based analysis by identifying transitive dependencies via reachability reasoning while neglecting the NPM-specific dependency resolution rules [16], which would lead to inaccurate results. 3) Efficiency. ...
... Lots of work [49,64,66,77] has been carried out to investigate transitive dependencies in the NPM ecosystem. However, none of them have taken into consideration the platform-specific dependency resolution rules [6], which could result in inaccurate dependencies being resolved (illustrated by the example shown in Figure 2). ...
... Wittern et al. [77] investigated the NPM ecosystem from several aspects (e.g., library dependency, download metrics). Decan et al. [49,50,53] conducted comparison studies of different ecosystems, and they [48] also recommend recommend semantic versions by the wisdom of the crowd. Similarly, Kikas et al. [62] analyzed the dependency network and evolution of three ecosystems (i.e., NPM, Ruby, and Rust). ...
Preprint
Full-text available
Third-party libraries with rich functionalities facilitate the fast development of Node.js software, but also bring new security threats that vulnerabilities could be introduced through dependencies. In particular, the threats could be excessively amplified by transitive dependencies. Existing research either considers direct dependencies or reasoning transitive dependencies based on reachability analysis, which neglects the NPM-specific dependency resolution rules, resulting in wrongly resolved dependencies. Consequently, further fine-grained analysis, such as vulnerability propagation and their evolution in dependencies, cannot be carried out precisely at a large scale, as well as deriving ecosystem-wide solutions for vulnerabilities in dependencies. To fill this gap, we propose a knowledge graph-based dependency resolution, which resolves the dependency relations of dependencies as trees (i.e., dependency trees), and investigates the security threats from vulnerabilities in dependency trees at a large scale. We first construct a complete dependency-vulnerability knowledge graph (DVGraph) that captures the whole NPM ecosystem (over 10 million library versions and 60 million well-resolved dependency relations). Based on it, we propose DTResolver to statically and precisely resolve dependency trees, as well as transitive vulnerability propagation paths, by considering the official dependency resolution rules. Based on that, we carry out an ecosystem-wide empirical study on vulnerability propagation and its evolution in dependency trees. Our study unveils lots of useful findings, and we further discuss the lessons learned and solutions for different stakeholders to mitigate the vulnerability impact in NPM. For example, we implement a dependency tree based vulnerability remediation method (DTReme) for NPM packages, and receive much better performance than the official tool (npm audit fix).
... IIS bugs appear on average 16.46% and 14.96% in R and Python programs respectively. Even though Python has more packages, the dependencies between R packages are higher Decan et al. (2016) which results in a similar number of IIS bugs. Other sources of IIS bugs are more general among R and Python developers, such as changing operating systems or changing underlying graphics library ggplot2 so3 (2019b). ...
... Library developers tend to reuse already available packages, and these changes propagate to dependent packages. These dependencies between packages are more severe in R programs Decan et al. (2016), and the number of isolated Python libraries that are not dependent on others is much higher. This suggests that there is a need for a better dependency analysis tool for R programs. ...
Preprint
R and Python are among the most popular languages used in many critical data analytics tasks. However, we still do not fully understand the capabilities of these two languages w.r.t. bugs encountered in data analytics tasks. What type of bugs are common? What are the main root causes? What is the relation between bugs and root causes? How to mitigate these bugs? We present a comprehensive study of 5,068 Stack Overflow posts, 1,800 bug fix commits from GitHub repositories, and several GitHub issues of the most used libraries to understand bugs in R and Python. Our key findings include: while both R and Python have bugs due to inexperience with data analysis, Python see significantly larger data preprocessing bugs compared to R. Developers experience significantly more data flow bugs in R because intermediate results are often implicit. We also found changes and bugs in packages and libraries cause more bugs in R compared to Python while package or library misselection and conflicts cause more bugs in Python than R. While R has a slightly higher readability barrier for data analysts, the statistical power of R leads to a less number of bad performance bugs. In terms of data visualization, R packages have significantly more bugs than Python libraries. We also identified a strong correlation between comparable packages in R and Python despite their linguistic and methodological differences. Lastly, we contribute a large dataset of manually verified R and Python bugs.
... We argue that it is important to study other software ecosystems to contrast with npm and draw more generalizable empirical evidence about vulnerabilities in software ecosystems. Our argument is supported by previous studies (e.g., Bogart et al. 2015;Decan et al. 2016Decan et al. , 2017) that show differences across ecosystems. For instance, Decan et al. (2017) found that the PyPi ecosystem has a less complex and intertwined network than ecosystems such as npm and CRAN. ...
... A few studies conducted a comparison across software ecosystems. Decan et al. (2016) empirically compared the dependency network evolution in 7 ecosystems (including npm). They discovered some differences across ecosystems that can be attributed to ecosystems' policies. ...
Article
Full-text available
Software ecosystems play an important role in modern software development, providing an open platform of reusable packages that speed up and facilitate development tasks. However, this level of code reusability supported by software ecosystems also makes the discovery of security vulnerabilities much more difficult, as software systems depend on an increasingly high number of packages. Recently, security vulnerabilities in the npm ecosystem, the ecosystem of Node.js packages, have been studied in the literature. As different software ecosystems embody different programming languages and particularities, we argue that it is also important to study other popular programming languages to build stronger empirical evidence about vulnerabilities in software ecosystems. In this paper, we present an empirical study of 1,396 vulnerability reports affecting 698 Python packages in the Python ecosystem (PyPi). In particular, we study the propagation and life span of security vulnerabilities, accounting for how long they take to be discovered and fixed. In addition, vulnerabilities in packages may affect software projects that depend on them (dependent projects), making them vulnerable too. We study a set of 2,224 GitHub Python projects, to better understand the prevalence of vulnerabilities in their dependencies and how fast it takes to update them. Our findings show that the discovered vulnerabilities in Python packages are increasing over time, and they take more than 3 years to be discovered. A large portion of these vulnerabilities (40.86%) are only fixed after being publicly announced, giving ample time for attackers exploitation. Moreover, we find that more than half of the dependent projects rely on at least one vulnerable package, taking a considerably long time (7 months) to update to a non-vulnerable version. We find similarities in some characteristics of vulnerabilities in PyPi and npm and divergences that can be attributed to specific PyPi policies. By leveraging our findings, we provide a series of implications that can help the security of software ecosystems by improving the process of discovering, fixing and managing package vulnerabilities.
... npm is the main package manager for the JavaScript programming language, with more than one million packages. An estimated 97% of web applications come from npm [1], making it the most extensive dependency network [9]. We employed mixed methods to identify and analyze the types of manifesting breaking changes -changes in a provider release that render the client's build defective -and how client packages deal with them in their projects. ...
... file. 9 This field allows developers to manually specify the Node.js version that runs the associated code with the build of a specific release. ...
Preprint
Full-text available
Complex software systems have a network of dependencies. Developers often configure package managers (e.g., npm) to automatically update dependencies with each publication of new releases containing bug fixes and new features. When a dependency release introduces backward-incompatible changes, commonly known as breaking changes, dependent packages may not build anymore. This may indirectly impact downstream packages, but the impact of breaking changes and how dependent packages recover from these breaking changes remain unclear. To close this gap, we investigated the manifestation of breaking changes in the npm ecosystem, focusing on cases where packages' builds are impacted by breaking changes from their dependencies. We measured the extent to which breaking changes affect dependent packages. Our analyses show that around 12% of the dependent packages and 14% of their releases were impacted by a breaking change during updates of non-major releases of their dependencies. We observed that, from all of the manifesting breaking changes, 44% were introduced both in minor and patch releases, which in principle should be backward compatible. Clients recovered themselves from these breaking changes in half of the cases, most frequently by upgrading or downgrading the provider's version without changing the versioning configuration in the package manager. We expect that these results help developers understand the potential impact of such changes and recover from them.
... While its relative popularity has varied over the years, the increasing emphasis on statistical analyses in academia and industry has been evidenced by R's increasing absolute rank in the TIOBE index. Inspired by research such as [2] [3] [4] [5] [6], our prior research [7] on Python's PyPI, and professional experience with software development and information security, we seek in this paper to empirically describe the package ecosystem of this important language. Unlike extant literature, we analyze both complete historical package metadata and package contents, providing a more comprehensive understanding of releases, authors, licenses, dependencies, and other trends in package source and metadata over time. ...
... For example, while ggplot2 2.2.1 exists as both a GitHub release and on CRAN, we count it as a single unique package release for 2016. 4 Further, much of the activity on GitHub may relate to forks of popular packages as users contribute via GitHub pull request workflows. Such activities are an important part of a healthy open-source community, but may result in overcounting R activity at the package level. ...
Preprint
Full-text available
In this research, we present a comprehensive, longitudinal empirical summary of the R package ecosystem, including not just CRAN, but also Bioconductor and GitHub. We analyze more than 25,000 packages, 150,000 releases, and 15 million files across two decades, providing comprehensive counts and trends for common metrics across packages, releases, authors, licenses, and other important metadata. We find that the historical growth of the ecosystem has been robust under all measures, with a compound annual growth rate of 29% for active packages, 28% for new releases, and 26% for active maintainers. As with many similar social systems, we find a number of highly right-skewed distributions with practical implications, including the distribution of releases per package, packages and releases per author or maintainer, package and maintainer dependency in-degree, and size per package and release. For example, the top five packages are imported by nearly 25% of all packages, and the top ten maintainers support packages that are imported by over half of all packages. We also highlight the dynamic nature of the ecosystem, recording both dramatic acceleration and notable deceleration in the growth of R. From a licensing perspective, we find a notable majority of packages are distributed under copyleft licensing or omit licensing information entirely. The data, methods, and calculations herein provide an anchor for public discourse and industry decisions related to R and CRAN, serving as a foundation for future research on the R software ecosystem and "data science" more broadly.
... Wittern et al. [30] performed the first large-scale analysis of the npm ecosystem, revealing its evolution through direct dependencies. Decan et al. [31]- [33] compared dependency graph evolution in various programming ecosystems. Jens et al. [34] reviewed dependency declarations across 17 ecosystems and categorized version classifications. ...
Preprint
As the default package manager for Node.js, npm has become one of the largest package management systems in the world. To facilitate dependency management for developers, npm supports a special type of dependency, Peer Dependency, whose installation and usage differ from regular dependencies. However, conflicts between peer dependencies can trap the npm client into infinite loops, leading to resource exhaustion and system crashes. We name this problem PeerSpin. Although PeerSpin poses a severe risk to ecosystems, it was overlooked by previous studies, and its impacts have not been explored. To bridge this gap, this paper conducts the first in-depth study to understand and detect PeerSpin in the npm ecosystem. First, by systematically analyzing the npm dependency resolution, we identify the root cause of PeerSpin and characterize two peer dependency patterns to guide detection. Second, we propose a novel technique called Node-Replacement-Conflict based PeerSpin Detection, which leverages the state of the directory tree during dependency resolution to achieve accurate and efficient PeerSpin detection. Based on this technique, we developed a tool called PeerChecker to detect PeerSpin. Finally, we apply PeerChecker to the entire NPM ecosystem and find that 5,662 packages, totaling 72,968 versions, suffer from PeerSpin. Up until now, we confirmed 28 real PeerSpin problems by reporting them to the package maintainer. We also open source all PeerSpin analysis implementations, tools, and data sets to the public to help the community detect PeerSpin issues and enhance the reliability of the npm ecosystem.
... Goblin [19] is the most recent dataset, it is based on Maven Central and made available as a neo4j graph database [20]. For our present study, we are using the 30-08-2024 version. 2 It is well-known that SECOs exhibit strong growth over time, and this can be observed in the respective networks [10], [40], [21], [9]. There are several networks that can be considered here. ...
Preprint
Full-text available
Maven Central is a large popular repository of Java components that has evolved over the last 20 years. The distribution of dependencies indicates that the repository is dominated by a relatively small number of components other components depend on. The question is whether those elites are static, or change over time, and how this relates to innovation in the Maven ecosystem. We study those questions using several metrics. We find that elites are dynamic, and that the rate of innovation is slowing as the repository ages but remains healthy.
... Numerous studies have explored the idea of comparing similar aspects across software ecosystems [16,22,26]. Decan et al. [26] empirically compared the evolution of dependency network in seven ecosystems. ...
Preprint
To comply with high productivity demands, software developers reuse free open-source software (FOSS) code to avoid reinventing the wheel when incorporating software features. The reliance on FOSS reuse has been shown to improve productivity and the quality of delivered software; however, reusing FOSS comes at the risk of exposing software projects to public vulnerabilities. Massacci and Pashchenko have explored this trade-off in the Java ecosystem through the lens of technical leverage: the ratio of code borrowed from FOSS over the code developed by project maintainers. In this paper, we replicate the work of Massacci and Pashchenko and we expand the analysis to include level-1 transitive dependencies to study technical leverage in the fastest-growing NPM ecosystem. We investigated 14,042 NPM library releases and found that both opportunities and risks of technical leverage are magnified in the NPM ecosystem. Small-medium libraries leverage 2.5x more code from FOSS than their code, while large libraries leverage only 3\% of FOSS code in their projects. Our models indicate that technical leverage shortens the release cycle for small-medium libraries. However, the risk of vulnerability exposure is 4-7x higher for libraries with high technical leverage. We also expanded our replication study to include the first level of transitive dependencies, and show that the results still hold, albeit with significant changes in the magnitude of both opportunities and risks of technical leverage. Our results indicate the extremes of opportunities and risks in NPM, where high technical leverage enables fast releases but comes at the cost of security risks.
... The choice of Python is driven by the following reasons: (i) Its widespread use [28,8,9]. ...
Preprint
Full-text available
Software Bills of Material (SBOMs), which improve transparency by listing the components constituting software, are a key countermeasure to the mounting problem of Software Supply Chain attacks. SBOM generation tools take project source files and provide an SBOM as output, interacting with the software ecosystem. While SBOMs are a substantial improvement for security practitioners, providing a complete and correct SBOM is still an open problem. This paper investigates the causes of the issues affecting SBOM completeness and correctness, focusing on the PyPI ecosystem. We analyze four popular SBOM generation tools using the CycloneDX standard. Our analysis highlights issues related to dependency versions, metadata files, remote dependencies, and optional dependencies. Additionally, we identified a systematic issue with the lack of standards for metadata in the PyPI ecosystem. This includes inconsistencies in the presence of metadata files as well as variations in how their content is formatted.
... Limitation 1: Both Platforms do not Consider all Dependency Types and Scopes Library dependency graphs can be large and complex (Louridas et al. 2008;Decan et al. 2016). For the Maven build system, there are various dependency scopes, transitive dependencies, optional dependencies, etc. ...
Article
Full-text available
Developers rely on software ecosystems such as Maven to manage and reuse external libraries (i.e., dependencies). Due to the complexity of the used dependencies, developers may face challenges in choosing which library to use and whether they should upgrade or downgrade a library. One important factor that affects this decision is the number of potential vulnerabilities in a library and its dependencies. Therefore, state-of-the-art platforms such as Maven Repository (MVN) and Open Source Insights (OSI) help developers in making such a decision by presenting vulnerability information associated with every dependency. In this paper, we first conduct an empirical study to understand how the two platforms, MVN and OSI, present and categorize vulnerability information. We found that these two platforms may either overestimate or underestimate the number of associated vulnerabilities in a dependency, and they lack prioritization mechanisms on which dependencies are more likely to cause an issue. Hence, we propose a tool named VulNet to address the limitations we found in MVN and OSI. Through an evaluation of 19,886 versions of the top 200 popular libraries, we find VulNet includes 90.5% and 65.8% of the dependencies that were omitted by MVN and OSI, respectively. VulNet also helps reduce 27% of potentially unreachable or less impactful vulnerabilities listed by OSI in test dependencies. Finally, our user study with 24 participants gave VulNet an average rating of 4.5/5 in presenting and prioritizing vulnerable dependencies, compared to 2.83 (MVN) and 3.14 (OSI).
... While, for the PyPI, it takes almost a median of 69.5 h to be notified of the external repository. One possible reason is that the NPM packages are less likely to be isolated when compared to the PyPI ones (Decan et al. 2016). For the metric related to the number of posts, as shown in Fig. 4(b), the median posts of Reporting an Enhancement category for both ecosystems are two which are larger than another three reason categories, suggesting that more discussions are likely to be involved in this instance. ...
Article
Full-text available
Popular and large contemporary open-source projects now embrace a diverse set of documentation for communication channels. Examples include contribution guidelines (i.e., commit message guidelines, coding rules, submission guidelines), code of conduct (i.e., rules and behavior expectations), governance policies, and Q&A forum. In 2020, GitHub released Discussion to distinguish between communication and collaboration. However, it remains unclear how developers maintain these channels, how trivial it is, and whether deciding on conversion takes time. We conducted an empirical study on 259 NPM and 148 PyPI repositories, devising two taxonomies of reasons for converting discussions into issues and vice-versa. The most frequent conversion from a discussion to an issue is when developers request a contributor to clarify their idea into an issue (Reporting a Clarification Request –35.1% and 34.7%, respectively), while agreeing that having non actionable topic (QA, ideas, feature requests –55.0% and 42.0%, respectively) is the most frequent reason of converting an issue into a discussion. Furthermore, we show that not all reasons for conversion are trivial (e.g., not a bug), and raising a conversion intent potentially takes time (i.e., a median of 15.2 and 35.1 h, respectively, taken from issues to discussions). Our work contributes to complementing the GitHub guidelines and helping developers effectively utilize the Issue and Discussion communication channels to maintain their collaboration.
... The reusable code usually takes the form of packages delivered by package management systems, such as npm for JavaScript packages, PyPI for Python packages, and Maven for Java packages. In recent years, researchers conduct substantial studies to investigate a variety of aspects of software ecosystems, including their evolution [17,22], dependencies of packages [14][15][16] and security risks [1,27,71]. A few studies make comparisons across software ecosystems, such as the structure [26] and evolution [17] of dependencies across software ecosystems. ...
Article
Full-text available
Rust is an emerging programming language designed for the development of systems software. To facilitate the reuse of Rust code, crates.io , as a central package registry of the Rust ecosystem, hosts thousands of third-party Rust packages. The openness of crates.io enables the growth of the Rust ecosystem but comes with security risks by severe security advisories. Although Rust guarantees a software program to be safe via programming language features and strict compile-time checking, the unsafe keyword in Rust allows developers to bypass compiler safety checks for certain regions of code. Prior studies empirically investigate the memory safety and concurrency bugs in the Rust ecosystem, as well as the usage of unsafe keywords in practice. Nonetheless, the literature lacks a systematic investigation of the security risks in the Rust ecosystem. In this paper, we perform a comprehensive investigation into the security risks present in the Rust ecosystem, asking “what are the characteristics of the vulnerabilities, what are the characteristics of the vulnerable packages, and how are the vulnerabilities fixed in practice?”. To facilitate the study, we first compile a dataset of 433 vulnerabilities, 300 vulnerable code repositories, and 218 vulnerability fix commits in the Rust ecosystem, spanning over 7 years. With the dataset, we characterize the types, life spans, and evolution of the disclosed vulnerabilities. We then characterize the popularity, categorization, and vulnerability density of the vulnerable Rust packages, as well as their versions and code regions affected by the disclosed vulnerabilities. Finally, we characterize the complexity of vulnerability fixes and localities of corresponding code changes, and inspect how practitioners fix vulnerabilities in Rust packages with various localities. We find that memory safety and concurrency issues account for nearly two thirds of the vulnerabilities in the Rust ecosystem. It takes over 2 years for the vulnerabilities to become publicly disclosed, and one third of the vulnerabilities have no fixes committed before their disclosure. In terms of vulnerability density, we observe a continuous upward trend at the package level over time, but a decreasing trend at the code level since August 2020. In the vulnerable Rust packages, the vulnerable code tends to be localized at the file level, and contains statistically significantly more unsafe functions and blocks than the rest of the code. More popular packages tend to have more vulnerabilities, while the less popular packages suffer from vulnerabilities for more versions. The vulnerability fix commits tend to be localized to a limited number of lines of code. Developers tend to address vulnerable safe functions by adding safe functions or lines to them, vulnerable unsafe blocks by removing them, and vulnerable unsafe functions by modifying unsafe trait implementations. Based on our findings, we discuss implications, provide recommendations for software practitioners, and outline directions for future research.
... The reusable code usually takes the form of packages delivered by package management systems, such as npm for JavaScript packages, PyPI for Python packages, and Maven for Java packages. In recent years, researchers conduct substantial studies to investigate a variety of aspects of software ecosystems, including their evolution [17,22], dependencies of packages [14][15][16] and security risks [1,27,71]. A few studies make comparisons across software ecosystems, such as the structure [26] and evolution [17] of dependencies across software ecosystems. ...
Preprint
Rust is an emerging programming language designed for the development of systems software. To facilitate the reuse of Rust code, crates.io, as a central package registry of the Rust ecosystem, hosts thousands of third-party Rust packages. The openness of crates.io enables the growth of the Rust ecosystem but comes with security risks by severe security advisories. Although Rust guarantees a software program to be safe via programming language features and strict compile-time checking, the unsafe keyword in Rust allows developers to bypass compiler safety checks for certain regions of code. Prior studies empirically investigate the memory safety and concurrency bugs in the Rust ecosystem, as well as the usage of unsafe keywords in practice. Nonetheless, the literature lacks a systematic investigation of the security risks in the Rust ecosystem. In this paper, we perform a comprehensive investigation into the security risks present in the Rust ecosystem, asking ``what are the characteristics of the vulnerabilities, what are the characteristics of the vulnerable packages, and how are the vulnerabilities fixed in practice?''. To facilitate the study, we first compile a dataset of 433 vulnerabilities, 300 vulnerable code repositories, and 218 vulnerability fix commits in the Rust ecosystem, spanning over 7 years. With the dataset, we characterize the types, life spans, and evolution of the disclosed vulnerabilities. We then characterize the popularity, categorization, and vulnerability density of the vulnerable Rust packages, as well as their versions and code regions affected by the disclosed vulnerabilities. Finally, we characterize the complexity of vulnerability fixes and localities of corresponding code changes, and inspect how practitioners fix vulnerabilities in Rust packages with various localities.
... While, for the PyPI, it takes almost a median of 69.5 hours to be notified of the external repository. One possible reason is that the NPM packages are less likely to be isolated when compared to the PyPI ones (Decan et al., 2016). For the metric related to the number of posts, as shown in Figure 4 (b), the median posts of Reporting an Enhancement category for both ecosystems are two which are larger than another three reason categories, suggesting that more discussions are likely to be involved in this instance. ...
Preprint
Full-text available
Popular and large contemporary open-source projects now embrace a diverse set of documentation for communication channels. Examples include contribution guidelines (i.e., commit message guidelines, coding rules, submission guidelines), code of conduct (i.e., rules and behavior expectations), governance policies, and Q&A forum. In 2020, GitHub released Discussion to distinguish between communication and collaboration. However, it remains unclear how developers maintain these channels, how trivial it is, and whether deciding on conversion takes time. We conducted an empirical study on 259 NPM and 148 PyPI repositories, devising two taxonomies of reasons for converting discussions into issues and vice-versa. The most frequent conversion from a discussion to an issue is when developers request a contributor to clarify their idea into an issue (Reporting a Clarification Request -35.1% and 34.7%, respectively), while agreeing that having non actionable topic (QA, ideas, feature requests -55.0% and 42.0%, respectively}) is the most frequent reason of converting an issue into a discussion. Furthermore, we show that not all reasons for conversion are trivial (e.g., not a bug), and raising a conversion intent potentially takes time (i.e., a median of 15.2 and 35.1 hours, respectively, taken from issues to discussions). Our work contributes to complementing the GitHub guidelines and helping developers effectively utilize the Issue and Discussion communication channels to maintain their collaboration.
... Wittern et al. [72] investigated the evolution of NPM and found that the number of packages and dependency relationships among packages increase rapidly. Decan et al. [32,33,35] and Kikas et al. [50] analyzed the distribution of direct and transitive dependencies in package SCs in multiple package ecosystems. They found growing packages and dependency relationships among packages in these SCs, and that transitive dependencies lead to the fragility of package SCs. ...
Preprint
Full-text available
Deep learning (DL) package supply chains (SCs) are critical for DL frameworks to remain competitive. However, vital knowledge on the nature of DL package SCs is still lacking. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications, Infrastructure, and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, but Tree and Forest clusters account for most packages (Tensorflow SC: 70%, PyTorch SC: 90%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common disengagement reason in the two SCs are different. Our study provides rich implications on the maintenance and dependency management practices of PyPI DL SCs.
... Target Package Ecosystems. We selected five popular and wellstudied software ecosystems [19][20][21]. NPM is a package manager for the JavaScript programming language that was recently purchased by Microsoft via GitHub on March 16, 2020[9]. PyPI is the library ecosystem that serves the Python programming language, which is interpreted as a high-level general-purpose programming language. ...
Preprint
Full-text available
Using libraries in applications has helped developers reduce the costs of reinventing already existing code. However, an increase in diverse technology stacks and third-party library usage has led developers to inevitably switch technologies and search for similar libraries implemented in the new technology. To assist with searching for these replacement libraries, maintainers have started to release their libraries to multiple ecosystems. Our goal is to explore the extent to which these libraries are intertwined between ecosystems. We perform a large-scale empirical study of 1.1 million libraries from five different software ecosystems, i.e., PyPI, CRAN, Maven, RubyGems, and NPM, to identify 4,146 GitHub repositories. As a starting point, insights from the study raise implications for library maintainers, users, contributors, and researchers into understanding how these different ecosystems are becoming more intertwined with each other.
... file. 9 This field allows developers to manually specify the Node.js version that runs the associated code with the build of a specific release. ...
Article
Full-text available
Complex software systems have a network of dependencies. Developers often configure package managers (e.g., npm ) to automatically update dependencies with each publication of new releases containing bug fixes and new features. When a dependency release introduces backward-incompatible changes, commonly known as breaking changes , dependent packages may not build anymore. This may indirectly impact downstream packages, but the impact of breaking changes and how dependent packages recover from these breaking changes remain unclear. To close this gap, we investigated the manifestation of breaking changes in the npm ecosystem, focusing on cases where packages’ builds are impacted by breaking changes from their dependencies. We measured the extent to which breaking changes affect dependent packages. Our analyses show that around 12% of the dependent packages and 14% of their releases were impacted by a breaking change during updates of non-major releases of their dependencies. We observed that, from all of the manifesting breaking changes, 44% were introduced both in minor and patch releases, which in principle should be backward compatible. Clients recovered themselves from these breaking changes in half of the cases, most frequently by upgrading or downgrading the provider’s version without changing the versioning configuration in the package manager. We expect that these results help developers understand the potential impact of such changes and recover from them.
... Previous research has focused on vulnerability propagation in the ecosystem of Npm [6], [8], [26], [27], RubyGems [12], [16], [28], Maven [29], [30], Pypi [31]- [34], packagist [16], etc. Very little work has been done on vulnerability propagation in the Cargo ecosystem, In this study we focus on the security of dependencies in the Cargo ecosystem. Most scholars only consider direct dependencies in other studies on vulnerability propagation in ecosystems, and fewer scholars consider pass-through dependencies. ...
Preprint
Full-text available
Currently, little is known about the structure of the Cargo ecosystem and the potential for vulnerability propagation. Many empirical studies generalize third-party dependency governance strategies from a single software ecosystem to other ecosystems but ignore the differences in the technical structures of different software ecosystems, making it difficult to directly generalize security governance strategies from other ecosystems to the Cargo ecosystem. To fill the gap in this area, this paper constructs a knowledge graph of dependency vulnerabilities for the Cargo ecosystem using techniques related to knowledge graphs to address this challenge. This paper is the first large-scale empirical study in a related research area to address vulnerability propagation in the Cargo ecosystem. This paper proposes a dependency-vulnerability knowledge graph parsing algorithm to determine the vulnerability propagation path and propagation range and empirically studies the characteristics of vulnerabilities in the Cargo ecosystem, the propagation range, and the factors that cause vulnerability propagation. Our research has found that the Cargo ecosystem's security vulnerabilities are primarily memory-related. 18% of the libraries affected by the vulnerability is still affected by the vulnerability in the latest version of the library. The number of versions affected by the propagation of the vulnerabilities is 19.78% in the entire Cargo ecosystem. This paper looks at the characteristics and propagation factors triggering vulnerabilities in the Cargo ecosystem. It provides some practical resolution strategies for administrators of the Cargo community, developers who use Cargo to manage third-party libraries, and library owners. This paper provides new ideas for improving the overall security of the Cargo ecosystem.
... Among R, Python, and JavaScript, Python has the largest standard library of functions, with a dependency graph that is more isolated, while npm lacks a standard library so it has the largest package manager index with many small packages that provides this basic functionality and is more connected. [12] A similar study of note is [7], which compares JavaScript, Ruby, and Rust and examines their evolution over time. This study confirmed the npm analysis of [6] and finds the result, also seen in this study, that a small subset of packages dominate the dependency graph. ...
Preprint
Full-text available
The Python ecosystem represents a global, data rich, technology-enabled network. By analyzing Python's dependency network, its top 14 most imported libraries and cPython (or core Python) libraries, this research finds clear evidence the Python network can be considered a problem solving network. Analysis of the contributor network of the top 14 libraries and cPython reveals emergent specialization, where experts of specific libraries are isolated and focused while other experts link these critical libraries together, optimizing both local and global information exchange efficiency. As these networks are expanded, the local efficiency drops while the density increases, representing a possible transition point between exploitation (optimizing working solutions) and exploration (finding new solutions). These results provide insight into the optimal functioning of technology-enabled social networks and may have larger implications for the effective functioning of modern organizations.
... Decan et al. [23] study the topology of CRAN, NPM, and PyPI and confirm that PyPI comprises the highest number of disconnected packages compared to other package managers, see Table II. They call ecosystems with many transitive dependencies on few packages "fragile" [24]. ...
Conference Paper
Full-text available
Recently, Google's Open Source team presented the criticality score [1] a metric to assess "influence and importance" 1 of a project in an ecosystem from project specific signals, e.g., number of dependents, commit frequency, etc. The community showed mixed reactions towards the score doubting if it can accurately identify critical projects. We share the community's doubts and we hypothesize, that a combination of PageRank (PR) and Truck Factor (TF) can more accurately identify critical projects than Google's current Criticality Score (CS). To verify our hypothesis, we conduct an experiment in which we compute the PR of thousands of projects from various ecosystems, such as, Maven (Java), NPM (JavaScript), PyPI (Python), etc., we compute the TFs of the projects with the highest PR in the respective ecosystems, and we compare these to the scores provided by the Google project. Unlike Google's CS, our approach identifies projects, such as, six and idna from PyPI, com.typesafe:config from Maven, or tap from NPM, as critical projects with high degree of transitive dependents (highest PR) and low amount of core developers (each of them possessing a TF of one).
... Kikas et al. [5] studied the structure and evolution of package dependency networks of JavaScript, Ruby, and Rust ecosystems. Decan et al. [3] showed that experimental results related to software packages belonging to a single software ecosystem fail to generalise to other ecosystems because of the diversity of their structure. ...
Chapter
Full-text available
Social network research has focused on hyperlink graphs, bibliographic citations, friend/follow patterns, influence spread, etc. Large software repositories also form a highly valuable networked artifact, usually in the form of a collection of packages, their developers, dependencies among them, and bug reports. This “social network of code” is rarely studied by social network researchers. We introduce two new problems in this setting. These problems are well-motivated in the software engineering community but not closely studied by social network scientists. The first is to identify packages that are most likely to be troubled by bugs in the immediate future, thereby demanding the greatest attention. The second is to recommend developers to packages for the next development cycle. Simple autoregression can be applied to historical data for both problems, but we propose a novel method to integrate network-derived features and demonstrate that our method brings additional benefits. Apart from formalizing these problems and proposing new baseline approaches, we prepare and contribute a substantial dataset connecting multiple attributes built from the long-term history of 20 releases of Ubuntu, growing to over 25,000 packages with their dependency links, maintained by over 3,800 developers, with over 280k bug reports.
... Kikas et al. [5] studied the structure and evolution of package dependency networks of JavaScript, Ruby, and Rust ecosystems. Decan et al. [3] showed that experimental results related to software packages belonging to a single software ecosystem fail to generalise to other ecosystems because of the diversity of their structure. ...
Preprint
Full-text available
Social network research has focused on hyperlink graphs, bibliographic citations, friend/follow patterns, influence spread, etc. Large software repositories also form a highly valuable networked artifact, usually in the form of a collection of packages, their developers, dependencies among them, and bug reports. This "social network of code" is rarely studied by social network researchers. We introduce two new problems in this setting. These problems are well-motivated in the software engineering community but not closely studied by social network scientists. The first is to identify packages that are most likely to be troubled by bugs in the immediate future, thereby demanding the greatest attention. The second is to recommend developers to packages for the next development cycle. Simple autoregression can be applied to historical data for both problems, but we propose a novel method to integrate network-derived features and demonstrate that our method brings additional benefits. Apart from formalizing these problems and proposing new baseline approaches, we prepare and contribute a substantial dataset connecting multiple attributes built from the long-term history of 20 releases of Ubuntu, growing to over 25,000 packages with their dependency links, maintained by over 3,800 developers, with over 280k bug reports.
Article
Rust programming language is gaining popularity rapidly in building reliable and secure systems due to its security guarantees and outstanding performance. To provide extra functionalities, the Rust compiler introduces Rust unstable features (RUFs) to extend compiler functionality, syntax, and standard library support. However, their inherent instability poses significant challenges, including potential removal that can lead to large-scale compilation failures across the entire ecosystem. While our original study provided the first ecosystem-wide analysis of RUF usage and impacts, this extended study builds upon our prior work to further explore RUF evolution, propagation, and mitigation. We introduce novel techniques for extracting and matching RUF APIs across compiler versions and find that proportion of RUF APIs has increased from 3% to 15%. Our analysis of 590K package versions and 140M transitive dependencies reveals that the Rust ecosystem uses 1,000 different RUFs, and 44% of package versions are affected by RUFs, causing compiling failures for 12% of package versions. Additionally, we also extend our analysis outside the ecosystem and find that popular Rust applications also rely heavily on RUFs. To mitigate the impacts of RUFs, we propose a mitigation technique integrated into the build process without requiring developer intervention. Our audit algorithm can systematically adjust dependencies and compiler versions to resolve RUF-induced compilation failures, successfully recovering 91% of compilation failures caused by RUFs. We believe our techniques, findings, and tools can help to stabilize the Rust compiler, ultimately enhancing the security and reliability of the ecosystem.</p
Article
Deep learning (DL) frameworks have become the cornerstone of the rapidly developing DL field. Through installation dependencies specified in the distribution metadata, numerous packages directly or transitively depend on DL frameworks, layer after layer, forming DL package supply chains (SCs), which are critical for DL frameworks to remain competitive. However, vital knowledge on how to nurture and sustain DL package SCs is still lacking. Achieving this knowledge may help DL frameworks formulate effective measures to strengthen their SCs to remain competitive and shed light on dependency issues and practices in the DL SC for researchers and practitioners. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications , Infrastructure , and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, while Tree and Forest clusters account for most packages (Tensorflow SC: 70.7%, PyTorch SC: 92.9%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common reason in TensorFlow SC is dependency incompatibility and in PyTorch SC is to simplify functionalities and reduce installation size. Our study provides rich implications for DL framework vendors, researchers, and practitioners on the maintenance and dependency management practices of PyPI DL SCs.
Chapter
Building software from source code requires a build environment that meets certain requirements, such as the presence of specific compilers, libraries, or other tools. Unfortunately, requirements for different packages can conflict with each other, so it is often impossible to use a single build environment when building a large collection of software. This paper develops techniques to minimize the number of distinct build environments required, and measures the practical impact of our techniques on build time. In particular, we introduce the notion of a “conflict graph,” and prove that the problem of minimizing the number of build environments is equivalent to the graph coloring problem on this graph. We explore several heuristic techniques to compute conflict graph colorings, finding solutions that result in surprisingly small sets of build environments. Using Ubuntu 20.04 as our primary experimental dataset, we computed just 4 different environments that were sufficient for building the “Top 500” most popular source packages, and 11 build environments were sufficient for building all 30,646 source packages included in Ubuntu 20.04. Finally, we experimentally evaluate the benefit of these environments by comparing the work required for building the “Top 500” with our environments to the work required using the traditional minimal environment build. We saw that the total work required for building these packages dropped from 139h36m (139 h and 36 min) to 54 h 18 m, a 61% reduction.KeywordsBuild environmentsLarge scale analysis
Article
Full-text available
Self-Admitted Technical Debt (SATD) is primarily studied in Object-Oriented (OO) languages and traditionally commercial software. However, scientific software coded in dynamically-typed languages such as R differs in paradigm, and the source code comments’ semantics are different (i.e., more aligned with algorithms and statistics when compared to traditional software). Additionally, many Software Engineering topics are understudied in scientific software development, with SATD detection remaining a challenge for this domain. This gap adds complexity since prior works determined SATD in scientific software does not adjust to many of the keywords identified for OO SATD, possibly hindering its automated detection. Therefore, we investigated how classification models (traditional machine learning, deep neural networks, and deep neural Pre-Trained Language Models (PTMs)) automatically detect SATD in R packages. This study aims to study the capabilities of these models to classify different TD types in this domain and manually analyze the causes of each in a representative sample. Our results show that PTMs (i.e., RoBERTa) outperform other models and work well when the number of comments labelled as a particular SATD type has low occurrences. We also found that some SATD types are more challenging to detect. We manually identified sixteen causes, including eight new causes detected by our study. The most common cause was failure to remember , in agreement with previous studies. These findings will help the R package authors automatically identify SATD in their source code and improve their code quality. In the future, checklists for R developers can also be developed by scientific communities such as rOpenSci to guarantee a higher quality of packages before submission.
Preprint
Full-text available
The R package ecosystem is expanding fast and dependencies among packages in the ecosystem are becoming more complex. In this study, we explored the package dependencies from a new aspect. We applied a new metric named "dependency heaviness" which measures the number of additional strong dependencies that a package uniquely contributes to its child or downstream packages. It also measures the total reduced dependencies in the ecosystem when the role of a package is changed from a strong parent to a weak parent. We systematically studied how the dependency heaviness spreads from parent to child packages, and how it further spreads to remote downstream packages in the CRAN/Bioconductor ecosystem. We extracted top packages and key paths that majorly transmit heavy dependencies in the ecosystem. Additionally, the dependency heaviness analysis on the ecosystem has been implemented as a web-based database that provides comprehensive tools for querying dependencies of individual R packages.
Article
The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on GITHUB and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, finegrained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.
Chapter
Today’s development of client-side web applications is based on one of the JavaScript-frameworks, such as Angular or React. The excessive dependencies that arise in the ecosystem from the Node-Package-Manager increase the security risk and the dependency of your own web application on third-party packages. Moreover, the frameworkless approach proposes a renaissance of classic web development, because it strives to avoid external dependencies as far as possible and to fall back on the standards. Whether the implementation achieves maintainability and security of frameworks is questionable. Therefore, it makes sense to research which core concepts of the frameworks meet the requirements for maintainability and security and how these are implemented. The novelty is that the concepts to be explored are moved to a standard in order to ensure the developer efficiency, security, performance and maintainability in the long term. This allows existing approaches to focus on other essential features.KeywordsWeb application modelling and engineeringDeveloper efficiencyConcepts and patternsStandards
Article
While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, the first lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to software and ML performance validation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.
Conference Paper
Full-text available
Users and developers of software distributions are often confronted with installation problems due to conflicting packages. A prototypical example of this are the Linux distributions such as Debian. Conflicts between packages have been studied under different points of view in the literature, in particular for the Debian operating system, but little is known about how these package conflicts evolve over time. This article presents an extensive analysis of the evolution of package incompatibilities, spanning a decade of the life of the Debian stable and testing distributions for its most popular architecture, i386. Using the technique of survival analysis, this empirical study sheds some light on the origin and evolution of package incompatibilities, and provides the basis for building indicators that may be used to improve the quality of package-based distributions.
Conference Paper
Full-text available
This paper explores the ecosystem of software packages for R, one of the most popular environments for statistical computing today. We empirically study how R packages are developed and distributed on different repositories: CRAN, BioConductor, R-Forge and GitHub. We also explore the role and size of each repository, the inter-repository dependencies, and how these repositories grow over time. With this analysis, we provide a deeper insight into the extent and the evolution of the R package ecosystem.
Article
Full-text available
The number of R extension packages available from the CRAN repository has tremendously grown over the past 10 years. We look at this phenomenon in more detail, and discuss some of its consequences. In particular, we argue that the statistical computing community needs a more common understanding of software quality, and better domain-specific semantic resources.
Article
Full-text available
Software ecosystems consist of multiple software projects, often interrelated by means of dependency relations. When one project undergoes changes, other projects may decide to upgrade their dependency. For example, a project could use a new version of a component from another project because the latter has been enhanced or subject to some bug-fixing activities. In this paper we study the evolution of dependencies between projects in the Java subset of the Apache ecosystem, consisting of 147 projects, for a period of 14 years, resulting in 1,964 releases. Specifically, we investigate (i) how dependencies between projects evolve over time when the ecosystem grows, (ii) what are the product and process factors that can likely trigger dependency upgrades, (iii) how developers discuss the needs and risks of such upgrades, and (iv) what is the likely impact of upgrades on client projects. The study results—qualitatively confirmed by observations made by analyzing the developers’ discussion—indicate that when a new release of a project is issued, it triggers an upgrade when the new release includes major changes (e.g., new features/services) as well as large amount of bug fixes. Instead, developers are reluctant to perform an upgrade when some APIs are removed. The impact of upgrades is generally low, unless it is related to frameworks/libraries used in crosscutting concerns. Results of this study can support the understanding of the of library/component upgrade phenomenon, and provide the basis for a new family of recommenders aimed at supporting developers in the complex (and risky) activity of managing library/component upgrade within their software projects.
Conference Paper
Full-text available
Software ecosystems consist of multiple software projects, often interrelated each other by means of dependency relations. When one project undergoes changes, other projects may decide to upgrade the dependency. For example, a project could use a new version of another project because the latter has been enhanced or subject to some bug-fixing activities. This paper reports an exploratory study aimed at observing the evolution of the Java subset of the Apache ecosystem, consisting of 147 projects, for a period of 14 years, and resulting in 1,964 releases. Specifically, we analyze (i) how dependencies change over time; (ii) whether a dependency upgrade is due to different kinds of factors, such as different kinds of API changes or licensing issues; and (iii) how an upgrade impacts on a related project. Results of this study help to comprehend the phenomenon of library/component upgrade, and provides the basis for a new family of recommenders aimed at supporting developers in the complex (and risky) activity of managing library/component upgrade within their software projects.
Conference Paper
Full-text available
Reverse engineering is the process of recovering a projectpsilas components and the relationships between them with the goal of creating representations of the project at a higher level of abstraction. When dealing with the large amounts of information that are analyzed during reverse engineering visualization and exploratory navigation are important tools. However, a software system does not exist by itself. Instead, a project is part of a larger software ecosystem of projects that is developed in the context of an organization, a research group of an open-source community. In our work, we argue that reverse engineering an ecosystem is a natural and complementary extension to the traditional single system reverse engineering. We propose a methodology based on visualization, top-down exploration, architecture recovery and software evolution analysis for the reverse engineering software ecosystems. Our methodology starts with visualizing high-level structural and evolutionary aspects of the ecosystem from which the reverse engineer can navigate to views which present architectural aspects of the individual projects. To support our approach we implemented tool support for analyzing the ecosystem level as well as the intra-project level.
Article
Full-text available
The upgrade problems faced by Free and Open Source Software distributions have characteristics not easily found elsewhere. We describe the structure of packages and their role in the upgrade process. We show that state of the art package managers have shortcomings inhibiting their ability to cope with frequent upgrade failures. We survey current countermeasures to such failures, argue that they are not satisfactory, and sketch alternative solutions.
Conference Paper
Change introduces conflict into software ecosystems: breaking changes may ripple through the ecosystem and trigger rework for users of a package, but often developers can invest additional effort or accept opportunity costs to alleviate or delay downstream costs. We performed a multiple case study of three software ecosystems with different tooling and philosophies toward change, Eclipse, R/CRAN, and Node.js/npm, to understand how developers make decisions about change and change-related costs and what practices, tooling, and policies are used. We found that all three ecosystems differ substantially in their practices and expectations toward change and that those differences can be explained largely by different community values in each ecosystem. Our results illustrate that there is a large design space in how to build an ecosystem, its policies and its supporting infrastructure; and there is value in making community values and accepted tradeoffs explicit and transparent in order to resolve conflicts and negotiate change-related costs.
Conference Paper
The node package manager (npm) serves as the frontend to a large repository of JavaScript-based software packages, which foster the development of currently huge amounts of server-side Node. js and client-side JavaScript applications. In a span of 6 years since its inception, npm has grown to become one of the largest software ecosystems, hosting more than 230, 000 packages, with hundreds of millions of package installations every week. In this paper, we examine the npm ecosystem from two complementary perspectives: 1) we look at package descriptions, the dependencies among them, and download metrics, and 2) we look at the use of npm packages in publicly available applications hosted on GitHub. In both perspectives, we consider historical data, providing us with a unique view on the evolution of the ecosystem. We present analyses that provide insights into the ecosystem's growth and activity, into conflicting measures of package popularity, and into the adoption of package versions over time. These insights help understand the evolution of npm, design better package recommendation engines, and can help developers understand how their packages are being used.
Conference Paper
When developing software packages in a software ecosystem, an important and well-known challenge is how to deal with dependencies to other packages. In presence of multiple package repositories, dependency management tends to become even more problematic. For the R ecosystem of statistical computing, dependency management is currently insufficient to deal with multiple package versions and inter-repository package dependencies. We explore how the use of GitHub influences the R ecosystem, both for the distribution of R packages and for inter-repository package dependency management. We also discuss how these problems could be addressed.
Article
When the Application Programming Interface (API) of a framework or library changes, its clients must be adapted. This change propagation---known as a ripple effect---is a problem that has garnered interest: several approaches have been proposed in the literature to react to these changes. Although studies of ripple effects exist at the single system level, no study has been performed on the actual extent and impact of these API changes in practice, on an entire software ecosystem associated with a community of developers. This paper reports on an empirical study of API deprecations that led to ripple effects across an entire ecosystem. Our case study subject is the development community gravitating around the Squeak and Pharo software ecosystems: seven years of evolution, more than 3,000 contributors, and more than 2,600 distinct systems. We analyzed 577 methods and 186 classes that were deprecated, and answer research questions regarding the frequency, magnitude, duration, adaptation, and consistency of the ripple effects triggered by API changes.
Article
One of the most powerful features of R is its infrastructure for contributed code. The built-in package manager and complementary repositories provide a great system for development and exchange of code, and have played an important role in the growth of the platform towards the de-facto standard in statistical computing that it is today. However, the number of packages on CRAN and other repositories has increased beyond what might have been foreseen, and is revealing some limitations of the current design. One such problem is the general lack of dependency versioning in the infrastructure. This paper explores this problem in greater detail, and suggests approaches taken by other open source communities that might work for R as well. Three use cases are defined that exemplify the issue, and illustrate how improving this aspect of package management could increase reliability while supporting further growth of the R community.
Article
This research analyzes complex networks in open-source software at the inter-package level, where package dependencies often span across projects and between development groups. We review complex networks identified at ``lower'' levels of abstraction, and then formulate a description of interacting software components at the package level, a relatively ``high'' level of abstraction. By mining open-source software repositories from two sources, we empirically show that the coupling of modules at this granularity creates a small-world and scale-free network in both instances.
NPM & left-pad: Have we forgotten how to program?
  • D Haney
D. Haney. NPM & left-pad: Have we forgotten how to program? http://www.haneycodes.net/npm-left-padhave-we-forgotten-how-to-program/, March 2016.
The npm blog: kik, left-pad
  • I Z Schlueter
I. Z. Schlueter. The npm blog: kik, left-pad, and npm. http://blog.npmjs.org/post/141577284765/kik-leftpad-and-npm, March 2016.
Javascript: A language in search of a standard library and module system
  • Z Hemel
Z. Hemel. Javascript: A language in search of a standard library and module system. http://zef.me/blog/2856/javascript-a-language-insearch-of-a-standard-library-and-module-system, February 2010.
Python dependency analysisadventures of the datastronomer
  • K Gullikson
K. Gullikson. Python dependency analysisadventures of the datastronomer.
How I decide when to trust an R package
  • J Leek
J. Leek. How I decide when to trust an R package. http://simplystatistics.org/?p=4409, November 2015.
On the development and distribution of R packages: An empirical analysis of the R ecosystem
  • A Decan
  • T Mens
  • M Claes
  • P Grosjean
A. Decan, T. Mens, M. Claes, and P. Grosjean. On the development and distribution of R packages: An empirical analysis of the R ecosystem. In European Conference on Software Architecture Workshops, pages 41:1-41:6, 2015.
Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen's d for evaluating group differences on the NSSE and other surveys?
  • J Romano
  • J Kromrey
  • J Coraggio
  • J Skowronek
J. Romano, J. Kromrey, J. Coraggio, and J. Skowronek. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen's d for evaluating group differences on the NSSE and other surveys? In annual meeting of the Florida Association of Institutional Research, pages 1-3, 2006.