Preprint

Analyzing Maintenance Activities of Software Libraries

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the author.

Abstract

Industrial applications heavily integrate open-source software libraries nowadays. Beyond the benefits that libraries bring, they can also impose a real threat in case a library is affected by a vulnerability but its community is not active in creating a fixing release. Therefore, I want to introduce an automatic monitoring approach for industrial applications to identify open-source dependencies that show negative signs regarding their current or future maintenance activities. Since most research in this field is limited due to lack of features, labels, and transitive links, and thus is not applicable in industry, my approach aims to close this gap by capturing the impact of direct and transitive dependencies in terms of their maintenance activities. Automatically monitoring the maintenance activities of dependencies reduces the manual effort of application maintainers and supports application security by continuously having well-maintained dependencies.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like k-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects. To facilitate open science (and replications and extensions of this work), all our materials are available online at https://github.com/arennax/Health_Indicator_Prediction.
Article
Full-text available
Third-party libraries are a key building block in software development as they allow developers to reuse common functionalities instead of reinventing the wheel. However, third-party libraries and client projects are developed and continuously evolving in an asynchronous way. As a result, outdated third-party libraries might be commonly used in client projects, while developers are unaware of the potential risk (e.g., security bugs) in usages. Outdated third-party libraries might be updated in client projects in a delayed way, while developers are less aware of the potential risk (e.g., API incompatibilities) in updates. Developers of third-party libraries may be unaware of how their third-party libraries are used or updated in client projects. Therefore, a quantitative and holistic study on usages, updates and risks of third-party libraries in open-source projects can provide concrete evidence on these problems, and practical insights to improve the ecosystem sustainably. In this paper, we make the first contribution towards such a study in the Java ecosystem. First, using 806 open-source projects and 13,565 third-party libraries, we conduct a library usage analysis (e.g., usage intensity and usage outdatedness), followed by a library update analysis (e.g., update intensity and update delay). The two analyses aim to quantify usage and update practices from the two holistic perspectives of open-source projects and third-party libraries. Then, we carry out a library risk analysis (e.g., usage risk and update risk) on 806 open-source projects and 544 security bugs. This analysis aims to quantify the potential risk of using and updating outdated third-party libraries with respect to security bugs. Our findings suggest practical implications to developers and researchers on problems and potential solutions in maintaining third-party libraries (e.g., smart alerting and automated updating of outdated third-party libraries). To demonstrate the usefulness of our findings, we propose a security bug-driven alerting system, named LibSecurify, for assisting developers to make confident decisions by quantifying risks and effort when updating outdated third-party libraries. 33 open-source projects have confirmed the presence of security bugs after receiving our alerts, and 24 of those 33 have updated their third-party libraries. We have released our dataset to foster valuable applications and improve the Java third-party library ecosystem.
Article
Full-text available
Systems with unmaintained embedded open source software (OSS) components are vulnerable to severe risks. This article introduces the OSS Abandonment Risk Assessment model to help companies avoid potentially dire consequences.
Article
Full-text available
Context Open Source Software (OSS) is nowadays used and integrated in most of the commercial products. However, the selection of OSS projects for integration is not a simple process, mainly due to a of lack of clear selection models and lack of information from the OSS portals. Objective We investigate the factors and metrics that practitioners currently consider when selecting OSS. We also investigate the source of information and portals that can be used to assess the factors, as well as the possibility to automatically extract such information with APIs. Method We elicited the factors and the metrics adopted to assess and compare OSS performing a survey among 23 experienced developers who often integrate OSS in the software they develop. Moreover, we investigated the APIs of the portals adopted to assess OSS extracting information for the most starred 100K projects in GitHub. Result We identified a set consisting of 8 main factors and 74 sub-factors, together with 170 related metrics that companies can use to select OSS to be integrated in their software projects. Unexpectedly, only a small part of the factors can be evaluated automatically, and out of 170 metrics, only 40 are available, of which only 22 returned information for all the 100K projects. Therefore, we recommend project maintainers and project repositories to pay attention to provide information for the project they are hosting, so as to increase the likelihood of being adopted. Conclusion OSS selection can be partially automated, by extracting the information needed for the selection from portal APIs. OSS producers can benefit from our results by checking if they are providing all the information commonly required by potential adopters. Developers can benefit from our results, using the list of factors we selected as a checklist during the selection of OSS, or using the APIs we developed to automatically extract the data from OSS projects.
Conference Paper
Full-text available
Background. Open Source Software (OSS) is experiencing an increasing popularity both in industry and in academia. Aim. We investigated models for the selection, evaluation, and adoption of OSS, focusing on factors that affect most the evaluation of OSS. Method. We conducted a Systematic Literature Review of 262 studies published until the end of 2019, to understand whether OSS selection is still an interesting topic for researchers, and which factors are considered by stakeholders and are assessed by the available models. Result. We selected 60 primary studies: 20 surveys and 5 lessons learned studies elicited the motivations for OSS adoption; 35 papers proposed several OSS evaluation models focusing on different technical aspects. This Systematic Literature Review provides an overview of the available OSS evaluation methods, highlighting their limits and strengths, based on the wide range of technicalities and aspects explored by the selected primary studies. Conclusion. OSS producers can benefit from our results by checking if they are providing all the information commonly required by potential adopters. Users can learn how models work and which models cover the relevant characteristics of OSS they are most interested in.
Article
Full-text available
Context: GitHub hosts an impressive number of high-quality OSS projects. However, selecting the "right tool for the job" is a challenging task, because we do not have precise information about those high-quality projects. Objective: In this paper, we propose a data-driven approach to measure the level of maintenance activity of GitHub projects. Our goal is to alert users about the risks of using unmaintained projects and possibly motivate other developers to assume the maintenance of such projects. Method: We train machine learning models to define a metric to express the level of maintenance activity of GitHub projects. Next, we analyze the historical evolution of 2,927 active projects in the time frame of one year. Results: From 2,927 active projects, 16% become unmaintained in the interval of one year. We also found that Objective-C projects tend to have lower maintenance activity than projects implemented in other languages. Finally, software tools--such as compilers and editors--have the highest maintenance activity over time. Conclusions: A metric about the level of maintenance activity of GitHub projects can help developers to select open source projects.
Article
Full-text available
Nowadays, with the rapid growth of open source software (OSS), library reuse becomes more and more popular since a large amount of third- party libraries are available to download and reuse. A deeper understanding on why developers reuse a library (i.e., replacing self-implemented code with an external library) or re-implement a library (i.e., replacing an imported external library with self-implemented code) could help researchers better understand the factors that developers are concerned with when reusing code. This understanding can then be used to improve existing libraries and API recommendation tools for researchers and practitioners by using the developers concerns identified in this study as design criteria. In this work, we investigated the reasons behind library reuse and re-implementation. To achieve this goal, we first crawled data from two popular sources, F-Droid and GitHub. Then, potential instances of library reuse and re-implementation were found automatically based on certain heuristics. Next, for each instance, we further manually identified whether it is valid or not. For library re-implementation, we obtained 82 instances which are distributed in 75 repositories. We then conducted two types of surveys (i.e., individual survey to corresponding developers of the validated instances and another open survey) for library reuse and re-implementation. For library reuse individual survey, we received 36 responses out of 139 contacted developers. For re-implementation individual survey, we received 13 responses out of 71 contacted developers. In addition, we received 56 responses from the open survey. Finally, we perform qualitative and quantitative analysis on the survey responses and commit logs of the validated instances. The results suggest that library reuse occurs mainly because developers were initially unaware of the library or the library had not been introduced. Re-implementation occurs mainly because the used library method is only a small part of the library, the library dependencies are too complicated, or the library method is deprecated. Finally, based on all findings obtained from analyzing the surveys and commit messages, we provided a few suggestions to improve the current library recommendation systems: tailored recommendation according to users’ preferences, detection of external code that is similar to a part of the users’ code (to avoid duplication or re-implementation), grouping similar recommendations for developers to compare and select the one they prefer, and disrecommendation of poor-quality libraries.
Conference Paper
Full-text available
Open-source projects do not exist in a vacuum. They benefit from reusing other projects and themselves are being reused by others, creating complex networks of interdependencies, i.e., software ecosystems. Therefore, the sustainability of projects comprising ecosystems may no longer by determined solely by factors internal to the project, but rather by the ecosystem context as well. In this paper we report on a mixed-methods study of ecosystem-level factors affecting the sustainability of open-source Python projects. Quantitatively, using historical data from 46,547 projects in the PyPI ecosystem, we modeled the chances of project development entering a period of dormancy (limited activity) as a function of the projects' position in their dependency networks, organizational support, and other factors. Qualitatively, we triangulated the revealed effects and further expanded on our models through interviews with project maintainers. Results show that the number of project ties and the relative position in the dependency network have significant impact on sustained project activity, with nuanced effects early in a project's life cycle and later on.
Conference Paper
Full-text available
Background: Vulnerable dependencies are a known problem in today's open-source software ecosystems because OSS libraries are highly interconnected and developers do not always update their dependencies. Aim: Our paper addresses the over-inflation problem of academic and industrial approaches for reporting vulnerable dependencies in OSS software, and therefore, caters to the needs of industrial practice for correct allocation of development and audit resources. Method: Careful analysis of deployed dependencies, aggregation of dependencies by their projects, and distinction of halted dependencies allow us to obtain a counting method that avoids over-inflation. To understand the industrial impact of a more precise approach, we considered the 200 most popular OSS Java libraries used by SAP in its own software. Our analysis included 10905 distinct GAVs (group, artifact, version) in Maven when considering all the library versions. Results: We found that about 20% of the dependencies affected by a known vulnerability are not deployed, and therefore, they do not represent a danger to the analyzed library because they cannot be exploited in practice. Developers of the analyzed libraries are able to fix (and actually responsible for) 82% of the deployed vulnerable dependencies. The vast majority (81%) of vulnerable dependencies may be fixed by simply updating to a new version, while 1% of the vulnerable dependencies in our sample are halted, and therefore, potentially require a costly mitigation strategy. Conclusions: Our case study shows that the correct counting allows software development companies to receive actionable information about their library dependencies, and therefore, correctly allocate costly development and audit resources, which is spent inefficiently in case of distorted measurements.
Conference Paper
Full-text available
Modern software systems build on a significant number of external libraries to deliver feature-rich and high-quality software in a cost-efficient and timely manner. As a consequence, these systems contain a considerable amount of third-party code. External libraries thus have a significant impact on maintenance activities in the project. However, most approaches that assess the maintainability of software systems largely neglect this important factor. Hence, risks may remain unidentified, threatening the ability to effectively evolve the system in the future. We propose a structured approach to assess the third-party library usage in software projects and identify potential problems. Industrial experience strongly influences our approach, which we designed in a lightweight way to enable easy adoption in practice. We present an industrial case study showing the applicability of the approach to a real-world software system.
Conference Paper
Full-text available
Open Source Software (OSS) proponents suggest that when develop-ers lose interest in their project, their last duty is to "hand it off to a competent successor." However, the mechanisms of such a hand-off are not clear, or widely known among OSS developers. As a result, many OSS projects, after a certain long period of evolution, stop evolving, in fact becoming "inactive" or "aban-doned" projects. This paper presents an analysis of the population of projects contained within one of the largest OSS repositories available (SourceForge.net), in order to identify how projects abandoned by their developers can be identi-fied, and to identify the attributes and characteristics of these inactive projects. In particular, the paper attempts to differentiate projects that experienced main-tainability issues from those that are inactive for other reasons, in order to be able to correlate common characteristics to the "failure" of these projects.
Article
Full-text available
Abstract,Case study is a suitable research methodology,for software engineering,research since it studies contemporary phenomena in its natural context. However, the understanding of what constitutes a case study varies, and hence the quality of the resulting studies. This paper aims,at providing,an introduction to case study methodology,and,guidelines for researchers,conducting,case studies and,readers studying,reports of such,studies. The content is based on the authors’ own,experience from conducting,and reading case studies. The terminology,and,guidelines are compiled,from,different methodology,handbooks,in other research domains, in particular social science and information systems, and adapted to the needs,in software,engineering. We,present,recommended,practices for software engineering,case studies as well,as empirically,derived,and,evaluated,checklists for researchers and readers of case study research. Keywords,Casestudy.Research methodology.Checklists .Guidelines
Article
Vulnerable dependencies are a known problem in today's free open-source software ecosystems because FOSS libraries are highly interconnected, and developers do not always update their dependencies. Our paper proposes Vuln4Real, the methodology for counting actually vulnerable dependencies, that addresses the over-inflation problem of academic and industrial approaches for reporting vulnerable dependencies in FOSS software, and therefore, caters to the needs of industrial practice for correct allocation of development and audit resources. To understand the industrial impact of a more precise methodology, we considered the 500 most popular FOSS Java libraries used by SAP in its own software. Our analysis included 25767 distinct library instances in Maven. We found that the proposed methodology has visible impacts on both ecosystem view and the individual library developer view of the situation of software dependencies: Vuln4Real significantly reduces the number of false alerts for deployed code (dependencies wrongly flagged as vulnerable), provides meaningful insights on the exposure to third-parties (and hence vulnerabilities) of a library, and automatically predicts when dependency maintenance starts lagging, so it may not receive updates for arising issues.
Conference Paper
Commercial use of open source software is on the rise as more companies realize the benefits of using FLOSS components in their products. At the same time, the ungoverned use of such components can result in legal, financial, intellectual property, and other risks. To mitigate these risks, companies must govern their use of open source through appropriate processes. This paper presents an initial theory of industry best practices on getting started with open source governance and compliance. Through a qualitative survey, we conducted and analyzed 15 expert interviews in companies with advanced capabilities in open source governance. We also studied practitioner reports on existing practices for introducing FLOSS governance processes. We cast our resulting initial theory in the actionable format of best practice patterns that, when combined, form a practical handbook of getting started with FLOSS governance in companies.
Article
Software reuse is finally here but comes with risks.
Conference Paper
BACKGROUND: Software libraries provide a set of reusable functionality, which helps developers write code in a systematic and timely manner. However, selecting the appropriate library to use is often not a trivial task. AIMS: In this paper, we investigate the usefulness of software metrics in helping developers choose libraries. Different developers care about different aspects of a library and two developers looking for a library in a given domain may not necessarily choose the same library. Thus, instead of directly recommending a library to use, we provide developers with a metric-based comparison of libraries in the same domain to empower them with the information they need to make an informed decision. METHOD: We use software data analytics from several sources of information to create quantifiable metric-based comparisons of software libraries. For evaluation, we select 34 open-source Java libraries from 10 popular domains and extract nine metrics related to these libraries. We then conduct a survey of 61 developers to evaluate whether our proposed metric-based comparison is useful, and to understand which metrics developers care about. RESULTS: Our results show that developers find that the proposed technique provides useful information when selecting libraries. We observe that developers care the most about metrics related to the popularity, security, and performance of libraries. We also find that the usefulness of some metrics may vary according to the domain. CONCLUSIONS: Our survey results showed that our proposed technique is useful. We are currently building a public website for metric-based library comparisons, while incorporating the feedback we obtained from our survey participants.
Conference Paper
With millions of apps available to users, the mobile app market is rapidly becoming very crowded. Given the intense competition, the time to market is a critical factor for the success and profitability of an app. In order to shorten the development cycle, developers often focus their efforts on the unique features and workflows of their apps and rely on third-party Open Source Software (OSS) for the common features. Unfortunately, despite their benefits, careless use of OSS can introduce significant legal and security risks, which if ignored can not only jeopardize security and privacy of end users, but can also cause app developers high financial loss. However, tracking OSS components, their versions, and interdependencies can be very tedious and error-prone, particularly if an OSS is imported with little to no knowledge of its provenance. We therefore propose OSSPolice, a scalable and fully-automated tool for mobile app developers to quickly analyze their apps and identify free software license violations as well as usage of known vulnerable versions of OSS. OSSPolice introduces a novel hierarchical indexing scheme to achieve both high scalability and accuracy, and is capable of efficiently comparing similarities of app binaries against a database of hundreds of thousands of OSS sources (billions of lines of code). We populated OSSPolice with 60K C/C++ and 77K Java OSS sources and analyzed 1.6M free Google Play Store apps. Our results show that 1) over 40K apps potentially violate GPL/AGPL licensing terms, and 2) over 100K of apps use known vulnerable versions of OSS. Further analysis shows that developers violate GPL/AGPL licensing terms due to lack of alternatives, and use vulnerable versions of OSS despite efforts from companies like Google to improve app security. OSSPolice is available on GitHub.
Conference Paper
Nearly every popular programming language comes with one or more open source software packaging ecosystem(s), containing a large collection of interdependent software pack- ages developed in that programming language. Such packaging ecosystems are extremely useful for their respective software development community. We present an empirical analysis of how the dependency graphs of three large packaging ecosystems (npm, CRAN and RubyGems) evolve over time. We study how the existing package dependencies impact the resilience of the three ecosystems over time and to which extent these ecosystems suffer from issues related to package dependency updates. We analyse specific solutions that each ecosystem has put into place and argue that none of these solutions is perfect, motivating the need for better tools to deal with package dependency update problems.
Article
The purpose of this explanatory mixed methods study was to examine the perceived value of mixed methods research for graduate students. The quantitative phase was an experiment examining the effect of a passage’s methodology on students’ perceived value. Results indicated students scored the mixed methods passage as more valuable than those who scored the quantitative or qualitative passage. The qualitative phase involved focus groups to better understand students’ perceptions of the perceived value of mixed methods. Findings suggested graduate students view mixed methods passages as having rigorous methods, a newer history, and providing a deeper meaning of the phenomenon. This study adds to the literature base by revealing what value graduate students assign to quantitative, qualitative, and mixed methods research.
Article
A fundamental problem in many disciplines is the classification of objects in a domain of interest into a taxonomy. Developing a taxonomy, however, is a complex process that has not been adequately addressed in the information systems (IS) literature. The purpose of this paper is to present a method for taxonomy development that can be used in IS. First, this paper demonstrates through a comprehensive literature survey that taxonomy development in IS has largely been ad hoc. Then the paper defines the problem of taxonomy development. Next, the paper presents a method for taxonomy development that is based on taxonomy development literature in other disciplines and shows that the method has certain desirable qualities. Finally, the paper demonstrates the efficacy of the method by developing a taxonomy in a domain in IS.
Article
Many of today's most innovative products and solutions are developed on the basis of free and open source software (FOSS). Most of us can no longer imagine the world of software engineering without open source operating systems, databases, application servers, Web servers, frameworks, and tools. Brands such as Linux, MySQL, Apache, and Eclipse have shaped product and service development. They facilitate competition and open markets as well as innovation to meet new challenges. De facto FOSS standards such as Eclipse and Corba simplify the integration of products, whether they're all from one company or from multiple suppliers. IEEE Software has assembled this theme section to provide a brief yet practical overview of where FOSS is heading.
Article
To make better decisions relative to CBSs, we need empirical knowledge. To gain this knowledge, we must more fully understand the lifecycle processes people use when harnessing COTS packages. The initial findings reported here are but the first step in our attempts to capture this empirical knowledge. We plan to continue collecting data and investigating the phenomenology of COTS-based systems.
Understanding API usage to support informed decision making in software maintenance
  • Veronika Bauer
  • Lars Heinemann
Veronika Bauer and Lars Heinemann. 2012. Understanding API usage to support informed decision making in software maintenance. In 2012 16th European Conference on Software Maintenance and Reengineering. IEEE, 435-440.
  • Jusop Choi
  • Wonseok Choi
  • William Aiken
  • Hyoungshick Kim
  • Jun Ho Huh
  • Taesoo Kim
  • Yongdae Kim
  • Ross Anderson
Jusop Choi, Wonseok Choi, William Aiken, Hyoungshick Kim, Jun Ho Huh, Taesoo Kim, Yongdae Kim, and Ross Anderson. 2022. Attack of the Clones: Measuring the Maintainability, Originality and Security of Bitcoin'Forks' in the Wild. arXiv preprint arXiv:2201.08678 (2022).
Visualizing the evolution of systems and their library dependencies
  • Coen Raula Gaikovina Kula
  • Daniel De Roover
  • Takashi German
  • Katsuro Ishio
  • Inoue
Raula Gaikovina Kula, Coen De Roover, Daniel German, Takashi Ishio, and Katsuro Inoue. 2014. Visualizing the evolution of systems and their library dependencies. In 2014 Second IEEE Working Conference on Software Visualization. IEEE, 127-136.
Emad Shihab, Mohamed Aymen Saied, and Bram Adams. 2021. Toward using package centrality trend to identify packages in decline
  • Suhaib Mujahid
  • Diego Elias Costa
  • Rabe Abdalkareem
Suhaib Mujahid, Diego Elias Costa, Rabe Abdalkareem, Emad Shihab, Mohamed Aymen Saied, and Bram Adams. 2021. Toward using package centrality trend to identify packages in decline. IEEE Transactions on Engineering Management (2021).
Open source security analysis: The state of open source security in commercial applications. Black Duck Software
  • Mike Pittenger
Mike Pittenger. 2016. Open source security analysis: The state of open source security in commercial applications. Black Duck Software, Tech. Rep (2016).
Exploring risks in the usage of third-party libraries. In of the BElgian-NEtherlands software eVO-Lution seminar
  • Steven Raemaekers
  • Arie Van Deursen
  • Joost Visser
Steven Raemaekers, Arie van Deursen, and Joost Visser. 2011. Exploring risks in the usage of third-party libraries. In of the BElgian-NEtherlands software eVO-Lution seminar. 31.
Predicting Popularity of Open Source Projects Using Recurrent Neural Networks
  • Kubilay Sefa Eren Sahin
  • Ayse Karpat
  • Tosun
Sefa Eren Sahin, Kubilay Karpat, and Ayse Tosun. 2019. Predicting Popularity of Open Source Projects Using Recurrent Neural Networks. In IFIP International Conference on Open Source Systems. Springer, 80-90.