ArticlePublisher preview available

Predicting health indicators for open source projects (using hyperparameter optimization)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like k-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects. To facilitate open science (and replications and extensions of this work), all our materials are available online at https://github.com/arennax/Health_Indicator_Prediction.
This content is subject to copyright. Terms and conditions apply.
https://doi.org/10.1007/s10664-022-10171-0
Predicting health indicators for open source projects
(using hyperparameter optimization)
Tianpei Xia1·Wei Fu1·Rui Shu1·Rishabh Agrawal1·Tim Menzies1
Accepted: 17 March 2022
/
©The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
Software developed on public platform is a source of data that can be used to make pre-
dictions about those projects. While the individual developing activity may be random and
hard to predict, the developing behavior on project level can be predicted with good accu-
racy when large groups of developers work together on software projects. To demonstrate
this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions
about the recent status of those projects (as of April 2020). We find that traditional estima-
tion algorithms make many mistakes. Algorithms like k-nearest neighbors (KNN), support
vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees
(CART) have high error rates. But that error rate can be greatly reduced using hyperparame-
ter optimization. To the best of our knowledge, this is the largest study yet conducted, using
recent data for predicting multiple health indicators of open-source projects. To facilitate
open science (and replications and extensions of this work), all our materials are available
online at https://github.com/arennax/Health Indicator Prediction.
Keywords Hyperparameter optimization ·Project health ·Machine learning
Communicated by: Federica Sarro
Tim Menzies
timm@ieee.org
Tianpei Xia
txia4@ncsu.edu
Wei F u
fuwei.ee@gmail.com
Rui Shu
rshu@ncsu.edu
Rishabh Agrawal
agrawa3@ncsu.edu
1Department of Computer Science, North Carolina State University, Raleigh, NC, USA
Published online: 22 June 2022
Empirical Software Engineering (2022) 27: 122
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Different studies predicted specific features for different topics, such as health related features [22,37] or popularity measures [4,35]. Predictions for multivariate maintenance activity features are missing so far. ...
... Predictions for multivariate maintenance activity features are missing so far. Statistical algorithms such as logistic regression, k-nearest neighbors, support vector regression, linear regression and regression trees [22,37], but also neural network based algorithms, such as LSTM RNNs [4,35] were applied. The prediction periods range from 1 to 30 days [4,22,35,37], 1 to 6 months [4,22,37] and up to 12 months and longer [4,37]. ...
... Statistical algorithms such as logistic regression, k-nearest neighbors, support vector regression, linear regression and regression trees [22,37], but also neural network based algorithms, such as LSTM RNNs [4,35] were applied. The prediction periods range from 1 to 30 days [4,22,35,37], 1 to 6 months [4,22,37] and up to 12 months and longer [4,37]. The main gap of RQ3 is the integration of transitive links into the prediction approach since current approaches are based on project-level features only. ...
Preprint
Industrial applications heavily integrate open-source software libraries nowadays. Beyond the benefits that libraries bring, they can also impose a real threat in case a library is affected by a vulnerability but its community is not active in creating a fixing release. Therefore, I want to introduce an automatic monitoring approach for industrial applications to identify open-source dependencies that show negative signs regarding their current or future maintenance activities. Since most research in this field is limited due to lack of features, labels, and transitive links, and thus is not applicable in industry, my approach aims to close this gap by capturing the impact of direct and transitive dependencies in terms of their maintenance activities. Automatically monitoring the maintenance activities of dependencies reduces the manual effort of application maintainers and supports application security by continuously having well-maintained dependencies.
... Understanding when a software project is in a healthy state remains a critical yet unsolved challenge in software development. While repositories provide extensive data about project activities, from code changes to community interactions, current approaches struggle to convert this wealth of information into actionable insights about project health [1,2]. This gap affects both practitioners managing projects and researchers studying software development. ...
... The potential impact of our stability-based vision is threefold: (1) it enables systematic, quantitative evaluation of repository health, (2) it paves the way for data-driven project management through clear stability metrics, and (3) it opens new research directions in empirical software engineering. By bridging control theory and repository analysis, we offer a rigorous yet practical approach to a fundamental challenge. ...
Preprint
Full-text available
Drawing from engineering systems and control theory, we introduce a framework to understand repository stability, which is a repository activity capacity to return to equilibrium following disturbances - such as a sudden influx of bug reports, key contributor departures, or a spike in feature requests. The framework quantifies stability through four indicators: commit patterns, issue resolution, pull request processing, and community engagement, measuring development consistency, problem-solving efficiency, integration effectiveness, and sustainable participation, respectively. These indicators are synthesized into a Composite Stability Index (CSI) that provides a normalized measure of repository health proxied by its stability. Finally, the framework introduces several important theoretical properties that validate its usefulness as a measure of repository health and stability. At a conceptual phase and open to debate, our work establishes mathematical criteria for evaluating repository stability and proposes new ways to understand sustainable development practices. The framework bridges control theory concepts with modern collaborative software development, providing a foundation for future empirical validation.
... DE has been widely applied [67] 8 . Within software engineering, DE has been used for optimization tasks such as Fu et al. [68] tuning study on defect prediction; Shu et al.'s study on tuning detectors for security issues [69], and Xia at al.s study that tuned project health predictors for for opensource JAVA systems [34]. ...
Preprint
Full-text available
SE analytics problems do not always need complex AI. Better and faster solutions can sometimes be obtained by matching the complexity of the problem to the complexity of the solution. This paper introduces the Dimensionality Reduction Ratio (DRR), a new metric for predicting when lightweight algorithms suffice. Analyzing SE optimization problems from software configuration to process decisions and open-source project health we show that DRR pinpoints "simple" tasks where costly methods like DEHB (a state-of-the-art evolutionary optimizer) are overkill. For high-DRR problems, simpler methods can be just as effective and run two orders of magnitude faster.
... (3) In the realm of ML algorithm selection, previous studies have predominantly employed a single proposed algorithm for pattern recognition within the DBDD framework (Bokaeian et al., 2021;Fallahian et al., 2018bFallahian et al., , 2022. Additionally, some investigations have explored the influence of HPs on algorithm performance and implemented methods to optimize these HPs (Adedeji et al., 2021;Agrawal and Chakraborty, 2021;Chandrasekhar et al., 2021;Gui et al., 2017;Gulgec et al., 2019;Guo et al., 2019;Iannelli et al., 2022;Ibrahim et al., 2019;Kaur et al., 2020;Kong et al., 2022;Li et al., 2023;Xia et al., 2022). The DBDD method's theoretical underpinnings clearly indicate that the choice of learning algorithm and its associated HP values significantly impacts prediction accuracy. ...
Article
One of the appealing alternatives, especially for detect, localize, and quantify the multiple simultaneously damages in structures, is using dataset-based damage detection (DBDD) technics. While machine learning (ML) algorithms have been applied to pattern recognition in DBDD studies, comprehensive research and sensitivity analysis on the effects of various parameters, such as ML algorithm choice, hyperparameter (HP) tuning, and feature engineering techniques, remain limited. This can hinder the selection of optimal parameter values, especially for large structures, without incurring significant computational costs. This study addresses this gap by employing extensive single-objective sensitivity analyses to evaluate the impact of these parameters on DBDD damage detection in a 3D sample structure, whose numerical model has been verified by comparing its Frequency Response Functions (FRFs) with experimental data. By comparing grid search outcomes, it was shown that leveraging insights from single-objective optimization can significantly reduce computational costs by limiting the parameter search space. Additionally, an innovative hybrid feature engineering method is proposed to enhance feature quality, reduce dataset size, and enable the feasibility of conducting extensive sensitivity analyses. Furthermore, the study investigates the impact of ensemble techniques with tuned algorithms and excitation point configuration on DBDD prediction accuracy. By using proposed feature engineering, optimal excitation configuration, optimized ML algorithms and HPs, in addition to shrinking the dataset to 5% of its original size, the accuracy of DBDD damage prediction can be improved by up to 60%. These results can be leveraged to pre-select optimal parameters for DBDD damage detection in similar structures, significantly reducing computational costs and improving accuracy.
... However, the FOSS sustainability impact on the community's product is less known. While recent FOSS sustainability research has focused on forecasting sustainability (Yin et al. 2021), reasons for project failure (Coelho and Valente 2017), community health indicators (Xia et al. 2022;Manikas and Hansen 2013;Linåker et al. 2022), project popularity (Han et al. 2019;Borges et al. 2016), developers and users attraction to the community (Chengalur-Smith et al. 2010), little work has been devoted to the implications of FOSS sustainability and its various aspects on community outputs. Software quality (SWQ) is deemed an important aspect of these outputs (Vasilescu et al. 2015). ...
Article
Full-text available
Context Free and Open Source Software (FOSS) communities’ ability to stay viable and productive over time is pivotal for society as they maintain the building blocks that digital infrastructure, products, and services depend on. Sustainability may, however, be characterized from multiple aspects, and less is known how these aspects interplay and impact community outputs, and software quality specifically. Objective This study, therefore, aims to empirically explore how the different aspects of FOSS sustainability impact software quality. Method 16 sustainability metrics across four categories were sampled and applied to a set of 217 OSS projects sourced from the Apache Software Foundation Incubator program. The impact of a decline in the sustainability metrics was analyzed against eight software quality metrics using Bayesian data analysis, which incorporates probability distributions to represent the regression coefficients and intercepts. Results Findings suggest that selected sustainability metrics do not significantly affect defect density or code coverage. However, a positive impact of community age was observed on specific code quality metrics, such as risk complexity, number of very large files, and code duplication percentage. Interestingly, findings show that even when communities are experiencing sustainability, certain code quality metrics are negatively impacted. Conclusion Findings imply that code quality practices are not consistently linked to sustainability, and defect management and prevention may be prioritized over the former. Results suggest that growth, resulting in a more complex and large codebase, combined with a probable lack of understanding of code quality standards, may explain the degradation in certain aspects of code quality.
Article
When data is scarce, software analytics can make many mistakes. For example, consider learning predictors for open source project health (e.g. the number of closed pull requests in twelve months time). The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). The models generated from such tiny data sets can make many prediction errors. Those errors can be tamed by a landscape analysis that selects better learner control parameters. Our niSNEAK tool (a) clusters the data to find the general landscape of the hyperparameters; then (b) explores a few representatives from each part of that landscape. niSNEAK is both faster and more effective than prior state-of-the-art hyperparameter optimization algorithms (e.g. FLASH, HYPEROPT, OPTUNA). The configurations found by niSNEAK have far less error than other methods. For example, for project health indicators such as C = number of commits; I =number of closed issues, and R =number of closed pull requests, niSNEAK ’s 12 month prediction errors are {I=0%, R=33% C=47%} while other methods have far larger errors of {I=61%,R=119% C=149%}. We conjecture that niSNEAK works so well since it finds the most informative regions of the hyperparameters, then jumps to those regions. Other methods (that do not reflect over the landscape) can waste time exploring less informative options. Based on the above, we recommend landscape analytics (e.g. niSNEAK ) especially when learning from very small data sets. This paper only explores the application of niSNEAK to project health. That said, we see nothing in principle that prevents the application of this technique to a wider range of problems. To assist other researchers in repeating, improving, or even refuting our results, all our scripts and data are available on GitHub at https://github.com/zxcv123456qwe/niSneak.
Article
Full-text available
This paper claims that a new field of empirical software engineering research and practice is emerging: data mining using/used-by optimizers for empirical studies, or DUO. For example, data miners can generate models that are explored by optimizers. Also, optimizers can advise how to best adjust the control parameters of a data miner. This combined approach acts like an agent leaning over the shoulder of an analyst that advises “ask this question next” or “ignore that problem, it is not relevant to your goals”. Further, those agents can help us build “better” predictive models, where “better” can be either greater predictive accuracy or faster modeling time (which, in turn, enables the exploration of a wider range of options). We also caution that the era of papers that just use data miners is coming to an end. Results obtained from an unoptimized data miner can be quickly refuted, just by applying an optimizer to produce a different (and better performing) model. Our conclusion, hence, is that for software analytics it is possible, useful and necessary to combine data mining and optimization using DUO.
Article
Full-text available
As software systems grow in complexity and the space of possible configurations increases exponentially, finding the near-optimal configuration of a software system becomes challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, collecting enough sample configurations can be very expensive since each such sample requires configuring, compiling, and executing the entire system using a complex test suite. When learning on new data is too expensive, it is possible to use Transfer Learning to “transfer” old lessons to the new context. Traditional transfer learning has a number of challenges, specifically, (a) learning from excessive data takes excessive time, and (b) the performance of the models built via transfer can deteriorate as a result of learning from a poor source. To resolve these problems, we propose a novel transfer learning framework called BEETLE, which is a “bellwether”-based transfer learner that focuses on identifying and learning from the most relevant source from amongst the old data. This paper evaluates BEETLE with 57 different software configuration problems based on five software systems (a video encoder, an SAT solver, a SQL database, a high-performance C-compiler, and a streaming data analytics tool). In each of these cases, BEETLE found configurations that are as good as or better than those found by other state-of-the-art transfer learners while requiring only a fraction 1/7th of the measurements needed by those other methods. Based on these results, we say that BEETLE is a new high-water mark in optimally configuring software.
Article
Full-text available
Context: GitHub hosts an impressive number of high-quality OSS projects. However, selecting the "right tool for the job" is a challenging task, because we do not have precise information about those high-quality projects. Objective: In this paper, we propose a data-driven approach to measure the level of maintenance activity of GitHub projects. Our goal is to alert users about the risks of using unmaintained projects and possibly motivate other developers to assume the maintenance of such projects. Method: We train machine learning models to define a metric to express the level of maintenance activity of GitHub projects. Next, we analyze the historical evolution of 2,927 active projects in the time frame of one year. Results: From 2,927 active projects, 16% become unmaintained in the interval of one year. We also found that Objective-C projects tend to have lower maintenance activity than projects implemented in other languages. Finally, software tools--such as compilers and editors--have the highest maintenance activity over time. Conclusions: A metric about the level of maintenance activity of GitHub projects can help developers to select open source projects.
Article
Full-text available
Machine learning techniques applied to software engineering tasks can be improved by hyperparameter optimization, i.e., automatic tools that find good settings for a learner's control parameters. We show that such hyperparameter optimization can be unnecessarily slow, particularly when the optimizers waste time exploring "redundant tunings", i.e., pairs of tunings which lead to indistinguishable results. By ignoring redundant tunings, DODGE, a tuning tool, runs orders of magnitude faster, while also generating learners with more accurate predictions than seen in prior state-of-the-art approaches.
Article
How can we make software analytics simpler and faster? One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by “DODGE-ing”; i.e., simply steering way from settings that lead to similar conclusions. But when is it wise to use that simple approach and when must we use more complex (and much slower) optimizers? To answer this, we applied hyperparameter optimization to 120 SE data sets that explored bad smell detection, predicting Github issue close time, bug report analysis, defect prediction, and dozens of other non-SE problems. We find that the simple DODGE works best for data sets with low “intrinsic dimensionality” ( μD3\mu _D\approx 3 ) and very poorly for higher-dimensional data ( μD>8\mu _D > 8 ). Nearly all the SE data seen here was intrinsically low-dimensional, indicating that DODGE is applicable for many SE analytics tasks.
Article
Many methods have been proposed to estimate how much effort is required to build and maintain software. Much of that research tries to recommend a single method – an approach that makes the dubious assumption that one method can handle the diversity of software project data. To address this drawback, we apply a configuration technique called “ROME” (Rapid Optimizing Methods for Estimation), which uses sequential model-based optimization (SMO) to find what configuration settings of effort estimation techniques work best for a particular data set. We test this method using data from 1161 traditional waterfall projects and 120 contemporary projects (from GitHub). In terms of magnitude of relative error and standardized accuracy, we find that ROME achieves better performance than the state-of-the-art methods for both traditional waterfall and contemporary projects. In addition, we conclude that we should not recommend one method for estimation. Rather, it is better to search through a wide range of different methods to find what works best for the local data. To the best of our knowledge, this is the largest effort estimation experiment yet attempted and the only one to test its methods on traditional waterfall and contemporary projects.
Article
The continuous contributions made by long time contributors (LTCs) are a key factor enabling open source software (OSS) projects to be successful and survival. We study Github as it has a large number of OSS projects and millions of contributors, which enables the study of the transition from newcomers to LTCs. In this paper, we investigate whether we can effectively predict newcomers in OSS projects to be LTCs based on their activity data that is collected from Github. We collect Github data from GHTorrent, a mirror of Github data. We select the most popular 917 projects, which contain 75,046 contributors. We determine a developer as a LTC of a project if the time interval between his/her first and last commit in the project is larger than a certain time T . In our experiment, we use three different settings on the time interval: 1, 2, and 3 years. There are 9,238, 3,968, and 1,577 contributors who become LTCs of a project in three settings of time interval, respectively. To build a prediction model, we extract many features from the activities of developers on Github, which group into five dimensions: developer profile, repository profile, developer monthly activity, repository monthly activity, and collaboration network. We apply several classifiers including naive Bayes, SVM, decision tree, kNN and random forest. We find that random forest classifier achieves the best performance with AUCs of more than 0.75 in all three settings of time interval for LTCs. We also investigate the most important features that differentiate newcomers who become LTCs from newcomers who stay in the projects for a short time. We find that the number of followers is the most important feature in all three settings of the time interval studied. We also find that the programming language and the average number of commits contributed by other developers when a newcomer joins a project also belong to the top 10 most important features in all three settings of time interval for LTCs. Finally, we provide several implications for action based on our analysis results to help OSS projects retain newcomers.