Figure - available from: Empirical Software Engineering
This content is subject to copyright. Terms and conditions apply.
The hyperparameters to be tuned in CART

The hyperparameters to be tuned in CART

Source publication
Article
Full-text available
Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstra...

Similar publications

Article
Full-text available
The prediction model with the sinter drum strength as the evaluation index was established based on the index data and historical sintering data generated during the sintering process. The regression prediction model in the algorithm of machine learning was applied to the prediction of the strength of the sinter drum. After verifying the feasibilit...

Citations

... Understanding when a software project is in a healthy state remains a critical yet unsolved challenge in software development. While repositories provide extensive data about project activities, from code changes to community interactions, current approaches struggle to convert this wealth of information into actionable insights about project health [1,2]. This gap affects both practitioners managing projects and researchers studying software development. ...
... The potential impact of our stability-based vision is threefold: (1) it enables systematic, quantitative evaluation of repository health, (2) it paves the way for data-driven project management through clear stability metrics, and (3) it opens new research directions in empirical software engineering. By bridging control theory and repository analysis, we offer a rigorous yet practical approach to a fundamental challenge. ...
Preprint
Full-text available
Drawing from engineering systems and control theory, we introduce a framework to understand repository stability, which is a repository activity capacity to return to equilibrium following disturbances - such as a sudden influx of bug reports, key contributor departures, or a spike in feature requests. The framework quantifies stability through four indicators: commit patterns, issue resolution, pull request processing, and community engagement, measuring development consistency, problem-solving efficiency, integration effectiveness, and sustainable participation, respectively. These indicators are synthesized into a Composite Stability Index (CSI) that provides a normalized measure of repository health proxied by its stability. Finally, the framework introduces several important theoretical properties that validate its usefulness as a measure of repository health and stability. At a conceptual phase and open to debate, our work establishes mathematical criteria for evaluating repository stability and proposes new ways to understand sustainable development practices. The framework bridges control theory concepts with modern collaborative software development, providing a foundation for future empirical validation.
... DE has been widely applied [67] 8 . Within software engineering, DE has been used for optimization tasks such as Fu et al. [68] tuning study on defect prediction; Shu et al.'s study on tuning detectors for security issues [69], and Xia at al.s study that tuned project health predictors for for opensource JAVA systems [34]. ...
Preprint
Full-text available
SE analytics problems do not always need complex AI. Better and faster solutions can sometimes be obtained by matching the complexity of the problem to the complexity of the solution. This paper introduces the Dimensionality Reduction Ratio (DRR), a new metric for predicting when lightweight algorithms suffice. Analyzing SE optimization problems from software configuration to process decisions and open-source project health we show that DRR pinpoints "simple" tasks where costly methods like DEHB (a state-of-the-art evolutionary optimizer) are overkill. For high-DRR problems, simpler methods can be just as effective and run two orders of magnitude faster.
... (3) In the realm of ML algorithm selection, previous studies have predominantly employed a single proposed algorithm for pattern recognition within the DBDD framework (Bokaeian et al., 2021;Fallahian et al., 2018bFallahian et al., , 2022. Additionally, some investigations have explored the influence of HPs on algorithm performance and implemented methods to optimize these HPs (Adedeji et al., 2021;Agrawal and Chakraborty, 2021;Chandrasekhar et al., 2021;Gui et al., 2017;Gulgec et al., 2019;Guo et al., 2019;Iannelli et al., 2022;Ibrahim et al., 2019;Kaur et al., 2020;Kong et al., 2022;Li et al., 2023;Xia et al., 2022). The DBDD method's theoretical underpinnings clearly indicate that the choice of learning algorithm and its associated HP values significantly impacts prediction accuracy. ...
Article
One of the appealing alternatives, especially for detect, localize, and quantify the multiple simultaneously damages in structures, is using dataset-based damage detection (DBDD) technics. While machine learning (ML) algorithms have been applied to pattern recognition in DBDD studies, comprehensive research and sensitivity analysis on the effects of various parameters, such as ML algorithm choice, hyperparameter (HP) tuning, and feature engineering techniques, remain limited. This can hinder the selection of optimal parameter values, especially for large structures, without incurring significant computational costs. This study addresses this gap by employing extensive single-objective sensitivity analyses to evaluate the impact of these parameters on DBDD damage detection in a 3D sample structure, whose numerical model has been verified by comparing its Frequency Response Functions (FRFs) with experimental data. By comparing grid search outcomes, it was shown that leveraging insights from single-objective optimization can significantly reduce computational costs by limiting the parameter search space. Additionally, an innovative hybrid feature engineering method is proposed to enhance feature quality, reduce dataset size, and enable the feasibility of conducting extensive sensitivity analyses. Furthermore, the study investigates the impact of ensemble techniques with tuned algorithms and excitation point configuration on DBDD prediction accuracy. By using proposed feature engineering, optimal excitation configuration, optimized ML algorithms and HPs, in addition to shrinking the dataset to 5% of its original size, the accuracy of DBDD damage prediction can be improved by up to 60%. These results can be leveraged to pre-select optimal parameters for DBDD damage detection in similar structures, significantly reducing computational costs and improving accuracy.
... However, the FOSS sustainability impact on the community's product is less known. While recent FOSS sustainability research has focused on forecasting sustainability (Yin et al. 2021), reasons for project failure (Coelho and Valente 2017), community health indicators (Xia et al. 2022;Manikas and Hansen 2013;Linåker et al. 2022), project popularity (Han et al. 2019;Borges et al. 2016), developers and users attraction to the community (Chengalur-Smith et al. 2010), little work has been devoted to the implications of FOSS sustainability and its various aspects on community outputs. Software quality (SWQ) is deemed an important aspect of these outputs (Vasilescu et al. 2015). ...
Article
Full-text available
Context Free and Open Source Software (FOSS) communities’ ability to stay viable and productive over time is pivotal for society as they maintain the building blocks that digital infrastructure, products, and services depend on. Sustainability may, however, be characterized from multiple aspects, and less is known how these aspects interplay and impact community outputs, and software quality specifically. Objective This study, therefore, aims to empirically explore how the different aspects of FOSS sustainability impact software quality. Method 16 sustainability metrics across four categories were sampled and applied to a set of 217 OSS projects sourced from the Apache Software Foundation Incubator program. The impact of a decline in the sustainability metrics was analyzed against eight software quality metrics using Bayesian data analysis, which incorporates probability distributions to represent the regression coefficients and intercepts. Results Findings suggest that selected sustainability metrics do not significantly affect defect density or code coverage. However, a positive impact of community age was observed on specific code quality metrics, such as risk complexity, number of very large files, and code duplication percentage. Interestingly, findings show that even when communities are experiencing sustainability, certain code quality metrics are negatively impacted. Conclusion Findings imply that code quality practices are not consistently linked to sustainability, and defect management and prevention may be prioritized over the former. Results suggest that growth, resulting in a more complex and large codebase, combined with a probable lack of understanding of code quality standards, may explain the degradation in certain aspects of code quality.
... Different studies predicted specific features for different topics, such as health related features [22,37] or popularity measures [4,35]. Predictions for multivariate maintenance activity features are missing so far. ...
... Predictions for multivariate maintenance activity features are missing so far. Statistical algorithms such as logistic regression, k-nearest neighbors, support vector regression, linear regression and regression trees [22,37], but also neural network based algorithms, such as LSTM RNNs [4,35] were applied. The prediction periods range from 1 to 30 days [4,22,35,37], 1 to 6 months [4,22,37] and up to 12 months and longer [4,37]. ...
... Statistical algorithms such as logistic regression, k-nearest neighbors, support vector regression, linear regression and regression trees [22,37], but also neural network based algorithms, such as LSTM RNNs [4,35] were applied. The prediction periods range from 1 to 30 days [4,22,35,37], 1 to 6 months [4,22,37] and up to 12 months and longer [4,37]. The main gap of RQ3 is the integration of transitive links into the prediction approach since current approaches are based on project-level features only. ...
Preprint
Industrial applications heavily integrate open-source software libraries nowadays. Beyond the benefits that libraries bring, they can also impose a real threat in case a library is affected by a vulnerability but its community is not active in creating a fixing release. Therefore, I want to introduce an automatic monitoring approach for industrial applications to identify open-source dependencies that show negative signs regarding their current or future maintenance activities. Since most research in this field is limited due to lack of features, labels, and transitive links, and thus is not applicable in industry, my approach aims to close this gap by capturing the impact of direct and transitive dependencies in terms of their maintenance activities. Automatically monitoring the maintenance activities of dependencies reduces the manual effort of application maintainers and supports application security by continuously having well-maintained dependencies.
... Authors deals with development process metrics at the project level to predict the health of a project in paper [10] Authors analyze the governance of open source projects to extract the best practices for project management in paper [11]. ...
Chapter
Projects of modern software systems comprise a wide number of various design artifacts. There is an extensive set of approaches and tools for creating design artifacts. It is very important to use good practices to improve the quality of design artifacts and a development process of creating software systems. Modern software project repositories contain many software systems. However, there are no effective methods for information retrieval and analyzing existing projects to gain access to best practices.Thus, it is necessary to develop a model of a software system project, considering various approaches and tools for a design artifacts creation and a development process organization. Such a model will make it possible to search for software projects, considering their domain features and properties of design artifacts and development process. Various data mining and systems analysis methods can apply to a found set of projects to extract a set of best practices.We described the approach to building the information retrieval module for the intelligent design repository in this paper. Also, we presented the model of a software system project. We considered the algorithm for indexing a software system project to build the information retrieval module index. There is a measure of a distance between a search query and an index element of the information retrieval module. We proposed the algorithm for calculating the relevance of an index element to a search query. The article also provides examples of the work of the information retrieval module for the intelligent design repository.KeywordsInformation retrievalData miningSoftware repository
... Most importantly, all this information must be collected and analyzed, considering the dynamics of the development of the project during its life cycle. The following works influenced our study: [19]. ...
... We also considered papers about the analysis of open-source software repositories to evaluate the quality of the repository, depending on various design, construction, and project management practices [16][17][18][19][20][21]. ...
... In the paper [19], the authors selected project indicators based on the developer survey and proposed a forecasting method based on a combination of machine learning methods. ...
Article
Full-text available
GitHub and GitLab contain many project repositories. Each repository contains many design artifacts and specific project management features. Developers can automate the processes of design and project management with the approach proposed in this paper. We described the knowledge base model and diagnostic analytics method for the solving of design automation and project management tasks. This paper also presents examples of use cases for applying the proposed approach.
... Attracting such support is easier when projects show healthy trends in their development. For this reason, in a survey of hundreds of decision makers from dozens of open source projects [71], it was found that 79% to 93% of decision makers wanted methods to predict open source health indicators of Table 1 such as commits; closed pull requests; and number of contributors. ...
... Making such predictions a small data problem; i.e. it may be necessary to make conclusions from just a few dozen rows described in just a dozen column. For example, Xia et al. [71] explored what project health information could be consistently collected across 1000s of Github projects. They found they could reliably access around data with around 15 columns (see Table 1 Table 1. ...
... But for small data sets, these methods are highly error prone. For example, Xia et al. [71]'s hyperparameter studies with our small data, found models with error rates twice as bad as what Sarro et al. would consider acceptable. Hence, in this work, we attempy to out-perform Xia et al. ...
Preprint
When learning from very small data sets, the resulting models can make many mistakes. For example, consider learning predictors for open source project health. The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). Using this data, prior work had unacceptably large errors in their learned predictors. We show that these high errors rates can be tamed by better configuration of the control parameters of the machine learners. For example, we present here a {\em landscape analytics} method (called SNEAK) that (a)~clusters the data to find the general landscape of the hyperparameters; then (b)~explores a few representatives from each part of that landscape. SNEAK is both faster and and more effective than prior state-of-the-art hyperparameter optimization algorithms (FLASH, HYPEROPT, OPTUNA, and differential evolution). More importantly, the configurations found by SNEAK had far less error that other methods. We conjecture that SNEAK works so well since it finds the most informative regions of the hyperparameters, then jumps to those regions. Other methods (that do not reflect over the landscape) can waste time exploring less informative options. From this, we make the following conclusions. Firstly, for predicting open source project health, we recommend landscape analytics (e.g.SNEAK). Secondly, and more generally, when learning from very small data sets, using hyperparameter optimization (e.g. SNEAK) to select learning control parameters. Due to its speed and implementation simplicity, we suggest SNEAK might also be useful in other ``data-light'' SE domains. To assist other researchers in repeating, improving, or even refuting our results, all our scripts and data are available on GitHub at https://github.com/zxcv123456qwe/niSneak
... Xia et al. predicted a number of health indicators of OSS projects, such as, the number of developers and the number of revisions. These predictions were made using regression trees that were optimized using differential evolution, leading to a 10% increase in prediction accuracy over the base line [25]. Norick et al. analyzed OSS projects using code quality measures and observed no significant evidence that the number of committing developers affects software quality [15]. ...
Preprint
Full-text available
A recent study applied frequentist survival analysis methods to a subset of the Software Heritage Graph and determined which attributes of an OSS project contribute to its health. This paper serves as an exact replication of that study. In addition, Bayesian survival analysis methods were applied to the same dataset, and an additional project attribute was studied to serve as a conceptual replication. Both analyses focus on the effects of certain attributes on the survival of open-source software projects as measured by their revision activity. Methods such as the Kaplan-Meier estimator, Cox Proportional-Hazards model, and the visualization of posterior survival functions were used for each of the project attributes. The results show that projects which publish major releases, have repositories on multiple hosting services, possess a large team of developers, and make frequent revisions have a higher likelihood of survival in the long run. The findings were similar to the original study; however, a deeper look revealed quantitative inconsistencies.