Conference Paper

Applying Machine Learning in Technical Debt Management: Future Opportunities and Challenges

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Technical Debt Management (TDM) is a fast-growing field that in the last years has attracted the attention of both academia and industry. TDM is a complex process, in the sense that it relies on multiple and heterogeneous data sources (e.g., source code, feature requests, bugs, developers' activity, etc.), which cannot be straightforwardly synthesized; leading the community to using mostly qualitative empirical methods. However, empirical studies that involve expert judgement are inherently biased, compared to automated or semi-automated approaches. To overcome this limitation, the broader (not TDM) software engineering community has started to employ machine learning (ML) technologies. Our goal is to investigate the opportunity of applying ML technologies for TDM, through a Systematic Literature Review (SLR) on the application of ML to software engineering problems (since ML applications on TDM are limited). Thus, we have performed a broader scope study, i.e., on machine learning for software engineering, and then synthesize the results so as to achieve our high-level goal (i.e., possible application of ML in TDM). Therefore, we have conducted a literature review, by browsing the research corpus published in five high-quality SE journals, with the goal of cataloging: (a) the software engineering practices in which ML is used; (b) the machine learning technologies that are used for solving them; and (c) the intersection of the two: developing a problem-solution mapping. The results are useful to both academics and industry, since the former can identify possible gaps, and interesting future research directions, whereas the later can obtain benefits by adopting ML technologies.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Another systematic literature review conducted by Tsintzira et al. (2020) focuses on technical debt management (TDM) for ML software with 90 primary studies. Since the authors of this study investigated current challenges and solutions in the context of TDM and ML, their outcomes, scope, and purpose are very different from ours. ...
Article
Full-text available
There is a widespread demand for Artificial Intelligence (AI) software, specifically Machine Learning (ML). It is getting increasingly popular and being adopted in various applications we use daily. AI-based software quality is different from traditional software quality because it generally addresses distinct and more complex kinds of problems. With the fast advance of AI technologies and related techniques, how to build high-quality AI-based software becomes a very prominent subject. This paper aims at investigating the state of the art on software quality (SQ) for AI-based systems and identifying quality attributes, applied models, challenges, and practices that are reported in the literature. We carried out a systematic literature review (SLR) from 1988 to 2020 to (i) analyze and understand related primary studies and (ii) synthesize limitations and open challenges to drive future research. Our study provides a road map for researchers to understand quality challenges, attributes, and practices in the context of software quality for AI-based software better. From the empirical evidence that we have gathered by this SLR, we suggest future work on this topic be structured under three categories which are Definition/Specification, Design/Evaluation, and Process/Socio-technical.
Article
Full-text available
Context Smells in software systems impair software quality and make them hard to maintain and evolve. The software engineering community has explored various dimensions concerning smells and produced extensive research related to smells. The plethora of information poses challenges to the community to comprehend the state-of-the-art tools and techniques. Objective We aim to present the current knowledge related to software smells and identify challenges as well as opportunities in the current practices. Method We explore the definitions of smells, their causes as well as effects, and their detection mechanisms presented in the current literature. We studied 445 primary studies in detail, synthesized the information, and documented our observations. Results The study reveals five possible defining characteristics of smells — indicator, poor solution, violates best-practices, impacts quality, and recurrence. We curate ten common factors that cause smells to occur including lack of skill or awareness and priority to features over quality. We classify existing smell detection methods into five groups — metrics, rules/heuristics, history, machine learning, and optimization-based detection. Challenges in the smells detection include the tools’ proneness to false-positives and poor coverage of smells detectable by existing tools.
Article
Full-text available
Context: Technical debt (TD) is a metaphor reflecting technical compromises that can yield short-term benefit but may hurt the long-term health of a software system. Objective: This work aims at collecting studies on TD and TD management (TDM), and making a classification and thematic analysis on these studies, to obtain a comprehensive understanding on the TD concept and an overview on the current state of research on TDM. Method: A systematic mapping study was performed to identify and analyze research on TD and its management, covering publications between 1992 and 2013. Results: Ninety-four studies were finally selected. TD was classified into ten types, eight TDM activities were identified, and twenty-nine tools for TDM were collected. Conclusions: The term “debt” has been used in different ways by different people, which leads to ambiguous interpretation of the term. Code-related TD and its management have gained the most attention. There is a need for more empirical studies with high-quality evidence on the whole TDM process and on the application of specific TDM approaches in industrial settings. Moreover, dedicated TDM tools are needed for managing various types of TD in the whole TDM process.
Article
Full-text available
Agile software development represents a major departure from traditional, plan-based approaches to software engineering. A systematic review of empirical studies of agile software development up to and including 2005 was conducted. The search strategy identified 1996 studies, of which 36 were identified as empirical studies. The studies were grouped into four themes: introduction and adoption, human and social factors, perceptions on agile methods, and comparative studies. The review investigates what is currently known about the benefits and limitations of, and the strength of evidence for, agile methods. Implications for research and practice are presented. The main implication for research is a need for more and better empirical studies of agile software development within a common research agenda. For the industrial readership, the review provides a map of findings, according to topic, that can be compared for relevance to their own settings and situations.
Article
Full-text available
In this article, we present a novel algorithmic method for the calculation of thresholds for a metric set. To this aim, machine learning and data mining techniques are utilized. We define a data-driven methodology that can be used for efficiency optimization of existing metric sets, for the simplification of complex classification models, and for the calculation of thresholds for a metric set in an environment where no metric set yet exists. The methodology is independent of the metric set and therefore also independent of any language, paradigm or abstraction level. In four case studies performed on large-scale open-source software metric sets for C functions, C+ +, C# methods and Java classes are optimized and the methodology is validated.
Article
Full-text available
In the last decade, empirical studies on object-oriented design metrics have shown some of them to be useful for predicting the fault-proneness of classes in object-oriented software systems. This research did not, however, distinguish among faults according to the severity of impact. It would be valuable to know how object-oriented design metrics and class fault-proneness are related when fault severity is taken into account. In this paper, we use logistic regression and machine learning methods to empirically investigate the usefulness of object-oriented design metrics, specifically, a subset of the Chidamber and Kemerer suite, in predicting fault-proneness when taking fault severity into account. Our results, based on a public domain NASA data set, indicate that 1) most of these design metrics are statistically related to fault-proneness of classes across fault severity, and 2) the prediction capabilities of the investigated metrics greatly depend on the severity of faults. More specifically, these design metrics are able to predict low severity faults in fault-prone classes better than high severity faults in fault-prone classes
Article
Full-text available
Empirical studies on software prediction models do not converge with respect to the question "which prediction model is best?" The reason for this lack of convergence is poorly understood. In this simulation study, we have examined a frequently used research procedure comprising three main ingredients: a single data sample, an accuracy indicator, and cross validation. Typically, these empirical studies compare a machine learning model with a regression model. In our study, we use simulation and compare a machine learning and a regression model. The results suggest that it is the research procedure itself that is unreliable. This lack of reliability may strongly contribute to the lack of convergence. Our findings thus cast some doubt on the conclusions of any study of competing software prediction models that used this research procedure as a basis of model comparison. Thus, we need to develop more reliable research procedures before we can have confidence in the conclusions of comparative studies of software prediction models.
Article
Context: Secondary studies are vulnerable to threats to validity. Although, mitigating these threats is crucial for the credibility of these studies, we currently lack a systematic approach to identify, categorize and mitigate threats to validity for secondary studies. Objective: In this paper, we review the corpus of secondary studies, with the aim to identify: (a) the trend of reporting threats to validity, (b) the most common threats to validity and corresponding mitigation actions, and (c) possible categories in which threats to validity can be classified. Method: To achieve this goal we employ the tertiary study research method that is used for synthesizing knowledge from existing secondary studies. In particular, we collected data from more than 100 studies, published until December 2016 in top quality software engineering venues (both journals and conference). Results: Our results suggest that in recent years, secondary studies are more likely to report their threats to validity. However, the presentation of such threats is rather ad/hoc, e.g., the same threat may be presented with a different name, or under a different category. To alleviate this problem, we propose a classification schema for reporting threats to validity and possible mitigation actions. Both the classification of threats and the associated mitigation actions have been validated by an empirical study, i.e., Delphi rounds with experts. Conclusion: Based on the proposed schema, we provide a checklist, which authors of secondary studies can use for identifying and categorizing threats to validity and corresponding mitigation actions, while readers of secondary studies can use the checklist for assessing the validity of the reported results.
Article
Context: It has been often argued that it is challenging to modify code fragments from existing software that contains files that are difficult to comprehend. Since systematic software maintenance includes an extensive human activity, cognitive complexity is one of the intrinsic factors that could potentially contribute to or impede an efficient software maintenance practice, the empirical validation of which remains vastly unaddressed. Objective: This study conducts an experimental analysis in which the software developer's level of difficulty in comprehending the software: the cognitive complexity, is theoretically computed and empirically evaluated for estimating its relevance to actual software change. Method: For multiple successive releases of two Java-based software projects, where the source code of a previous release has been substantively used in a novel release, we calculate the change results and the values of the cognitive complexity for each of the version's source code Java files. We construct eight datasets and build predictive models using statistical analysis and machine learning techniques. Results: The pragmatic comparative examination of the estimated cognitive complexity against prevailing metrics of software change and software complexity clearly validates the cognitive complexity metric as a noteworthy measure of version to version source code change.
Article
Context: Software developers spend a significant amount of time fixing faults. However, not many papers have addressed the actual effort needed to fix software faults. Objective: The objective of this paper is twofold: (1) analysis of the effort needed to fix software faults and how it was affected by several factors and (2) prediction of the level of fix implementation effort based on the information provided in software change requests. Method: The work is based on data related to 1200 failures, extracted from the change tracking system of a large NASA mission. The analysis includes descriptive and inferential statistics. Predictions are made using three supervised machine learning algorithms and three sampling techniques aimed at addressing the imbalanced data problem. Results: Our results show that (1) 83% of the total fix implementation effort was associated with only 20% of failures. (2) Both post-release failures and safety-critical failures required more effort to fix than pre-release and non-critical counterparts, respectively; median values were two or more times higher. (3) Failures with fixes spread across multiple components or across multiple types of software artifacts required more effort. The spread across artifacts was more costly than spread across components. (4) Surprisingly, some types of faults associated with later life-cycle activities did not require significant effort. (5) The level of fix implementation effort was predicted with 73% overall accuracy using the original, imbalanced data. Oversampling techniques improved the overall accuracy up to 77% and, more importantly, significantly improved the prediction of the high level effort, from 31% to 85%. Conclusions: This paper shows the importance of tying software failures to changes made to fix all associated faults, in one or more software components and/or in one or more software artifacts, and the benefit of studying how the spread of faults and other factors affect the fix implementation effort.
Article
The need to overcome the weaknesses of single estimation techniques for prediction tasks has given rise to ensemble methods in software development effort estimation (SDEE). An ensemble effort estimation (EEE) technique combines several of the single/classical models found in the SDEE literature. However, to the best of our knowledge, no systematic review has yet been performed with a focus on the use of EEE techniques in SDEE. The purpose of this review is to analyze EEE techniques from six viewpoints: single models used to construct ensembles, ensemble estimation accuracy, rules used to combine single estimates, accuracy comparison of EEE techniques with single models, accuracy comparison between EEE techniques and methodologies used to construct ensemble methods. We performed a systematic review of EEE studies published between 2000 and 2016, and we selected 24 of them to address the questions raised in this review. We found that EEE techniques may be separated into two types: homogeneous and heterogeneous, and that the machine learning single models are the most frequently employed in constructing EEE techniques. We also found that EEE techniques usually yield acceptable estimation accuracy, and in fact are more accurate than single models.
Article
Several code smell detection tools have been developed providing different results, because smells can be subjectively interpreted, and hence detected, in different ways. In this paper, we perform the largest experiment of applying machine learning algorithms to code smells to the best of our knowledge. We experiment 16 different machine-learning algorithms on four code smells (Data Class, Large Class, Feature Envy, Long Method) and 74 software systems, with 1986 manually validated code smell samples. We found that all algorithms achieved high performances in the cross-validation data set, yet the highest performances were obtained by J48 and Random Forest, while the worst performance were achieved by support vector machines. However, the lower prevalence of code smells, i.e., imbalanced data, in the entire data set caused varying performances that need to be addressed in the future studies. We conclude that the application of machine learning to the detection of these code smells can provide high accuracy (>96 %), and only a hundred training examples are needed to reach at least 95 % accuracy.
Conference Paper
Change impact analysis investigates the negative consequence of system changes, i.e., the propagation of changes to other parts of the system (also known as the ripple effect). Identifying modules of the system that will be affected by the ripple effect is an important activity, before and after the application of any change. However, in the literature, there is only a limited set of studies that investigate the probability of a random change occurring in one class, to propagate to another. In this paper we discuss the Ripple Effect Measure (in short REM), a metric that can be used to assess the aforementioned probability. To evaluate the capacity of REM as an assessor of the probability of a class to change due to the ripple effect, we: (a) mathematically validate it against established metric properties (e.g., non-negativity, monotonicity, etc.), proposed by Briand et al., and (b) empirically investigate its validity as an assessor of class proneness to the ripple effect, based on the 1061-1998 IEEE Standard on Software Measurement (e.g., correlation, predictive power, etc.). To apply the empirical validation process, we conducted a holistic multiple-case study on java open-source classes. The results of REM validation (both mathematical and empirical) suggest that REM is a theoretically sound measure that is the most valid assessor of the probability of a class to change due to the ripple effect, compared to other existing metrics.
Article
The metaphor of technical debt in software development was introduced two decades ago to explain to nontechnical stakeholders the need for what we call now "refactoring." As the term is being used to describe a wide range of phenomena, this paper proposes an organization of the technical debt landscape, and introduces the papers on technical debt contained in the issue.
Article
ContextAutomated static analysis (ASA) identifies potential source code anomalies early in the software development lifecycle that could lead to field failures. Excessive alert generation and a large proportion of unimportant or incorrect alerts (unactionable alerts) may cause developers to reject the use of ASA. Techniques that identify anomalies important enough for developers to fix (actionable alerts) may increase the usefulness of ASA in practice.ObjectiveThe goal of this work is to synthesize available research results to inform evidence-based selection of actionable alert identification techniques (AAIT).MethodRelevant studies about AAITs were gathered via a systematic literature review.ResultsWe selected 21 peer-reviewed studies of AAITs. The techniques use alert type selection; contextual information; data fusion; graph theory; machine learning; mathematical and statistical models; or dynamic detection to classify and prioritize actionable alerts. All of the AAITs are evaluated via an example with a variety of evaluation metrics.ConclusionThe selected studies support (with varying strength), the premise that the effective use of ASA is improved by supplementing ASA with an AAIT. Seven of the 21 selected studies reported the precision of the proposed AAITs. The two studies with the highest precision built models using the subject program’s history. Precision measures how well a technique identifies true actionable alerts out of all predicted actionable alerts. Precision does not measure the number of actionable alerts missed by an AAIT or how well an AAIT identifies unactionable alerts. Inconsistent use of evaluation metrics, subject programs, and ASAs in the selected studies preclude meta-analysis and prevent the current results from informing evidence-based selection of an AAIT. We propose building on an actionable alert identification benchmark for comparison and evaluation of AAIT from literature on a standard set of subjects and utilizing a common set of evaluation metrics.
Article
ContextSoftware development effort estimation (SDEE) is the process of predicting the effort required to develop a software system. In order to improve estimation accuracy, many researchers have proposed machine learning (ML) based SDEE models (ML models) since 1990s. However, there has been no attempt to analyze the empirical evidence on ML models in a systematic way.Objective This research aims to systematically analyze ML models from four aspects: type of ML technique, estimation accuracy, model comparison, and estimation context.Method We performed a systematic literature review of empirical studies on ML model published in the last two decades (1991–2010).ResultsWe have identified 84 primary studies relevant to the objective of this research. After investigating these studies, we found that eight types of ML techniques have been employed in SDEE models. Overall speaking, the estimation accuracy of these ML models is close to the acceptable level and is better than that of non-ML models. Furthermore, different ML models have different strengths and weaknesses and thus favor different estimation contexts.ConclusionML models are promising in the field of SDEE. However, the application of ML models in industry is still limited, so that more effort and incentives are needed to facilitate the application of ML models. To this end, based on the findings of this review, we provide recommendations for researchers as well as guidelines for practitioners.
Survey on machine learning-based QoE-QoS correlation models
  • S Aroussi
  • A Mellouk
Aroussi, S., Mellouk, A.: Survey on machine learning-based QoE-QoS correlation models. International Conference on Computing, Management and Telecommunications (ComMan-Tel'), Da Nang, Vietnam, 27-29 April 2014.
Estimating the breaking point for technical debt. 7 th International Workshop on Managing Technical Debt (MTD' 15)
  • A Chatzigeorgiou
  • Ap Ampatzoglou
  • Ar Ampatzoglou
  • T Amanatidis
Chatzigeorgiou, A., Ampatzoglou, Ap., Ampatzoglou, Ar., Amanatidis, T.: Estimating the breaking point for technical debt. 7 th International Workshop on Managing Technical Debt (MTD' 15), IEEE, Germany, pp.53-56, 2 Oct. 2015.
An investigation of machine learning based prediction systems
  • C Mair
  • G Kadoda
  • M Lefley
  • K Phalp
  • C Schofied
  • M Shepperd
  • S Webster
Mair, C., Kadoda, G., Lefley, M., Phalp, K., Schofied, C., Shepperd, M., Webster, S.: An investigation of machine learning based prediction systems. Journal of Systems and Software, 53(1), 23-29 (2000).
Predicting and quantifying the technical debt in cloud software engineering. 19 th International Workshop on Computer-Aided Modeling and Design of Communication Links and Networks (CAMAD)
  • G Skourletopoulos
  • C Mavromoustakis
  • R Bahsoon
  • G Masotrakis
  • E Pallis
Skourletopoulos, G., Mavromoustakis, C., Bahsoon, R., Masotrakis, G., Pallis, E.: Predicting and quantifying the technical debt in cloud software engineering. 19 th International Workshop on Computer-Aided Modeling and Design of Communication Links and Networks (CAMAD), IEEE Computer Society, 2014.