Article

Learning a Metric for Code Readability

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, we explore the concept of code readability and investigate its relation to software quality. With data collected from 120 human annotators, we derive associations between a simple set of local code features and human notions of readability. Using those features, we construct an automated readability measure and show that it can be 80 percent effective and better than a human, on average, at predicting readability judgments. Furthermore, we show that this metric correlates strongly with three measures of software quality: code changes, automated defect reports, and defect log messages. We measure these correlations on over 2.2 million lines of code, as well as longitudinally, over many releases of selected projects. Finally, we discuss the implications of this study on programming language design and engineering practice. For example, our data suggest that comments, in and of themselves, are less important than simple blank lines to local judgments of readability.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Readability (RA): Readability of software code refers to a human judgement of how easy the code is to understand [14]. In our research, we consider readability as one of most important quality metrics for an exception handler in the code example. ...
... The baseline idea is-the more readable and understandable the handler code is, the easier it is to leverage in exception handling. Buse and Weimer [14] propose a code readability model trained on human perception of readability and understandability. The model uses different textual features (e.g., length of identifiers, number of comments, line length) of the code that are likely to affect the human perception of readability. ...
... Q ehc = µ × R + × AHA + κ × HCR (4) Here, µ, and κ are the weights of the corresponding quality metrics, which are calculated using a machine learning technique involving logistic regression (Section IV-D). While HCR metric is likely to encourage examples with excessive handling code, AHA metric ensures that the handlers contain meaningful statements, and RA metric penalizes code with too many parentheses [14] (i.e., code with too many handlers). ...
Preprint
Full-text available
Studies show that software developers often either misuse exception handling features or use them inefficiently, and such a practice may lead an undergoing software project to a fragile, insecure and non-robust application system. In this paper, we propose a context-aware code recommendation approach that recommends exception handling code examples from a number of popular open source code repositories hosted at GitHub. It collects the code examples exploiting GitHub code search API, and then analyzes, filters and ranks them against the code under development in the IDE by leveraging not only the structural (i.e., graph-based) and lexical features but also the heuristic quality measures of exception handlers in the examples. Experiments with 4,400 code examples and 65 exception handling scenarios as well as comparisons with four existing approaches show that the proposed approach is highly promising.
... Although disregarded by compilers, these elements are deliberately integrated into the source code to assist developers in comprehending the code. Many factors in the source code can affect code understandability, such as the number of identifiers, size of a line of code, and nesting [20], [21]. However, the definition of whether a change is an understandability improvement depends on the developer's background (e.g., skill and experience) [15]. ...
... Buse and Weimer [21] proposed a method for measuring code understandability (code readability in their paper) based on human notions of understandability. They collected feedback from 120 human annotators on their understanding of 100 code snippets, correlating this feedback with source code features. ...
... The resulting model predicted understandability judgments with an 80% success rate. Posnett et al. [80] proposed a simplification of Buse and Weimer's [21] model. They leverage source code size metrics and Halstead metrics [81]. ...
Preprint
Full-text available
Motivation: Code understandability is crucial in software development, as developers spend 58% to 70% of their time reading source code. Improving it can improve productivity and reduce maintenance costs. Problem: Experimental studies often identify factors influencing code understandability in controlled settings but overlook real-world influences like project culture, guidelines, and developers' backgrounds. Ignoring these factors may yield results with limited external validity. Objective: This study investigates how developers enhance code understandability through code review comments, assuming that code reviewers are specialists in code quality. Method and Results: We analyzed 2,401 code review comments from Java open-source projects on GitHub, finding that over 42% focus on improving code understandability. We further examined 385 comments specifically related to this aspect and identified eight categories of concerns, such as inadequate documentation and poor identifiers. Notably, 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted. We identified various types of patches that enhance understandability, from simple changes like removing unused code to context-dependent improvements such as optimizing method calls. Additionally, we evaluated four well-known linters for their ability to flag these issues, finding they cover less than 30%, although many could be easily added as new rules. Implications: Our findings encourage the development of tools to enhance code understandability, as accepted changes can serve as reliable training data for specialized machine-learning models. Our dataset supports this training and can inform the development of evidence-based code style guides. Data Availability: Our data is publicly available at https://codeupcrc.github.io.
... dos Santos and Gerosa (2018) found out that 7 out of 11 Java coding practices improve the readability perceived by developers, while one of them decreases readability and the others are neutral. The assessed practices refer to lexical and syntactic features or algorithmic simplicity and were derived from existing code readability models (Buse and Weimer 2010;Scalabrino et al. 2016). Mi et al. (2023) used a causal analysis process on 420 labeled code snippets to find that a higher number of comments increase code readability while more assignments, identifiers, and periods decrease code readability. ...
... For this purpose, there have been progressive improvements in the formulation of metrics and predictive models. Buse and Weimer (2010) identified simple code features, correlated with human notions of readability, and based on them built one of the first measures for this feature. This seminal work materialized the possibility of automating the continuous assessment and monitoring of the readability of source code. ...
... A reflection by Posnett et al. (2021) lays out the evolutionary history of readability studies and models, building off of the work of Buse and Weimer (2010). They also point out that the results in readability have led to recent work on assessing the understandability of the code snippets (Scalabrino et al. 2021). ...
Article
Full-text available
Context While developing software, developers must first read and understand source code in order to work on change requests such as bug fixes or feature additions. The easier it is for them to understand what the code does, the faster they can get to working on change tasks. Source code is meant to be consumed by humans, and hence, the human factor of how readable the code is plays an important role. During the past decade, software engineering researchers have used eye trackers to see how developers comprehend code. The eye tracker enables us to see exactly what parts of the code the developer is reading (and for how long) in an objective manner without prompting them. Objective In this paper, we leverage eye tracking technology to replicate a prior online questionnaire-based controlled experiment (Johnson et al. 2019) to determine the visual effort needed to read code presented in different readability rule styles. As in the prior study, we assess two readability rules - minimize nesting and avoid do-while loops. Each rule is evaluated on code snippets that are correct and incorrect with respect to a requirement. Method This study was conducted in a lab setting with the Tobii X-60 eye tracker where each of the 46 participants. 21 undergraduate students, 24 graduate students, and 6 professional developers (part-time or full-time)) participated and were given eight Java methods from a total set of 32 Java methods in four categories: ones that follow/do not follow the readability rule and that are correct/incorrect. After reading each code snippet, they were asked to answer a multiple-choice comprehension question about the code and some questions related to logical correctness and confidence. In addition to comparing the time and accuracy of answering the questions with the prior study, we also report on the visual effort of completing the tasks via gaze-based metrics. Results The results of this study concur with the online study, in that following the minimize nesting rule showed higher confidence (14.8%14.8%14.8\%) decreased time spent reading programming tasks (7.1%7.1%7.1\%), and decreased accuracy in finding bugs (5.4%5.4%5.4\%). However, the decrease in accuracy was not significant. For method analysis tasks showing one Java method at a time, participants spent proportionally less time fixating on code lines (9.9%9.9%9.9\%) and had fewer fixations on code lines (3.5%3.5%3.5\%) when a snippet is not following the minimize-nesting rule. However, the opposite is true when the snippet is logically incorrect (3.4% and 3.9%, respectively), regardless of whether the rule was followed. The avoid do-while rule, however, did not have as significant of an effect. Following the avoid do-while rule did result in higher accuracy in task performance albeit with lower fixation counts. We also note a lower rate for a majority of the gaze-based linearity metrics on the rule-breaking code snippet when the rule-following and rule-breaking code snippets are displayed side-by-side. Conclusions The results of this study show strong support for the use of the minimize nesting rule. All participants considered the minimize nesting rule to be important and considered the avoid do-while rule to be less important. This was despite the results showing that the participants were more accurate when the avoid do-while rule was followed. Overall, participants ranked the snippets following the readability rules to be higher than the snippets that do not follow the rules. We discuss the implications of these results for advancing the state of the art for reducing visual effort and cognitive load in code readability research.
... Although disregarded by compilers, these elements are deliberately integrated into the source code to assist developers in comprehending the code. Many factors in the source code can affect code understandability, such as the number of identifiers, size of a line of code, and nesting [20], [21]. However, the definition of whether a change is an understandability improvement depends on the developer's background (e.g., skill and experience) [15]. ...
... Buse and Weimer [21] proposed a method for measuring code understandability (code readability in their paper) based on human notions of understandability. They collected feedback from 120 human annotators on their understanding of 100 code snippets, correlating this feedback with source code features. ...
... The resulting model predicted understandability judgments with an 80% success rate. Posnett et al. [80] proposed a simplification of Buse and Weimer's [21] model. They leverage source code size metrics and Halstead metrics [81]. ...
Article
italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Motivation : Code understandability plays a crucial role in software development, as developers spend between 58% and 70% of their time reading source code. Improving code understandability can lead to enhanced productivity and save maintenance costs. Problem: Experimental studies aim to establish what makes code more or less understandable in a controlled setting, but ignore that what makes code easier to understand in the real world also depends on extraneous elements such as project culture and guidelines, and developers’ background. Not accounting for the influence of these factors may lead to results that are sound but have little external validity. Objective : This study aims to investigate how developers improve code understandability during software development through code review comments. Its basic assumption is that code reviewers are specialists in code quality within a project. Method and Results : We manually analyzed 2,401 code review comments from Java open-source projects on GitHub and find that over 42% of all comments focus on improving code understandability, demonstrating the significance of this aspect in code reviews. We further explored a subset of 385 comments related to code understandability and identified eight categories of code understandability concerns, such as incomplete or inadequate code documentation, bad identifier, and unnecessary code. Among the suggestions to improve code understandability, 83.9% were accepted and integrated into the codebase. Among these, only two (less than 1%) end up being reverted later. We also identified types of patches that improve code understandability, ranging from simple changes (e.g., removing unused code) to more context-dependent improvements (e.g., replacing method calling chain by existing API). Finally, we evaluated the ability of four well-known linters to flag the identified code understandability issues. These linters cover less than 30% of these issues, although some of them could be easily added as new rules. Implications : Our findings motivate and provide practical insight for the construction of tools to make code more understandable, e.g., understandability improvements are rarely reverted and thus can be used as reliable training data for specialized ML-based tools. This is also supported by our dataset, which can be used to train such models. Finally, our findings can also serve as a basis to develop evidence-based code style guides. Data Availability : Our data is publicly available at https://codeupcrc.github.io.
... These studies employ different response variables, including time, accuracy, opinion, and visual metrics, as criteria for assessing CU. Some studies directly solicit opinions from a limited number of developers about specific code pieces [18,19,20,21]. Other studies rely on metrics such as likes, stars, or votes on software repositories to gauge CU [22,23,10,24]. ...
... Another contribution of our study is the exploration of notebook metrics that have a more significant impact on CU. Previous studies have introduced a set of code metrics (such as lines of code, maximum line length, and cyclomatic complexity) that are related to CU for traditional programming languages like Java, Python, and C [19,26,27,21,28]. However, some of these metrics (such as coupling, cohesion, and depth of inheritance tree) are not relevant to notebooks. ...
... Given that each code cell in Jupyter notebooks is a regular Python script, many of the metrics for code scripts proposed by prior studies [19,39,27,21,40,41] are also applicable to code cells in Jupyter notebooks. Considering that a small percentage of notebooks use the concept of class and object orientation 9 and that notebooks typically have a weaker structure, some metrics were ignored. ...
Preprint
Full-text available
Computational notebooks have become the primary coding environment for data scientists. However, research on their code quality is still emerging, and the code shared is often of poor quality. Given the importance of maintenance and reusability, understanding the metrics that affect notebook code comprehensibility is crucial. Code understandability, a qualitative variable, is closely tied to user opinions. Traditional approaches to measuring it either use limited questionnaires to review a few code pieces or rely on metadata such as likes and votes in software repositories. Our approach enhances the measurement of Jupyter notebook understandability by leveraging user comments related to code understandability. As a case study, we used 542,051 Kaggle Jupyter notebooks from our previous research, named DistilKaggle. We employed a fine-tuned DistilBERT transformer to identify user comments associated with code understandability. We established a criterion called User Opinion Code Understandability (UOCU), which considers the number of relevant comments, upvotes on those comments, total notebook views, and total notebook upvotes. UOCU proved to be more effective than previous methods. Furthermore, we trained machine learning models to predict notebook code understandability based solely on their metrics. We collected 34 metrics for 132,723 final notebooks as features in our dataset, using UOCU as the label. Our predictive model, using the Random Forest classifier, achieved 89% accuracy in predicting the understandability levels of computational notebooks.
... Code readability is essential to program comprehension and maintainability (Lucas et al., 2019;Rugaber, 2000;Buse and Weimer, 2010). Readability is especially important for newcomers to a project since they still do not have the contextual knowledge that helps to understand the source code (Steinmacher et al., 2015). ...
... Recent studies revealed that some projects abuse annotations and that a high number of identifiers in the code harms readability (Buse and Weimer, 2010;Tashtoush et al., 2013). On the positive side, an analysis of the code evolution (Yu et al., 2019) pointed out that annotated Java code tends to be less error-prone. ...
... Readability can be defined as "a human judgment of how easy a text is to understand" (Buse and Weimer, 2010) and is directly related to its maintainability (Buse and Weimer, 2010). Development teams pursue this quality attribute since the typical software product life-cycle cost distribution is 70% maintenance and 30% development (Boehm and Basili, 2001). ...
Article
Full-text available
Context Code annotations have gained widespread popularity in programming languages, offering developers the ability to attach metadata to code elements to define custom behaviors. Many modern frameworks and APIs use annotations to keep integration less verbose and located nearer to the corresponding code element. Despite these advantages, practitioners’ anecdotal evidence suggests that annotations might negatively affect code readability. Objective To better understand this effect, this paper systematically investigates the relationship between code annotations and code readability. Method In a survey with software developers (n=332), we present 15 pairs of Java code snippets with and without code annotations. These pairs were designed considering five categories of annotation used in real-world Java frameworks and APIs. Survey participants selected the code snippet they considered more readable for each pair and answered an open question about how annotations affect the code’s readability. Results Preferences were scattered for all categories of annotation usage, revealing no consensus among participants. The answers were spread even when segregated by participants’ programming or annotation-related experience. Nevertheless, some participants showed a consistent preference in favor or against annotations across all categories, which may indicate a personal preference. Our qualitative analysis of the open-ended questions revealed that participants often praise annotation impacts on design, maintainability, and productivity but expressed contrasting views on understandability and code clarity. Conclusions Software developers and API designers can consider our results when deciding whether to use annotations, equipped with the insight that developers express contrasting views of the annotations’ impact on code readability.
... A core activity in both software development and maintenance is code reading [60]. Code readability refers to how easily a developer can comprehend the structure and logic of the source code [61], [62], [63], [64]. High readability is essential for efficient maintenance, as it reduces the cognitive load on developers and accelerates the process of updating and fixing code. ...
... High readability is essential for efficient maintenance, as it reduces the cognitive load on developers and accelerates the process of updating and fixing code. In this study, we employed two distinct readability metrics: one proposed by Buse et al. [61] (Readability) and the other by Posnett et al. [62] (SimpleReadability). Both metrics assign a readability score to each method, ranging from 0 (least readable) to 1 (most readable), providing a quantitative measure of how easily the code can be understood. ...
Preprint
Full-text available
Self-Admitted Technical Debt (SATD) refers to the phenomenon where developers explicitly acknowledge technical debt through comments in the source code. While considerable research has focused on detecting and addressing SATD, its true impact on software maintenance remains underexplored. The few studies that have examined this critical aspect have not provided concrete evidence linking SATD to negative effects on software maintenance. These studies, however, focused only on file- or class-level code granularity. This paper aims to empirically investigate the influence of SATD on various facets of software maintenance at the method level. We assess SATD's effects on code quality, bug susceptibility, change frequency, and the time practitioners typically take to resolve SATD. By analyzing a dataset of 774,051 methods from 49 open-source projects, we discovered that methods containing SATD are not only larger and more complex but also exhibit lower readability and a higher tendency for bugs and changes. We also found that SATD often remains unresolved for extended periods, adversely affecting code quality and maintainability. Our results provide empirical evidence highlighting the necessity of early identification, resource allocation, and proactive management of SATD to mitigate its long-term impacts on software quality and maintenance costs.
... Various metrics have been proposed to evaluate the readability of modules [3]. For example, Mi et al. classified modules based on their readability using a convolutional neural network [5]. ...
... In this study, we focus on the necessity of personalization for the readability evaluations. The readability of modules could be different among software developers [3]. That is, one developer might judge a module as having high readability, whereas it might not be easy for other developers to read it. ...
Preprint
Code readability is an important indicator of software maintenance as it can significantly impact maintenance efforts. Recently, LLM (large language models) have been utilized for code readability evaluation. However, readability evaluation differs among developers, so personalization of the evaluation by LLM is needed. This study proposes a method which calibrates the evaluation, using collaborative filtering. Our preliminary analysis suggested that the method effectively enhances the accuracy of the readability evaluation using LLMs.
... The visual organization of code, including formatting, the use of white spaces, and consistency in naming, can significantly affect how developers interpret and interact with the source code. Similarly, Buse and Weimer [6] developed a metric for code readability, demonstrating that certain characteristics of the code can significantly influence how it is understood. This metric considers factors such as the complexity of control structures, clarity in variable and function naming, and code conciseness. ...
... Buse and Weimer [6,7] and Posnett et al. [26] aimed to identify specific source code characteristics that directly impact its readability and comprehensibility. These features were assessed through the subjective perceptions of students and programmers, offering valuable insights into factors contributing to code's legibility. ...
Conference Paper
The incorporation and adaptation of style guides play an essential role in software development, influencing code formatting, naming conventions, and structure to enhance readability and simplify maintenance. However, many of these guides often lack empirical studies to validate their recommendations. Previous studies have examined the impact of code styles on developer performance, concluding that some styles have a negative impact on code readability. However, there is a need for more studies that assess other perspectives and the combination of these perspectives on a common basis through experiments. This study aimed to investigate, through eye-tracking, the impact of guidelines in style guides, with a special focus on the PEP8 guide in Python, recognized for its best practices. We conducted a controlled experiment with 32 Python novices, measuring time, the number of attempts, and visual effort through eye-tracking, using fixation duration, fixation count, and regression count for four PEP8 recommendations. Additionally, we conducted interviews to explore the subjects’ difficulties and preferences with the programs. The results highlighted that not following the PEP8 Line Break after an Operator guideline increased the eye regression count by 70% in the code snippet where the standard should have been applied. Most subjects preferred the version that adhered to the PEP8 guideline, and some found the left-aligned organization of operators easier to understand. The other evaluated guidelines revealed other interesting nuances, such as the True Comparison, which negatively impacted eye metrics for the PEP8 standard, although subjects preferred the PEP8 suggestion. We recommend practitioners selecting guidelines supported by experimental evaluations.
... For the method level, the popular code metrics include size [23], cyclomatic complexity [63], nested block depth [25], code readability [15,78], fanout [77], maintainability index [25], halstead metrics [3], comment ratio [105], etc. Unlike the class-level code metrics, method-level code metrics were found to be useful in multiple studies. ...
... Readability is another important indicator of maintenance effort [14,43,89,90], as practitioners spend a significant amount of time in reading code. We included two different scores of readability: one according to Buse et al. [15] and the other according to Posnett et al. [78]. We considered the fanout measurements as an indication of dependency because if there is a problem in any of the called methods, that problem can propagate to the caller method. ...
Preprint
Full-text available
The cost of software maintenance often surpasses the initial development expenses, making it a significant concern for the software industry. A key strategy for alleviating future maintenance burdens is the early prediction and identification of change-prone code components, which allows for timely optimizations. While prior research has largely concentrated on predicting change-prone files and classes, an approach less favored by practitioners, this paper shifts focus to predicting highly change-prone methods, aligning with the preferences of both practitioners and researchers. We analyzed 774,051 source code methods from 49 prominent open-source Java projects. Our findings reveal that approximately 80% of changes are concentrated in just 20% of the methods, demonstrating the Pareto 80/20 principle. Moreover, this subset of methods is responsible for the majority of the identified bugs in these projects. After establishing their critical role in mitigating software maintenance costs, our study shows that machine learning models can effectively identify these highly change-prone methods from their inception. Additionally, we conducted a thorough manual analysis to uncover common patterns (or concepts) among the more difficult-to-predict methods. These insights can help future research develop new features and enhance prediction accuracy.
... Using simple textual features enhances code readability, emphasizing the importance of shorter lines, consistent indentation, and judicious use of comments [49,50]. While comments may not uniformly indicate high readability, they directly communicate intent, making their use preferable. ...
... While comments may not uniformly indicate high readability, they directly communicate intent, making their use preferable. Blank lines, positively correlated with readability [49,51]. Xiaoran et al. propose SEGMENT [52], a heuristic solution for automatic blank line insertion based on program structure and naming information. ...
Preprint
Full-text available
Context: IoT systems, networks of connected devices powered by software, require studying software quality for maintenance. Despite extensive studies on non-IoT software quality, research on IoT software quality is lacking. It is uncertain if IoT and non-IoT systems software are comparable, hindering the confident application of results and best practices gained on non-IoT systems. Objective: Therefore, we compare the code quality of two equivalent sets of IoT and non-IoT systems to determine whether there are similarities and differences. We also collect and revisit software-engineering best practices in non-IoT contexts to apply them to IoT. Method: We design and apply a systematic method to select two sets of 94 non-IoT and IoT systems software from GitHub with comparable characteristics. We compute quality metrics on the systems in these two sets and then analyse and compare the metric values. We analyse in depth and provide specific examples of IoT system's complexity and how it manifests in the codebases. After the comparison, We systematically select and present a list of best practices to address the observed difference between IoT and non-IoT code. Results: Through a comparison of metrics, we conclude that software for IoT systems is more complex, coupled, larger, less maintainable, and cohesive than non-IoT systems. Several factors, such as integrating multiple hardware and software components and managing data communication between them, contribute to these differences. Considering these differences, we present a revisited best practices list with approaches, tools, or techniques for developing IoT systems. As example, applying modularity, and refactoring are best practices for lowering the complexity. Conclusion: Based on our work, researchers can now make an informed decision using existing studies on the quality of non-IoT systems for IoT systems.
... The simplest code metrics include the number of lines of code (NLOC) metric and its variants [17,18], the number of methods and the number of fields [10]. In [19], researchers annotated a collection of code snippets by manually assigning subjective readability assessments to programs. Then, the decisions made by the researchers were compared to the values of different simple code metrics. ...
... For example, in the autograding system described in [25], student submissions are checked for their compliance with the PEP8 Python programming standard. PEP8 limits line lengths, indents, and the count of spaces and other special characters that are known to complicate the code readability [19]. In addition, the DTA system [25] checks the values of the CycC of the submitted code snippets, encouraging students to refactor their programs into small and readable functions or classes. ...
Article
Full-text available
Modern software systems consist of many software components; the source code of modern software systems is hard to understand and maintain for new developers. Aiming to simplify the readability and understandability of source code, companies that specialize in software development adopt programming standards, software design patterns, and static analyzers with the aim of decreasing the complexity of software. Recent research introduced a number of code metrics allowing the numerical characterization of the maintainability of code snippets. Cyclomatic Complexity (CycC) is one widely used metric for measuring the complexity of software. The value of CycC is equal to the number of decision points in a program plus one. However, CycC does not take into account the nesting levels of the syntactic structures that break the linear control flow in a program. Aiming to resolve this, the Cognitive Complexity (CogC) metric was proposed as a successor to CycC. In this paper, we describe a rule-based algorithm and its specializations for measuring the complexity of programs. We express the CycC and CogC metrics by means of the described algorithm and propose a new complexity metric named Educational Complexity (EduC) for use in educational digital environments. EduC is at least as strict as CycC and CogC are and includes additional checks that are based on definition-use graph analysis of a program. We evaluate the CycC, CogC, and EduC metrics using the source code of programs submitted to a Digital Teaching Assistant (DTA) system that automates a university programming course. The obtained results confirm that EduC rejects more overcomplicated and difficult-to-understand programs in solving unique programming exercises generated by the DTA system when compared to CycC and CogC.
... Our current study is only a first step, and more empirical studies must be conducted in order to pin down the usefulness of unit testing in practice. Particularly, the following issues should be meticulously investigated: 1) the relation between unit testing and code quality should be investigated in different domains and development methodologies, 2) the quality of the unit tests can vary and therefore sheer correlation between the coverage and defects can underestimate the effect size 3) it is maybe so that coverage measures are inadequate measures for measuring unit testing, 4) code complexity and cohesion have major effect on code quality [16], therefore they are likely to be strong confounding factors, 5) the developers choice of which files should be tested can be deliberate. ...
Preprint
Unit testing has been considered as having a key role in building high quality software, and therefore it has been widely used in practice. However, data on the relationship between unit testing and aspects of software quality remain scarce. A survey study with 235 survey responses from seven organizations was conducted in order to understand the correlation between practitioners' perception of code quality and unit testing practices. In addition, we conducted a case study in one of these organizations to investigate the correlation between unit test coverage and post-unit test defects. In both cases none or weak correlations were found. We recommend further research on the effectiveness of different testing practices in order to help practitioners to understand how to best allocate their resources to the testing chain.
... Code readability has a profound impact on decompiled code understandability. Buse et al. [31] proposed a method to model code readability with machine-learning algorithms based on a simple set of code features. In a subsequent development, Posnett et al. [32] constructed a simpler model based on size metrics and Halstead metrics [33]. ...
Preprint
Full-text available
Decompilation, the process of converting machine-level code into readable source code, plays a critical role in reverse engineering. Given that the main purpose of decompilation is to facilitate code comprehension in scenarios where the source code is unavailable, the understandability of decompiled code is of great importance. In this paper, we propose the first empirical study on the understandability of Java decompiled code and obtained the following findings: (1) Understandability of Java decompilation is considered as important as its correctness, and decompilation understandability issues are even more commonly encountered than decompilation failures. (2) A notable percentage of code snippets decompiled by Java decompilers exhibit significantly lower or higher levels of understandability in comparison to their original source code. (3) Unfortunately, Cognitive Complexity demonstrates relatively acceptable precision while low recall in recognizing these code snippets exhibiting diverse understandability during decompilation. (4) Even worse, perplexity demonstrates lower levels of precision and recall in recognizing such code snippets. Inspired by the four findings, we further proposed six code patterns and the first metric for the assessment of decompiled code understandability. This metric was extended from Cognitive Complexity, with six more rules harvested from an exhaustive manual analysis into 1287 pairs of source code snippets and corresponding decompiled code. This metric was also validated using the original and updated dataset, yielding an impressive macro F1-score of 0.88 on the original dataset, and 0.86 on the test set.
... However, all the existing LLM-based evolutionary heuristic search methods focus on a single objective regarding the optimized performance of the target problem (Ma et al. 2023;Nasir et al. 2024;Liu et al. 2024;Romera-Paredes et al. 2024;Zhang et al. 2024;Yao et al. 2024;van Stein and Bäck 2024;Li et al. 2024;Zeng et al. 2024;Mao et al. 2024;Ma et al. 2024). Other important heuristic design criteria, such as heuristic complexity (Ausiello et al. 2012) and code readability (Buse and Weimer 2009), which could be vital in practice, are neglected. While some studies have attempted to optimize multiple objectives by combining them into a single objective function, resulting in a single heuristic, the conflicting nature of diverse objectives often makes it challenging to find a single heuristic that satisfies all simultaneously. ...
Preprint
Full-text available
Heuristics are commonly used to tackle diverse search and optimization problems. Design heuristics usually require tedious manual crafting with domain knowledge. Recent works have incorporated large language models (LLMs) into automatic heuristic search leveraging their powerful language and coding capacity. However, existing research focuses on the optimal performance on the target problem as the sole objective, neglecting other criteria such as efficiency and scalability, which are vital in practice. To tackle this challenge, we propose to model heuristic search as a multi-objective optimization problem and consider introducing other practical criteria beyond optimal performance. Due to the complexity of the search space, conventional multi-objective optimization methods struggle to effectively handle multi-objective heuristic search. We propose the first LLM-based multi-objective heuristic search framework, Multi-objective Evolution of Heuristic (MEoH), which integrates LLMs in a zero-shot manner to generate a non-dominated set of heuristics to meet multiple design criteria. We design a new dominance-dissimilarity mechanism for effective population management and selection, which incorporates both code dissimilarity in the search space and dominance in the objective space. MEoH is demonstrated in two well-known combinatorial optimization problems: the online Bin Packing Problem (BPP) and the Traveling Salesman Problem (TSP). Results indicate that a variety of elite heuristics are automatically generated in a single run, offering more trade-off options than existing methods. It successfully achieves competitive or superior performance while improving efficiency up to 10 times. Moreover, we also observe that the multi-objective search introduces novel insights into heuristic design and leads to the discovery of diverse heuristics.
... The present experiment looks like a contradiction to the results by Buse and Weimer (2010) as well as Scalabrino et al. (2016, who constructed readability models using ML techniques. In both cases, it is documented that an increase in indentation decreases the readability. ...
Article
Full-text available
Indentation is an old technique that emphasizes elements in source code using white spaces or tabs. But while this technique has been taught and applied for decades, evidence for its effectiveness is weak: up to 2022 relatively few experiments can be found – the present authors are only aware of one single experiment that revealed an effect of indentation and reported the effect size, but even in that experiment the effect of indentation was found to be weak. The situation changed recently, where an experiment was published that suddenly revealed a strong and large effect of indentation in control flows on reaction time. However, although the experiment provided an initial indication of a possible cause for the difference between indented and non-indented code (the length of the code that can be skipped in indented code), the evidence for this indicator was rather weak. The present paper presents a formal model on the differences in reading times (measured in terms of reaction times) between indented and non-indented code. Based on that model a controlled experiment on generated tasks was designed and executed on 27 participants (undergraduate students, PhD students, professionals). The experiment (again) confirms a strong (p < .001) and large ( ηp2\eta _{p}^{2} η p 2 = .198, MNonIndentedMIndented\frac{M_{Non-Indented}}{M_{Indented}} M N o n - I n d e n t e d M Indented = 2.13) effect of indentation. Furthermore, it confirms that the larger the skippable code is, the larger is the difference between reading time of indented and non-indented code (p = .001, ηp2\eta _{p}^{2} η p 2 = .072). I.e., the experiment goes beyond the point “ indented vs non-indented code ” and explains the difference by revealing a factor that controls this difference. However, although the previous statements holds true for the whole sample in the experiment, this effect could only be shown for a subset of individual participants.
... Go code is formatted mechanically to a consistent form using a tool included in the tool chain; this form, and other aspects of the Go grammar and coding style promote short lines, favoring code readability (Buse and Weimer, 2010), an important aspect in code maintainability and peer review. ...
Preprint
Full-text available
bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and expressive syntax, lowering the barriers to entry for researchers needing to process large data sets with custom analyses while retaining computational safety and ease of code review. We believe bíogo provides an excellent environment for training and research in computational biology because of its combination of strict typing, simple and expressive syntax, and high performance.
... The second category applies code readability studies to test readability. Based on Buse and Weimer's research on code readability [36], Daka et al. [17] gathered a dataset of tests along with their human-rated readability, used common code metrics as features (including identifier length, parentheses, test length, and casts), and added extra features for tests (e.g., assertions). They trained a linear regression model to improve the readability of automated unit tests. ...
Preprint
Full-text available
Automated test techniques usually generate unit tests with higher code coverage than manual tests. However, the readability of automated tests is crucial for code comprehension and maintenance. The readability of unit tests involves many aspects. In this paper, we focus on test inputs. The central limitation of existing studies on input readability is that they focus on test codes alone without taking the tested source codes into consideration, making them either ignore different source codes' different readability requirements or require manual efforts to write readable inputs. However, we observe that the source codes specify the contexts that test inputs must satisfy. Based on such observation, we introduce the \underline{C}ontext \underline{C}onsistency \underline{C}riterion (a.k.a, C3), which is a readability measurement tool that leverages Large Language Models to extract primitive-type (including string-type) parameters' readability contexts from the source codes and checks whether test inputs are consistent with those contexts. We have also proposed EvoSuiteC3. It leverages C3's extracted contexts to help EvoSuite generate readable test inputs. We have evaluated C3's performance on 409 \java{} classes and compared manual and automated tests' readability under C3 measurement. The results are two-fold. First, The Precision, Recall, and F1-Score of C3's mined readability contexts are \precision{}, \recall{}, and \fone{}, respectively. Second, under C3's measurement, the string-type input readability scores of EvoSuiteC3, ChatUniTest (an LLM-based test generation tool), manual tests, and two traditional tools (EvoSuite and Randoop) are 90%90\%, 83%83\%, 68%68\%, 8%8\%, and 8%8\%, showing the traditional tools' inability in generating readable string-type inputs.
... However, a simple execution success does not always guarantee the reproducibility of issues [6]. Several studies investigate the quality of SO code snippets by measuring their readability [12,13,14,15,16,17], and understandability [18,19,20]. Unfortunately, their capability of reproducing the issues reported in SO questions was not investigated. ...
Preprint
Full-text available
Software developers often submit questions to technical Q&A sites like Stack Overflow (SO) to resolve code-level problems. In practice, they include example code snippets with questions to explain the programming issues. Existing research suggests that users attempt to reproduce the reported issues using given code snippets when answering questions. Unfortunately, such code snippets could not always reproduce the issues due to several unmet challenges that prevent questions from receiving appropriate and prompt solutions. One previous study investigated reproducibility challenges and produced a catalog. However, how the practitioners perceive this challenge catalog is unknown. Practitioners' perspectives are inevitable in validating these challenges and estimating their severity. This study first surveyed 53 practitioners to understand their perspectives on reproducibility challenges. We attempt to (a) see whether they agree with these challenges, (b) determine the impact of each challenge on answering questions, and (c) identify the need for tools to promote reproducibility. Survey results show that - (a) about 90% of the participants agree with the challenges, (b) "missing an important part of code" most severely hurt reproducibility, and (c) participants strongly recommend introducing automated tool support to promote reproducibility. Second, we extract \emph{nine} code-based features (e.g., LOC, compilability) and build five Machine Learning (ML) models to predict issue reproducibility. Early detection might help users improve code snippets and their reproducibility. Our models achieve 84.5% precision, 83.0% recall, 82.8% F1-score, and 82.8% overall accuracy, which are highly promising. Third, we systematically interpret the ML model and explain how code snippets with reproducible issues differ from those with irreproducible issues.
... Buse and Weimer [13] show that the average number of identifiers per line, the average line length, or average nesting depth are negatively correlated with readability. Simultaneously, the average number of comment lines, and the average number of semantically breaking blank lines are positively correlated with readability. ...
Article
Full-text available
In this article, we present an approach to the ABZ 2020 case study that differs from those usually presented at ABZ: Rather than using a (correct-by-construction) approach following a formal method, we use C for a low-level implementation instead. We strictly adhere to test-driven development for validation, and only afterwards apply model checking using CBMC for verification. While the approach has several benefits compared to the more rigorous approaches, it also provides less mathematical clarity and overall less thorough verification. In consequence, our realization of the ABZ case study serves as a baseline reference for comparison, allowing to assess the benefit provided by the various formal modeling languages, methods and tools.
... Given that each code cell in Jupyter notebooks is a regular Python script, many of the metrics for code scripts proposed by prior studies [4,7,12,15,17,23] are also applicable to code cells in Jupyter notebooks. Considering that a small percentage of notebooks use the concept of class and object orientation 3 and have a weaker structure, some metrics were ignored. ...
Conference Paper
Full-text available
Jupyter notebooks have become indispensable tools for data analysis and processing in various domains. However, despite their widespread use, there is a notable research gap in understanding and analyzing the contents and code metrics of these notebooks. This gap is primarily attributed to the absence of datasets that encompass both Jupyter notebooks and extracted their code metrics. To address this limitation, we introduce DistilKaggle, a unique dataset specifically curated to facilitate research on code metrics in Jupyter notebooks, utilizing the Kaggle repository as a prime source. Through an extensive study, we identify thirty-four code metrics that significantly impact Jupyter notebook code quality. These features such as lines of code cell, mean number of words in markdown cells performance tier of developer, etc., are crucial for understanding and improving the overall effectiveness of computational notebooks. The DistilKaggle dataset which is derived from a vast collection of notebooks constitutes two distinct datasets: (i) Code Cells and Markdown Cells Dataset which is presented in two CSV files, allowing for easy integration into researchers' workflows as dataframes. It provides a granular view of the content structure within 542,051 Jupyter notebooks, enabling detailed analysis of code and markdown cells; and (ii) The Notebook Code Metrics Dataset focused on the identified code metrics of notebooks. Researchers can leverage this dataset to access Jupyter notebooks with specific code quality characteristics, surpassing the limitations of filters available on the Kaggle website. Furthermore, the reproducibility of the notebooks in our dataset is ensured through the code cells and markdown cells datasets, offering a reliable foundation for researchers to build upon. Given the substantial size of our datasets, it becomes an invaluable resource for the research community, surpassing the capabilities of individual Kaggle users to collect such extensive data. For accessibility and transparency, both the dataset and the code utilized in crafting this dataset are publicly available at https://github.com/ISE-Research/DistilKaggle.
... Over the years, naming variables have proven to be one of the most challenging steps during programming that developers face. Choosing a poor variable name decreases the readability and understanding of the code, since their purpose and meaning is not reflected directly by the label assigned [3]. Thus, programmers communicate their intentions via suggestive names, which can serve as a form of documentation within the code itself, helping other developers understand the code without relying heavily on external comments or documentation. ...
Article
Full-text available
The JavaScript code deployed goes through the process of minification, in which variables are renamed using single-character names and spaces are removed in order for the files to have a smaller size, thus loading faster. Because of this, the code becomes unintelligible, making it harder to be analyzed manually. Since JavaScript experts can under- stand it, machine learning approaches to deobfuscate the minified file are possible. Thus, we propose a technique that finds a fitting name for each obfuscated variable, which is both intuitive and meaningful based on the usage of that variable, based on a Sequence-to-Sequence model, which generates the name character by character to cover all the possible variable names. The proposed approach achieves an average exact name generation accuracy of 70.53%, outperforming the state-of-the-art by 12%. Keywords and phrases: JavaScript deobfuscation, variable name prediction, Deep Learning, Recurrent Neural Network, Abstract Syntax Tree.
... Code readability is another quality businesses and individual developers should have good practice of. Bad readability could mean that only the person who wrote the code can understand it, therefore preventing growth and resulting in loss for one of the involved parties [19,20]. ...
... -Subjective rating. This is the subjective perception of how well maintainers believe that they understood the code they are asked to maintain, usually on an ordinal scale (Börstler and Paech 2016;Buse and Weimer 2010;Scalabrino et al. 2021). (Floyd et al. 2017;Fucci et al. 2019;Ikutani and Uwano 2014;Peitek et al. 2020;Sharafi et al. 2021) investigated the physiological activities occurring in the human body when understanding software code, involving for instance the brain, heart, and skin. ...
Article
Full-text available
Context Insufficient code understandability makes software difficult to inspect and maintain and is a primary cause of software development cost. Several source code measures may be used to identify difficult-to-understand code, including well-known ones such as Lines of Code and McCabe’s Cyclomatic Complexity, and novel ones, such as Cognitive Complexity. Objective We investigate whether and to what extent source code measures, individually or together, are correlated with code understandability. Method We carried out an empirical study with students who were asked to carry out realistic maintenance tasks on methods from real-life Open Source Software projects. We collected several data items, including the time needed to correctly complete the maintenance tasks, which we used to quantify method understandability. We investigated the presence of correlations between the collected code measures and code understandability by using several Machine Learning techniques. Results We obtained models of code understandability using one or two code measures. However, the obtained models are not very accurate, the average prediction error being around 30%. Conclusions Based on our empirical study, it does not appear possible to build an understandability model based on structural code measures alone. Specifically, even the newly introduced Cognitive Complexity measure does not seem able to fulfill the promise of providing substantial improvements over existing measures, at least as far as code understandability prediction is concerned. It seems that, to obtain models of code understandability of acceptable accuracy, process measures should be used, possibly together with new source code measures that are better related to code understandability.
Article
The fractal property has been regarded as a fundamental property of complex networks, characterizing the self-similarity of a network. Such a property is usually numerically characterized by the fractal dimension metric, and it not only helps the understanding of the relationship between the structure and function of complex networks, but also finds a wide range of applications in complex systems. The existing literature shows that class-level software networks (i.e., class dependency networks) are complex networks with the fractal property. However, the fractal property at the feature (i.e., methods and fields) level has never been investigated, although it is useful for measuring class complexity and predicting bugs in classes. Furthermore, existing studies on the fractal property of software systems were all performed on un-weighted software networks and have not been used in any practical quality assurance tasks such as bug prediction. Generally, considering the weights on edges can give us more accurate representations of the software structure and thus help us obtain more accurate results. The illustration of an approach’s practical use can promote its adoption in practice. In this paper, we examine the fractal property of classes by proposing a new metric. Specifically, we build an FLSN ( F eature L evel S oftware N etwork) for each class to represent the methods/fields and their couplings (including coupling frequencies) within the class, and propose a new metric, ( F ractal D imension for C lasses), to numerically describe the fractal property of classes using FLSNs, which captures class complexity. We evaluate theoretically against Weyuker’s nine properties, and the results show that adheres to eight of the nine properties. Empirical experiments performed on a set of twelve large open-source Java systems show that i) for most classes (larger than 96%96\% ), there exists the fractal property in their FLSNs, ii) is capable of capturing additional aspects of class complexity that have not been addressed by existing complexity metrics, iii) significantly correlates with both the existing class-level complexity metrics and the number of bugs in classes, and iv), when used together with existing class-level complexity metrics, can significantly improve bug prediction in classes in three scenarios (i.e., bug-count , bug-classification , and effort-aware ) of the cross-project context, but in the within-project context, it cannot.
Article
Automated test generation tools enable test automation and further alleviate the low efficiency caused by writing hand-crafted test cases. However, existing automated tools are not mature enough to be widely used by software testing groups. This paper conducts an empirical study on the state-of-the-art automated tools for Java, i.e., EvoSuite, Randoop, JDoop, JTeXpert, T3, and Tardis. We design a test workflow to facilitate the process, which can automatically run tools for test generation, collect data, and evaluate various metrics. Furthermore, we conduct empirical analysis on these six tools and their related techniques from different aspects, i.e., code coverage, mutation score, test suite size, readability, and real fault detection ability. We discuss about the benefits and drawbacks of hybrid techniques based on experimental results. Besides, we introduce our experience in setting up and executing these tools, and summarize their usability and user-friendliness. Finally, we give some insights into automated tools in terms of test suite readability improvement, meaningful assertion generation, test suite reduction for random testing tools, and symbolic execution integration.
Chapter
Small and Medium-Sized Enterprises (SMEs), which are recognized as having a significant impact on the economies of developing countries, require specialized software frameworks to design and manage their systems. However, software developers often face challenges in maintaining existing software systems, particularly when trying to comprehend complex code. While researchers have proposed code interpretation principles to help with this, these principles do not focus on code structure transformation. In this study, a code transformation approach is presented to improve the readability of code structure. The approach includes a set of heuristic rules based on five primary transformation rules for analysis. Two case studies were conducted to evaluate the effectiveness of the method, and the results show that it is easier and less time-consuming for developers to understand the code under examination when using this approach compared to conventional methods.
Article
Code comments are a crucial source of software documentation that captures various aspects of the code. Such comments play a vital role in understanding the source code and facilitating communication between developers. However, with the iterative release of software, software projects become larger and more complex, leading to a corresponding increase in issues such as mismatched, incomplete, or outdated code comments. These inconsistencies in code comments can misguide developers and result in potential bugs, and there has been a steady rise in reports of such inconsistencies over time. Despite numerous methods being proposed for detecting code comment inconsistencies, their learning effect remains limited due to a lack of consideration for issues such as characterization noise and labeling errors in datasets. To overcome these limitations, we propose a novel approach called MCCL that first removes noise from the dataset and then detects inconsistent code comments in a timely manner, thereby enhancing the model’s learning ability. Our proposed model facilitates better matching between code and comments, leading to improved development of software engineering projects. MCCL comprises two components, namely method comment detection and confidence learning denoising. The method comment detection component captures the intricate relationships between code and comments by learning their syntactic and semantic structures. It correlates the code and comments through an attention mechanism to identify how changes in the code affect the comments. Furthermore, confidence learning denoising component of MCCL identifies and removes characterization noises and labeling errors to enhance the quality of the datasets. This is achieved by implementing principles such as pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. By effectively eliminating noise from the dataset, our model is able to more accurately learn inconsistencies between comments and source code. Our experiments on 1,518 open-source projects demonstrate that MCCL can accurately detect inconsistencies, achieving an average F1-score of 82.6%. This result outperforms state-of-the-art methods by 2.4% to 28.0%. Therefore, MCCL is more effective in identifying inconsistent comments based on code changes compared to existing approaches.
Article
Fixing software bugs can be colossally expensive, especially if they are discovered in the later phases of the software development life cycle. As such, bug prediction has been a classic problem for the research community. As of now, the Google Scholar site generates ∼ 113,000 hits if searched with the “bug prediction” phrase. Despite this staggering effort by the research community, bug prediction research is criticized for not being decisively adopted in practice. A significant problem of the existing research is the granularity level (i.e., class/file level) at which bug prediction is historically studied. Practitioners find it difficult and time-consuming to locate bugs at the class/file level granularity. Consequently, method-level bug prediction has become popular in the last decade. We ask, are these method-level bug prediction models ready for industry use? Unfortunately, the answer is no . The reported high accuracies of these models dwindle significantly if we evaluate them in different realistic time-sensitive contexts. It may seem hopeless at first, but encouragingly, we show that future method-level bug prediction can be improved significantly. In general, we show how to reliably evaluate future method-level bug prediction models, and how to improve them by focusing on four different improvement avenues: building noise-free bug data, addressing concept drift, selecting similar training projects, and developing a mixture of models. Our findings are based on three publicly available method-level bug datasets, and a newly built bug dataset of 774,051 Java methods originating from 49 open-source software projects.
Article
A large number of tutorials for popular software development technologies are available online, and those about the same technology vary widely in their presentation. We studied the design of tutorials in the software documentation landscape for five popular programming languages: Java, C#, Python, Javascript, and Typescript. We investigated the extent to which tutorial pages, i.e. resources , differ and report statistics of variations in resource properties. We developed a framework for characterizing resources based on their distinguishing attributes , i.e. properties that vary widely for the resource, relative to other resources. Additionally, we propose that a resource can be represented by its resource style , i.e. the combination of its distinguishing attributes. We discuss three techniques for characterizing resources based on our framework, to capture notable and relevant content and presentation properties of tutorial pages. We apply these techniques on a data set of 2551 resources to validate that our framework identifies valid and interpretable styles. We contribute this framework for reasoning about the design of resources in the online software documentation landscape.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Article
Full-text available
This article argues that the general practice of describing interrater reliability as a single, unified concept is at best imprecise, and at worst potentially misleading. Rather than representing a single concept, different statistical methods for computing interrater reliability can be more accurately classified into one of three categories based upon the underlying goals of analysis. The three general categories introduced and described in this paper are: 1) consensus estimates, 2) consistency estimates, and 3) measurement estimates. The assumptions, interpretation, advantages, and disadvantages of estimates from each of these three categories are discussed, along with several popular methods of computing interrater reliability coefficients that fall under the umbrella of consensus, consistency, and measurement estimates. Researchers and practitioners should be aware that different approaches to estimating interrater reliability carry with them different implications for how ratings across multiple judges should be summarized, which may impact the validity of subsequent study results.
Article
Full-text available
First Page of the Article
Article
Full-text available
Many techniques have been developed over the years to automatically find bugs in software. Often, these techniques rely on formal methods and sophisticated program analysis. While these techniques are valuable, they can be difficult to apply, and they aren't always effective in finding real bugs. Bug patterns are code idioms that are often errors. We have implemented automatic detectors for a variety of bug patterns found in Java programs. In this extended abstract1, we describe how we have used bug pattern detectors to find serious bugs in several widely used Java applications and libraries. We have found that the effort required to implement a bug pattern detector tends to be low, and that even extremely simple detectors find bugs in real applications. From our experience applying bug pattern detectors to real programs, we have drawn several interesting conclusions. First, we have found that even well tested code written by experts contains a surprising number of obvious bugs. Second, Java (and similar languages) have many language features and APIs which are prone to misuse. Finally, that simple automatic techniques can be effective at countering the impact of both ordinary mistakes and misunderstood language features.
Conference Paper
Full-text available
The F-measure - the number of distinct test cases to detect the first program failure - is an effectiveness measure for debug testing strategies. We show that for random testing with replacement, the F-measure is distributed according to the geometric distribution. A simulation study examines the distribution of two adaptive random testing methods, to study how closely their sampling distributions approximate the geometric distribution, revealing that in the worst case scenario, the sampling distribution for adaptive random testing is very similar to random testing. Our results have provided an answer to a conjecture that adaptive random testing is always a more effective alternative to random testing, with reference to the F-measure. We consider the implications of our findings for previous studies conducted in the area, and make recommendations to future studies.
Conference Paper
Full-text available
WEKA is a workbench for machine learning that is intended to aid in the application of machine learning techniques to a variety of real-world problems, in particular, those arising from agricultural and horticultural domains. Unlike other machine learning projects, the emphasis is on providing a working environment for the domain specialist rather than the machine learning expert. Lessons learned include the necessity of providing a wealth of interactive tools for data manipulation, result visualization, database linkage, and cross-validation and comparison of rule sets, to complement the basic machine learning tools
Article
Full-text available
A set of properties of syntactic software complexity measures is proposed to serve as a basis for the evaluation of such measures. Four known complexity measures are evaluated and compared using these criteria. This formalized evaluation clarifies the strengths and weaknesses of the examined complexity measures, which include the statement count, cyclomatic number, effort measure, and data flow complexity measures. None of these measures possesses all nine properties, and several are found to fail to possess particularly fundamental properties; this failure calls into question their usefulness in measuring synthetic complexity
Article
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensiveleaveone -out cross-validation. We report on a largescale experiment---over half a million runs of C4.5 and a Naive-Bayes algorithm---to estimate the effects of different parameters on these algorithms on real-world datasets. For crossvalidation, wevary the number of folds and whether the folds are stratified or not# for bootstrap, wevary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold stratified cross validation, even if computation power allows using more folds. 1 Introduction It can not be emphasized eno...
Article
The utility of technical materials is influenced to a marked extent by their reading level or readability. This article describes the derivation and validation of the Automated Readability Index (ARI) for use with technical materials. The method allows for the easy, automatic collection of data as narrative material is typed on a slightly modified electric typewriter. Data collected includes word length (a measure of word difficulty) and sentence length (a measure of sentence difficulty). Appropriate weightings of these factors in a multiple regression equation result in an index of reading difficulty. Uses of the index for evaluating and controlling the readability of large quantities of technical material are described.
Article
The consensus in the programming community is that indentation aids program comprehension, although many studies do not back this up. The authors tested program comprehension on a Pascal program. Two styles of indentation were used - blocked and nonblocked - in addition to four possible levels of indentation (0, 2, 4, 6 spaces). Both experienced and novice subjects were used. Although the blocking style made no difference, the level of indentation had a significant effect on program comprehension. (2 - 4 spaces had the highest mean score for program comprehension. ) It is recommended that a moderate level of indentation be used to increase program comprehension and user satisfaction.
Article
An abstract is not available.
Article
The paper is based on the premise that the productivity and quality of software development and maintenance, particularly in large and long term projects, is related to software readability. Software readability depends on such things as coding conventions and system overview documentation. Existing mechanisms to ensure readability --- for example, peer reviews --- are not sufficient. The paper proposes that software organizations or projects institute a readability/documentation group, similar to a test or usability group. This group would be composed of programmers and would produce overview documentation and ensure code and documentation readability. The features and functions of this group are described. Its benefits and possible objections to it are discussed.
Article
It is argued that program reading is an important programmer activity and that reading skill should be taught in programming courses. Possible teaching methods are suggested. The use of program reading in test construction and as part of an overall teaching strategy is discussed. A classification of reading comprehension testing methods is provided in an appendix.
Article
The problem of poorly written hyperdocuments has already been identified. Furthermore, there is no complete definition of hyperdocument quality and the methodology and tools that will help in analysing and assessing the quality of hyperdocuments are missing. The ability to measure attributes of hyperdocuments is indispensable for the fields of hyperdocument authoring and hypertext engineering. Useful paradigms can be drawn from the practices used in the software engineering and software measurement fields. In this paper we define a hyperdocument quality model, based on the ideas of the well‐known Factor‐Criteria‐Metric hierarchical model. The important factors of readability and maintainability are defined, as well as the corresponding criteria. Finally, structure metrics, that can be computed on the hypertext graph, are presented. Most of these metrics are derived from well‐known software metrics. Experimentation is a key issue for the application of measurement, and flexible tools for the automatic collection of measures are needed to support it. Athena , a tool that was originally developed for software measurement and later tailored to meet hypertext measurement needs, is used for hyperdocument measurement.
Conference Paper
In this paper, we explore the concept of code readability and investigate its relation to software quality. With data collected from human annotators, we derive associations be- tween a simple set of local code features and human notions of readability. Using those features, we construct an au- tomated readability measure and show that it can be 80% eective, and better than a human on average, at predict- ing readability judgments. Furthermore, we show that this metric correlates strongly with two traditional measures of software quality, code changes and defect reports. Finally, we discuss the implications of this study on programming language design and engineering practice. For example, our data suggests that comments, in of themselves, are less im- portant than simple blank lines to local judgments of read- ability.
Conference Paper
Formal specifications can help with program testing, optimization, refactoring, documentation, and, most importantly, debugging and repair. Unfortunately, formal specifications are difficult to write manually, while techniques that infer specifications automatically suffer from 90–99% false positive rates. Consequently, neither option is currently practical for most software development projects. We present a novel technique that automatically infers partial correctness specifications with a very low false positive rate. We claim that existing specification miners yield false positives because they assign equal weight to all aspects of program behavior. By using additional information from the software engineering process, we are able to dramatically reduce this rate. For example, we grant less credence to duplicate code, infrequently-tested code, and code that exhibits high turnover in the version control system. We evaluate our technique in two ways: as a preprocessing step for an existing specification miner and as part of novel specification inference algorithms. Our technique identifies which input is most indicative of program behavior, which allows off-the-shelf techniques to learn the same number of specifications using only 60% of their original input. Our inference approach has few false positives in practice, while still finding useful specifications on over 800,000 lines of code. When minimizing false alarms, we obtain a 5% false positive rate, an order-of-magnitude improvement over previous work. When used to find bugs, our mined specifications locate over 250 policy violations. To the best of our knowledge, this is the first specification miner with such a low false positive rate, and thus a low associated burden of manual inspection.
Article
The list itself is based on a concise selection of empirical data and is in rough priority order. The first fact had the most effects on defect reduction on the empirical data that was used for evaluation, while the last fact was less important. The priority of the facts is discussable and depends on the context.
Article
Frequently, when circumstances require that a computer program be modified, the program is found to be extremely difficult to read and understand. In this case a new step to make the program more readable should be added at the beginning of the software modification cycle. A small investment will make (1) the specifications for the modifications easier to write, (2) the estimate of the cost of the modifications more accurate, (3) the design for the modifications simpler, and (4) the implementation of the modifications less error-prone.
Article
Treemaps, a space-filling method for visualizing large hierarchical data sets, are receiving increasing attention. Several algorithms have been previously proposed to create more useful displays by controlling the aspect ratios of the rectangles that make up a treemap. While these algorithms do improve visibility of small items in a single layout, they introduce instability over time in the display of dynamically changing data, fail to preserve order of the underlying data, and create layouts that are difficult to visually search. In addition, continuous treemap algorithms are not suitable for displaying fixed-sized objects within them, such as images.This paper introduces a new "strip" treemap algorithm which addresses these shortcomings, and analyzes other "pivot" algorithms we recently developed showing the trade-offs between them. These ordered treemap algorithms ensure that items near each other in the given order will be near each other in the treemap layout. Using experimental evidence from Monte Carlo trials and from actual stock market data, we show that, compared to other layout algorithms, ordered treemaps are more stable, while maintaining relatively favorable aspect ratios of the constituent rectangles. A user study with 20 participants clarifies the human performance benefits of the new algorithms. Finally, we present quantum treemap algorithms, which modify the layout of the continuous treemap algorithms to generate rectangles that are integral multiples of an input object size. The quantum treemap algorithm has been applied to PhotoMesa, an application that supports browsing of large numbers of images.
Article
This paper describes a graph-theoretic complexity measure and illustrates how it can be used to manage and control program complexity. The paper first explains how the graph-theory concepts apply and gives an intuitive explanation of the graph concepts in programming terms. The issue of using nonstructured control flow is also discussed. A characterization of nonstructured control graphs is given and a method of measuring the ″structuredness″ of a program is developed. The last section of this paper deals with a testing methodology used in conjunction with the complexity measure; a testing strategy is defined that dictates that program can either admit of a certain minimal testing level or the program can be structurally reduced.
Article
The project conceived in 1929 by Gardner Murphy and the writer aimed first to present a wide array of problems having to do with five major "attitude areas"--international relations, race relations, economic conflict, political conflict, and religion. The kind of questionnaire material falls into four classes: yes-no, multiple choice, propositions to be responded to by degrees of approval, and a series of brief newspaper narratives to be approved or disapproved in various degrees. The monograph aims to describe a technique rather than to give results. The appendix, covering ten pages, shows the method of constructing an attitude scale. A bibliography is also given.
Conference Paper
Software systems evolve over time due to changes in requirements, optimization of code, fixes for security and reliability bugs etc. Code churn, which measures the changes made to a component over a period of time, quantifies the extent of this change. We present a technique for early prediction of system defect density using a set of relative code churn measures that relate the amount of churn to other variables such as component size and the temporal extent of churn. Using statistical regression models, we show that while absolute measures of code chum are poor predictors of defect density, our set of relative measures of code churn is highly predictive of defect density. A case study performed on Windows Server 2003 indicates the validity of the relative code churn measures as early indicators of system defect density. Furthermore, our code churn metric suite is able to discriminate between fault and not fault-prone binaries with an accuracy of 89.0 percent.
Conference Paper
This paper describes an empirical study investigating whether programmers improve the readability of their source code if they have support from a source code editor that offers dynamic feedback on their identifier naming practices. An experiment, employing both students and professional software engineers, and requiring the maintenance and production of software, demonstrated a statistically significant improvement in source code readability over that of the control.
Conference Paper
For large software systems, the maintenance phase tends to have comparatively much longer duration than all the previous life-cycle phases taken together, obviously resulting in much more effort. A good measure of software maintainability can help better manage the maintenance phase effort. Software maintainability cannot be adequately measured by only source code or by documents. The readability and understandability of both source code and documentation should be considered to measure the maintainability. This paper proposes an integrated measure of software maintainability. The paper also proposes a new representation for rule base of fuzzy models, which require less space for storage and will be efficient in finding the results in the simulation. The proposed model measures the software maintainability based on three important aspects of software-readability of source code (RSC), documentation quality (DOQ), and understandability of software (UOS). Keeping in view the nature of these parameters, a fuzzy approach has been used to integrate these three aspects. A new efficient representation of rule base has been proposed for fuzzy models. This integrated measurement of software maintainability, which to our knowledge is a first attempt to quantify integrated maintainability, is bound to be better than any other single parameter maintainability measurement approach. Thus the output of this model can advise the software project managers in judging the maintenance efforts of the software
Article
This paper describes a graph-theoretic complexity measure and illustrates how it can be used to manage and control program complexity. The paper first explains how the graph-theory concepts apply and gives an intuitive explanation of the graph concepts in programming terms. The control graphs of several actual Fortran programs are then presented to illustrate the correlation between intuitive complexity and the graph-theoretic complexity. Several properties of the graph-theoretic complexity are then proved which show, for example, that complexity is independent of physical size (adding or subtracting functional statements leaves complexity unchanged) and complexity depends only on the decision structure of a program.
Article
A 3×2 factorial experiment was performed to compare the effects of procedure format (none, internal, or external) with those of comments (absent or present) on the readability of a PL/1 program. The readability of six editions of the program, each having a different combination of these factors, was inferred from the accuracy with which students could answer questions about the program after reading it. Both extremes in readability occurred in the program editions having no procedures: without comments the procedureless program was the least readable and with comments it was the most readable
Article
Software's complexity and accelerated development schedules make avoiding defects difficult. We have found, however, that researchers have established objective and quantitative data, relationships, and predictive models that help software developers avoid predictable pitfalls and improve their ability to predict and control efficient software projects. The article presents 10 techniques that can help reduce the flaws in your code
Article
Software is the key technology in applications as diverse as accounting, hospital management, aviation, and nuclear power. Application advances in different domains such as these-each with different requirements-have propelled software development from small batch programs to large, real-time programs with multimedia capabilities. To cope, software's enabling technologies have undergone tremendous improvement in hardware, communications, operating systems, compilers, databases, programming languages, and user interfaces, among others. In turn, those improvements have fueled even more advanced applications. Improvements in VLSI technology and multimedia, for example, have resulted in faster, more compact computers that significantly widened the range of software applications. Database and user interface enhancements, on the other hand, have spawned more interactive and collaborative development environments. Such changes have a ripple effect on software development processes as well as on software techniques and tools. In this article, we highlight software development's crucial methods and techniques of the past 30 years
Article
Program understanding is an essential part of all software maintenance and enhancement activities. As currently practiced, program understanding consists mainly of code reading. The few automated understanding tools that are actually used in industry provide helpful but relatively shallow information, such as the line numbers on which variable names occur or the calling structure possible among system components. These tools rely on analyses driven by the nature of the programming language used. As such, they are adequate to answer questions concerning implementation details, so called what questions. They are severely limited, however, when trying to relate a system to its purpose or requirements, the why questions. Application programs solve real-world problems. The part of the world with which a particular application is concerned is that application's domain. A model of an application's domain can serve as a supplement to programming-language-based analysis methods and tools....
C++ Coding Standards: 101 Rules, Guidelines, and Best Practices
  • H Sutter
  • A Alexandrescu
H. Sutter and A. Alexandrescu, C++ Coding Standards: 101 Rules, Guidelines, and Best Practices. Addison-Wesley Professional, 2004.
A readability metric for computer-generated mathematics
  • S Machaffie
  • R Mcleod
  • B Roberts
  • P Todd
  • L Anderson
S. MacHaffie, R. McLeod, B. Roberts, P. Todd, and L. Anderson, " A readability metric for computer-generated mathematics, " Saltire Software, http://www.saltire.com/equation.html, Tech. Rep., retrieved 2007.
Java Coding Standards Software Development
  • S Ambler
Smog Grading&mdash,A New Readability
  • G H Mclaughlin