ArticlePublisher preview available

A graph-based code representation method to improve code readability classification

May 2023
Empirical Software Engineering 28(4)

May 2023
28(4)

DOI:10.1007/s10664-023-10319-6

Authors:

Qing Mi

Beijing University of Technology

Show all 6 authorsHide

Context Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. Objective However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Method Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. Result We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. Conclusion We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.

Approach overview

…

AST-based program graph with additional control and data flow edges

…

Different methods of adding data flows

…

Adding control flows in different node types

…

Model architecture

…

Figures - available from: Empirical Software Engineering

This content is subject to copyright. Terms and conditions apply.

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Empirical Software Engineering

This content is subject to copyright. Terms and conditions apply.

(2023 ) 28:87

Empirical Software Engineering

https://doi.org/10.1007/s10664-023-10319-6

A graph-based code representation method to improve code

readability classiﬁcation

Qing Mi1·Yi Zhan1·Han Weng1·Qinghang Bao1·Longjie Cui1·Wei Ma1

Accepted: 13 March 2023 / Published online: 23 May 2023

Abstract

Context Code readability is crucial for developers since it is closely related to code main-

tenance and affects developers’ work efﬁciency. Code readability classiﬁcation refers to the

source code being classiﬁed as pre-deﬁned certain levels according to its readability. So far,

many code readability classiﬁcation models have been proposed in existing studies, including

deep learning networks that have achieved relatively high accuracy and good performance.

Objective However, in terms of representation, these methods lack effective preservation of

the syntactic and semantic structure of the source code. To extract these features, we propose

a graph-based code representation method.

Method Firstly, the source code is parsed into a graph containing its abstract syntax tree

(AST) combined with control and data ﬂow edges to reserve the semantic structural infor-

mation and then we convert the graph nodes’ source code and type information into vectors.

Finally, we train our graph neural networks model composing Graph Convolutional Net-

work (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers

to extract these features from the program graph.

Result We evaluate our approach to the task of code readability classiﬁcation using a Java

dataset provided by Scalabrino et al. (2016). The results show that our method achieves

72.5% and 88% in three-class and two-class classiﬁcation accuracy, respectively.

Conclusion We are the ﬁrst to introduce graph-based representation into code readability clas-

siﬁcation. Our method outperforms state-of-the-art readability models, which suggests that

the graph-based code representation method is effective in extracting syntactic and semantic

information from source code, and ultimately improves code readability classiﬁcation.

Keywords Code readability classiﬁcation ·Graph neural network ·Code representation ·

Abstract syntax tree ·Program comprehension

Communicated by: Simone Scalabrino, Rocco Oliveto, Felipe Ebert, Fernanda Madeiral, Fernando Castor

This article belongs to the Topical Collection: Code Legibility, Readability and Understandability.

BWei M a

mawei@bjut.edu.cn

1Faculty of Information Technology, Beijing University of Technology, Beijing, China

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

The Use of AI in Software Engineering: Synthetic Knowledge Synthesis of Recent Research Literature

Preprint

Full-text available

Mar 2024

Peter Kokol

Artificial intelligence (AI) has witnessed an exponential increase in its use in various applications. Recently, the academic community started to research and inject new AI-based approaches to provide solutions to traditional software engineering problems. However, a comprehensive and holistic understanding of the current status is missing. To close the above gap, synthetic knowledge synthesis was used to induce a research landscape of the contemporary research literature on the use of AI in software engineering. The synthesis resulted in 15 research categories and five themes, namely natural language processing in software engineering, use of artificial intelligence in the management of software development life cycle, use of machine learning in fault/defect prediction and effort estimation, employment of deep learning in intelligent software engineering and code management, and mining software repositories to improve software quality. The most productive country was China (n=2042), followed by the United States (n=1193), India (n=934), Germany (n=445), and Canada (n=381). A high percentage (n=47.4%) of papers were funded, showing a strong interest in this research topic. The convergence of AI and software engineering can significantly reduce needed resources, improve quality, increase user experience, and improve the well-being of software developers.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Conference Paper

Full-text available

Jan 2020

How does code readability change during software evolution?

Article

Full-text available

Nov 2020
EMPIR SOFTW ENG

Code reading is one of the most frequent activities in software maintenance. Such an activity aims at acquiring information from the code and, thus, it is a prerequisite for program comprehension: developers need to read the source code they are going to modify before implementing changes. As the code changes, so does its readability; however, it is not clear yet how code readability changes during software evolution. To understand how code readability changes when software evolves, we studied the history of 25 open source systems. We modeled code readability evolution by defining four states in which a file can be at a certain point of time (non-existing, other-name, readable, and unreadable). We used the data gathered to infer the probability of transitioning from one state to another one. In addition, we also manually checked a significant sample of transitions to compute the performance of the state-of-the-art readability prediction model we used to calculate the transition probabilities. With this manual analysis, we found that the tool correctly classifies all the transitions in the majority of the cases, even if there is a loss of accuracy compared to the single-version readability estimation. Our results show that most of the source code files are created readable. Moreover, we observed that only a minority of the commits change the readability state. Finally, we manually carried out qualitative analysis to understand what makes code unreadable and what developers do to prevent this. Using our results we propose some guidelines (i) to reduce the risk of code readability erosion and (ii) to promote best practices that make code readable.

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

Conference Paper

Full-text available

Feb 2020

Heterogeneous Graph Transformer

Conference Paper

Full-text available

Mar 2020

Recent years have witnessed the emerging success of graph neural networks (GNNs) for modeling structured data. However, most GNNs are designed for homogeneous graphs, in which all nodes and edges belong to the same types, making them infeasible to represent heterogeneous structures. In this paper, we present the Heterogeneous Graph Transformer (HGT) architecture for modeling Web-scale heterogeneous graphs. To model heterogeneity, we design node- and edge-type dependent parameters to characterize the heterogeneous attention over each edge, empowering HGT to maintain dedicated representations for different types of nodes and edges. To handle dynamic heterogeneous graphs, we introduce the relative temporal encoding technique into HGT, which is able to capture the dynamic structural dependency with arbitrary durations. To handle Web-scale graph data, we design the heterogeneous mini-batch graph sampling algorithm---HGSampling---for efficient and scalable training. Extensive experiments on the Open Academic Graph of 179 million nodes and 2 billion edges show that the proposed HGT model consistently outperforms all the state-of-the-art GNN baselines by 9%--21% on various downstream tasks.

Learning to represent programs with heterogeneous graphs

Conference Paper

Oct 2022

An Enhanced Data Augmentation Approach to Support Multi-Class Code Readability Classification

Conference Paper

Jul 2022

Towards using visual, semantic and structural features to improve code readability classification

Article

Jul 2022
J SYST SOFTWARE

Context Code readability, which correlates strongly with software quality, plays a critical role in software maintenance and evolvement. Although existing deep learning-based code readability models have reached a rather high classification accuracy, only structural features are utilized which inevitably limits their model performance. Objective To address this problem, we propose to extract readability-related features from visual, semantic, and structural aspects from source code in an attempt to further improve code readability classification. Method First, we convert a code snippet into a RGB matrix (for visual feature extraction), a token sequence (for semantic feature extraction) and a character matrix (for structural feature extraction). Then, we input them into a hybrid neural network that is composed of BERT, CNN, and BiLSTM for feature extraction. Finally, the extracted features are concatenated and input into a classifier to make a code readability classification. Result A series of experiments are conducted to evaluate our method. The results show that the average accuracy could reach 85.3%, which outperforms all existing models. Conclusion As an innovative work of extracting readability-related features automatically from visual, semantic, and structural aspects, our method is proved to be effective for the task of code readability classification.

A Mocktail of Source Code Representations

Conference Paper

Nov 2021

BGNN4VD: Constructing Bidirectional Graph Neural-Network for Vulnerability Detection

Article

Mar 2021
INFORM SOFTWARE TECH

Context Previous studies have shown that existing deep learning-based approaches can significantly improve the performance of vulnerability detection. They represent code in various forms and mine vulnerability features with deep learning models. However, the differences of code representation forms and deep learning models make various approaches still have some limitations. In practice, their false-positive rate (FPR) and false-negative rate (FNR) are still high. Objective To address the limitations of existing vulnerability detection approaches, we propose BGNN4VD (Bidirectional Graph Neural Network for Vulnerability Detection), a vulnerability detection approach by constructing a Bidirectional Graph Neural-Network (BGNN). Method In Phase 1, we extract the syntax and semantic information of source code through abstract syntax tree (AST), control flow graph (CFG), and data flow graph (DFG). Then in Phase 2, we use vectorized source code as input to Bidirectional Graph Neural-Network (BGNN). In Phase 3, we learn the different features between vulnerable code and non-vulnerable code by introducing backward edges on the basis of traditional Graph Neural-Network (GNN). Finally in Phase 4, a Convolutional Neural-Network (CNN) is used to further extract features and detect vulnerabilities through a classifier. Results We evaluate BGNN4VD on 4 popular C/C++ projects from NVD and GitHub, and compare it with four state-of-the-art (Flawfinder, RATS, SySeVR, and VUDDY). Experiment results show that, when compared these these baselines, BGNN4VD achieves 4.9%, 11.0%, and 8.4% improvement in F1-measure, accuracy and precision, respectively. Conclusion The proposed BGNN4VD achieves a higher precision and accuracy than the state-of-the-art methods. In addition, when applied on the latest vulnerabilities reported by CVE, BGNN4VD can still achieve a precision at 45.1%, which demonstrates the feasibility of BGNN4VD in practical application.

Improved Code Summarization via a Graph Neural Network

Conference Paper

Jul 2020

A graph-based code representation method to improve code readability classification

Abstract and Figures

Recommended publications

Enhanced heterogeneous graph convolutional networks with dual-level attention for aspect-based senti...

On the Generalizability of Neural Program Analyzers with respect to Semantic-Preserving Program Tran...

Hybrid Model with Multi-Level Code Representation for Multi-Label Code Smell Detection (077)

Multilevel Readability Interpretation Against Software Properties: A Data-Centric Approach

A Data-driven Methodology towards Interpreting Readability against Software Properties

An Enhanced Data Augmentation Approach to Support Multi-Class Code Readability Classification

On the Importance and Shortcomings of Code Readability Metrics: A Case Study on Reactive Programming