Conference PaperPDF Available

Clone Detection Using Abstract Syntax Suffix Trees

Authors:

Abstract and Figures

Reusing software through copying and pasting is a continuous plague in software development despite the fact that it creates serious maintenance problems. Var- ious techniques have been proposed to find duplicated redundant code (also known as software clones). A re- cent study has compared these techniques and shown that token-based clone detection based on sux trees is extremely fast but yields clone candidates that are of- ten no syntactic units (26). Current techniques based on abstract syntax trees—on the other hand—find syn- tactic clones but are considerably less ecient. This paper describes how we can make use of suf- fix trees to find clones in abstract syntax trees. This new approach is able to find syntactic clones in linear time and space. The paper reports the results of sev- eral large case studies in which we empirically compare the new technique to other techniques using the Bellon benchmark for clone detectors.
Number of candidates Figure 6 shows the number of candidates found by the various tools for SNNS. Here, clones found 71% more clones than CCFinder (the tool with the highest number in the earlier experiment), while the number for cscope is comparable to the numbers of the other token-based tools. cpdetector yields a low number of candidates and ccdiml compares to cscope . The number of candidates of token-based approaches almost doubles those found by AST-based tools. The rejected candidates are shown in Figure 7. These are very high for clones and cscope . Other token-based methods have a reject rate of 50% and 54% while clones and cscope have 90% and 82%. The reason is that clones and cscope do not check for consistent renaming of identifiers and literals. Rejects for ccdiml and cpdetector are comparable to CCFinder , Dup , and Duploc with respect to type- 2 clones. The reason is that ccdiml and cpdetector neither check for consistent renaming. Given the fact that ccdiml and cpdetector have much lesser rejected type-1 clones, we conclude that they could perform better than token-based techniques if they checked for consistent renaming. Figure 8 contains the true negatives. The tools clones , cscope , and cpdetector find an average percentage of references of 30%, 33% and 26%, respectively. ccdiml has the second best result (53%) after CCFinder (61%). There is no substantial difference in the overall percentage of true negatives for token-based versus AST-based approaches. In Figure 9, recall is shown. The recall of clones , cscope , and cpdetector is with 30%, 33% and 26% at average. The recall of ccdiml , however, is with 53% the second best with a quite big advantage. The average recall of the token-based tools ( clones , cscope , Dup , CCFinder ) is higher (54%) than the AST-based tools (30%).
… 
Content may be subject to copyright.
A preview of the PDF is not available
... Two similar code fragments form a "clone pairs". If more than two code fragments are similar they are called "clone class" or "clone group" [5] and [8]. Code clones can be literally simplified into four types [8]:-Type 1: Identical or exact copy without modification after pasting; -Type 2: Nearly Identicalsyntactical identical copy. ...
... If more than two code fragments are similar they are called "clone class" or "clone group" [5] and [8]. Code clones can be literally simplified into four types [8]:-Type 1: Identical or exact copy without modification after pasting; -Type 2: Nearly Identicalsyntactical identical copy. In some areas of the codes such as variable names, method name or class name are changed without changing the syntax; -Type 3: Gappedcopied fragments are improved by adding or deleting some statements.-Type ...
... The AST operations to build all AST subtress are relatively costly and slow. As a follow up on AST, Koschke [8] proposed a suffix trees to find clones on AST. Their approach is built on the string-based algorithm implemented by Ukkonen [10], which is based on the AST tokens. ...
Conference Paper
Full-text available
Code clone is a code portion in one source code that is similar or identical to another source code. Current clone code detection techniques detect, refactor, remove and redirect clones without being archived in the Discoverable Digital Clone Library (DDCL). This paper introduces Clone Wrapper Detection Technique (CWDT) that detects and wraps commonly used structural clones into a DDCL and extract metadata of each clone to induce Family Tree Ontology of related class clones. In order to evaluate the usefulness of CWDT, we conducted preliminary experiments on large open source software including Java Development Kit (JDK), Apache and JConnector projects. The preliminary results show a great number of structural reusable and sharable Type1, Type2 and Type3 clones detected from large system software. And also the results of the experiments show a significant reduction in clone detection time.
... Therefore, we need an indicator to evaluate the overall recovery effect of our work. We apply clone detection methods using abstract syntax suffix trees [45] to compare the similarity between the de-obfuscated PSCmds and their original content. Previous research has divided the clones into three types: ...
... • Type 3 clones are syntactically dissimilar copies with modifications of statements but are semantically similar. Since we leverage the methods in [45] to compare similarities by converting the AST of PSCmds into a semantic VOLUME 0, 2022 ...
Article
Full-text available
In recent years, PowerShell has become the common tool that helps attackers launch targeted attacks using living-off-the-land tactics and fileless attack techniques. Unfortunately, malware-derived PowerShell Commands (PSCmds) have typically been obfuscated to hide the malicious intent from detection and analysis. Also, malicious PSCmds’ expansive use of multiple obfuscation strategies and encryption methods makes them difficult to be revealed. Despite the advances in malicious PSCmds detection incorporating new approaches such as machine learning and deep learning, there is still no consensus on the solution to de-obfuscating malicious PSCmds and profiling their behavior. To address this challenge, we propose a hybrid framework that combines deep learning and program analysis for automatic PowerShell De-obfuscation and behavioral Profiling (PowerDP) through multi-label classification in a static manner. First, we use character distribution features to forecast obfuscation types of malicious PSCmds. Second, we developed an extensive de-obfuscator utilizing static regular expression replacement to recover the original content of obfuscated PSCmds based on the predicted obfuscation types. Finally, we profile the behavior of PSCmds by features extracted from the abstract syntax tree of PSCmds after de-obfuscation. Our results show that PowerDP achieves a promising 99.82% accuracy and 0.18% hamming loss in obfuscation multi-label classification using deep learning. Furthermore, the successful recovery rate of the de-obfuscator against 15 obfuscation types is 98.11% on average with semantic similarity comparison, and the accuracy of the behavior multi-label classification for identifying 5 behaviors in malicious PSCmds averages 98.53%. The evaluation indicates that PowerDP is able to classify and profile complicated PSCmds.
... XML representations exist for Java, Prolog, and C++. Koschke et al. [17] describe the technique that finds the syntactic clones in linear complexity This technique is compared with other techniques using Bellon benchmark. The suffix tree detection applied to find the clone pairs. ...
Article
Clone is the software code snippets that are similar to each other with little modifications. There is a 10-20 percent clone mostly present in the software. Many techniques are developed for detection. With the code clone detection, the software developer gets an idea of removing, refactoring the clone. Code clone has both advantages and disadvantages in the particular software. In this paper, we explore the types of code clones, advantages, and disadvantages, the reason for cloning. Typically, this paper describes various techniques by using several parameters. Lastly, we discuss gaps in the research.
... Koschke et al. [42] presented a practical CCD based on ASTs for arbitrary program fragments. Jiang et al. [43] proposed an accurate and scalable CCD approach utilizing an efficient similar subtree identification algorithm. ...
Article
Full-text available
Code duplication detection is the act of finding similar code in software development. It is important for software engineer to address the issues of code duplication detection. In this paper, a critical review of previous works on code duplication for code clone and plagiarism detection is performed. The review involves five main parts. Firstly, a systematic literature review is conducted to confirm the selected articles. Secondly, a critical review of different code duplication approaches is conducted based on three phases; processing, detection, and decision. Thirdly, statistical analysis of the number of review articles is performed to show the trends and hots of code duplication research. Moreover, quantitative analysis of different code duplication approaches is presented to show the effectiveness of different approaches. Fourthly, the advantages and disadvantages of different approaches and techniques are summarized and discussed. Finally, the conclusion of the review is summarized and future research direction of code duplication is described.
Article
Full-text available
Our contemporary society has never been more connected and aware of vital information in real time, through the use of innovative technologies. A considerable number of applications have transitioned into the cyber-physical domain, automating and optimizing their routines and processes via the dense network of sensing devices and the immense volumes of data they collect and instantly share. In this paper, we propose an innovative architecture based on the monitoring, analysis, planning, and execution (MAPE) paradigm for network and service performance optimization. Our study confirms distinct evidence that the utilization of learning algorithms, consuming datasets enriched with the users’ empirical opinions as input during the analysis and planning phases, contributes greatly to the optimization of video streaming quality, especially by handling different packet loss rates, paving the way for the achievable provision of a resilient communications platform for calamity assessment and management.
Article
Full-text available
In spite of significant research done in the past 3 decades introducing more than 250 clone detection tools/ techniques for finding the same language clones, there exists no single framework to detect and classify all 4 basic types of clones with great accuracy (precision and recall). In this paper, we propose an accurate and language agnostic technique to classify 4 types of clones. The method first generates an ANTLR parse tree for the input program file using freely available ANTLR grammar files then finds the edit distance between the two parse trees using the Levenshtein distance algorithm and converts the edit distance into similarity using. We obtained 100% precision and recall in detecting type 1 & 2 clone types and achieve 98.50 and 98.12 respectively for type 3 and 4 clone types for our datasets containing microprograms of C, CPP, and Java. This paper provides evidence that the Levenshtein distance on ANTLR parse tree is the good choice to build a complete and accurate software clone detector and act as proper validation tools to detect code plagiarism.
Article
Code search is a common activity in software development, and code-to-code search can benefit in a wide range of use-case scenarios. Code-to-code search uses a code fragment as the query for searching similar code fragments from large corpora. The results of a search can be applied to some software engineering tasks, such as search-based code recommendation, data-driven program repairing, and software plagiarism detection. To be put into daily use, the code-to-code search needs to find similar code fragments accurately and efficiently in a large dataset. Some search engines can locate exactly similar code, but are not able to search syntactical clones. Therefore, we propose ASTENS-BWA, a novel approach for searching syntactic similar code regions between code fragments via a tree-based sequence alignment. Source code has been transformed into a tree-based sequence that contains the structure information, and a sequence alignment algorithm has been applied to find similar regions. We evaluate ASTENS-BWA on three different tasks, the results demonstrate that our approach can find syntactical similar regions for programming code and retrieve similar code fragments fast and with high accuracy. As a code clone detection tool, ASTENS-BWA can report clone pairs in a high recall, but it needs manually check to reduce the false alarms. ASTENS-BWA is scalable and can report cloned code fragments in seconds for a code corpus of million lines of code.
Conference Paper
Full-text available
We describe the design and implementation of a program called sim to measure similarity between two C computer programs. It is useful for detecting plagiarism among a large set of homework programs. This software is part of a project to construct tools to assist the teaching of computer science.
Conference Paper
Programs often have a lot of duplicated code, which makes both understanding and maintenance more difficult. This problem can be alleviated by detecting duplicated code, extracting it into a separate new procedure, and replacing all the clones (the instances of the duplicated code) by calls to the new procedure. This paper describes the design and initial implementation of a tool that finds clones and displays them to the programmer. The novel aspect of our approach is the use of program dependence graphs (PDGs) and program slicing to find isomorphic PDG subgraphs that represent clones. The key benefits of this approach are that our tool can find non-contiguous clones (clones whose components do not occur as contiguous text in the program), clones in which matching statements have been reordered, and clones that are intertwined with each other. Furthermore, the clones that are found are likely to be meaningful computations, and thus good candidates for extraction.
Article
We present the R 2D 2 redundancy detector. R 2D 2 identifies redundant code fragments in large software systems written in Lisp. For each pair of code fragments, R 2D 2 uses a combination of techniques ranging from syntax-based analysis to semantics-based analysis, that detects positive and negative evidences regarding the redundancy of the analyzed code fragments. These evidences are combined according to a well-defined model and sufficiently redundant fragments are reported to the user. R 2D 2 explores several techniques and heuristics to operate within reasonable time and space bounds and is designed to be extensible.
Article
Code cloning — that is, the gratuitous duplication of source code within a software system — is an endemic problem in large, industrial systems (6, 5). While there has been much research into techniques for clone detec- tion and analysis, there has been relatively little empiri- cal study on characterizing how, where, and why clones occur in industrial software systems. Our current re- search is to perform an in-depthanalysis of codecloning in real software systems andto build a taxonomyof types of code duplication.
Article
An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It always has the suffix tree for the scanned part of the string ready. The method is developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries. Regardless of its quadratic worst case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give, in a natural way, the well-known algorithms for constructing suffix automata (DAWGs).
Article
We present randomized algorithms to solve the following string-matching problem and some of its generalizations: Given a string X of length n (the pattern) and a string Y (the text), find the first occurrence of X as a consecutive block within Y. The algorithms represent strings of length n by much shorter strings called fingerprints, and achieve their efficiency by manipulating fingerprints instead of longer strings. The algorithms require a constant number of storage locations, and essentially run in real time. They are conceptually simple and easy to implement. The method readily generalizes to higher-dimensional pattern-matching problems.
Article
Sparse Dynamic Programming has emerged as an essential tool for the design of efficient algorithms for optimization problems coming from such diverse areas as computer science, computational biology, and speech recognition. We provide a new sparse dynamic programming technique that extends the Hunt–Szymanski paradigm for the computation of the longest common subsequence (LCS) and apply it to solve the LCS from Fragments problem: given a pair of strings X and Y (of length n and m, respectively) and a set M of matching substrings of X and Y, find the longest common subsequence based only on the symbol correspondences induced by the substrings. This problem arises in an application to analysis of software systems. Our algorithm solves the problem in O(|M| log |M|) time using balanced trees, or O(|M| log log min(|M|, nm/|M|)) time using Johnson's version of Flat Trees. These bounds apply for two cost measures. The algorithm can also be adapted to finding the usual LCS in O((m + n) log |Σ| + |M| log |M|) time using balanced trees or O((m + n) log |Σ| + |M| log log min (|M|, nm/|M|)) time using Johnson's version of Flat Trees, where M is the set of maximal matches between substrings of X and Y and Σ is the alphabet. These bounds improve on those of the original Hunt–Szymanski algorithm while retaining the overall approach.
Conference Paper
Previous research shows that most software sys- tems contain significant amounts of duplicated, or cloned, code. Some clones are exact duplicates of each other, while others differ in small details only. We designate these almost-perfect clones as "near-miss" clones. While technically difficult, detection of near-miss clones has many benefits, both academic and practical. Finding these clones can give us better insight into the way developers maintain and reuse code, and we can also param- eterize and remove near-miss clones to reduce overall source code size and decrease system com- plexity. This paper presents a simple, general and prac- tical way to detect near-miss clones, and summa- rizes the results of its application to two produc- tion websites. We use standard lexical comparison tools coupled with language-specific extractors to locate potential clones. Our approach separates code comparisons from code understanding, and makes the comparisons language independent. This makes it easy to adapt to different program- ming languages.