Conference Paper

Systematic Mapping Study of Metrics based Clone Detection Techniques

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A code clone is a code fragment that is similar or identical to other code fragments in the source code. Code clones generally occurs in large systems and affects the system maintenance and quality. Removing clones is one way to avoid problems that occur due to the presence of code clones. Clone detection techniques using software metrics provides less complexity in finding the clones. This systematic mapping study focuses on metric based clone detection techniques and various tools used in previous studies. The existing work is classified into different categories and presented in systematic maps. This paper also indicates some problems related to clone detection research.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Area Reference Machine learning [347, 246, 208, 212, 48, 324, 67, 186, 296, 21, 130, 145, 207, 229, 245, ?, ?, 289, 303, 304, 356, 16, 42, 44, 54, 56, 60, 82, 83, 95, 110, 131, 132, 147, 148, 177, 219, 223, 228, ?, 312, 322, 344, 2, 7, 15, 20, 31, 34, 41, 47, 50, 53, 57, 70, 85, 100, 119, 123, 125, 126, 140, 159, 161, 164, 168, 183, 179, 197, 198, 210, 217, 218, 234, 251, 263, 271, 274, 288, 293, 302, 313, 319, 327, 336, 359, 362, 366, 46, 58, 61, 72, 80, 91, 108, 124, 135, 138, 163, 171, 173, 195, 206, 233, 247, 256, 261, 268, 269, 314, 320, 321, 325, 330, 334, 335, 340, 345, 358, 103, 155, 170, 211, 249, 250, 252, 275, 281, 284, 298, 353, 361, 363, 253] Deep learning [208,77,267,270,350,9,29,56,95,154,165,202,205,243,279,317,11,25,98,101,109,114,115,164,174,187,204,199,213,214,230,336,339,343,359,371,8,10,23,40,51,62,63,66,72,75,88,116,129,136,162,173,181,185,190,194,203,224,232,242,272,277,297,352,97,120,122,152,156,200,222,225,240,284,286,287,301,305,307,316,370] 3.1.4 Natural language processing Area Reference Natural language processing (NLP) [26,158,167,32,28,282,151,346,221,300,372,209,264,360,343,36,68,118,134,266,278,368,18,79,283,294,292,342,357] Natural language generation (NLG) [43] Word segmentation -Word sense disambiguation [169] Text processing -Text summarization [133,189,348,341] Text processing -Text mining (TM) [150,26,26,117,244,1,311,121,262,306,176,175,315,332,112,113,127,257,24,22,37,87,153,5,111,146,191,255,14,39,49,102,328] Text processing -Text classification [339,35] Sentiment analysis [182,27,17,151,299,6,93,113,280,4,137,188,209,232,13,30,40,64,94,160,215,260,238] Named entity recognition [143,172,144,295] IE [364] Information extraction -Relation extraction [37,111,146] Open information extraction? [139,142] 3.1.5 ...
... The literature review by [6] has studied code cloning and various techniques to detect code clones. The SMS by [7] focuses on metric-based clone detection techniques and various tools used in previous studies. The literature review by [8] puts a light on all the types of clones and various techniques for the detection of clones. ...
Conference Paper
Both novice and experienced developers rely more and more in external sources of code to include into their programs by copying and pasting code snippets. This behavior differs from the traditional software design approach where cohesion was achieved via a conscious design effort. Due to this fact, it is essential to know how this copy and paste programming practices are actually carried out, so that IDEs and code recommenders can be designed to fit with developer expectations and habits. Our objective is to identify the role of copy and paste programming or code clone in current development practices. A Systematic Mapping Study (SMS) has been conducted, searching the main scientific databases. The search retrieved 1271 citations and 39 articles were retained as primary studies. The primary studies were categorized according to eight areas: General information of clone usage, developer behavior, techniques and tools for clone detection, techniques and tools for clone reuse, patterns of cloning, clone evolution, effects of code cloning in software maintenance and development, and tools for clone visualization. The areas, techniques and tools of clone detection and developer behavior are strongly represented in the sample. The areas that have been least studied in the literature found in the SMS are tools of clone visualization and patterns of cloning.
... These techniques collect a number of metrics from source code. Metrics are calculated from names, layout, expressions and control flow functions [3]. Then these metrics are compared to find clones. ...
... Para a classificação dos tipos de clonagem ser feita de maneira adequada, técnicas de detecção foram elaboradas, nas quais a variação da representação do trecho de código a ser analisado é a principal diferença entre as técnicas mais comumente utilizadas [11][5]: ...
Conference Paper
Code clones are source code parts that are identical or have some degree of similarity to another part of the code. Cloning arises for a variety of reasons, including copy and paste and the reuse of ad-hoc code by programmers. Detection of information system clones is aimed at propagating changes by all clones at the development, maintenance and evolution stages, preserving data consistency, correcting errors, and so on. Clones can be classified as 1, 2, 3 and 4, depending on their similarity and characteristics that classify them as such. Several techniques and tools have been created with the objective of detecting code clones, and for this, they use techniques of representation of the source code in text, token, tree, graphic, hybrid and metrics. This systematic mapping work presents answers to the four research questions, which aim to identify, count and catalog, data from a set of 875 articles, of which 128 were selected, for the selection of relevant information seeking to provide content for the collection of data objectified. In all, 52 clone detection tools were identified, which reinforce the current theme; 26 ways of presenting source code to detect clones, where the commonly used ones stand out for ease of understanding and handling; 13 programming languages in 6 paradigms and the identification, highlighting the great presence of clones detection in object oriented information systems, of all 4 types of clones, as well as semantic and syntactic clones, which reinforces the current questioning of authors of this division search line into four types.
Chapter
In this paper, we propose the collaborative method that analyzes both code structure and function semantics for code comparison. First, we create the function call graph of code and use it to obtain the structure semantics with the graph auto-encoder. Then the function semantics are obtained with the names and definition of the used library functions and built-in classes in code. Finally, we integrate the structure and function semantics to collaboratively analyze the similarity of codes. We adopt several real code datasets to validate our method and the experimental results show that it outperforms other baselines. The ablation experiments show that the function call structure contributes the most to the performance. We also visualize the semantics of function structures to illustrate that the proposed method can extract the correlations and differences between codes.KeywordsCode StructureFunction SemanticsSelf-Encoder
Chapter
The presence of bad smells in code hampers software’s maintainability, comprehensibility, and extensibility. A type of code smell, which is common in software projects, is “duplicated code” bad smell, also known as code clones. These types of smells generally arise in a software system due to the copy-paste-modify actions of software developers. They can either be exact copies or copies with certain modifications. Different clone detection techniques exist, which can be broadly classified as text-based, token-based, abstract syntax tree-based (AST-based), metrics-based, or program dependence graph-based (PDG-based) approaches based on the amount of preprocessing required on the input source code. Researchers have also built clone detection techniques using a hybrid of two or more approaches described above. In this paper, we did a narrative review of the metrics-based techniques (solo or hybrid) reported in the previously published studies and analyzed them for their quality in terms of run-time efficiency, accuracy values, and the types of clones they detect. This study can be helpful for practitioners to select an appropriate set of metrics, measuring all the code characteristics required for clone detection in a particular scenario.KeywordsClone detectionMetrics-based techniquesHybrid clone detection techniquesCategorizationQualitative analysis
Article
Code cloning (CC) is the process of copying and reconfiguring a code fragment and using it in another part of a software project. This clones increases the running overhead of the software. As a result, Code Clone Detection (CCD) has become an active research area in software development research. The detection of Large-Variance Code Clones (LV-CCs) is very difficult when the lines of codes (LOCs) in the source code are very large. The distance metrics have been used in LV-CC detection by calculating the distance between training feature sets of source codes and testing feature sets of source codes. However, threshold selection for detecting clones is a challenging issue in distance-based LV-CC detection. To solve this, a Collaborative CCD using Deep Learning (CCCD-DL) is developed in this paper by utilising lexical, syntactic, semantic and structural features for identifying all types of clones together. A lexical feature is extracted from Clone Pairs (CPs) identified by LV-Mapper. Syntactic and semantic features are identified by the Abstract Syntax Tree (AST) and Control Flow Graph (CFG). The structural features are extracted by code size metrics (CZMs) and object-oriented metrics (OOMs). All features are coordinated and fed into the input layer of DNN. The hidden layer then transforms the inputs into the neural vertices in the multi-classification stage using linear transformation preceded by suppressing non-linearity. This process can generate a complicated and non-hypothetical prototype with a weight matrix for fitting the training sequence. Thus, the feed-forward step has been successfully completed. This model then uses back-propagation in the following element to modify the weight matrix based on the training set. Finally, a softmax layer converts the clone detection task into a classification process. The results of the experiments show that the proposed method solves distance-based problems more quickly and effectively than the traditional methods for the CCD.
Preprint
The application of code clone technology accelerates code search, improves code reuse efficiency, and assists in software quality assessment and code vulnerability detection. However, the application of code clones also introduces software quality issues and increases the cost of software maintenance. As an important research field in software engineering, code clone has been extensively explored and studied by researchers, and related studies on various sub-research fields have emerged, including code clone detection, code clone evolution, code clone analysis, etc. However, there lacks a comprehensive exploration of the entire field of code clone, as well as an analysis of the trend of each sub-research field. This paper collects related work of code clones in the past ten years. In summary, the contributions of this paper mainly include: (1) summarize and classify the sub-research fields of code clone, and explore the relative popularity and relation of these sub-research fields; (2) analyze the overall research trend of code clone and each sub-research field; (3) compare and analyze the difference between academy and industry regarding code clone research; (4) construct a network of researchers, and excavate the major contributors in code clone research field; (5) The list of popular conferences and journals was statistically analyzed. The popular research directions in the future include clone visualization, clone management, etc. For the clone detection technique, researchers can optimize the scalability and execution efficiency of the method, targeting particular clone detection tasks and contextual environments, or apply the technology to other related research fields continuously.
Article
Full-text available
Refactoring and smells have been well researched by the software-engineering research community these past decades. Several secondary studies have been published on code smells, discussing their implications on software quality, their impact on maintenance and evolution, and existing tools for their detection. Other secondary studies addressed refactoring, discussing refactoring techniques, opportunities for refactoring, impact on quality, and tools support. In this paper, we present a tertiary systematic literature review of previous surveys, secondary systematic literature reviews, and systematic mappings. We identify the main observations (what we know) and challenges (what we do not know) on code smells and refactoring. We perform this tertiary review using eight scientific databases, based on a set of five research questions, identifying 40 secondary studies between 1992 and 2018. We organize the main observations and challenges about code smell and their refactoring into: smells definitions, most common code-smell detection approaches, code-smell detection tools, most common refactoring, and refactoring tools. We show that code smells and refactoring have a strong relationship with quality attributes, i.e., with understandability, maintainability, testability, complexity, functionality, and reusability. We argue that code smells and refactoring could be considered as the two faces of a same coin. Besides, we identify how refactoring affects quality attributes, more than code smells. We also discuss the implications of this work for practitioners, researchers, and instructors. We identify 13 open issues that could guide future research work. Thus, we want to highlight the gap between code smells and refactoring in the current state of software-engineering research. We wish that this work could help the software-engineering research community in collaborating on future work on code smells and refactoring.
Preprint
Full-text available
Refactoring and smells have been well researched by the software-engineering research community these past decades. Several secondary studies have been published on code smells, discussing their implications on software quality,their impact on maintenance and evolution, and existing tools for their detection. Other secondary studies addressed refactoring, discussing refactoring techniques, opportunities for refactoring, impact on quality, and tools support. In this paper, we present a tertiary systematic literature review of previous surveys, secondary systematic literature reviews, and systematic mappings. We identify the main observations (what we know) and challenges (what we do not know) on code smells and refactoring. We perform this tertiary review using eight scientific databases, based on a set of five research questions, identifying 40 secondary studies between 1992 and 2018. We organize the main observations and challenges about code smell and their refactoring into: smells definitions, most common code-smell detection approaches, code-smell detection tools, most common refactoring, and refactoring tools. We show that code smells and refactoring have a strong relationship with quality attributes, i.e., with understandability, maintainability, testability, complexity, functionality, and reusability. We argue that code smells and refactoring could be considered as the two faces of a same coin. Besides, we identify how refactoring affects quality attributes, more than code smells. We also discuss the implications of this work for practitioners, researchers, and instructors. We identify 13 open issues that could guide future research work. Thus, we want to highlight the gap between code smells and refactoring in the current state of software-engineering research. We wish that this work could help the software-engineering research community in collaborating on future work on code smells and refactoring.
Article
Full-text available
While finding clones in source code has drawn considerable attention, there has been only very little work in finding similar fragments in binary code and intermediate languages, such as Java bytecode. Some recent studies showed that it is possible to find distinct sets of clone pairs in bytecode representation of source code, which are not always detectable at source code-level. In this paper, we present a bytecode clone detection approach, called SeByte, which exploits the benefits of compilers (the bytecode representation) for detecting a specific type of semantic clones in Java bytecode. SeByte is a hybrid metric-based approach that takes advantage of both, Semantic Web technologies and Set theory. We use a two-step analysis process: (1) Pattern matching via Semantic Web querying and reasoning, and (2) Content matching, using Jaccard coefficient for set similarity measurement. Semantic Web-based pattern matching helps us to find method blocks which share similar patterns even in case of extreme dissimilarity (e.g., numerous repetitions or large gaps). Although it leads to high recall, it gives high false positive rate. We thus use the content matching (via Jaccard) to reduce false positive rate by focusing on content semantic resemblance. Our evaluation of four Java systems and five other tools shows that SeByte can detect a large number of semantic clones that are either not detected or supported by source code based clone detectors.
Article
Full-text available
It is generally said that code clones are one of the factors to make software maintenace difficult. They are pairs or sets of code portions in source files that are identical or similar to each other. If some bugs are found in a code portion, it is nessesary to correct them in its all clones. However, in large-scale sofrware, it is very difficult to find all clones by hand. In this paper, we develop a code clone analysis environment, Gemini, that uses a clone detection tool, CCFinder. Also, we conduct a case study for the evaluation.
Article
Full-text available
Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone candidates were evaluated by one of the authors as independent third party. The selected techniques cover the whole spectrum of the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program dependency graphs.
Article
Full-text available
BACKGROUND: A software engineering systematic map is a defined method to build a classification scheme and structure a software engineering field of interest. The analysis of results focuses on frequencies of publications for categories within the scheme. Thereby, the coverage of the research field can be determined. Different facets of the scheme can also be combined to answer more specific research questions. OBJECTIVE: We describe how to conduct a systematic mapping study in software engineering and provide guidelines. We also compare systematic maps and systematic reviews to clarify how to chose between them. This comparison leads to a set of guidelines for systematic maps. METHOD: We have defined a systematic mapping process and applied it to complete a systematic mapping study. Furthermore, we compare systematic maps with systematic reviews by systematically analyzing existing systematic reviews. RESULTS: We describe a process for software engineering systematic mapping studies and compare it to systematic reviews. Based on this, guidelines for doing systematic maps are defined. CONCLUSIONS: Systematic maps and reviews are different in terms of goals, breadth, validity issues and implications. Thus, they should be used complementarily and require different methods (e.g., for analysis).
Conference Paper
Full-text available
Similarity is an important concept in information theory. A challenging question is how to measure the amount of shared information between two systems. A large number of metrics are proposed and used to measure similarity between two computer programs or two portions of the same program. In this paper, we present an approach for assessing which metrics are most useful for similarity prediction in the context of clone detection. The presented approach uses clustering to identify clone candidates. In the experiments conducted, we applied sequential clustering using all possible permutations of a subset of the metrics used in metric-based clone detection literature. Precision and recall are calculated in every experiment. Experimental results show that the order of the metrics used affects the results dramatically. This shows that the used metrics are of variable relevance.
Conference Paper
Full-text available
Code clone detection tools may report a large number of code clones, while software developers are interested in only a subset of code clones that are relevant to software development tasks such as refactoring. Our research group has supported many software developers with the code clone detection tool CCFinder and its GUI front-end Gemini. Gemini shows clone sets (i.e., a set of code clones identical or similar to each other) with several clone metrics including their length and the number of code clones; however, it is not clear how to use those metrics to extract interesting code clones for developers. In this paper, we propose a method combining clone metrics to extract code clones for refactoring activity. We have conducted an empirical study on a web application developed by a Japanese software company. The result indicates that combinations of simple clone metric is more effective to extract refactoring candidates in detected code clones than individual clone metric.
Conference Paper
Full-text available
In this paper, we present a replicated study to predict fault-prone modules with code clone metrics to follow Baba's experiment. We empirically evaluated the performance of fault prediction models with clone metrics using 3 datasets from the Eclipse project and compared it to fault prediction without clone metrics. Contrary to the original Baba's experiment, we could not significantly support the effect of clone metrics, i.e., the result showed that F1-measure of fault prediction was not improved by adding clone metrics to the prediction model. To explain this result, this paper analyzed the relationship between clone metrics and fault density. The result suggested that clone metrics were effective in fault prediction for large modules but not for small modules.
Conference Paper
Full-text available
A relevant consequence of the expansion of the web and e-commerce is the growth of the demand of new web sites and web applications. As a result, web sites and applications are usually developed without a formalized process, and web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve web sites and applications. Moreover, clone detection among different web sites aims to detect cases of possible plagiarism. In this paper we propose an approach. based on similarity metrics, to detect duplicated pages in web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several web sites and Web applications. The obtained results are reported in the paper with respect to some case studies.
Conference Paper
Full-text available
This paper shows that existing software metric tools inter- pret and implement the denitions of object-oriented soft- ware metrics dierently. This delivers tool-dependent met- rics results and has even implications on the results of anal- yses based on these metrics results. In short, the metrics- based assessment of a software system and measures taken to improve its design dier considerably from tool to tool. To support our case, we conducted an experiment with a number of commercial and free metrics tools. We calcu- lated metrics values using the same set of standard metrics for three software systems of dierent sizes. Measurements show that, for the same software system and metrics, the metrics values are tool depended. We also dened a (sim- ple) software quality model for "maintainability" based on the metrics selected. It denes a ranking of the classes that are most critical wrt. maintainability. Measurements show that even the ranking of classes in a software system is met- rics tool dependent.
Article
Full-text available
SUMMARY A code clone is a code fragment that has other code fragments identical or similar to it in the source code. The presence of code clones is generally regarded as one factor that makes software maintenance more difficult. For example, if a code fragment with code clones is modified, it is necessary to consider whether each of the other code clones has to be modified as well. Removing code clones is one way of avoiding problems that arise due to the presence of code clones. This makes the source code more maintainable and more comprehensible. This paper proposes a set of metrics that suggest how code clones can be refactored. As well, the tool Aries, which automatically computes these metrics, is presented. The tool gives metrics that are indicators for certain refactoring methods rather than suggesting the refactoring methods themselves. The tool performs only lightweight source code analysis; hence, it can be applied to a large number of code lines. This paper also describes a case study that illustrates how this tool can be used. Based on the results of this case study, it can be concluded that this method can efficiently merge code clones. Copyright © 2008 John Wiley & Sons, Ltd.
Article
Full-text available
In this paper, we explain our refactoring support tool Aries. Aries characterizes code clones by several metrics, and suggests how to remove them.
Article
Full-text available
A legacy system is an operational, large-scale software system that is maintained beyond its first generation of programmers. It typically represents a massive economic investment and is critical to the mission of the organization it serves. As such systems age, they become increasingly complex and brittle, and hence harder to maintain. They also become even more critical to the survival of their organization because the business rules encoded within the system are seldom documented elsewhere. Our research is concerned with developing a suite of tools to aid the maintainers of legacy systems in recovering the knowledge embodied within the system. The activities, known collectively as “program understanding”, are essential preludes for several key processes, including maintenance and design recovery for reengineering. In this paper we present three pattern-matching techniques: source code metrics, a dynamic programming algorithm for finding the best alignment between two code fragments, and a statistical matching algorithm between abstract code descriptions represented in an abstract language and actual source code. The methods are applied to detect instances of code cloning in several moderately-sized production systems including tcsh, bash, and CLIPS. The programmer's skill and experience are essential elements of our approach. Selection of particular tools and analysis methods depends on the needs of the particular task to be accomplished. Integration of the tools provides opportunities for synergy, allowing the programmer to select the most appropriate tool for a given task.
Conference Paper
Full-text available
Many Web applications use a mixture of HTML and scripting language code as the front-end to business services. Analogously to traditional applications, redundant code is introduced by copy-and-paste practices. Code duplication is a pathological form of software reuse because of its effects on the maintenance of large software systems. This paper describes how a simple semi-automated approach can be used to identity cloned functions within scripting code of Web applications. The results obtained from applying our approach to three Web applications show that the approach is useful for a fast selection of script function clones, and can be applied to prevent clone spreading or to remove redundant scripting code.
Conference Paper
Full-text available
The actual effort to evolve and maintain a software system is likely to vary depending on the amount of clones (i.e., duplicated or slightly different code fragments) present in the system. This paper presents a method for monitoring and predicting clones evolution across subsequent versions of a software system. Clones are firstly identified using a metric-based approach, then they are modeled in terms of time series identifying a predictive model. The proposed method has been validated with an experimental activity performed on 27 subsequent versions of mSQL, a medium-size software system written in C. The time span period of the analyzed mSQL releases covers four years, from May 1995 (mSQL 1.0.6) to May 1999 (mSQL 2. 0. 10). For any given software release, the identified models was able to predict the clone percentage of the subsequent release with an average error below 4 %. A higher prediction error was observed only in correspondence of major system redesign
Conference Paper
In this paper, we explain our refactoring support tool Aries. Aries characterizes code clones by several metrics, and suggests how to remove them.
Conference Paper
Most of the software systems consist of a number of code clones. Although it makes the task of software development easy but at the same time code cloning may cause several maintenance and cost related problems. A number of clone detection techniques have been proposed so far. In this paper, an approach for selecting a set of appropriate metrics from a list of large number of metrics is presented. The proposed approach evaluates a set of independent metrics on the basis of their precision and recall values in clone detection starting from all combinations of one metric and then gradually increasing the number of metrics in the metrics combinations until the complete set of metrics involved in the approach are evaluated. The result of implementing the proposed approach on a C language software system is provided as example.
Chapter
A legacy system is an operational, large-scale software system that is maintained beyond its first generation of programmers. It typically represents a massive economic investment and is critical to the mission of the organization it serves. As such systems age, they become increasingly complex and brittle, and hence harder to maintain. They also become even more critical to the survival of their organization because the business rules encoded within the system are seldom documented elsewhere. Our research is concerned with developing a suite of tools to aid the maintainers of legacy systems in recovering the knowledge embodied within the system. The activities, known collectively as “program understanding”, are essential preludes for several key processes, including maintenance and design recovery for reengineering. In this paper we present three pattern-matching techniques: source code metrics, a dynamic programming algorithm for finding the best alignment between two code fragments, and a statistical matching algorithm between abstract code descriptions represented in an abstract language and actual source code. The methods are applied to detect instances of code cloning in several moderately-sized production systems including tcsh, bash, and CLIPS. The programmer’s skill and experience are essential elements of our approach. Selection of particular tools and analysis methods depends on the needs of the particular task to be accomplished. Integration of the tools provides opportunities for synergy, allowing the programmer to select the most appropriate tool for a given task.
Article
Context Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available. Objective The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets. Method The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes. Results The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision. Conclusion The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer.
Conference Paper
The detection of function clones in software systems is valuable for the code adaptation and error checking maintenance activities. This paper presents an efficient metrics-based data mining clone detection approach. First, metrics are collected for all functions in the software system. A data mining algorithm, fractal clustering, is then utilized to partition the software system into a relatively small number of clusters. Each of the resulting clusters encapsulates functions that are within a specific proximity of each other in the metrics space. Finally, clone classes, rather than pairs, are easily extracted from the resulting clusters. For large software systems, the approach is very space efficient and linear in the size of the data set. Evaluation is performed using medium and large open source software systems. In this evaluation, the effect of the chosen metrics on the detection precision is investigated.
Article
ContextReusing software by means of copy and paste is a frequent activity in software development. The duplicated code is known as a software clone and the activity is known as code cloning. Software clones may lead to bug propagation and serious maintenance problems.Objective This study reports an extensive systematic literature review of software clones in general and software clone detection in particular.Method We used the standard systematic literature review method based on a comprehensive set of 213 articles from a total of 2039 articles published in 11 leading journals and 37 premier conferences and workshops.ResultsExisting literature about software clones is classified broadly into different categories. The importance of semantic clone detection and model based clone detection led to different classifications. Empirical evaluation of clone detection tools/techniques is presented. Clone management, its benefits and cross cutting nature is reported. Number of studies pertaining to nine different types of clones is reported. Thirteen intermediate representations and 24 match detection techniques are reported.Conclusion We call for an increased awareness of the potential benefits of software clone management, and identify the need to develop semantic and model clone detection techniques. Recommendations are given for future research.
Article
Metric space is a set with definition of distance between elements within this set. This paper introduces metric space into code clone detection, and uses the distance within a metric space to measure the similarity level of code. It proposed a process of building up a metric space to detect clones in software system. Based on metric space which is derived from software metric, the clone detection can get more convenience and flexibility. We also exercise the approach in a real industry project.
Article
Copying a code fragment and reusing it by pasting with or without minor modifications is a common practice in software development environments. Various techniques have been proposed to find duplicated redundant code. Previous work was simple and practical methods for detecting exact and near miss clones over arbitrary program fragments in program source code by using abstract syntax trees. Our proposal is a new technique for finding similar code blocks and for quantifying their similarity. Our techniques can be used to find clone clusters, sets of code blocks all within a user-supplied similarity. It detects similar clones using metrics for type 1, type 2 of clones.
Conference Paper
Clone Detection has considerably evolved over the last decade, leading to approaches with better results but with increasing complexity. Most of the existing approaches are limited to finding program fragments similar in their syntax or semantics, while the fraction of candidates that are actually clones and fraction of actual clones identified as candidates on the average remain similar. In this paper, a metric-based approach combined with the textual comparison of the source code for the detection of functional Clones in C source code has been proposed. Various metrics had been formulated and their values were utilized during the detection process. Compared to the other approaches, this method is considered to be the least complex and is to provide a more accurate and efficient way of Clone Detection. The results obtained had been compared with the two other existing tools for the open source project Weltab.
Article
Over the last decade many techniques and tools for software clone detection have been proposed. In this paper, we provide a qualitative comparison and evaluation of the current state-of-the-art in clone detection techniques and tools, and organize the large amount of information into a coherent conceptual framework. We begin with background concepts, a generic clone detection process and an overall taxonomy of current techniques and tools. We then classify, compare and evaluate the techniques and tools in two different dimensions. First, we classify and compare approaches based on a number of facets, each of which has a set of (possibly overlapping) attributes. Second, we qualitatively evaluate the classified techniques and tools with respect to a taxonomy of editing scenarios designed to model the creation of Type-1, Type-2, Type-3 and Type-4 clones. Finally, we provide examples of how one might use the results of this study to choose the most appropriate clone detection tool or technique in the context of a particular set of goals and constraints. The primary contributions of this paper are: (1) a schema for classifying clone detection techniques and tools and a classification of current clone detectors based on this schema, and (2) a taxonomy of editing scenarios that produce different clone types and a qualitative evaluation of current clone detectors based on this taxonomy.
Conference Paper
Graphics Processing Unit (GPU) have been around for a while. Although they are primarily used for high-end D graphics processing, their use is now acknowledged for general massive parallel computing. This paper presents an original technique based on [10] to compute many instances of the longest common subsequence problem on a generic GPU architecture using classic DP-matching [7]. Application of this algorithm has been found useful to address the problem of filtering false positives produced by metrics-based clone detection methods. Experimental results of this application are presented along with a discussion of possibilities of using GPUs for other cloning related problems.
Conference Paper
This paper presents a technique to automatically identify duplicate and near duplicate functions in a large software system. The identification technique is based on metrics extracted from the source code using the tool DatrixTM. This clone identification technique uses 21 function metrics grouped into four points of comparison. Each point of comparison is used to compare functions and determine their cloning level. An ordinal scale of eight cloning levels is defined. The levels range from an exact copy to distinct functions. The metrics, the thresholds and the process used are fully described. The results of applying the clone detection technique to two telecommunication monitoring systems totaling one million lines of source code are provided as examples. The information provided by this study is useful in monitoring the maintainability of large software systems
Conference Paper
Cloning (ad hoc reuse by duplication of design or code) speeds up development, but also hinders future maintenance. Cloning also hints at reuse opportunities that, if exploited systematically, might have positive impact on development and maintenance productivity. Unstable requirements and tight schedules pose unique challenges for Web Application engineering that encourage cloning. We are conducting a systematic study of cloning in Web Applications of different sizes, developed using a range of Web technologies, and serving diverse purposes. Our initial results show cloning rates up to 63% in both newly developed and already maintained Web Applications. Expected contribution of this work is two-fold: (1) to confirm potential benefits of reuse-based methods in addressing clone related problems of Web engineering, and (2) to create a framework of metrics and presentation views to be used in other similar studies.
Conference Paper
Code clones (duplicated source code in a software system) are one of the major factors in decreasing maintainability. Many code clone detection methods have been proposed to find code clones automatically from large-scale software. However, it is still hard to find harmful code clones to improve maintainability because there are many code clones that should remain. Thus, to help find harmful code clones, we propose a code clone visualization method and a metrics application on the visualized information. Our method enables the location of harmful code clones diffused in a software system. We apply our method to three open source software programs and visualize their code clone information.
Article
Identifying code duplication in large multi-platform software systems is a challenging problem. This is due to a variety of reasons including the presence of high-level programming languages and structures interleaved with hardware-dependent low-level resources and assembler code, the use of GUI-based configuration scripts generating commands to compile the system, and the extremely high number of possible different configurations.This paper studies the extent and the evolution of code duplications in the Linux kernel. Linux is a large, multi-platform software system; it is based on the Open Source concept, and so there are no obstacles in discussing its implementation. In addition, it is decidedly too large to be examined manually: the current Linux kernel release (2.4.18) is about three million LOCs.Nineteen releases, from 2.4.0 to 2.4.18, were processed and analyzed, identifying code duplication among Linux subsystems by means of a metric-based approach. The obtained results support the hypothesis that the Linux system does not contain a relevant fraction of code duplication. Furthermore, code duplication tends to remain stable across releases, thus suggesting a fairly stable structure, evolving smoothly without any evidence of degradation.
Conference Paper
The paper presents extensions to Bell Canada source code quality assessment suite (DATRIX tm) for handling Java language systems. Such extensions are based on source code object metrics, including Java interface metrics, which are presented and explained in detail. The assessment suite helps to evaluate the quality of medium-large software systems identifying parts of the system which have unusual characteristics. The paper also studies and reports the occurrence of clones in medium-large Java software systems. Clone presence affects quality since it increases a system size and often leads to higher maintenance costs. The clone identification process uses Java specific metrics to determine similarities between methods throughout a system. The results obtained from experiments with software evaluation and clone detection techniques, on over 500 KLOC of Java source code, are presented
Clone detection using textual and metric analysis to figure out all type of clones
  • E Kodhai
  • A Perumal
  • S Kanmani
Kodhai, E., Perumal, A., and Kanmani, S. 2010. Clone detection using textual and metric analysis to figure out all type of clones. International Journal of Computer Communications and Information System 2, 1 (Dec. 2010), 99-103.
Finding function clones in web applications
  • F Lanubile
  • T Mallarado
Lanubile, F. and Mallarado, T. 2003. Finding function clones in web applications. In Proceedings of the 7th European Conference on Software Maintenance and Reengineering (Benevento, Italy, Mar. 26-28, 2003), 379-386.