An Empirical Study of Long-Lived Code Clones.
An Empirical Study of Long-Lived Code Clones
Dongxiang Cai1and Miryung Kim2
1Hong Kong University of Science and Technology
2The University of Texas at Austin
Abstract. Previous research has shown that refactoring code clones as
soon as they are formed or discovered is not always feasible or worth-
while to perform, since some clones never change during evolution and
some disappear in a short amount of time, while some undergo repetitive
similar edits over their long lifetime.
Toward a long-term goal of developing a recommendation system that
selectively identifies clones to refactor, as a first step, we conducted an
empirical investigation into the characteristics of long-lived clones. Our
study of 13558 clone genealogies from 7 large open source projects, over
the history of 33.25 years in total, found surprising results. The size of
a clone, the number of clones in the same group, and the method-level
distribution of clones are not strongly correlated with the survival time
of clones. However, the number of developers who modified clones and
the time since the last addition or removal of a clone to its group are
highly correlated with the survival time of clones. This result indicates
that the evolutionary characteristics of clones may be a better indicator
for refactoring needs than static or spatial characteristics such as LOC,
the number of clones in the same group, or the dispersion of clones in a
Keywords: Software evolution, empirical study, code clones, refactoring
Code clones are code fragments similar to one another in syntax and semantics.
Existing research on code cloning indicates that a significantly large portion
of software (e.g. gcc-8.7% , JDK-29% , Linux-22.7% , etc.) contains
code duplicates created by copy and paste programming practices. Though code
cloning helps developers to reuse existing design and implementation, it could
incur a significant maintenance cost because programmers need to apply repet-
itive edits when the common logic among clones changes. Neglecting to update
clones consistently may introduce a bug.
Refactoring is defined as a disciplined technique for restructuring existing
software systems, altering a program’s internal structure without changing its
??This research was conducted while the first author was a graduate student intern at
The University of Texas at Austin.
external behavior . Because refactoring is considered a key to keeping source
code easier to understand, modify, and extend, previous research effort has fo-
cused on automatically identifying clones [15,4,2,21,13].
However, recent studies on code clones [7,18–20,26] indicated that cloning
is not necessarily harmful and that refactoring may not be always applicable to
or beneficial for clones. In particular, our previous study of clone evolution 
found that (1) some clones never change during evolution, (2) some clones disap-
pear after staying in a system for only a short amount of time due to divergent
changes, and (3) some clones stay in a system for a long time and undergo con-
sistent updates repetitively, indicating a high potential return for the refactoring
investment. These findings imply that it is crucial to selectively identify clones
We hypothesize that the benefit of clone removal may depend on how long
clones survive in the system and how often they require similar edits over their
lifetime. Toward a long-term goal of developing a system that recommends clones
to refactor, as a first step, we conducted an empirical investigation into the char-
acteristics of long-surviving clones. Based on our prior work on clone genealo-
gies—an automatically extracted history of clone evolution from a sequence of
program versions —we first studied various factors that may influence a
clone’s survival time, such as the number of clones in the same group, the num-
ber of consistent updates to clones in the past, the degree of clone dispersion
in a system, etc. In total, we extracted 34 attributes from a clone genealogy
and investigated correlation between each attribute and a clone’s survival time
in terms of the number of days before disappearance from a system. Our study
found several surprising results. The more developers maintain code clones, the
longer the clones survive in a system. The longer it has been since the time of
the last addition or deletion of a clone to its clone group, the longer the clones
survive in a system. On the other hand, a clone’s survival time did not have much
correlation with the size of clones, the number of clones in the same group, and
the number of methods that the clones are located in.
For each subject, we developed a decision-tree based model that predicts a
clone survival time based on its characteristics. The model’s precision ranges
from 58.1% to 79.4% and the recall measure ranges from 58.8% to 79.3%. This
result shows promise in developing a refactoring recommendation system that
selects long-lived clones.
The rest of this paper is organized as follows. Section 2 describes related
work. Section 3 gives background of our previous clone genealogy research and
Section 4 describes the characteristics of clone genealogy data and the subject
programs that we studied. Section 5 describes correlation analysis results and
Section 6 presents construction and evaluation of decision tree-based prediction
models. Section 7 discusses threats to validity, and Section 8 summarizes our
This section describes tool support for identifying refactoring opportunities, em-
pirical studies of code cloning, and clone evolution analysis.
Identification of Refactoring Opportunities. Higo et al.  propose Aries
to identify refactoring candidates based on the number of assigned variables, the
number of referred variables, and clone dispersion in the class hierarchy. Aries
suggests two types of refactorings, extract method and pull-up method . A
refactoring can be suggested if the clone metrics satisfy certain predefined val-
ues. Komondoor’s technique  extracts non-contiguous lines of clones into a
procedure that can then be refactored by applying an extract method refactoring.
Koni-N’Sapu  provides refactoring suggestions based on the location of clones
with respect to a system’s class hierarchy. Balazinska et al.  suggest clone
refactoring opportunities based on the differences between the cloned methods
and the context of attributes, methods, and classes containing clones. Breakaway
 automatically identifies detailed structural correspondences between two ab-
stract syntax trees to help programmers generalize two pieces of similar code.
Several techniques [35,34,33,11,28] automatically identify bad-smells that in-
dicate refactoring needs. For example, Tsantalis and Chatzigeorgiou’s technique
identifies extract method refactoring opportunities using static slicing. Our work
is different from these refactoring opportunity identification techniques in that
it uses clone evolution history to predict how long clones are likely to survive in
Studies about Cloning Practice. Cordy  notes that cloning is a common
method of risk minimization used by financial institutions because modifying
an abstraction can introduce the risk of breaking existing code. Fixing a shared
abstraction is costly and time consuming as it requires any dependent code to
be extensively tested. On the other hand, clones increase the degrees of freedom
in maintaining each new application or module. Cordy noted that propagating
bug fixes to clones is not always a desired practice because the risk of changing
an already properly working module is too high.
Godfrey et al.  conducted a preliminary investigation of cloning in Linux
SCSI drivers and found that super-linear growth in Linux is largely caused by
cloning of drivers. Kapser and Godfrey  further studied cloning practices in
several open source projects and found that clones are not necessarily harmful.
Developers create new features by starting from existing similar ones, as this
cloning practice permits the use of stable, already tested code. While interview-
ing and surveying developers about how they develop software, LaToza et al. 
uncovered six patterns of why programmers create clones: repeated work, exam-
ple, scattering, fork, branch, and language. For each pattern, less than half of the
developers interviewed thought that the cloning pattern was a problem. LaToza
et al.’s study confirms that most cloning is unlikely to be created with ill inten-
tions. Rajapakse et al.  found that reducing duplication in a web application
had negative effects on the extensibility of an application. After significantly re-
ducing the size of the source code, a single change often required testing a vastly
larger portion of the system. Avoiding clones during initial development could
contribute to a significant overhead. These studies indicate that not all clones
are harmful and it is important to selectively identify clones to refactor.
Clone Evolution Analysis. While our study uses the evolutionary character-
istics captured by the clone genealogy model , the following clone evolution
analyses could serve as a basis for generating clone evolution data. The evolu-
tion of code clones was analyzed for the first time by Lagu¨ e et al. . Aversano
et al.  refined our clone genealogy model  by further categorizing the In-
consistent Change pattern into the Independent Evolution pattern and the Late
Propagation pattern. Krinke  also extended our clone genealogy analysis and
independently studied clone evolution patterns. Balint et al.  developed a vi-
sualization tool to show who created and modified code clones, the time of the
modifications, the location of clones in the system, and the size of code clones.
Classification of Code Clones. Bellon et al. categorized clones into Type 1
(an exact copy without modifications), Type 2 (a syntactically identical copy)
and Type 3 (a copy with further modifications, e.g., addition and deletion of
statements) in order to distinguish the kinds of clones that can be detected by
existing clone detectors . Kapser and Godfrey [17,16] taxonomized clones to
increase the user comprehension of code duplication and to filter false positives
in clone detection results. Several attributes of a clone genealogy in Section 4 are
motivated by Kapser and Godfrey’s region and location based clone filtering cri-
teria. Our work is different from these projects by identifying the characteristics
of long-lived clones.
3Background on Clone Genealogy and Data Set
Dead Genealogy: Disappeared through refactoring at the age of 5
Alive Genealogy: Present in the last version with the age of 4
Vi Vi+1 Vi+2 Vi+3Vi+4Vi+5 Vi+6
Fig.1. Example clone genealogies: G1 (above) and G2 (below)
A clone genealogy describes how groups of code clones change over multiple
versions of the program. A clone group is a set of clones considered equivalent
according to a clone detector. For example, clone A and B belong to the same
group in version i because a clone detector finds them equivalent. In a clone’s
genealogy, a group to which the clone belongs is traced to its origin clone group
in the previous version. The model associates related clone groups that have
originated from the same ancestor group. In addition, the genealogy contains
information about how each element in a group of clones changed with respect
to other elements in the same group. The detail description on the clone geneal-
ogy representation is described elsewhere . The following evolution patterns
describe all possible changes in a clone group.
Same: all code snippets in the new version’s clone group did not change from
the old version’s clone group.
Add: at least one code snippet is newly added to the clone group.
Subtract: at least one code snippet in the old version’s clone group does
not appear in the corresponding clone group in the new version.
Consistent Change: all code snippets in the old version’s clone group have
changed consistently; thus, they all belong to the new clone group.
Inconsistent Change: at least one code snippet in the old version’s clone
group changed inconsistently; thus, it no longer belongs to the same group in
the new version.
Shift: at least one code snippet in the new clone group partially overlaps
with at least one code snippet in the original clone group.
A clone lineage is a directed acyclic graph that describes the evolution history
of a sink node (clone group). A clone group in the kthversion is connected to a
clone group in the k−1thversion by an evolution pattern. For example, Figure 1
shows a clone lineage following the sequence of Add, Same, Consistent Change,
Consistent Change, and Inconsistent Change. In the figure, code snippets
with the similar content are filled with the same shade.
A clone genealogy is a set of clone lineages that have originated from the
same clone group. A clone genealogy is a connected component where every
clone group is connected by at least one evolution pattern. A clone genealogy
approximates how programmers create, propagate, and evolve code clones.
Clone genealogies are classified into two groups: dead genealogies that do
not include clone groups of the final version and alive genealogies that include
clone groups of the final version. We differentiate a dead genealogy from an alive
genealogy because only dead genealogies provide information about how long
clones stayed in the system before they disappeared. On the other hand, for an
alive genealogy, we cannot tell how long its clones will survive because they are
still evolving. In Figure 1, G1 is a dead genealogy with the age 5, and G2 is an
alive genealogy with the age 4. Dead genealogies are essentially genealogies that
disappeared because clones were either refactored, because they were deleted by
a programmer, or because they are no longer considered as clones by a clone
detector due to divergent changes to the clones.