Conference PaperPDF Available

Identifying Source Code Reuse across Repositories Using LCS-Based Source Code Similarity

Authors:

Abstract and Figures

Developers often reuse source files developed for another project. In order to update a reused file to a newer version released by the original project, developers have to track which revision of a file was reused and how its content was modified. However, such tracking is tedious for developers. Many projects keep older versions of files whose bugs are already fixed in the original project. In this paper, we propose a technique to automatically identify source code reuse relationships between two repositories. Using a similarity metric based on longest common subsequence, we identify pairs of similar revisions of files across the repositories. To evaluate our approach, we have analyzed eight project pairs of open source software projects and compared the result with the recorded information in the repositories. As a result, we have identified 1394 file revisions as instances of source code reuse. While 75.3% of the instances are recorded in the repositories, 20.1% of the instances are unrecorded but recovered by our approach.
Content may be subject to copyright.
Identifying Source Code Reuse across Repositories
using LCS-based Source Code Similarity
Naohiro Kawamitsu, Takashi Ishio, Tetsuya Kanda, Raula Gaikovina Kula, Coen De Roover∗†, Katsuro Inoue
Graduate School of Information Science and Technology
Osaka University
1–5 Yamadaoka, Suita, Osaka 565–0871, Japan
Email: {n-kawamt, ishio, t-kanda, raula-k, coen, inoue}@ist.osaka-u.ac.jp
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2, 1050 Brussels, Belgium
Email: cderoove@vub.ac.be
Abstract—Developers often reuse source files developed for
another project. In order to update a reused file to a newer
version released by the original project, developers have to track
which revision of a file was reused and how its content was
modified. However, such tracking is tedious for developers. Many
projects keep older versions of files whose bugs are already fixed
in the original project. In this paper, we propose a technique to
automatically identify source code reuse relationships between
two repositories. Using a similarity metric based on longest
common subsequence, we identify pairs of similar revisions of
files across the repositories. To evaluate our approach, we have
analyzed eight project pairs of open source software projects
and compared the result with the recorded information in the
repositories. As a result, we have identified 1394 file revisions as
instances of source code reuse. While 75.3% of the instances
are recorded in the repositories, 20.1% of the instances are
unrecorded but recovered by our approach.
I. INTRODUCTION
Clone-and-own is one of the popular approaches to source
code reuse [1][2]. Developers copy source files developed
by another project into their own. Reusing existing libraries
reduces cost and enables quick software development. This
even happens for libraries which are intended to be shared
between projects. Rather than linking against a compiled
version of the library, developers sometimes copy its source
code into the project’s. For instance, when a project-specific
enhancement is required.
Developers should keep track of the library version they
copied from. This facilitates keeping copies up-to-date. For
instance, to patch newly discovered security vulnerabilities.
Nevertheless, 18.7% of the projects studied in [3] had no
records of the actual versions of the third-party code that they
are reusing. A simple diff is often insufficient to identify
the origin of a copy, because it cannot distinguish project-
specific changes from updates to the original project. Indeed,
many copies are modified for project-specific enhancements.
The practice of clone-and-own reuse therefore requires tool
support for identifying the original version of the copy.
In general, there are two approaches to the problem.
Forward engineering approaches aim to provide tool support
for a more systematic clone-and-own reuse. Jablonski et al.,
for instance, proposed an extension of the Eclipse IDE that
automatically records copy-and-paste activities [4]. Reverse
engineering approaches, in contrast, aim to identify instances
Source
repository
Destination
repository
Versions of origin file
Copy
File
File 𝑓
Origin
time
2
1
3
𝑓1𝑓2𝑓3
similar
Fig. 1. Separate evolutions of a copy and the origin file.
of reused code in given code bases. Our approach is one of
these.
We propose a file-level analysis technique for detecting
copies and the origin files they are copied from across project
repositories. For each revision of a source file in a project, we
determine the file revision it originated from in another project
—while accounting for project-specific changes to the copy.
Figure 1 depicts separate evolutions of a copy and the
origin file it is copied from. The file fin the source repository
Rsis copied into the destination repository Rdas file f0. A
version f1of fis updated in the source repository to f2,
which is copied to the destination repository as f0
2. Next, the
destination repository adds a new feature to f0
2resulting in
a new revision f0
3. As f0
3is newer than f3, it might seem
up-to-date upon a casual inspection. In reality, however, it
contains outdated source code that still stems from f2. This
example illustrates that commit timestamps do not suffice for
determining which revision in a source repository is copied.
Our approach therefore detects instances of clone-and-own
reuse based on the similarity between a copy and a candidate
origin file revision instead. More concretely, we use Longest
Common Subsequence (LCS) [5] as a similarity metric. LCS is
common in clone detection [6] and product evolution analyses
[7]. We implemented our approach as a tool. Given a source
repository and a destination repository, our tool identifies pairs
of similar versions of files across the repositories. For each
copy in a destination repository, our tool reports project version
numbers of the original file revision in a source repository.
To evaluate our approach, we applied our tool to eight
open source software projects of which six are known to have
reused libpng and two are known to have reused libcurl. The
tool reported 1394 file revisions in a destination project and
their origin file versions in a source project. We manually
verified the reported revisions using the directory structure
of the source and destination projects, the commit log of the
destination project, and the contents of the copy and the origin
file. As a result, 1004 file revisions are correctly recorded in
the destination projects. 46 of the reported origin file revisions
revealed potential problems that developers recorded different
version numbers from actual origin file revisions. Furthermore,
201 of the reported instances are not recorded in the destination
projects, even though the instances have the same contents
except for white space and code comments. Note that the
instances of source code reuse are included in 73 commits
in the repositories. Because 23 (31.5%) of them do not record
version numbers, our automatic analysis is important to recover
source code reuse information across repositories.
The contributions of the paper are summarized as follows.
An automated analysis method is proposed to de-
tect instances of source code reuse between software
repositories.
The approach is evaluated using the source code reuse
information recorded in publicly available software
repositories.
Actual instances of source code reuse in eight projects
are analyzed.
Section II shows motivating examples of our approach.
The approach itself is detailed in Section III. Section IV
presents the evaluation of our approach using a prototype
implementation and the aforementioned open source projects.
Before concluding in Section VI, we describe related work in
Section V.
II. MOT IVATIN G EXA MP LE S
When reusing source files from another project, developers
may modify these files for their own purpose. An example
is found in the V8Monkey project. The project includes a
file pngget.c which stems from the libpng project. The
V8Monkey project modified this file to support PNG ani-
mations. When the libpng project released a new version
of pngget.c, the V8Monkey project merged these changes
with theirs. Another example is found in the source code of
Wolfenstein: Enemy Territory. The project includes variants
of libcurl files that have been modified to satisfy the project’s
own code formatting conventions.
To maintain all of these copies, knowing the project a file
originates from does not suffice. Developers require knowledge
about the concrete revision each copy was copied from.
Otherwise, it is not clear what revision to update each copy
to, since two or more releases of a library may be available at
the same time. For example, Cocos2D-iPhone project tried to
update their libpng files from 1.2.38 to 1.4.1, but downgraded
to 1.2.43 because of source incompatibility. This traceability
information is recorded in several manners in practice. Commit
messages in the repository of V8Monkey indicate project
version numbers of libpng files. For example, the message of
commit ID 9def47a86c95fd5f1says “updated libpng to
1.4.3.” In the source code of Wolfenstein, on the other hand,
the name of directory curl-7.12.2 hints at the included
version of libcurl.
However, traceability information recorded in this man-
ner is not always available. Xia et al. [3] reported
18.7% of projects had no version information of the
third-party code. Even if traceability information has been
recorded when files are copied, the history may not be
directly visible to developers. For example, the commit
3a6c8755c4b08c2c in V8Monkey project records “Move
libpng to media/libpng.” Hence, an older version must
be analyzed to identify what version of libpng are used in
the project. In addition, incorrect traceability information may
be recorded in the repository. For example, the git reposi-
tory of the Haiku-services-branch project includes a commit
cc57c65424afbcb7 of which the message states “updated
libpng to 1.2.31.” The commit updated three files. While two
of them are exactly the same as files involved in libpng
1.2.31, one file named png.h is the same as png.h in
libpng 1.2.30 except for additional code comments. While
this case does not cause a serious problem, a newer version
number incorrectly recorded may prevent developers from
updating their vulnerable copy. To avoid manual error-prone
recording of traceability information, an automated technique
for tracking source code reuse between repositories is required.
Two clues for recovering traceability of source code reuse
are the time the copy is made and the content of the copy. A
simple heuristic is that developers copied the latest version at
the time. However, developers sometimes intentionally copy an
older version. For example, fs2open project started to use files
copied from libpng 1.2.42, although the latest version 1.4.0
has been also available.
Another clue is the content of a copy. Since developers
may modify the copy, developers have to compare a copy
with candidates of the original version. Identifiers play an
important role in this comparison, because most of bug fixes
are implemented in a small number of lines of code [8]. For
example, commit f2e2833f28fa11ba of libpng patched a
bug using the following string replacement:
- png_ptr->transformations |= PNG_STRIP_ALPHA;
+ png_ptr->flags |= PNG_FLAG_STRIP_ALPHA;
Small differences are therefore often vital to identifying
which revision a copy stems from. Unfortunately, these are
exactly the differences code clone detection tools are inten-
tionally oblivious to [9].
It should be noted that code comment in a copy may indi-
cate a different version from its actual version. For example,
the header comment of png.c in libpng 1.2.31 says that the
file is 1.2.30. The header comment of pngmem.c in libpng
1.0.38 also says that the file is 1.2.30. A correct version number
of a copy must be verified by the content of the copy.
III. OUR AP PROACH
Our approach determines which source code from a source
repository S(e.g., a library project) is reused into a destination
1In this paper, a commit ID for a git repository is represented by its first
16 characters that uniquely specify the commit.
f1
r1 r2 r3
f2
r4 r5
f3
r7 r8 r6
Fig. 2. An example repository including three files f1,f2, and f3.
repository D(e.g., an application project). An instance of
source code reuse is a link between a file revision (a version
of a file) in Sand a file revision in D. The approach is based
on the following assumptions about source code reuse:
Developers copy a file from a release version of a
library.
Version numbers are available as tags in a source
repository.
Developers do not modify the content of a copy
significantly.
An origin file exists in a source repository before a
copy appears in a destination repository.
Based on these assumptions, instances of source code reuse
can be detected using the following step-wise approach:
1) Identify file pairs (s, d)such that a revision of sin
Sis similar to a revision of din D.
2) Exclude file pairs (s, d)for which dis older than s.
3) Identify the most similar revision iof sfor each
revision jof d. Output project version numbers
corresponding to sias the origin of dj.
The resulting set includes all instances of source code reuse
between the repositories. It enables developers of a destination
project to know which file revisions have been derived from
the source project and should be updated. It also enables
developers of a source project to understand how their files are
used and extended in the destination project and to improve
functionalities in origin files.
We regard a repository as a directed graph of which the
vertices represent revisions of files, and of which the edges
represent successors of revisions. Figure 2 shows an example
of a repository including three files f1, f2,and f3. In the figure,
a circle represents a file revision. In the repository, r1is the
first revision of file f1.r2and r3are modified revisions of r1.
The file f2is modified once and renamed to f3. The edge r4
to r5represents the modification, the edge r5to r6represents
the rename, respectively. The lack of an edge from revision r8
to r6indicates that file revision r8has been deleted and then
replaced by r6.
Rk(f)refers to all file revisions related to file fin
the repository k. To include renamed files in the analysis,
Rk(f)includes versions in which the file is named to f
s
0.7
s2 s3
d
d1 d2
time
0.8
0.6
0.8
0.9
0.5
s1
Fig. 3. An example of a file pair (s, d).
and also versions weakly connected to them. For example,
in Figure 2, Rk(f1) = {r1, r2, r3},Rk(f2) = {r4, r5, r6},
Rk(f3) = {r4, r5, r6, r7, r8}. Although Rk(f)may include a
file revision whose name is not f, we denote the file revisions
in Rk(f)as fifor simplicity.
A. File Pair Extraction
A file dis likely a reuse of a file sif a similarity between a
revision of sand a revision of dis above certain predetermined
threshold. We extract a set of file pairs Cthat are likely
instances of source code reuse as follows.
C={(s, d)| ∃siRS(s)djRD(d) : sim(si, dj)th}
sim(si, dj)is a similarity metric between revisions. For sim,
our technique employs a metric that stems from product
evolution analysis [7]. This token-based metric is computed
as follows:
sim(si, dj) = |LCS (si, dj)|
|si|+|dj|−|LCS (si, dj)|
where |si|and |dj|are the numbers of tokens in the file
revisions, |LCS(si, dj)|is the length of the LCS of tokens
in the file revisions. In this comparison, each file revision is
normalized to a sequence of tokens excluding code comments
and white space. All other tokens including keywords, macros,
and identifiers are kept as is. A threshold th is a config-
urable parameter of the metric, but we have arbitrarily chosen
th = 0.8for our implementation.
This step compares all the file revision pairs between the
repositories. To avoid unnecessary computation of LCS for a
pair of less similar files, we have employed an optimization. In
short, we compare term-frequency vectors of two file revisions
to estimate the similarity. If two file revisions have only a
small number of common tokens, we do not need to compute
the LCS for the revisions, as they cannot have a long common
subsequence. By normalizing identifiers in source code, each
file is translated into a fixed-length vector representing the
frequencies of each lexical element such as keywords and
operators in a programming language. In addition to the
optimization, we have employed a LCS algorithm suitable for
similar strings [10] in order to efficiently compare many similar
file revisions.
s
s2 s3
d
d2
time
0.8
0.8
0.9
s1
Fig. 4. An example of files that have similar revisions.
Figure 3 shows an example of a pair of a file sin a
source repository and a file din a destination repository. In
the figure, the file shas three revisions s1,s2, and s3, the file
dhas two revisions d1and d2. The directed edges between
revisions indicate the successor relationship between revisions.
The revisions are placed left to right according to their commit
time. Cross-repository links between a revision of sand a
revision of dcarry the similarity between the file revisions.
For example, sim(s1, d1) = 0.7,sim(s1, d2) = 0.8, and so
on. Note that these links are undirected. We regard the file pair
(s, d)as a candidate instance of source code reuse, because
there are several links whose similarity values are equal to or
greater than our 0.8 threshold.
B. Filtering by Commit Time
Having computed the set Cof pairs (s, d)of files whose
revisions are similar to each other, we filter out file pairs in C
that are less likely reuse instances using a heuristic involving
commit times. If a file din Dhas been created by reusing s,
the oldest revision of sthat is similar to a revision of dshould
have been committed earlier than the revision of d.
To compare the commit time, we identify the oldest similar
file revisions of sand d. We select file revisions in both
repositories that are similar to each other. Formally, we select
the following revisions.
TS(s) = {siRS(s)| ∃djRD(d) : sim(si, dj)th}
TD(d) = {djRD(d)| ∃siRS(s) : sim(si, dj)th}
We compare the oldest revisions of sand din the sets. If the
oldest revision of dis older than the oldest revision of s, the
file pair (s, d)is removed from C. The filtered file pair set
Cfiltered is obtained as follows.
Cfiltered ={(s, d)C|
siTS(s)djTD(d) : t(si)t(dj)}
where t(r)is the commit time of a file revision r.
The selected revisions have at least one similar revision
in the peer repository. In the case of the pair (s, d)shown
in Figure 3, TS(s) = {s1, s2, s3},TD(d) = {d2}. Figure 4
shows the selected revisions. This figure can be obtained by
removing links whose similarity is less than the threshold, and
s
0.7
s2 s3
d
d1 d2
time
0.9
s1
Fig. 5. An example of the result of the final step. Each file revision in the
destination repository is linked to its most similar file revision in the source
repository. Two revision pairs (s1, d1)and (s2, d2)are the output of our
approach for this file pair.
removing the revisions which have no links. In the figure, the
oldest revisions of file sand dare s1and d2, respectively.
According to their commit time, s1is older than the oldest
revision d2,i.e.,t(s1)< t(d2). Hence, the pair (s, d)is kept
for the next step.
C. Identify Revision Pairs
We identify similar file revision pairs in the history of file
pair (s, d)in Cfiltered and translate the source file revisions
into version numbers of the source project. For each revision
djRD(d), we identify the most similar revisions siin RS(s)
as the original revision corresponding to djusing the following
criteria. While these criteria rely on the same similarity metric
as before, they incorporate additional criteria for tie-breaking:
1) Select source revisions that are the most similar to
dj.
2) If two or more source revisions have the same sim-
ilarity to dj, select revisions that are most similar
to djusing a text-based similarity metric that is a
variant of sim using lines instead of tokens. It takes
into account code comments and white space, while
it ignores line separators. This rule focuses on file
revisions in a source repository whose difference is
only code comments, e.g. version numbers in their
headers. We still ignore line separators because two
projects may use different line separators.
3) If two or more source revisions still have the same
similarity to dj, select the oldest one.
This step generates a set of reuse instances (si, dj)between
two repositories. Finally, we translate siinto version numbers
Viby extracting tags associated with si. Since multiple file
revisions have the same content in several branches, we list up
all file revisions having the same content as siand their tags
as version numbers corresponding to si. We output (Vi, dj)
representing that djis likely a copy of the original file revision
in versions Vi. If tags are unavailable, we output the commit
ID of sias an original file revision of dj. An example is shown
in Figure 5, which is computed for the file pair from Figure 3
and Figure 4. Two links {(s1, d1),(s2, d2)}are produced by
the step. If version numbers v1and v2are available for s1and
s2, the final output is {({v1}, d1),({v2}, d2)}.
D. Implementation
We have implemented our approach as a tool for C pro-
grams under version management by Git. The tool takes as
input a pair of Git repositories and compares .c and .h files
in the repositories.
The similarity metric is not defined for a file without source
code such as an empty file. Hence, the tool excludes such files
from analysis.
The output of the tool are pairs of a destination file revision
and the source file revision from which it was copied. A file
revision is identified by a file path and a commit ID in a
repository. The tool also outputs similarity values between
revisions to facilitate subsequent analyses. A similarity value
less than 1 indicates a destination file revision modified from
the original so that developers can compare the contents of
the revision pair. Below we list an extract of the results for a
source repository libpng and a destination repository fs2open:
Source Destination
Path Tags Path Commit sim
png.c v1.2.42 libpng/png.c 101018d 1
png.c v1.5.2 libpng/png.c 623b6ad 1
png.c v1.5.7 libpng/png.c 58f9e77 1
png.h v1.2.42 libpng/png.h 101018d 1
png.h v1.5.2 libpng/png.h 623b6ad 1
png.h v1.5.7 libpng/png.h 58f9e77 1
pngrio.c v1.0.52, libpng/ 101018d 1
v1.2.42 pngrio.c
Due to the limited space, the commit IDs in the list
are shortened. The first line indicates that png.c tagged
as version 1.2.42 in the libpng repository is similar to
libpng/png.c in the commit 101018d in the fs2open
repository. The full result indicates that developers of the
fs2open project copied files in libpng to their repository three
times. Because the similarity values are always 1.0, it is likely
that the copies remained unmodified afterwards (though code
comments could have been modified). In the repository of
fs2open, the three commits record the version numbers of
libpng in the messages: 1.2.42, 1.5.2, and 1.5.7. We can verify
the correctness of the recorded version numbers from the
output of the tool. Note that the tool outputs a number of
tags for a destination file revision, if the same file content has
been included in several releases. For example, the tool reports
that libpng/pngrio.c in the commit 101018d is a copy
of pngrio.c in the versions 1.0.52, 1.2.42, and several other
versions (omitted in the above list).
IV. EVALUATION
To evaluate the effectiveness of our approach, we have
applied our tool to 10 of the projects in Xia’s work [11]
which identified projects of which files are potentially copied
between projects. Two projects libpng and libcurl are selected
as source repositories of reused code. Six projects using libpng
and two projects using libcurl are selected as destination
repositories. Table I shows the project names, URLs, and the
summary of the repository attributes. We have assigned IDs
for referring to the destination repositories in this section. The
columns “Duration,” “Latest commit ID,” and “#Commits”
show the analyzed history of the repositories. The column
“LOC” indicates the number of lines of code of .c and .h in
the latest version. Although several destination projects include
.cpp files, we have analyzed only C files because the source
repositories use .c files for their implementation.
We have evaluated precision of the output of our tool
by comparing with instances of source code reuse that are
recorded by developers. As the ground truth, we identify
project version numbers of origin files as follows.
In destination repositories 1-6, we identify a version
number from the commit message recorded when the
copy has been committed. For example, the version
number of a destination file revision committed with
a message “Updated to libpng 1.2.31” is 1.2.31.
In destination repository 7, all files are located in
acurl-7.12.2 directory. Hence, we assume that
a destination revision committed in the directory is
copied from version 7.12.2.
In destination repository 8, all files are located in a
curl directory. We identify a version number in a
file named CHANGES in the curl directory that is
committed with other source files at the same time.
Since it is hard to analyze the whole repository, we have
analyzed only commits and directories including an instance
of source code reuse reported by our tool.
A. Research Method
An instance of source code reuse reported by our tool
is correct if the same information is recorded in the desti-
nation repository, i.e. the reuse instance is consistent with
the recorded information. We classified instances of source
code reuse reported by our tool into four groups: consistent,
inconsistent,unrecorded, and redundant.
A reported instance is consistent with the recorded
information if the reported destination file revision has
been recorded as a copy of version v, and the reported
version numbers include the version v.
A reported instance is inconsistent with the recorded
information if the reported version numbers have not
included the version recorded by developers.
A reported instance is unrecorded if no version num-
ber is identified for the reported destination file revi-
sion.
A reported instance is redundant if another report
links the destination file revision to another source
file version that is more appropriate. For example, if
two similar files A and B are copied to a destination
repository as A’ and B’, our tool reports four pairs (A,
A’), (B, B’), (A, B’), and (B, A’). In this case, (A, B’)
and (B, A’) are regarded as redundant pairs.
The classification process compares the output of the tool
with the ground truth. When a destination file revision is simi-
lar to revisions of two or more files in a source repository, we
manually check source file revisions in the reported versions
to identify redundant instances. We select the most similar file
revision based on their file paths and a visual inspection as the
original revision.
TABLE I. ANA LYZE D PROJ ECT S
#ID Project name Repository URL Duration Latest commit ID #Commits LOC
libpng git://libpng.git.sourceforge.net/gitroot/libpng/libpng Jul 1995 - Nov 2013 0e60f06b7c14e698 3517 76419
libcurl https://github.com/bagder/curl Dec 1999 - Oct 2013 72f850571d24ae48 16891 152515
1cocos2d-iphone https://github.com/hansoninteractive/cocos2d-iphone.git Jun 2008 - Jul 2010 b47eab8e90f6ba3f 3754 83111
2apitrace https://github.com/apitrace/apitrace Jul 2008 - Oct 2013 a2ad18752c5692a0 2694 113472
3guliverkli2 https://github.com/athomasm/guliverkli2.git Sep 2007 - Feb 2010 5e6d5d4caa2cf74d 107 420486
4fs2open https://github.com/sobczyk/fs2open.git Jun 2002 - Jul 2013 156565c8e94b1f58 8423 197532
5v8monkey https://github.com/zpao/v8monkey.git Mar 2007 - Feb 2012 0280cf71d2c36e0e 87435 2709919
6Haiku-services-branch git://github.com/Barrett17/Haiku-services-branch.git Jul 2002 - Feb 2013 85e3cb0c85752c45 44842 5205544
7Enemy-Territory https://github.com/id-Software/Enemy-Territory.git Jan 2012 40342a9e3690cb5b 1 553309
8doom3.gpl https://github.com/TTimo/doom3.gpl.git Nov 2011 - Apr 2012 8047099afdfc5c97 39 252618
TABLE II. TH E ANALYS IS RES ULT
#ID Source Destination Reported Consistent Inconsistent Unrecorded Redundant
1libpng cocos2d-iphone 147 127 (86.4%) 0 (0.0%) 20 (13.6%) 0 (0.0%)
2libpng apitrace 78 39 (50.0%) 0 (0.0%) 39 (50.0%) 0 (0.0%)
3libpng guliverkli2 131 71 (54.2%) 0 (0.0%) 60 (45.8%) 0 (0.0%)
4libpng fs2open 57 57 (100.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%)
5libpng v8monkey 275 183 (66.5%) 18 (6.5%) 74 (26.9%) 0 (0.0%)
6libpng Haiku-services-branch 306 218 (71.2%) 1 (0.3%) 87 (28.4%) 0 (0.0%)
7libcurl Enemy-Territory 210 150 (71.4%) 26 (12.4%) 0 (0.0%) 34 (16.2%)
8libcurl doom3.gpl 190 159 (83.7%) 1 (0.5%) 0 (0.0%) 30 (15.8%)
Total 1394 1004 (72.0%) 46 (3.3%) 280 (20.1%) 64 (4.6%)
TABLE III. THE RESULT OF CONTENT COMPARISON OF CONSISTENT
RE USE I NS TANCE S.
#ID Destination Not Modified Modified
1cocos2d-iphone 127 0
2apitrace 39 0
3guliverkli2 68 3
4fs2open 57 0
5v8monkey 63 120
6Haiku-services-branch 218 0
7Enemy-Territory 54 96
8doom3.gpl 156 3
Total 782 222
B. Results
We have applied our tool to eight project pairs. Ta-
ble II shows the result of the classification. The column
“Reported” shows the number of reuse instances extracted
by our approach. The columns “Consistent,” “Inconsistent,
“Unrecorded,” and “Redundant,” show the classified result. Our
tool reported 1394 pairs as instances of source code reuse.
1) Consistent Instances: 1004 instances (72.0%) of the
reported 1394 instances are consistent with the recorded
information. To analyze how often developers modified file
revisions after copying, we have compared the contents of
source file revisions and destination revisions. In Table III,
the consistent instances are classified into two groups: Not
Modified and Modified. A reuse instance is classified as Not
Modified if the similarity metric value of the pair is 1.0. In
other words, the differences are limited to code comments
and white space. Otherwise, a reuse instance is classified as
Modified. Our results indicate that projects #5 (V8Monkey)
and #7 (Wolfenstein: Enemy Territory) often modify the source
code after copying from the source repositories. As described
in Section 2, V8Monkey modified copied revisions to handle
Animation PNG. Wolfenstein also modified various functions
in the files. Except for these projects, developers tend to reuse
source code as is.
2) Inconsistent Instances: 46 instances (3.3%) are incon-
sistent with the recorded information. While this group may
include reuse instances incorrectly reported by the tool, it also
TABLE IV. THE RESULT OF CONTENT COMPARISON OF UNRECORDED
RE USE I NS TANCE S.
Reuse w/o Version Number Others
#ID Not Modified Modified Not Modified Modified
1 0 0 20 0
2 39 0 0 0
3 0 0 56 4
4 0 0 0 0
5 14 38 9 13
6 21 0 42 24
7 0 0 0 0
8 0 0 0 0
Total 74 38 127 41
includes reuse instances incorrectly recorded by developers. By
analyzing the instances, we have identified potential problems
in the destination repositories. The first example is found in
project #5. A commit message says that files are updated to
libpng 1.2.31, whereas some of the committed file revisions are
included in either 1.0.38 or 1.2.30. Those files are regarded as
1.2.31 partly because the files in the versions have the same
contents except for code comments. The second example is
png.h in the project #6 that is described in Section 2. The
content of the file is the same as 1.2.30 but recorded as 1.2.31.
3) Unrecorded Instances: 280 instances (20.1%) are un-
recorded because of the lack of version numbers in commit
messages. 112 of them are recorded as source code reuse in the
commit message, although there are no version numbers. The
remaining 168 instances have no explicit information about
source code reuse. We have compared the contents of source
revisions and destination revisions and classified the instances
to two categories: Not Modified and Modified. A reported
reuse instance is Not Modified if the similarity of the source
revision and the destination revision is 1.0. Otherwise, a reuse
instance is Modified. The result is shown in Table IV. The
columns “Reuse w/o Version Number” show the number of
reuse instances whose commit message mentioned source code
reuse. The columns “Others” show the number of other reuse
instances. 201 “Not Modified” instances are likely represent
instances of source code reuse, since the file revisions have the
same contents between repositories. Although some instances
may point to different source versions from actual versions,
we expect that most of them are likely correct as the number
of consistent instances are 22 times greater than the number
of inconsistent instances.
Developers did not record version numbers for these in-
stances partly because the developers focused on source code
management issues rather than source code reuse. For example,
project #1 updated libpng copies with a message “v0.99.1 re-
lease tag.” The project #3 has a similar commit updating libpng
files with a message “Guliverkli revision 611.” In addition,
developers did not record version numbers when they did not
change the contents of revisions. For example, project #2 and
#5 moved libpng copies in their repositories. They recorded
only what they did, e.g. “Move libpng to media/libpng,”
without version numbers. The version numbers of those files
recovered by our tool would help developers to understand
what files are included in the repositories.
4) Redundant Instances: We have identified 64 redundant
reuse instances in projects #7 and #8. One cause is the lack of
history of file moving and renaming in the libcurl repository.
For example, a file getpass.c is located in lib directory
in libcurl 7.10.6 and located in src directory in libcurl 7.11.2.
Our tool could not identify the revision in src as a successor
version of the file in the lib directory. Hence, our tool
reported two instances of source code reuse for a single version
of the file getpass.c in the project #7. In the analysis, we
manually identified the correct version, because the project #7
stores the file revision in libcurl-7.12.2/src directory.
In addition, the file content is the same as src/getpass.c
in 7.11.2.
Redundant instances are also reported when two or more
similar files are included in a source repository. For example,
lib500.c and lib501.c in libcurl 7.12.2 defined test
functions whose contents are almost the same except for two
lines. Since the files are not linked in a repository, our tool
reported two redundant reuse instances: lib500.c is reused
as lib501.c, and vice versa. As similar to the first case, we
have manually identified correct reuse instances by file paths.
C. Precision
We calculate precision of our tool by comparing the
reported instances with the recorded information as follows.
P=c
c+i+r
where cis the number of consistent reuse instances (1004),
iis the number of inconsistent reuse instances (46), and r
is the number of redundant reuse instances (64). The resulting
value is P= 0.901. The number of unrecorded reuse instances
is excluded from this calculation because we cannot verify
whether they are actual source code reuse or not.
We have assumed that similar source files are source code
reuse, but the assumption is not always true. One example
is md4.c in the project #7. While all the analyzed projects
put their library copies in directories having special names,
e.g. external/libpng/ and thirdparty/libpng, it
is the only one case reported for a file revision outside of such
a directory. Hence, the file is likely a third party code used
by the project #7 and libcurl. Another case is a file named
lib/getdate.c in the project #8. The file is generated by
TABLE V. THE NUMBER OF UNREPORTED FILE REVISIONS INCLUDED
IN DIRECTORIES THAT CONTAIN COPIES FROM LIBRARIES
#ID #Revisions Source No Source
1 5 5 0
2 3 0 3
3 0 0 0
4 2 0 2
5 19 1 18
6 9 9 0
7 6 6 0
8 3 3 0
Total 47 24 23
a code generator during the build process of libcurl. Since
the file is involved in both source and destination repositories,
our tool reported the file as an instance of reuse even if
developers did not directly copy the file. We think these are
false positives of our approach. However, we could not verify
the latter case in the experiment. In the case, two file revisions
are involved in both repositories and marked as reuse by
developers. There are no differences from other consistent
instances. We accidentally identified the case simply because
the source repository included a commit message telling the
file is generated one and removed in the future release. We
could not count the number of this kind of false positives
because of our limited effort.
D. Recall
Although we cannot know the number of files actually
reused, developers tend to put copies of a library in the same
directory. Indeed, only 12 directories and their subdirectories
are used for the 8 projects. We have analyzed all the file
revisions in the directories, because they are also likely copies
from the library. Table V shows the result. The column
“#Revisions” shows the number of file revisions that are not
reported by our tool. The file revisions are classified into
“Source” and “No Source” according to whether the same file
name is found in a source repository or not.
The file names of 24 file revisions are found in source
repositories. 12 of 24 revisions include only comments instead
of source code. Our tool simply removed such files from the
analysis. One example is pnggvrd.c found in the project
#1. The file includes only a single comment as follows.
/*pnggvrd.c was removed
from libpng-1.2.20. */
Other file revisions are modified from the original version and
similarity metric values for them are less than the threshold.
For example, the similarity between a copy of pngconf.h in
the project #6 and its most similar revision in libpng is 0.77.
14 of those 24 revisions are recorded as reuse in their commits.
Hence, they are likely false negatives accidentally missed by
our tool.
23 file revisions are not found in source repositories. The
file revisions are associated with two files: pnglibconf.h
and mozpngconf.h. The former file is created by a script
in libpng project, although the project #4 recorded the re-
visions as source code reuse from libpng. The latter file is
created by Mozilla project to replace function names for reuse
TABLE VI. THE NUMBER OF COMMITS INCLUDING INSTANCES OF
SOURCE CODE REUSE
#ID #Total #Consistent #Inconsistent #Unrecorded
1 8 7 0 1
2 4 2 0 2
3 8 4 0 4
4 3 3 0 0
5 25 10 4 11
6 23 17 1 5
7 1 0 1 0
8 1 0 1 0
Total 73 43 7 23
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Ratio
Fig. 6. The ratio of inconsistent instances against the number of reuse
instances in a commit
in the project. For example, a macro replaces a function
png_read_data in libpng with MOZ_PNG_read_data.
We do not regard these file revisions as false negatives because
they are not copied from source repositories.
By assuming there are no potential copies in other direc-
tories, recall of our tool can be estimated as follows:
Restimated =c
c+i+n
where cis the number of consistent pairs (1004), iis the
number of inconsistent pairs (46), and nis the number of
false negatives we identified (14). The resultant value is
Restimated = 0.943.
E. Commit-level Analysis
Since developers may copy a number of files at once
from a source repository, we have analyzed the number of
commits including reuse instances for each project. Table VI
shows the result. The column “#Total” indicates the number of
commits that include at least one reuse instance. The column
“#Consistent” indicates the number of commits with correct
version numbers. The column “#Inconsistent” indicates the
number of commits including at least one inconsistent reuse
instance. The column “#Unrecorded” indicates the number
of commits without version numbers. The result shows that
an automatic analysis is important for developers, since 23
(31.5%) of 73 commits do not include version numbers.
Inconsistent reuse instances are always committed with
several consistent reuse instances. Figure 6 shows the ratio of
inconsistent instances against the number of reuse instances
in a commit. For example, if a commit updated three files
obtained from a library but one of them was copied from
another version of the library, the ratio is 0.33. The result
shows that 7 (9.6%) of 73 commits include 18.1% of files
from different versions of a library accidentally. In such a case,
our tool reports different version numbers for each file in a
commit. Consequently, developers can recognize such potential
problems using our tool.
TABLE VII. THE EXECUTION TIME
#ID Execution Time #LCS Computed
1 40min51sec 87,280
2 55min6sec 76,066
3 38min13sec 77,525
4 23min43sec 28,378
5 225min33sec 307,910
6 139min45sec 175,588
7 5min26sec 10,691
8 4min35sec 10,162
F. Performance
Our tool compares all file revision pairs between repos-
itories in the worst case, while we have employed an op-
timization. Table VII shows the time taken to execute the
analysis using a single thread on Intel(R) Xeon(R) CPU E5-
1603 2.80GHz. The time does not include the time to copy Git
repositories to a local storage. We have two large repositories
in the analysis. Repository #5 involves 87,435 commits. The
total amount of code we have analyzed is 62.9 MLOC. Repos-
itory #6 involves 44,842 commits. The total amount of code
we have analyzed is 37.8 MLOC. Our tool finished analysis
for those large products in several hours. The time depends
on the number of similar file pairs between repositories that
require computing the longest common subsequence between
file revisions. Table VII indicates the number of computed
longest common subsequences. The numbers are much smaller
than the number of possible revision pairs between repositories
because of our optimization. In addition, the computed results
can be reused for an incremental analysis, since similarity
values between existing file revisions do not change in the
future.
G. Threats to Validity
The precision of our approach is computed against the
information recorded in the repositories. If developers of the
analyzed projects often copied file revisions without recording
this and subsequently modified most copies significantly, such
modified files are not taken into account. As described in
Section IV-E, several false positives may be regarded as true
positives in our analysis. On the other hand, inconsistent
reuse instances are regarded as false positives. Some of them
could not be identified because of incorrectly recorded version
numbers.
We have used a single threshold value th = 0.8in our
implementation. While another threshold value may change the
detection result, we believe that the value is not so significant
because most of the consistent reuse instances are file copies
without modification. Indeed, our manual analysis of direc-
tories identified only 14 false negatives that are accidentally
filtered out.
We have selected two library projects for source repos-
itories: libpng for manipulating image files and libcurl for
transferring data using various protocols. Both of them are
utilities to implement features of applications. The source code
reuse activities of developers in eight projects may be limited
to a particular style for using such a utility library.
V. RELATED WO RK
A. Software Product Lines
Software Product Line Engineering involves much forking
and copying of files across variants of a software product.
Tracing similarities between variants has its benefits for selec-
tion of the most appropriate variant. Duszynski [12] proposed
Variant Analysis to compare source code of product variants
to understand product-specific features and common (reused)
features.
Hemel [13] showed that the tracing of the evolution also
has reverse and forward engineering benefits. Nonaka [14]
visualized the relationships of variants and analyzed corrective
maintenance data for a particular product family. Yamamoto
[15] defined similarity between source code of software prod-
ucts. Kanda [7] defined similarity between software products
using a file-level similarity metric to identify the origin of
the variants with their evolution. We have employed the file
similarity metric to compare file revisions. Kanda [7] also in-
troduced an optimization using term frequency vectors to avoid
unnecessary comparison of files. We normalize identifiers to
compute fixed-length term frequency vector for source code.
Software Product Line targets reuse between variants of
the same products or from projects originating from the same
family. Ray [16] analyzed how forked projects import source
code changes from other projects. While the analysis compare
patches between projects, our analysis compares file revisions
between projects. One reason is that files copied from a library
are updated less frequently.
B. Code Clones and Origin Analysis
Similar fragments of code are considered to be code clones.
The longest common subsequence has also been implemented
for ’suffix’ based clone detection tools. Recently, clone detec-
tion work efforts have concentrated around the improvement of
the state of the art through scalability, speed and precision [17],
[18].
Other studies have done cross-project clone detection,
however with different intentions. Bauer [19] investigated the
extraction of similar pairs of files between projects to create
a library. Our approach extracts source files copied from a
library. Kim [20] coined clone genealogy to help understand
clone management and refactoring. Our analysis might enable
a genealogy analysis between projects, since the output of our
tool links the history of files in a library and their copies in an
application. Krinke [21] proposed to use version information
in repositories to distinguish original code from its clones
across projects. We employs source code similarity without
normalizing identifiers differently from code clone detection
tools as described in Section II. Al-Ekram [22] and Kawaguchi
[23] all studied cloning for patterns. Ichi-tracker [24] is an
example of the many code search research that search for
clones across various repositories on the Internet. While Ichi-
tracker extracts similar file revisions in repositories, it does
not tell which one is likely the original revision. Our analysis
employs heuristics for identifying an original file revision.
According to Godfrey [25], the merge and splitting of
source code entities is a common activity during the lifespan
of a software system. During this process, the original context
may be lost over time. They then show how tracking origins
is beneficial for ownership, code comprehension, refactoring
and software evolution research. Hashimoto [26] proposed an
AST-based comparison of file revisions to track co-evolution
of two branches in a project. Our analysis enables to track the
history of source code reuse between two projects. German
and colleagues [27] proposed Software Bertillonage tracing
licensing implications of copied code fragments.
C. Software Reuse
Software reuse has been becoming standard practice in
software engineering. Most research has focused on social,
extent and nature of software reuse. Most studies borrow code
clone detection tools for identifying actual source code reuse.
Due to the complex nature of white-box reuse, a manual verifi-
cation aspect is usually required. For example, Heinemann [28]
analyzed white-box source code reuse among Java projects by
manually inspecting source code with the result of a code clone
detection tool. Xia [3], [11] also manually analyzed how source
code is reused, while the source code was obtained by a clone
detection technique. In this study, we present and evaluate
an automatic detection technique for white-box reuse. The
manual analyses conducted in [3], [11] can be automated by
our tool. In those work, an original file revision was manually
identified in files reported by a clone detection tool. Our tool
can automatically identify the most similar file revision as the
origin of the reused file.
German et al. [29] analyzed code siblings copied across
open source software projects. They detected code clones
for certain releases and then investigated their history when
the code clones are introduced. Our tool analyzes the entire
repository of projects so that developers can analyze file
revisions that are modified after copy.
VI. CONCLUSION
Developers often reuse source code developed by another
project. Using a source code similarity metric, we have auto-
matically extracted revision pairs that are likely source code
reuse. In the experiment, we have extracted 1394 revision pairs
from eight project pairs. The estimated precision and recall of
the tool are 0.901 and 0.943. 1004 (72.0%) of the pairs are
consistent with the information recorded in the repositories.
We have identified several inconsistent pairs that are caused by
incorrect information in the repositories. 201 unrecorded reuse
instances pointed to file revisions in source repositories whose
contents are the same as revisions in destination repositories.
31.5% of commits for source code reuse have no version
numbers in the commit messages.
Developers may use the tool as a static checker before a
release to avoid security vulnerability and known issues of a
library in their software. Even if developers did not record a
version number of a library, our tool reports version numbers
of the library using the source code in their repository. The
version numbers help developers to decide whether files reused
from the library should be updated or not. The tool also reports
a similarity between a reused file and its original file so that
developers can integrate their project-specific enhancement
into the latest version of the library.
In the future work, we are planning to improve the exe-
cution time of the tool. We also would like to automatically
identifying how a file revision is modified in a destination
repository. Since developers may import small changes such as
security fixes without changing the other source code, automat-
ically generating an explanation how a copy has been modified
from the most similar revision is valuable for developers to an-
alyze the current status of source code of their project. Finally,
we are also interested in cross-project analysis including more
than two projects. Because our current implementation reports
the same library file used in two projects as source code reuse,
we would like to include multiple projects in analysis so that
our tool can report more accurate and useful results.
ACKNOWLEDGMENT
We would like to thank Dr. Daniel M. German for his
valuable comments on this research.
The work was supported by JSPS KAKENHI No.25220003
and Osaka University Program for Promoting International
Joint Research.
REFERENCES
[1] J. Rubin, A. Kirshin, G. Botterweck, and M. Chechik, “Managing forked
product variants,” in Proceedings of the 16th International Software
Product Line Conference, 2012, pp. 156–160.
[2] T. Mende, R. Koschke, and F. Beckwermert, “An evaluation of code
similarity identification for the grow-and-prune model,Journal of
Software Maintenance and Evolution, vol. 21, no. 2, pp. 143–169, 2009.
[3] P. Xia, M. Matsushita, N. Yoshida, and K. Inoue, “Studying reuse of
out-dated third-party code in open source projects,” JSSST Computer
Software, vol. 30, no. 4, pp. 98–104, 2013.
[4] P. Jablonski and D. Hou, “Aiding software maintenance with copy-
and-paste clone-awareness,” in Proceedings of the 18th International
Conference on Program Comprehension, 2010, pp. 170–179.
[5] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge
University Press, 1997.
[6] C. K. Roy and J. R. Cordy, “NiCad: Accurate detection of near-miss
intentional clones using flexible pretty-printing and code normalization,”
in Proceedings of the 16th International Conference on Program
Comprehension, 2008, pp. 172–181.
[7] T. Kanda, T. Ishio, and K. Inoue, “Extraction of product evolution
tree from source code of product variants,” in Proceedings of the 17th
International Software Product Line Conference, 2013, pp. 141–150.
[8] Lucia, F. Thung, D. Lo, and L. Jiang, “Are faults localizable?” in
Proceedings of the 9th Working Conference on Mining Software Repos-
itories, 2012, pp. 74–77.
[9] C. K. Roy and J. R. Cordy, “Scenario-based comparison of clone detec-
tion techniques,” in Proceedings of the 16th International Conference
on Program Comprehension, 2008, pp. 153–162.
[10] S. Wu, U. Mamber, and G. Myers, “An o(np) sequence comparison
algorithm,” Information Processing Letters, vol. 35, no. 6, pp. 317–323,
1990.
[11] P. Xia, “An empirical study of out-dated third-party code in open source
software,” Master’s thesis, Osaka University, 2013.
[12] S. Duszynski, J. Knodel, and M. Becker, “Analyzing the source code
of multiple software variants for reuse potential,” in Proceedings of the
18th Working Conference on Reverse Engineering, 2011, pp. 303–307.
[13] A. Hemel and R. Koschke, “Reverse engineering variability in source
code using clone detection: A case study for Linux variants of consumer
electronic devices,” in Proceedings of the 19th Working Conference on
Reverse Engineering, 2012, pp. 357–366.
[14] M. Nonaka, K. Sakuraba, and K. Funakoshi, “A preliminary analysis
on corrective maintenance for an embedded software product family,”
IPSJ SIG Technical Report, vol. 2009-SE-166, no. 13, pp. 1–8, 2009.
[15] T. Yamamoto, M. Matsushita, T. Kamiya, and K. Inoue, “Measuring
similarity of large software systems based on source code correspon-
dence,” in Proceedings of the 6th International Conference on Product
Focused Software Process Improvement, 2005, pp. 530–544.
[16] B. Ray and M. Kim, “A case study of cross-system porting in forked
projects,” in Proceedings of the ACM SIGSOFT 20th International
Symposium on the Foundations of Software Engineering, 2012, pp. 1–
11.
[17] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: a multilinguistic
token-based code clone detection system for large scale source code,
IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654–
670, 2002.
[18] N. Schwarz, M. Lungu, and R. Robbes, “On how often code is
cloned across repositories,” in Proceedings of the 34th International
Conference on Software Engineering, 2012, pp. 1289–1292.
[19] V. Bauer and B. Hauptmann, “Assessing cross-project clones for reuse
optimization,” in Proceedings of the 8th International Worskhop on
Software Clones, 2013, pp. 60–61.
[20] M. Kim, V. Sazawal, D. Notkin, and G. Murphy, “An empirical study
of code clone genealogies,” in Proceedings of the 10th European
Software Engineering Conference Held Jointly with 13th International
Symposium on Foundations of Software Engineering, 2005, pp. 187–
196.
[21] J. Krinke, N. Gold, Y. Jia, and D. Binkley, “Cloning and copying be-
tween GNOME projects,” in Proceedings of the 7th Working Conference
on Mining Software Repositories, May 2010, pp. 98–101.
[22] R. Al-Ekram, C. Kapser, R. C. Holt, and M. W. Godfrey, “Cloning by
accident: an empirical study of source code cloning across software sys-
tems,” in Proceedings of the 4th International Symposium on Empirical
Software Engineering, 2005, pp. 376–385.
[23] S. Kawaguchi, P. K. Garg, M. Matsushita, and K. Inoue, “MUDABlue:
an automatic categorization system for open source repositories,” Jour-
nal of Systems and Software, vol. 79, no. 7, pp. 939–953, 2006.
[24] K. Inoue, Y. Sasaki, P. Xia, and Y. Manabe, “Where does this code
come from and where does it go? – integrated code history tracker
for open source systems –,” in Proceedings of the 34th International
Conference on Software Engineering, 2012, pp. 331–341.
[25] M. Godfrey and L. Zou, “Using origin analysis to detect merging
and splitting of source code entities,” IEEE Transactions on Software
Engineering, vol. 31, no. 2, pp. 166–181, 2005.
[26] M. Hashimoto and A. Mori, “A method for analyzing code homol-
ogy in genealogy of evolving software,” in Proceedings of the 13th
International Conference on Fundamental Approaches to Software
Engineering, 2010, pp. 91–106.
[27] J. Davies, D. M. German, M. W. Godfrey, and A. Hindle, “Software
bertillonage: Finding the provenance of an entity,” in Proceedings of
the 8th Working Conference on Mining Software Repositories, 2011,
pp. 183–192.
[28] L. Heinemann, F. Deissenboeck, M. Gleirscher, B. Hummel, and
M. Irlbeck, “On the extent and nature of software reuse in open source
java projects,” in Proceedings of the 12th International Conference on
Top Productivity Through Software Reuse, 2011, pp. 207–222.
[29] D. M. German, M. D. Penta, Y.-G. Gueheneuc, and G. Antoniol, “Code
siblings: Technical and legal implications of copying code between
applications,” in Proceedings of the 6th Working Conference on Mining
Software Repositories, 2009, pp. 81–90.
... Godfrey et al. [52] proposed origin analysis to recover context of code changes. Our previous work [53] tracked how code is reused cross-projects. Related works [10] focused on support for clients migrating to a newer library version. ...
Preprint
Context: Refactoring is recognized as an effective practice to maintain evolving software systems. For software libraries, we study how library developers refactor their Application Programming Interfaces (APIs), especially when it impacts client users by breaking an API of the library. Objective: Our work aims to understand how clients that use a library API are affected by refactoring activities. We target popular libraries that potentially impact more library client users. Method: We distinguish between library APIs based on their client-usage (refereed to as client-used APIs) in order to understand the extent to which API breakages relate to refactorings. Our tool-based approach allows for a large-scale study across eight libraries (i.e., totaling 183 consecutive versions) with around 900 clients projects. Results: We find that library maintainers are less likely to break client-used API classes. Quantitatively, we find that refactoring activities break less than 37% of all client-used APIs. In a more qualitative analysis, we show two documented cases of where non-refactoring API breaking changes are motivated other maintenance issues (i.e., bug fix and new features) and involve more complex refactoring operations. Conclusion: Using our automated approach, we find that library developers are less likely to break APIs and tend to break client-used APIs when addressing these maintenance issues.
... We are not aware of any other curation system that operates at the level of a blob or finer granularity, nor is there an easy way to determine the extent of OSS-wide copy-based reuse at that level. Methods for identifying reuse, such as the one introduced in [49], are designed to find reuse between specific input projects and do not easily scale to detect reuse across all OSS repositories [45]. The methods we use to identify and characterize reuse could, therefore, serve as a foundation for tools that expose this difficult-to-obtain yet potentially important phenomenon [45]. ...
Preprint
Full-text available
In Open Source Software, resources of any project are open for reuse by introducing dependencies or copying the resource itself. In contrast to dependency-based reuse, the infrastructure to systematically support copy-based reuse appears to be entirely missing. Our aim is to enable future research and tool development to increase efficiency and reduce the risks of copy-based reuse. We seek a better understanding of such reuse by measuring its prevalence and identifying factors affecting the propensity to reuse. To identify reused artifacts and trace their origins, our method exploits World of Code infrastructure. We begin with a set of theory-derived factors related to the propensity to reuse, sample instances of different reuse types, and survey developers to better understand their intentions. Our results indicate that copy-based reuse is common, with many developers being aware of it when writing code. The propensity for a file to be reused varies greatly among languages and between source code and binary files, consistently decreasing over time. Files introduced by popular projects are more likely to be reused, but at least half of reused resources originate from ``small'' and ``medium'' projects. Developers had various reasons for reuse but were generally positive about using a package manager.
... The longest common subsequence (LCS) [31] of two or more strings is a useful measure of their similarity. LCS is used to calculate the similarity between texts [34], codes [26], and certain users in social media [45] to accelerate the similarity join. String S is the longest common subsequence (LCS) of strings S 1 and S 2 if S is a subsequence of S 1 and also a subsequence of S 2 of maximal length, i.e., there is no common subsequence of S 1 and S 2 that has greater length. ...
Article
Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold. We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.
... Furthermore, they discovered that their proposed technique can be employed in the reused libraries to identify security bugs. Kawamitsu et al. [35] further refined this work by proposing a technique that compares two repositories and automatically identifies the reuse of source code at the file level. His approach measures the resemblance between two source files in order to trace the original source file version. ...
Article
Full-text available
The development of robotic applications necessitates the availability of useful, adaptable, and accessible programming frameworks. Robotic, IoT, and sensor-based systems open up new possibilities for the development of innovative applications, taking advantage of existing and new technologies. Despite much progress, the development of these applications remains a complex, time-consuming, and demanding activity. Development of these applications requires wide utilization of software components. In this paper, we propose a platform that efficiently searches and recommends code components for reuse. To locate and rank the source code snippets, our approach uses a machine learning approach to train the schema. Our platform uses trained schema to rank code snippets in the top k results. This platform facilitates the process of reuse by recommending suitable components for a given query. The platform provides a user-friendly interface where developers can enter queries (specifications) for code search. The evaluation shows that our platform effectively ranks the source code snippets and outperforms existing baselines. A survey is also conducted to affirm the viability of the proposed methodology.
Article
In Open Source Software, resources of any project are open for reuse by introducing dependencies or copying the resource itself. In contrast to dependency-based reuse, the infrastructure to systematically support copy-based reuse appears to be entirely missing. Our aim is to enable future research and tool development to increase efficiency and reduce the risks of copy-based reuse. We seek a better understanding of such reuse by measuring its prevalence and identifying factors affecting the propensity to reuse. To identify reused artifacts and trace their origins, our method exploits World of Code infrastructure. We begin with a set of theory-derived factors related to the propensity to reuse, sample instances of different reuse types, and survey developers to better understand their intentions. Our results indicate that copy-based reuse is common, with many developers being aware of it when writing code. The propensity for a file to be reused varies greatly among languages and between source code and binary files, consistently decreasing over time. Files introduced by popular projects are more likely to be reused, but at least half of reused resources originate from “small” and “medium” projects. Developers had various reasons for reuse but were generally positive about using a package manager.
Conference Paper
Full-text available
Software forking---creating a variant product by copying and modifying an existing product---is often considered an ad hoc, low cost alternative to principled product line development. To maintain such forked products, developers often need to port an existing feature or bug-fix from one product variant to another. As a first step towards assessing whether forking is a sustainable practice, we conduct an in-depth case study of 18 years of the BSD product family history. Our study finds that maintaining forked projects involves significant effort of porting patches from other projects. Cross-system porting happens periodically and the porting rate does not necessarily decrease over time. A significant portion of active developers participate in porting changes from peer projects. Surprisingly, ported changes are less defect-prone than non-ported changes. Our work is the first to comprehensively characterize the temporal, spatial, and developer dimensions of cross-system porting in the BSD family, and our tool Repertoire is the first automated tool for detecting ported edits with high accuracy of 94% precision and 84% recall. Our study finds that the upkeep work of porting changes from peer projects is significant and currently, porting practice seems to heavily depend on developers doing their porting job on time. This result calls for new techniques to automate cross-system porting to reduce the maintenance cost of forked projects.
Conference Paper
Full-text available
Organizational structures (e. g., separate accounting, heterogeneous infrastructure, or different development processes) can restrict systematic reuse among projects within companies. As a consequence, code is often copied between projects which increases maintenance costs and can cause failures due to inconsistent bug fixing. Assessing cross-project clones helps to uncover organizational obstacles for code reuse and to leverage other ways of systematic reuse. Furthermore, knowing how strongly clones are entangled with the surrounding code helps to decide if and how to extract them to commonly used libraries. We propose to combine cross-project clone detection and dependency analyses to detect (1) what is cloned between projects, (2) how far the cloned code is entangled with the surrounding system and (3) what are candidates for extraction into common libraries.
Conference Paper
Full-text available
The Consumer Electronics Working Group (CEWG) in the Linux Foundation has identified several problems in the re-use process of embedded Linux software for consumer electronic devices. Among these is the increasing fragmentation of Linux derivatives. Vendors of electronic devices copy the Linux sources and make their modifications to adapt it to their own devices, but fail to back port their modifications to the mainstream Linux sources. Likewise, later improvements of the Linux sources are not integrated into the vendors' variants. CEWG launched the Long Term Support Initiative (LTSI) for an industry-managed tree of the Linux sources, maintained by CEWG, that is based on the long-term stable kernel tree annually updated with the latest mainstream kernel version to address their needs. In order to justify this initiative, CEWG asked us to investigate whether and if so how much non-upstream code can be found in industry products and to which extent to and for what part of the kernel. We used large-scale clone detection techniques to compare various Linux versions to their vendor-specific variants. We found many changes that were not back ported. Some of these changes were even found in Linux subsystems where neither we nor people from the Linux Foundation would expect them. We also found instances of defects fixed in the mainstream kernel that were not integrated into the vendors' code. Overall, our investigation provides enough evidence to support the need for an LTSI and better collaboration among Linux developers both of the mainstream and the vendor variants.
Article
Full-text available
Many fault localization techniques have been proposed to facilitate debugging activities. Most of them attempt to pinpoint the location of faults (i.e., localize faults) based on a set of failing and correct executions and expect debuggers to investigate a certain number of located program elements to find faults. These techniques thus assume that faults are localizable, i.e., only one or a few lines of code that are close to one another are responsible for each fault. However, in reality, are faults localizable? In this work, we investigate hundreds of real faults in several software systems, and find that many faults may not be localizable to a few lines of code and these include faults with high severity level.
Article
This paper presents an approach to automatically distinguish the copied clone from the original in a pair of clones. It matches the line-by-line version information of a clone to the pair's other clone. A case study on the GNOME Desktop Suite revealed a complex flow of reused code between the different subprojects. In particular, it showed that the majority of larger clones (with a minimal size of 28 lines or higher) exist between the subprojects and more than 60% of the clone pairs can be automatically separated into original and copy.
Article
Using existing source code as third-party code to build new software systems becomes very popular in these days. However, many existing code is keeping on updating during their life circle. Different versions of code, even out-dated, is reused by other software and spreading all over the world. This paper presents an empirical study on the reuse of out-dated third-party source code of several famous open source libraries. Given target source code, using repository mining techniques and file clone detection techniques, we identified the different versions of code in other user projects, and discovered the vulnerability information of the out-dated versions. We also investigated how user projects manage their code. The result shows that a large proportion of open source projects are reusing out-dated third-party code, and many of them are not well managed.
Article
Polypropylene membrane with 71% porosity was prepared for PEMFC because of its low cost and easy handling. The pores and porosity were controlled by altering the polypropylene concentration and extraction rate of camphene from the membrane in supercritical CO2 The average pore size in the membrane was about 2-3 μpm and the porosities were 80, 76, and 71% with 10, 20, and 30 polypropylene wt% respectively. The breaking points of the polypropylene membrane with 10, 20, and 30 polypropylene wt% were 0.17, 0.24, and 0.46 Kgf/mm2, respectively. The optimum conditions for the camphene extraction were performed at 45°C and 150 bar for 10 min. The thickness of the polypropylene membrane was 70 ±3 μm and the composite membrane impregnated with Nation solution was 105 ±3 μm. The water uptake and ion conductivity of the polypropylene composite membrane were 25 ±3% and 0.0030 =0.0005 S/cm, respectively.
Conference Paper
A large number of software products may be derived from an original single product. Although software product line engineering is advocated as an effective approach to maintaining such a family of products, re-engineering existing products requires developers to understand the evolution history of the products. This can be challenging because developers typically only have access to product source code. In this research, we propose to extract a Product Evolution Tree that approximates the evolution history from source code of products. Our key idea is that two successive products are the most similar to one another in the evolution history. We construct a Product Evolution Tree as a minimum spanning tree whose cost function is defined by the number of similar files between products. As an experiment, we extracted Product Evolution Trees from 6 datasets of open-source projects. The result showed that 53% to 92% of edges in the extracted trees were consistent with the actual evolution history of the projects.
Article
Detecting code duplication in large code bases, or even across project boundaries, is problematic due to the massive amount of data involved. Large-scale clone detection also opens new challenges beyond asking for the provenance of a single clone fragment, such as assessing the prevalence of code clones on the entire code base, and their evolution. We propose a set of lightweight techniques that may scale up to very large amounts of source code in the presence of multiple versions. The common idea behind these techniques is to use bad hashing to get a quick answer. We report on a case study, the Squeaksource ecosystem, which features thousands of software projects, with more than 40 million versions of methods, across more than seven years of evolution. We provide estimates for the prevalence of type-1, type-2, and type-3 clones in Squeaksource.
Article
When we reuse a code fragment in an open source system, it is very important to know the history of the code, such as the code origin and evolution. In this paper, we propose an integrated approach to code history tracking for open source repositories. This approach takes a query code fragment as its input, and returns the code fragments containing the code clones with the query code. It utilizes publicly available code search engines as external resources. Based on this model, we have designed and implemented a prototype system named Ichi Tracker. Using Ichi Tracker, we have conducted three case studies. These case studies show the ancestors and descendents of the code, and we can recognize their evolution history.