Conference PaperPDF Available

Normalizing Source Code Vocabulary

Authors:

Abstract

Information Retrieval (IR) based tools complement traditional static and dynamic analysis tools by exploiting the natural language found within a program's text. Tools incorporating IR have tackled problems, such as feature location, that previously required considerable human effort. However, to reap the full benefit of IR-based techniques, the language used across all software artifacts (e.g., requirement and design documents, test plans, as well as the source code) must be consistent. Vocabulary normalization aligns the vocabulary found in source code with that found in other software artifacts. Normalization both splits an identifier into its constituent parts and expands each part into a full dictionary word to match vocabulary in other artifacts. An algorithm for normalization is presented. Its current implementation incorporates a greatly improved splitter that exploits a collection of resources including several dictionaries, frequency distributions derived from the corpus of programs, and co-occurrence data. Empirical study of this new splitter, GenTest, on almost 8000 identifiers finds that it correctly splits 82%, outperforming the current state-of-the-art. A preliminary experiment with the normalization algorithm finds it improving the FLAT ̂ 3 feature locator's scores of relevant code from 0.60 to 0.95 on a scale from 0 to 1.
Normalizing Source Code Vocabulary
Dawn Lawrie Dave Binkley Christopher Morrell
Loyola University Maryland
Baltimore MD
21210-2699, USA
{lawrie, binkley}@cs.loyola.edu, chm@loyola.edu
Keywords: source code analysis tools, information retrieval, program comprehension
Abstract
Information Retrieval (IR) based tools complement tra-
ditional static and dynamic analysis tools by exploiting the
natural language found within a program’s text. Tools in-
corporating IR have tackled problems, such as feature lo-
cation, that previously required considerable human effort.
However, to reap the full benefit of IR-based techniques, the
language used across all software artifacts (e.g., require-
ment and design documents, test plans, as well as the source
code) must be consistent. Vocabulary normalization aligns
the vocabulary found in source code with that found in other
software artifacts. Normalization both splits an identifier
into its constituent parts and expands each part into a full
dictionary word to match vocabulary in other artifacts.
An algorithm for normalization is presented. Its cur-
rent implementation incorporates a greatly improved split-
ter that exploits a collection of resources including several
dictionaries, frequency distributions derived from the cor-
pus of programs, and co-occurrence data. Empirical study
of this new splitter, GenTest, on almost 8000 identifiers
finds that it correctly splits 82%, outperforming the current
state-of-the-art. A preliminary experiment with the normal-
ization algorithm finds it improving the FLATˆ3feature lo-
cator’s scores of relevant code from 0.60 to 0.95 on a scale
from 0 to 1.
1 Introduction
Best known for its use by search engines on the Internet,
IR encompasses a growing collection of techniques that ap-
ply to large repositories of natural language [27]. Recent
research has found that software systems contain significant
and useful natural language information [2, 9, 23, 24, 29,
35, 36]. Furthermore, exploiting this information comple-
ments existing tools and techniques based on the structure
of a program (for example, tools exploiting data and control
dependence information). Many of the challenges faced by
software engineers as they attempt to recover information
from existing software projects can be effectively addressed
using IR techniques applied to the unstructured text found
in the source code and its associated documents [33]. IR-
based tools have tackled problems previously requiring con-
siderable human effort. Examples include (re)establishing
links between a program and its documentation [2], devel-
oping software metrics [31], and performing feature loca-
tion [35, 36].
Most tools that leverage IR techniques make the implicit
assumption that software artifacts (particularly source code)
contain exploitable natural language information. How-
ever, at present, there is a debilitating mismatch between
the vocabulary used in source code and that used in other
software artifacts (e.g., the design and requirements docu-
ments). The mismatch stems from identifiers being writ-
ten in what amounts to a different language than the rest of
the documentation, as they include significant abbreviations
and acronyms [21]. The negative impact of this mismatch
comes from the implicit assumption of most IR techniques
that the same words are used whenever a particular concept
is described [27]. A previous study found a wealth of natu-
ral language information within the source code, but a ma-
jor stumbling block in applying IR techniques to code is the
disparate vocabulary used in source code, especially when
compared to external documentation [23]. Thus, there is a
need for vocabulary normalization if the full benefit of ap-
plying IR techniques is to be realized. Normalization will
bring the vocabulary of the code and documentation in line
with each other, thus making it more appropriate for con-
sumption by IR-based tools.
This paper makes the following contributions:
1. It presents an algorithm, Normalize, for vocabulary
normalization. This algorithm involves two key tasks:
splitting and expansion.
2. Efficient implementation of Normalize requires a fast
precise splitter. The paper describes in detail the con-
struction of a new splitting algorithm, GenTest, which
accurately splits an identifier into its constituent parts.
3. Finally, it empirically evaluates GenTest, finding that
it’s 82% accuracy improves on the state of the art.
1
In the remainder of this paper, Section 2 describes back-
ground material. Then Section 3 presents the normalization
algorithm. After considering prior splitting approaches in
Section 4, the building and evaluation of GenTest in con-
sidered in Sections 5 and 6. Finally, Sections 7, 8 and 9
present related work, future challenges, and a summary of
the paper.
2 Background
This section first introduces some necessary terminology
used to describe identifiers and then the two subtasks that
make up normalization. When applying IR to software en-
gineering artifacts, the first step is to break the input text up
into words (or other atomic units). For natural language
documents such as the requirements or the design, word
separation is often straightforward. For example, many lan-
guages use separators such as white-space. However, there
is no simple means of separating identifiers into individual
words, which poses a problem to the adaptation of IR algo-
rithms to SE problems because most IR algorithms make the
(often implicit) assumption that the same or similar collec-
tion of words is used to describe a concept in all documents.
In the following, word breaks (e.g., underscores and
camel-casing) are referred to as division markers, while the
strings of characters between division markers and the end-
points of an identifier are referred to as hard-words. For
example, sponge bob and spongeBob include the hard-
words sponge and bob. Sometimes splitting into hard-
words is sufficient (e.g., when all hard-words are dictionary
words); however, other times hard-word splitting is not suf-
ficient, as with identifiers composed of juxtaposed lower-
case words (e.g., spongebob). In this case further division
is required. The resulting strings of characters are referred
to as soft-words. Thus, a soft-word is either the entire hard-
word or a sub-string of a hard-word. Take, for example,
the identifier hashtable entry. This identifier consists of
one division marker (an underscore) and, thus, two hard-
words, hashtable and entry. The hard-word hashtable
is composed of two soft-words, hash and table, while the
hard-word entry is composed of a single soft-word.
Identifying hard-words is not always a trivial task. For
example, Multiple adjacent uppercase characters in camel-
cased identifiers can lead to two possible splits – one where
all uppercase characters make up a hard-word and one
where the final uppercase character is part of the succeeding
hard-word. Samurai’s mixed case algorithm [13] produces
hard-words with sufficient accuracy, even in the presence of
multiple adjacent uppercase characters, that the problem of
discovering hard splits in camel-cased identifiers is consid-
ered solved and not considered further in this paper.
Motivation for splitting beyond the hard-word level
comes from several past efforts. In applications of IR tech-
niques to source code, splitting is generally based solely on
division markers [10, 2, 33, 1]. Such restricted splitting has
been noted by some authors to be insufficient. For example,
Zhao and Zhang note that “It should be indicated that this
relatively simple preprocessing [splitting at word markers]
is not enough for further use of IR. For instance, identifiers
like featurelocation and floc need some more sophisticated
word recognizers. In our experiment, we preprocess such
cases manually · · · .” [36]
Their need for such manual preprocessing illustrates
both of the tasks undertaken by normalization. The splitting
subtask separates the hard-word featurelocation into the
soft-words feature and location and the hard-word floc into
the soft-words fand loc. Here the identifier floc presents a
greater challenge. Once correctly split, it still presents dif-
ficulties to an IR-based tool because IR techniques tend not
to equate fand feature nor loc and location. To establish
the correct link, the soft-words need to be expanded map-
ping fto feature and loc to location. Part of the challenge
here is to avoid alternate realistic expansions such as file
lines-of-code for f loc.
3 Algorithm
This section presents the normalization algorithm and
then explains how the algorithm is used as a preprocess-
ing step to improve IR-based tools. As described above,
normalization has two tasks: splitting an identifier into soft-
words and expansion of those soft-words to associate a
meaning (e.g., a dictionary word) with each. While there are
advantages to carrying the two steps concurrently as they
can inform each other, for ease of presentation the two are
described sequentially.
Task 1 is to separate hard-words into soft-words com-
posed of character sequences that represent words, abbrevi-
ations, or acronyms. A generate and test algorithm, named
GenTest, is used to accomplish Task 1. The generation part
of algorithm is simple as is generates all possible splittings.
While this generates an exponential number of splittings,
most hard-words are short and thus the computational effort
is not excessive. The test part is more complex although it
is quite efficient as it simply evaluates a scoring function
against each proposed splitting. The complexity is within
this function, which is a linear combination of metrics that
describe the quality of the split. The construction of this
function is described in Section 6.
Task 2 assigns a meaning to each soft-word. There are
two cases. The first is the easy case where the soft-word is
a dictionary word and has a well-established meaning that
coincides with the use of the word in the source code. In
the second case, non-dictionary words are assumed to be
abbreviations or acronyms. A naive, but illustrative algo-
rithm for expanding an abbreviation is based on a combi-
nation of wild-card expansion [22] and ideas taken from
machine translation. For example, given the abbreviation
horiz, the code and the external documentation are searched
for strings matching “h*o*r*i*z*”, where a “*” represents an
2
arbitrary sequence of letters. A search of the program from
which horiz was extracted uncovers a unique match, the
dictionary word horizontal; thus, the meaning of the abbre-
viation horiz is equated to the word horizontal.
Acronym expansion is more challenging. First acronyms
need to be identified. Then a phrase needs to be matched to
the acronym. The source code and documentation can be
mined for phrases that may be expansions for acronyms.
For example, “cms” appears within many of ghostscript’s
identifiers. A phrase finder [16] uncovers the text Color
Management System (CMS) in the ghostscript documen-
tation, which is a likely expansion for the identifier.
Machine translation techniques offer a means of expand-
ing abbreviations and acronyms into words and phrases that
create coherent identifiers. One machine translation tech-
nique, the maximum coherence model [25], is the planned
basis for the Normalize algorithm.
The following description of Normalize is a declarative
statement of the algorithm. Because, the set Splits(id)(de-
fined below) includes an exponential number of possible
splits, the efficient implementation of the algorithm requires
some care. Fortunately, in practice only a handful of po-
tential splits needs to be considered. For example, in the
empirical valuation the GenTest splitter ranks the correct
split in the top ten over 99% of the time. Sections 5 and 6
describe the engineering of an effective splitter used in the
implementation of Normalize.
The formalization of the function Normalize identifies
the best expansion over all possible splits. It uses Splits(id)
to denote the set of all possible splits of identifier id. A
split sSplits(id)is composed of a sequence of soft-words
s1s2... sn. Finally, for soft-word si,E(si)denotes the set
of distinct expansions ei,1, ei,2, ..., ei,m for si. The heart
of the algorithm is a similarity metric computed from co-
occurrence data. This data is used because it has proven use-
ful in resolving translation ambiguity. In other words, the
normalization relies on the fact that expanded soft-words
should be found co-located in the documentation or in gen-
eral text. For the general text, a data set of over a trillion
words extracted by Google and distributed by the Linguistic
Data Consortium [8] is used. In the algorithm the similarity
between two expansions, sim(e1, e2)is the probability of
e1and e2co-occurring in a five word window in the Google
data set.
Part 1 For a selected splitting s=s1s2... sn
1. For each expansion ei,j E(si), define the simi-
larity score between ei,j and the other soft-words
of s,sk(k6=i)as the sum of the similarities be-
tween ei,j and the elements of E(sk)
sim(ei,j , sk)
(k6=i)
=X
eE(sk)
sim(ei,j , e)
2. Define the cohesion for ei,j relative to sas
cohesion(ei,j ,s) = log
X
sks6=si
sim(ei,j , sk)
3. For each sis,score(si)is the cohesion of the
expansion ei,j E(si)having the maximal co-
hesion
score(si) = maxei,jE(si)[cohesion(ei,j ,s)]
Part 2 Finally, for the identifier id,Normalize(id)identi-
fies the split with the highest cumulative score
Normalize(id) = maxsSplits(id)"X
sis
score(si)#
As an example of how this algorithm works consider
the hard-word strlen where there are two possible splits:
Splits(strlen)={st-rlen,str-len}. Let E(st) = {stop,
string, set},E(rlen)={riflemen},E(str)={steer,string},
and E(len)={lender,length}as the possible expansions
(translations) of each soft-word. Similarity scores are com-
puted using the Google data set.
sim(stop,riflemen)=9.95857 ×108
sim(string,riflemen)=9.95857 ×108
sim(set,riflemen)=9.95857 ×108
sim(steer,lender)=0
sim(steer,length)=9.95857 ×108
sim(string,lender)=0
sim(string,length)=0.00262389
With these similarity scores, sim(ei,j , sk)can be com-
puted. If ei,j =string and sk=len than
sim(string,len) = sim(string,lender)
+sim(string,length)
in Part 1a. The computation for cohesion sums over all the
other soft-words besides str. In this case there is one other
soft-word so,
cohesion(string,str-len) = log(sim(string,len))
in Part 1b. The maximal cohesion for Part 1c is chosen from
among the possible expansions. Given
cohesion(string,str-len) = 5.9431
cohesion(steer,str-len) = 16.1222
score(str) = 5.9431 using string as the expansion.
Normalize then identifies the split with the highest over-
all score sum. In the example, str-len produces the sum
11.8862, while for st-rlen the sum is 31.1459. There-
fore, in Part 2 Normalize(strlen) selects str-len using the
expansions string and length because it has the maximal
score sum.
3
4 Prior Splitting Approaches
The GenTest algorithm which produces possible splits
for Normalize builds on ideas presented in two prior algo-
rithms: Greedy and Samurai. This section introduces these
two algorithms. The Greedy Algorithm [15] relies on a dic-
tionary to determine where to insert a split in a hard-word.
The algorithm identifies in the hard-word the longest prefix
or suffix that is in the dictionary. Once a dictionary word
is discovered, it is set aside as a soft-word, and the remain-
ing characters are recursively searched for additional soft-
words. Since both the prefix search and suffix search are
invoked on the remaining characters, the search returning
the higher ratio of soft-words found in the dictionary to the
total number of soft-words is used. The process ends when
the remainder is a dictionary word or contains no dictionary
words.
The dictionary used by the Greedy Algorithm is com-
posed of three groups of words. The dominant group is
natural language words. For example, the initial experi-
ments used the publicly available dictionary that accom-
panies ispell Version 3.1.20. The second group augments
the dictionary with common abbreviations (e.g., alt for al-
titude) and programming abbreviations (e.g., txt for text).
The final group is made up of programming-language spe-
cific words including keywords (e.g., while), predefined
identifiers (e.g., NULL), library function and variable names
(e.g., strcpy and errno), and all identifiers that consist
of a single character.
The second splitting algorithm, Samurai [13], scores
potential splits using the frequencies of the occurrence of
strings from two sources: those appearing in the program
being analyzed and those appearing in a large corpus of pro-
grams. The strings used by Samurai are the hard-words ex-
tracted from the identifiers of both sources, together with
the words found in the comments of both sources. The al-
gorithm builds two tables that map each string to the num-
ber of times that the string occurs as a hard-word or in a
comment. One table is constructed from the program be-
ing analyzed, yielding the program-specific frequency table,
progFreq, and the other from the entire corpus, yielding the
global frequency table,globalFreq. In addition, hand-built
prefix and suffix lists are used to prevent certain strings from
appearing as soft-words. Prefixes include “afro”, “co”, and
“peri” while suffixes include “aholic”, “eous”, and “tropy”.
The two tables are used in the following string scoring
function:
progFreq(s, p)+(globalFreq(s)/log(AllStrsFreq(p)))
where pis the program under analysis and AllStrsFreq(p)is
the total number of strings that occur in p. With the goal of
maximizing the score, the algorithm compares the score for
the entire hard-word and those for all possible splits of the
hard-word into two soft-words subject to two constraints:
first neither the left soft-word is on a list of prefixes nor
the right soft-word is on a list of suffixes. The second con-
straint is that there be overwhelming evidence in favor of the
split. Such evidence exists if the square root of the two soft-
words’ scores is greater than that of the unspilt hard-word.
If the first constraint holds, but only the left soft-word has
a sufficiently high score, then the right soft-word is recur-
sively split, but not vice versa.
5GenTest Scoring Metrics
This section presents the metrics used by GenTest. In
contrast to previous splitting algorithms, rather than identi-
fying chunks of a hard-word as acceptable and then attempt-
ing to split the remainder of the hard-word, GenTest uses a
generate and test strategy in which all possible splits are ma-
terialized and then scored (i.e., tested). When used only for
splitting, the one with the highest score is selected. When
used by Normalize, a ranked list of high-scoring splits is
used to prioritize the expansions considered.
Although there are an exponential number of possible
splits relative to the length of the hard-word, in practice the
number of possible splits is manageable because identifiers,
and thus their hard-words, tend to be short. For those where
the number of splits becomes large, machine learning tech-
niques such as genetic algorithms can provide an efficient
way of exploring the search space given an accurate scoring
function.
The metrics that GenTest uses aim to characterize a high
quality splitting of a hard-word. After describing each met-
ric, motivation for the metric is provided. In Section 6, lo-
gistic regression is used to empirically build a model and
thus a scoring function that captures the best combination
of metrics.
There are three categories of metrics: soft-word char-
acteristics, metrics incorporating external information, and
metrics incorporating internal information. Soft-word char-
acteristics are characteristics of the strings produced by the
splitting. External information includes dictionaries and
other information that is either human engineered or ex-
tracted from non-source code sources. Internal information
is derived from the source code, either the program itself or
a collection of programs.
5.1 Soft-word Characteristics
There are five metrics derived from soft-word character-
istics. The first is simply the number of soft-words the hard-
word is divided into, number of words. This can be any-
where from 1 to nwhere nis the length of the hard-word.
Empirical analysis shows that generally fewer soft-words
are better.
The second metric is the average soft-word size, aver-
age word length. This is simply the average length of the
soft-word strings. In general, longer soft-words are more
understandable.
4
The third metric is the longest soft-word size, longest
word. This is highly correlated to average soft-word size. It
was included because programmers frequently add a short
prefix to a well understood soft-word such as when using
Hungarian notation.
The fourth metric, words with vowels, is the number of
soft-words that contain vowels. The value of this metric is
higher when there are fewer vowel-free soft-words; how-
ever, even in the case of non-words many abbreviations in-
clude some vowels. This metric captures that expectation.
The final metric, single letter count, is the number of
soft-words containing a single letter (i.e., length one). Un-
like the other metrics, it is hypothesized that smaller values
for this metric will lead to better splits.
5.2 External Information
Eight metrics make use of external information such as
dictionaries to calculate metric values. Three of the eight
are derived from counting the number of soft-words found
in the dictionary. Two different dictionaries are used. One
is Debians’s wamerican (6-2) which consists of a concate-
nation of Kevin Atkinson’s SCOWL word list sizes 10 thru
50 [3]. This includes 98,569 entries. The other dictionary
has 479,625 entries and is distributed in /usr/share/dict
with Red Hat 4.1.2-14. The first dictionary is referred to
as the ‘small dictionary’ and the second as the ‘large dic-
tionary’. The small dictionary is a subset of the large dic-
tionary. One metric, small dictionary match count, is the
number of soft-words found in the small dictionary. The
second metric, large dictionary match count, is the number
of soft-words found in the large dictionary. The third met-
ric, large dictionary match len 3 count, is the number of
soft-words that have a length of three or more and are found
in the large dictionary.
The fourth metric, programming word count, uses the
same list of programming specific words employed in
Greedy Splitting. In this case, the number of soft-words that
are found on the list is counted. Since the list is program-
ming language specific, the programming language of the
file where the soft-word occurs is taken into consideration
when counting.
The fifth metric, dictionary expansion count, counts the
number of soft-words that are not on the programming spe-
cific list and that can be found in the large dictionary or
expanded into a word in the small dictionary. To expand
a soft-word, the soft-word is first stemmed using Krovetz’s
morphological stemmer [20]. Then, provided that the soft-
word has more than one character, the dictionary is searched
for a wild-card expansion, as described in Section 3. If one
or more matches are found for a soft-word, then that soft-
word is counted as having a dictionary expansion.
The final metrics are computed using co-occurrence in-
formation developed by Google and provided by the Lin-
guistic Data Consortium. The data set provides the number
of times a series of 5 words (5-grams) is observed on web
pages crawled by Google as of 2006. Co-occurrence is com-
puted for words w1and w2by counting the number of times
w1appears in a 5-gram with w2divided by the number of
5-grams w2occurs in.
For example, let w1=linguistics and w2=Google. As-
sume the following 5-grams:
5-gram frequency
about the future of Google 101
exactly the same as Google 87
Google Scholar People in Linguistics 93
The Co-occurrence(linguistics, Google) = 93
101+87+93 =
0.331, which means that roughly one-third of the time that
Google occurs linguistics also occurs.
There are three metrics computed from the co-
occurrence information. Co-occurrence is the average co-
occurrence of each successive pair of soft-words in the hard-
word. Through inspection it was discovered that many two
character ‘words’ in this set are a result of faulty optical
character recognition (OCR). These small words caused ex-
cessive splitting. Co-occurrence len 3 only includes pairs of
soft-words where both words in the pair are of length three
or longer. The final metric, co-occurrence when combined
not word, also attempts to cope with the OCR problem by
ignoring pairs whose combined characters make up a word.
5.3 Internal Information
There are five internal metrics. The first metric, in
source, tallies the number of soft-words that meet one of the
following conditions: (1) the characters form an acronym
for a phrase found in the source, or (2) the soft-word or
an expansion of the soft-word is a dictionary word appear-
ing in the comments or the code. Phrases are identified by
running the comments and multi-word-identifiers through
a phrase finder [16]. Here, the first letter of each word
in the phrase is used to build an acronym. If a soft-word
matches an acronym exactly, then the soft-word is part of
the tally. For instance, the phrase finder extracts the phrase
Color Management System from ghostscript documenta-
tion. This phrase matches the soft-word cms. This metric
counts higher quality expansions than the dictionary expan-
sion count by including the subset of the dictionary that is
found in the source. This works well when abbreviations
are defined elsewhere in the source code.
The final four metrics are rooted in the frequency tables
used in the Samurai algorithm. However, instead of straight
frequency counts, these metrics use normalized frequencies
– the frequency of a string divided by the total number of
strings observed. A normalized frequency can be inter-
preted as the probability of encountering a particular string
5
in the program or collection. Assuming that these proba-
bilities are independent and disjoint (a common assumption
in IR even though it is generally false), the probability of
observing the union of a set of words is obtained by sum-
ming their individual probabilities, while the probability of
observing intersection of the words from the set is obtained
by computing the product of their probabilities.
For these metrics the words in the unions and inter-
sections include only soft-words that have three or more
characters. Short soft-words are given a very small or no
probability since their counts are likely to be inflated by
homonymy. Short soft-words have an artificially high prob-
ability because they can expand into so many different con-
cepts. The probability of the union of the soft-words is
computed using both the program normalized frequencies
and the global or collection normalized frequencies leading
to the union-program probability and union-global prob-
ability metrics. The probability of the intersection of the
soft-words is computed using both the program normalized
frequencies and the global normalized frequencies leading
to the final metrics, intersection-program probability and
intersection-global probability.
6 Building and Evaluating GenTest
GenTest combines the metrics from Section 5 using sta-
tistical analysis. This section first describes the oracle data
used in this analysis and the building of the model from the
oracle data set. It then compares GenTest’s results with the
two algorithms presented in Section 4 and finally provides
a brief discussion of threats to validity.
6.1 Oracle Data Set
The oracle data set identifies the correct splitting for
each identifier. It was generated by having four program-
mers hand split overlapping subsets of a random sample
of four thousand identifiers The random sample was drawn
from a source base that includes 186 programs containing
26MLoC of C, 15MLoC of C++, and 7MLoC of Java. The
total of almost 50MLoC includes almost 3 million identifier
instances, of which 746,345 are unique. These are com-
posed of 104,278 unique hard-words. To produce the ora-
cle data, 4,000 identifiers were randomly chosen from the
746,345.
The 4,000 identifiers contain 2,180 unique hard words.
Because the same hard-word can generate different program
specific metric values, each program was searched for iden-
tifiers containing that hard-word. This increased the size of
the oracle to 8,455 identifiers. Finally, hard-words contain-
ing fewer than three characters or more than twelve char-
acters were excluded. One and two character hard-words
almost never require splitting while those longer the twelve
characters tend to be composed of juxtaposed (English)
words where a simpler splitter, such as the Greedy algo-
rithm [15], is more efficient and accurate. In total, 7,941
hard-words formed the oracle. To avoid over-fitting the
data, the oracle set was randomly divided into two halves:
an estimation set, used for model construction, and a vali-
dation set.
Oracle identifiers come from three different program-
ming languages. The bulk, 67%, are from C programs.
The data set is 26% C++ identifiers and 7% comes from
Java programs. This roughly reflects the source code base.
The most frequently occurring hard-word in the data set is
val with 50 occurrences, whereas accfragref is among the
1,141 hard-words that only occur once.
6.2 Model Construction and Quality
Statistical analysis by SAS version 9.1 was used to gen-
erate a model of correctly-split identifiers. Because the re-
sponse variable is binary (whether a particular split is right
or wrong), logistic regression is used to model the associa-
tion between the response variable and the explanatory vari-
ables. The resulting model, a weighted combination of the
statistically significant metrics, becomes the scoring func-
tion for potential splittings.
The first step in model construction is to generate the
metric values for all possible splits of each oracle hard-
word. However, because no hard-word required more than
three splits, those with four of more splits were ignored.
Because each identifier still has multiple splits, repeated
measures are present in the data; thus, a generalized linear
mixed model (GLMM) [32]) is appropriate to handle the
within identifier correlations caused by repeated measures.
However, fitting such a model produced very small (essen-
tially zero) variance estimates, suggesting that, the more so-
phisticated GLMM analysis is not needed. Thus, multiple
logistic regression is used to fit the models to obtain pre-
dicted probabilities of a correct prediction.
When modeling the data, two of the metrics were elimi-
nated: co-occurrence when combined not word, which was
not significant (p > 0.05) and intersection-global probabil-
ity, which while statistically significant, but had a very small
effect and was highly correlated with intersection-program
probability.
The resulting scoring function appears as Equation (2) in
Figure 1. Statistically, it is a very good model of the data.
The area under the ROC curve is 0.984 (the c-statistic),
which is in the “excellent” range. The high percent concor-
dant value of 97.8% indicates that the correct splitting has a
higher predicted probability than incorrect splits 97.8% of
the time. Somer’s D and the Gamma statistics also indicate
an almost perfect association.
The following discussion details how the metrics con-
tribute to the overall score. First, the coefficients of Equa-
tion (1) are either positive or negative. For negative coef-
ficients smaller values produce higher scores. The oppo-
site is true for positive coefficients. For example, a higher
score comes with fewer words (i.e., a lower value of num-
6
Logistic Regression Parameter Estimates Statistics
Standard Wald
Explanatory Variable Min Max Error χ2Pr > χ2Odds-Ratio
number of words 1 4 0.1620 1497.5287 <.0001 0.002
programming word count 0 2 0.1847 534.9597 <.0001 71.738
in source 0 4 0.1118 353.4142 <.0001 8.179
small dictionary match count 0 4 0.0774 578.7864 <.0001 6.442
large dictionary match count 0 4 0.1071 17.6977 <.0001 0.637
large dictionary match len 3 count 0 3 0.1097 379.1848 <.0001 8.465
single letter count 0 4 0.1392 46.7953 <.0001 2.592
dictionary expansion count 0 4 0.0975 331.1915 <.0001 5.897
co-occurrence 0 0.0543 285.6 6.2660 0.0123 <0.001
co-occurrence len 3 0 0.0543 286.7 8.3633 0.0038 >999.999
average word size 1 12 0.0603 248.7042 <.0001 0.386
longest word 1 12 0.0491 178.4353 <.0001 1.928
words with vowels 0 4 0.0716 103.6284 <.0001 0.482
union-program probability 0 0.0395 23.7065 13.0305 0.0003 >999.999
intersection-program probability 0 84.817 0.00260 465.0384 <.0001 0.946
union-global probability 0 0.0294 33.4389 127.1047 <.0001 >999.999
Table 1. Statistical information concerning each of the explanatory variables in the model.
η= 8.2271
6.27 number of words
+ 4.27 programming word count
+ 2.10 in source
+ 1.86 small dictionary match count
0.45 large dictionary match count
+ 2.14 large dictionary match len 3 count
+ 0.95 single letter count
+ 1.77 dictionary expansion count
714.9co-occurrence
+ 829.2co-occurrence len 3
0.95 average word length
+ 0.66 longest word
0.73 words with vowels
+ 85.58 union-program probability
0.06 intersection-program probability
+ 377.0union-global probability (1)
Score =eη
1 + eη(2)
Figure 1. GenTest’s Scoring Function
ber of words); however, more programming-words (i.e., a
higher value of programming word count) is better. In
essence, the model says that all things being equal fewer
soft-words are better, but if there are more soft-words, it
is best that they be on the programming list. Second, to
fully appreciate the impact a metric has, it is important to
consider the range of values that were observed as shown
in Table 1 by the minimum and maximum columns. This
limit bounds the impact that a metric can have on the over-
all value of Equation (1). For example, the most that union-
program probability contributes to Equations (1) is 3.38
(85.5753 ×0.03951), whereas number of words potentially
has a greater impact as it largest value while negative has a
magnitude of 25.08.
There are several interesting features in Equation (1).
The number of words is the most influential single variable.
However, the combined program probabilities can make a
very high positive contribution to the score. In fact, many of
the similar metrics have opposite signs, co-occurrence and
co-occurrence len 3 for instance, because they are highly
correlated, members of such a pair have a dampening ef-
fect on each other. Both are left in the model as each also
brings something unique to the model. In the case of the
two co-occurrences, co-occurrence len 3 is always less than
or equal to co-occurrence; therefore, the highest scores will
result when the two are equal. Also, although the coeffi-
cients for these variables are large, the maximum values are
small and thus the pair never has a large impact on the value
of Equation (1).
Finally, Table 1 includes some statistical information re-
lating to how good the logistic regression parameter esti-
mates are and the odds-ratio. Since the model is a logistic
regression, the value produced by the score can be inter-
preted as the odds that the split is the correct split. The
standard error is the usual estimate of variability/uncertainty
in the parameter estimate. The Wald χ2statistic, which is
7
Algorithm Overall C C++ Java
GenTest (val) 81.82% 80.91% 82.66% 87.64%
GenTest (est) 82.07% 80.81% 82.54% 92.01%
Greedy 64.45% 65.57% 60.17% 69.88%
Samurai 70.32% 69.98% 67.72% 83.54%
Table 2. Accuracy of the different algorithms
overall and by programming language
equal to (estimate/stderr)2, is similar to the z-statistic
squared. The p-value is the probability that χ2with 1 de-
gree of freedom is greater or equal to the Wald statistic.
This tests if a parameter equals zero or not. The standard
error and Wald χ2statistic test if a parameter is needed. A
p-value over 0.05 is not statistically significant.
Table 1 also contains the odds-ratio. Each parameter es-
timate (i.e., coefficient in Equation (1)) provides an estimate
of the change in log(odds-ratio) for a unit increase in the ex-
planatory variable (i.e., metric). If the parameter estimate is
negative the odds-ratio will be less than one and the effect of
the explanatory variable on the response is a decrease in the
odds of success for a unit increase in the explanatory vari-
able. Conversely, if the parameter estimate is positive the
odds-ratio will be greater than one, and the effect of the ex-
planatory variable on the response is an increase in the odds
of success for a unit increase in the explanatory variable.
6.3 Results and Comparison
To empirically evaluate the performance of GenTest, it
was compared with Greedy and Samurai. Table 2 shows
the overall accuracy for each algorithm and the accuracy
by programming language. The results for the other algo-
rithms are over all the data (the combination of the estima-
tion and validation data sets). Their performance is iden-
tical when applied to these sets separately. Using McNe-
nar’s test to compare proportions in paired samples, Gen-
Test out-performs the other algorithms in all categories
(columns) , using either the validation or estimation data
set (p < 0.0001 in all cases).
As the prediction was not 100% accurate, the errantly
split identifiers were analyzed to determine whether either
algorithm tended to insert splits where they did not belong
or to miss desired splits. If a split was present in the oracle
answer, but not the one proposed by an algorithm, it was
considered a miss. If the a split was present in the proposed
split, but not the oracle, then it was considered an addition.
The GenTest algorithm tends to under split the identifiers.
Over 76% of the time there is an inaccurate split, the inaccu-
racy is caused by a missed split. Samurai more evenly dis-
tributes the inaccuracies. An inaccurate split is the result of
an additional split 58% of the time. The Greedy algorithm
over-split at nearly the same rate that GenTest under-splits,
77% of the time. Thus, if under-splitting is more desirable
than over-splitting, GenTest also performs better when it is
inaccurate.
6.4 Threats to Validity
Most of the threats to the validity of this work are typ-
ical for this kind of statistical modeling. For example, the
external validity is threatened by the selection of only cer-
tain identifiers for the oracle set. This sample may not be
representative of identifiers in general, although given the
randomness of the sample and the similar performance on
the estimation and validation sets, it is likely that the sam-
ple is representative of (open source) software. One threat is
particular to this experiment: following Anquetil and Leth-
bridge [1], vocabulary normalization assumes that software
engineers are trying to give identifiers meaningful names
(although they may have failed in the attempt).
7 Related Work
This section briefly considers the broader category of
IR-based tools giving one example, and then considers
previously published splitting and expansion algorithms.
Many of the challenges faced in software engineering can
be (partially) addressed using IR techniques applied to
source code and its associated documents. Example ap-
plications include traceability link recovery [2], concept
or feature location [36, 35], reverse engineering [17], im-
pact analysis [9], software clustering metrics [29], software
libraries [31], developer identification [24], and software
comprehension [10]. Of these application categories, over
half are rather strongly dependent on the language con-
tained in identifiers.
An example application comes from metrics used to as-
sess design quality, predict software quality, identify fault
prone modules, and identify reusable components. Existing
(non-IR) metrics are primarily based on structural aspects
of software, such as the number of attributes in a class, the
number of lines of code, etc. Recently, Marcus et al. de-
fined a coupling metric between classes in terms of concep-
tual similarity between each classes’ methods [29]. For two
methods, cosine similarity between the natural language
found in the two is used to establish the similarity. How-
ever, they found that when two conceptually related classes
used even slightly different abbreviations for concepts, the
cosine similarity under-represented the true coupling. The
vocabulary normalization techniques presented herein will
help such an approach by replacing abbreviations with their
normalized equivalent. This will generate a more accurate
approximation of the true coupling.
Two splitting algorithms were presented in Section 4. A
third was inspired by speech recognition techniques [26].
The goal of the approach is to identify matchings between
identifier substrings and dictionary words. The identifier is
considered a signal of unknown meaning described by the
feature vector x1, x2, ..., xN. Each dictionary word is then
used as a second (known) signal described by the feature
vector y1, y2, ..., yM. The algorithm performs a dynamic
8
time warping (DTW) of xand yto find the optimal match
between the two vectors. The “time warp” part of the search
allows Nand Mto differ and in the splitting domain al-
lows abbreviations to be accounted for. The optimal match
is made by computing local distances and then choosing
matches that minimize the overall distance using dynamic
programming.
When comparing the four algorithms, Samurai’s need for
overwhelming evidence produces fewer over-splits than the
Greedy Algorithm, but more than GenTest. The DTW al-
gorithm is highly dependent on the dictionary used. The
presented results [26] appear dependent on a small focused
dictionary. When specialized dictionaries are not available,
the technique may be prohibitively slow and will tend to
over-split identifiers. A head-to-head comparison with the
DTW algorithm is left to future work as it will require care-
ful planning to correctly understand and evaluate the dictio-
nary’s impact on the algorithm’s performance.
Finally, three expansion algorithms are considered here.
The DTW algorithm provides hints for expansion by iden-
tifying abbreviations of dictionary words. When the algo-
rithm uses an abbreviation, the algorithm can map the ab-
breviation back to the original word. However, this feature
of the algorithm is not evaluated and currently has no way
to choose among multiple words that produce the same ab-
breviation. The Scoped Approach [18] finds possible ex-
pansions in dictionary words found in the source code and
documentation. Expansions are associated with soft-words
based on regular expression patterns. When multiple ex-
pansions are possible, higher frequencies of occurring in
the source code are favored. The final expansion approach
uses wildcard expansion [22] as described in Section 3. The
source code is used as the initial source of expansions and a
dictionary is used as a secondary resource. Soft-words are
associated with an expansion when there is only one possi-
ble expansion.
8 Future Challenges
Future work, already underway, will consider improve-
ments to the vocabulary normalization algorithm’s two
phases and its impact on IR tools. An example improve-
ment is expected from the incorporation of relative en-
tropy [11, 28]. Such a probabilistic approach is well suited
for automatic abbreviation and acronym detection. For ex-
ample, n-gram language models [30], built from nsequen-
tial letters, have been successfully used in speech recog-
nition software [28]. In essence, relative entropy identi-
fies unusually frequent sequences of characters. These are
likely to be meaningful in a particular document. For ex-
ample, “cms” appears within many of ghostscript’s iden-
tifiers. Since it has a high relative entropy, it is identified
as an acronym used in the program. Adding a relative en-
tropy metric should benefit GenTest. Furthermore, com-
bined with a phrase finder [16] it should allow the automatic
expansion of acronyms.
This paper focuses on the splitting aspect of normaliza-
tion. Expansion can follow splitting by taking the chosen
split as input and attempting to expand any non-dictionary
soft words. However, the two aspects are expected to per-
form better after integration. For example, GenTest can
produce a ranked-list of the top Xsplits. If expansion works
better with one of these ten, then it is the preferred split. In
support of integrating the two GenTest ranks the correct
split in the top ten over 99% of the time.
Once the normalization implementation is in place,
empirical investigation of its impact on existing tools is
planned. Tools and techniques to be experimented with will
be drawn from the collection of problems in software engi-
neering to which IR has been applied [5, 6, 14]. Vocabulary
normalization is expected to dramatically improve existing
(and future) IR-based tools. As an illustration of the impact
that Normalize can have considers on the FLATˆ3[34]
feature locator. In addition to dynamic tracing, this tool
indexes the current Eclipse workspace and then uses co-
sine similarity to determine how close a query is to any
class, method, or other file in the workspace. A prelimi-
nary experiment paired GenTest with a simple expansion
algorithm. The results showed a dramatic improvement in
FLATˆ3performance. An example that highlights the ben-
efits of expansion, considers the search for “account num-
ber”. Before normalization this search returns files contain-
ing accountNum having a confidence (cosine similarity)
of 0.60. After normalization, which replaces accountNum
with accountNumber, the confidence jumps to over 0.95.
Similar improvement is expected from other IR-based SE
tools.
9 Summary
IR-based techniques complement techniques grounded in
structural (compiler) based analysis, which presently dom-
inate the field. Popular examples include Prevent [12],
Klockwork K7 [19], and historically lint [7]. Further ex-
amples can be found in a recent survey of the past, present,
and future of Source Code Analysis [4].
Vocabulary normalization is a preprocessing step that al-
lows source code to satisfy the often implicit assumption of
IR-based techniques that the same words are used whenever
describing a particular concept. This assumption is com-
monly violated by the language used in source-code iden-
tifiers and that used in the documentation. The vocabulary
normalization described in this paper is a key step to im-
proving existing and future IR-based tools and techniques.
Normalization has two aspects: splitting and meaning as-
signment. The splitting algorithm leverages a collection of
metrics to correctly split a hard-word 82% of the time. Pre-
liminary experiments with the second phase, finds that it
works well. Further evolution of the algorithms for both
9
phases and continued empirical study are expected to show
the continued improvement and benefit of vocabulary nor-
malization.
10 Acknowledgments
Thanks to Justin Overfelt and Austin Wheeler for there
input on the application of Normalize. Special thanks to
the anonymous referees for their suggestions, which con-
tributed to a substantial improvement to the paper. Support
for this work was provided by NSF grant CCF 0916081.
References
[1] N. Anquetil and T. Lethbridge. Assessing the relevance of identi-
fier names in a legacy software system. In Proceedings of the 1998
conference of the Centre for Advanced Studies on Collaborative Re-
search, Toronto, Ontario, Canada, November 1998.
[2] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Re-
covering traceability links between code and documentation. IEEE
Transactions on Software Engineering, 28(10), October 2002.
[3] K. Atkinson. Spell checking oriented word lists (scowl).
[4] D. Binkley. Source code analysis: A road map. ICSE 2007 special
track on the Future of Software Engineering, May 2007.
[5] D. Binkley and D. Lawrie. Applications of information retrieval to
software development. Encyclopedia of Software Engineering (P.
Laplante, ed.), (to appear).
[6] D. Binkley and D. Lawrie. Applications of information retrieval to
software maintenance and evolution. Encyclopedia of Software En-
gineering (P. Laplante, ed.), (to appear).
[7] A. Binstock. Extra-strength code cleaners, Jan-
uary 2006. www.infoworld.com/article/06/01/26/
74270 05FEcodelint 1.html.
[8] T. Brants and A. Franz. Web 1t 5-gram version 1, 2006. Linguistic
Data Consortium, Philadelphia.
[9] G. Canfora and L. Cerulo. Jimpa: An eclipse plug-in for impact
analysis. In Proceedings of the Tenth Conference on Software Main-
tenance and Reengineering, Bari, Italy, March 2006.
[10] B. Caprile and P. Tonella. Restructuring program identifier names.
In ICSM, 2000.
[11] T. M. Cover and J. A. Thomas. Elements of information theory.
Wiley-Interscience, New York, New York, 1991.
[12] Coverity. www.coverity.com, 2006.
[13] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker. Mining source
code to automatically split identifiers for software analysis. In Pro-
ceedings of the 2009 Mining Software Repositories (MSR). IEEE,
May 2009.
[14] L. Etzkorn and T. Menzies. Information retrieval for program com-
prehension. Journal of Empirical Software Engineering, to appear.
[15] H. Feild, D. Binkley, and D. Lawrie. An empirical comparison of
techniques for extracting concept abbreviations from identifiers. In
Proceedings of IASTED International Conference on Software Engi-
neering and Applications, Dallas, TX, November 2006.
[16] F. Feng and W.B. Croft. Probabalistic techniques for phrase extrac-
tion. Information Process Management, 37(2), March 2001.
[17] R. Ferenc, A. Besz´
edes, M. Tarkiainen, and T. Gyim ´
othy. Columbus
- reverse engineering tool and schema for C++. In IEEE Interna-
tional Conference on Software Maintenance (ICSM 2002), Montreal,
Canada, October 2002.
[18] E. Hill, Z. Fry, H. Boyd, G. Sridhara, Y. Novikova, L. Pollock, and
K. Vijay-Shanker. Amap: Automatically mining abbreviation ex-
pansions in programs to enhance software maintenance tools. In
Proceedings of the 2008 Mining Software Repositories (MSR). IEEE,
May 2008.
[19] Klockwork K7. www.klocwork.com, 2006.
[20] R. Krovetz. Viewing morphology as an inference process. In R. Ko-
rfhage et al., editor, Proceedings of the 16th ACM SIGIR Conference,
June 1993.
[21] Kari Laitinen. Estimating understandability of software documents.
SIGSOFT Software Engineering Notes, 21(4), 1996.
[22] D. Lawrie, H. Feild, and D. Binkley. Extracting meaning from ab-
breviated identifiers. In Proceedings of 2007 IEEE Conference on
Source Code Analysis and Manipulation (SCAM’07), Paris, France,
September 2007.
[23] D. Lawrie, H. Feild, and D. Binkley. Quantifying identifier quality:
An analysis of trends. Journal of Empirical Software Engineering,
12(4), 2007.
[24] E. Linstead, P. Rigor, S. Bajracharya, C. Lopes, and P. Baldi. Min-
ing eclipse developer contributions via author-topic models. In Pro-
ceedings of the Fourth International Workshop on Mining Software
Repositories, Dubrovnik, Croatia, May 2007.
[25] Y. Liu, R. Jin, and J. Chai. A maximum coherence model fo
dictionary-based cross-language information retrieval. In Proceed-
ings of the 2005 SIGIR, Salvador, Brazil, August 2005. ACM.
[26] N. Madani, L. Guerrouj, M. Di Penta, Y. Gueheneuc, and G. Anto-
niol. Recognizing words from source code identifiers using speech
recognition techniques. In Proceedings of the 14th European Con-
ference on Soware Maintenance and Reengineering (CSMR). IEEE,
March 2010.
[27] C. Manning, P. Raghavan, and H. Schutze. Introduction to Informa-
tion Retrieval. Cambridge University Press, 2008.
[28] C. Manning and H. Schutze. Foundations of statistical natural lan-
guage processing. The MIT Press, 1999.
[29] A. Marcus, D. Poshyvanyk, and R. Ferenc. Using the conceptual
cohesion of classes for fault prediction in object-oriented systems.
IEEE Transactions on Software Engineering, 34(2), 2008.
[30] John G. McMahon and F. Jack Smith. A review of statistical lan-
guage processing techniques. Artificial Intelligence Review, 12(5),
1998.
[31] H. Mili, E. Ah-ki, R. Godin, and H. Mcheick. An experiment in
software component retrieval. Information and Software Technology,
45, 2003.
[32] G. Molenberghs and G. Verbeke G. Models for Discrete Longitudinal
Data. Springer, Berlin, 2006.
[33] J. Rilling and T. Klemola. Identifying comprehension bottlenecks
using program slicing and cognitive complexity metrics. In Proceed-
ings of the 11th IEEE International Workshop on Program Compre-
hension, Portland, Oregon, USA, May 2003.
[34] T. Savage, M. Revelle, and D. Poshyvanyk. Flatˆ3: Feature loca-
tion and textual tracing tool. In Proceedings of 32nd ACM/IEEE In-
ternational Conference on Software Engineering (ICSE’10), Formal
Research Tool Demonstration. ACM, May 2010.
[35] D. Shepherd, Z. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using
natural language program analysis to locate and understand action-
oriented concerns. In International Conference on Aspect Oriented
Software Development, Vancouver, British Columbia, March 2007.
[36] W. Zhao and L. Zhang. Sniafl: Towards a static non-interactive ap-
proach to feature location. ACM Transactions on Software Engineer-
ing and Methodology, 15(2), 2006.
10
... Where most of the approaches apply Information Retrieval (IR) techniques to collect lexical information with the assumption that the textual information of source code and comment are same. However, that assumption can be violated [4] in several cases, for example, the vocabulary developers use to write source DOI reference number: 10.18293/SEKE2020-062. code can be different from the vocabulary of comment (e.g. synonym). ...
Conference Paper
Full-text available
In modern era, the size of software is increasing, as a result a large number of software developers are assigned into software projects. To have a better understanding about source codes these developers are highly dependent on code comments. However, comments and source codes are often inconsistent in a software project because keeping comments up-to-date is often neglected. Since these comments are written in natural language and consist of context related topics from source codes, manual inspection is needed to ensure the quality of the comment associated with the corresponding code. Existing approaches consider entire texts as a feature, which fail to capture dominant topics to build the bridge between comments and its corresponding code. In this paper, an effective approach has been proposed to automatically extract dominant topics as well as to identify the consistency between a code snippet and its corresponding comment. This approach is evaluated with a benchmark dataset containing 2.8K Java code-comment pairs, which showed that the proposed approach has achieved better performance with respect to the several evaluation metrics than the existing state-of-the-art Support Vector Machine on vector space model.
Article
Refactoring is a widespread practice of improving the quality of software systems by applying changes on their internal structures without affecting their observable behaviors. Rename is one of the most recurring and widely used refactoring operation. A rename refactoring is often required when a software entity was poorly named in the beginning or its semantics have changed and therefore should be renamed to reflect its new semantics. However, identifying renaming opportunities is often challenging as it involves several aspects including source code semantics, natural language understanding and developer's experience. To this end, we propose a new approach to identify rename refactoring opportunities by leveraging feature requests. The rationale is that, when implementing a feature request there are chances that the semantics of software entities could significantly change to fulfill the requested feature. Consequently, their names should be modified as well to portray their latest semantics. The approach employs textual similarity to assess the similarity between a feature request description and identifiers. The approach has been validated on the dataset of 15 open source Java applications by comparing the recommended renaming opportunities against those recovered from the refactoring history of the involved subject applications. The evaluation results suggest that, the proposed approach can identify renaming opportunities on up to 66% precision and 72% recall.
Article
Although the negative impact of abbreviations in source code is well-recognized, abbreviations are common for various reasons. To this end, a number of approaches have been proposed to expand abbreviations in identifiers. However, such approaches are either inaccurate or confined to specific identifiers. To this end, in this paper, we propose a generic and accurate approach to expand identifier abbreviations by leveraging both semantic relation and transfer expansion. One of the key insights of the approach is that abbreviations in the name of software entity $e$ have a great chance to find their full terms in names of software entities that are semantically related to $e$ . Consequently, the proposed approach builds a knowledge graph to represent such entities and their relationships with $e$ and searches the graph for full terms. Another key insight is that literally identical abbreviations within the same application are likely (but not necessary) to have identical expansions, and thus the semantics-based expansion in one place may be transferred to other places. To investigate when abbreviation expansion could be transferred safely, we conduct a case study on three open-source applications. The results suggest that a significant part (75 percent) of expansions could be transferred among lexically identical abbreviations within the same application. However, the risk of transfer varies according to various factors, e.g., length of abbreviations, the physical distance between abbreviations, and semantic relations between abbreviations. Based on these findings, we design nine heuristics for transfer expansion and propose a learning-based approach to prioritize both transfer heuristics and semantic-based expansion heuristics. Evaluation results on nine open-source applications suggest that the proposed approach significantly improves the state of the art, improving recall from 29 to 89 percent and precision from 39 to 92 percent.
Article
More than 70% of characters in the source code are used to label identifiers. Consequently, identifiers are one of the most important source for program comprehension. Meaningful identifiers are crucial to understand and maintain programs. However, for reasons like constrained schedule, inexperience, and unplanned evolution, identifiers may fail to convey the semantics of the entities associated with them. As a result, such entities should be renamed to improve software quality. However, manual renaming and recommendation are fastidious, time consuming, and error prone, whereas automating the process of renamings is challenging: (1) It involves complex natural language processing to understand the meaning of identifers; (2) It also involves difficult semantic analysis to determine the role of software entities. Researchers proposed a number of approaches and tools to facilitate renamings. We present a survey on existing approaches and classify them into identification of renaming opportunities, execution of renamings, and detection of renamings. We find that there is an imbalance between the three type of approaches, and most of implementation of approaches and evaluation dataset are not publicly available. We also discuss the challenges and present potential research directions. To the best of our knowledge, this survey is the first comprehensive study on renamings of software entities.
Article
Full-text available
There is a growing interest in creating tools that can assist engineers in all phases of the software life cycle. This assistance requires techniques that go beyond traditional static and dynamic analysis. An example of such a technique is the application of information retrieval (IR), which exploits information found in a project's natural language. Such information can be extracted from the source code's identifiers and comments and in artifacts associated with the project, such as the requirements. The techniques described pertain to the maintenance and evolution phase of the software life cycle and focus on problems such as feature location and impact analysis. These techniques highlight the bright future that IR brings to addressing software engineering problems.
Article
Full-text available
When a programmer is faced with the task of modifying code written by others, he or she must first gain an un-derstanding of the concepts and entities used by the pro-gram. Comments and identifiers are the two main sources of such knowledge. In the case of identifiers, the meaning can be hidden in abbreviations that make comprehension more difficult. A tool that can automatically replace abbre-viations with their full word meanings would improve the comprehension ability (especially of less experienced pro-grammers) to understand and work with the code. Such a tool first needs to isolate abbreviations within the identi-fiers. When identifiers are separated by division markers such as underscores or camel-casing, this isolation task is trivial. However, many identifiers lack these division mark-ers. Therefore, the first task of automatic expansion is sep-aration of identifiers into their constituent parts. Presented here is a comparison of three techniques that accomplish this task: a random algorithm (used as a straw man), a greedy algorithm, and a neural network based algorithm. The greedy algorithm's performance ranges from 75 to 81 percent correct, while the neural network's performance ranges from 71 to 95 percent correct.
Article
There is a growing interest in creating tools that can assist engineers in all phases of the software life cycle. This assistance requires techniques that go beyond traditional static and dynamic analysis. An example of such a technique applies information retrieval (IR), which exploits infor- mation found in a project's natural language. Such information can be extracted from the source code's identifiers and comments and in artifacts associated with the project, such as the requirements. The techniques de- scribed pertain to the maintenance and evolution phase of the software life cycle and focus on such problems as feature location and impact analysis. These techniques highlight the bright future that IR brings to addressing software engineering problems.
Article
Feature location is the process of finding the source code that implements a functional requirement of a software system. It plays an important role in software maintenance activities, but when it is performed manually, it can be challenging and time-consuming, especially for large, long-lived systems. This paper describes a tool called FLAT 3 that integrates textual and dynamic feature location techniques along with feature annotation capabilities and a useful visualization technique, providing a complete suite of tools that allows developers to quickly and easily locate the code that implements a feature and then save these annotations for future use.
Article
Software developers and maintainers need to read and understand source programs and other kinds of software documents in their work. Understandability of software documents is thus important. This paper introduces a method for estimating the understandability of software documents. The method is based on a language theory according to which every software document is considered to contain a language of its own, which is a set of symbols. The understandability of documents written according to different documentation practices can be compared using the rules of the language theory. The method and the language theory are presented by using source programs with different naming styles as example documents. The method can, at least theoretically, be applied to any kind of document. It can also be used to explain the benefits of some well-known software design methods.