Conference PaperPDF Available

Query Quality Prediction on Source Code Base Dataset: A Comparative Study

Authors:
  • Manipal Institute of Technology, Manipal University, India

Figures

Content may be subject to copyright.
Query Quality Prediction on Source Code Base
Dataset: A Comparative Study
Swathi B.P
Department of Information and Communication Technology
Manipal Institute of Technology
Manipal Academy of Higher Education
Manipal, India
swathi.bp@manipal.edu
Balachandra Muniyal
Department of Information and Communication Technology
Manipal Institute of Technology
Manipal Academy of Higher Education
Manipal, India
bala.chandra@manipal.edu
Abstract—Source code retrieval is a task under text retrieval
which is performed by software developers regularly. The
existing source code retrieval approaches are regular expression
based and anticipate that the software developer querying the
code base has an extensive acquaintance with the source code.
Unlike keyword or regular expression based source code search
which are difficult to remember, software developers should be
able to query the code base in a sentential form. Although,
performance of the search on text widely depends upon query
quality, it succeeds when the quality of the textual query is high.
Query quality prediction ahead of query execution on a source
code retrieval system will save developers time and effort by
notifying him/her when a query is unlikely to perform. This
paper assesses the performance of prominent classification
algorithms namely Support Vector Machine (SVM), Logistic
Regression (LR), Gradient Boosted Tree (GBT) and Decision
Tree (DT) to predict the query quality on a data set created from
the documentation of the source code files. Experimental results
using benchmark open source projects data set demonstrates
that Gradient Boosted Tree performs better than others in
comparison.
Keywords—Data mining, Information retrieval, Text
retrieval, Source code retrieval, Pre-retrieval metrics, Query
quality prediction.
I. I
NTRODUCTION
Text retrieval is a branch under information retrieval
where information retrieved is primarily in the form of text.
Amid of legion approaches based on text retrieval, source
code retrieval is one such approach which helps to locate
source code files from a source code repository[1]. The
existing source code retrieval approaches are keyword/regular
expression based and look for software developer querying the
code base to have an extensive familiarity with the source
code. JSearch[2], Google Eclipse Search[3], Searchcode [4],
InstaSearch[5], Strathcona[6], Catalog[7], Sando[8] are some
of the tools in which source code is retrieved based on
keyword search such as method name, class name, variable
names. Nevertheless, such a proficiency in code base cannot
be expected from a developer who is a newbie to the code
base. At the time of change request during software
development cycle software developers spend most of their
time in searching code from large code repository in order to
perform any modifications in the source code. This issue of
unnecessary time being spent in searching for source code can
be addressed by source code search by issuing a query to the
code base in a sentential form. But, the common problem with
text retrieval application is that the result of the retrieval
mainly depends on the quality of the query. The retrieval result
will be hindered when developer’s vocabulary is different
from that of code base. The query given by the user may or
may not contain terms from code base. User query is said to
be high quality when it has matching terms from the code base
and low quality when query terms are not present in the code
base[9].
When the query is not capable of retrieving relevant
documents, a lot of time is unnecessarily spent in reading
irrelevant documents to manually identify and locate the
relevant source code. As a result, query should be tested for
its quality in order to categorize as high or low quality. In this
paper, the features to predict query quality incorporate
techniques from natural language processing[10]. The
various properties/features (e.g. specificity) of the query
measured are called the pre-retrieval metrics of the query as
the metrics are computed before a query runs on an
information retrieval engine[11]. In this paper, the training
data is created by collecting documentation (comments) from
the source code files firstly and then computing the pre-
retrieval measures of the collected documentation. The
construction of the training data is followed by a comparative
study of classification algorithms; Support Vector Machine,
Logistic Regression, Gradient Boosted Tree and Decision
Tree to predict query quality using the constructed training
data. In this work, pre-retrieval metrics and algorithms are
implemented using R language.
II. Q
UERY
P
ROPERTIES
The pre-retrieval metrics which are required to analyze the
query quality are specificity, coherency, similarity and term
relatedness[11].
The models under comparison is trained and
tested using the dataset constructed from the source code
base. The dataset consists of 21 pre-retrieval metrics of
natural language processing as features.
A. Specificity
Specificity metric is a property of the user query which
focuses on the frequency of occurrence of query terms across
the source code base. A user query has a scope to retrieve
better result if the query terms appear more number of times
in very few documents in the entire source code base.
Appearance of same query term in most of the documents
makes distinction of relevant and non-relevant documents
difficult. The Eight specificity variants are listed in Table 1
where D is the set of documents in collection,
is the set of
documents in the collection containing the term t, t is a query
term,, q is the query term in the query, Q is the set of query
terms, d is a document in the document collection D, (,)
is the frequency of term t in all docs, (,) is the frequency
of term t in d,
()=log󰇧||
|
|󰇨
(1)
()=log
󰇧
||
(,)
󰇨
(2)
()=(,)
(,).log
||
(,)
(,)
∈
(3)
()=(,)
|
|
(4)
978-1-5386-5314-2/18/$31.00 ©2018 IEEE 1115
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on May 17,2021 at 05:56:43 UTC from IEEE Xplore. Restrictions apply.
TABLE 1: Variants of Specificity metric
Metric Description Formula
AvgIDF Average of the Inverse Document Frequency values
over all query terms 1
||()
∈
MaxIDF Maximum of the Inverse Document Frequency values
over all query terms 
∈
()
DevIDF The standard deviation of the Inverse Document
Frequency values over all query terms 1
||(()
∈
−)
AvgICTF Average Inverse Collection Term Frequency values
over all query terms 1
||(()
∈
)
MaxICTF Maximum Inverse Collection Term Frequency values
over all query terms 
∈
(())
DevICTF The standard deviation of the Inverse Collection Term
Frequency (ictf) values over all query terms 1
||(()
∈
−)
AvgEntropy Average entropy values over all query terms 1
||()
∈
MedEntropy Median entropy values over all query terms 
∈
(())
MaxEntropy Maximum entropy values over all query terms 
∈
(())
DevEntropy The standard deviation of the entropy values over all
query terms
1
||(()
∈
−)
Query Scope
(QS)
The percentage of documents in the collection
containing at least one of the query terms ∪
∈
|
|
Simplified
Clarity Score (SCS)
The Kullback-Leiber divergence of the query
language model from the collection language model 
()log󰇧
()
()󰇨
∈
B. Coherency
The second metric is coherency which is query property
focusing on how concentrated the query is on a particular
topic. In source code retrieval, the user query terms should
focus on a particular feature (topic) so that set of all files
implementing a particular feature can be retrieved. The four
variants of coherency metric are listed in Table 2 where
()=((,)−
)
∈
()
(5)
(,)=1
||log(1+(,))()
(6)
=1
|
|(,)
∈
(7)
And (
,
)is the cosine similarity between the vector
space representation of
and
.
TABLE 2: Variants of Coherency metric
Metric Description Formula
AvgVAR
Average of the variances of the query term weights
over the documents containing the query term over
all query terms 1
||()
∈
MaxVAR
Maximum of the variances of the query
term weights over the documents containing
the query term (VAR), over all query terms 
∈
(())
SumVAR
Sum of the variances of the query term weights over
the documents containing the query term (VAR),
over all query terms ()
∈
Coherence
Score (CS)
The average of the pairwise similarity between
all pairs of documents containing one of the query
terms among all documents in the corpus 1
||󰇧
,
∈
(
,
)

.(
−1) 󰇨
∈
1116
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on May 17,2021 at 05:56:43 UTC from IEEE Xplore. Restrictions apply.
TABLE 3: Variants of Similarity metric
Metric Description Formula
AvgSCQ
The average of the
collection-query
similarity
(SCQ) over all query
terms
1
||()
∈
MaxSCQ
The maximum of the
collection-query
similarity
(SCQ) over all query
terms

∈
(())
SumSCQ
The sum of the
collection-query
similarity
(SCQ) over all query
terms
()
∈
C. Similarity
The third metric is similarity which focuses on entire
query being similar to source code base. The user query can
retrieve better result when all query term put together is
similar to source code base. The variants of similarity metrics
are listed in Table 3 where
()=(1+log(,).()
(8)
D. Term Relatedness
Term relatedness is the fourth query property which says
that the user query can retrieve significant result when terms
in the query co-occur in the source code base. The two
measures from term relatedness are listed in Table 4 where
(
,
)=log
,
()
().
()
(9)
III. C
LASSIFICATION
T
ECHNIQUES
A. Support Vector Machine
Support vector machine is a decision machine used to
classify both linear and non-linear data[12]. In SVM, the
maximized margin is chosen to be the decision boundary.
Statistical learning theory is used to learn the maximum
margin solution. In this work, Radial Basis Function (RBF)
kernel SVM is implemented with kernel gamma 1.
B. Logistic Regression
A logit model is used to apply on a binary dependent
variable[13]. Estimating the parameters of logit model is
logistic regression. Logistic regression is implemented using
Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-
BFGS) solver.
C. Gradient Boosted Trees
Gradient boosting which is a machine learning technique
for regression and classification problems[14]. GBT
produces a prediction model by assembling weaker models
typically decision trees. In this work, a tree count of 20 with
depth of 5 yielded the best accuracy.
D. Decision Tree
Decision tree induction is the construction of decision
trees from training sample which has class labels in it[15].
This implementation uses gain ratio as splitting criterion and
tree depth of 10 for better accuracy.
IV. M
ETHODOLOGY
Query quality prediction is composed of two main modules;
the construction of dataset followed by query quality
prediction. The input to the dataset construction module is a
benchmark source code base which comprises of
documentation in the form of comments. In this work, the
comments from 3 open source projects such as VLC media
player, Code Blocks and 7-zip has been retrieved. The
retrieval of comments from the source code base is followed
by query processing operations (lexical analysis, stemming,
stop word removal) and computation of 21 pre-retrieval
measures. The 21 metrics are computed for each comment in
the source code base to form training data and are stored in
database. An unstructured database called MongoDB is used
during the implementation. Any missing value in the dataset
is treated as average of domain values under the metric. In
order to set up label for each sample in the training data, each
comment from the source code base is run on the information
retrieval engine. Lucene library is used for the implementation
of information retrieval engine. If the document in which the
comment is present is retrieved in the top few documents of
the user target( say top 5 or 10 documents), then the label in
the training data against that particular comment's retrieval
measures is labelled as 1 otherwise it is labelled as 0. Fig 1
shows the model used to construct the training data set. The
attribute ‘Quality’ is made as class label in the data set
Once the training set is prepared, the four models namely
SVM, LR, GBT and DT are trained and tested on the data set.
A configuration of 70% as training data and 30% as test data
is maintained.
Fig 1. Model to construct training data set
Table 4: Variants of Term Relatedness metric
Metric Description Formula
AvgPMI Average Pointwise Mutual Information
over all pairs of terms in the query 2(||−1)!
(
,
)
,
∈
MaxPMI
Maximum Pointwise Mutual
Information (PMI) over all pairs of
terms in the query 
∈
((
,
))
1117
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on May 17,2021 at 05:56:43 UTC from IEEE Xplore. Restrictions apply.
Fig 2. Framework to predict query quality
V. R
ESULTS
The dataset created from 7-zip, VLC media player and Code
Blocks source code consists of 21 pre-retrieval measures of
319 comments, 1055 comments and 798 comments
respectively. Table 5 and Table7 demonstrate how much
accurate the classifiers are with respect to various datasets.
The Kappa statistics of the classifiers are furnished in Table
6
TABLE 5. ACCURACIES OF CLASSIFIERS
SVM LR GBT DT
7-zip 92.2% 90% 94.8% 93.7%
VLC media
player 89.4% 90.1% 91.3% 89.9%
Code
Blocks 92.6% 87.6% 92.6% 91.6%
TABLE 6. KAPPA STATISTICS OF CLASSIFIERS
SVM LR GBT DT
7-zip 0.666 0.820 0.910 0.711
VLC media
player 0.498 0.50 0.504 0.503
Code
Blocks 0.368 0.521 0.564 0.506
TABLE 7. AUC SCORE OF CLASSIFIERS
SVM LR GBT DT
7-zip 0.92 0.82 0.94 0.79
VLC media
player 0.9 0.91 0.92 0.81
Code
Blocks 0.87 0.61 0.92 0.75
Figures Fig 3, Fig 4 and Fig 5 show the ROC plot of the
classifiers for various datasets. From the graphs, it is evident
that the AUC score for GBT in all cases have higher value.
(a) (b)
(c)
(d)
Fig 3. ROC for CodeBlock dataset of (a)GBT (b)SVM (c)
LR (d) Decision Tree
(a)
(b)
(c)
(d)
Fig 4. ROC for VLC dataset of (a)GBT (b)SVM (c) LR
(d) Decision Tree
(a)
(b)
(c)
(d)
Fig 5. ROC for 7-ZIP dataset of (a)GBT (b)SVM (c) LR
(d) Decision Tree
1118
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on May 17,2021 at 05:56:43 UTC from IEEE Xplore. Restrictions apply.
VI. C
ONCLUSION AND
F
UTURE WORK
The comparison performed among SVM, LR, GBT and DT
classification algorithms using the data set constructed from
standard open source projects to predict the query quality
proved that GBT performs better than other models with
accuracy of 94.8%,91.3%, 92.6% for the 7-zip, VLC media
player, Code Blocks source codes respectively. Boosting
algorithm has worked well that it builds model intelligently by
giving more weight to the observation that are hard to classify.
Therefore, GBT is one of the appropriate classification
algorithm to predict the quality of the query run on a source
code base to retrieve the classes/methods when a developer is
altogether new to the code base.
In the future, a module can be developed which will
reformulate the low quality developer query with terms from
source code base to provide a better search result. The label
“low quality” produced by GBT is the trigger to the query
reformulation module. The reformulation has to be
performed based on the concept of synonyms.
R
EFERENCES
[1] B. Sisman, S.A. Akbar, and A.C Kak, “Exploiting spatial code
proximity and order for improved source code retrieval for bug
localization,” Jounral of Software: Evolution and Process, vol. 29, No.
1, 2017.
[2] R. Sindhgatta, “ Using an Information Retrieval System to Retrieve
Source Code Samples“, Proceedings of the 28th ACM international
conference on Software engineerin, pp. 905-908, 2006
[3] D. Poshyvanyk, M. Petrenko, A. Marcus and X. Xie, and D. Liu,
“Source code exploration with google”, Proceedings of the 22nd
International Conference on Software Maintenance, pp. 334-338, 2006
[4] searchcode.com, ‘searchcode’, 2018. [online]. Available:
https://searchcode.com [Accessed: 31-March-2018]
[5] marketplace.eclipse.org, ‘eclipse marketplace’, 2018. [online].
Available: http://marketplace.eclipse.org [Accessed: 20 -March-2018]
[6] R. Holmes and G.C. Murphy , Marcus ,”Using structural context to
recommend source code examples”, proceedings of the 27th ACM
International Conference on Software Engineering, pp. 117-125, 2005
[7] W.B Frakes and B.A Nejmeh, “Software reuse through information
retrieval “,ACM SIGIR Forum, Vol. 21, No. 1-2, pp. 30-36, 1986
[8] X.G.D.S.K Damevski and, E. Murphy-Hill, “How developers use multi-
recommender system in local code search”, proceedings of IEEE
Symposium on Visual Languages and Human-Centric Computing, pp.
69-76, 2014
[9] C. Mills, G. Bavota, S. Haiduc, R. Oliveto,, A. Marcus, and A. D Lucia,
“ Predicting query quality for applications of text retrieval to software
engineering tasks”, ACM Transactions on Software Engineering and
Methodology, Vol. 26, No. 1, 2017
[10] M. Boughanem, B. Catherine, and M. Josiane, “Advances in
Information Retrieval”, Proceedings of the 31th European Conference
on IR, Toulouse, France, 2009
[11 ] S. Haiduc, G. Ba vota, R. O liveto, A. D Luci a, and Marcus, "Aut omatic
query performance assessment during the retrieval of software
artifacts." Proceedings of the 27th IEEE/ACM international conference
on Automated Software Engineering, pp. 90-99, 2012.
[12] C.M Bishop, “Pattern recognition and Machine learning”, Springer-
Verlag New York, Inc., Secaucus, NJ, 2006
[13] D.A. Freedman, “ Statistical models: theory and practice”, Cambridge
University press, 2009.
[14] Friedman, H Jerome, “Greedy function approximation: a gradient
boosting machine”, Annals of statistics, JSTOR, pp. 1189—1232, 2001
[15] Han, Jiawei and Pei, Jian and Kamber, Micheline, “Data mining:
concepts and techniques”, Elsevier, 2011.
1119
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on May 17,2021 at 05:56:43 UTC from IEEE Xplore. Restrictions apply.
... Swathi et al. [16] conducted a comparison of four classification algorithms, namely Decision Tree, Support Vector Machine, Gradient Boosted Tree and Logistic Regression. They used 21 pre-retrieval metrics to predict the quality of queries on the source code base. ...
Book
This book constitutes the refereed proceedings of the 30th annual European Conference on Information Retrieval Research, ECIR 2009, held in Toulouse, France in April 2009. The 42 revised full papers and 18 revised short papers presented together with the abstracts of 3 invited lectures and 25 poster papers were carefully reviewed and selected from 188 submissions. The papers are organized in topical sections on retrieval model, collaborative IR / filtering, learning, multimedia - metadata, expert search - advertising, evaluation, opinion detection, web IR, representation, clustering / categorization as well as distributed IR.
Article
Context: Since the mid-2000s, numerous recommendation systems based on text retrieval (TR) have been proposed to support software engineering (SE) tasks such as concept location, traceability link recovery, code reuse, impact analysis, and so on. The success of TR-based solutions highly depends on the query submitted, which is either formulated by the developer or automatically extracted from software artifacts. Aim: We aim at predicting the quality of queries submitted to TR-based approaches in SE. This can lead to benefits for developers and for the quality of software systems alike. For example, knowing when a query is poorly formulated can save developers the time and frustration of analyzing irrelevant search results. Instead, they could focus on reformulating the query. Also, knowing if an artifact used as a query leads to irrelevant search results may uncover underlying problems in the query artifact itself. Method: We introduce an automatic query quality prediction approach for software artifact retrieval by adapting NL-inspired solutions to their use on software data. We present two applications and evaluations of the approach in the context of concept location and traceability link recovery, where TR has been applied most often in SE. For concept location, we use the approach to determine if the list of retrieved code elements is likely to contain code relevant to a particular change request or not, in which case, the queries are good candidates for reformulation. For traceability link recovery, the queries represent software artifacts. In this case, we use the query quality prediction approach to identify artifacts that are hard to trace to other artifacts and may therefore have a low intrinsic quality for TR-based traceability link recovery. Results: For concept location, the evaluation shows that our approach is able to correctly predict the quality of queries in 82% of the cases, on average, using very little training data. In the case of traceability recovery, the proposed approach is able to detect hard to trace artifacts in 74% of the cases, on average. Conclusions: The results of our evaluation on applications for concept location and traceability link recovery indicate that our approach can be used to predict the results of a TR-based approach by assessing the quality of the text query. This can lead to saved effort and time, as well as the identification of software artifacts that may be difficult to trace using TR.
Article
Practically all information retrieval based approaches developed to date for automatic bug localization are based on the bag-of-words assumption that ignores any positional and ordering relationships between the terms in a query. In this paper, we argue that bug reports are ill-served by this assumption because such reports frequently contain various types of structural information whose terms must obey certain positional and ordering constraints. It therefore stands to reason that the quality of retrieval for bug localization would improve if these constraints could be taken into account when searching for the most relevant files. In this paper, we demonstrate that such is indeed the case. We show how the well-known Markov Random Field based retrieval framework can be used for taking into account the term-term proximity and ordering relationships in a query vis-à-vis the same relationships in the files of a source-code library to greatly improve the quality of retrieval of the most relevant source files. We have carried out our experimental evaluations on popular large software projects using over 4000 bug reports. The results we present demonstrate unequivocally that the new proposed approach is far superior to the widely used bag-of-words based approaches. Copyright
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Conference Paper
Developers often start programming tasks by searching for relevant code in their local codebase. Previous research suggests that 88% of manually-composed queries retrieve no relevant results. Many searches fail because existing search tools depend solely on string matching with a manually-composed query, which cannot find semantically-related code. To solve this problem, researchers proposed query recommendation techniques to help developers compose queries without the extensive knowledge of the codebase under search. However, few of these techniques are empirically evaluated by the usage data from real-world developers. To fill this gap, we studied several query recommendation techniques by extending Sando and conducting a longitudinal field study. Our study shows that over 30% of all queries were adopted from recommendation; and recommended queries retrieved results 7% more often than manual queries.
Article
1. Observational studies and experiments 2. The regression line 3. Matrix algebra 4. Multiple regression 5. Path models 6. Maximum likelihood 7. The bootstrap 8. Simultaneous equations References Answers to exercises The computer labs Appendix: sample MATLAB code Reprints Index.
Article
There is widespread need for safe, verifiable, efficient, and reliable software that can be delivered in a timely manner. Software reuse can make a valuable contrbution toward this goal by increasing programmer productivity and software quality. Unfortunately, the amount of software reuse currently done is quite small. DeMarco [1] estimates that in the average software development environment only about five percent of code is reused.
Conference Paper
Text-based search and retrieval is used by developers in the context of many SE tasks, such as, concept location, traceability link retrieval, reuse, impact analysis, etc. Solutions for software text search range from regular expression matching to complex techniques using text retrieval. In all cases, the results of a search depend on the query formulated by the developer. A developer needs to run a query and look at the results before realizing that it needs reformulating. Our aim is to automatically assess the performance of a query before it is executed. We introduce an automatic query performance assessment approach for software artifact retrieval, which uses 21 measures from the field of text retrieval. We evaluate the approach in the context of concept location in source code. The evaluation shows that our approach is able to predict the performance of queries with 79% accuracy, using very little training data.