PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Maintenance of existing software requires a large amount of time for comprehending the source code. The architecture of a software, however, may not be clear to maintainers if up to date documentations are not available. Software clustering is often used as a remodularisation and architecture recovery technique to help recover a semantic representation of the software design. Due to the diverse domains, structure, and behaviour of software systems, the suitability of different clustering algorithms for different software systems are not investigated thoroughly. Research that introduce new clustering techniques usually validate their approaches on a specific domain, which might limit its generalisability. If the chosen test subjects could only represent a narrow perspective of the whole picture, researchers might risk not being able to address the external validity of their findings. This work aims to fill this gap by introducing a new approach, Explaining Software Clustering for Remodularisation, to evaluate the effectiveness of different software clustering approaches. This work focuses on hierarchical clustering and Bunch clustering algorithms and provides information about their suitability according to the features of the software, which as a consequence, enables the selection of the most optimum algorithm and configuration from our existing pool of choices for a particular software system. The proposed framework is tested on 30 open source software systems with varying sizes and domains, and demonstrates that it can characterise both the strengths and weaknesses of the analysed software clustering algorithms using software features extracted from the code. The proposed approach also provides a better understanding of the algorithms behaviour through the application of dimensionality reduction techniques.
Content may be subject to copyright.
E-SC4R: Explaining Software Clustering for Remodularisation
Alvin Jian Jia Tana,Chun Yong Chonga,and Aldeida Aletib
aSchool of Information Technology, Monash University Malaysia, Jalan Lagoon Selatan, Bandar Sunway, 47500 Subang Jaya, Selangor, Malaysia
bFaculty of Information Technology, Monash University, Clayton 3168, VIC, Australia
ARTICLE INFO
Keywords:
architecture recovery
software remodularisation
software clustering
feature extraction
footprint visualisation
ABSTRACT
Maintenance of existing software requires a large amount of time for comprehending the source code.
The architecture of a software, however, may not be clear to maintainers if up-to-date documentations
are not available. Software clustering is often used as a remodularisation and architecture recovery
technique to help recover a semantic representation of the software design. Due to the diverse do-
mains, structure, and behaviour of software systems, the suitability of different clustering algorithms
for different software systems are not investigated thoroughly. Research that introduce new clustering
techniques usually validate their approaches on a specific domain, which might limit its generalis-
ability. If the chosen test subjects could only represent a narrow perspective of the whole picture,
researchers might risk not being able to address the external validity of their findings. This work aims
to fill this gap by introducing a new approach, Explaining Software Clustering for Remodularisation
(E-SC4R), to evaluate the effectiveness of different software clustering approaches. This work focuses
on hierarchical clustering and Bunch clustering algorithms and provides information about their suit-
ability according to the features of the software, which as a consequence, enables the selection of the
most optimum algorithm and configuration from our existing pool of choices for a particular soft-
ware system. The E-SC4R framework is tested on 30 open-source software systems with varying
sizes and domains, and demonstrates that it can characterise both the strengths and weaknesses of the
analysed software clustering algorithms using software features extracted from the code. The pro-
posed approach also provides a better understanding of the algorithms’ behaviour by showing a 2D
representation of the effectiveness of clustering techniques on the feature space generated through the
application of dimensionality reduction techniques.
1. Introduction
Software clustering is one of the software remodularisa-
tion and software architecture recovery techniques that has
received a substantial amount of attention in recent years
[1,2,3,4,5,6,7]. There are many goals for software re-
modularisation, including, but not limited to, representing
the high-level architectural view of the analysed software, re-
moving potential technical debt due to suboptimal software
structure, and reverse documentation of poorly documented
systems. Software remodularisation can help software de-
velopers and maintainers better understand the interrelation-
ships between software components.
As discussed in the work by Teymourian et al. [8], most
of the software clustering algorithms fall into two main cat-
egories which are agglomerative hierarchical [2,3,9,10]
and search-based algorithms [11,12,13]. In general, soft-
ware clustering works by choosing from a collection of soft-
ware entities (methods, classes, or packages) and then form-
ing multiple groups of entities such that the entities within
the same group are similar to each other while being dissim-
ilar from entities in other groups. By dividing and grouping
software entities based on their functionality, these groups
or clusters can be recognised as functionally similar subsys-
tems, which can be used to represent the software architec-
ture of the system. Ultimately, the high-level architecture
view of the software system will aid software maintainers in
Corresponding author
alvin.tan@monash.edu (A.J.J. Tan); chong.chunyong@monash.edu
(C.Y. Chong); aldeida.aleti@monash.edu (A. Aleti)
ORCID (s):
implementing new functionalities or make changes to exist-
ing code through better comprehension of the software de-
sign.
Due to their capabilities to aid in architecture recovery
and software remodularisation, software clustering techniques
have been widely investigated and a large number of tech-
niques have been introduced [1,14]. These techniques dif-
fer greatly in terms of the chosen common clustering fea-
tures, similarity measures, clustering algorithm, and evalu-
ation metric [12,14,15]. There is a vast variety of software
systems from different domains with unique structures, char-
acteristics, and behaviour. However, the suitability of differ-
ent clustering algorithms for different software systems are
not investigated thoroughly.
Different clustering algorithms tend to produce seman-
tically different clustering results. For instance, if classes
are chosen as the basis to perform software clustering, the
clustering feature extraction method will only look at class-
level interaction between those classes. Subsequently, the
clustering results produced by the class-level clustering al-
gorithm will be completely different from a method-level
clustering algorithm, although both results might be equally
feasible. Furthermore, comparing software clustering algo-
rithms within the same level of granularity is also not straight-
forward, due to different fitness functions and cluster valid-
ity metrics employed by different algorithms [9,16]. Even if
we were to compare the effectiveness of the clustering algo-
rithms from the same family (i.e., agglomerative hierarchical
clustering), there are still different ways to configure them
(i.e. different distance metrics, different linkage algorithms,
and different validity indices for hierarchical clustering algo-
AJJ Tan et al.: Preprint submitted to Elsevier Page 1 of 31
arXiv:2107.01766v1 [cs.SE] 5 Jul 2021
E-SC4R: Explaining Software Clustering for Remodularisation
rithm). It is then up to the researchers to choose a software
clustering evaluation method depending upon if they are able
to produce the reference decomposition or model, to evalu-
ate the effectiveness of the clustering results. Almost all of
the existing studies in software clustering only emphasised
on the advantages and benefits brought upon by the proposed
clustering technique, while limited studies suggest the limi-
tation of their approaches [14,15,17]. Therefore, it raises a
question as to how software clustering algorithms are eval-
uated.
Most studies which introduce new clustering algorithms
often only evaluate their approach on a specific set of prob-
lem instances [14,16,17]. Different from existing studies,
this work aims to provide a better understanding of which
software/code features (i.e., lines of code, number of meth-
ods, etc.) are related to the performance of clustering algo-
rithms, and whether the software/code features can be used
to select the most suitable clustering algorithm. Our work
is inspired from similar research efforts in optimisation and
search-based software testing [18,19].
This work aims to fill this gap by introducing a new ap-
proach that evaluates the effectiveness of software cluster-
ing techniques by providing information on their suitability
according to the software or code features, which as a con-
sequence, enables the selection of the most optimum algo-
rithm and configuration from our existing pool of choices
for a particular profile of a software system. Using the pro-
posed framework, the pool of chosen software clustering al-
gorithms only requires profiling to be done for the first time,
in order for the proposed framework to recommend the op-
timum algorithm and configuration from our existing pool
of choices. Software systems that exhibit characteristics that
match a profile from the pool of choices (software clustering
algorithm) will be recommended with the respective clus-
tering algorithm to improve the overall efficiency and effec-
tiveness. The entire workflow is summarised in Figure 1.
Different from traditional software clustering research that
uses a trial and error approach to identify the optimum clus-
tering algorithm and its configuration (dashed line in Fig-
ure 1), the proposed framework only requires the developer
or researcher to extract the software features of the software
to be remodularised, in order for the proposed framework to
recommend the most suitable clustering algorithm.
Note that in this paper, we only focus on comparing and
evaluating the effectiveness of different variants of agglom-
erative hierarchical clustering algorithms and Bunch clus-
tering algorithm because i.) agglomerative clustering and
Bunch (which is a search-based software clustering algo-
rithm) are two of the most popular clustering algorithms as
discussed in [8], and ii.) results produce by different family
of software clustering algorithms exhibit significantly differ-
ent structure and behaviour, which are difficult to compare
directly.
Our proposed technique aims to characterise both the
strengths and weaknesses of the analysed software clustering
algorithms using existing and newly developed software fea-
tures extracted from the code. The proposed technique also
provides a clear understanding of the algorithm’s behaviour
by showing a 2D representation of the effectiveness of soft-
ware clustering techniques on the feature space through the
application of dimensionality reduction techniques. Note
that to avoid confusion, the term clustering features used in
this paper refers to the features extracted from the chosen
clustering entities (classes), while software features refers
to the characteristic of the software/code such as the lines of
code, number of public methods, number of static methods,
etc.
In essence, the proposed approach can be used to char-
acterise the software/code features that have an impact on
the effectiveness of clustering algorithms. It is a known fact
that the selection of clustering entities and clustering fea-
tures will directly influence the final clustering results. If
we could understand and correlate the relationships between
software/code features and the effectiveness of software clus-
tering algorithms, it is then possible to choose the most op-
timum clustering algorithm and configuration from our ex-
isting pool of choices, based on the profiled software/code
features. We show how such software/code features can be
measured, and how the footprints of software clustering al-
gorithms (regions where clustering algorithms’ strengths are
expected) can be visualised across the inspected software
components. The proposed approach can be used for perfor-
mance prediction, enabling the selection of the most suitable
clustering algorithm according to the software/code features.
The results can also lead to algorithm improvements, consid-
ering one of the aims of the proposed approach is to reveal
the weaknesses of clustering algorithms.
The research questions are:
RQ1 How can we identify the strengths and weaknesses of
clustering techniques for the remodularisation of soft-
ware systems?
RQ2 How can we select the most suitable clustering tech-
nique from a portfolio of hierarchical and Bunch clus-
tering algorithms?
2. The E-SC4R Framework
The E-SC4R framework provides a way for objective as-
sessment of the overall effectiveness of software clustering
techniques. Understanding the effectiveness of a software
clustering technique is critical in selecting the most optimum
technique for a particular software, avoiding trial and error
application of software clustering techniques.
The purpose of E-SC4R is the ability to identify the
most optimum algorithm and configuration from our ex-
isting pool of choices for software clustering.
The approach involves two main parts: Strengths and
Weaknesses of Clustering Techniques, which learns sig-
nificant software features, such as lines of code, cohesion,
coupling, and complexity, that reveal why certain remodu-
larisation problems are hard. The first part visualises foot-
prints of clustering techniques which expose their strengths
and weaknesses, and Clustering Technique Selection, which
AJJ Tan et al.: Preprint submitted to Elsevier Page 2 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Figure 1: E-SC4R framework design and workflow.
𝑠𝑆
Software Systems
𝑄(𝐹(𝑇)) ∈ 𝑅
Strengths and Weaknesses
of Clustering Techniques
𝐶(𝑠) ∈ 𝑌
Software Clustering Techniques
𝐼(𝐶(𝑠)) ∈ 𝑅
Clustering Technique
Performance Indicator
𝐴(𝑄(𝐹(𝑇))) ∈ 𝑇
Clustering Technique Selection
software features
source code, dissimilarity matrix
clustering results
MoJoFM results
footprints
Figure 2: An overview of E-SC4R.
addresses the problem of selecting the most suitable tech-
nique for software remodularisation. The approach we pro-
pose has two main goals:
to help designers of software clustering techniques gain
insight into why some techniques might be more or
less suited to remodularise certain software systems,
thus devising new and better techniques that address
any challenging areas, and
to help software developers select the most effective
clustering technique for their software systems.
An overview of the E-SC4R framework is presented in
Figure 2. The boxes represent the artefacts, while the ar-
rows are the processes/steps for creating the artefacts. In the
following subsections, we describe our framework in more
detail for each artefact/step.
2.1. Software Systems
The software systems, defined as 𝑠𝑆in Figure 2are
software systems to be remodularised by researchers using
software clustering techniques to recover a high-level ab-
straction view of the software architecture.
The software systems used in this study are chosen in a
pseudo-random manner, with a mixture of GitHub projects
and Apache projects.
The following process is used when selecting the projects
from the main GitHub and Apache repository.
1. The search parameters are set to filter out Java-based
projects at https://github.com/topics/java,
2. Sort projects by the number of stars,
3. Projects are manually chosen if they meet the follow-
ing criteria,
Have at least 10 commits in the past year,
README.MD, Project Title, "About" and com-
ments are written in English.
In order to make sure that the selected project is still cur-
rently active, we only chose projects that have at least 10
commits in 2021 (the year in which we conducted the exper-
iment).
AJJ Tan et al.: Preprint submitted to Elsevier Page 3 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
2.2. Software Clustering
Software clustering techniques are defined as 𝐶(𝑠) ∈ 𝑌
in Figure 2, where a specific clustering technique 𝐶, which
is a subset of 𝑌, applied on software 𝑠.
Depending on the different interpretation of a meaning-
ful and effective software clustering result, different cluster-
ing algorithms proposed in the existing literature address the
software remodularisation and architecture recovery prob-
lem in a different manner. While these algorithms usually
aim to achieve common goals, i.e., improve the software
modularity, quality, and software comprehension, they usu-
ally produce significantly different clustering results but at
the same time, might also offer equally valid high-level ab-
straction views of the analysed software [20]. Hence, it is
necessary to discuss the working principle of some of the
widely adopted software clustering algorithms available in
the current literature.
2.2.1. Search-based Software Remodularisation
Algorithms
Search-based approaches have been successfully utilised
to address the software remodularisation problem. In gen-
eral, the workflow of search-based clustering algorithms con-
sists of the following steps [12] [3].
1. Generate a random seed solution based on some initial
parameters,
2. Explore the neighbourhood structure of the solution.
If a neighbour has a better fitness, it becomes the new
solution,
3. Repeat step (2) until the neighbourhood of the current
candidate solution offers no further improvement to
fitness function (local optimal),
4. Restart step (1) with different seed to find better solu-
tion and fitness (global optimal).
Numerous existing studies that adopt search-based ap-
proaches are based on the work by Mitchell and Mancoridis
[12,21], where their approaches have been implemented as
part of the Bunch software clustering tool. The authors pro-
posed the notion of module dependency graph (MDG) as
the basis of their clustering entities. In their context, mod-
ule represents source code entities that encapsulate data and
functions that operate on the data (e.g., Java/C++ classes,
C source code files). Dependencies between the modules,
on the other hand, are binary relations between the source
code entities, depending upon the programming language
used to implement the analysed software systems (e.g., func-
tion/procedure invocation, variable access, and inheritance).
The Bunch clustering algorithm works by generating a ran-
dom partition of the MDG which is then re-partition sys-
tematically by examining neighbouring structures in order
to find a better partition. When an improved partition which
yields better intra-cluster cohesion and inter-cluster coupling
is found, the process repeats by using the newly found parti-
tion as the basis for finding the next improved partition. The
algorithm stops when it cannot find a better partition.
Based on the work by Mitchell and Mancoridis, Har-
man et al. propose a search-based optimisation method to
search for the best optimum partition [22]. In their work,
Harman et al. proposed a new objective function for the
search-based problem such that for each pair of modules in
a partition/cluster, the optimisation score is incremented if
there is a dependency between the modules; otherwise, it
is decremented. The authors do not consider inter-cluster
coupling and only focuses on intra-cluster cohesion. Exper-
iment results show that their proposed approach can tolerate
noises better than the Bunch clustering algorithm [21] and
reach optimum results faster than Bunch-guided approach.
In the work by Beck and Diehl [23], the authors dis-
cussed that clustering algorithms based on the Bunch tool
often only rely on the structural and static information of the
source code in order to measure the similarity and dependen-
cies among software entities. The authors attempted to en-
rich the structural data with some evolutionary aspects (his-
torical data) of the analysed software such as size of pack-
ages, ratio of code to comments, and the number of down-
load. Experiment results show that using evolutionary data
alone does not perform better than the traditional cluster-
ing algorithms that utilise structural data. It is only when
both data are integrated, the clustering results achieve much
higher accuracy when compared against the reference model.
2.2.2. Hierarchical Clustering Algorithms
On the other hand, hierarchical clustering iteratively merges
smaller clusters into larger ones or divide large clusters into
smaller ones, depending on whether it is a bottom-up or top-
down approach. Merging or dividing operations are usually
dependent on the clustering algorithm used in the existing
studies. In general, hierarchical clustering algorithms can
be divided into two main approaches, divisive (top-down)
and agglomerative (bottom-up) hierarchical clustering algo-
rithms.
Divisive clustering is based on a top-down hierarchical
clustering approach where the clustering process starts at the
top with all data in one big cluster. The cluster is then split
into smaller clusters in a recursive manner until all data re-
sides in a single cluster. For problem domains with a fixed
number of top levels, using flat algorithms such as 𝐾-mean
yield lower computational complexity because divisive clus-
tering is linear in the number of clusters [24]. Although
the computational complexity of divisive clustering is lower
than agglomerative clustering, complete information about
the global distribution of the data is needed when making
the top-level clustering decisions [25]. Most of the time,
software maintainers are not involved in the earlier software
design phases. If the software documentations are not up-to-
date, it is hard for maintainers to identify the ideal number of
software packages (or the number of clusters in the context
of software clustering) before any attempt to remodularise
any software systems.
On the other hand, the work by Wiggerts [26] discussed
how agglomerative clustering, a bottom-up clustering ap-
proach would be helpful to software engineers in remod-
ularising legacy and poorly documented software systems.
According to the author, the working principle of agglom-
AJJ Tan et al.: Preprint submitted to Elsevier Page 4 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
erative clustering is actually similar to reverse engineering
where the abstractions of software design are recovered in
a bottom-up manner. Agglomerative hierarchical clustering
starts by placing each cluster entity (usually code or classes
in the context of software clustering) in a cluster on its own.
At each iteration as we move up the hierarchy, two of the
most similar clusters from the lower layer are merged and the
number of clusters is reduced by one. The decision to merge
which clusters differs depending on the similarity measure
and the linkage algorithm used, i.e., nearest neighbour be-
tween two clusters, furthest neighbour between two clusters,
or average distance between two clusters.
Once the two chosen clusters have been merged, the strength
of similarity between the newly formed cluster and the rest of
the clusters are updated to reflect the changes. The merging
process will continue until there is only one cluster left. The
results of agglomerative clustering are usually presented in a
tree diagram, called dendrogram. A dendrogram shows the
taxonomic relationships of clusters produced by the cluster-
ing algorithm. Cutting the dendrogram at a certain height
produces a set of disjoint clusters.
In this paper, we will focus on examining the strength
and weakness of different configurations for agglomerative
hierarchical clustering algorithm and Bunch algorithm be-
cause they are one of the most widely used generic clustering
algorithm and search-based algorithms in the existing litera-
ture [9,14,23,27,28]. It will be interesting to compare the
performance of two different families of software remodular-
isation techniques and examine their suitability on different
behaving datasets.
2.3. Clustering Technique Performance Indicator
The Clustering Technique Performance Indicator, denoted
as 𝐼(𝐶(𝑠)) ∈ 𝑅takes as input the clustering results gener-
ated by a clustering algorithm 𝐶(𝑠)for a particular software
system 𝑠𝑆. There exist various ways to evaluate the
performance of software clustering algorithms. Typically,
the clustering results generated from a clustering algorithm
𝐶(𝑠)are measured against a reference model, which refers
to a known good clustering result or a reliable reference that
can act as a baseline for comparison. Hence, performance
indicator typically measures the similarity of the clustering
results against the reference model.
One way of accessing the reference model is through
feedback from domain experts in the analysed system, for
instance, the original designer, system architect, or senior
developers directly involved in the development of the soft-
ware. However, this approach is difficult to realise because
the software maintainers are usually not involved in the ini-
tial design and development of the maintained software.
In existing works, when there are no inputs from domain
experts to create a reference model or ground truth, several
authors have chosen to use the directory structure or pack-
age structure of the analysed software to create an artificial
ground truth or reference model [4,23,27]. This method is
less expensive compared to retrieving the reference model
from domain experts because it can usually be automated.
However, the reliability of using the directory or package
structure of the analysed system is strongly dependent on the
skills and experience of the software developers because it is
assumed that software developers follow the best practices of
putting functional relevant and similar classes into the same
package directory.
If an artificial ground truth or reference model can be re-
trieved, we can then compare it with the clustering results
generated by a clustering algorithm 𝐶(𝑠)to measure the ex-
tent to which two given decompositions of the software are
similar to each other. The work by [29] discussed that one
of the most popular performance indicator for software clus-
tering results is the MoJo family of metrics [30]:
MoJoFM(𝐴, 𝐵) = (1 − 𝑚𝑛𝑜(𝐴, 𝐵)
max(𝑚𝑛𝑜(∀𝐴, 𝐵)) ) × 100% (1)
where 𝑚𝑛𝑜(𝐴, 𝐵)is the minimum number of Move or
Join operations needed to transform the clustering result (𝐴)
into ground truth (𝐵), and 𝑚𝑎𝑥(𝑚𝑛𝑜(∀𝐴, 𝐵)is the maximum
number of operations to transform the clustering result to
ground truth. MoJoFM return 0if the clustering result is
very different from the ground truth, and return 100 if the
clustering result is identical to the ground truth. We will
be using MoJoFM as the Clustering Technique Performance
Indicator 𝑅in this paper.
2.4. Strengths and Weaknesses of Clustering
Techniques
One of the important steps in E-SC4R is identifying fea-
tures of software systems 𝐹(𝑠) ∈ 𝐹that have an impact on
the effectiveness of software clustering techniques. Software
features such as lines of code, number of public methods,
number of static methods, etc., are problem dependent and
must be chosen such that depending on the type of target
software systems 𝑠𝑆, any known structural properties of
the software systems are captured, and any known advan-
tages and limitations of the different software clustering al-
gorithm are related to the features.
For each version of the analysed project, software fea-
tures are extracted using the Java code metrics calculator,
which is publicly available online [31]. The tool is capable
of calculating simple size metrics such as the numbers of
methods, lines of code, and number of private fields, to more
complex measures such as depth of inheritance, coupling be-
tween object, and other CK suite of metrics. In total, there
are 40 metrics that we extracted using the tool. These met-
rics are calculated at the class level. To aggregate the calcu-
lated metrics at the project level, we calculate the max, mean,
standard deviation, and sum over all classes of a project, re-
sulting in 40 × 4 = 160 metrics for each project.
Table 1shows some of the metrics used in this paper as
well as their definitions. The full list of metrics along with
its description can be found on the tool’s GitHub page1. The
set of features listed in Table 1is only a subset of the total
1https://github.com/mauricioaniche/ck.
AJJ Tan et al.: Preprint submitted to Elsevier Page 5 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 1
Description of software features.
Object oriented features
WMC McCabe’s complexity. It counts the number of branch instructions in a class.
DIT Depth Inheritance Tree, counts the number of parent a class has.
Number of Fields Counts the number of fields. Specific numbers for total number of fields, static, public,
private, protected, default, final, and synchronised fields.
Number of Methods Counts the number of methods. Specific numbers for total number of methods, static,
public, abstract, private, protected, default, final, and synchronised methods. Constructor
methods also count here.
Number of visible methods Counts the number of visible methods. A method is visible if it is not private.
NOSI Number of static invocations, counts the number of invocations to static methods.
RFC Response for a class, counts the number of unique method invocations in a class.
LOC Lines of code. It counts the lines of count, ignoring empty lines and comments (i.e., it’s
Source Lines of Code, or SLOC).
TCC Tight Class Cohesion, measures the cohesion of a class with a value range from 0 to 1.
TCC measures the cohesion of a class via direct connections between visible methods,
two methods or their invocation trees access the same class variable.
LCC Loose Class Cohesion, similar to TCC but it further includes the number of indirect
connections between visible classes for the cohesion calculation. Thus, the constraint
LCC >= TCC holds always.
No. of Returns The number of return instructions.
No. of Loops The number of loops (i.e., for, while, do while, enhanced for).
No. of Comparisons The number of comparisons (i.e., == and !=).
No. of try/catches The number of try/catches.
No. of () expressions The number of expressions inside parenthesis.
String literals The number of string literals (e.g., "John Doe"). Repeated strings count as many times
as they appear.
Quantity of Number The number of numbers (i.e., int, long, double, float) literals.
Quantity of Math Operations The number of math operations (times, divide, remainder, plus, minus, left shit, right
shift).
Quantity of Variables Number of declared variables.
Max nested blocks The highest number of blocks nested together.
Number of unique words Number of unique words in the source code. Counts the number of words in a
method/class, after removing Java keywords. Names are split based on camel case and
underline.
Number of Log Statements Number of log statements in the source code. The counting uses REGEX compatible
with SLF4J and Log4J API calls.
Has Javadoc Boolean indicating whether a method has javadoc.
Usage of each variable How often each variable was used inside each method.
Usage of each field How often each local field was used inside each method, local field are fields within a
class (subclasses are not included). Also indirect local field usages are detected, indirect
local field usages include all usages of fields within the local invocation tree of a class e.g.
A invokes B and B uses field a, then a is indirectly used by A.
Method invocations All directly invoked methods, variations are local invocations and indirect local invoca-
tions.
number of features used in the paper, where we list some of
the more general and widely used software features.
As such, the terms software metrics and software fea-
tures will be used interchangeably in this paper to denote the
metrics that represent different features of the analysed soft-
ware.
Using the results gathered from clustering algorithms and
software feature extraction, we can create the footprint visu-
alisations of each clustering algorithm in order to identify
the most significant software features that have an impact on
its effectiveness.
E-SC4R identifies software features that have an impact
on the effectiveness of software clustering techniques. The
clustering results can be affected by the program structure
and/or source code, the complexity of dependencies between
classes, information about the input/output data space, and
information dynamically obtained from program execution.
All these aspects may influence the suitability of software
clustering techniques for a particular software system.
In E-SC4R, a subset of features is considered significant
AJJ Tan et al.: Preprint submitted to Elsevier Page 6 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
if they result in an instance space – as defined by the 2-
dimensional projection of the subset of features – with soft-
ware systems where a particular clustering technique per-
forms well being clustered together. The software instances
are initially projected in the 2D instance space in such a way
that if two software systems are similar according to some
features, they are closer together, and if they are dissimilar,
then they are far apart. Since we focus on arranging the soft-
ware systems in a space where the instances of a technique is
effective are separated from the ineffective ones, we repre-
sent a software system as a vector of the most significant fea-
tures that are likely to correlate with a clustering technique’s
effectiveness. E-SC4R identifies software features that are
able to create a clear separation in instance space, such that
we can clearly see the different clusters of software systems
where the techniques are effective. We refer to these clusters
as clustering technique footprints.
The most significant features are determined in 2 steps
[32]. Firstly, MoJo is determined as the performance met-
ric to measure the effectiveness of the clustering algorithm
based on the software features (e.g. lines of code, number of
methods, etc.).
Next, a genetic algorithm is applied to select the set of
features that maximises MoJo as given below:
1. Select a set of software features,
2. Generate an instance space using PCA for dimension
reduction, as described below,
3. Fitness of the set of software features is evaluated
4. If the set of software features is not suitable, return to
Step 1.
The genetic algorithm will search the space for possible
subsets and determine the optimal subset with the classifica-
tion accuracy on an out-of-sample test set used as the fitness
function. The instance space is generated in iterations, until
an optimal subset of features is found [33].
A subset of the features is considered of high quality if
they result in a footprint visualisation with distinct clusters.
The best subset of features is the one that can best discrimi-
nate between high performing and low performing clustering
algorithms.
After the most significant software features (lines of code,
number of methods, etc.) are identified, these features are
then used as an input into SVM to capture the relationship
between the selected features and the MoJo value of the clus-
tering results. Similar to our previous work [34,35], we use
principal component analysis (PCA) as a method for project-
ing the software instances to two dimensions, while making
sure that we retain as much information as possible. PCA
rotates the data to a new coordinate system 𝐑𝑘, with axes
defined by linear combinations of the selected features. The
new axes are the eigenvectors of the covariance matrix. The
subset of features that have large coefficients and therefore
contribute significantly to the variance of each Principal Com-
ponent (PC), are identified as the significant features. Re-
taining the two main PCs, the software instances are then
projected on this 2D space. Following a similar approach
to previous work on dimensionality reduction [36], we ac-
cept the new two dimensional instance space as adequate
if most of the variance in the data is explained by the two
principal axes. The two principal components are then used
to visualise the footprints of the clustering technique. The
footprint indicates the area of strength (software instances
on which the clustering technique is effective) of each clus-
tering technique. This step provides an answer to the first
research question, RQ1 : How can we identify the strengths
and weaknesses of clustering techniques for the remodular-
isation of software systems?
2.5. Clustering Technique Selection
Finally, E-SC4R is used to predict, based on the most
significant software features, the most effective clustering
technique for new remodularisation and architecture recov-
ery problems. This step answers the second research ques-
tion: RQ2: How can we select the most suitable clustering
technique from a portfolio of hierarchical and Bunch clus-
tering algorithms? This is achieved by modelling the rela-
tionship between software features and clustering technique
effectiveness by employing a Support Vector Machine [37]
to learn this relationship. E-SC4R uses the two-dimensional
space as an input to the Support Vector Machine to learn the
relationship between the software features and remodulari-
sation method performance.
The SVM is C-classification. The cost C in the regular-
isation term and the RBF hyper-parameter 𝛾are tuned via
grid search in [1,10] and [0,1] respectively.
We use 10-fold cross validation to train the model and
assess the model generalisation ability. The cross-validated
Mean Squared Error and Error Rate are used as estimates of
the model generalisation ability in classification. At the end
of this process, E-SC4R creates a model that can select the
most effective technique for remodularisation based on the
features of software programs. This model can be retrained
and extended further with new remodularisation techniques
and software features.
3. Experimental Design
In this chapter, we discuss the way how the experiment
is designed to examine the strength and weakness of differ-
ent configurations for agglomerative hierarchical clustering
algorithm and Bunch algorithm.
3.1. Agglomerative Hierarchical Software
Clustering
The process of agglomerative hierarchical clustering can
be summarised in the following steps, illustrated in Figure 3.
1. Identification of clustering entities,
2. Identification of clustering features,
3. Calculation of similarity measure,
4. Application of clustering algorithm,
5. Evaluation of clustering results.
AJJ Tan et al.: Preprint submitted to Elsevier Page 7 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Figure 3: Agglomerative hierarchical and Bunch clustering processes.
Identification of clustering entities In software cluster-
ing, the typical choices of entities are in the form of methods,
classes, or packages because they represent the basic com-
ponents and functionalities of a software system. In this pa-
per, we focus on representing the software at the class-level
because in object-oriented software systems, classes are the
main building blocks that contain the implementation details
of the examined components.
Identification of clustering features The similarities be-
tween entities are determined based on their characteristics
or clustering features extracted from the available informa-
tion. Extracting dependencies between code entities is criti-
cal in architecture recovery because it helps to understand
the static and dynamic relationships between them. Sev-
eral existing studies have proposed different methods to ex-
tract dependencies from software entities at different lev-
els of granularity, including code, class, and package lev-
els [38,39,40]. The work by Jin et al., [38] in particular,
released their open-source dependency extraction tool, De-
pends2, which is capable of gathering syntactical relations
among source code entities such as files and methods. In this
work, we choose to utilise Depends to extract dependencies
2https://github.com/multilang-depends/depends
between classes in the analysed software in order to improve
the replicability of our research findings. Depends extracts
the following dependency types:
1. Import,
2. Contain,
3. Parameter,
4. Call,
5. Return,
6. Throw,
7. Implement,
8. Extend,
9. Create,
10. Use,
11. Cast,
12. ImplLink,
13. Annotation,
14. Mixin.
We are then able to generate a 𝑁𝑥𝑁 (𝑁= number of
classes) matrix that denotes the relationships between all the
classes. Sample data that are extracted from Depends are
shown in Table 2, which are then aggregated into relation-
ships between clustering entities shown in Table 3.
AJJ Tan et al.: Preprint submitted to Elsevier Page 8 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 2
Examples of relationships extracted from Depends.
src dest Cast Call Return Use Contain Import Extend Implement
a.java b.java 1 8 4 9 0 0 6 3
b.java c.java 1 0 0 8 1 0 1 5
Table 3
Examples of relationships between clustering entities aggregated from Table 2.
a.java b.java c.java d.java
a.java 0 31 11 9
b.java 31 0 16 4
c.java 11 16 0 3
d.java 9 4 3 0
Calculation of similarity measure The next step is to
ascertain the similarity between entities by referring to the
clustering features identified in the previous step. In this pa-
per, we choose to use distance measures because we are able
to quantify the strength of dependencies between classes with
the aid of Depends. We do not attempt to distinguish be-
tween the different type of dependencies identified by De-
pends, but instead aggregate all the dependencies to rep-
resent an aggregated strength of dependency between two
classes. In order to generate the distance matrix, the fol-
lowing distance measures were taken into consideration to
compute the dissimilarity between each class in the exam-
ined software:
Euclidean distance : least squares, minimising the sum
of the square of the differences between a pair of classes
𝑑(𝑥, 𝑦) =
𝑛
𝑖=1
(𝑥𝑖𝑦𝑖)2(2)
where 𝑛= number of classes, 𝑥𝑖and 𝑦𝑖are the classes
of vectors 𝑥and 𝑦respectively in the two-dimensional
vector space.
Manhattan distance : least absolute deviations, min-
imising the sum of the absolute differences between a
pair of classes
𝑑(𝑥, 𝑦) =
𝑛
𝑖=1 𝑥𝑖𝑦𝑖(3)
Cosine distance : the cosine of the angle between a
pair of classes
𝑑(𝑥, 𝑦) = 1 − 𝑛
𝑖=1 𝑥𝑖𝑦𝑖
𝑛
𝑖=1 𝑥2
𝑖𝑛
𝑖=1 𝑦2
𝑖
(4)
These distance/similarity measures are chosen because
they have been proven to be effective in measuring the sim-
ilarity between software components in some of the related
studies [14] [41].
Application of clustering algorithm A clustering algo-
rithm is needed to decide upon how and when to merge two
clusters. Depending on the algorithm used, certain algo-
rithms merge the most similar pair first while others merge
the most dissimilar first. Once the two chosen clusters have
been merged, the strength of similarity or dissimilarity be-
tween the newly formed cluster and the rest of the clusters
are updated to reflect the changes. It is very common that
during hierarchical clustering, there exist more than two en-
tities which are equally similar or dissimilar. In this kind of
scenario, the selection of candidate entities to be clustered is
arbitrary [14].
In this work, we use the following three linkage algo-
rithms [14]:
Single Linkage Algorithm - defines the similarity of
two chosen clusters as the maximum similarity strength
among all pairs of entities (classes) in the two clusters
Average Linkage Algorithm - defines the similarity
measure between two clusters as the arithmetic aver-
age of similarity strengths among all pairs of entities
(classes) in the two clusters
Complete Linkage Algorithm - defines the similarity
of two chosen clusters as the minimum similarity strength
among all pairs of entities (classes) in the two clusters
There exists many other newer algorithms that are pro-
posed for software architecture recovery. Currently, we are
only including these three basic linkage algorithms to demon-
strate E-SC4R’s ability to identify the most optimum algo-
rithm and configuration from our existing pool of choices.
Newer algorithms would be added to E-SC4R in future iter-
ations.
Apart from that, we also attempt to determine the most
optimum range of number of clusters for each of the chosen
hierarchical clustering algorithms. In hierarchical cluster-
ing, the final output is represented in a dendrogram, which
AJJ Tan et al.: Preprint submitted to Elsevier Page 9 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
is a tree diagram that shows the taxonomic relationships of
clusters of software entities produced by hierarchical clus-
tering. The distance at which the dendrogram tree is cut de-
termines the number of clusters formed. Cutting the dendro-
gram tree at a higher distance value always yields a smaller
number of clusters. However, this decision involves a trade-
off with respect to relaxing the constraint of cohesion in the
cluster memberships [9,16,42]. As such, in this work, we
attempt to determine the optimal total number of clusters by
dividing the total number of classes with the following divi-
sors : 5, 7, 10, 20, and 25. The numbers were chosen based
on the package distribution of the ground truth that we gen-
erated and depends on the number of classes of the analysed
software. In this paper, we choose to use these divisors in-
stead of an exhaustive approach to save computation time
and obtain a range of the optimal number of classes. In prac-
tice, E-SC4R would allow the user to specify the number of
clusters or divisors.
We use different configurations of clustering algorithms
which differ between the combination of different distance
metrics, linkage algorithms, and the number of clusters. We
then record the clustering results of each combination of the
clustering algorithm on each version of the software, to be
compared with the ground truth. For example:
Agglomerative Hierarchical Configuration 1
Linkage = Single
Distance Metric = Euclidean
Cluster Divisor = 5
Agglomerative Hierarchical Configuration 2
Linkage = Complete
Distance Metric = Cosine
Cluster Divisor = 7
Evaluation of clustering results As mentioned earlier,
creating a reference model to act as the ground truth by en-
gaging domain experts is expensive in terms of time and ef-
fort. On the other hand, the reliability of the package struc-
ture of the analysed software is strongly dependent on the
experience of the software developers, as well as maturity
of the analysed project.
In this paper, we create the reference model (ground truth)
by looking at the most commonly occurring directory struc-
ture patterns for the 10 previous releases of the analysed soft-
ware. A more detailed example is available in Section 3.4.
To evaluate the performance of each hierarchical cluster-
ing algorithm against the reference model, we use MoJoFM
metric proposed in the work by [30,43]. The MoJo family
of metrics were widely used in the domain of software clus-
tering to evaluate the performance of different clustering al-
gorithms [14,16,27,28]. Hence, in the remaining of this
paper, the term performance of clustering algorithm refers
to the MoJoFM value computed when comparing between
the produced clustering results and the ground truth.
3.2. Bunch Clustering Algorithm
Bunch supports three main clustering algorithms, namely
hill-climbing algorithm, exhaustive clustering algorithm, and
genetic algorithm. The authors claimed that exhaustive clus-
tering algorithm only works well for small systems [11], and
the hill-climbing algorithm performs well for most software
systems [12].
Bunch starts off by generating a random partition of the
MDG. Then, depending on the chosen clustering algorithm
(hill climbing, genetic algorithm, or exhaustive), it will clus-
ter each of the random partitions in the population and se-
lect the result with the largest Modularisation Quality (MQ)
as the suboptimal solution. MQ measures the quality of an
MDG partition by taking into consideration the trade-off be-
tween the dependencies between the clustering entities (classes)
of two distinct clusters (package/subsystem), and the depen-
dencies between the clustering entities (classes) of the same
cluster (package/subsystem). The assumption made is that
high quality software systems should be designed with cohe-
sive subsystems that are loosely coupled between each other.
As the size of the problem (software systems) increases, the
probability of finding a good sub-optimal solution (MQ) also
increases.
In this paper, we will be using a combination of different
algorithms and MQ calculator to evaluate their performance
against the chosen datasets. For example,
Bunch Configuration 1
Algorithm = HillClimbing
Calculator = TurboMQ
Bunch Configuration 2
Algorithm = GeneticAlgorithm
Calculator = TurboMQIncr
3.3. Dataset Collection
For each chosen project, we compare the clustering re-
sults across 10 releases to ensure the stability of the clus-
tering algorithm. Stability in software clustering is defined
as the sensitivity of a particular clustering algorithm toward
the changes in the dataset [22]. For any good clustering al-
gorithm, small changes in the target software (clustering al-
gorithm applied on multiple small increment releases of the
same software) should not alter the clustering results signif-
icantly. Due to the way how we create the ground truth, 10
prior releases of the examined software will be needed to
identify the common directory structure, as shown in Fig-
ure 4. As such, 21 releases of the chosen project are required
to form one of our selection criteria.
Once we identified the suitable software, 21 versions of
the project source code were then downloaded using GitHub
CLI to a server based on the project link, release tag, and
version name. In total, we had chosen 30 Java-based projects
collected from GitHub as we could not find more which suited
AJJ Tan et al.: Preprint submitted to Elsevier Page 10 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 4
List of chosen projects and their versions.
Project Name firstRelease lastRelease
bkromhout-realm-java v0.87.1 v0.89.0
btraceio-btrace v1.3.4 v1.3.9
bytedeco-javacpp 1.4 1.5.3
codecentric-spring-boot-admin 1.4.5 1.5.7
codenvy-legacy-che-plugins 3.13.4.4 3.9.5
coobird-thumbnailator 0.4.10 0.4.9
dropwizard v2.0.11 v2.0.9
dropwizard-metrics v4.1.10.1 v4.1.9
evant-gradle-retrolambda v3.3.0-beta1 v3.7.0
facebook-android-sdk 5.15.1 5.9.0
facebook-java-business-sdk v3.2.7 v3.3.6
facebook-fresco v1.14.2 v1.9.0
facebook-litho v2020.05.11 v2020.07.20
facebook-react-native-fbsdk 0.6.1 v0.10.3
google-cdep 0.8.27 0.8.9
google-dagger 2.28.2 2.9.0
google-error-prone v2.1.1 v2.4.0
google-gitiles v0.2-9 v0.4
google-openrtb 1.5.11 1.5.9
google-openrtb-doubleclick 1.5.14 1.5.9
grpc-grpc-java v1.30.1 v1.9.1
havarunner-havarunner 0.8.4 0.9.5
immutables-immutables 2.6.0 2.7.5
ionic-team-capacitor 1.0.0 1.5.2
jankotek-mapdb 3.0.0 3.0.8
javafunk-funk 0.1.24 0.2.0
javaparser 3.6.9 3.13.10
permissions-dispatcher 2.1.0 2.4.0
pxb1988-dex2jar 0.0.9.14 0.0.9.9
web3j v4.5.2 v4.6.1
our search criteria. The selected projects are shown in Ta-
ble 4and the complete set of datasets can be found on our
Github page3.
The columns firstRelease and lastRelease in Figure 4in-
dicate the versions of the software that we examined and
used for our experiments. Note that we use 21 incremental
releases in between the stated firstRelease and lastRelease
to ensure the stability of the clustering algorithms, and to
generate the ground truth.
3.4. Generation of Ground Truth
In this work, we attempt to improve the existing ground
truth generation method by looking into the evolution of the
analysed software over multiple releases instead of the lat-
est version package structure. The creation of the ground
truth is done via extraction of common directories across 10
previous releases of the software. For each version of the
software, the previous 10 releases of the software were anal-
ysed and only the common file directory structure across all
10 versions will be extracted.
Given the scale of the 30 open source projects over mul-
tiple releases, it is challenging to find domain experts for
3https://github.com/alvintanjianjia/SoftwareRemodularization
each of the software systems, which is a similar problem en-
countered in the work by [44]. Ground truths that are gen-
erated and manually approved by senior developers work-
ing on the project would only be practical for a handful of
projects. However, this approach will not provide our re-
search with enough data points for footprint visualisations
during the evaluation of clustering results. Given that the
current factual architecture of the system has been created
by the open source developers or administrators themselves,
it is reasonable to assume the current package structure is
held to a certain standard [45] [29]. Additionally, these open
source projects are among the highest rated Java projects on
GitHub based on stars, which provides additional confidence
on the correctness of the generated ground truth based on the
evolution of the package structure over multiple releases.
In the example shown in Figure 4a, we use the common
directories across apache_spark-1.0 to apache_spark-1.9 to
generate the ground truth. Subsequently, this ground truth
comprising the common directory from releases 1.0 to 1.9
will be used to evaluate against the clustering results that we
produce in the next release, which is apache_spark-2.0, as
shown in Figure 4b. To illustrate another simple example,
given the following directory structure of a software in 3 in-
AJJ Tan et al.: Preprint submitted to Elsevier Page 11 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Figure 4: a.) Method used to select project releases and
generate ground truth b.) Method used to evaluate clus-
tering results against ground truth.
cremental releases:
1. V 2.5: src/var/c.java, src/var/d.java, tmp/eg/z.java
2. V 2.6: src/var/c.java, src/var/d.java, temp/util/z.java
3. V 2.7: src/var/c.java, src/var/d.java, temp/eg/z.java
Based on our approach, only src/var/c.java and src/var/d.java
are extracted from the given 3 versions. A parent-child clus-
ter relationship would be defined based on the extracted di-
rectory paths, given by the parent contains child.
For example, given that the following directory paths are
extracted
1. user/spark/java/a.java
2. user/spark/java/b.java
3. user/spark/main/test.java
The following parent-child clusters will be created.
java contains a.java
java contains b.java
main contains test.java
spark contains java
spark contains main
user contains spark
The final clustering result obtained is then taken as the
ground truth, which will be used as the reference model when
compared using the MoJoFM metric.
Table 5
Summary of parameters and settings associated with hierar-
chical clustering.
Parameters Values
Linkage
Method
Average
Complete
Single
Distance
Metric
Euclidean
Cosine
Manhattan
Divisor
5
7
10
20
25
Table 6
Summary of parameters and settings associated with Bunch
clustering.
Parameters Values
Bunch
Algorithm
HillClimbing
GeneticAlgorithm
Exhaustive
Bunch
Calculator
TurboMQIncrW
TurboMQIncr
TurboMQ
TurboMQW
BasicMQ
3.5. Measuring Dependencies between Classes
As mentioned earlier, for hierarchical clustering algo-
rithms, we use Depends [38] to quantify the strength of de-
pendencies between classes in the examined software. The
tool can create a 𝑁𝑥𝑁 matrix to show the types and fre-
quency of dependencies between all the classes in the anal-
ysed software (𝑁= number of classes).
On the other hand, the Bunch tool uses Module Depen-
dency Graph (MDG) to measure the strength of dependen-
cies between classes[12,21].
3.6. Selection and Permutation of Chosen
Clustering Algorithms
For each of the chosen software projects, we ran different
permutation of hierarchical and Bunch clustering algorithms
based on the configuration shown in Table 5and Table 6.
In total, we have 45 unique configurations of hierarchical
clustering algorithms and 15 unique configurations of Bunch
clustering algorithms to run on each chosen project. Further-
more, for each configuration, we ran it against the 10 prior
releases of the target software to ensure the stability of the
algorithm.
3.7. Evaluation of Clustering Results
The clustering results are compared with the ground truth
using MoJoFM [30]. Due to the size of the table, we are un-
able to show the full set of clustering results from all the
AJJ Tan et al.: Preprint submitted to Elsevier Page 12 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 7
Top 10 hierarchical clustering results based on MoJoFM.
Project Name Project Version Divisor Affinity Linkage MoJoFM
facebook-facebook-java-business-sdk 3.2.7 25 euclidean single 98.657
facebook-facebook-java-business-sdk 3.2.8 25 euclidean single 96.979
facebook-facebook-java-business-sdk 3.2.9 25 euclidean single 96.812
facebook-facebook-java-business-sdk 3.3.1 25 euclidean single 96.644
facebook-facebook-java-business-sdk 3.3.5 25 euclidean single 96.643
facebook-facebook-java-business-sdk 3.3.6 25 euclidean single 96.642
facebook-facebook-java-business-sdk 3.3.0 25 euclidean single 96.476
facebook-facebook-java-business-sdk 3.3.2 25 euclidean single 96.475
facebook-facebook-java-business-sdk 3.3.3 25 euclidean single 96.474
javafunk-funk 0.2.0 25 euclidean single 96.268
Table 8
Top 10 Bunch clustering results based on MoJoFM.
Project Name Project Version Algorithm Calculator MoJoFM
google-openrtb 1.5.7 GA TurboMQIncrW 90.909
google-openrtb 1.5.5 HillClimbing BasicMQ 90.909
facebook-react-native-fbsdk 0.10.2 Exhaustive TurboMQIncrW 90.476
facebook-react-native-fbsdk 0.10.0 HillClimbing TurboMQ 90.476
bytedeco-javacpp 1.4.2 HillClimbing TurboMQW 88.961
google-openrtb 1.5.9 Exhaustive TurboMQIncr 88.637
google-openrtb 1.5.2 Exhaustive TurboMQIncr 88.637
google-openrtb 1.5.4 HillClimbing BasicMQ 88.636
google-openrtb 1.5.6 Exhaustive TurboMQ 88.636
google-openrtb 1.5.11 HillClimbing BasicMQ 86.363
chosen projects. The complete set of results can be assessed
from our GitHub page. The summarised version of the clus-
tering results, showing the top 10 clustering results for hi-
erarchical clustering and Bunch are shown in Table 7and
Table 8respectively.
4. Results and Discussion
The proposed E-SC4R framework identifies the most sig-
nificant software features, which have an impact on the per-
formance of clustering techniques. The resulting SVM pre-
dictions are plotted in the reduced instance space as shown
in Figure 5.
Based on the output of the SVM model, we took a deeper
dive into the accuracy, precision, and recall scores of each
algorithm, and found out that most of the algorithms with
distinct separable clusters are with high recall scores, while
the algorithms with indistinguishable clusters are with low
recall scores.
Distinct separable clusters are manually identified from
the footprint visualisations, where for a specific range of soft-
ware metrics, the algorithm performs undoubtedly the best
for clustering these software systems. Drawing an example
from Figure 5, in the range where 0.2 < z1 < 0.4, and 0.6
< z2 < 0.8, cosine_average_10 performs the best for these
software systems. The values that z1 and z2 represent are
explained in more detail under Section 4.1.1.
Given that,
True Positive = Clustering Algorithm is Predicted Cor-
rectly
True Negative = Clustering Algorithm is Rejected Cor-
rectly
False Positive = Clustering Algorithm is Predicted Wrongly
False Negative = Clustering Algorithm is Rejected Wrongly
Accuracy = The number of times E-SC4R is able to
accurately predict the most suitable clustering algo-
rithm or reject the wrong clustering algorithm out of
the total predictions made: 𝑇 𝑃 +𝑇 𝑁
𝑇 𝑃 +𝑇 𝑁+𝐹 𝑃 +𝐹 𝑁
Precision = The number of times E-SC4R is accurate
in predicting the best algorithm out of all the times the
algorithm is predicted by E-SC4R: 𝑇 𝑃
𝑇 𝑃 +𝐹 𝑃
Recall = The number of times E-SC4R is accurate in
predicting the best algorithm out of all the times the
best algorithm should have been predicted : 𝑇 𝑃
𝑇 𝑃 +𝐹 𝑁
We are able to identify 3 main patterns/clusters of algo-
rithms from the results shown in Table 9. Note that due to
the size of the table, we only show some of the examples
in the last column of Table 9. The information about the
average accuracy, precision, and recall of agglomerative and
Bunch algorithm produced from the SVM model (on the 300
releases) are shown in Tables 10,11,12, and 13.
AJJ Tan et al.: Preprint submitted to Elsevier Page 13 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 9
Categorisation of clustering results based on SVM.
Cluster Type Observation Examples
𝑐1High accuracy, high preci-
sion and high recall
We are able to correctly predict when these
algorithms are suitable, as well as which fea-
tures are the most important when determin-
ing the selection of these algorithms.
cosine_single_20, eu-
clidean_single_25, man-
hattan_single_15, hill-
climbing_turbomqincrw
𝑐2Low accuracy, low preci-
sion and low recall
It is hard to predict whether these algorithms
are suitable.
cosine_complete_20,
euclidean_complete_15,
manhat-
tan_complete_15,
euclidean_average_15,
ga_turbomq,
ga_turbomqincr
𝑐3High accuracy, high preci-
sion and low recall
The model is generally able to accurately pre-
dict when these algorithms are suitable, but
most of the time the model prioritises other
algorithms compared to the selected algo-
rithms.
cosine_average_10,
cosine_single_5, exhaus-
tive_turbomqincr, ex-
haustive_turbomqincrw
Table 10
Performance of SVM model for agglomerative cosine algorithm.
Algorithm Accuracy Precision Recall
cosine_average_10 88.7 90.0 21.4
cosine_average_15 81.7 94.7 25.0
cosine_average_20 69.0 66.7 41.0
cosine_average_25 63.3 82.8 34.9
cosine_average_5 97.3 70.0 58.3
cosine_average_7 96.3 100.0 47.6
cosine_complete_10 93.7 90.0 33.3
cosine_complete_15 56.3 8.5 29.4
cosine_complete_20 64.3 20.0 34.0
cosine_complete_25 49.3 27.1 67.6
cosine_complete_5 97.0 70.0 53.8
cosine_complete_7 97.0 100.0 52.6
cosine_single_10 55.7 90.0 6.4
cosine_single_15 85.3 86.9 95.8
cosine_single_20 90.7 90.4 99.6
cosine_single_25 89.0 88.6 99.6
cosine_single_5 88.0 100.0 21.7
cosine_single_7 79.7 100.0 14.1
Algorithms that fall into 𝑐1which possess high recall are
preferred as we would be able to easily identify software fea-
tures that contribute to determining whether a particular al-
gorithm is the most suitable for the given software systems.
4.1. PCA Visualisation
To visualise the results in a meaningful way, we apply
PCA as a dimensionality reduction technique on the opti-
mal subset of software features. The aim is to plot the per-
formance of the different clustering algorithms across the
project space in 2D, which is likely to reveal where the clus-
tering algorithms perform well, and where are their weak-
nesses. Two new axes were created, which are linear combi-
nations of the selected set of software features. Projecting
it using the two principal components holds 85% of the
variation in the data.
4.1.1. Combined PCA Visualisation of Agglomerative
and Bunch algorithms
A combined visualisation Figure 5is created to have a
general overview on the spread of the algorithm and fea-
tures. Based on the MoJoFM results, an initial comparison
was made on which algorithms are to be prioritised among
the 300 projects (30 unique projects with 10 releases each),
where some examples of the projects are shown in Table 4.
Recall that for each of the chosen 30 projects, we perform
different configurations of agglomerative and Bunch clus-
tering algorithms over 10 releases. The best performing al-
gorithm (in terms of MoJoFM values) will be prioritised.
The coordinate system that defines the new instance space
AJJ Tan et al.: Preprint submitted to Elsevier Page 14 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 11
Performance of SVM model for agglomerative Euclidean algorithm.
Algorithm Accuracy Precision Recall
euclidean_average_10 89.0 90.0 22.0
euclidean_average_15 51.7 27.0 54.7
euclidean_average_20 64.0 45.3 43.4
euclidean_average_25 58.7 54.4 99.3
euclidean_average_5 96.3 70.0 46.7
euclidean_average_7 94.0 100.0 35.7
euclidean_complete_10 88.3 90.0 20.9
euclidean_complete_15 63.3 23.3 17.9
euclidean_complete_20 59.7 49.2 52.9
euclidean_complete_25 59.0 55.0 98.7
euclidean_complete_5 95.3 70.0 38.9
euclidean_complete_7 95.0 100.0 40.0
euclidean_single_10 75.0 90.0 10.8
euclidean_single_15 67.0 65.9 100.0
euclidean_single_20 83.0 82.4 100.0
euclidean_single_25 87.7 87.2 100.0
euclidean_single_5 93.3 70.0 29.2
euclidean_single_7 85.7 100.0 18.9
Table 12
Performance of SVM model for agglomerative Manhattan algorithm.
Algorithm Accuracy Precision Recall
manhattan_average_10 70.3 90.0 9.3
manhattan_average_15 84.0 83.0 99.1
manhattan_average_20 92.7 92.1 100.0
manhattan_average_25 94.0 93.6 100.0
manhattan_average_5 92.3 70.0 25.9
manhattan_average_7 87.0 100.0 20.4
manhattan_complete_10 85.7 90.0 17.6
manhattan_complete_15 59.3 35.0 48.8
manhattan_complete_20 59.0 54.8 99.3
manhattan_complete_25 77.7 75.6 99.5
manhattan_complete_5 95.0 60.0 35.3
manhattan_complete_7 95.0 100.0 40.0
manhattan_single_10 62.3 90.0 7.4
manhattan_single_15 93.3 92.9 100.0
manhattan_single_20 92.7 92.1 100.0
manhattan_single_25 94.0 93.6 100.0
manhattan_single_5 90.0 70.0 20.6
manhattan_single_7 76.7 100.0 12.5
is defined as
𝑧1
𝑧2=
−0.1027 0.0546
−0.0137 0.1251
0.1056 −0.1107
−0.0784 0.0747
0.0708 0.0566
0.091 −0.0484
0.1015 0.0525
−0.0418 −0.0474
0.1215 0.0154
0.0578 0.1086
staticMethods sum
modifiers mean
defaultMethods mean
maxNestedBlocks mean
totalMethods mean
protectedMethods mean
finalFields mean
stringLiteralsQty mean
lambdasQty mean
returnQty mean
(5)
As seen from the visualisation in Figure 5, agglomerative
algorithms are heavily prioritised over Bunch algorithms,
where most of the points of the instance space prioritises
the usage of agglomerative algorithms. An interpretation of
Figure 6 could be seen as such, where by looking at Equa-
tion 5and the visualisation generated, when 𝑧2is within the
range of -1 to -1.4, and 𝑧1is within the range of 0 to 0.2, the
algorithm that is prioritised is cosine_average_10. Based on
Table 14 and Equation 5, we can see that agglomerative al-
gorithms are prioritised over bunch algorithms if MoJoFM
is used as the main evaluation metric based on the output
from the SVM framework as well as the raw data from the
first part of the experiments.
A higher value for the individual feature in Equation 5
would mean that the feature has a higher influence on pre-
dicting which algorithm is the best, and lower values would
AJJ Tan et al.: Preprint submitted to Elsevier Page 15 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 13
Performance of SVM model for Bunch algorithm.
Algorithm Mean Accuracy Mean Precision Mean Recall
exhaustive_basicmq 60.3 11.5 24.4
exhaustive_turbomq 82.0 0.0 0.0
exhaustive_turbomqincr 79.0 10.0 7.7
exhaustive_turbomqincrw 77.0 19.0 7.1
exhaustive_turbomqw 82.0 0.0 0.0
ga_basicmq 70.3 19.6 20.0
ga_turbomq 85.7 0.0 0.0
ga_turbomqincr 82.7 0.0 0.0
ga_turbomqincrw 72.3 22.8 25.0
ga_turbomqw 61.0 13.3 23.5
hillclimbing_basicmq 73.3 22.7 17.9
hillclimbing_turbomq 73.7 26.5 23.2
hillclimbing_turbomqincr 79.7 16.7 12.2
hillclimbing_turbomqincrw 51.0 21.6 63.6
hillclimbing_turbomqw 85.7 0.0 0.0
Figure 5: Combined algorithm spread.
mean a lower feature importance. For example, maxNest-
edBlocks mean in Equation 5has comparatively low values.
This means that the same clustering method may be suitable
for programs with vastly different values for this feature.
Upon further investigation, we discovered that although
agglomerative clustering algorithm appears to be the supe-
rior algorithm when measured against MoJoFM, it usually
generates many small clusters with few classes (5-10 classes).
On the other hand, Bunch tends to generate clustering results
with lesser number of clusters and more equal classes inside
each cluster, which might make it easier for software main-
tainers to follow the suggested decomposition. Our findings
largely agree with the experiments done by Wu et al. [4]
where they discovered that algorithms that give good clus-
tering results according to one criterion (i.e. MoJoFM) often
do not give good results according to other criterion (i.e. size
and number of clusters).
As such, we have decided to analyse the strengths and
weaknesses of agglomerative and Bunch clustering algorithms
separately using PCA, instead of combining the two.
AJJ Tan et al.: Preprint submitted to Elsevier Page 16 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Figure 6: Agglomerative algorithm selection.
Table 14
Performance of agglomerative vs Bunch clustering algorithm
over 300 projects.
Algorithms Number of times algorithm is prioritised
Agglomerative 291
Bunch 9
Summary of Combined PCA Visualisations: Ag-
glomerative algorithms are prioritised over bunch al-
gorithms if MoJoFM is used as the main evaluation
metric. Agglomerative algorithms usually generate
many small clusters with few classes (5-10). Bunch
algorithms usually generate fewer clusters with a
more balanced spread of the number of classes in-
side each cluster.
4.2. Agglomerative PCA Visualisation
Figure 6illustrates the footprint visualisation generated
for agglomerative algorithms.
The coordinate system that defines the new instance space
is defined as:
𝑧1
𝑧2=
0.0224 0.0178
−0.1018 0.0626
−0.0086 0.0677
−0.0459 −0.005
−0.0708 0.0803
0.0695 0.1282
0.0908 0.003
staticMethods std
privateMethods mean
subClassesQty mean
cbo mean
modifiers max
publicMethods mean
anonymousClassQty mean
(6)
where the footprint shows 3 main clusters.
cosine_average_15 (brown)
cosine_average_7 (blue)
manhattan_average_25 (pink)
The seven software features in Equation 6are identified
to be the most impactful on the performance of the differ-
ent clustering algorithms used for software remodularisa-
tion. We noticed that six out of the seven software features
that form the coordinate system in Figure 6are size metrics.
This shows that for agglomerative clustering algorithms, size
metrics have a stronger influence over the performance of the
algorithm.
4.2.1. Agglomerative PCA Relationship between
Features and Clusters
Using the new coordinate system for agglomerative clus-
tering algorithm, we visualise the footprints of the differ-
ent techniques as shown in Figures 7,8and 9. We show
the results for the prioritised clustering methods individually
(Figures 7(a)-7(c)), by setting the threshold of good perfor-
mance if the quality of the software clustering is above 70%
(MoJoFM). Each data point represents a project, which is la-
belled as good if the performance of the MoJoFM score is
above 70%, and as bad otherwise.
To better understand why certain clustering algorithms
work better for software projects in the cluster, we did a side-
by-side comparison of software features and SVM model
performance (Figures 8,9) to try and draw correlations.
By comparing Figures 7,8and 9, we are able to draw
similarities between the algorithm’s footprint patterns and
the distribution of the features’ values. This means that the
following features are the most important when determining
the priority of the algorithms.
AJJ Tan et al.: Preprint submitted to Elsevier Page 17 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Modifiers_max - Figure 8(a) and Manhattan Average
25 (linkage method; distance metric; cluster divisor) -
Figure 7(c)
There are distinct distributions between the top
left and bottom right clusters which are reflected
in both footprints. Representing the software fea-
tures of the target software 𝑠𝑆in the new in-
stance space, when the software features fall in
the region of z_2 > -0.2 and z_1 < 0.4, mod-
ifiers_max is the most important feature in de-
termining whether Manhattan Average 25 is the
most suitable clustering algorithm.
Public Methods_mean - Figure 8(b) and Cosine Aver-
age 15 - Figure 7(b)
There are distinct distributions between the left
and right clusters which are reflected in both the
footprints. Representing the software features of
the target software 𝑠𝑆in the new instance
space, when the software features fall in the re-
gion of 0 > z_2 > -0.1 and z_1 > 0.4, publicMeth-
ods_mean is the most important feature in deter-
mining whether Cosine Average 15 is the most
suitable clustering algorithm.
AnonymousClassesQty_mean - Figure 9(c) and Co-
sine Average 7 - Figure 7(a)
There are distinct distributions between the left
and right clusters which are reflected in both the
footprints. Representing the software features of
the target software 𝑠𝑆in the new instance
space, when the software features fall in the re-
gion of z_2 < -0.6 and z_1 > 0.1, Anonymous-
ClassesQty_mean is the most important feature
in determining whether Cosine Average 7 is the
most suitable clustering algorithm.
However, there are certain features such as Static Meth-
ods_std - Figure 8(c) and SubClassesQty_mean - Figure 9(c)
that do not have a clear distinct distribution among the clus-
ters. These features do not have any similar distribution pat-
terns when compared to the prioritised agglomerative algo-
rithms as well.
4.2.2. Findings from Agglomerative Footprints
Softwares with a higher value of staticMethods, pub-
licMethods, privateMethods, and modifiers are more com-
plex which leads to more opportunities for remodularisa-
tion [46]. While there is no existing literature that iden-
tifies any correlation between the number of methods and
the performance of software clustering algorithms, the work
by [46,47,48] discusses how metrics related to the size of
the software can be effectively used to measure the quality
of object-oriented software systems and for fault prediction.
Our approach provides clear evidence of the impact these
software features have on the effectiveness of software clus-
tering and remodularisation techniques.
Table 15
Performance of agglomerative linkage distribution.
Algorithms Number of times algorithm is prioritised
Average 48
Complete 52
Single 200
For agglomerative clustering, the algorithm carries out
clustering based on the provided distance matrix given by
the Depends tool, where classes with high functional depen-
dency (i.e., minimal distance) would be clustered together to
form a cluster (subsystem). The Depends tool measures de-
pendencies between classes by analysing method invocation,
type casting, and variable containment, which are strongly
correlated with the seven main software features that we have
identified - staticMethods, privateMethods, subClassesQty,
CBO (Coupling Between Objects), modifiers, publicMeth-
ods, and anonymousClassQty. When there are more meth-
ods and assignment operations in a class, the probability
for the method operations and variable assignments to in-
volve instances from another class is higher. With the richer
dependency information extracted from Depends, we can
better illustrate the interrelationships between classes in the
analysed software.
Based on the results, we are able to observe very distinct
clusters from the footprints for the prioritised algorithms.
When we investigate the distribution of software features
in Figures 8(a) and 8(b), we see a clear gradient of change
from top left to bottom right and from bottom left to top right
respectively. This means that we are able to clearly assign
the most suitable algorithms to these projects based on these
software features - modifier_max and publicMethods_mean
respectively.
Intuitively, this makes sense, as classes with a high num-
ber of public methods and modifiers have more opportunities
to be remodularised, by splitting large and complex classes
into smaller ones with less methods.
SubClassesQty and Coupling Between Objects, on the
other hand, increase the complexity of the software remod-
ularisation problem. Intuitively, this makes sense, as classes
with the presence of SubClasses and high coupling (CBO)
makes it harder for us to separate this classes during the clus-
tering process. Hence, the findings from this section help
provide answers for the first research question.
As for the linkage algorithm, our results show that sin-
gle linkage triumphs over average and complete linkage in
more than 50% of the projects, reflected in Table 15. How-
ever, single linkage is not prioritised as compared to average
linkage based on the SVM model, which is due to the overall
SVM accuracy, precision, and recall of each individual algo-
rithm as discussed above. Single linkage algorithm belongs
to the 𝑐3cluster (high accuracy, high precision, and low re-
call), where the SVM model generally prioritises other al-
gorithms due to the low recall of single linkage algorithms.
This is interesting because the single linkage algorithm tends
to form large and less coupled clusters. Upon further inves-
AJJ Tan et al.: Preprint submitted to Elsevier Page 18 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
(a) Footprints of Cosine Average 7. (b) Footprints of Cosine Average 15.
(c) Footprints of Manhattan Average 25.
Figure 7: Agglomerative algorithm footprint visualisation.
tigation, we found that the ground truth extracted from the
analysed projects tends to have a large directory structure as
well, which contributes toward the finding.
While the work by Maqbool [14] claimed that the com-
plete linkage algorithm is capable of forming the best soft-
ware clustering results in terms of cluster cohesiveness, they
used binary clustering features (identify the presence or ab-
sence of similar features) to identify the interrelationships
between software entities. On the other hand, our work utilises
quantifiable measures to assign a relative weight to indicate
the strength of dependencies between classes. Apart from
that, when running the same linkage algorithm on 10 pre-
vious releases of the examined software, we found that sin-
gle linkage produces much more stable results as compared
to complete linkage and average linkage algorithms, which
largely agrees with the observation found in the work by
[49]. Since single linkage outperforms other linkage algo-
rithms in most of the scenarios, we do not include the illus-
tration of the other footprint visualisation in this paper.
The work by Tzerpos et al. [49] stated that single linkage
forms the least cohesive cluster. However, we would like to
argue that the experiments conducted by the authors are per-
AJJ Tan et al.: Preprint submitted to Elsevier Page 19 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
(a) Distribution of modifiers_max. (b) Distribution of public methods_mean.
(c) Distribution of static methods_std. (d) Distribution of private methods_mean.
Figure 8: Agglomerative software features footprint visualisation.
formed on software written in C programming languages.
Hence, the same might not be applicable to modern Java-
based systems that possess a completely different structure
compared to software written in C. We theorise that the fea-
ture extraction tool used to capture the relationship between
the clustering entities (classes) enables us to extract richer
information on the interaction between the classes, thus pro-
ducing slightly different results as compared to the work by
Tzerpos et al.
To provide a simple illustration on how E-SC4R can ef-
fectively recommend the optimum clustering algorithm from
the pool of choices, we have compared the E-SC4R frame-
work against some of the baseline agglomerative hierarchi-
cal clustering algorithms. Table 16 shows a comparison be-
tween the MoJo values for "Average Euclidean 10", "Com-
plete Cosine 10", "Single Manhattan 10", and the configura-
tion recommended by E-SC4R. The lower the MoJo value,
the more suited the algorithm configuration is for the spe-
cific project. The last column in Table 16 is the configuration
suggested by E-SC4R for the target software. We will like to
note that due to space constraints, we are unable to show all
the configurations compared against the ones recommended
by E-SC4R. The complete information on all clustering re-
sults are available on our GitHub page.
By using the proposed framework, developers or researchers
can easily identify the optimum clustering algorithm and its
AJJ Tan et al.: Preprint submitted to Elsevier Page 20 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
(a) Distribution of Coupling Between Objects_mean. (b) Distribution of SubClassesQty_mean.
(c) Distribution of AnonymousClassesQty_mean.
Figure 9: Agglomerative software features footprint visualisation.
configuration instead of adopting an exhaustive or trial-and-
error approach which is tedious and error prone.
Summary of Agglomerative Footprints Visual-
isation: Agglomerative clustering algorithms are
most impacted by staticMethods, privateMethods,
subClassesQty, cbo, modifiers, publicMethods and
anonymousClassQty. Single linkage outperforms
average and complete linkage.
4.3. Bunch Footprints Visualisation
Figure 10 illustrates the footprint generated from Bunch
algorithm.
The coordinate system that defines the new instance space
is defined as:
𝑧1
𝑧2=0.1127 0.0705
0.2152 0.049
−0.0411 0.2699RFC mean
staticMethods mean
stringLiteralsQty mean(7)
The three software features in Equation 7are identified
as the most impactful on the performance of the Bunch clus-
tering algorithms used for software remodularisation. stat-
AJJ Tan et al.: Preprint submitted to Elsevier Page 21 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 16
Baseline comparison.
Project Name
MoJo for
Average
Euclidean 10
MoJo for
Complete
Cosine 10
MoJo for
Single
Manhattan 10
MoJo for E-SC4R
Recommended
Configuration
E-SC4R
Recommended Configuration
bkromhout-realm-java 88 129 52 43 manhattan average 25
btraceio-btrace 240 301 75 53 manhattan average 25
bytedeco-javacpp 41 75 14 8 manhattan average 25
codecentric-spring-boot-admin 35 65 21 16 manhattan average 25
codenvy-legacy-che-plugins 1933 1358 414 331 manhattan average 25
coobird-thumbnailator 42 51 27 21 manhattan average 25
dropwizard 672 429 172 121 manhattan average 25
dropwizard-metrics 98 145 58 37 manhattan average 25
evant-gradle-retrolambda 3 3 3 3 cosine average 7
facebook-android-sdk 173 210 194 120 cosine average 15
facebook-java-business-sdk 137 185 170 137 manhattan average 25
facebook-fresco 307 388 168 140 manhattan average 25
facebook-litho 1058 990 250 164 manhattan average 25
facebook-react-native-fbsdk 1 1 1 1 manhattan average 25
google-cdep 56 115 30 19 manhattan average 25
google-dagger 91 138 63 47 manhattan average 25
google-error-prone 639 830 418 370 manhattan average 25
google-gitiles 48 43 17 8 manhattan average 25
google-openrtb 19 20 13 8 manhattan average 25
google-openrtb-doubleclick 8 14 8 6 manhattan average 25
grpc-grpc-java 183 160 75 55 manhattan average 25
havarunner-havarunner 20 23 17 13 manhattan average 25
immutables-immutables 300 412 150 126 manhattan average 25
ionic-team-capacitor 18 42 14 12 manhattan average 25
jankotek-mapdb 47 89 25 15 manhattan average 25
javafunk-funk 19 50 17 14 cosine average 15
javaparser 657 792 251 216 manhattan average 25
permissions-dispatcher 7 10 8 7 manhattan average 25
pxb1988-dex2jar 37 63 30 26 manhattan average 25
web3j 280 378 90 60 manhattan average 25
icMethods and stringLiteralsQty are size related metrics, while
RFC is a coupling and complexity related metric. All three
metrics are correlated because larger projects tend to have
more complex classes with higher number of methods that
leads to higher RFC.
An interesting finding in Figure 10 is the cluster of projects
(yellow color) where none of the Bunch clustering algorithm
is predicted to perform well, labelled as "None". The foot-
print of this cluster is in the "grey zone", where all three
software features – RFC, staticMethods, and stringLiteral-
sQty, have medium scores such that 𝑧2is in the range be-
tween 0.2 < 𝑧1< 0.4 and -0.2 < 𝑧2< 0.2. This is an indica-
tion of the specialisation of the Bunch clustering algorithm,
and provides evidence that these methods are good at solv-
ing extreme cases. This finding highlights another important
aspect of our methodology; by analysing the strengths and
weaknesses of existing software clustering techniques, we
could identify areas that require improvement. In this case,
it is evident that there is a gap in software clustering tech-
niques which are able to solve problems that have a medium
number of RFC, staticMethods, and stringLiteralsQty.
4.3.1. Bunch PCA Relationship between Features and
Clusters
Using the new coordinate system for Bunch, we visu-
alise the footprints of the different techniques as shown in
Figures 11,12 and 13. To better understand why certain
clustering algorithms or parameters works better for soft-
ware projects in the cluster, we did a side-by-side compar-
ison of feature and SVM model performance to try and draw
correlations.
By comparing Figures 7,8and 9we are able to draw
similarities between the algorithm’s footprint patterns and
the distribution of the values of the features. This means
that these features are the most important when determining
priority of the algorithms.
StringLiteralsQty_mean - Figure 13(c) and HillClimb-
ing TurboMQ - Figure 12(b)
There are distinct distributions between the top
left and bottom right clusters which are reflected
in both the footprints. Representing the software
features of the target software 𝑠𝑆in the new
instance space, when the software features fall in
AJJ Tan et al.: Preprint submitted to Elsevier Page 22 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Figure 10: Bunch algorithm selection.
the region of z_2 > 0.4 and z_1 < 0.4, StringLit-
eralsQty_mean is the most important feature in
determining whether HillClimbing TurboMQ is
the most suitable clustering algorithm.
RFC_mean - Figure 13(a) and GA TurboMQIncrW -
Figure 11(d)
There are distinct distributions between the right
and left clusters which are reflected in both the
footprints. Representing the software features of
the target software 𝑠𝑆in the new instance
space, when the software features fall in the re-
gion of z_2 > 0.5 and z_1 > 0.4, RFC_mean is
the most important feature in determining whether
GA TurboMQIncrW is the most suitable cluster-
ing algorithm.
StaticMethods_mean - Figure 13(b) and GA BasicMQ
-Figure 11(c)
There are distinct distributions between the bot-
tom right and top left clusters which are reflected
in both the footprints. Representing the software
features of the target software 𝑠𝑆in the new
instance space, when the software features fall
in the region of z_2 > 0.4 and z_1 > 0.3, Stat-
icMethods_mean is the most important feature
in determining whether GA BasicMQ is the most
suitable clustering algorithm.
4.3.2. Findings from Bunch Footprints Visualisation
For Bunch, the algorithm carries out clustering based on
source code analysis. Bunch uses a family of source code
analysis tools (supports C, C++ and Java) that is based on an
entity relationship model, where the source code is scanned
and a relational database is constructed to store the enti-
ties and relations [12]. Bunch also assumes that all rela-
tion types have an equal weight. Hence, when taking into
account variable references or global variables, the amount
of StringType literals can be a distinguishable feature. This
is why StringLiteralsQty turns out to be an important fea-
ture within Bunch where the more String variables that are
present within the software, there will be more relationships
that the source code analysis tool can identify, which subse-
quently helps to correctly identify the distribution of the cor-
rect clusters. At the same time, features such as staticMeth-
ods and RFC which contributes to the richness of informa-
tion within the entity relationship model is also important
because they can better illustrate the interrelationships be-
tween classes.
By looking at the footprint visualisations for the priori-
tised Bunch algorithm, our results agree to a certain extent
with the authors’ claim that exhaustive clustering algorithm
only works well for small systems [11], and the hill-climbing
algorithm performs well for most software systems [12]. The
footprints of Exhaustive BasicMQ (Figure 11(a)) and Ex-
haustive TurboMQIncrW (Figure 11(b)) shows that there are
significantly larger clusters of instances where it is labelled
as "Good" by the SVM model when the values for 𝑧_1and
𝑧_2are small (low RFC, staticMethods, and stringLiteral-
sQty, shown on the bottom left corner of Figure 11). This
shows that exhaustive clustering works better on software
projects classified as small based on the software features.
One good example is shown on Figure 11(b), where only
Exhaustive TurboMQIncrW is able to show "Good" results
AJJ Tan et al.: Preprint submitted to Elsevier Page 23 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
(a) Footprints of Exhaustive BasicMQ. (b) Footprints of Exhaustive TurboMQIncrW.
(c) Footprints of GA BasicMQ. (d) Footprints of GA TurboMQIncrW.
Figure 11: Bunch algorithm footprint visualisation - Part 1.
when 𝑧_1is between -0.4 to -0.6, and 𝑧_2is between -0.8 to
-1.0.
On the other hand, when the size-related metrics of stat-
icMethods and stringLiteralsQty are high (top right corner of
Figure 13(b) and Figure 13(c)), the footprints of HillClimb-
ing BasicMQ (Figure 12(a)) and TurboMQIncrW (Figure
12(c)) shows that instances are labelled as "Good". This sug-
gests that HillClimbing algorithms performs well on large
sized projects. As such, our E-SC4R framework not only
reaffirms that hill climbing approach is well suited for large
size projects, we further discover that the value of RFC, stat-
icMethods, and stringLiteralsQty can be a good indicator for
researchers to decide whether or not to choose exhaustive
Bunch or hill climbing Bunch when performing software re-
AJJ Tan et al.: Preprint submitted to Elsevier Page 24 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
(a) Footprints of HillClimbing BasicMQ. (b) Footprints of HillClimbing TurboMQ.
(c) Footprints of HillClimbing TurboMQIncrW
Figure 12: Bunch algorithm footprint visualisation - Part 2.
modularisation.
Another interesting finding from the footprint visualisa-
tions for the prioritised Bunch algorithms is that the same
algorithm but with a different calculator is able to cater to a
entirely different software type. For example, GA BasicMQ
performs well for most software systems, but is unable to
cater to extremely large software systems. GA TurboMQIn-
crW on the other hand, only performs well on extremely
large software systems. Similarly, for HillClimbing, Hill-
Climbing BasicMQ and HillClimbing TurboMQIncrW per-
forms well for most software systems, but HillClimbing Tur-
boMQ speficially caters for software with size metrics that
are extremely high in the 𝑧_1range (𝑧_1>0) and in the
middle of the 𝑧_2range (−1 < 𝑧_2<1).
AJJ Tan et al.: Preprint submitted to Elsevier Page 25 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
(a) Distribution of RFC_mean. (b) Distribution of StaticMethods_mean.
(c) Distribution of StringLiteralsQty_mean.
Figure 13: Bunch software features footprint visualisation.
Summary of Bunch Footprints Visualisation:
Bunch clustering algorithms are most impacted by
StaticMethods, StringLiteralQtys, RFC. Exhaustive
algorithm works better with small systems. Hill-
climbing algorithm performs generally well across
all systems.The calculator for the algorithm plays a
huge role in determining which software type the
configuration of the algorithm is best suited for.
4.4. Summary of Experiment Results
By analysing the strengths and weaknesses of existing
software clustering techniques, we could identify areas in
current software remodularisation and architecture recovery
research that requires improvement. Based on our experi-
ment results and footprints, it is evident that there is a gap
in clustering techniques that can address software projects
that have more distinct and substantial features that are not
AJJ Tan et al.: Preprint submitted to Elsevier Page 26 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
captured very well by the existing software clustering algo-
rithms, possibly due to their working principles. For exam-
ple, based on Figure 9(a), existing agglomerative clustering
techniques are unable to cluster software projects based on
Coupling Between Objects (CBO). We are not suggesting
that CBO is not a good indicator for software remodulari-
sation problems, but rather, existing software clustering al-
gorithms do not benefit from CBO as a feature during the
clustering process.
The ability to identify the most optimum algorithm and
configuration from our existing pool of choices to config-
ure software clustering algorithms (such as the linkage al-
gorithm, distance metric, etc.) based on the characteristics
of different software can help software developers and main-
tainers to reduce the time and effort needed to perform soft-
ware remodularisation in a more effective manner, rather
than having to resort to intuition or trial and error, both of
which have far lower accuracy rates and and it is either re-
source intensive when evaluating multiple approaches, or
they might simply just work with a sub-optimal cluster set.
The experiment results also suggest that on a larger scale
(analysis of more projects and more distinct identification of
domains or metrics to classify a pool of software), it would
be possible to, at the minimum, eliminate approaches or pa-
rameters that are already known to be sub-optimal. The vi-
sualisation has the potential to show the most optimum ap-
proach that a software maintainer can adopt, thus improving
the accuracy of clustering results as well as saving resources
that would otherwise potentially be wasted.
RQ1: The strengths and weaknesses of agglomer-
ative hierarchical and Bunch clustering algorithms
can be evaluated from the footprints generated.
Drawing examples from the Bunch footprint, the
strength of the algorithm can be seen in its abil-
ity to cluster software accurately through RFC, stat-
icMethods, and stringLiteralsQty metrics, where
identifiable and significant clusters can be found
from the feature footprints The weakness of the al-
gorithm can be instead seen through clusters that are
labelled as "None" during the prioritised algorithm
footprint visualisation. This shows that the tech-
niques are good at solving extreme cases, but unable
to properly cluster software projects with medium-
sized metrics, identifying the gap in the aforemen-
tioned software clustering techniques. As such, the
answers to both research questions are discussed be-
low.
RQ2: When using MoJoFM as the evaluation cri-
teria, agglomerative hierarchical clustering proves
to be the clear winner when compared to different
variations and configuration of Bunch clustering al-
gorithm. It is when the two algorithms are eval-
uated separately using E-SC4R and using the con-
clusions drawn from the footprints, we are able to
more objectively select the most suitable clustering
technique based on the software features of each test
subject. Based on our results, modifier_max, pub-
licMethods_mean, and anonymousClassQty_mean
are the three most prevalent software features that
affect the performance of agglomerative hierarchical
clustering algorithm, while RFC_mean, staticMeth-
ods_mean, and stringLiteralsQty_mean plays essen-
tial role in deciding the type of Bunch algorithm
to aid in software remodularisation and architecture
recovery. As such, using the identified software
features as indicators, they can be used to aid re-
searchers in selecting the most suitable clustering
technique, which depends on the characteristic of the
software remodularisation and architecture recovery
problem.
5. Threats to Validity
Based on the classification schema of Runeson et al. [50],
Construct Validity in our case refers to whether all relevant
parameters for hierarchical clustering and Bunch clustering
algorithm have been explored to visualise the footprint for
software remodularisation. To mitigate this risk, we consid-
ered a plethora of parameters such as number of projects,
number of revisions, different linkage algorithms, distance
metrics, and search-based fitness function. Besides, we take
into consideration the past releases of the examined software
in generating the ground truth.
Internal Validity is related to the examination of causal
relations. Our results pinpoint the particular software fea-
tures that affect the effectiveness of the agglomerative and
Bunch clustering method of the analysed project, which is
not inferred from causal relationships.
With respect to External Validity, the risk is mitigated
by selecting a pool of projects that are well-known and popu-
lar in the the open-source community (project selected based
on the number of stars on GitHub) forming a representative
sample for analysis. In order to provide more information
about the quality of the chosen project, Sonarqube [51] has
been used to analyse the quality of the chosen projects in
terms of the number of bugs, code smells, and code duplica-
tion presented in Table 17. This is to demonstrate the appli-
cation on E-SC4R on a variety of software projects in terms
of size and quality. However, a replication of this study in a
larger scale that comprises projects written in different lan-
guages would be valuable in verifying the current findings.
We have created a replication package on our GitHub page.
Ground truth generation plays a vital role in determining
AJJ Tan et al.: Preprint submitted to Elsevier Page 27 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
Table 17
Quality metrics of the chosen projects extracted from SonarQube.
Project Bugs
Bugs
Rating
Code Smells
(thousands)
Code Smells
Rating
Duplications
(%)
Duplications
Rating
Lines
(thousands) SizeRating
bkromhout-realm-
java 137 E 4.5 A 9.6 C 52 M
btraceio-btrace 78 E 2.3 A 2.5 A 46 M
bytedeco-javacpp 301 E 1.9 A 5.9 C 27 M
codecentric-spring-
boot-admin 152 E 1.2 A 1.8 A 20 M
codenvy-legacy-
che-plugins 548 E 36 A 11.1 D 477 L
coobird-
thumbnailator 70 E 1.8 A 10.2 D 21 M
dropwizard 19 E 1.8 A 3.2 B 61 M
dropwizard-metrics 27 D 0.9 A 23.4 E 36 M
evant-gradle-
retrolambda 1 E 0.02 A 26.8 E 2 S
facebook-
android-sdk 88 E 2.1 A 1.6 A 73 M
facebook-java-
business-sdk 26 E 25 A 36.8 E 381 L
facebook-litho 257 E 38 A 30.9 E 568 XL
facebook-react-
native-fbsdk 0 A 0.03 A 0 A 1.5 S
google-cdep 49 E 1.4 A 4.6 B 16 M
google-dagger 46 E 3.4 A 4.1 B 104 L
google-gitiles 16 E 0.5 A 2.9 A 16 M
google-openrtb 6 C 0.04 A 0 A 2.3 S
google-openrtb-
doubleclick 3 C 0.08 A 0 A 2.8 S
grpc-java 253 E 7.8 A 7.7 C 203 L
havarunner 1 C 0.35 A 0.9 A 3.6 S
immutables 54 D 1.7 A 2.2 A 71 M
ionic-team-
capacitor 3 E 0.2 A 0.4 A 6.2 S
jankotek-mapdb 151 E 2.3 A 6.7 C 20 M
javafunk-funk 29 C 3 B 1.3 A 26 M
javaparser 2200 E 12 A 23.6 E 187 L
permissions-
dispatcher 1 C 0.5 A 27 E 14 M
pxb1988-dex2jar 34 E 2.5 A 2.7 A 36 M
web3j 90 E 2.2 A 5.7 C 48 M
the optimum clustering algorithm and configuration from
the existing pool. Getting input from domain expert may
help reaffirm the validity of our ground truth. However, due
to the scope of the project where we experimented on 30
open source projects, it is challenging to get the develop-
ers from the open source community to evaluate the ground
truth individually for each version for each project. Based
on the state-of-the-art, there is no single well-acknowledged
method in creating the ground truth for software clustering.
One of the most popular approaches, however, is by leverag-
ing on the package structure of the analysed software. Hence,
in this research, we have adopted a similar approach to ad-
dress the problem. We like to note that the proposed E-SC4R
framework can work with any kind of clustering algorithm
and ground truth, as long as the ground truth is standardised
for comparison across the existing pool of clustering algo-
rithms.
Software systems with only a few directories with a large
number of files in each might not validate some of our results
as well. Upon examining the existing ground truth that we
have, we found that the effect of having a few directories with
large number of files is negligible. In fact, the majority of
the ground truths that we use for the experiments consist of
projects with large numbers of directories (packages), with
a small number of files in each of them. We have uploaded
some of the ground truth that we used for the experiments
on the GitHub page 4.
On the other hand, code quality, which includes coding
style, readability, level of cohesion, and other indicators are
factors that might impact the effectiveness of the presented
clustering algorithms. In this research, we have only evalu-
4https://github.com/alvintanjianjia/SoftwareRemodularization/tree/master/sample_groundtruth
AJJ Tan et al.: Preprint submitted to Elsevier Page 28 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
ated the proposed approach on open source systems. While
the code quality of open source and real-life industrial sys-
tems is very subjective and context dependent (the quality
of open source projects can be low when compared to in-
dustrial project, and vice versa), we expect that applying the
proposed E-SC4R framework on project with different code
quality (real-life industrial systems included) will yield dif-
ferent results, mainly because the characteristic of software
(i.e. CK metrics of the analysed software) will affect the
footprint constructed using our framework.
The challenge of running the experiments on low code
quality projects is the construction of ground truth to vali-
date the clustering algorithm. In our proposed approach and
in existing studies, package structure is the most commonly
used method to create the artificial ground truth for software
clustering. We construct the ground truth by looking at the
package structure of the past 10 releases of the software, and
find the overlapping and most common directory structure to
be used as the ground truth. If the same approach is to be ap-
plied to low code quality projects, the ground truth will be
heavily skewed and not reliable, which affects the profiling
of these systems. With an accurate ground truth generated
manually by an expert, the E-SC4R framework is applica-
ble to both open source and industrial projects for reversing
the documentation of poorly documented systems with high
technical debts.
6. Related Works on Software Clustering for
Architecture Recovery
In general, software clustering consists of the following
four steps. First, common clustering features are chosen to
determine the similarity between entities (methods, classes,
or packages depending on the level of granularity). Second,
a similarity measure is chosen to determine the similarity
strength between two entities (method invocation, passing of
parameters, sharing of variables, etc.) [14]. Third, a cluster-
ing algorithm is chosen to group similar entities together. Fi-
nally, a form of validation is required to measure the quality
of the clustering results. The results of software clustering
can be viewed as a high-level abstraction of the software ar-
chitecture to aid in software comprehension. All four steps
mentioned above play a significant role in determining the
quality of clustering results because the selection of differ-
ent clustering features or metrics will produce substantially
different clustering results.
Although some clustering algorithms produce a single
clustering result for any given dataset, a dataset may have
more than one natural and optimum clustering result. For
instance, source code can only reveal very limited informa-
tion about the architectural design of a software system since
it is a very low-level software artefact. On the other than,
some implementation details might be lost if software pack-
ages are being used to represent clustering entities. Hence,
identifying the most optimum way to choose the most appro-
priate clustering algorithm and configure the parameters of
the algorithm is a non-trivial task in software remodularisa-
tion.
The work by Deursen and Kuipers [52] adopted a greedy
search method by using mathematical analysis to analyse the
structure of cluster entities and identify the clustering fea-
tures that are shared by them. The proposed approach finds
all of the possible combinations of clusters and evaluates
the quality of each combination. Agglomerative hierarchi-
cal clustering is used in this work. The authors discovered
that it is hard to analyse all possible combinations, and use-
ful information might be missing if no attention is given to
analyse all the results generated from different dendrogram
cutting points.
In contrast to the greedy search method proposed by Deursen
and Kuipers, the work by Fokaefs et al. [53] proposed an ap-
proach that produces multiple clustering results from which
software developers and maintainers can choose the best re-
sult based on their experiences. The goal is to decompose
large classes by identifying ‘Extract Class’ refactoring op-
portunities. Extract class is defined as classes that contain
many methods without clear functionality. The authors adopted
the agglomerative clustering algorithm to generate a den-
drogram and cut the dendrogram at several places to form
multiple sets of results. The authors argued that clustering
algorithms that produce a single result are too rigid and not
feasible to fit into the context of software development.
Work by Anquetil and Lethbridge [54] attempted to per-
form agglomerative clustering on source files and found out
that using source code alone to aid in software remodularisa-
tion yields poor results. In their study, clustering entities are
represented in the form of source code. The authors found
that the quantity of information, such as the number of vari-
ables used in the source code, the dependency between rou-
tines, and the data passed and shared by functions helps in
improving the reliability of clustering.
The work by Cui and Chae [55] attempted to analyse
the performance, strengths, and weaknesses of different ag-
glomerative hierarchical clustering algorithms using multi-
ple case studies and setups. The authors conducted a series
of experiments using 18 clustering strategies. The clustering
strategies are the combination of different similarity mea-
sures, linkage methods, and weighting schemes. The case
studies comprise 11 systems where source codes were used
as the input parameters. The authors found that it is difficult
to identify a perfect clustering strategy which can fulfil all
the evaluation criteria proposed by the authors.
As discussed in the systematic literature review conducted
by Alsarhan et al. [29], the selection of clustering algo-
rithms remains a challenging problem in the area of soft-
ware clustering for remodularisation and architectural recov-
ery. While there have been attempts to propose guidelines
for selecting or rejecting a clustering algorithm for a given
software [56], there is a lack of comprehensive methods for
clustering algorithm selection.
Our work extends existing research in analysing the ef-
fectiveness of software clustering techniques by examining
what software features impacts the performance of agglom-
erative and Bunch clustering algorithms. We characterise a
AJJ Tan et al.: Preprint submitted to Elsevier Page 29 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
software system using software/code features (e.g., depth of
inheritance tree, cohesion, and coupling) and determine the
most significant features that have an impact on whether a
software clustering technique can generate a good cluster-
ing result.
7. Conclusion and Future Work
Acknowledging the lack of a universal approach in find-
ing the optimum clustering algorithm for any software re-
modularisation problem given the numerous algorithms that
exist in the literature and the various parameters that may
be used to configure software clustering algorithms, we are
able to provide empirical evidences that help in identifying
key characteristics of software/code features that influence
the effectiveness of hierarchical and Bunch-based software
clustering algorithms.
Given the relatively high cost of running a clustering al-
gorithm on large and complex software systems, the pro-
posed approach optimises the resources spent on software
remodularisation. The results in this paper, while promising,
are constrained in a number of ways. First, while it is one
of the most popular software remodularisation approaches,
only agglomerative hierarchical clustering and Bunch clus-
tering methods were assessed in this paper because it is dif-
ficult to compare the clustering results produced from dif-
ferent families of clustering algorithms in a fair and unbias
manner. The relationships that are extracted from Depends
are aggregated in the proposed approach which may lead to
the loss of some semantic information. One way to address
this problem is by running the experiments multiple times
using only one type of the extracted relationship at a time.
For example, if we have 14 types of relationships extracted
using Depends, we can run it 14 times for each version of the
project, and evaluate the effectiveness of different clustering
algorithms using each type of relationship (or combination
of multiple relationships).
The creation of the ground truth relies on the past 10 re-
leases of a particular software, which strongly favours legacy
software that have more active developers and contributors.
Less stable software with radical changes between versions
make it difficult to construct a usable ground truth. For fu-
ture work, a separate method of defining and creating the
ground truth can be further explored. An approach similar
to the work by Naseem et al. [44] by taking a deep dive
into 1 or 2 of our generated ground truth with a few senior
and experienced developers who have been working on the
respective projects would be one of the future directions of
this research, to affirm the validity of our ground truth and
clustering results..
Nonetheless, the experiment results show that our find-
ings are a step forward in the area of software remodulari-
sation to reveal the strengths and weaknesses of different hi-
erarchical clustering and Bunch clustering algorithms, and
it is hoped that the work discussed in this paper can serve
as a framework for further analysis and improvements to be
made.
Acknowledgement
This work was carried out within the framework of the
research project FRGS/1/2018/ICT01/MUSM/03/1 under the
Fundamental Research Grant Scheme provided by the Min-
istry of Education, Malaysia.
References
[1] M. Hall, N. Walkinshaw, and P. McMinn, “Effectively incorporat-
ing expert knowledge in automated software remodularisation,IEEE
Transactions on Software Engineering, vol. 44, no. 7, pp. 613–630,
2018.
[2] N. Anquetil and T. C. Lethbridge, “Comparative study of cluster-
ing algorithms and abstract representations for software remodular-
isation,” IEE Proceedings - Software, vol. 150, no. 3, pp. 185–201,
2003.
[3] K. Praditwong, M. Harman, and Y. Xin, “Software module clustering
as a multi-objective search problem,IEEE Transactions on Software
Engineering, vol. 37, no. 2, pp. 264–282, 2011.
[4] J. Wu, A. E. Hassan, and R. C. Holt, “Comparison of clustering al-
gorithms in the context of software evolution,” in 21st IEEE Inter-
national Conference on Software Maintenance (ICSM’05). IEEE,
2005, pp. 525–535.
[5] M. Fokaefs, N. Tsantalis, A. Chatzigeorgiou, and J. Sander, “Decom-
posing object-oriented class modules using an agglomerative cluster-
ing technique,” IEEE International Conference on Software Mainte-
nance, pp. 93–101, 2009.
[6] R. Aull-Hyde, S. Erdogan, and J. M. Duke, “An experiment on the
consistency of aggregated comparison matrices in ahp,” European
Journal of Operational Research, vol. 171, no. 1, pp. 290–295, 2006.
[Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0377221704005971
[7] P. Andritsos and V. Tzerpos, “Information-theoretic software cluster-
ing,” IEEE Transactions on Software Engineering, vol. 31, no. 2, pp.
150–165, 2005.
[8] N. Teymourian, H. Izadkhah, and A. Isazadeh, “A fast clustering al-
gorithm for modularization of large-scale software systems,IEEE
Transactions on Software Engineering, 2020.
[9] C. Y. Chong, S. P. Lee, and T. C. Ling, “Efficient software cluster-
ing technique using an adaptive and preventive dendrogram cutting
approach,” Information and Software Technology, vol. 55, no. 11, pp.
1994–2012, 2013.
[10] M. Aghdasifam, H. Izadkhah, and A. Isazadeh, “A new metaheuristic-
based hierarchical clustering algorithm for software modularization,”
Complexity, vol. 2020, 2020.
[11] B. S. Mitchell and S. Mancoridis, A heuristic search approach to solv-
ing the software clustering problem. Drexel University Philadelphia,
PA, USA, 2002.
[12] B. Mitchell and S. Mancoridis, “On the automatic modularization of
software systems using the bunch tool,IEEE Transactions on Soft-
ware Engineering, vol. 32, no. 3, pp. 193–208, March 2006.
[13] A. Prajapati and Z. W. Geem, “Harmony search-based approach for
multi-objective software architecture reconstruction,Mathematics,
vol. 8, no. 11, p. 1906, 2020.
[14] O. Maqbool and H. Babri, “Hierarchical clustering for software ar-
chitecture recovery,” IEEE Transactions on Software Engineering,
vol. 33, no. 11, 2007.
[15] M. Shtern and V. Tzerpos, “On the comparability of software clus-
tering algorithms,” in Program Comprehension (ICPC), 2010 IEEE
18th International Conference on. IEEE, 2010, pp. 64–67.
[16] C. Y. Chong and S. P. Lee, “Automatic clustering constraints deriva-
tion from object-oriented software using weighted complex network
with graph theory analysis,” Journal of Systems and Software, vol.
133, pp. 28–53, 2017.
[17] M. Shtern and V. Tzerpos, “Clustering methodologies for software en-
gineering,” Advances in Software Engineering, vol. 2012, p. 1, 2012.
[18] C. Oliveira, A. Aleti, Y.-F. Li, and M. Abdelrazek, “Footprints of
AJJ Tan et al.: Preprint submitted to Elsevier Page 30 of 31
E-SC4R: Explaining Software Clustering for Remodularisation
fitness functions in search-based software testing,” in Proceedings of
the Genetic and Evolutionary Computation Conference, ser. GECCO
’19. New York, NY, USA: ACM, 2019, pp. 1399–1407. [Online].
Available: http://doi.acm.org/10.1145/3321707.3321880
[19] C. Oliveira, A. Aleti, L. Grunske, and K. Smith-Miles, “Mapping
the effectiveness of automated test suite generation techniques,IEEE
Transactions on Reliability, vol. 67, no. 3, pp. 771–785, 2018.
[20] M. Shtern and V. Tzerpos, “Factbase and decomposition generation,”
in 2011 15th European Conference on Software Maintenance and
Reengineering, March 2011, pp. 111–120.
[21] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen, and E. R. Gansner,
“Using automatic clustering to produce high-level system organiza-
tions of source code,” in Program Comprehension, 1998. IWPC’98.
Proceedings., 6th International Workshop on. IEEE, 1998, pp. 45–
52.
[22] M. Harman, S. Swift, and K. Mahdavi, “An empirical study of the ro-
bustness of two module clustering fitness functions,” in Proceedings
of the 7th annual conference on Genetic and evolutionary computa-
tion. ACM, 2005, pp. 1029–1036.
[23] F. Beck and S. Diehl, “On the impact of software evolution on soft-
ware clustering,” Empirical Software Engineering, vol. 18, no. 5, pp.
970–1004, 2013.
[24] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to infor-
mation retrieval. Cambridge University Press Cambridge, 2008,
vol. 39.
[25] I. S. Dhillon, S. Mallela, and R. Kumar, “Enhanced word clustering
for hierarchical text classification,” in Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data
mining. ACM, 2002, pp. 191–200.
[26] T. A. Wiggerts, “Using clustering algorithms in legacy systems re-
modularization,” in Reverse Engineering, 1997. Proceedings of the
Fourth Working Conference on. IEEE, 1997, pp. 33–43.
[27] F. Beck, J. Melcher, and D. Weiskopf, “Identifying modularization
patterns by visual comparison of multiple hierarchies,” in Program
Comprehension (ICPC), 2016 IEEE 24th International Conference
on. IEEE, 2016, pp. 1–10.
[28] R. Naseem, M. M. Deris, O. Maqbool, and S. Shahzad, “Euclidean
space based hierarchical clusterers combinations: an application to
software clustering,” Cluster Computing, vol. 22, no. 3, pp. 7287–
7311, 2019.
[29] Q. Alsarhan, B. S. Ahmed, M. Bures, and K. Z. Zamli, “Software
module clustering: An in-depth literature analysis,IEEE Transac-
tions on Software Engineering, pp. 1–1, 2020.
[30] Z. Wen and V. Tzerpos, “An optimal algorithm for mojo distance,” in
Program Comprehension, 2003. 11th IEEE International Workshop
on. IEEE, 2003, pp. 227–235.
[31] M. Aniche, Java code metrics calculator (CK), 2015, available in
https://github.com/mauricioaniche/ck/.
[32] A. Aleti and M. Martinez, “E-apr: Mapping the effectiveness of au-
tomated program repair,” arXiv preprint arXiv:2002.03968, 2020.
[33] M. A. Muñoz, L. Villanova, D. Baatar, and K. Smith-Miles, “Instance
spaces for machine learning classification,” Machine Learning, vol.
107, no. 1, pp. 109–147, 2018.
[34] C. Oliveira, A. Aleti, L. Grunske, and K. Smith-Miles, “Mapping
the effectiveness of automated test suite generation techniques,IEEE
Transactions on Reliability, vol. 67, no. 3, pp. 771–785, 2018.
[35] C. Oliveira, A. Aleti, Y.-F. Li, and M. Abdelrazek, “Footprints of fit-
ness functions in search-based software testing,” in Genetic and Evo-
lutionary Computation Conference, 2019, pp. 1399–1407.
[36] K. Smith-Miles, D. Baatar, B. Wreford, and R. Lewis, “Towards ob-
jective measures of algorithm performance across instance space,”
Computers & Operations Research, vol. 45, pp. 12–24, 2014.
[37] R. Gholami and N. Fakhari, “Support vector machine: principles,
parameters, and applications,” in Handbook of Neural Computation.
Elsevier, 2017, pp. 515–535.
[38] W. Jin, Y. Cai, R. Kazman, Q. Zheng, D. Cui, and T. Liu, “Enre: a
tool framework for extensible entity relation extraction,” in Proceed-
ings of the 41st International Conference on Software Engineering:
Companion Proceedings. IEEE Press, 2019, pp. 67–70.
[39] C. Patel, A. Hamou-Lhadj, and J. Rilling, “Software clustering using
dynamic analysis and static dependencies,” in 2009 13th European
Conference on Software Maintenance and Reengineering. IEEE,
2009, pp. 27–36.
[40] T. Lutellier, D. Chollak, J. Garcia, L. Tan, D. Rayside, N. Medvidovic,
and R. Kroeger, “Comparing software architecture recovery tech-
niques using accurate dependencies,” in 2015 IEEE/ACM 37th IEEE
International Conference on Software Engineering, vol. 2. IEEE,
2015, pp. 69–78.
[41] N. Tsantalis and A. Chatzigeorgiou, “Identification of move method
refactoring opportunities,” IEEE Transactions on Software Engineer-
ing, vol. 35, no. 3, p. 347–367, 2009.
[42] C. Y. Chong and S. P. Lee, “Analyzing maintainability and reliability
of object-oriented software using weighted complex network,Jour-
nal of Systems and Software, vol. 110, pp. 28–53, 2015.
[43] V. Tzerpos and R. C. Holt, “Mojo: A distance metric for software
clusterings,” in Reverse Engineering, 1999. Proceedings. Sixth Work-
ing Conference on. IEEE, 1999, pp. 187–193.
[44] R. Naseem, M. B. Deris, O. Maqbool, J.-P. Li, S. Shahzad, and
H. Shah, “Improved binary similarity measures for software modu-
larization,” Frontiers of Information Technology & Electronic Engi-
neering, vol. 18, no. 8, p. 1082–1107, 2017.
[45] F. Beck and S. Diehl, “Evaluating the impact of software evolution
on software clustering,” 2010 17th Working Conference on Reverse
Engineering, 2010.
[46] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object ori-
ented design,” IEEE Transactions on Software Engineering, vol. 20,
no. 6, pp. 476–493, 1994.
[47] G. Scanniello, C. Gravino, A. Marcus, and T. Menzies, “Class level
fault prediction using software clustering,” in 2013 28th IEEE/ACM
International Conference on Automated Software Engineering (ASE).
IEEE, 2013, pp. 640–645.
[48] G. Scanniello, A. D’Amico, C. D’Amico, and T. D’Amico, “Using
the kleinberg algorithm and vector space model for software system
clustering,” in Program Comprehension (ICPC), 2010 IEEE 18th In-
ternational Conference on. IEEE, 2010, pp. 180–189.
[49] V. Tzerpos and R. C. Holt, “On the stability of software clustering
algorithms,” in Program Comprehension, 2000. Proceedings. IWPC
2000. 8th International Workshop on. IEEE, 2000, pp. 211–218.
[50] P. Runeson, M. Host, A. Rainer, and B. Regnell, Case study research
in software engineering: Guidelines and examples. John Wiley &
Sons, 2012.
[51] G. A. Campbell and P. P. Papapetrou, SonarQube in action. Manning
Publications Co., 2013.
[52] A. van Deursen and T. Kuipers, “Identifying objects using cluster
and concept analysis,” in Proceedings of the 21st International
Conference on Software Engineering, ser. ICSE ’99. New York,
NY, USA: ACM, 1999, pp. 246–255. [Online]. Available: http:
//doi.acm.org/10.1145/302405.302629
[53] M. Fokaefs, N. Tsantalis, E. Stroulia, and A. Chatzigeorgiou,
“Identification and application of extract class refactorings in
object-oriented systems,” Journal of Systems and Software, vol. 85,
no. 10, pp. 2241–2260, Oct. 2012. [Online]. Available: http:
//dx.doi.org/10.1016/j.jss.2012.04.013
[54] N. Anquetil and T. C. Lethbridge, “Experiments with clustering as
a software remodularization method,” in Reverse Engineering, 1999.
Proceedings. Sixth Working Conference on. IEEE, 1999, pp. 235–
255.
[55] J. F. Cui and H. S. Chae, “Applying agglomerative hierarchical clus-
tering algorithms to component identification for legacy systems,”
Information and Software Technology, vol. 53, no. 6, pp. 601–614,
2011.
[56] M. Shtern and V. Tzerpos, “Methods for selecting and improving soft-
ware clustering algorithms,” in 2009 IEEE 17th International Confer-
ence on Program Comprehension. IEEE, 2009, pp. 248–252.
AJJ Tan et al.: Preprint submitted to Elsevier Page 31 of 31
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Article
Automated Program Repair (APR) is a fast growing area with numerous new techniques being developed to tackle one of the most challenging software engineering problems. APR techniques have shown promising results, giving us hope that one day it will be possible for software to repair itself. In this paper, we focus on the problem of objective performance evaluation of APR techniques. We introduce a new approach, Explaining Automated Program Repair (E-APR), which identifies features of buggy programs that explain why a particular instance is difficult for an APR technique. E-APR is used to examine the diversity and quality of the buggy programs used by most researchers, and analyse the strengths and weaknesses of existing APR techniques. E-APR visualises an instance space of buggy programs, with each buggy program represented as a point in the space. The instance space is constructed to reveal areas of hard and easy buggy programs, and enables the strengths and weaknesses of APR techniques to be identified.
Full-text available
Article
Software module clustering is an unsupervised learning method used to cluster software entities (e.g., classes, modules, or files) with similar features. The obtained clusters may be used to study, analyze, and understand the software entities' structure and behavior. Implementing software module clustering with optimal results is challenging. Accordingly, researchers have addressed many aspects of software module clustering in the past decade. Thus, it is essential to present the research evidence that has been published in this area. In this study, 143 research papers from well-known literature databases that examined software module clustering were reviewed to extract useful data. The obtained data were then used to answer several research questions regarding state-of-the-art clustering approaches, applications of clustering in software engineering, clustering processes, clustering algorithms, and evaluation methods. Several research gaps and challenges in software module clustering are discussed in this paper to provide a useful reference for researchers in this field.
Full-text available
Article
The success of any software system highly depends on the quality of architectural design. It has been observed that over time, the quality of software architectural design gets degraded. The software system with poor architecture design is difficult to understand and maintain. To improve the architecture of a software system, multiple design goals or objectives (often conflicting) need to be optimized simultaneously. To address such types of multi-objective optimization problems a variety of metaheuristic-oriented computational intelligence algorithms have been proposed. In existing approaches, harmony search (HS) algorithm has been demonstrated as an effective approach for numerous types of complex optimization problems. Despite the successful application of the HS algorithm on different non-software engineering optimization problems, it gained little attention in the direction of architecture reconstruction problem. In this study, we customize the original HS algorithm and propose a multi-objective harmony search algorithm for software architecture reconstruction (MoHS-SAR). To demonstrate the effectiveness of the MoHS-SAR, it has been tested on seven object-oriented software projects and compared with the existing related multi-objective evolutionary algorithms in terms of different software architecture quality metrics and metaheuristic performance criteria. The experimental results show that the MoHS-SAR performs better compared to the other related multi-objective evolutionary algorithms.
Full-text available
Article
Software refactoring is a software maintenance action to improve the software internal quality without changing its external behavior. During the maintenance process, structural refactoring is performed by remodularizing the source code. Software clustering is a modularization technique to remodularize artifacts of source code aiming to improve readability and reusability. Due to the NP hardness of the clustering problem, evolutionary approaches such as the genetic algorithm have been used to solve this problem. In the structural refactoring literature, there exists no search-based algorithm that employs a hierarchical approach for modularization. Utilizing global and local search strategies, in this paper, a new search-based top-down hierarchical clustering approach, named TDHC, is proposed that can be used to modularize the system. The output of the algorithm is a tree in which each node is an artifact composed of all artifacts in its subtrees and is a candidate to be a software module (i.e., cluster). This tree helps a software maintainer to have better vision on source code structure to decide appropriate composition points of artifacts aiming to create modules (i.e., files, packages, and components). Experimental results on seven folders of Mozilla Firefox with different functionalities and five other software systems show that the TDHC produces modularization closer to the human expert’s decomposition (i.e., directory structure) than the other existing algorithms. The proposed algorithm is expected to help a software maintainer for better remodularization of a source code. The source codes and dataset related to this paper can be accessed at https://github.com/SoftwareMaintenanceLab.
Full-text available
Article
A software system evolves overtime to meet the needs of users. Understanding a program is the most important step to apply new requirements. Clustering techniques by dividing a program into small and meaningful parts make it possible to understand the program. In general, clustering algorithms are classified into two categories: hierarchical and non-hierarchical algorithms (such as search-based approaches). While clustering problems generally tend to be NP-hard, search-based algorithms produce acceptable clustering, but have time and space constraints and hence they are inefficient in large-scale software systems. Most algorithms currently used in software clustering fields do not scale well when applied to large and very large applications. In this paper, we present a new and fast clustering algorithm, named FCA, that can overcome space and time constraints of existing algorithms by performing operations on the dependency matrix and extracting other matrices based on a set of features. The experimental results on ten small-sized applications, ten folders with different functionalities from Mozilla Firefox, a large-sized application (namely ITK), and a very large-sized application (namely Chromium) demonstrate that the proposed algorithm achieves higher quality modularization compared with hierarchical algorithms. It can also compete with search-based algorithms and a clustering algorithm based on subsystem patterns. But the running time of the proposed algorithm is much shorter than that of the hierarchical and non-hierarchical algorithms. The source code of the proposed algorithm can be accessed at https://github.com/SoftwareMaintenanceLab.
Full-text available
Conference Paper
Testing is technically and economically crucial for ensuring software quality. One of the most challenging testing tasks is to create test suites that will reveal potential defects in software. However, as the size and complexity of software systems increase, the task becomes more labour-intensive and manual test data generation becomes infeasible. To address this issue, researchers have proposed different approaches to automate the process of generating test data using search techniques; an area that is known as Search-Based Software Testing (SBST). SBST methods require a fitness function to guide the search to promising areas of the solution space. Over the years, a plethora of fitness functions have been proposed. Some methods use control information, others focus on goals. Deciding on what fitness function to use is not easy, as it depends on the software system under test. This work investigates the impact of software features on the effectiveness of different fitness functions. We propose the Mapping the Effectiveness of Test Automation (META) Framework which analyses the footprint of different fitness functions and creates a decision tree that enables the selection of the appropriate function based on software features.
Full-text available
Article
Remodularising the components of a software system is challenging: sound design principles (e.g., coupling and cohesion) need to be balanced against developer intuition of which entities conceptually belong together. Despite this, automated approaches to remodularisation tend to ignore domain knowledge, leading to results that can be nonsensical to developers. Nevertheless, suppling such knowledge is a potentially burdensome task to perform manually. A lot information may need to be specified, particularly for large systems. Addressing these concerns, we propose the SUMO (SUpervised reMOdularisation) approach. SUMO is a technique that aims to leverage a small subset of domain knowledge about a system to produce a remodularisation that will be acceptable to a developer. With SUMO, developers refine a modularisation by iteratively supplying corrections. These corrections constrain the type of remodularisation eventually required, enabling SUMO to dramatically reduce the solution space. This in turn reduces the amount of feedback the developer needs to supply. We perform a comprehensive systematic evaluation using 100 real world subject systems. Our results show that SUMO guarantees convergence on a target remodularisation with a tractable amount of user interaction.
Full-text available
Article
Hierarchical clustering groups similar entities on the basis of some similarity (or distance) association and results in a tree like structure, called dendrogram. Dendrograms represent clusters in a nested manner, where at each step an entity makes a new cluster or merges into an existing cluster. Hierarchical clustering has many applications, therefore researchers have made efforts to come up with improved hierarchical clustering approaches. An approach that has received attention is based on combining clustering results, since different hierarchical clustering algorithms produce different dendrograms and their combination has produced more promising results as compared to individual hierarchical clustering. This paper proposes the hierarchical clustering combination (HCC) approach which uses the different types of structural features present in the dendrogram. Firstly, the dendrograms are represented in a 4+N (4 is the extracted number of features and can be extended to N number) dimensional euclidean space (4+NDES) which results in vector matrices. 4+NDES is the structural representation of the dendrogram which contains not only the relative features but also the absolute features of the entities in the dendrogram. Then the vector matrices are aggregated and the distance is calculated between each two vector using the Euclidean distance measure. The final hierarchy is obtained using a recovery tool like individual hierarchical clustering. 4+NDES-HCC utilizes the structural contents of the dendrogram and has the flexibility to handle an increasing number of features. The proposed approach is tested for software clustering which plays an important role in maintenance of software systems. The experimental results of the proposed approach and comparative analysis with existing approaches reveal the effectiveness of the HCC for software clustering.
Article
Various binary similarity measures have been employed in clustering approaches to make homogeneous groups of similar entities in the data. These similarity measures are mostly based only on the presence or absence of features. Binary similarity measures have also been explored with different clustering approaches (e.g., agglomerative hierarchical clustering) for software modularization to make software systems understandable and manageable. Each similarity measure has its own strengths and weaknesses which improve and deteriorate the clustering results, respectively. We highlight the strengths of some well-known existing binary similarity measures for software modularization. Furthermore, based on these existing similarity measures, we introduce several improved new binary similarity measures. Proofs of the correctness with illustration and a series of experiments are presented to evaluate the effectiveness of our new binary similarity measures.