Figures
Explore figures and images from publications
Table 1 - uploaded by Chun Yong Chong
Content may be subject to copyright.
Summary of related works on constrained clustering

Summary of related works on constrained clustering

Source publication
Article
Full-text available
Constrained clustering or semi-supervised clustering has received a lot of attention due to its flexibility of incorporating minimal supervision of domain experts or side information to help improve clustering results of classic unsupervised clustering techniques. In the domain of software remodularisation, classic unsupervised software clustering...

Contexts in source publication

Context 1
... that, in hierarchical clustering, one does not need to specify the number of clusters because it is capable of finding the natural number of partitions on the dataset. Table 1 provides a summary of all the discussed papers. In summary, most of the existing work use k-mean to perform constrained clustering, partly because the fulfilment of ML and CL constraints can be achieved easier by manipulating the clustering assignment, i.e. initial seeding of clustering entities involved in ML or CL constraints. ...
Context 2
... shown in Table 1, existing studies in constrained clustering mainly focus on the domain of data mining and machine learning to cluster or classify text documents, images, and to perform biological classifications. While there are several studies that apply a classic unsupervised software clustering technique to aid in remodularisation of poorly designed or poorly documented software systems ( Chong et al., 2013;Cui & Chae, 2011;Fokaefs, Tsantalis, Chatzigeorgiou, & Sander, 2009;Maqbool & Babri, 2007) there is a lack of work that integrates domain knowledge or side information for the same purpose. ...
Context 3
... most of the selected software systems fall into the range of A-rated and B-rated SQALE rating, it is assumed that the selected software can reveal some of the properties and characteristics of good OO software. EmpireDB 470 41775 307 days B 19 Apache Archiva 506 75638 535 days C 20 Apache Roller 528 55395 532 days B 21 Titan 532 35415 350 days B 22 Jajuk 543 57029 58 days A 23 Apache Mina 583 36978 723 days C 24 Apache Abdera 682 50568 783 days C 25 Apache Log4j 704 32987 209 days B 26 Apache ...
Context 4
... to size and page constraints, all the clustering constraints derived from the 40 test subjects are presented in Table A1 in Appendix. Some examples of Table A1 are illustrated in Table 6, which shows the clustering constraints derived from Apache Gora, openFAST, and Apache Tika. ...
Context 5
... to size and page constraints, all the clustering constraints derived from the 40 test subjects are presented in Table A1 in Appendix. Some examples of Table A1 are illustrated in Table 6, which shows the clustering constraints derived from Apache Gora, openFAST, and Apache Tika. ...
Context 6
... second column in Table 6 and Table A1 shows the hubs found in each test subject, while the third column shows the neighbouring classes that form a complete clique with each corresponding hub in the second column. Note that cannot-link constraints are established for each pair of hubs in order to promote the notion of separation of concerns. ...
Context 7
... et al. reported that several classes in JFreeChart became denser with each incremental update. Based on the experimental findings in Table A1, classes that behave like god classes are XYItemRenderer.java, Plot.java, ...
Context 8
... results in Table 6 and Table A1 show that graph theory analysis is able to automatically derive clustering constraints from the implicit structure of software systems. The proposed method has succeeded in deriving a number of clustering constraints without the need for user feedback to help facilitate in the subsequent constrained clustering process. ...

Similar publications

Article
Full-text available
Graph partitioning is an important method for accelerating large distributed graph computation. Streaming graph partitioning is more efficient than offline partitioning, and it has been developed continuously in the application of graph partitioning in recent years. In this work, we first introduce a heuristic greedy streaming partitioning method a...
Article
Full-text available
Overlapping clustering is a fundamental and widely studied subject that identifies all densely connected groups of vertices and separates them from other vertices in complex networks. However, most conventional algorithms extract modules directly from the whole large-scale graph using various heuristics, resulting in either high time consumption or...
Chapter
Full-text available
Psychotherapy, unanimously described as a particular organized and systematic relationship between a patient and a therapist, is a real complex system. The interaction between the numerous variables belonging to the patient, the therapist and the context in which the therapeutic couple is inserted, presents auto-poietic characteristics and generate...
Preprint
Full-text available
We provide an up-to-date view of the structure of the energy landscape of the low autocorrelation binary sequences problem, a typical representative of the $NP$-hard class. To study the landscape features of interest we use the local optima network methodology through exhaustive extraction of the optima graphs for problem sizes up to $24$. Several...
Article
Full-text available
We seek to quantify the extent of similarity among nodes in a complex network with respect to two or more node-level metrics (like centrality metrics). In this pursuit, we propose the following unit disk graph-based approach: we first normalize the values for the node-level metrics (using the sum of the squares approach) and construct a unit disk g...

Citations

... Sarhan et al. [5] systematically reported the state-of-the-art empirical contributions in software module clustering. Chong and Lee [6] also presented a method to integrate the concept of graph theory analysis to automatically derive clustering constraints from the implicit structure of software systems. ...
... Basically modularisation is a design principle to have a complex system composed of smaller subsystems that are able to work independently [6,7]. Furthermore, the module view is the most common way to understand software system architecture [8]. ...
Full-text available
Article
In software engineering, a software development process, also known as software development life cycle (SDLC), involves several distinct activities for developing, testing, maintaining, and evolving a software system. Within the stages of SDLC, software maintenance occupies most of the total cost of the software life. However, after extended maintenance activities, software quality always degrades due to increasing size and complexity. To solve this problem, software modularisation using clustering is an intuitive way to modularise and classify code into small pieces. , A multi‐pattern clustering (MPC) algorithm for software modularisation is proposed in this study. The proposed MPC algorithm can be divided into five different steps: (1) preprocessing, (2) file labelling, (3) collection of chain dependencies, (4) hierarchical agglomerative clustering, (5) modification of the clustering result. The performance of the proposed MPC algorithm to selected clustering techniques is compared by using three open‐source and one closed‐source software programs. Experimental results show that the modularisation quality of the proposed MPC algorithm is nearly 1.6 times better than that of the expert decomposition. Additionally, compared to other software clustering algorithms, the proposed MPC algorithm, on average, has a 13% enhancement in producing results similar to human thinking. Consequently, it can be seen that the proposed MPC algorithm is suitable for human comprehension while producing better module quality compared to other clustering algorithms.
... Subsequently, the clustering results produced by the class-level clustering algorithm will be completely different from a method-level clustering algorithm, although both results might be equally feasible. Furthermore, comparing software clustering algorithms within the same level of granularity is also not straightforward, due to different fitness functions and cluster validity metrics employed by different algorithms (Chong and Lee, 2017;Chong et al., 2013). Even if we were to compare the effectiveness of the clustering algorithms from the same family (i.e., agglomerative hierarchical clustering), there are still different ways to configure them (i.e. ...
... Most studies which introduce new clustering algorithms often only evaluate their approach on a specific set of problem instances (Maqbool and Babri, 2007;Chong and Lee, 2017;Shtern and Tzerpos, 2012). Different from existing studies, this work aims to provide a better understanding of which software/code features (i.e., lines of code, number of methods, coupling between objects, depth inheritance) are related to the performance of clustering algorithms, and whether the software/code features can be used to select the most suitable clustering algorithm. ...
... To evaluate the performance of each hierarchical clustering algorithm against the reference model, we use MoJoFM metric proposed in the work by Tzerpos and Holt (1999), Wen and Tzerpos (2003). The MoJo family of metrics were widely used in the domain of software clustering to evaluate the performance of different clustering algorithms (Maqbool and Babri, 2007;Chong and Lee, 2017;Beck et al., 2016;Naseem et al., 2019). Hence, in the remaining of this paper, the term performance of clustering algorithm refers to the MoJoFM value computed when comparing between the produced clustering results and the ground truth. ...
Article
Maintenance of existing software requires a large amount of time for comprehending the source code. The architecture of a software, however, may not be clear to maintainers if up-to-date documentations are not available. Software clustering is often used as a remodularisation and architecture recovery technique to help recover a semantic representation of the software design. Due to the diverse domains, structure, and behaviour of software systems, the suitability of different clustering algorithms for different software systems are not investigated thoroughly. Research that introduce new clustering techniques usually validate their approaches on a specific domain, which might limit its generalisability. If the chosen test subjects could only represent a narrow perspective of the whole picture, researchers might risk not being able to address the external validity of their findings. This work aims to fill this gap by introducing a new approach, Explaining Software Clustering for Remodularisation (E-SC4R), to evaluate the effectiveness of different software clustering approaches. This work focuses on hierarchical clustering and Bunch clustering algorithms and provides information about their suitability according to the features of the software, which as a consequence, enables the selection of the most suitable algorithm and configuration that can achieve the best MoJoFM value from our existing pool of choices for a particular software system. The E-SC4R framework is tested on 30 open-source software systems with varying sizes and domains, and demonstrates that it can characterise both the strengths and weaknesses of the analysed software clustering algorithms using software features extracted from the code. The proposed approach also provides a better understanding of the algorithms’ behaviour by showing a 2D representation of the effectiveness of clustering techniques on the feature space generated through the application of dimensionality reduction techniques.
... Subsequently, the clustering results produced by the class-level clustering algorithm will be completely different from a method-level clustering algorithm, although both results might be equally feasible. Furthermore, comparing software clustering algorithms within the same level of granularity is also not straightforward, due to different fitness functions and cluster validity metrics employed by different algorithms [9,16]. Even if we were to compare the effectiveness of the clustering algorithms from the same family (i.e., agglomerative hierarchical clustering), there are still different ways to configure them (i.e. ...
... Most studies which introduce new clustering algorithms often only evaluate their approach on a specific set of problem instances [14,16,17]. Different from existing studies, this work aims to provide a better understanding of which software/code features (i.e., lines of code, number of methods, etc.) are related to the performance of clustering algorithms, and whether the software/code features can be used to select the most suitable clustering algorithm. ...
... Cutting the dendrogram tree at a higher distance value always yields a smaller number of clusters. However, this decision involves a tradeoff with respect to relaxing the constraint of cohesion in the cluster memberships [9,16,42]. As such, in this work, we attempt to determine the optimal total number of clusters by dividing the total number of classes with the following divisors : 5, 7, 10, 20, and 25. ...
Full-text available
Preprint
Maintenance of existing software requires a large amount of time for comprehending the source code. The architecture of a software, however, may not be clear to maintainers if up to date documentations are not available. Software clustering is often used as a remodularisation and architecture recovery technique to help recover a semantic representation of the software design. Due to the diverse domains, structure, and behaviour of software systems, the suitability of different clustering algorithms for different software systems are not investigated thoroughly. Research that introduce new clustering techniques usually validate their approaches on a specific domain, which might limit its generalisability. If the chosen test subjects could only represent a narrow perspective of the whole picture, researchers might risk not being able to address the external validity of their findings. This work aims to fill this gap by introducing a new approach, Explaining Software Clustering for Remodularisation, to evaluate the effectiveness of different software clustering approaches. This work focuses on hierarchical clustering and Bunch clustering algorithms and provides information about their suitability according to the features of the software, which as a consequence, enables the selection of the most optimum algorithm and configuration from our existing pool of choices for a particular software system. The proposed framework is tested on 30 open source software systems with varying sizes and domains, and demonstrates that it can characterise both the strengths and weaknesses of the analysed software clustering algorithms using software features extracted from the code. The proposed approach also provides a better understanding of the algorithms behaviour through the application of dimensionality reduction techniques.
... In other words, a node with special properties like high in-degree centrality in the system can be potentially mapped to a class with an attribute such as the level of reusability of a class. This attribute can be mapped to a bad smell such as shotgun surgery or the lazy class (Chong and Lee 2017;Jenkins and Kirk 2007). The main difference of the proposed approach with the similar existing approaches is the execution of the refactoring process immediately after the bad smell identification. ...
... This cohesion metric can represent badly written codes and can be used to find bad smells. Yong Chong and Peck Lee used weighted complex networks and graph theory concepts to automatically derive the clustering constraints (Chong and Lee 2017). These methods can also be used for code refactoring. ...
... Clustering approaches (Gu et al. 2017;Chong and Lee 2017) Effective in finding dependencies and improving cohesion and coupling ...
Full-text available
Article
The creation of high-quality software is of great importance in the current state of the enterprise systems. High-quality software should contain certain features including flexibility, maintainability, and a well-designed structure. Correctly adhering to the object-oriented principles is a primary approach to make the code more flexible. Developers usually try to leverage these principles, but many times neglecting them due to the lack of time and the extra costs involved. Therefore, sometimes they create confusing, complex, and problematic structures in code known as code smells. Code smells have specific and well-known anti-patterns that can be corrected after their identification with the help of the refactoring techniques. This process can be performed either manually by the developers or automatically. In this paper, an automated method for identifying and refactoring a series of code smells in the Java programs is introduced. The primary mechanism used for performing such automated refactoring is by leveraging a fuzzy genetic method. Besides, a graph model is used as the core representation scheme along with the corresponding measures such as betweenness, load, in-degree, out-degree, and closeness centrality, to identify the code smells in the programs. Then, the applied fuzzy approach is combined with the genetic algorithm to refactor the code using the graph-related features. The proposed method is evaluated using the Freemind, Jag, JGraph, and JUnit as sample projects and compared the results against the Fontana dataset which contains results from IPlasma, FluidTool, Anti-patternScanner, PMD, and Maeinescu. It is shown that the proposed approach can identify on average 68.92% of the bad classes similar to the Fontana dataset and also refactor 77% of the classes correctly with respect to the coupling measures. This is a noteworthy result among the currently existing refactoring mechanisms and also among the studies that consider both the identification and the refactoring of the bad smells.
... The maintainability and reliability of these systems are then carefully evaluated based on the networks. Later they [64] provided an approach which can help practitioners automatically achieve the clustering constraints from the implicit structure of software systems based on graph theory. In [65], C. R. Myers discussed the relationships between several network topological measurements to software engineering practices. ...
Full-text available
Article
In the current decade, software systems have been more intensively employed in every aspect of our lives. However, it is disappointing that the quality of software is far from satisfactory. More importantly, the complexity and size of today’s software systems are increasing dramatically, which means that the number of required modifications also increases exponentially. Therefore, it is necessary to understand how function-level modifications impact the distribution of software bugs. In addition, other factors such as function’s structural characteristics as well as attributes of functions themselves may also serve as informative indicators for software fault prediction. In this paper, we perform statistical methods and logistic regression to analyze the possible factors which are related to the distribution of software bugs. We demonstrate our study from the following five perspectives: 1) the distribution of bugs in time and space; 2) the distribution of function-level modifications in time and space; 3) the relationship between function-level modifications and functions’ fault-proneness; 4) the relationship between functional attributes and functions’ fault-proneness; and 5) the relationship between software structural characteristics and functions’ fault-proneness.
... However, the ways to represent software-based complex networks are generally not standardized across multiple studies due to the fact that different studies might be addressing some specific issues at different levels of granularity, i.e. package level [5], class level [6,7], or code level [8]. While most of the existing studies focus on utilizing source code as the main source of information to form a software-based complex network, there is a lack of studies that attempt to harness the data and metadata that are available on source code management systems (SCMS). ...
Full-text available
Chapter
Various studies had successfully utilized graph theory analysis as a way to gain a high-level abstraction view of the software systems, such as constructing the call graph to visualize the dependencies among software components. The level of granularity and information shown by the graph usually depends on the input such as variable, method, class, package, or combination of multiple levels. However, there are very limited studies that investigated how software evolution and change history can be used as a basis to model software-based complex network. It is a common understanding that stable and well-designed source code will have less update throughout a software development lifecycle. It is only those code that were badly design tend to get updated due to broken dependencies, high coupling, or dependencies with other classes. This paper put forward an approach to model a commit change-based weighted complex network based on historical software change and evolution data captured from GitHub repositories with the aim to identify potential fault prone classes. Four well-established graph centrality metrics were used as a proxy metric to discover fault prone classes. Experiments on ten open-source projects discovered that when all centrality metrics are used together, it can yield reasonably good precision when compared against the ground truth.
... Some tools are available to support Class Responsibility Assignment (CRA) which can be used for analysing and designing OOP, was designed to provide a cognitive toolkit for designers and developers [4]. Designing the object-oriented software (OOS) is the complex process and in the initial steps it's included analysing the class candidates and allocating responsibilities of a system to them [5]. This type of initial design is applied in more advanced Object-oriented mechanisms like interfaces, design patterns, inheritance or even architectural styles [6]. ...
Article
Object Oriented Program (OOP) provides the way to reuse the program by pre-implementing functionalities in the software. It is difficult to develop the object oriented software, which is important in the computer programing. For OOP, modularization mainly depends on the class. There are many methods present for the assigning responsibilities, but most of them are based on the human for decision making. In this research, back propagation neural network (BPNN) was used to provide the solution to object oriented design of the software. The Cinema Booking System (CBS) was taken as the input documentation and Formal Concept analysis (FCA) then found the relationship of the element in the lattice manner, after that the relationship was set with each other. The result showed that the proposed system outperformed the existing system and also the design made manually.
... A complex network approach was used to study software dependency network evolution [13,14]. Chong and Lee used weighted complex network with graph theory analysis to automate the derivation of clustering constraints from object-oriented software [15]. Joblin et al. investigated the evolutionary trends of developer coordination using a network approach [16]. ...
Full-text available
Article
The phenomenon of local worlds (also known as communities) exists in numerous real-life networks, for example, computer networks and social networks. We proposed the Weighted Multi-Local-World (WMLW) network evolving model, taking into account (1) the dense links between nodes in a local world, (2) the sparse links between nodes from different local worlds, and (3) the different importance between intra-local-world links and inter-local-world links. On topology evolving, new links between existing local worlds and new local worlds are added to the network, while new nodes and links are added to existing local worlds. On weighting mechanism, weight of links in a local world and weight of links between different local worlds are endowed different meanings. It is theoretically proven that the strength distribution of the generated network by the WMLW model yields to a power-law distribution. Simulations show the correctness of the theoretical results. Meanwhile, the degree distribution also follows a power-law distribution. Analysis and simulation results show that the proposed WMLW model can be used to model the evolution of class diagrams of software systems.
Article
Programmers strive to design programs that are flexible, updateable, and maintainable. However, several factors such as lack of time, high costs, and workload lead to the creation of software with inadequacies known as anti-patterns. To identify and refactor software anti-patterns, many research studies have been conducted using machine learning. Even though some of the previous works were very accurate in identifying anti-patterns, a method that takes into account the relationships between different structures is still needed. Furthermore, a practical method is needed that is trained according to the characteristics of each program. This method should be able to identify anti-patterns and perform the necessary refactorings. This paper proposes a framework based on probabilistic graphical models for identifying and refactoring anti-patterns. A graphical model is created by extracting the class properties from the source code. As a final step, a Bayesian network is trained, which determines whether anti-patterns are present or not based on the characteristics of neighboring classes. To evaluate the proposed approach, the model is trained on six different anti-patterns and six different Java programs. The proposed model has identified these anti-patterns with a mean accuracy of 85.16 percent and a mean recall of 79%. Additionally, this model has been used to introduce several methods for refactoring, and it has been shown that these refactoring methods will ultimately create a system with less coupling and higher cohesion.
Full-text available
Preprint
Marketplaces for distributing software products and services have been getting increasing popularity. GitHub, which is most known for its version control functionality through Git, launched its own marketplace in 2017. GitHub Marketplace hosts third party apps and actions to automate workflows in software teams. Currently, this marketplace hosts 440 Apps and 7,878 Actions across 32 different categories. Overall, 419 Third party developers released their apps on this platform which 111 distinct customers adopted. The popularity and accessibility of GitHub projects have made this platform and the projects hosted on it one of the most frequent subjects for experimentation in the software engineering research. A simple Google Scholar search shows that 24,100 Research papers have discussed GitHub within the Software Engineering field since 2017, but none have looked into the marketplace. The GitHub Marketplace provides a unique source of information on the tools used by the practitioners in the Open Source Software (OSS) ecosystem for automating their project's workflow. In this study, we (i) mine and provide a descriptive overview of the GitHub Marketplace, (ii) perform a systematic mapping of research studies in automation for open source software, and (iii) compare the state of the art with the state of the practice on the automation tools. We conclude the paper by discussing the potential of GitHub Marketplace for knowledge mobilization and collaboration within the field. This is the first study on the GitHub Marketplace in the field.