Article

Data clustering: a review. ACM Comput Surv

Authors:
Article

Data clustering: a review. ACM Comput Surv

If you want to read the PDF, try requesting it from the authors.

Abstract

This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... While there is not one single clustering algorithm that can be effectively applied to every problem, multiple algorithms have been developed to answer the needs for different types of dataset and analysis [24]. For instance, hierarchical agglomerative clustering algorithms apply a bottom-up strategy that successively groups the closest clusters until only a single cluster remains, which creates a hierarchical tree that represents the nested grouping of patterns [25]. Despite its higher computational cost, it only needs to be calculated once before it can be used to create any number of clusters, having been applied to gene expression datasets to group genes that exhibit similar expression patterns over time or over diverse experimental conditions [26], [27]. ...
... Hierarchical clustering may result in long processing speeds but allows the user to select between any number of clusters after being calculated only once [25]. K-means has a fast execution time, but requires the user to specify the number of desired clusters beforehand [28]. ...
Article
Full-text available
Many fields of study still face the challenges inherent to the analysis of complex multidimensional datasets, such as the field of computational biology, whose research of infectious diseases must contend with large protein-protein interaction networks with thousands of genes that vary in expression values over time. In this paper, we explore the visualization of multivariate data through CroP, a data visualization tool with a coordinated multiple views framework where users can adapt the workspace to different problems through flexible panels. In particular, we focus on the visualization of relational and temporal data, the latter being represented through layouts that distort timelines to represent the fluctuations of values across complex datasets, creating visualizations that highlight significant events and patterns. Moreover, CroP provides various layouts and functionalities to not only highlight relationships between different variables, but also dig-down into discovered patterns in order to better understand their sources and their effects. These methods are demonstrated through multiple experiments with diverse multivariate datasets, with a focus on gene expression time-series datasets. In addition to a discussion of our results, we also validate CroP through model and interface tests performed with participants from both the fields of information visualization and computational biology.
... There have been many reviews of existing clustering algorithms (Ezugwu et al., 2020;Feczko & Fair, 2020;Gan et al., 2020;Giordani, 2020;Jain, 2010;Jain et al., 1999;Rui & Wunsch, 2005;Xu & Tian, 2015). However, most of these reviews target audiences in the areas of statistics, machine learning and computer science, and many only provide a general review in methods without detailed practical guidance for those less familiar with these approaches. ...
... Distance between two points in $ dimensional space. It is the most commonly used distance measure in clustering; however, it can be sensitive to outliers (Jain et al., 1999). A few other distance measures were modified based on Euclidean such as weighted Euclidean distance, and average Euclidean distance. ...
Preprint
Full-text available
Clustering models or cluster analyses have been widely used to explore individual heterogeneity in mental health research. Despite advances in new algorithms and increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements when using clustering models. In this review, we first provided a comprehensive introduction to the philosophy, design and implementation of major algorithms that are particularly relevant in mental health research. The design, comparisons and implementations (in R package) of different models and dissimilarity measures are discussed. The extensions of basic models, such as kernel method, deep learning, semi-supervised clustering, and clustering ensembles were subsequently introduced. Methods for pre-clustering data processing, clustering evaluation and validation, as well as important issues commonly faced in clustering tasks are discussed. Importantly, we provided general guidance on clustering workflow and reporting requirements. A rapid review of publications (December 2020-December 2021) was conducted focusing on the top six psychology and psychiatry journals that published most of the clustering papers. The results have highlighted that there was a lack of diversity in the algorithm of choice, robust validation processes via resampling, and available data and analysis code to improve reproducibility. This comprehensive review offers researchers advanced tools and guidelines to address some of these issues, improve practice and ultimately understanding of the complexity of mental illness.
... Grouping similar observations, data points, or feature vectors for similar observations based on similar characteristics is expressed as clustering analysis or clustering. Jain et al (1999) and Hancer et al. (2020) defined the clustering process as an automatic grouping of unlabeled samples in a given dataset using undefined similarity measures such as undefined Euclidean distance, point symmetry, and signal coding. Clustering algorithms are generally divided into two types, which are hierarchical and nonhierarchical or partition clustering algorithms (Jain 2010). ...
... The centroid µ i can be calculated using Eq. (2): Fig. 6 The approaches of clustering analysis (Jain et al. 1999) Content courtesy of Springer Nature, terms of use apply. Rights reserved. ...
Article
Full-text available
It would be beneficial to consider the results of the long-term evaluation of natural disasters in the decision-making process for disaster damage reduction/prevention. However, disaster evaluation is a complex and time-consuming process depending on different factors such as data type and data period. In this study, a new approach is proposed to determine the risk groups of the provinces in Turkey according to the disaster types (avalanche, landslide, rockfall, and flood) at regional and national scales. Disaster data between 1950 and 2020 were evaluated by considering the number of disasters in the provinces. The obtained data were subjected to cluster analysis, and then, the cluster groups were converted into risk classes. Finally, the risk weight ratios of the provinces and regions were calculated and thematically mapped by integrating them with GIS methods. According to the results, when four disaster types were considered together, Trabzon is the riskiest province on a provincial basis and the Black Sea is the riskiest region on a regional basis in Turkey. Additionally, the results of the study show that cluster analysis offers an effective solution for the evaluation of long-term large datasets. Furthermore, it was found that the new approach, which is used to minimize the errors that may be caused by surface area differences, makes a significant contribution to the evaluation process. This new approach will make a positive contribution to the analysts at the stage of giving priority to disasters and establishing protective and preventive policies on a national and global scale.
... When choosing the measure of the distance between the clusters, we considered the fact that the original data were min-max normalized, resulting in a more compact data set with a diminished outlier's effect [73]. The advantages resulting from the min-max normalization will recommend, as a measure of the distances between the clusters, one of the components of the Minkovski family of measures, among which are: Manhattan Distances, Euclidean Distance, and so on [74]. ...
... The Euclidean distance is frequently used and easy to calculate, being adapted to work with data sets for which there are compact or isolated clusters [73,75]. ...
Article
Full-text available
On the background of the exponential growth of the world's population, doubled by the decrease of natural resources and the continuous, accentuated degradation of the quality of the environment , with global warming as its main effect, ensuring the sustainability of economic and social processes is becoming a growing concern. At the European Union level, it is important that all member countries adhere to and implement common measures on sustainable development, which involve, inter alia, ensuring the convergence of policies and their effects at EU level. The EU through detailed SDGs presents the structure of a system of indicators structured on 17 objectives, indicators taken over, implemented, and calculated by EUROSTAT. The study proposes, based on a Composite Index of Sustainable Development of EU Countries' Economies (ISDE-EU), the analysis of the convergence of the sustainability of EU states' economies, not so much at individual level, but at cluster level, each cluster containing EU countries with similar/close ISDE-EU levels and dynamics. The results of the analysis confirm the partial existence of the beta and sigma convergence of the sustainability of EU countries' economies. Please note that, at the time when we processed data, the UK was an EU state, which is why it was included in the analysis.
... 15 The median value of each isotopy per season was calculated by considering the isotopy values of the episodes belonging to that season. 16 The principal clustering methods are partitioning methods (k-means, PAM, CLARA) that subdivide the datasets into a set of k groups, where k is the number of groups pre-specified by the analyst; hierarchical clustering, that identifies groups in the data without subdividing it; fuzzy clustering; density-based clustering; and model-based clustering (for more details see Jain et al., 1999;Rokach and Maimon, 2005;Berkhin, 2006). 17 The optimal number of clusters is a central issue in partitioning clustering such as kmeans clustering. ...
Article
Full-text available
TV series have gained both economic and cultural relevance. Their development over time can hardly be traced back to the simple programmatic action of creative intentionality. Instead, TV series might be studied as narrative ecosystems with emergent trends and patterns. This paper aims to boost quantitative research in the field of media studies, first considering a comparative and data-driven study of the narrative features in the US medical TV series, one of the most popular and longest-running genres on global television. Based on a corpus of more than 400 h of video, we investigate the storytelling evolution of eight audiovisual serial products by identifying three main narrative features (i.e., isotopies). The implemented schematization allows to grasp the basic components of the social interactions showing the strength of the medical genre and its ability to rebuild, in its microcosm, the essential traits of the human macrocosm where random everyday life elements (seen in the medical cases plot) mix and overlap with working and social relationships (professional plot) and personal relationships (sentimental plot). This study relies on data-driven research that combines content analysis and clustering analysis. It significantly differs from traditional studies regarding the narrative features of medical dramas and broadly the field of television studies. We proved that the three isotopies are good descriptors for the medical drama genre and identified four narrative profiles which emphasize the strong stability of these serial products. Contrary to what is often taken for granted in many interpretative studies, creative decisions rarely significantly change the general narrative aspects of the wider series.
... Among data mining algorithms, grouping techniques are supposed to find the most homogeneous clusters that are as distinct as possible from other clusters: maximizing inter-cluster variance while minimizing intra-cluster variance [24]. In other words, these algorithms should automatically recognize patterns intrinsically present within the dataset [25]. ...
Article
Full-text available
The recent improvement of infrared image quality has increased the use of thermography as a non-destructive diagnostic technique. Amongst other applications, thermography can be used to monitor historic buildings. The present work was carried out within the framework of the Horizon 2020 European project SHELTER, which aims to create a management plan for cultural heritage subject to environmental and anthropogenic risk. Among the chosen case studies is the Santa Croce Complex in Ravenna (Italy), which is exposed to different hazards, including flooding. The church has a peculiar architecture that develops below the street level, so the internal walls are affected by the deterioration caused by rising humidity. In such a case of advanced degradation, passive thermography cannot be used to its full potential. For this reason, an innovative methodology involving active thermography was first developed and validated with laboratory tests. Secondly, we conducted its first application to a real case study. With this purpose, an active thermography survey with forced ventilation was carried out to enhance different stages of material degradation by means of automatic classification of multitemporal data. These experiments have resulted in a method using an active thermal survey in a high moisture content environment to detect masonry degradation.
... Clustering analysis is performed to classify a group of objects into different categories, in which objects in the same cluster will be more similar than object in other clusters [1]. It is often used to discover the structure of data that is invisible. ...
Article
Full-text available
Over the years, research on fuzzy clustering algorithms has attracted the attention of many researchers, and they have been applied to various areas, such as image segmentation and data clustering. Various fuzzy clustering algorithms have been put forward based on the initial Fuzzy C-Means clustering (FCM) with Euclidean distance. However, the existing fuzzy clustering approaches ignore two problems. Firstly, clustering algorithms based on Euclidean distance have a high error rate, and are more sensitive to noise and outliers. Secondly, the parameters of the fuzzy clustering algorithms are hard to determine. In practice, they are often determined by the user’s experience, which results in poor performance of the clustering algorithm. Therefore, considering the above deficiencies, this paper proposes a novel fuzzy clustering algorithm by combining the Gaussian kernel function and Grey Wolf Optimizer (GWO), called Kernel-based Picture Fuzzy C-Means clustering with Grey Wolf Optimizer (KPFCM-GWO). In KPFCM-GWO, the Gaussian kernel function is used as a symmetrical measure of distance between data points and cluster centers, and the GWO is utilized to determine the parameter values of PFCM. To verify the validity of KPFCM-GWO, a comparative study was conducted. The experimental results indicate that KPFCM-GWO outperforms other clustering methods, and the improvement of KPFCM-GWO is mainly attributed to the combination of the Gaussian kernel function and the parameter optimization capability of the GWO. What is more, the paper applies KPFCM-GWO to analyzes the value of an airline’s customers, and five levels of customer categories are defined.
... Its efficiency and effectiveness in clustering have been proven in many practical applications. The algorithm starts with randomly selected k centers, where the value of k is pre-selected [31]. The algorithm then assigns each data point to its nearest center. ...
Article
Full-text available
Road damage such as potholes and cracks may reduce ride comfort and traffic safety. This influence can be prevented by regular, proper monitoring and maintenance of roads. Traditional methods and existing methods of surveying are very time-consuming, expensive, require a lot of human effort, and, thus, cannot be conducted frequently. A more efficient and cost-effective process is required to augment profilometer and traditional road-condition recognition systems. In this study, we propose deep-learning methods using smartphone data to devise a cost-effective and ad-hoc approach. Information from sensors on smartphones such as gyroscopes, accelerometers, magnetometers, and cameras are harnessed to detect road damage using deep-learning algorithms. In order to give heuristic and accurate information about the road damage, we used a cloud-based collaborative approach to fuse all the data and update a map frequently with these road-surface conditions. Deep-learning models were able to perform well, with an accuracy of 94% on the sensor-based long short-term memory (LSTM) model and 87.5% on the vision-based YOLOv5 model.
... The goal of clustering is to separate similar and unlabeled data into several independent subsets, each representing a class. Many clustering methods [1][2][3][4] have been proposed in the literature and are widely applied to various practical problems. However, their shortcomings are becoming more and more serious for high-dimensional data. ...
Article
Clustering is an important and challenging research topic in many fields. Although various clustering algorithms have been developed in the past, traditional shallow clustering algorithms cannot mine the underlying structural information of the data. Recent advances have shown that deep clustering can achieve excellent performance on clustering tasks. In this work, a novel variational autoencoder-based deep clustering algorithm is proposed. It treats the Gaussian mixture model as the prior latent space and uses an additional classifier to distinguish different clusters in the latent space accurately. A similarity-based loss function is proposed consisting specifically of the cross-entropy of the predicted transition probabilities of clusters and the Wasserstein distance of the predicted posterior distributions. The new loss encourages the model to learn meaningful cluster-oriented representations to facilitate clustering tasks. The experimental results show that our method consistently achieves competitive results on various data sets.
... ML methods are divided into several classes depending on the learning method and purpose of the algorithm [62] and include the following: supervised learning (SL) [13], unsupervised learning (UL) or cluster analysis [14], dimensionality reduction, semi-supervised learning (SSL), reinforcement learning (RL) [63], and deep learning (DL) [64]. UL methods solve the task of splitting the set of unlabeled objects into isolated or intersecting groups by applying the automatic procedure based on the properties of these objects [65,66]. UL reveals the hidden patterns in data, as well as anomalies and imbalances. ...
Article
Full-text available
Artificial intelligence (AI) is an evolving set of technologies used for solving a wide range of applied issues. The core of AI is machine learning (ML)—a complex of algorithms and methods that address the problems of classification, clustering, and forecasting. The practical application of AI&ML holds promising prospects. Therefore, the researches in this area are intensive. However, the industrial applications of AI and its more intensive use in society are not widespread at the present time. The challenges of widespread AI applications need to be considered from both the AI (internal problems) and the societal (external problems) perspective. This consideration will identify the priority steps for more intensive practical application of AI technologies, their introduction, and involvement in industry and society. The article presents the identification and discussion of the challenges of the employment of AI technologies in the economy and society of resource-based countries. The systematization of AI&ML technologies is implemented based on publications in these areas. This systematization allows for the specification of the organizational, personnel, social and technological limitations. This paper outlines the directions of studies in AI and ML, which will allow us to overcome some of the limitations and achieve expansion of the scope of AI&ML applications.
... In this implementation, the similarity metric obtained from the all-to-all registration strategy is converted into a dissimilarity metric, and multidimensional scaling [110] is used to translate the values of each registration pair into spatial coordinates projected in a multi-dimensional space. The multi-dimensional particles are then classified using k-means clustering [111] and each class is averaged to generate the final reconstructions. In particular, the authors provide a relatively simple strategy to determine an optimal number of classes without a priori knowledge. ...
Article
Full-text available
Understanding the structure of supramolecular complexes provides insight into their functional capabilities and how they can be modulated in the context of disease. Super-resolution microscopy (SRM) excels in performing this task by resolving ultrastructural details at the nanoscale with molecular specificity. However, technical limitations, such as underlabelling, preclude its ability to provide complete structures. Single-particle analysis (SPA) overcomes this limitation by combining information from multiple images of identical structures and producing an averaged model, effectively enhancing the resolution and coverage of image reconstructions. This review highlights important studies using SRM–SPA, demonstrating how it broadens our knowledge by elucidating features of key biological structures with unprecedented detail.
... Each Gaussian is replaced with a single data point that represents its time of day, amplitude and width. The peak data is grouped into clusters by using k-means clustering (Jain et al., 1999) to cluster the data in terms of the time of day, as shown by the 3-dimensional plot in Fig. 1b. The optimal number of clusters is determined by using the elbow method (Kodinariya & Makwana, 2013). ...
Article
Accurate modelling of household electricity on a large scale is a cornerstone for demand management and decarbonisation efforts. This paper introduces a data-driven model for South African households. According to the Paris Agreement, a 50 % reduction in global CO2 emissions is required to ensure the global average temperature does not surpass 2 °C above pre-industrial levels. In this context, 26 % of the overall European energy consumption is used by the residential sector and the global energy consumption is estimated to increase by 1.3 % on average per year until 2050. In the last decade, the development and implementation of smart grid technologies have grown to efficiently and cost-effectively meet the electricity demands of the grid and mitigating greenhouse gas emissions. We present a data-driven, enveloped sum of Gaussians-based model and residential household synthesiser that statistically models the household's electricity usage demand based on smart meter data and generates synthetic data that accurately represents the actual demand profile, load peaks, and daily variances for an individual or aggregate group of households. The measured data was gathered over a one year period for 1200 households in South Africa. Our model accounts for temporal variations such as seasonality and the day of week, household uniqueness, and is fully autonomous. Our results show that the root mean square error between the aggregated measured and synthetic electricity profiles is 0.181 A (5.68 %) and the total energy of each profile is 75.9 and 76.3 A·h, respectively.
... Clustering is the process of grouping objects into clusters according to the similarities within the data objects (Jain et al., 1999). Cluster analysis (Jain, 2010) does not need to refer to any classification information beforehand and can classify data by judging the similarity of data features. ...
Article
Full-text available
Clustering is an unsupervised learning technique widely used in the field of data mining and analysis. Clustering encompasses many specific methods, among which the K-means algorithm maintains the predominance of popularity with respect to its simplicity and efficiency. However, its efficiency is significantly influenced by the initial solution and it is susceptible to being stuck in a local optimum. To eliminate these deficiencies of K-means, this paper proposes a quantum-inspired moth-flame optimizer with an enhanced local search strategy (QLSMFO). Firstly, quantum double-chain encoding and quantum revolving gates are introduced in the initial phase of the algorithm, which can enrich the population diversity and efficiently improve the exploration ability. Second, an improved local search strategy on the basis of the Shuffled Frog Leaping Algorithm (SFLA) is implemented to boost the exploitation capability of the standard MFO. Finally, the poor solutions are updated using Levy flight to obtain a faster convergence rate. Ten well-known UCI benchmark test datasets dedicated to clustering are selected for testing the efficiency of QLSMFO algorithms and compared with the K-means and ten currently popular swarm intelligence algorithms. Meanwhile, the Wilcoxon rank-sum test and Friedman test are utilized to evaluate the effect of QLSMFO. The simulation experimental results demonstrate that QLSMFO significantly outperforms other algorithms with respect to precision, convergence speed, and stability.
... Clustering is a well-known problem in data analysis and machine learning, and has been widely studied in the literature (Jain et al., 1999;Estivill-Castro, 2002;Jain, 2010). In a classical setting, the data to be clustered is stored in a single dataset, and a single algorithm (or agent) is in charge of finding the "best" clusters according to some optimisation criteria. ...
Article
Full-text available
Swarm intelligence leverages collective behaviours emerging from interaction and activity of several “simple” agents to solve problems in various environments. One problem of interest in large swarms featuring a variety of sub-goals is swarm clustering, where the individuals of a swarm are assigned or choose to belong to zero or more groups, also called clusters. In this work, we address the sensing-based swarm clustering problem, where clusters are defined based on both the values sensed from the environment and the spatial distribution of the values and the agents. Moreover, we address it in a setting characterised by decentralisation of computation and interaction, and dynamicity of values and mobility of agents. For the solution, we propose to use the field-based computing paradigm, where computation and interaction are expressed in terms of a functional manipulation of fields, distributed and evolving data structures mapping each individual of the system to values over time. We devise a solution to sensing-based swarm clustering leveraging multiple concurrent field computations with limited domain and evaluate the approach experimentally by means of simulations, showing that the programmed swarms form clusters that well reflect the underlying environmental phenomena dynamics.
Chapter
Clustering has been proven to produce better results when applied to learning problems that fall under semi-supervised paradigms, where only incomplete or partial information about the dataset is available to perform clustering. Classic constrained clustering and recent monotonic clustering problems belong to the semi-supervised learning paradigm, although a combination of both has never been addressed. This study aims to prove that the fusion of the background knowledge leveraged by the two aforementioned semi-supervised clustering techniques results in improved performance. To do so, a hybrid objective function combining them is proposed and optimized by means of an expectation minimization scheme. The capabilities of the proposed method are tested in a wide variety of datasets with incremental levels of background knowledge and compared to purely monotonic clustering and purely constrained clustering methods belonging to the state-of-the-art. Bayesian statistical testing is used to validate the obtained results.KeywordsPairwise instance-level constraintsMonotonicity constraintsExpectation-minimizationHybrid objective function
Chapter
The current energy crisis coupled with concerns about climate change make renewable energies a priority. Solar energy being one of the most representative in Spain, being able to predict the behavior of a solar energy panel in the short term can be very useful to determine the coverage of this energy. To achieve this goal, four regression techniques in combination with a clustering algorithm have been applied, with the aim of predicting the solar energy generation of a panel during the year 2011 in the autonomous region of Galicia (Spain). A data set with continuous information has been used and it has been possible to validate the best way to combine regression and clustering techniques to achieve the best prediction.KeywordsRegressionNeural networksSolar energyRenewable energiesClustering
Chapter
Research on the non-stationary nature of road vehicle vibrations (RVV) led to advances in simulating such processes. Contemporary methods introduced for the analysis of RVV primarily aimed at partitioning the signal in the time- or time − frequency domain, providing differing segments of a signal. However, a degree of dissimilarity, or conversely similarity, is still challenging to find. Hereunder we argue that in some cases, merely a statement of dissimilarity between neighbouring segments within a signal might be well-enough, though from a broader perspective, the assessment of the similarity of discrete Fourier transforms (DFT) may be the next practical step forward. For this reason, the current paper presents the hierarchical clustering of elements of the short-time Fourier transform (STFT) plane from an RVV measurement; secondly, it introduces a clustering validation metric to arrive at an optimum distance metric and a threshold to use in binary hierarchical clusters.KeywordsClustering spectrumsDistance probabilityHierarchical clusteringRoad vehicle vibration
Article
The European transport sector is evolving rapidly, and so do the challenges associated with its fuel needs. The advanced biofuels of second and third generations based on Lignocellulosic (LC) and microalgae biomass have emerged as promising alternative biofuels producers. The paper reviews the renewable energy scenario and its contribution to the transport sector in the European region. A techno-economic analysis is presented for LC, algae-based advanced biofuels. A SWOT analysis is performed to understand the challenges and opportunities associated with 2nd and 3rd generation advanced biofuels for making them market-ready.
Article
Contexto: Hoy en día, el uso de grandes cantidades de datos adquiridos desde diversos dispositivos y equipos electrónicos, ópticos u otra tecnología de medición, generan un problema de análisis de datos en el momento de extraer la información de interés desde las muestras adquiridas. En ellos, agrupar correctamente los datos es necesario para obtener información relevante y precisa para evidenciar el fenómeno físico que se desea abordar. Metodología: El trabajo presenta la evolución de una metodología de cinco etapas para el desarrollo de una técnica de agrupamiento de datos, a través de técnicas de aprendizaje automático e inteligencia artificial. Esta se compone de cinco fases denominadas análisis, diseño, desarrollo, evaluación y distribución, con estándares de código abierto y fundamentadas en los lenguajes unificados para la interpretación del software en ingeniería. Resultados: La validación de la metodología se ha desarrollado mediante la creación de dos métodos de análisis de datos, con un tiempo de ejecución promedio de 20 semanas, obteniendo valores de precisión 40 % y 29 % superiores con los algoritmos clásicos de agrupamiento de datos de k-means y fuzzy c-means. Adicionalmente, se encuentra una metodología de experimentación masiva sobre pruebas unitarias automatizadas, las cuales lograron agrupar, etiquetar y validar 3,6 millones de muestras, acumulado un total de 100 ejecuciones de grupos de 900 muestras, en aproximadamente 2 horas. Conclusiones: Con los resultados de la investigación se ha determinado que la metodología pretende orientar el desarrollo sistemático de técnicas de agrupamiento de datos, en problemas específicos para bases integradas por muestras con atributos cuantitativos, como los casos de parámetros de canal en un sistema de comunicaciones o la segmentación de imágenes usando los valoras RGB de los pixeles; incluso, cuando se desarrolla software y hardware, la ejecución será más versátil que en casos con aplicaciones teóricas. Financiamiento: Universidad Francisco de Paula Santander y Univeridade Federal de Minas Gerais.
Article
Fuzzy C-means (FCM) clustering algorithm is an important and popular clustering algorithm which is utilized in various application domains such as pattern recognition, machine learning, and data mining. Although this algorithm has shown acceptable performance in diverse problems, the current literature does not have studies about how they can improve the clustering quality of partitions with overlapping classes. The better the clustering quality of a partition, the better is the interpretation of the data, which is essential to understand real problems. This work proposes two robust FCM algorithms to prevent ambiguous membership into clusters. For this, we compute two types of weights: an weight to avoid the problem of overlapping clusters; and other weight to enable the algorithm to identify clusters of different shapes. We perform a study with synthetic datasets, where each one contains classes of different shapes and different degrees of overlapping. Moreover, the study considered real application datasets. Our results indicate such weights are effective to reduce the ambiguity of membership assignments thus generating a better data interpretation.
Article
Full-text available
Binary colloidal superlattices (BSLs) have demonstrated enormous potential for the design of advanced multifunctional materials that can be synthesized via colloidal self-assembly. However, mechanistic understanding of the three-dimensional self-assembly of BSLs is largely limited due to a lack of tractable strategies for characterizing the many two-component structures that can appear during the self-assembly process. To address this gap, we present a framework for colloidal crystal structure characterization that uses branched graphlet decomposition with deep learning to systematically and quantitatively describe the self-assembly of BSLs at the single-particle level. Branched graphlet decomposition is used to evaluate local structure via high-dimensional neighborhood graphs that quantify both structural order (e.g., body-centered-cubic vs face-centered-cubic) and compositional order (e.g., substitutional defects) of each individual particle. Deep autoencoders are then used to efficiently translate these neighborhood graphs into low-dimensional manifolds from which relationships among neighborhood graphs can be more easily inferred. We demonstrate the framework on in silico systems of DNA-functionalized particles, in which two well-recognized design parameters, particle size ratio and interparticle potential well depth can be adjusted independently. The framework reveals that binary colloidal mixtures with small interparticle size disparities (i.e., A- and B-type particle radius ratios of r A/r B = 0.8 to r A/r B = 0.95) can promote the self-assembly of defect-free BSLs much more effectively than systems of identically sized particles, as nearly defect-free BCC-CsCl, FCC-CuAu, and IrV crystals are observed in the former case. The framework additionally reveals that size-disparate colloidal mixtures can undergo nonclassical nucleation pathways where BSLs evolve from dense amorphous precursors, instead of directly nucleating from dilute solution. These findings illustrate that the presented characterization framework can assist in enhancing mechanistic understanding of the self-assembly of binary colloidal mixtures, which in turn can pave the way for engineering the growth of defect-free BSLs.
Article
Clustering is one of the most crucial problems in unsupervised learning, and the well-known k-means algorithm can be implemented on a quantum computer with a significant speedup. However, for the clustering problems that cannot be solved using the k-means algorithm, a powerful method called spectral clustering is used. In this study, we propose a circuit design to implement spectral clustering on a quantum processor with substantial speedup by initializing the processor into a maximally entangled state and encoding the data information into an efficiently simulatable Hamiltonian. Compared to the established quantum k-means algorithms, our method does not require a quantum random access memory or a quantum adiabatic process. It relies on an appropriate embedding of quantum phase estimation into Grover’s search to gain the quantum speedup. Simulations demonstrate that our method effectively solves clustering problems and is an important supplement to quantum k-means algorithm for unsupervised learning.
Article
In structural health monitoring (SHM), damage detection is a final target to know the real status of the objective structure. Vibration-based damage detection is a commonly used method, since it makes full use of the dynamic characteristics. Improving the efficiency of this kind of methods has attracted increasing attentions. The existing uncertainty of identified modal parameters using measured data may significantly affect the detection accuracy. Furthermore, an optimization algorithm with a better convergence speed can improve the detection accuracy and reduce the computational time. This article presents the work to develop a novel damage detection method based on fundamental Bayesian two-stage model and sparse regularization. In this method, the most probable value of modal parameters and the associated posterior uncertainty are combined to investigate the effect of uncertainty on damage detection. The usage of the sparse regularization in the objective function can decrease the complexity of modeling and avoid the overfitting problem. A machine learning method combining intelligent swarm optimization algorithm with K-means clustering was used to carry out the optimization. Finally, a method combining three existing theory, that is, fundamental Bayesian two-stage model, sparse regularization, and I-Jaya algorithm, was developed. To investigate the efficiency of the proposed method, the traditional objective functions with and without the sparse regularization were also used for the comparison. The proposed method was verified by an ASCE benchmark example, and then it is applied into an experimental structure. The results show that due to the consideration of uncertainty, the objective function based on the fundamental Bayesian model and sparse regularization has a better performance.
Article
Full-text available
Wireless technologies provide a wide variety of unique characteristics geared toward various purposes and demands. It allows billions of people to use the Internet in the current information age and benefit from the modern digital economy and digital technology. Theoretical and practical implications of discussing how to apply the wireless communications network to the study of volleyball strategies have great significance. In line with the significance of data mining and wireless communication networks, we proposed a Markov-based model for the technical and tactic analysis of volleyball games along with the strategy to extract the key elements of winning volleyball games. The computerized solution to the problem of finding the key factors to make changes in volleyball games is mandatory and wireless communication has a pivotal role in this context. In data acquisition, the speed of retrieval increases by frequently searching for records to meet real-time requirements for location capturing. In data processing, due to the problem of data ambiguity caused by the rules of the game of volleyball, the solution is accepted to process data separately by setting a threshold for the rate of global change. This study shows that the proposed design emphasizes the order and efficiency of the project. Therefore, a Markov-style approach is adapted using data mining for technical and tactical analysis of volleyball matches and the results of our proposed approach outperform the existing techniques and approaches.
Chapter
In the field of clustering, non-spherical data clustering is a relatively complex case. To satisfy the practical application, the solution should be able to capture non-convex patterns in data sets with high performance. At present, the multi-prototype method can meet the former requirement, but the time cost is still high. This paper proposes a new multi-prototype extension of the K-multiple-means type algorithm, which aims to further reduce the computation time in processing non-spherical data sets with a concise principle while maintaining close performance. Compared with other methods, the method still adopts the idea of multiple prototypes and uses agglomerative strategies in the phase of class cluster connection. However, to reduce the amount of data involved in the computation and the interference of incorrect partition, the subclass data of the first partition is filtered. In addition, the agglomeration is divided into two stages: the agglomeration between prototypes and the agglomeration between clusters, and two agglomeration modes are provided to deal with different clustering tasks. Before updating the means, the filtered data needs a quadratic partition. Experimental results show that compared with the state-of-the-art approaches, the proposed method is still effective with lower time complexity in both synthetic and real-world data sets.
Article
This article presents a framework to cluster buildings into typologically similar groups and select indicator buildings for regional seismic response and damage analysis. The framework requires a robust database of buildings to provide high-level structural and site information of buildings. Here, a database of 234 reinforced concrete buildings with five or more above-ground stories in the central business district of Wellington, New Zealand, has been selected as the case study of this research. First, key structural and site parameters that contribute to the seismic demand, response, and damage of each building are extracted from the database. Extracted parameters comprise three numerical and five categorical attributes of each building, including the year of construction, height, period, lateral load resisting system, floor system, site subsoil class, importance level, and strong motion station. Next, two prominent unsupervised machine learning clustering approaches are utilized to cluster the mixed categorical and numerical building database: k-prototype on the mixed numerical and categorical database and k-means on principal components numerical subspace adopted from factor analysis of mixed data (FAMD). A novel autoencoder deep learning neural network is also designed and trained to convert the mixed data into a low-dimensional subspace called latent space and feed this into k-means for clustering. The proposed autoencoder method is demonstrated to be more effective at clustering buildings into useful typological clusters for seismic response and damage analysis based on multiple criteria from both data-science and engineering perspectives. The details of selected indicator buildings for each similar seismic vulnerability cluster are also represented.
Article
Monitoring and maintenance of rotating machinery are a common issue in industry due to the numerous types of faults that can occur in their components. In this context, a relevant number of stops come from failures in rolling bearings, elements widely used in rotating systems. Vibration signals are usually used to monitor this component, and among the most employed methods are the data-driven ones, basically divided into the following steps: signal acquisition and processing, feature extraction and selection, and pattern recognition/fault classification. Several researches focused on each one of these steps, developing new methods or improving existent ones. In this work, feature analysis is investigated through unsupervised learning to evaluate how features can be combined to improve the bearing fault identification. A bibliographic review was carried out to gather time-domain and frequency-domain features that could describe the fault content of vibration signals. Vibration signals were obtained from two different databases to compare the results. The k-means algorithm was employed to group similar data and to enable the understanding of which features are relevant when the objective is to analyze the presence of faults in rolling bearings, considering both the location and the severity of the faults. Results were discussed and compared using an appropriate accuracy and showed that high accuracies (over 99%) can be obtained for both datasets with the same combination of features that comprises both time- and frequency-domain features. It is revealed that the extraction of simple features in time and frequency domains, combined with a common unsupervised learning method, is suitable for grouping vibration data containing different types of faults on rolling bearings. This type of simple feature extraction can assist the evaluation of methods that may be proposed to fault identification in rolling bearings.
Article
Full-text available
To provide health services, hospitals consume electrical power and contribute to the CO 2 emission. This paper aims to develop a modelling approach to optimize hospital services while reducing CO 2 emissions. To capture treatment processes and the production of carbon dioxide, a hybrid method of data mining and simulation–optimization techniques is proposed. Different clustering algorithms are used to categorize patients. Using quality indicators, clustering methods are evaluated to find the best cluster sets, and then patients are categorized accordingly. Discrete-event simulation is applied to each patient category to estimate performance measures such as number of patients being served, waiting times, and length of stay, as well as the amount of CO 2 emission. To optimize performance measures of patient flow, metaheuristic searches have been used. The dataset of Bushehr Heart Hospital is considered as a case study. Based on K-means, K-medoid, Hierarchical clustering, and Fuzzy C-means clustering methods, patients are categorized into two groups of high-risk and low-risk patients. The number of patients being served, total waiting time, length of stay, and CO 2 emitted during care processes are improved for both groups. The proposed hybrid method is an effective method for hospitals to categorize patients based on care processes. The problems and the proposed solution approach reported in this study could be applicable to other hospitals, worldwide to help both optimize the patient flow and minimize the environmental consequences of care services.
Article
Accurate estimation of daily rainfall return levels associated with large return periods is needed for a number of hydrological planning purposes, including protective infrastructure, dams, and retention basins. This is especially relevant at small spatial scales. The ERA-5 reanalysis product provides seasonal daily precipitation over Europe on a 0.25∘×0.25∘ grid (about 27 × 27 [km]). This translates more than 20,000 land grid points and leads to models with a large number of parameters when estimating return levels. To bypass this abundance of parameters, we build on the regional frequency analysis (RFA), a well-known strategy in statistical hydrology. This approach consists in identifying homogeneous regions, by gathering locations with similar distributions of extremes up to a normalizing factor and developing sparse regional models. In particular, we propose a step-by-step blueprint that leverages a recently developed and fast clustering algorithm to infer return level estimates over large spatial domains. This enables us to produce maps of return level estimates of ERA-5 reanalysis daily precipitation over continental Europe for various return periods and seasons. We discuss limitations and practical challenges and also provide a git hub repository. We show that a relatively parsimonious model with only a spatially varying scale parameter can compete well against statistical models of higher complexity.
Article
Full-text available
A new non-parametric procedure is presented for extracting seismic fragility curves by considering frequency content of real ground motion records. The implemented non-parametric method is based on parametric methods averaging and clustering the input data based on an intensity measure (IM) using K-means clustering method. As an advantage, it can considerably decrease the number of analyses via an optimization process and using Monte Carlo procedure. The proposed method's accuracy is evaluated for real ground motion records. Results show that although the random selection of records among different clusters can be effective for synthetic records, but it’s not suitable for selecting the real ones and can lead to large errors in estimations. Therefore, a classification procedure based on frequency content of the records is used in this study to select an appropriate set of real ground motion records for providing the required IM observations in the method. Then, a set of 472 real ground motion records corresponding to 60 earthquakes is chosen as input data for extracting the fragility curves of the case study structure for evaluating the accuracy and applicability of the non-parametric method. Mean period (\(T_{m}\)) of the ground motions is considered as a suitable frequency content measure to classify the real ground motion records while this classification can lead to better and more accurate results rather than the Monte Carlo procedure. According to the results, with using the classification procedure, the optimized non-parametric method may require less than 70 nonlinear analyses to extract the fragility curves.
Article
Full-text available
Nowadays, university rankings are used to assess all aspects of universities. Due to the impact of university rankings on assessing the performance of universities, this research aims to explore university rankings in depth. University rankings are considered contributors to assessing university performance. Previous literature showed different types of goals, such as output and support goals, where the literature advised to align between these two types of goals. Universities have different goals, but still, university rankings measure all universities on the same criteria. Subsequently, this research has used the most used university rankings in the literature, QS world ranking dataset. Then unsupervised machine learning was performed to cluster the universities. The results divided universities among four clusters. This study helps in allocating the university in the adequate cluster. This study helps university managers define the goals of their universities. The study recommends universities align their support goals with their output goals. The study recommends universities to develop international goals and strategies, and support the research in the universities by supporting the scholars. This study’s novelty lies in connecting the university rankings and goals using management analytics in education.
Chapter
One of the benefits of computer-based assessments lies in the automatic generation of log data. Such behavioural process data provide a time-stamped documentation of students’ interactions with the assessment system (e.g., mouse clicks). This chapter explores the usefulness of computer-generated log data for the measurement of professional competence and their potential for the research on professional learning and development. Based on a selection of studies, we illustrate how interindividual differences in task completion processes can be analysed with the help of log data, e.g. to identify the use of certain problem-solving strategies, or to reveal subgroups of students with efficiency barriers. We further present our own research, where we applied a theory on the diagnostic process (Abele, Vocat Learn 11(1):133–159, 2018) in order to assess diagnostic strategies (Abele and von Davier, CDMs in vocational education: assessment and usage of diagnostic problem-solving strategies in car mechatronics. In: von Davier M, Lee YS (eds) Handbook of diagnostic classification models. Springer International Publishing, pp 461–488. https://doi.org/10.1007/978-3-030-05584-4_22, 2019) in the domain of car mechatronics using log data. A profound understanding of interindividual process differences may supplement a merely product-oriented competence measurement and pave the way for a more process-oriented approach. Challenges concerning the assessment, analysis and interpretation of log data will be discussed.KeywordsProcess dataLog dataCompetence measurementComputer-based assessmentVocational education
Chapter
Density-based clustering methods can detect clusters of arbitrary shapes. Most traditional clustering methods need the number of clusters to be given as a parameter, but this information is usually not available. And some density-based clustering methods cannot estimate local density accurately. When estimating the density of a given point, each neighbor of the point should have different importance. To solve these problems, based on the K-nearest neighbor density estimation and shared nearest neighbors, a new density-based clustering method is proposed, which assigns different weights to k-nearest neighbors of the given point and redefines the local density. In addition, a new clustering process is introduced: the number of shared nearest neighbors between the given point and the higher-density points is calculated first, the cluster that the given point belongs to can be identified, and the remaining points are allocated according to the distance between them and the nearest higher-density point. Using this clustering process, the proposed method can automatically discover the number of clusters. Experimental results on synthetic and real-world datasets show that the proposed method has the best performance compared with K-means, DBSCAN, CSPV, DPC, and SNN-DPC.
Article
Full-text available
Kmeans clustering algorithm is an iterative unsupervised learning algorithm that tries to partition the given dataset into k pre-defined distinct non-overlapping clusters where each data point belongs to only one group. However, its performance is affected by its sensitivity to the initial cluster centroids with the possibility of convergence into local optimum and specification of cluster number as the input parameter. Recently, the hybridization of metaheuristics algorithms with the K-Means algorithm has been explored to address these problems and effectively improve the algorithm’s performance. Nonetheless, most metaheuristics algorithms require rigorous parameter tunning to achieve an optimum result. This paper proposes a hybrid clustering method that combines the well-known symbiotic organisms search algorithm with K-Means using the SOS as a global search metaheuristic for generating the optimum initial cluster centroids for the K-Means. The SOS algorithm is more of a parameter-free metaheuristic with excellent search quality that only requires initialising a single control parameter. The performance of the proposed algorithm is investigated by comparing it with the classical SOS, classical K-means and other existing hybrids clustering algorithms on eleven (11) UCI Machine Learning Repository datasets and one artificial dataset. The results from the extensive computational experimentation show improved performance of the hybrid SOSK-Means for solving automatic clustering compared to the standard K-Means, symbiotic organisms search clustering methods and other hybrid clustering approaches.
Chapter
Choosing a proper strategy is the first step or prerequisite for drug design/discovery, particularly rational design of multitarget drugs (MTDs), which has been discussed in Chap. 18. And using appropriate web-based data resources for initial analyses and predictions is the next step to success in drug design/discovery, and this information has been conveyed in Chap. 19. Yet to be able to put the strategy and data resource into a real-world process for hit identification and validation as well as hit-to-lead development, application of proper and efficient methods is indispensable. The present chapter provides detailed descriptions on step-by-step procedures of a few most commonly and frequently used methods in drug design and discovery, including high-throughput screening assays, molecular docking methods, target and compound similarity-based methods, X-ray crystallography, machine learning, and information mining (data mining and text maiming).
Article
Request for quotation (RFQ) is a lengthy document soliciting vendor products and services according to rigid specifications. This research develops an integrated natural language processing (NLP), text mining, and machine learning approach for intelligent RFQ summarization. Over 1,300 power transformer RFQ requests are used to build a word-embedding model for training and testing. Domain keywords are extracted using N-gram TF-IDF. The method automatically extracts essential specifications such as voltage, capacity, and impedance from RFQs using text analytics. The K-means algorithm groups the sentences of each specification. The TextRank algorithm identifies important sentences of all specifications to generate RFQ summaries. The summarization system helps engineers shorten the time to identify all specifications and reduces the risk of missing important requirements during manual RFQ reading. The system helps improve the complex product design for manufacturers and improve the cost estimation and competitiveness of quotations in a highly competitive marketplace.
Article
Full-text available
The correlation clustering problem identifies clusters in a set of objects when the qualitative information about objects’ mutual similarities or dissimilarities is given in a signed network. This clustering problem has been studied in different scientific areas, including computer sciences, operations research, and social sciences. A plethora of applications, problem extensions, and solution approaches have resulted from these studies. This paper focuses on the cross-disciplinary evolution of this problem by analysing the taxonomic and bibliometric developments during the 1992 to 2020 period. With the aim of enhancing cross-fertilization of knowledge, we present a unified discussion of the problem, including details of several mathematical formulations and solution approaches. Additionally, we analyse the literature gaps and propose some dominant research directions for possible future studies.
Article
Full-text available
Urban–rural fringes, as special zones where urban and rural areas meet, are the most sensitive areas in the urbanization process. The quantitative identification of urban–rural fringes is the basis for studying the social structure, landscape pattern, and development gradient of fringes, and is also a prerequisite for quantitative analyses of the ecological effects of urbanization. However, few studies have been conducted to compare the identification accuracy of The US Air Force Defence Meteorological Satellite Program’s (DMSP) and the Visible Infrared Imaging Radiometer Suite (VIIRS) nighttime light data from the same year, subsequently enabling long time series monitoring of the urban–rural fringe. Therefore, in this study, taking Shenyang as an example, a K-means algorithm was used to delineate and compare the urban–rural fringe identification results of DMSP and VIIRS nighttime light data for 2013 and analyzed the changes between 2013 and 2020. The results of the study showed a high degree of overlap between the two types of data in 2013, with the overlap accounting for 75% of the VIIRS data identification results. Furthermore, the VIIRS identified more urban and rural details than the DMSP data. The area of the urban–rural fringe in Shenyang increased from 1872 km2 to 2537 km2, with the growth direction mainly concentrated in the southwest. This study helps to promote the study of urban–rural fringe identification from static identification to dynamic tracking, and from spatial identification to temporal identification. The research results can be applied to the comparative analysis of urban–rural differences and the study of the ecological and environmental effects of urbanization.
Article
Full-text available
Machine learning represents a milestone in data-driven research, including material in-formatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure-activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP-), and Adam optimization learning algorithms featuring rational train-test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard-Stone train-test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2-criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors.
Article
Flow cytometry (FCM) determines the characteristics of individual biological cells using optical and fluorescence measurements. It is a widely used standard method for analysing blood samples in medical diagnostics, through identifying and quantifying the different types of cells in the samples. The multidimensional dataset obtained from FCM is large and complex, so it is difficult and time-consuming to analyse manually. The main process of differentiation and therefore labelling of the different cell populations in the data is referred to as Gating. This is the first step of FCM data analysis and is highly subjective, an issue that significant research has focussed on reducing. However, a faster standard gating technique is still needed. Existing automated gating techniques are time-consuming or retain subjectivity by requiring many user-defined parameters. This paper presents and discusses FLOPTICS: a novel automated gating technique that is a combination of density-based and grid-based clustering algorithms. FLOPTICS has an ability to classify and label cell populations in FCM data faster and with fewer user-defined parameters than many state-of-the-art techniques.
Article
Full-text available
The multi-level perspective has been criticized for being functionalistic and paying little attention to actor-based perspectives. Nevertheless, for the identification and assessment of potential change agents in a sustainability transition, a clear conceptual and methodological approach is necessary. This paper, thus, develops a multi-dimensional typology of niche, regime, and hybrid actors, which is conceptually grounded in transition studies and empirically illustrated by a cluster analysis based on a survey of pig and poultry farmers in Germany, France, and the Netherlands. Animal husbandry is chosen as a case study because a significant share of the environmental impact within the agri-food system is attributed to this sector and there is evidence for resistance to change by mainstream actors. Conceptually, the paper provides a framework of constitutive elements for different kinds of actors and contributes to an extension of the niche–regime dichotomy by adding the group of hybrid actors. The empirical results show that cluster analysis is a suitable approach to identify conceptually meaningful differences among interviewed farmers. Among pig and poultry farmers, the regime actors are by far the largest group. The smaller group of hybrid actors, however, has large potential to act as boundary spanners. A particularly interesting finding is that several larger farms are among the group of niche actors which hints at the possibility that larger farms are not necessarily resistant to change.
Article
Full-text available
As the core component of permanent magnet motor, the magnetic tile defects seriously affect the quality of industrial motor. Automatic recognition of the surface defects of the magnetic tile is a difficult job since the patterns of the defects are complex and diverse. The existing defect recognition methods result in difficulty in practical application due to the complicated system structure and the low accuracy of the image segmentation and the target detection for the diversity of the defect patterns. A self-supervised learning (SSL) method, which benefits from its nonlinear feature extraction performance, is proposed in this study to improve the existing approaches. We proposed an efficient multihead self-attention method, which can automatically locate single or multiple defect areas of magnetic tile and extract features of the magnetic tile defects. We also designed an accurate full-connection classifier, which can accurately classify different defects of magnetic tile defects. A knowledge distillation process without labeling is proposed, which simplifies the self-supervised training process. The process of our method is as follows. A feature extraction model consists of standard vision transformer (ViT) backbone, which is trained by contrast learning without labeled dataset that is used to extract global and local features from the input magnetic tile images. Then, we use a full-connection neural network, which is trained by using labeled dataset to classify the known defect types. Finally, we combined the feature extraction model and defect classification model together to form a relatively simple integrated system. The public magnetic tile surface defect dataset, which holds 5 defect categories and 1 nondefect category, is used in the process of training, validating, and testing. We also use online data augmentation techs to increase training samples to make the model converge and achieve high classification accuracy. The experimental results show that the features extracted by the SSL method can get richer and more detailed features than the supervised learning model gets. The composite model reaches to a high testing accuracy of 98.3%, and gains relatively strong robustness and good generalization ability.
Article
Full-text available
Image segmentation task becomes complex in the presence of vague boundary structure and spatially distributed noise. In the literature, fuzzy set and its extension-based clustering methods are utilized to address vagueness for the image segmentation problem. In this work, picture fuzzy set theoretic clustering method has been proposed to enhance MRI image segmentation that is robust to the vague boundary structure, nonlinearity and spatially distributed noise. Most of the related work for handling noise in segmentation process is majorly based on smoothing the image, which results in the loss of fine structure. Moreover, the nonlinearity present in the data leads to inaccurate segmentation result. In this work, we have defined the optimization problem for clustering the pixel intensity values for image segmentation using the picture fuzzy set theoretic approach, which handle the vagueness. Also, we have included a spatial neighborhood information term in the optimization problem of the proposed method that avoids smoothing and preserves fine details in the segmentation process. Further, kernel distance measure is utilized to capture the nonlinear structure present in image in the proposed optimization problem. The experiments are carried out on a synthetic image dataset and two publicly available brain MRI datasets. The comparison with the state-of-the-art methods shows that the proposed picture fuzzy clustering method provides better segmentation performance in terms of average segmentation accuracy and Dice score.
Article
Full-text available
Just as miners must process huge quantities of rock and dirt to obtain valuable ores, data analysts must often process huge volumes of raw data to extract useful information.
Article
Full-text available
discussed in the accompanying article, involve datasets so large that their direct manipulation is impractical. Some method of data compres-sion or consolidation must first be applied to reduce the size of the dataset without losing the essential character of the data. All consolidation methods sacrifice some detail; the most desirable methods are computationally efficient and yield re-sults that are, at least for practical applications, representative of the original data. Here we introduce several widely used algorithms that consolidate data by clustering, or grouping, and then present a new method, the continuous k-means algorithm,* developed at the Laboratory specifically for clustering large datasets.
Article
Humans detect and identify faces in a scene with little or no effort. However, building an automated system that accomplishes this task is very difficult. There are several related subproblems: detection of a pattern as a face, identification of the face, analysis of facial expressions, and classification based on physical features of the face. A system that performs these operations will find many applications, e.g. criminal identification, authentication in secure systems, etc. Most of the work to date has been in identification. This paper surveys the past work in solving these problems. The capability of the human visual system with respect to these problems is also discussed. It is meant to serve as a guide for an automated system. Some new approaches to these problems are also briefly discussed.
Article
One of the currently most active research areas within Artificial Intelligence is the field of Machine Learning. which involves the study and development of computational models of learning processes. A major goal of research in this field is to build computers capable of improving their performance with practice and of acquiring knowledge on their own. The intent of this book is to provide a snapshot of this field through a broad. representative set of easily assimilated short papers. As such. this book is intended to complement the two volumes of Machine Learning: An Artificial Intelligence Approach (Morgan-Kaufman Publishers). which provide a smaller number of in-depth research papers. Each of the 77 papers in the present book summarizes a current research effort. and provides references to longer expositions appearing elsewhere. These papers cover a broad range of topics. including research on analogy. conceptual clustering. explanation-based generalization. incremental learning. inductive inference. learning apprentice systems. machine discovery. theoretical models of learning. and applications of machine learning methods. A subject index IS provided to assist in locating research related to specific topics. The majority of these papers were collected from the participants at the Third International Machine Learning Workshop. held June 24-26. 1985 at Skytop Lodge. Skytop. Pennsylvania. While the list of research projects covered is not exhaustive. we believe that it provides a representative sampling of the best ongoing work in the field. and a unique perspective on where the field is and where it is headed.
Article
Since it is usually impossible to determine a 'best' clustering procedure, admissible clustering procedures are suggested. Let A denote some property which should be satisfied by any reasonable procedure either in general or when used in a special application. Any procedure which satisfies A is called A -admissible. Nine admissibility conditions are defined and several standard clustering methods are compared with them.
Article
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH 's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.
Article
San Marcos Pueblo, dating between A.C. 1340 and 1700, is a large, Glaze Period town of the Anasazi-Pueblo cultural tradition. An aerial map of this protohistoric site was made from CIR (color infrared) photography provided by the Stennis Space Center, NASA. Ground confirmation of the aerial map has been affected through archaeological studies beginning in 1915 and continuing, intermittently, until 1993. The verified map is presented as a modern settlement statement as well as a vehicle for plotting differential spatial distributions of artifacts useful in evaluating the role of tasks and activities. Among these intramural activities are apartment living, religious life, craft production, water hauling, and other work tasks. Outside of the town were mining and "walk-out" gardening locales. The town of San Marcos played a significant role as a member of the "Eastern Frontier" of pueblos manning the border between the American Southwest and the High Plains. Interaction with Plains buffalo hunters was based on shifting political alliances reflecting a changing mixture of strategies ranging from hostile raids to peaceful trade fairs. Intensive farming and interregional trade are identified as critical variables in the evolution of social complexity and the rise of towns from a village farming base.
Article
Minimum spanning trees (MST) and single linkage cluster analysis (SLCA) are explained and it is shown that all the information required for the SLCA of a set of points is contained in their MST. Known algorithms for finding the MST are discussed. They are efficient even when there are very many points; this makes a SLCA practicable when other methods of cluster analysis are not. The relevant computing procedures are published in the Algorithm section of the same issue of Applied Statistics. The use of the MST in the interpretation of vector diagrams arising in multivariate analysis is illustrated by an example.
Article
An analysis of surface pollen samples to discover if they fall naturally into distinct groups of similar samples is an example of a classification problem. In Euclidean classification, a set of n objects can be represented as n points in Euclidean space of p dimensions. The sum of squares criterion defines the optimal partition of the points into g disjoint groups to be the partition which minimizes the total within-group sum of squared distances about the g centroids. It is not usually feasible to examine all possible partitions of the objects into g groups. A critical review is made of algorithms which have been proposed for seeking optimal partitions. The problem is reformulated in non-linear programming terms, and a new algorithm for seeking the minimum sum of squares is described. The performance of this algorithm in analyzing the pollen data is found to compare well with the performance of three of the existing algorithms. An efficient hybrid algorithm is introduced.
Article
Conceptual clustering is an important way of summarizing and explaining data. However, the recent formulation of this paradigm has allowed little exploration of conceptual clustering as a means of improving performance. Furthermore, previous work in conceptual clustering has not explicitly dealt with constraints imposed by real world environments. This article presents COBWEB, a conceptual clustering system that organizes data so as to maximize inference ability. Additionally, COBWEB is incremental and computationally economical, and thus can be flexibly applied in a variety of domains.
Article
A project to develop artificial intelligence (AI) computers that can understand and emulate human speech, perform physical functions, and make reasoned judgments has developed to the point where expert systems with modest powers that suggest reasoning are at work in business and industry. Present-day computers process data, but must be programmed with instructions. AI systems process information more rapidly and can comprehend new types of programming languages that use symbols rather than numbers. The programs compare facts and rules to make deductive, reasoned responses. Sales in AI soft and hardware amount to $150 million a year, and Japan is competing for industry and defense contracts. There is no conclusive proof that computers can do more than make inferences, nor is it known how to instill common sense or spontaneous problem solving.
Article
The complete-link hierarchical clustering strategy is reinterpreted as a heuristic procedure for coloring the nodes of a graph. Using this framework, the problem of assessing goodness-of-fit in complete-link clustering is approached through the number of “extraneous” edges in the fit of the constructed partitions to a sequence of graphs obtained from the basic proximity data. Several simple numerical examples that illustrate the suggested paradigm are given and some Monte Carlo results presented.
Article
In this paper the notion of conceptual cohesiveness is precised and used to group objects semantically, based on a knowledge structure called ‘cohesion forest’. A set of axioms is proposed which should be satisfied to make the generated clusters meaningful.
Article
Clustering analysis(1–4) is a newly developed computer-oriented data analysis technique. It is a product of many research fields: statistics, computer science, operations research, and pattern recognition. Because of the diverse backgrounds of researchers, clustering analysis has many different names. In biology, clustering analysis is called “taxonomy”.(5,6) In pattern recognition(7–15) it is called “unsupervised learning.” Perhaps the most confusing name of all, the term “classification” sometimes also denotes clustering analysis. Since classification may denote discriminant analysis, which is totally different from clustering analysis, it is perhaps important to distinguish these two terms.
Article
A simple step-wise procedure for the clustering of variables is described. Two alternative criteria for the merger of groups at each pass are discussed: (1) maximization of the pairwise correlation between the centroids of two groups, and (2) minimization of Wilks’ statistic to test the hypothesis of independence between two groups. For a set of sample covariance matrices the step-wise solution for each criterion is compared with the optimal two-group separation of variables found by total enumeration of the possible groupings.
Article
An algorithm to generate a minimal spanning tree is presented when the nodes with their coordinates in some m-dimensional Euclidean space and the corresponding metric are given. This algorithm is tested on manually generated data sets. The worst case time complexity of this algorithm is O(n log2n) for a collection of n data samples.
In this paper, we present a comparative study of four different classifiers for isolated handprinted character recognition. These four classifiers are i) a nearest template (NT) classifier, ii) an enhanced nearest template (ENT) classifier, iii) a standard feedforward neural network (FNN) classifier, and iv) a hybrid classifier. The NT classifier is a variation of the nearest neighbor classifier which stores a small number of templates (or prototypes) and their statistics generated by a special clustering algorithm. Motivated by radial basis function networks, the ENT classifier is proposed to augment the NT classifier with an optimal transform which maps the distances generated by the NT classifier to character categories. The FNN classifier is a 3-layer (with one hidden layer) feedforward network trained using the backpropagation algorithm. The hybrid classifier combines results from the FNN and NT classifiers in an efficient way to improve the recognition accuracy with only a slight increase in computation. In this paper, we evaluate the performance of these four classifiers in terms of recognition accuracy, top 3 coverage rate, and recognition speed, using the NIST isolated lower-case alphabet database. Our experiments show that the FNN classifier outperforms the NT and ENT classifiers in all the three evaluation criteria. The hybrid classifier achieves the best recognition accuracy at a cost of little extra computation over the FNN classifier. The ENT classifier can significantly improve the recognition accuracy of the NT classifier when a small number of templates is used.
This chapter presents an empirical study of the performance of heuristic methods for clustering and discusses the problem of clustering a set of data points into a given number of groups (or clusters). Among the popular heuristic methods are the K-MEANS and its variations. The task of empirical evaluation and comparison of heuristic methods presents several difficulties: first, it is usually not possible to derive the time complexity of all the methods under consideration; second, the trade-off between the computation cost and the quality of solutions produced must be taken into consideration; and third, an assessment of the quality of solutions produced must be based on several instances of input data of different sizes, and on different types of input. It is extremely difficult to control the amount of effort that goes into making the design choices for a given method. An appropriate set of choices and parameter values are highly important to obtain good performance.
Article
Cluster analysis is a collective term covering a wide variety of techniques for delineating natural groups or clusters in data sets. This book integrates the necessary elements of data analysis, cluster analysis, and computer implementation to cover the complete sequence of steps from raw data to the finished analysis. The author develops a conceptual and philosophical basis for using cluster analysis as a tool of discovery and applies it systematically throughout the book. He provides a comprehensive discussion of variables, scales, and measures of association that establishes a sound basis for constructing an operational definition of similarity tailored to the needs of any particular operational definition of similarity tailored to the needs of any particular problem, and devotes special attention to the problems of analyzing data sets containing mixtures of nominal, ordinal, and interval variables. (Author)
Article
Artificial Intelligence (AI) methods for machine learning can be viewed as forms of exploratory data analysis, even though they differ markedly from the statistical methods generally connoted by the term. The distinction between methods of machine learning and statistical data analysis is primarily due to differences in the way techniques of each type represent data and structure within data. That is, methods of machine learning are strongly biased toward (symbolic as opposed to numeric) data representations. We explore this difference within a limited context, devoting the bulk of our paper to the explication of conceptual clustering, and extension to the statistically based methods of numerical taxonomy. In conceptual clustering he formation of object clusters is dependent on the quality of 'higher-level' characterizations, termed concepts, of the clusters. The form of concepts used by existing conceptual clustering systems (sets of necessary and sufficient conditions) is described in some detail. This is followed by descriptions of several conceptual clustering techniques, along with sample output. We conclude with a discussion of how alternative concept representations might enhance the effectiveness of future conceptual clustering systems. Keywords: Conceptual clustering; Concept formation; Hierarchical classification; Numerical taxonomy; Heuristic search; Exploratory data analysis.
Article
An abstract is not available.