Conference Paper

K-Means++: The Advantages of Careful Seeding

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The optimized algorithm is known as k-means ++, and it guarantees (8 ln k+2)-solutions in the worst scenario [47]. The details of the solving strategies will be discussed in the Literature Review section. ...
... With the intention of bounding the sensitivity and suppress the estimator's variance at a low level, researchers proposed the D 2 -sampling [47,54]. This approach starts with roughly approximating the optimal solution where the importance of time efficiency outweighs the quality of the solution obtained. ...
Thesis
Full-text available
Recent work has proposed solving the k-means clustering problem in quantum computers via the Quantum Approximate Optimization Algorithm (QAOA) and coreset techniques. QAOA is a variational quantum algorithm that tackles k-means clustering by translating it as a form of MAX-CUT problem. Besides, the coreset from computational geometry is a data-reduction approach approximating a dataset into a much smaller subset with weights. However, the current method only proves the feasibility of quantum k-means clustering without guaranteeing high accuracy and stability in various datasets. Therefore, we aim to design a new quantum framework that yields comparable results to the classical method in k-means clustering. To achieve this, we compared the performance of solving the clustering problem with existing coreset techniques and quantum variational algorithms. Most importantly, this work proposes solving the k-means clustering problem with the variational quantum eigensolver (VQE) and a novel coreset, the Contour coreset. VQE is a quantum-classical hybrid optimization algorithm based on variational principles. We demonstrate that VQE is an alternative method to QAOA but exhibits higher accuracy in the clustering problem. Moreover, we provide a novel coreset construction algorithm for Contour coreset, which is tailor-made for in-putting large datasets into noisy intermediate-scale quantum (NISQ) computers with limited qubits. The construction algorithm is faster than existing coreset algorithms, and the Contour coreset can better represent large datasets compared to the state-of-art coresets for k-means clustering. To the best of our knowledge, this is the first core-set algorithm built for quantum computing. Extensive experiments with synthetic and real-life data demonstrated that our approach outperforms existing quantum k-means clustering approaches with higher accuracy and lower standard deviation.
... The K-means method is a popular clustering approach. Various studies [29][30][31] provide competitive solutions for the k-means problem. In their work, Arthur and Vassilvitskii [30] suggested a method to estimate the initial values of centers for k-means. ...
... Various studies [29][30][31] provide competitive solutions for the k-means problem. In their work, Arthur and Vassilvitskii [30] suggested a method to estimate the initial values of centers for k-means. Their approach is based on the concept that the initial k cluster centers should be spaced relatively far apart, allowing the nearest node to the cluster center to be selected from the other data points with a probability proportional to the square of the distance. ...
Article
Full-text available
Wireless Sensor Networks (WSNs) have emerged as a pivotal technology interlinked with numerous burgeoning sectors. Myriad sensor nodes, diverse in nature, constitute these networks, which are dispersed within a given environment to collect and relay pertinent data to a central station. Given the typical deployment of sensor nodes-bearing limited energy reserves and often stationed at extensive distances for prolonged periods-energy conservation becomes a paramount concern for enhancing the network's lifespan. One avenue explored to address this challenge involves the clustering of sensor nodes within the network. This study introduces a dynamic approach for clustering nodes in WSNs, designed to accommodate mobile nodes. The approach leverages an enhanced version of the k-means algorithm in tandem with a novel cluster head selection method, capable of clustering even moving nodes. This strategy proposes an innovative solution to select cluster heads, aiming to reduce the energy consumption of nodes and augment reliability during data transmission within sensor networks.
... D 2 -sampling returns an (O(1), O(log(k)))-approximate solution for k-means with high probability (Jiang et al., 2024); adding the extra point x * has a negligible effect. The proof follows that of Lemma 3.1 in Arthur & Vassilvitskii (2007) with the only difference being the equality of Lemma 3.1 becomes an inequality. We summarise the result in the following lemma: ...
... This improves on the algorithm given byJiang et al. (2024) which has running timeÕ(nk). The running time of Jiang et al.(2024)is dominated by the running time of D 2 -sampling, Algorithm 1, which is just the k-means++ initialisation algorithm(Arthur & Vassilvitskii, 2007) in kernel space. For completeness, we provide the full description of the ε-coreset algorithm ofJiang et al. (2024) in Algorithm 4 in Appendix C.1. ...
Preprint
Full-text available
Coresets have become an invaluable tool for solving k-means and kernel k-means clustering problems on large datasets with small numbers of clusters. On the other hand, spectral clustering works well on sparse graphs and has recently been extended to scale efficiently to large numbers of clusters. We exploit the connection between kernel k-means and the normalised cut problem to combine the benefits of both. Our main result is a coreset spectral clustering algorithm for graphs that clusters a coreset graph to infer a good labelling of the original graph. We prove that an α\alpha-approximation for the normalised cut problem on the coreset graph is an O(α)O(\alpha)-approximation on the original. We also improve the running time of the state-of-the-art coreset algorithm for kernel k-means on sparse kernels, from O~(nk)\tilde{O}(nk) to O~(nmin{k,davg})\tilde{O}(n\cdot \min \{k, d_{avg}\}), where davgd_{avg} is the average number of non-zero entries in each row of the n×nn\times n kernel matrix. Our experiments confirm our coreset algorithm is asymptotically faster on large real-world graphs with many clusters, and show that our clustering algorithm overcomes the main challenge faced by coreset kernel k-means on sparse kernels which is getting stuck in local optima.
... A Balanced SSE Objective. As Theorem 4.3 indicates, assigning a balanced number of original nodes to each synthetic node can effectively reduce the upper bound of the parameter distance, thereby 1 Reduction rate is defined as (#nodes in synthetic set)/(#nodes in training set). enhancing the objectives in both the training and testing stages. ...
... To enable the GECC pipeline to effectively condense graphs that evolve over time and increase in size, we adopt a clustering approach inspired by the K-means++ initialization technique [1]. This method improves the quality of clustering by ensuring a better spread of centroids during initialization. ...
Preprint
Full-text available
Graph data has become a pivotal modality due to its unique ability to model relational datasets. However, real-world graph data continues to grow exponentially, resulting in a quadratic increase in the complexity of most graph algorithms as graph sizes expand. Although graph condensation (GC) methods have been proposed to address these scalability issues, existing approaches often treat the training set as static, overlooking the evolving nature of real-world graph data. This limitation leads to inefficiencies when condensing growing training sets. In this paper, we introduce GECC (Graph Evolving Clustering Condensation), a scalable graph condensation method designed to handle large-scale and evolving graph data. GECC employs a traceable and efficient approach by performing class-wise clustering on aggregated features. Furthermore, it can inherits previous condensation results as clustering centroids when the condensed graph expands, thereby attaining an evolving capability. This methodology is supported by robust theoretical foundations and demonstrates superior empirical performance. Comprehensive experiments show that GECC achieves better performance than most state-of-the-art graph condensation methods while delivering an around 1,000x speedup on large datasets.
... to tasks such as clustering, dimensionality reduction, and anomaly detection [56]. 129 K-means is one popular clustering algorithm, which divides data into K clusters by minimizing the distance between each 130 data point and its cluster center [57]. Hierarchical clustering forms a hierarchy of clusters through iterative merging or splitting 131 [58]. ...
Article
Full-text available
Bridges are critical links for transportation networks, making their functionality and serviceability attract long-lasting research interests. Among various loads acting on bridges during their lifetime, vehicle/traffic loads are recognized as the most dominant, which excites extensive studies to understand the complex behavior of vehicle–bridge systems (VBS). Traditional methods for analyzing VBS face significant challenges due to the increasing demand for accuracy, efficiency, and adaptability. Recent advancements in machine learning (ML) offer promising solutions to these challenges, bearing great potential to develop intelligent vehicle–bridge systems (IVBS) that are imperative for future intelligent monitoring and maintenance of bridges. This paper reviews the current status of ML applications in VBS, highlighting how ML enhances vehicle load monitoring, bridge dynamic performance and reliability evaluation, and bridge damage identification. This paper also discusses the key challenges and associated countermeasures of integrating ML into VBS, attempting to provide the first seminal roadmap for building future IVBS.
... [CGK + 19] identify leaders in time ( k log n k ) by sparsifying the instance via coresets and then determining the distance of leaders to the optimal solution in time (log n) k . Our overall approach significantly speeds up these two steps by way of distance sampling similar to k-means++ [AV07]. ...
Preprint
Full-text available
In the classical NP-hard metric k-median problem, we are given a set of n clients and centers with metric distances between them, along with an integer parameter k1k\geq 1. The objective is to select a subset of k open centers that minimizes the total distance from each client to its closest open center. In their seminal work, Jain, Mahdian, Markakis, Saberi, and Vazirani presented the Greedy algorithm for facility location, which implies a 2-approximation algorithm for k-median that opens k centers in expectation. Since then, substantial research has aimed at narrowing the gap between their algorithm and the best achievable approximation by an algorithm guaranteed to open exactly k centers. During the last decade, all improvements have been achieved by leveraging their algorithm or a small improvement thereof, followed by a second step called bi-point rounding, which inherently increases the approximation guarantee. Our main result closes this gap: for any ϵ>0\epsilon >0, we present a (2+ϵ)(2+\epsilon)-approximation algorithm for k-median, improving the previous best-known approximation factor of 2.613. Our approach builds on a combination of two algorithms. First, we present a non-trivial modification of the Greedy algorithm that operates with O(logn/ϵ2)O(\log n/\epsilon^2) adaptive phases. Through a novel walk-between-solutions approach, this enables us to construct a (2+ϵ)(2+\epsilon)-approximation algorithm for k-median that consistently opens at most k+O(logn/ϵ2)k + O(\log n{/\epsilon^2}) centers. Second, we develop a novel (2+ϵ)(2+\epsilon)-approximation algorithm tailored for stable instances, where removing any center from an optimal solution increases the cost by at least an Ω(ϵ3/logn)\Omega(\epsilon^3/\log n) fraction. Achieving this involves a sampling approach inspired by the k-means++ algorithm and a reduction to submodular optimization subject to a partition matroid.
... • K-Means [49]. For K-Means, it is necessary to specify the number of clusters to be identified. ...
Article
Full-text available
Significant computational research has been dedicated to automatic key and mode detection in Western tonal music, particularly within the major and minor modes. However, limited research has focused on identifying alternative diatonic modes in traditional and folk music contexts. This paper addresses this gap by comparing the effectiveness of various preprocessing techniques in unsupervised machine learning for diatonic mode detection. Using a dataset of Irish folk music that incorporates diatonic modes such as Ionian, Dorian, Mixolydian, and Aeolian, we assess how different preprocessing approaches influence clustering accuracy and mode distinction. By examining multiple feature transformations and reductions, this study highlights the impact of preprocessing choices on clustering performance, aiming to optimize the unsupervised classification of diatonic modes in folk music traditions.
... If the triples belong to the same cluster, the effect of improvement propagates among them, even if the explicit relations are different. The clustered sampling process is depicted in Algorithm 3. First, we obtain the embeddings of all triples in the KG and cluster them using the k-means++ algorithm [35]. Then, we sample N pc triples from each cluster in order of their scores. ...
Article
Full-text available
Knowledge graphs are graph-structured data models that provide a robust scheme for representing real-world relational facts with structured triples. The structural and factual information in knowledge graphs are extensively leveraged in various downstream applications. Unfortunately, knowledge graphs often contain incorrect triples due to the automated extraction processes. Therefore, to ensure the reliability and usability of knowledge graphs, it is crucial to identify and rectify these incorrect triples. However, this is a challenging task, as the knowledge graph is a complex structure containing a vast number of triples composed of diverse entities, relations, and their complex interconnections. This paper proposes an effective method to enhance knowledge graph accuracy by introducing an active learning framework. The proposed framework integrates the advantages of machine-based models and human involvement to enable efficient and reliable improvement in knowledge graph accuracy. Additionally, the proposed method includes sampling strategies that consider the relation distribution in the knowledge graph to maximize the effectiveness of the active learning framework. Extensive experimental results demonstrate the effectiveness of the proposed active learning framework and sampling strategies in improving knowledge graph accuracy. Furthermore, this paper provides an exploration of the level of human involvement and a discussion of practical approaches to improve knowledge graph accuracy in real-world scenarios.
... Based on the collected customer behavior data (607,570 user sessions), clustering was performed using the K-means algorithm (with K-means++ [49] as the method for selecting initial values), four clusters (k = 4), and a maximum number of iterations equal to 300. The choice of clustering parameters was the result of a multi-criteria analysis based on the TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution) method [50]. ...
Article
Full-text available
This paper proposes a new framework for personalizing e-commerce, which integrates multivariate user interfaces (MultiUI) with AI-generated content (AIGC). By utilizing customer behavioral data, our approach customizes both the visual layout and product descriptions for specific customer segments. This addresses the current research gap that often overlooks the synergy between UI design and content personalization. We conducted an empirical study to demonstrate the effectiveness of this integrated approach, showing that personalized user interface variants significantly improve customer engagement and conversion rates. In addition, we explore the potential of AIGC by using behavioral clusters to generate customized product descriptions. This showcases how AI can improve the relevance and appeal of product information, contributing to a more engaging and effective e-commerce experience. Although our initial findings using a simplified approach with ChatGPT are promising, future research will focus on refining AIGC models by incorporating domain-specific knowledge and leveraging comprehensive customer behavior data to generate highly tailored product descriptions. This research advances information processing in e-commerce by demonstrating how AI can be used to extract valuable insights from customer data, adapt UI designs, and generate personalized content, ultimately leading to more profitable online shopping experiences. Experimental studies showed that only about 10% of the most popular words were repeated in the product descriptions generated for the three different clusters. At the same time, two-thirds of the most popular words were dominant in only one of the clusters, confirming the satisfactory degree of matching descriptions to the specifics of customer groups.
... A list of all clustering methods considered, and the approaches we used for model selection, is given below: 1) K-means (KM): The classical clustering model, where we used the implementation in the R [8] package ClusterR [9], and the popular K-means++ initialisation [10]. We used 10 initialisations due to the randomness in the K-means++ method, and selected the number of clusters using the silhouette score [11]. ...
Preprint
Full-text available
A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/ CNS
... The algorithm is sensitive to the choice of initial centroids, which can lead to convergence at suboptimal local minima. To address this, methods like k-means++ select initial centroids probabilistically, ensuring better initial separation between clusters [25,28]. Other strategies, such as hierarchical approaches that begin with a larger number of clusters and merge them iteratively, have been developed to improve initialization and cluster quality [26]. ...
Article
Full-text available
Electrofacies are log-related signatures that reflect specific physical and compositional characteristics of rock units. The concept was developed to encapsulate a collection of recorded well-log responses, enabling the characterization and differentiation of one rock unit from another. The analysis of the lateral and vertical distribution of electrofacies is crucial for understanding reservoir properties; however, well-log analysis can be labor-intensive, time-consuming, and prone to inaccuracies due to the subjective nature of the process. In addition, there is no unique way of reliably classifying logs or deriving electrofacies due to the varying accuracy of different methods. In this study, we develop a workflow that mitigates the variability in results produced by different clustering algorithms using a committee machine. Using several unsupervised machine learning methods, including k-means, k-median, hierarchical clustering, spectral clustering, and the Gaussian mixture model, we predict electrofacies from wireline well log data and generate their 3D vertical and lateral distributions and inferred geological properties. The results from the different methods are used to constitute a committee machine, which is then used to implement electrofacies-guided well placement. 3D distributed petrophysical properties are also computed from core-calibrated porosity and permeability data for reservoir simulation. The results indicate that wells producing from a specific electrofacies, as predicted by the committee machine, have significantly better production than wells producing from other electrofacies. This proposed detailed machine learning workflow allows for strategic decision-making in development and the practical application of these findings for improved oil recovery.
... However, this differs from the batch selection in our approach in that we choose a set of vectors within a budget B rather than choosing k vectors. Therefore, we propose a new selection method, which is inspired by the work [3] using the k-means++ seeding algorithm [2] as an approximation algorithm for k-DPP. ...
Preprint
Full-text available
Active learning (AL) is a label-efficient machine learning paradigm that focuses on selectively annotating high-value instances to maximize learning efficiency. Its effectiveness can be further enhanced by incorporating weak supervision, which uses rough yet cost-effective annotations instead of exact (i.e., full) but expensive annotations. We introduce a novel AL framework, Instance-wise Supervision-Level Optimization (ISO), which not only selects the instances to annotate but also determines their optimal annotation level within a fixed annotation budget. Its optimization criterion leverages the value-to-cost ratio (VCR) of each instance while ensuring diversity among the selected instances. In classification experiments, ISO consistently outperforms traditional AL methods and surpasses a state-of-the-art AL approach that combines full and weak supervision, achieving higher accuracy at a lower overall cost. This code is available at https://github.com/matsuo-shinnosuke/ISOAL.
... The choice of the number of clusters is crucial and is based on calculations to obtain ideal values that explain the maximum variability. The initial theory of K-Means was improved by Arthur and Vassilvitskii (2007) with the development of K-Means++, which is used for initializations in the K-Means implementation in the Python package "Scikit-Learn" (Pedregosa et al. 2011), utilized in the present study. ...
Article
Full-text available
The identification of weather patterns and associated surface waves for the Southwestern Atlantic Ocean is the goal of the present work. For this purpose, a K-means algorithm was adopted to group data into similar atmospheric conditions considering 25 years of reanalysis data (1993-2017) of zonal and meridional wind components and geopotential height at 1000 hPa. Three points (Vitoria, Santos and Rio Grande) along the Brazilian coast were chosen to evaluate the wave extremes and which Weather Patterns are associated with the extremes in these three different regions. The knee point detection method was used to determine the ideal number of centroids for representing the Weather Patterns at each point. The dates corresponding to each WP were used to plot the average wave field associated with each WP. The results indicate that WPs are dominated by both cyclones and anti-cyclones in the domain. Cyclones with a south/southwest fetch induce extreme waves in Santos and Rio Grande, while for Vitoria, extreme wave generation is more dominant due to the influence of the post-frontal high.
... We investigate our algorithmic contribution in two ways. First, we compare our method for task clustering with three common clustering methods: KMeans++ (Arthur & Vassilvitskii, 2006), DBScan (Khan et al., 2014), GMM (Bishop, 2006), as well as with random clustering. As shown in Table 3, the proposed approach outperforms all baselines. ...
Preprint
Full-text available
Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time. Typical approaches for these problems, such as multi-task and meta reinforcement learning, do not generalize well when the tasks are diverse. On the other hand, approaches that aim to tackle task diversity, such as using task embedding as policy context and task clustering, typically lack performance guarantees and require a large number of training tasks. To address these challenges, we propose a novel approach for learning a policy committee that includes at least one near-optimal policy with high probability for tasks encountered during execution. While we show that this problem is in general inapproximable, we present two practical algorithmic solutions. The first yields provable approximation and task sample complexity guarantees when tasks are low-dimensional (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide a provable sample complexity bound for few-shot learning. Our experiments on MuJoCo and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin.
... K-means clustering has been utilized in the architectural domain since the early 20th century. The algorithm is particularly recognized for its effectiveness in grouping similar geometries by minimizing within-cluster variation, making it suitable for applications requiring efficient classification and optimization (Arthur & Vassilvitskii, 2007). Previous researchers (Kontovourkis & Panayiotou, 2022) have introduced the methodology of using machine learning (ML) to classify and predict the design solution. ...
Conference Paper
Full-text available
Topological interlocking systems provide an innovative and sustainable solution for construction, offering reusable and adhesive-free assemblies. However, applying these systems to curved surfaces presents significant challenges due to the geometric complexity and the need for custom designs for each panel. This results in inefficiencies, increased fabrication costs, and higher chances of errors. This research introduces a methodology to address these challenges by optimizing the mesh through a form-finding process, clustering mesh edges using K-means, and generating interlocking interfaces. The optimal number of formworks is determined by clustering panels with similar edge lengths, significantly reducing fabrication complexity and minimizing material waste. The method also ensures the precision of panel shapes while maintaining design flexibility. The results demonstrate that applying k-means clustering can significantly reduce the number of custom formworks needed, leading to less material waste and improved efficiency in production. By addressing these challenges, this research contributes to more efficient and cost-effective methods for implementing topologically interlocking systems for curved surfaces.
... For inspections of those 3D shapes composed of 2D planes, the algorithm proposed by Phung et al. [22] splits each 2D target surface into uniform grids and plans an inspection viewpoint for each grid to inspect it at the required spatial resolution. Similarly, Asa et al. [20], [21] use K-means++ [35] to cluster the inspection target surfaces into as few patches as possible, each patch size equals the UAV's inspection range, and then plan an inspection viewpoint for each patch. All the above algorithms require UAVs to inspect the surface orthogonally. ...
Article
Full-text available
We propose a viewpoint generation algorithm for the visual inspection of civil engineering structures using robots equipped with a gimbal-mounted camera. This algorithm enables robots to completely inspect target surfaces at the required spatial resolution while reducing movement by effectively utilizing the gimbal. Previous studies generally assume that the camera is always oriented orthogonally to the target surface, and they focus on the inspection position optimization problem, resulting in the underutilization of gimbals and heavy movement of robots. In this study, firstly we have formulated the conditions under which the collected data’s quality does not deteriorate in terms of damage detection, and we have relaxed the orthogonal measuring restrictions. Based on the conditions, we propose a viewpoint generation algorithm that can carry out inspection work by changing measurement orientations instead of moving the robot. In simulation experiments, our method reduces inspection time by 47.7% compared to the benchmark method for the same tasks of inspecting a pier surface with the required spatial resolution. In addition, the viewpoint generation time was reduced by 96.8% compared to the benchmark method. We also conducted experiments in real environments using pan-tilt heads and confirmed that our algorithm can collect data qualified for the damage detection of visual inspections.
... One critical drawback lies in its sensitivity to the initial positioning of centroids and the necessity of predefining the number of clusters. To overcome these challenges, the K-means++ algorithm was developed to optimize centroid initialization, thereby improving its overall performance [6]. Alternatively, alternative algorithms such as DBSCAN are often preferred for more intricate analyses, particularly when dealing with irregularly shaped clusters, a frequent occurrence in nonlinear economic data [7]. ...
Article
Full-text available
This study examines the repercussions of the COVID-19 pandemic on Moroccan companies by employing the K-means clustering algorithm to classify them based on their performance. Owing to its efficiency, this algorithm excels in segmenting complex datasets, making it an ideal tool for clustering companies according to their size, sales volume, resilience, and adaptability to new economic realities. The literature indicates that sustainable governance practices are crucial in fostering resilience during crises. In this context, the study adopts a methodology that combines the K-means algorithm with data normalization techniques, which facilitate the creation of homogeneous groups of companies. The results reveal distinct clusters with varying sales performance and strategic orientations. On the one hand, high-performing companies tend to embrace digitization and diversification strategies, thereby reinforcing their resilience. On the other hand, clusters with weaker performance exhibit limited adoption of such measures, opting instead for approaches such as reducing working hours. These insights highlight the importance of adopting digital transformation and innovation as pivotal strategies to increase competitiveness. Ultimately, the study offers actionable recommendations to strengthen corporate governance and resilience, particularly in times of crisis.
... The batch set extracted by the above three methods, is obtained by selection of top-B score samples. Kmeans++ Vassilvitskii & Arthur (2006) and Coreset Khakham (2019); Sener & Savarese (2018) are diversity-based BMAL methods, and therefore applicable for B > 1. In Kmeans++, the batch samples are chosen as the closest points to each of the B centroids, and in Coreset, we ensure that the batch samples adequately represent the entire candidate pool based on the L 2 norm distance. ...
... [1][2][3][4][5][6][7] Nonhierarchical clustering methods are widely used due to their ease of implementation and scalability. Techniques like k -means 1,8,9 and k -medoids 10 are frequently applied. Another method, Radial Threshold Clustering (RTC), was developed by Daura and his team. ...
Preprint
Full-text available
Unsupervised learning techniques play a pivotal role in unraveling protein folding landscapes, constructing Markov State Models, expediting replica exchange simulations, and discerning drug binding patterns, among other applications. A fundamental challenge in current clustering methods lies in how similarities among objects are accessed. Traditional similarity operations are typically only defined over pairs of objects, and this limitation is at the core of many performance issues. The crux of the problem in this field is that efficient algorithms like k-means struggle to distinguish between metastable states effectively. However, more robust methods like density-based clustering demand substantial computational resources. Extended similarity techniques have been proven to swiftly pinpoint high and low-density regions within the data in linear O(N) time. This offers a highly convenient means to explore complex conformational landscapes, enabling focused exploration of rare events or identification of the most representative conformations, such as the medoid of the dataset. In this contribution, we aim to bridge this gap by introducing a novel density clustering algorithm to the Molecular Dynamics Analysis with N-ary Clustering Ensembles (MDANCE) software package based on n-ary similarity framework.
... During training, we divided the two groups of images into a training set, a validation set, and a test set at a ratio of 6:3:1. Meanwhile, considering that the adaptive anchor boxes preset in YOLOv5 do not match the small and dim infrared target dataset used in this experiment, we used the k-means++ clustering algorithm [42] to recalculate the sizes of the anchor boxes before training to replace the initial anchor boxes. Figure 7 shows some sample images of representative backgrounds in the dataset. ...
Article
Full-text available
In recent years, the technology for detecting small and dim infrared targets has played a crucial role in both military and civilian security fields. Deep learning-based methods have also achieved remarkable progress in this area. However, it is still restricted by challenges such as small target size, low signal-to-noise ratio, and complex backgrounds. Therefore, this paper proposes an improved model IRE-YOLO based on You Only Look Once (YOLO) to enhance the detection accuracy of small targets. To improve the model’s feature extraction ability for targets, a receptive field enhancement module based on dilated convolution and shared weights is proposed. By expanding the receptive field of the feature map, it can extract the detailed features and local information of multi-scale targets. Secondly, to address the difficulties of small target size and low image resolution, a space-to-depth convolution is added to the backbone network. By converting the spatial dimension into the depth dimension, it can effectively capture the context information of small targets. In addition, to enhance the accuracy of the model for real detection boxes, this paper proposes an SNS algorithm, which can effectively remove redundant detection boxes. IRE-YOLO is compared and evaluated with other models on two public datasets, IDTA and SIRST. The experimental results show that compared with the baseline YOLOv5s, the mean average precision (mAP) of IRE-YOLO has increased by 2% and 2.1%, respectively, significantly improving the detection accuracy of small and dim infrared targets.
... To enhance the diversity of samples, we divided each test set into 50 clusters and randomly selected one commit from each cluster. We used the OpenAI embedding model [50] for vectorization and K-means++ [76] for clustering. In total, we sampled 400 <code diff, commit message> pairs to score. ...
Preprint
Commit messages concisely describe code changes in natural language and are important for software maintenance. Several approaches have been proposed to automatically generate commit messages, but they still suffer from critical limitations, such as time-consuming training and poor generalization ability. To tackle these limitations, we propose to borrow the weapon of large language models (LLMs) and in-context learning (ICL). Our intuition is based on the fact that the training corpora of LLMs contain extensive code changes and their pairwise commit messages, which makes LLMs capture the knowledge about commits, while ICL can exploit the knowledge hidden in the LLMs and enable them to perform downstream tasks without model tuning. However, it remains unclear how well LLMs perform on commit message generation via ICL. In this paper, we conduct an empirical study to investigate the capability of LLMs to generate commit messages via ICL. Specifically, we first explore the impact of different settings on the performance of ICL-based commit message generation. We then compare ICL-based commit message generation with state-of-the-art approaches on a popular multilingual dataset and a new dataset we created to mitigate potential data leakage. The results show that ICL-based commit message generation significantly outperforms state-of-the-art approaches on subjective evaluation and achieves better generalization ability. We further analyze the root causes for LLM's underperformance and propose several implications, which shed light on future research directions for using LLMs to generate commit messages.
... It relies on random search and probabilistic acceptance of new solutions to gradually converge to the optimal solution. Moreover, we also compare the clustering performance of these schemes with the K-means++ algorithm [51]. These algorithms are implemented by using Python 3.7 on a computer with an NVIDIA RTX 3080 Ti GPU and we run the quantum annealing algorithm on D-Wave quantum annealing machines (QAMs). ...
Preprint
In wireless communication networks, it is difficult to solve many NP-hard problems owing to computational complexity and high cost. Recently, quantum annealing (QA) based on quantum physics was introduced as a key enabler for solving optimization problems quickly. However, only some studies consider quantum-based approaches in wireless communications. Therefore, we investigate the performance of a QA solution to an optimization problem in wireless networks. Specifically, we aim to maximize the sum rate by jointly optimizing clustering, sub-channel assignment, and power allocation in a multi-unmanned aerial vehicle-aided wireless network. We formulate the sum rate maximization problem as a combinatorial optimization problem. Then, we divide it into two sub-problems: 1) a QA-based clustering and 2) sub-channel assignment and power allocation for a given clustering configuration. Subsequently, we obtain an optimized solution for the joint optimization problem by solving these two sub-problems. For the first sub-problem, we convert the problem into a simplified quadratic unconstrained binary optimization (QUBO) model. As for the second sub-problem, we introduce a novel QA algorithm with optimal scaling parameters to address it. Simulation results demonstrate the effectiveness of the proposed algorithm in terms of the sum rate and running time.
Preprint
Full-text available
Customer segmentation theatres a key role in retail, permitting businesses to adapt marketing policies, optimize reserve allocation, and expand general customer completion. Traditional clustering methods, such as K-means and DBSCAN, have been broadly employed for this drive. However, each of these methods has its limitations, such as compassion to noise in DBSCAN or the statement of round clusters in K-means. To address these boundaries, this paper proposes a hybrid clustering approach that mixes multiple clustering methods, including K-means, DBSCAN, Spectral Clustering, Fuzzy C-means, and Hierarchical Clustering. By combination the resources of these methods, the hybrid approach distributes more careful and sensitive segmentation results, exclusively in complex retail atmospheres where customer behaviour is assorted and data structures are involved. The results establish that the hybrid approach knowingly improves division truth compared to individual methods. The combined clustering techniques allow the model to detect both distinct, well-separated clusters and more nuanced, overlapping segments, providing a more comprehensive view of customer behaviors. This enhanced segmentation authorizes retailers to make data-driven decisions, such as better-targeted marketing operations, more efficient record management, and personalized customer experiences. The paper concludes by discussing the potential for further research in applying hybrid clustering models in other sectors and integrating predictive analytics for even more refined customer insights.
Article
Full-text available
Deep-learning-based object detection algorithms play a pivotal role in various domains, including face detection, automatic driving, monitoring security, and industrial production. Compared with the traditional object detection algorithms and the two-stage object detection algorithms, the YOLO (You Only Look Once) series improved the detection speed and accuracy. In addition, the YOLO series of object detection algorithms are widely used in the industrial fields due to their real-time and high-precision characteristics. This work summarizes the main versions of YOLO series algorithms as well as their main improving measures. Furthermore, the following is the analysis of the industrial application fields and some application examples of YOLO series algorithms. Furthermore, this work summarizes the general improvement measures for the industrial applications of the YOLO series algorithms. As for the comparison of these algorithms, this work implements the basic tests for the industrial application performance on different datasets. Finally, the development directions and challenges for YOLO series algorithms are pointed out.
Article
Full-text available
Integrated workflows for mineral resource estimation from exploration to mining must be able to process typical geodata (e.g., borehole data), perform data engineering (e.g., geodomaining), and spatial modeling (e.g., block modeling). Several methods exist, however they can only handle individual subtasks, and are either semi or fully automatable. Thus, an integrated workflow has not been established, which is needed to handle bigger geodata sets, perform remote monitoring, or provide short-term operational feedback. Bigger (more voluminous, higher velocity and higher dimensional) geodata sets are both emerging and anticipated in future exploration and mining operations, necessitating a geodata science counterpart to traditional, segregated, and routinely manual geostatistical workflows for resource estimation. In this paper, we demonstrate a prototype that integrates various data processing, pointwise geodomaining, domain boundary delineation, combinatorics-based visualization, and geostatistical modeling methods to create a modern resource estimation workflow. For the purpose of geodomaining, we employed a fully semi-automated, machine learning-based workflow to perform spatially aware geodomaining. We demonstrate the effectiveness of the method using actual mining data. This workflow makes use of methods that are properly geodata science-based as opposed to merely data science-based (explicitly leverages the spatial aspects of data). The workflow achieves these benefits through the use of objective metrics and semi-automated modeling practices as part of geodata science (e.g., cross-validation), enabling high automation potential, practitioner-agnosticism, replicability, and objectivity. We also evaluate the integrated resource estimation workflow using a real dataset from the platiniferous Merensky Reef of the Bushveld Complex (South Africa) known for its high nugget effect.
Article
Full-text available
Speaker recognition is essential in smart voice applications for personal identification. Current state-of-the-art techniques primarily focus on ideal acoustic conditions. However, the traditional spectrogram struggles to differentiate between noise, reverberation, and speech. To overcome this challenge, MFCC can be replaced with the output from a self-supervised learning model. This study introduces a TDNN enhanced with a pre-trained model for robust performance in noisy and reverberant environments, referred to as PNR-TDNN. The PNR-TDNN employs HuBERT as its backbone, while the TDNN is an improved ECAPA-TDNN. The pre-trained model employs the Canopy/Mini Batch k-means++ strategy. In the TDNN architecture, several enhancements are implemented, including a cross-channel fusion mechanism based on Res2Net. Additionally, a non-average attention mechanism is applied to the pooling operation, focusing on the weight information of each channel within the Squeeze-and-Excitation Net. Furthermore, the contribution of individual channels to the pooling of time-domain frames is enhanced by substituting attentive statistics with multi-head attention statistics. Validated by zhvoice in noisy conditions, the minimized PNR-TDNN demonstrates a 5.19% improvement in EER compared to CAM++. In more challenging environments with noise and reverberation, the minimized PNR-TDNN further improves EER by 3.71% and 9.6%, respectively, and MinDCF by 3.14% and 3.77%, respectively. The proposed method has also been validated on the VoxCeleb1 and cn-celeb_v2 datasets, representing a significant breakthrough in the field of speaker recognition under challenging conditions. This advancement is particularly crucial for enhancing safety and protecting personal identification in voice-enabled microphone applications.
Article
he combined effects of extreme high temperatures and the urban heat island effect have emerged as significant challenges hindering the sustainable urban development. Due to the spatial heterogeneity arising from imbalanced development in high-density Chinese cities, heat health risk (HHR) varies greatly across urban regions. To address this issue, this study focused on Qingdao, a northern coastal city in China, and proposed an assessment framework based on the Local Climate Zones (LCZ) to evaluate HHR. Additionally, the Dagum Gini coefficient and Bayesian Quantile Regression were utilized to assess spatial inequality and driving factors. The results showed that: (1) Compact high-rise and mid-rise showed the highest HHR. Generally, higher population density and building concentration were generally associated with greater HHR. (2) The HHR distribution across the study area exhibited significant spatial inequality (Gini coefficient: 0.499), with the primary source of inequality arising from differences between LCZs (inter-group Gini coefficient: 0.329). (3) The medical facility density, the gross domestic product, and the cooling shelter density had significant influence in high-HHR areas (LCZ2, LCZ4, LCZ5-1, LCZ8). This study provided policymakers with a valuable tool to formulate targeted measures based on the characteristics of LCZs and risk assessment outcomes to better address the challenges posed by climate change and promote sustainable urban development.
Article
High-precision urban subsurface geophysical imaging is critical for city development, including urban construction, seismic hazard assessment, and renewable energy development. We investigate the top 5 km VS, sediment thickness, Moho depth, and crustal average VP/VS of Singapore using teleseismic P-wave polarizations and receiver functions from a nodal array and some permanent stations. We present the first sediment model of Singapore, which shows the thickest compacted sediments (∼1.2 km) in the west, negligible sediments in central area and some localized sediments in the east (∼0.8 km) and northwest (∼1.1 km). Based on the new crustal model and previous geological surveys, we discuss a possible tectonic evolutionary history of Singapore, which is related to the subduction of the Paleo-Tethys slab. The observed low-VS anomaly beneath the Sembawang hot spring reveals its possible deep heat source, indicating a potential source of geothermal energy in Singapore. In addition, ground-motion amplification variations caused by local site conditions across Singapore are estimated, showing that areas of soft sediment, especially the reclaimed land in the east, have the highest seismic risk.
Article
Meiyu, featuring prolonged periods of rainfall over the Yangtze–Huai River basin (YHRB), not only replenishes water resources and sustains ecological balance, but also poses potential disaster risks. Accurate early forecasting of Meiyu is crucial for effectively implementing flood prevention strategies. To help refine numerical models and providing guidance for operational forecasters, this study explores the capabilities of two global ensemble prediction systems (GEPSs) of the China Meteorological Administration (CMA) and the ECMWF in forecasting the Meiyu characteristics in 2023 over the YHRB. Results show that the ECMWF GEPS reasonably forecasts the Meiyu rainfall, while the CMA GEPS presents a notable underestimation. The predictable lead time of the Meiyu onset date is eight days by the ECMWF GEPS and six days by the CMA GEPS, respectively. Regarding the regional rainstorm processes, the two GEPSs generally provide a predictable lead time of 48–168 h for reasonably forecasting the patterns of the heavy rainfall area. To further examine their strengths and weaknesses in Meiyu forecasting, this paper revisits their abilities in forecasting key influence systems. By verifying against their respective analyses, it is demonstrated that the ECMWF GEPS reasonably forecasts the spatial coverages of the northwestern Pacific subtropical high (NWPSH) and the South Asian high (SAH), whereas the CMA GEPS presents substantial underestimation. Both GEPSs show generally southward deviations for the eastern ridge line position (RLP) of the SAH, and exhibit a northward deviation for the western RLP of the NWPSH during early forecast lead times. The less Meiyu rainfall predicted by the CMA GEPS compared to the ECMWF GEPS can be attributed to its weaker low-level convergence belt and weaker upper-level divergence area. A deeper exploration into these forecast discrepancies in upper-level divergence and lower-level convergence suggests that they likely originate from their initial analysis fields.
Article
The rapid growth of digital repositories containing textual data, such as research articles, news stories, and reviews, necessitates effective clustering techniques for categorization and information retrieval. However, traditional clustering methods like KMeans and DBSCAN often struggle with high-dimensional, sparse text data. This paper proposes an approach that leverages BERT embeddings to enhance clustering performance. By integrating BERT embeddings with K-Means and DBSCAN, we improve clustering accuracy by capturing the semantic richness of textual data. Experimental results on the BBC Full Text dataset demonstrate superior clustering performance, achieving a 91% accuracy based on Purity and Adjusted Rand Index (ARI). Furthermore, an interactive visualization component is introduced to aid in the interpretation of clustered data.
Article
Celem pracy jest porównanie poziomu jakości życia w województwach Polski. Badanie przeprowadzono wykorzystując metodę grupowania hierarchicznego (metoda Warda i metodę grupowania niehierarchicznego (metodę k-średnich)) oraz utworzono ranking województw wykorzystując TOPSIS. Za pomocą tych algorytmów zbudowano grupy województw o podobnym poziomie jakości życia odpowiednio w 2010 i 2022 r. Zgodnie z wynikami analizy dokonanej na podstawie wybranych wskaźników najwyższy poziom jakości życia zanotowano w województwach mazowieckim, śląskim i małopolskim, zaś niższą m.in. w województwach zachodniopomorskim oraz podlaskim.
Article
Full-text available
Despite its critical function in the immune system and accumulating evidence of immunological abnormalities in schizophrenia, the thymus has long been overlooked. We aimed to investigate thymic morphological alterations and their corresponding heterogeneity in patients with schizophrenia. Imaging-derived thymic morphology was assessed and compared between 419 patients with schizophrenia and 460 age- and sex-matched control participants aged 16–40 years who underwent chest computed tomography (CT) scanning. These included measurements reflecting thymic size and density, such as average maximal thickness, anteroposterior distance, and average CT attenuation, which were also used to identify patient subtypes based on an unsupervised machine learning algorithm. Once the thymus-based patient subtypes were identified, between-subtype differences in the thymic and blood immunometabolic profiles were further tested. Case-control comparisons revealed that patients had greater average maximal thickness (Glass’s delta [Δ] effect size = 0.37) but lower average CT attenuation (Δ = −0.18) in the thymus than controls. Two thymus-based subtypes with disparate thymic and blood immunometabolic profiles were identified. Specifically, Subtype 1 (containing 40.1% of patients) was characterized as greater average maximal thickness (Δ = 1.36) and longer anteroposterior distance (Δ = 0.71) but lower CT attenuation (Δ = −0.97), contrary to the abnormal patterns of Subtype 2. Furthermore, Subtype 1 had higher levels of blood immunometabolic profiles, such as lymphocyte count and lipid measures, than Subtype 2. Altered thymic morphology with considerable heterogeneity was first reported in schizophrenia, providing evidence for the immune hypothesis and facilitating the discovery of imaging biomarkers reflecting the immunometabolic status.
Chapter
Clustering algorithms are used extensively in data analysis for data exploration and discovery. Technological advancements lead to continual growth of data in terms of volume, dimensionality and complexity. This provides great opportunities in data analytics, as the data can be interrogated for many different purposes. This, however, leads challenges, such as identification of relevant features for a given task. In supervised tasks, one can utilise a number of methods to optimise the input features for the task objective (e.g. classification accuracy). In unsupervised problems, such tools are not readily available, in part due to an inability to quantify feature relevance in unlabelled tasks. In this paper, we investigate the sensitivity of clustering performance using noisy uncorrelated variables iteratively added to baseline datasets with well-defined clusters. The clustering quality is evaluated using labelled and unlabelled metrics, covering a range of dimensionalities in the baseline data to understand the impact of irrelevant features on clustering popular metrics. We show how different types of irrelevant variables can impact the outcome of a clustering result from k-means in different ways. We observe a resilience to very high proportions of irrelevant features for adjusted Rand index (ARI) and normalised mutual information (NMI), when the irrelevant features are Gaussian distributed. For uniformly distributed irrelevant features, we notice the resilience of ARI and NMI is dependent on the dimensionality of the data and exhibits tipping points between high scores and near zero. Our results show that the Silhouette Coefficient and the Davies–Bouldin score are the most sensitive to irrelevant added features exhibiting large changes in score for comparably low proportions of irrelevant features regardless of underlying distribution or data scaling. As such the Silhouette Coefficient and the Davies–Bouldin score are good candidates for optimising feature selection in unsupervised clustering tasks. Finally, we observe that standardising and mean centering the data prior to clustering removes the discrepancies between Gaussian and uniformly distributed irrelevant features and in general reduces variability in metrics between repeated cluster runs.
Article
Full-text available
ML stands for machine learning, a groundbreaking technological paradigm that has transformed the way we approach diverse areas of society from clinical diagnostics to algorithmic trading, from personalized media delivery to self-driving cars. In this article, we will look into the history of ML algorithms — in the order of their chronology. This can aid in providing an extensive perspective of how various ML techniques evolved, the theoretical underpinnings that strengthen them and the potential of forthcoming research in this branch of computational intelligence. Tracing this history helps us to appreciate how the interplay between theoretical innovations, the expanding availability of computational resources, and the demands of complex real-world problems have shaped the field of ML. From a few decades ago to today, this evolution tells the story of how ML has become such an influential driver of much of the technology that we see today. Lastly, the article discusses the interaction among improvements in algorithm design, computational infrastructure, and challenges brought by data tasks, paving the way for more advances in the years to come.
Conference Paper
Well logs, crucial for drilling and post-drilling analysis, provide continuous measurements of subsurface formations as a function of depth. Logging while drilling (LWD) and electrical wireline logs (EWL) are commonly used techniques for well-log acquisition. Both methods are prone to depth measurement errors due to various factors, which need to be aligned to a common depth-scale for subsequent analysis. This study compares two automated machine learning approaches for aligning repeated measurements of the same parameters from LWD and EWL logs of the same well. The first approach is based on supervised learning and the second on unsupervised learning. The supervised approach trains a 1D convolutional neural network (1D CNN) classification model on actual well-log data from the Norwegian North Sea, using LWD-EWL pairs of log slices. A specific depth discrepancy is introduced for each pair, and the logs are divided into various classes based on the depth error between them. The unsupervised method combines autoencoders and K-means clustering to identify potential lithological boundaries in EWL and LWD multiparameter log data. These predicted boundaries are validated by requesting a maximal Pearson correlation. The performance of the classification model is evaluated using metrics such as accuracy, precision, and recall. The optimal number of clusters for K-means clustering is identified using silhouette scores and the elbow method. Depth alignment is verified through visual inspection, correlation analysis, and Euclidean distances between logs. Supervised and unsupervised approaches significantly improve the alignment of various logs, such as bulk density, deep resistivity, sonic compressional, and neutron porosity. Both methods outperform maximization of cross-correlation for specific logs, such as deep resistivity and neutron porosity. These results highlight the potential of machine learning for efficient and accurate depth alignment of well logs, with promising implications for enhancing drilling and post-drilling analysis in the oil and gas industry.
Article
Full-text available
Clustering is one of the main mathematical challenges in large-scale gene expression analysis. We describe a clustering procedure based on a sequential k-means algorithm with additional refinements that is able to handle high-throughput data in the order of hundreds of thousands of data items measured on hundreds of variables. The practical motivation for our algorithm is oligonucleotide fingerprinting-a method for simultaneous determination of expression level for every active gene of a specific tissue-although the algorithm can be applied as well to other large-scale projects like EST clustering and qualitative clustering of DNA-chip data. As a pairwise similarity measure between two p-dimensional data points, x and y, we introduce mutual information that can be interpreted as the amount of information about x in y, and vice versa. We show that for our purposes this measure is superior to commonly used metric distances, for example, Euclidean distance. We also introduce a modified version of mutual information as a novel method for validating clustering results when the true clustering is known. The performance of our algorithm with respect to experimental noise is shown by extensive simulation studies. The algorithm is tested on a subset of 2029 cDNA clones coming from 15 different genes from a cDNA library derived from human dendritic cells. Furthermore, the clustering of these 2029 cDNA clones is demonstrated when the entire set of 76,032 cDNA clones is processed.
Article
The k-means method is an old but popular clustering algorithm known for its observed speed and its simplicity. Until recently, however, no meaningful theoretical bounds were known on its running time. In this paper, we demonstrate that the worst-case running time of k-means is superpolynomial by improving the best known lower bound from Ω(n) iterations to 2 Ω(√n).
Article
For a partition of an n -point set into k subsets (clusters) S 1 ,S 2 ,. . .,S k , we consider the cost function , where c(S i ) denotes the center of gravity of S i . For k=2 and for any fixed d and ε >0 , we present a deterministic algorithm that finds a 2-clustering with cost no worse than (1+ε) -times the minimum cost in time O(n log n); the constant of proportionality depends polynomially on ε . For an arbitrary fixed k , we get an O(n log k n) algorithm for a fixed ε , again with a polynomial dependence on ε .
Article
In k-means clustering we are given a set of n data points in d-dimensional space and an integer k, and the problem is to determine a set of k points in , called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+ε)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9−ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.
Conference Paper
We study clustering problems in the streaming model, where the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storage space. Our main result is a randomized algorithm for the k--Median problem which produces a constant factor approximation in one pass using storage space O(k poly log n). This is a significant improvement of the previous best algorithm which yielded a 2O(1/ε) approximation using O(nε) space. Next we give a streaming algorithm for the k--Median problem with an arbitrary distance function. We also study algorithms for clustering problems with outliers in the streaming model. Here, we give bicriterion guarantees, producing constant factor approximations by increasing the allowed fraction of outliers slightly.
Conference Paper
Let k be a fixed integer. We consider the problem ofpartitioning an input set of points endowed with a distancefunction into k clusters. We give polynomial timeapproximation schemes for the following three clustering problems:Metric k-Clustering, l 22k-Clustering, and l22 k-Median.In the k-Clustering problem, the objective is to minimizethe sum of all intra-cluster distances. In the k-Medianproblem, the goal is to minimize the sum of distances from pointsin a cluster to the (best choice of) cluster center. In metricinstances, the input distance function is a metric. In l22 instances, the points are in Rd and the distance between two points x,yis measured by x−y 22 (noticethat (R d, ṡ 22 is nota metric space). For the first two problems, our results are thefirst polynomial time approximation schemes. For the third problem,the running time of our algorithms is a vast improvement overprevious work.
Conference Paper
In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in Rd, one can compute a weighted set S ⊆ P, of size O(k ε-d log n), such that one can compute the k-median/means clustering on S instead of on P, and get an (1+ε)-approximation. As a result, we improve the fastest known algorithms for (1+ε)-approximate k-means and k-median. Our algorithms have linear running time for a fixed k and ε. In addition, we can maintain the (1+ε)-approximate k-median or k-means clustering of a stream when points are being only inserted, using polylogarithmic space and update time.
Conference Paper
In many applications it is desirable to cluster high dimensional data along various subspaces, which we refer to as projective clustering. We propose a new objective function for projective clustering, taking into account the inherent trade-off between the dimension of a subspace and the induced clustering error. We then present an extension of the k-means clustering algorithm for projective clustering in arbitrary subspaces, and also propose techniques to avoid local minima. Unlike previous algorithms, ours can choose the dimension of each cluster independently and automatically. Furthermore, experimental results show that our algorithm is significantly more accurate than the previous approaches.
Conference Paper
We present the first linear time (1+")-approximation al- gorithm for the k-means problem for fixed k and ". Our al- gorithm runs in O(nd) time, which is linear in the size of the input. Another feature of our algorithm is its simplic- ity - the only technique involved is random sampling.
Conference Paper
We present polynomial upper and lower bounds on the number of iterations performed by the k-means method (a.k.a. Lloyd’s method) for k-means clustering. Our upper bounds are polynomial in the number of points, number of clusters, and the spread of the point set. We also present a lower bound, showing that in the worst case the k-means heuristic needs to perform Ω(n) iterations, for n points on the real line and two centers. Surprisingly, the spread of the point set in this construction is polynomial. This is the first construction showing that the k-means heuristic requires more than a polylogarithmic number of iterations. Furthermore, we present two alternative algorithms, with guaranteed performance, which are simple variants of the k-means method. Results of our experimental studies on these algorithms are also presented.
Article
It has long been realized that in pulse-code modulation (PCM), with a given ensemble of signals to handle, the quantum values should be spaced more closely in the voltage regions where the signal amplitude is more likely to fall. It has been shown by Panter and Dite that, in the limit as the number of quanta becomes infinite, the asymptotic fractional density of quanta per unit voltage should vary as the one-third power of the probability density per unit voltage of signal amplitudes. In this paper the corresponding result for any finite number of quanta is derived; that is, necessary conditions are found that the quanta and associated quantization intervals of an optimum finite quantization scheme must satisfy. The optimization criterion used is that the average quantization noise power be a minimum. It is shown that the result obtained here goes over into the Panter and Dite result as the number of quanta become large. The optimum quautization schemes for 2^{b} quanta, b=1,2, \cdots, 7 , are given numerically for Gaussian and for Laplacian distribution of signal amplitudes.
Article
In this paper, we first draw a connection between a level set algorithm and k-Means plus nonlinear diffusion preprocessing. Then, we exploit this link to develop a new hybrid numerical technique for segmentation that draws on the speed and simplicity of k-Means procedures, and the robustness of level set algorithms. The proposed method retains spatial coherence on initial data characteristic of curve evolution techniques, as well as the balance between a pixel/voxel's proximity to the curve and its intention to cross over the curve from the underlying energy. However, it is orders of magnitude faster than standard curve evolutions. Moreover, it does not suffer from the limitations of k-Means due to inaccurate local minima and allows for segmentation results ranging from k-Means clustering type partitioning to level set partitions.
Article
Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters neccessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historial perspective rooted in mathematics, statistics and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters in unsupervised learning and the resulting system represents a data concept. From a practicual perspective clustering plays an outstanding role in data mining applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition and machine learning. This survery focuses on clustering in data ming. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of the survey.
How fast is the k-means method? In SODA ’05: Proceed-ings of the sixteenth annual
  • Sariel Har
  • Bardia Peled
  • Sadri
Sariel Har-Peled and Bardia Sadri. How fast is the k-means method? In SODA ’05: Proceed-ings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 877–885, Philadelphia, PA, USA, 2005. Society for Industrial and Applied Mathematics.