Article

The Complexity of the Generalized Lloyd-Max Problem

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A simple (combinatorial) special case of the generalized Lloyd-Max (or quantization) problem is shown to be nondeterministic polynomial (NP)-complete. {em A fortiori}, the general problem of communication theory, in its combinatorial forms, has at least that complexity.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Some of the automatic grouping problems can be considered from the point of view of location problems. The most popular k-means clustering model [10] can be described by Equation (1) where L (·,·) is the squared Euclidean distance between two points. The existence of a trivial non-iterative solution of the corresponding Weber problem with the squared Euclidean distances [11] makes the k-means problem one of the most popular optimization models for clustering. ...
... Lloyd (multistart) 1.54171 × 10 9 1.64834 × 10 9 1.59859 × 10 9 1.60045 × 10 9 2.62443 × 10 7 j-means (SWAP 1 + Lloyd) 9.95926 × 10 8 1.15171 × 10 9 1.05927 × 10 9 1.04561 × 10 9 4.63788 × 10 7 AGGL 1 1.14784 × 10 9 1.88404 × 10 9 1.47311 × 10 9 1.42352 × 10 9 1.94210 × 10 8 AGGL 2 9.96893 × 10 8 1.23746 × 10 9 1.08456 × 10 9 1.07354 × 10 9 5.76487 × 10 7 AGGL 3 9.87064 × 10 8 1.21980 × 10 9 1.07148 × 10 9 1.06503 × 10 9 4.84664 × 10 7 AGGL 5 9.68357 × 10 8 1.13508 × 10 9 1.06002 × 10 9 1.06617 × 10 9 4.50726 × 10 7 AGGL 7 1.00537 × 10 9 1.19050 × 10 9 1.07440 × 10 9 1.06799 × 10 9 4.73520 × 10 7 AGGL 10 9.78925 × 10 8 1.17041 × 10 9 1.06123 × 10 9 1.05264 × 10 9 5.10834 × 10 7 AGGL 12 9.94720 × 10 8 1.16133 × 10 9 1.05659 × 10 9 1.05930 × 10 9 3.59604 × 10 7 AGGL 15 9.64129 × 10 8 1.18512 × 10 9 1.05884 × 10 9 1.04935 × 10 9 5.54774 × 10 7 AGGL 20 9.53544 × 10 8 1.15383 × 10 9 1.03729 × 10 9 1.03124 × 10 9 4.58983 × 10 7 AGGL 25 9.98103 × 10 8 1.13062 × 10 9 1.03990 × 10 9 1.02958 × 10 9 3.39141 × 10 7 AGGL 30 9.59132 × 10 8 1.08646 × 10 9 1.02528 × 10 9 1.02102 × 10 9 3.48713 × 10 7 AGGL 50 9.48152 × 10 8 1.07480 × 10 9 1.00313 × 10 9 9.97259 × 10 8 2.65919 × 10 7 AGGL 75 9.44390 × 10 8 1.05320 × 10 9 9.90754 × 10 8 9.86511 × 10 8 3.04626 × 10 7 AGGL 100 9.33977 × 10 8 1.02524 × 10 9 9.70365 × 10 8 3.64508 × 10 6 3.77758 × 10 6 3.66037 × 10 6 3.65225 × 10 6 28,242.8 AGGL 5 3.64502 × 10 6 3.77520 × 10 6 3.65956 × 10 6 3.65221 × 10 6 31,464.1 AGGL 7 3.64352 × 10 6 3.77238 × 10 6 3.66457 × 10 6 3.65259 × 10 6 36,374.3 ...
... AGGL 5 3.64502 × 10 6 3.77520 × 10 6 3.65956 × 10 6 3.65221 × 10 6 31,464.1 AGGL 7 3.64352 × 10 6 3.77238 × 10 6 3.66457 × 10 6 3.65259 × 10 6 36,374.3 AGGL 10 3.64503 × 10 6 3.74222 × 10 6 3.65639 × 10 6 3.65191 × 10 6 18,087.8 AGGL 12 3.64508 × 10 6 3.77121 × 10 6 3.65745 × 10 6 3.65299 × 10 6 22,688.4 ...
Article
Full-text available
The continuous p-median problem (CPMP) is one of the most popular and widely used models in location theory that minimizes the sum of distances from known demand points to the sought points called centers or medians. This NP-hard location problem is also useful for clustering (automatic grouping). In this case, sought points are considered as cluster centers. Unlike similar k-means model, p-median clustering is less sensitive to noisy data and appearance of the outliers (separately located demand points that do not belong to any cluster). Local search algorithms including Variable Neighborhood Search as well as evolutionary algorithms demonstrate rather precise results. Various algorithms based on the use of greedy agglomerative procedures are capable of obtaining very accurate results that are difficult to improve on with other methods. The computational complexity of such procedures limits their use for large problems, although computations on massively parallel systems significantly expand their capabilities. In addition, the efficiency of agglomerative procedures is highly dependent on the setting of their parameters. For the majority of practically important p-median problems, one can choose a very efficient algorithm based on the agglomerative procedures. However, the parameters of such algorithms, which ensure their high efficiency, are difficult to predict. We introduce the concept of the AGGLr neighborhood based on the application of the agglomerative procedure, and investigate the search efficiency in such a neighborhood depending on its parameter r. Using the similarities between local search algorithms and (1 + 1)-evolutionary algorithms, as well as the ability of the latter to adapt their search parameters, we propose a new algorithm based on a greedy agglomerative procedure with the automatically tuned parameter r. Our new algorithm does not require preliminary tuning of the parameter r of the agglomerative procedure, adjusting this parameter online, thus representing a more versatile computational tool. The advantages of the new algorithm are shown experimentally on problems with a data volume of up to 2,000,000 demand points.
... K-means clustering is an NP-hard problem, but we can approximate the optimal solution [44,45]. Note that the brutal force approach of examining all possibilities is impractical [44]. ...
... K-means clustering is an NP-hard problem, but we can approximate the optimal solution [44,45]. Note that the brutal force approach of examining all possibilities is impractical [44]. The classical solution to this problem includes Lloyd's algorithm, an efficient heuristic algorithm that locates local optimum [46]. ...
Thesis
Full-text available
Recent work has proposed solving the k-means clustering problem in quantum computers via the Quantum Approximate Optimization Algorithm (QAOA) and coreset techniques. QAOA is a variational quantum algorithm that tackles k-means clustering by translating it as a form of MAX-CUT problem. Besides, the coreset from computational geometry is a data-reduction approach approximating a dataset into a much smaller subset with weights. However, the current method only proves the feasibility of quantum k-means clustering without guaranteeing high accuracy and stability in various datasets. Therefore, we aim to design a new quantum framework that yields comparable results to the classical method in k-means clustering. To achieve this, we compared the performance of solving the clustering problem with existing coreset techniques and quantum variational algorithms. Most importantly, this work proposes solving the k-means clustering problem with the variational quantum eigensolver (VQE) and a novel coreset, the Contour coreset. VQE is a quantum-classical hybrid optimization algorithm based on variational principles. We demonstrate that VQE is an alternative method to QAOA but exhibits higher accuracy in the clustering problem. Moreover, we provide a novel coreset construction algorithm for Contour coreset, which is tailor-made for in-putting large datasets into noisy intermediate-scale quantum (NISQ) computers with limited qubits. The construction algorithm is faster than existing coreset algorithms, and the Contour coreset can better represent large datasets compared to the state-of-art coresets for k-means clustering. To the best of our knowledge, this is the first core-set algorithm built for quantum computing. Extensive experiments with synthetic and real-life data demonstrated that our approach outperforms existing quantum k-means clustering approaches with higher accuracy and lower standard deviation.
... In many practical situations, the distribution f P0 (p 0 ) is not available, but data drawn from it is available. The optimal design of quantizers from data is NP-hard [17], [18]. However, the Lloyd-Max algorithm and its close cousin K-means can be used on data with the Bayes risk error fidelity criterion. ...
... dp 0 is substituted into the high-rate distortion approximation for the MBRE criterion (17). Taking R = log 2 (K), there is a constant gap between the rates using the MBRE point density and the MAE point density for all distortion values. ...
Preprint
Bayesian hypothesis testing is investigated when the prior probabilities of the hypotheses, taken as a random vector, are quantized. Nearest neighbor and centroid conditions are derived using mean Bayes risk error as a distortion measure for quantization. A high-resolution approximation to the distortion-rate function is also obtained. Human decision making in segregated populations is studied assuming Bayesian hypothesis testing with quantized priors.
... e k-means problem is a continuous unconstrained global optimization problem which has become a classic clustering model. is problem is proved to be NP-hard [1,2], so it is necessary to find a compromise between the computation time and the solution preciseness. e aim of the problem is to find set S � X 1 , . . . ...
... In [45], the authors refer to their algorithm as "Evolutionary k-Means." However, they actually solved an alternative problem which aimed to increase the clustering stability instead of minimizing (1). eir algorithm operates with binary consensus matrices and uses two types of mutation genetic operators: cluster split (dissociative) and cluster merge (agglomerative) mutation. ...
Article
Full-text available
The k-means problem is one of the most popular models of cluster analysis. The problem is NP-hard, and modern literature offers many competing heuristic approaches. Sometimes practical problems require obtaining such a result (albeit notExact), within the framework of the k-means model, which would be difficult to improve by known methods without a significant increase in the computation time or computational resources. In such cases, genetic algorithms with greedy agglomerative heuristic crossover operator might be a good choice. However, their computational complexity makes it difficult to use them for large-scale problems. The crossover operator which includes the k-means procedure, taking the absolute majority of the computation time, is essential for such algorithms, and other genetic operators such as mutation are usually eliminated or simplified. The importance of maintaining the population diversity, in particular, with the use of a mutation operator, is more significant with an increase in the data volume and available computing resources such as graphical processing units (GPUs). In this article, we propose a new greedy heuristic mutation operator for such algorithms and investigate the influence of new and well-known mutation operators on the objective function value achieved by the genetic algorithms for large-scale k-means problems. Our computational experiments demonstrate the ability of the new mutation operator, as well as the mechanism for organizing subpopulations, to improve the result of the algorithm.
... This is however beyond the scope of this paper. We note that the classic (non-compressed) k-means problem by minimization of the empirical risk is known to be NP-hard [Garey et al., 1982, Aloise et al., 2009] and that guarantees for approaches such as K-means++ [Arthur and Vassilvitskii, 2007] are only in expectation and with a logarithmic sub-optimality factor. ...
Preprint
We describe a general framework -- compressive statistical learning -- for resource-efficient large-scale learning: the training collection is compressed in one pass into a low-dimensional sketch (a vector of random empirical generalized moments) that captures the information relevant to the considered learning task. A near-minimizer of the risk is computed from the sketch through the solution of a nonlinear least squares problem. We investigate sufficient sketch sizes to control the generalization error of this procedure. The framework is illustrated on compressive PCA, compressive clustering, and compressive Gaussian mixture Modeling with fixed known variance. The latter two are further developed in a companion paper.
... Globally-optimal solution is out of reach unless the number of users is very small, since this problem is NP-hard (exponential complexity), even in its simple form[54]-[56]. ...
Article
Full-text available
A novel analytical approach to optimal base station (BS) location problem is proposed. It is based on the widely used system and propagation path models but, unlike known studies, makes use a convex optimization formulation to minimize the total transmit power subject to quality-of-service (QoS, rate) constraints. In contrast to the previously-proposed approaches, the sufficient Karush-Kuhn-Tucker (KKT) conditions are used here to characterize a globally (rather than locally)-optimum point as a convex combination of user locations, where convex weights depend on user parameters, path loss exponent and overall geometry of the problem. Based on this characterization, a number of novel closed-form solutions are obtained. In particular, the optimum BS location is shown to be the average of user locations in the case of unobstructed line-of-sight (LOS) propagation (the path loss exponent equals 2) and identical user parameters but not in general. If the user set is symmetric, the optimal BS location is independent of the pathloss exponent, which is not the case in general. The analytical results show the impact of propagation conditions (e.g. clear/obstructed LOS) as well as system and user parameters (bandwidth, rate demand, etc.) on optimal BS location: the higher the path loss exponent, the heavier the impact of distant users; users with higher rate demands have more impact. The obtained analytical results facilitate insights, which are unavailable from purely numerical studies and which can be used to develop design guidelines. Based on these results, an iterative algorithm is proposed and its convergence is proved. The single BS results are further extended to multi-BS scenarios (e.g. a cell cluster) using the K-means algorithm with proper modifications, so that the total (sum) BS power in a cell cluster is locally minimized, subject to user rate constrains. Numerical experiments validate the analytical solutions and show the effectiveness of the proposed algorithms. Overall, the emphasis is on an analytical framework, solutions and insights rather than on numerical algorithms.
... It has been shown that the clustering problem, in its combinatorial basis, is NPhard [1][2], that is no polynomial algorithm exists which can solve the problem in the optimal way. Therefore, heuristic algorithms have to be practically used which can only find suboptimal solutions by trying the minimization of the SSE function cost. ...
Chapter
This paper describes an approach to formal modeling of clustering algorithms based on medoids using timed automata. The approach permits to assess properties such as the minimization of the sum-of-squared-error objective cost (SSE) by exhaustive model checking. Although constrained to integer semantics and problem instances of small size, the contribution, which is mainly inspired by didactic concerns, highlights how abstract specification of and reasoning on clustering by medoids can be established through the introduction of concrete data structures and functions in the context of the timed automata (TA) of the popular Uppaal model checker. The paper describes the rationale of the proposed approach, summarizes its implementation which purposely exploits the high-level character of the TA supported by Uppaal, and demonstrates its practical application through synthetic datasets.
... where µ c = 1 |S C | d∈Sc d is the centroid of cth cluster. The minimization of (1) is NP-hard [18]. Hence, we consider an approximate solution via the k-means clustering algorithm. ...
Preprint
In this study, we propose using an over-the-air computation (OAC) scheme for the federated k-means clustering algorithm to reduce the per-round communication latency when it is implemented over a wireless network. The OAC scheme relies on an encoder exploiting the representation of a number in a balanced number system and computes the sum of the updates for the federated k-means via signal superposition property of wireless multiple-access channels non-coherently to eliminate the need for precise phase and time synchronization. Also, a reinitialization method for ineffectively used centroids is proposed to improve the performance of the proposed method for heterogeneous data distribution. For a customer-location clustering scenario, we demonstrate the performance of the proposed algorithm and compare it with the standard k-means clustering. Our results show that the proposed approach performs similarly to the standard k-means while reducing communication latency.
... Another common termination criterion is to stop the iterations whenever the relative change in the sse between two consecutive iterations drops below a threshold (Linde et al. (14) 48 In the clustering literature, k -means may refer to an objective function to be minimized or the bestknown algorithm for minimizing this objective. 49 In the early cq literature, the hardness of the km problem was incorrectly attributed to various authors, including Hyafil and Rivest (1976), Brucker (1978), and Garey et al. (1982). 50 Also known as the Lloyd's algorithm (Lloyd 1982), generalized Lloyd algorithm (gla) (Gray and Karnin 1982), or Linde-Buzo-Gray algorithm (lbg) (Linde et al. 1980). ...
Article
Full-text available
Color quantization (cq), the reduction of the number of distinct colors in a given image with minimal distortion, is a common image processing operation with various applications in computer graphics, image processing/analysis, and computer vision. The first cq algorithm, median-cut, was proposed over 40 years ago. Since then, many clustering algorithms have been applied to the cq problem. In this paper, we present a comprehensive overview of the cq algorithms proposed in the literature. We first examine various aspects of cq, including the number of distinguishable colors, cq artifacts, types of cq, applications of cq, data structures, data reduction, color spaces and color difference equations, and color image fidelity assessment. We then provide an overview of image-independent cq algorithms, followed by a detailed survey of image-dependent ones. After presenting a brief discussion of pixel mapping, we conclude our survey with an outline of the open problems in cq.
... As it will be seen in the numerical performance analysis, the proposed algorithm performs remarkably well in the low resolution regime. Note that the direct minimization of the general form of the OL is an NP-hard problem since it is a mathematical generalization of the conventional quantization problem (see e.g., [20], [42]). Therefore, using approximants and suboptimal procedures is a classical approach in the area of quantization especially for vector quantization. ...
Article
Full-text available
In this paper, the situation in which a receiver has to execute a task from a quantized version of the information source of interest is considered. The task is modeled by the minimization problem of a general goal function f ( x; g ) for which the decision x has to be taken from a quantized version of the parameters g . This problem is relevant in many applications, e.g., for radio resource allocation (RA), high spectral efficiency communications, controlled systems, or data clustering in the smart grid. By resorting to high resolution (HR) analysis, it is shown how to design a quantizer that minimizes the gap between the minimum of f (which would be reached by knowing g perfectly) and what is effectively reached with a quantized g . The conducted formal analysis both provides quantization strategies in the HR regime and insights for the general regime and allows a practical algorithm to be designed. The analysis also allows one to provide some elements to the new and fundamental problem of the relationship between the goal function regularity properties and the hardness to quantize its parameters. The derived results are discussed and supported by a rich numerical performance analysis in which known RA goal functions are studied and allows one to exhibit very significant improvements by tailoring the quantization operation to the final task.
... As it will be seen in the numerical performance analysis, the proposed algorithm performs remarkably well in the low resolution regime. Note that the direct minimization of the general form of the OL is an NP-hard problem since it is a mathematical generalization of the conventional quantization problem (see e.g., [20], [42]). Therefore, using approximants and suboptimal procedures is a classical approach in the area of quantization especially for vector quantization. ...
Preprint
Full-text available
In this paper, the situation in which a receiver has to execute a task from a quantized version of the information source of interest is considered. The task is modeled by the minimization problem of a general goal function f(x;g) for which the decision x has to be taken from a quantized version of the parameters g. This problem is relevant in many applications e.g., for radio resource allocation (RA), high spectral efficiency communications, controlled systems, or data clustering in the smart grid. By resorting to high resolution (HR) analysis, it is shown how to design a quantizer that minimizes the gap between the minimum of f (which would be reached by knowing g perfectly) and what is effectively reached with a quantized g. The conducted formal analysis both provides quantization strategies in the HR regime and insights for the general regime and allows a practical algorithm to be designed. The analysis also allows one to provide some elements to the new and fundamental problem of the relationship between the goal function regularity properties and the hardness to quantize its parameters. The derived results are discussed and supported by a rich numerical performance analysis in which known RA goal functions are studied and allows one to exhibit very significant improvements by tailoring the quantization operation to the final task.
... Nevertheless, as n data items enjoy Ω(k n ) possible k-partitionings (Stirling number of the second kind), the problem of optimising a general CVI over the whole search space is very difficult computationally (provably hard in the case of the said WCSS, see [1,25], amongst many others). ...
Preprint
Full-text available
Internal cluster validity measures (such as the Calinski-Harabasz, Dunn, or Davies-Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping with regards to, say, the Silhouette index really meaningful? It turns out that many cluster (in)validity indices promote clusterings that match expert knowledge quite poorly. We also introduce a new, well-performing variant of the Dunn index that is built upon OWA operators and the near-neighbour graph so that subspaces of higher density, regardless of their shapes, can be separated from each other better.
... First invented for improving the visibility of quantized pictures [18], subtractive dithering aims to alleviate potential distortions that originate from quantization. Subtractive dithering was later extended for other domains such as speech [8], distributed deep learning [1], and federated learning [21]. ...
Conference Paper
Full-text available
We consider the fundamental problem of communicating an estimate of a real number x ∈ [0,1] using a single bit. A sender that knows x chooses a value X ∈ {0,1} to transmit. In turn, a receiver estimates x based on the value of X. The goal is to minimize the cost, defined as the worst-case (over the choice of x) expected squared error. We first overview common biased and unbiased estimation approaches and prove their optimality when no shared randomness is allowed. We then show how a small amount of shared randomness, which can be as low as a single bit, reduces the cost in both cases. Specifically, we derive lower bounds on the cost attainable by any algorithm with unrestricted use of shared randomness and propose optimal and near-optimal solutions that use a small number of shared random bits. Finally, we discuss open problems and future directions.
... Exact algorithms for solving optimisation-based partition clustering problems are typically not available in practice because of the complexity of the problems; it is well known, for instance, that several formulations of the k-means problem are NP-hard [9], [10]. Approaches for solving clustering problems include, among others, heuristic methods such as kmeans itself, statistical-based methods such as mixture models, but also metaheuristics have been widely used on a variety of clustering problems [11], [12]. ...
... Let X be a dataset with n entities and d dimensions, i.e., X ⊂ R d . The goal of centroid-based clustering algorithms is to group X into k disjoint clusters, such that each entity is assigned to the closest centroid c ∈ C. As this problem is NP-hard [3,26], several heuristics exist which aim to approximate the solution. One of these heuristics is the k-Means algorithm [40,41]. ...
Article
Full-text available
Clustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.
... If there are 2 5 = 32 quantization bins, for example, a single-precision floating-point value's bit depth reduces from 32 to 5. In addition, various entropy coding techniques, such as Huffman coding, can be further employed to losslessly reduce the bit depth. While the quantization could be done in a traditional way, e.g., using Lloyds-Max quantization [36] after the neural codec is fully trained, we encompass the quantization step as a trainable part of the neural network as proposed in [37]. Consequently, we expect that the codec is aware of the quantization error, which the training procedure tries to reduce it. ...
Article
Full-text available
We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model components to NWC, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where each NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. The hybrid design integrates LPC and NWC by redefining LPCs quantization as a differentiable process, making the system training an end-to-end manner. The decoder of proposed system is with either one NWC (0.12 million parameters) in low to medium bitrate ranges (12 to 20 kbps) or two NWCs in the high bitrate (32 kbps). Although the decoding complexity is not yet as low as that of conventional speech codecs, it is significantly reduced from that of other neural speech coders, such as a WaveNet-based vocoder. For wide-band speech coding quality, our system yields comparable or superior performance to AMR-WB and Opus on TIMIT test utterances at low and medium bitrates. The proposed system can scale up to higher bitrates to achieve near transparent performance.
... Nevertheless, as n data items enjoy Ω(k n ) possible k-partitionings (Stirling number of the second kind), the problem of optimising a general CVI over the whole search space is very difficult computationally (provably hard in the case of the said WCSS, see [1,25], amongst many others). ...
Article
Full-text available
Internal cluster validity measures (such as the Caliski–Harabasz, Dunn, or Davies–Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping with regards to, say, the Silhouette index really meaningful? It turns out that many cluster (in)validity indices promote clusterings that match expert knowledge quite poorly. We also introduce a new, well-performing variant of the Dunn index that is built upon OWA operators and the near-neighbour graph so that subspaces of higher density, regardless of their shapes, can be separated from each other better.
... However, in general, finding the optimal cluster centers is a computationally hard problem. Even for k = 2, the problem is NP-hard [18]. ...
Article
Full-text available
Many quantum algorithms for machine learning require access to classical data in superposition. However, for many natural data sets and algorithms, the overhead required to load the data set in superposition can erase any potential quantum speedup over classical algorithms. Recent work by Harrow introduces a new paradigm in hybrid quantum-classical computing to address this issue, relying on coresets to minimize the data loading overhead of quantum algorithms. We investigated using this paradigm to perform k-means clustering on near-term quantum computers, by casting it as a QAOA optimization instance over a small coreset. We used numerical simulations to compare the performance of this approach to classical k-means clustering. We were able to find data sets with which coresets work well relative to random sampling and where QAOA could potentially outperform standard k-means on a coreset. However, finding data sets where both coresets and QAOA work well—which is necessary for a quantum advantage over k-means on the entire data set—appears to be challenging.
... The latter is NP-hard, yet, under RIP conditions, provably good and computationally efficient algorithms (either greedy or based on convex relaxations) have been derived [Foucart and Rauhut, 2012]. Remark that the classic (non-compressed) k-means problem by minimization of the empirical risk is also known to be NP-hard [Garey et al., 1982, Aloise et al., 2009 and that guarantees for approaches such as K-means++ [Arthur and Vassilvitskii, 2007] are only in expectation and with a logarithmic sub-optimality factor. ...
Article
We provide statistical learning guarantees for two unsupervised learning tasks in the context of compressive statistical learning , a general framework for resource-efficient large-scale learning that we introduced in a companion paper. The principle of compressive statistical learning is to compress a training collection, in one pass, into a low-dimensional sketch (a vector of random empirical generalized moments) that captures the information relevant to the considered learning task. We explicitly describe and analyze random feature functions which empirical averages preserve the needed information for compressive clustering and compressive Gaussian mixture modeling with fixed known variance, and establish sufficient sketch sizes given the problem dimensions.
... Similarly to iterative algorithms such as KMC, convergence to a global minimum is not guaranteed in general. Maximizing jointly a function w.r.t. the set of clusters and the set of representatives is known to be an NP-hard problem (See [29] [30]). This is the reason why we resort to an alternating optimization algorithm. ...
Preprint
Full-text available
Data clustering is an instrumental tool in the area of energy resource management. One problem with conventional clustering is that it does not take the final use of the clustered data into account, which may lead to a very suboptimal use of energy or computational resources. When clustered data are used by a decision-making entity, it turns out that significant gains can be obtained by tailoring the clustering scheme to the final task performed by the decision-making entity. The key to having good final performance is to automatically extract the important attributes of the data space that are inherently relevant to the subsequent decision-making entity, and partition the data space based on these attributes instead of partitioning the data space based on predefined conventional metrics. For this purpose, we formulate the framework of decision-making oriented clustering and propose an algorithm providing a decision-based partition of the data space and good representative decisions. By applying this novel framework and algorithm to a typical problem of real-time pricing and that of power consumption scheduling, we obtain several insightful analytical results such as the expression of the best representative price profiles for real-time pricing and a very significant reduction in terms of required clusters to perform power consumption scheduling as shown by our simulations.
... Similarly to iterative algorithms such as KMC, convergence to a global minimum is not guaranteed in general. Maximizing jointly a function w.r.t. the set of clusters and the set of representatives is known to be an NP-hard problem (See [29] [30]). This is the reason why we resort to an alternating optimization algorithm. ...
Article
Full-text available
Data clustering is an instrumental tool in the area of energy resource management. One problem with conventional clustering is that it does not take the final use of the clustered data into account, which may lead to a very suboptimal use of energy or computational resources. When clustered data are used by a decision-making entity, it turns out that significant gains can be obtained by tailoring the clustering scheme to the final task performed by the decision-making entity. The key to having good final performance is to automatically extract the important attributes of the data space that are inherently relevant to the subsequent decision-making entity, and partition the data space based on these attributes instead of partitioning the data space based on predefined conventional metrics. For this purpose, we formulate the framework of decision-making oriented clustering and propose an algorithm providing a decision-based partition of the data space and good representative decisions. By applying this novel framework and algorithm to a typical problem of real-time pricing and that of power consumption scheduling, we obtain several insightful analytical results such as the expression of the best representative price profiles for real-time pricing and a very significant reduction in terms of required clusters to perform power consumption scheduling as shown by our simulations.
... If there are 2 5 = 32 quantization bins, for example, a single-precision floating-point value's bit depth reduces from 32 to 5. In addition, various entropy coding techniques, such as Haffman coding, can be further employed to losslessly reduce the bit depth. While the quantization could be done in a traditional way, e.g., using Lloyds-Max quantization [29] after the neural codec is fully trained, we encompass the quantization step as a trainable part of the neural network as Algorithm 1 Trainable Softmax quantization, Q(h, α, β) 1: Input: the code, e.g., the encoder output, h = F enc (x) the Softmax scaling factor, α the centroid vector, β ∈ R K 2: Output: the quantized code,ĥ (training) orh (testing) 3: Compute the dissimilarity matrix: D nk ← 2 (h n ||β k ) 4: Softmax conversion: A (soft) n: ...
Preprint
Full-text available
This work presents a scalable and efficient neural waveform codec (NWC) for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as its feedforward routine. The proposed CNN autoencoder also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model architectures to our fully convolutional network model, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where an NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. We redefine LPC's quantization as a trainable module to enhance the bit allocation tradeoff between LPC and its following NWC modules. Compared to the other autoregressive decoder-based neural speech coders, our decoder has significantly smaller architecture, e.g., with only 0.12 million parameters, more than 100 times smaller than a WaveNet decoder. Compared to the LPCNet-based speech codec, which leverages the speech production model to reduce the network complexity in low bitrates, ours can scale up to higher bitrates to achieve transparent performance. Our lightweight neural speech coding model achieves comparable subjective scores against AMR-WB at the low bitrate range and provides transparent coding quality at 32 kbps.
... In terms of complexity, finding the global optimum of the k-means objective function is a Non-Deterministic Polynomial acceptable (or NP-hard) problem [48,49]. To avoid solving the NP-hard problem, as already indicated, the Lloyd's clustering algorithm [44] is used but offers a local search heuristic for k-means. ...
Article
Full-text available
Evaluating the quality of reconstructed images requires consistent approaches to extracting information and applying metrics. Partitioning medical images into tissue types permits the quantitative assessment of regions that contain a specific tissue. The assessment facilitates the evaluation of an imaging algorithm in terms of its ability to reconstruct the properties of various tissue types and identify anomalies. Microwave tomography is an imaging modality that is model-based and reconstructs an approximation of the actual internal spatial distribution of the dielectric properties of a breast over a reconstruction model consisting of discrete elements. The breast tissue types are characterized by their dielectric properties, so the complex permittivity profile that is reconstructed may be used to distinguish different tissue types. This manuscript presents a robust and flexible medical image segmentation technique to partition microwave breast images into tissue types in order to facilitate the evaluation of image quality. The approach combines an unsupervised machine learning method with statistical techniques. The key advantage for using the algorithm over other approaches, such as a threshold-based segmentation method, is that it supports this quantitative analysis without prior assumptions such as knowledge of the expected dielectric property values that characterize each tissue type. Moreover, it can be used for scenarios where there is a scarcity of data available for supervised learning. Microwave images are formed by solving an inverse scattering problem that is severely ill-posed, which has a significant impact on image quality. A number of strategies have been developed to alleviate the ill-posedness of the inverse scattering problem. The degree of success of each strategy varies, leading to reconstructions that have a wide range of image quality. A requirement for the segmentation technique is the ability to partition tissue types over a range of image qualities, which is demonstrated in the first part of the paper. The segmentation of images into regions of interest corresponding to various tissue types leads to the decomposition of the breast interior into disjoint tissue masks. An array of region and distance-based metrics are applied to compare masks extracted from reconstructed images and ground truth models. The quantitative results reveal the accuracy with which the geometric and dielectric properties are reconstructed. The incorporation of the segmentation that results in a framework that effectively furnishes the quantitative assessment of regions that contain a specific tissue is also demonstrated. The algorithm is applied to reconstructed microwave images derived from breasts with various densities and tissue distributions to demonstrate the flexibility of the algorithm and that it is not data-specific. The potential for using the algorithm to assist in diagnosis is exhibited with a tumor tracking example. This example also establishes the usefulness of the approach in evaluating the performance of the reconstruction algorithm in terms of its sensitivity and specificity to malignant tissue and its ability to accurately reconstruct malignant tissue.
... Since the k-means is an N P -hard optimization problem [21,22,23], the results are easily stuck by the local optimal solution. The genetic algorithms are popular instruments for global optimization. ...
Article
Full-text available
The k-means problem and the algorithm of the same name are the most commonly used clustering model and algorithm. Being a local search optimization method, the k-means algorithm falls to a local minimum of the objective function (sum of squared errors) and depends on the initial solution which is given or selected randomly. This disadvantage of the algorithm can be avoided by combining this algorithm with more sophisticated methods such as the Variable Neighborhood Search, agglomerative or dissociative heuristic approaches, the genetic algorithms, etc. Aiming at the shortcomings of the k-means algorithm and combining the advantages of the k-means algorithm and rvolutionary approack, a genetic clustering algorithm with the cross-mutation operator was designed. The efficiency of the genetic algorithms with the tournament selection, one-point crossover and various mutation operators (without any mutation operator, with the uniform mutation, DBM mutation and new cross-mutation) are compared on the data sets up to 2 millions of data vectors. We used data from the UCI repository and special data set collected during the testing of the highly reliable semiconductor components. In this paper, we do not discuss the comparative efficiency of the genetic algorithms for the k-means problem in comparison with the other (non-genetic) algorithms as well as the comparative adequacy of the k-means clustering model. Here, we focus on the influence of various mutation operators on the efficiency of the genetic algorithms only.
... The division of the set of observations into k ≤ n cluster is based upon the minimisation of the within-cluster sum of squares, and thus variance. It should be noted here that k-means clustering belongs to NP-hard problems [59][60][61][62][63]. In computational complexity theory NP (non-deterministic, polynomial time) it is the complexity class containing the set of decision problems that can be solved by a nondeterministic Turing machine in polynomial time [64] (p. ...
Article
Full-text available
The location quotient is one of the basic quantitative tools for identifying the regional poles and the turnpikes of economic growth in spatial economy. The disadvantage of this traditional measure is the limited scope of economic information contained in it. The new measure of economic development proposed in the article encompasses a complex spectrum of phenomena in one number, as it takes into account the influence of the public administration sector, as well as top technology in the form of ICT and its practical business models. It also takes into account the digital prosumption and the platforms for participation. The participation platforms in the public administration sector are the websites of municipal public administration offices. A cluster analysis was used to distinguish four quality classes of these websites. These classes were assigned four different colours, which were then used to draw up a map of the selected province. Each municipality is marked with a colour that corresponds to the quality class of the website of the state administration office operating on its territory. The colour system resulting from the four-colour theorem and the corresponding dual graph play the role of a reference system in relation to each empirical colour distribution and another dual graph related to it. The measure of the economic development of a region is the degree of reduction of the dual graph corresponding to the empirical distribution of colours, which identifies the actual growth poles and determines the routes of growth. The presented indicator better and more precisely identifies poles and routes of economic growth than the traditional location quotient.
... , X k even in the cases with no cluster structure in the data [9,10]. Moreover, the NP-hardness [11,12] of the problem makes the exact methods [6] applicable only for very small problems. ...
Article
Full-text available
The k-means problem is one of the most popular models in cluster analysis that minimizes the sum of the squared distances from clustered objects to the sought cluster centers (centroids). The simplicity of its algorithmic implementation encourages researchers to apply it in a variety of engineering and scientific branches. Nevertheless, the problem is proven to be NP-hard which makes exact algorithms inapplicable for large scale problems, and the simplest and most popular algorithms result in very poor values of the squared distances sum. If a problem must be solved within a limited time with the maximum accuracy, which would be difficult to improve using known methods without increasing computational costs, the variable neighborhood search (VNS) algorithms, which search in randomized neighborhoods formed by the application of greedy agglomerative procedures, are competitive. In this article, we investigate the influence of the most important parameter of such neighborhoods on the computational efficiency and propose a new VNS-based algorithm (solver), implemented on the graphics processing unit (GPU), which adjusts this parameter. Benchmarking on data sets composed of up to millions of objects demonstrates the advantage of the new algorithm in comparison with known local search algorithms, within a fixed time, allowing for online computation.
... The color quantization problem is complex, since the selection of the best colors to define the quantized palette is a NP-complete problem [23]. For this reason, several solution methods have been proposed to solve this problem. ...
Article
Full-text available
This article presents a color quantization technique that combines two previously proposed approaches: the Binary splitting method and the Iterative ant-tree for color quantization method. The resulting algorithm can obtain good quality images with low time consumption. In addition, the iterative nature of the proposed method allows the quality of the quantized image to improve as the iterations progress, although it also allows a good initial image to be quickly obtained. The proposed method was compared to 13 other color quantization techniques and the results showed that it could generate better quantized images than most of the techniques assessed. The statistical significance of the improvement obtained using the new method is confirmed by applying a statistical test to the results of all the methods compared.
... Furthermore, the optimization problems upon which multirobot area coverage algorithms build are known to belong to the NP-hard class of non-deterministic polynomial time algorithms [141]. Therefore, part of the existing research has focused towards probabilistic approaches. ...
Article
Full-text available
Search and rescue (SAR) operations can take significant advantage from supporting autonomous or teleoperated robots and multi-robot systems. These can aid in mapping and situational assessment, monitoring and surveillance, establishing communication networks, or searching for victims. This paper provides a review of multi-robot systems supporting SAR operations, with system-level considerations and focusing on the algorithmic perspectives for multi-robot coordination and perception. This is, to the best of our knowledge, the first survey paper to cover (i) heterogeneous SAR robots in different environments, (ii) active perception in multi-robot systems, while (iii) giving two complementary points of view from the multi-agent perception and control perspectives. We also discuss the most significant open research questions: shared autonomy, sim-to-real transferability of existing methods, awareness of victims' conditions, coordination and interoperability in heterogeneous multi-robot systems, and active perception. The different topics in the survey are put in the context of the different challenges and constraints that various types of robots (ground, aerial, surface, or underwater) encounter in different SAR environments (maritime, urban, wilderness, or other post-disaster scenarios). The objective of this survey is to serve as an entry point to the various aspects of multi-robot SAR systems to researchers in both the machine learning and control fields by giving a global overview of the main approaches being taken in the SAR robotics area.
... Furthermore, the optimization problems upon which multirobot area coverage algorithms build are known to belong to the NP-hard class of non-deterministic polynomial time algorithms [148]. Therefore, part of the existing research has focused towards probabilistic approaches. ...
Preprint
Autonomous or teleoperated robots have been playing increasingly important roles in civil applications in recent years. Across the different civil domains where robots can support human operators, one of the areas where they can have more impact is in search and rescue (SAR) operations. In particular, multi-robot systems have the potential to significantly improve the efficiency of SAR personnel with faster search of victims, initial assessment and mapping of the environment, real-time monitoring and surveillance of SAR operations, or establishing emergency communication networks, among other possibilities. SAR operations encompass a wide variety of environments and situations, and therefore heterogeneous and collaborative multi-robot systems can provide the most advantages. In this paper, we review and analyze the existing approaches to multi-robot SAR support, from an algorithmic perspective and putting an emphasis on the methods enabling collaboration among the robots as well as advanced perception through machine vision and multi-agent active perception. Furthermore, we put these algorithms in the context of the different challenges and constraints that various types of robots (ground, aerial, surface or underwater) encounter in different SAR environments (maritime, urban, wilderness or other post-disaster scenarios). This is, to the best of our knowledge, the first review considering heterogeneous SAR robots across different environments, while giving two complimentary points of view: control mechanisms and machine perception. Based on our review of the state-of-the-art, we discuss the main open research questions, and outline our insights on the current approaches that have potential to improve the real-world performance of multi-robot SAR systems.
Article
Existing methods for integerized training speed up deep learning by using low-bitwidth integerized weights, activations, gradients, and optimizer buffers. However, they overlook the issue of full-precision latent weights, which consume excessive memory to accumulate gradient-based updates for optimizing the integerized weights. In this paper, we propose the first latent weight quantization schema for general integerized training, which minimizes quantization perturbation to training process via residual quantization with optimized dual quantizer. We leverage residual quantization to eliminate the correlation between latent weight and integerized weight for suppressing quantization noise. We further propose dual quantizer with optimal nonuniform codebook to avoid frozen weight and ensure statistically unbiased training trajectory as full-precision latent weight. The codebook is optimized to minimize the disturbance on weight update under importance guidance and achieved with a three-segment polyline approximation for hardware-friendly implementation. Extensive experiments show that the proposed schema allows integerized training with lowest 4-bit latent weight for various architectures including ResNets, MobileNetV2, and Transformers, and yields negligible performance loss in image classification and text generation. Furthermore, we successfully fine-tune Large Language Models with up to 13 billion parameters on one single GPU using the proposed schema.
Chapter
K-Means is well-known clustering algorithm very often used for its simplicity and efficiency. Its properties have been thoroughly investigated. It is emerged that K-Means heavily depends on the seeding method used to initialize the cluster centroids and that, besides the seeding procedure, it mainly acts as a local refiner of the centroids and can easily become stuck around a local sub-optimal solution of the objective function cost. As a consequence, K-Means is often repeated many times, always starting with a different centroids’ configuration, to increase the likelihood of finding a clustering solution near the optimal one. In this paper, the Hartigan and Wong variation of K-Means (HWKM) is chosen because of its increased probability to ending up near the optimal solution. HWKM is then enhanced with the use of careful seeding methods and by an incremental technique which constrains the movement of points among clusters according to their Silhouette coefficients. The result is HWKM+ which, through a small number of restarts, is capable of generating a careful clustering solution with compact and well-separated clusters. The current implementation of HWKM+ rests on Java parallel streams. The paper describes the design and development of HWKM+ and demonstrates its abilities through a series of benchmark and real-world datasets.
Article
Full-text available
The aim of computer vision techniquesand deep learning in the era of digitalization is to derive valuable insights from them and generate novel understanding. This makes it possible to employ imaging to quickly diagnose and treat a variety of diseases. In the field of dermatology, deep neural networks are utilized to differentiate between images of melanoma and non-melanoma skin lesions. In this paper, we have emphasised two important aspects of melanoma detection research. The accuracy of classifiers is the first thing to take into account, even with very little modifications to the dataset's characteristics there will be a lot of difference in accuracy. We investigated transfer learning issues in this case. We propose that continual training-test iterations are necessary to create reliable prediction models based on the results of the initial study.The second argument is the need for a system with a flexible design that can accommodate changes to training datasets.Our proposal for creating and implementing a melanoma detection service that utilizes clinical and thermoscopic images involves the development and implementing a hybrid architecture that fuses fog, edge and cloud computing. In addition, this design should aim to decrease the duration of the ongoing retraining process, which is necessary to accommodate the large volume of data that requires evaluation. This notion has been reinforced by experiments using a single computer and a variety of distribution techniques, which show how a dispersed strategy ensures output attainment in a noticeably more sufficient amount of time.
Chapter
Image color representation is among the most important aspects of Content-Based Image Retrieval (CBIR). Indeed, color features are one of the low-level image features commonly used CBIR and Color Histogram (CH) is one of the most widely used techniques for color features extraction in CBIR systems. However, we do believe that we can have other techniques to outperform CH in terms of retrieval precision. In this work, a new color feature descriptor called Color Octree Quantization Descriptor (COQD) combined with color strings coding (CSC), is proposed. It applies to a color image the Octree Color Quantization (OCQ) algorithm and then constructs a color palette of size K which is used to extract a color string from the Octree. Indeed, most nowadays images are treated first using RGB (Red, Green, Blue) color space, thus we test in RGB color space. It is critical to consider how we extract visual features using color, so our proposed method uniformly encodes the resulting image into a color string that we use as a feature. The proposed approach is experimentally validated on Wang datasets containing 1000 natural images, and the showed that the proposed COQD method outperforms the CH descriptor in terms of precision. KeywordsRGB color spaceOctree quantizationContent based image retrievalColor stringColor histogram
Article
Utilizing PC vision, machine learning, and deep learning , the objective is to track down new data and concentrate data from advanced pictures. Images can now be used for both early illness detection and treatment. Dermatology use deep neural network to tell the difference between images with and without melanoma. Two important melanoma location research topics have been emphasized in this essay. Classifier accuracy is impacted by even minor alterations to dataset’s bounds, the primary variable under investigation. We examined the exchange learning issues in this example. We propose using continuous preparation test cycles to create trustworthy prediction models on the basis of this initial evaluation’s findings . Seconds, a more flexible design philosophy that can oblige changes in the preparation datasets is fundamental. We recommended the creation and utilization of a half breed plan in view of cloud, dimness, and edge figuring to give Melanoma Area the board in light of clinical and dermoscopic pictures. By lessing the span of the consistent retrain, this designing must continually adjust to the quantity of data the should be investigated. This aspect has been highlighted in experiments coduted on a dingle Pc using various conveyance method, demonstrating how a distributed system guarantees yield fulfillment in an unquestionably more acceptable amount of time
Article
A federated kernel k -means (FedKKM) algorithm is developed in this article to conduct distributed clustering with low memory consumption on user devices. In FedKKM, a federated eigenvector approximation (FEA) algorithm is designed to iteratively determine the low-dimensional approximate vectors of the transformed feature vectors, using only low-dimensional random feature vectors. To maintain high communication efficiency in each iteration of FEA, a communication-efficient Lanczos algorithm (CELA) is further designed in FEA to reduce the communication cost. Based on the low-dimensional approximate vectors, the clustering result is obtained by leveraging a distributed linear k -means algorithm. A theoretical analysis shows that: 1) FEA has a convergence rate of O(1/T) , where T is the number of iterations; 2) the scalability of FedKKM is not affected by the dataset size since the communication cost of FedKKM is independent of the number of users’ data; and 3) FedKKM is a (1+ϵ)(1+\epsilon) approximation algorithm. The experimental results show that FedKKM achieves the comparable clustering quality to that of a centralized kernel k -means. Compared with state-of-the-art schemes, FedKKM reduces the memory consumption on user devices by up to 94%94\% and also reduces the communication cost by more than 40%40\% .
Thesis
Full-text available
The classic paradigm for designing a transmitter (encoder) and a receiver (decoder) is to design these elements by ensuring that the information reconstructed by the receiver is sufficiently close to the information that the transmitter has formatted to send it on the communication medium. This is referred to as a criterion of fidelity or of reconstruction quality (measured for example in terms of distortion, binary error rate, packet error rate or communication cut-off probability).The problem with the classic paradigm is that it can lead to an unjustified investment in terms of communication resources (oversizing of the data storage space, very high speed and expensive communication medium, very fast components, etc.) and even to make exchanges more vulnerable to attacks. The reason for this is that the use of the classic approach (based on the criterion of fidelity of information) in the wireless networks will typically lead to exchanges excessively rich in information, too rich regarding the decision which will have to be taken. the recipient of the information; in the simpler case, this decision may even be binary, indicating that in theory a single bit of information could be sufficient. As it turns out, the engineer does not currently have at his disposal a methodology to design such a transceiver pair that would be suitable for the intended use (or uses) of the recipient.Therefore, a new communication paradigm named the goal-oriented communication is proposed to solve the problem of classic communications. The ultimate objective of goal-oriented communications is to achieve some tasks or goals instead of improving the accuracy of reconstructed signal merely. Tasks are generally characterized by some utility functions or cost functions to be optimized.In the present thesis, we focus on the quantization problem of the goal-oriented communication, i.e., the goal-oriented quantization. We first formulate the goal-oriented quantization problem formally. Secondly, we propose an approach to solve the problem when only realizations of utility function are available. A special scenario with some extra knowledge about regularity properties of the utility functions is treated as well. Thirdly, we extend the high-resolution quantization theory to our goal-oriented quantization problem and propose implementable schemes to design a goal-oriented quantizer. Fourthly, the goal-oriented quantification problem is developed in a framework of games in strategic form. It is shown that goal-oriented quantization could improve the overall performance of the system if the famous Braess paradox exists. Finally, Nash equilibrium of a multi-user multiple-input and multiple output multiple access channel game with energy efficiency being the utility is studied and achieved in different methods.
Article
Full-text available
We present a method to construct optimal clustering via a sequence of merge steps. We formulate the merge-based clustering as a minimum redundancy search tree, and then search the optimal clustering by a branch-and-bound technique. Optimal clustering is found regardless of the objective function used. We also consider two suboptimal polynomial time variants based on the proposed branch-and-bound technique. However, all variants are slow and has merely theoretical interest. We discuss the reasons for the results.
Article
Classification of mineralized areas into different geochemical classes in terms of prospectivity is crucial in the optimal management of exploration risk and costs. Machine learning (ML) algorithms can be served as appropriate alternatives for separating ore-related anomalies due to avoiding the assumptions of statistical distribution and compatibility with the multivariate nature of geochemical features. By hybridizing the ML with a metaheuristic algorithm called particle swarm optimization (PSO), this contribution aims to provide an innovative approach to optimize the classification of geochemical anomalies within the study area. The algorithm, PSO, is inspired by simulating the social behavior of flocks of birds in search of food. The Dagh-Dali ZnPb (±Au) mineral prospect in northwest Iran was subjected as a case study to examine the integrity of the proposed method. Mineralization-related features were extracted by applying principal component analysis (PCA) on metallogenic elements analyzed in soil samples as PC1 and PC2 with elemental assemblages of AgAuPbZn and PbZn, respectively. The silhouette index was employed to estimate the number of underlying geochemical clusters within the adopted feature space. To constitute a comparative analysis, two k-means clustering and PSO-based learning (PSO-L) algorithms were implemented to classify the gridded data of PC1 and PC2 within the study area. The results indicated that the use of PSO has improved the cost function of the clustering problem (up to 4%). Adapting the mineralization classes with the metallogenic evidence demonstrated by boreholes drilled in the study area indicated that PSO-L was superior to the traditional k-means method, improving the accurate estimation of subsurface mineralization classes by 7%. By overcoming the drawbacks of conventional methods for trapping at the local optima, PSO-based learning possesses the potential to highlight weak mineralization signals that are numerically located in boundary conditions. The results show that the proposed approach can serve as an effective medium for optimal modeling of geochemical classes and management of detailed exploration operations.
Article
In today’s world, the temperature of the environment is gradually rising. One of the main reasons for it is decreasing the quality of air mainly caused due to air pollution. There are many harmful substances present in the environment that is the cause of the declining quality of air. These pollutants get mixed along with the air and pollute the environment. The two air pollutants are considered here, CO2 and NO, to reduce air pollution, there is a need to know the number of pollutants, with the help of sensors in this experiment the level of pollutants are to be monitored and based on that a prediction mechanism is developed to determine the level of pollutants in the future. There are some machine learning concepts involved, K means clustering for classification of pollutants along with the S.V.M. (Support Vector Machine). With the successful prediction of the level of pollutants, the necessary counter measures can be adaopted.
Article
In the era of digitized images, the goal is to be able to extract information from them and create new knowledge thanks to the use of Computer Vision techniques, Machine Learning and Deep Learning. This allows their use for early diagnosis and subsequent determination of the treatment of many pathologies. In the specific case treated here, deep neural networks are used in the dermatological field to distinguish between melanoma and non-melanoma images. In this work we have underlined two essential points of melanoma detection research. The first aspect taken into consideration is how even a simple modification of the parameters in the dataset determines a change of the accuracy of the classifiers, while working on the same original dataset. The second point is the need to have a system architecture that can be more flexible in updating the training datasets for the classification of this pathology. In this context, the proposed architecture reserves the goal of developing and implementing a hybrid architecture based on Cloud, Fog and Edge Computing in order to provide a Melanoma Detection service based on clinical and/or dermoscopic images. At the same time, this architecture must be able to interface with the amount of data to be analyzed by reducing the running time of the necessary computational operations. This has been highlighted with experiments carried out on a single machine and on different distribution systems, highlighting how a distributed approach guarantees the achievement of an output in a much more acceptable time without the need to fully rely on data scientists skills.
Article
The Iterative Ant-tree for color quantization algorithm has recently been proposed to reduce the colors of an image at a low computational cost. It is a clustering-based method that generates good images compared to several well-known color quantization methods. This article proposes the modification of two features of the original algorithm: the value assigned to the parameter associated with the algorithm and the order in which the pixels of the image are processed. As a result, the new variant of the algorithm generates better images than the original and the results are less sensitive to the value selected for the parameter.
Article
Electrical utilities apply condition monitoring on power transformers (PTs) to prevent unplanned outages and detect incipient faults. This monitoring is often done using dissolved gas analysis (DGA) coupled with engineering methods to interpret the data, however the obtained results lack accuracy and reproducibility. In order to improve accuracy, various advanced analytical methods have been proposed in the literature. Nonetheless, these methods are often hard to interpret by the decision‐maker and require a substantial amount of failure records to be trained. In the context of the PTs, failure data quality is recurrently questionable, and failure records are scarce when compared to nonfailure records. This work tackles these challenges by proposing a novel unsupervised methodology for diagnosing PT condition. Differently from the supervised approaches in the literature, our method does not require the labeling of DGA records and incorporates a visual representation of the results in a 2D scatter plot to assist in interpretation. A modified clustering technique is used to classify the condition of different PTs using historical DGA data. Finally, well‐known engineering methods are applied to interpret each of the obtained clusters. The approach was validated using data from two different real‐world data sets provided by a generation company and a distribution system operator. The results highlight the advantages of the proposed approach and outperformed engineering methods (from IEC and IEEE standards) and companies legacy method. The approach was also validated on the public IEC TC10 database, showing the capability to achieve comparable accuracy with supervised learning methods from the literature. As a result of the methodology performance, both companies are currently using it in their daily DGA diagnosis.
Article
Model quantization is essential to deploy deep convolutional neural networks (DCNNs) on resource-constrained devices. In this article, we propose a general bitwidth assignment algorithm based on theoretical analysis for efficient layerwise weight and activation quantization of DCNNs. The proposed algorithm develops a prediction model to explicitly estimate the loss of classification accuracy led by weight quantization with a geometrical approach. Consequently, dynamic programming is adopted to achieve optimal bitwidth assignment on weights based on the estimated error. Furthermore, we optimize bitwidth assignment for activations by considering the signal-to-quantization-noise ratio (SQNR) between weight and activation quantization. The proposed algorithm is general to reveal the tradeoff between classification accuracy and model size for various network architectures. Extensive experiments demonstrate the efficacy of the proposed bitwidth assignment algorithm and the error rate prediction model. Furthermore, the proposed algorithm is shown to be well extended to object detection.
Article
Full-text available
The main goal of this paper is to address the problem of deploying a team of heterogeneous, autonomous robots in a partially known environment. To handle such arbitrary environments, we first represent them as a weighted directed graph. Then, a new partitioning algorithm is given that is capable of capturing the heterogeneity of robots in terms of the speed and onboard power. It is shown that the proposed partitioning method assigns a larger subgraph to a robot that has more resources or better capabilities compared to its neighbors. Next, a distributed deployment strategy is proposed to optimally distribute robots on the graph with the aim of monitoring specified regions of interest in the environment. It will be proved that the proposed combined partitioning and deployment strategy is an optimal solution in the sense that any other arbitrary partition than the proposed one results in a larger coverage cost, and that our deployment strategy also minimizes the considered cost. Moreover, the application of the proposed methodology for monitoring an agricultural field is studied, where a series of simulations and experimental studies are carried out to demonstrate that the proposed approach can yield an optimal partitioning and deployment and offer promise to be used in practice.
Preprint
Why not have a computer just draw a map? This is something you hear a lot when people talk about gerrymandering, and it's easy to think at first that this could solve redistricting altogether. But there are more than a couple problems with this idea. In this chapter, two computer scientists survey what's been done in algorithmic redistricting, discuss what doesn't work and highlight approaches that show promise. This preprint was prepared as a chapter in the forthcoming edited volume Political Geometry, an interdisciplinary collection of essays on redistricting. (https://mggg.org/gerrybook)
Article
Clustering is a fundamental primitive in manifold applications. In order to achieve valuable results, parameters of the clustering algorithm, e.g., the number of clusters, have to be set appropriately, which is a tremendous pitfall. To this end, analysts rely on their domain knowledge in order to define parameter search spaces. While experienced analysts may be able to define a small search space, especially novice analysts often define rather large search spaces due to the lack of in-depth domain knowledge. These search spaces can be explored in different ways by estimation methods for the number of clusters. In the worst case, estimation methods perform an exhaustive search in the given search space, which leads to infeasible runtimes for large datasets and large search spaces. We propose LOG-Means, which is able to overcome these issues of existing methods. We show that LOG-Means provides estimates in sublinear time regarding the defined search space, thus being a strong fit for large datasets and large search spaces. In our comprehensive evaluation on an Apache Spark cluster, we compare LOG-Means to 13 existing estimation methods. The evaluation shows that LOG-Means significantly outperforms these methods in terms of runtime and accuracy. To the best of our knowledge, this is the most systematic comparison on large datasets and search spaces as of today.
Article
It has long been realized that in pulse-code modulation (PCM), with a given ensemble of signals to handle, the quantum values should be spaced more closely in the voltage regions where the signal amplitude is more likely to fall. It has been shown by Panter and Dite that, in the limit as the number of quanta becomes infinite, the asymptotic fractional density of quanta per unit voltage should vary as the one-third power of the probability density per unit voltage of signal amplitudes. In this paper the corresponding result for any finite number of quanta is derived; that is, necessary conditions are found that the quanta and associated quantization intervals of an optimum finite quantization scheme must satisfy. The optimization criterion used is that the average quantization noise power be a minimum. It is shown that the result obtained here goes over into the Panter and Dite result as the number of quanta become large. The optimum quautization schemes for 2^{b} quanta, b=1,2, \cdots, 7 , are given numerically for Gaussian and for Laplacian distribution of signal amplitudes.
Article
The fact that the general decoding problem for linear codes and the general problem of finding the weights of a linear code are both NP-complete is shown. This strongly suggests, but does not rigorously imply, that no algorithm for either of these problems which runs in polynomial time exists.
Article
This paper discusses the problem of the minimization of the distortion of a signal by a quantizer when the number of output levels of the quantizer is fixed. The distortion is defined as the expected value of some function of the error between the input and the output of the quantizer. Equations are derived for the parameters of a quantizer with minimum distortion. The equations are not soluble without recourse to numerical methods, so an algorithm is developed to simplify their numerical solution. The case of an input signal with normally distributed amplitude and an expected squared error distortion measure is explicitly computed and values of the optimum quantizer parameters are tabulated. The optimization of a quantizer subject to the restriction that both input and output levels be equally spaced is also treated, and appropriate parameters are tabulated for the same case as above.
Article
MEMBER, IEEE, AND HENK C. A. V~ TILBORG The fact that the general decoding problem for linear codes and the general problem of finding the weights of a linear code are both NP-complete is shown. This strongly suggests, but does not rigorously imply, that no algorithm for either of these problems which runs in polynomial time exists.
Computers und Intmctuhility Quantization for minimum distortion Least squares auantization in PCM
  • Pi M R Garey
  • D S Johnson
  • J Max
  • S P Lloyd
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-28, NO. 2, MARCH 1982 PI M. R. Garey and D. S. Johnson, Computers und Intmctuhility. San Francisco: Freeman, 1979. [31 J. Max, " Quantization for minimum distortion, " IRE Trams. Inform. Theory, vol. IT-& pp. 7-13, 1960. [41 S. P. Lloyd. " Least squares auantization in PCM, " unpublished Bell Laboratories manuscripi July 3i, 1957; also IEEE Trans. ~nforrn, Theory, vol. IT-28. on. 129-137. 1982. [51 H. S. Witsenhausen, " informational aspects of stochastic control, " in Anulysis und Optimizution of Stochustic Systems, Jacobs et ul., Eds. New York: Academic, 1980, pp. 273-284. and which results in the cells S, being the Voroni* regions or Dirchlet regions [2] of the alphabet, and b) that A should be optimal for S, which is accomplished by choosing y, so that E(d(X Y,) IX E S;} = minE{d(X,u)IXES,}, (3)