Article

Measuring Human Performance on Clustering Problems: Some Potential Objective Criteria and Experimental Research Opportunities

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The study of human performance on discrete optimization problems has a considerable history that spans various disciplines. The two most widely studied problems are the Euclidean traveling salesperson problem and the quadratic assignment problem. The purpose of this paper is to outline a program of study for the measurement of human performance on discrete optimization problems related to clustering of points in the two-dimensional plane. I describe possible objective criteria for clustering problems, the measurement of agreement of solutions produced by subjects, and categories of experiments for investigating human performance on clustering problems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Two clusters are well-separated if there are no points in one cluster that are close to any point in the second cluster. The most well-known criteria reported by [6] are a) maximizing partitioning split, 2) minimizing partition diameter, and 3) minimizing within-cluster sums of squares. ...
Full-text available
Conference Paper
In the post-proceedings of the Workshop "Visibility in Information Spaces and in Geographic Environments" a selection of research papers is presented where the topic of visibility is addressed in different contexts. Visibility governs information selection in geographic environments as well as in information spaces and in cognition. The users of social media navigate in information spaces and at the same time, as embodied agents, they move in geographic environments. Both activities follow a similar type of information economy in which decisions by individuals or groups require a highly selective filtering to avoid information overload. In this context, visibility refers to the fact that in social processes some actors, topics or places are more salient than others. Formal notions of visibility include the centrality measures from social network analysis or the plethora of web page ranking methods. Recently, comparable approaches have been proposed to analyse activities in geographic environments: Place Rank, for instance, describes the social visibility of urban places based on the temporal sequence of tourist visit patterns. The workshop aimed to bring together researchers from AI, Geographic Information Science, Cognitive Science, and other disciplines who are interested in understanding how the different forms of visibility in information spaces and geographic environments relate to one another and how the results from basic research can be used to improve spatial search engines, geo-recommender systems or location-based social networks.
... The success of these displays possibly lies in the ability of the human visual system to easily detect patterns in the display such as clusters and outliers. As Brusco (2007) has pointed out, the ability to partition such point arrays into clusters is one of many visual combinatorial optimisation problems for which the human visual system appears to be very well adapted (see also Vickers et al., 2001). ...
Full-text available
Article
Two experiments were conducted examining the effectiveness of visualizations of unstructured texts. The first experiment presented transcriptions of unrehearsed dialog and the second used emails. Both experiments showed an advantage in overall performance for semantically structured two-dimensional (2D) spatialized layouts, such as multidimensional scaling (MDS), over structured and non-structured list displays. The second experiment also demonstrated that this advantage is not simply due to the 2D nature of the display, but the combination of 2D display and the semantic structure underpinning it. Without this structure, performance fell to that of a Random List of documents. The effect of document type in this study and in Butavicius and Lees (2007) study on visualizations of news articles may be partly described by a change in bias on a speed-accuracy trade-off. At one extreme, users were accurate but slow in answering questions based on the dialog texts while, at the other extreme, users were fast but relatively inaccurate when responding to queries about emails. Similarly, users could respond accurately using the non-structured list interface; however, this was at the cost of very long response times and was associated with a technique whereby participants navigated by clicking on neighboring document representations. Implications of these findings for real-world applications are discussed.
Article
Intelligent mental representations of physical, cognitive and social environments allow humans to navigate enormous search spaces, whose sizes vastly exceed the number of neurons in the human brain. This allows us to solve a wide range of problems, such as the Traveling Salesperson Problem, insight problems, as well as mathematics and physics problems. As an area of research, problem solving has steadily grown over time. Researchers in Artificial Intelligence have been formulating theories of problem solving for the last 70 years. Psychologists, on the other hand, have focused their efforts on documenting the observed behavior of subjects solving problems. This book represents the first effort to merge the behavioral results of human subjects with formal models of the causative cognitive mechanisms. The first coursebook to deal exclusively with the topic, it provides a main text for elective courses and a supplementary text for courses such as cognitive psychology and neuroscience.
Article
Context: Understanding the process through which adolescents and young adults are trying legal and illegal substances is a crucial point for the development of tailored prevention and treatment programs. However, patterns of substance first use can be very complex when multiple substances are considered, requiring reduction into a few meaningful number of categories. Data: We used data from a survey on adolescent and young adult health conducted in 2002 in Switzerland. Answers from 2212 subjects aged 19 and 20 were included. The first consumption ever of 10 substances (tobacco, cannabis, medicine to get high, sniff (volatile substances, and inhalants), ecstasy, GHB, LSD, cocaine, methadone, and heroin) was considered for a grand total of 516 different patterns. Methods: In a first step, automatic clustering was used to decrease the number of patterns to 50. Then, two groups of substance use experts, three social field workers, and three toxicologists and health professionals, were asked to reduce them into a maximum of 10 meaningful categories. Results: Classifications obtained through our methodology are of practical interest by revealing associations invisible to purely automatic algorithms. The article includes a detailed analysis of both final classifications, and a discussion on the advantages and limitations of our approach.
Full-text available
Article
The planar Euclidean version of the Traveling Salesperson Problem (TSP) requires finding a tour of minimal length through a two-dimensional set of points. Despite the computational intractability of these problems, most em- pirical studies have found that humans are able to find good solutions. For this reason, understanding human performance on TSPs has the potential to offer insights into basic perceptual and cognitive decision making processes, as well as potentially informing theorizing in individual differences and intel- ligence. Through the convex hull hypothesis, previous researchers (MacGre- gor & Ormerod 1996; MacGregor, Ormerod, & Chronicle 1999; MacGregor, Ormerod, & Chronicle 2000; Ormerod & Chronicle 1999) have suggested peo- ple use a global-to-local solution process. We review the empirical evidence for and against this idea, before suggesting an alternative local-to-global solution process, based on the avoidance of intersections in constructing a tour. To compare these two competing approaches, we present the results of an exper- iment that measures the different effects the number of points on the convex hull and the number of potential intersections have on human performance. It is found that both have independent effects on the degree to which people deviate from the optimal solution, with performance worsening as more points are added to the convex hull and as fewer potential intersections are present. A measure of response uncertainty, capturing the degree to which different people produce the same types of solution, is found to be unaffected by the number of points on the convex hull but to increase as fewer potential inter- sections are present. A possible interpretation of these results in terms of a generative transformational theory of human visual perception is discussed.
Full-text available
Article
Cluster analysis involves the problem of optimal partitioning of a given set of entities into a pre-assigned number of mutually exclusive and exhaustive clusters. Here the problem is formulated in two different ways with the distance function (a) of minimizing the within groups sums of squares and (b) minimizing the maximum distance within groups. These lead to different kinds of linear and non-linear (0–1) integer programming problems. Computational difficulties are discussed and efficient algorithms are provided for some special cases.
Full-text available
Chapter
A general parametric scheme of hierarchical clustering procedures with invariance under monotone transformations of similarity values and invariance under numeration of objects is described. This scheme consists of two steps: correction of given similarity values between objects and transitive closure of obtained valued relation. Some theoretical properties of considered scheme are studied. Different parametric classes of clustering procedures from this scheme based on perceptions like “keep similarity classes,” “break bridges between clusters,” etc. are considered. Several examples are used to illustrate the application of proposed clustering procedures to analysis of similarity structures of data.
Full-text available
Article
MacGregor and Ormerod (1996) have presented results purporting to show that human performance on visually presented traveling salesman problems, as indexed by a measure of response uncertainty, is strongly determined by the number of points in the stimulus array falling inside the convex hull, as distinct from the total number of points. It is argued that this conclusion is artifactually determined by their constrained procedure for stimulus construction, and, even if true, would be limited to arrays with fewer than around 50 points.
Full-text available
Article
Given a set of entities, Cluster Analysis aims at finding subsets, called clusters, which are homogeneous and/or well separated. As many types of clustering and criteria for homogeneity or separation are of interest, this is a vast field. A survey is given from a mathematical programming viewpoint. Steps of a clustering study, types of clustering and criteria are discussed. Then algorithms for hierarchical, partitioning, sequential, and additive clustering are studied. Emphasis is on solution methods, i.e., dynamic programming, graph theoretical algorithms, branch-and-bound, cutting planes, column generation and heuristics.
Full-text available
Article
The processing time for quantifying numerosity of two-dimensional dot patterns was investigated as a function of both number of dots and relative proximity between dots. A cluster algorithm (CODE) was first developed as a formal model of how human subjects organize neighboring dots into groups. CODE-based predictions of grouping effects on number processing latencies were then tested with patterns consisting of n dots (range n = 13–23). The results largely confirmed CODE-based predictions and thereby indicated that large collections of dots are preferably counted by groups. Small (n ≤ 5) groups are subitized and their partial results are summed to a running total. Based on criteria other than dot proximity, large (n > 5), proximity-based groups are subdivided into smaller groups of two or three dots, which are again subitized.
Full-text available
Article
Two experiments on performance on the traveling salesman problem (TSP) are reported. The TSP consists of finding the shortest path through a set of points, returning to the origin. It appears to be an intransigent mathematical problem, and heuristics have been developed to find approximate solutions. The first experiment used 10-point, the second, 20-point problems. The experiments tested the hypothesis that complexity of TSPs is a function of number of nonboundary points, not total number of points. Both experiments supported the hypothesis. The experiments provided information on the quality of subjects' solutions. Their solutions clustered close to the best known solutions, were an order of magnitude better than solutions produced by three well-known heuristics, and on average fell beyond the 99.9th percentile in the distribution of random solutions. The solution process appeared to be perceptually based.
Full-text available
Article
The traveling salesperson problem (TSP) consists of finding the shortest tour around a set of locations and is an important task in computer science and operations research. In four experiments, the relationship between processes implicated in the recognition of good figures and the identification of TSP solutions was investigated. In Experiment 1, a linear relationship was found between participants' judgments of good figure and the optimality of solutions to TSPs. In Experiment 2, identification performance was shown to be a function of solution optimality and problem orientation. Experiment 3 replicated these findings with a forced-pace method, suggesting that global processing, rather than a local processing strategy involving point-by-point analysis of TSP solutions, is the primary process involved in the derivation of best figures for the presented TSPs. In Experiment 4, the role of global precedence was confirmed using a priming method, in which it was found that short (100 msec) primes facilitated solution identification, relative to no prime or longer primes. Effects of problem type were found in all the experiments, suggesting that local features of some problems may disrupt global processing. The results are discussed in terms of Sanocki's (1993) global-to-local contingency model. We argue that global perceptual processing may contribute more generally to problem solving and that human performance can complement computational TSP methods.
Chapter
This chapter discusses a production system for counting, subitizing, and adding. It presents an explicit model of the process of quantification. The general model states that quantification of n items takes place via subitizing for n ≤ 4 and via subitizing and addition for n > 4. The latter process is what is conventionally called counting. The models form a collection of independent rules, called productions, that form a production system. A production system obeys simple operating rules: (1) the productions are considered in sequence, starting with the first; (2) each condition is compared with the current state of knowledge in the system, as represented by the symbols in STM, and if all of the elements in a condition can be matched with elements in STM, then the condition is satisfied; (3) If a condition is not satisfied, the next production rule in the ordered list of production rules is considered; and (4) if a condition is satisfied, the actions to the right of the arrow are taken, and then the production system is reentered from the first step; (5) when a condition is satisfied, all those STM elements that were matched are moved to the front of STM, providing a form of automatic rehearsal; (6) Actions can change the state of goals, replace elements, apply operators, or add elements to STM; and (7) the STM is a stack in which a new element appears at the top pushing all else in the stack down one position. Since STM is limited in size, elements may be lost.
Article
The first part of this monograph's title, Combinatorial Data Analysis (CDA), refers to a wide class of methods for the study of relevant data sets in which the arrangement of a collection of objects is absolutely central. Characteristically, CDA is involved either with the identification of arrangements that are optimal for a specific representation of a given data set (usually operationalized with some specific loss or merit function that guides a combinatorial search defined over a domain constructed from the constraints imposed by the particular representation selected), or with the determination in a confirmatory manner of whether a specific object arrangement given a priori reflects the observed data. As the second part of the title, Optimization by Dynamic Programming, suggests, the sole focus of this monograph is on the identification of arrangements; it is then restricted further, to where the combinatorial search is carried out by a recursive optimization process based on the general principles of dynamic programming. For an introduction to confirmatory CDA without any type of optimization component, the reader is referred to the monograph by Hubert (1987). For the use of combinatorial optimization strategies other than dynamic programming for some (clustering) problems in CDA, the recent comprehensive review by Hansen and Jaumard (1997) provides a particularly good introduction. The first part of this monograph's title, Combinatorial Data Analysis (CDA), refers to a wide class of methods for the study of relevant data sets in which the arrangement of a collection of objects is absolutely central. Characteristically, CDA is involved either with the identification of arrangements that are optimal for a specific representation of a given data set (usually operationalized with some specific loss or merit function that guides a combinatorial search defined over a domain constructed from the constraints imposed by the particular representation selected), or with the determination in a confirmatory manner of whether a specific object arrangement given a priori reflects the observed data. As the second part of the title, Optimization by Dynamic Programming, suggests, the sole focus of this monograph is on the identification of arrangements; it is then restricted further, to where the combinatorial search is carried out by a recursive optimization process based on the general principles of dynamic programming. For an introduction to confirmatory CDA without any type of optimization component, the reader is referred to the monograph by Hubert (1987). For the use of combinatorial optimization strategies other than dynamic programming for some (clustering) problems in CDA, the recent comprehensive review by Hansen and Jaumard (1997) provides a particularly good introduction.
Article
The classification maximum likelihood approach is sufficiently general to encompass many current clustering algorithms, including those based on the sum of squares criterion and on the criterion of H. P. Friedman and J. Rubin [J. Am. Stat. Assoc. 62, 1159-1178 (1967)]. However, as currently implemented, it does not allow the specification of which features (orientation, size, and shape) are to be common to all clusters and which may differ between clusters. Also, it is restricted to Gaussian distributions and it does not allow for noise. We propose ways of overcoming these limitations. A reparameterization of the covariance matrix allows us to specify that some, but not all, features be the same for all clusters. A practical framework for non- Gaussian clustering is outlined, and a means of incorporating noise in the form of a Poisson process is described. An approximate Bayesian method for choosing the number of clusters is given. The performance of the proposed methods is studied by simulation, with encouraging results. The methods are applied to the analysis of a data set arising in the study of diabetes, and the results seem better than those of previous analyses. A magnetic resonance image (MRI) of the brain is also analyzed, and the methods appear successful in extracting the main features of anatomical interest. The methods described here have been implemented in both Fortran and S-PLUS versions, and the software is freely available through StatLib.
Article
Several multivariate clustering methods are analyzed in which each cluster may have a different metric depending on its covariance matrix. Numerical experiments show that the only reliable method among these is one using a metric suggested by Rohlf [1970] based on the within cluster covariance matrix normalized for unit determinant. (12 references.)
Article
The class of discrete optimization problems may be partitioned into two subclasses PS and P?. PS is the subclass of problems which are known to be polynomial solvable. P? is the subclass of problems for which it is not yet known whether an algorithm exists which is polynomial bounded. The subclass P? contains the class NPC of NP-complete discrete optimization problems. NPC has the important property that if only one member of NPC can be shown to be polynomial solvable all members of NPC as well as a large number of other combinatorial problems are polynomial solvable. However, it seems to be very unlikely that all NP-complete problems are polynomial solvable.
Article
A method for investigating the relationships of points in multi-dimensional space is described. Using an analysis of variance technique, the points are divided into the two most-compact clusters, and the process repeated sequentially so that a `tree' diagram is formed. The application of the method to problems of classification is particularly stressed, and numerical examples are given.
Article
Methods of cluster analysis based on maximizing or minimizing certain criteria are examined. The effect of adding a single point to the data casts light on the properties of the methods.
Article
A basic problem in cluster analysis is how to partition the entities of a given set into a preassigned number of homogeneous subsets called clusters. The homogeneity of the clusters is often expressed as a function of a dissimilarity measure between entities. The objective function considered here is the minimization of the maximum dissimilarity between entities in the same cluster. It is shown that the clustering problem so defined is reducible to the problem of optimally coloring a sequence of graphs, and is NP-complete. An efficient algorithm is proposed and computational experience with problems involving up to 270 entities is reported on.
Article
This paper deals with methods of “cluster analysis”. In particular we attack the problem of exploring the structure of multivariate data in search of “clusters”.The approach taken is to use a computer procedure to obtain the “best” partition of n objects into g groups. A number of mathematical criteria for “best” are discussed and related to statistical theory. A procedure for optimizing the criteria is outlined. Some of the criteria are compared with respect to their behavior on actual data. Results of data analysis are presented and discussed.
Article
Linear programming used to reduce the combinatorial magnitude of travelling-salesman-problems. To illustrate the method, a step-by-step solution of Barachet’s ten-city example is presented.
Article
Various authors propose the use of flow dominance in complexity ratings to evaluate the complexity of facilities layout problems, to determine the choice between computer algorithms and visual based methods for plant layout, and to decide on the particular layout configuration (line or process layout) to be installed. This paper examines critically past contributions and casts some serious doubts on the validity of the flow dominance concept and the related measures of layout complexity. It is shown that flow dominance does not serve its intended purpose and that the complexity rating factors suggested in the literature not only show serious problems with regard to their interpretability, but are largely unusable. Finally, the future work needed in this problem area is elaborated.
Article
New clustering criteria for use when a mixture of multivariate normal distributions is an appropriate model are presented. They are derived from maximum likelihood and Bayesian approaches corresponding to different assumptions about the covariance matrices of the mixture components. Two of the criteria are modifications of the determinant of the within-groups sum-of-squares criterion of Friedman and Rubin (1967, Journal of the American Statistical Association 63, 1159-1178); these criteria appear to be more sensitive to disparate cluster sizes. Two others are appropriate for different-shaped clusters. The practical aspects of these criteria, and of another one studied by Maronna and Jacovkis (1974, Biometrics 30, 499-505) for heterogeneous covariance matrices, are outlined. An example involving the separation of two types of diabetic patients from normal subjects, each group having a distinct covariance structure, is given. The results with the three criteria appropriate for different-shaped clusters were comparable to one another and preferable to those obtained with the three criteria for similar-shaped clusters. Results obtained for the example with two additional clustering procedures are presented.
Article
The standard classification model with several normal populations is extended to the cluster analysis situation where little or no previous information about the population parameters is available. Some common clustering procedures are shown to be extensions of likelihood ratio methods of classification. The analysis suggests that the procedures may have a tendency to partition the sample into groups of about the same size. This suggestion is examined in an example.
Article
The concept of a spanning tree for a weighted graph is used to characterize several methods of clustering a set of objects. In particular, most of the paper is devoted to stating relationships between spanning trees, single-link and complete-link hierarchical clustering, network flow and two divisive clustering procedures. Several related topics using the notion of a spanning tree are also mentioned.
Article
Recently, Mulvey and Crowder (Mulvey, J., H. Crowder. 1979. Cluster analysis: an application of Lagrangian relaxation. Management Sci. 25 329--340.) suggested that the p-median problem might be useful for cluster analysis problems (where the goal is to group objects described by a vector of characteristics in such a way that objects in the same group are somehow more alike than objects in different groups). The intent of this paper is to test Mulvey and Crowder's proposal using the mixture model approach; i.e., by applying a number of algorithms (including one for the p-median problem) to a set of objects randomly sampled from a number of known multivariate populations and comparing the ability of each algorithm to detect the original populations. In order to evaluate the results, a generalized partition comparison measure and its distribution are developed. Using this measure, results from various algorithms are compared.
Article
An experiment with plant layout formulated as a quadratic assignment problem gave the following results: (1) The CRAFT algorithm does as well as human subjects in solving the plant layout problem; (2) CRAFT does better as problem size increases if the human subjects have no prior knowledge of the computer solution values; (3) there is no breakpoint in relative problem solving ability between humans and computers at 200% flow dominance (the coefficient of variation of the flow matrix expressed as a percent); (4) the point at which the computer becomes better, as opposed to equally good, is more clearly indicated by problem size than by flow dominance; (5) humans can solve problems with high flow dominance and problems with near-zero flow dominance better than they can solve problems with low flow dominance, though the computer is still at least as good across all levels of flow dominance.
Article
This work offers a critique of the methodology used by Scriabin and Vergin [Scriabin, Michael, Roger C. Vergin. 1975. Comparison of computer algorithms and visual based methods for plant layout. Management Sci. 22 (2, October) 172-181] in their study of computer algorithms versus humans in designing plant layouts. It attempts to show that Scriabin's and Vergin's experiments do not provide a useful comparison of computers and humans, and to point out several experimental procedures that would make comparisons of heuristic computer algorithms and humans more valid.
Article
Increasing emphasis on the reduction of materials handling costs in the modern plant has led to research into new methods of planning the process type layout in such a way as to minimize these costs. This project compares the performances of three highly rated computer algorithms prescribed for the solution of the plant layout problem with the performances of selected human subjects using the manual and visual methods still used and recommended by industrial engineers for plant layout design. The objective of this comparison is to determine whether there is in fact an advantage to using one of the available computer programs to solve the problem, instead of designing the layout by traditional visual-based methods. These tests, performed under the control of a computer system which accurately recorded the solutions achieved by each subject, show not only that the computer algorithms do not perform better than selected human subjects in the design of plant layouts, but that the human subjects, without the benefit of any prescriptive help from a computer, actually achieve layouts which are stochastically better than those produced by the computer programs.
Article
It is well known that minimum-diameter partitioning of symmetric dissimilarity matrices can be framed within the context of coloring the vertices of a graph. Although confusion data are typically represented in the form of asymmetric similarity matrices, they are also amenable to a graph-coloring perspective. In this paper, we propose the integration of the minimum-diameter partitioning method with a neighborhood-based coloring approach for analyzing digraphs corresponding to confusion data. This procedure is capable of producing minimum-diameter partitions with the added desirable property that vertices with the same color have similar in-neighborhoods (i.e., directed edges entering the vertex) and out-neighborhoods (i.e., directed edges exiting the vertex) for the digraph corresponding to the minimum partition diameter.
Article
Although problem solving is an essential expression of intelligence, both experimental and differential psychology have neglected an important class of problems, for which it is difficult or impossible for systematic procedures to provide a definitive solution. Two experiments are described, in which participants’ solutions to three computationally difficult problems (a Travelling Salesperson, a Minimal Spanning Tree, and a Generalised Steiner Tree problem) all showed consistent individual differences that intercorrelated reliably and correlated moderately with scores on Raven's Advanced Progressive Matrices. The results are interpreted in terms of a theory of visual perception based on the efficient use of information about the relative position of stimulus elements.
Conference Paper
Discrete optimization problems arise in various applications such as airline crew scheduling and rostering, vehicle routing, frequency assignment, communication network design, etc.
Article
Improvements to the dynamic programming (DP) strategy for partitioning (nonhierarchical classification) as discussed in Hubert, Arabie, and Meulman (2001) are proposed. First, it is shown how the number of evaluations in the DP process can be decreased without affecting generality. Both a completely nonredundant and a quasi-nonredundant method are proposed. Second, an efficient implementation of both approaches is discussed. This implementation is shown to have a dramatic increase in speed over the original program. The flexibility of the approach is illustrated by analyzing three data sets.
Article
The traveling salesman problem belongs to an important class of scheduling and routing problems. It is also a subproblem in solving others, such as the warehouse distribution problem. It has been attacked by many mathematical methods with but meager success. Only for special forms of the problem or for problems with a moderate number of points can it be solved exactly, even if very large amounts of computer time are used. Heuristic procedures have been proposed and tested with only slightly better results. This paper describes a computer aided heuristic technique which uses only a modest amount of computer time in real-time to solve large (100-200) point problems. This technique takes advantage of both the computer's and the human's problem-solving abilities. The computer is not asked to solve the problem in a brute force way as in many of today's heuristics, but it is asked to organize the data for the human so that the human can solve the problem easily. The technique used in this paper seems to point to new directions in the field of man-machine interaction and in the field of artificial intelligence.
Article
We enumerate all the p-partitions of vertices of a complete valued graph (partitions with p classes with minimum diameter Δ. When p = 2, we recall M.R. Rao's algorithm that permits to count and to enumerate bipartitions. When p > 2, the problem is the same as to build a minimum threshold spanning subgraph Gσ having edges greater than Δ and being p-colourable. So it is an NP-hard problem. First, we use some heuristics to determine σ, an upper approximation of Δ, to enumerate all the p colourings of Gσ. Next we reduce σ until the partitions remain compatible. When algorithm stops, the diameter of the remaining partitions is the greatest length of an edge lower than the threshold σ.
Article
Minimization of the within-cluster sums of squares (WCSS) is one of the most important optimization criteria in cluster analysis. Although cluster analysis modules in commercial software packages typically use heuristic methods for this criterion, optimal approaches can be computationally feasible for problems of modest size. This paper presents a new branch-and-bound algorithm for minimizing WCSS. Algorithmic enhancements include an effective reordering of objects and a repetitive solution approach that precludes the need for splitting the data set, while maintaining strong bounds throughout the solution process. The new algorithm provided optimal solutions for problems with up to 240 objects and eight well-separated clusters. Poorly separated problems with no inherent cluster structure were optimally solved for up to 60 objects and six clusters. The repetitive branch-and-bound algorithm was also successfully applied to three empirical data sets from the classification literature.
Article
This paper attempts to review and expand upon the relationship between graph theory and the clustering of a set of objects. Several graphtheoretic criteria are proposed for use within a general clustering paradigm as a means of developing procedures in between the extremes of complete-link and single-link hierarchical partitioning; these same ideas are then extended to include the more general problem of constructing subsets of objects with overlap. Finally, a number of related topics are surveyed within the general context of reinterpreting and justifying methods of clustering either through standard concepts in graph theory or their simple extensions.
Article
Techniques for partitioning objects into optimally homogeneous groups on the basis of empirical measures of similarity among those objects have received increasing attention in several different fields. This paper develops a useful correspondence between any hierarchical system of such clusters, and a particular type of distance measure. The correspondence gives rise to two methods of clustering that are computationally rapid and invariant under monotonic transformations of the data. In an explicitly defined sense, one method forms clusters that are optimally “connected,” while the other forms clusters that are optimally “compact.”
Article
The more ways there are of understanding a clustering technique, the more effectively the results can be analyzed and used. I will give a general procedure, calledparameter modification, to obtain from a clustering criterion a variety of equivalent forms of the criterion. These alternative forms reveal aspects of the technique that are not necessarily apparent in the original formulation. This procedure is successful in improving the understanding of a significant number of clustering techniques. The insight obtained will be illustrated by applying parameter modification to partitioning, mixture and fuzzy clustering methods, resulting in a unified approach to the study of these methods and a general algorithm for optimizing them.
Article
The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.
Article
Visual numerosity judgements were made for tachistoscopically presented arrays of dots. The arrangement within the arrays was either linear or such that dots could be easily perceptually subdivided into two groups. Subdivision was either in terms of an orientation difference, a colour difference, or a spacing difference in the centre of the array. For a large difference in orientation between the two 'arms' of the array (90 degrees), or a large central space (three times the interdot interval) up to 8 dots were accurately perceived. This numerosity limit was twice that found for equivalent linear arrays, with no grouping. Although in terms of accuracy it seems that in these conditions the two groups within each array can be counted independently, there is no evidence for independent processing in terms of response times. From the results of a subsidiary experiment it seems likely that the slow response times in the subgrouping conditions are due to the necessity of processes other than counting (such as judgements of symmetry). For arrays where subgrouping was in teems of a colour difference, or an orientation difference of between approximately 45 degrees and 90 degrees, or a small central space of twice the interdot interval, there was an improvement in accuracy compared to equivalent linear arrays, but no evidence of independent processing, up to a limit of 4, in each of the subgroups. From these preliminary results, tentative proposals concerning 'numerosity' units and their properties are made.
Article
Mathematically, the process of enumeration is fundamental to arithmetic. Psychologically, it is a sensorimotor chain controlled at every stage by a shifting perceptual organization. Enumeration requires a chant ("1, 2, 3 "), a shifting indicator response (pointing), and a perceptual grouping of objects into those already counted and those still ahead. The arrangement of the objects has, theoretically, an important effect on the speed and accuracy of enumeration. Further analysis shows that the serial chain of behavior, required for counting a fairly large set of objects, must be divided into parts, and the objects grouped into corresponding subsets. 3 experiments show the relationship between arrangement of objects and counting. (23 ref.)