Misty Mountain Clustering: Application to Fast Unsupervised Flow Cytometry Gating

Department of Neurology and Center for Translational Systems Biology, Mount Sinai School of Medicine, New York, NY, USA.
BMC Bioinformatics (Impact Factor: 2.58). 10/2010; 11(3):502. DOI: 10.1186/1471-2105-11-502
Source: PubMed


There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.
To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 106 data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.
Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.

Download full-text


Available from: Istvan Sugar,
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Conventional compensation of flow cytometry (FMC) data of an N-stained sample requires additional data sets, of N single-stained control samples, to estimate the spillover coefficients. Single-stained controls however are the least rigorous controls because any of the multi-stained controls are closer to the N-stained sample. In this article, a new, optimization based, compensation method has been developed that is able to use not only single- but also multi-stained controls to improve estimates of the spillover coefficients. The method is demonstrated on a data set from five-stained dentritic cells (DCs) with five single-stained and eight multi-stained controls. This approach is practical and leads to significant improvements in FCM compensation.
    Cytometry Part A 05/2011; 79(5):356-60. DOI:10.1002/cyto.a.21062 · 2.93 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The evolution of immunology research from measurements of single entities to large-scale data-intensive assays necessitates the integration of experimental work with bioinformatics and computational approaches. The introduction of physics into immunology has led to the study of new phenomena, such as cellular noise, which is likely to prove increasingly important to understand immune system responses. The fusion of "hard science" and biology is also leading to a re-examination of data acquisition, analysis, and statistical validation and is resulting in the development of easy-to-access tools for immunology research. Here, we review some of our models, computational tools, and results related to studies of the innate immune response of human dendritic cells to viral infection. Our project functions on an open model across institutions with electronic record keeping and public sharing of data. Our tools, models, and data can be accessed at http://tsb.mssm.edu/primeportal/ .
    Immunologic Research 04/2012; 54(1-3):160-8. DOI:10.1007/s12026-012-8322-6 · 3.10 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For flow cytometry data, there are two common approaches to the unsupervised clustering problem: one is based on the finite mixture model and the other on spatial exploration of the histograms. The former is computationally slow and has difficulty to identify clusters of irregular shapes. The latter approach cannot be applied directly to high-dimensional data as the computational time and memory become unmanageable and the estimated histogram is unreliable. An algorithm without these two problems would be very useful. In this article, we combine ideas from the finite mixture model and histogram spatial exploration. This new algorithm, which we call flowPeaks, can be applied directly to high-dimensional data and identify irregular shape clusters. The algorithm first uses K-means algorithm with a large K to partition the cell population into many small clusters. These partitioned data allow the generation of a smoothed density function using the finite mixture model. All local peaks are exhaustively searched by exploring the density function and the cells are clustered by the associated local peak. The algorithm flowPeaks is automatic, fast and reliable and robust to cluster shape and outliers. This algorithm has been applied to flow cytometry data and it has been compared with state of the art algorithms, including Misty Mountain, FLOCK, flowMeans, flowMerge and FLAME. The R package flowPeaks is available at https://github.com/yongchao/flowPeaks. yongchao.ge@mssm.edu Supplementary data are available at Bioinformatics online.
    Bioinformatics 05/2012; 28(15):2052-8. DOI:10.1093/bioinformatics/bts300 · 4.98 Impact Factor
Show more