Conference Paper

A Novel Workflow for Semi-supervised Annotation of Cell-type Clusters in Mass Cytometry Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Mass Cytometry by time-of-flight (CyTOF) is a widely used technology to study the variation in immune cell populations by simultaneously measuring the expression of 40-50 protein markers in millions of single cells. Traditionally, for the identification of cell types, a clustering method is employed which uses cell surface marker expression profiles to group similar cell-types. While being instrumental in analyzing the high-dimensional CyTOF datasets, current clustering-based strategies face a number of limitations. For instance, for larger datasets, sub-sampling is routinely performed (e.g. often only 10% or even less of all events are used), and randomly selected cells are assumed to be the representative of entire cell population [1]. The primary reason of sub-sampling is to reduce computational time and memory use, which consequently reduces the probability of annotating non-canonical cells with small population size along with significant data loss. Moreover, the clustering event of a cell to a given group varies with respect to neighboring cells, making the cell annotation difficult. This statistical reoccurrence of a given cell within a single cell-type cluster in spite of varying neighboring cells could be utilized for assigning it to a statistically most probable cell-type. Therefore, to extend the usability of existing approaches, we present a novel bootstrapping-based workflow, integrated with automated cell-type identification that predicts statistically reproducible cells clusters. Briefly, the method first creates blocks of a fixed number of randomly selected cells from each sample, which are then randomly concatenated to create an expression sub-matrix by picking one block from each sample. The cells in the sub-matrix are then subjected to cell-type annotation using the Linear Discriminant Analysis or ACDC algorithm [2]. The steps are repeated with unique expression sub-matrix in each iteration which provides a framework to test the annotation of every cell to one or more cell-types under varying neighbor cells. The statistical significance of cell-type association is measured by the frequency of cell occurrence in a given cell-type across all iterations. The spurious and unstable cell-type clusters are identified by the variation in the silhouette score, cluster size and average Euclidian distances in each cell-type cluster across all iterations. It is expected that stable clusters produce meaningful and reproducible results, whereas unstable and dynamic cell-type clusters can be considered for the identification of unknown/rare-cell types or they may represent batch affected cells contaminated with technical noise. We benchmarked the accuracy of the workflow by classifying 22 hand-gated cells from 38 markers obtained in replicative measurements of mass cytometry data from mice [3]. The preliminary results suggest ~85% accuracy in classification of different cell subtypes across 500 iterations. Currently, we are improving the performance of this approach by integrating faster (GPU-based) clustering methods and benchmarking with other public datasets with non-canonical cell-types.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Additionally, these data-driven algorithms provide more reproducible results, removing researcher bias that comes with manually setting gates in bivariate plots. Since the publication of the first clustering method for cytometry data in 2007, many clustering algorithms have been published and their performance thoroughly compared [3][4][5][6][7]. However, every high-dimensional analysis method makes assumptions on the underlying data that need to be understood by the researcher implementing these methods [8,9]. ...
Article
Full-text available
Background Current methods of high-dimensional unsupervised clustering of mass cytometry data lack means to monitor and evaluate clustering results. Whether unsupervised clustering is correct is typically evaluated by agreement with dimensionality reduction techniques or based on benchmarking with manually classified cells. The ambiguity and lack of reproducibility of sequential gating has been replaced with ambiguity in interpretation of clustering results. On the other hand, spurious overclustering of data leads to loss of statistical power. We have developed INFLECT, an R-package designed to give insight in clustering results and provide an optimal number of clusters. In our approach, a mass cytometry dataset is overclustered intentionally to ensure the smallest phenotypically different subsets are captured using FlowSOM. A range of metacluster number endpoints are generated and evaluated using marker interquartile range and distribution unimodality checks. The fraction of marker distributions that pass these checks is taken as a measure of clustering success. The fraction of unimodal distributions within metaclusters is plotted against the number of generated metaclusters and reaches a plateau of diminishing returns. The inflection point at which this occurs gives an optimal point of capturing cellular heterogeneity versus statistical power. Results We applied INFLECT to four publically available mass cytometry datasets of different size and number of markers. The unimodality score consistently reached a plateau, with an inflection point dependent on dataset size and number of dimensions. We tested both ConsenusClusterPlus metaclustering and hierarchical clustering. While hierarchical clustering is less computationally expensive and thus faster, it achieved similar results to ConsensusClusterPlus. The four datasets consisted of labeled data and we compared INFLECT metaclustering to published results. INFLECT identified a higher optimal number of metaclusters for all datasets. We illustrated the underlying heterogeneity within labels, showing that these labels encompass distinct types of cells. Conclusion INFLECT addresses a knowledge gap in high-dimensional cytometry analysis, namely assessing clustering results. This is done through monitoring marker distributions for interquartile range and unimodality across a range of metacluster numbers. The inflection point is the optimal trade-off between cellular heterogeneity and statistical power, applied in this work for FlowSOM clustering on mass cytometry datasets.
Article
Full-text available
Mass cytometry allows high-resolution dissection of the cellular composition of the immune system. However, the high-dimensionality, large size, and non-linear structure of the data poses considerable challenges for the data analysis. In particular, dimensionality reduction-based techniques like t-SNE offer single-cell resolution but are limited in the number of cells that can be analyzed. Here we introduce Hierarchical Stochastic Neighbor Embedding (HSNE) for the analysis of mass cytometry data sets. HSNE constructs a hierarchy of non-linear similarities that can be interactively explored with a stepwise increase in detail up to the single-cell level. We apply HSNE to a study on gastrointestinal disorders and three other available mass cytometry data sets. We find that HSNE efficiently replicates previous observations and identifies rare cell populations that were previously missed due to downsampling. Thus, HSNE removes the scalability limit of conventional t-SNE analysis, a feature that makes it highly suitable for the analysis of massive high-dimensional data sets.
Article
Full-text available
Motivation: Recent advances in mass cytometry allow simultaneous measurements of up to 50 markers at single-cell resolution. However, the high dimensionality of mass cytometry data introduces computational challenges for automated data analysis and hinders translation of new biological understanding into clinical applications. Previous studies have applied machine learning to facilitate processing of mass cytometry data. However, manual inspection is still inevitable and becoming the barrier to reliable large-scale analysis. Results: We present a new algorithm called Automated Cell-type Discovery and Classification (ACDC) that fully automates the classification of canonical cell populations and highlights novel cell types in mass cytometry data. Evaluations on real-world data show ACDC provides accurate and reliable estimations compared to manual gating results. Additionally, ACDC automatically classifies previously ambiguous cell types to facilitate discovery. Our findings suggest that ACDC substantially improves both reliability and interpretability of results obtained from high-dimensional mass cytometry profiling data. Availability and Implementation: A Python package (Python 3) and analysis scripts for reproducing the results are availability on https://bitbucket.org/dudleylab/acdc. Contact:brian.kidd@mssm.edu or joel.dudley@mssm.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Article
Full-text available
Accurate identification of cell subsets in complex populations is key to discovering novelty in multidimensional single-cell experiments. We present X-shift (http://web.stanford.edu/~samusik/vortex/), an algorithm that processes data sets using fast k-nearest-neighbor estimation of cell event density and arranges populations by marker-based classification. X-shift enables automated cell-subset clustering and access to biological insights that 'prior knowledge' might prevent the researcher from discovering.