ArticlePDF Available

Visual analysis of mass cytometry data by hierarchical stochastic neighbour embedding reveals rare cell types

Authors:

Abstract and Figures

Mass cytometry allows high-resolution dissection of the cellular composition of the immune system. However, the high-dimensionality, large size, and non-linear structure of the data poses considerable challenges for the data analysis. In particular, dimensionality reduction-based techniques like t-SNE offer single-cell resolution but are limited in the number of cells that can be analyzed. Here we introduce Hierarchical Stochastic Neighbor Embedding (HSNE) for the analysis of mass cytometry data sets. HSNE constructs a hierarchy of non-linear similarities that can be interactively explored with a stepwise increase in detail up to the single-cell level. We apply HSNE to a study on gastrointestinal disorders and three other available mass cytometry data sets. We find that HSNE efficiently replicates previous observations and identifies rare cell populations that were previously missed due to downsampling. Thus, HSNE removes the scalability limit of conventional t-SNE analysis, a feature that makes it highly suitable for the analysis of massive high-dimensional data sets.
Schematic overview of Cytosplore+HSNE for exploring the mass cytometry data. By creating a multi-level hierarchy of an illustrative 3D data set (a), we achieve a clear separation of different cell groups in an overview embedding (left panel b) that conserves non-linear relationships (i.e., follows the distance indicated by the dashed line in a, instead of the grey arrow) and more detail within the separate groups on the data level (right panel b). c Construction and exploration of the hierarchy. The hierarchy is constructed starting with the data level (left two columns). On the basis of the high-dimensional expression patterns of the cells, a weighted kNN graph is constructed, which is used to find representative cells used as landmarks in the next coarser level. By administering the area of influence (AoI) of the landmarks, cells/landmarks can be aggregated without losing the global structure of the underlying data or creating shortcuts. The exploration of the hierarchy is shown in the two rightmost columns. At the bottom, we see the overview level (in this example the 3rd level in the hierarchy), which shows that a group of landmarks has low expression in marker c (bottom-right panel). Selecting this group of landmarks for further exploration results in a look-up of the landmarks in the preceding level (neighborhood graph, intermediate level) that are in the AoI, with which a new embedding can be created at the 2nd level of the hierarchy (middle-right panel). Marker b shows a strong separation between the upper and lower landmarks at this level. Zooming-in on the landmarks with low expression of marker b reveals further separation in marker a at the lowest level, the full data level (top-right panel)
… 
Analysis of the CD7⁺CD3⁻ innate lymphocyte compartment in inflammatory intestinal diseases. a First HSNE level embedding of 5.2 million cells. Color represents arcsin5-transformed marker expression as indicated. Size of the landmarks represents AoI. Blue encirclement indicates selection of landmarks representing CD7⁺CD3⁻ innate lymphocytes and CD4⁺ T cells further discussed in Fig. 5. b The major immune lineages, annotated on the basis of lineage marker expression. c Third HSNE level embedding of the CD7⁺CD3⁻ innate lymphocytes (5.0 × 10⁵ cells). Color represents arcsin5-transformed marker expression in top panels, and tissue-origin and clinical features in bottom panels. Blue encirclement indicates selection of landmarks representing CD127⁺ILC and ILC-like cells. d Third HSNE level embedding shows density features depicting the local probability density of cells, where black dots indicate the centroids of identified cluster partitions using GMS clustering. e Embedding of the CD127⁺ILC and ILC-like cells (6.0 × 10⁴ cells) at single-cell resolution. Arrows indicate ILC1 (blue), ILC2 (orange) and ILC3 (green). Bottom-right panel shows corresponding cluster partitions using GMS clustering based on density features (top-right panel). f A heatmap summary of median expression values (same color coding as for the embeddings) of cell markers expressed by CD127 + ILC and ILC-like clusters identified in b and hierarchical clustering thereof. g Composition of cells for each cluster is represented graphically by a horizontal bar in which segment lengths represent the proportion of cells with: (left) tissue-of-origin, (middle) disease status and (right) sampling status
… 
This content is subject to copyright. Terms and conditions apply.
ARTICLE
Visual analysis of mass cytometry data by
hierarchical stochastic neighbour embedding
reveals rare cell types
Vincent van Unen 1, Thomas Höllt2,3, Nicola Pezzotti2,NaLi
1, Marcel J.T. Reinders 4, Elmar Eisemann2,
Frits Koning1, Anna Vilanova2& Boudewijn P.F. Lelieveldt4,5
Mass cytometry allows high-resolution dissection of the cellular composition of the immune
system. However, the high-dimensionality, large size, and non-linear structure of the
data poses considerable challenges for the data analysis. In particular, dimensionality
reduction-based techniques like t-SNE offer single-cell resolution but are limited in the
number of cells that can be analyzed. Here we introduce Hierarchical Stochastic Neighbor
Embedding (HSNE) for the analysis of mass cytometry data sets. HSNE constructs a
hierarchy of non-linear similarities that can be interactively explored with a stepwise increase
in detail up to the single-cell level. We apply HSNE to a study on gastrointestinal disorders
and three other available mass cytometry data sets. We nd that HSNE efciently replicates
previous observations and identies rare cell populations that were previously missed due to
downsampling. Thus, HSNE removes the scalability limit of conventional t-SNE analysis, a
feature that makes it highly suitable for the analysis of massive high-dimensional data sets.
DOI: 10.1038/s41467-017-01689-9 OPEN
1Department of Immunohematology and Blood Transfusion, Leiden University Medical Center, Albinusdreef 2, 2333 ZA Leiden, The Netherlands. 2Computer
Graphics and Visualization Group, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands. 3Computational Biology Center, Leiden
University Medical Center, Albinusdreef 2, 2333 ZA Leiden, The Netherlands. 4Pattern Recognition and Bioinformatics Group, Delft University of Technology,
Mekelweg 4, 2628 CD Delft, The Netherlands. 5Division of Image Processing, Department of Radiology, Leiden University Medical Center, Albinusdreef 2,
2333 ZA Leiden, The Netherlands. Vincent van Unen, Thomas Höllt and Nicola Pezzotti contributed equally to this work. Frits Koning, Anna Vilanova and
Boudewijn P.F. Lelieveldt jointly supervised this work. Correspondence and requests for materials should be addressed to V.v.U. (email: V.van_unen@lumc.nl)
or to B.P.F.L. (email: B.P.F.Lelieveldt@lumc.nl)
NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications 1
1234567890
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Mass cytometry (cytometry by time-of-ight; CyTOF)
allows the simultaneous analysis of multiple cellular
markers (>30) present on biological samples consisting
of millions of cells. Computational tools for the analysis of such
data sets can be divided into clustering-based and dimensionality
reduction-based techniques1, each having distinctive advantages
and disadvantages. The clustering-based techniques, including
SPADE2, FlowMaps3, Phenograph4, VorteX5and Scaffold maps6,
allow the analysis of data sets consisting of millions of cells but
only provide aggregate information on generated cell clusters at
the expense of local data structure (i.e., single-cell resolution).
Dimensionality reduction-based techniques, such as PCA7,t-
SNE8(implemented in viSNE9), and Diffusion maps10, do allow
analysis at the single-cell level. However, the linear nature of PCA
renders it unsuitable to dissect the non-linear relationships in the
mass cytometry data, while the non-linear methods (t-SNE8and
Diffusion maps10) do retain local data structure, but are limited
by the number of cells that can be analyzed. This limit is imposed
by a computational burden but, more importantly, by local
neighborhoods becoming too crowded in the high-dimensional
space, resulting in overplotting and presenting misleading infor-
mation in the visualization. In cytometry studies, this poses a
problem, as a signicant number of cells needs to be removed by
random downsampling to make dimensionality reduction com-
putationally feasible and reliable. Future increases in acquisition
rate and dimensionality in mass- and ow cytometry are expected
to amplify this problem signicantly11,12.
Here we adapted Hierarchical stochastic neighbor embedding
(HSNE)13 that was recently introduced for the analysis of
hyperspectral satellite imaging data to the analysis of mass
Marker b
Marker a
Marker c
a
HSNE
(2 levels)
HSNE 1 HSNE 1
HSNE 2
HSNE 2
AoI
(# Events)
b
Overview level Data level
Construction Exploration
c
1
2
Embedding
color: marker a
Density
Heatmap
Color: marker c
Color: marker b
23
1
1
2
3
4
HSNE 1
HSNE 2
Cell
Landmark
AoI
1
2
23
1
123
a
b
c
a
b
c
12
1
2
3
4
1234
a
b
c
AoI Density
Expression
Events / area of influence (AoI)
Hierarchy construction (high-dimensional space)
Neighborhood graph Embeddings and clustering
Hierarchy exploration (two-dimensional space)
Intermediate levels
(arbitrary number)Overview level Data level
Fig. 1 Schematic overview of Cytosplore+HSNE for exploring the mass cytometry data. By creating a multi-level hierarchy of an illustrative 3D data set (a),
we achieve a clear separation of different cell groups in an overview embedding (left panel b) that conserves non-linear relationships (i.e., follows the
distance indicated by the dashed line in a, instead of the grey arrow) and more detail within the separate groups on the data level (right panel b).
cConstruction and exploration of the hierarchy. The hierarchy is constructed starting with the data level (left two columns). On the basis of the
high-dimensional expression patterns of the cells, a weighted kNN graph is constructed, which is used to nd representative cells used as landmarks in the
next coarser level. By administering the area of inuence (AoI) of the landmarks, cells/landmarks can be aggregated without losing the global structure of
the underlying data or creating shortcuts. The exploration of the hierarchy is shown in the two rightmost columns. At the bottom, we see the overview level
(in this example the 3rd level in the hierarchy), which shows that a group of landmarks has low expression in marker c (bottom-right panel). Selecting this
group of landmarks for further exploration results in a look-up of the landmarks in the preceding level (neighborhood graph, intermediate level) that are in
the AoI, with which a new embedding can be created at the 2nd level of the hierarchy (middle-right panel). Marker b shows a strong separation between
the upper and lower landmarks at this level. Zooming-in on the landmarks with low expression of marker b reveals further separation in marker a at the
lowest level, the full data level (top-right panel)
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9
2NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
cytometry data sets to visually explore millions of cells while
avoiding downsampling. HSNE builds a hierarchical representa-
tion of the complete data that preserves the non-linear high-
dimensional relationships between cells. We implemented HSNE
in an integrated single-cell analysis framework called Cytosplore
+HSNE. This framework allows interactive exploration of the
hierarchy by a set of embeddings, two-dimensional scatter plots
where cells are positioned based on the similarity of all marker
expressions simultaneously, and used for subsequent analysis
such as clustering of cells at different levels of the hierarchy. We
found that Cytosplore+HSNE replicates the previously identied
hierarchy in the immune-system-wide single-cell data4,5,14, i.e.,
we can immediately identify major lineages at the highest over-
view level, while acquiring more information by dissecting the
immune system at the deeper levels of the hierarchy on demand.
Additionally, Cytosplore+HSNE does so in a fraction of the
time required by other analysis tools. Furthermore, we identied
rare cell populations specically associating to diseases in both
the innate and adaptive immune compartments that were pre-
viously missed due to downsampling. We highlight scalability and
generalizability of Cytosplore+HSNE using three other data sets,
consisting of up to 15 million cells. Thus, Cytosplore+HSNE
combines the scalability of clustering-based methods with the
local single-cell detail preservation of non-linear dimensionality
reduction-based methods. Finally, Cytosplore+HSNE is not only
applicable to mass cytometry data sets, but can be used for
the other high-dimensional data like single-cell transcriptomic
data sets.
HSNE clusters
Annotated subsets
(% of HSNE clusters)
0
20
40
60
80
100
Overview level
All cells
Visualization: annotated
subsets
HSNE generated using 5.2×106 cells
HSNE generated
using 1.1×106 cells
Level 2
Downsampled +
discarded
Level 3
ab
d
c
Downsampled
Discarded
Annotated subsets
0 142
×106 cells
3.0
1.1
1.1
Density
HighLow
HSNE 1
HSNE 2
Fig. 2 Gain of information by analyzing the mass cytometry data at full resolution with Cytosplore+HSNE .aPie chart showing cellular composition of
the mass cytometry data set. Color represents the subsets (N=142), as identied in our previous study14. Black represents the cells discarded by
stochastic downsampling and grey represents the cells discarded by ACCENSE clustering. bEmbeddings of the 1.1 million cells annotated in ref 14 showing
the top three levels of the HSNE-hierarchy (ve levels in total). Color represents annotations as in a. Size of the landmarks is proportional to the number
of cells in the AoI that each landmark represents. Bottom map shows density features depicting the local probability density of cells for the level 3
embedding, where black dots indicate the centroids of identied cluster partitions using GMS clustering. cEmbeddings of all 5.2 million cells, again showing
only the top three levels of the hierarchy (ve levels in total). Colors as in a. Right panels visualize landmarks representing cells discarded by
stochastic downsampling (black) and the cells discarded by ACCENSE (grey). Bottom map shows density features for the level 3 embedding as
described in (b). dFrequency of annotated cells for 145 clusters identied by Cytosplore+HSNE at the third hierarchical level using GMS clustering in c.
Color coding as in a
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9 ARTICLE
NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Results
Hierarchical exploration of massive single-cell data. For a given
high-dimensional data set such as the three-dimensional illus-
trative example in Fig. 1a, HSNE13 builds a hierarchy of local
neighborhoods in this high-dimensional space, starting with the
raw data that, subsequently, is aggregated at more abstract
hierarchical levels. The hierarchy is then explored in reverse order,
by embedding the neighborhoods using the similarity-based
embedding technique, BarnesHut (BH)-SNE15. To allow for
more detail and faster computation, each level can be partitioned
in part or completely, by manual gating or unsupervised cluster-
ing, and partitions are embedded separately on the next, more
ILC
and
ILC-like cells
cCD127CD7 CD45RA CD56
CD38 NKp46
CD161
Level 3
5.0×105 cells (9.6 %)
Density
HighLow
e
g
CD127CD7 CD45RA CD56
CRTH2 c-KIT
CD27
Data level
6.0×104 cells (1.2 %)
f
Cluster partitions
Density
Blood/intestine Clinical features Sampling
18
Cell frequencies (fraction of cluster)
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
CD19
CCR6
c-KIT
CD11b
CD4
CD8a
CD7
CD25
CD123
TCRγδ
CD45
CRTH2
CD122
CCR7
CD14
CD11c
CD161
CD127
CD8b
CD27
IL-15Ra
CD45RA
CD3
CD28
CD38
NKp46
PD-1
CD56
16
11
15
12
13
1
19
9
5
4
10
17
14
6
8
7
2
3
16
11
15
12
13
1
19
9
5
4
10
17
18
14
6
8
7
2
3
16 15 4
17
11
10
8
18
14
19
13 3
512
7
9
6
2
1
CD4+ T cells
CD8+ T cells
TCRγδ
Innate
lymphocytes
B cells
Myeloid1
Myeloid2
CD7 CD3 CD4
Overview level
5.2×106 cells (100 %)
ab
CD8a
TCRγδ
CD19
CD11c
Marker expression
50
HSNE 1
HSNE 2
Fig. 5
d
Select and zoom-In
Cluster
unique to RCDII
Tissue
Blood
RCDII
EATLII
Crohn
CeD
Ctrl
Blood
Intestine
Ctrl
CeD
Crohn
EATLII
RCDII
Subset
Discarded
Downsampled
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9
4NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
detailed level (compare Fig. 1b). HSNE works particularly well for
the analysis of the mass cytometry data because the local neigh-
borhood information of the data level is propagated through the
complete hierarchy. Groups of cells that are close in the Euclidian
sense (Fig. 1a, grey arrow), but not on the non-linear manifold
(Fig. 1a, dashed black line), are well separated even at higher
aggregation levels (Fig. 1b). The power of HSNE lies in its scal-
ability to tens of millions of cells, while the possibility to con-
tinuously explore the hierarchy allows the identication of rare
cell populations at the more detailed levels. Next follows a general
description of how the hierarchy is built and explored through
embeddings. More details can be found in the Methods section.
The left panels of Fig. 1c give an overview of the HSNE-hierarchy
construction. We show the hierarchy from the ne-grained data
level to an overview level from the top to bottom panels. The
number of levels is dened by the user and depends mostly on the
input-data size. While the data aggregation is completely data-
driven, for a typical mass cytometry data set, every additional level
reduces the number of landmarks by roughly one order of
magnitude. Therefore, we recommend to use log10(N/100) levels,
with N being the number of cells: this generally results in at most
few thousands of landmarks at the highest level of the hierarchy.
The foundation of the hierarchy is constructed using the original
input data. Each dot represents a single cell (Fig. 1c, data level).
Similarities between cells on the data level are dened by building
an approximated, weighted k-nearest neighbor (kNN) graph16
using the Euclidian distances based on the complete marker
expression (Fig. 1c, top-center panel). The weights of this graph can
directly be used as input to embed the data into a two-dimensional
space (Fig. 1c, top-right panel). With the BH-SNE the two-
dimensional embedding is generated such that the layout of the
points indicates similarities between the cells in the high-
dimensional space according to the neighborhood graph.
To aggregate the data into the next level (Fig. 1c, intermediate
levels), we identify representative cells to use as landmarks (Fig. 1c,
white circles). For that, the weighted kNN graph is interpreted as a
Finite Markov Chain and the most inuential (i.e., best-
connected) nodes are chosen as landmarks, using a Monte Carlo
process. The landmarks are then embedded into a two-
dimensional space based on their similarities. However, simply
repeating the kNN construction with Euclidian distances for the
selected landmarks in the high-dimensional space would even-
tually eliminate non-linear structures by creating undesired
shortcutsin the graph (a problem reported by Setty et al.17 in
a different setting). Instead, we dene the area of inuence (AoI)
of each landmark, indicated by the grey hulls (Fig. 1c, left panels),
as the cells that are well-represented by the landmark according to
the kNN graph. Different landmarks can have overlapping regions
of locally-similar cells. Therefore, we dene the similarity of two
landmarks as the overlap of their respective AoIs. Furthermore, we
construct a neighborhood graph, based on these similarities. Here,
two nodes are connected if they have overlapping AoIs. The
strength of the connection is dened by the number of data points
within the overlapping region. This graph replaces the kNN graph
as input for levels subsequent to the data level. Hereby, we
effectively maintain the non-linear structure of the data to the top
of the hierarchy and avoid shortcuts (Fig. 1c, bottom panels). We
show that the preservation of non-linear neighborhoods by HSNE
indeed conserves structure that is otherwise lost by random
downsampling (Supplementary Note 1. Cytosplore+HSNE is
reproducible and robust. and Supplementary Fig. 1).
The data exploration in Cytosplore+HSNE starts with the
visualization of the embedding at the highest level, the overview
level (Fig. 1c, bottom-right panel). Similar to other embedding
techniques for visualizing the single-cell data4,9, the layout of
the landmarks indicates similarity in the high-dimensional
space according to the levels neighborhood graph. Color is
used to represent additional traits, such as marker expressions.
The landmark size reects its AoI. While it is possible to
continuously select all landmarks and compute a complete
embedding of the next, more detailed level, this strategy would
eventually embed all the data and suffer from the same scalability
problems as a t-SNE embedding, i.e., overcrowding (Supplemen-
tary Note 2. Millions of cells cause performance issues and
overcrowding in t-SNE. and Supplementary Fig. 2) and slow
performance. Instead, we envision that the user selects a group of
landmarks, by manual gating based on visual cues such as patterns
found in marker expression, or by performing unsupervised
Gaussian mean shift (GMS) clustering18 of the landmarks based
on the density representation of the embedding (Fig. 1c, right
panels). Then, the user can zoom into this selection by means of a
more detailed embedding. This means that, all landmarks/cells in
the combined AoI on the preceding level are retrieved from the
neighborhood graph (Fig. 1c, blue encirclements), embedded, and
visualized in a new view. Moreover, interactively linked heatmap
visualizations of clusters (Fig. 1c, right panels) and descriptive
statistics of markers within a selection can be used to guide
the exploration. For example, these tools allow to inspect the
heterogeneity of cells within individual clusters, including the cells
associated to individual landmarks. Importantly, all of the
described tools are available at every level of the hierarchy and
linked interactively. Selections in the embedding and heatmap at
one level of the hierarchy can thus be highlighted in the
embeddings of other levels (Supplementary Fig. 3). All these
aspects are further demonstrated using a typical exploration
workow with Cytosplore+HSNE in the Supplementary Movie 1.
With this strategy, tens of millions of cells can be explored,
providing both global visualizations up to single-cell resolution
visualizations, while preserving non-linear relationships between
landmarks/cells at all levels of the hierarchy.
HSNE eliminates the need for downsampling. In a previous
study14, a mass cytometry data set on 5.2 million cells derived
from intestinal biopsies and paired blood samples was analyzed
using a SPADE-t-SNE-ACCENSE pipeline. Due to t-SNE
Fig. 3 Analysis of the CD7+CD3innate lymphocyte compartment in inammatory intestinal diseases. aFirst HSNE level embedding of 5.2 million cells.
Color represents arcsin5-transformed marker expression as indicated. Size of the landmarks represents AoI. Blue encirclement indicates selectionof
landmarks representing CD7+CD3innate lymphocytes and CD4+T cells further discussed in Fig. 5.bThe major immune lineages, annotated on the basis of
lineage marker expression. cThird HSNE level embedding of the CD7+CD3innate lymphocytes (5.0 × 105cells). Color represents arcsin5-transformed
marker expression in top panels, and tissue-origin and clinical features in bottom panels. Blue encirclement indicates selection of landmarks representing
CD127+ILC and ILC-like cells. dThird HSNE level embedding shows density features depicting the local probability density of cells, where black dots indicate
the centroids of identied cluster partitions using GMS clustering. eEmbedding of the CD127+ILC and ILC-like cells (6.0 × 104cells) at single-cell resolution.
Arrows indicate ILC1 (blue), ILC2 (orange) and ILC3 (green). Bottom-right panel shows corresponding cluster partitions using GMS clustering based on
density features (top-right panel). fA heatmap summary of median expression values (same color coding as for the embeddings) of cell markers expressed
by CD127 + ILC and ILC-like clusters identied in band hierarchical clustering thereof. gComposition of cells for each cluster is represented graphically by a
horizontal bar in which segment lengths represent the proportion of cells with: (left) tissue-of-origin, (middle) disease status and (right) sampling status
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9 ARTICLE
NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications 5
Content courtesy of Springer Nature, terms of use apply. Rights reserved
limitations, the data set had to be downsampled by 57.7%
(Fig. 2a), where it was decided to equal the number of cells from
blood and intestinal samples for a balanced comparison, which led
to the exclusion of more cells from the blood samples. Moreover,
ACCENSE clustered only 50% of the t-SNE-embedded data into
subsets (Fig. 2a). Together, this excluded 78.8% of the cells from
the analysis. The remaining 1.1 million cells were annotated into
142 phenotypically distinct immune subsets14 (Fig. 2a).
To determine whether Cytosplore+HSNE could identify similar
subsets, we embedded the 1.1 million annotated cells (Fig. 2b).
Computation time was in the order of minutes and the analysis
was nished within an hour, compared to 8 weeks of computation
in the original study. Color coding shows the grouping of subsets
at all hierarchical levels. GMS clustering at the third level
embedding (Fig. 2b, bottom panel) reveals that 75.5% of cells
were assigned to a single subset by both methods (Supplementary
Fig. 4). Hence, to reach similar results it was not necessary to
explore the data at lower (more detailed) levels.
Next, we utilized Cytosplore+HSNE to analyze the complete
dataset on 5.2 million cells, thus including the cells that were
discarded in the SPADE-t-SNE-ACCENSE pipeline. The embed-
dings show by color coding that subsets of the same immune
lineage clustered at all three levels (Fig. 2c). More interestingly, the
cells removed during downsampling (shown in black) and cells
ignored during the ACCENSE clustering (shown in grey) were
positioned throughout the entire map (Fig. 2c). We selected 145
clusters using GMS clustering at the third level and observed that
the identied clusters contained variable numbers of downsampled
and non-classied cells (Fig. 2d). These ndings indicate that both
the non-uniform downsampling and the cell losses during the
ACCENSE clustering introduce a potential bias in observed
heterogeneity in the immune system. Cytosplore+HSNE overcomes
this problem as it analyzes all cells and does so efciently.
HSNE identies rare subsets in the ILC compartment.We
illustrate an exploration workow with Cytosplore+HSNE using
the data set of 5.2 million cells14 (Fig. 3). At the overview level,
4090 landmarks depict the general composition of the immune
system (Fig. 3a) and color coding is applied to reveal CD-marker
expression patterns on the basis of which the major immune
lineages are identied (Fig. 3b). Next the CD7+CD3cell clusters
were selected as indicated and a new higher resolution embedding
was generated at level 3 of the hierarchy (Fig. 3c). Here, coloring
of the landmarks based on marker expression (Fig. 3c, top panels)
and a density plot of the embedding is shown (Fig. 3d) alongside
the clinical features of the subjects from which the samples
were obtained and the tissue-origin of the landmarks (Fig. 3c,
bottom panels). This reveals a cluster of cells abundantly
present in the intestine of patients with refractory celiac disease
(RCDII). In addition, a large cluster of CD45RA+CD56+NK cells
and three distinct innate lymphoid cell (ILC) clusters with a
characteristic lineageCD7+CD161+CD127+marker expression
prole19,20 are visualized. Strikingly, a distinct population of
CD7+CD127CD45RAand partly CD56+cells is found in
between the NK, RCDII and ILC cell clusters.
To uncover the phenotypes of these ILC-related clusters, we
next embedded the ILC and ILC-like clusters (Fig. 3c, selection) at
the full single-cell data level (59,775 cells; 1.2% of total) (Fig. 3e).
The marker expression overlays revealed that the majority of cells
are CD7+and displayed variable expression levels for CD127,
CD45RA, and CD56 (Fig. 3e). In addition, and in line with
previous reports21,22, (co-)expression of CD127 with CD27,
CRTH2, and c-KIT revealed the phenotypes corresponding to
helper-like ILC type 1, 2 and 3, respectively (indicated by arrows
in Fig. 3e). Moreover, by visualizing the tissue-origin in the
Cytosplore+HSNE embedding the tissue-specic location of ILC
and ILC-related phenotypes became evident (Fig. 3e).
Subset Phenotype Annotation
16 CD127+CD161+CD25+CD122CRTH2+ILC2
15 CD127+CD161+CD25+CD122CRTH2ILC2-like
4 CD56+NKp46+CD127CD161c-KITNK-like
17 CD56+NKp46+CD127+CD161c-KITILC1-like
9 CD56+NKp46+CD127+CD161c-KITILC1-like
11 CD56+NKp46+CD127+CD161c-KITILC1-like
10 CD56+NKp46+CD127CD161c-KITNK-like
1 CD7CD127+CD161+c-KIT+ILC3-like
5 CD7+CD127+CD161+c-KIT+ILC3
12 CD56+CD127+CD161+c-KITCD27ILC1-like
19 CD56CD127NKp46CD161dim Lin- cells
13 CD56CD127NKp46CD161dim Lin- cells
18 CD56CD127NKp46+CD161Lin- cells
14 CD56CD127NKp46+CD161Lin- cells
6 CD56CD127NKp46+CD161+Lin- cells
8 CD56CD127NKp46+CD161+Lin- cells
7 CD56+CD127CD45RACD161NK-like
2 CD56+CD127CD45RACD161+NK-like
3 CD56+CD127CD45RACD161+NK-like
Fig. 4 CD127+ILC and ILC-like subsets identied by Cytosplore+HSNE. Table showing cluster number, distinguishing phenotypic marker expression proles
and biological annotation for the clusters identied in Fig. 3e. Black color indicates clusters described in previous reports and red color additional unknown
clusters. Hierarchical clustering of clusters based on marker expression prole shown in the heatmap depicted in Fig. 3f
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9
6NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Next, we performed GMS clustering on the full data level
embedding, which resulted in 19 phenotypically distinct clusters
(Fig. 3e, right plots) based on marker expression proles (Fig. 3f).
The cell surface phenotypes of 8 out of the 19 clusters (Fig. 3f)
matched previously described21 biological annotations (Fig. 4,
black annotations) including the CRTH2+ILC2 (cluster 16), c-
KIT+ILC3 (cluster 5) and CD56CD127lineageIELs (cluster
19, 13, 18, 14, 6, and 8), the latter representing innate type of
lymphocytes with dual T-cell precursor and NK/ILC traits2325.
Remarkably, the remaining 11 clusters strongly resembled distinct
ILC types, but did not full the complete phenotypic require-
ments according to established nomenclature21 (Fig. 4, red
annotations). For example, cluster 15 is highly similar to ILC2
(cluster 16) based on the expression of CD7, CD127, CD161,
and CD25, but lacks the ILC2-dening marker CRTH2. Also,
clusters 17, 9 and 11 bear close resemblance to ILC1 based
on CD7+CD127+c-KITmarker expression prole, but lack the
ILC-dening CD161 marker. Finally, cluster 1 is very similar to
ILC3 (cluster 5) based on CD127, CD161 and c-KIT positivity,
but lacks the lymphoid marker CD7. Interestingly, the ILC3
(cluster 5) and ILC3-like (cluster 1) populations resided mainly in
intestinal biopsies of patient with Crohns disease (Fig. 3f) and
may be related. Cluster 4 was mainly present in peripheral blood
of patients with RCDII, suggesting a possible association with this
pre-malignant disease state. Importantly, three clusters (4, 17, and
19) (Fig. 3f) were essentially missed in our previous study14 due
to the downsampling. Finally, all identied cell clusters consist to
a variable extent of cells that were downsampled in the original
analysis (Fig. 3g). Thus, the analysis of the full data set provides
increased detail and condence in establishing the phenotypes of
these low abundance innate cell subsets.
HSNE identies rare CD4+T-cell subsets in blood. Next, we
selected the CD4+T-cell lineage (Fig. 3a) and show the distribu-
tion of the landmarks at the third level, revealing several clusters
within the CD4+T-cell compartment (Fig. 5a), including a small
CD28CD4+T-cell memory population (25,398 cells; 0.5% of
total), most likely representing terminally differentiated cells26.
Subsequent analysis at the single-cell level (Fig. 5b) identied a
CD56+population within the CD28CD4+T cells that is enriched
in blood of patients with Crohns disease (Fig. 5b, bottom panels,
dashed black circle), as well as a CD56population of CD28CD4
+T cells (Fig. 5b, bottom panels, dashed yellow circle) present in
blood samples of both patients and controls. Importantly, this
latter cell population was not identied in our previous publica-
tion due to the non-uniform downsampling of cells (Fig. 5b).
Together, these ndings emphasize that Cytosplore+HSNE is
highly efcient in unbiased analysis of both abundant and rare cell
populations in health and disease by permitting full single-cell
CD28
memory
a
b
Marker expression
50
CCR7CD45RA CD28 CD27
CD127 CD38
CD161
Data level
2.5×104 cells (0.5 %)
CCR7 CD27
CD45RA CD127 CD28 CD7
CD56
Density
Density
HighLow
Level 3
1.9×106 cells (36.9 %)
RCDII
EATLII
Crohn
CeD
Ctrl
Downsampled
Discarded
Subset
Intestine
Blood
Fig. 5 Analysis of the CD4+T-cell compartment in inammatory intestinal diseases. aThird HSNE level embedding of the CD4+T cells (1.4 × 106cells,
selected in Fig. 3). Color and size of landmarks as described in Fig. 3. Right panel shows density features for the level 3 embedding. Blue encirclement
indicates selection of landmarks representing CD28CD4+T cells. bEmbedding of the CD28CD4+T cells (2.6 × 104cells) at single-cell resolution.
Bottom-left panel shows yellow and black dashed encirclements based on CD56and CD56+expression, respectively. Three bottom-right panels show
cells colored according to: (left) from subjects with different disease status (CeD, Crohn, EATLII, RCDII, and controls), (middle) sampling status (annotated
subset, discarded by ACCENSE and downsampled) and (right) tissue-of-origin (blood and intestine)
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9 ARTICLE
NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
resolution. It enables the simultaneous identication and visua-
lization of known cell subsets and provides evidence for additional
heterogeneity in the immune system, as it reveals the presence of
cell clusters that were missed in a previous analysis due to
downsampling of the input data. These currently unspecied cell
clusters might represent intermediate stages of differentiation or
novel rare cell types with presently unknown function.
HSNE is robust and outperforms current single-cell methods.
While the exploration of the hierarchy requires analysis at multiple
levels, the workow is robust and reproducible as shown in Sup-
plementary Fig. 5. In this exemplary analysis, we obtained the same
Cytosplore+HSNE clusters at the single-cell level upon reconstructing
the hierarchy and embeddings in a matter of minutes (Methods
section). In addition, we tested the Cytosplore+HSNE applicability to
three different public mass cytometry data sets. First, we analyzed a
well-characterized bone marrow data set27 containing 81,747 cells
as a benchmark case (Supplementary Fig. 6) and demonstrated that
the landmarks in the overview level (2632; 3.2% of total) that were
selected by the HSNE algorithm were distributed across almost all
of the manually gated cell types (Supplementary Fig. 6a), indicating
that the global data heterogeneity was accurately preserved. Also,
GMS clustering resulted in HSNE clusters that were phenotypically
similar to the manually gated cell types and displayed additional
diversity within those subsets (Supplementary Fig. 6b). However, as
the power of Cytosplore+HSNE lies in its scalability to data sets
exceeding millions of cells, we also tested the versatility of Cytos-
plore+HSNE by comparing it to other state-of-the-art scalable single-
cell analysis methods and accompanying large data sets (Supple-
mentary Note 3. Cytosplore+HSNE offers advantages over current
scalable single-cell analysis methods, Supplementary Figs. 7and 8).
Here Cytosplore+HSNE computed the analyses of the VorteX data
set5containing 0.8 million cells in 4 min compared to 22 h, using
the publicly available VorteX implementation on the same com-
puter. Similarly, analysis of the Phenograph data set4containing 15
million cells was computed in 3.5 h compared to 40 h, using the
publicly available Phenograph implementation on the same com-
puter. Both analyses show that Cytosplore+HSNE reproduces the
main ndings as presented in the original publications. More
importantly, Cytosplore+HSNE provides the distinct advantage of
visualizing all cells and intracluster heterogeneity at subsequent
levels of detail up to the single-cell level, even for the 15 million of
cell data set, without a need for downsampling. Also, VorteX failed
computing the 5.2 million cell gastrointestinal data set within 3 days
of clustering (regardless of using Euclidian or Angular distance),
where Cytosplore+HSNE accomplished this within 29 min. More-
over, while Phenograph did identify rare clusters that largely con-
sisted of CD56+cells within the CD28
CD4+memory T cells
(Fig. 5b), these clusters did not accurately correspond to the total
number of CD56+cells, obscuring the association with Crohns
disease, further highlighting the advantages of Cytosplore+HSNE
over these other computational tools.
Finally, we investigated whether a density-based downsampling
as implemented for instance by SPADE2, could provide better
results compared to random downsampling. However, solely
applying density-based downsampling does not allow for
quantitative analysis of the resulting sample, as different types
of cells will be reduced by different amounts. To mitigate this
problem, SPADE implements an elaborate pipeline of down-
sampling, clustering and subsequent upsampling to enable for
such a comparison, while this is an inherent part of HSNE.
Therefore, we made a direct comparison between density-based
downsampling used in the SPADE pipeline2and HSNE of the
same 5.2 million cells gastrointestinal data set. On the basis of the
expression of major lineage markers (Fig. 3a), HSNE created six
large clusters (Fig. 3b) in the two-dimensional space at the
overview level where similar landmark cells group closely, laying
out all the cells of one cluster very close to any other cell of the
same cluster, but distant from the cells of the other clusters. The
SPADE analysis on the same data (Supplementary Fig. 9) created
a dendrogram where cells of one cluster are close to cells of other
clusters, while in high-dimensional space, they could be dissimilar
and far apart. Importantly, we compared the ability of the SPADE
analysis to preserve rare cellular subsets with HSNE. Despite
density-based downsampling, several SPADE nodes that were
created displayed a mixture of different phenotypes (under-
clustering) as revealed by the single-cell resolution of a linked t-
SNE analysis that we show for the CD56+CD4+T-cell node as an
example (Supplementary Fig. 9b, node #1), while other SPADE
nodes contained cells with overlapping phenotypes (overcluster-
ing) such as several myeloid cell populations (Supplementary
Fig. 9c, nodes #25). In addition, rare subsets such as the CD28
subpopulations of CD4+memory T cells (Supplementary Fig. 9d)
or the ILC-like clusters (Supplementary Fig. 9e) that we could
identify with HSNE (Figs. 3and 5) were in the resulting SPADE
tree indistinguishable from other CD4+T cells or innate
lymphocytes, respectively (shown by the overlapping distribu-
tions of cells from different nodes); this indicates that SPADE is
less suitable for rare cell analysis. A similar problem was reported
by Amir et. al., where leukemic cells were not separated from
healthy cells in the SPADE tree9. Thus, combining the single-cell
resolution with the enhanced scalability may be critical for the
success of HSNE in preserving rare cells.
Discussion
Mass cytometry data sets generally consist of millions of cells.
Current tools can either extract global information with no
single-cell resolution or provide single-cell resolution but at the
expense of the number of cells that can be analyzed. Conse-
quently, when single-cell resolution is of interest, most current
tools require downsampling of the data sets. However, reducing
the number of included cells in the analysis pipeline may hamper
the identication of rare subsets.
To overcome this problem, we introduce Cytosplore+HSNE.On
the basis of a novel hierarchical embedding of the data (HSNE),
Cytosplore+HSNE enables the analysis of tens of millions of cells
using the whole data in a fraction of the time required by
currently available tools. The power of the hierarchical embed-
ding strategy is that Cytosplore+HSNE provides visualizations
of the data at different levels of resolution, while preserving the
non-linear phenotypic similarities of the single cells at each level.
Cytosplore+HSNE enables the user to interactively select the
groups of data points at each resolution level, either hand-picked
or guided by density-based clustering, to further zoom-in on the
underlying data points in the hierarchy up to the single-cell
resolution. Using a data set of 5.2 million cells, we demonstrate
that Cytosplore+HSNE allows a rapid analysis of the composition
of the cells in the data set that, at all levels of the hierarchy, the
representation of these cells preserve phenotypic relationships,
and that one can zoom-in on rare cell populations that were
missed with other analysis tools. The identication of such rare
immune subsets offers opportunities to determine cellular para-
meters that correlate with disease.
There is an ongoing scientic debate on the validity of clus-
tering in t-SNE maps versus direct clustering on the high-
dimensional space. However, it has been shown that stochastic
neighbor embedding (SNE) preserves and separates clusters in the
high dimensional space28. While clustering the data points on
highly non-linear manifolds is possible with complex models, we
argue that the presented approach simplies clustering
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9
8NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
considerably. We show that HSNE efciently unfolds the non-
linearity in the high-dimensional data, as other SNE approaches
do and therefore simpler clustering methods based on locality in
the map sufce to partition the data faithfully (e.g., the density-
based GMS clustering, implemented in Cytosplore+HSNE). Espe-
cially when combined with an interactive quality control
mechanism to visually inspect residual variance within each
cluster, the kernel size can be selected such that within-cluster
variance is minimized, and thereby supports the validity of the
cluster with respect to potential underclustering. This is indeed
conrmed by comparisons to other scalable tools (i.e., Phenograph
and VorteX), showing that Cytosplore+HSNE provides a superior
discriminatory ability to identify and visualize rare phenotypically
distinct cell clusters in large data sets in a very short time span.
However, depending on user preference, Cytosplore+HSNE can be
used in conjunction with such direct clustering approaches. This
allows the user to identify additional heterogeneity that is poten-
tially missed by direct clustering, and provides the tools for an
informed merging and splitting of clusters as the user deems
appropriate. The recent application of mass cytometry and other
high-dimensional single-cell analysis techniques has greatly
increased the number of phenotypically distinct cell clusters
within the immune system. This raises obvious questions about
the true distinctiveness and function of such cell clusters in health
and disease, an issue that is beyond the scope of the present study
but needs to be addressed in future studies.
In conclusion, Cytosplore+HSNE allows an interactive and fast
analysis of large high-dimensional mass cytometry data sets from a
global overview to the single-cell level and is coupled to patient-
specicfeatures.Thismayprovidecrucialinformationforthe
identication of disease-associated changes in the adaptive and
innate immune system which may aid in the development of dis-
ease- and patient-specic treatment protocols. Finally, Cytosplore
+HSNE applicability goes beyond analyzing mass cytometry data sets
as it is able to analyze any high-dimensional single-cell data set.
Methods
HSNE algorithm. HSNE builds a hierarchy of local and non-linear similarities of
high-dimensional data points13, where landmarks on a coarser level of the hier-
archy represent a set of similar points or landmarks of the preceding more detailed
level. To represent the non-linear structures of the data, the similarity of these
landmarks is not described by Euclidian distance, but by the concept of AoI on
landmarks of the preceding level. The similarities described in every level of the
hierarchy are then used as input for an adapted version of the similarity-based
embedding technique BH-SNE15 for visualization.
The algorithm works as follows: First, a weighted k-nearest neighbor (kNN)
graph is computed from the raw input data. For optimal performance and
scalability, the neighborhoods are approximated as described in ref. 16. The weight
of the link between the two data points in the kNN graph describes the similarity of
the connected data points.
In the subsequent steps, the hierarchy is built based on the similarities of the data
level. To this extent, a number of random walks of predened length is carried out
starting from every node in the kNN graph, using the similarities as probability for
the next jump; similar nodes to the current node are more likely to be the target of
the next jump. Nodes in the graph that are reached more often are considered more
important and selected as landmarks for the next coarser level. The number of
landmarks is selected in a data-driven manner, based on this importance. The AoI
of a landmark is dened by a second set of random walks started from all nodes
(data points or landmarks on the preceding level). Here, the length is not
predened. Rather, once a landmark is reached, the random walk terminates. The
inuence on the node is then dened for every reached landmark as the fraction of
walks that terminated in that landmark. Inversely, the AoI for each landmark is
dened as the set of all nodes that reached this landmark at least once in this second
set of random walks. Consequently, since multiple random walks initiated at the
same node can end in different nodes, the AoIs of different landmarks can overlap.
We use this overlap to dene a new neighborhood graph at the levels above the
data level. Here, two nodes in the graph corresponding to landmarks at this level
are connected if they have overlapping AoIs, where the link between the nodes is
weighted by the number of data points in the overlapping area. This process is
carried out iteratively, until a predened number of hierarchical levels has been
constructed. For the full technical details, we refer to our previous work13.
HSNE implementation in Cytosplore+HSNE. We implemented our integrated
analysis tool Cytosplore+HSNE using a combination of C + + , javascript and
OpenGL. All computationally demanding parts are implemented in C + + and
make use of parallelization, where possible. The density estimation and GMS
clustering make use of the graphics processing unit (GPU), as described in our
original publication on Cytosplore29, if possible, allowing clustering of millions of
points in less than a second. We implemented the visualizations of the embedding
in OpenGL on the GPU, for optimal performance, and less computational
demanding visualizations, such as the heatmap, in javascript. We implemented the
HSNE algorithm in C + + , as presented in ref. 13. Since we use the sparse data
structures, memory consumption strongly depends on the data complexity. Max-
imum memory consumption during the construction of a four level hierarchy plus
overview embedding of the 841,644 cell VorteX data set was 1,684 MB, construc-
tion of a ve-level hierarchy of our human inammatory intestinal diseases data
set, consisting of 5,220,347 cells required a maximum of 9,357 MB of main
memory, and nally, the 15,299,616 cell Phenograph data set required a maximum
of 24.3 GB of memory during the computation of a ve-level hierarchy plus the
overview embedding. Computation times for the described hierarchies plus the rst
level embedding after 1,000 iterations were 4 min, 29 min, and, 3 h and 37 min,
respectively, on a HP Z440 workstation with a single intel Xeon E5-1620 v3 CPU (4
cores) clocked at 3.5 Ghz, 64 GB of main memory and an nVidia Geforce GTX 980
GPU with 4 GB of memory, running Windows 7.
Human gastrointestinal disorders mass cytometry data set. Detailed descrip-
tion of the mass cytometry data set on human gastrointestinal disorders can be
found in our previous work14. In brief, samples (N=102) were collected from
patients who were undergoing routine diagnostic endoscopies. The cells from the
epithelium and lamina propria were isolated from two or three intestinal biopsies
by treatment with EDTA followed by a collagenase mix under rotation at 37 °C. We
analyzed single-cell suspensions from biological samples including duodenum
biopsies (N=36), rectum biopsies (N=13), perianal stulas (N=6), and PBMC
from control individuals (N=15) and from patients with inammatory intestinal
diseases (celiac disease (CeD), N=13; RCD type II (RCDII), N=5; enteropathy-
associated T-cell lymphoma type II (EATLII), N=1 and Crohns disease (Crohn),
N=10). A CyTOF panel of 32 metal isotope-tagged monoclonal antibodies was
designed to obtain a global overview of the heterogeneity of the innate and adaptive
immune system. Primary antibody metal-conjugates were either purchased or
conjugated in-house. Procedures for mass cytometry antibody staining and data
acquisition were carried out as previously described27. CyTOF data were acquired
and analyzed on-the-y, using dual-count mode and noise-reduction on. All other
settings were either default settings or optimized with a tuning solution. After data
acquisition, the mass bead signal was used to normalize the short-term signal
uctuations with the reference EQ passport P13H2302 during the course of each
experiment and the bead events were removed30.
Processing of mass cytometry data. We transformed data from the human
inammatory intestinal diseases data set using hyperbolic arcsin with a cofactor of
5 directly within Cytosplore+HSNE. We discriminated live, single CD45+immune
cells with DNA stains and event length for the human inammatory intestinal
diseases study. We analyzed other data (Phenograph and VorteX data sets) as was
available, except the transformation using hyperbolic arcsin with a cofactor of 5.
Cytosplore+HSNE analysis. Cytosplore+HSNE facilitates the complete exploration
pipeline in an integrated manner (see Supplementary Movie 1). All presented tools
are available for every step of the exploration and every level of the hierarchy. Data
analysis in Cytosplore+HSNE included the following steps: We applied the arcsin
transform with a cofactor of ve upon loading the data sets. After that, we started a
new HSNE analysis and dened the markers that should be used for the similarity
computation. We used markers CD3, CD4, CD7, CD8a, CD8b, CD11b, CD11c,
CD14, CD19, CD25, CD27, CD28, CD34, CD38, CD45, CD45RA, CD56, CD103,
CD122, CD123, CD127 CD161, CCR6, CCR7, c-KIT, CRTH2, IL-15Ra, IL-21R,
NKp46, PD-1, TCRab, and TCRgd for the human inammatory intestinal diseases
data set, all available markers for the bone marrow benchmark dataset, surface
markers CD3, CD7, CD11b, CD15, CD19, CD33, CD34, CD38, CD41, CD44,
CD45, CD47, CD64, CD117, CD123 and HLA-DR for the Phenograph dataset, and
markers CD3, CD4, CD5, CD8, CD11b, CD11c, CD16/32, CD19, CD23, CD25,
CD27, CD34, CD43, CD44, CD45.2, CD49b, CD64, CD103, CD115, CD138,
CD150, 120g8, B220, CCR7, c-KIT, F4/80, FceR1a, Foxp3, IgD, IgM, Ly6C, Ly6G,
MHCII, NKp46, Sca1, SiglecF, TCRb, TCRgd and Ter119 to construct the hier-
archy for the VorteX data set. We used the standard parameters for the hierarchy
construction; number of random walks for landmark selection: N=100, random
walk length: L=15, number of random walks for inuence computation: N=15.
For any clustering that occurred the GMS grid size was set to S=256 ref. 2. The
reduction factor from one level in the hierarchy to the next coarser level is com-
pletely data-driven. In our experiments with mass cytometry data, the number of
landmarks was consistently reduced by roughly one order of magnitude from one
level to the next. Embeddings consisting of only a few hundred points usually
provide little insight. Therefore, we dened the number of levels such that the
overview level could be expected to consist of in the order of 1,000 landmarks
NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9 ARTICLE
NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved
meaning N=5 for the human inammatory intestinal diseases data set and Phe-
nograph data set, N=3 for the bone marrow benchmark data set, and N=4 for the
VorteX data set. Building the hierarchy automatically creates a visualization of the
overview level using BH-SNE. Cytosplore+HSNE enables color coding of the land-
marks using expression (e.g., Fig. 3a) of any provided markers or by sample. For
example, we created the clinical feature (e.g., Fig. 3c, bottom-left panel) and blood/
intestine (e.g., Fig. 3c, bottom-right panel) color schemes based on samples for the
human inammatory intestinal diseases data set within Cytosplore+HSNE, and for
the Phenograph data set, we created a color scheme that represented the sample
coloring as provided in ref. 4(Supplementary Fig. 7). For zooming into the data, we
generally selected cells based on visible clusters, either using manual selection or by
selecting clusters derived by using the GMS clustering. For the VorteX data set, we
clustered the third level embedding (Supplementary Fig. 8). We specied a kernel
size of 0.18 of the embedding size, to match the 48 clusters created by the X-shift
clustering described in ref. 5, resulting in 50 clusters.
For subset classication, we rst cluster the embedding at a given level using the
GMS clustering. Next, we inspect the clustering by using the integrated descriptive
marker statistics and heatmap visualization. If there is still meaningful variation of
the marker expression within clusters, we zoom further into these clusters. If
clusters are phenotypically homogeneous, the corresponding cell types are dened
by inspecting the full marker expression prole in the heatmap and then the cluster
is exported from any level in the hierarchy.
Data availability. The gastrointestinal mass cytometry data set that supports the
ndings of this study is publicly available on Cytobank, experiment no 60564.
https://community.cytobank.org/cytobank/experiments/60564. The source code of
the HSNE library, written in C+ +, is available at https://github.com/Nicola17/
High-Dimensional-Inspector. Furthermore, we provide a Cytosplore+HSNE installer
for Windows, allowing exploration of several million cells, for academic use at
https://www.cytosplore.org.
Received: 16 June 2017 Accepted: 25 September 2017
References
1. Saeys, Y., Gassen, S. V. & Lambrecht, B. N. Computational ow cytometry:
helping to make sense of high-dimensional immunology data. Nat. Rev.
Immunol. 16, 449462 (2016).
2. Qiu, P. et al. Extracting a cellular hierarchy from high-dimensional cytometry
data with SPADE. Nat. Biotechnol. 29, 886891 (2011).
3. Zunder, E. R., Lujan, E., Goltsev, Y., Wernig, M. & Nolan, G. P. A continuous
molecular roadmap to iPSC reprogramming through progression analysis of
single-cell mass cytometry. Cell Stem Cell 16, 323337 (2015).
4. Levine, J. H. et al. Data-Driven phenotypic dissection of AML reveals
progenitor-like cells that correlate with prognosis. Cell 162, 184197 (2015).
5. Samusik, N., Good, Z., Spitzer, M. H., Davis, K. L. & Nolan, G. P. Automated
mapping of phenotype space with single-cell data. Nat. Methods 13, 493496
(2016).
6. Spitzer, M. H. et al. IMMUNOLOGY. An interactive reference framework for
modeling a dynamic immune system. Science 349, 1259425 (2015).
7. Hotelling, H. Analysis of a complex of statistical variables into principal
components. J Ed. Psychol.24, 417441 (1933).
8. van der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data
using t-SNE. J. Mach. Learn. Res. 9, 25792605 (2008).
9. Amir, E.-A. D. et al. viSNE enables visualization of high dimensional single-cell
data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31,
545552 (2013).
10. Haghverdi, L., Buettner, F. & Theis, F. J. Diffusion maps for high-dimensional
single-cell analysis of differentiation data. Bioinformatics 31, 29892998 (2015).
11. Bendall, S. C., Nolan, G. P., Roederer, M. & Chattopadhyay, P. K. A deep
prolers guide to cytometry. Trends Immunol. 33, 323332 (2012).
12. Chattopadhyay, P. K., Gierahn, T. M., Roederer, M. & Love, J. C. Single-cell
technologies for monitoring immune systems. Nat. Immunol. 15, 128135
(2014).
13. Pezzotti, N., Höllt, T., Lelieveldt, B., Eisemann, E. & Vilanova, A. Hierarchical
Stochastic Neighbor Embedding. Comput. Graph. Forum 35,2130 (2016).
14. van Unen, V. et al. Mass cytometry of the human mucosal immune system
identies tissue- and disease-associated immune subsets. Immunity 44,
12271239 (2016).
15. van der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach.
Learn. Res.15, 32213245 (2014).
16. Pezzotti, N. et al. Approximated and user steerable tSNE for progressive visual
analytics. IEEE. Trans. Vis. Comput. Graph. 23, 17391752 (2016).
17. Setty, M. et al. Wishbone identies bifurcating developmental trajectories from
single-cell data. Nat. Biotechnol. 34, 637645 (2016).
18. Comaniciu, D. & Meer, P. Mean shift: a robust approach toward feature space
analysis. IEEE. Trans. Pattern Anal. Mach. Intell. 24, 603619 (2002).
19. Spits, H. & Cupedo, T. Innate lymphoid cells: emerging insights in
development, lineage relationships, and function. Annu. Rev. Immunol. 30,
647675 (2012).
20. McKenzie, A. N. J., Spits, H. & Eberl, G. Innate lymphoid cells in inammation
and immunity. Immunity 41, 366374 (2014).
21. Spits, H. et al. Innate lymphoid cells--a proposal for uniform nomenclature.
Nat. Rev. Immunol. 13, 145149 (2013).
22. Robinette, M. L. et al. Transcriptional programs dene molecular characteristics
of innate lymphoid cell classes and subsets. Nat. Immunol. 16, 306317
(2015).
23. Schmitz, F. et al. Identication of a potential physiological precursor of aberrant
cells in refractory coeliac disease type II. Gut. 62, 509519 (2013).
24. Schmitz, F. et al. The composition and differentiation potential of the duodenal
intraepithelial innate lymphocyte compartment is altered in coeliac disease.
Gut. 65, 12691278 (2016).
25. Ettersperger, J. et al. Interleukin-15-dependent T-cell-like innate intraepithelial
lymphocytes develop in the intestine and transform into lymphomas in celiac
disease. Immunity 45, 610625 (2016).
26. Mou, D., Espinosa, J., Lo, D. J. & Kirk, A. D. CD28 negative T cells: is their loss
our gain? Am. J. Transplant. 14, 24602466 (2014).
27. Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug
responses across a human hematopoietic continuum. Science 332, 687696
(2011).
28. Shaham, U. & Steinerberger, S. Stochastic neighbor embedding separates well-
separated clusters. arXiv:1702.02670 [stat.ML] (2017).
29. Höllt, T. et al. Cytosplore: Interactive immune cell phenotyping for large single-
cell datasets. Comput. Graph. Forum 35, 171180 (2016).
30. Finck, R. et al. Normalization of mass cytometry data with bead standards.
Cytometry A 83, 483494 (2013).
Acknowledgements
The research leading to these results has received funding from Leiden University
Medical Center, the Netherlands Organization for Scientic Research (ZonMW grant
91112008) and the Technology Foundation STW, the Netherlands (VAnPIRe; grant
12720, and Genes in Space; grant 12721). We thank Drs M.W. Schilham, M. Yazdan-
bakhsh, J. Goeman, K. Schepers, J. van Bergen and S.E. de Jong for critical review of the
manuscript and B. van Lew for narrating the Supplementary Movie 1.
Author contributions
V.v.U., T.H., N.P., F.K., A.V. and B.P.F.L.: Conceived the study. T.H., N.P., A.V. and
B.P.F.L.: Developed the HSNE method and implementation in Cytosplore+HSNE. V.v.U.
and F.K.: Performed the biological analysis and interpretation. T.H.: Performed the
t-SNE scalability analysis and comparison. V.v.U.: Performed the hierarchy robustness
analysis. V.v.U. and T.H.: Performed the comparison with other methods. N.L., M.J.T.R.
and E.E.: Provided conceptual input. V.v.U., T.H., N.P., F.K., A.V. and B.L.: Wrote the
manuscript. All authors discussed the results and commented on the manuscript.
Additional information
Supplementary Information accompanies this paper at doi:10.1038/s41467-017-01689-9.
Competing interests: The authors declare no competing nancial interests.
Reprints and permission information is available online at http://npg.nature.com/
reprintsandpermissions/
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional afliations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons license, and indicate if changes were made. The images or other third party
material in this article are included in the articles Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the
articles Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this license, visit http://creativecommons.org/
licenses/by/4.0/.
© The Author(s) 2017
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-017-01689-9
10 NATURE COMMUNICATIONS |8: 1740 |DOI: 10.1038/s41467-017-01689-9 |www.nature.com/naturecommunications
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Mass cytometry allows the simultaneous measurement of over 40 cellular markers at a single-cell resolution, providing the opportunity to investigate the immune response with unprecedented resolution in an unbiased and data-driven manner [7]. As the analysis methods for traditional flow cytometry are not suitable for high-dimensional mass cytometry datasets, novel computational algorithms have been developed, such as hierarchical stochastic neighbour embedding (HSNE) [8]. ...
... Pooled data from individual live CD45 + cells that were individually gated using FlowJo software (version 10.5.3), as shown in Additional file 2, were sample-tagged and hyperbolic-arcsinh-transformed with a cofactor of 5 using Cytosplore +HSNE software [8]. The hierarchical stochastic neighbour embedding (H-SNE) analysis was carried out with default settings (Perplexity: 30; iteration: 1000). ...
... This antibody panel allowed us to identify all of the major immune lineages, including myeloid cells, innate lymphoid cells (ILCs), CD4 + T cells, CD8 + T cells, other T cells and B cells, and investigate the heterogeneity within each lineage. After obtaining data from individual live CD45 + immune cells (Additional file 2), we pooled all of the data (219,967 CD45 + cells) derived from 14 lungs and carried out an HSNE analysis in Cytosplore [8]. Here, the landmarks described the lung immune composition ( Figure 1E). ...
Article
Full-text available
Due to the increase in bacterial resistance, improving the anti-infectious immunity of the host is rapidly becoming a new strategy for the prevention and treatment of bacterial pneumonia. However, the specific lung immune responses and key immune cell subsets involved in bacterial infection are obscure. Actinobacillus pleuropneumoniae (APP) can cause porcine pleuropneumonia, a highly contagious respiratory disease that has caused severe economic losses in the swine industry. Here, using high-dimensional mass cytometry, the major immune cell repertoire in the lungs of mice with APP infection was profiled. Various phenotypically distinct neutrophil subsets and Ly-6C⁺ inflammatory monocytes/macrophages accumulated post-infection. Moreover, a linear differentiation trajectory from inactivated to activated to apoptotic neutrophils corresponded with the stages of uninfected, onset, and recovery of APP infection. CD14⁺ neutrophils, which mainly increased in number during the recovery stage of infection, were revealed to have a stronger ability to produce cytokines, especially IL-10 and IL-21, than their CD14⁻ counterparts. Importantly, MHC-II⁺ neutrophils with antigen-presenting cell features were identified, and their numbers increased in the lung after APP infection. Similar results were further confirmed in the lungs of piglets infected with APP and Klebsiella pneumoniae infection by using a single-cell RNA-seq technique. Additionally, a correlation analysis between cluster composition and the infection process yielded a dynamic and temporally associated immune landscape where key immune clusters, including previously unrecognized ones, marked various stages of infection. Thus, these results reveal the characteristics of key neutrophil clusters and provide a detailed understanding of the immune response to bacterial pneumonia. Supplementary Information The online version contains supplementary material available at 10.1186/s13567-023-01207-4.
... We investigated the immune response to PfSPZ-CVac [CQ] associated with protection at the pre-CHMI time point (c-1; 8-10 weeks after the third PfSPZ-CVac [CQ] vaccination and 1 day prior to CHMI). Unsupervised clustering using hierarchical stochastic neighbor embedding (HSNE) (17,18) identified a total of 103 distinct cell clusters (Supplemental Figure 1), classified into lineages and subsets (Supplemental Table 1) based on their marker expression. Seven major immune lineages were annotated: CD4 + T cells, CD8 + T cells, γδ T cells, unconventional T cells, B cells, and innate lymphoid cells (including NK cells) as well as monocytes and DCs (Supplemental Figure 1). ...
... The normalized FCS files were exported and analyzed with FlowJo V10 (TreeStar) to exclude EQ beads and select live CD45 + cells (Supplemental Figure 4A). The FCS files from selected live CD45 + cells were then analyzed using the hierarchical stochastic neighborhood embedding (HSNE) method on Cytosplore (17,18), a dimensionality reduction visualization tool (Supplemental Figure 4B). The HSNE method of clustering enables the selection of cell landmarks (also referred to as clusters) per level based on the similarity of their marker expression. ...
... Thus, the following discussion will be restricted to t-SNE. In the context of t-SNE, a hierarchical method has been previously introduced [18]. However, the hierarchies are tailored to enhance visual tasks, not the performance of the embedding itself. ...
... • How does our approach relate to HSNE [18]? The two approaches follow different goals, but can they be combined? ...
Preprint
Full-text available
Widely used pipelines for the analysis of high-dimensional data utilize two-dimensional visualizations. These are created, e.g., via t-distributed stochastic neighbor embedding (t-SNE). When it comes to large data sets, applying these visualization techniques creates suboptimal embeddings, as the hyperparameters are not suitable for large data. Cranking up these parameters usually does not work as the computations become too expensive for practical workflows. In this paper, we argue that a sampling-based embedding approach can circumvent these problems. We show that hyperparameters must be chosen carefully, depending on the sampling rate and the intended final embedding. Further, we show how this approach speeds up the computation and increases the quality of the embeddings.
... Both arcsinh-transformation and fdaNorm-normalization were carried out using the R scripts published by Melsen et al. [15]. For dimensionality reduction and clustering, we utilized HSNE (hierarchical stochastic neighbor embedding) that constructs a hierarchical representation of the entire dataset, maintaining the non-linear high-dimensional relationships between cells that can be gradually explored from a broader overview to the detailed single-cell level [16]. HSNE was applied in the Cytosplore app [17], an integrated single-cell analysis framework that enables dynamic exploration of the hierarchy through a two-dimensional scatter plot in which cells are arranged according to the similarity in the expression of all markers at the same time. ...
Article
Full-text available
Hypoxia leads to metabolic changes at the cellular, tissue, and organismal levels. The molecular mechanisms for controlling physiological changes during hypoxia have not yet been fully studied. Erythroid cells are essential for adjusting the rate of erythropoiesis and can influence the development and differentiation of immune cells under normal and pathological conditions. We simulated high-altitude hypoxia conditions for mice and assessed the content of erythroid nucleated cells in the spleen and bone marrow under the existing microenvironment. For a pure population of CD71+ erythroid cells, we assessed the production of cytokines and the expression of genes that regulate the immune response. Our findings show changes in the cellular composition of the bone marrow and spleen during hypoxia, as well as changes in the composition of the erythroid cell subpopulations during acute hypoxic exposure in the form of a decrease in orthochromatophilic erythroid cells that are ready for rapid enucleation and the accumulation of their precursors. Cytokine production normally differs only between organs; this effect persists during hypoxia. In the bone marrow, during hypoxia, genes of the C-lectin pathway are activated. Thus, hypoxia triggers the activation of various adaptive and compensatory mechanisms in order to limit inflammatory processes and modify metabolism.
Article
Full-text available
Advancements in cytometry technologies have enabled quantification of up to 50 proteins across millions of cells at single cell resolution. Analysis of cytometry data routinely involves tasks such as data integration, clustering, and dimensionality reduction. While numerous tools exist, many require extensive run times when processing large cytometry data containing millions of cells. Existing solutions, such as random subsampling, are inadequate as they risk excluding rare cell subsets. To address this, we propose SuperCellCyto, an R package that builds on the SuperCell tool which groups highly similar cells into supercells. SuperCellCyto is available on GitHub (https://github.com/phipsonlab/SuperCellCyto) and Zenodo (https://doi.org/10.5281/zenodo.10521294).
Article
Full-text available
Background Pulmonary infections are a crucial health concern for patients with advanced non–small-cell lung cancer (NSCLC). Whether the clinical outcome of pulmonary infection is influenced by immunotherapy(IO) remains unclear. By evaluating immune signatures, this study investigated the post-immunotherapy risk of pulmonary infection in patients with lung cancer and identified circulating biomarkers that predict post-immunotherapy infection. Methods Blood specimens were prospectively collected from patients with NSCLC before and after chemotherapy(C/T) and/or IO to explore dynamic changes in immune signatures. Real-world clinical data were extracted from medical records for outcome evaluation. Mass cytometry and ELISA were employed to analyze immune signatures and cytokine profiles to reveal potential correlations between immune profiles and the risk of infection. Results The retrospective cohort included 283 patients with advanced NSCLC. IO was associated with a lower risk of pneumonia (odds ratio=0.46, p=0.012). Patients receiving IO and remained pneumonia-free exhibited the most favorable survival outcomes compared with those who received C/T or developed pneumonia (p<0.001). The prospective cohort enrolled 30 patients. The proportion of circulating NK cells significantly increased after treatment in IO alone (p<0.001) and C/T+IO group (p<0.01). An increase in cell densities of circulating PD-1⁺CD8⁺(cytotoxic) T cells (p<0.01) and PD-1⁺CD4⁺ T cells (p<0.01) were observed in C/T alone group after treatment. In IO alone group, a decrease in cell densities of TIM-3⁺ and PD-1⁺ cytotoxic T cells (p<0.05), and PD-1⁺CD4⁺ T cells (p<0.01) were observed after treatment. In C/T alone and C/T+IO groups, cell densities of circulating PD-1⁺ cytotoxic T cells significantly increased in patients with pneumonia after treatment(p<0.05). However, in IO alone group, cell density of PD-1⁺ cytotoxic T cells significantly decreased in patients without pneumonia after treatment (p<0.05). TNF-α significantly increased after treatment with IO alone (p<0.05) but decreased after C/T alone (p<0.01). Conclusions Our results indicate that the incorporation of immunotherapy into treatment regimens may potentially offer protective effects against pulmonary infection. Protective effects are associated with reduction of exhausted T-cells and augmentation of TNF-α and NK cells. Exhausted T cells, NK cells, and TNF-α may play crucial roles in immune responses against infections. These observations highlight the potential utility of certain circulating biomarkers, particularly exhausted T cells, for predicting post-treatment infections.
Article
Full-text available
Multiple myeloma (MM) is a malignant neoplasm characterized by clonal proliferation of abnormal plasma cells. In many countries, it ranks as the second most prevalent malignant neoplasm of the hematopoietic system. Although treatment methods for MM have been continuously improved and the survival of patients has been dramatically prolonged, MM remains an incurable disease with a high probability of recurrence. As such, there are still many challenges to be addressed. One promising approach is single-cell RNA sequencing (scRNA-seq), which can elucidate the transcriptome heterogeneity of individual cells and reveal previously unknown cell types or states in complex tissues. In this review, we outlined the experimental workflow of scRNA-seq in MM, listed some commonly used scRNA-seq platforms and analytical tools. In addition, with the advent of scRNA-seq, many studies have made new progress in the key molecular mechanisms during MM clonal evolution, cell interactions and molecular regulation in the microenvironment, and drug resistance mechanisms in target therapy. We summarized the main findings and sequencing platforms for applying scRNA-seq to MM research and proposed broad directions for targeted therapies based on these findings.
Article
Full-text available
Precision oncology approaches for patients with colorectal cancer (CRC) continue to lag behind other solid cancers. Functional precision oncology—a strategy that is based on perturbing primary tumor cells from cancer patients—could provide a road forward to personalize treatment. We extend this paradigm to measuring proteome activity landscapes by acquiring quantitative phosphoproteomic data from patient-derived organoids (PDOs). We show that kinase inhibitors induce inhibitor- and patient-specific off-target effects and pathway crosstalk. Reconstruction of the kinase networks revealed that the signaling rewiring is modestly affected by mutations. We show non-genetic heterogeneity of the PDOs and upregulation of stemness and differentiation genes by kinase inhibitors. Using imaging mass-cytometry-based profiling of the primary tumors, we characterize the tumor microenvironment (TME) and determine spatial heterocellular crosstalk and tumor-immune cell interactions. Collectively, we provide a framework for inferring tumor cell intrinsic signaling and external signaling from the TME to inform precision (immuno-) oncology in CRC.
Article
Cancers evade T-cell immunity by several mechanisms such as secretion of anti-inflammatory cytokines, downregulation of antigen presentation machinery, upregulation of immune checkpoint molecules, and exclusion of T cells from tumor tissues. The distribution and function of immune checkpoint molecules on tumor cells and tumor-infiltrating leukocytes is well established, but less is known about their impact on intratumoral endothelial cells. Here, we demonstrated that V-domain Ig suppressor of T-cell activation (VISTA), a PD-L1 homologue, was highly expressed on endothelial cells in synovial sarcoma, subsets of different carcinomas, and immune-privileged tissues. We created an ex vivo model of the human vasculature and demonstrated that expression of VISTA on endothelial cells selectively prevented T-cell transmigration over endothelial layers, under physiological flow conditions, whereas it does not affect migration of other immune cell types. Furthermore, endothelial VISTA correlated with reduced infiltration of T cells and poor prognosis in metastatic synovial sarcoma. In endothelial cells, we detected VISTA on the plasma membrane and in recycling endosomes, and its expression was upregulated by cancer cell-secreted factors in a VEGF-A-dependent manner. Our study reveals that endothelial VISTA is upregulated by cancer-secreted factors and that it regulates T-cell accessibility to cancer and healthy tissues. This newly identified mechanism should be considered when using immunotherapeutic approaches aimed at unleashing T cell-mediated cancer immunity.
Article
Full-text available
Stochastic Neighbor Embedding and its variants are widely used dimensionality reduction techniques -- despite their popularity, no theoretical results are known. We prove that the optimal SNE embedding of well-separated clusters from high dimensions to the real line $\mathbb{R}$ manages to successfully separate the clusters in a quantitative way.
Article
Full-text available
Accurate identification of cell subsets in complex populations is key to discovering novelty in multidimensional single-cell experiments. We present X-shift (http://web.stanford.edu/~samusik/vortex/), an algorithm that processes data sets using fast k-nearest-neighbor estimation of cell event density and arranges populations by marker-based classification. X-shift enables automated cell-subset clustering and access to biological insights that 'prior knowledge' might prevent the researcher from discovering.
Article
Full-text available
In recent years, dimensionality-reduction techniques have been developed and are widely used for hypothesis generation in Exploratory Data Analysis. However, these techniques are confronted with overcoming the trade-off between computation time and the quality of the provided dimensionality reduction. In this work, we address this limitation, by introducing Hierarchical Stochastic Neighbor Embedding (Hierarchical-SNE). Using a hierarchical representation of the data, we incorporate the well-known mantra of Overview-First, Details-On-Demand in non-linear dimensionality reduction. First, the analysis shows an embedding, that reveals only the dominant structures in the data (Overview). Then, by selecting structures that are visible in the overview, the user can filter the data and drill down in the hierarchy. While the user descends into the hierarchy, detailed visualizations of the high-dimensional structures will lead to new insights. In this paper, we explain how Hierarchical-SNE scales to the analysis of big datasets. In addition, we show its application potential in the visualization of Deep-Learning architectures and the analysis of hyperspectral images.
Article
The nature of gut intraepithelial lymphocytes (IELs) lacking antigen receptors remains controversial. Herein we showed that, in humans and in mice, innate intestinal IELs expressing intracellular CD3 (iCD3(+)) differentiate along an Id2 transcription factor (TF)-independent pathway in response to TF NOTCH1, interleukin-15 (IL-15), and Granzyme B signals. In NOTCH1-activated human hematopoietic precursors, IL-15 induced Granzyme B, which cleaved NOTCH1 into a peptide lacking transcriptional activity. As a result, NOTCH1 target genes indispensable for T cell differentiation were silenced and precursors were reprogrammed into innate cells with T cell marks including intracellular CD3 and T cell rearrangements. In the intraepithelial lymphoma complicating celiac disease, iCD3(+) innate IELs acquired gain-of-function mutations in Janus kinase 1 or Signal transducer and activator of transcription 3, which enhanced their response to IL-15. Overall we characterized gut T cell-like innate IELs, deciphered their pathway of differentiation and showed their malignant transformation in celiac disease.
Article
Recent advances in flow cytometry allow scientists to measure an increasing number of parameters per cell, generating huge and high-dimensional datasets. To analyse, visualize and interpret these data, newly available computational techniques should be adopted, evaluated and improved upon by the immunological community. Computational flow cytometry is emerging as an important new field at the intersection of immunology and computational biology; it allows new biological knowledge to be extracted from high-throughput single-cell data. This Review provides non-experts with a broad and practical overview of the many recent developments in computational flow cytometry.
Article
Inflammatory intestinal diseases are characterized by abnormal immune responses and affect distinct locations of the gastrointestinal tract. Although the role of several immune subsets in driving intestinal pathology has been studied, a system-wide approach that simultaneously interrogates all major lineages on a single-cell basis is lacking. We used high-dimensional mass cytometry to generate a system-wide view of the human mucosal immune system in health and disease. We distinguished 142 immune subsets and through computational applications found distinct immune subsets in peripheral blood mononuclear cells and intestinal biopsies that distinguished patients from controls. In addition, mucosal lymphoid malignancies were readily detected as well as precursors from which these likely derived. These findings indicate that an integrated high-dimensional analysis of the entire immune system can identify immune subsets associated with the pathogenesis of complex intestinal disorders. This might have implications for diagnostic procedures, immune-monitoring, and treatment of intestinal diseases and mucosal malignancies.
Article
Recent single-cell analysis technologies offer an unprecedented opportunity to elucidate developmental pathways. Here we present Wishbone, an algorithm for positioning single cells along bifurcating developmental trajectories with high resolution. Wishbone uses multi-dimensional single-cell data, such as mass cytometry or RNA-Seq data, as input and orders cells according to their developmental progression, and it pinpoints bifurcation points by labeling each cell as pre-bifurcation or as one of two post-bifurcation cell fates. Using 30-channel mass cytometry data, we show that Wishbone accurately recovers the known stages of T-cell development in the mouse thymus, including the bifurcation point. We also apply the algorithm to mouse myeloid differentiation and demonstrate its generalization to additional lineages. A comparison of Wishbone to diffusion maps, SCUBA and Monocle shows that it outperforms these methods both in the accuracy of ordering cells and in the correct identification of branch points.
Article
To understand how the immune system works, one needs to have a clear picture of its cellular compositon and the cells’ corresponding properties and functionality. Mass cytometry is a novel technique to determine the properties of single-cells with unprecedented detail. This amount of detail allows for much finer differentiation but also comes at the cost of more complex analysis. In this work, we present Cytosplore, implementing an interactive workflow to analyze mass cytometry data in an integrated system, providing multiple linked views, showing different levels of detail and enabling the rapid definition of known and unknown cell types. Cytosplore handles millions of cells, each represented as a high-dimensional data point, facilitates hypothesis generation and confirmation, and provides a significant speed up of the current workflow. We show the effectiveness of Cytosplore in a case study evaluation.