ArticlePDF Available

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Authors:
  • Tutte Intitute for Mathematics and Computing

Abstract and Figures

UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP as described has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.
Content may be subject to copyright.
UMAP: Uniform Manifold
Approximation and Projection for
Dimension Reduction
Leland McInnes
Tue Institute for Mathematics and Computing
leland.mcinnes@gmail.com
John Healy
Tue Institute for Mathematics and Computing
jchealy@gmail.com
James Melville
jlmelville@gmail.com
September 21, 2020
Abstract
UMAP (Uniform Manifold Approximation and Projection) is a novel
manifold learning technique for dimension reduction. UMAP is constructed
from a theoretical framework based in Riemannian geometry and algebraic
topology. e result is a practical scalable algorithm that is applicable to
real world data. e UMAP algorithm is competitive with t-SNE for visu-
alization quality, and arguably preserves more of the global structure with
superior run time performance. Furthermore, UMAP has no computational
restrictions on embedding dimension, making it viable as a general purpose
dimension reduction technique for machine learning.
1 Introduction
Dimension reduction plays an important role in data science, being a funda-
mental technique in both visualisation and as pre-processing for machine
1
arXiv:1802.03426v3 [stat.ML] 18 Sep 2020
learning. Dimension reduction techniques are being applied in a broaden-
ing range of elds and on ever increasing sizes of datasets. It is thus desir-
able to have an algorithm that is both scalable to massive data and able to
cope with the diversity of data available. Dimension reduction algorithms
tend to fall into two categories; those that seek to preserve the pairwise
distance structure amongst all the data samples and those that favor the
preservation of local distances over global distance. Algorithms such as
PCA [27], MDS [30], and Sammon mapping [50] fall into the former cate-
gory while t-SNE [59, 58], Isomap [56], LargeVis [54], Laplacian eigenmaps
[6, 7] and diusion maps [16] all fall into the laer category.
In this paper we introduce a novel manifold learning technique for di-
mension reduction. We provide a sound mathematical theory grounding
the technique and a practical scalable algorithm that applies to real world
data. UMAP (Uniform Manifold Approximation and Projection) builds upon
mathematical foundations related to the work of Belkin and Niyogi on
Laplacian eigenmaps. We seek to address the issue of uniform data distri-
butions on manifolds through a combination of Riemannian geometry and
the work of David Spivak [52] in category theoretic approaches to geomet-
ric realization of fuzzy simplicial sets. t-SNE is the current state-of-the-art
for dimension reduction for visualization. Our algorithm is competitive
with t-SNE for visualization quality and arguably preserves more of the
global structure with superior run time performance. Furthermore the al-
gorithm is able to scale to signicantly larger data set sizes than are feasible
for t-SNE. Finally, UMAP has no computational restrictions on embedding
dimension, making it viable as a general purpose dimension reduction tech-
nique for machine learning.
Based upon preliminary releases of a soware implementation, UMAP
has already found widespread use in the elds of bioinformatics [5, 12, 17,
46, 2, 45, 15], materials science [34, 23], and machine learning [14, 20, 21,
24, 19, 47] among others.
is paper is laid out as follows. In Section 2 we describe the theory un-
derlying the algorithm. Section 2 is necessary to understand both the the-
ory underlying why UMAP works and the motivation for the choices that
where made in developing the algorithm. A reader without a background
(or interest) in topological data analysis, category theory or the theoretical
underpinnings of UMAP should skip over this section and proceed directly
to Section 3.
at being said, we feel that strong theory and mathematically justied
algorithmic decisions are of particular importance in the eld of unsuper-
vised learning. is is, at least partially, due to plethora of proposed objec-
2
tive functions within the area. We aempt to highlight in this paper that
UMAPs design decisions were all grounded in a solid theoretic foundation
and not derived through experimentation with any particular task focused
objective function. ough all neighbourhood based manifold learning al-
gorithms must share certain fundamental components we believe it to be
advantageous for these components to be selected through well grounded
theoretical decisions. One of the primary contributions of this paper is to
reframe the problem of manifold learning and dimension reduction in a dif-
ferent mathematical language allowing pracitioners to apply a new eld of
mathemtaics to the problems.
In Section 3 we provide a more computational description of UMAP.
Section 3 should provide readers less familiar with topological data analysis
with a beer foundation for understanding the theory described in Section
2. Appendix C contrasts UMAP against the more familiar algorithms t-SNE
and LargeVis, describing all these algorithms in similar language. is sec-
tion should assist readers already familiar with those techniques to quickly
gain an understanding of the UMAP algorithm though they will grant lile
insite into its theoretical underpinnings.
In Section 4 we discuss implementation details of the UMAP algorithm.
is includes a more detailed algorithmic description, and discussion of the
hyper-parameters involved and their practical eects.
In Section 5 we provide practical results on real world datasets as well
as scaling experiments to demonstrate the algorithm’s performance in real
world scenarios as compared with other dimension reduction algorithms.
In Section 6 we discuss relative weakenesses of the algorithm, and ap-
plications for which UMAP may not be the best choice.
Finally, in Section 7 we detail a number of potential extensions of UMAP
that are made possible by its construction upon solid mathematical foun-
dations. ese avenues for further development include semi-supervised
learning, metric learning and heterogeneous data embedding.
2 eoretical Foundations for UMAP
e theoretical foundations for UMAP are largely based in manifold theory
and topological data analysis. Much of the theory is most easily explained
in the language of topology and category theory. Readers may consult
[39], [49] and [40] for background. Readers more interested in practical
computational aspects of the algorithm, and not necessarily the theoretical
motivation for the computations involved, may wish to skip this section.
3
Readers more familiar with traditional machine learning may nd the re-
lationships between UMAP, t-SNE and Largeviz located in Appendix C en-
lightening. Unfortunately, this purely computational view fails to shed any
light upon the reasoning that underlies the algorithmic decisions made in
UMAP. Without strong theoretical foundations the only arguments which
can be made about algorithms amount to empirical measures, for which
there are no clear universal choices for unsupervised problems.
At a high level, UMAP uses local manifold approximations and patches
together their local fuzzy simplicial set representations to construct a topo-
logical representation of the high dimensional data. Given some low dimen-
sional representation of the data, a similar process can be used to construct
an equivalent topological representation. UMAP then optimizes the layout
of the data representation in the low dimensional space, to minimize the
cross-entropy between the two topological representations.
e construction of fuzzy topological representations can be broken
down into two problems: approximating a manifold on which the data is
assumed to lie; and constructing a fuzzy simplicial set representation of
the approximated manifold. In explaining the algorithm we will rst dis-
cuss the method of approximating the manifold for the source data. Next
we will discuss how to construct a fuzzy simplicial set structure from the
manifold approximation. Finally, we will discuss the construction of the
fuzzy simplicial set associated to a low dimensional representation (where
the manifold is simply Rd), and how to optimize the representation with
respect to our objective function.
2.1 Uniform distribution of data on a manifold and
geodesic approximation
e rst step of our algorithm is to approximate the manifold we assume
the data (approximately) lies on. e manifold may be known apriori (as
simply Rn) or may need to be inferred from the data. Suppose the manifold
is not known in advance and we wish to approximate geodesic distance on
it. Let the input data be X={X1, . . . , XN}. As in the work of Belkin and
Niyogi on Laplacian eigenmaps [6, 7], for theoretical reasons it is benecial
to assume the data is uniformly distributed on the manifold, and even if that
assumption is not made (e.g [26]) results are only valid in the limit of innite
data. In practice, nite real world data is rarely so nicely behaved. However,
if we assume that the manifold has a Riemannian metric not inherited from
the ambient space, we can nd a metric such that the data is approximately
uniformly distributed with regard to that metric.
4
Formally, let Mbe the manifold we assume the data to lie on, and let g
be the Riemannian metric on M. us, for each point p∈ M we have gp,
an inner product on the tangent space TpM.
Lemma 1. Let (M, g)be a Riemannian manifold in an ambient Rn, and let
pMbe a point. If gis locally constant about pin an open neighbourhood
Usuch that gis a constant diagonal matrix in ambient coordinates, then in a
ball BUcentered at pwith volume πn/2
Γ(n/2+1) with respect to g, the geodesic
distance from pto any point qBis 1
rdRn(p, q), where ris the radius of the
ball in the ambient space and dRnis the existing metric on the ambient space.
See Appendix A of the supplementary materials for a proof of Lemma
1.
If we assume the data to be uniformly distributed on M(with respect to
g) then, away from any boundaries, any ball of xed volume should contain
approximately the same number of points of Xregardless of where on the
manifold it is centered. Given nite data and small enough local neighbor-
hoods this crude approximation should be accurate enough even for data
samples near manifold boundaries. Now, conversely, a ball centered at Xi
that contains exactly the k-nearest-neighbors of Xishould have approxi-
mately xed volume regardless of the choice of XiX. Under Lemma 1
it follows that we can approximate geodesic distance from Xito its neigh-
bors by normalising distances with respect to the distance to the kth nearest
neighbor of Xi.
In essence, by creating a custom distance for each Xi, we can ensure
the validity of the assumption of uniform distribution on the manifold. e
cost is that we now have an independent notion of distance for each and
every Xi, and these notions of distance may not be compatible. We have
a family of discrete metric spaces (one for each Xi) that we wish to merge
into a consistent global structure. is can be done in a natural way by
converting the metric spaces into fuzzy simplicial sets.
2.2 Fuzzy topological representation
We will use functors between the relevant categories to convert from metric
spaces to fuzzy topological representations. is will provide a means to
merge the incompatible local views of the data. e topological structure
of choice is that of simplicial sets. For more details on simplicial sets we
refer the reader to [25], [40], [48], or [22]. Our approach draws heavily
upon the work of Michael Barr [3] and David Spivak in [52], and many
of the denitions and theorems below are drawn or adapted from those
5
sources. We assume familiarity with the basics of category theory. For an
introduction to category theory readers may consult [39] or [49].
To start we will review the denitions for simplicial sets. Simplicial sets
provide a combinatorial approach to the study of topological spaces. ey
are related to the simpler notion of simplicial complexes – which construct
topological spaces by gluing together simple building blocks called sim-
plices – but are more general. Simplicial sets are most easily dened purely
abstractly in the language of category theory.
Denition 1. e category has as objects the nite order sets [n] =
{1, . . . , n}, with morphims given by (non-strictly) order-preserving maps.
Following standard category theoretic notation, op denotes the cate-
gory with the same objects as and morphisms given by the morphisms
of with the direction (domain and codomain) reversed.
Denition 2. Asimplicial set is a functor from op to Sets, the category of
sets; that is, a contravariant functor from to Sets.
Given a simplicial set X:op Sets,it is common to denote the set
X([n]) as Xnand refer to the elements of the set as the n-simplices of X.
e simplest possible examples of simplicial sets are the standard simplices
n, dened as the representable functors hom(·,[n]). It follows from the
Yoneda lemma that there is a natural correspondence between n-simplices
of Xand morphisms nXin the category of simplicial sets, and it
is oen helpful to think in these terms. us for each xXnwe have
a corresponding morphism x: ∆nX. By the density theorem and
employing a minor abuse of notation we then have
colim
xXn
n
=X
ere is a standard covariant functor |·|:Top mapping from
the category to the category of topological spaces that sends [n]to the
standard n-simplex |n| ⊂ Rn+1 dened as
|n|,((t0, . . . , tn)Rn+1 |
n
X
i=0
ti= 1, ti0)
with the standard subspace topology. If X:op Sets is a simplicial
set then we can construct the realization of X(denoted |X|) as the colimit
|X|= colim
xXn
|n|
6
and thus associate a topological space with a given simplicial set. Con-
versely given a topological space Ywe can construct an associated simpli-
cial set S(Y), called the singular set of Y, by dening
S(Y) : [n]7→ homTop(|n|, Y ).
It is a standard result of classical homotopy theory that the realization func-
tor and singular set functors form an adjunction, and provide the standard
means of translating between topological spaces and simplicial sets. Our
goal will be to adapt these powerful classical results to the case of nite
metric spaces.
We draw signicant inspiration from Spivak, specically [52], where
he extends the classical theory of singular sets and topological realization
to fuzzy singular sets and metric realization. To develop this theory here
we will rst outline a categorical presentation of fuzzy sets, due to [3], that
will make extending classical simplicial sets to fuzzy simplicial sets most
natural.
Classically a fuzzy set [65] is dened in terms of a carrier set Aand a
map µ:A[0,1] called the membership function. One is to interpret the
value µ(x)for xAto be the membership strength of xto the set A. us
membership of a set is no longer a bi-valent true or false property as in
classical set theory, but a fuzzy property taking values in the unit interval.
We wish to formalize this in terms of category theory.
Let Ibe the unit interval (0,1] Rwith topology given by intervals
of the form [0, a)for a(0,1]. e category of open sets (with morphisms
given by inclusions) can be imbued with a Grothendieck topology in the
natural way for any poset category.
Denition 3. A presheaf Pon Iis a functor from Iop to Sets. A fuzzy set
is a presheaf on Isuch that all maps P(ab)are injections.
Presheaves on Iform a category with morphisms given by natural
transformations. We can thus form a category of fuzzy sets by simply re-
stricting to the sub-category of presheaves that are fuzzy sets. We note that
such presheaves are trivially sheaves under the Grothendieck topology on
I. As one might expect, limits (including products) of such sheaves are
well dened, but care must be taken to dene colimits (and coproducts) of
sheaves. To link to the classical approach to fuzzy sets one can think of a
section P([0, a)) as the set of all elements with membership strength at
least a. We can now dene the category of fuzzy sets.
Denition 4. e category Fuzz of fuzzy sets is the full subcategory of
sheaves on Ispanned by fuzzy sets.
7
With this categorical presentation in hand, dening fuzzy simplicial
sets is simply a maer of considering presheaves of valued in the cate-
gory of fuzzy sets rather than the category of sets.
Denition 5. e category of fuzzy simplicial sets sFuzz is the category
with objects given by functors from op to Fuzz, and morphisms given by
natural transformations.
Alternatively, a fuzzy simplicial set can be viewed as a sheaf over ×I,
where is given the trivial topology and ×Ihas the product topology.
We will use n
<a to denote the sheaf given by the representable functor of
the object ([n],[0, a)). e importance of this fuzzy (sheaed) version of
simplicial sets is their relationship to metric spaces. We begin by consider-
ing the larger category of extended-pseudo-metric spaces.
Denition 6. An extended-pseudo-metric space (X, d)is a set Xand a
map d:X×XR0∪ {∞} such that
1. d(x, y)>0, and x=yimplies d(x, y)=0;
2. d(x, y) = d(y, x); and
3. d(x, z)6d(x, y) + d(y, z)or d(x, z) = .
e category of extended-pseudo-metric spaces EPMet has as objects extended-
pseudo-metric spaces and non-expansive maps as morphisms. We denote the
subcategory of nite extended-pseudo-metric spaces FinEPMet.
e choice of non-expansive maps in Denition 6 is due to Spivak, but
we note that it closely mirrors the work of Carlsson and Memoli in [13] on
topological methods for clustering as applied to nite metric spaces. is
choice is signicant since pure isometries are too strict and do not provide
large enough Hom-sets.
In [52] Spivak constructs a pair of adjoint functors, Real and Sing be-
tween the categories sFuzz and EPMet. ese functors are the natural ex-
tension of the classical realization and singular set functors from algebraic
topology. e functor Real is dened in terms of standard fuzzy simplices
n
<a as
Real(∆n
<a),((t0, . . . , tn)Rn+1 |
n
X
i=0
ti=log(a), ti0)
similarly to the classical realization functor |·|. e metric on Real(∆n
<a)
is simply inherited from Rn+1. A morphism n
<a m
<b exists only if
8
ab, and is determined by a morphism σ: [n][m]. e action of
Real on such a morphism is given by the map
(x0, x1, . . . , xn)7→ log(b)
log(a)
X
i0σ1(0)
xi0,X
i0σ1(1)
xi0,..., X
i0σ1(m)
xi0
.
Such a map is clearly non-expansive since 0ab1implies that
log(b)/log(a)1.
We then extend this to a general simplicial set Xvia colimits, dening
Real(X),colim
n
<aXReal(∆n
<a).
Since the functor Real preserves colimits, it follows that there exists a
right adjoint functor. Again, analogously to the classical case, we nd the
right adjoint, denoted Sing, is dened for an extended pseudo metric space
Yin terms of its action on the category ×I:
Sing(Y) : ([n],[0, a)) 7→ homEPMet(Real(∆n
<a), Y ).
For our case we are only interested in nite metric spaces. To corre-
spond with this we consider the subcategory of bounded fuzzy simplicial
sets Fin-sFuzz. We therefore use the analogous adjoint pair FinReal and
FinSing. Formally we dene the nite fuzzy realization functor as follows:
Denition 7. Dene the functor FinReal :Fin-sFuzz FinEPMet by
seing
FinReal(∆n
<a),({x1, x2, . . . , xn}, da),
where
da(xi, xj) =
log(a)if i6=j,
0otherwise
.
and then dening
FinReal(X),colim
n
<aXFinReal(∆n
<a).
Similar to Spivak’s construction, the action of FinReal on a map n
<a
m
<b, where abdened by σ: ∆nm, is given by
({x1, x2, . . . , xn}, da)7→ ({xσ(1), xσ(2), . . . , xσ(n)}, db),
which is a non-expansive map since abimplies dadb.
Since FinReal preserves colimits it admits a right adjoint, the fuzzy sin-
gular set functor FinSing. We can then dene the (nite) fuzzy singular set
functor in terms of the action of its image on ×I, analogously to Sing.
9
Denition 8. Dene the functor FinSing :FinEPMet Fin-sFuzz by
FinSing(Y) : ([n],[0, a)) 7→ homFinEPMet(FinReal(∆n
<a), Y ).
We then have the following theorem.
eorem 1. e functors FinReal :Fin-sFuzz FinEPMet and FinSing :
FinEPMet Fin-sFuzz form an adjunction with FinReal the le adjoint
and FinSing the right adjoint.
e proof of this is by construction. Appendix B provides a full proof
of the theorem.
With the necessary theoretical background in place, the means to han-
dle the family of incompatible metric spaces described above becomes clear.
Each metric space in the family can be translated into a fuzzy simplicial
set via the fuzzy singular set functor, distilling the topological information
while still retaining metric information in the fuzzy structure. Ironing out
the incompatibilities of the resulting family of fuzzy simplicial sets can be
done by simply taking a (fuzzy) union across the entire family. e result
is a single fuzzy simplicial set which captures the relevant topological and
underlying metric structure of the manifold M.
It should be noted, however, that the fuzzy singular set functor applies
to extended-pseudo-metric spaces, which are a relaxation of traditional
metric spaces. e results of Lemma 1 only provide accurate approxima-
tions of geodesic distance local to Xifor distances measured from Xi
the geodesic distances between other pairs of points within the neighbor-
hood of Xiare not well dened. In deference to this lack of information we
dene distances between Xjand Xkin the extended-pseudo metric space
local to Xi(where i6=jand i6=k) to be innite (local neighborhoods of
Xjand Xkwill provide suitable approximations).
For real data it is safe to assume that the manifold Mis locally con-
nected. In practice this can be realized by measuring distance in the extended-
pseudo-metric space local to Xias geodesic distance beyond the nearest
neighbor of Xi. Since this sets the distance to the nearest neighbor to be
equal to 0 this is only possible in the more relaxed seing of extended-
pseudo-metric spaces. It ensures, however, that each 0-simplex is the face
of some 1-simplex with fuzzy membership strength 1, meaning that the
resulting topological structure derived from the manifold is locally con-
nected. We note that this has a similar practical eect to the truncated
similarity approach of Lee and Verleysen [33], but derives naturally from
the assumption of local connectivity of the manifold.
10
Combining all of the above we can dene the fuzzy topological repre-
sentation of a dataset.
Denition 9. Let X={X1, . . . , XN}be a dataset in Rn. Let {(X, di)}i=1...N
be a family of extended-pseudo-metric spaces with common carrier set Xsuch
that
di(Xj, Xk) =
dM(Xj, Xk)ρif i=jor i=k,
otherwise ,
where ρis the distance to the nearest neighbor of Xiand dMis geodesic
distance on the manifold M, either known apriori, or approximated as per
Lemma 1.
e fuzzy topological representation of Xis
n
[
i=1
FinSing((X, di)).
e (fuzzy set) union provides the means to merge together the dier-
ent metric spaces. is provides a single fuzzy simplicial set as the global
representation of the manifold formed by patching together the many local
representations.
Given the ability to construct such topological structures, either from
a known manifold, or by learning the metric structure of the manifold, we
can perform dimension reduction by simply nding low dimensional rep-
resentations that closely match the topological structure of the source data.
We now consider the task of nding such a low dimensional representation.
2.3 Optimizing a low dimensional representation
Let Y={Y1, . . . , YN} ⊆ Rdbe a low dimensional (dn) representation
of Xsuch that Yirepresents the source data point Xi. In contrast to the
source data where we want to estimate a manifold on which the data is
uniformly distributed, a target manifold for Yis chosen apriori (usually this
will simply be Rditself, but other choices such as d-spheres or d-tori are
certainly possible) . erefore we know the manifold and manifold metric
apriori, and can compute the fuzzy topological representation directly. Of
note, we still want to incorporate the distance to the nearest neighbor as per
the local connectedness requirement. is can be achieved by supplying a
parameter that denes the expected distance between nearest neighbors in
the embedded space.
11
Given fuzzy simplicial set representations of Xand Y, a means of com-
parison is required. If we consider only the 1-skeleton of the fuzzy sim-
plicial sets we can describe each as a fuzzy graph, or, more specically, a
fuzzy set of edges. To compare two fuzzy sets we will make use of fuzzy set
cross entropy. For these purposes we will revert to classical fuzzy set no-
tation. at is, a fuzzy set is given by a reference set Aand a membership
strength function µ:A[0,1]. Comparable fuzzy sets have the same
reference set. Given a sheaf representation Pwe can translate to classical
fuzzy sets by seing A=Sa(0,1] P([0, a)) and µ(x) = sup{a(0,1] |
xP([0, a))}.
Denition 10. e cross entropy Cof two fuzzy sets (A, µ)and (A, ν)is
dened as
C((A, µ),(A, ν)) ,X
aAµ(a) log µ(a)
ν(a)+ (1 µ(a)) log 1µ(a)
1ν(a).
Similar to t-SNE we can optimize the embedding Ywith respect to fuzzy
set cross entropy Cby using stochastic gradient descent. However, this re-
quires a dierentiable fuzzy singular set functor. If the expected minimum
distance between points is zero the fuzzy singular set functor is dieren-
tiable for these purposes, however for any non-zero value we need to make
a dierentiable approximation (chosen from a suitable family of dieren-
tiable functions).
is completes the algorithm: by using manifold approximation and
patching together local fuzzy simplicial set representations we construct a
topological representation of the high dimensional data. We then optimize
the layout of data in a low dimensional space to minimize the error between
the two topological representations.
We note that in this case we restricted aention to comparisons of the
1-skeleton of the fuzzy simplicial sets. One can extend this to `-skeleta by
dening a cost function C`as
C`(X, Y ) =
`
X
i=1
λiC(Xi, Yi),
where Xidenotes the fuzzy set of i-simplices of Xand the λiare suit-
ably chosen real valued weights. While such an approach will capture the
overall topological structure more accurately, it comes at non-negligible
computational cost due to the increasingly large numbers of higher dimen-
sional simplices. For this reason current implementations restrict to the
1-skeleton at this time.
12
3 A Computational View of UMAP
To understand what computations the UMAP algorithm is actually making
from a practical point of view, a less theoretical and more computational
description may be helpful for the reader. is description of the algorithm
lacks the motivation for a number of the choices made. For that motivation
please see Section 2.
e theoretical description of the algorithm works in terms of fuzzy
simplicial sets. Computationally this is only tractable for the one skeleton
which can ultimately be described as a weighted graph. is means that,
from a practical computational perspective, UMAP can ultimately be de-
scribed in terms of, construction of, and operations on, weighted graphs.
In particular this situates UMAP in the class of k-neighbour based graph
learning algorithms such as Laplacian Eigenmaps, Isomap and t-SNE.
As with other k-neighbour graph based algorithms, UMAP can be de-
scribed in two phases. In the rst phase a particular weighted k-neighbour
graph is constructed. In the second phase a low dimensional layout of this
graph is computed. e dierences between all algorithms in this class
amount to specic details in how the graph is constructed and how the
layout is computed. e theoretical basis for UMAP as described in Section
2 provides novel approaches to both of these phases, and provides clear
motivation for the choices involved.
Finally, since t-SNE is not usually described as a graph based algorithm,
a direct comparison of UMAP with t-SNE, using the similarity/probability
notation commonly used to express the equations of t-SNE, is given in the
Appendix C.
In section 2 we made a few basic assumptions about our data. From
these assumptions we made use of category theory to derive the UMAP
algorithms. at said, all these derivations assume these axioms to be true.
1. ere exists a manifold on which the data would be uniformly dis-
tributed.
2. e underlying manifold of interest is locally connected.
3. Preserving the topological structure of this manifold is the primary
goal.
e topological theory of Section 2 is driven by these axioms, particularly
the interest in modelling and preserving topological structure. In particular
Section 2.1 highlights the underlying motivation, in terms of topological
theory, of representing a manifold as a k-neighbour graph.
13
As highlighted in Appendix C any algorithm that aempts to use a
mathematical structure akin to a k-neighbour graph to approximate a man-
ifold must follow a similar basic structure.
Graph Construction
1. Construct a weighted k-neighbour graph
2. Apply some transform on the edges to ambient local distance.
3. Deal with the inherent asymmetry of the k-neighbour graph.
Graph Layout
1. Dene an objective function that preserves desired characteris-
tics of this k-neighbour graph.
2. Find a low dimensional representation which optimizes this ob-
jective function.
Many dimension reduction algorithms can be broken down into these
steps because they are fundamental to a particular class of solutions. Choices
for each step must be either chosen through task oriented experimentation
or by selecting a set of believable axioms and building strong theoretical
arguments from these. Our belief is that basing our decisions on a strong
foundational theory will allow for a more extensible and generalizable al-
gorithm in the long run.
We theoretically justify using the choice of using a k-neighbour graph
to represent a manifold in Section 2.1. e choices for our kernel transform
an symmetrization function can be found in Section 2.2. Finally, the justi-
cations underlying our choices for our graph layout are outlined in Section
2.3.
3.1 Graph Construction
e rst phase of UMAP can be thought of as the construction of a weighted
k-neighbour graph. Let X={x1, . . . , xN}be the input dataset, with a
metric (or dissimilarity measure) d:X×XR0. Given an input hyper-
parameter k, for each xiwe compute the set {xi1, . . . , xik}of the knearest
neighbors of xiunder the metric d. is computation can be performed via
any nearest neighbour or approximate nearest neighbour search algorithm.
For the purposes of our UMAP implemenation we prefer to use the nearest
neighbor descent algorithm of [18].
For each xiwe will dene ρiand σi. Let
ρi= min{d(xi, xij)|1jk, d(xi, xij)>0},
14
and set σito be the value such that
k
X
j=1
exp max(0, d(xi, xij)ρi)
σi= log2(k).
e selection of ρiderives from the local-connectivity constraint described
in Section 2.2. In particular it ensures that xiconnects to at least one other
data point with an edge of weight 1; this is equivalent to the resulting fuzzy
simplicial set being locally connected at xi. In practical terms this signif-
icantly improves the representation on very high dimensional data where
other algorithms such as t-SNE begin to suer from the curse of dimen-
sionality.
e selection of σicorresponds to (a smoothed) normalisation factor,
dening the Riemannian metric local to the point xias described in Section
2.1.
We can now dene a weighted directed graph ¯
G= (V, E , w). e
vertices Vof ¯
Gare simply the set X. We can then form the set of directed
edges E={(xi, xij)|1jk, 1iN}, and dene the weight
function wby seing
w((xi, xij)) = exp max(0, d(xi, xij)ρi)
σi.
For a given point xithere exists an induced graph of xiand outgoing edges
incident on xi. is graph is the 1-skeleton of the fuzzy simplicial set as-
sociated to the metric space local to xiwhere the local metric is dened
in terms of ρiand σi. e weight associated to the edge is the member-
ship strength of the corresponding 1-simplex within the fuzzy simplicial
set, and is derived from the adjunction of eorem 1 using the right adjoint
(nearest inverse) of the geometric realization of a fuzzy simplicial set. In-
tuitively one can think of the weight of an edge as akin to the probability
that the given edge exists. Section 2 demonstrates why this construction
faithfully captures the topology of the data. Given this set of local graphs
(represented here as a single directed graph) we now require a method to
combine them into a unied topological representation. We note that while
patching together incompatible nite metric spaces is challenging, by using
eorem 1 to convert to a fuzzy simplicial set representation, the combin-
ing operation becomes natural.
Let Abe the weighted adjacency matrix of ¯
G, and consider the sym-
metric matrix
B=A+A>AA>,
15
where is the Hadamard (or pointwise) product. is formula derives from
the use of the probabilistic t-conorm used in unioning the fuzzy simplicial
sets. If one interprets the value of Aij as the probability that the directed
edge from xito xjexists, then Bij is the probability that at least one of
the two directed edges (from xito xjand from xjto xi) exists. e UMAP
graph Gis then an undirected weighted graph whose adjacency matrix is
given by B. Section 2 explains this construction in topological terms, pro-
viding the justication for why this construction provides an appropriate
fuzzy topological representation of the data – that is, this construction cap-
tures the underlying geometric structure of the data in a faithful way.
3.2 Graph Layout
In practice UMAP uses a force directed graph layout algorithm in low di-
mensional space. A force directed graph layout utilizes of a set of aractive
forces applied along edges and a set of repulsive forces applied among ver-
tices. Any force directed layout algorithm requires a description of both the
aractive and repulsive forces. e algorithm proceeds by iteratively ap-
plying aractive and repulsive forces at each edge or vertex. is amounts
to a non-convex optimization problem. Convergence to a local minima is
guaranteed by slowly decreasing the aractive and repulsive forces in a
similar fashion to that used in simulated annealing.
In UMAP the aractive force between two vertices iand jat coordi-
nates yiand yjrespectively, is determined by:
2abkyiyjk2(b1)
2
1 + kyiyjk2
2
w((xi, xj)) (yiyj)
where aand bare hyper-parameters.
Repulsive forces are computed via sampling due to computational con-
straints. us, whenever an aractive force is applied to an edge, one of
that edge’s vertices is repulsed by a sampling of other vertices. e repul-
sive force is given by
2b
+kyiyjk2
21 + akyiyjk2b
2(1 w((xi, xj))) (yiyj).
is a small number to prevent division by zero (0.001 in the current
implementation).
16
e algorithm can be initialized randomly but in practice, since the sym-
metric Laplacian of the graph Gis a discrete approximation of the Laplace-
Beltrami operator of the manifold, we can use a spectral layout to initialize
the embedding. is provides both faster convergence and greater stability
within the algorithm.
e forces described above are derived from gradients optimising the
edge-wise cross-entropy between the weighted graph G, and an equiva-
lent weighted graph Hconstructed from the points {yi}i=1..N . at is, we
are seeking to position points yisuch that the weighted graph induced by
those points most closely approximates the graph G, where we measure
the dierence between weighted graphs by the total cross entropy over all
the edge existence probabilities. Since the weighted graph Gcaptures the
topology of the source data, the equivalent weighted graph Hconstructed
from the points {yi}i=1..N matches the topology as closely as the optimiza-
tion allows, and thus provides a good low dimensional representation of the
overall topology of the data.
4 Implementation and Hyper-parameters
Having completed a theoretical description of the approach, we now turn
our aention to the practical realization of this theory. We begin by pro-
viding a more detailed description of the algorithm as implemented, and
then discuss a few implementation specic details. We conclude this sec-
tion with a discussion of the hyper-parameters for the algorithm and their
practical eects.
4.1 Algorithm description
In overview the UMAP algorithm is relatively straightforward (see Algo-
rithm 1). When performing a fuzzy union over local fuzzy simplicial sets
we have found it most eective to work with the probabilistic t-conorm (as
one would expect if treating membership strengths as a probability that the
simplex exists). e individual functions for constructing the local fuzzy
simplicial sets, determining the spectral embedding, and optimizing the
embedding with regard to fuzzy set cross entropy, are described in more
detail below.
e inputs to Algorithm 1 are: X, the dataset to have its dimension
reduced; n, the neighborhood size to use for local metric approximation;
d, the dimension of the target reduced space; min-dist, an algorithmic pa-
17
Algorithm 1 UMAP algorithm
function UMAP(X,n,d, min-dist, n-epochs)
#Construct the relevant weighted graph
for all xXdo
fs-set[x]LocalFuzzySimplicialSet(X,x,n)
top-rep SxXfs-set[x]#We recommend the probabilistic t-conorm
#Perform optimization of the graph layout
YSpectralEmbedding(top-rep, d)
YOptimizeEmbedding(top-rep, Y, min-dist, n-epochs)
return Y
rameter controlling the layout; and n-epochs, controlling the amount of
optimization work to perform.
Algorithm 2 describes the construction of local fuzzy simplicial sets.
To represent fuzzy simplicial sets we work with the fuzzy set images of [0]
and [1] (i.e. the 1-skeleton), which we denote as fs-set0and fs-set1. One can
work with higher order simplices as well, but the current implementation
does not. We can construct the fuzzy simplicial set local to a given point x
by nding the nnearest neighbors, generating the appropriate normalised
distance on the manifold, and then converting the nite metric space to a
simplicial set via the functor FinSing, which translates into exponential of
the negative distance in this case.
Rather than directly using the distance to the nth nearest neighbor as
the normalization, we use a smoothed version of knn-distance that xes
the cardinality of the fuzzy set of 1-simplices to a xed value. We selected
log2(n)for this purpose based on empirical experiments. is is described
briey in Algorithm 3.
Spectral embedding is performed by considering the 1-skeleton of the
global fuzzy topological representation as a weighted graph and using stan-
dard spectral methods on the symmetric normalized Laplacian. is pro-
cess is described in Algorithm 4.
e nal major component of UMAP is the optimization of the em-
bedding through minimization of the fuzzy set cross entropy. Recall that
18
Algorithm 2 Constructing a local fuzzy simplicial set
function LocalFuzzySimplicialSet(X,x,n)
knn, knn-dists ApproxNearestNeighbors(X,x,n)
ρknn-dists[1] # Distance to nearest neighbor
σSmoothKNNDist(knn-dists, n,ρ) # Smooth approximator to
knn-distance
fs-set0X
fs-set1← {([x, y],0) |yX}
for all yknn do
dx,y max{0,dist(x, y)ρ}
fs-set1fs-set1([x, y],exp(dx,y ))
return fs-set
Algorithm 3 Compute the normalizing factor for distances σ
function SmoothKNNDist(knn-dists, n,ρ)
Binary search for σsuch that Pn
i=1 exp((knn-distsiρ)) = log2(n)
return σ
Algorithm 4 Spectral embedding for initialization
function SpectralEmbedding(top-rep, d)
A1-skeleton of top-rep expressed as a weighted adjacency matrix
Ddegree matrix for the graph A
LD1/2(DA)D1/2
evec Eigenvectors of L(sorted)
Yevec[1..d + 1] #0-base indexing assumed
return Y
19
fuzzy set cross entropy, with respect given membership functions µand ν,
is given by
C((A, µ),(A, ν)) = X
aA
µ(a) log µ(a)
ν(a)+ (1 µ(a)) log 1µ(a)
1ν(a)
=X
aA
(µ(a) log(µ(a)) + (1 µ(a)) log(1 µ(a)))
X
aA
(µ(a) log(ν(a)) + (1 µ(a)) log(1 ν(a))) .
(1)
e rst sum depends only on µwhich takes xed values during the op-
timization, thus the minimization of cross entropy depends only on the
second sum, so we seek to minimize
X
aA
(µ(a) log(ν(a)) + (1 µ(a)) log(1 ν(a))) .
Following both [54] and [41], we take a sampling based approach to the
optimization. We sample 1-simplices with probability µ(a)and update ac-
cording to the value of ν(a), which handles the term µ(a) log(ν(a)). e
term (1 µ(a)) log(1 ν(a)) requires negative sampling – rather than
computing this over all potential simplices we randomly sample potential
1-simplices and assume them to be a negative example (i.e. with member-
ship strength 0) and update according to the value of 1ν(a). In contrast
to [54] the above formulation provides a vertex sampling distribution of
P(xi) = P{aA|d0(a)=xi}1µ(a)
P{bA|d0(b)6=xi}1µ(b)
for negative samples, which can be reasonably approximated by a uniform
distribution for suciently large data sets.
It therefore only remains to nd a dierentiable approximation to ν(a)
for a given 1-simplex aso that gradient descent can be applied for opti-
mization. is is done as follows:
Denition 11. Dene Φ : Rd×Rd[0,1], a smooth approximation of the
membership strength of a 1-simplex between two points in Rd, as
Φ(x,y) = 1 + a(kxyk2
2)b1,
20
where aand bare chosen by non-linear least squares ing against the curve
Ψ : Rd×Rd[0,1] where
Ψ(x,y) = (1if kxyk2min-dist
exp((kxyk2min-dist)) otherwise .
e optimization process is now executed by stochastic gradient de-
scent as given by Algorithm 5.
Algorithm 5 Optimizing the embedding
function OptimizeEmbedding(top-rep, Y, min-dist, n-epochs)
α1.0
Fit Φfrom Ψdened by min-dist
for e1,...,n-epochs do
for all ([a, b], p)top-rep1do
if Random( ) pthen #Sample simplex with probability p
yaya+α· ∇(log(Φ))(ya, yb)
for i1,...,n-neg-samples do
crandom sample from Y
yaya+α· ∇(log(1 Φ))(ya, yc)
α1.0e/n-epochs
return Y
is completes the UMAP algorithm.
4.2 Implementation
Practical implementation of this algorithm requires (approximate) k-nearest-
neighbor calculation and ecient optimization via stochastic gradient de-
scent.
Ecient approximate k-nearest-neighbor computation can be achieved
via the Nearest-Neighbor-Descent algorithm of [18]. e error intrinsic in
a dimension reduction technique means that such approximation is more
than adequate for these purposes. While no theoretical complexity bounds
21
have been established for Nearest-Neighbor-Descent the authors of the
original paper report an empirical complexity of O(N1.14). A further ben-
et of Nearest-Neighbor-Descent is its generality; it works with any valid
dissimilarity measure, and is ecient even for high dimensional data.
In optimizing the embedding under the provided objective function, we
follow work of [54]; making use of probabilistic edge sampling and nega-
tive sampling [41]. is provides a very ecient approximate stochastic
gradient descent algorithm since there is no normalization requirement.
Furthermore, since the normalized Laplacian of the fuzzy graph represen-
tation of the input data is a discrete approximation of the Laplace-Betrami
operator of the manifold [?, see]]belkin2002laplacian, belkin2003laplacian,
we can provide a suitable initialization for stochastic gradient descent by
using the eigenvectors of the normalized Laplacian. e amount of opti-
mization work required will scale with the number of edges in the fuzzy
graph (assuming a xed negative sampling rate), resulting in a complexity
of O(kN ).
Combining these techniques results in highly ecient embeddings, which
we will discuss in Section 5. e overall complexity is bounded by the ap-
proximate nearest neighbor search complexity and, as mentioned above, is
empirically approximately O(N1.14). A reference implementation can be
found at https://github.com/lmcinnes/umap, and an R implementa-
tion can be found at https://github.com/jlmelville/uwot.
For simplicity these experiments were carried out on a single core ver-
sion of our algorithm. It should be noted that at the time of this publication
that both Nearest-Neighbour-Descent and SGD have been parallelized and
thus the python reference implementation can be signicantly accelerated.
Our intention in this paper was to introduce the underlying theory behind
our UMAP algorithm and we felt that parallel vs single core discussions
would distract from our intent.
4.3 Hyper-parameters
As described in Algorithm 1, the UMAP algorithm takes four hyper-parameters:
1. n, the number of neighbors to consider when approximating the local
metric;
2. d, the target embedding dimension;
3. min-dist, the desired separation between close points in the embed-
ding space; and
22
4. n-epochs, the number of training epochs to use when optimizing the
low dimensional representation.
e eects of the parameters dand n-epochs are largely self-evident, and
will not be discussed in further detail here. In contrast the eects of the
number of neighbors nand of min-dist are less clear.
One can interpret the number of neighbors nas the local scale at which
to approximate the manifold as roughly at, with the manifold estimation
averaging over the nneighbors. Manifold features that occur at a smaller
scale than within the nnearest-neighbors of points will be lost, while large
scale manifold features that cannot be seen by patching together locally at
charts at the scale of nnearest-neighbors may not be well detected. us
nrepresents some degree of trade-o between ne grained and large scale
manifold features — smaller values will ensure detailed manifold structure
is accurately captured (at a loss of the “big picture” view of the manifold),
while larger values will capture large scale manifold structures, but at a loss
of ne detail structure which will get averaged out in the local approxima-
tions. With smaller nvalues the manifold tends to be broken into many
small connected components (care needs to be taken with the spectral em-
bedding for initialization in such cases).
In contrast min-dist is a hyperparameter directly aecting the output,
as it controls the fuzzy simplicial set construction from the low dimensional
representation. It acts in lieu of the distance to the nearest neighbor used
to ensure local connectivity. In essence this determines how closely points
can be packed together in the low dimensional representation. Low values
on min-dist will result in potentially densely packed regions, but will likely
more faithfully represent the manifold structure. Increasing the value of
min-dist will force the embedding to spread points out more, assisting vi-
sualization (and avoiding potential overploing issues). We view min-dist
as an essentially aesthetic parameter, governing the appearance of the em-
bedding, and thus is more important when using UMAP for visualization.
In Figure 1 we provide examples of the eects of varying the hyper-
parameters for a toy dataset. e data is uniform random samples from a
3-dimensional color-cube, allowing for easy visualization of the original 3-
dimensional coordinates in the embedding space by using the correspond-
ing RGB colour. Since the data lls a 3-dimensional cube there is no local
manifold structure, and hence for such data we expect larger nvalues to be
more useful. Low values will interpret the noise from random sampling as
ne scale manifold structure, producing potentially spurious structure1.
1See the discussion of the constellation eect in Section 6
23
Figure 1: Variation of UMAP hyperparameters nand min-dist result in dierent
embeddings. e data is uniform random samples from a 3-dimensional color-
cube, allowing for easy visualization of the original 3-dimensional coordinates
in the embedding space by using the corresponding RGB colour. Low values of
nspuriously interpret structure from the random sampling noise – see Section 6
for further discussion of this phenomena.
24
In Figure 2 we provides examples of the same hyperparamter choices
as Figure 1, but for the PenDigits dataset2. In this case we expect small
to medium nvalues to be most eective, since there is signicant cluster
structure naturally present in the data. e min-dist parameter expands out
tightly clustered groups, allowing more of the internal structure of densely
packed clusters to be seen.
Finally, in Figure 3 we provide an equivalent example of hyperparame-
ter choices for the MNIST dataset3. Again, since this dataset is expected to
have signifcant cluster structure we expect medium sized values of nto be
most eective. We note that large values of min-dist result in the distinct
clusters being compressed together, making the distinctions between the
clusters less clear.
5 Practical Ecacy
While the strong mathematical foundations of UMAP were the motivation
for its development, the algorithm must ultimately be judged by its prac-
tical ecacy. In this section we examine the delity and performance of
low dimensional embeddings of multiple diverse real world data sets under
UMAP. e following datasets were considered:
Pen digits [1, 10] is a set of 1797 grayscale images of digits entered using
a digitiser tablet. Each image is an 8x8 image which we treat as a single 64
dimensional vector, assumed to be in Euclidean vector space.
COIL 20 [43] is a set of 1440 greyscale images consisting of 20 objects un-
der 72 dierent rotations spanning 360 degrees. Each image is a 128x128
image which we treat as a single 16384 dimensional vector for the purposes
of computing distance between images.
COIL 100 [44] is a set of 7200 colour images consisting of 100 objects un-
der 72 dierent rotations spanning 360 degrees. Each image consists of 3
128x128 intensity matrices (one for each color channel). We treat this as
a single 49152 dimensional vector for the purposes of computing distance
between images.
Mouse scRNA-seq [11] is proled gene expression data for 20,921 cells
from an adult mouse. Each sample consists of a vector of 26,774 measure-
ments.
Statlog (Shuttle) [35] is a NASA dataset consisting of various data associ-
ated to the positions of radiators in the space shule, including a timestamp.
2See Section 5 for a description of the PenDigits dataset
3See section 5 for details on the MNIST dataset
25
Figure 2: Variation of UMAP hyperparameters nand min-dist result in dier-
ent embeddings. e data is the PenDigits dataset, where each point is an 8x8
grayscale image of a hand-wrien digit.
26
Figure 3: Variation of UMAP hyperparameters nand min-dist result in dier-
ent embeddings. e data is the MNIST dataset, where each point is an 28x28
grayscale image of a hand-wrien digit.
27
e dataset has 58000 points in a 9 dimensional feature space.
MNIST [32] is a dataset of 28x28 pixel grayscale images of handwrien
digits. ere are 10 digit classes (0 through 9) and 70000 total images. is
is treated as 70000 dierent 784 dimensional vectors.
F-MNIST [63] or Fashion MNIST is a dataset of 28x28 pixel grayscale im-
ages of fashion items (clothing, footwear and bags). ere are 10 classes
and 70000 total images. As with MNIST this is treated as 70000 dierent
784 dimensional vectors.
Flow cytometry [51, 9] is a dataset of ow cytometry measurements of
CDT4 cells comprised of 1,000,000 samples, each with 17 measurements.
GoogleNews word vectors [41] is a dataset of 3 million words and phrases
derived from a sample of Google News documents and embedded into a 300
dimensional space via word2vec.
For all the datasets except GoogleNews we use Euclidean distance be-
tween vectors. For GoogleNews, as per [41], we use cosine distance (or
angular distance in t-SNE which does support non-metric distances, in con-
trast to UMAP).
5.1 alitative Comparison of Multiple Algorithms
We compare a number of algorithms–UMAP, t-SNE [60, 58], LargeVis [54],
Laplacian Eigenmaps [7], and Principal Component Analysis [27]–on the
COIL20 [43], MNIST [32], Fashion-MNIST [63], and GoogleNews [41] datasets.
e Isomap algorithm was also tested, but failed to complete in any reason-
able time for any of the datasets larger than COIL20.
e Multicore t-SNE package [57] was used for t-SNE. e reference
implementation [53] was used for LargeVis. e scikit-learn [10] imple-
mentations were used for Laplacian Eigenmaps and PCA. Where possible
we aempted to tune parameters for each algorithm to give good embed-
dings.
Historically t-SNE and LargeVis have oered a dramatic improvement
in nding and preserving local structure in the data. is can be seen qual-
itatively by comparing their embeddings to those generated by Laplacian
Eigenmaps and PCA in Figure 4. We claim that the quality of embeddings
produced by UMAP is comparable to t-SNE when reducing to two or three
dimensions. For example, Figure 4 shows both UMAP and t-SNE embed-
dings of the COIL20, MNIST, Fashion MNIST, and Google News datasets.
While the precise embeddings are dierent, UMAP distinguishes the same
structures as t-SNE and LargeVis.
28
Figure 4: A comparison of several dimension reduction algorithms. We note
that UMAP successfully reects much of the large scale global structure that is
well represented by Laplacian Eigenmaps and PCA (particularly for MNIST and
Fashion-MNIST), while also preserving the local ne structure similar to t-SNE
and LargeVis.
29
It can be argued that UMAP has captured more of the global and topo-
logical structure of the datasets than t-SNE [4, 62]. More of the loops in the
COIL20 dataset are kept intact, including the intertwined loops. Similarly
the global relationships among dierent digits in the MNIST digits dataset
are more clearly captured with 1 (red) and 0 (dark red) at far corners of
the embedding space, and 4,7,9 (yellow, sea-green, and violet) and 3,5,8 (or-
ange, chartreuse, and blue) separated as distinct clumps of similar digits.
In the Fashion MNIST dataset the distinction between clothing (dark red,
yellow, orange, vermilion) and footwear (chartreuse, sea-green, and violet)
is made more clear. Finally, while both t-SNE and UMAP capture groups of
similar word vectors, the UMAP embedding arguably evidences a clearer
global structure among the various word clusters.
5.2 antitative Comparison of Multiple Algorithms
We compare UMAP, t-SNE, LargeVis, Laplacian Eigenmaps and PCA em-
beddings with respect to the performance of a k-nearest neighbor clas-
sier trained on the embedding space for a variety of datasets. e k-
nearest neighbor classier accuracy provides a clear quantitative measure
of how well the embedding has preserved the important local structure of
the dataset. By varying the hyper-parameter kused in the training we
can also consider how structure preservation varies under transition from
purely local to non-local, to more global structure. e embeddings used
for training the kNN classier are for those datasets that come with dened
training labels: PenDigits, COIL-20, Shule, MNIST, and Fashion-MNIST.
We divide the datasets into two classes: smaller datasets (PenDigits and
COIL-20), for which a smaller range of kvalues makes sense, and larger
datasets, for which much larger values of kare reasonable. For each of
the small datasets a stratied 10-fold cross-validation was used to derive
a set of 10 accuracy scores for each embedding. For the Shule dataset a
10-fold cross-validation was used due to constraints imposed by class sizes
and the stratied sampling. For MNIST and Fashion-MNIST a 20-fold cross
validation was used, producing 20 accuracy scores.
In Table 1 we present the average accuracy across the 10-folds for the
PenDigits and COIL-20 datasets. UMAP performs at least as well as t-SNE
and LargeVis (given the condence bounds on the accuracy) for kin the
range 10 to 40, but for larger kvalues of 80 and 160 UMAP has signicantly
higher accuracy on COIL-20, and shows evidence of higher accuracy on
PenDigits. Figure 5 provides swarm plots of the accuracy results across the
COIL-20 and PenDigits datasets.
30
In Table 2 we present the average cross validation accuracy for the Shut-
tle, MNIST and Fashion-MNIST datasets. UMAP performs at least as well
as t-SNE and LargeVis (given the condence bounds on the accuracy) for k
in the range 100 to 400 on the Shule and MNIST datasets (but notably un-
derperforms on the Fashion-MNIST dataset), but for larger kvalues of 800
and 3200 UMAP has signicantly higher accuracy on the Shule dataset,
and shows evidence of higher accuracy on MNIST. For kvalues of 1600 and
3200 UMAP establishes comparable performance on Fashion-MNIST. Fig-
ure 6 provides swarm plots of the accuracy results across the Shule and
MNIST and Fashion-MNIST datasets.
k t-SNE UMAP LargeVis Eigenmaps PCA
COIL-20
10 0.934 (±0.115) 0.921 (±0.075) 0.888 (±0.092) 0.629 (±0.153) 0.667 (±0.179)
20 0.901 (±0.133) 0.907 (±0.064) 0.870 (±0.125) 0.605 (±0.185) 0.663 (±0.196)
40 0.857 (±0.125) 0.904 (±0.056) 0.833 (±0.106) 0.578 (±0.159) 0.620 (±0.230)
80 0.789 (±0.118) 0.899 (±0.058) 0.803 (±0.100) 0.565 (±0.119) 0.531 (±0.294)
160 0.609 (±0.067) 0.803 (±0.138) 0.616 (±0.066) 0.446 (±0.110) 0.375 (±0.111)
PenDigits
10 0.977 (±0.033) 0.973 (±0.044) 0.966 (±0.053) 0.778 (±0.113) 0.622 (±0.092)
20 0.973 (±0.033) 0.976 (±0.035) 0.973 (±0.044) 0.778 (±0.116) 0.633 (±0.082)
40 0.956 (±0.064) 0.954 (±0.060) 0.959 (±0.066) 0.778 (±0.112) 0.636 (±0.078)
80 0.948 (±0.060) 0.951 (±0.072) 0.949 (±0.072) 0.767 (±0.111) 0.643 (±0.085)
160 0.949 (±0.065) 0.951 (±0.085) 0.921 (±0.085) 0.747 (±0.108) 0.629 (±0.107)
Table 1: kNN Classier accuracy for varying values of kover the embedding
spaces of COIL-20 and PenDigits datasets. Average accuracy scores are given
over a 10-fold cross-validation for each of PCA, Laplacian Eigenmaps, LargeVis,
t-SNE and UMAP.
As evidenced by this comparison UMAP provides largely comparable
perfomance in embedding quality to t-SNE and LargeVis at local scales, but
performs markedly beer than t-SNE or LargeVis at non-local scales. is
bears out the visual qualitative assessment provided in Subsection 5.1.
5.3 Embedding Stability
Since UMAP makes use of both stochastic approximate nearest neighbor
search, and stochastic gradient descent with negative sampling for opti-
mization, the resulting embedding is necessarily dierent from run to run,
and under sub-sampling of the data. is is potentially a concern for a
31
k t-SNE UMAP LargeVis Eigenmaps PCA
Shule
100 0.994 (±0.002) 0.993 (±0.002) 0.992 (±0.003) 0.962 (±0.004) 0.833 (±0.013)
200 0.992 (±0.002) 0.990 (±0.002) 0.987 (±0.003) 0.957 (±0.006) 0.821 (±0.007)
400 0.990 (±0.002) 0.988 (±0.002) 0.976 (±0.003) 0.949 (±0.006) 0.815 (±0.007)
800 0.969 (±0.005) 0.988 (±0.002) 0.957 (±0.004) 0.942 (±0.006) 0.804 (±0.003)
1600 0.927 (±0.005) 0.981 (±0.002) 0.904 (±0.007) 0.918 (±0.006) 0.792 (±0.003)
3200 0.828 (±0.004) 0.957 (±0.005) 0.850 (±0.008) 0.895 (±0.006) 0.786 (±0.001)
MNIST
100 0.967 (±0.015) 0.967 (±0.014) 0.962 (±0.015) 0.668 (±0.016) 0.462 (±0.023)
200 0.966 (±0.015) 0.967 (±0.014) 0.962 (±0.015) 0.667 (±0.016) 0.467 (±0.023)
400 0.964 (±0.015) 0.967 (±0.014) 0.961 (±0.015) 0.664 (±0.016) 0.468 (±0.024)
800 0.963 (±0.016) 0.967 (±0.014) 0.961 (±0.015) 0.660 (±0.017) 0.468 (±0.023)
1600 0.959 (±0.016) 0.966 (±0.014) 0.947 (±0.015) 0.651 (±0.014) 0.467 (±0.0233)
3200 0.946 (±0.017) 0.964 (±0.014) 0.920 (±0.017) 0.639 (±0.017) 0.459 (±0.022)
Fashion-MNIST
100 0.818 (±0.012) 0.790 (±0.013) 0.808 (±0.014) 0.631 (±0.010) 0.564 (±0.018)
200 0.810 (±0.013) 0.785 (±0.014) 0.805 (±0.013) 0.624 (±0.013) 0.565 (±0.016)
400 0.801 (±0.013) 0.780 (±0.013) 0.796 (±0.013) 0.612 (±0.011) 0.564 (±0.017)
800 0.784 (±0.011) 0.767 (±0.014) 0.771 (±0.014) 0.600 (±0.012) 0.560 (±0.017)
1600 0.754 (±0.011) 0.747 (±0.013) 0.742 (±0.013) 0.580 (±0.014) 0.550 (±0.017)
3200 0.727 (±0.011) 0.730 (±0.011) 0.726 (±0.012) 0.542 (±0.014) 0.533 (±0.017)
Table 2: kNN Classier accuracy for varying values of kover the embedding
spaces of Shule, MNIST and Fashion-MNIST datasets. Average accuracy scores
are given over a 10-fold or 20-fold cross-validation for each of PCA, Laplacian
Eigenmaps, LargeVis, t-SNE and UMAP.
32
Figure 5: kNN Classier accuracy for varying values of kover the embedding
spaces of COIL-20 and PenDigits datasets. Accuracy scores are given for each
fold of a 10-fold cross-validation for each of PCA, Laplacian Eigenmaps, LargeVis,
t-SNE and UMAP. We note that UMAP produces competitive accuracy scores to
t-SNE and LargeVis for most cases, and outperforms both t-SNE and LargeVis for
larger kvalues on COIL-20.
variety of uses cases, so establishing some measure of how stable UMAP
embeddings are, particularly under sub-sampling, is of interest. In this sub-
section we compare the stability under subsampling of UMAP, LargeVis and
t-SNE (the three stochastic dimension reduction techniques considered).
To measure the stability of an embedding we make use of the nor-
malized Procrustes distance to measure the distance between two poten-
tially comparable distributions. Given two datasets X={x1, . . . , xN}and
Y={y1, . . . , yN}such that xicorresponds to yi, we can dene the Pro-
custes distance between the datasets dP(X, Y )in the following manner.
Determine Y0={y10, . . . , yN0}the optimal translation, uniform scaling,
and rotation of Ythat minimizes the squared error PN
i=1(xiyi0)2, and
dene
dP(X, Y ) = v
u
u
t
N
X
i=1
(xiyi0)2.
Since any measure that makes use of distances in the embedding space is
potentially sensitive to the extent or scale of the embedding, we normal-
ize the data before computing the Procrustes distance by dividing by the
average norm of the embedded dataset. In Figure 7 we visualize the re-
sults of using Procrustes alignment of embedding of sub-samples for both
33
Figure 6: kNN Classier accuracy for varying values of kover the embedding
spaces of Shule, MNIST and Fashion-MNIST datasets. Accuracy scores are
given for each fold of a 10-fold cross-validation for Shule, and 20-fold cross-
validation for MNIST and Fashion-MNIST, for each of PCA, Laplacian Eigen-
maps, LargeVis, t-SNE and UMAP. UMAP performs beer than the other algo-
rithms for large k, particularly on the Shule dataset. For Fashion-MNIST UMAP
provides slightly poorer accuracy than t-SNE and LargeVis at small scales, but is
competitive at larger kvalues.
34
(a) UMAP (b) t-SNE
Figure 7: Procrustes based alignment of a 10% subsample (red) against the full
dataset (blue) for the ow cytometry dataset for both UMAP and t-SNE.
UMAP and t-SNE, demonstrating how Procrustes distance can measure the
stability of the overall structure of the embedding.
Given a measure of distance between dierent embeddings we can ex-
amine stability under sub-sampling by considering the normalized Pro-
crustes distance between the embedding of a sub-sample, and the corre-
sponding sub-sample of an embedding of the full dataset. As the size of
the sub-sample increases the average distance per point between the sub-
sampled embeddings should decrease, potentially toward some asymptote
of maximal agreement under repeated runs. Ideally this asymptotic value
would be zero error, but for stochastic embeddings such as UMAP and t-
SNE this is not achievable.
We performed an empirical comparison of algorithms with respect to
stability using the Flow Cytometry dataset due its large size, interesting
structure, and low ambient dimensionality (aiding runtime performance
for t-SNE). We note that for a dataset this large we found it necessary to
increase the default n_iter value for t-SNE from 1000 to 1500 to ensure bet-
ter convergence. While this had an impact on the runtime, it signicantly
improved the Procrustes distance results by providing more stable and con-
sistent embeddings. Figure 8 provides a comparison between UMAP and
t-SNE, demonstrating that UMAP has signifcantly more stable results than
35
t-SNE. In particular, aer sub-sampling on 5% of the million data points, the
per point error for UMAP was already below any value achieved by t-SNE.
5.4 Computational Performance Comparisons
Benchmarks against the real world datasets were performed on a Macbook
Pro with a 3.1 GHz Intel Core i7 and 8GB of RAM for Table 3, and on a
server with Intel Xeon E5-2697v4 processors and 512GB of RAM for the
large scale benchmarking in Subsections 5.4.1, 5.4.2, and 5.4.3.
For t-SNE we chose MulticoreTSNE [57], which we believe to be the
fastest extant implementation of Barnes-Hut t-SNE at this time, even when
run in single core mode. It should be noted that MulticoreTSNE is a heav-
ily optimized implementation wrien in C++ based on Van der Maaten’s
bhtsne [58] code.
As a fast alternative approach to t-SNE we also consider the FIt-SNE
algorithm [37]. We used the reference implementation [36], which, like
MulticoreTNSE is an optimized C++ implementation. We also note that
FIt-SNE makes use of multiple cores.
LargeVis [54] was benchmarked using the reference implementation
[53]. It was run with default parameters including use of 8 threads on the 4-
core machine. e only exceptions were small datasets where we explicitly
set the -samples parameter to n_samples/100 as per the recommended
values in the documentation of the reference implementation.
e Isomap [55] and Laplacian Eigenmaps [7] implementations in scikit-
learn [10] were used. We suspect the Laplacian eigenmaps implementation
may not be well optimized for large datasets but did not nd a beer per-
forming implementation that provided comparable quality results. Isomap
failed to complete for the Shule, Fashion-MNIST, MNIST and Google-
News datasets, while Laplacian Eigenmaps failed to run for the Google-
News dataset.
To allow a broader range of algorithms to run some of the datasets
where subsampled or had their dimension reduced by PCA. e Flow Cy-
tometry dataset was benchmarked on a 10% sample and the GoogleNews
was subsampled down to 200,000 data points. Finally, the Mouse scRNA
dataset was reduced to 1,000 dimensions via PCA.
Timing were performed for the COIL20 [43], COIL100 [44], Shule [35],
MNIST [32], Fashion-MNIST [63], and GoogleNews [41] datasets. Results
can be seen in Table 3. UMAP consistently performs faster than any of
the other algorithms aside from on the very small Pendigits dataset, where
Laplacian Eigenmaps and Isomap have a small edge.
36
Figure 8: Comparison of average Procustes distance per point for t-SNE, LargeVis
and UMAP over a variety of sizes of subsamples from the full Flow Cytometry
dataset. UMAP sub-sample embeddings are very close to the full embedding even
for subsamples of 5% of the full dataset, outperforming the results of t-SNE and
LargeVis even when they use the full Flow Cytometry dataset.
37
UMAP FIt-SNE t-SNE LargeVis Eigenmaps Isomap
Pen Digits 9s 48s 17s 20s 2s 2s
(1797x64)
COIL20 12s 75s 22s 82s 47s 58s
(1440x16384)
COIL100 85s 2681s 810s 3197s 3268s 3210s
(7200x49152)
scRNA 28s 131s 258s 377s 470s 923s
(21086x1000)
Shuttle 94s 108s 714s 615s 133s
(58000x9)
MNIST 87s 292s 1450s 1298s 40709s
(70000x784)
F-MNIST 65s 278s 934s 1173s 6356s
(70000x784)
Flow 102s 164s 1135s 1127s 30654s
(100000x17)
Google News 361s 652s 16906s 5392s
(200000x300)
Table 3: Runtime of several dimension reduction algorithms on various datasets.
To allow a broader range of algorithms to run some of the datasets where sub-
sampled or had their dimension reduced by PCA. e Flow Cytometry dataset
was benchmarked on a 10% sample and the GoogleNews was subsampled down
to 200,000 data points. Finally, the Mouse scRNA dataset was reduced to 1,000
dimensions via PCA. e fastest runtime for each dataset has been bolded.
38
5.4.1 Scaling with Embedding Dimension
UMAP is signicantly more performant than t-SNE4when embedding into
dimensions larger than 2. is is particularly important when the intention
is to use the low dimensional representation for further machine learning
tasks such as clustering or anomaly detection rather than merely for visu-
alization. e computation performance of UMAP is far more ecient than
t-SNE, even for very small embedding dimensions of 6 or 8 (see Figure 9).
is is largely due to the fact that UMAP does not require global normali-
sation (since it represents data as a fuzzy topological structure rather than
as a probability distribution). is allows the algorithm to work without
the need for space trees —such as the quad-trees and oct-trees that t-SNE
uses [58]—. Such space trees scale exponentially in dimension, resulting
in t-SNE’s relatively poor scaling with respect to embedding dimension.
By contrast, we see that UMAP consistently scales well in embedding di-
mension, making the algorithm practical for a wider range of applications
beyond visualization.
5.4.2 Scaling with Ambient Dimension
rough a combination of the local-connectivity constraint and the approx-
imate nearest neighbor search, UMAP can perform eective dimension re-
duction even for very high dimensional data (see Figure 13 for an example
of UMAP operating directly on 1.8 million dimensional data). is stands in
contrast to many other manifold learning techniques, including t-SNE and
LargeVis, for which it is generally recommended to reduce the dimension
with PCA before applying these techniques (see [59] for example).
To compare runtime performance scaling with respect to the ambient
dimension of the data we chose to use the Mouse scRNA dataset, which
is high dimensional, but is also amenable to the use of PCA to reduce the
dimension of the data as a pre-processing step without losing too much
of the important structure5. We compare the performance of UMAP, FIt-
SNE, MulticoreTSNE, and LargeVis on PCA reductions of the Mouse scRNA
dataset to varying dimensionalities, and on the original dataset, in Figure
10.
While all the implementations tested show a signicant increase in run-
time with increasing dimension, UMAP is dramatically more ecient for
4Comparisons were performed against MulticoreTSNE as the current implementation of FIt-
SNE does not support embedding into any dimension larger than 2.
5In contrast to COIL100, on which PCA destroys much of the manifold structure
39
(a) A comparison of run time for UMAP,
t-SNE and LargeVis with respect to em-
bedding dimension on the Pen digits
dataset. We see that t-SNE scales worse
than exponentially while UMAP and
LargeVis scale linearly with a slope so
slight to be undetectable at this scale.
(b) Detail of scaling for embedding di-
mension of six or less. We can see that
UMAP and LargeVis are essentially at.
In practice they appear to scale linearly,
but the slope is essentially undetectable
at this scale.
Figure 9: Scaling performance with respect to embedding dimension of UMAP,
t-SNE and LargeVis on the Pen digits dataset.
40
Figure 10: Runtime performance scaling of UMAP, t-SNE, FIt-SNE and Largevis
with respect to the ambient dimension of the data. As the ambient dimension
increases beyond a few thousand dimensions the computational cost of t-SNE,
FIt-SNE, and LargeVis all increase dramatically, while UMAP continues to per-
form well into the tens-of-thousands of dimensions.
41
large ambient dimensions, easily scaling to run on the original unreduced
dataset. e ability to run manifold learning on raw source data, rather than
dimension reduced data that may have lost important manifold structure in
the pre-processing, is a signicant advantage. is advantage comes from
the local connectivity assumption which ensures good topological repre-
sentation of high dimensional data, particularly with smaller numbers of
near neighbors, and the eciency of the NN-Descent algorithm for approx-
imate nearest neighbor search even in high dimensions.
Since UMAP scales well with ambient dimension the python implemen-
tation also supports input in sparse matrix format, allowing scaling to ex-
tremely high dimensional data, such as the integer data shown in Figures
13 and 14.
5.4.3 Scaling with the Number of Samples
For dataset size performance comparisons we chose to compare UMAP with
FIt-SNE [37], a version of t-SNE that uses approximate nearest neighbor
search and a Fourier interpolation optimisation approach; MulticoreTSNE
[57], which we believe to be the fastest extant implementation of Barnes-
Hut t-SNE; and LargeVis [54]. It should be noted that FIt-SNE, MulticoreT-
SNE, and LargeVis are all heavily optimized implementations wrien in
C++. In contrast our UMAP implementation was wrien in Python — mak-
ing use of the numba [31] library for performance. MulticoreTSNE and
LargeVis were run in single threaded mode to make fair comparisons to
our single threaded UMAP implementation.
We benchmarked all four implementations using subsamples of the Google-
News dataset. e results can be seen in Figure 11. is demonstrates that
UMAP has superior scaling performance in comparison to Barnes-Hut t-
SNE, even when Barnes-Hut t-SNE is given multiple cores. Asymptotic
scaling of UMAP is comparable to that of FIt-SNE (and LargeVis). On this
dataset UMAP demonstrated somewhat faster absolute performance com-
pared to FIt-SNE, and was dramatically faster than LargeVis.
e UMAP embedding of the full GoogleNews dataset of 3 million word
vectors, as seen in Figure 12, was completed in around 200 minutes, as com-
pared with several days required for MulticoreTSNE, even using multiple
cores.
To scale even further we were inspired by the work of John Williamson
on embedding integers [61], as represented by (sparse) binary vectors of
their prime divisibility. is allows the generation of arbitrarily large, ex-
tremely high dimension datasets that still have meaningful structure to be
42
Figure 11: Runtime performance scaling of t-SNE and UMAP on various sized
sub-samples of the full Google News dataset. e lower t-SNE line is the wall
clock runtime for Multicore t-SNE using 8 cores.
43
Figure 12: Visualization of the full 3 million word vectors from the GoogleNews
dataset as embedded by UMAP.
44
explored. In Figures 13 and 14 we show an embedding of 30,000,000 data
samples from an ambient space of approximately 1.8 million dimensions.
is computation took approximately 2 weeks on a large memory SMP.
Note that despite the high ambient dimension, and vast amount of data,
UMAP is still able to nd and display interesting structure. In Figure 15 we
show local regions of the embedding, demonstrating the ne detail struc-
ture that was captured.
6 Weaknesses
While we believe UMAP to be a very eective algorithm for both visualiza-
tion and dimension reduction, most algorithms must make trade-os and
UMAP is no exception. In this section we will briey discuss those areas or
use cases where UMAP is less eective, and suggest potential alternatives.
For a number of use cases the interpretability of the reduced dimension
results is of critical importance. Similarly to most non-linear dimension re-
duction techniques (including t-SNE and Isomap), UMAP lacks the strong
interpretability of Principal Component Analysis (PCA) and related tech-
niques such a Non-Negative Matrix Factorization (NMF). In particular the
dimensions of the UMAP embedding space have no specic meaning, un-
like PCA where the dimensions are the directions of greatest variance in
the source data. Furthermore, since UMAP is based on the distance between
observations rather than the source features, it does not have an equivalent
of factor loadings that linear techniques such as PCA, or Factor Analysis
can provide. If strong interpretability is critical we therefore recommend
linear techniques such as PCA, NMF or pLSA.
One of the core assumptions of UMAP is that there exists manifold
structure in the data. Because of this UMAP can tend to nd manifold
structure within the noise of a dataset – similar to the way the human mind
nds structured constellations among the stars. As more data is sampled
the amount of structure evident from noise will tend to decrease and UMAP
becomes more robust, however care must be taken with small sample sizes
of noisy data, or data with only large scale manifold structure. Detecting
when a spurious embedding has occurred is a topic of further research.
UMAP is derived from the axiom that local distance is of more im-
portance than long range distances (similar to techniques like t-SNE and
LargeVis). UMAP therefore concerns itself primarily with accurately rep-
resenting local structure. While we believe that UMAP can capture more
global structure than these other techniques, it remains true that if global
45
Figure 13: Visualization of 30,000,000 integers as represented by binary vectors
of prime divisibility, colored by density of points.
46
Figure 14: Visualization of 30,000,000 integers as represented by binary vectors
of prime divisibility, colored by integer value of the point (larger values are green
or yellow, smaller values are blue or purple).
47
(a) Upper right spiral (b) Lower right spiral and starbursts
(c) Central cloud
Figure 15: Zooming in on various regions of the integer embedding reveals fur-
ther layers of ne structure have been preserved.
48
structure is of primary interest then UMAP may not be the best choice for
dimension reduction. Multi-dimensional scaling specically seeks to pre-
serve the full distance matrix of the data, and as such is a good candidate
when all scales of structure are of equal importance. PHATE [42] is a good
example of a hybrid approach that begins with local structure information
and makes use of MDS to aempt to preserve long scale distances as well. It
should be noted that these techniques are more computationally intensive
and thus rely on landmarking approaches for scalability.
It should also be noted that a signicant contributor to UMAP’s relative
global structure preservation is derived from the Laplacian Eigenmaps ini-
tialization (which, in turn, followed from the theoretical foundations). is
was noted in, for example, [29]. e authors of that paper demonstrate
that t-SNE, with similar initialization, can perform equivalently to UMAP
in a particular measure of global structure preservation. However, the ob-
jective function derived for UMAP (cross-entropy) is signicantly dierent
from that of t-SNE (KL-divergence), in how it penalizes failures to preserve
non-local and global structure, and is also a signicant contributor6.
It is worth noting that, in combining the local simplicial set structures,
pure nearest neighbor structure in the high dimensional space is not ex-
plicitly preserved. In particular it introduces so called ”reverse-nearest-
neighbors” into the classical knn-graph. is, combined with the fact that
UMAP is preserving topology rather than pure metric structures, mean that
UMAP will not perform as well as some methods on quality measures based
on metric structure preservation – particularly methods, such as MDS –
which are explicitly designed to optimize metric structure preservation.
UMAP aempts to discover a manifold on which your data is uniformly
distributed. If you have strong condence in the ambient distances of your
data you should make use of a technique that explicitly aempts to preserve
these distances. For example if your data consisted of a very loose structure
in one area of your ambient space and a very dense structure in another
region region UMAP would aempt to put these local areas on an even
footing.
Finally, to improve the computational eciency of the algorithm anum-
ber of approximations are made. is can have an impact on the results
of UMAP for small (less than 500 samples) dataset sizes. In particular the
use of approximate nearest neighbor algorithms, and the negative sampling
used in optimization, can result in suboptimal embeddings. For this reason
6e authors would like to thank Nikolay Oskolkov for his article (tSNE vs. UMAP: Global
Structure) which does an excellent job of highlighting these aspects from an empirical and the-
oretical basis.
49
we encourage users to take care with particularly small datasets. A slower
but exact implementation of UMAP for small datasets is a future project.
7 Future Work
Having established both relevant mathematical theory and a concrete im-
plementation, there still remains signicant scope for future developments
of UMAP.
A comprehensive empirical study which examines the impact of the
various algorithmic components, choices, and hyper-parameters of the al-
gorithm would be benecial. While the structure and choices of the algo-
rithm presented were derived from our foundational mathematical frame-
work, examining the impacts that these choices have on practical results
would be enlightening and a signicant contribution to the literature.
As noted in the weaknesses section there is a great deal of uncertainty
surrounding the preservation of global structure among the eld of man-
ifold learning algorithms. In particular this is hampered by the lack clear
objective measures, or even denitions, of global structure preservation.
While some metrics exist, they are not comprehensive, and are oen spe-
cic to various downstream tasks. A systematic study of both metrics of
non-local and global structure preservation, and performance of various
manifold learning algorithms with respect to them, would be of great ben-
et. We believe this would aid in beer understanding UMAP’s success in
various downstream tasks.
Making use of the fuzzy simplicial set representation of data UMAP can
potentially be extended to support (semi-)supervised dimension reduction,
and dimension reduction for datasets with heterogeneous data types. Each
data type (or prediction variables in the supervised case) can be seen as an
alternative view of the underlying structure, each with a dierent associ-
ated metric – for example categorical data may use Jaccard or Dice distance,
while ordinal data might use Manhaan distance. Each view and metric can
be used to independently generate fuzzy simplicial sets, which can then be
intersected together to create a single fuzzy simplicial set for embedding.
Extending UMAP to work with mixed data types would vastly increase the
range of datasets to which it can be applied. Use cases for (semi-)supervised
dimension reduction include semi-supervised clustering, and interactive la-
belling tools.
e computational framework established for UMAP allows for the po-
tential development of techniques to add new unseen data points into an
50
existing embedding, and to generate high dimensional representations of
arbitrary points in the embedded space. Furthermore, the combination of
supervision and the addition of new samples to an existing embedding pro-
vides avenues for metric learning. e addition of new samples to an ex-
isting embedding would allow UMAP to be used as a feature engineering
tool as part of a general machine learning pipeline for either clustering or
classication tasks. Pulling points back to the original high dimensional
space from the embedded space would potentially allow UMAP to be used
as a generative model similar to some use cases for autoencoders. Finally,
there are many use cases for metric learning; see [64] or [8] for example.
ere also remains signicant scope to develop techniques to both de-
tect and mitigate against potentially spurious embeddings, particularly for
small data cases. e addition of such techniques would make UMAP far
more robust as a tool for exploratory data analysis, a common use case
when reducing to two dimensions for visualization purposes.
Experimental versions of some of this work are already available in the
referenced implementations.
8 Conclusions
We have developed a general purpose dimension reduction technique that
is grounded in strong mathematical foundations. e algorithm imple-
menting this technique is demonstrably faster than t-SNE and provides
beer scaling. is allows us to generate high quality embeddings of larger
data sets than had previously been aainable. e use and eectiveness
of UMAP in various scientic elds demonstrates the strength of the algo-
rithm.
Acknowledgements e authors would like to thank Colin Weir, Rick
Jardine, Brendan Fong, David Spivak and Dmitry Kobak for discussion and
useful commentary on various dras of this paper.
A Proof of Lemma 1
Lemma 1. Let (M, g)be a Riemannian manifold in an ambient Rn, and let
pMbe a point. If gis locally constant about pin an open neighbourhood
Usuch that gis a constant diagonal matrix in ambient coordinates, then in a
ball BUcentered at pwith volume πn/2
Γ(n/2+1) with respect to g, the geodesic
51
distance from pto any point qBis 1
rdRn(p, q), where ris the radius of the
ball in the ambient space and dRnis the existing metric on the ambient space.
Proof. Let x1, . . . , xnbe the coordinate system for the ambient space. A
ball Bin Munder Riemannian metric ghas volume given by
ZBpdet(g)dx1∧ · · · dxn.
If Bis contained in U, then gis constant in Band hence pdet(g)is con-
stant and can be brought outside the integral. us, the volume of Bis
pdet(g)ZB
dx1... dxn=pdet(g)πn/2rn
Γ(n/2 + 1),
where ris the radius of the ball in the ambient Rn. If we x the volume of
the ball to be πn/2
Γ(n/2+1) we arrive at the requirement that
det(g) = 1
r2n.
Now, since gis assumed to be diagonal with constant entries we can solve
for gitself as
gij =
1
r2if i=j,
0otherwise
.(2)
e geodesic distance on Munder gfrom pto q(where p, q B) is dened
as
inf
cCZb
apg( ˙c(t),˙c(t))dt,
where Cis the class of smooth curves con Msuch that c(a) = pand
c(b) = q, and ˙cdenotes the rst derivative of con M. Given that gis as
dened in (2) we see that this can be simplied to
1
rinf
cCZb
a
hp˙c(t),˙c(t)idt
=1
rinf
cCZb
a
hk ˙c(t),˙c(t)kdt
=1
rdRn(p, q).
(3)
52
B Proof that FinReal and FinSing are adjoint
eorem 2. e functors FinReal :Fin-sFuzz FinEPMet and FinSing :
FinEPMet Fin-sFuzz form an adjunction with FinReal the le adjoint
and FinSing the right adjoint.
Proof. e adjunction is evident by construction, but can be made more
explicit as follows. Dene a functor F:×IFinEPMet by
F([n],[0, a)) = ({x1, x2, . . . , xn}, da),
where
da(xi, xj) =
log(a)if i6=j,
0otherwise .
Now FinSing can be dened in terms of Fas
FinSing(Y) : ([n],[0, a)) 7→ homFinEPMet(F([n],[0, a)), Y ).
where the face maps diare given by pre-composition with F di, and sim-
ilarly for degeneracy maps, at any given value of a. Furthermore post-
composition with Flevel-wise for each adenes maps of fuzzy simplicial
sets making FinSing a functor.
We now construct FinReal as the le Kan extension of Falong the
Yoneda embedding:
Fin-sFuzz
FinReal
((
×I
+
y
88
F
//FinEPMet
Explicitly this results in a denition of FinReal at a fuzzy simplicial set X
as a colimit:
FinReal(X) = colim
y([n],[0,a))XF([n]).
Further, it follows from the Yoneda lemma that FinReal(∆n
<a)
=F([n],[0, a)),
and hence this denition as a le Kan extension agrees with Denition 7,
and the denition of FinSing above agrees with that of Denition 8. To see
that FinReal and FinSing are adjoint we note that
homFin-sFuzz(∆n
<a,FinSing(Y))
=FinSing(Y)n
<a
= homFinEPMet(F([n],[0, a)), Y )
=homFinEPMet(FinReal(∆n
<a), Y )).
(4)
53
e rst isomorphism follows from the Yoneda lemma, the equality is by
construction, and the nal isomorphism follows by another application of
the Yoneda lemma. Since every simplicial set can be canonically expressed
as a colimit of standard simplices and FinReal commutes with colimits (as
it was dened via a colimit formula), it follows that FinReal is completely
determined by its image on standard simplices. As a result the isomor-
phism of equation (4) extends to the required isomorphism demonstrating
the adjunction.
C From t-SNE to UMAP
As an aid to implementation of UMAP and to illuminate the algorithmic
similarities with t-SNE and LargeVis, here we review the main equations
used in those methods, and then present the equivalent UMAP expressions
in a notation which may be more familiar to users of those other methods.
In what follows we are concerned with dening similarities between
two objects iand jin the high dimensional input space Xand low di-
mensional embedded space Y. ese are normalized and symmetrized in
various ways. In a typical implementation, these pair-wise quantities are
stored and manipulated as (potentially sparse) matrices. antities with
the subscript ij are symmetric, i.e. vij =vji. Extending the conditional
probability notation used in t-SNE, j|iindicates an asymmetric similarity,
i.e. vj|i6=vi|j.
t-SNE denes input probabilities in three stages. First, for each pair of
points, iand j, in X, a pair-wise similarity, vij, is calculated, Gaussian with
respect to the Euclidean distance between xiand xj:
vj|i= exp(− kxixjk2
2/2σ2
i)(5)
where σ2
iis the variance of the Gaussian.
Second, the similarities are converted into Nconditional probability
distributions by normalization:
pj|i=vj|i
Pk6=ivk|i
(6)
σiis chosen by searching for a value such that the perplexity of the proba-
bility distribution p·|imatches a user-specied value.
ird, these probability distributions are symmetrized and then further
normalized over the entire matrix of values to give a joint probability dis-
tribution:
54
pij =pj|i+pi|j
2N(7)
We note that this is a heuristic denition and not in accordance with stan-
dard relationship between conditional and joint probabilities that would be
expected under probability semantics usually used to describe t-SNE.
Similarities between pairs of points in the output space Yare dened
using a Student t-distribution with one degree of freedom on the squared
Euclidean distance:
wij =1 + kyiyjk2
21(8)
followed by the matrix-wise normalization, to form qij:
qij =wij
Pk6=lwkl
(9)
e t-SNE cost is the Kullback-Leibler divergence between the two proba-
bility distributions:
CtSN E =X
i6=j
pij log pij
qij
(10)
this can be expanded into constant and non-constant contributions:
CtSN E =X
i6=j
pij log pij pij log qij (11)
Because both pij and qij require calculations over all pairs of points, im-
proving the eciency of t-SNE algorithms has involved separate strategies
for approximating these quantities. Similarities in the high dimensions are
eectively zero outside of the nearest neighbors of each point due to the
calibration of the pj|ivalues to reproduce a desired perplexity. erefore an
approximation used in Barnes-Hut t-SNE is to only calculate vj|ifor nnear-
est neighbors of i, where nis a multiple of the user-selected perplexity and
to assume vj|i= 0 for all other j. Because the low dimensional coordinates
change with each iteration, a dierent approach is used to approximate
qij. In Barnes-Hut t-SNE and related methods this usually involves group-
ing together points whose contributions can be approximated as a single
point.
A further heuristic algorithm optimization technique employed by t-
SNE implementations is the use of early exaggeration where, for some num-
ber of initial iterations, the pij are multiplied by some constant greater than
55
1.0 (usually 12.0). In theoretical analyses of t-SNE such as [38] results are
obtained only under an early exaggeration regimen with either large con-
stant (of order of the number of samples), or in the limit of innite exagger-
ation. Further papers such as [37], and [28], suggest the option of using ex-
aggeration for all iterations rather than just early ones, and demonstrate the
utility of this. e eectiveness of these analyses and practical approaches
suggests that KL-divergence as a measure between probability distributions
is not what makes the t-SNE algorithm work, since, under exaggeration, the
pij are manifestly not a probability distribution. is is another example
of the probability semantics used to describe t-SNE are primarily descrip-
tive rather than foundational. None the less, t-SNE is highly eective and
clearly produces useful results on a very wide variety of tasks.
LargeVis uses a similar approach to Barnes-Hut t-SNE when approxi-
mating pij, but further improves eciency by only requiring approximate
nearest neighbors for each point. For the low dimensional coordinates,
it abandons normalization of wij entirely. Rather than use the Kullback-
Leibler divergence, it optimizes a likelihood function, and hence is maxi-
mized, not minimized:
CLV =X
i6=j
pij log wij +γX
i6=j
log (1 wij )(12)
pij and wij are dened as in Barnes-Hut t-SNE (apart from the use of
approximate nearest neighbors for pij, and the fact that, in implementation,
LargeVis does not normalize the pij by N) and γis a user-chosen positive
constant which weights the strength of the the repulsive contributions (sec-
ond term) relative to the aractive contribution (rst term). Note also that
the rst term resembles the optimizable part of the Kullback-Leibler diver-
gence but using wij instead of qij. Abandoning calculation of qij is a crucial
change, because the LargeVis cost function is amenable to optimization via
stochastic gradient descent.
Ignoring specic denitions of vij and wij , the UMAP cost function,
the cross entropy, is:
CUMAP =X
i6=j
vij log vij
wij + (1 vij) log 1vij
1wij (13)
Like the Kullback-Leibler divergence, this can be arranged into two con-
stant contributions (those containing vij only) and two optimizable contri-
butions (containing wij):
56
CUMAP =X
i6=j
vij log vij + (1 vij) log (1 vij)
vij log wij (1 vij) log (1 wij)
(14)
Ignoring the two constant terms, the UMAP cost function has a very
similar form to that of LargeVis, but without a γterm to weight the re-
pulsive component of the cost function, and without requiring matrix-wise
normalization in the high dimensional space. e cost function for UMAP
can therefore be optimized (in this case, minimized) with stochastic gradi-
ent descent in the same way as LargeVis.
Although the above discussion places UMAP in the same family of meth-
ods as t-SNE and LargeVis, it does not use the same denitions for vij and
wij. Using the notation established above, we now provide the equivalent
expressions for the UMAP similarities. In the high dimensional space, the
similarities vj|iare the local fuzzy simplicial set memberships, based on the
smooth nearest neighbors distances:
vj|i= exp[(d(xi, xj)ρi)i](15)
As with LargeVis, vj|iis calculated only for napproximate nearest neigh-
bors and vj|i= 0 for all other j.d(xi, xj)is the distance between xiand
xj, which UMAP does not require to be Euclidean. ρiis the distance to the
nearest neighbor of i.σiis the normalizing factor, which is chosen by Al-
gorithm 3 and plays a similar role to the perplexity-based calibration of σi
in t-SNE. Calculation of vj|iwith Equation 15 corresponds to Algorithm 2.
Symmetrization is carried out by fuzzy set union using the probabilistic
t-conorm and can be expressed as:
vij =vj|i+vi|jvj|ivi|j(16)
Equation 16 corresponds to forming top-rep in Algorithm 1. Unlike t-SNE,
further normalization is not carried out.
e low dimensional similarities are given by:
wij =1 + akyiyjk2b
21(17)
where aand bare user-dened positive values. e procedure for nding
them is given in Denition 11. Use of this procedure with the default values
in the UMAP implementation results in a1.929 and b0.7915. Seing
a= 1 and b= 1 results in the Student t-distribution used in t-SNE.
57
References
[1] E Alpaydin and Fevzi Alimoglu. Pen-based recognition of handwrit-
ten digits data set. university of california, irvine. Machine Learning
Repository. Irvine: University of California, 4(2), 1998.
[2] Frederik Otzen Bagger, Savvas Kinalis, and Nicolas Rapin. Bloodspot:
a database of healthy and malignant haematopoiesis updated with pu-
ried and single cell mrna sequencing proles. Nucleic Acids Research,
2018.
[3] Michael Barr. Fuzzy set theory and topos theory. Canad. Math. Bull,
29(4):501–508, 1986.
[4] Etienne Becht, Charles-Antoine Dutertre, Immanuel W.H. Kwok,
Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Evaluation of
umap as an alternative to t-sne for single-cell data. bioRxiv, 2018.
[5] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine
Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and
Evan W Newell. Dimensionality reduction for visualizing single-cell
data using umap. Nature biotechnology, 37(1):38, 2019.
[6] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spec-
tral techniques for embedding and clustering. In Advances in neural
information processing systems, pages 585–591, 2002.
[7] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimen-
sionality reduction and data representation. Neural computation,
15(6):1373–1396, 2003.
[8] Aur´
elien Bellet, Amaury Habrard, and Marc Sebban. A survey on
metric learning for feature vectors and structured data. arXiv preprint
arXiv:1306.6709, 2013.
[9] Tess Brodie, Elena Brenna, and Federica Sallusto. Omip-018:
Chemokine receptor expression on human t helper cells. Cytometry
Part A, 83(6):530–532, 2013.
[10] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa,
Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Preenhofer,
Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas,
Arnaud Joly, Brian Holt, and Ga¨
el Varoquaux. API design for machine
learning soware: experiences from the scikit-learn project. In ECML
PKDD Workshop: Languages for Data Mining and Machine Learning,
pages 108–122, 2013.
58
[11] John N Campbell, Evan Z Macosko, Henning Fenselau, Tune H Pers,
Anna Lyubetskaya, Danielle Tenen, Melissa Goldman, Anne MJ Ver-
stegen, Jon M Resch, Steven A McCarroll, et al. A molecular census of
arcuate hypothalamus and median eminence cell types. Nature neu-
roscience, 20(3):484, 2017.
[12] Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M
Ibrahim, Andrew J Hill, Fan Zhang, Stefan Mundlos, Lena Chris-
tiansen, Frank J Steemers, et al. e single-cell transcriptional land-
scape of mammalian organogenesis. Nature, page 1, 2019.
[13] Gunnar Carlsson and Facundo M´
emoli. Classifying clustering
schemes. Foundations of Computational Mathematics, 13(2):221–252,
2013.
[14] Shan Carter, Zan Armstrong, Ludwig Schubert, Ian John-
son, and Chris Olah. Activation atlas. Distill, 2019.
hps://distill.pub/2019/activation-atlas.
[15] Brian Clark, Genevieve Stein-O’Brien, Fion Shiau, Gabrielle Can-
non, Emily Davis, omas Sherman, Fatemeh Rajaii, Rebecca James-
Esposito, Richard Gronostajski, Elana Fertig, et al. Comprehensive
analysis of retinal development at single cell resolution identies n
factors as essential for mitotic exit and specication of late-born cells.
bioRxiv, page 378950, 2018.
[16] Ronald R Coifman and St´
ephane Lafon. Diusion maps. Applied and
computational harmonic analysis, 21(1):5–30, 2006.
[17] Alex Diaz-Papkovich, Luke Anderson-Trocme, and Simon Gravel. Re-
vealing multi-scale population structure in large cohorts. bioRxiv,
page 423632, 2018.
[18] Wei Dong, Charikar Moses, and Kai Li. Ecient k-nearest neighbor
graph construction for generic similarity measures. In Proceedings
of the 20th International Conference on World Wide Web, WWW ’11,
pages 577–586, New York, NY, USA, 2011. ACM.
[19] Carlos Escolano, Marta R Costa-juss`
a, and Jos´
e AR Fonollosa. (self-
aentive) autoencoder-based universal language representation for
machine translation. arXiv preprint arXiv:1810.06351, 2018.
[20] Mateus Espadoto, Nina ST Hirata, and Alexandru C Telea. Deep learn-
ing multidimensional projections. arXiv preprint arXiv:1902.07958,
2019.
59
[21] Mateus Espadoto, Francisco Caio M Rodrigues, and Alexandru C
Telea. Visual analytics of multidimensional projections for construct-
ing classier decision boundary maps.
[22] Greg Friedman et al. Survey article: an elementary illustrated intro-
duction to simplicial sets. Rocky Mountain Journal of Mathematics,
42(2):353–423, 2012.
[23] Lukas Fuhrimann, Vahid Moosavi, Patrick Ole Ohlbrock, and Pierluigi
Dacunto. Data-driven design: Exploring new structural forms using
machine learning and graphic statics. arXiv preprint arXiv:1809.08660,
2018.
[24] Benoit Gaujac, Ilya Feige, and David Barber. Gaussian mixture models
with wasserstein distance. arXiv preprint arXiv:1806.04465, 2018.
[25] Paul G Goerss and John F Jardine. Simplicial homotopy theory.
Springer Science & Business Media, 2009.
[26] Mahias Hein, Jean-Yves Audibert, and Ulrike von Luxburg. Graph
laplacians and their convergence on random neighborhood graphs.
Journal of Machine Learning Research, 8(Jun):1325–1368, 2007.
[27] Harold Hotelling. Analysis of a complex of statistical variables into
principal components. Journal of educational psychology, 24(6):417,
1933.
[28] Dmitry Kobak and Philipp Berens. e art of using t-sne for single-cell
transcriptomics. Nature communications, 10(1):1–14, 2019.
[29] Dmitry Kobak and George C Linderman. Umap does not preserve
global structure any beer than t-sne when using the same initializa-
tion. bioRxiv, 2019.
[30] J. B. Kruskal. Multidimensional scaling by optimizing goodness of t
to a nonmetric hypothesis. Psychometrika, 29(1):1–27, Mar 1964.
[31] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A llvm-
based python jit compiler. In Proceedings of the Second Workshop on
the LLVM Compiler Infrastructure in HPC, LLVM ’15, pages 7:1–7:6,
New York, NY, USA, 2015. ACM.
[32] Yann Lecun and Corinna Cortes. e MNIST database of handwrien
digits.
[33] John A Lee and Michel Verleysen. Shi-invariant similarities circum-
vent distance concentration in stochastic neighbor embedding and
variants. Procedia Computer Science, 4:538–547, 2011.
60
[34] Xin Li, Ondrej E Dyck, Mark P Oxley, Andrew R Lupini, Leland
McInnes, John Healy, Stephen Jesse, and Sergei V Kalinin. Mani-
fold learning of four-dimensional scanning transmission electron mi-
croscopy. npj Computational Materials, 5(1):5, 2019.
[35] M. Lichman. UCI machine learning repository, 2013.
[36] George Linderman. Fit-sne. https://github.com/KlugerLab/
FIt-SNE, 2018.
[37] George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan
Steinerberger, and Yuval Kluger. Ecient algorithms for t-distributed
stochastic neighborhood embedding. arXiv preprint arXiv:1712.09005,
2017.
[38] George C Linderman and Stefan Steinerberger. Clustering with t-sne,
provably. SIAM Journal on Mathematics of Data Science, 1(2):313–332,
2019.
[39] Saunders Mac Lane. Categories for the working mathematician, vol-
ume 5. Springer Science & Business Media, 2013.
[40] J Peter May. Simplicial objects in algebraic topology, volume 11. Uni-
versity of Chicago Press, 1992.
[41] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je
Dean. Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing sys-
tems, pages 3111–3119, 2013.
[42] Kevin R Moon, David van Dijk, Zheng Wang, Sco Gigante, Daniel B
Burkhardt, William S Chen, Kristina Yim, Antonia van den Elzen,
Mahew J Hirn, Ronald R Coifman, et al. Visualizing structure and
transitions in high-dimensional biological data. Nature biotechnology,
37(12):1482–1492, 2019.
[43] Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object
image library (coil-20. Technical report, 1996.
[44] Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. object image
library (coil-100. Technical report, 1996.
[45] Karolyn A Oetjen, Katherine E Lindblad, Meghali Goswami, Gege Gui,
Pradeep K Dagur, Catherine Lai, Laura W Dillon, J Philip McCoy, and
Christopher S Hourigan. Human bone marrow assessment by single
cell rna sequencing, mass cytometry and ow cytometry. bioRxiv,
2018.
61
[46] Jong-Eun Park, Krzysztof Polanski, Kerstin Meyer, and Sarah A Te-
ichmann. Fast batch alignment of single cell transcriptomes unies
multiple mouse cell atlases into an integrated landscape. bioRxiv, page
397042, 2018.
[47] Jose Daniel Gallego Posada. Simplicial autoencoders. 2018.
[48] Emily Riehl. A leisurely introduction to simplicial sets. Unpublished
expository article available online at hp://www. math. harvard. edu/˜
eriehl, 2011.
[49] Emily Riehl. Category theory in context. Courier Dover Publications,
2017.
[50] John W Sammon. A nonlinear mapping for data structure analysis.
IEEE Transactions on computers, 100(5):401–409, 1969.
[51] Josef Spidlen, Karin Breuer, Chad Rosenberg, Nikesh Kotecha, and
Ryan R Brinkman. Flowrepository: A resource of annotated ow cy-
tometry datasets associated with peer-reviewed publications. Cytom-
etry Part A, 81(9):727–731, 2012.
[52] David I Spivak. Metric realization of fuzzy simplicial sets. Self pub-
lished notes, 2012.
[53] Jian Tang. Largevis. https://github.com/lferry007/LargeVis,
2016.
[54] Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing
large-scale and high-dimensional data. In Proceedings of the 25th Inter-
national Conference on World Wide Web, pages 287–297. International
World Wide Web Conferences Steering Commiee, 2016.
[55] Joshua B. Tenenbaum. Mapping a manifold of perceptual observa-
tions. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in
Neural Information Processing Systems 10, pages 682–688. MIT Press,
1998.
[56] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global
geometric framework for nonlinear dimensionality reduction. science,
290(5500):2319–2323, 2000.
[57] Dmitry Ulyanov. Multicore-tsne. https://github.com/
DmitryUlyanov/Multicore-TSNE, 2016.
[58] Laurens van der Maaten. Accelerating t-sne using tree-based algo-
rithms. Journal of machine learning research, 15(1):3221–3245, 2014.
62
[59] Laurens van der Maaten and Georey Hinton. Visualizing data using
t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
[60] Laurens van der Maaten and Georey Hinton. Visualizing data using
t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
[61] John Williamson. What do numbers look like? https://johnhw.
github.io/umap_primes/index.md.html, 2018.
[62] Duoduo Wu, Joe Yeong, Grace Tan, Marion Chevrier, Josh Loh, Tony
Lim, and Jinmiao Chen. Comparison between umap and t-sne for
multiplex-immunouorescence derived single-cell data from tissue
sections. bioRxiv, page 549659, 2019.
[63] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel
image dataset for benchmarking machine learning algorithms. CoRR,
abs/1708.07747, 2017.
[64] Liu Yang and Rong Jin. Distance metric learning: A comprehensive
survey. Michigan State Universiy, 2(2):4, 2006.
[65] Loi A Zadeh. Information and control. Fuzzy sets, 8(3):338–353, 1965.
63
... In BERTopic, before clustering similar documents based on sentence embeddings, the dimension of embeddings is reduced. To reduce the dimensionality of document embeddings, instead of traditional PCA [24] or t-SNE [25], they used UMAP [26], which has been shown to preserve more local and global features of high-dimensional data in lower projected dimensions [26]. ...
... In BERTopic, before clustering similar documents based on sentence embeddings, the dimension of embeddings is reduced. To reduce the dimensionality of document embeddings, instead of traditional PCA [24] or t-SNE [25], they used UMAP [26], which has been shown to preserve more local and global features of high-dimensional data in lower projected dimensions [26]. ...
Article
Full-text available
Recently, open-domain question-answering systems have achieved tremendous progress because of developments in large language models (LLMs), and have successfully been applied to question-answering (QA) systems, or Chatbots. However, there has been little progress in open-domain question answering in the geographic domain. Existing open-domain question-answering research in the geographic domain relies heavily on rule-based semantic parsing approaches using few data. To develop intelligent GeoQA agents, it is crucial to build QA systems upon datasets that reflect the real users’ needs regarding the geographic domain. Existing studies have analyzed geographic questions using the geographic question corpora Microsoft MAchine Reading Comprehension (MS MARCO), comprising real-world user queries from Bing in terms of structural similarity, which does not discover the users’ interests. Therefore, we aimed to analyze location-related questions in MS MARCO based on semantic similarity, group similar questions into a cluster, and utilize the results to discover the users’ interests in the geographic domain. Using a sentence-embedding-based topic modeling approach to cluster semantically similar questions, we successfully obtained topic models that could gather semantically similar documents into a single cluster. Furthermore, we successfully discovered latent topics within a large collection of questions to guide practical GeoQA systems on relevant questions.
... Analysis of single cell RNA measurements Dimension reduction and trajectory analysis. Dimension reduction and trajectory analysis were performed on the filtered scRNAseq dataset (a matrix of 9785 genes × 1914 cells) as implemented in Monocle3 [70][71][72][73] . A brief description of the steps are as follows: (1) Using the preprocess_cds() function, the matrix was log2 transformed and dimensionality of the data was reduced using PCA (principal component analysis) to the top 50 principal components. ...
Article
Full-text available
A proper understanding of disease etiology will require longitudinal systems-scale reconstruction of the multitiered architecture of eukaryotic signaling. Here we combine state-of-the-art data acquisition platforms and bioinformatics tools to devise PAMAF, a workflow that simultaneously examines twelve omics modalities, i.e., protein abundance from whole-cells, nucleus, exosomes, secretome and membrane; N-glycosylation, phosphorylation; metabolites; mRNA, miRNA; and, in parallel, single-cell transcriptomes. We apply PAMAF in an established in vitro model of TGFβ-induced epithelial to mesenchymal transition (EMT) to quantify >61,000 molecules from 12 omics and 10 timepoints over 12 days. Bioinformatics analysis of this EMT-ExMap resource allowed us to identify; –topological coupling between omics, –four distinct cell states during EMT, –omics-specific kinetic paths, –stage-specific multi-omics characteristics, –distinct regulatory classes of genes, –ligand–receptor mediated intercellular crosstalk by integrating scRNAseq and subcellular proteomics, and –combinatorial drug targets (e.g., Hedgehog signaling and CAMK-II) to inhibit EMT, which we validate using a 3D mammary duct-on-a-chip platform. Overall, this study provides a resource on TGFβ signaling and EMT.
... We will assess our methodology in terms of F1 score. b) Similarity Retrieval: We evaluate our embeddings, e.g. for information retrieval applications [23], [24], projecting them from size 20 to 2 using a Uniform Manifold Approximation and Projection (UMAP) [25]. We perform a qualitative analysis showing its clustering capabilities. ...
Preprint
This work explores scene graphs as a distilled representation of high-level information for autonomous driving, applied to future driver-action prediction. Given the scarcity and strong imbalance of data samples, we propose a self-supervision pipeline to infer representative and well-separated embeddings. Key aspects are interpretability and explainability; as such, we embed in our architecture attention mechanisms that can create spatial and temporal heatmaps on the scene graphs. We evaluate our system on the ROAD dataset against a fully-supervised approach, showing the superiority of our training regime.
Article
Full-text available
Insoluble particles in ice cores record signatures of past climate parameters like vegetation dynamics, volcanic activity, and aridity. For some of them, the analytical detection relies on intensive bench microscopy investigation and requires dedicated sample preparation steps. Both are laborious, require in-depth knowledge, and often restrict sampling strategies. To help overcome these limitations, we present a framework based on flow imaging microscopy coupled to a deep neural network for autonomous image classification of ice core particles. We train the network to classify seven commonly found classes, namely mineral dust, felsic and mafic (basaltic) volcanic ash grains (tephra), three species of pollen (Corylus avellana, Quercus robur, Quercus suber), and contamination particles that may be introduced onto the ice core surface during core handling operations. The trained network achieves 96.8 % classification accuracy at test time. We present the system's potential and its limitations with respect to the detection of mineral dust, pollen grains, and tephra shards, using both controlled materials and real ice core samples. The methodology requires little sample material, is non-destructive, fully reproducible, and does not require any sample preparation procedures. The presented framework can bolster research in the field by cutting down processing time, supporting human-operated microscopy, and further unlocking the paleoclimate potential of ice core records by providing the opportunity to identify an array of ice core particles. Suggestions for an improved system to be deployed within a continuous flow analysis workflow are also presented.
Chapter
Graph-driven techniques have been widely used in chemoinformatics and bioinformatics. It is of a great beneficial to develop toxicity prediction models. However, toxicity mechanisms are so complicated that they cannot be well explained. Many toxicity-related molecular features have been designed or explored, and the stacking ensemble strategies of machine learning models have been often used to boost toxicity predictive power. Herein, we review graph kernel learning (GKL) techniques for predictive toxicity models. These GKL techniques are fully graph data-driven, involving composed of graph kernels, graph neural networks, and graph embeddings. We briefly introduce the fundamental elements and developments of the GKL techniques in chemoinformatics. We systematically collect and evaluate the performance of the GKL methods on the public toxicity data sets. Consequently, we discuss applications, challenges, and perspectives about the GKL techniques for toxicity-related problems. We hope this chapter could help better understand and guide applications of GKL in solving computational toxicity problems.
Preprint
Full-text available
Analysis of Electrochemical Impedance Spectroscopy (EIS) data for electrochemical systems often consists of defining an Equivalent Circuit Model (ECM) using expert knowledge and then optimizing the model parameters to deconvolute various resistance, capacitive, inductive, or diffusion responses. For small data sets, this procedure can be conducted manually; however, it is not feasible to manually define a proper ECM for extensive data sets with a wide range of EIS responses. Automatic identification of an ECM would substantially accelerate the analysis of large sets of EIS data. Here, we showcase machine learning methods developed during the BatteryDEV hackathon to classify the ECMs of 9,300 EIS measurements provided by QuantumScape. The best-performing approach is a gradient-boosted tree model utilizing a library to automatically generate features, followed by a random forest model using the raw spectral data. A convolutional neural network using boolean images of Nyquist representations is presented as an alternative, although it achieves a lower accuracy. We publish the data and open source the associated code. The approaches described in this article can serve as benchmarks for further studies. A key remaining challenge is that the labels contain uncertainty and human bias, underlined by the performance of the trained models.
Article
Cyclic immunohistochemistry (cycIHC) uses sequential rounds of colorimetric immunostaining and imaging for quantitative mapping of location and number of cells of interest. Additionally, cycIHC benefits from the speed and simplicity of brightfield microscopy, making the collection of entire tissue sections and slides possible at a trivial cost compared to other high dimensional imaging modalities. However, large cycIHC datasets currently require an expert data scientist to concatenate separate open-source tools for each step of image pre-processing, registration, and segmentation, or the use of proprietary software. Here, we present a unified and user-friendly pipeline for processing, aligning, and analyzing cycIHC data - Cyclic Analysis of Single-Cell Subsets and Tissue Territories (CASSATT). CASSATT registers scanned slide images across all rounds of staining, segments individual nuclei, and measures marker expression on each detected cell. Beyond straightforward single cell data analysis outputs, CASSATT explores the spatial relationships between cell populations. By calculating the log odds of interaction frequencies between cell populations within tissues and tissue regions, this pipeline helps users identify populations of cells that interact-or do not interact-at frequencies that are greater than those occurring by chance. It also identifies specific neighborhoods of cells based on the assortment of neighboring cell types that surround each cell in the sample. The presence and location of these neighborhoods can be compared across slides or within distinct regions within a tissue. CASSATT is a fully open source workflow tool developed to process cycIHC data and will allow greater utilization of this powerful staining technique.
Article
Full-text available
The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data. PHATE, a new data visualization tool, better preserves patterns in high-dimensional data after dimensionality reduction.
Article
Full-text available
Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE. t-SNE is widely used for dimensionality reduction and visualization of high-dimensional single-cell data. Here, the authors introduce a protocol to help avoid common shortcomings of t-SNE, for example, enabling preservation of the global structure of the data.
Article
Full-text available
Mammalian organogenesis is a remarkable process. Within a short timeframe, the cells of the three germ layers transform into an embryo that includes most of the major internal and external organs. Here we investigate the transcriptional dynamics of mouse organogenesis at single-cell resolution. Using single-cell combinatorial indexing, we profiled the transcriptomes of around 2 million cells derived from 61 embryos staged between 9.5 and 13.5 days of gestation, in a single experiment. The resulting ‘mouse organogenesis cell atlas’ (MOCA) provides a global view of developmental processes during this critical window. We use Monocle 3 to identify hundreds of cell types and 56 trajectories, many of which are detected only because of the depth of cellular coverage, and collectively define thousands of corresponding marker genes. We explore the dynamics of gene expression within cell types and trajectories over time, including focused analyses of the apical ectodermal ridge, limb mesenchyme and skeletal muscle.
Article
Full-text available
Four-dimensional scanning transmission electron microscopy (4D-STEM) of local atomic diffraction patterns is emerging as a powerful technique for probing intricate details of atomic structure and atomic electric fields. However, efficient processing and interpretation of large volumes of data remain challenging, especially for two-dimensional or light materials because the diffraction signal recorded on the pixelated arrays is weak. Here we employ data-driven manifold leaning approaches for straightforward visualization and exploration analysis of 4D-STEM datasets, distilling real-space neighboring effects on atomically resolved deflection patterns from single-layer graphene, with single dopant atoms, as recorded on a pixelated detector. These extracted patterns relate to both individual atom sites and sublattice structures, effectively discriminating single dopant anomalies via multi-mode views. We believe manifold learning analysis will accelerate physics discoveries coupled between data-rich imaging mechanisms and materials such as ferroelectric, topological spin, and van der Waals heterostructures. © 2019, This is a U.S. government work and not under copyright protection in the U.S.; foreign
Article
Full-text available
Advances in single-cell technologies have enabled high-resolution dissection of tissue composition. Several tools for dimensionality reduction are available to analyze the large number of parameters generated in single-cell studies. Recently, a nonlinear dimensionality-reduction technique, uniform manifold approximation and projection (UMAP), was developed for the analysis of any type of high-dimensional data. Here we apply it to biological data, using three well-characterized mass cytometry and single-cell RNA sequencing datasets. Comparing the performance of UMAP with five other tools, we find that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters. The work highlights the use of UMAP for improved visualization and interpretation of single-cell data.
Article
Full-text available
We present Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits. The dataset is freely available at https://github.com/zalandoresearch/fashion-mnist.
Article
t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the `early exaggeration' phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed. As a byproduct, the proof suggests novel ways for setting the exaggeration parameter $\alpha$ and step size $h$. Numerical examples illustrate the effectiveness of these rules: in particular, the quality of embedding of topological structures (e.g. the swiss roll) improves. We also discuss a connection to spectral clustering methods.
Article
The hypothalamic arcuate–median eminence complex (Arc-ME) controls energy balance, fertility and growth through molecularly distinct cell types, many of which remain unknown. To catalog cell types in an unbiased way, we profiled gene expression in 20,921 individual cells in and around the adult mouse Arc-ME using Drop-seq. We identify 50 transcriptionally distinct Arc-ME cell populations, including a rare tanycyte population at the Arc-ME diffusion barrier, a new leptin-sensing neuron population, multiple agouti-related peptide (AgRP) and pro-opiomelanocortin (POMC) subtypes, and an orexigenic somatostatin neuron population. We extended Drop-seq to detect dynamic expression changes across relevant physiological perturbations, revealing cell type–specific responses to energy status, including distinct responses in AgRP and POMC neuron subtypes. Finally, integrating our data with human genome-wide association study data implicates two previously unknown neuron populations in the genetic control of obesity. This resource will accelerate biological discovery by providing insights into molecular and cell type diversity from which function can be inferred.