Content uploaded by Vitaliy Kurlin
Author content
All content in this area was uploaded by Vitaliy Kurlin on Nov 02, 2021
Content may be subject to copyright.
The Earth Mover’s Distance as a Metric for the Space of Inorganic
Compositions
Cameron J. Hargreaves, Matthew S. Dyer,*Michael W. Gaultois, Vitaliy A. Kurlin,
and Matthew J. Rosseinsky
Cite This: https://dx.doi.org/10.1021/acs.chemmater.0c03381
Read Online
ACCESS Metrics & More Article Recommendations *
sıSupporting Information
ABSTRACT: It is a core problem in any field to reliably tell how
close two objects are to being the same, and once this relation has
been established, we can use this information to precisely quantify
potential relationships, both analytically and with machine learning
(ML). For inorganic solids, the chemical composition is a
fundamental descriptor, which can be represented by assigning
the ratio of each element in the material to a vector. These vectors
are a convenient mathematical data structure for measuring
similarity, but unfortunately, the standard metric (the Euclidean
distance) gives little to no variance in the resultant distances
between chemically dissimilar compositions. We present the earth
mover’s distance (EMD) for inorganic compositions, a well-
defined metric which enables the measure of chemical similarity in
an explainable fashion. We compute the EMD between two compositions from the ratio of each of the elements and the absolute
distance between the elements on the modified Pettifor scale. This simple metric shows clear strength at distinguishing compounds
and is efficient to compute in practice. The resultant distances have greater alignment with chemical understanding than the
Euclidean distance, which is demonstrated on the binary compositions of the inorganic crystal structure database. The EMD is a
reliable numeric measure of chemical similarity that can be incorporated into automated workflows for a range of ML techniques.
We have found that with no supervision, the use of this metric gives a distinct partitioning of binary compounds into clear trends and
families of chemical property, with future applications for nearest neighbor search queries in chemical database retrieval systems and
supervised ML techniques.
■INTRODUCTION
Even before Aristotle, philosophers sought to explain the
properties of materials through their elemental compositions.
As an experimental chemist, the first step in any investigation is
choosing what elements and what ratio to put into the sample,
and the composition is arguably the most important
independent variable under control. In many functional
materials, where disorder is important to the functional
properties (such as electronic or ionic conductivity), the
elemental composition is a fundamental property that is well
described. This is because the nominal composition that is put
into a synthetic process is generally well-defined, and also
because there are extensive characterization methods to
experimentally determine the elemental composition.
Although the underlying theory has evolved considerably
since antiquity, the elemental composition of a material
continues to be a prime director of material properties, and we
know now the chemical composition largely dictates the nature
of the chemical bonding, which has a strong influence on the
crystal structure and physical properties. Similar compositions
lead to similar properties, and when estimating material
properties, it is important to consider the closest known
composition to the one being considered. These similarities
can be defined quantitatively in a distance function, which
returns a real valued number, such that identical objects have a
distance of 0, and less similar objects return a larger value. We
would expect that small changes in chemical makeup would
lead to correspondingly small variations in chemical property,
and that chemically dissimilar compounds may behave entirely
differently.
The chemist develops such understanding naturally through
their exploration of the sciences. While each practitioner may
have a personal tolerance for what they believe to be
“chemically similar,”two compositions which differ only by a
minor dopant or by the substitution of a similar element have
Received: August 19, 2020
Revised: November 20, 2020
Articlepubs.acs.org/cm
© XXXX American Chemical Society A
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
Downloaded via UNIV OF LIVERPOOL on December 2, 2020 at 23:07:27 (UTC).
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
inarguable similarity. This relationship may be immediately
clear to the chemist, but in practice it is difficult to capture
these small physical changes numerically. In this paper, we
present a new technique to calculate the distance between two
compositions, which captures nuanced variations in stoichi-
ometry for both similar and dissimilar compounds.
By correlating relationships between the chemical compo-
sition of materials and their observed behavior, we may
automatically detect underlying statistical relationships be-
tween these. This can be used in an automated process to
usefully inform the chemistry, whether it be by providing the
relationships with other clusters of similar materials, or even by
estimating the properties. This has been exploited implicitly by
modern machine learning (ML) methods, which have been
applied to capitalize on the strong determination of properties
by composition; there are many reports of regression models
to estimate material performance of inorganic solids from
compositions alone.
1−3
For these models to be successful, we require two things: a
large collection of data and a method of differentiating these
such that we may uncover the subtle relationships which
govern a material’s properties. Having a metric to quantify
these relationships allows us to take our bearings and construct
maps of chemical space to enable clear exploration, providing
an awareness of compositional relationships between materials.
When predicting the properties of a new composition, we must
form an understanding of its relation to other reported
compounds, and the distinguishing quality of the similarity
metric chosen can vastly affect performance. The choice of
metric should therefore possess enough fidelity to give an
accurate representation of chemical relationships between
entries in a database of compositions and align with human
understanding.
Though widely used as a metric, the compositional
Euclidean distance (CED) can perform poorly at the task of
distinguishing compounds. A common method of encoding a
composition is to store the relative ratio of each element in the
compound to its associated index in a vector of length 103, for
each of the naturally stable elements. Taking xito be the
fraction of the ith element in a compound X, we take the CED
to a second such vector, Y, via the standard formula,
∑−xy()
iii
2
. Due to the sparsity of these vectors, the
CED overly simplifies and exaggerates physical differences. As
an example by taking the atomic number of the 103 stable
elements as our index, the compositional vectors of LiF and
BeO would be (00...0.53...0.59...0103) and (00...0.54...0.58...0103),
respectively. Taking the nonzero elements with indices 3, 4, 8,
and 9, the CED between these vectors would thus be
−+−+−+−
=
(0.5 0) (0 0.5) (0 0.5) (0.5 0)
1
2222
A third binary composition, BN, which with a compositional
vector of (00...0.55...0.57...0103) is arguably less chemically
similar to LiF than BeO, would also have a CED of 1 to both of
these compounds, as demonstrated in Figure 1a,b. A CED of 1
would be calculated between any two binary compositions
which did not have a common element. This discrete nature of
the CED does not provide an accurate distinction between
compounds which may be entirely different chemically, and
while this can capture local trends in a chemical data set, global
information may be lost. We can improve on this shortcoming
by incorporating a measure of elemental similarity which may
be applied to a compositional vector directly.
The earth mover’s distance (EMD) is a metric which is well-
constructed to pair elements between compositions, and from
this judge their similarity, which has had successful applications
in multiple fields.
4−6
The EMD may analogously be thought of
as the minimal amount of work to move piles of earth to fill
pits of equal overall volume but different shapes, a long studied
transportation problem
7
with fast algorithmic implementa-
Figure 1. In (a,b), CED is demonstrated between compositional
vectors of LiF, BeO, and BN by taking the absolute difference of
atomic sites, with the atomic number as index. (c,d) equivalent EMD
is shown, where we instead match elements with one another. By
calculating the cost to transport each atom along the modified Pettifor
scale, we have arrived at a distance which is reflective of the chemical
similarity. While each of these elements have similar atomic numbers,
they possess well-known chemical differences, which is displayed by
the greater variance in the output.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
B
tions.
8,9
This consistently returns a unitless quantity of work
which may be interpreted as a measure of distance.
We could assign the atomic number as the vector index for
each element, then take the difference between indices as a
measure of elemental similarity, but this approach loses the
natural clustering of chemical properties afforded by the
periodic table. An ideal elemental indexing would perfectly
capture the chemical trends observed in nature, but ordering
the elements in such a manner is problematic. As well as the
unclear resolution of how to handle the f-block elements,
chemical trends moving down the periodic table tend to be the
direct opposite of those moving across. This leads to some
elements having greater substitutional feasibility to their
diagonal neighbor than their immediate neighbor, making a
simple placement of these difficult.
To solve this problem, Pettifor proposed a method of
labeling the elemental scale in his seminal paper of 1984,
10
drawn from extensive domain knowledge. These numeric
labels may form the basis of a coordinate system allowing us to
associate patterns in geometric and physiochemical properties,
with extensions to this idea continuing to guide practi-
tioners.
11,12
This concept of labeling was further developed by
analyzing the probability that an element can be substituted for
another given the same structural framework on 20,500
compounds of the inorganic crystal structure database
(ICSD) by Glawe et al.
13
This probability matrix can be
reordered to maximize the likelihood that local neighborhoods
will contain elements with greater feasibility of stable
substitutions, thus possessing inherent chemical similarities.
14
We take the associated indices of this final ordering to give
each element its modified Pettifor number.
In this report, we define a composition vector by taking the
ratio of each element in a compound assigned to the index of
its respective modified Pettifor number. By assuming the
sample of the set of feasibly stable compounds (although we
know this is not strictly the case
15
), we can see that these
indices capture the truly physical similarities between elements
from statistical analysis. Using the modified Pettifor scale gives
resultant similarities between compounds which align with
human judgement but may be substituted with any continuous
elemental scale including less equally spaced distributions, for
example, Pauling electronegativity.
This gives us the ability to place new compositions within
the context of previously reported compounds, allowing us to
attribute properties to these before the lengthy process of
synthesis. We can do this automatically with ML techniques,
where the EMD forms part of the workflow to predict
properties quantitatively. We may additionally assign proper-
ties to compositions qualitatively by simply searching through
multiple databases to find the most similar existing entries.
This second approach requires the practitioner’s judgement on
whether to take the property of the closest match, an average
of many similar compounds, or to conclude that the reported
landscape is not sufficiently complete to make an accurate
judgement.
■METHODS
ThePyCifRWpackage(version4.4.1)wasusedtoextract
composition from each crystalline information file (cif) of the ICSD
(2017).
16
These compositions were processed into vectors with our
python implementation, where the EMD between two vectors is
computed via the network simplex algorithm. All embeddings are
created by first constructing a complete distance matrix between
compositions in a data set and providing this as an input to the chosen
technique. Density-based spatial clustering of applications with noise
(DBSCAN)
17
parameters were tuned by hand until clusters were
obtained which clearly bound regions of similarity of the periodic
table. Implementation details of uniform manifold approximation and
projection (UMAP) and principle component analysis (PCA) may be
found in refs 18 and 19, respectively. All code was in python 3.7, with
the matplotlib (3.1.3) and datashader (0.10.0) libraries used to plot
the binary and complete maps, respectively.
■RESULTS
EMD. We take an initial matching by pairing each of the m
elements in a vector, X, to its most similar unmatched partner
in the nelements of a second vector Y, until all have been
paired. The quantity matched, q, from the i-th element of X,to
the j-th element of Y, is given by qij. A cost is calculated (eq 1a)
by summing all quantity, q, paired through each matching
multiplied by the difference pi−pjin indices on the modified
Pettifor scale, p, between the elements matched. When all
elements have been paired, this is a feasible solution to the
problem; however, it may not be an optimally minimal
solution. Given two vectors we take a feasible matching and
successively improve this via the network simplex algorithm
until the total summed cost is verified to be optimal.
For a compositional vector X,∑=
=x1
i
m
i
1and therefore the
total quantity matched with any other vector will also equal 1.
As we are describing a transformation from two distributions
with the same total sum 1, this satisfies the axioms of a metric
space, as proven in ref 4, Appendix A. The formal definition of
the EMD between two compositional vectors X=(x1...xm) and
Y=(y1...yn), as given by ref 20 eq 7, is thus defined as
∑∑
=|−|
==
XY qp p
E
MD( , ) min
i
m
j
n
ij ij
11 (1a)
≥qi
j
subject to 0 for any ,
ij
(1b)
∑≤≤≤
=
qx i
m
for any 1
j
n
ij i
1(1c)
∑≤≤≤
=
qy j
n
for any 1
i
m
ij j
1(1d)
∑∑
=
==
q1
i
m
j
n
ij
11
(1e)
Constraint 1b defines that we may only match a positive
quantity from Xto Y,1c and 1d state that each element will
only pair to another up to its ratio. The final constraint ensures
that all of the elements in Xare matched to an element in Y
such that a feasible solution has been achieved.
Taking three candidate solid-state electrolytes with known
dissimilarity in composition and structure to exemplify this, as
shown in Figure 2, we can see how the EMD allows greater
depth of analysis when defining chemical similarity compared
to the CED. From the figure, we see how the solution not only
gives us the measure of distance but also the quantity of
elements that are paired to one another. We may apply this to
any two chemical formulae, enabling us to highlight chemically
similar substitutions, and thus familial relation, which may not
have been immediately obvious from the compound formula.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
C
A reference implementation may be found at https://github.
com/lrcfmd/ElMD/.
Pairing Structures to Compositions. It is widely
recognized that composition is not the sole determinant of
physical performance, and crystal structure plays a fundamental
role in property, which can be dependent on many different
length scales. Codifying these structures such that we may
compare them for similarity has known difficulties
21
due to the
periodicity of inorganic systems. For organic molecules, there
exist methods of formally encoding a structure derived from
the strict lexicographic conventions of organic chemistry,
22,23
and methods of encoding an inorganic crystal’slocal
environment and symmetry have been successfully imple-
mented.
24−26
Structural features are a known asset to ML
models, and the addition of this information generally gives
stronger predictive performance at screening and property
prediction.
27−29
Unfortunately, structural information is often not reported
in tandem with experimentally determined chemophysical
properties, and many such properties are reported from solid
solutions where a similar reported structure may not even exist.
In many cases, only the composition and the property under
investigation will be reported, leaving a fragmented data
landscape with a barrier between databases. With the number
of reported compounds untenable for any person to feasibly
audit, we must bring this information together in an automated
manner. The EMD allows us to connect compounds to their
closest determined structure in such a fashion, allowing
databases with compositional information to be joined to
databases of structural information.
Using the EMD we may now pair query formulae, including
those which have never been synthesized, to their most similar
compositions in one of the many chemical databases such as
the ICSD (2017), consisting of 188,631 cifs. A recent review of
materials with reported ionic conductivity was undertaken with
842 compounds identified. Each compound had a comparative
search applied to every ICSD entry, and these pairings were
analyzed by a team of 21 researchers at the Materials
Innovation Factory, University of Liverpool, with the quality
of these matchings assessed. Of these compounds, 528 had a
perfect match to a cif or an exact match under a minor change
in stoichiometry. A further 254 compounds have a matching cif
with a small number of elemental substitutions and are similar
in crystal structures. The remaining 60 formulae did not find a
good match, mostly due to the materials being reported more
recently than any database entries. In Table 1, we see some
commonly cited compounds from this field and their closest
matches in the ICSD. Clearly, a distance of zero gives an exact
match barring polymorphs; however, there remains the more
general case where an exact structure has not been reported.
We can see that in each example a chemically similar
compound has been returned, and we would expect the
extracted structural features to have a high degree of
Figure 2. All feasible solutions to compute the EMD from
Li1.3Al0.3Ti1.7(PO4)3→La0.5Li0.35TiO3are presented in (a), where
the area of each disk represents the fraction of the element in the
compound, and the width of each arc shows the maximal quantity
which could be theoretically matched within the constraints of the
problem. The optimal matching is shown in (b), giving a resultant
distance between these of 12.70. For an arguably less similar
compound, Li6PS5Cl, these chemical differences are reflected by the
greater distance of 31.08 (c). Taking the final distance from Li6PS5Cl
→La0.5Li0.35TiO3of 18.38 (d), it can be seen that a simple
embedding of these three compounds may be constructed (e).
Table 1. Top Three Most Similar Results when Querying
Some Commonly Cited Solid-State Electrolytes against the
ICSD (2017) with EMD
a
query three closest matches EMD
Li1.3Al0.3Ti1.7(PO4)3Na1.261Al0.302Ti1.696(PO4)30.231
Li1.2Al0.2Ti1.8(PO4)30.302
Li1.4Al0.4Ge0.4Ti1.2(PO4)30.368
Li10GeP2S12 Li10GeP2S12 0.000
Li10SnP2S12 0.040
Li9.81Sn0.81P2.19S12 0.439
Li6PS5Cl Li6AsS5I 0.231
Li6PO5Cl 0.385
Li6PO5Br 0.462
Li5La3Nb2O12 Li5La3Nb2O12 0.000
Li5La3Ta2O12 0.091
Li5.08La3Ta1.51Zr0.39O12 0.322
Li1.5Al0.5Ge1.5(PO4)3Na1.5Sn1.5Sb0.3(PO4)30.543
BaGa(PO4)20.712
BaSn(PO4)20.745
Li7La3Zr2O12 Li7La3Zr2O12 0.000
Li7La3Hf2O12 0.083
Li7.1La3(Zr1.9Cr0.1)O12 0.200
Li14Zn(GeO4)4Li14Zn(GeO4)40.000
Li6Ge2O70.410
Li6(Si2O7) 0.543
a
While there are only four queries with exact matches due to the
recency in the electrolyes reporting, it may be seen that the remainder
are chemically similar.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
D
correlation with the true structural information. While 8.43%
of the compounds have been poorly matched, the number of
false matches can be reduced by filtering the data set with the
application of a maximum threshold value. By removing all
matches which have a distance greater than 1, we discard 12%
of the entries but improve the false positive rate to 5.7%.
Although caution should be applied from the introduction of
known errors, this provides an efficient method for the
automated creation of data sets on the scale required to
uncover complex statistical relationships.
By assessing those matchings in Table 1 that are imperfect,
we may see how the top ranked match remains structurally
related, with the remainder being simple dopings and
substitutions. Of interest is Li1.5Al0.5Ge1.5(PO4)3in row 5,
where there is clear chemical relation between these three
compounds, we see some dissimilarity in their structures as
Na1.5Sn1.5Sb0.3(PO4)3, BaGa(PO4), and BaSn(PO4) represent
the NASICON and pyrosilicate phases. It is likely, perhaps
certain, that the chemist develops a deeper understanding of
the relations between the compounds they study, their
combinations, and their behaviors under environmental
conditions, than can be captured by a simple number. An
engineered representation of compositions has however
allowed us to express chemical relationships which have not
previously been possible to express quantifiably. With a clearly
defined metric of chemical similarity, we may use this as we
would any other distance, with additional confidence that the
underlying mechanics are mathematically aligned with
chemical knowledge.
Mapping Compositional Space. The discovery of new
materials has always been data driven, and mapping
compositions to predict the existence of structures is a time-
honored technique in crystallography.
30,31
The visual medium
provides a tangible clarity to the human reader, where abstract
relations between compounds can be difficult to conceptualize
through numerical analysis. The EMD in conjunction with
modern visualization techniques has a clear application in this
regard, giving the ability to plot detailed maps which clearly
align with known chemical clustering. The metric space is
given by compositional vectors in 103 dimensions and their
relationships with respect to the EMD.
This space and its induced structure have a mathematically
complex geometry, and as we only possess the distances
between points, we do not have the two-dimensional
coordinates that are required for plotting. We may use
dimensionality reduction techniques to generate these
coordinates, and the resultant points are called an embedding
of the space. By measuring a line between embedded points
with a ruler, distance may be used in the ordinary meaning of
the word to define the similarity between two points. As it is
generally impossible to represent a complex space without
distorting the relationships between points, many dimension-
ality reduction techniques exist, each with their own focus at
emphasizing specific relationships across a data set. In general,
we wish to align the distance between points with the
associated EMD between compositions such that our
embeddings give a valuable representation of the metric
space. In this paper, we discuss embeddings produced by
UMAP,
17
which gives clustered plots which allow the
qualitative assessment of chemical data sets with unsupervised
ML, and PCA which we find gives more accurate
representations of the relationship between points, with less
overall distortion from the true positions.
In the UMAP algorithm, every composition is represented
by a point and edges to each of the 15 most similar compounds
in the data set calculated with respect to the EMD. It is not
however possible to plot these distances directly, due to the
contradictory information that arises when embedding a graph
of high degree to the plane. Approximations of the metric
distances are realized in two-dimensional Euclidean space by
constructing an inaccurate embedding of the point cloud, and
refining the positions of the points along edges to each
neighbor, such that distances between them align with the true
EMD, with respect to local cluster density. In doing this, we
disregard the majority of the inter-compound distances, yet
retain a skeletal backbone which follows the local trends of the
data, pulling together clusters of similar compounds. The
resultant two-dimensional plots are highly distorted from their
true positions in the metric space, but in a manner which draws
out the most prominent global trends of a data set, from which
we may pick out clear patterns and clusters both manually and
automatically.
Binary Compositions. The binary compounds have
simple compositions for us to demonstrate the efficacy of the
EMD and its alignment with domain knowledge. The 12,623
binary compounds in the ICSD were identified, the complete
inter-compound distance matrix calculated with respect to the
EMD, and the resultant distances reduced to two dimensions
with the application of UMAP (Figure 3a). Each of the clusters
of clear separation tend to contain AB pairs from the same, or
similar families on the periodic table, with trends across
clusters following expected transitions in chemical composition
through the modified Pettifor scale. In some clusters, there is
greater chemical discontinuity, yet there are trends across
regions of these points which follow smooth variations in AB
ratio.
Adding chemical labels (e.g., which blocks of the periodic
table are found in the compound) demonstrate how trends in
chemical properties are preserved when using the EMD
(Figure 3). As there are no experimental properties barring
chemical formula and atomic positions reported in cifs, we
must derive known features from the composition alone. In
Figure 3. Here, we see 12,623 binary composition vectors from the
ICSD, with the EMD (a) and CED (b) between these calculated and
reduced to two-dimensional coordinates using UMAP. With the
application of color labels to display the placement of elements
present in the composition across the periodic table, it can be seen
that the EMD has separated the space into complex-shaped clusters of
related chemical families which have strong alignment with chemists’
perception of similarity, which is not present with the CED. The maps
produced with the CED contain many isolated clusters with trivial
shapes and few members, with poor resolution of chemistry (i.e., many
labels within a cluster).
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
E
binary compositions, we know that the block of the periodic
table, that each of the two elements are from, will play a
significant role in the resultant chemical properties. By labeling
these blocks, we can immediately see how UMAP has
partitioned the space into clusters of compositions from the
same, or arguably similar, blocks of the periodic table. A clear
example is the pink cluster in the upper left of the map, where
we may find every compound in the ICSD containing two p-
block elements. By embedding approximations of distances
from the metric space, these chemical maps have been given a
structure which aligns with domain knowledge. This alignment
arises because the EMD preserves chemical relationships
between elements, and thus the chemical context from the
periodic table is present in the metric, which allows reference
between elements (and by extension regions of the periodic
table) ensuring that these trends are well captured.
Maps of inorganic compositional space have previously been
created with the CED;
32
however, for these maps to possess a
structure which aligns with chemical judgement, any method
employing CED requires a high incidence of compounds with
shared elements. When a similar methodology is applied to the
entire periodic table (e.g., binary compositions of the ICSD
shown in Figure 3b), all detailed structure present when using
the EMD is lost. When using the EMD, compounds with
elements from similar regions or blocks of the periodic table
are clustered in groups with nontrivial shapes, high purity (i.e.,
a low number of labels per cluster), with a sensible relationship
between clusters (Figure 3a). When using the CED, all these
desirable properties are lost (Figure 3b); clusters have trivial
shapes with little variation, are impure (i.e., have a high number
of labels in each cluster with combinations of each block in the
periodic table evenly distributed across the map), and the
clusters are evenly distributed throughout the projection.
Furthermore, rather than large, connected clusters generated
when using EMD, using the CED we get small and often
isolated islands, with no clear relationship between these.
Because compositions without shared elements have roughly
equal distances under the CED, there are not enough global
points of reference to place clusters in relation to one another
with fidelity. There is dense clustering of compounds with
similar stoichiometry (i.e., shared elements) due to the
comparatively small distances between these, making it difficult
Figure 4. In (a) we see how the same embedding of binary compounds from Figure 3a, may be segmented into 26 distinct clusters using the
DBSCAN algorithm, with a full analysis of these provided in the Supporting Information. The AB2compounds of cluster 13 are given in (b) where
there are clear chemical trends. Here, the ordered grouping is a clear reflection of the landscape of reported compounds, and the relative stability of
the AB2structure prototype under different elemental doping.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
F
to differentiate points within clusters. This shows how a metric
with qualitatively poor ability at distinguishing compositions
will lead to quantitative confusion of known chemical
relationships. While the CED may provide enough distinguish-
ing quality to be of benefit when applied to certain chemical
data sets (e.g., where there are shared elements), the lack of
chemical relationship between elements leads to a loss of
discernibility and is a guaranteed source of noise for models.
By processing compositional vectors with respect to the EMD,
chemical relationships and context have been preserved to a
high enough standard that it may be captured reliably with
automated methods.
The use of the EMD enables the comparison between
different choices of the elemental scale defining the indices in
the compositional vector. This is not the case for the CED,
where the distance is the same regardless of the elemental scale
chosen. Even when using simple atomic numbers as the
elemental index, the EMD introduces a significant structure to
the UMAP generated clusters, leading to clusters with
nontrivial shapes, however without the purity of labels
observed when using the modified Pettifor scale (Figure S1).
Elemental scales such as Pettifor’s original Mendeleev
number
13
and alternate orderings of this scale
33
result in
plots with similar cluster shapes and purity to the modified
Pettifor scale (Figures S2−S6). An alternative approach to the
use of compositional vectors Xand Yis the use of recently
developed vectors of features which are derived from values of
physicochemical properties of the elements present in the
composition.
34−36
Application of UMAP to the Euclidean
distances between the magpie features
37
of these binary
compounds results in clusters with low levels of chemical
purity, similar to the results obtained using the EMD with the
atomic number scale (Figure S8).
The labels in Figure 4a are assigned with the density-based
clustering algorithm DBSCAN
38
on the points obtained by the
EMD and UMAP, which assigns groups of point class labels,
such that clusters which have been closely plotted together on
the plane share a label. As there are few pre-existing chemical
properties we can attribute to this data set, we rely on
unsupervised learning to gain insights from our data with ML.
Here, the EMD gives strong separation into chemical clusters
with clear patterns in atomic trends, the most prominent
example being the zoom of cluster 13, as shown in Figure 4b,
such that most compounds are of the form AB2with the Aion
being a lanthanide and the Bion being a transition metal. It is
worth noting here that the analogous YB2phases (orange dots)
are classified with the other lanthanides (brown dots),
reflecting the common practice of grouping them together as
the rare-earth metals. Another notable feature is cluster 20,
containing the entirety of the transition-metal d−d compounds
with increased concentrations of each transition metal as we
progress around the crescent. After cluster 13, the second
clearest example of parallel trendlines in chemical features can
be seen to the left of cluster 6, with the general form AB3,A
being an f-block metal, and B being a p-block metal. From the
top left of the cluster to the bottom right, A ions follow the
Pettifor scale. Across each successive line from left to right, the
B ions progress through Al, Ga, In, Tl, Pb, Sn, and finally Ge.
Complete analysis of each DBSCAN cluster is given in the
Supporting Information. With no prior chemical knowledge of
these compounds, we can draw attention to underlying
Figure 5. 125,627 compositions from the ICSD with their inter-compound EMD calculated and resultant distances reduced to two-dimensional
coordinates with both UMAP (a) and PCA (b). Three candidate solid-state electrolytes are overlaid with the planar distances between points
labeled. In addition, standard deviation of electronegativity for elements in each compound is given by the coloring from red (more covalent) to
blue (more ionic). It can be seen that UMAP has accentuated some of the more subtle aspects of chemical similarity by distorting the more
accurate representation given to us by PCA, where many regions are too densely plotted to make out clearly.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
G
chemical properties, providing visually qualitative maps,
capturing families of clear relation.
Inorganic Crystal Structure Database. For each of the
125,627 unique compound formula in the ICSD, we may apply
the same process but defining clusters becomes difficult due to
the scale of the task. We may instead attribute known
chemistry about each composition to uncover underlying
trends in the data. In Figure 5a, we see these compounds
plotted via UMAP and colored by taking the standard
deviation of the respective electronegativities of the constituent
atoms. We calculate this by taking the associated Pauling
electronegativity for each of the nonzero elements in a
compositional vector, giving a set of electronegativities, E,of
length n. The average electronegativity of the set, e̅is
calculated, and standard deviation is obtained via the standard
formula
=∑−̅
−
=ee
n
S
D()
1
i
n
i
1
2
This simple measure reveals a clear trend in the chemical
property across the reported compounds, between the more
ionic compounds across the right side of Figure 5a, to the more
covalently bonded across the left boundary.
It should be noted that the UMAP algorithm emphasizes the
clusters of a metric space. When optimizing distances, UMAP
ensures that clusters of compounds are closely packed within
families and clearly separated from other clusters on the plane.
Unsupervised density-based clustering algorithms such as
DBSCAN therefore work consistently and effectively on the
resultant plots, allowing the swift classification of new
compounds from existing knowledge.
While local neighborhoods will have similar structure to the
metric space, the global trends appear warped. This is
highlighted by the three SSEs from the previous section
overlaid on Figure 5. It can be seen that these do follow the
approximate similarity given to us from EMD but have been
distorted from the perfect line they fall on in the EMD metric
space. We may take the local distance between each of the
embedded points, and by calculating the Pearson’s correlation
between each of these and their associated EMD, the quality of
these embeddings may be assessed. While UMAP has given
value by separating these projections into clusters of familial
relation, by referring to Table 2, with a correlation of 0.748
many of the distances have been distorted from their true
values, making these potentially unsuitable inputs for
regression tasks. We would expect that reducing our distances
to higher dimensional coordinate systems would give UMAP
more degrees of freedom when embedding a graph layout;
however, the correlation does not improve past two
dimensions, in part due to the implementation’s focus on
planar projections.
Principle Component Analysis. A truer picture of the
metric space may be obtained via PCA, a widely used
dimensionality reduction technique in the natural sciences for
projecting data along axes of greatest variance. In practice, this
compresses every one of the distances linearly, and as there is
rarely an embedding of a higher dimensional point cloud in a
lower dimensional Euclidean space, which respects the global
structure perfectly, often creates overcrowded plots with a loss
of intrinsic structures. When applied to our data set, this does
create overly dense regions of points, making these unsuitable
for the automated identification of clusters with an algorithm
such as DBSCAN, however retains a strong resemblance to the
true structure of the metric space, as shown in Figure 5b.
We take an initial embedding of the EMD inter-compound
distance matrix by considering each row as a centered
embedded point, which is produced by taking each value in
the distance matrix and forming a Gram-centered matrix.
39
These normalized points can then be reduced with PCA, giving
us a linear projection of the data set. We have found that even
in lower dimensional spaces, the local Euclidean distances
between points retain a reasonably high degree of correlation
with the EMD. A full discussion on the structure of this space
is beyond the scope of this paper, but it should be noted that
the space has a non-Euclidean geometry, with a brief
introduction to this topic given in the Supporting Information.
This may make the EMD unsuitable for direct application in
some clustering and supervised learning algorithms, without
prior dimensionality reduction. In three dimensions, with a
correlation of 0.945, we may take these as semi-reliable
reduced composition vectors with respect to the Euclidean
distance. Embedding to higher dimensions with PCA does not
improve on this correlation as the underlying space is seen to
be approximately two-dimensional
40
with an observed global
saddle shape in three dimensions.
It can be seen that each of the single elements may be found
along the bottom edge of the plot at the tip of each of the
parabolas. Trending away from these are the associated binary
and ternary compounds in divergent lines of placement. We
can clearly see the abundance and scarcity of reported
compositions containing certain elements along the modified
Pettifor scale, and trends in chemical makeup can be observed.
While this may not give us the best map of compositions for
effective ML, it remains valuable for its accurate realization of
the metric space. This enables us to map the chemical
relationships between all of the compositions in the ICSD,
with confidence that our embedding is representative of the
relation between compounds given to us with the metric,
which may be explored interactively at www.elmd.io/plots/.
■CONCLUSIONS
By directly calculating similarity of constituent elements, we
present the EMD as a computable mathematical relationship
between any two compounds. This provides a natural
extension to the physical scale introduced by Pettifor, allowing
us to not only calculate the similarity of elements, but to
quantitatively measure the similarity of compounds. These
distances give a reliable measure of chemical similarity which
align with human judgement, which we may use to associate
relationships either analytically or with ML. These distances
present a method to connect separate chemical data sets and to
pair compoundspotentially theoretical onesto reported
Table 2. Pearson’s Correlation Coefficient between the
Complete Euclidean Distance Matrix for Embedded Points,
and True EMD Distances between Compounds in
Successively Higher Dimensional Embeddings
embedded dimension UMAP PCA
1 0.538 0.860
2 0.748 0.938
3 0.736 0.945
5 0.661 0.945
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
H
chemical information. A search interface using the compound
formula as a query may be implemented, providing chemists
with a natural interface to retrieve and explore data. This has
been demonstrated by pairing a recent survey of 842
compositions with known ionic conductivity to their most
likely reported structural information in the ICSD where, with
a cutoffdistance of 1, we have automatically returned a good
match in 94% of cases. One clear future possibility is to
connect chemical properties from multiple databases of
potentially different compositions, where the distance may be
used as a numeric measure of uncertainty for each assignment.
When designing statistical models, it is tempting to include
all available chemical information in the hopes of arriving at
the most accurate correlative results possible. There is however
growing sentiment within the community to go further than
simply black box curve fitting statistical models,
41
with an
increased call for interpretable models which not only give
predictions but also some understanding of how we have
arrived at our answer. Here, we use the EMD to visualize and
analyze solid-state compounds in the ICSD, including the
subset of binary compounds. With this metric, we have created
detailed chemical maps using modern data visualization
techniques, which preserve clear trends in chemical relation-
ships. The quality of these maps is of high enough degree for
the unsupervised ML method DBSCAN to automatically
assign cluster labels such that similar compositions share a
label. These assignments have a verifiable alignment with
human judgement, which is given to us from the imbued
domain knowledge engineered into the metric. Meaningfully
understanding any large chemical data set is a daunting task,
and these maps aid us by giving a broad overview of a
compositional space. It has been shown that simple metrics like
the CED are ineffective for this as they do not possess the
resolution to differentiate disparate compositions in a space as
complex as the domain of feasible compounds. This leads to an
assessment of numeric similarity which does not align with
chemical judgement, and in creating maps using this metric, we
find dissimilar compositions in close proximity to one another.
In traditional ML models, for those with no background in
statistical inference, determining why two points have a
calculated proximity may be challenging. With the EMD,
should greater depth of investigation be required, a complete
analytic solution can be calculated between two points to
justify their exact positioning with respect to one other. These
solutions provide chemists with thorough explanations for why
two materials in a map have their calculated vicinity.
Understanding ML predictions requires us to not only
understand the materials but also the relationships between
these. Stepping back from the forest of details allows us to look
for general patterns, and with the results of ever more
experiments readily available, we need ML to carry this
forward. Following patterns to predict complex physical
properties with 100% accuracy may prove to be impossible
but we know that natural trends, although well hidden, almost
always exist. If we are to understand these, we believe that the
EMD and other crafted metrics will prove to be invaluable
tools in the categorization of material space and in further
interpreting the artificial intelligence we use in cheminfor-
matics.
■ASSOCIATED CONTENT
*
sıSupporting Information
The Supporting Information is available free of charge at
https://pubs.acs.org/doi/10.1021/acs.chemmater.0c03381.
In depth discussion of the EMD, plots of the binary
compounds generated using different approaches, and
more in depth analysis of unsupervised learning derived
clusters (PDF)
■AUTHOR INFORMATION
Corresponding Author
Matthew S. Dyer −Department of Chemistry, University of
Liverpool, Liverpool L69 7ZD, U.K.; orcid.org/0000-
0002-4923-3003; Email: msd30@liverpool.ac.uk
Authors
Cameron J. Hargreaves −Department of Chemistry and
Department of Computer Science, University of Liverpool,
Liverpool L69 7ZD, U.K.
Michael W. Gaultois −Department of Chemistry and
Leverhulme Research Center for Functional Material Design,
Materials Innovation Factory, University of Liverpool,
Liverpool L69 7ZD, U.K.; orcid.org/0000-0003-2172-
2507
Vitaliy A. Kurlin −Department of Computer Science and
Leverhulme Research Center for Functional Material Design,
Materials Innovation Factory, University of Liverpool,
Liverpool L69 7ZD, U.K.
Matthew J. Rosseinsky −Department of Chemistry and
Leverhulme Research Center for Functional Material Design,
Materials Innovation Factory, University of Liverpool,
Liverpool L69 7ZD, U.K.; orcid.org/0000-0002-1910-
2483
Complete contact information is available at:
https://pubs.acs.org/10.1021/acs.chemmater.0c03381
Notes
The authors declare no competing financial interest.
All code used to generate distances between compositions may
be found at https://github.com/lrcfmd/ElMD/.
All supporting code documentation and presentation of results
may be found at https://www.elmd.io/.
■ACKNOWLEDGMENTS
The authors would like to thank the following colleagues at the
University of Liverpool, Materials Innovation Factory for their
dedicated efforts in assessing the quality of structural matching
functionality: Alexandra Morscher, Andrij Vasylenko, Aris
Robinson, Arnaud Perez, Benjamin Duff, Bernhard Leube,
Catriona Crawford, Chris Collins, Elvis Shoko, Jacinthe
Gamon, Kate Thompson, Lu Wang, Matthew Bright, Matthew
Wright, Michael Moran, Oliver Rogan, Paul Sharp, Prasad
Beluvalli Eshwarappa, Will Thomas, Yun Dang, and Yundong
Zhou. We would additionally like to thank Jiahui An, Yulan
Liu, and Wenkai Zhang for their work in the development and
testing of code. We thank Leszek Gąsieniec (University of
Liverpool) for his helpful comments on multi-commodity flow
problems, and Alice Rizzardo (University of Liverpool) for her
valuable input on the structure of the EMD metric space. This
work was supported by the University of Liverpool (student-
ship to C.J.H.), by the Faraday Institution (grant number
FIRG007), and the EPSRC grant New Approaches to Data
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
I
Science: Application Driven Topological Data Analysis EP/
R018472/1. The authors thank the Leverhulme Trust for
funding this research via the Leverhulme Research Centre for
Functional Materials Design (RC-2015-036). M.J.R. thanks the
Royal Society for the award of a Research Professor position.
This work was undertaken on Barkla, part of the High
Performance Computing facilities at the University of
Liverpool, UK.
■REFERENCES
(1) Schmidt, J.; Shi, J.; Borlido, P.; Chen, L.; Botti, S.; Marques, M.
A. L. Predicting the thermodynamic stability of solids combining
density functional theory and machine learning. Chem. Mater. 2017,
29, 5090−5103.
(2) Wen, C.; Zhang, Y.; Wang, C.; Xue, D.; Bai, Y.; Antonov, S.; Dai,
L.; Lookman, T.; Su, Y. Machine learning assisted design of high
entropy alloys with desired property. Acta Mater. 2019,170, 109−
117.
(3) Jha, D.; Ward, L.; Paul, A.; Liao, W.-K.; Choudhary, A.;
Wolverton, C.; Agrawal, A. ElemNet: Deep learning the chemistry of
materials from only elemental composition. Sci. Rep. 2018,8, 17593.
(4) Rubner, Y.; Tomasi, C.; Guibas, L. The Earth Mover’s Distance
as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000,40,99−121.
(5) Orlova, D. Y.; Zimmerman, N.; Meehan, S.; Meehan, C.; Waters,
J.; Ghosn, E. E. B.; Filatenkov, A.; Kolyagin, G. A.; Gernez, Y.; Tsuda,
S.; Moore, W.; Moss, R. B.; Herzenberg, L. A.; Walther, G. Earth
mover’s distance (EMD): a true metric for comparing biomarker
expression levels in cell populations. PLoS One 2016,11,
No. e0151859.
(6) Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From word
embeddings to document distances. Proceedings of the 32nd Interna-
tional Conference on Machine Learning, 2015; Vol. 37, pp 957−996.
(7) Monge, G. Mémoire sur la théorie des déblais et des remblais.
Histoire de l’Académie Royale des Sciences de Paris, 1781; pp 666−704.
(8) Pele, O.; Werman, M. A linear time histogram metric for
improved sift matching. Computer VisionECCV 2008, 2008; pp
495−508.
(9) Ahuja, R. K.; Magnanti, T. L.; Orlin, J. B. Minimum Cost Flows:
Network Simplex Algorithms. Network Flows, 1st ed.; Prentice Hall:
Harlow, 1993; pp 402−461.
(10) Pettifor, D. G. A chemical scale for crystal-structure maps. Solid
State Commun. 1984,51,31−34.
(11) Goldsmith, B. R.; Boley, M.; Vreeken, J.; Scheffler, M.;
Ghiringhelli, L. M. Uncovering structure-property relationships of
materials by subgroup discovery. New J. Phys. 2017,19, 013031.
(12) Isayev, O.; Fourches, D.; Muratov, E. N.; Oses, C.; Rasch, K.;
Tropsha, A.; Curtarolo, S. Materials cartography: Representing and
mining materials space using structural and electronic fingerprints.
Chem. Mater. 2015,27, 735−743.
(13) Glawe, H.; Sanna, A.; Gross, E. K. U.; Marques, M. A. L. The
optimal one dimensional periodic table: a modified Pettifor chemical
scale from data mining. New J. Phys. 2016,18, 093011.
(14) Ong, S. P.; Chevrier, V. L.; Hautier, G.; Jain, A.; Moore, C. J.;
Kim, S.; Ma, X.; Ceder, G. Voltage, stability and diffusion barrier
differences between sodium-ion and lithium-ion intercalation
materials. Energy Environ. Sci. 2011,4, 3680−3688.
(15) Haghighatlari, M.; Shih, C. Y.; Hachmann, J.. Thinking globally,
acting locally: On the issue of training set imbalance and the case for
local machine learning models in chemistry. 2019, 8796947,
ChemRxiv. https://chemrxiv.org/articles/preprint/Thinking_
Globally_Acting_Locally_On_the_Issue_of_Training_Set_
Imbalance_and_the_Case_for_Local_Machine_Learning_Models_
in_Chemistry/8796947/2 (accessed November 11, 2020).
(16) Hellenbrandt, M. The inorganic crystal structure database
(ICSD) present and future. Crystallogr. Rev. 2004,10,17−22.
(17) Pedregosa, F.; Varoquax, G.; Gramfort, A.; Michel, V.; Thirion,
B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.;
Duchesnay, E. Scikit-learn: Machine Learning in Python. J. Mach.
Learn. Res. 2011,12, 2825−2830.
(18) McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold
Approximation and Projection for Dimension Reduction. 2018,
1802.03426, arXiv, https://arxiv.org/abs/1802.03426 (accessed No-
vember 11, 2020).
(19) Krzanowski, W. Subspace representation of units, derived from
numerical dissimilarities. Principles of Multivariate Analysis, 2nd ed.;
Oxford Statistical Science Series; Oxford University Press, 2000; pp
104−109.
(20) Kolouri, S.; Park, S. R.; Thorpe, M.; Slepcev, D.; Rohde, G. K.
Optimal Mass Transport: Signal processing and machine-learning
applications. IEEE Signal Process. Mag. 2017,34,43−59.
(21) Fourches, D.; Muratov, E.; Tropsha, A. Trust, but verify: On
the importance of chemical structure curation in cheminformatics and
qsar modeling research. J. Chem. Inf. Model. 2010,50, 1189−1204.
(22) Stokes, J. M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.;
Donghia, N. M.; MacNair, C. R.; French, S.; Carfrae, L. A.; Bloom-
Ackermann, Z.; Tran, V. M.; Chiappino-Pepe, A.; Badran, A. H.;
Andrews, I. W.; Chory, E. J.; Church, G. M.; Brown, E. D.; Jaakkola,
T. S.; Barzilay, R.; Collins, J. J. A deep learning approach to antibiotic
discovery. Cell 2020,181, 475−483.
(23) Duan, J.; Dixon, S. L.; Lowrie, J. F.; Sherman, W. Analysis and
comparison of 2d fingerprints: Insights into database screening
performance using eight fingerprint methods. J. Mol. Graphics 2010,
29, 157−170.
(24) Bartók, A. P.; Kondor, R.; Csányi, G. On representing chemical
environments. Phys. Rev. B: Condens. Matter Mater. Phys. 2013,87,
184115.
(25) Ziletti, A.; Kumar, D.; Scheffler, M.; Ghiringhelli, L. M.
Insightful classification of crystal structures using deep learning. Nat.
Commun. 2018,9, 2775.
(26) Ward, L.; Dunn, A.; Faghaninia, A.; Zimmermann, N. E. R.;
Bajaj, S.; Wang, Q.; Montoya, J.; Chen, J.; Bystrom, K.; Dylla, M.;
Chard, K.; Asta, M.; Persson, K. A.; Snyder, G. J.; Foster, I.; Jain, A.
Matminer: An open source toolkit for materials data mining. Comput.
Mater. Sci. 2018,152,60−69.
(27) Ward, L.; Liu, R.; Krishna, A.; Hegde, V. I.; Agrawal, A.;
Choudhary, A.; Wolverton, C. Including crystal structure attributes in
machine learning models of formation energies via Voronoi
tessellations. Phys. Rev. B: Condens. Matter Mater. Phys. 2017,96,
024104.
(28) Zhang, Y.; He, X.; Chen, Z.; Bai, Q.; Nolan, A. M.; Roberts, C.
A.; Banerjee, D.; Matsunaga, T.; Mo, Y.; Ling, C. Unsupervised
discovery of solid-state lithium ion conductors. Nat. Commun. 2019,
10, 5260.
(29) Schmidt, J.; Marques, M. R. G.; Botti, S.; Marques, M. A. L.
Recent advances and applications of machine learning in solid-state
materials science. npj Comput. Mater. 2019,5, 83.
(30) Wood, E. A. Polymorphism in potassium niobate, sodium
niobate, and other ABO3compounds. Acta Crystallogr. 1951,4, 353.
(31) Villars, P.; Cenzual, K.; Daams, J.; Chen, Y.; Iwata, S. Data-
driven atomic environment prediction for binaries using the
Mendeleev number: Part 1. Composition AB. J. Alloys Compd.
2004,367, 167−175.
(32) Zakutayev, A.; Wunder, N.; Schwarting, M.; Perkins, J. D.;
White, R.; Munch, K.; Tumas, W.; Philips, C. An open experimental
database for exploring inorganic materials. Sci. Data 2018,5, 180053.
(33) Villars, P.; Brandenburg, K.; Berndt, M.; LeClair, S.; Jackson,
A.; Pao, Y.-H.; Igelnik, B.; Oxley, M.; Bakshi, B. R.; Chen, P.; Iwata, S.
Binary, ternary and quaternary compound former/nonformer
prediction via Mendeleev number. J. Alloys Compd. 2001,317,26−38.
(34) Tshitoyan, V.; Dagdelen, J.; Weston, L.; Dunn, A.; Rong, Z.;
Kononova, O.; Persson, K. A.; Ceder, G.; Jain, A. Unsupervised word
embeddings capture latent knowledge from materials science
literature. Nature 2019,571,95−98.
(35) Isayev, O.; Oses, C.; Toher, C.; Gossett, E.; Curtarolo, S.;
Tropsha, A. Universal fragment descriptors for predicting properties
of inorganic crystals. Nat. Commun. 2017,8, 15679.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
J
(36) Zhou, Q.; Tang, P.; Liu, S.; Pan, J.; Yan, Q.; Zhang, S.-C.
Learning atoms for materials discovery. Proc. Natl. Acad. Sci. U. S. A.
2018,115, E6411−E6417.
(37) Ward, L.; Agrawal, A.; Choudhary, A.; Wolverton, C. A general-
purpose machine learning framework for predicting properties of
inorganic materials. npj Comput. Mater. 2016,2, 16028.
(38) Ester, M.; Kriegel, H. P.; Sander, J.; Xu, X. A density-based
algorithm for discovering clusters in large spatial databases with noise.
KDD ’96: Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining, 1996; Vol. 96, pp 226−231.
(39) Rabadán, R.; Blumberg, A. J. Dimensionality Reduction,
Manifold Learning, and Metric Geometry. Topological Data Analysis
for Genomics and Evolution: Topology in Biology, 1st ed.; Cambridge
University Press, 2019; pp 235−270.
(40) Ohta, S.-I. Gradient flows on Wasserstein spaces over compact
Alexandrov spaces. Am. J. Math. 2009,131, 475−516.
(41) Chuang, K. V.; Keiser, M. J. Adversarial Controls for Scientific
Machine Learning. ACS Chem. Biol. 2018,13, 2819−2821.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXX−XXX
K