ArticlePDF Available

Abstract

It is a core problem in any field to reliably tell how close two objects are to being the same, and once this relation has been established, we can use this information to precisely quantify potential relationships, both analytically and with machine learning (ML). For inorganic solids, the chemical composition is a fundamental descriptor, which can be represented by assigning the ratio of each element in the material to a vector. These vectors are a convenient mathematical data structure for measuring similarity, but unfortunately, the standard metric (the Euclidean distance) gives little to no variance in the resultant distances between chemically dissimilar compositions. We present the earth mover's distance (EMD) for inorganic compositions, a well-defined metric which enables the measure of chemical similarity in an explainable fashion. We compute the EMD between two compositions from the ratio of each of the elements and the absolute distance between the elements on the modified Pettifor scale. This simple metric shows clear strength at distinguishing compounds and is efficient to compute in practice. The resultant distances have greater alignment with chemical understanding than the Euclidean distance, which is demonstrated on the binary compositions of the inorganic crystal structure database. The EMD is a reliable numeric measure of chemical similarity that can be incorporated into automated workflows for a range of ML techniques. We have found that with no supervision, the use of this metric gives a distinct partitioning of binary compounds into clear trends and families of chemical property, with future applications for nearest neighbor search queries in chemical database retrieval systems and supervised ML techniques.
The Earth Movers Distance as a Metric for the Space of Inorganic
Compositions
Cameron J. Hargreaves, Matthew S. Dyer,*Michael W. Gaultois, Vitaliy A. Kurlin,
and Matthew J. Rosseinsky
Cite This: https://dx.doi.org/10.1021/acs.chemmater.0c03381
Read Online
ACCESS Metrics & More Article Recommendations *
sıSupporting Information
ABSTRACT: It is a core problem in any eld to reliably tell how
close two objects are to being the same, and once this relation has
been established, we can use this information to precisely quantify
potential relationships, both analytically and with machine learning
(ML). For inorganic solids, the chemical composition is a
fundamental descriptor, which can be represented by assigning
the ratio of each element in the material to a vector. These vectors
are a convenient mathematical data structure for measuring
similarity, but unfortunately, the standard metric (the Euclidean
distance) gives little to no variance in the resultant distances
between chemically dissimilar compositions. We present the earth
movers distance (EMD) for inorganic compositions, a well-
dened metric which enables the measure of chemical similarity in
an explainable fashion. We compute the EMD between two compositions from the ratio of each of the elements and the absolute
distance between the elements on the modied Pettifor scale. This simple metric shows clear strength at distinguishing compounds
and is ecient to compute in practice. The resultant distances have greater alignment with chemical understanding than the
Euclidean distance, which is demonstrated on the binary compositions of the inorganic crystal structure database. The EMD is a
reliable numeric measure of chemical similarity that can be incorporated into automated workows for a range of ML techniques.
We have found that with no supervision, the use of this metric gives a distinct partitioning of binary compounds into clear trends and
families of chemical property, with future applications for nearest neighbor search queries in chemical database retrieval systems and
supervised ML techniques.
INTRODUCTION
Even before Aristotle, philosophers sought to explain the
properties of materials through their elemental compositions.
As an experimental chemist, the rst step in any investigation is
choosing what elements and what ratio to put into the sample,
and the composition is arguably the most important
independent variable under control. In many functional
materials, where disorder is important to the functional
properties (such as electronic or ionic conductivity), the
elemental composition is a fundamental property that is well
described. This is because the nominal composition that is put
into a synthetic process is generally well-dened, and also
because there are extensive characterization methods to
experimentally determine the elemental composition.
Although the underlying theory has evolved considerably
since antiquity, the elemental composition of a material
continues to be a prime director of material properties, and we
know now the chemical composition largely dictates the nature
of the chemical bonding, which has a strong inuence on the
crystal structure and physical properties. Similar compositions
lead to similar properties, and when estimating material
properties, it is important to consider the closest known
composition to the one being considered. These similarities
can be dened quantitatively in a distance function, which
returns a real valued number, such that identical objects have a
distance of 0, and less similar objects return a larger value. We
would expect that small changes in chemical makeup would
lead to correspondingly small variations in chemical property,
and that chemically dissimilar compounds may behave entirely
dierently.
The chemist develops such understanding naturally through
their exploration of the sciences. While each practitioner may
have a personal tolerance for what they believe to be
chemically similar,two compositions which dier only by a
minor dopant or by the substitution of a similar element have
Received: August 19, 2020
Revised: November 20, 2020
Articlepubs.acs.org/cm
© XXXX American Chemical Society A
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
Downloaded via UNIV OF LIVERPOOL on December 2, 2020 at 23:07:27 (UTC).
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
inarguable similarity. This relationship may be immediately
clear to the chemist, but in practice it is dicult to capture
these small physical changes numerically. In this paper, we
present a new technique to calculate the distance between two
compositions, which captures nuanced variations in stoichi-
ometry for both similar and dissimilar compounds.
By correlating relationships between the chemical compo-
sition of materials and their observed behavior, we may
automatically detect underlying statistical relationships be-
tween these. This can be used in an automated process to
usefully inform the chemistry, whether it be by providing the
relationships with other clusters of similar materials, or even by
estimating the properties. This has been exploited implicitly by
modern machine learning (ML) methods, which have been
applied to capitalize on the strong determination of properties
by composition; there are many reports of regression models
to estimate material performance of inorganic solids from
compositions alone.
13
For these models to be successful, we require two things: a
large collection of data and a method of dierentiating these
such that we may uncover the subtle relationships which
govern a materials properties. Having a metric to quantify
these relationships allows us to take our bearings and construct
maps of chemical space to enable clear exploration, providing
an awareness of compositional relationships between materials.
When predicting the properties of a new composition, we must
form an understanding of its relation to other reported
compounds, and the distinguishing quality of the similarity
metric chosen can vastly aect performance. The choice of
metric should therefore possess enough delity to give an
accurate representation of chemical relationships between
entries in a database of compositions and align with human
understanding.
Though widely used as a metric, the compositional
Euclidean distance (CED) can perform poorly at the task of
distinguishing compounds. A common method of encoding a
composition is to store the relative ratio of each element in the
compound to its associated index in a vector of length 103, for
each of the naturally stable elements. Taking xito be the
fraction of the ith element in a compound X, we take the CED
to a second such vector, Y, via the standard formula,
xy()
iii
2
. Due to the sparsity of these vectors, the
CED overly simplies and exaggerates physical dierences. As
an example by taking the atomic number of the 103 stable
elements as our index, the compositional vectors of LiF and
BeO would be (00...0.53...0.59...0103) and (00...0.54...0.58...0103),
respectively. Taking the nonzero elements with indices 3, 4, 8,
and 9, the CED between these vectors would thus be
+−+−+
=
(0.5 0) (0 0.5) (0 0.5) (0.5 0)
1
2222
A third binary composition, BN, which with a compositional
vector of (00...0.55...0.57...0103) is arguably less chemically
similar to LiF than BeO, would also have a CED of 1 to both of
these compounds, as demonstrated in Figure 1a,b. A CED of 1
would be calculated between any two binary compositions
which did not have a common element. This discrete nature of
the CED does not provide an accurate distinction between
compounds which may be entirely dierent chemically, and
while this can capture local trends in a chemical data set, global
information may be lost. We can improve on this shortcoming
by incorporating a measure of elemental similarity which may
be applied to a compositional vector directly.
The earth movers distance (EMD) is a metric which is well-
constructed to pair elements between compositions, and from
this judge their similarity, which has had successful applications
in multiple elds.
46
The EMD may analogously be thought of
as the minimal amount of work to move piles of earth to ll
pits of equal overall volume but dierent shapes, a long studied
transportation problem
7
with fast algorithmic implementa-
Figure 1. In (a,b), CED is demonstrated between compositional
vectors of LiF, BeO, and BN by taking the absolute dierence of
atomic sites, with the atomic number as index. (c,d) equivalent EMD
is shown, where we instead match elements with one another. By
calculating the cost to transport each atom along the modied Pettifor
scale, we have arrived at a distance which is reective of the chemical
similarity. While each of these elements have similar atomic numbers,
they possess well-known chemical dierences, which is displayed by
the greater variance in the output.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
B
tions.
8,9
This consistently returns a unitless quantity of work
which may be interpreted as a measure of distance.
We could assign the atomic number as the vector index for
each element, then take the dierence between indices as a
measure of elemental similarity, but this approach loses the
natural clustering of chemical properties aorded by the
periodic table. An ideal elemental indexing would perfectly
capture the chemical trends observed in nature, but ordering
the elements in such a manner is problematic. As well as the
unclear resolution of how to handle the f-block elements,
chemical trends moving down the periodic table tend to be the
direct opposite of those moving across. This leads to some
elements having greater substitutional feasibility to their
diagonal neighbor than their immediate neighbor, making a
simple placement of these dicult.
To solve this problem, Pettifor proposed a method of
labeling the elemental scale in his seminal paper of 1984,
10
drawn from extensive domain knowledge. These numeric
labels may form the basis of a coordinate system allowing us to
associate patterns in geometric and physiochemical properties,
with extensions to this idea continuing to guide practi-
tioners.
11,12
This concept of labeling was further developed by
analyzing the probability that an element can be substituted for
another given the same structural framework on 20,500
compounds of the inorganic crystal structure database
(ICSD) by Glawe et al.
13
This probability matrix can be
reordered to maximize the likelihood that local neighborhoods
will contain elements with greater feasibility of stable
substitutions, thus possessing inherent chemical similarities.
14
We take the associated indices of this nal ordering to give
each element its modied Pettifor number.
In this report, we dene a composition vector by taking the
ratio of each element in a compound assigned to the index of
its respective modied Pettifor number. By assuming the
sample of the set of feasibly stable compounds (although we
know this is not strictly the case
15
), we can see that these
indices capture the truly physical similarities between elements
from statistical analysis. Using the modied Pettifor scale gives
resultant similarities between compounds which align with
human judgement but may be substituted with any continuous
elemental scale including less equally spaced distributions, for
example, Pauling electronegativity.
This gives us the ability to place new compositions within
the context of previously reported compounds, allowing us to
attribute properties to these before the lengthy process of
synthesis. We can do this automatically with ML techniques,
where the EMD forms part of the workow to predict
properties quantitatively. We may additionally assign proper-
ties to compositions qualitatively by simply searching through
multiple databases to nd the most similar existing entries.
This second approach requires the practitioners judgement on
whether to take the property of the closest match, an average
of many similar compounds, or to conclude that the reported
landscape is not suciently complete to make an accurate
judgement.
METHODS
ThePyCifRWpackage(version4.4.1)wasusedtoextract
composition from each crystalline information le (cif) of the ICSD
(2017).
16
These compositions were processed into vectors with our
python implementation, where the EMD between two vectors is
computed via the network simplex algorithm. All embeddings are
created by rst constructing a complete distance matrix between
compositions in a data set and providing this as an input to the chosen
technique. Density-based spatial clustering of applications with noise
(DBSCAN)
17
parameters were tuned by hand until clusters were
obtained which clearly bound regions of similarity of the periodic
table. Implementation details of uniform manifold approximation and
projection (UMAP) and principle component analysis (PCA) may be
found in refs 18 and 19, respectively. All code was in python 3.7, with
the matplotlib (3.1.3) and datashader (0.10.0) libraries used to plot
the binary and complete maps, respectively.
RESULTS
EMD. We take an initial matching by pairing each of the m
elements in a vector, X, to its most similar unmatched partner
in the nelements of a second vector Y, until all have been
paired. The quantity matched, q, from the i-th element of X,to
the j-th element of Y, is given by qij. A cost is calculated (eq 1a)
by summing all quantity, q, paired through each matching
multiplied by the dierence pipjin indices on the modied
Pettifor scale, p, between the elements matched. When all
elements have been paired, this is a feasible solution to the
problem; however, it may not be an optimally minimal
solution. Given two vectors we take a feasible matching and
successively improve this via the network simplex algorithm
until the total summed cost is veried to be optimal.
For a compositional vector X,=
=x1
i
m
i
1and therefore the
total quantity matched with any other vector will also equal 1.
As we are describing a transformation from two distributions
with the same total sum 1, this satises the axioms of a metric
space, as proven in ref 4, Appendix A. The formal denition of
the EMD between two compositional vectors X=(x1...xm) and
Y=(y1...yn), as given by ref 20 eq 7, is thus dened as
∑∑
=||
==
XY qp p
E
MD( , ) min
i
m
j
n
ij ij
11 (1a)
qi
j
subject to 0 for any ,
ij
(1b)
≤≤
=
qx i
m
for any 1
j
n
ij i
1(1c)
≤≤
=
qy j
n
for any 1
i
m
ij j
1(1d)
∑∑
=
==
q1
i
m
j
n
ij
11
(1e)
Constraint 1b denes that we may only match a positive
quantity from Xto Y,1c and 1d state that each element will
only pair to another up to its ratio. The nal constraint ensures
that all of the elements in Xare matched to an element in Y
such that a feasible solution has been achieved.
Taking three candidate solid-state electrolytes with known
dissimilarity in composition and structure to exemplify this, as
shown in Figure 2, we can see how the EMD allows greater
depth of analysis when dening chemical similarity compared
to the CED. From the gure, we see how the solution not only
gives us the measure of distance but also the quantity of
elements that are paired to one another. We may apply this to
any two chemical formulae, enabling us to highlight chemically
similar substitutions, and thus familial relation, which may not
have been immediately obvious from the compound formula.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
C
A reference implementation may be found at https://github.
com/lrcfmd/ElMD/.
Pairing Structures to Compositions. It is widely
recognized that composition is not the sole determinant of
physical performance, and crystal structure plays a fundamental
role in property, which can be dependent on many dierent
length scales. Codifying these structures such that we may
compare them for similarity has known diculties
21
due to the
periodicity of inorganic systems. For organic molecules, there
exist methods of formally encoding a structure derived from
the strict lexicographic conventions of organic chemistry,
22,23
and methods of encoding an inorganic crystalslocal
environment and symmetry have been successfully imple-
mented.
2426
Structural features are a known asset to ML
models, and the addition of this information generally gives
stronger predictive performance at screening and property
prediction.
2729
Unfortunately, structural information is often not reported
in tandem with experimentally determined chemophysical
properties, and many such properties are reported from solid
solutions where a similar reported structure may not even exist.
In many cases, only the composition and the property under
investigation will be reported, leaving a fragmented data
landscape with a barrier between databases. With the number
of reported compounds untenable for any person to feasibly
audit, we must bring this information together in an automated
manner. The EMD allows us to connect compounds to their
closest determined structure in such a fashion, allowing
databases with compositional information to be joined to
databases of structural information.
Using the EMD we may now pair query formulae, including
those which have never been synthesized, to their most similar
compositions in one of the many chemical databases such as
the ICSD (2017), consisting of 188,631 cifs. A recent review of
materials with reported ionic conductivity was undertaken with
842 compounds identied. Each compound had a comparative
search applied to every ICSD entry, and these pairings were
analyzed by a team of 21 researchers at the Materials
Innovation Factory, University of Liverpool, with the quality
of these matchings assessed. Of these compounds, 528 had a
perfect match to a cif or an exact match under a minor change
in stoichiometry. A further 254 compounds have a matching cif
with a small number of elemental substitutions and are similar
in crystal structures. The remaining 60 formulae did not nd a
good match, mostly due to the materials being reported more
recently than any database entries. In Table 1, we see some
commonly cited compounds from this eld and their closest
matches in the ICSD. Clearly, a distance of zero gives an exact
match barring polymorphs; however, there remains the more
general case where an exact structure has not been reported.
We can see that in each example a chemically similar
compound has been returned, and we would expect the
extracted structural features to have a high degree of
Figure 2. All feasible solutions to compute the EMD from
Li1.3Al0.3Ti1.7(PO4)3La0.5Li0.35TiO3are presented in (a), where
the area of each disk represents the fraction of the element in the
compound, and the width of each arc shows the maximal quantity
which could be theoretically matched within the constraints of the
problem. The optimal matching is shown in (b), giving a resultant
distance between these of 12.70. For an arguably less similar
compound, Li6PS5Cl, these chemical dierences are reected by the
greater distance of 31.08 (c). Taking the nal distance from Li6PS5Cl
La0.5Li0.35TiO3of 18.38 (d), it can be seen that a simple
embedding of these three compounds may be constructed (e).
Table 1. Top Three Most Similar Results when Querying
Some Commonly Cited Solid-State Electrolytes against the
ICSD (2017) with EMD
a
query three closest matches EMD
Li1.3Al0.3Ti1.7(PO4)3Na1.261Al0.302Ti1.696(PO4)30.231
Li1.2Al0.2Ti1.8(PO4)30.302
Li1.4Al0.4Ge0.4Ti1.2(PO4)30.368
Li10GeP2S12 Li10GeP2S12 0.000
Li10SnP2S12 0.040
Li9.81Sn0.81P2.19S12 0.439
Li6PS5Cl Li6AsS5I 0.231
Li6PO5Cl 0.385
Li6PO5Br 0.462
Li5La3Nb2O12 Li5La3Nb2O12 0.000
Li5La3Ta2O12 0.091
Li5.08La3Ta1.51Zr0.39O12 0.322
Li1.5Al0.5Ge1.5(PO4)3Na1.5Sn1.5Sb0.3(PO4)30.543
BaGa(PO4)20.712
BaSn(PO4)20.745
Li7La3Zr2O12 Li7La3Zr2O12 0.000
Li7La3Hf2O12 0.083
Li7.1La3(Zr1.9Cr0.1)O12 0.200
Li14Zn(GeO4)4Li14Zn(GeO4)40.000
Li6Ge2O70.410
Li6(Si2O7) 0.543
a
While there are only four queries with exact matches due to the
recency in the electrolyes reporting, it may be seen that the remainder
are chemically similar.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
D
correlation with the true structural information. While 8.43%
of the compounds have been poorly matched, the number of
false matches can be reduced by ltering the data set with the
application of a maximum threshold value. By removing all
matches which have a distance greater than 1, we discard 12%
of the entries but improve the false positive rate to 5.7%.
Although caution should be applied from the introduction of
known errors, this provides an ecient method for the
automated creation of data sets on the scale required to
uncover complex statistical relationships.
By assessing those matchings in Table 1 that are imperfect,
we may see how the top ranked match remains structurally
related, with the remainder being simple dopings and
substitutions. Of interest is Li1.5Al0.5Ge1.5(PO4)3in row 5,
where there is clear chemical relation between these three
compounds, we see some dissimilarity in their structures as
Na1.5Sn1.5Sb0.3(PO4)3, BaGa(PO4), and BaSn(PO4) represent
the NASICON and pyrosilicate phases. It is likely, perhaps
certain, that the chemist develops a deeper understanding of
the relations between the compounds they study, their
combinations, and their behaviors under environmental
conditions, than can be captured by a simple number. An
engineered representation of compositions has however
allowed us to express chemical relationships which have not
previously been possible to express quantiably. With a clearly
dened metric of chemical similarity, we may use this as we
would any other distance, with additional condence that the
underlying mechanics are mathematically aligned with
chemical knowledge.
Mapping Compositional Space. The discovery of new
materials has always been data driven, and mapping
compositions to predict the existence of structures is a time-
honored technique in crystallography.
30,31
The visual medium
provides a tangible clarity to the human reader, where abstract
relations between compounds can be dicult to conceptualize
through numerical analysis. The EMD in conjunction with
modern visualization techniques has a clear application in this
regard, giving the ability to plot detailed maps which clearly
align with known chemical clustering. The metric space is
given by compositional vectors in 103 dimensions and their
relationships with respect to the EMD.
This space and its induced structure have a mathematically
complex geometry, and as we only possess the distances
between points, we do not have the two-dimensional
coordinates that are required for plotting. We may use
dimensionality reduction techniques to generate these
coordinates, and the resultant points are called an embedding
of the space. By measuring a line between embedded points
with a ruler, distance may be used in the ordinary meaning of
the word to dene the similarity between two points. As it is
generally impossible to represent a complex space without
distorting the relationships between points, many dimension-
ality reduction techniques exist, each with their own focus at
emphasizing specic relationships across a data set. In general,
we wish to align the distance between points with the
associated EMD between compositions such that our
embeddings give a valuable representation of the metric
space. In this paper, we discuss embeddings produced by
UMAP,
17
which gives clustered plots which allow the
qualitative assessment of chemical data sets with unsupervised
ML, and PCA which we nd gives more accurate
representations of the relationship between points, with less
overall distortion from the true positions.
In the UMAP algorithm, every composition is represented
by a point and edges to each of the 15 most similar compounds
in the data set calculated with respect to the EMD. It is not
however possible to plot these distances directly, due to the
contradictory information that arises when embedding a graph
of high degree to the plane. Approximations of the metric
distances are realized in two-dimensional Euclidean space by
constructing an inaccurate embedding of the point cloud, and
rening the positions of the points along edges to each
neighbor, such that distances between them align with the true
EMD, with respect to local cluster density. In doing this, we
disregard the majority of the inter-compound distances, yet
retain a skeletal backbone which follows the local trends of the
data, pulling together clusters of similar compounds. The
resultant two-dimensional plots are highly distorted from their
true positions in the metric space, but in a manner which draws
out the most prominent global trends of a data set, from which
we may pick out clear patterns and clusters both manually and
automatically.
Binary Compositions. The binary compounds have
simple compositions for us to demonstrate the ecacy of the
EMD and its alignment with domain knowledge. The 12,623
binary compounds in the ICSD were identied, the complete
inter-compound distance matrix calculated with respect to the
EMD, and the resultant distances reduced to two dimensions
with the application of UMAP (Figure 3a). Each of the clusters
of clear separation tend to contain AB pairs from the same, or
similar families on the periodic table, with trends across
clusters following expected transitions in chemical composition
through the modied Pettifor scale. In some clusters, there is
greater chemical discontinuity, yet there are trends across
regions of these points which follow smooth variations in AB
ratio.
Adding chemical labels (e.g., which blocks of the periodic
table are found in the compound) demonstrate how trends in
chemical properties are preserved when using the EMD
(Figure 3). As there are no experimental properties barring
chemical formula and atomic positions reported in cifs, we
must derive known features from the composition alone. In
Figure 3. Here, we see 12,623 binary composition vectors from the
ICSD, with the EMD (a) and CED (b) between these calculated and
reduced to two-dimensional coordinates using UMAP. With the
application of color labels to display the placement of elements
present in the composition across the periodic table, it can be seen
that the EMD has separated the space into complex-shaped clusters of
related chemical families which have strong alignment with chemists
perception of similarity, which is not present with the CED. The maps
produced with the CED contain many isolated clusters with trivial
shapes and few members, with poor resolution of chemistry (i.e., many
labels within a cluster).
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
E
binary compositions, we know that the block of the periodic
table, that each of the two elements are from, will play a
signicant role in the resultant chemical properties. By labeling
these blocks, we can immediately see how UMAP has
partitioned the space into clusters of compositions from the
same, or arguably similar, blocks of the periodic table. A clear
example is the pink cluster in the upper left of the map, where
we may nd every compound in the ICSD containing two p-
block elements. By embedding approximations of distances
from the metric space, these chemical maps have been given a
structure which aligns with domain knowledge. This alignment
arises because the EMD preserves chemical relationships
between elements, and thus the chemical context from the
periodic table is present in the metric, which allows reference
between elements (and by extension regions of the periodic
table) ensuring that these trends are well captured.
Maps of inorganic compositional space have previously been
created with the CED;
32
however, for these maps to possess a
structure which aligns with chemical judgement, any method
employing CED requires a high incidence of compounds with
shared elements. When a similar methodology is applied to the
entire periodic table (e.g., binary compositions of the ICSD
shown in Figure 3b), all detailed structure present when using
the EMD is lost. When using the EMD, compounds with
elements from similar regions or blocks of the periodic table
are clustered in groups with nontrivial shapes, high purity (i.e.,
a low number of labels per cluster), with a sensible relationship
between clusters (Figure 3a). When using the CED, all these
desirable properties are lost (Figure 3b); clusters have trivial
shapes with little variation, are impure (i.e., have a high number
of labels in each cluster with combinations of each block in the
periodic table evenly distributed across the map), and the
clusters are evenly distributed throughout the projection.
Furthermore, rather than large, connected clusters generated
when using EMD, using the CED we get small and often
isolated islands, with no clear relationship between these.
Because compositions without shared elements have roughly
equal distances under the CED, there are not enough global
points of reference to place clusters in relation to one another
with delity. There is dense clustering of compounds with
similar stoichiometry (i.e., shared elements) due to the
comparatively small distances between these, making it dicult
Figure 4. In (a) we see how the same embedding of binary compounds from Figure 3a, may be segmented into 26 distinct clusters using the
DBSCAN algorithm, with a full analysis of these provided in the Supporting Information. The AB2compounds of cluster 13 are given in (b) where
there are clear chemical trends. Here, the ordered grouping is a clear reection of the landscape of reported compounds, and the relative stability of
the AB2structure prototype under dierent elemental doping.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
F
to dierentiate points within clusters. This shows how a metric
with qualitatively poor ability at distinguishing compositions
will lead to quantitative confusion of known chemical
relationships. While the CED may provide enough distinguish-
ing quality to be of benet when applied to certain chemical
data sets (e.g., where there are shared elements), the lack of
chemical relationship between elements leads to a loss of
discernibility and is a guaranteed source of noise for models.
By processing compositional vectors with respect to the EMD,
chemical relationships and context have been preserved to a
high enough standard that it may be captured reliably with
automated methods.
The use of the EMD enables the comparison between
dierent choices of the elemental scale dening the indices in
the compositional vector. This is not the case for the CED,
where the distance is the same regardless of the elemental scale
chosen. Even when using simple atomic numbers as the
elemental index, the EMD introduces a signicant structure to
the UMAP generated clusters, leading to clusters with
nontrivial shapes, however without the purity of labels
observed when using the modied Pettifor scale (Figure S1).
Elemental scales such as Pettifors original Mendeleev
number
13
and alternate orderings of this scale
33
result in
plots with similar cluster shapes and purity to the modied
Pettifor scale (Figures S2S6). An alternative approach to the
use of compositional vectors Xand Yis the use of recently
developed vectors of features which are derived from values of
physicochemical properties of the elements present in the
composition.
3436
Application of UMAP to the Euclidean
distances between the magpie features
37
of these binary
compounds results in clusters with low levels of chemical
purity, similar to the results obtained using the EMD with the
atomic number scale (Figure S8).
The labels in Figure 4a are assigned with the density-based
clustering algorithm DBSCAN
38
on the points obtained by the
EMD and UMAP, which assigns groups of point class labels,
such that clusters which have been closely plotted together on
the plane share a label. As there are few pre-existing chemical
properties we can attribute to this data set, we rely on
unsupervised learning to gain insights from our data with ML.
Here, the EMD gives strong separation into chemical clusters
with clear patterns in atomic trends, the most prominent
example being the zoom of cluster 13, as shown in Figure 4b,
such that most compounds are of the form AB2with the Aion
being a lanthanide and the Bion being a transition metal. It is
worth noting here that the analogous YB2phases (orange dots)
are classied with the other lanthanides (brown dots),
reecting the common practice of grouping them together as
the rare-earth metals. Another notable feature is cluster 20,
containing the entirety of the transition-metal dd compounds
with increased concentrations of each transition metal as we
progress around the crescent. After cluster 13, the second
clearest example of parallel trendlines in chemical features can
be seen to the left of cluster 6, with the general form AB3,A
being an f-block metal, and B being a p-block metal. From the
top left of the cluster to the bottom right, A ions follow the
Pettifor scale. Across each successive line from left to right, the
B ions progress through Al, Ga, In, Tl, Pb, Sn, and nally Ge.
Complete analysis of each DBSCAN cluster is given in the
Supporting Information. With no prior chemical knowledge of
these compounds, we can draw attention to underlying
Figure 5. 125,627 compositions from the ICSD with their inter-compound EMD calculated and resultant distances reduced to two-dimensional
coordinates with both UMAP (a) and PCA (b). Three candidate solid-state electrolytes are overlaid with the planar distances between points
labeled. In addition, standard deviation of electronegativity for elements in each compound is given by the coloring from red (more covalent) to
blue (more ionic). It can be seen that UMAP has accentuated some of the more subtle aspects of chemical similarity by distorting the more
accurate representation given to us by PCA, where many regions are too densely plotted to make out clearly.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
G
chemical properties, providing visually qualitative maps,
capturing families of clear relation.
Inorganic Crystal Structure Database. For each of the
125,627 unique compound formula in the ICSD, we may apply
the same process but dening clusters becomes dicult due to
the scale of the task. We may instead attribute known
chemistry about each composition to uncover underlying
trends in the data. In Figure 5a, we see these compounds
plotted via UMAP and colored by taking the standard
deviation of the respective electronegativities of the constituent
atoms. We calculate this by taking the associated Pauling
electronegativity for each of the nonzero elements in a
compositional vector, giving a set of electronegativities, E,of
length n. The average electronegativity of the set, e̅is
calculated, and standard deviation is obtained via the standard
formula
=̅
=ee
n
S
D()
1
i
n
i
1
2
This simple measure reveals a clear trend in the chemical
property across the reported compounds, between the more
ionic compounds across the right side of Figure 5a, to the more
covalently bonded across the left boundary.
It should be noted that the UMAP algorithm emphasizes the
clusters of a metric space. When optimizing distances, UMAP
ensures that clusters of compounds are closely packed within
families and clearly separated from other clusters on the plane.
Unsupervised density-based clustering algorithms such as
DBSCAN therefore work consistently and eectively on the
resultant plots, allowing the swift classication of new
compounds from existing knowledge.
While local neighborhoods will have similar structure to the
metric space, the global trends appear warped. This is
highlighted by the three SSEs from the previous section
overlaid on Figure 5. It can be seen that these do follow the
approximate similarity given to us from EMD but have been
distorted from the perfect line they fall on in the EMD metric
space. We may take the local distance between each of the
embedded points, and by calculating the Pearsons correlation
between each of these and their associated EMD, the quality of
these embeddings may be assessed. While UMAP has given
value by separating these projections into clusters of familial
relation, by referring to Table 2, with a correlation of 0.748
many of the distances have been distorted from their true
values, making these potentially unsuitable inputs for
regression tasks. We would expect that reducing our distances
to higher dimensional coordinate systems would give UMAP
more degrees of freedom when embedding a graph layout;
however, the correlation does not improve past two
dimensions, in part due to the implementations focus on
planar projections.
Principle Component Analysis. A truer picture of the
metric space may be obtained via PCA, a widely used
dimensionality reduction technique in the natural sciences for
projecting data along axes of greatest variance. In practice, this
compresses every one of the distances linearly, and as there is
rarely an embedding of a higher dimensional point cloud in a
lower dimensional Euclidean space, which respects the global
structure perfectly, often creates overcrowded plots with a loss
of intrinsic structures. When applied to our data set, this does
create overly dense regions of points, making these unsuitable
for the automated identication of clusters with an algorithm
such as DBSCAN, however retains a strong resemblance to the
true structure of the metric space, as shown in Figure 5b.
We take an initial embedding of the EMD inter-compound
distance matrix by considering each row as a centered
embedded point, which is produced by taking each value in
the distance matrix and forming a Gram-centered matrix.
39
These normalized points can then be reduced with PCA, giving
us a linear projection of the data set. We have found that even
in lower dimensional spaces, the local Euclidean distances
between points retain a reasonably high degree of correlation
with the EMD. A full discussion on the structure of this space
is beyond the scope of this paper, but it should be noted that
the space has a non-Euclidean geometry, with a brief
introduction to this topic given in the Supporting Information.
This may make the EMD unsuitable for direct application in
some clustering and supervised learning algorithms, without
prior dimensionality reduction. In three dimensions, with a
correlation of 0.945, we may take these as semi-reliable
reduced composition vectors with respect to the Euclidean
distance. Embedding to higher dimensions with PCA does not
improve on this correlation as the underlying space is seen to
be approximately two-dimensional
40
with an observed global
saddle shape in three dimensions.
It can be seen that each of the single elements may be found
along the bottom edge of the plot at the tip of each of the
parabolas. Trending away from these are the associated binary
and ternary compounds in divergent lines of placement. We
can clearly see the abundance and scarcity of reported
compositions containing certain elements along the modied
Pettifor scale, and trends in chemical makeup can be observed.
While this may not give us the best map of compositions for
eective ML, it remains valuable for its accurate realization of
the metric space. This enables us to map the chemical
relationships between all of the compositions in the ICSD,
with condence that our embedding is representative of the
relation between compounds given to us with the metric,
which may be explored interactively at www.elmd.io/plots/.
CONCLUSIONS
By directly calculating similarity of constituent elements, we
present the EMD as a computable mathematical relationship
between any two compounds. This provides a natural
extension to the physical scale introduced by Pettifor, allowing
us to not only calculate the similarity of elements, but to
quantitatively measure the similarity of compounds. These
distances give a reliable measure of chemical similarity which
align with human judgement, which we may use to associate
relationships either analytically or with ML. These distances
present a method to connect separate chemical data sets and to
pair compoundspotentially theoretical onesto reported
Table 2. Pearsons Correlation Coecient between the
Complete Euclidean Distance Matrix for Embedded Points,
and True EMD Distances between Compounds in
Successively Higher Dimensional Embeddings
embedded dimension UMAP PCA
1 0.538 0.860
2 0.748 0.938
3 0.736 0.945
5 0.661 0.945
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
H
chemical information. A search interface using the compound
formula as a query may be implemented, providing chemists
with a natural interface to retrieve and explore data. This has
been demonstrated by pairing a recent survey of 842
compositions with known ionic conductivity to their most
likely reported structural information in the ICSD where, with
a cutodistance of 1, we have automatically returned a good
match in 94% of cases. One clear future possibility is to
connect chemical properties from multiple databases of
potentially dierent compositions, where the distance may be
used as a numeric measure of uncertainty for each assignment.
When designing statistical models, it is tempting to include
all available chemical information in the hopes of arriving at
the most accurate correlative results possible. There is however
growing sentiment within the community to go further than
simply black box curve tting statistical models,
41
with an
increased call for interpretable models which not only give
predictions but also some understanding of how we have
arrived at our answer. Here, we use the EMD to visualize and
analyze solid-state compounds in the ICSD, including the
subset of binary compounds. With this metric, we have created
detailed chemical maps using modern data visualization
techniques, which preserve clear trends in chemical relation-
ships. The quality of these maps is of high enough degree for
the unsupervised ML method DBSCAN to automatically
assign cluster labels such that similar compositions share a
label. These assignments have a veriable alignment with
human judgement, which is given to us from the imbued
domain knowledge engineered into the metric. Meaningfully
understanding any large chemical data set is a daunting task,
and these maps aid us by giving a broad overview of a
compositional space. It has been shown that simple metrics like
the CED are ineective for this as they do not possess the
resolution to dierentiate disparate compositions in a space as
complex as the domain of feasible compounds. This leads to an
assessment of numeric similarity which does not align with
chemical judgement, and in creating maps using this metric, we
nd dissimilar compositions in close proximity to one another.
In traditional ML models, for those with no background in
statistical inference, determining why two points have a
calculated proximity may be challenging. With the EMD,
should greater depth of investigation be required, a complete
analytic solution can be calculated between two points to
justify their exact positioning with respect to one other. These
solutions provide chemists with thorough explanations for why
two materials in a map have their calculated vicinity.
Understanding ML predictions requires us to not only
understand the materials but also the relationships between
these. Stepping back from the forest of details allows us to look
for general patterns, and with the results of ever more
experiments readily available, we need ML to carry this
forward. Following patterns to predict complex physical
properties with 100% accuracy may prove to be impossible
but we know that natural trends, although well hidden, almost
always exist. If we are to understand these, we believe that the
EMD and other crafted metrics will prove to be invaluable
tools in the categorization of material space and in further
interpreting the articial intelligence we use in cheminfor-
matics.
ASSOCIATED CONTENT
*
sıSupporting Information
The Supporting Information is available free of charge at
https://pubs.acs.org/doi/10.1021/acs.chemmater.0c03381.
In depth discussion of the EMD, plots of the binary
compounds generated using dierent approaches, and
more in depth analysis of unsupervised learning derived
clusters (PDF)
AUTHOR INFORMATION
Corresponding Author
Matthew S. Dyer Department of Chemistry, University of
Liverpool, Liverpool L69 7ZD, U.K.; orcid.org/0000-
0002-4923-3003; Email: msd30@liverpool.ac.uk
Authors
Cameron J. Hargreaves Department of Chemistry and
Department of Computer Science, University of Liverpool,
Liverpool L69 7ZD, U.K.
Michael W. Gaultois Department of Chemistry and
Leverhulme Research Center for Functional Material Design,
Materials Innovation Factory, University of Liverpool,
Liverpool L69 7ZD, U.K.; orcid.org/0000-0003-2172-
2507
Vitaliy A. Kurlin Department of Computer Science and
Leverhulme Research Center for Functional Material Design,
Materials Innovation Factory, University of Liverpool,
Liverpool L69 7ZD, U.K.
Matthew J. Rosseinsky Department of Chemistry and
Leverhulme Research Center for Functional Material Design,
Materials Innovation Factory, University of Liverpool,
Liverpool L69 7ZD, U.K.; orcid.org/0000-0002-1910-
2483
Complete contact information is available at:
https://pubs.acs.org/10.1021/acs.chemmater.0c03381
Notes
The authors declare no competing nancial interest.
All code used to generate distances between compositions may
be found at https://github.com/lrcfmd/ElMD/.
All supporting code documentation and presentation of results
may be found at https://www.elmd.io/.
ACKNOWLEDGMENTS
The authors would like to thank the following colleagues at the
University of Liverpool, Materials Innovation Factory for their
dedicated eorts in assessing the quality of structural matching
functionality: Alexandra Morscher, Andrij Vasylenko, Aris
Robinson, Arnaud Perez, Benjamin Du, Bernhard Leube,
Catriona Crawford, Chris Collins, Elvis Shoko, Jacinthe
Gamon, Kate Thompson, Lu Wang, Matthew Bright, Matthew
Wright, Michael Moran, Oliver Rogan, Paul Sharp, Prasad
Beluvalli Eshwarappa, Will Thomas, Yun Dang, and Yundong
Zhou. We would additionally like to thank Jiahui An, Yulan
Liu, and Wenkai Zhang for their work in the development and
testing of code. We thank Leszek Gąsieniec (University of
Liverpool) for his helpful comments on multi-commodity ow
problems, and Alice Rizzardo (University of Liverpool) for her
valuable input on the structure of the EMD metric space. This
work was supported by the University of Liverpool (student-
ship to C.J.H.), by the Faraday Institution (grant number
FIRG007), and the EPSRC grant New Approaches to Data
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
I
Science: Application Driven Topological Data Analysis EP/
R018472/1. The authors thank the Leverhulme Trust for
funding this research via the Leverhulme Research Centre for
Functional Materials Design (RC-2015-036). M.J.R. thanks the
Royal Society for the award of a Research Professor position.
This work was undertaken on Barkla, part of the High
Performance Computing facilities at the University of
Liverpool, UK.
REFERENCES
(1) Schmidt, J.; Shi, J.; Borlido, P.; Chen, L.; Botti, S.; Marques, M.
A. L. Predicting the thermodynamic stability of solids combining
density functional theory and machine learning. Chem. Mater. 2017,
29, 50905103.
(2) Wen, C.; Zhang, Y.; Wang, C.; Xue, D.; Bai, Y.; Antonov, S.; Dai,
L.; Lookman, T.; Su, Y. Machine learning assisted design of high
entropy alloys with desired property. Acta Mater. 2019,170, 109
117.
(3) Jha, D.; Ward, L.; Paul, A.; Liao, W.-K.; Choudhary, A.;
Wolverton, C.; Agrawal, A. ElemNet: Deep learning the chemistry of
materials from only elemental composition. Sci. Rep. 2018,8, 17593.
(4) Rubner, Y.; Tomasi, C.; Guibas, L. The Earth Movers Distance
as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000,40,99121.
(5) Orlova, D. Y.; Zimmerman, N.; Meehan, S.; Meehan, C.; Waters,
J.; Ghosn, E. E. B.; Filatenkov, A.; Kolyagin, G. A.; Gernez, Y.; Tsuda,
S.; Moore, W.; Moss, R. B.; Herzenberg, L. A.; Walther, G. Earth
movers distance (EMD): a true metric for comparing biomarker
expression levels in cell populations. PLoS One 2016,11,
No. e0151859.
(6) Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From word
embeddings to document distances. Proceedings of the 32nd Interna-
tional Conference on Machine Learning, 2015; Vol. 37, pp 957996.
(7) Monge, G. Mémoire sur la théorie des déblais et des remblais.
Histoire de lAcadémie Royale des Sciences de Paris, 1781; pp 666704.
(8) Pele, O.; Werman, M. A linear time histogram metric for
improved sift matching. Computer VisionECCV 2008, 2008; pp
495508.
(9) Ahuja, R. K.; Magnanti, T. L.; Orlin, J. B. Minimum Cost Flows:
Network Simplex Algorithms. Network Flows, 1st ed.; Prentice Hall:
Harlow, 1993; pp 402461.
(10) Pettifor, D. G. A chemical scale for crystal-structure maps. Solid
State Commun. 1984,51,3134.
(11) Goldsmith, B. R.; Boley, M.; Vreeken, J.; Scheffler, M.;
Ghiringhelli, L. M. Uncovering structure-property relationships of
materials by subgroup discovery. New J. Phys. 2017,19, 013031.
(12) Isayev, O.; Fourches, D.; Muratov, E. N.; Oses, C.; Rasch, K.;
Tropsha, A.; Curtarolo, S. Materials cartography: Representing and
mining materials space using structural and electronic fingerprints.
Chem. Mater. 2015,27, 735743.
(13) Glawe, H.; Sanna, A.; Gross, E. K. U.; Marques, M. A. L. The
optimal one dimensional periodic table: a modified Pettifor chemical
scale from data mining. New J. Phys. 2016,18, 093011.
(14) Ong, S. P.; Chevrier, V. L.; Hautier, G.; Jain, A.; Moore, C. J.;
Kim, S.; Ma, X.; Ceder, G. Voltage, stability and diffusion barrier
differences between sodium-ion and lithium-ion intercalation
materials. Energy Environ. Sci. 2011,4, 36803688.
(15) Haghighatlari, M.; Shih, C. Y.; Hachmann, J.. Thinking globally,
acting locally: On the issue of training set imbalance and the case for
local machine learning models in chemistry. 2019, 8796947,
ChemRxiv. https://chemrxiv.org/articles/preprint/Thinking_
Globally_Acting_Locally_On_the_Issue_of_Training_Set_
Imbalance_and_the_Case_for_Local_Machine_Learning_Models_
in_Chemistry/8796947/2 (accessed November 11, 2020).
(16) Hellenbrandt, M. The inorganic crystal structure database
(ICSD) present and future. Crystallogr. Rev. 2004,10,1722.
(17) Pedregosa, F.; Varoquax, G.; Gramfort, A.; Michel, V.; Thirion,
B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.;
Duchesnay, E. Scikit-learn: Machine Learning in Python. J. Mach.
Learn. Res. 2011,12, 28252830.
(18) McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold
Approximation and Projection for Dimension Reduction. 2018,
1802.03426, arXiv, https://arxiv.org/abs/1802.03426 (accessed No-
vember 11, 2020).
(19) Krzanowski, W. Subspace representation of units, derived from
numerical dissimilarities. Principles of Multivariate Analysis, 2nd ed.;
Oxford Statistical Science Series; Oxford University Press, 2000; pp
104109.
(20) Kolouri, S.; Park, S. R.; Thorpe, M.; Slepcev, D.; Rohde, G. K.
Optimal Mass Transport: Signal processing and machine-learning
applications. IEEE Signal Process. Mag. 2017,34,4359.
(21) Fourches, D.; Muratov, E.; Tropsha, A. Trust, but verify: On
the importance of chemical structure curation in cheminformatics and
qsar modeling research. J. Chem. Inf. Model. 2010,50, 11891204.
(22) Stokes, J. M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.;
Donghia, N. M.; MacNair, C. R.; French, S.; Carfrae, L. A.; Bloom-
Ackermann, Z.; Tran, V. M.; Chiappino-Pepe, A.; Badran, A. H.;
Andrews, I. W.; Chory, E. J.; Church, G. M.; Brown, E. D.; Jaakkola,
T. S.; Barzilay, R.; Collins, J. J. A deep learning approach to antibiotic
discovery. Cell 2020,181, 475483.
(23) Duan, J.; Dixon, S. L.; Lowrie, J. F.; Sherman, W. Analysis and
comparison of 2d fingerprints: Insights into database screening
performance using eight fingerprint methods. J. Mol. Graphics 2010,
29, 157170.
(24) Bartók, A. P.; Kondor, R.; Csányi, G. On representing chemical
environments. Phys. Rev. B: Condens. Matter Mater. Phys. 2013,87,
184115.
(25) Ziletti, A.; Kumar, D.; Scheffler, M.; Ghiringhelli, L. M.
Insightful classification of crystal structures using deep learning. Nat.
Commun. 2018,9, 2775.
(26) Ward, L.; Dunn, A.; Faghaninia, A.; Zimmermann, N. E. R.;
Bajaj, S.; Wang, Q.; Montoya, J.; Chen, J.; Bystrom, K.; Dylla, M.;
Chard, K.; Asta, M.; Persson, K. A.; Snyder, G. J.; Foster, I.; Jain, A.
Matminer: An open source toolkit for materials data mining. Comput.
Mater. Sci. 2018,152,6069.
(27) Ward, L.; Liu, R.; Krishna, A.; Hegde, V. I.; Agrawal, A.;
Choudhary, A.; Wolverton, C. Including crystal structure attributes in
machine learning models of formation energies via Voronoi
tessellations. Phys. Rev. B: Condens. Matter Mater. Phys. 2017,96,
024104.
(28) Zhang, Y.; He, X.; Chen, Z.; Bai, Q.; Nolan, A. M.; Roberts, C.
A.; Banerjee, D.; Matsunaga, T.; Mo, Y.; Ling, C. Unsupervised
discovery of solid-state lithium ion conductors. Nat. Commun. 2019,
10, 5260.
(29) Schmidt, J.; Marques, M. R. G.; Botti, S.; Marques, M. A. L.
Recent advances and applications of machine learning in solid-state
materials science. npj Comput. Mater. 2019,5, 83.
(30) Wood, E. A. Polymorphism in potassium niobate, sodium
niobate, and other ABO3compounds. Acta Crystallogr. 1951,4, 353.
(31) Villars, P.; Cenzual, K.; Daams, J.; Chen, Y.; Iwata, S. Data-
driven atomic environment prediction for binaries using the
Mendeleev number: Part 1. Composition AB. J. Alloys Compd.
2004,367, 167175.
(32) Zakutayev, A.; Wunder, N.; Schwarting, M.; Perkins, J. D.;
White, R.; Munch, K.; Tumas, W.; Philips, C. An open experimental
database for exploring inorganic materials. Sci. Data 2018,5, 180053.
(33) Villars, P.; Brandenburg, K.; Berndt, M.; LeClair, S.; Jackson,
A.; Pao, Y.-H.; Igelnik, B.; Oxley, M.; Bakshi, B. R.; Chen, P.; Iwata, S.
Binary, ternary and quaternary compound former/nonformer
prediction via Mendeleev number. J. Alloys Compd. 2001,317,2638.
(34) Tshitoyan, V.; Dagdelen, J.; Weston, L.; Dunn, A.; Rong, Z.;
Kononova, O.; Persson, K. A.; Ceder, G.; Jain, A. Unsupervised word
embeddings capture latent knowledge from materials science
literature. Nature 2019,571,9598.
(35) Isayev, O.; Oses, C.; Toher, C.; Gossett, E.; Curtarolo, S.;
Tropsha, A. Universal fragment descriptors for predicting properties
of inorganic crystals. Nat. Commun. 2017,8, 15679.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
J
(36) Zhou, Q.; Tang, P.; Liu, S.; Pan, J.; Yan, Q.; Zhang, S.-C.
Learning atoms for materials discovery. Proc. Natl. Acad. Sci. U. S. A.
2018,115, E6411E6417.
(37) Ward, L.; Agrawal, A.; Choudhary, A.; Wolverton, C. A general-
purpose machine learning framework for predicting properties of
inorganic materials. npj Comput. Mater. 2016,2, 16028.
(38) Ester, M.; Kriegel, H. P.; Sander, J.; Xu, X. A density-based
algorithm for discovering clusters in large spatial databases with noise.
KDD 96: Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining, 1996; Vol. 96, pp 226231.
(39) Rabadán, R.; Blumberg, A. J. Dimensionality Reduction,
Manifold Learning, and Metric Geometry. Topological Data Analysis
for Genomics and Evolution: Topology in Biology, 1st ed.; Cambridge
University Press, 2019; pp 235270.
(40) Ohta, S.-I. Gradient flows on Wasserstein spaces over compact
Alexandrov spaces. Am. J. Math. 2009,131, 475516.
(41) Chuang, K. V.; Keiser, M. J. Adversarial Controls for Scientific
Machine Learning. ACS Chem. Biol. 2018,13, 28192821.
Chemistry of Materials pubs.acs.org/cm Article
https://dx.doi.org/10.1021/acs.chemmater.0c03381
Chem. Mater. XXXX, XXX, XXXXXX
K
... (with X representing the first five bins of an RDF) will have the same Euclidean distance to Y = suggested as a way to quantify compositional similarity in inorganic materials. 36 In this work, ...
... Hargreaves et al. 36 Our method represents the normalised elemental fractions as a 78element vector in atomic number order (considering elements up to Bi, but excluding noble gases); taking SrTiO 3 as a representative example would give values of 0.6, 0.2 and 0.2 at the 7th, 19th and 34th elements in this vector, respectively. Rather than ordering this vector by Pettifor scale and computing EMD directly as in [36], we instead introduce a pairwise dissimilarity metric (Fig. S3) between elements based on the statistical likelihood of species occurring within the same crystal structure (see methods). ...
... Hargreaves et al. 36 Our method represents the normalised elemental fractions as a 78element vector in atomic number order (considering elements up to Bi, but excluding noble gases); taking SrTiO 3 as a representative example would give values of 0.6, 0.2 and 0.2 at the 7th, 19th and 34th elements in this vector, respectively. Rather than ordering this vector by Pettifor scale and computing EMD directly as in [36], we instead introduce a pairwise dissimilarity metric (Fig. S3) between elements based on the statistical likelihood of species occurring within the same crystal structure (see methods). 36 The advantage of this approach is that while the Pettifor scale assumes a constant distance between adjacent species, the substitutional (dis)similarity approach gives a more chemically meaningful metric. ...
Article
Full-text available
Determining how similar two materials are in terms of both atomic composition and crystallographic structure remains a challenge, the solution of which would enable generalised machine learning using crystal structure...
... In our previous work, we introduce the Element Movers Distance (ElMD) 28 as a metric to quantify the similarity between two chemical formulae. This is demonstrated to be an expressive measure of chemical similarity that aligns with domain expert judgement. ...
... A Gram centred matrix 30 is first obtained from the given distance matrix, and then singular value decomposition of the Gram matrix carried forward to obtain the coordinates of each point projected to the first two principle components. PCA linearly scales each metric distance to maximally preserve each of the interpoint relationships across the dataset, which has previously been shown to closely reflect the true structure of the metric space 28 . Figure 3 Complete Distribution (465) Anti-Perovskite (8) Argyrodite (23) Garnet (67) Glass (36) Glass-Ceramic (7) LISICON (28) Lysonite (2) NASICON (154) Olivine (6) Other (17) Perovskite (64) Fig. 2 Distribution of room temperature conductivities across expert-curated structural families. ...
Article
Full-text available
The application of machine learning models to predict material properties is determined by the availability of high-quality data. We present an expert-curated dataset of lithium ion conductors and associated lithium ion conductivities measured by a.c. impedance spectroscopy. This dataset has 820 entries collected from 214 sources; entries contain a chemical composition, an expert-assigned structural label, and ionic conductivity at a specific temperature (from 5 to 873 °C). There are 403 unique chemical compositions with an associated ionic conductivity near room temperature (15–35 °C). The materials contained in this dataset are placed in the context of compounds reported in the Inorganic Crystal Structure Database with unsupervised machine learning and the Element Movers Distance. This dataset is used to train a CrabNet-based classifier to estimate whether a chemical composition has high or low ionic conductivity. This classifier is a practical tool to aid experimentalists in prioritizing candidates for further investigation as lithium ion conductors.
... CrabNet [24] Composition-based property regression Predict performance for proxy scores ElMD [25] Composition-based distance metric Supply distance matrix to DensMAP DensMAP [26] Density-aware dimensionality reduction Obtain densities for density proxy HDBSCAN [27] Density-aware clustering Create chemically homogenous clusters Peak proxy ...
Preprint
Full-text available
One of the biggest unsolved problems in condensed matter physics is what mechanism causes high-temperature superconductivity and if there is a material that can exhibit superconductivity at both room temperature and atmospheric pressure. Among the many important properties of a superconductor, the critical temperature (Tc) or transition temperature is the point at which a material transitions into a superconductive state. In this implementation, machine learning is used to predict the critical temperatures of chemically unique compounds in an attempt to identify new chemically novel, high-temperature superconductors. The training data set (SuperCon) consists of known superconductors and their critical temperatures, and the testing data set (NOMAD) consists of around 700,000 novel chemical formulae. The chemical formulae in these data sets are first passed through a collection of rapid screening tools, SMACT, to check for chemical validity. Next, the DiSCoVeR algorithm is used to train on the SuperCon data to form a model, and then screens through batches of the formulae in the NOMAD data set. Having a combination of a chemical distance metric, density-aware dimensionality reduction, clustering, and a regression model, the DiSCoVeR algorithm serves as a tool to identify and assess these superconducting compositions [1]. This research and implementation resulted in the screening of chemically novel compositions exhibiting critical temperatures upwards of 150 K, which correlates to superconductors in the cuprate class. This implementation demonstrates a process of performing machine learning-assisted superconductor screening (while exploring chemically distinct spaces) which can be utilized in the materials discovery process.
... 3,4 Meanwhile, Matt Rosseinsky's group had recently developed a new tool for assessing chemical similarity. 5 By putting these two tools together, we have been able to create generative models that can be guided away from common chemistries toward unusual new materials. 6 We are now constructing structural distance metrics in order to steer discovery toward unusual structures as well. ...
... For a given candidate 2D material formula, the TCSP algorithm first searches all known 2D material structure templates that share the same composition prototype as this formula (e.g., SiTiO 3 has prototype ABC 3 ). The Element's mover distance(ElMD) [34] is used to measure the compositional similarity between the query formula and compositions of all possible template structures. It then picks the top 5 structures with the smallest compositional distances as the candidate templates . ...
Preprint
Full-text available
Two-dimensional (2D) materials have wide applications in superconductors, quantum, and topological materials. However, their rational design is not well established, and currently less than 6,000 experimentally synthesized 2D materials have been reported. Recently, deep learning, data-mining, and density functional theory (DFT)-based high-throughput calculations are widely performed to discover potential new materials for diverse applications. Here we propose a generative material design pipeline, namely material transformer generator(MTG), for large-scale discovery of hypothetical 2D materials. We train two 2D materials composition generators using self-learning neural language models based on Transformers with and without transfer learning. The models are then used to generate a large number of candidate 2D compositions, which are fed to known 2D materials templates for crystal structure prediction. Next, we performed DFT computations to study their thermodynamic stability based on energy-above-hull and formation energy. We report four new DFT-verified stable 2D materials with zero e-above-hull energies, including NiCl$_4$, IrSBr, CuBr$_3$, and CoBrCl. Our work thus demonstrates the potential of our MTG generative materials design pipeline in the discovery of novel 2D materials and other functional materials.
... Since synthesis also plays a crucial role in real-world materials discovery, sintering temperature (a key synthesis parameter) of material is included. Finally, to evaluate the diversity of compounds generated, the element movers distance (ElMD) [11] and % uniqueness are employed. ...
Preprint
A major obstacle to the realization of novel inorganic materials with desirable properties is the inability to perform efficient optimization across both materials properties and synthesis of those materials. In this work, we propose a reinforcement learning (RL) approach to inverse inorganic materials design, which can identify promising compounds with specified properties and synthesizability constraints. Our model learns chemical guidelines such as charge and electronegativity neutrality while maintaining chemical diversity and uniqueness. We demonstrate a multi-objective RL approach, which can generate novel compounds with targeted materials properties including formation energy and bulk/shear modulus alongside a lower sintering temperature synthesis objectives. Using this approach, the model can predict promising compounds of interest, while suggesting an optimized chemical design space for inorganic materials discovery.
... Therefore, mofdscribe allows users to flexibly choose from a wide variety of elemental properties in addition to other encodings such as the (modified) Pettifor scales that have been shown to better capture similarities of elements across the periodic table. [58][59][60] For instance, Pettifor scales can be Listing 2 | Example of using aggregations in mofdscribe. Many featurizers compute more than one feature vector per structure; for instance, one feature vector per atom. ...
Preprint
Full-text available
The space of all plausible materials for a given application is so large that it cannot be explored using a brute-force approach. This is, in particular, the case for reticular chemistry which provides materials designers with a practically infinite playground on different length scales. One promising approach to guide the design and discovery of materials is machine learning, which typically involves learning a mapping of structures onto properties from data. While there have been plenty of examples of the use of machine learning for reticular materials, the progress in the field seems to have stagnated. From our perspective, an important reason is that digital reticular chemistry is still more an art than a science in which many parts are only accessible to experienced groups. The lack of standardization across all the steps of the machine learning pipeline makes it practically impossible to directly compare machine learning models and build on top of prior results. To confront these challenges, we present mofdscribe: a software ecosystem that accompanies—seasoned as well as novice—digital reticular chemists on all steps from ideation to model publication. Our package provides reference datasets (including a completely new one), more than 35 reported as well as completely novel featurization strategies, data splitters, and validation helpers which can be used to benchmark new modeling strategies on standard benchmark tasks and to report the results on a public leaderboard. We envision that this ecosystem allows for a more robust, comparable, and productive area of digital reticular chemistry.
Article
Weak thruster fault feature extraction and fault severity identification methods for autonomous underwater vehicle (AUV) are studied in this paper. One of the traditional methods of fault feature extraction is based on wavelet transformation + modified Bayes (MB), then the grey relation analysis (GRA) method is used to identify the fault severity of the thruster. Above methods are efficient for strong fault of thruster, but for weak fault, problems exist in these methods are the ratio of fault eigenvalues to noise eigenvalues of the extracted feature is low and the identification accuracy of fault is not satisfactory. To overcome the above deficiencies, resonance-based sparse signal decomposition (RSSD) together with stochastic resonance (SR) + MB is proposed for thruster weak fault feature extraction. Euclidean distance together with grey relation (GR) method is proposed to promote the identification accuracy of weak thruster fault. Finally, the pool experiments are performed on Beaver II AUV, and the effectiveness of the proposed method is demonstrated in comparison.
Article
Full-text available
Although machine learning has gained great interest in the discovery of functional materials, the advancement of reliable models is impeded by the scarcity of available materials property data. Here we propose and demonstrate a distinctive approach for materials discovery using unsupervised learning, which does not require labeled data and thus alleviates the data scarcity challenge. Using solid-state Li-ion conductors as a model problem, unsupervised materials discovery utilizes a limited quantity of conductivity data to prioritize a candidate list from a wide range of Li-containing materials for further accurate screening. Our unsupervised learning scheme discovers 16 new fast Li-conductors with conductivities of 10−4–10−1 S cm−1 predicted in ab initio molecular dynamics simulations. These compounds have structures and chemistries distinct to known systems, demonstrating the capability of unsupervised learning for discovering materials over a wide materials space with limited property data. Predictions of new solid-state Li-ion conductors are challenging due to the diverse chemistries and compositions involved. Here the authors combine unsupervised learning techniques and molecular dynamics simulations to discover new compounds with high Li-ion conductivity.
Article
Full-text available
One of the most exciting tools that have entered the material science toolbox in recent years is machine learning. This collection of statistical methods has already proved to be capable of considerably speeding up both fundamental and applied research. At present, we are witnessing an explosion of works that develop and apply machine learning to solid-state systems. We provide a comprehensive overview and analysis of the most recent research in this topic. As a starting point, we introduce machine learning principles, algorithms, descriptors, and databases in materials science. We continue with the description of different machine learning approaches for the discovery of stable materials and the prediction of their crystal structure. Then we discuss research in numerous quantitative structure–property relationships and various approaches for the replacement of first-principle methods by machine learning. We review how active learning and surrogate-based optimization can be applied to improve the rational design process and related examples of applications. Two major questions are always the interpretability of and the physical understanding gained from machine learning models. We consider therefore the different facets of interpretability and their importance in materials science. Finally, we propose solutions and future research paths for various challenges in computational materials science.
Article
Full-text available
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
Article
Full-text available
Conventional machine learning approaches for predicting material properties from elemental compositions have emphasized the importance of leveraging domain knowledge when designing model inputs. Here, we demonstrate that by using a deep learning approach, we can bypass such manual feature engineering requiring domain knowledge and achieve much better results, even with only a few thousand training samples. We present the design and implementation of a deep neural network model referred to as ElemNet; it automatically captures the physical and chemical interactions and similarities between different elements using artificial intelligence which allows it to predict the materials properties with better accuracy and speed. The speed and best-in-class accuracy of ElemNet enable us to perform a fast and robust screening for new material candidates in a huge combinatorial space; where we predict hundreds of thousands of chemical systems that could contain yet-undiscovered compounds.
Article
Due to the rapid emergence of antibiotic-resistant bacteria, there is a growing need to discover new antibiotics. To address this challenge, we trained a deep neural network capable of predicting molecules with antibacterial activity. We performed predictions on multiple chemical libraries and discovered a molecule from the Drug Repurposing Hub—halicin—that is structurally divergent from conventional antibiotics and displays bactericidal activity against a wide phylogenetic spectrum of pathogens including Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae. Halicin also effectively treated Clostridioides difficile and pan-resistant Acinetobacter baumannii infections in murine models. Additionally, from a discrete set of 23 empirically tested predictions from >107 million molecules curated from the ZINC15 database, our model identified eight antibacterial compounds that are structurally distant from known antibiotics. This work highlights the utility of deep learning approaches to expand our antibiotic arsenal through the discovery of structurally distinct antibacterial molecules. A trained deep neural network predicts antibiotic activity in molecules that are structurally different from known antibiotics, among which Halicin exhibits efficacy against broad-spectrum bacterial infections in mice.
Chapter
Topological Data Analysis for Genomics and Evolution - by Raúl Rabadán December 2019
Book
Cambridge Core - Genomics, Bioinformatics and Systems Biology - Topological Data Analysis for Genomics and Evolution - by Raúl Rabadán
Article
We formulate a materials design strategy combining a machine learning (ML) surrogate model with experimental design algorithms to search for high entropy alloys (HEAs) with large hardness in a model Al-Co-Cr-Cu-Fe-Ni system. We fabricated several alloys with hardness 10% higher than the best value in the original training dataset via only seven experiments. We find that a strategy using both the compositions and descriptors based on a knowledge of the properties of HEAs, outperforms that merely based on the compositions alone. This strategy offers a recipe to rapidly optimize multi-component systems, such as bulk metallic glasses and superalloys, towards desired properties.
Article
New machine learning methods to analyze raw chemical and biological data are now widely accessible as open-source toolkits. This positions researchers to leverage powerful, predictive models in their own domains. We caution, however, that the application of machine learning to experimental research merits careful consideration. Machine learning algorithms readily exploit confounding variables and experimental artifacts instead of relevant patterns, leading to overoptimistic performance and poor model generalization. In parallel to the strong control experiments that remain a cornerstone of experimental research, we advance the concept of adversarial controls for scientific machine learning: the design of exacting and purposeful experiments to ensure that predictive performance arises from meaningful models.
Article
As materials data sets grow in size and scope, the role of data mining and statistical learning methods to analyze these materials data sets and build predictive models is becoming more important. This manuscript introduces matminer, an open-source, Python-based software platform to facilitate data-driven methods of analyzing and predicting materials properties. Matminer provides modules for retrieving large data sets from external databases such as the Materials Project, Citrination, Materials Data Facility, and Materials Platform for Data Science. It also provides implementations for an extensive library of feature extraction routines developed by the materials community, with 47 featurization classes that can generate thousands of individual descriptors and combine them into mathematical functions. Finally, matminer provides a visualization module for producing interactive, shareable plots. These functions are designed in a way that integrates closely with machine learning and data analysis packages already developed and in use by the Python data science community. We explain the structure and logic of matminer, provide a description of its various modules, and showcase several examples of how matminer can be used to collect data, reproduce data mining studies reported in the literature, and test new methodologies.