## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

It is a core problem in any field to reliably tell how close two objects are to being the same, and once this relation has been established, we can use this information to precisely quantify potential relationships, both analytically and with machine learning (ML). For inorganic solids, the chemical composition is a fundamental descriptor, which can be represented by assigning the ratio of each element in the material to a vector. These vectors are a convenient mathematical data structure for measuring similarity, but unfortunately, the standard metric (the Euclidean distance) gives little to no variance in the resultant distances between chemically dissimilar compositions. We present the earth mover's distance (EMD) for inorganic compositions, a well-defined metric which enables the measure of chemical similarity in an explainable fashion. We compute the EMD between two compositions from the ratio of each of the elements and the absolute distance between the elements on the modified Pettifor scale. This simple metric shows clear strength at distinguishing compounds and is efficient to compute in practice. The resultant distances have greater alignment with chemical understanding than the Euclidean distance, which is demonstrated on the binary compositions of the inorganic crystal structure database. The EMD is a reliable numeric measure of chemical similarity that can be incorporated into automated workflows for a range of ML techniques. We have found that with no supervision, the use of this metric gives a distinct partitioning of binary compounds into clear trends and families of chemical property, with future applications for nearest neighbor search queries in chemical database retrieval systems and supervised ML techniques.

To read the full-text of this research,

you can request a copy directly from the authors.

... We can compare PDD matrices that have same number of columns and possibly different numbers of rows by interpreting PDD(S; k) as a distribution of unordered rows (or points in R k ) with weights or probabilities. One metric on such weighted distributions is the Earth Mover's Distance (EMD), which was previously used in [19] for chemical compositions from the ICSD. If any point is perturbed up to ε in Euclidean distance, any inter-point distance changes up to 2ε. ...

... This EMD metric also outputs which atomic types and/or occupancies were correctly matched and which were not. Since most A-lab crystals had several geometric nearest neighbors with small distances EMD, we selected the neighbor with the most similar composition as measured by element mover's distance [19], which are listed in Tables 2 and 4 below. The local novelty distance of each A-lab crystal is not more than the Earth Mover's Distance listed in the column EMD, 100. ...

... Table 2 Close neighbors of each A-lab crystal in the ICSD. The ICSD entry with the smallest element mover's distance [19] was selected from the list of 100 nearest neighbors by ADA 100 . Disordered crystals are marked with an asterisk *. ...

With the advent of self-driving labs promising to synthesize large numbers of new materials, new automated tools are required for checking potential duplicates in existing structural databases before a material can be claimed as novel. To avoid duplication, we rigorously define the novelty metric of any periodic material as the smallest distance to its nearest neighbor among already known materials. Using ultra-fast structural invariants, all such nearest neighbors can be found within seconds on a typical computer even if a given crystal is disguised by changing a unit cell, perturbing atoms, or replacing chemical elements. This real-time novelty check is demonstrated by finding near-duplicates of the 43 materials produced by Berkeley's A-lab in the world's largest collections of inorganic structures, the Inorganic Crystal Structure Database and Materials Project. To help future self-driving labs successfully identify novel materials, we propose navigation maps of the materials space where any new structure can be quickly located by its invariant descriptors similar to a geographic location on Earth.

... Such a metric has recently been suggested as a way to quantify compositional similarity in inorganic materials. 36 In this work, we have used EMD to compare between GRID distributions by computing the 1D pairwise (ith-ith) distances, before nally computing the mean EMD across all GRID groups. Note, however, that a different weighting within the mean could be applied to favour short-or long-range similarity, or alternatively a 2D EMD could compare across all GRID groups at once, at the expense of computational time. ...

... To address the absence of compositional information, we have therefore combined our GRID EMD with a similar compositional EMD, in a modied version of that demonstrated by Hargreaves et al. 36 Our method represents the normalised elemental fractions as a 78-element vector in atomic number order (considering elements up to Bi, but excluding noble gases); taking SrTiO 3 as a representative example would give values of 0.6, 0.2 and 0.2 at the 7th, 19th and 34th elements in this vector, respectively. Rather than ordering this vector by Pettifor scale and computing EMD directly as in ref. 36, we instead introduce a pairwise dissimilarity metric (Fig. S3 †) between elements based on the statistical likelihood of species occurring within the same crystal structure (see Methods). ...

... To address the absence of compositional information, we have therefore combined our GRID EMD with a similar compositional EMD, in a modied version of that demonstrated by Hargreaves et al. 36 Our method represents the normalised elemental fractions as a 78-element vector in atomic number order (considering elements up to Bi, but excluding noble gases); taking SrTiO 3 as a representative example would give values of 0.6, 0.2 and 0.2 at the 7th, 19th and 34th elements in this vector, respectively. Rather than ordering this vector by Pettifor scale and computing EMD directly as in ref. 36, we instead introduce a pairwise dissimilarity metric (Fig. S3 †) between elements based on the statistical likelihood of species occurring within the same crystal structure (see Methods). 36 The advantage of this approach is that while the Pettifor scale assumes a constant distance between adjacent species, the substitutional (dis)similarity approach gives a more chemically meaningful metric. ...

Determining how similar two materials are in terms of both atomic composition and crystallographic structure remains a challenge, the solution of which would enable generalised machine learning using crystal structure data. We demonstrate a new method of describing crystal structures based on interatomic distances, termed the Grouped Representation of Interatomic Distances (GRID). This fast to compute descriptor can equally be applied to crystalline or disordered materials, and encodes additional information beyond pairwise distances, such as coordination environments. Combined with earth mover's distance as a measure of similarity, we show that GRID is able to quantitatively compare materials involving both short- and long-range structural variation. Using this new material descriptor, we show that it can accurately predict bulk moduli using a simple nearest-neighbour model, and that the resulting similarity shows good generalisability across multiple materials properties.

... Such a metric has recently been suggested as a way to quantify compositional similarity in inorganic materials. 20 In this work, we have used EMD to compare between GRID distributions by computing the 1D pairwise (ith -ith) distances, before finally computing the mean EMD across all GRID groups. Note, however, that a different weighting within the mean could be applied to favour short-or longrange similarity, or alternatively a 2D EMD could compare across all GRID groups at once, at the expense of computational time. ...

... Hargreaves et al. 20 Our method represents the normalised elemental fractions as a 78element vector in atomic number order (considering elements up to Bi, but excluding noble gases); taking SrTiO3 as a representative example would give values of 0.6, 0.2 and 0.2 at the 7th, 19th and 34th elements in this vector, respectively. Rather than ordering this vector by ...

... Pettifor scale and computing EMD directly as in [20], we instead introduce a pairwise dissimilarity metric (Fig. S3) between elements based on the statistical likelihood of species occurring within the same crystal structure (see methods). 20 The advantage of this approach is that while the Pettifor scale assumes a constant distance between adjacent species, the substitutional (dis)similarity approach gives a more chemically meaningful metric. ...

Determining how similar two materials are in terms of both atomic composition and crystallographic structure remains a challenge, the solution of which would enable generalised machine learning using crystal structure data. We demonstrate a new method of describing crystal structures based on interatomic distances, termed the Grouped Representation of Interatomic Distances (GRID). This fast to compute descriptor can equally be applied to crystalline or disordered materials, and encodes additional information beyond pairwise distances, such as coordination environments. Combined with earth mover’s distance as a measure of similarity, we show that GRID is able to quantitatively compare materials involving both short- and long-range structural variation. Using this new material descriptor, we show that it can accurately predict bulk moduli using a simple nearest-neighbour model, and that the resulting similarity shows good generalisability across multiple materials properties.

... DiSCoVeR depends on clusters exhibiting homogeneity with respect to chemical classes, which we enforce via a recently introduced distance metric: Element Mover's Distance (ElMD). 54 Dimensionality reduction algorithms such as Uniform Manifold Approximation and Projection (UMAP) 55 or tdistributed stochastic neighbor embeddings 56 can then be used to create low-dimensional embeddings suitable for clustering algorithms such as Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN*) 57 or k-means clustering. 58 Finally, these can be fed into density estimator algorithms such as Density-preserving Uniform Manifold Approximation and Projection (DensMAP) 59 a UMAP variant or kernel density estimation 60,61 where density is then used as a proxy for chemical uniqueness. ...

... In this work, we use DensMAP for dimensionality reduction and HDBSCAN* for clustering similar to the work by Hargreaves et al. 54 † which successfully reported clusters of compounds that match chemical intuition. Removal of the DensMAP step result in a much higher proportion of points classied as "noise" in the clustering results ‡ and precludes the use of density-based proxies unless other suitable density estimation algorithms are used. ...

... where X v,j , m t,i , S t,i , $, and n train represent j-th validation Den-sMAP embedding position at which to be evaluated, i-th train DensMAP embedding position, i-th train covariance matrix, matrix multiplication, and total number of train points, respectively. CrabNet 40 Composition-based property regression Predict performance for proxy scores ElMD 54 Composition-based distance metric Supply distance matrix to DensMAP DensMAP 59 Density-aware dimensionality reduction Obtain densities for density proxy HDBSCAN* 57 Density-aware clustering Create chemically homogeneous clusters Peak proxy ...

We present Descending from Stochastic Clustering Variance Regression (DiSCoVeR) (https://github.com/sparks-baird/mat discover), a Python tool for identifying and assessing high-performing, chemically unique compositions relative to existing compounds using a combination of...

... DiSCoVeR depends on clusters exhibiting homogeneity with respect to chemical classes, which we enforce via a recently introduced distance metric: Element Mover's Distance (ElMD) [53]. Dimensionality reduction algorithms such as Uniform Manifold Approximation and Projection (UMAP) [54] or t-distributed stochastic neighbor embeddings [55] can then be used to create lowdimensional embeddings suitable for clustering algorithms such as Hierarchical Density-based Spatial Clustering of Applications with Noise (HDB-SCAN*) [56] or k-means clustering [57]. ...

... The key is in the dissimilarity metric used to compute distances between compounds. Recently, ElMD [53] was developed based on Earth Mover's or Wasserstein Distance; ElMD calculates distances between compounds in a way that more closely matches chemical intuition. For example, compounds with similar composition templates (e.g. ...

... In this work, we use UMAP for dimensionality reduction and HDBSCAN* for clustering similar to the work by Hargreaves et al. [53] 1 which successfully reported clusters of compounds that match chemical intuition. ...

We present Descending from Stochastic Clustering Variance Regression (DiSCoVeR), a Python tool for identifying high-performing, chemically unique compositions relative to existing compounds using a combination of a chemical distance metric, density-aware dimensionality reduction, and clustering. We introduce several new metrics for materials discovery and validate DiSCoVeR on Materials Project bulk moduli using compound-wise and cluster-wise validation methods. We visualize these via multiobjective Pareto front plots and assign a weighted score to each composition where this score encompasses the trade-off between performance and density-based chemical uniqueness. We explore an additional uniqueness proxy related to property gradients in chemical space. We demonstrate that DiSCoVeR can successfully screen materials for both performance and uniqueness in order to extrapolate to new chemical spaces.

... Often this PDF is additionally smoothed to guarantee continuity under perturbations but even the exact discrete version of PDF is strictly weaker than the PDD, see details below Fig. 6 in Section 3 of [8]. The earth mover's distance (EMD) [31] defines a continuous metric on PDDs even if they have a different number of rows. Here is a summary of the PDD advantages over past descriptors. ...

... Every crystal in the dataset will have differing ranges, and applying each one individually such that each graph uses its own k value yields poor results. This is not surprising, the Earth Mover's Distance [31,47] establishes a continuous metric between PDDs with fixed k according to Theorem 4.3 of [8]. So, given a different range for each sample in the data, we want a single value for k that is sufficient across the dataset. ...

The structure–property hypothesis says that the properties of all materials are determined by an underlying crystal structure. The main obstacle was the ambiguity of conventional crystal representations based on incomplete or discontinuous descriptors that allow false negatives or false positives. This ambiguity was resolved by the ultra-fast pointwise distance distribution, which distinguished all periodic structures in the world’s largest collection of real materials (Cambridge structural database). State-of-the-art results in property prediction were previously achieved by graph neural networks based on various graph representations of periodic crystals, including the Crystal Graph with vertices at all atoms in a crystal unit cell. This work adapts the pointwise distance distribution for a simpler graph whose vertex set is not larger than the asymmetric unit of a crystal structure. The new Distribution Graph reduces mean absolute error by 0.6–12% while having 44–88% of the number of vertices when compared to the Crystal Graph when applied on the Materials Project and Jarvis-DFT datasets using CGCNN and ALIGNN. Methods for hyper-parameters selection for the graph are backed by the theoretical results of the pointwise distance distribution and are then experimentally justified.

... In our previous work, we introduce the Element Movers Distance (ElMD) 28 as a metric to quantify the similarity between two chemical formulae. This is demonstrated to be an expressive measure of chemical similarity that aligns with domain expert judgement. ...

... A Gram centred matrix 30 is first obtained from the given distance matrix, and then singular value decomposition of the Gram matrix carried forward to obtain the coordinates of each point projected to the first two principle components. PCA linearly scales each metric distance to maximally preserve each of the interpoint relationships across the dataset, which has previously been shown to closely reflect the true structure of the metric space 28 . Figure 3 Complete Distribution (465) Anti-Perovskite (8) Argyrodite (23) Garnet (67) Glass (36) Glass-Ceramic (7) LISICON (28) Lysonite (2) NASICON (154) Olivine (6) Other (17) Perovskite (64) Fig. 2 Distribution of room temperature conductivities across expert-curated structural families. ...

The application of machine learning models to predict material properties is determined by the availability of high-quality data. We present an expert-curated dataset of lithium ion conductors and associated lithium ion conductivities measured by a.c. impedance spectroscopy. This dataset has 820 entries collected from 214 sources; entries contain a chemical composition, an expert-assigned structural label, and ionic conductivity at a specific temperature (from 5 to 873 °C). There are 403 unique chemical compositions with an associated ionic conductivity near room temperature (15–35 °C). The materials contained in this dataset are placed in the context of compounds reported in the Inorganic Crystal Structure Database with unsupervised machine learning and the Element Movers Distance. This dataset is used to train a CrabNet-based classifier to estimate whether a chemical composition has high or low ionic conductivity. This classifier is a practical tool to aid experimentalists in prioritizing candidates for further investigation as lithium ion conductors.

... The L1 distance is the sum of the absolute deviations between the predicted fractions of each element and the true fractions, and the L2 distance is the square root of the sum of the squared deviations. The EleMD is a recently developed metric for assessing compositional similarity through calculating the minimal amount of work taken to transform one distribution of elements to another along the modified Pettifor scale [203,204]. The Null baseline assumes all elements in the precursors are present in the product and are present in their average relative amounts derived from the training data. ...

... The model was implemented in PyTorch [72] and Matminer [273] was used to generate the Magpie features for the precursors. The element movers distance (EleMD) [204] was calculated using the pot package. ...

As we enter the data age, ever-increasing amounts of human knowledge are being recorded in machine-readable formats. This has opened up new opportunities to leverage data to accelerate scientific discovery. This thesis focuses on how we can use historical and computational data to aid the discovery and development of new materials. We begin by looking at a traditional materials informatics task -- elucidating the structure-function relationships of high-temperature cuprate superconductors. One of the most significant challenges for materials informatics is the limited availability of relevant data. We propose a simple calibration-based approach to estimate the apical and in-plane copper-oxygen distances from more readily available lattice parameter data to address this challenge for cuprate superconductors. Our investigation uncovers a large, unexplored region of materials space that may yield cuprates with higher critical temperatures. We propose two experimental avenues that may enable this region to be accessed. Computational materials exploration is bottle-necked by our ability to provide input structures to feed our workflows. Whilst \textit{ab-intio} structure identification is possible, it is computationally burdensome and we lack design rules for deciding where to target searches in high-throughput setups. To address this, there is a need to develop tools that suggest promising candidates, enabling automated deployment and increased efficiency. Machine learning models are well suited to this task, however, current approaches typically use hand-engineered inputs. This means that their performance is circumscribed by the intuitions reflected in the chosen inputs. We propose a novel way to formulate the machine learning task as a set regression problem over the elements in a material. We show that our approach leads to higher sample efficiency than other well-established composition-based approaches. Having demonstrated the ability of machine learning to aid in the selection of promising compound compositions, we next explore how useful machine learning might be for identifying fabrication routes. Using a recently released data-mined data set of solid-state synthesis reactions, we design a two-stage model to predict the products of inorganic reactions. We critically explore the performance of this model, showing that whilst the predictions fall short of the accuracy required to be chemically discriminative, the model provides valuable insights into understanding inorganic reactions. Through careful investigation of the model's failure modes, we explore the challenges that remain in the construction of forward inorganic reaction prediction models and suggest some pathways to tackle the identified issues. One of the principal ways that material scientists understand and categorise materials is in terms of their symmetries. Crystal structure prototypes are assigned based on the presence of symmetrically equivalent sites known as Wyckoff positions. We show that a powerful coarse-grained representation of materials structures can be constructed from the Wyckoff positions by discarding information about their coordinates within crystal structures. One of the strengths of this representation is that it maintains the ability of structure-based methods to distinguish polymorphs whilst also allowing combinatorial enumeration akin to composition-based approaches. We construct an end-to-end differentiable model that takes our proposed Wyckoff representation as input. The performance of this approach is examined on a suite of materials discovery experiments showing that it leads to strong levels of enrichment in materials discovery tasks. The research presented in this thesis highlights the promise of applying data-driven workflows and machine learning in materials discovery and development. This thesis concludes by speculating about promising research directions for applying machine learning within materials discovery.

... Subsequently, it classifies each material as redundant or representative, depending on its similarity to the existing representatives already selected into the cluster. Composition similarities are estimated using the ElMD (The Element Movers Distance) 26 package, which offers the option to choose linear, chemically derived, and machine-learned similarity measures. By default, we utilized the Mendeleev similarity and the MatScholar similarity 27 for our non-redundant composition dataset generation. ...

Materials datasets usually contain many redundant (highly similar) materials due to the tinkering approach historically used in material design. This redundancy skews the performance evaluation of machine learning (ML) models when using random splitting, leading to overestimated predictive performance and poor performance on out-of-distribution samples. This issue is well-known in bioinformatics for protein function prediction, where tools like CD-HIT are used to reduce redundancy by ensuring sequence similarity among samples greater than a given threshold. In this paper, we survey the overestimated ML performance in materials science for material property prediction and propose MD-HIT, a redundancy reduction algorithm for material datasets. Applying MD-HIT to composition- and structure-based formation energy and band gap prediction problems, we demonstrate that with redundancy control, the prediction performances of the ML models on test sets tend to have relatively lower performance compared to the model with high redundancy, but better reflect models’ true prediction capability.

... In addition to these learned descriptors of the elements, hand-crafted representations of elements and compositions continue to play a role in materials informatics for property prediction. [11][12][13][14][15][16][17][18] An often overlooked aspect when dealing with traditional element representations is the role of ions. Ions, and the knowledge of oxidation states, play a significant role in both the structural and electronic properties of materials such as electrical conductivity, 19,20 chemical bonding, and magnetism 21 as a result of their electronic configuration differing from the parent atom. ...

High-dimensional representations of the elements have become common within the field of materials informatics to build useful, structure-agnostic models for the chemistry of materials. However, the characteristics of elements change when they adopt a given oxidation state, with distinct structural preferences and physical properties. We explore several methods for developing embedding vectors of elements decorated with oxidation states. Graphs generated from 110 160 crystals are used to train representations of 84 elements that form 336 species. Clustering these learned representations of ionic species in low-dimensional space reproduces expected chemical heuristics, particularly the separation of cations from anions. We show that these representations have enhanced expressive power for property prediction tasks involving inorganic compounds. We expect that ionic representations, necessary for the description of mixed valence and complex magnetic systems, will support more powerful machine learning models for materials.

... TCSP [17] is a template-based crystal structure prediction algorithm. For a given formula, TCSP first narrows down the candidates to structures with the same prototype and then uses Element's mover distance (ElMD) [58] to measure the compositional similarity between the query formula and the compositions of all possible template structures. We implement BERTOS [59] in TCSP, which achieves over 96.82% accuracy for all-element oxidation states prediction on the Inorganic Crystal Structure Database (ICSD), to leverage its significant capabilities to enhance the accuracy of predicting oxidation states in the searching template element process of TCSP. ...

Crystal structure prediction (CSP) is now increasingly used in discovering novel materials with applications in diverse industries. However, despite decades of developments and significant progress in this area, there lacks a set of well-defined benchmark dataset, quantitative performance metrics, and studies that evaluate the status of the field. We aim to fill this gap by introducing a CSP benchmark suite with 180 test structures along with our recently implemented CSP performance metric set. We benchmark a collection of 13 state-of-the-art (SOTA) CSP algorithms including template-based CSP algorithms, conventional CSP algorithms based on DFT calculations and global search such as CALYPSO, CSP algorithms based on machine learning (ML) potentials and global search, and distance matrix based CSP algorithms. Our results demonstrate that the performance of the current CSP algorithms is far from being satisfactory. Most algorithms cannot even identify the structures with the correct space groups except for the template-based algorithms when applied to test structures with similar templates. We also find that the ML potential based CSP algorithms are now able to achieve competitive performances compared to the DFT-based algorithms. These CSP algorithms' performance is strongly determined by the quality of the neural potentials as well as the global optimization algorithms. Our benchmark suite comes with a comprehensive open-source codebase and 180 well-selected benchmark crystal structures, making it convenient to evaluate the advantages and disadvantages of CSP algorithms from future studies. All the code and benchmark data are available at https://github.com/usccolumbia/cspbenchmark

... Given the number of compounds in this study it is helpful to represent these in a condensed format, such as a plot, to observe the distribution of candidate materials in compositional space. The Element Movers Distance (ElMD) 64 is an established metric of chemical similarity, which uses optimal transport to return a consistent measure of (dis)similarity between two given compositions that aligns with chemical intuition. An embedding of datasets with respect to this metric may be carried forward via kernel principal component analysis (PCA) by performing PCA on the ElMD distance matrix to give 2-dimensional coordinates for each material. ...

Combinatorial and guided screening of materials space with density-functional theory and related approaches has provided a wealth of hypothetical inorganic materials, which are increasingly tabulated in open databases. The OPTIMADE API is a standardised format for representing crystal structures, their measured and computed properties, and the methods for querying and filtering them from remote resources. Currently, the OPTIMADE federation spans over 20 data providers, rendering over 30 million structures accessible in this way, many of which are novel and have only recently been suggested by machine learning-based approaches. In this work, we outline our approach to non-exhaustively screen this dynamic trove of structures for the next-generation of optical materials. By applying MODNet, a neural network-based model for property prediction, within a combined active learning and high-throughput computation framework, we isolate particular structures and chemistries that should be most fruitful for further theoretical calculations and for experimental study as high-refractive-index materials. By making explicit use of automated calculations, federated dataset curation and machine learning, and by releasing these publicly, the workflows presented here can be periodically re-assessed as new databases implement OPTIMADE, and new hypothetical materials are suggested.

... As shown, an increase in dataset similarity between pretraining dataset and finetuning data is correlated with increase in performance gain. Dataset similarity is measured using the earth mover's distance [10]. right: % improvement in performance after masked element modeling (MEM). ...

Given the vast spectrum of material properties characterizing each compound, learning representations for inorganic materials is intricate. The prevailing trend within the materials informatics community leans towards designing specialized models that predict single properties. We introduce a multi-task learning framework, wherein a transformer-based encoder is co-trained across diverse materials properties and a de-noising objective, resulting in robust and generalizable materials representations. Our method not only improves over the performance observed in single-dataset pretrain-ing but also showcases scalability and adaptability toward multi-dataset pretraining. Experiments demonstrate that the trained encoder MTENCODER captures chemically meaningful representations, surpassing the performance of currrent structure-agnostic materials encoders. This approach paves the way to improvements in a multitude of materials informatics tasks, prominently including materials property prediction and synthesis planning for materials discovery.

... Computational pipeline of PDD is illustrated for a 2-dimensional square lattice.The Average Minimum Distance AMD(S; k) is the vector obtained by taking the weighted average of the last k columns in PDD(S; k), so AMD is a single vector of k average distances.To compare two AMD vectors of the same length, our experiments used the L 1 (Chebyshev) metric equal to the maximum absolute di↵erence of corresponding coordinates. For a metric on PDDs, we use the Earth Mover's Distance (EMD)60 with the L 1 metric on rows. If any point of S is perturbed in its "-neighborhood, then PDD(S; k) changes by at most 2" in the EMD metric. ...

Zeolites are inorganic materials known for their diversity of applications, synthesis conditions, and resulting polymorphs. Although their synthesis is controlled both by inorganic and organic synthesis conditions, computational studies of zeolite synthesis have focused mostly on the design of organic structure-directing agents (OSDAs). In this work, we combine distances between crystal structures and machine learning (ML) to create inorganic synthesis maps in zeolites. Starting with 253 known zeolites, we show how the continuous distances between frameworks reproduce inorganic synthesis conditions from the literature without using labels such as building units. An unsupervised learning analysis shows that neighboring zeolites according to two different representations often share similar inorganic synthesis conditions, even in OSDA-based routes. In combination with ML classifiers, we find synthesis-structure relationships for 14 common inorganic conditions in zeolites, namely Al, B, Be, Ca, Co, F, Ga, Ge, K, Mg, Na, P, Si, and Zn. By explaining the model predictions, we demonstrate how (dis)similarities towards known structures can be used as features for the synthesis space, thus quantifying the intuition that similar structures often share inorganic synthesis routes. Finally, we show how these methods can be used to predict inorganic synthesis conditions for unrealized frameworks in hypothetical databases and interpret the outcomes by extracting local structural patterns from zeolites. In combination with OSDA design, this work can accelerate the exploration of the space of synthesis conditions for zeolites.

... CrabNet [25] Composition-based property regression Predict performance for proxy scores ElMD [26] Composition-based distance metric Supply distance matrix to DensMAP DensMAP [27] Density-aware dimensionality reduction Obtain densities for density proxy HDBSCAN [28] Density-aware clustering Create chemically homogenous clusters Peak proxy ...

... CrabNet [25] Composition-based property regression Predict performance for proxy scores ElMD [26] Composition-based distance metric Supply distance matrix to DensMAP DensMAP [27] Density-aware dimensionality reduction Obtain densities for density proxy HDBSCAN [28] Density-aware clustering Create chemically homogenous clusters Peak proxy ...

One of the biggest unsolved problems in condensed matter physics is what mechanism causes high-temperature superconductivity and if there is a material that can exhibit superconductivity at both room temperature and atmospheric pressure. Among the many important properties of a superconductor, the critical temperature (Tc) or transition temperature is the point at which a material transitions into a superconductive state. In this implementation, machine learning is used to predict the critical temperatures of chemically unique compounds in an attempt to identify new chemically novel, high-temperature superconductors. The training data set (SuperCon) consists of known superconductors and their critical temperatures, and the testing data set (NOMAD) consists of around 700,000 novel chemical formulae. The chemical formulae in these data sets are first passed through a collection of rapid screening tools, SMACT, to check for chemical validity. Next, the DiSCoVeR algorithm is used to train on the SuperCon data to form a model, and then screens through batches of the formulae in the NOMAD data set. Having a combination of a chemical distance metric, density-aware dimensionality reduction, clustering, and a regression model, the DiSCoVeR algorithm serves as a tool to identify and assess these superconducting compositions [1]. This research and implementation resulted in the screening of chemically novel compositions exhibiting critical temperatures upwards of 150 K, which correlates to superconductors in the cuprate class. This implementation demonstrates a process of performing machine learning-assisted superconductor screening (while exploring chemically distinct spaces) which can be utilized in the materials discovery process.

... The EMD is a well-studied distance measure between weighted point sets, with many successful applications in a variety of domains; for example, see [8,10,17,19]. The idea of the EMD was first conceived by Monge [14] in 1781, in the context of transportation theory. ...

Many applications in pattern recognition represent patterns as a geometric graph. The geometric graph distance (GGD) has recently been studied as a meaningful measure of similarity between two geometric graphs. Since computing the GGD is known to be $\mathcal{NP}$-hard, the distance measure proves an impractical choice for applications. As a computationally tractable alternative, we propose in this paper the Graph Mover's Distance (GMD), which has been formulated as an instance of the earth mover's distance. The computation of the GMD between two geometric graphs with at most $n$ vertices takes only $O(n^3)$-time. Alongside studying the metric properties of the GMD, we investigate the stability of the GGD and GMD. The GMD also demonstrates extremely promising empirical evidence at recognizing letter drawings from the {\tt LETTER} dataset \cite{da_vitoria_lobo_iam_2008}.

... The EMD is a well-studied distance measure between weighted point sets, with many successful applications in a variety of domains; for example, see [8,10,17,19]. The idea of the EMD was first conceived by Monge [14] in 1781, in the context of transportation theory. ...

Many applications in pattern recognition represent patterns as a geometric graph. The geometric graph distance (GGD) has recently been studied in [13] as a meaningful measure of similarity between two geometric graphs. Since computing the GGD is known to be NP-hard, the distance measure proves an impractical choice for applications. As a computationally tractable alternative , we propose in this paper the Graph Mover's Distance (GMD), which has been formulated as an instance of the earth mover's distance. The computation of the GMD between two geometric graphs with at most n vertices takes only O(n^3)-time. Alongside studying the metric properties of the GMD, we investigate the stability of the GGD and GMD. The GMD also demonstrates extremely promising empirical evidence at recognizing letter drawings from the LETTER dataset [18].

... CrabNet [24] Composition-based property regression Predict performance for proxy scores ElMD [25] Composition-based distance metric Supply distance matrix to DensMAP DensMAP [26] Density-aware dimensionality reduction Obtain densities for density proxy HDBSCAN [27] Density-aware clustering Create chemically homogenous clusters Peak proxy ...

One of the biggest unsolved problems in condensed matter physics is what mechanism causes high-temperature superconductivity and if there is a material that can exhibit superconductivity at both room temperature and atmospheric pressure. Among the many important properties of a superconductor, the critical temperature (Tc) or transition temperature is the point at which a material transitions into a superconductive state. In this implementation, machine learning is used to predict the critical temperatures of chemically unique compounds in an attempt to identify new chemically novel, high-temperature superconductors. The training data set (SuperCon) consists of known superconductors and their critical temperatures, and the testing data set (NOMAD) consists of around 700,000 novel chemical formulae. The chemical formulae in these data sets are first passed through a collection of rapid screening tools, SMACT, to check for chemical validity. Next, the DiSCoVeR algorithm is used to train on the SuperCon data to form a model, and then screens through batches of the formulae in the NOMAD data set. Having a combination of a chemical distance metric, density-aware dimensionality reduction, clustering, and a regression model, the DiSCoVeR algorithm serves as a tool to identify and assess these superconducting compositions [1]. This research and implementation resulted in the screening of chemically novel compositions exhibiting critical temperatures upwards of 150 K, which correlates to superconductors in the cuprate class. This implementation demonstrates a process of performing machine learning-assisted superconductor screening (while exploring chemically distinct spaces) which can be utilized in the materials discovery process.

... Pointwise Distance Distributions (PDD) are stronger invariants, which can be continuously compared by the Earth Mover's Distance. This distance was used for visualising the Inorganic Crystal Structure Database [7]. More than 200 billion pairwise comparisons between all 660K+ periodic crystals (full 3D structure; no disorder) in the Cambridge Structural Database (CSD) was completed over two days on a modest desktop PC. ...

... CrabNet [24] Composition-based property regression Predict performance for proxy scores ElMD [25] Composition-based distance metric Supply distance matrix to DensMAP DensMAP [26] Density-aware dimensionality reduction Obtain densities for density proxy HDBSCAN [27] Density-aware clustering Create chemically homogenous clusters Peak proxy ...

One of the biggest unsolved problems in condensed matter physics is what mechanism causes high-temperature superconductivity and if there is a material that can exhibit superconductivity at both room temperature and atmospheric pressure. Among the many important properties of a superconductor, the critical temperature (Tc) or transition temperature is the point at which a material transitions into a superconductive state. In this implementation, machine learning is used to predict the critical temperatures of chemically unique compounds in an attempt to identify new chemically novel, high-temperature superconductors. The training data set (SuperCon) consists of known superconductors and their critical temperatures, and the testing data set (NOMAD) consists of around 700,000 novel chemical formulae. The chemical formulae in these data sets are first passed through a collection of rapid screening tools, SMACT, to check for chemical validity. Next, the DiSCoVeR algorithm is used to train on the SuperCon data to form a model, and then screens through batches of the formulae in the NOMAD data set. Having a combination of a chemical distance metric, density-aware dimensionality reduction, clustering, and a regression model, the DiSCoVeR algorithm serves as a tool to identify and assess these superconducting compositions [1]. This research and implementation resulted in the screening of chemically novel compositions exhibiting critical temperatures upwards of 150 K, which correlates to superconductors in the cuprate class. This implementation demonstrates a process of performing machine learning-assisted superconductor screening (while exploring chemically distinct spaces) which can be utilized in the materials discovery process.

... 3,4 Meanwhile, Matt Rosseinsky's group had recently developed a new tool for assessing chemical similarity. 5 By putting these two tools together, we have been able to create generative models that can be guided away from common chemistries toward unusual new materials. 6 We are now constructing structural distance metrics in order to steer discovery toward unusual structures as well. ...

... For a given candidate 2D material formula, the TCSP algorithm first searches all known 2D material structure templates that share the same composition prototype as this formula (e.g., SiTiO 3 has prototype ABC 3 ). The Element's mover distance(ElMD) [34] is used to measure the compositional similarity between the query formula and compositions of all possible template structures. It then picks the top 5 structures with the smallest compositional distances as the candidate templates . ...

Two-dimensional (2D) materials have wide applications in superconductors, quantum, and topological materials. However, their rational design is not well established, and currently less than 6,000 experimentally synthesized 2D materials have been reported. Recently, deep learning, data-mining, and density functional theory (DFT)-based high-throughput calculations are widely performed to discover potential new materials for diverse applications. Here we propose a generative material design pipeline, namely material transformer generator(MTG), for large-scale discovery of hypothetical 2D materials. We train two 2D materials composition generators using self-learning neural language models based on Transformers with and without transfer learning. The models are then used to generate a large number of candidate 2D compositions, which are fed to known 2D materials templates for crystal structure prediction. Next, we performed DFT computations to study their thermodynamic stability based on energy-above-hull and formation energy. We report four new DFT-verified stable 2D materials with zero e-above-hull energies, including NiCl$_4$, IrSBr, CuBr$_3$, and CoBrCl. Our work thus demonstrates the potential of our MTG generative materials design pipeline in the discovery of novel 2D materials and other functional materials.

... Since synthesis also plays a crucial role in real-world materials discovery, sintering temperature (a key synthesis parameter) of material is included. Finally, to evaluate the diversity of compounds generated, the element movers distance (ElMD) [11] and % uniqueness are employed. ...

A major obstacle to the realization of novel inorganic materials with desirable properties is the inability to perform efficient optimization across both materials properties and synthesis of those materials. In this work, we propose a reinforcement learning (RL) approach to inverse inorganic materials design, which can identify promising compounds with specified properties and synthesizability constraints. Our model learns chemical guidelines such as charge and electronegativity neutrality while maintaining chemical diversity and uniqueness. We demonstrate a multi-objective RL approach, which can generate novel compounds with targeted materials properties including formation energy and bulk/shear modulus alongside a lower sintering temperature synthesis objectives. Using this approach, the model can predict promising compounds of interest, while suggesting an optimized chemical design space for inorganic materials discovery.

... Therefore, mofdscribe allows users to flexibly choose from a wide variety of elemental properties in addition to other encodings such as the (modified) Pettifor scales that have been shown to better capture similarities of elements across the periodic table. [58][59][60] For instance, Pettifor scales can be Listing 2 | Example of using aggregations in mofdscribe. Many featurizers compute more than one feature vector per structure; for instance, one feature vector per atom. ...

The space of all plausible materials for a given application is so large that it cannot be explored using a brute-force approach. This is, in particular, the case for reticular chemistry which provides materials designers with a practically infinite playground on different length scales. One promising approach to guide the design and discovery of materials is machine learning, which typically involves learning a mapping of structures onto properties from data. While there have been plenty of examples of the use of machine learning for reticular materials, the progress in the field seems to have stagnated. From our perspective, an important reason is that digital reticular chemistry is still more an art than a science in which many parts are only accessible to experienced groups. The lack of standardization across all the steps of the machine learning pipeline makes it practically impossible to directly compare machine learning models and build on top of prior results. To confront these challenges, we present mofdscribe: a software ecosystem that accompanies—seasoned as well as novice—digital reticular chemists on all steps from ideation to model publication. Our package provides reference datasets (including a completely new one), more than 35 reported as well as completely novel featurization strategies, data splitters, and validation helpers which can be used to benchmark new modeling strategies on standard benchmark tasks and to report the results on a public leaderboard. We envision that this ecosystem allows for a more robust, comparable, and productive area of digital reticular chemistry.

... Here, we use the 100 nearest neighbors as the k value. The earth mover's distance was previously used to compare crystal compositions 49 and is now adapted to a continuous metric between PDDs (Tables S4 and S5), which is easier to compute than between complete isoset invariants. 50 The PDD is more robust and quicker to compute than past invariants, 51−53 which allowed it to be used to distinguish all the periodic crystals in the Cambridge Structural Database. ...

Mesoporous molecular crystals have potential applications in separation and catalysis, but they are rare and hard to design because many weak interactions compete during crystallization, and most molecules have an energetic preference for close packing. Here, we combine crystal structure prediction (CSP) with structural invariants to continuously qualify the similarity between predicted crystal structures for related molecules. This allows isomorphous substitution strategies, which can be unreliable for molecular crystals, to be augmented by a priori prediction, thus leveraging the power of both approaches. We used this combined approach to discover a rare example of a low-density (0.54 g cm-3) mesoporous hydrogen-bonded framework (HOF), 3D-CageHOF-1. This structure comprises an organic cage (Cage-3-NH2) that was predicted to form kinetically trapped, low-density polymorphs via CSP. Pointwise distance distribution structural invariants revealed five predicted forms of Cage-3-NH2 that are analogous to experimentally realized porous crystals of a chemically different but geometrically similar molecule, T2. More broadly, this approach overcomes the difficulties in comparing predicted molecular crystals with varying lattice parameters, thus allowing for the systematic comparison of energy-structure landscapes for chemically dissimilar molecules.

... The mean value of D sample , as a function of N sample , converges at N sample ¼ 50 (Fig. S2 †) with a cutoff of 0.001. While previous studies suggest the earth mover's distance (EMD) a good distance function for chemical compositions, 48,49 changing the L1 loss function in eqn (3) to EMD of modied Pettifor scale 48 does not improve results. ...

Despite its simplicity, the composition of a material can be used as input to machine learning models to predict a range of materials properties. However, many property optimization tasks require the generation of novel but realistic materials compositions. In this study, we describe a way to generate compositions of hybrid organic–inorganic crystals through adapting Augmented CycleGAN, a novel generative model that can learn many-to-many relations between two domains. Specifically, we investigate the problem of composition change upon amine swap: for a specific chemical system (set of elements) crystalized with amine A, how would the product chemical compositions change if it is crystalized with amine B? By training with limited data from Cambridge Structural Database, our model can generate realistic chemical compositions for hybrid crystalline materials. The Augmented CycleGAN model can also utilize abundant unpaired data (compositions of different chemical systems), a feature that traditional supervised methods lack. The generated compositions can be used for many tasks, for example, as input fed to a classifier that predicts structural dimensionality.

... We extract the internal vector representation of all of the 51242 compounds in the OQMD_Bandgap test dataset from the last self-attention layer of HotCrab, perform dimensionality reduction using UMAP and finally visualize the compounds as shown in Fig. 4. In addition to coloring the plots by the predicted value, prediction error, and number of distinct elements for the compounds, we also highlight the chemical trend between ionic to covalent bonding character within the compounds. This trend is revealed by calculating and visualizing the standard deviation of the Pauling electronegativities of the constituent atoms in a given compound [75] according to Equation 5: where i is the Pauling electronegativity of each element i in the compound (totaling n elements), and ̄ is the average electronegativity of all elements in the compound. A higher signifies a more ionic bonding character, and a lower value signifies a more covalent bonding character. ...

Despite recent breakthroughs in deep learning for materials informatics, there exists a disparity between their popularity in academic research and their limited adoption in the industry. A significant contributor to this “interpretability-adoption gap” is the prevalence of black-box models and the lack of built-in methods for model interpretation. While established methods for evaluating model performance exist, an intuitive understanding of the modeling and decision-making processes in models is nonetheless desired in many cases. In this work, we demonstrate several ways of incorporating model interpretability to the structure-agnostic Compositionally Restricted Attention-Based network, CrabNet. We show that CrabNet learns meaningful, material property-specific element representations based solely on the data with no additional supervision. These element representations can then be used to explore element identity, similarity, behavior, and interactions within different chemical environments. Chemical compounds can also be uniquely represented and examined to reveal clear structures and trends within the chemical space. Additionally, visualizations of the attention mechanism can be used in conjunction to further understand the modeling process, identify potential modeling or dataset errors, and hint at further chemical insights leading to a better understanding of the phenomena governing material properties. We feel confident that the interpretability methods introduced in this work for CrabNet will be of keen interest to materials informatics researchers as well as industrial practitioners alike.

... In the partial case of lattices, their space of isometry classes was continuously parameterised by root forms [9,10] in dimension two and three. AMD were recently extended to Pointwise Distance Distributions (PDD) whose continuity was proved [40] under Earth Mover's Distance, which was used for comparing chemical compositions [24]. The above root forms of lattices combined with PDD are enough to explicitly reconstruct a periodic point set in general position, which justifies a geometric inverse design for any periodic crystals. ...

The fundamental model of any solid crystalline material (crystal) at the atomic scale is a periodic point set. The strongest natural equivalence of crystals is rigid motion or isometry that preserves all inter-atomic distances. Past comparisons of periodic structures often used manual thresholds, symmetry groups and reduced cells, which are discontinuous under perturbations or thermal vibrations of atoms. This work defines the infinite sequence of continuous isometry invariants (Average Minimum Distances) to progressively capture distances between neighbors. The asymptotic behaviour of the new invariants is theoretically proved in all dimensions for a wide class of sets including non-periodic. The proposed near linear time algorithm identified all different crystals in the world's largest Cambridge Structural Database within a few hours on a modest desktop. The ultra fast speed and proved continuity provide rigorous foundations to continuously parameterise the space of all periodic crystals as a high-dimensional extension of Mendeleev's table of elements.

... Module A1: Element mover's distance for formula similarity calculation We use the Element's Mover Distance measure ElM D [21] to select most similar template structures. ElMD is a metric that allows measuring the chemical similarity of two formulas in an explainable fashion. ...

Fast and accurate crystal structure prediction (CSP) algorithms and web servers are highly desirable for exploring and discovering new materials out of the infinite design space. However, currently, the computationally expensive first principle calculation based crystal structure prediction algorithms are applicable to relatively small systems and are out of reach of most materials researchers due to the requirement of high computing resources or the software cost related to ab initio code such as VASP. Several computational teams have used an element substitution approach for generating or predicting new structures, but usually in an ad hoc way. Here we develop a template based crystal structure prediction algorithm (TCSP) and its companion web server, which makes this tool to be accessible to all materials researchers. Our algorithm uses elemental/chemical similarity and oxidation states to guide the selection of template structures and then rank them based on the substitution compatibility and can return multiple predictions with ranking scores in a few minutes. Benchmark study on the ~98,290 formulas of the Materials Project database using leave-one-out evaluation shows that our algorithm can achieve high accuracy (for 13,145 target structures, TCSP predicted their structures with RMSD < 0.1) for a large portion of the formulas. We have also used TCSP to discover new materials of the Ga-B-N system showing its potential for high-throughput materials discovery. Our user-friendly web app TCSP can be accessed freely at \url{www.materialsatlas.org/crystalstructure} on our MaterialsAtlas.org web app platform.

Combinatorial and guided screening of materials space with density-functional theory and related approaches has provided a wealth of hypothetical inorganic materials, which are increasingly tabulated in open databases. The OPTIMADE...

The Cambridge Structural Database (CSD) played a key role in the recently established crystal isometry principle (CRISP). The CRISP says that any real periodic crystal is uniquely determined as a rigid structure by the geometry of its atomic centers without atomic types. Ignoring atomic types allows us to study all periodic crystals in a common space whose continuous nature is justified by the continuity of real-valued coordinates of atoms. Our previous work introduced structural descriptors pointwise distance distributions (PDD) that are invariant under isometry defined as a composition of translations, rotations, and reflections. The PDD invariants distinguished all nonduplicate periodic crystals in the CSD. This paper presents the first continuous maps of the CSD and its important subsets in invariant coordinates that have analytic formulas and physical interpretations. Any existing periodic crystal has a uniquely defined location on these geographic-style maps. Any newly discovered periodic crystals will appear on the same maps without disturbing the past materials.

Data-to-knowledge started to reveal significant promises in material science. Still, some classes of materials, such as Metal-Organic Frameworks (MOFs), possess multi-dimensional interrelated physicochemical properties that pose challenges in using data...

Synthetic polymers, in contrast to small molecules and deterministic biomacromolecules, are typically ensembles composed of polymer chains with varying numbers, lengths, sequences, chemistry, and topologies. While numerous approaches exist for measuring pairwise similarity among small molecules and sequence-defined biomacromolecules, accurately determining the pairwise similarity between two polymer ensembles remains challenging. This work proposes the earth mover’s distance (EMD) metric to calculate the pairwise similarity score between two polymer ensembles. EMD offers a greater resolution of chemical differences between polymer ensembles than the averaging method and provides a quantitative numeric value representing the pairwise similarity between polymer ensembles in alignment with chemical intuition. The EMD approach for assessing polymer similarity enhances the development of accurate chemical search algorithms within polymer databases and can improve machine learning techniques for polymer design, optimization, and property prediction.

Recent Machine Learning (ML) developments have opened new perspectives on accelerating the discovery of new materials. However, in the field of materials informatics, the performance of ML estimators is heavily limited by the nature of the available training datasets, which are often severely restricted and unbalanced. Among practitioners, it is usually taken for granted that more data corresponds to better performance. Here, we investigate whether different ML models for property predictions benefit from the aggregation of large databases into smaller repositories. To do this, we probe three different aggregation strategies prioritizing training size, element diversity, and composition diversity. For classic ML models, our results consistently show a reduction in performance under all the considered strategies. Deep Learning models show more robustness, but most changes are not significant. Furthermore, to assess whether this is a consequence of a distribution mismatch between datasets, we simulate the data acquisition process of a single dataset and compare a random selection with prioritizing chemical diversity. We observe that prioritizing composition diversity generally leads to a slower convergence toward better accuracy. Overall, our results suggest caution when merging different data sources and discourage a biased acquisition of novel chemistries when building a training dataset.

Two‐dimensional (2D) materials offer great potential in various fields like superconductivity, quantum systems, and topological materials. However, designing them systematically remains challenging due to the limited pool of fewer than 100 experimentally synthesized 2D materials. Recent advancements in deep learning, data mining, and density functional theory (DFT) calculations have paved the way for exploring new 2D material candidates. Herein, a generative material design pipeline known as the material transformer generator (MTG) is proposed. MTG leverages two distinct 2D material composition generators, both trained using self‐learning neural language models rooted in transformers, with and without transfer learning. These models generate numerous potential 2D compositions, which are plugged into established templates for known 2D materials to predict their crystal structures. To ensure stability, DFT computations assess their thermodynamic stability based on energy‐above‐hull and formation energy metrics. MTG has found four new DFT‐validated stable 2D materials: NiCl 4 , IrSBr, CuBr 3 , and CoBrCl, all with zero energy‐above‐hull values that indicate thermodynamic stability. Additionally, GaBrO and NbBrCl 3 are found with energy‐above‐hull values below 0.05 eV. CuBr 3 and GaBrO exhibit dynamic stability, confirmed by phonon dispersion analysis. In summary, the MTG pipeline shows significant potential for discovering new 2D and functional materials.

A geometric graph is a combinatorial graph, endowed with a geometry that is inherited from its embedding in a Euclidean space. Formulation of a meaningful measure of (dis-)similarity in both the combinatorial and geometric structures of two such geometric graphs is a challenging problem in pattern recognition. We study two notions of distance measures for geometric graphs, called the geometric edit distance (GED) and geometric graph distance (GGD). While the former is based on the idea of editing one graph to transform it into the other graph, the latter is inspired by inexact matching of the graphs. For decades, both notions have been lending themselves well as measures of similarity between attributed graphs. If used without any modification, however, they fail to provide a meaningful distance measure for geometric graphs—even cease to be a metric. We have curated their associated cost functions for the context of geometric graphs. Alongside studying the metric properties of GED and GGD, we investigate how the two notions compare. We further our understanding of the computational aspects of GGD by showing that the distance is NP-hard to compute, even if the graphs are planar and arbitrary cost coefficients are allowed.

Discovery of novel materials is slow but necessary for societal progress. Here, we demonstrate a closed-loop machine learning (ML) approach to rapidly explore a large materials search space, accelerating the intentional discovery of superconducting compounds. By experimentally validating the results of the ML-generated superconductivity predictions and feeding those data back into the ML model to refine, we demonstrate that success rates for superconductor discovery can be more than doubled. Through four closed-loop cycles, we report discovery of a superconductor in the Zr-In-Ni system, re-discovery of five superconductors unknown in the training datasets, and identification of two additional phase diagrams of interest for new superconducting materials. Our work demonstrates the critical role experimental feedback provides in ML-driven discovery, and provides a blueprint for how to accelerate materials progress.

The discovery of new materials often requires collaboration between experimental and computational chemists. Web based platforms allow more flexibility in this collaboration by giving access to computational tools without the need for access to computational researchers. We present Liverpool materials discovery server (https://lmds.liverpool.ac.uk/), one such platform which currently hosts six state of the art computational tools in an easy to use format. We describe the development of this platform, highlighting the advantages and disadvantages the methods used. In addition, we provide source code, a tutorial example, and setup scripts, and an application programming interface (API) to enable other research groups to create similar platforms, to promote collaboration both within and between research groups.

The traditional display of elements in the periodic table is convenient for the study of chemistry and physics. However, the atomic number alone is insufficient for training statistical machine learning models to describe and extract composition-structure–property relationships. Here, we assess the similarity and correlations contained within high-dimensional local and distributed representations of the chemical elements, as implemented in an open-source Python package ElementEmbeddings. These include element vectors of up to 200 dimensions derived from known physical properties, crystal structure analysis, natural language processing, and deep learning models. A range of distance measures are compared and a clustering of elements into familiar groups is found using dimensionality reduction techniques. The cosine similarity is used to assess the utility of these metrics for crystal structure prediction, showing that they can outperform the traditional radius ratio rules for the structural classification of AB binary solids.

Zeolites are inorganic materials known for their diversity of applications, synthesis conditions, and resulting polymorphs. Although their synthesis is controlled both by inorganic and organic synthesis conditions, computational studies of zeolite synthesis have focused mostly on organic template design. In this work, we use a strong distance metric between crystal structures and machine learning (ML) to create inorganic synthesis maps in zeolites. Starting with 253 known zeolites, we show how the continuous distances between frameworks reproduce inorganic synthesis conditions from the literature without using labels such as building units. An unsupervised learning analysis shows that neighboring zeolites according to our metric often share similar inorganic synthesis conditions, even in template-based routes. In combination with ML classifiers, we find synthesis-structure relationships for 14 common inorganic conditions in zeolites, namely Al, B, Be, Ca, Co, F, Ga, Ge, K, Mg, Na, P, Si, and Zn. By explaining the model predictions, we demonstrate how (dis)similarities towards known structures can be used as features for the synthesis space. Finally, we show how these methods can be used to predict inorganic synthesis conditions for unrealized frameworks in hypothetical databases and interpret the outcomes by extracting local structural patterns from zeolites. In combination with template design, this work can accelerate the exploration of the space of synthesis conditions for zeolites.

Two-dimensional (2D) materials exhibit exceptional properties. Thus, many studies have been conducted to discover novel 2D materials with unique characteristics or to find new ways of utilizing existing 2D materials. However, the existing open databases of 2D materials are often inefficient for this purpose. In this study, a material discovery framework is developed to identify new 2D materials using a deep learning-based generative model. First, a previous 2D database is adopted as a training set to develop a machine learning-based surrogate model for predicting the mechanical properties. Next, 2D candidates are generated, and their structural validity is confirmed by employing a classification model and checking their similarities to existing 2D materials. The uncertainty in the predicted mechanical properties of the generated materials is measured and the actual values are verified using density functional theory calculations. A total of 360 structures are newly identified according to the exploration method and the mean absolute error is significantly reduced from 206.025 to 10.185 N/m. We believe that the developed framework is general and can be further modified to search for novel 2D materials satisfying target physicochemical properties.

The vastness of the materials design space makes it impractical to explore using traditional brute-force methods, particularly in reticular chemistry. However, machine learning has shown promise in expediting and guiding materials design. Despite numerous successful applications of machine learning to reticular materials, progress in the field has stagnated, possibly because digital chemistry is more an art than a science and its limited accessibility to inexperienced researchers. To address this issue, we present mofdscribe, a software ecosystem tailored to novice and seasoned digital chemists that streamlines the ideation, modeling, and publication process. Though optimized for reticular chemistry, our tools are versatile and can be used in nonreticular materials research. We believe that mofdscribe will enable a more reliable, efficient, and comparable field of digital chemistry.

Weak thruster fault feature extraction and fault severity identification methods for autonomous underwater vehicle (AUV) are studied in this paper. One of the traditional methods of fault feature extraction is based on wavelet transformation + modified Bayes (MB), then the grey relation analysis (GRA) method is used to identify the fault severity of the thruster. Above methods are efficient for strong fault of thruster, but for weak fault, problems exist in these methods are the ratio of fault eigenvalues to noise eigenvalues of the extracted feature is low and the identification accuracy of fault is not satisfactory. To overcome the above deficiencies, resonance-based sparse signal decomposition (RSSD) together with stochastic resonance (SR) + MB is proposed for thruster weak fault feature extraction. Euclidean distance together with grey relation (GR) method is proposed to promote the identification accuracy of weak thruster fault. Finally, the pool experiments are performed on Beaver II AUV, and the effectiveness of the proposed method is demonstrated in comparison.

The availability and easy access of large-scale experimental and computational materials data have enabled the emergence of accelerated development of algorithms and models for materials property prediction, structure prediction, and generative design of materials. However, the lack of user-friendly materials informatics web servers has severely constrained the wide adoption of such tools in the daily practice of materials screening, tinkering, and design space exploration by materials scientists. Herein we first survey current materials informatics web apps and then propose and develop MaterialsAtlas.org, a web-based materials informatics toolbox for materials discovery, which includes a variety of routinely needed tools for exploratory materials discovery, including material’s composition and structure validity check (e.g. charge neutrality, electronegativity balance, dynamic stability, Pauling rules), materials property prediction (e.g. band gap, elastic moduli, hardness, and thermal conductivity), search for hypothetical materials, and utility tools. These user-friendly tools can be freely accessed at http://www.materialsatlas.org. We argue that such materials informatics apps should be widely developed by the community to speed up materials discovery processes.

Although machine learning has gained great interest in the discovery of functional materials, the advancement of reliable models is impeded by the scarcity of available materials property data. Here we propose and demonstrate a distinctive approach for materials discovery using unsupervised learning, which does not require labeled data and thus alleviates the data scarcity challenge. Using solid-state Li-ion conductors as a model problem, unsupervised materials discovery utilizes a limited quantity of conductivity data to prioritize a candidate list from a wide range of Li-containing materials for further accurate screening. Our unsupervised learning scheme discovers 16 new fast Li-conductors with conductivities of 10−4–10−1 S cm−1 predicted in ab initio molecular dynamics simulations. These compounds have structures and chemistries distinct to known systems, demonstrating the capability of unsupervised learning for discovering materials over a wide materials space with limited property data. Predictions of new solid-state Li-ion conductors are challenging due to the diverse chemistries and compositions involved. Here the authors combine unsupervised learning techniques and molecular dynamics simulations to discover new compounds with high Li-ion conductivity.

One of the most exciting tools that have entered the material science toolbox in recent years is machine learning. This collection of
statistical methods has already proved to be capable of considerably speeding up both fundamental and applied research. At
present, we are witnessing an explosion of works that develop and apply machine learning to solid-state systems. We provide a
comprehensive overview and analysis of the most recent research in this topic. As a starting point, we introduce machine learning
principles, algorithms, descriptors, and databases in materials science. We continue with the description of different machine
learning approaches for the discovery of stable materials and the prediction of their crystal structure. Then we discuss research in
numerous quantitative structure–property relationships and various approaches for the replacement of first-principle methods by
machine learning. We review how active learning and surrogate-based optimization can be applied to improve the rational design
process and related examples of applications. Two major questions are always the interpretability of and the physical
understanding gained from machine learning models. We consider therefore the different facets of interpretability and their
importance in materials science. Finally, we propose solutions and future research paths for various challenges in computational
materials science.

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

Conventional machine learning approaches for predicting material properties from elemental compositions have emphasized the importance of leveraging domain knowledge when designing model inputs. Here, we demonstrate that by using a deep learning approach, we can bypass such manual feature engineering requiring domain knowledge and achieve much better results, even with only a few thousand training samples. We present the design and implementation of a deep neural network model referred to as ElemNet; it automatically captures the physical and chemical interactions and similarities between different elements using artificial intelligence which allows it to predict the materials properties with better accuracy and speed. The speed and best-in-class accuracy of ElemNet enable us to perform a fast and robust screening for new material candidates in a huge combinatorial space; where we predict hundreds of thousands of chemical systems that could contain yet-undiscovered compounds.

Significance
Motivated by the recent achievements of artificial intelligence (AI) in linguistics, we design AI to learn properties of atoms from materials data on its own. Our work realizes knowledge representation of atoms via computers and could serve as a foundational step toward materials discovery and design fully based on machine learning.

The use of advanced machine learning algorithms in experimental materials science is limited by the lack of sufficiently large and diverse datasets amenable to data mining. If publicly open, such data resources would also enable materials research by scientists without access to expensive experimental equipment. Here, we report on our progress towards a publicly open High Throughput Experimental Materials (HTEM) Database (htem.nrel.gov). This database currently contains 140,000 sample entries, characterized by structural (100,000), synthetic (80,000), chemical (70,000), and optoelectronic (50,000) properties of inorganic thin film materials, grouped in >4,000 sample entries across >100 materials systems; more than a half of these data are publicly available. This article shows how the HTEM database may enable scientists to explore materials by browsing web-based user interface and an application programming interface. This paper also describes a HTE approach to generating materials data, and discusses the laboratory information management system (LIMS), that underpin HTEM database. Finally, this manuscript illustrates how advanced machine learning algorithms can be adopted to materials science problems using this open data resource.

UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP as described has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.

Computational methods that automatically extract knowledge from data are critical for enabling data-driven materials science. A reliable identification of lattice symmetry is a crucial first step for materials characterization and analytics. Current methods require a user-specified threshold, and are unable to detect "average symmetries" for defective structures. Here, we propose a new machine-learning-based approach to automatically classify structures by crystal symmetry. First, we represent crystals by a diffraction image, and then construct a deep-learning neural-network model for classification. Our approach is able to correctly classify a dataset comprising more than 80,000 structures, including heavily defective ones. The internal operations of the neural network are unraveled through attentive response maps, demonstrating that it uses the same landmarks a materials scientist would use, although never explicitly instructed to do so. Our study paves the way for crystal-structure recognition in computational and experimental big-data materials science.

While high-throughput density functional theory (DFT) has become a prevalent tool for materials discovery, it is limited by the relatively large computational cost. In this paper, we explore using DFT data from high-throughput calculations to create faster, surrogate models with machine learning (ML) that can be used to guide new searches. Our method works by using decision tree models to map DFT-calculated formation enthalpies to a set of attributes consisting of two distinct types: (i) composition-dependent attributes of elemental properties (as have been used in previous ML models of DFT formation energies), combined with (ii) attributes derived from the Voronoi tessellation of the compound's crystal structure. The ML models created using this method have half the cross-validation error and similar training and evaluation speeds to models created with the Coulomb matrix and partial radial distribution function methods. For a dataset of 435 000 formation energies taken from the Open Quantum Materials Database (OQMD), our model achieves a mean absolute error of 80 meV/atom in cross validation, which is lower than the approximate error between DFT-computed and experimentally measured formation enthalpies and below 15% of the mean absolute deviation of the training set. We also demonstrate that our method can accurately estimate the formation energy of materials outside of the training set and be used to identify materials with especially large formation enthalpies. We propose that our models can be used to accelerate the discovery of new materials by identifying the most promising materials to study with DFT at little additional computational cost.

Transport-based techniques for signal and data analysis have recently received increased interest. Given their ability to provide accurate generative models for signal intensities and other data distributions, they have been used in a variety of applications, including content-based retrieval, cancer detection, image superresolution, and statistical machine learning, to name a few, and they have been shown to produce state-of-the-art results. Moreover, the geometric characteristics of transport-related metrics have inspired new kinds of algorithms for interpreting the meaning of data distributions. Here, we provide a practical overview of the mathematical underpinnings of mass transport-related methods, including numerical implementation, as well as a review, with demonstrations, of several applications. Software accompanying this article is available from [43].

Subgroup discovery (SGD) is presented here as a data-mining approach to help find interpretable local patterns, correlations, and descriptors of a target property in materials-science data. Specifically, we will be concerned with data generated by density-functional theory calculations. At first, we demonstrate that SGD can identify physically meaningful models that classify the crystal structures of 82 octet binary semiconductors as either rocksalt or zincblende. SGD identifies an interpretable two-dimensional model derived from only the atomic radii of valence s and p orbitals that properly classifies the crystal structures for 79 of the 82 octet binary semiconductors. The SGD framework is subsequently applied to 24 400 configurations of neutral gas-phase gold clusters with 5 to 14 atoms to discern general patterns between geometrical and physicochemical properties. For example, SGD helps find that van der Waals interactions within gold clusters are linearly correlated with their radius of gyration and are weaker for planar clusters than for nonplanar clusters. Also, a descriptor that predicts a local linear correlation between the chemical hardness and the cluster isomer stability is found for the even-sized gold clusters.

Starting from the experimental data contained in the inorganic crystal structure database, we use a statistical analysis to determine the likelihood that a chemical element A can be replaced by another B in a given structure. This information can be used to construct a matrix where each entry [$(A,B)$] is a measure of this likelihood. By ordering the rows and columns of this matrix in order to reduce its bandwidth, we construct a one-dimension ordering of the chemical elements, analogous to the famous Pettifor scale. The new scale shows large similarities with the one of Pettifor, but also striking differences, especially in what comes to the ordering of the non-metals.

Historically, materials discovery is driven by a laborious trial-and-error process. The growth of materials databases and emerging informatics approaches finally offer the opportunity to transform this practice into data- and knowledge-driven rational design$-$accelerating discovery of novel materials exhibiting desired properties. By using data from the ${\rm A{\small FLOW}}$ repository for high-throughput $\textit{ab-initio}$ calculations, we have generated $\underline{\mathrm{Q}}$uantitative $\underline{\mathrm{M}}$aterials $\underline{\mathrm{S}}$tructure-$\underline{\mathrm{P}}$roperty $\underline{\mathrm{R}}$elationship (QMSPR) models to predict three critical material properties, namely the metal/insulator classification, Fermi energy, and band gap energy. The prediction accuracy obtained with these QMSPR models approaches training data for virtually any stoichiometric inorganic crystalline material. We attribute the success and universality of these models to the construction of new material descriptors$-$referred to as the universal $\underline{\mathrm{p}}$roperty-$\underline{\mathrm{l}}$abeled $\underline{\mathrm{m}}$aterial $\underline{\mathrm{f}}$ragments (PLMF). This representation affords straightforward model interpretation in terms of simple heuristic design rules that could guide rational materials design. This proof-of-concept study demonstrates the power of materials informatics to dramatically accelerate the search for new materials.

A very active area of materials research is to devise methods that use machine learning to automatically extract predictive models from existing materials data. While prior examples have demonstrated successful models for some applications, many more applications exist where machine learning can make a strong impact. To enable faster development of machine-learning-based models for such applications, we have created a framework capable of being applied to a broad range of materials data. Our method works by using a chemically diverse list of attributes, which we demonstrate are suitable for describing a wide variety of properties, and a novel method for partitioning the data set into groups of similar materials in order to boost the predictive accuracy. In this manuscript, we demonstrate how this new method can be used to predict diverse properties of crystalline and amorphous materials, such as band gap energy and glass-forming ability.

Changes in the frequencies of cell subsets that (co)express characteristic biomarkers, or levels of the biomarkers on the subsets, are widely used as indices of drug response, disease prognosis, stem cell reconstitution, etc. However, although the currently available computational "gating" tools accurately reveal subset frequencies and marker expression levels, they fail to enable statistically reliable judgements as to whether these frequencies and expression levels differ significantly between/among subject groups. Here we introduce flow cytometry data analysis pipeline which includes the Earth Mover's Distance (EMD) metric as solution to this problem. Well known as an informative quantitative measure of differences between distributions, we present three exemplary studies showing that EMD 1) reveals clinically-relevant shifts in two markers on blood basophils responding to an offending allergen; 2) shows that ablative tumor radiation induces significant changes in the murine colon cancer tumor microenvironment; and, 3) ranks immunological differences in mouse peritoneal cavity cells harvested from three genetically distinct mouse strains.

To evaluate the potential of Na-ion batteries, we contrast in this work the difference between Na-ion and Li-ion based intercalation chemistries in terms of three key battery properties—voltage, phase stability and diffusion barriers. The compounds investigated comprise the layered AMO2 and AMS2 structures, the olivine and maricite AMPO4 structures, and the NASICONA3V2(PO4)3 structures. The calculated Na voltages for the compounds investigated are 0.18–0.57 V lower than that of the corresponding Li voltages, in agreement with previous experimental data. We believe the observed lower voltages for Na compounds are predominantly a cathodic effect related to the much smaller energy gain from inserting Na into the host structure compared to inserting Li. We also found a relatively strong dependence of battery properties on structural features. In general, the difference between the Na and Li voltage of the same structure, ΔVNa–Li, is less negative for the maricite structures preferred by Na, and more negative for the olivine structures preferred by Li. The layered compounds have the most negative ΔVNa–Li. In terms of phase stability, we found that open structures, such as the layered and NASICON structures, that are better able to accommodate the larger Na+ ion generally have both Na and Li versions of the same compound. For the close-packed AMPO4 structures, our results show that Na generally prefers the maricite structure, while Li prefers the olivine structure, in agreement with previous experimental work. We also found surprising evidence that the barriers for Na+ migration can potentially be lower than that for Li+ migration in the layered structures. Overall, our findings indicate that Na-ion systems can be competitive with Li-ion systems.

We review some recently proposed methods to represent atomic neighbourhood
environments and analyse their relative merits in terms of their faithfulness
and suitability for fitting potential energy surfaces (PES). The crucial
properties that such representations (commonly called descriptors) must have
are continuity and invariance to the basic symmetries of physics: rotation,
reflection, translation, and permutation of identical atoms. We demonstrate
that schemes that initially look quite different are specific cases of a
general approach, in which a finite set of basis functions with increasing
angular wave numbers are used to expand the atomic neighbourhood density
function. We quantitatively show using the example system of small clusters
that this expansion needs to be carried to higher and higher wave numbers as
the number of neighbours increases in order to obtain a faithful
representation, and that variants of the descriptors converge at very different
rates.

We investigate the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval. The EMD is based on the minimal cost that must be paid to transform one distribution into the other, in a precise sense, and was first proposed for certain vision problems by Peleg, Werman, and Rom. For image retrieval, we combine this idea with a representation scheme for distributions that is based on vector quantization. This combination leads to an image comparison framework that often accounts for perceptual similarity better than other previously proposed methods. The EMD is based on a solution to the transportation problem from linear optimization, for which efficient algorithms are available, and also allows naturally for partial matching. It is more robust than histogram matching techniques, in that it can operate on variable-length representations of the distributions that avoid quantization and other binning problems typical of histograms. When used to compare distributions with the same overall mass, the EMD is a true metric. In this paper we focus on applications to color and texture, and we compare the retrieval performance of the EMD with that of other distances.

Due to the rapid emergence of antibiotic-resistant bacteria, there is a growing need to discover new antibiotics. To address this challenge, we trained a deep neural network capable of predicting molecules with antibacterial activity. We performed predictions on multiple chemical libraries and discovered a molecule from the Drug Repurposing Hub—halicin—that is structurally divergent from conventional antibiotics and displays bactericidal activity against a wide phylogenetic spectrum of pathogens including Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae. Halicin also effectively treated Clostridioides difficile and pan-resistant Acinetobacter baumannii infections in murine models. Additionally, from a discrete set of 23 empirically tested predictions from >107 million molecules curated from the ZINC15 database, our model identified eight antibacterial compounds that are structurally distant from known antibiotics. This work highlights the utility of deep learning approaches to expand our antibiotic arsenal through the discovery of structurally distinct antibacterial molecules. A trained deep neural network predicts antibiotic activity in molecules that are structurally different from known antibiotics, among which Halicin exhibits efficacy against broad-spectrum bacterial infections in mice.

Topological Data Analysis for Genomics and Evolution - by Raúl Rabadán December 2019

Cambridge Core - Genomics, Bioinformatics and Systems Biology - Topological Data Analysis for Genomics and Evolution - by Raúl Rabadán

We formulate a materials design strategy combining a machine learning (ML) surrogate model with experimental design algorithms to search for high entropy alloys (HEAs) with large hardness in a model Al-Co-Cr-Cu-Fe-Ni system. We fabricated several alloys with hardness 10% higher than the best value in the original training dataset via only seven experiments. We find that a strategy using both the compositions and descriptors based on a knowledge of the properties of HEAs, outperforms that merely based on the compositions alone. This strategy offers a recipe to rapidly optimize multi-component systems, such as bulk metallic glasses and superalloys, towards desired properties.

New machine learning methods to analyze raw chemical and biological data are now widely accessible as open-source toolkits. This positions researchers to leverage powerful, predictive models in their own domains. We caution, however, that the application of machine learning to experimental research merits careful consideration. Machine learning algorithms readily exploit confounding variables and experimental artifacts instead of relevant patterns, leading to overoptimistic performance and poor model generalization. In parallel to the strong control experiments that remain a cornerstone of experimental research, we advance the concept of adversarial controls for scientific machine learning: the design of exacting and purposeful experiments to ensure that predictive performance arises from meaningful models.

As materials data sets grow in size and scope, the role of data mining and statistical learning methods to analyze these materials data sets and build predictive models is becoming more important. This manuscript introduces matminer, an open-source, Python-based software platform to facilitate data-driven methods of analyzing and predicting materials properties. Matminer provides modules for retrieving large data sets from external databases such as the Materials Project, Citrination, Materials Data Facility, and Materials Platform for Data Science. It also provides implementations for an extensive library of feature extraction routines developed by the materials community, with 47 featurization classes that can generate thousands of individual descriptors and combine them into mathematical functions. Finally, matminer provides a visualization module for producing interactive, shareable plots. These functions are designed in a way that integrates closely with machine learning and data analysis packages already developed and in use by the Python data science community. We explain the structure and logic of matminer, provide a description of its various modules, and showcase several examples of how matminer can be used to collect data, reproduce data mining studies reported in the literature, and test new methodologies.

We perform a large scale benchmark of machine learning methods for the prediction of the thermodynamic stability of solids. We start by constructing a data set that comprises density functional theory calculations of around 250\,000 cubic perovskite systems. This includes all possible perovskite and anti-perovskite crystals that can be generated with elements from hydrogen to bismuth, and neglecting rare gases and lanthanides. Incidentally, these calculations already reveal a large number of systems (around 500) that are thermodynamically stable, but that are not present in crystal structure databases. Moreover, some of these phases have unconventional compositions and define completely new families of perovskites. This data set is then used to train and test a series of machine learning algorithms to predict the energy distance to the convex hull of stability. In particular, we study the performance of ridge regression, random forests, extremely randomized trees (including adaptive boosting), and neural networks. We find that extremely randomized trees give the smallest mean absolute error of the distance to the convex hull (121 meV/atom) in the test set of 230\,000 perovskites, after being trained in 20\,000 samples. Surprisingly, the machine already works if we give it as sole input features the group and row in the periodic table of the three elements composing the perovskite. Moreover, we find that the prediction accuracy is not uniform across the periodic table, being worse for first-row elements and elements forming magnetic compounds. Our results suggest that machine learning can be used to speed up considerably (by at least a factor of 5) high-throughput DFT calculations, by restricting the space of relevant chemical compositions without degradation of the accuracy.

As the proliferation of high-throughput approaches in materials science is
increasing the wealth of data in the field, the gap between
accumulated-information and derived-knowledge widens. We address the issue of
scientific discovery in materials databases by introducing novel analytical
approaches based on structural and electronic materials fingerprints. The
framework is employed to (i) query large databases of materials using
similarity concepts, (ii) map the connectivity of the materials space (i.e., as
a materials cartogram) for rapidly identifying regions with unique
organizations/properties, and (iii) develop predictive Quantitative Materials
Structure-Property Relation- ships (QMSPR) models for guiding materials design.
In this study, we test these fingerprints by seeking target material
properties. As a quantitative example, we model the critical temperatures of
known superconductors. Our novel materials fingerprinting and materials
cartography approaches contribute to the emerging field of materials
informatics by enabling effective computational tools to analyze, visualize,
model, and design new materials.

The Inorganic Crystal Structure Database (ICSD) is a comprehensive collection of crystal structure entries for inorganic materials. ICSD is produced by Fachinformationszentrum Karlsruhe, Germany, and the National Institute of Standards and Technology, US. The WWW interface is developed in cooperation with the Institut Laue-Langevin, Grenoble. The ICSD is disseminated in computerized formats with scientific software tools to exploit the content of the database. ICSD includes records of all inorganic crystal structures with atomic coordinates published since 1913. The data base contains 70 102 records as of July 2003. All data are recorded by experts and are checked several times. Apart from updating, data integrity and completeness are important objectives. Incorporation of missing structures, evaluation and correction of data, with the help of authors, users and experts are ongoing activities. This review article gives an overview of the product portfolio and the current activities.

A significant breakthrough has been achieved in the design of new materials by using materials databases, semiempirical approaches and neural networks. It was found in the present work that a nonlinear expression involving one elemental property parameter can be used to predict, with an overall accuracy exceeding 99%, the occurrence of a compound for any binary, ternary or quaternary system. This elemental property parameter, referred to as the Mendeleev number, was conceived by D.G. Pettifor in 1983 to group binary compounds by crystal structures. The immediate profit of this discovery is the obvious savings, in time and resources, relative to the investigation of yet-to-be-studied, materials systems. In the longer term the relation found here will make it possible to better define the search space for the development of new materials and encourage attempts to predict more specific information such as stoichiometries, crystal structures and physical properties.

A chemical scale χ is presented which characterizes each atom in the periodic table. It allows a single two-dimensional structure map (χA, χB) to be plotted for all binary compounds with a given stoichiometry ABn. The resultant map for 574 AB compounds shows excellent structural separation, the 75 octet sp-sp compounds being perfectly separated amongst themselves.

The atomic environment types (AETs) (coordination polyhedra) realized by each chemical element in binary compounds at the equi-atomic composition were analyzed based on a comprehensive set of literature data. The Mendeleev number (MN) (ordering number listing the chemical elements column by column through the periodic system) was successfully used to classify the chemical systems. An atomic environment type map, using as coordinates the maximum Mendeleev number versus the ratio between the minimum and the maximum Mendeleev number, sub-divided the chemical systems where different atomic environment types occur in distinct stability domains. The same maps also showed a clear separation between chemical systems where intermediate compounds form and those where no compounds form. These maps make it possible to predict the existence of compound that have not yet been investigated with a particular atomic environment.

We present a new metric between histograms such as SIFT descriptors and a linear time algorithm for its computation. It is
common practice to use the L
2 metric for comparing SIFT descriptors. This practice assumes that SIFT bins are aligned, an assumption which is often not
correct due to quantization, distortion, occlusion etc.
In this paper we present a new Earth Mover’s Distance (EMD) variant. We show that it is a metric (unlike the original EMD
[1] which is a metric only for normalized histograms). Moreover, it is a natural extension of the L
1 metric. Second, we propose a linear time algorithm for the computation of the EMD variant, with a robust ground distance
for oriented gradients. Finally, extensive experimental results on the Mikolajczyk and Schmid dataset [2] show that our method
outperforms state of the art distances.

Virtual screening is a widely used strategy in modern drug discovery and 2D fingerprint similarity is an important tool that has been successfully applied to retrieve active compounds from large datasets. However, it is not always straightforward to select an appropriate fingerprint method and associated settings for a given problem. Here, we applied eight different fingerprint methods, as implemented in the new cheminformatics package Canvas, on a well-validated dataset covering five targets. The fingerprint methods include Linear, Dendritic, Radial, MACCS, MOLPRINT2D, Pairwise, Triplet, and Torsion. We find that most fingerprints have similar retrieval rates on average; however, each has special characteristics that distinguish its performance on different query molecules and ligand sets. For example, some fingerprints exhibit a significant ligand size dependency whereas others are more robust with respect to variations in the query or active compounds. In cases where little information is known about the active ligands, MOLPRINT2D fingerprints produce the highest average retrieval actives. When multiple queries are available, we find that a fingerprint averaged over all query molecules is generally superior to fingerprints derived from single queries. Finally, a complementarity metric is proposed to determine which fingerprint methods can be combined to improve screening results.

Chemical structure curatione plays an important role in cheminformatics and QSAR modeling research. Both common sense and the recent investigations described above indicate that chemical record curation should be viewed as a separate and critical component of any cheminformatics research. Treatment of mixtures is not as simple as it appears. The practice of retaining the component with the highest molecular weight or largest number of atoms is common and widely used, but not necessarily the best solution. Manual conversion of all functional groups to some standard forms is too time-consuming and could introduce additional human-dependent nonsystematic errors. ChemAxon's Standardizer is probably the most well-known tool to rapidly and efficiently realize chemotype normalizations. Rigorous statistical analysis of any data set assumes that each compound is unique and thus, structurally different from all other compounds.

We establish the existence of Euclidean tangent cones on Wasserstein spaces over compact Alexandrov spaces of curvature bounded below. By using this Riemannian structure, we formulate and construct gradient flows of functions on such spaces. If the underlying space is a Riemannian manifold of nonnegative sectional curvature, then our gradient flow of the free energy produces a solution of the linear Fokker-Planck equation.

Erratum: A Deep Learning Approach to Antibiotic Discovery (Cell (2020) 180(4) (688–702.e13), (S0092867420301021), (10.1016/j.cell.2020.01.021))

- Stokes J.M.
- Yang K.
- Swanson K.
- Jin W.
- Cubillos-Ruiz A.
- Donghia N.M.
- MacNair C.R.
- Collins J.J.

Thinking globally, acting locally: On the issue of training set imbalance and the case for local machine learning models in chemistry

- M Haghighatlari
- C Y Shih
- J Hachmann

Haghighatlari, M.; Shih, C. Y.; Hachmann, J.. Thinking globally,
acting locally: On the issue of training set imbalance and the case for
local machine learning models in chemistry. 2019, 8796947,
ChemRxiv. https://chemrxiv.org/articles/preprint/Thinking_
Globally_Acting_Locally_On_the_Issue_of_Training_Set_
Imbalance_and_the_Case_for_Local_Machine_Learning_Models_
in_Chemistry/8796947/2 (accessed November 11, 2020).

Minimum Cost Flows: Network Simplex Algorithms