Article

# Boosting Quantum Machine Learning Models with a Multilevel Combination Technique: Pople Diagrams Revisited

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

Inspired by Pople diagrams popular in quantum chemistry, we introduce a hierarchical scheme, based on the multi-level combination (C) technique, to combine various levels of approximations made when calculating molecular energies within quantum chemistry. When combined with quantum machine learning (QML) models, the resulting CQML model is a generalized unified recursive kernel ridge regression which exploits correlations implicitly encoded in training data comprised of multiple levels in multiple dimensions. Here, we have investigated up to three dimensions: Chemical space, basis set, and electron correlation treatment. Numerical results have been obtained for atomization energies of a set of $\sim$7'000 organic molecules with up to 7 atoms (not counting hydrogens) containing CHONFClS, as well as for $\sim$6'000 constitutional isomers of C$_7$H$_{10}$O$_2$. CQML learning curves for atomization energies suggest a dramatic reduction in necessary training samples calculated with the most accurate and costly method. In order to generate milli-second estimates of CCSD(T)/cc-pvdz atomization energies with prediction errors reaching chemical accuracy ($\sim$1 kcal/mol), the CQML model requires only $\sim$100 training instances at CCSD(T)/cc-pvdz level, rather than thousands within conventional QML, while more training molecules are required at lower levels. Our results suggest a possibly favourable trade-off between various hierarchical approximations whose computational cost scales differently with electron number.

## No full-text available

... The core idea of MF-ML is hereafter demonstrated by total energy (E) prediction. For brevity, we deal with two levels of theory (the low and high level are denoted by 0 and 1 respectively) and focus on one flavor of MF-ML, i.e., recursive KRR (r-KRR for short, or MF-KRR) [257], which is similar to its counterpart, recursive GPR (r-GPR, or MF-GPR) [258,259] and differs to MF-GPR to some extent, in analogy to the difference between KRR and GPR. Unlike ∆-ML, MF-ML comprises of multiple machines with different labels to learn (two for our exemplified case). ...
... Furthermore, it must be strictly satisfied that the increasingly more expensive training sets form a nested structure, implying that possible and beneficial correlations between non-nested reference data calculated at different level of theory are not being exploited. To overcome this drawback, Zaspel et al. [257] proposed a multi-level model in 2018, combining success-fully ML with sparse grid (SG) [265], a numerical techniques widely used to integrate/interpolate high dimensional functions. ...
... This, however, should always be done with great care. In the original MLGC paper [257], electron correlation levels are reasonably chosen as HF, MP2 and CCSD(T), together with three basis sets, i.e., STO-3G, 6-31G and cc-pVDZ (the number of basis functions increases by a factor of ∼2). ...
Preprint
Full-text available
Chemical compound space (CCS), the set of all theoretically conceivable combinations of chemical elements and (meta-)stable geometries that make up matter, is colossal. The virtual exploration of this space for the design and discovery of novel molecules and materials exhibiting desirable properties is therefore generally prohibitive for all but the smallest sub-sets and simplest properties, and typically relies heavily on access to substantial allocations on modern high-performance computing hardware. We review studies aimed at tackling this challenge using modern machine learning techniques based on (i) synthetic data generated using quantum mechanics based data and (ii) model architectures inspired by quantum mechanics. Such Quantum based Machine Learning (QML) approaches combine the advantages of a first principles view on matter, i.e.~reflecting properly the underlying physics which guarantees universality and transferability of models across all of CCS, with the numerical efficiency of statistical surrogate models. While state-of-the-art approximations to quantum problems impose severe computational bottlenecks, recent QML based developments indicate the possibility of substantial acceleration without sacrificing the rigour and reliability of a physics based understanding of trends and relationships throughout CCS.
... The core idea of MF-ML is hereafter demonstrated by total energy (E) prediction. For brevity, we deal with two levels of theory (the low and high level are denoted by 0 and 1, respectively) and focus on one flavor of MF-ML, i.e., recursive KRR (r-KRR for short, or MF-KRR), 266 which is similar to its counterpart, recursive GPR (r-GPR, or MF-GPR), 267,268 and differs to MF-GPR to some extent, in analogy to the difference between KRR and GPR. Unlike Δ-ML, MF-ML comprises multiple machines with different labels to learn (two for our exemplified case). ...
... Furthermore, it must be strictly satisfied that the increasingly more expensive training sets form a nested structure, implying that possible and beneficial correlations between non-nested reference data calculated at different level of theory are not being exploited. To overcome this drawback, Zaspel et al. 266 proposed a multilevel model in 2018, combining successfully ML with sparse grid (SG), 274 a numerical technique widely used to integrate/interpolate high dimensional functions. ...
... To incorporate it within ML framework, one extra variable has to be introduced, i.e., training set (x N ), the size of which indicates the magnitude of x N . 266 Accordingly ...
Article
Full-text available
Chemical compound space (CCS), the set of all theoretically conceivable combinations of chemical elements and (meta-)stable geometries that make up matter, is colossal. The first-principles based virtual sampling of this space, for example, in search of novel molecules or materials which exhibit desirable properties, is therefore prohibitive for all but the smallest subsets and simplest properties. We review studies aimed at tackling this challenge using modern machine learning techniques based on (i) synthetic data, typically generated using quantum mechanics based methods, and (ii) model architectures inspired by quantum mechanics. Such Quantum mechanics based Machine Learning (QML) approaches combine the numerical efficiency of statistical surrogate models with an ab initio view on matter. They rigorously reflect the underlying physics in order to reach universality and transferability across CCS. While state-of-the-art approximations to quantum problems impose severe computational bottlenecks, recent QML based developments indicate the possibility of substantial acceleration without sacrificing the predictive power of quantum mechanics.
... The previous analysis has been applied to computational chemistry methods, but to our knowledge, the error distributions of machine learning (ML) algorithms have not be scrutinized for their ability to deliver a reliable prediction uncertainty, and the general use of the MAE as a benchmark statistic for ML methods [5,6] has to be evaluated. A problem arises notably when comparing methods with different error distribution shapes, as MAE-based ranking might become arbitrary, occulting important considerations about the risk of large errors for some of the methods [2,7] . ...
... A problem arises notably when comparing methods with different error distribution shapes, as MAE-based ranking might become arbitrary, occulting important considerations about the risk of large errors for some of the methods [2,7] . This is the main topic of the present paper, where we analyze the prediction errors for effective atomization energies of QM7b molecules calculated at the level of theory CCSD(T)/cc-pVDZ by the kernel ridge regression with Coulomb matrix (CM) and Spectrum of London and Axilrod-Teller-Muto potential (SLATM) representations and L 2 distance metric [6]. The ML error distributions are compared with the ones obtained from computational chemistry methods (HF and MP2) on the same reference dataset. ...
... Several statistical trend correction methods have been proposed in the computational chemistry literature, from the simple scaling of the calculated values [12,13], or linear corrections [14,15,1,4,16], to more complex, ML-based corrections, such as ∆-ML [17,18,6] or Gaussian Processes [19]. ...
Article
Full-text available
Quantum machine learning models have been gaining significant traction within atomistic simulation communities. Conventionally, relative model performances are being assessed and compared using learning curves (prediction error vs. training set size). This article illustrates the limitations of using the Mean Absolute Error (MAE) for benchmarking, which is particularly relevant in the case of non-normal error distributions. We analyze more specifically the prediction error distribution of the kernel ridge regression with SLATM representation and L2distance metric (KRR-SLATM-L2) for effective atomization energies of QM7b molecules calculated at the level of theory CCSD(T)/cc-pVDZ. Error distributions of HF and MP2 at the same basis set referenced to CCSD(T) values were also assessed and compared to the KRR model. We show that the true performance of the KRR-SLATM-L2 method over the QM7b dataset is poorly assessed by the Mean Absolute Error, and can be notably improved after adaptation of the learning set.
... The previous analysis has been applied to computational chemistry methods, but to our knowledge, the error distributions of ML algorithms have not be scrutinized for their ability to deliver a reliable prediction uncertainty, and the general use of the MAE as a benchmark statistic for ML methods [5,6] has to be evaluated. A problem arises notably when comparing methods with different error distribution shapes, as MAE-based ranking might become arbitrary, occulting important considerations about the risk of large errors for some of the methods [2,7] . ...
... A problem arises notably when comparing methods with different error distribution shapes, as MAE-based ranking might become arbitrary, occulting important considerations about the risk of large errors for some of the methods [2,7] . This is the main topic of the present paper, where we analyze the prediction errors for effective atomization energies of QM7b molecules calculated at the level of theory CCSD(T)/cc-pVDZ by the kernel ridge regression with CM and SLATM representations and L 2 distance metric [6]. The ML error distributions are compared with the ones obtained from computational chemistry methods (HF and MP2) on the same reference dataset. ...
... Several statistical trend correction methods have been proposed in the computational chemistry literature, from the simple scaling of the calculated values [12,13], or linear corrections [14,15,1,4,16], to more complex, ML-based corrections, such as ∆-ML [17,18,6] or Gaussian Processes [19]. ...
Preprint
Quantum machine learning models have been gaining significant traction within atomistic simulation communities. Conventionally, relative model performances are being assessed and compared using learning curves (prediction error vs. training set size). This article illustrates the limitations of using the Mean Absolute Error (MAE) for benchmarking, which is particularly relevant in the case of non-normal error distributions. We analyze more specifically the prediction error distribution of the kernel ridge regression with SLATM representation and L 2 distance metric (KRR-SLATM-L2) for effective atomization energies of QM7b molecules calculated at the level of theory CCSD(T)/cc-pVDZ. Error distributions of HF and MP2 at the same basis set referenced to CCSD(T) values were also assessed and compared to the KRR model. We show that the true performance of the KRR-SLATM-L2 method over the QM7b dataset is poorly assessed by the Mean Absolute Error, and can be notably improved after adaptation of the learning set.
... The focus of this Perspective is to examine the datasets typically seen in these two areas. For this purpose we consider several from the MD17 datasets 7,8 and ones from our work for the same molecules. ...
... The data from MD17, obtained from DFT directdynamics run at 500 K, are labeled using that term. 7,8 This approach, i.e., using DFT direct-dynamics at thermal energies, perhaps as high as 1000 K, is commonly done in the field to generate data for MLPs of a given molecule. We also use direct-dynamics as one means of generating configurations; however, at a number of total energies, including high energies. ...
Preprint
Full-text available
There has been great progress in developing methods for machine-learned potential energy surfaces. There have also been important assessments of these methods by comparing so-called learning curves on datasets of electronic energies and forces, notably the MD17 database. The dataset for each molecule in this database generally consists of tens of thousands of energies and forces obtained from DFT direct dynamics at 500 K. We contrast the datasets from this database for three "small" molecules, ethanol, malonaldehyde, and glycine, with datasets we have generated with specific targets for the PESs in mind: a rigorous calculation of the zero-point energy and wavefunction, the tunneling splitting in malonaldehyde and in the case of glycine a description of all eight low-lying conformers. We found that the MD17 datasets are too limited for these targets. We also examine recent datasets for several PESs that describe small-molecule but complex chemical reactions. Finally, we introduce a new database, "QM-22", which contains datasets of molecules ranging from 4 to 15 atoms that extend to high energies and a large span of configurations.
... 9 Δ-ML, which is of direct relevance to the present paper, seeks to add a correction to a property obtained using an efficient and thus perforce low-level ab initio theory. [10][11][12][13][14][15] This approach includes an interesting, recent variant based on a "Pople" style composite approach. 11 In this sense, the approach is related, in spirit at least, to the correction potential approach mentioned above, when the property is the PES. ...
... [10][11][12][13][14][15] This approach includes an interesting, recent variant based on a "Pople" style composite approach. 11 In this sense, the approach is related, in spirit at least, to the correction potential approach mentioned above, when the property is the PES. However, it is applicable to much larger molecules. ...
Article
Full-text available
“Δ-machine learning” refers to a machine learning approach to bring a property such as a potential energy surface (PES) based on low-level (LL) density functional theory (DFT) energies and gradients close to a coupled cluster (CC) level of accuracy. Here, we present such an approach that uses the permutationally invariant polynomial (PIP) method to fit high-dimensional PESs. The approach is represented by a simple equation, in obvious notation VLL→CC = VLL + ΔVCC–LL, and demonstrated for CH4, H3O⁺, and trans and cis-N-methyl acetamide (NMA), CH3CONHCH3. For these molecules, the LL PES, VLL, is a PIP fit to DFT/B3LYP/6-31+G(d) energies and gradients and ΔVCC–LL is a precise PIP fit obtained using a low-order PIP basis set and based on a relatively small number of CCSD(T) energies. For CH4, these are new calculations adopting an aug-cc-pVDZ basis, for H3O⁺, previous CCSD(T)-F12/aug-cc-pVQZ energies are used, while for NMA, new CCSD(T)-F12/aug-cc-pVDZ calculations are performed. With as few as 200 CCSD(T) energies, the new PESs are in excellent agreement with benchmark CCSD(T) results for the small molecules, and for 12-atom NMA, training is done with 4696 CCSD(T) energies.
... 82 It was also shown that combining several Δ-ML models achieves better performance and lowers the computational cost of generating the training data. 83,84 This fact has yet to be fully exploited in the construction of molecular ML PESs, and to the best of our knowledge, no such procedure has been devised to find the optimal training data. Furthermore, practical research usually requires knowledge about the choice of QC levels of theory, training set geometries, and sizes before generating the computationally intensive reference data. ...
... It is known that the number of computationally expensive highlevel QC calculations can be greatly reduced by combining several Δ-ML models, some of which are trained on many more low-level QC data. 83,84 However, the choice of the optimal number of training points for each constituent Δ-ML model is not trivial, especially for a large number of the Δ-ML models. To the best of our knowledge, until now, no procedure was suggested to determine the training set sizes ahead of time. ...
Article
We present hierarchical machine learning (hML) of highly accurate potential energy surfaces (PESs). Our scheme is based on adding predictions of multiple Δ-machine learning models trained on energies and energy corrections calculated with a hierarchy of quantum chemical methods. Our (semi-)automatic procedure determines the optimal training set size and composition of each constituent machine learning model, simultaneously minimizing the computational effort necessary to achieve the required accuracy of the hML PES. Machine learning models are built using kernel ridge regression, and training points are selected with structure-based sampling. As an illustrative example, hML is applied to a high-level ab initio CH3Cl PES and is shown to significantly reduce the computational cost of generating the PES by a factor of 100 while retaining similar levels of accuracy (errors of ∼1 cm⁻¹).
... These schemes are also becoming a target of recent work using ML methods. 135 HF determinants provide good baseline approximations of the ground state electronic structure of many molecules, but they may describe poorly more complicated bonding that arises during bond dissociation events, excited states, and conical intersections. 136−139 Some many-body wavefunctions are best described as a superposition of two or more configurations, for example, when other configurations in eq 7 can have similar or higher expansion coefficients a than the HF determinant. ...
... 149 This is an area though where ML can bring progress in automating the selections of physically justified active spaces. 129 In closing, there are a large number of available correlated wavefunction methods but many are even more costly than HF theory by virtue of requiring an HF reference energy expression shown in eq 5. Figure 5a depicts a so-called "magic cube" (that is an extension beyond a traditional "Pople diagram" 135,150 ) that concisely shows a full hierarchy of computational approaches across different Hamiltonians, basis sets, and correlation treatment methods. This makes it easy to identify different wavefunction methods that should be more accurate and more likely to provide useful atomic scale insights (as well as those that would be more computationally intensive). ...
Article
Full-text available
Machine learning models are poised to make a transformative impact on chemical sciences by dramatically accelerating computational algorithms and amplifying insights available from computational chemistry methods. However, achieving this requires a confluence and coaction of expertise in computer science and physical sciences. This Review is written for new and experienced researchers working at the intersection of both fields. We first provide concise tutorials of computational chemistry and machine learning methods, showing how insights involving both can be achieved. We follow with a critical review of noteworthy applications that demonstrate how computational chemistry and machine learning can be used together to provide insightful (and useful) predictions in molecular and materials modeling, retrosyntheses, catalysis, and drug design.
... The effective atomization energies (E * ) for the QM7b dataset, 24 for 7211 molecules up to seven heavy atoms (C, N, O, S, or Cl), are available for several basis sets (STO-3g, 6-31g, and ccpvdz), three quantum chemistry methods [HF, MP2, and CCSD(T)], and four machine learning algorithms (CM-L1, CM-L2, SLATM-L1, and SLATM-L2). The data have been provided on request by the authors of Zaspel et al. 12 The machine learning methods have been trained over a random sample of 1000 CCSD(T) energies (learning set), and the test set contains the prediction errors for the 6211 remaining systems. 12 We retain here only HF, MP2, and SLATM-L2 and compare their ability to predict CCSD(T) values. ...
... The data have been provided on request by the authors of Zaspel et al. 12 The machine learning methods have been trained over a random sample of 1000 CCSD(T) energies (learning set), and the test set contains the prediction errors for the 6211 remaining systems. 12 We retain here only HF, MP2, and SLATM-L2 and compare their ability to predict CCSD(T) values. ...
Article
In Paper I [P. Pernot and A. Savin, J. Chem. Phys. 152, 164108 (2020)], we introduced the systematic improvement probability as a tool to assess the level of improvement on absolute errors to be expected when switching between two computational chemistry methods. We also developed two indicators based on robust statistics to address the uncertainty of ranking in computational chemistry benchmarks: Pinv, the inversion probability between two values of a statistic, and Pr, the ranking probability matrix. In this second part, these indicators are applied to nine data sets extracted from the recent benchmarking literature. We also illustrate how the correlation between the error sets might contain useful information on the benchmark dataset quality, notably when experimental data are used as reference.
... Examples of this philosophy include using transfer learning techniques on a neural network trained with abundant but inaccurate data to re-train it with accurate but scarce data 9,10 and using machine learning to automatically tune parameters of a semiempirical calculation. 11 A more straightforward approach is to use ∆-machine learning [12][13][14] (∆-ML), i.e. use machine learna) Electronic mail: konstantin.karandashev@univie.ac.at b) Electronic mail: anatole.vonlilienfeld@univie.ac.at ing to predict the difference between a quantity and its estimate from a relatively inexpensive calculation; this is the simplest example of a multifidelity information fusion approach. 15 ∆-ML approaches can be further improved by incorporating additional features from the base-line calculations (e.g. ...
... The idea of ∆-ML methods 12 is to choose p approx (q) such that the error of estimating p(q) − p approx (q) is smaller than the error of estimating p(q) for a given N train ; this approach has more sophisticated generalizations for cases when several approximations of differing cost and accuracy are available. 13 A natural extension of the concept is to use byproducts of calculating p approx to define a representation of compound q that would reflect not just the compound's features, but also physical intuition behind the property p. The general idea of the method proposed in this work is to define representations for localized orbitals obtained from a Hartree-Fock calculation and then define the kernel function in terms of these orbital representations obtained from the ground state or excited state calculations. ...
Preprint
We introduce an electronic structure based representation for quantum machine learning (QML) of electronic properties throughout chemical compound space. The representation is constructed using computationally inexpensive ab initio calculations and explicitly accounts for changes in the electronic structure. We demonstrate the accuracy and flexibility of resulting QML models when applied to property labels such as total potential energy, HOMO and LUMO energies, ionization potential, and electron affinity, using as data sets for training and testing entries from the QM7b, QM7b-T, QM9, and LIBE libraries. For the latter, we also demonstrate the ability of this approach to account for molecular species of different charge and spin multiplicity, resulting in QML models that infer total potential energies based on geometry, charge, and spin as input.
... The better the correlation between the levels of theory, the easier it is to learn the difference between them. In a more generalized version of this method called Multilevel-ML [77], one can exploit the correlations between more than 2 levels of theory and basis sets to improve predictions. In this work, we combine the SML method with ∆-ML using data from the QM7b dataset, namely the ZINDO energies as baseline, and the GW energies as target. ...
Preprint
Full-text available
Quantum Machine Learning (QML) models of molecular HOMO-LUMO-gaps often struggle to achieve satisfying data-efficiency as measured by decreasing prediction errors for increasing training set sizes. Partitioning training sets of organic molecules (QM7 and QM9-data-sets) into three classes [systems containing either aromatic rings and carbonyl groups, or single unsaturated bonds, or saturated bonds] prior to training results in independently trained QML models with improved learning rates. The selected QML models of band-gaps (at GW, B3LYP, and ZINDO level of theory) reach mean absolute prediction errors of $\sim$0.1 eV for up to an order of magnitude fewer training molecules than for conventionally trained models. Direct comparison to $\Delta$-QML models of band-gaps suggest that selected QML possesses substantially more data-efficiency. The findings suggest that selected QML, e.g. based on simple classifications prior to training, could help to successfully tackle challenging quantum property screening tasks of large libraries with high fidelity and low computational burden.
... The effective atomization energies (EAE) for the QM7b dataset [16], for molecules up to seven heavy atoms (C, N, O, S, and Cl) are issued from the study by Zaspel et al. [17]. We consider here values for the cc-pVDZ basis set, and the prediction error for 6211 systems for the SCF, MP2, and machine-learning (SLATM-L2) methods with respect to CCSD(T) values as analyzed by Pernot et al. [18]. ...
Article
Full-text available
Confirming the result of a calculation by a calculation with a different method is often seen as a validity check. However, when the methods considered are all subject to the same (systematic) errors, this practice fails. Using a statistical approach, we define measures for reliability and similarity, and we explore the extent to which the similarity of results can help improve our judgment of the validity of data. This method is illustrated on synthetic data and applied to two benchmark datasets extracted from the literature: band gaps of solids estimated by various density functional approximations, and effective atomization energies estimated by ab initio and machine-learning methods. Depending on the levels of bias and correlation of the datasets, we found that similarity may provide a null-to-marginal improvement in reliability and was mostly effective in eliminating large errors.
... In order to connect to the predominant body of DFT literature on small organic molecules as well as to the composite quantum chemistry methods developed by Pople, Curtiss and co-workers Gn-series [19][20][21] , we have consistently opted for B3LYP/cc-pVTZ as level of theory for all structures and properties. While the short-comings of common approximations to the exchange-correlation potential in DFT are well known, we note that the fragmentation itself is independent of the electronic structure method, and that it is straightforward to augment and improve upon this level in future studies, e.g. through the use of multi-level grid combination techniques 22 . Furthermore, due to their modest size, all AMONs are sufficiently small to remain amenable to more accurate methods, such as CCSD(T)-F12 in a large basis set. ...
Preprint
We present all {\bf A}mons for {\bf G}DB and {\bf Z}inc data-bases using no more than 7 non-hydrogen atoms (AGZ7)---a calculated organic chemistry building-block dictionary based on the AMON approach [Huang and von Lilienfeld, {\em Nature Chemistry} (2020)]. AGZ7 records Cartesian coordinates of compositional and constitutional isomers, as well as properties for $\sim$140k small organic molecules obtained by systematically fragmenting all molecules of Zinc and the majority of GDB17 into smaller entities, saturating with hydrogens, and containing no more than 7 heavy atoms (excluding hydrogen atoms). AGZ7 cover the elements \{H, B, C, N, O, F, Si, P, S, Cl, Br, Sn and I\} and includes optimized geometries, total energy and its decomposition, Mulliken atomic charges, dipole moment vectors, quadrupole tensors, electronic spatial extent, eigenvalues of all occupied orbitals, LUMO, gap, isotropic polarizability, harmonic frequencies, reduced masses, force constants, IR intensity, normal coordinates, rotational constants, zero-point energy, internal energy, enthalpy, entropy, free energy, and heat capacity (all at ambient conditions) using B3LYP/cc-pVTZ (pseudopotentials were used for Sn and I) level of theory. We exemplify the usefulness of this data set with AMON based machine learning models of total potential energy predictions of seven of the most rigid GDB-17 molecules.
... For instance, empirical trends, simple group-contribution methods and computationally demanding quantum mechanical simulations can generate this low-fidelity (LF) data. Given such a situation, a multi-fidelity (MF) information fusion model aims to consolidate all the available information from the varying fidelity sources to make the most accurate and confident property predictions at the highest level of fidelity [47,48,[112][113][114][115][116]. ...
Preprint
Full-text available
Artificial intelligence (AI) based approaches are beginning to impact several domains of human life, science and technology. Polymer informatics is one such domain where AI and machine learning (ML) tools are being used in the efficient development, design and discovery of polymers. Surrogate models are trained on available polymer data for instant property prediction, allowing screening of promising polymer candidates with specific target property requirements. Questions regarding synthesizability, and potential (retro)synthesis steps to create a target polymer, are being explored using statistical means. Data-driven strategies to tackle unique challenges resulting from the extraordinary chemical and physical diversity of polymers at small and large scales are being explored. Other major hurdles for polymer informatics are the lack of widespread availability of curated and organized data, and approaches to create machine-readable representations that capture not just the structure of complex polymeric situations but also synthesis and processing conditions. Methods to solve inverse problems, wherein polymer recommendations are made using advanced AI algorithms that meet application targets, are being investigated. As various parts of the burgeoning polymer informatics ecosystem mature and become integrated, efficiency improvements, accelerated discoveries and increased productivity can result. Here, we review emergent components of this polymer informatics ecosystem and discuss imminent challenges and opportunities.
... Besides, there are several properties for which the reference data are rather sparse, leading to rather small datasets. Another trend, enhanced by the development of machine learning is to replace experimental values by gold standard calculations, with limitations on the size of accessible systems 7,8 . As the estimated values of the statistics and their uncertainties depend on the size of the dataset, it is important to assess this size effect and its impact on statistics comparison and ranking. ...
... For instance, empirical trends, simple group-contribution methods and computationally demanding quantum mechanical simulations can generate this low-fidelity (LF) data. Given such a situation, a multifidelity (MF) information fusion model aims to consolidate all the available information from the varying fidelity sources to make the most accurate and confident property predictions at the highest level of fidelity [47,48,[112][113][114][115][116]. Comparative studies have shown that the multi-fidelity approach performs better than any single-fidelity based method in terms of prediction accuracy, especially for small (high-fidelity) data sets. ...
Article
Artificial intelligence (AI) based approaches are beginning to impact several domains of human life, science and technology. Polymer informatics is one such domain where AI and machine learning (ML) tools are being used in the efficient development, design and discovery of polymers. Surrogate models are trained on available polymer data for instant property prediction, allowing screening of promising polymer candidates with specific target property requirements. Questions regarding synthesizability, and potential (retro)synthesis steps to create a target polymer, are being explored using statistical means. Data-driven strategies to tackle unique challenges resulting from the extraordinary chemical and physical diversity of polymers at small and large scales are being explored. Other major hurdles for polymer informatics are the lack of widespread availability of curated and organized data, and approaches to create machine-readable representations that capture not just the structure of complex polymeric situations but also synthesis and processing conditions. Methods to solve inverse problems, wherein polymer recommendations are made using advanced AI algorithms that meet application targets, are being investigated. As various parts of the burgeoning polymer informatics ecosystem mature and become integrated, efficiency improvements, accelerated discoveries and increased productivity can result. Here, we review emergent components of this polymer informatics ecosystem and discuss imminent challenges and opportunities.
... Another trend enhanced by the development of machine learning is to replace experimental values by gold standard calculations, with limitations on the size of accessible systems. 7,8 As the estimated values of the statistics and their uncertainties depend on the size of the dataset, it is important to assess this size effect and its impact on statistics comparison and ranking. ...
Article
The comparison of benchmark error sets is an essential tool for the evaluation of theories in computational chemistry. The standard ranking of methods by their mean unsigned error is unsatisfactory for several reasons linked to the non-normality of the error distributions and the presence of underlying trends. Complementary statistics have recently been proposed to palliate such deficiencies, such as quantiles of the absolute error distribution or the mean prediction uncertainty. We introduce here a new score, the systematic improvement probability, based on the direct system-wise comparison of absolute errors. Independent of the chosen scoring rule, the uncertainty of the statistics due to the incompleteness of the benchmark datasets is also generally overlooked. However, this uncertainty is essential to appreciate the robustness of rankings. In the present article, we develop two indicators based on robust statistics to address this problem: Pinv, the inversion probability between two values of a statistic, and Pr, the ranking probability matrix. We demonstrate also the essential contribution of the correlations between error sets in these scores comparisons.
... Further, these efforts have been limited to specific properties of single structure prototypes 18,19 . Similarly, transfer learning and Δ-learning 20 are either two-fidelity approaches or non-trivial 21 to extend to more than two fidelities. Multi-task neural network models 22 can handle multi-fidelity data and scale linearly with the number of data fidelities, but require homogeneous data that have all properties labeled for all the data, which is rarely the case in materials property data sets. ...
Article
Full-text available
Predicting the properties of a material from the arrangement of its atoms is a fundamental goal in materials science. While machine learning has emerged in recent years as a new paradigm to provide rapid predictions of materials properties, their practical utility is limited by the scarcity of high-fidelity data. Here, we develop multi-fidelity graph networks as a universal approach to achieve accurate predictions of materials properties with small data sizes. As a proof of concept, we show that the inclusion of low-fidelity Perdew–Burke–Ernzerhof band gaps greatly enhances the resolution of latent structural features in materials graphs, leading to a 22–45% decrease in the mean absolute errors of experimental band gap predictions. We further demonstrate that learned elemental embeddings in materials graph networks provide a natural approach to model disorder in materials, addressing a fundamental gap in the computational prediction of materials properties.
... Review subsequently formalized and extended in multiple dimensions using the sparse grid combination technique, which combines models trained on different subspaces (e.g., combination of basis set size and correlation level) such that only a few samples are needed on the highest, target, level of accuracy. 574 A different multifidelity learning approach, known as cokriging, can combine low-and high-fidelity training data to predict properties at the highest fidelity levelwithout using the low-fidelity data as features or baseline. This technique was used by Pilania et al. to predict band gaps of elpasolites on hybrid functional level of theory using a training set of properties on both GGA and hybrid functional level. ...
Article
By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal–organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies. In the second part, we review how the different approaches of machine learning have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. Given the increasing interest of the scientific community in machine learning, we expect this list to rapidly expand in the coming years.
... Computational chemistry is naturally a sub-field that has been increasingly boosted by the advances and unique capabilities of ML Ramakrishnan et al., 2014Ramakrishnan et al., , 2015Dral et al., 2015;Sánchez-Lengeling and Aspuru-Guzik, 2017;Christensen et al., 2019;Iype and Urolagin, 2019;Zaspel et al., 2019). ...
... Furthermore, atomistic details (geometries) are often lacking in the case of experimental data, while level of theory used in the case of theoretical studies can often no longer be considered to be state of the art. While it is possible to merge reaction data from different sources or to learn their respective differences in the potential energy surface by means of Delta machine learning (∆-ML) [36], multi-fidelity machine learning models [37], multi-level combination grid technique [38] or transfer learning [39], the resulting multilevel approaches require at least part of the data to be evaluated in many different sources. Thus there is considerable need for one large consistent data set which subsequently could be used as a basis for multilevel machine learning models and their application in reaction design. ...
... Techniques based on the correlation between high-level and low-level methods are not rare in theoretical chemistry. [18][19][20][21][22] For example, higher-level correlation contribution corrections are added with smaller basis sets in the calculations of weak interactions. 23 Here, this type of idea is extended into the statistical treatment of large ensembles. ...
Preprint
Full-text available
Nuclear densities are frequently represented by an ensemble of nuclear configurations or points in the phase space in various contexts of molecular simulations. The size of the ensemble directly affects the accuracy and computational cost of subsequent calculations of observable quantities. In the present work, we address the question of how many configurations do we need and how to select them most efficiently. We focus on the nuclear ensemble method in the context of electronic spectroscopy, where thousands of sampled configurations are usually needed for sufficiently converged spectra. The proposed representative sampling technique allows for a dramatic reduction of the sample size. By using an exploratory method, we model the density from a large sample in the space of transition properties. The representative subset of nuclear configurations is optimized by minimizing its Kullback-Leibler divergence to the full density with simulated annealing. High-level calculations are then performed only for the selected subset of configurations. We tested the algorithm on electronic absorption spectra of three molecules: (E)-azobenzene, the simplest Criegee intermediate, and hydrated nitrate anion. Typically, dozens of nuclear configurations provided sufficiently accurate spectra. A strongly forbidden transition of the nitrate anion presented the most challenging case due to rare geometries with disproportionately high transition intensities. This problematic case was easily diagnosed within the present approach. We also discuss various exploratory methods and a possible extension to dynamical simulations.
... To circumvent the problem of requiring large high-accuracy data sets, Δ-ML aims to predict the highly accurate target property at the same cost of the computationally cheaper methods, which is often referred to as the baseline property. 45,46 This approach is typically more data-efficient than direct ML, since the computationally expensive high-accuracy simulations are needed only for a considerably smaller subset to obtain a certain predictive power. 23,44 The accurate target property is labeled as p t and is obtained by ...
Article
Full-text available
We present a Δ-machine learning approach for the prediction of GW quasiparticle energies (ΔMLQP) and photoelectron spectra of molecules and clusters, using orbital-sensitive representations (OSRs) based on molecular Cartesian coordinates in kernel ridge regression-based supervised learning. Coulomb matrix, bag-of-bond, and bond-angle-torsion representations are made orbital-sensitive by augmenting them with atom-centered orbital charges and Kohn–Sham orbital energies, both of which are readily available from baseline calculations at the level of density functional theory (DFT). We first illustrate the effects of different constructions of the OSRs on the prediction of frontier orbital energies of 22k molecules of the QM8 data set and show that it is possible to predict the full photoelectron spectrum of molecules within the data set using a single model with a mean absolute error below 0.1 eV. We further demonstrate that the OSR-based ΔMLQP captures the effects of intra- and intermolecular conformations in application to water monomers and dimers. Finally, we show that the approach can be embedded in multiscale simulation workflows, by studying the solvatochromic shifts of quasiparticle and electron–hole excitation energies of solvated acetone in a setup combining molecular dynamics, DFT, the GW approximation, and the Bethe–Salpeter equation. Our findings suggest that the ΔMLQP model allows us to predict quasiparticle energies and photoelectron spectra of molecules and clusters with GW accuracy at DFT cost.
... Furthermore, atomistic details (geometries) are often lacking in the case of experimental data, while level of theory used in the case of theoretical studies can often no longer be considered to be state of the art. While it is possible to merge reaction data from different sources or to learn their respective differences in the potential energy surface by means of Delta machine learning (∆-ML) [36], multi-fidelity machine learning models [37], or multi-level combination grid technique [38], the resulting multilevel approaches require at least part of the data to be evaluated in many different sources. Thus there is considerable need for one large consistent data set which subsequently could be used as a basis for multilevel machine learning models and their application in reaction design. ...
Preprint
Reaction barriers are a crucial ingredient for first principles based computational retro-synthesis efforts as well as for comprehensive reactivity assessments throughout chemical compound space. While extensive databases of experimental results exist, modern quantum machine learning applications require atomistic details which can only be obtained from quantum chemistry protocols. For competing E2 and S$_\text{N}$2 reaction channels we report 4'466 transition state and 143'200 reactant complex geometries and energies at respective MP2/6-311G(d) and single point DF-LCCSD/cc-pVTZ level of theory covering the chemical compound space spanned by the substituents NO$_2$, CN, CH$_3$, and NH$_2$ and early halogens (F, Cl, Br) as nucleophiles and leaving groups. Reactants are chosen such that the activation energy of the competing E2 and S$_\text{N}$2 reactions are of comparable magnitude. The correct concerted motion for each of the one-step reactions has been validated for all transition states. We demonstrate how quantum machine learning models can support data set extension, and discuss the distribution of key internal coordinates of the transition states.
Article
Rational design of compounds with specific properties requires understanding and fast evaluation of molecular properties throughout chemical compound space — the huge set of all potentially stable molecules. Recent advances in combining quantum-mechanical calculations with machine learning provide powerful tools for exploring wide swathes of chemical compound space. We present our perspective on this exciting and quickly developing field by discussing key advances in the development and applications of quantum-mechanics-based machine-learning methods to diverse compounds and properties, and outlining the challenges ahead. We argue that significant progress in the exploration and understanding of chemical compound space can be made through a systematic combination of rigorous physical theories, comprehensive synthetic data sets of microscopic and macroscopic properties, and modern machine-learning methods that account for physical and chemical knowledge. Machine-learning techniques have enabled, among many other applications, the exploration of molecular properties throughout chemical space. The specific development of quantum-based approaches in machine learning can now help us unravel new chemical insights.
Chapter
In this chapter we illustrate in a tutorial way how machine learning can be used to assist quantum chemical research. Pitfalls of machine learning are highlighted and ways to avoid them are suggested. We show how machine learning can be used to improve relatively low-cost, approximated quantum chemical methods in two conceptually different ways. The first way is to improve the low-cost quantum chemical predictions a posteriori, e.g., as in Δ-machine learning. The second way is to improve the low-cost quantum chemical method itself and then use this improved method to make predictions, e.g., as in semiempirical parameter learning. Then we show how pure machine learning can be used to build very accurate potential energy surfaces with spectroscopic accuracy. Here we also discuss the importance of sampling to reduce the number of training points and eliminate many unphysical outliers, e.g., as in structure-based and farthest-point sampling. Then we demonstrate how machine learning can be used for nonadiabatic excited-state dynamics and discuss the associated challenges. In all examples, kernel ridge regression approach to machine learning is used. This approach and its advantages and disadvantages are discussed too.
Article
High-fidelity quantum chemical calculations can provide accurate predictions of molecular energies, but their high computational costs limit their utility, especially for larger molecules. We have shown in previous work that machine learning models trained on high-level quantum chemical calculations (G4MP2) for organic molecules with 1-9 non-hydrogen atoms can provide accurate predictions for other molecules of comparable size, at much lower costs. Here, we demonstrate that such models can also be used to effectively predict energies of molecules larger than those in the training set. To implement this strategy, we first established a set of 191 molecules with 10-14 non-hydrogen atoms having reliable experimental enthalpies of formation. We then assessed the accuracy of computed G4MP2 enthalpies of formations for these 191 molecules. The error in the G4MP2 results was somewhat larger than that for smaller molecules and the reason for this increase is discussed. Two density functional methods, B3LYP and ωB97X-D, were also used on this set of molecules, with ωB97X-D found to perform better than B3LYP at predicting energies. The G4MP2 energies for the 191 molecules were then predicted using these two functionals with two machine learning methods, the FCHL- and SCHNmodels, with the learning done on calculated energies of the 1-9 non-hydrogen atom molecules. The better-performing model, FCHL- gave atomization energies of the 191 organic molecules with 10-14 non-hydrogen atoms within 0.4 kcal/mol of their G4MP2 energies. Thus, this work demonstrates that quantum chemically informed machine learning can be used to successfully predict energies of large organic molecules, whose size is beyond that in the training set.
Article
Atomistic modeling of the optoelectronic properties of organic semiconductors (OSCs) requires a large number of excited-state electronic-structure calculations, a computationally daunting task for many OSC applications. In this work, we advocate the use of deep learning to address this challenge and demonstrate that state-of-the-art deep neural networks (DNNs) are capable of accurately predicting various electronic properties of an important class of OSCs, i.e., oligothiophenes (OTs), including their HOMO and LUMO energies, excited-state energies and associated transition dipole moments. Among the tested DNNs, SchNet shows the best performance for OTs of different sizes, achieving average prediction errors in the range of 20-80meV. We show that SchNet also consistently outperforms shallow feed-forward neural networks, especially in difficult cases with large molecules or limited training data. We further show that SchNet could predict the transition dipole moment accurately, a task previously known to be difficult for feed-forward neural networks, and we ascribe the relatively large errors in transition dipole prediction seen for some OT configurations to the charge-transfer character of their excited states. Finally, we demonstrate the effectiveness of SchNet by modeling the UV-Vis absorption spectra of OTs in dichloromethane and a good agreement is observed between the calculated and experimental spectra.
Article
We present non-covalent quantum machine learning corrections to six physically motivated density functionals with systematic errors. We demonstrate that the missing massively non-local and non-additive physical effects can be recovered by the quantum machine learning models. The models seamlessly account for various types of non-covalent interactions, and enable accurate predictions of dissociation curves. The correction improves the description of molecular two- and three-body interactions crucial in large water clusters, and provides a reasonable atomic-resolution picture about the interaction energy errors of approximate density functionals that can be a useful information in the development of more accurate density functionals. We show that given sufficient training instances the correction is more flexible than standard molecular mechanical dispersion corrections, and thus it can be applied for cases where many dispersion corrected density functionals fail, such as hydrogen bonding.
Preprint
Machine Learning (ML) has become a promising tool for improving the quality of atomistic simulations. Using formaldehyde as a benchmark system for intramolecular interactions, a comparative assessment of ML models based on state-of-the-art variants of deep neural networks (NN), reproducing kernel Hilbert space (RKHS+F), and kernel ridge regression (KRR) is presented. Learning curves for energies and atomic forces indicate rapid convergence towards excellent predictions for B3LYP, MP2, and CCSD(T)-F12 reference results for modestly sized (in the hundreds) training sets. Typically, learning curve off-sets decay as one goes from NN (PhysNet) to RKHS+F to KRR (FCHL). Conversely, the predictive power for extrapolation of energies towards new geometries increases in the same order with RKHS+F and FCHL performing almost equally. For harmonic vibrational frequencies, the picture is less clear, with PhysNet and FCHL yielding respectively flat learning at $\sim$ 1 and $\sim$ 0.2 cm$^{-1}$ no matter which reference method, while RKHS+F models level off for B3LYP, and exhibit continued improvements for MP2 and CCSD(T)-F12. Finite-temperature molecular dynamics (MD) simulations with the same initial conditions yield indistinguishable infrared spectra with good performance compared with experiment except for the high-frequency modes involving hydrogen stretch motion which is a known limitation of MD for vibrational spectroscopy. For sufficiently large training set sizes all three models can detect insufficient convergence (noise'') of the reference electronic structure calculations in that the learning curves level off. Transfer learning (TL) from B3LYP to CCSD(T)-F12 with PhysNet indicates that additional improvements in data efficiency can be achieved.
Article
We propose a multi-level method to increase the accuracy of machine learning algorithms for approximating observables in scientific computing, particularly those that arise in systems modelled by differential equations. The algorithm relies on judiciously combining a large number of computationally cheap training data on coarse resolutions with a few expensive training samples on fine grid resolutions. Theoretical arguments for lowering the generalisation error, based on reducing the variance of the underlying maps, are provided and numerical evidence, indicating significant gains over underlying single-level machine learning algorithms, are presented. Moreover, we also apply the multi-level algorithm in the context of forward uncertainty quantification and observe a considerable speedup over competing algorithms.
Article
Recent advances in theoretical thermochemistry have allowed the study of small organic and bio-organic molecules with high accuracy. However, applications to larger molecules are still impeded by the steep scaling problem of highly accurate quantum mechanical (QM) methods, forcing the use of approximate, more cost-effective methods at a greatly reduced accuracy. One of the most successful strategies to mitigate this error is the use of systematic error-cancellation schemes, in which highly accurate QM calculations can be performed on small portions of the molecule to construct corrections to an approximate method. Herein, we build on ideas from fragmentation and error-cancellation to introduce a new family of molecular descriptors for machine learning modeled after the Connectivity-Based Hierarchy (CBH) of generalized isodesmic reaction schemes. The best performing descriptor ML(CBH-2) is constructed from fragments preserving only the immediate connectivity of all heavy (non-H) atoms of a molecule along with overlapping regions of fragments in accordance with the inclusion-exclusion principle. Our proposed approach offers a simple, chemically intuitive grouping of atoms, tuned with an optimal amount of error-cancellation, and outperforms previous structure-based descriptors using a much smaller input vector length. For a wide variety of density functionals, DFT+ΔML(CBH-2) models, trained on a set of small- to medium-sized organic HCNOSCl-containing molecules, achieved an out-of-sample MAE within 0.5 kcal/mol and 2σ (95%) confidence interval of <1.5 kcal/mol compared to accurate G4 reference values at DFT cost.
Article
Δ-machine learning, or the hierarchical construction scheme, is a highly cost-effective method, as only a small number of high-level ab initio energies are required to improve a potential energy surface (PES) fit to a large number of low-level points. However, there is no efficient and systematic way to select as few points as possible from the low-level data set. We here propose a permutation-invariant-polynomial neural-network (PIP-NN)-based Δ-machine learning approach to construct full-dimensional accurate PESs of complicated reactions efficiently. Particularly, the high flexibility of the NN is exploited to efficiently sample points from the low-level data set. This approach is applied to the challenging case of a HO2 self-reaction with a large configuration space. Only 14% of the DFT data set is used to successfully bring a newly fitted DFT PES to the UCCSD(T)-F12a/AVTZ quality. Then, the quasiclassical trajectory (QCT) calculations are performed to study its dynamics, particularly the mode specificity.
Article
Many of the machine learning-based approaches for materials property predictions use low-cost computational data. The motivation for machine learning models is based on the orders of magnitude speedup compared to DFT calculations or experimental characterization. High-quality experimental materials data would be ideal for training these models; unfortunately, experimental data are typically costly to obtain. As a result, experimental databases are often smaller and less cohesive. Using band gap, we demonstrate how an ensemble learning approach allows us to efficiently model experimental data by combining models trained on otherwise disparate computational and experimental data. This approach demonstrates how disparate data sources can be incorporated into the modeling of sparsely represented experimental data. In the case of band gap prediction, we reduce the root mean squared error by over 9%.
Preprint
Full-text available
We apply response operator based quantum machine learning (OQML) to the problem of geometry optimization and transition state search throughout chemical compound space. Using legacy optimizers for both applications, the impact of including of OQML based atomic forces on optimization outcome has been explored. Numerical results for randomly sampled small organic query molecules indicate systematic improvement of equilibrium and transition state geometries as training set sizes increase. For geometry optimizations, we have considered 5'989 randomly chosen instances of relaxation paths of 5'500 constitutional isomers (sum formula: C$_7$H$_{10}$O$_2$) from the QM9-database. Using the resulting OQML models with an LBFGS optimizer reproduces the minimum geometry with an RMSD of 0.15\r{A}. Training on 3812 instances drawn at random from 200 transition state search trajectories from the QMrxn20 data-set, out-of-sample S$_\mathrm{N}$2 transition state geometries have been obtained using OQML based forces within the QST2 algorithm with an RMSD of 0.3 \r{A}. For the converged equilibrium and transition state geometries subsequent vibrational normal mode frequency analysis deviates from MP2 reference results on average 39 and 41 cm$^{-1}$, respectively. The number of steps until convergence is typically larger for OQML than for DFT based forces. However, the success rate for reaching convergence increases systematically with training set size, indicating OQML's considerable potential applicability.
Article
We introduce an electronic structure based representation for quantum machine learning (QML) of electronic properties throughout chemical compound space. The representation is constructed using computationally inexpensive ab initio calculations and explicitly accounts for changes in the electronic structure. We demonstrate the accuracy and flexibility of resulting QML models when applied to property labels, such as total potential energy, HOMO and LUMO energies, ionization potential, and electron affinity, using as datasets for training and testing entries from the QM7b, QM7b-T, QM9, and LIBE libraries. For the latter, we also demonstrate the ability of this approach to account for molecular species of different charge and spin multiplicity, resulting in QML models that infer total potential energies based on geometry, charge, and spin as input.
Article
Application of machine learning (ML) to the prediction of reaction activation barriers is a new and exciting field for these algorithms. The works covered here are specifically those in which ML is trained to predict the activation energies of homogeneous chemical reactions, where the activation energy is given by the energy difference between the reactants and transition state of a reaction. Particular attention is paid to works that have applied ML to directly predict reaction activation energies, the limitations that may be found in these studies, and where comparisons of different types of chemical features for ML models have been made. Also explored are models that have been able to obtain high predictive accuracies, but with reduced datasets, using the Gaussian process regression ML model. In these studies, the chemical reactions for which activation barriers are modeled include those involving small organic molecules, aromatic rings, and organometallic catalysts. Also provided are brief explanations of some of the most popular types of ML models used in chemistry, as a beginner's guide for those unfamiliar. This article is categorized under: Structure and Mechanism > Reaction Mechanisms and Catalysis Computer and Information Science > Visualization This review covers work where machine learning is applied to predict activation energies of homogeneous chemical reactions. Discussed are comparisons of different features, limitations, attempts to interpret the models, and models with lower amounts of training data.
Article
Vibrational frequencies were used to achieve chemical accuracy with 3% data by Δ-machine learning.
Preprint
Although the amino acid tyrosine is among the main building blocks of life, its photochemistry is not fully understood. Traditional theoretical simulations are neither accurate enough, nor computationally efficient to provide the missing puzzle pieces to the experimentally observed signatures obtained via time-resolved pump-probe spectroscopy or mass spectroscopy. In this work, we go beyond the realms of possibility with conventional quantum chemical methods and develop as well as apply a new technique to shed light on the photochemistry of tyrosine. By doing so, we discover roaming atoms in tyrosine, which is the first time such a reaction is discovered in biology. Our findings suggest that roaming atoms are radicals that could play a fundamental role in the photochemistry of peptides and proteins, offering a new perspective. Our novel method is based on deep learning, leverages the physics underlying the data, and combines different levels of theory. This combination of methods to obtain an accurate picture of excited molecules could shape how we study photochemical systems in the future and how we can overcome the current limitations that we face when applying quantum chemical methods.
Article
We present a new iterative scheme for potential energy surface (PES) construction, which relies on both physical information and information obtained through statistical analysis. The adaptive density guided approach (ADGA) is combined with a machine learning technique, namely, the Gaussian process regression (GPR), in order to obtain the iterative GPR–ADGA for PES construction. The ADGA provides an average density of vibrational states as a physically motivated importance-weighting and an algorithm for choosing points for electronic structure computations employing this information. The GPR provides an approximation to the full PES given a set of data points, while the statistical variance associated with the GPR predictions is used to select the most important among the points suggested by the ADGA. The combination of these two methods, resulting in the GPR–ADGA, can thereby iteratively determine the PES. Our implementation, additionally, allows for incorporating derivative information in the GPR. The iterative process commences from an initial Hessian and does not require any presampling of configurations prior to the PES construction. We assess the performance on the basis of a test set of nine small molecules and fundamental frequencies computed at the full vibrational configuration interaction level. The GPR–ADGA, with appropriate settings, is shown to provide fundamental excitation frequencies of an root mean square deviation (RMSD) below 2 cm⁻¹, when compared to those obtained based on a PES constructed with the standard ADGA. This can be achieved with substantial savings of 65%–90% in the number of single point calculations.
Article
Machine learning (ML) methods are being used in almost every conceivable area of electronic structure theory and molecular simulation. In particular, ML has become firmly established in the construction of high-dimensional interatomic potentials. Not a day goes by without another proof of principle being published on how ML methods can represent and predict quantum mechanical properties—be they observable, such as molecular polarizabilities, or not, such as atomic charges. As ML is becoming pervasive in electronic structure theory and molecular simulation, we provide an overview of how atomistic computational modeling is being transformed by the incorporation of ML approaches. From the perspective of the practitioner in the field, we assess how common workflows to predict structure, dynamics, and spectroscopy are affected by ML. Finally, we discuss how a tighter and lasting integration of ML methods with computational chemistry and materials science can be achieved and what it will mean for research practice, software development, and postgraduate training.
Article
Full-text available
The distribution of errors is a central object in the assessment and benchmarking of computational chemistry methods. The popular and often blind use of the mean unsigned error as a benchmarking statistic leads to ignore distributions features that impact the reliability of the tested methods. We explore how the Gini coefficient offers a global representation of the errors distribution, but, except for extreme values, does not enable an unambiguous diagnostic. We propose to relieve the ambiguity by applying the Gini coefficient to mode-centered error distributions. This version can usefully complement benchmarking statistics and alert on error sets with potentially problematic shapes.
Article
First-principles prediction of nuclear magnetic resonance chemical shifts plays an increasingly important role in the interpretation of experimental spectra, but the required density functional theory (DFT) calculations can be computationally expensive. Promising machine learning models for predicting chemical shieldings in general organic molecules have been developed previously, though the accuracy of those models remains below that of DFT. The present study demonstrates how much higher accuracy chemical shieldings can be obtained via the Δ-machine learning approach, with the result that the errors introduced by the machine learning model are only one-half to one-third the errors expected for DFT chemical shifts relative to experiment. Specifically, an ensemble of neural networks is trained to correct PBE0/6-31G chemical shieldings up to the target level of PBE0/6-311+G(2d,p). It can predict 1H, 13C, 15N, and 17O chemical shieldings with root-mean-square errors of 0.11, 0.70, 1.69, and 2.47 ppm, respectively. At the same time, the Δ-machine learning approach is 1-2 orders of magnitude faster than the target large-basis calculations. It is also demonstrated that the machine learning model predicts experimental solution-phase NMR chemical shifts in drug molecules with only modestly worse accuracy than the target DFT model. Finally, the ability to estimate the uncertainty in the predicted shieldings based on variations within the ensemble of neural network models is also assessed.
Preprint
Full-text available
The distribution of errors is a central object in the assesment and benchmarking of computational chemistry methods. The popular and often blind use of the mean unsigned error as a benchmarking statistic leads to ignore distributions features that impact the reliability of the tested methods. We explore how the Gini coefficient offers a global representation of the errors distribution, but, except for extreme values, does not enable an unambiguous diagnostic. We propose to relieve the ambiguity by applying the Gini coefficient to mode-centered error distributions. This version can usefully complement benchmarking statistics and alert on error sets with potentially problematic shapes.
Article
The age of cognitive computing and artificial intelligence (AI) is just dawning. Inspired by its successes and promises, several AI ecosystems are blossoming, many of them within the domain of materials science and engineering. These materials intelligence ecosystems are being shaped by several independent developments. Machine learning (ML) algorithms and extant materials data are utilized to create surrogate models of materials properties and performance predictions. Materials data repositories, which fuel such surrogate model development, are mushrooming. Automated data and knowledge capture from the literature (to populate data repositories) using natural language processing approaches is being explored. The design of materials that meet target property requirements and of synthesis steps to create target materials appear to be within reach, either by closed-loop active-learning strategies or by inverting the prediction pipeline using advanced generative algorithms. AI and ML concepts are also transforming the computational and physical laboratory infrastructural landscapes used to create materials data in the first place. Surrogate models that can outstrip physics-based simulations (on which they are trained) by several orders of magnitude in speed while preserving accuracy are being actively developed. Automation, autonomy and guided high-throughput techniques are imparting enormous efficiencies and eliminating redundancies in materials synthesis and characterization. The integration of the various parts of the burgeoning ML landscape may lead to materials-savvy digital assistants and to a human–machine partnership that could enable dramatic efficiencies, accelerated discoveries and increased productivity. Here, we review these emergent materials intelligence ecosystems and discuss the imminent challenges and opportunities.
Article
Machine Learning (ML) has become a promising tool for improving the quality of atomistic simulations. Using formaldehyde as a benchmark system for intramolecular interactions, a comparative assessment of ML models based on state-of-the-art variants of deep neural networks (NN), reproducing kernel Hilbert space (RKHS+F), and kernel ridge regression (KRR) is presented. Learning curves for energies and atomic forces indicate rapid convergence towards excellent predictions for B3LYP, MP2, and CCSD(T)-F12 reference results for modestly sized (in the hundreds) training sets. Typically, learning curve off-sets decay as one goes from NN (PhysNet) to RKHS+F to KRR (FCHL). Conversely, the predictive power for extrapolation of energies towards new geometries increases in the same order with RKHS+F and FCHL performing almost equally. For harmonic vibrational frequencies, the picture is less clear, with PhysNet and FCHL yielding respectively flat learning at ∽1 and ∼0.2 cm-1 no matter which reference method, while RKHS+F models level off for B3LYP, and exhibit continued improvements for MP2 and CCSD(T)-F12. Finite-temperature molecular dynamics (MD) simulations with the same initial conditions yield indistinguishable infrared spectra with good performance compared with experiment except for the high-frequency modes involving hydrogen stretch motion which is a known limitation of MD for vibrational spectroscopy. For sufficiently large training set sizes all three models can detect insufficient convergence (noise'') of the reference electronic structure calculations in that the learning curves level off. Transfer learning (TL) from B3LYP to CCSD(T)-F12 with PhysNet indicates that additional improvements in data efficiency can be achieved.
Preprint
In the first part of this study (Paper I), we introduced the systematic improvement probability (SIP) as a tool to assess the level of improvement on absolute errors to be expected when switching between two computational chemistry methods. We developed also two indicators based on robust statistics to address the uncertainty of ranking in computational chemistry benchmarks: Pinv , the inversion probability between two values of a statistic, and Pr , the ranking probability matrix. In this second part, these indicators are applied to nine data sets extracted from the recent benchmarking literature. We illustrate also how the correlation between the error sets might contain useful information on the benchmark dataset quality, notably when experimental data are used as reference.
Article
Full-text available
A survey of the contributions to the Special Topic on Data-enabled Theoretical Chemistry is given, including a glossary of relevant machine learning terms.
Article
Full-text available
Data quality as well as library size are crucial issues for force field development. In order to predict molecular properties in a large chemical space, the foundation to build force fields on needs to encompass a large variety of chemical compounds. The tabulated molecular physicochemical properties also need to be accurate. Due to the limited transparency in data used for development of existing force fields it is hard to establish data quality and reusability is low. This paper presents the Alexandria library as an open and freely accessible database of optimized molecular geometries, frequencies, electrostatic moments up to the hexadecupole, electrostatic potential, polarizabilities, and thermochemistry, obtained from quantum chemistry calculations for 2704 compounds. Values are tabulated and where available compared to experimental data. This library can assist systematic development and training of empirical force fields for a broad range of molecules.
Article
Full-text available
Deep learning has led to a paradigm shift in artificial intelligence, including web, text, and image search, speech recognition, as well as bioinformatics, with growing impact in chemical physics. Machine learning, in general, and deep learning, in particular, are ideally suitable for representing quantum-mechanical interactions, enabling us to model nonlinear potential-energy surfaces or enhancing the exploration of chemical compound space. Here we present the deep learning architecture SchNet that is specifically designed to model atomistic systems by making use of continuous-filter convolutional layers. We demonstrate the capabilities of SchNet by accurately predicting a range of properties across chemical space for molecules and materials, where our model learns chemically plausible embeddings of atom types across the periodic table. Finally, we employ SchNet to predict potential-energy surfaces and energy-conserving force fields for molecular dynamics simulations of small molecules and perform an exemplary study on the quantum-mechanical properties of C20-fullerene that would have been infeasible with regular ab initio molecular dynamics.
Article
Full-text available
The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows our AL algorithm to automatically sample regions of chemical space where the machine learned potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach we develop the COMP6 benchmark (publicly available on GitHub), which contains a diverse set of organic molecules. We show the use of our proposed AL technique develops a universal ANI potential (ANI-1x), which provides very accurate energy and force predictions on the entire COMP6 benchmark. This universal potential achieves a level of accuracy on par with the best ML potentials for single molecule or materials while remaining applicable to the general class of organic molecules comprised of the elements CHNO.
Article
Full-text available
One of the grand challenges in modern theoretical chemistry is designing and implementing approximations that expedite ab initio methods without loss of accuracy. Machine learning (ML) methods are emerging as a powerful approach to constructing various forms of transferable atomistic potentials. They have been successfully applied in a variety of applications in chemistry, biology, catalysis, and solid-state physics. However, these models are heavily dependent on the quality and quantity of data used in their fitting. Fitting highly flexible ML potentials, such as neural networks, comes at a cost: a vast amount of reference data is required to properly train these models. We address this need by providing access to a large computational DFT database, which consists of more than 20 M off equilibrium conformations for 57,462 small organic molecules. We believe it will become a new standard benchmark for comparison of current and future methods in the ML potential community.
Article
Full-text available
Classical intermolecular potentials typically require an extensive parametrization procedure for any new compound considered. To do away with prior parametrization, we propose a combination of physics-based potentials with machine learning (ML), coined IPML, which is transferable across small neutral organic and biologically-relevant molecules. ML models provide on-the-fly predictions for environment-dependent local atomic properties: electrostatic multipole coefficients (significant error reduction compared to previously reported), the population and decay rate of valence atomic densities, and polarizabilities across conformations and chemical compositions of H, C, N, and O atoms. These parameters enable accurate calculations of intermolecular contributions---electrostatics, charge penetration, repulsion, induction/polarization, and many-body dispersion. Unlike other potentials, this model is transferable in its ability to handle new molecules and conformations without explicit prior parametrization: All local atomic properties are predicted from ML, leaving only eight global parameters---optimized once and for all across compounds. We validate IPML on various gas-phase dimers at and away from equilibrium separation, where we obtain mean absolute errors between 0.4 and 0.7 kcal/mol for several chemically and conformationally diverse datasets representative of non-covalent interactions in biologically-relevant molecules. We further focus on hydrogen-bond complexes---essential but challenging due to their directional nature---where datasets of DNA base pairs and amino acids yield an extremely encouraging 1.4 kcal/mol error. Finally, and as a first look, we consider IPML in more condensed-phase systems: water clusters, supramolecular host-guest complexes, and the benzene crystal.
Article
Full-text available
In recent years the machine learning techniques have shown a great potential in various problems from a multitude of disciplines, including materials design and drug discovery. The high computational speed on the one hand and the accuracy comparable to that of DFT on another hand make machine learning algorithms efficient for high-throughput screening through chemical and configurational space. However, the machine learning algorithms available in the literature require large training datasets to reach the chemical accuracy and also show large errors for the so-called outliers---the out-of-sample molecules, not well-represented in the training set. In the present paper we propose a new machine learning algorithm for predicting molecular properties that addresses these two issues: it is based on a local model of interatomic interactions providing high accuracy when trained on relatively small training sets and an active learning algorithm of optimally choosing the training set that significantly reduces the errors of the outliers. We compare our model to the other state-of-the-art algorithms from the literature on the widely used benchmark tests.
Article
Full-text available
Determining the stability of molecules and condensed phases is the cornerstone of atomistic modelling, underpinning our understanding of chemical and materials properties and transformations. Here we show that a machine learning model, based on a local description of chemical environments and Bayesian statistical learning, provides a unified framework to predict atomic-scale properties. It captures the quantum mechanical effects governing the complex surface reconstructions of silicon, predicts the stability of different classes of molecules with chemical accuracy, and distinguishes active and inactive protein ligands with more than 99% reliability. The universality and the systematic nature of our framework provides new insight into the potential energy surface of materials and molecules.
Article
Full-text available
We present a computationally efficient sparse grid approach to allow for multiscale simulations of non-Newtonian polymeric fluids. Multiscale approaches for polymeric fluids often involve model equations of high dimensionality. A conventional numerical treatment of such equations leads to computing times in the order of months even on massively parallel computers.For a reduction of this enormous complexity, we propose the sparse grid combination technique. Compared to classical full grid approaches, the combination technique strongly reduces the computational complexity of a numerical scheme but only slightly decreases its accuracy.Here, we use the combination technique in a general formulation that balances not only different discretization errors but also considers the accuracy of the mathematical model. For an optimal weighting of these different problem dimensions, we employ a dimension-adaptive refinement strategy. We finally verify substantial cost reductions of our approach for simulations of non-Newtonian Couette and extensional flow problems.
Article
Full-text available
Machine learning has emerged as an invaluable tool in many research areas. In the present work, we harness this power to predict highly accurate molecular infrared spectra with unprecedented computational efficiency. To account for vibrational anharmonic and dynamical effects -- typically neglected by conventional quantum chemistry approaches -- we base our machine learning strategy on ab initio molecular dynamics simulations. While these simulations are usually extremely time consuming even for small molecules, we overcome these limitations by leveraging the power of a variety of machine learning techniques, not only accelerating simulations by several orders of magnitude, but also greatly extending the size of systems that can be treated. To this end, we develop a molecular dipole moment model based on environment dependent neural network charges and combine it with the neural network potentials of Behler and Parrinello. Contrary to the prevalent big data philosophy, we are able to obtain very accurate machine learning models for the prediction of infrared spectra based on only a few hundreds of electronic structure reference points. This is made possible through the introduction of a fully automated sampling scheme and the use of molecular forces during neural network potential training. We demonstrate the power of our machine learning approach by applying it to model the infrared spectra of a methanol molecule, n-alkanes containing up to 200 atoms and the protonated alanine tripeptide, which at the same time represents the first application of machine learning techniques to simulate the dynamics of a peptide. In all these case studies we find excellent agreement between the infrared spectra predicted via machine learning models and the respective theoretical and experimental spectra.
Article
Full-text available
High-throughput computational screening has emerged as a critical component of materials discovery. Direct density functional theory (DFT) simulation of inorganic materials and molecular transition metal complexes is often used to describe subtle trends in inorganic bonding and spin-state ordering, but these calculations are computationally costly and properties are sensitive to the exchange-correlation functional employed. To begin to overcome these challenges, we trained artificial neural networks (ANNs) to predict quantum-mechanically-derived properties, including spin-state ordering, sensitivity to Hartree-Fock exchange, and spin- state specific bond lengths in transition metal complexes. Our ANN is trained on a small set of inorganic-chemistry-appropriate empirical inputs that are both maximally transferable and do not require precise three-dimensional structural information for prediction. Using these descriptors, our ANN predicts spin-state splittings of single-site transition metal complexes (i.e., Cr-Ni) at arbitrary amounts of Hartree-Fock exchange to within 3 kcal/mol accuracy of DFT calculations. Our exchange-sensitivity ANN enables improved predictions on a diverse test set of experimentally-characterized transition metal complexes by extrapolation from semi-local DFT to hybrid DFT. The ANN also outperforms other machine learning models (i.e., support vector regression and kernel ridge regression), demonstrating particularly improved performance in transferability, as measured by prediction errors on the diverse test set. We establish the value of new uncertainty quantification tools to estimate ANN prediction uncertainty in computational chemistry, and we provide additional heuristics for identification of when a compound of interest is likely to be poorly predicted by the ANN.
Article
Full-text available
Learning from data has led to paradigm shifts in a multitude of disciplines, including web, text and image search, speech recognition, as well as bioinformatics. Can machine learning enable similar breakthroughs in understanding quantum many-body systems? Here we develop an efficient deep learning approach that enables spatially and chemically resolved insights into quantum-mechanical observables of molecular systems. We unify concepts from many-body Hamiltonians with purpose-designed deep tensor neural networks, which leads to size-extensive and uniformly accurate (1 kcal mol−1) predictions in compositional and configurational chemical space for molecules of intermediate size. As an example of chemical relevance, the model reveals a classification of aromatic rings with respect to their stability. Further applications of our model for predicting atomic energies and local chemical potentials in molecules, reliable isomer energies, and molecules with peculiar electronic structure demonstrate the potential of machine learning for revealing insights into complex quantum-chemical systems.
Article
Full-text available
In this paper, we introduce a new scheme for the efficient numerical treatment of the electronic Schr\"odinger equation for molecules. It is based on the combination of a many-body expansion, which corresponds to the so-called bond order dissection Anova approach, with a hierarchy of basis sets of increasing order. Here, the energy is represented as a finite sum of contributions associated to subsets of nuclei and basis sets in a telescoping sum like fashion. Under the assumption of data locality of the electronic density (nearsightedness of electronic matter), the terms of this expansion decay rapidly and higher terms may be neglected. We further extend the approach in a dimension-adaptive fashion to generate quasi-optimal approximations, i.e. a specific truncation of the hierarchical series such that the total benefit is maximized for a fixed amount of costs. This way, we are able to achieve substantial speed up factors compared to conventional first principles methods depending on the molecular system under consideration. In particular, the method can deal efficiently with molecular systems which include only a small active part that needs to be described by accurate but expensive models.
Article
Full-text available
Using conservation of energy - a fundamental property of closed classical and quantum mechanical systems - we develop an efficient gradient-domain machine learning (GDML) approach to construct accurate molecular force fields using a restricted number of samples from ab initio molecular dynamics (AIMD) trajectories. The GDML implementation is able to reproduce global potential energy surfaces of intermediate-sized molecules with an accuracy of 0.3 kcal/mol for energies and 1 kcal/mol/Å for atomic forces using only 1000 conformational geometries for training. We demonstrate this accuracy for AIMD trajectories of molecules, including benzene, toluene, naphthalene, ethanol, uracil, and aspirin. The challenge of constructing conservative force fields is accomplished in our work by learning in a Hilbert space of vector-valued functions that obey the law of energy conservation. The GDML approach enables quantitative molecular dynamics simulations for molecules at a fraction of cost of explicit AIMD calculations, thereby allowing the construction of efficient force fields with the accuracy and transferability of high-level ab initio methods.
Article
Full-text available
Composite ab initio methods are multistep theoretical procedures specifically designed to obtain highly accurate thermochemical and kinetic data with confident sub-kcal mol−1 or sub-kJ mol−1 accuracy. These procedures include all energetic terms that contribute to the molecular binding energies at these levels of accuracy (e.g., CCSD(T), post-CCSD(T), core–valence, relativistic, spin-orbit, Born–Oppenheimer, and zero-point vibrational energy corrections). Basis-set extrapolations (and other basis-set acceleration techniques) are used for obtaining these terms at sufficiently high levels of accuracy. Major advances in computer hardware and theoretical methodologies over the past two decades have enabled the application of these procedures to medium-sized organic systems (e.g., ranging from benzene and hexane to amino acids and DNA bases). With these advances, there has been a proliferation in the number of developed composite ab initio methods. We give an overview of the accuracy and applicability of the various types of composite ab initio methods that were developed in recent years. General recommendations to guide selection of the most suitable method for a given problem are presented, with a special emphasis on organic molecules. For further resources related to this article, please visit the WIREs website.
Article
Full-text available
Evaluating the (dis)similarity of crystalline, disordered and molecular compounds is a critical step in the development of algorithms to navigate automatically the configuration space of complex materials. For instance, a structural similarity metric is crucial for classifying structures, searching chemical space for better compounds and materials, and to drive the next generation of machine-learning techniques for predicting the stability and properties of molecules and materials. In the last few years several strategies have been designed to compare atomic coordination environments. In particular, the Smooth Overlap of Atomic Positions (SOAP) has emerged as a natural framework to obtain translation, rotation and permutation-invariant descriptors of groups of atoms, driven by the design of various classes of machine-learned inter-atomic potentials. Here we discuss how one can combine such local descriptors using a Regularized Entropy Match (REMatch) approach to describe the similarity of both whole molecular and bulk periodic structures, introducing powerful metrics that allow the navigation of alchemical and structural complexity within a unified framework. Furthermore, using this kernel and a ridge regression method we can also predict atomization energies for a database of small organic molecules with a mean absolute error below 1 kcal/mol, reaching an important milestone in the application of machine-learning techniques to the evaluation of molecular properties.
Article
Full-text available
Elpasolite is the predominant quaternary crystal structure (AlNaK$_2$F$_6$ prototype) reported in the Inorganic Crystal Structure Database. We have developed a machine learning model to calculate density functional theory quality formation energies of all the 2 M pristine ABC$_2$D$_6$ elpasolite crystals which can be made up from main-group elements (up to bismuth). Our model's accuracy can be improved systematically, reaching 0.1 eV/atom for a training set consisting of 10 k crystals. Important bonding trends are revealed, fluoride is best suited to fit the coordination of the D site which lowers the formation energy whereas the opposite is found for carbon. The bonding contribution of elements A and B is very small on average. Low formation energies result from A and B being late elements from group (II), C being a late (I) element, and D being fluoride. Out of 2 M crystals, the three degenerate pairs CaSrCs$_2$F$_6$/SrCaCs$_2$F$_6$, CaSrRb$_2$F$_6$/SrCaRb$_2$F$_6$ and CaBaCs$_2$F$_6$/BaCaCs$_2$F$_6$ yield the lowest formation energies: $-3.44$, $-3.41$, and $-3.39$ eV/atom, respectively. In crystals with large negative formation energies unusual atomic oxidation states have been discovered for Sb and Te.
Article
Full-text available
Due to its favorable computational efficiency, time-dependent (TD) density functional theory (DFT) enables the prediction of electronic spectra in a high-throughput manner across chemical space. Its predictions, however, can be quite inaccurate. We resolve this issue with machine learning models trained on deviations of reference second-order approximate coupled-cluster (CC2) singles and doubles spectra from TDDFT counterparts, or even from DFT gap. We applied this approach to low-lying singlet-singlet vertical electronic spectra of over 20 000 synthetically feasible small organic molecules with up to eight CONF atoms. The prediction errors decay monotonously as a function of training set size. For a training set of 10 000 molecules, CC2 excitation energies can be reproduced to within ±0.1 eV for the remaining molecules. Analysis of our spectral database via chromophore counting suggests that even higher accuracies can be achieved. Based on the evidence collected, we discuss open challenges associated with data-driven modeling of high-lying spectra and transition intensities.
Article
Full-text available
Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Article
Full-text available
The development of modern materials science has led to a growing need to understand the phenomena determining the properties of materials and processes on an atomistic level. The interactions between atoms and electrons are governed by the laws of quantum mechanics; hence, accurate and efficient techniques for solving the basic quantum-mechanical equations for complex many-atom, many-electron systems must be developed. Density functional theory (DFT) marks a decisive breakthrough in these efforts, and in the past decade DFT has had a rapidly growing impact not only on fundamental but also industrial research. This article discusses the fundamental principles of DFT and the highly efficient computational tools that have been developed for its application to complex problems in materials science. Also highlighted are state-of-the-art applications in many areas of materials research, such as structural materials, catalysis and surface science, nanomaterials, and biomaterials and geophysics.
Article
Full-text available
The materials discovery process can be significantly expedited and simplified if we can learn effectively from available knowledge and data. In the present contribution, we show that efficient and accurate prediction of a diverse set of properties of material systems is possible by employing machine (or statistical) learning methods trained on quantum mechanical computations in combination with the notions of chemical similarity. Using a family of one-dimensional chain systems, we present a general formalism that allows us to discover decision rules that establish a mapping between easily accessible attributes of a system and its properties. It is shown that fingerprints based on either chemo-structural (compositional and configurational information) or the electronic charge density distribution can be used to make ultra-fast, yet accurate, property predictions. Harnessing such learning paradigms extends recent efforts to systematically explore and mine vast chemical spaces, and can significantly accelerate the discovery of new application-specific materials.
Article
Full-text available
Two new schemes for computing molecular total atomization energies (TAEs) and/or heats of formation (DeltaHfo) of first- and second-row compounds to very high accuracy are presented. The more affordable scheme, W1 (Weizmann-1) theory, yields a mean absolute error of 0.30 kcal/mol and includes only a single, molecule-independent, empirical parameter. It requires CCSD (coupled cluster with all single and double substitutions) calculations in spdf and spdfg basis sets, while CCSD(T) (i.e., CCSD with a quasiperturbative treatment of connected triple excitations) calculations are only required in spd and spdf basis sets. On workstation computers and using conventional coupled cluster algorithms, systems as large as benzene can be treated, while larger systems are feasible using direct coupled cluster methods. The more rigorous scheme, W2 (Weizmann-2) theory, contains no empirical parameters at all and yields a mean absolute error of 0.23 kcal/mol, which is lowered to 0.18 kcal/mol for molecules dominated by dynamical correlation. It involves CCSD calculations in spdfg and spdfgh basis sets and CCSD(T) calculations in spdf and spdfg basis sets. On workstation computers, molecules with up to three heavy atoms can be treated using conventional coupled cluster algorithms, while larger systems can still be treated using a direct CCSD code. Both schemes include corrections for scalar relativistic effects, which are found to be vital for accurate results on second-row compounds.
Article
Full-text available
The combination of modern scientific computing with electronic structure theory can lead to an unprecedented amount of data amenable to intelligent data analysis for the identification of meaningful, novel, and predictive structure-property relationships. Such relationships enable high-throughput screening for relevant properties in an exponentially growing pool of virtual compounds that are synthetically accessible. Here, we present a machine learning (ML) model, trained on a data base of \textit{ab initio} calculation results for thousands of organic molecules, that simultaneously predicts multiple electronic ground- and excited-state properties. The properties include atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity, and excitation energies. The ML model is based on a deep multi-task artificial neural network, exploiting underlying correlations between various molecular properties. The input is identical to \emph{ab initio} methods, \emph{i.e.} nuclear charges and Cartesian coordinates of all atoms. For small organic molecules the accuracy of such a "Quantum Machine" is similar, and sometimes superior, to modern quantum-chemical methods---at negligible computational cost.
Article
Full-text available
A variation of Gaussian-3 (G3) theory is presented in which the basis set extensions are obtained at the second-order Møller–Plesset level. This method, referred to as G3(MP2) theory, is assessed on 299 energies from the G2/97 test set [J. Chem. Phys. 109, 42 (1998)]. The average absolute deviation from experiment of G3(MP2) theory for the 299 energies is 1.30 kcal/mol and for the subset of 148 neutral enthalpies it is 1.18 kcal/mol. This is a significant improvement over the related G2(MP2) theory [J. Chem. Phys. 98, 1293 (1993)], which has an average absolute deviation of 1.89 kcal/mol for all 299 energies and 2.03 kcal/mol for the 148 neutral enthalpies. The corresponding average absolute deviations for full G3 theory are 1.01 and 0.94 kcal/mol, respectively. The new method provides significant savings in computational time compared to G3 theory and, also, G2(MP2) theory.© 1999 American Institute of Physics.
Article
For a theoretical understanding of the reactivity of complex chemical systems, relative energies of stationary points on potential energy hypersurfaces need to be calculated to high accuracy. Due to the large number of intermediates present in all but the simplest chemical processes, approximate quantum chemical methods are required that allow for fast evaluations of the relative energies, but at the expense of accuracy. Despite the plethora of benchmark studies, the accuracy of a quantum chemical method is often difficult to assess. Moreover, a significant improvement of a method's accuracy (e.g., through reparameterization or systematic model extension) is rarely possible. Here, we present a new approach that allows for the systematic, problem-oriented, and rolling improvement of quantum chemical results through the application of Gaussian processes. Due to its Bayesian nature, reliable error estimates are provided for each prediction. A reference method of high accuracy can be employed, if the uncertainty associated with a particular calculation is above a given threshold. The new data point is then added to a growing data set in order to continuously improve the model, and as a result, all subsequent predictions. Previous predictions are validated by the updated model to ensure that uncertainties remain within the given confidence bound, which we call backtracking. We demonstrate our approach at the example of a complex chemical reaction network.
Article
We introduce a representation of any atom in any chemical environment for the automatized generation of universal kernel ridge regression-based quantum machine learning (QML) models of electronic properties, trained throughout chemical compound space. The representation is based on Gaussian distribution functions, scaled by power laws and explicitly accounting for structural as well as elemental degrees of freedom. The elemental components help us to lower the QML model’s learning curve, and, through interpolation across the periodic table, even enable “alchemical extrapolation” to covalent bonding between elements not part of training. This point is demonstrated for the prediction of covalent binding in single, double, and triple bonds among main-group elements as well as for atomization energies in organic molecules. We present numerical evidence that resulting QML energy models, after training on a few thousand random training instances, reach chemical accuracy for out-of-sample compounds. Compound datasets studied include thousands of structurally and compositionally diverse organic molecules, non-covalently bonded protein side-chains, (H2O)40-clusters, and crystalline solids. Learning curves for QML models also indicate competitive predictive power for various other electronic ground state properties of organic molecules, calculated with hybrid density functional theory, including polarizability, heat-capacity, HOMO-LUMO eigenvalues and gap, zero point vibrational energy, dipole moment, and highest vibrational fundamental frequency.
Article
We study how with means of Gaussian Process Regression (GPR) geometry optimizations, which rely on numerical gradients, can be accelerated. The GPR interpolates a local potential energy surface on which the structure is optimized. It is found to be efficient to combine results on a low computational level (HF or MP2) with the GPR-calculated gradient of the difference between the low level method and the target method, which is a variant of explicitly correlated Coupled Cluster Singles and Doubles with perturbative Triples correction CCSD(F12*)(T) in this study. Overall convergence is achieved if both the potential and the geometry are converged. Compared to numerical gradient-based algorithms, the number of required single point calculations is reduced. Although introducing an error due to the interpolation, the optimized structures are sufficiently close to the minimum of the target level of theory meaning that the reference and predicted minimum only vary energetically in the μEh regime.
Article
Direct molecular dynamics (MD) simulation with ab initio quantum mechanical and molecular mechanical (QM/MM) methods is very powerful for studying the mechanism of chemical reactions in complex environment but very time consuming. The computational cost on QM/MM calculations during MD simulations can be reduced significantly using semiempirical QM/MM methods with lower accuracy. To achieve higher accuracy at the ab initio QM/MM level, a correction on the existing semiempirical QM/MM model is an attractive way. Recently, we reported a neural network (NN) method as QM/MM-NN to predict the potential energy difference between semiempirical and ab initio QM/MM approaches. The high-level results can be obtained using neural network based on semiempirical QM/MM MD simulations, but the lack of direct MD samplings at the ab initio QM/MM level is still a deficiency that limits the applications of QM/MM-NN. In the present paper, we developed a dynamic scheme of QM/MM-NN for direct MD simulations on the NN-predicted potential energy surface to approximate ab initio QM/MM MD. Since some configurations excluded from the database for NN training were encountered during simulations, which may cause some difficulties on MD samplings, an adaptive procedure inspired by the selection scheme reported by Behler was employed with some adaptions to update NN and carry out MD iteratively. We further applied the adaptive QM/MM-NN MD method to the free energy calculation and transition path optimization on chemical reactions in water. The results at the ab initio QM/MM level can be well reproduced using this method after 2-4 iteration cycles. The saving in computational cost is about 2 orders of magnitude. It demonstrates that the QM/MM-NN with direct MD simulations has great potentials not only for the calculation of thermodynamic properties but also for the characterization of reaction dynamics, which provides a useful tool to study chemical or biochemical systems in solution or enzymes.
Article
Rather than numerically solving the computationally demanding equations of quantum or statistical mechanics, machine learning methods can infer approximate solutions, interpolating previously acquired property data sets of molecules and materials. The case is made for quantum machine learning: An inductive molecular modeling approach which can be applied to quantum chemistry problems.
Article
How machine learning and big data are helping chemists search the vast chemical universe for better medicines.
Article
We investigate the impact of choosing regres- sors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out- of-sample errors as a function of training set size with up to ∼118k distinct molecules. Molecular structures and properties at hybrid density functional theory (DFT) level of theory come from the QM9 database [Ramakrishnan et al, Scientific Data 1 140022 (2014)] and include enthalpies and free energies of atomization , HOMO/LUMO energies and gap, dipole moment, polarizability, zero point vibrational energy, heat capacity and the highest fundamental vibrational frequency. Various molecular representations have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural networks, graph convolutions (GC) and gated graph networks (GG). Out-of sample errors are strongly dependent on the choice of representation and regressor and molecular property. Electronic properties are typically best accounted for by MG and GC, while energetic properties are better described by HDAD and KRR. The specific combinations with the lowest out-of-sample errors in the ∼118k training set size limit are (free) energies and enthalpies of atomization (HDAD/KRR), HOMO/LUMO eigenvalue and gap (MG/GC), dipole moment (MG/GC), static polarizability (MG/GG), zero point vibrational energy (HDAD/KRR), heat capacity at room temperature (HDAD/KRR), and highest fundamental vibrational frequency (BAML/RF). We present numerical evidence that ML model predictions deviate from DFT (B3LYP) less than DFT (B3LYP) deviates from experiment for all properties. Furthermore, out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. The results suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data was available.
Article
We present an efficient approach for generating highly accurate molecular potential energy surfaces (PESs) using self-correcting, kernel ridge regression (KRR) based machine learning (ML). We introduce structure-based sampling to automatically assign nuclear configurations from a pre-defined grid to the training and prediction sets, respectively. Accurate high-level ab initio energies are required only for the points in the training set, while the energies for the remaining points are provided by the ML model with negligible computational cost. The proposed sampling procedure is shown to be superior to random sampling and also eliminates the need for training several ML models. Self-correcting machine learning has been implemented such that each additional layer corrects errors from the previous layer. The performance of our approach is demonstrated in a case study on a published high-level ab initio PES of methyl chloride with 44 819 points. The ML model is trained on sets of different sizes and then used to predict the energies for tens of thousands of nuclear configurations within seconds. The resulting datasets are utilized in variational calculations of the vibrational energy levels of CH3Cl. By using both structure-based sampling and self-correction, the size of the training set can be kept small (e.g., 10% of the points) without any significant loss of accuracy. In ab initio rovibrational spectroscopy, it is thus possible to reduce the number of computationally costly electronic structure calculations through structure-based sampling and self-correcting KRR-based machine learning by up to 90%.
Chapter
A number of machine learning (ML) studies have appeared with the commonality that quantum mechanical properties are being predicted based on regression models defined in chemical compound space (CCS). The quantum mechanical framework is crucial for the unbiased exploration of CCS since it enables, at least in principle, the free variation of nuclear charges, atomic weights, atomic configurations, and electron number. This chapter first gives a brief tutorial summary of the employed ML model in Kernel Ridge Regression. A discussion on the various representations (descriptors) used to encode molecular species, in particular the molecular Coulomb-matrix (CM), sorted or its eigenvalues follows. The chapter also reviews quantum chemistry data of 134k molecules. The local, linearly scaling ML models for atomic properties such as forces on atoms, nuclear magnetic resonance (NMR) shifts, core-electron ionization energies, as well as atomic charges, dipole-moments, and quadrupole-moments for force-field predictions are finally discussed.
Article
Abstract We present a multi-fidelity co-kriging statistical learning framework that combines variable-fidelity quantum mechanical calculations of bandgaps to generate a machine-learned model that enables low-cost accurate predictions of the bandgaps at the highest fidelity level. In addition, the adopted Gaussian process regression formulation allows us to predict the underlying uncertainties as a measure of our confidence in the predictions. Using a set of 600 elpasolite compounds as an example dataset and using semi-local and hybrid exchange correlation functionals within density functional theory as two levels of fidelities, we demonstrate the excellent learning performance of the method against actual high fidelity quantum mechanical calculations of the bandgaps. The presented statistical learning method is not restricted to bandgaps or electronic structure methods and extends the utility of high throughput property predictions in a significant way.
Article
The training of molecular models of quantum mechanical properties based on statistical machine learning requires large datasets which exemplify the map from chemical structure to molecular property. Intelligent a priori selection of training examples is often difficult or impossible to achieve as prior knowledge may be sparse or unavailable. Ordinarily representative selection of training molecules from such datasets is achieved through random sampling. We use genetic algorithms for the optimization of training set composition consisting of tens of thousands of small organic molecules. The resulting machine learning models are considerably more accurate with respect to small randomly selected training sets: mean absolute errors for out-of-sample predictions are reduced to ~25% for enthalpies, free energies, and zero-point vibrational energy, to ~50% for heat-capacity, electron-spread, and polarizability, and by more than ~20% for electronic properties such as frontier orbital eigenvalues or dipole-moments. We discuss and present optimized training sets consisting of 10 molecular classes for all molecular properties studied. We show that these classes can be used to design improved training sets for the generation of machine learning models of the same properties in similar but unrelated molecular sets.
Article
The predictive accuracy of Machine Learning (ML) models of molecular properties depends on the choice of the molecular representation. Inspired by the postulates of quantum mechanics, we introduce a hierarchy of representations which meet uniqueness and target similarity criteria. To systematically control target similarity, we simply rely on interatomic many body expansions, as implemented in universal force-fields, including Bonding, Angular (BA), and higher order terms. Addition of higher order contributions systematically increases similarity to the true potential energy and predictive accuracy of the resulting ML models. We report numerical evidence for the performance of BAML models trained on molecular properties pre-calculated at electron-correlated and density functional theory level of theory for thousands of small organic molecules. Properties studied include enthalpies and free energies of atomization, heat capacity, zero-point vibrational energies, dipole-moment, polarizability, HOMO/LUMO energies and gap, ionization potential, electron affinity, and electronic excitations. After training, BAML predicts energies or electronic properties of out-of-sample molecules with unprecedented accuracy and speed.
Article
Sparse tensor product spaces provide an efficient tool to discretize higher dimensional operator equations. The direct Galerkin method in such ansatz spaces may employ hierarchical bases, interpolets, wavelets or multilevel frames. Besides, an alternative approach is provided by the so-called combination technique. It properly combines the Galerkin solutions of the underlying problem on certain full (but small) tensor product spaces. So far, however, the combination technique has been analyzed only for special model problems. In the present paper, we provide now the analysis of the combination technique for quite general operator equations in sparse tensor product spaces. We prove that the combination technique produces the same order of convergence as the Galerkin approximation with respect to the sparse tensor product space. Furthermore, the order of the cost complexity is the same as for the Galerkin approach in the sparse tensor product space. Our theoretical findings are validated by numerical experiments.
Article
Here, the employment of multilayer perceptrons, a type of artificial neural network, is proposed as part of a computational funneling procedure for high-throughput organic materials design. Through the use of state of the art algorithms and a large amount of data extracted from the Harvard Clean Energy Project, it is demonstrated that these methods allow a great reduction in the fraction of the screening library that is actually calculated. Neural networks can reproduce the results of quantum-chemical calculations with a large level of accuracy. The proposed approach allows to carry out large-scale molecular screening projects with less computational time. This, in turn, allows for the exploration of increasingly large and diverse libraries.
Article
We introduce machine learning models of quantum mechanical observables of atoms in molecules. Instant out-of-sample predictions for proton and carbon nuclear chemical shifts, atomic core level excitations, and forces on atoms reach accuracies on par with dens