Article

QDπ: A Quantum Deep Potential Interaction Model for Drug Discovery

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We report QDπ-v1.0 for modeling the internal energy of drug molecules containing H, C, N, and O atoms. The QDπ model is in the form of a quantum mechanical/machine learning potential correction (QM/Δ-MLP) that uses a fast third-order self-consistent density-functional tight-binding (DFTB3/3OB) model that is corrected to a quantitatively high-level of accuracy through a deep-learning potential (DeepPot-SE). The model has the advantage that it is able to properly treat electrostatic interactions and handle changes in charge/protonation states. The model is trained against reference data computed at the ωB97X/6-31G* level (as in the ANI-1x data set) and compared to several other approximate semiempirical and machine learning potentials (ANI-1x, ANI-2x, DFTB3, MNDO/d, AM1, PM6, GFN1-xTB, and GFN2-xTB). The QDπ model is demonstrated to be accurate for a wide range of intra- and intermolecular interactions (despite its intended use as an internal energy model) and has shown to perform exceptionally well for relative protonation/deprotonation energies and tautomers. An example application to model reactions involved in RNA strand cleavage catalyzed by protein and nucleic acid enzymes illustrates QDπ has average errors less than 0.5 kcal/mol, whereas the other models compared have errors over an order of magnitude greater. Taken together, this makes QDπ highly attractive as a potential force field model for drug discovery.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Many of the molecules encountered in the screening process may not have well-established, mature molecular mechanical force field parameters, and some may not have ever been synthesized before. This has led to interest in training universal MLP models that accurately reproduce ab initio energies and forces for a large diversity of molecules [8][9][10][11][12][13][14] and chemical systems 15,16 by considering the enormous number of possible atomic permutations, combinations, and conformational isomers 17,18 . The training of universal MLP models therefore requires extensive and accurate datasets that sample the diverse chemical space of organic and drug-like molecules. ...
... The QDπ dataset contains 1.6 million molecular structures to express the chemical diversity of 13 elements, and the energies and forces are calculated with the accurate ωB97M-D3(BJ)/ def2-TZVPPD method. Molecular conformations were taken from various source datasets including SPICE 35 , ANI 12,38 , GEOM 18 , FreeSolv 43 , RE 14 , and COMP6 36 . We describe several strategies that were used to select structures in a manner that maximizes the chemical diversity while minimizing the number of expensive ab initio evaluations. ...
... www.nature.com/scientificdata/ quantum mechanical (SQM)/Δ MLP model 14 . A SQM/Δ MLP model supplements a standard semiempirical QM (or QM/MM) calculation with a MLP that is trained to reproduce the difference between SQM and ab initio energies and forces. ...
Article
Full-text available
The development of universal machine learning potentials (MLP) for small organic and drug-like molecules requires large, accurate datasets that span diverse chemical spaces. In this study, we introduce the QDπ dataset which incorporates data taken from several datasets. We use a query—by—committee active learning strategy to extract data from large datasets to maximize the diversity and avoid redundancy as relevant for neural network training to construct the QDπ dataset. The QDπ dataset requires only 1.6 million structures to express the chemical diversity of 13 elements from the various source datasets at the ωB97M-D3(BJ)/def2-TZVPPD level of theory. The QDπ dataset enables creation of flexible target loss functions for neural network training relevant to drug discovery, including information-dense data sets of relative conformational energies and barriers, intermolecular interactions, tautomers and relative protonation energies of drug-like compounds and biomolecular fragments. It is the hope that the high chemical information density and diversity contained in the QDπ dataset will provide a valuable resource for the development of new universal MLPs for drug discovery.
... These three methods are already sufficient to construct a conventional simulation context employing the DP model within OpenMM, although the utility of the third method is not always needed. Recent studies, especially in the realm of biomolecular systems, show that most applications of MLPs in molecular dynamic simulations are based on hybrid MLP/MM frameworks [48,[61][62][63][64][65] In such schemes, the particles described by the MLP models constitute a subset of the entire system. For implementing the hybrid MLP/MM scheme in OpenMM, the DeepPotentialModel class offers two optional methods that allow users to specify which particles will be input to the DP model and generate the Force object designated for integration with classical force fields. ...
... Different selection methods within the hybrid DP/MM model cater to diverse application needs. For example, the DP model with pre-selected particles can be deployed either independently [65,[91][92][93] or alongside QM/MM methods [48,63,64] to depict the intramolecular interactions for the subset of interests. When used in isolation, interactions among these particles modeled by classical force fields can be disregarded prior to simulation. ...
... However, the flexible design of MLP architectures and frameworks means that there is no one-size-fits-all approach for their integration with simulation software. Furthermore, with the community proposing new application scenarios for MLPs, especially hybrid MLP/MM models [48,[61][62][63][64][65]94], an efficient and adaptable approach for embedding MLPs alongside other force fields within MD engines becomes essential. Here, the integration of MLPs with MD software is well illustrated by the implementation and validation of the OpenMM Deepmd plugin. ...
Article
Full-text available
Machine learning potentials, particularly the deep potential (DP) model, have revolutionized molecular dynamics (MD) simulations, striking a balance between accuracy and computational efficiency. To facilitate the DP model’s integration with the popular MD engine OpenMM, we have developed a versatile OpenMM plugin. This plugin supports a range of applications, from conventional MD simulations to alchemical free energy calculations and hybrid DP/MM simulations. Our extensive validation tests encompassed energy conservation in microcanonical ensemble simulations, fidelity in canonical ensemble generation, and the evaluation of the structural, transport, and thermodynamic properties of bulk water. The introduction of this plugin is expected to significantly expand the application scope of DP models within the MD simulation community, representing a major advancement in the field.
... The DeePMD-kit implements a series of MLP models known as Deep Potential (DP) models, 9,10,[50][51][52][53][54] which have been widely adopted in the fields of physics, chemistry, biology, and material science for studying a broad range of atomistic systems. These systems include metallic materials, 55 non-metallic inorganic materials, [56][57][58][59][60] water, [61][62][63][64][65][66][67][68][69][70][71] organic systems, 10,72 solutions, 52,73-76 gasphase systems, [77][78][79][80] macromolecular systems, 81,82 and interfaces. [83][84][85][86][87] Furthermore, the DeePMD-kit is capable of simulating systems containing almost all Periodic Table elements, 51 operating under a wide range of temperature and pressure, 88 and can handle drug-like molecules, 72,89 ions, 73,76 transition states, 75,77 and excited states. ...
... These systems include metallic materials, 55 non-metallic inorganic materials, [56][57][58][59][60] water, [61][62][63][64][65][66][67][68][69][70][71] organic systems, 10,72 solutions, 52,73-76 gasphase systems, [77][78][79][80] macromolecular systems, 81,82 and interfaces. [83][84][85][86][87] Furthermore, the DeePMD-kit is capable of simulating systems containing almost all Periodic Table elements, 51 operating under a wide range of temperature and pressure, 88 and can handle drug-like molecules, 72,89 ions, 73,76 transition states, 75,77 and excited states. 90 As a result, the DeePMD-kit is a powerful and versatile tool that can be used to simulate a wide range of atomistic systems. ...
... The trained DPRc model with a 6 Å range-correction was applied to simulate RNA 2 ′ -O-transphosphorylation reactions in solution in long timescales 75 and obtain better free energy estimates with the help of the generalization of the weighted thermodynamic perturbation (gwTP) method. 100 Very recently, Zeng et al. 72 have trained a Δ-MLP correction model called Quantum Deep Potential Interaction (QDπ) for drug-like molecules, including tautomeric forms and protonation states, which was found to be superior to other semiempirical methods and pure MLP models. 89 The third important application is large-scale reactive MD simulations over a nanosecond time scale, which enable the construction of interwoven reaction networks for complex reactive systems 101 instead of focusing on studying a single reaction. ...
Article
Full-text available
DeePMD-kit is a powerful open-source software package that facilitates molecular dynamics simulations using machine learning potentials known as Deep Potential (DP) models. This package, which was released in 2017, has been widely used in the fields of physics, chemistry, biology, and material science for studying atomistic systems. The current version of DeePMD-kit offers numerous advanced features, such as DeepPot-SE, attention-based and hybrid descriptors, the ability to fit tensorial properties, type embedding, model deviation, DP-range correction, DP long range, graphics processing unit support for customized operators, model compression, non-von Neumann molecular dynamics, and improved usability, including documentation, compiled binary packages, graphical user interfaces, and application programming interfaces. This article presents an overview of the current major version of the DeePMD-kit package, highlighting its features and technical details. Additionally, this article presents a comprehensive procedure for conducting molecular dynamics as a representative application, benchmarks the accuracy and efficiency of different models, and discusses ongoing developments.
... The artificially expanded genetic information system (AEGIS) dataset also exhibits a rich set of tautomeric forms that have been studied extensively with computational methods. 77,[85][86][87] These tautomeric pairs are illustrated in Figure 6, and their ∆E values are listed in Table III and illustrated in Figure 7. ...
... QDπ-v1.0 is openly available in our GitLab repository at https://gitlab.com/RutgersLBSR/qdpi, which was previously released 77 . The data that support the findings of this study are available from the corresponding author upon reasonable request. ...
... All geometry optimizations using semiempirical QM, MLP or QM/∆-MLP models were performed using the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm 125 in the ASE 126 package. Relaxed 2D torsion profiles were made using the same way described in Ref.77. ...
Article
Full-text available
Modern semiempirical electronic structure methods have considerable promise in drug discovery as universal "force fields" that can reliably model biological and drug-like molecules. Herein, we compare the performance of several NDDO-based semiempirical (MNDO/d, AM1, PM6 and ODM2), density-functional tight-binding based (DFTB3, GFN1-xTB and GFN2-xTB) models with pure machine learning potentials (ANI-1x and ANI-2x) and hybrid quantum mechanical/machine learning potentials (AIQM1 and QDπ) for a wide range of data computed at a consistent ωB97X/6-31G* level of theory (as in the ANI-1x database). This data includes conformational energies, intermolecular interactions, tautomers, and protonation states. Additional comparisons are made to a set of natural and synthetic nucleic acids from the artificially expanded genetic information system (AEGIS). This dataset has important implications in the design of new biotechnology and therapeutics. Finally, weexamine acid/base chemistry relevant for RNA cleavage reactions catalyzed by small nucleolytic ribozymes and ribonucleases. Overall, the recently developed QDπ model performs exceptionally well across all datasets, having especially high accuracy for tautomers and protonation states relevant to drug discovery.
... For example, a difference or correction potential dU L;H ðRÞ (to distinguish from the "energy gap," DU L!H ðRÞ in the free energy perturbation expression above and below) is learned based on snapshots from L simulations. Such "d-learning" can be carried out with either force-matching 80,81 or machine learning [82][83][84][85][86][87][88][89] approaches; it can be done with one cycle, or iteratively to update the L level with multiple rounds of update/re-sampling. For relatively small regions of interest (e.g., tens of QM atoms in QM/MM simulations), the d-learning can be effective, and the number of snapshots required to improve the low level depends on the difference between L and H as well as the architecture of the ML model. ...
... Along this line, the mapping strategy discussed in the work of Rizzi et al. 126,127 can be adopted in the framework presented here to maximize the data efficiency of high-level energy and force calculations. Once optimized, the efficiency of such multi-level free energy simulations should be analyzed in comparison to the alternative approach of d-learning [82][83][84][85][86][87][88][89] for improving the accuracy of free energy calculations. ...
Article
Full-text available
Machine learning (ML) techniques have been making major impacts on all areas of science and engineering, including biophysics. In this review, we discuss several applications of ML to biophysical problems based on our recent research. The topics include the use of ML techniques to identify hotspot residues in allosteric proteins using deep mutational scanning data and to analyze how mutations of these hotspots perturb co-operativity in the framework of a statistical thermodynamic model, to improve the accuracy of free energy simulations by integrating data from different levels of potential energy functions, and to determine the phase transition temperature of lipid membranes. Through these examples, we illustrate the unique value of ML in extracting patterns or parameters from complex data sets, as well as the remaining limitations. By implementing the ML approaches in the context of physically motivated models or computational frameworks, we are able to gain a deeper mechanistic understanding or better convergence in numerical simulations. We conclude by briefly discussing how the introduced models can be further expanded to tackle more complex problems.
... Through the unique properties of qubits, quantum computers can offer unprecedented computational power and capabilities, leading to breakthroughs in solving complex problems and driving innovation across multiple sectors. In binary informa- [54] 2023 Drug discovery Quantum learning inspired by AlexNet [63] 2023 Categorization of Brain disorder symptoms Quantum fruit fly algorithm inspired by ResNet50-VGG16 [64] 2023 Categorization of medical diseases Transfer-learning-based deep learning protocol [65,71] 2023 image classification, Protein calculations for high precision prediction in large chemical and biological systems Quantum support vector classifiers (QSVC) and variational quantum classifiers (VQC) [68,88,96] 2024 ...
... In a study conducted by researchers [89], an MRI-radiomics [51], Quantum Fruit Fly Algorithm [64], Quantum support vector classifiers (QSVC) [21] and variational quantum classifiers (VQC) [96], quantum heuristic algorithm [94] Quantum Deep Neural Networks Quantum Deep Potential Interaction Model [45,54], Quantum Learning through Alex-Net [63], Quantum convolutional neural networks [78,91,97], Quantum orthogonal neural networks [75], MRI-radiomics variational Quantum Neural Network [76], Quantum-LSTM contrastive learning [92], Hybrid classical-quantum transfer learning [72,81], Quantum Single Layer Perceptron [84], Quantum photonic convolutional neural network [87,90], Heap Based Optimization with Deep Quantum Neural Network [86], Quantum Relu activation function for CNN [101], Quantum Self-Supervised Network [102], 3D Quantum-inspired Selfsupervised Tensor Network [103], IoTsspiro and fuzzy-based quantum neural network system [105] variational quantum neural network (QNN) was employed to construct a brain tumor model using an MRI dataset. To address this task, the researchers utilized the mutual information feature selection (MIFS) method, which transformed the problem into a stochastic optimization task. ...
Preprint
Full-text available
The working environment in healthcare analytics is transforming with the emergence of healthcare 5.0 and the advancements in quantum neural networks. In addition to analyzing a comprehensive set of case studies, we also review relevant literature from the fields of quantum computing applications and smart healthcare analytics, focusing on the implications of quantum deep neural networks. This study aims to shed light on the existing research gaps regarding the implications of quantum neural networks in healthcare analytics. We argue that the healthcare industry is currently transitioning from automation towards genuine collaboration with quantum networks, which presents new avenues for research and exploration. Specifically, this study focuses on evaluating the performance of Healthcare 5.0, which involves the integration of diverse quantum machine learning and quantum neural network systems. This study also explores a range of potential challenges and future directions for Healthcare 5.0, particularly focusing on the integration of quantum neural networks.
... DeePMD-kit implements a series of MLP models known as Deep Potential (DP) models, 9,10,[41][42][43][44][45] which have been widely adopted in the fields of physics, chemistry, biology, and material science for studying a broad range of atomistic systems. These systems include metallic materials [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61] , non-metallic inorganic materials [62][63][64][65][66] , water 67-77 , organic systems, 10,78 solutions 43,79-82 , gas-phase systems [83][84][85][86] , macromolecular systems, 87,88 and interfaces [89][90][91][92][93] . ...
... Furthermore, DeePMD-kit is capable of simulating systems containing almost all periodic table elements 42 , operating under a wide range of temperature and pressure, 94 and can handle drug-like molecules, 78,95 ions, 79,82 transition states, 81,83 and excited states. 96 As a result, DeePMD-kit is a powerful and versatile tool that can be used to simulate a wide range of atomistic systems. ...
Preprint
Full-text available
DeePMD-kit is a powerful open-source software package that facilitates molecular dynamics simulations using machine learning potentials (MLP) known as Deep Potential (DP) models. This package, which was released in 2017, has been widely used in the fields of physics, chemistry, biology, and material science for studying atomistic systems. The current version of DeePMD-kit offers numerous advanced features such as DeepPot-SE, attention-based and hybrid descriptors, the ability to fit tensile properties, type embedding, model deviation, Deep Potential - Range Correction (DPRc), Deep Potential Long Range (DPLR), GPU support for customized operators, model compression, non-von Neumann molecular dynamics (NVNMD), and improved usability, including documentation, compiled binary packages, graphical user interfaces (GUI), and application programming interfaces (API). This article presents an overview of the current major version of the DeePMD-kit package, highlighting its features and technical details. Additionally, the article benchmarks the accuracy and efficiency of different models and discusses ongoing developments.
... While hybrid QNN and QFT-based QNN models effectively learned data distribution with improved convergence, quantum speedups were limited by state preparation and readout limitations. In [255], authors introduced Quantum Deep-learning Potential Interaction (QDπ)-v1.0, a quantum machine learning model for predicting drug molecule internal energy with high precision, which excelled in handling charge variations and interactions critical for drug discovery. ...
Article
Full-text available
This study presents a comprehensive survey on Quantum Machine Learning (QML) along with its current status, challenges, and perspectives. QML combines quantum computing and machine learning to solve complex problems in different domains, leveraging quantum algorithms to enhance classical machine learning techniques. We explore the application of QML in various domains such as cybersecurity, finance, healthcare, and drug discovery. The survey includes detailed tabular comparisons of the different QML models used for each application area, highlighting key techniques, findings, and their limitations. In this work, we identify important trends such as the strong potential of hybrid quantum-classical models for near-term applications and the significant challenges in the quantum domain due to quantum noise, limited qubit scalability, and costly qRAM implementations. Furthermore, we discuss solutions that emphasize advances in hardware, quantum error correction, and algorithmic innovations to address these challenges. By providing an in-depth analysis of QML’s potential across different fields, this study provides valuable insights into how QML can address complex real-world challenges and transform traditional machine learning practices.
... The range corrected QM/MM-∆MLP strategy uses neural network to introduce short-range nonelectrostatic corrections to an inexpensive (semiempirical) QM/MM base model to reproduce target ab initio QM/MM energies and forces. The DeePMD-GNN plugin greatly extends the capability of recently developed interoperable software infrastructure 38,39 within Amber 40 for design of next-generation QM/MM-∆MLP models and their application to biochemical reactions 41,42and drug discovery.[43][44][45] The new software interfaces are demonstrated by comparing benchmark calculations of NequIP, 28 MACE, 29 and DPA-2 24 models developed with a consistent training strategy. ...
... 25,26 In recent years, atomistic modeling based on MLPs has also been increasingly applied in many fields, such as fuel combustion, materials chemistry, and biochemistry. [27][28][29][30][31] In this work, we develop a deep neural network potential (NNP) of FOX-7 and use it to simulate the decomposition behaviors of FOX-7 at different densities. Meanwhile, combined with high-precision DFT calculations, we construct a complete decomposition reaction network of FOX-7. ...
Article
Full-text available
Condensed phase explosives typically contain defects such as voids, bubbles, and pores; this heterogeneity facilitates the formation of hot spots and triggers decomposition reaction at low densities. The study of the thermal decomposition mechanisms of explosives at different densities has thus attracted considerable research interest. Gaining a deeper insight into these mechanisms would be helpful for elucidating the detonation processes of explosives. In this work, we developed an ab initio neural network potential for the FOX-7 system using machine learning method. Extensive large-scale (1008 atoms) and long-duration (nanosecond timescale) deep potential molecular dynamics simulations at different densities were performed to investigate the effect of the density on the thermal decomposition mechanism. The results indicate that the initial reaction pathway of the FOX-7 explosives is the cleavage of the C–NO2 bond at different densities, while the frequency of C–NO2 bond cleavage decreases at higher density. Increasing the initial density of FOX-7 significantly increases the reaction rate during the initial decomposition and the formation of final products. However, it leads to a decrease in released heat and has minimal impact on the decomposition temperature. In addition, by analyzing the molecular dynamics trajectories and conducting quantum chemical calculations, we identified two lower-barrier production pathways to produce the CO2 and N2.
... Machine learning potentials (MLPs) have emerged as a powerful approach to modeling complex materials and molecules, bridging the gap between the high accuracy of QM methods and the computational efficiency of EFFs. This has enabled the study of largescale molecular systems with QM-level accuracy across diverse applications, including drug discovery 3,4 , materials design [5][6][7] , and catalysis 8,9 , etc. In most MLP applications, the training data is generated from scratch either through brute force ab initio molecular dynamics (MD) simulations 10 or by using a concurrent learning (or active learning) scheme capable of automatically generating the most critical data for building uniformly accurate models [11][12][13][14] . ...
... [13][14][15] Consequently, the utility of MLPs has been gradually expanded, embracing a diverse of applications that include crystalline materials systems, [16][17][18][19][20] interfacial and inhomogeneous systems, [21][22][23] and complex solutions 24,25 such as those consisting of biomolecules. [26][27][28] The number of elements in simulation systems has also expanded from one or several to ten or even more. 29,30 The recent emergence of MACE-OFF23 31 and DPA-2 32 models, which were trained on richer datasets for a wide range of organic and material systems, respectively, fully demonstrates the broad application potential of MLPs. ...
Article
Full-text available
Machine learning potentials (MLPs) are promising for various chemical systems, but their complexity and lack of physical interpretability challenge their broad applicability. This study evaluates the transferability of the deep potential (DP) and neural equivariant interatomic potential (NequIP) models for graphene–water systems using numerical metrics and physical characteristics. We found that the data quality from density functional theory calculations significantly influences MLP predictive accuracy. Prediction errors in transferring systems reveal the particularities of quantum chemical calculations on the heterogeneous graphene–water systems. Even for supercells with non-planar graphene carbon atoms, k-point mesh is necessary to obtain accurate results. In contrast, gamma-point calculations are sufficiently accurate for water molecules. In addition, we performed molecular dynamics (MD) simulations using these two models and compared the physical features such as atomic density profiles, radial distribution functions, and self-diffusion coefficients. It was found that although the NequIP model has higher accuracy than the DP model, the differences in the above physical features between them were not significant. Considering the stochasticity and complexity inherent in simulations, as well as the statistical averaging of physical characteristics, this motivates us to explore the meaning of accurately predicting atomic force in aligning the physical characteristics evolved by MD simulations with the actual physical features.
... [18][19][20][21][22][23] Of particular relevance to the current work is the development of QM/MM-∆MLP models, whereby the energies and forces of a fast, approximate QM model are corrected with a machine-learning potential. 20,[24][25][26][27][28][29][30] These models have the potential to offer the computational efficiency needed to address complex chemical mechanisms that require sampling of high-dimensional free energy surfaces, while providing accuracy comparable to high level QM methods. A barrier to progress in the development and validation of such methods is their availability in flexible software packages that enable a wide range of applications in the condensed phase. ...
Article
Full-text available
We report the development and testing of new integrated cyberinfrastructure for performing free energy simulations with generalized hybrid quantum mechanical/molecular mechanical (QM/MM) and machine learning potentials (MLPs) in Amber. The Sander molecular dynamics program has been extended to leverage fast, density-functional tight-binding models implemented in the DFTB+ and xTB packages, and an interface to the DeePMD-kit software enables the use of MLPs. The software is integrated through application program interfaces that circumvent the need to perform “system calls” and enable the incorporation of long-range Ewald electrostatics into the external software’s self-consistent field procedure. The infrastructure provides access to QM/MM models that may serve as the foundation for QM/MM–ΔMLP potentials, which supplement the semiempirical QM/MM model with a MLP correction trained to reproduce ab initio QM/MM energies and forces. Efficient optimization of minimum free energy pathways is enabled through a new surface-accelerated finite-temperature string method implemented in the FE-ToolKit package. Furthermore, we interfaced Sander with the i-PI software by implementing the socket communication protocol used in the i-PI client–server model. The new interface with i-PI allows for the treatment of nuclear quantum effects with semiempirical QM/MM–ΔMLP models. The modular interoperable software is demonstrated on proton transfer reactions in guanine-thymine mispairs in a B-form deoxyribonucleic acid helix. The current work represents a considerable advance in the development of modular software for performing free energy simulations of chemical reactions that are important in a wide range of applications.
... Machine learning potentials (MLPs) have shown particular promise in enhancing the accuracy and performance of condensed-phase simulations of chemical reactions [33,[74][75][76][77][78]. Of particular relevance to the current work is the development of QM/MM-∆MLP models, whereby the energies and forces of a fast, approximate QM model are corrected with a machine learning potential [76,[79][80][81][82][83][84][85]. These are described in more detail in the supporting information. ...
Article
Full-text available
Rare tautomeric forms of nucleobases can lead to Watson–Crick-like (WC-like) mispairs in DNA, but the process of proton transfer is fast and difficult to detect experimentally. NMR studies show evidence for the existence of short-time WC-like guanine–thymine (G-T) mispairs; however, the mechanism of proton transfer and the degree to which nuclear quantum effects play a role are unclear. We use a B-DNA helix exhibiting a wGT mispair as a model system to study tautomerization reactions. We perform ab initio (PBE0/6-31G*) quantum mechanical/molecular mechanical (QM/MM) simulations to examine the free energy surface for tautomerization. We demonstrate that while the ab initio QM/MM simulations are accurate, considerable sampling is required to achieve high precision in the free energy barriers. To address this problem, we develop a QM/MM machine learning potential correction (QM/MM-ΔMLP) that is able to improve the computational efficiency, greatly extend the accessible time scales of the simulations, and enable practical application of path integral molecular dynamics to examine nuclear quantum effects. We find that the inclusion of nuclear quantum effects has only a modest effect on the mechanistic pathway but leads to a considerable lowering of the free energy barrier for the GT*⇌G*T equilibrium. Our results enable a rationalization of observed experimental data and the prediction of populations of rare tautomeric forms of nucleobases and rates of their interconversion in B-DNA.
... Furthermore, computational benchmark datasets (PB20-QM, PB20-QM-8 k and PB20-QM-3 k) were set up to evaluate the structural reliability of 3-19 k drug/inhibitor molecules computed by the DFT, MLPs and/or SE methods, which hopefully help future development of better DFT, MLPs and/or SE methods for drug/inhibitor molecules. We are also optimistic that with the advances of more high-level and general MLPs developed by different groups [55][56][57][58][59] , the fast and reliable QR of diverse biosystems will be routinely performed on any normal desktop computer in the future. ...
Article
Full-text available
Biomacromolecule structures are essential for drug development and biocatalysis. Quantum refinement (QR) methods, which employ reliable quantum mechanics (QM) methods in crystallographic refinement, showed promise in improving the structural quality or even correcting the structure of biomacromolecules. However, vast computational costs and complex quantum mechanics/molecular mechanics (QM/MM) setups limit QR applications. Here we incorporate robust machine learning potentials (MLPs) in multiscale ONIOM(QM:MM) schemes to describe the core parts (e.g., drugs/inhibitors), replacing the expensive QM method. Additionally, two levels of MLPs are combined for the first time to overcome MLP limitations. Our unique MLPs+ONIOM-based QR methods achieve QM-level accuracy with significantly higher efficiency. Furthermore, our refinements provide computational evidence for the existence of bonded and nonbonded forms of the Food and Drug Administration (FDA)-approved drug nirmatrelvir in one SARS-CoV-2 main protease structure. This study highlights that powerful MLPs accelerate QRs for reliable protein–drug complexes, promote broader QR applications and provide more atomistic insights into drug development.
... We begin by investigating the performance of GFlowNet in simple, well-studied molecular systems in two dimensions: alanine dipeptide, ibuprofen, and ketorolac. 35,36 In this experiment, we aim to assess how well the proposed approach can learn to sample from the target distribution and analyze the impact of the energy estimator. ...
Article
Full-text available
Sampling diverse, thermodynamically feasible molecular conformations plays a crucial role in predicting properties of a molecule. In this paper we propose to use GFlowNets for sampling conformations of small molecules from the Boltzmann distribution, as determined by the molecule's energy. The proposed approach can be used in combination with energy estimation methods of different fidelity and discovers a diverse set of low-energy conformations for drug-like molecules. We demonstrate that GFlowNets can reproduce molecular potential energy surfaces by sampling proportionally to the Boltzmann distribution.
... Machine learning potentials (MLPs) have emerged as a powerful approach to modeling complex materials and molecules, bridging the gap between the high accuracy of QM methods and the computational efficiency of EFFs. This has enabled the study of large-scale molecular systems with QM-level accuracy across diverse applications, including drug discovery [3,4], materials design [5][6][7], and catalysis [8,9], etc. ...
Preprint
Full-text available
The rapid development of artificial intelligence (AI) is driving significant changes in the field of atomic modeling, simulation, and design. AI-based potential energy models have been successfully used to perform large-scale and long-time simulations with the accuracy of ab initio electronic structure methods. However, the model generation process still hinders applications at scale. We envision that the next stage would be a model-centric ecosystem, in which a large atomic model (LAM), pre-trained with as many atomic datasets as possible and can be efficiently fine-tuned and distilled to downstream tasks, would serve the new infrastructure of the field of molecular modeling. We show that DPA-2 can accurately represent a diverse range of chemical systems and materials, enabling high-quality simulations and predictions with significantly reduced efforts compared to traditional methods. Our approach paves the way for a universal large atomic model that can be widely applied in molecular and material simulation research, opening new opportunities for scientific discoveries and industrial applications.
... In the Behler-Parrinello neural network (BPNN) 17 and its ANI variants, [18][19][20] for instance, symmetric functions are used to encode the local environment of each atom into a descriptor called an atomic environment vector (AEV). In the DeepPot-SE models, [21][22][23][24][25] on the other hand, embedding neural networks are used to transform the coordinates into descriptors. These and other descriptors (such as the internal coordinates, 26 Coulomb matrix, 27 permutation invariant polynomial, 5,14,28,29 bag of bonds, 30 normalized inverted internuclear distances, 31 FCHL representation, 32 and weighted symmetry functions 33 ) are then used as inputs to a regressor, such as a neural network or a kernelbased regressor, to predict the target molecular energy and the corresponding atomic forces. ...
Article
Full-text available
In the last several years, there has been a surge in the development of machine learning potential (MLP) models for describing molecular systems. We are interested in a particular area of this field — the training of system‐specific MLPs for reactive systems — with the goal of using these MLPs to accelerate free energy simulations of chemical and enzyme reactions. To help new members in our labs become familiar with the basic techniques, we have put together a self‐guided Colab tutorial ( https://cc-ats.github.io/mlp_tutorial/ ), which we expect to be also useful to other young researchers in the community. Our tutorial begins with the introduction of simple feedforward neural network (FNN) and kernel‐based (using Gaussian process regression, GPR) models by fitting the two‐dimensional Müller‐Brown potential. Subsequently, two simple descriptors are presented for extracting features of molecular systems: symmetry functions (including the ANI variant) and embedding neural networks (such as DeepPot‐SE). Lastly, these features will be fed into FNN and GPR models to reproduce the energies and forces for the molecular configurations in a Claisen rearrangement reaction.
Article
The incorporation of selenium into tacrine derivatives has been explored as a novel strategy to enhance therapeutic efficacy while minimizing toxicity in the treatment of neurodegenerative diseases such as Alzheimer’s. This study utilized computational and experimental approaches, including Density Functional Theory (DFT), molecular docking, pharmacokinetic profiling, and toxicological predictions, to evaluate the potential of these derivatives. The selenium-modified compounds demonstrated improved electronic properties, such as narrower HOMO–LUMO gaps and optimized electronegativity, resulting in enhanced stability and reactivity. Pharmacokinetic analyses revealed favorable absorption, distribution, and blood–brain barrier penetration, while toxicological assessments indicated reduced hepatotoxicity and skin sensitization risks compared to tacrine. Molecular docking and dynamic simulations highlighted strong and stable interactions of the derivatives with critical enzymes, including acetylcholinesterase (AChE) and beta-secretases (BACE1 and BACE2). Compounds 12 and 13, in particular, emerged as the most promising candidates due to their superior stability and binding affinity. These findings underscore the potential of selenium-modified tacrine derivatives as safer and more effective therapeutic agents for Alzheimer’s disease, warranting further experimental validation.
Preprint
Full-text available
The development of semiempirical models to simplify quantum mechanical descriptions of atomistic systems is a practice that started soon after the discovery of quantum mechanics and continues to the present day. There are now many methods for atomistic simulation with many software implementations and many users, on a scale large enough to be considered as a software market. Semiempirical models occupied a large share of this market in its early days, but the research activity in atomistic simulation has steadily polarized over the last three decades towards general-purpose but expensive ab initio quantum mechanics methods and fast but special-purpose molecular mechanics methods. I offer perspective on recent trends in atomistic simulation from the middle ground of semiempirical modeling, to learn from its past success and consider its possible paths to future growth. In particular, there is a lot of ongoing research activity in combining semiempirical quantum mechanics with machine learning models and some unrealized possibilities of tighter integration between ab initio and semiempirical quantum mechanics with more flexible theoretical frameworks and more modular software components.
Article
Accurate prediction of protein–ligand binding affinities is crucial in drug discovery, particularly during hit-to-lead and lead optimization phases, however, limitations in ligand force fields continue to impact prediction accuracy. In this work, we validate relative binding free energy (RBFE) accuracy using neural network potentials (NNPs) for the ligands. We utilize a novel NNP model, AceFF 1.0, based on the TensorNet architecture for small molecules that broadens the applicability to diverse drug-like compounds, including all important chemical elements and supporting charged molecules. Using established benchmarks, we show overall improved accuracy and correlation in binding affinity predictions compared with GAFF2 for molecular mechanics and ANI2-x for NNPs. Slightly less accuracy but comparable correlations with OPLS4. We also show that we can run the NNP simulations at 2 fs time step, at least two times larger than previous NNP models, providing significant speed gains. The results show promise for further evolutions of free energy calculations using NNPs while demonstrating its practical use already with the current generation. The code and NNP model are publicly available for research use.
Article
Tautomerization plays a critical role in chemical and biological processes, influencing molecular stability, reactivity, biological activity, and ADME-Tox properties. Many drug-like molecules exist in multiple tautomeric states in aqueous solution, complicating the study of protein–ligand interactions. Rapid and accurate prediction of tautomer ratios and identification of predominant species are therefore crucial in computational drug discovery. In this study, we introduce sPhysNet-Taut, a deep learning model fine-tuned on experimental data using a Siamese neural network architecture. This model directly predicts tautomer ratios in aqueous solution based on MMFF94-optimized molecular geometries. On experimental test sets, sPhysNet-Taut achieves state-of-the-art performance with root-mean-square error (RMSE) of 1.9 kcal/mol on the 100-tautomers set and 1.0 kcal/mol on the SAMPL2 challenge, outperforming all other methods. It also provides superior ranking power for tautomer pairs on multiple test sets. Our results demonstrate that fine-tuning on experimental data significantly enhances model performance compared to training from scratch. This work not only offers a valuable deep learning model for predicting tautomer ratios but also presents a protocol for modeling pairwise data. To promote usability, we have developed an accessible tool that predicts stable tautomeric states in aqueous solution by enumerating all possible tautomeric states and ranking them using our model. The source code and web server are freely accessible at https://github.com/xiaolinpan/sPhysNet-Taut and https://yzhang.hpc.nyu.edu/tautomer.
Article
Molecular force field (FF) determines the accuracy of molecular dynamics (MD) and is one of the major bottlenecks that limits the application of MD in molecular design. Recently, artificial intelligence...
Article
Machine learning (ML) methods offer a promising route to the construction of universal molecular potentials with high accuracy and low computational cost. It is becoming evident that integrating physical principles into these models, or utilizing them in a Δ-ML scheme, significantly enhances their robustness and transferability. This paper introduces PM6-ML, a Δ-ML method that synergizes the semiempirical quantum-mechanical (SQM) method PM6 with a state-of-the-art ML potential applied as a universal correction. The method demonstrates superior performance over standalone SQM and ML approaches and covers a broader chemical space than its predecessors. It is scalable to systems with thousands of atoms, which makes it applicable to large biomolecular systems. Extensive benchmarking confirms PM6-ML’s accuracy and robustness. Its practical application is facilitated by a direct interface to MOPAC. The code and parameters are available at https://github.com/Honza-R/mopac-ml.
Article
We present a comprehensive study investigating the potential gain in accuracy for calculating absolute solvation free energies (ASFE) using a neural network potential to describe the intramolecular energy of the solute. We calculated the ASFE for most compounds from the FreeSolv database using the Open Force Field (OpenFF) and compared them to earlier results obtained with the CHARMM General Force Field (CGenFF). By applying a nonequilibrium (NEQ) switching approach between the molecular mechanics (MM) description (either OpenFF or CGenFF) and the neural net potential (NNP)/MM level of theory (using ANI-2x as the NNP potential), we attempted to improve the accuracy of the calculated ASFEs. The predictive performance of the results did not change when this approach was applied to all 589 small molecules in the FreeSolv database that ANI-2x can describe. When selecting a subset of 156 molecules, focusing on compounds where the force fields performed poorly, we saw a slight improvement in the root-mean-square error (RMSE) and mean absolute error (MAE). The majority of our calculations utilized unidirectional NEQ protocols based on Jarzynski’s equation. Additionally, we conducted bidirectional NEQ switching for a subset of 156 solutes. Notably, only a small fraction (10 out of 156) exhibited statistically significant discrepancies between unidirectional and bidirectional NEQ switching free energy estimates.
Preprint
Full-text available
Quantum chemical methods developed since 1927 are instrumental in chemical simulations but human expertise has been still essential in choosing a suitable method. Here we introduce a paradigm shift to universal and updatable artificial intelligence-enhanced quantum mechanical (UAIQM) foundational models with an online platform auto-selecting the models with the best accuracy for the given system, available time, and moderate computational resources (see https://xacs.xmu.edu.cn/docs/mlatom/tutorial_uaiqm.html for instructions). The platform hosts a growing library of state-of-the-art UAIQM models with calibrated uncertainties and provides a mechanism for improving the foundational models continuously with more usage. We demonstrate how the UAIQM platform can be used for massive accurate simulations within hours on a commodity hardware which would take days or weeks on high-performance computing centers with less accurate workhorse quantum chemical methods. We also show that UAIQM sets a new standard for infrared spectra, reaction barriers, and energetics whose accurate predictions can have far-reaching consequences in molecular simulations.
Article
Molecular simulations of high energetic materials (HEMs) are limited by efficiency and accuracy. Recently, neural network potential (NNP) models have achieved molecular simulations of millions of atoms while maintaining the...
Article
Full-text available
This article gives a perspective on the progress of AI tools in computational chemistry through the lens of the author’s decade-long contributions put in the wider context of the trends...
Article
Full-text available
This Perspective provides a contextual explanation of the current state-of-the-art alchemical free energy methods and their role in drug discovery as well as highlights select emerging technologies. The narrative attempts to answer basic questions about what goes on “under the hood” in free energy simulations and provide general guidelines for how to run simulations and analyze the results. It is the hope that this work will provide a valuable introduction to students and scientists in the field.
Article
Full-text available
Mass spectrometric innovations in analytical instrumentation tend to be accompanied by the development of a data-processing methodology, expecting to gain molecular-level insights into real-life objects. Qualitative and semi-quantitative methods have been replaced routinely by precise, accurate, selective, and sensitive quantitative ones. Currently, mass spectrometric 3D molecular structural methods are attractive. As an attempt to establish a reliable link between quantitative and 3D structural analyses, there has been developed an innovative formula [ D S D ″ , t o t = ∑ i n D S D ″ , i = ∑ i n 2.6388.10 − 17 × I i 2 ¯ − I i ¯ 2 ] capable of the exact determination of the analyte amount and its 3D structure. It processed, herein, ultra-high resolution mass spectrometric variables of paracetamol, atenolol, propranolol, and benzalkonium chlorides in biota, using mussel tissue and sewage sludge. Quantum chemistry and chemometrics were also used. Results: Data on mixtures of antibiotics and surfactants in biota and the linear dynamic range of concentrations 2–80 ng.(mL)⁻¹ and collision energy CE = 5–60 V are provided. Quantitative analysis of surfactants in biota via calibration equation ln[D″SD] = f(conc.) yields the exact parameter |r| = 0.99991, examining the peaks of BAC-C12 at m/z 212.209 ± 0.1 and 211.75 ± 0.15 for tautomers of fragmentation ions. Exact parameter |r| = 1 has been obtained, correlating the theory and experiments in determining the 3D molecular structures of ions of paracetamol at m/z 152, 158, 174, 301, and 325 in biota.
Article
Full-text available
Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.
Article
Full-text available
Glutaredoxins are small enzymes that catalyze the oxidation and reduction of protein disulfide bonds by the thiol-disulfide exchange mechanism. They have either one or two cysteines in their active site, resulting in different catalytic reaction cycles that have been investigated in many experimental studies. However, the exact mechanisms are not yet fully known, and to our knowledge, no theoretical studies have been performed to elucidate the underlying mechanism. In this study, we investigated a proposed mechanism for the reduction of the disulfide bond in the protein HMA4n by a mutated monothiol Homo sapiens glutaredoxin (HsGrx1) and the co-substrate glutathione (GSH). The catalytic cycle involves three successive thiol-disulfide exchanges that occur between the molecules. To estimate the regioselectivity of the different attacks,classical molecular dynamics simulations were performed and the trajectories analyzed regarding the sulfur--sulfur distances and the attack angles between the sulfurs. The free energy profile of each reaction was obtained with hybrid quantum mechanical/molecular mechanical metadynamics simulations. Since this required extensive phase space sampling, the semi-empirical density functional tight-binding (DFTB) method was used to describe the reactive cysteines. For an accurate description, we used specific reaction parameters fitted to B3LYP energies of the thiol-disulfide exchange and a machine learned energy correction that was trained on CCSD(T) energies of thiol-disulfide exchanges. Our calculations show the same regiospecifity as observed in the experiment and the obtained barrier heights are about 12 and 20~kcal/mol for the different reaction steps, which confirms the proposed pathway.
Article
Full-text available
Machine learning approaches in drug discovery, as well as in other areas of the chemical sciences, benefit from curated datasets of physical molecular properties. However, there currently is a lack of data collections featuring large bioactive molecules alongside first-principle quantum chemical information. The open-access QMugs (Quantum-Mechanical Properties of Drug-like Molecules) dataset fills this void. The QMugs collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling ~2 M conformers. QMugs contains optimized molecular geometries and thermodynamic data obtained via the semi-empirical method GFN2-xTB. Atomic and molecular properties are provided on both the GFN2-xTB and on the density-functional levels of theory (DFT, ωB97X-D/def2-SVP). QMugs features molecules of significantly larger size than previously-reported collections and comprises their respective quantum mechanical wave functions, including DFT density and orbital matrices. This dataset is intended to facilitate the development of models that learn from molecular data on different levels of theory while also providing insight into the corresponding relationships between molecular structure and biological activity. Measurement(s)Quantum MechanicsTechnology Type(s)density functional theory Measurement(s) Quantum Mechanics Technology Type(s) density functional theory
Article
Full-text available
To fill the gap between accurate (and expensive) ab initio calculations and efficient atomistic simulations based on empirical interatomic potentials, a new class of descriptions of atomic interactions has emerged and been widely applied; i.e., machine learning potentials (MLPs). One recently developed type of MLP is the Deep Potential (DP) method. In this review, we provide an introduction to DP methods in computational materials science. The theory underlying the DP method is presented along with a step-by-step introduction to their development and use. We also review materials applications of DPs in a wide range of materials systems. The DP Library provides a platform for the development of DPs and a database of extant DPs. We discuss the accuracy and efficiency of DPs compared with ab initio methods and empirical potentials.
Article
Full-text available
A key step during indirect alchemical free energy simulations using quantum mechanical/molecular mechanical (QM/MM) hybrid potential energy functions is the calculation of the free energy difference ΔAlow→high between the low level (e.g., pure MM) and the high level of theory (QM/MM). A reliable approach uses nonequilibrium work (NEW) switching simulations in combination with Jarzynski's equation; however, it is computationally expensive. In this study, we investigate whether it is more efficient to use more shorter switches or fewer but longer switches. We compare results obtained with various protocols to reference free energy differences calculated with Crooks' equation. The central finding is that fewer longer switches give better converged results. As few as 200 sufficiently long switches lead to ΔAlow→high values in good agreement with the reference results. This optimized protocol reduces the computational cost by a factor of 40 compared to earlier work. We also describe two tools/ways of analyzing the raw data to detect sources of poor convergence. Specifically, we find it helpful to analyze the raw data (work values from the NEW switching simulations) in a quasi-time series-like manner. Principal component analysis helps to detect cases where one or more conformational degrees of freedom are different at the low and high level of theory.
Article
Full-text available
The 10–23 DNAzyme is one of the most prominent catalytically active DNA sequences1,2. Its ability to cleave a wide range of RNA targets with high selectivity entails a substantial therapeutic and biotechnological potential². However, the high expectations have not yet been met, a fact that coincides with the lack of high-resolution and time-resolved information about its mode of action³. Here we provide high-resolution NMR characterization of all apparent states of the prototypic 10–23 DNAzyme and present a comprehensive survey of the kinetics and dynamics of its catalytic function. The determined structure and identified metal-ion-binding sites of the precatalytic DNAzyme–RNA complex reveal that the basis of the DNA-mediated catalysis is an interplay among three factors: an unexpected, yet exciting molecular architecture; distinct conformational plasticity; and dynamic modulation by metal ions. We further identify previously hidden rate-limiting transient intermediate states in the DNA-mediated catalytic process via real-time NMR measurements. Using a rationally selected single-atom replacement, we could considerably enhance the performance of the DNAzyme, demonstrating that the acquired knowledge of the molecular structure, its plasticity and the occurrence of long-lived intermediate states constitutes a valuable starting point for the rational design of next-generation DNAzymes.
Article
Full-text available
High-level quantum mechanical (QM) calculations are indispensable for accurate explanation of natural phenomena on the atomistic level. Their staggering computational cost, however, poses great limitations, which luckily can be lifted to a great extent by exploiting advances in artificial intelligence (AI). Here we introduce the general-purpose, highly transferable artificial intelligence–quantum mechanical method 1 (AIQM1). It approaches the accuracy of the gold-standard coupled cluster QM method with high computational speed of the approximate low-level semiempirical QM methods for the neutral, closed-shell species in the ground state. AIQM1 can provide accurate ground-state energies for diverse organic compounds as well as geometries for even challenging systems such as large conjugated compounds (fullerene C 60 ) close to experiment. This opens an opportunity to investigate chemical compounds with previously unattainable speed and accuracy as we demonstrate by determining geometries of polyyne molecules—the task difficult for both experiment and theory. Noteworthy, our method’s accuracy is also good for ions and excited-state properties, although the neural network part of AIQM1 was never fitted to these properties.
Article
Full-text available
Quantum-chemistry simulations based on potential energy surfaces of molecules provide invaluable insight into the physicochemical processes at the atomistic level and yield such important observables as reaction rates and spectra. Machine learning potentials promise to significantly reduce the computational cost and hence enable otherwise unfeasible simulations. However, the surging number of such potentials begs the question of which one to choose or whether we still need to develop yet another one. Here, we address this question by evaluating the performance of popular machine learning potentials in terms of accuracy and computational cost. In addition, we deliver structured information for non-specialists in machine learning to guide them through the maze of acronyms, recognize each potential's main features, and judge what they could expect from each one.
Article
Full-text available
Interatomic potentials derived with Machine Learning algorithms such as Deep-Neural Networks (DNNs), achieve the accuracy of high-fidelity quantum mechanical (QM) methods in areas traditionally dominated by empirical force fields and allow performing massive simulations. Most DNN potentials were parametrized for neutral molecules or closed-shell ions due to architectural limitations. In this work, we propose an improved machine learning framework for simulating open-shell anions and cations. We introduce the AIMNet-NSE (Neural Spin Equilibration) architecture, which can predict molecular energies for an arbitrary combination of molecular charge and spin multiplicity with errors of about 2–3 kcal/mol and spin-charges with error errors ~0.01e for small and medium-sized organic molecules, compared to the reference QM simulations. The AIMNet-NSE model allows to fully bypass QM calculations and derive the ionization potential, electron affinity, and conceptual Density Functional Theory quantities like electronegativity, hardness, and condensed Fukui functions. We show that these descriptors, along with learned atomic representations, could be used to model chemical reactivity through an example of regioselectivity in electrophilic aromatic substitution reactions. Quantum mechanical calculations of molecular ionized states are computationally quite expensive. This work reports a successful extension of a previous deep-neural networks approach towards transferable neural-network models for predicting multiple properties of open shell anions and cations.
Article
Full-text available
The computation of tautomer ratios of druglike molecules is enormously important in computer-aided drug discovery, as over a quarter of all approved drugs can populate multiple tautomeric species in solution. Unfortunately, accurate calculations of aqueous tautomer ratios-the degree to which these species must be penalized in order to correctly account for tautomers in modeling binding for computer-aided drug discovery-is surprisingly difficult. While quantum chemical approaches to computing aqueous tautomer ratios using continuum solvent models and rigid-rotor harmonic-oscillator thermochemistry are currently state of the art, these methods are still surprisingly inaccurate despite their enormous computational expense. Here, we show that a major source of this inaccuracy lies in the breakdown of the standard approach to accounting for quantum chemical thermochemistry using rigid rotor harmonic oscillator (RRHO) approximations, which are frustrated by the complex conformational landscape introduced by the migration of double bonds, creation of stereocenters, and introduction of multiple conformations separated by low energetic barriers induced by migration of a single proton. Using quantum machine learning (QML) methods that allow us to compute potential energies with quantum chemical accuracy at a fraction of the cost, we show how rigorous relative alchemical free energy calculations can be used to compute tautomer ratios in vacuum free from the limitations introduced by RRHO approximations. Furthermore, since the parameters of QML methods are tunable, we show how we can train these models to correct limitations in the underlying learned quantum chemical potential energy surface using free energies, enabling these methods to learn to generalize tautomer free energies across a broader range of predictions.
Article
Full-text available
CONSPECTUS: Machine learning interatomic potentials (MLIPs) are widely used for describing molecular energy and continue bridging the speed and accuracy gap between quantum mechanical (QM) and classical approaches like force fields. In this Account, we focus on the out-of-the-box approaches to developing transferable MLIPs for diverse chemical tasks. First, we introduce the "Accurate Neural Network engine for Molecular Energies," ANAKIN-ME, method (or ANI for short). The ANI model utilizes Justin Smith Symmetry Functions (JSSFs) and realizes training for vast data sets. The training data set of several orders of magnitude larger than before has become the key factor of the knowledge transferability and flexibility of MLIPs. As the quantity, quality, and types of interactions included in the training data set will dictate the accuracy of MLIPs, the task of proper data selection and model training could be assisted with advanced methods like active learning (AL), transfer learning (TL), and multitask learning (MTL). Next, we describe the AIMNet "Atoms-in-Molecules Network" that was inspired by the quantum theory of atoms in molecules. The AIMNet architecture lifts multiple limitations in MLIPs. It encodes long-range interactions and learnable representations of chemical elements. We also discuss the AIMNet-ME model that expands the applicability domain of AIMNet from neutral molecules toward open-shell systems. The AIMNet-ME encompasses a dependence of the potential on molecular charge and spin. It brings ML and physical models one step closer, ensuring the correct molecular energy behavior over the total molecular charge. We finally describe perhaps the simplest possible physics-aware model, which combines ML and the extended Huckel method. In ML-EHM, "Hierarchically Interacting Particle Neural Network," HIP-NN generates the set of a molecule-and environment-dependent Hamiltonian elements α μμ and K ‡. As a test example, we show how in contrast to traditional Huckel theory, ML-EHM correctly describes orbital crossing with bond rotations. Hence it learns the underlying physics, highlighting that the inclusion of proper physical constraints and symmetries could significantly improve ML model generalization. ■ KEY REFERENCES • Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI-1: An Extensible Neural Network Potential with DFT Accuracy at Force Field Computational Cost. Chem. Sci. 2017, 8, 3192−3203. 1 The f irst transferable NNP with accuracy comparable to DFT that is applicable to broad classes of organic molecules. • Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.; Tretiak, S.; Isayev, O.; Roitberg, A. E. Approaching Coupled Cluster Accuracy with a General-Purpose Neural Network Potential through Transfer Learning. Nat. Commun. 2019, 10, 2903. 2 TL implementation to train NNP that approaches CCSD(T) accuracy on diverse benchmarks: thermochemistry, isomer-ization, molecular torsion. • Zubatyuk, R.; Smith, J. S.; Leszczynski, J.; Isayev, O. Accurate and Transferable Multitask Prediction of Chemical Properties with an Atoms-in-Molecules Neural Network. Sci. Adv. 2019, 5, eaav6490. 3 Development of AIMNet modular deep NNP. The AIMNet shows a new dimension of transferability: the aptitude in learning new features f rom foregoing training through multimodal information. • Zubatyuk, R.; Smith, J.; Nebgen, B. T.; Tretiak, S.; Isayev, O. Teaching a Neural
Article
Full-text available
We present RegioSQM20, a new version of RegioSQM (Chem Sci 9:660, 2018), which predicts the regioselectivities of electrophilic aromatic substitution (EAS) reactions from the calculation of proton affinities. The following improvements have been made: The open source semiempirical tight binding program is used instead of the closed source program. Any low energy tautomeric forms of the input molecule are identified and regioselectivity predictions are made for each form. Finally, RegioSQM20 offers a qualitative prediction of the reactivity of each tautomer (low, medium, or high) based on the reaction center with the highest proton affinity. The inclusion of tautomers increases the success rate from 90.7 to 92.7%. RegioSQM20 is compared to two machine learning based models: one developed by Struble et al. (React Chem Eng 5:896, 2020) specifically for regioselectivity predictions of EAS reactions (WLN) and a more generally applicable reactivity predictor (IBM RXN) developed by Schwaller et al. (ACS Cent Sci 5:1572, 2019). RegioSQM20 and WLN offers roughly the same success rates for the entire data sets (without considering tautomers), while WLN is many orders of magnitude faster. The accuracy of the more general IBM RXN approach is somewhat lower: 76.3–85.0%, depending on the data set. The code is freely available under the MIT open source license and will be made available as a webservice (regiosqm.org) in the near future.
Article
Full-text available
We review progress in neural network (NN)-based methods for the construction of interatomic potentials from discrete samples (such as ab initio energies) for applications in classical and quantum dynamics including reaction dynamics and computational spectroscopy. The main focus is on methods for building molecular potential energy surfaces (PES) in internal coordinates that explicitly include all many-body contributions, even though some of the methods we review limit the degree of coupling, due either to a desire to limit computational cost or to limited data. Explicit and direct treatment of all many-body contributions is only practical for sufficiently small molecules, which are therefore our primary focus. This includes small molecules on surfaces. We consider direct, single NN PES fitting as well as more complex methods that impose structure (such as a multibody representation) on the PES function, either through the architecture of one NN or by using multiple NNs. We show how NNs are effective in building representations with low-dimensional functions including dimensionality reduction. We consider NN-based approaches to build PESs in the sums-of-product form important for quantum dynamics, ways to treat symmetry, and issues related to sampling data distributions and the relation between PES errors and errors in observables. We highlight combinations of NNs with other ideas such as permutationally invariant polynomials or sums of environment-dependent atomic contributions, which have recently emerged as powerful tools for building highly accurate PESs for relatively large molecular and reactive systems.
Article
Full-text available
Combustion is a complex chemical system which involves thousands of chemical reactions and generates hundreds of molecular species and radicals during the process. In this work, a neural network-based molecular dynamics (MD) simulation is carried out to simulate the benchmark combustion of methane. During MD simulation, detailed reaction processes leading to the creation of specific molecular species including various intermediate radicals and the products are intimately revealed and characterized. Overall, a total of 798 different chemical reactions were recorded and some new chemical reaction pathways were discovered. We believe that the present work heralds the dawn of a new era in which neural network-based reactive MD simulation can be practically applied to simulating important complex reaction systems at ab initio level, which provides atomic-level understanding of chemical reaction processes as well as discovery of new reaction pathways at an unprecedented level of detail beyond what laboratory experiments could accomplish.
Article
Full-text available
Maximum diversification of data is a central theme in building generalized and accurate machine learning (ML) models. In chemistry, ML has been used to develop models for predicting molecular properties, for example quantum mechanics (QM) calculated potential energy surfaces and atomic charge models. The ANI-1x and ANI-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process. Here, we describe the ANI-1x and ANI-1ccx data sets. To demonstrate data diversity, we visualize it with a dimensionality reduction scheme, and contrast against existing data sets. The ANI-1x data set contains multiple QM properties from 5 M density functional theory calculations, while the ANI-1ccx data set contains 500 k data points obtained with an accurate CCSD(T)/CBS extrapolation. Approximately 14 million CPU core-hours were expended to generate this data. Multiple QM calculated properties for the chemical elements C, H, N, and O are provided: energies, atomic forces, multipole moments, atomic charges, etc. We provide this data to the community to aid research and development of ML models for chemistry.
Article
Full-text available
DFTB+ is a versatile community developed open source software package offering fast and efficient methods for carrying out atomistic quantum mechanical simulations. By implementing various methods approximating density functional theory (DFT), such as the density functional based tight binding (DFTB) and the extended tight binding method, it enables simulations of large systems and long timescales with reasonable accuracy while being considerably faster for typical simulations than the respective ab initio methods. Based on the DFTB framework, it additionally offers approximated versions of various DFT extensions including hybrid functionals, time dependent formalism for treating excited systems, electron transport using non-equilibrium Green’s functions, and many more. DFTB+ can be used as a user-friendly standalone application in addition to being embedded into other software packages as a library or acting as a calculation-server accessed by socket communication. We give an overview of the recently developed capabilities of the DFTB+ code, demonstrating with a few use case examples, discuss the strengths and weaknesses of the various features, and also discuss on-going developments and possible future perspectives.
Article
Full-text available
The semiempirical quantum mechanical (SQM) methods used in drug design are commonly parametrized and tested on data sets of systems that may not be representative models for drug–biomolecule interactions in terms of both size and chemical composition. This is addressed here with a new benchmark data set, PLF547, derived from protein–ligand complexes, consisting of complexes of ligands with protein fragments (such as amino-acid side chains), with interaction energies based on MP2-F12 and DLPNO-CCSD(T) calculations. From these, composite benchmark interaction energies are also built for complexes of the ligand with the complete active site of the protein (PLA15 data set). These data sets are used to test multiple SQM methods with corrections for noncovalent interactions; the role of the solvation model in the calculations is tested as well.
Article
Full-text available
We develop an L-platform/L-scaffold framework we hypothesize may serve as a blueprint to facilitate site-specific RNA-cleaving nucleic acid enzyme design. Building on the L-platform motif originally described by Suslov and coworkers, we identify new critical scaffolding elements required to anchor a conserved general base guanine ("L-anchor") and bind functionally important metal ions at the active site ("L-pocket"). Molecular simulations, together with a broad range of experimental structural and functional data, connect the L-platform/L-scaffold elements to necessary and sufficient conditions for catalytic activity. We demonstrate that the L-platform/L-scaffold framework is common to 5 of the 9 currently known naturally occurring ribozyme classes (Twr, HPr, VSr, HHr, Psr), and intriguingly from a design perspective, the framework also appears in an artificially engineered DNAzyme (8-17dz). The flexibility of the L-platform/L-scaffold framework is illustrated on these systems, highlighting modularity and trends in the variety of known general acid moieties that are supported. These trends give rise to two distinct catalytic paradigms, building on the classifications proposed by Wilson and coworkers and named for the implicated general base and acid. The "G+A" paradigm (Twr, HPr, VSr) exclusively utilizes nucleobase residues for chemistry, and the "G+M+" paradigm (HHr, 8-17dz, Psr) involves structuring of the "L-pocket" metal ion binding site for recruitment of a divalent metal ion that plays an active role in the chemical steps of the reaction. Finally, the modularity of the L-platform/L-scaffold framework is illustrated in the VS ribozyme where the "L-pocket" assumes the functional role of the "L-anchor" element, highlighting a distinct mechanism, but one that is functionally linked with the hammerhead ribozyme.
Article
Full-text available
Molecular dynamics (MD) simulations have become increasingly popular in studying the motions and functions of biomolecules. The accuracy of the simulation, however, is highly determined by the molecular mechanics (MM) force field (FF), a set of functions with adjustable parameters to compute the potential energies from atomic positions. However, the overall quality of the FF, such as our previously published ff99SB and ff14SB, can be limited by assumptions that were made years ago. In the updated model presented here (ff19SB), we have significantly improved the backbone profiles for all 20 amino acids. We fit coupled ϕ/ψ parameters using 2D ϕ/ψ conformational scans for multiple amino acids, using as reference data the entire 2D quantum mechanics (QM) energy surface. We address the polarization inconsistency during dihedral parameter fitting by using both QM and MM in solution. Finally, we examine possible dependency of the backbone fitting on side chain rotamer. To extensively validate ff19SB parameters, we have performed a total of ~5 milliseconds MD simulations in explicit solvent. Our results show that after amino-acid specific training against QM data with solvent polarization, ff19SB not only reproduces the differences in amino acid specific Protein Data Bank (PDB) Ramachandran maps better, but also shows significantly improved capability to differentiate amino acid dependent properties such as helical propensities. We also conclude that an inherent underestimation of helicity is present in ff14SB, which is (inexactly) compensated by an increase in helical content driven by the TIP3P bias toward overly compact structures. In summary, ff19SB, when combined with a more accurate water model such as OPC, should have better predictive power for modeling sequence-specific behavior, protein mutations, and also rational protein design.
Article
Full-text available
We perform molecular dynamics simulations, based on recent crystallographic data, on the 8-17 DNAzyme at four states along the reaction pathway to determine the dynamical ensemble for the active state and transition state mimic in solution. A striking finding is the diverse roles played by Na+ and Pb2+ ions in the electrostatically strained active site that impact all four fundamental catalytic strategies, and share commonality with some features recently inferred for naturally occurring hammerhead and pistol ribozymes. The active site Pb2+ ion helps to stabilize in-line nucleophilic attack, provides direct electrostatic transition state stabilization, and facilitates leaving group departure. A conserved guanine residue is positioned to act as the general base, and is assisted by a bridging Na+ ion that tunes the pKa and facilitates in-line fitness. The present work provides insight into how DNA molecules are able to solve the RNA-cleavage problem, and establishes functional relationships between the mechanism of these engineered DNA enzymes with their naturally evolved RNA counterparts. This adds valuable information to our growing body of knowledge on general mechanisms of phosphoryl transfer reactions catalyzed by RNA, proteins and DNA.
Article
Full-text available
The nucleolytic ribozymes carry out site-specific RNA cleavage reactions by nucleophilic attack of the 2′-oxygen atom on the adjacent phosphorus with an acceleration of a million-fold or greater. A major part of this arises from concerted general acid–base catalysis. Recent identification of new ribozymes has expanded the group to a total of nine and this provides a new opportunity to identify sub-groupings according to the nature of the general base and acid. These include nucleobases, hydrated metal ions, and 2′-hydroxyl groups. Evolution has selected a number of different combinations of these elements that lead to efficient catalysis. These differences provide a new mechanistic basis for classifying these ribozymes.
Article
Full-text available
Computational modeling of chemical and biological systems at atomic resolution is a crucial tool in the chemist's toolset. The use of computer simulations requires a balance between cost and accuracy: quantum-mechanical methods provide high accuracy but are computationally expensive and scale poorly to large systems, while classical force fields are cheap and scalable, but lack transferability to new systems. Machine learning can be used to achieve the best of both approaches. Here we train a general-purpose neural network potential (ANI-1ccx) that approaches CCSD(T)/CBS accuracy on benchmarks for reaction thermochemistry, isomerization, and drug-like molecular torsions. This is achieved by training a network to DFT data then using transfer learning techniques to retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology, and chemistry, and billions of times faster than CCSD(T)/CBS calculations.
Article
Full-text available
In recent years, machine learning (ML) methods have become increasingly popular in computational chemistry. After being trained on appropriate ab initio reference data, these methods allow to accurately predict the properties of chemical systems, circumventing the need for explicitly solving the electronic Schrödinger equation. Because of their computational efficiency and scalability to large datasets, deep neural networks (DNNs) are a particularly promising ML algorithm for chemical applications. This work introduces PhysNet, a DNN architecture designed for predicting energies, forces and dipole moments of chemical systems. PhysNet achieves state-of-the-art performance on the QM9, MD17 and ISO17 benchmarks. Further, two new datasets are generated in order to probe the performance of ML models for describing chemical reactions, long-range interactions, and condensed phase systems. It is shown that explicitly including electrostatics in energy predictions is crucial for a qualitatively correct description of the asymptotic regions of a potential energy surface (PES). PhysNet models trained on a systematically constructed set of small peptide fragments (at most eight heavy atoms) are able to generalize to considerably larger proteins like deca-alanine (Ala10): The optimized geometry of helical Ala10 predicted by PhysNet is virtually identical to ab initio results (RMSD = 0.21 Å). By running unbiased molecular dynamics (MD) simulations of Ala10 on the PhysNet-PES in gas phase, it is found that instead of a helical structure, Ala10 folds into a "wreath-shaped" configuration, which is more stable than the helical form by 0.46 kcal mol⁻¹ according to the reference ab initio calculations.
Article
Electronic wave function calculation is a fundamental task of computational quantum chemistry. Knowledge of the wave function parameters allows one to compute physical and chemical properties of molecules and materials. Unfortunately, it is infeasible to compute the wave functions analytically even for simple molecules. Classical quantum chemistry approaches such as the Hartree-Fock method or density functional theory (DFT) allow to compute an approximation of the wave function but are very computationally expensive. One way to lower the computational complexity is to use machine learning models that can provide sufficiently good approximations at a much lower computational cost. In this work we: (1) introduce a new curated large-scale dataset of electron structures of drug-like molecules, (2) establish a novel benchmark for the estimation of molecular properties in the multi-molecule setting, and (3) evaluate a wide range of methods with this benchmark. We show that the accuracy of recently developed machine learning models deteriorates significantly when switching from the single-molecule to the multi-molecule setting. We also show that these models lack generalization over different chemistry classes. In addition, we provide experimental evidence that larger datasets lead to better ML models in the field of quantum chemistry.
Chapter
Recently, artificial neural network-based methods for the construction of potential energy surfaces and molecular dynamics (MD) simulations based on them have been increasingly used in the field of theoretical chemistry. The neural network potentials (NNP) strike a good balance between accuracy and computational efficiency relative to quantum chemical calculations and MD simulations based on classical force fields. Thus, NNP is becoming a powerful tool for studying the structure and function of molecules. In this chapter, we introduce the basic theory of NNP. The construction steps and the usage of NNP are also introduced in detail with the MD simulation of methane combustion as an example. We hope that this chapter can help those readers who are new but interested in entering this field.
Chapter
Quantum chemistry (QC) has a vast variety of different methods, with more accurate methods being generally slower. This has several consequences: one is that it is easier to generate more data with less accurate methods for training machine learning (ML), whereas the availability of more accurate data is limited. Another consequence is that the databases are rich in data generated with different methods. In addition, some quantum chemical properties such as heats of formation at 298 K and atomization energies at 0 K are related, but the computational cost of their generation and therefore availability is different too. Such data sets with data from different sources are known as multifidelity data, and ML provides tools to learn from them. Here, we discuss such standard tools, transfer learning (TL), and co-kriging, as well as more specialized tools used in QC such as Δ-learning and hierarchical ML as well as methods going beyond them. We will show that Δ-learning and related methods provide an efficient way to improve low-level quantum chemical methods. At the end of the chapter, case studies for performing Δ-learning, hierarchical ML, and TL are provided.
Article
Machine-learning-based interatomic potential energy surface (PES) models are revolutionizing the field of molecular modeling. However, although much faster than electronic structure schemes, these models suffer from costly computations via deep neural networks to predict the energy and atomic forces, resulting in lower running efficiency as compared to the typical empirical force fields. Herein, we report a model compression scheme for boosting the performance of the Deep Potential (DP) model, a deep learning-based PES model. This scheme, we call DP Compress, is an efficient postprocessing step after the training of DP models (DP Train). DP Compress combines several DP-specific compression techniques, which typically speed up DP-based molecular dynamics simulations by an order of magnitude faster and consume an order of magnitude less memory. We demonstrate that DP Compress is sufficiently accurate by testing a variety of physical properties of Cu, H2O, and Al-Cu-Mg systems. DP Compress applies to both CPU and GPU machines and is publicly available online.
Article
We present a fast, accurate, and robust approach for determination of free energy profiles and kinetic isotope effects for RNA 2'-O-transphosphorylation reactions with inclusion of nuclear quantum effects. We apply a deep potential range correction (DPRc) for combined quantum mechanical/molecular mechanical (QM/MM) simulations of reactions in the condensed phase. The method uses the second-order density-functional tight-binding method (DFTB2) as a fast, approximate base QM model. The DPRc model modifies the DFTB2 QM interactions and applies short-range corrections to the QM/MM interactions to reproduce ab initio DFT (PBE0/6-31G*) QM/MM energies and forces. The DPRc thus enables both QM and QM/MM interactions to be tuned to high accuracy, and the QM/MM corrections are designed to smoothly vanish at a specified cutoff boundary (6 Å in the present work). The computational speed-up afforded by the QM/MM+DPRc model enables free energy profiles to be calculated that include rigorous long-range QM/MM interactions under periodic boundary conditions and nuclear quantum effects through a path integral approach using a new interface between the AMBER and i-PI software. The approach is demonstrated through the calculation of free energy profiles of a native RNA cleavage model reaction and reactions involving thio-substitutions, which are important experimental probes of the mechanism. The DFTB2+DPRc QM/MM free energy surfaces agree very closely with the PBE0/6-31G* QM/MM results, and it is vastly superior to the DFTB2 QM/MM surfaces with and without weighted thermodynamic perturbation corrections. 18O and 34S primary kinetic isotope effects are compared, and the influence of nuclear quantum effects on the free energy profiles is examined.
Article
The high computational cost of first-principles electronic structure methods together with the successful applications of machine learning (ML) techniques in atomistic simulations resulted in a surge of interest in ML-based interatomic potentials. Despite great progress in the field, there remain some challenges to be solved such as the best way of incorporating long-range interactions, as well as nonlocal charge transfer. The first generation of the charge equilibration via neural network technique (CENT) was a major step forward in concurrently taking into account both aforementioned points. Within structure prediction methods, it turned out to be a powerful tool in discovering novel polymorphs of ionic systems. On the other hand, the method is not expected to be appropriate for multicomponent systems with reference data sets in which some or all elements are subject to varying oxidation states. Here, we present the second generation of CENT, with multiple improvements to the original variant that lead to a more accurate treatment of electrostatic interactions. To do this, it aims at reproducing the electric potential function, which is directly related to the charge distribution, rather than only considering total energies. In addition, a charge-free term is added to correct for the difference between the reference energies and those obtained with the energy functional of CENT. Moreover, the Green's function within the Hartree energy is modified to substantially shield interactions from charges in the neighborhood of each point. Also, the charge density is split into ionic and electronic parts, which allows for a better approximation of the electron density. The utility of this method is examined for magnesium oxide clusters, and multiple comparisons with the first generation are made, demonstrating that much more physical electrostatic interactions can be expected from the second generation of CENT.
Article
Semiempirical methods like density functional tight-binding (DFTB) allow extensive phase space sampling, making it possible to generate free energy surfaces of complex reactions in condensed-phase environments. Such a high efficiency often comes at the cost of reduced accuracy, which may be improved by developing a specific reaction parametrization (SRP) for the particular molecular system. Thiol-disulfide exchange is a nucleophilic substitution reaction that occurs in a large class of proteins. Its proper description requires a high-level ab initio method, while DFT-GAA and hybrid functionals were shown to be inadequate, and so is DFTB due to its DFT-GGA descent. We develop an SRP for thiol-disulfide exchange based on an artificial neural network (ANN) implementation in the DFTB+ software and compare its performance to that of a standard SRP approach applied to DFTB. As an application, we use both new DFTB-SRP as components of a QM/MM scheme to investigate thiol-disulfide exchange in two molecular complexes: a solvated model system and a blood protein. Demonstrating the strengths of the methodology, highly accurate free energy surfaces are generated at a low cost, as the augmentation of DFTB with an ANN only adds a small computational overhead.
Article
We present OrbNet Denali, a machine learning potential that is designed as a drop-in replacement for ground-state density functional theory (DFT) energy calculations. The model is a message-passing neural network that uses symmetry-adapted atomic orbital features from low-cost quantum calculations to predict the energy of a molecule. OrbNet Denali is trained on a vast dataset of 2.3M DFT calculations on molecules and geometries. This dataset covers the most common elements in bio- and organic chemistry (H,Li,B,C,N,O,F,Na,Mg,Si,P,S,Cl,K,Ca,Br,I) as well as charged molecules. OrbNet Denali is demonstrated on several well-established benchmark datasets, and we find that it provides accuracy on par with modern DFT methods while offering a speedup of up to three orders of magnitude. For the GMTKN55 benchmark set, OrbNet Denali achieves WTMAD-1 and WTMAD-2 scores of 7.19 and 9.84, on par with modern DFT functionals. For several GMTKN55 subsets, which contain chemical problems that are not present in the training set, OrbNet Denali produces a MAEs comparable to those of DFT methods. For the Hutchison conformers benchmark set, OrbNet Denali has a median correlation coefficient of R^2=0.90 compared to reference DLPNO-CCSD(T) calculations, and R^2=0.97 compared to the method used to generate the training data (wB97X-D3/def2-TZVP), exceeding the performance of any other method with a similar cost. Similarly, the model reaches chemical accuracy for non-covalent interactions in the S66x10 dataset. For torsional profiles, OrbNet Denali reproduces the torsion profiles of wB97X-D3/def2-TZVP with an average MAE of 0.12 kcal/mol for the potential energy surfaces of the diverse fragments in the TorsionNet500-dataset.
Article
ConspectusMachine learning interatomic potentials (MLIPs) are widely used for describing molecular energy and continue bridging the speed and accuracy gap between quantum mechanical (QM) and classical approaches like force fields. In this Account, we focus on the out-of-the-box approaches to developing transferable MLIPs for diverse chemical tasks. First, we introduce the "Accurate Neural Network engine for Molecular Energies," ANAKIN-ME, method (or ANI for short). The ANI model utilizes Justin Smith Symmetry Functions (JSSFs) and realizes training for vast data sets. The training data set of several orders of magnitude larger than before has become the key factor of the knowledge transferability and flexibility of MLIPs. As the quantity, quality, and types of interactions included in the training data set will dictate the accuracy of MLIPs, the task of proper data selection and model training could be assisted with advanced methods like active learning (AL), transfer learning (TL), and multitask learning (MTL).Next, we describe the AIMNet "Atoms-in-Molecules Network" that was inspired by the quantum theory of atoms in molecules. The AIMNet architecture lifts multiple limitations in MLIPs. It encodes long-range interactions and learnable representations of chemical elements. We also discuss the AIMNet-ME model that expands the applicability domain of AIMNet from neutral molecules toward open-shell systems. The AIMNet-ME encompasses a dependence of the potential on molecular charge and spin. It brings ML and physical models one step closer, ensuring the correct molecular energy behavior over the total molecular charge.We finally describe perhaps the simplest possible physics-aware model, which combines ML and the extended Hückel method. In ML-EHM, "Hierarchically Interacting Particle Neural Network," HIP-NN generates the set of a molecule- and environment-dependent Hamiltonian elements αμμ and K‡. As a test example, we show how in contrast to traditional Hückel theory, ML-EHM correctly describes orbital crossing with bond rotations. Hence it learns the underlying physics, highlighting that the inclusion of proper physical constraints and symmetries could significantly improve ML model generalization.
Article
Predicting protein-ligand binding affinities and the associated thermodynamics of biomolecular recognition is a primary objective of structure-based drug design. Alchemical free energy simulations offer a highly accurate and computationally efficient route to achieving this goal. While the AMBER molecular dynamics package has successfully been used for alchemical free energy simulations in academic research groups for decades, widespread impact in industrial drug discovery settings has been minimal because of the previous limitations within the AMBER alchemical code, coupled with challenges in system setup and postprocessing workflows. Through a close academia-industry collaboration we have addressed many of the previous limitations with an aim to improve accuracy, efficiency, and robustness of alchemical binding free energy simulations in industrial drug discovery applications. Here, we highlight some of the recent advances in AMBER20 with a focus on alchemical binding free energy (BFE) calculations, which are less computationally intensive than alternative binding free energy methods where full binding/unbinding paths are explored. In addition to scientific and technical advances in AMBER20, we also describe the essential practical aspects associated with running relative alchemical BFE calculations, along with recommendations for best practices, highlighting the importance not only of the alchemical simulation code but also the auxiliary functionalities and expertise required to obtain accurate and reliable results. This work is intended to provide a contemporary overview of the scientific, technical, and practical issues associated with running relative BFE simulations in AMBER20, with a focus on real-world drug discovery applications.
Article
Intermolecular interactions are critical to many chemical phenomena, but their accurate computation using ab initio methods is often limited by computational cost. The recent emergence of machine learning (ML) potentials may be a promising alternative. Useful ML models should not only estimate accurate interaction energies but also predict smooth and asymptotically correct potential energy surfaces. However, existing ML models are not guaranteed to obey these constraints. Indeed, systemic deficiencies are apparent in the predictions of our previous hydrogen-bond model as well as the popular ANI-1X model, which we attribute to the use of an atomic energy partition. As a solution, we propose an alternative atomic-pairwise framework specifically for intermolecular ML potentials, and we introduce AP-Net—a neural network model for interaction energies. The AP-Net model is developed using this physically motivated atomic-pairwise paradigm and also exploits the interpretability of symmetry adapted perturbation theory (SAPT). We show that in contrast to other models, AP-Net produces smooth, physically meaningful intermolecular potentials exhibiting correct asymptotic behavior. Initially trained on only a limited number of mostly hydrogen-bonded dimers, AP-Net makes accurate predictions across the chemically diverse S66x8 dataset, demonstrating significant transferability. On a test set including experimental hydrogen-bonded dimers, AP-Net predicts total interaction energies with a mean absolute error of 0.37 kcal mol⁻¹, reducing errors by a factor of 2–5 across SAPT components from previous neural network potentials. The pairwise interaction energies of the model are physically interpretable, and an investigation of predicted electrostatic energies suggests that the model “learns” the physics of hydrogen-bonded interactions.
Article
This paper presents TorchANI, a PyTorch based software for training/inference of ANI (ANAKIN-ME) deep learning models to obtain potential energy surfaces and other physical properties of molecular systems. ANI is an accurate neural network potential originally implemented using C++/CUDA in a program called NeuroChem. Compared with NeuroChem, TorchANI has a design emphasis on being light weight, user friendly, cross platform, and easy to read and modify for fast prototyping, while allowing acceptable sacrice on running performance. Because the computation of atomic environmental vectors (AEVs) and atomic neural networks are all implemented using PyTorch operators, TorchANI is able to use PyTorch's autograd engine to automatically compute analytical forces and Hessian matrices, as well as do force training without additional codes required. TorchANI is open-source and freely available on GitHub: https://github.com/aiqm/torchani
Article
Machine learning (ML) methods have become powerful, predictive tools in a wide range of applications, such as facial recognition and autonomous vehicles. In the sciences, computational chemists and physicists have been using ML for the prediction of physical phenomena, such as atomistic potential energy surfaces and reaction pathways. Transferable ML potentials, such as ANI-1x, have been developed with the goal of accurately simulating organic molecules containing the chemical elements H, C, N, and O. Here we provide an extension of the ANI-1x model. The new model, dubbed ANI-2x, is trained to three additional chemical elements: S, F, and Cl. Additionally, ANI-2x underwent torsional refinement training to better predict molecular torsion profiles. These new features open a wide range of new applications within organic chemistry and drug development. These seven elements (H, C, N, O, F, Cl, S) make up ~90% of drug like molecules. To show that these additions do not sacrifice accuracy, we have tested this model across a range of organic molecules and applications, including the COMP6 benchmark, dihedral rotations, conformer scoring, and non-bonded interactions. ANI-2x is shown to accurately predict molecular energies compared to DFT with a ~106 factor speedup and a negligible slowdown compared to ANI-1x. The resulting model is a valuable tool for drug development that can potentially replace both quantum calculations and classical force fields for myriad applications.
Article
Virtual high throughput screening (vHTS) in drug discovery is a powerful approach to identify hits: when applied successfully, it can be much faster and cheaper than experimental high-throughput screening approaches. However, mainstream vHTS tools have significant limitations: ligand-based methods depend on knowledge of existing chemical matter, while structure-based tools such as docking involve significant approximations that limit their accuracy. Recent advances in scientific methods coupled with dramatic speedups in computational processing with GPUs make this an opportune time to consider the role of more rigorous methods that could improve the predictive power of vHTS workflows. In this Perspective, we assert that alchemical binding free energy methods using all-atom molecular dynamics simulations have matured to the point where they can be applied in virtual screening campaigns as a final scoring stage to prioritize the top molecules for experimental testing. Specifically, we propose that alchemical absolute binding free energy (ABFE) calculations offer the most direct and computationally efficient approach within a rigorous statistical thermodynamic framework for computing binding energies of diverse molecules, as is required for virtual screening. ABFE calculations are particularly attractive for drug discovery at this point in time, where the confluence of large-scale genomics data and insights from chemical biology have unveiled a large number of promising disease targets for which no small molecule binders are known, precluding ligand-based approaches, and where traditional docking approaches have foundered to find progressible chemical matter.
Article
The Non-Covalent Interactions Atlas project (www.nciatlas.org) aims to cover a wide range of noncovalent interactions with a new generation of benchmark data sets. This Article presents the first two data sets focused on hydrogen bonding: HB375, featuring neutral systems, and IHB100 for ionic H-bonds. Both data sets are complemented by 10-point dissociation curves (HB375×10, IHB100×10). The interaction energies are extrapolated to the CCSD(T)/CBS limit from calculations in large basis sets. The Article also summarizes the design principles that will be used to construct the subsequent data sets in the series. The testing of DFT-D methods on the HB375 set has revealed interesting, previously unnoticed issues. The application of the new data to the testing and parametrization of semiempirical QM methods is also discussed.
Article
Evolution has yielded biopolymers that are constructed from exactly four building blocks and are able to support Darwinian evolution. Synthetic biology aims to extend this alphabet, and we recently showed that 8-letter (hachimoji) DNA can support rule-based information encoding. One source of replicative error in non-natural DNA-like systems, however, is the occurrence of alternative tautomeric forms, which pair differently. Unfortunately, little is known about how structural modifications impact free-energy differences between tautomers of the non-natural nucleo¬bases used in the hachimoji expanded genetic alphabet. Determining experimental tautomer ratios is technically difficult and so strategies for improving hachimoji DNA replication efficiency will benefit from accurate computational predictions of equilibrium tautomeric ratios. We now report that high-level quantum-chemical calculations in aqueous solution by the embedded cluster reference interaction site model (EC-RISM), benchmarked against free energy molecular simulations for solvation thermodynamics, provide useful quantitative information on the tautomer ratios of both Watson-Crick and hachimoji nucleobases. In agreement with previous computational studies, all four Watson-Crick nucleobases adopt essentially only one tautomer in water. This is not the case, however, for non-natural nucleobases and their analogs. For example, although the enols of isoguanine and a series of related purines are not populated in water, these heterocycles possess N1-H and N3-H keto tautomers that are similar in energy thereby adversely impacting accurate nucleobase pairing. These robust computational strategies offer a firm basis for improving experimental measurements of tautomeric ratios, which are currently limited to studying molecules that exist only as two tautomers in solution.
Article
Machine learning (ML) is transforming all areas of science. The complex and time-consuming calculations in molecular simulations are particularly suitable for an ML revolution and have already been profoundly affected by the application of existing ML methods. Here we review recent ML methods for molecular simulation, with particular focus on (deep) neural networks for the prediction of quantum-mechanical energies and forces, on coarse-grained molecular dynamics, on the extraction of free energy surfaces and kinetics, and on generative network approaches to sample molecular equilibrium structures and compute thermodynamics. To explain these methods and illustrate open methodological problems, we review some important principles of molecular physics and describe how they can be incorporated into ML structures. Finally, we identify and describe a list of open challenges for the interface between ML and molecular simulation. Expected final online publication date for the Annual Review of Physical Chemistry, Volume 71 is April 20, 2020. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Article
There have been decades of research on determining/predicting acid dissociation constants (pKa) and the tautomer ratios both experimentally and theoretically. However the lack of an extensive publicly available database of measured tautomeric ratios in water and non-aqueous solvents poses a challenge for the researchers interested in theoretical studies related to tautomers. Hereby we present Tautobase, –to date and to the best of our knowledge– the first extensive open-source tautomer database of measured/estimated tautomer ratios mainly in water, containing 1680 unique tautomer pairs.
Article
We use the PBE0/6-31G* density functional method to perform ab initio quantum mechanical/molecular mechanical (QM/MM) molecular dynamics (MD) simulations under periodic boundary conditions with rigorous electrostatics using the ambient potential composite Ewald method in order to test the convergence of MM→QM/MM free energy corrections for the prediction of 17 small-molecule solvation free energies and 8 ligand binding free energies to T4 lysozyme. The ``indirect'' thermodynamic cycle for calculating free energies is used to explore whether a series of reference potentials improve the statistical quality of the predictions. Specifically, we construct a series of reference potentials that optimizes a molecular mechanical (MM) force field's parameters to reproduce the ab initio QM/MM forces from a QM/MM simulation. The optimizations form a systematic progression of successively expanded parameters that include bond, angle, dihedral and charge parameters. For each reference potential, we calculate benchmark quality reference values for the MM→QM/MM correction by performing the mixed MM and QM/MM Hamiltonians at 11 intermediate states, each for 200 ps. We then compare forward and reverse application of Zwanzig's relation, thermodynamic integration, and Bennett's acceptance ratio (BAR) methods as a function of reference potential, simulation time, and the number of simulated intermediate states. We find that Zwanzig's equation is inadequate unless a large number of intermediate states are explicitly simulated. The TI and BAR mean signed errors are very small even when only the end-state simulations are considered, and the standard deviation of the TI and BAR errors are decreased by choosing a reference potential that optimizes the bond and angle parameters. We find a robust approach for the data sets of fairly rigid molecules considered here is to use bond+angle reference potential together with the end-state-only BAR analysis. This requires a QM/MM simulations to be performed in order to generate reference data to parameterize the bond+angle reference potential, and then this same simulation serves a dual purpose as the full QM/MM end-state. The convergence of the results with respect to time suggests that computational resources may be used more efficiently by running multiple simulations for no more than 50 ps, rather than running one long simulation.
Article
We propose a simple, but efficient and accurate machine learning (ML) model for developing high-dimensional potential energy surface. This so-called embedded atom neural network (EANN) approach is inspired by the well-known empirical embedded atom method (EAM) model used in condensed phase. It simply replaces the scalar embedded atom density in EAM with a Gaussian-type orbital based density vector, and represents the complex relationship between the embedded density vector and atomic energy by neural networks. We demonstrate that the EANN approach is equally accurate as several established ML models in representing both big molecular and extended periodic systems, yet with much fewer parameters and configurations. It is highly efficient as it implicitly contains the three-body information without an explicit sum of the conventional costly angular descriptors. With high accuracy and efficiency, EANN potentials can vastly accelerate molecular dynamics and spectroscopic simulations in complex systems at ab initio level.
Article
A predictive understanding of the mechanisms of RNA cleavage is important for the design of emerging technology built from biological and synthetic molecules that have promise for new biochemical and medicinal applications. Over the past 15 years, RNA cleavage reactions involving 2'-O-transphosphorylation have been discussed using a simplified framework introduced by Breaker that consists of four fundamental catalytic strategies (designated α, β, γ, and δ) that contribute to rate enhancement. As more detailed mechanistic data emerge, there is need for the framework to evolve and keep pace. We develop an ontology for discussion of strategies of enzymes that catalyze RNA cleavage via 2'-O-transphosphorylation that stratifies Breaker’s framework into primary (1°), secondary (2°) and tertiary (3°) contributions to enable more precise interpretation of mechanism in the context of structure and bonding. Further, we point out instances where atomic-level changes give rise to changes in more than one catalytic contribution, a phenomenon we refer to as ‘functional blurring’. We hope that this ontology will help clarify our conversations and pave the path forward toward a consensus view of these fundamental and fascinating mechanisms. The insight gained will deepen our understanding of RNA cleavage reactions catalyzed by natural protein and RNA enzymes, as well as aid in the design of new engineered DNA and synthetic enzymes.