ArticleLiterature Review

AI-assisted chemistry research: a comprehensive analysis of evolutionary paths and hotspots through knowledge graphs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Artificial intelligence (AI) offers transformative potential for chemical research through its ability to optimize reactions and processes, enhance energy efficiency, and reduce waste. AI-assisted chemical research (AI + chem) has become a global hotspot. To better understand the current research status of "AI + chem", this study conducted a scientific bibliometric investigation using CiteSpace. The web of science core collection was utilized to retrieve original articles related to "AI + chem" published from 2000 to 2024. The obtained data allowed for the visualization of the knowledge background, current research status, and latest knowledge structure of "AI + chem". The "AI + chem" has entered a stage of explosive growth, and the number of papers will maintain long-term high-speed growth. This article systematically analyzes the latest progress in "AI + chem" and objectively predicts future trends, including molecular design, reaction prediction, materials design, drug design, and quantum chemistry. The outcomes of this study will provide readers with a comprehensive understanding of the overall landscape of "AI + chem".

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Full-text available
Molecular optimization constitutes a pivotal phase in many domains since it holds the promise of improving the properties of lead molecules. The advent of artificial intelligence (AI)-driven molecular optimization has revolutionized lead optimization work- flows, which significantly accelerated the development of drug candidates. However, AI models are also confronted with new challenges in practical molecular optimization, such as the high-dimensional chemical space, the data sparsity issues. This paper initially high-lights the inherent benefits of molecular optimization in terms of optimizing properties and maintaining structural similarity for lead molecules, thereby highlighting its critical role in drug discovery. The next section systematically categorizes and analyzes existing AI-aided molecular optimization methods into iterative search in discrete chemical space, end-to-end generation in continuous latent space, and iterative search in continuous latent space. Finally, we discuss key challenges in AI-aided molecular optimization methods, including molecular representations, dataset selection, properties to be optimized, and optimization algorithms, while proposing potential solutions and future research directions. In summary, this review provides a comprehensive analysis of existing representative AI-aided molecular optimization methods, which offers guidance for future research directions.
Article
Molecular optimization plays a pivotal role in many domains since it holds promise for improving the properties of lead molecules. The advent of artificial intelligence (AI)-driven molecular optimization has revolutionized lead optimization workflows, which have significantly accelerated the development of drug candidates. However, AI models are also confronted with new challenges in practical molecular optimization, such as high-dimensional chemical space and data sparsity issues. This paper initially highlights the inherent benefits of molecular optimization in terms of optimizing the properties and maintaining the structural similarity of lead molecules, thereby highlighting its critical role in drug discovery. The next section systematically categorizes and analyzes existing AI-aided molecular optimization methods, comprising iterative search in discrete chemical space, end-to-end generation in continuous latent space, and iterative search in continuous latent space methods. Finally, we discuss the key challenges in AI-aided molecular optimization methods, including molecular representations, dataset selection, the properties to be optimized, and optimization algorithms, while proposing potential solutions and future research directions. In summary, this review provides a comprehensive analysis of existing representative AI-aided molecular optimization methods, thereby offering guidance for future research directions.
Article
Full-text available
We offer ten diverse perspectives exploring the transformative potential of artificial intelligence (AI) in chemistry, highlighting many of the challenges we face, and offering potential strategies to address them.
Article
Full-text available
The emergence of Artificial Intelligence (AI) in drug discovery marks a pivotal shift in pharmaceutical research, blending sophisticated computational techniques with conventional scientific exploration to break through enduring obstacles. This review paper elucidates the multifaceted applications of AI across various stages of drug development, highlighting significant advancements and methodologies. It delves into AI′s instrumental role in drug design, polypharmacology, chemical synthesis, drug repurposing, and the prediction of drug properties such as toxicity, bioactivity, and physicochemical characteristics. Despite AI′s promising advancements, the paper also addresses the challenges and limitations encountered in the field, including data quality, generalizability, computational demands, and ethical considerations. By offering a comprehensive overview of AI′s role in drug discovery, this paper underscores the technology‘s potential to significantly enhance drug development, while also acknowledging the hurdles that must be overcome to fully realize its benefits.
Article
Full-text available
Purpose – The article discusses the current relevance of artificial intelligence (AI) in research and how AI improves various research methods. This article focuses on the practical case study of systematic literature reviews (SLRs) to provide a guideline for employing AI in the process. Design/methodology/approach – Researchers no longer require technical skills to use AI in their research. The recent discussion about using Chat Generative Pre-trained Transformer (GPT), a chatbot by OpenAI, has reached the academic world and fueled heated debates about the future of academic research. Nevertheless, as the saying goes, AI will not replace our job; a human being using AI will. This editorial aims to provide an overview of the current state of using AI in research, highlighting recent trends and developments in the field. Findings – The main result is guidelines for the use of AI in the scientific research process. The guidelines were developed for the literature review case but the authors believe the instructions provided can be adjusted to many fields of research, including but not limited to quantitative research, data qualification, research on unstructured data, qualitative data and even on many support functions and repetitive tasks. Originality/value – AI already has the potential to make researchers’ work faster, more reliable and more convenient. The authors highlight the advantages and limitations of AI in the current time, which should be present in any research utilizing AI. Advantages include objectivity and repeatability in research processes that currently are subject to human error. The most substantial disadvantages lie in the architecture of current general-purpose models, which understanding is essential for using them in research. The authors will describe the most critical shortcomings without going into technical detail and suggest how to work with the shortcomings daily.
Article
Full-text available
With the development of Industry 4.0, artificial intelligence (AI) is gaining increasing attention for its performance in solving particularly complex problems in industrial chemistry and chemical engineering. Therefore, this review provides an overview of the application of AI techniques, in particular machine learning, in chemical design, synthesis, and process optimization over the past years. In this review, the focus is on the application of AI for structure-function relationship analysis, synthetic route planning, and automated synthesis. Finally, we discuss the challenges and future of AI in making chemical products.
Article
Full-text available
Microreactors have gained widespread attention from academia and industrial researchers due to their exceptionally fast mass and heat transfer and flexible control. In this work, CiteSpace software was used to systematically analyze the relevant literature to gain a comprehensively understand on the research status of microreactors in various fields. The results show that the research depth and application scope of microreactors are continuing to expand. The top 10 most popular research fields are photochemistry, pharmaceutical intermediates, multistep flow synthesis, mass transfer, computational fluid dynamics, μ‐TAS (micro total analysis system), nanoparticles, biocatalysis, hydrogen production, and solid‐supported reagents. The evolution trends of current focus areas are examined, including photochemistry, mass transfer, biocatalysis and hydrogen production and their milestone literature is analyzed in detail. This article demonstrates the development of different fields of microreactors technology and highlights the unending opportunities and challenges offered by this fascinating technology.
Article
Full-text available
The fields of brain‐inspired computing, robotics, and, more broadly, artificial intelligence (AI) seek to implement knowledge gleaned from the natural world into human‐designed electronics and machines. In this review, the opportunities presented by complex oxides, a class of electronic ceramic materials whose properties can be elegantly tuned by doping, electron interactions, and a variety of external stimuli near room temperature, are discussed. The review begins with a discussion of natural intelligence at the elementary level in the nervous system, followed by collective intelligence and learning at the animal colony level mediated by social interactions. An important aspect highlighted is the vast spatial and temporal scales involved in learning and memory. The focus then turns to collective phenomena, such as metal‐to‐insulator transitions (MITs), ferroelectricity, and related examples, to highlight recent demonstrations of artificial neurons, synapses, and circuits and their learning. First‐principles theoretical treatments of the electronic structure, and in situ synchrotron spectroscopy of operating devices are then discussed. The implementation of the experimental characteristics into neural networks and algorithm design is then revewed. Finally, outstanding materials challenges that require a microscopic understanding of the physical mechanisms, which will be essential for advancing the frontiers of neuromorphic computing, are highlighted.
Article
Full-text available
Generative machine learning models have become widely adopted in drug discovery and other fields to produce new molecules and explore molecular space, with the goal of discovering novel compounds with optimized properties. These generative models are frequently combined with transfer learning or scoring of the physicochemical properties to steer generative design, yet often, they are not capable of addressing a wide variety of potential problems, as well as converge into similar molecular space when combined with a scoring function for the desired properties. In addition, these generated compounds may not be synthetically feasible, reducing their capabilities and limiting their usefulness in real-world scenarios. Here, we introduce a suite of automated tools called MegaSyn representing three components: a new hill-climb algorithm, which makes use of SMILES-based recurrent neural network (RNN) generative models, analog generation software, and retrosynthetic analysis coupled with fragment analysis to score molecules for their synthetic feasibility. We show that by deconstructing the targeted molecules and focusing on substructures, combined with an ensemble of generative models, MegaSyn generally performs well for the specific tasks of generating new scaffolds as well as targeted analogs, which are likely synthesizable and druglike. We now describe the development, benchmarking, and testing of this suite of tools and propose how they might be used to optimize molecules or prioritize promising lead compounds using these RNN examples provided by multiple test case examples.
Article
Full-text available
In the literature, machine learning (ML) and artificial intelligence (AI) applications tend to start with examples that are irrelevant to process engineers (e.g. classification of images between cats and dogs, house pricing, types of flowers, etc.). However, process engineering principles are also based on pseudo-empirical correlations and heuristics, which are a form of ML. In this work, industrial data science fundamentals will be explained and linked with commonly-known examples in process engineering, followed by a review of industrial applications using state-of-art ML techniques.
Article
Full-text available
Machine learning (ML) provides an efficient method to predict the unknown properties during the exploration of new materials, but how to efficiently represent the molecules as input is still not fully solved. Inspired by image processing, one of the classical ML tasks, this work developed a method to predict the structure‐dependent properties by converting the atom position into a three‐dimensional (3D) molecular image and learning the structure features from the image via a classical convolutional neural networks. After trained with datasets larger than 12,000 species, a very high accuracy is obtained in predicting both theoretical molecular energy and experimental properties including melting points, boiling points, and flash points. Since stereoscopic information is explicitly and accurately represented by the molecular images, our model successfully distinguish the melting points and boiling points of molecules with similar structure, including those of trans–cis isomers.
Article
Full-text available
This is a critical review of artificial intelligence/machine learning (AI/ML) methods applied to battery research. It aims at providing a comprehensive, authoritative, and critical, yet easily understandable, review of general interest to the battery community. It addresses the concepts, approaches, tools, outcomes, and challenges of using AI/ML as an accelerator for the design and optimization of the next generation of batteries—a current hot topic. It intends to create both accessibility of these tools to the chemistry and electrochemical energy sciences communities and completeness in terms of the different battery R&D aspects covered.
Article
Full-text available
As a fundamental problem in chemistry, retrosynthesis aims at designing reaction pathways and intermediates for a target compound. The goal of artificial intelligence (AI)-aided retrosynthesis is to automate this process by learning from the previous chemical reactions to make new predictions. Although several models have demonstrated their potentials for automated retrosynthesis, there is still a significant need to further enhance the prediction accuracy to a more practical level. Here we propose a local retrosynthesis framework called LocalRetro, motivated by the chemical intuition that the molecular changes occur mostly locally during the chemical reactions. This differs from nearly all existing retrosynthesis methods that suggest reactants based on the global structures of the molecules, often containing fine details not directly relevant to the reactions. This local concept yields local reaction templates involving the atom and bond edits. Because the remote functional groups can also affect the overall reaction path as a secondary aspect, the proposed locally encoded retrosynthesis model is then further refined to account for the nonlocal effects of chemical reaction through a global attention mechanism. Our model shows a promising 89.5 and 99.2% round-trip accuracy at top-1 and top-5 predictions for the USPTO-50K dataset containing 50 016 reactions. We further demonstrate the validity of LocalRetro on a large dataset containing 479 035 reactions (UTPTO-MIT) with comparable round-trip top-1 and top-5 accuracy of 87.0 and 97.4%, respectively. The practical application of the model is also demonstrated by correctly predicting the synthesis pathways of five drug candidate molecules from various literature.
Article
Full-text available
Machine learning models are poised to make a transformative impact on chemical sciences by dramatically accelerating computational algorithms and amplifying insights available from computational chemistry methods. However, achieving this requires a confluence and coaction of expertise in computer science and physical sciences. This Review is written for new and experienced researchers working at the intersection of both fields. We first provide concise tutorials of computational chemistry and machine learning methods, showing how insights involving both can be achieved. We follow with a critical review of noteworthy applications that demonstrate how computational chemistry and machine learning can be used together to provide insightful (and useful) predictions in molecular and materials modeling, retrosyntheses, catalysis, and drug design.
Article
Full-text available
Drug designing and development is an important area of research for pharmaceutical companies and chemical scientists. However, low efficacy, off-target delivery, time consumption, and high cost impose a hurdle and challenges that impact drug design and discovery. Further, complex and big data from genomics, proteomics, microarray data, and clinical trials also impose an obstacle in the drug discovery pipeline. Artificial intelligence and machine learning technology play a crucial role in drug discovery and development. In other words, artificial neural networks and deep learning algorithms have modernized the area. Machine learning and deep learning algorithms have been implemented in several drug discovery processes such as peptide synthesis, structure-based virtual screening, ligand-based virtual screening, toxicity prediction, drug monitoring and release, pharmacophore modeling, quantitative structure–activity relationship, drug repositioning, polypharmacology, and physiochemical activity. Evidence from the past strengthens the implementation of artificial intelligence and deep learning in this field. Moreover, novel data mining, curation, and management techniques provided critical support to recently developed modeling algorithms. In summary, artificial intelligence and deep learning advancements provide an excellent opportunity for rational drug design and discovery process, which will eventually impact mankind. Graphic abstract The primary concern associated with drug design and development is time consumption and production cost. Further, inefficiency, inaccurate target delivery, and inappropriate dosage are other hurdles that inhibit the process of drug delivery and development. With advancements in technology, computer-aided drug design integrating artificial intelligence algorithms can eliminate the challenges and hurdles of traditional drug design and development. Artificial intelligence is referred to as superset comprising machine learning, whereas machine learning comprises supervised learning, unsupervised learning, and reinforcement learning. Further, deep learning, a subset of machine learning, has been extensively implemented in drug design and development. The artificial neural network, deep neural network, support vector machines, classification and regression, generative adversarial networks, symbolic learning, and meta-learning are examples of the algorithms applied to the drug design and discovery process. Artificial intelligence has been applied to different areas of drug design and development process, such as from peptide synthesis to molecule design, virtual screening to molecular docking, quantitative structure–activity relationship to drug repositioning, protein misfolding to protein–protein interactions, and molecular pathway identification to polypharmacology. Artificial intelligence principles have been applied to the classification of active and inactive, monitoring drug release, pre-clinical and clinical development, primary and secondary drug screening, biomarker development, pharmaceutical manufacturing, bioactivity identification and physiochemical properties, prediction of toxicity, and identification of mode of action.
Article
Full-text available
De novo drug design is a computational approach that generates novel molecular structures from atomic building blocks with no a priori relationships. Conventional methods include structure-based and ligand-based design, which depend on the properties of the active site of a biological target or its known active binders, respectively. Artificial intelligence, including ma-chine learning, is an emerging field that has positively impacted the drug discovery process. Deep reinforcement learning is a subdivision of machine learning that combines artificial neural networks with reinforcement-learning architectures. This method has successfully been em-ployed to develop novel de novo drug design approaches using a variety of artificial networks including recurrent neural networks, convolutional neural networks, generative adversarial networks, and autoencoders. This review article summarizes advances in de novo drug design, from conventional growth algorithms to advanced machine-learning methodologies and high-lights hot topics for further development.
Article
Full-text available
Using the CiteSpace software and bibliometric methods, with the core collection of the Web of Science (WoS) database as the data source, the development of industrial heritage research in China and Western countries since the 2006 Wuxi Proposal was analyzed. The study found that the latest quantitative changes in China and Western countries’ industrial heritage research have similar fluctuations. However, researchers and institutions in the two places are independent of each other, lacking in-depth cooperative research. Notwithstanding, comprehensive and holistic research needs to be strengthened. The research content in China mainly focuses on the issues of urban renewal, industrial heritage tourism and creative industries, whereas Western countries are dominated by heritage and community building industrial heritage, the exploration of tourism and the protection of industrial sites, post-industrial heritage protection, and new technology use. Finally, by comparing and analyzing the research status of the two regions, future research on industrial heritage in China and Western countries are encouraged.
Article
Full-text available
Fragment-based drug (or lead) discovery (FBDD or FBLD) has developed in the last two decades to become a successful key technology in the pharmaceutical industry for early stage drug discovery and development. The FBDD strategy consists of screening low molecular weight compounds against macromolecular targets (usually proteins) of clinical relevance. These small molecular fragments can bind at one or more sites on the target and act as starting points for the development of lead compounds. In developing the fragments attractive features that can translate into compounds with favorable physical, pharmacokinetics and toxicity (ADMET—absorption, distribution, metabolism, excretion, and toxicity) properties can be integrated. Structure-enabled fragment screening campaigns use a combination of screening by a range of biophysical techniques, such as differential scanning fluorimetry, surface plasmon resonance, and thermophoresis, followed by structural characterization of fragment binding using NMR or X-ray crystallography. Structural characterization is also used in subsequent analysis for growing fragments of selected screening hits. The latest iteration of the FBDD workflow employs a high-throughput methodology of massively parallel screening by X-ray crystallography of individually soaked fragments. In this review we will outline the FBDD strategies and explore a variety of in silico approaches to support the follow-up fragment-to-lead optimization of either: growing, linking, and merging. These fragment expansion strategies include hot spot analysis, druggability prediction, SAR (structure-activity relationships) by catalog methods, application of machine learning/deep learning models for virtual screening and several de novo design methods for proposing synthesizable new compounds. Finally, we will highlight recent case studies in fragment-based drug discovery where in silico methods have successfully contributed to the development of lead compounds.
Article
Full-text available
Finding new molecules with a desired biological activity is an extremely difficult task. In this context, artificial intelligence and generative models have been used for molecular de novo design and compound optimization. Herein, we report a generative model that bridges systems biology and molecular design, conditioning a generative adversarial network with transcriptomic data. By doing so, we can automatically design molecules that have a high probability to induce a desired transcriptomic profile. As long as the gene expression signature of the desired state is provided, this model is able to design active-like molecules for desired targets without any previous target annotation of the training compounds. Molecules designed by this model are more similar to active compounds than the ones identified by similarity of gene expression signatures. Overall, this method represents an alternative approach to bridge chemistry and biology in the long and difficult road of drug discovery. High quality hit identification remains a considerable challenge in de novo drug design. Here, the authors train a generative adversarial network with transcriptome profiles induced by a large set of compounds, enabling it to design molecules that are likely to induce desired expression profiles.
Article
Full-text available
Whereas most organic molecules can be synthesized from progressively simpler substrates, syntheses of complex organic targets often involve counterintuitive sequence of steps that first complexify the structure but, by doing so, open up possibilities for pronounced structural simplification in subsequent, downstream steps. Such complexifying/simplifying reaction sequences, called tactical combinations (TCs), can be quite powerful and elegant but also inherently hard to spot—indeed, only some 500 TCs have so far been cataloged, and even fewer are routinely used in synthetic practice. This paper describes computer-driven discovery of large numbers of viable TCs (over 46,000 combinations of reaction classes and ∼4.85 million combinations of reaction variants), the vast majority of which have no prior literature precedent. Examples—including a concise wet lab synthesis of a small natural product—are provided to illustrate how the use of these newly discovered TCs can streamline the design of syntheses leading to important drugs and/or natural products.
Article
Full-text available
In recent years, the development of high-throughput screening (HTS) technologies and their establishment in an industrialized environment have given scientists the possibility to test millions of molecules and profile them against a multitude of biological targets in a short period of time, generating data in a much faster pace and with a higher quality than before. Besides the structure activity data from traditional bioassays, more complex assays such as transcriptomics profiling or imaging have also been established as routine profiling experiments thanks to the advancement of Next Generation Sequencing or automated microscopy technologies. In industrial pharmaceutical research, these technologies are typically established in conjunction with automated platforms in order to enable efficient handling of screening collections of thousands to millions of compounds. To exploit the ever-growing amount of data that are generated by these approaches, computational techniques are constantly evolving. In this regard, artificial intelligence technologies such as deep learning and machine learning methods play a key role in cheminformatics and bio-image analytics fields to address activity prediction, scaffold hopping, de novo molecule design, reaction/retrosynthesis predictions, or high content screening analysis. Herein we summarize the current state of analyzing large-scale compound data in industrial pharmaceutical research and describe the impact it has had on the drug discovery process over the last two decades, with a specific focus on deep-learning technologies.
Article
Full-text available
Organic electronics such as organic field-effect transistors (OFET), organic light-emitting diodes (OLED), and organic photovoltaics (OPV) have flourished over the last three decades, largely due to the development of new conjugated materials. Their designs have evolved through incremental modification and stepwise inspiration by researchers; however, a complete survey of the large molecular space is experimentally intractable. Machine learning (ML), based on the rapidly growing field of artificial intelligence technology, offers high throughput material exploration that is more efficient than high-cost quantum chemical calculations. This review describes the present status and perspective of ML-based development (materials informatics) of organic electronics. Although the complexity of OFET, OLED, and OPV makes revealing their structure-property relationships difficult, a cooperative approach incorporating virtual ML, human consideration, and fast experimental screening may help to navigate growth and development in the organic electronics field.
Article
Full-text available
Traditional methods of discovering new materials, such as the empirical trial and error method and the density functional theory (DFT)‐based method, are unable to keep pace with the development of materials science today due to their long development cycles, low efficiency, and high costs. Accordingly, due to its low computational cost and short development cycle, machine learning is coupled with powerful data processing and high prediction performance and is being widely used in material detection, material analysis, and material design. In this article, we discuss the basic operational procedures in analyzing material properties via machine learning, summarize recent applications of machine learning algorithms to several mature fields in materials science, and discuss the improvements that are required for wide‐ranging application. Machine learning has been widely used in various fields of materials science. This review focused on the basic operational procedures of machine learning in analyzing the properties of materials; it summarized the applications of machine learning algorithms in materials science in recent years, which include material property analysis, materials design, and quantum chemistry; and it discussed problems and possible new directions in the development of machine learning.
Article
Full-text available
In the space of only a few years, deep generative modeling has revolutionized how we think of artificial creativity, yielding autonomous systems which produce original images, music, and text. Inspired by these successes, researchers are now applying deep generative modeling techniques to the generation and optimization of molecules- in our review we found 45 papers on the subject published in the past two years. These works point to a future where such systems will be used to generate lead molecules, greatly reducing resources spent downstream synthesizing and characterizing bad leads in the lab. In this review we survey the increasingly complex landscape of models and representation schemes that have been proposed. The four classes of techniques we describe are recursive neural networks, autoencoders, generative adversarial networks, and reinforcement learning. After first discussing some of the mathematical fundamentals of each technique, we draw high level connections and comparisons with other techniques and expose the pros and cons of each. Several important high level themes emerge as a result of this work, including the shift away from the SMILES string representation of molecules towards more sophisticated representations such as graph grammars and 3D representations, the importance of reward function design, the need for better standards for benchmarking and testing, and the benefits of adversarial training and reinforcement learning over maximum likelihood based training.
Article
Full-text available
In recent years, machine learning (ML) methods have become increasingly popular in computational chemistry. After being trained on appropriate ab initio reference data, these methods allow to accurately predict the properties of chemical systems, circumventing the need for explicitly solving the electronic Schrödinger equation. Because of their computational efficiency and scalability to large datasets, deep neural networks (DNNs) are a particularly promising ML algorithm for chemical applications. This work introduces PhysNet, a DNN architecture designed for predicting energies, forces and dipole moments of chemical systems. PhysNet achieves state-of-the-art performance on the QM9, MD17 and ISO17 benchmarks. Further, two new datasets are generated in order to probe the performance of ML models for describing chemical reactions, long-range interactions, and condensed phase systems. It is shown that explicitly including electrostatics in energy predictions is crucial for a qualitatively correct description of the asymptotic regions of a potential energy surface (PES). PhysNet models trained on a systematically constructed set of small peptide fragments (at most eight heavy atoms) are able to generalize to considerably larger proteins like deca-alanine (Ala10): The optimized geometry of helical Ala10 predicted by PhysNet is virtually identical to ab initio results (RMSD = 0.21 Å). By running unbiased molecular dynamics (MD) simulations of Ala10 on the PhysNet-PES in gas phase, it is found that instead of a helical structure, Ala10 folds into a "wreath-shaped" configuration, which is more stable than the helical form by 0.46 kcal mol⁻¹ according to the reference ab initio calculations.
Article
Full-text available
The electronic charge density plays a central role in determining the behavior of matter at the atomic scale, but its computational evaluation requires demanding electronic-structure calculations. We introduce an atom-centered, symmetry-adapted framework to machine-learn the valence charge density based on a small number of reference calculations. The model is highly transferable, meaning it can be trained on electronic-structure data of small molecules and used to predict the charge density of larger compounds with low, linear-scaling cost. Applications are shown for various hydrocarbon molecules of increasing complexity and flexibility, and demonstrate the accuracy of the model when predicting the density on octane and octatetraene after training exclusively on butane and butadiene. This transferable, data-driven model can be used to interpret experiments, accelerate electronic structure calculations, and compute electrostatic interactions in molecules and condensed-phase systems.
Article
Chemical and biomass processing systems release volatile matter compounds into the environment daily. Catalytic reforming can convert these compounds into valuable fuels, but developing stable and efficient catalysts is challenging. Machine learning can handle complex relationships in big data and optimize reaction conditions, making it an effective solution for addressing the mentioned issues. This study is the first to develop a machine-learning-based research framework for modeling, understanding, and optimizing the catalytic steam reforming of volatile matter compounds. Toluene catalytic steam reforming is used as a case study to show how chemical/textural analyses (e.g., X-ray diffraction analysis) can be used to obtain input features for machine learning models. Literature is used to compile a database covering a variety of catalyst characteristics and reaction conditions. The process is thoroughly analyzed, mechanistically discussed, modeled by six machine learning models, and optimized using the particle swarm optimization algorithm. Ensemble machine learning provides the best prediction performance (R2 > 0.976) for toluene conversion and product distribution. The optimal tar conversion (higher than 77.2%) is obtained at temperatures between 637.44 and 725.62 °C, with a steam-to-carbon molar ratio of 5.81–7.15 and a catalyst BET surface area of 476.03–638.55 m2/g. The feature importance analysis satisfactorily reveals the effects of input descriptors on model prediction. Operating conditions (50.9%) and catalyst properties (49.1%) are equally important in modeling. The developed framework can expedite the search for optimal catalyst characteristics and reaction conditions, not only for catalytic chemical processing but also for related research areas.
Article
Finding the optimum structures of non-stoichiometric or berthollide materials, such as (1D, 2D, 3D) materials or nanoparticles (0D), is challenging due to the huge chemical/structural search space. Computational methods...
Article
Hydrovoltaic technology can harvest sustainable energy and clean water directly from various environments, providing a novel way to alleviate global environmental problems and energy crisis. A wide variety of hydrovoltaic materials with distinctly different morphological, mechanical and functional features have been created by using GO as a versatile building block. However, there is still a lack of comprehensive knowledge regarding the involvement of GO in the hydrovoltaic technology and its future perspectives. In this review, the latest progress in the preparation of GO-based hydrovoltaic materials and their various applications are summarized. The working mechanisms for the hydrovoltaic power generation and some remaining challenges are also discussed. Finally, some suggestions are given for further development of GO-based hydrovoltaic technology.
Article
Recent research on artificial intelligence indicates that machine learning algorithms can auto-generate novel drug-like molecules. Generative models have revolutionized de novo drug discovery, rendering the explorative process more efficient. Several model frameworks and input formats have been proposed to enhance the performance of intelligent algorithms in generative molecular design. In this systematic literature review of experimental articles and reviews over the last five years, machine learning models, challenges associated with computational molecule design along with proposed solutions, and molecular encoding methods are discussed. A query-based search of the PubMed, ScienceDirect, Springer, Wiley Online Library, arXiv, MDPI, bioRxiv, and IEEE Xplore databases yielded 87 studies. Twelve additional studies were identified via citation searching. Of the articles in which machine learning was implemented, six prominent algorithms were identified: long short-term memory recurrent neural networks (LSTM-RNNs), variational autoencoders (VAEs), generative adversarial networks (GANs), adversarial autoencoders (AAEs), evolutionary algorithms, and gated recurrent unit (GRU-RNNs). Furthermore, eight central challenges were designated: homogeneity of generated molecular libraries, deficient synthesizability, limited assay data, model interpretability, incapacity for multi-property optimization, incomparability, restricted molecule size, and uncertainty in model evaluation. Molecules were encoded either as strings, which were occasionally augmented using randomization, as 2D graphs, or as 3D graphs. Statistical analysis and visualization are performed to illustrate how approaches to machine learning in de novo drug design have evolved over the past five years. Finally, future opportunities and reservations are discussed.
Article
This article reviews recent developments in the applications of machine learning, data-driven modeling, transfer learning, and autonomous experimentation for the discovery, design, and optimization of soft and biological materials. The design and engineering of molecules and molecular systems have long been a preoccupation of chemical and biomolecular engineers using a variety of computational and experimental techniques. Increasingly, researchers have looked to emerging and established tools in artificial intelligence and machine learning to integrate with established approaches in chemical science to realize powerful, efficient, and in some cases autonomous platforms for molecular discovery, materials engineering, and process optimization. This review summarizes the basic principles underpinning these techniques and highlights recent successful example applications in autonomous materials discovery, transfer learning, and multi-fidelity active learning. Expected final online publication date for the Annual Review of Chemical and Biomolecular Engineering, Volume 13 is October 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Article
In recent years, emerging contaminants have been found in the wastewater, surface water, and even drinking water, which should be treated to ensure the safety of our living environment. In this study, we provide a comprehensive summary of wastewater treatment and emerging contaminants research from 1998 to 2021 by using the bibliometric analysis. This study is conducted based on the Web of Science Core Collection Database. The bibliometix R-package, VOSviewer and CiteSpace software are used for bibliometric analysis and science mapping. A dataset of 10, 605 publications has been retrieved. The analysis results show that China has produced the most publications. China and the United States have the closest cooperation. Analysis of the most cited papers reveals that the purification or removal techniques such as ozonation or membrane filtration can effectively remove pharmaceutical compounds from the water environment. We also found that the efficient detection of emerging contaminants and the optimization of removal methods are current challenges. Finally, future research directions are discussed.
Article
We present OrbNet Denali, a machine learning potential that is designed as a drop-in replacement for ground-state density functional theory (DFT) energy calculations. The model is a message-passing neural network that uses symmetry-adapted atomic orbital features from low-cost quantum calculations to predict the energy of a molecule. OrbNet Denali is trained on a vast dataset of 2.3M DFT calculations on molecules and geometries. This dataset covers the most common elements in bio- and organic chemistry (H,Li,B,C,N,O,F,Na,Mg,Si,P,S,Cl,K,Ca,Br,I) as well as charged molecules. OrbNet Denali is demonstrated on several well-established benchmark datasets, and we find that it provides accuracy on par with modern DFT methods while offering a speedup of up to three orders of magnitude. For the GMTKN55 benchmark set, OrbNet Denali achieves WTMAD-1 and WTMAD-2 scores of 7.19 and 9.84, on par with modern DFT functionals. For several GMTKN55 subsets, which contain chemical problems that are not present in the training set, OrbNet Denali produces a MAEs comparable to those of DFT methods. For the Hutchison conformers benchmark set, OrbNet Denali has a median correlation coefficient of R^2=0.90 compared to reference DLPNO-CCSD(T) calculations, and R^2=0.97 compared to the method used to generate the training data (wB97X-D3/def2-TZVP), exceeding the performance of any other method with a similar cost. Similarly, the model reaches chemical accuracy for non-covalent interactions in the S66x10 dataset. For torsional profiles, OrbNet Denali reproduces the torsion profiles of wB97X-D3/def2-TZVP with an average MAE of 0.12 kcal/mol for the potential energy surfaces of the diverse fragments in the TorsionNet500-dataset.
Article
The design of chemical-based products and functional materials is vital to modern technologies, yet remains expensive and slow. Artificial intelligence and machine learning offer new approaches to leverage data to overcome these challenges. This review focuses on recent applications of Bayesian optimization (BO) to chemical products and materials including molecular design, drug discovery, molecular modeling, electrolyte design, and additive manufacturing. Numerous examples show how BO often requires an order of magnitude fewer experiments than Edisonian search. The essential equations for BO are introduced in a self-contained primer specifically written for chemical engineers and others new to the area. Finally, the review discusses four current research directions for BO and their relevance to product and materials design.
Article
Under various names, such as, data science, Industry 4.0, or smart manufacturing, digital technologies are transforming our world. Although value statements and promises are published in a steady stream, uptake in the chemical and process industries has been moderate. Successful transformations are not confined to tasks, the “whats”. They also require great care in how they are carried out. This overview, aimed at all participants in the digital transformation of the chemical industry, presents “dos and don’ts” method recommendations for three successive steps: strategy development to define goals, (organisational) mobilisation for implementation, and project delivery. Successful strategy development requires assembling an empowered and skilled team; truly understanding the data science and digital transformation topics; accepting emergence and iteration; and focusing on real needs. Mobilising an organisation is essential so that it can translate strategy to tactics and value. Within organisations, one must therefore: enable project identification; set up a supportive organisational structure and skilful people within it. Looking outside, participation in partnerships is essential to access external resources. Delivery of valuable projects is the end goal. A diverse portfolio is needed, as well as effective collaborations between subject matter experts and data scientists. Technically, the use of software best practice is beneficial, and care must be taken of the data themselves. In the longer term, data science opportunities will extend beyond merely improving traditional analytics to make them faster, better, and more user-friendly. The early identification of beneficial future trends requires encouraging those individuals who have an interest in disruptive currents, and the perceptiveness to sense their areas of application.
Article
Renewable energy resources have enabled the mitigation of global environmental pollution and sustainable energy generation. Due to renewability, cleanliness, and vast sustainability aspects of wind energy, wind power generation (WPG) systems have recently found rapid development. In this study, the sustainability aspects of WPG systems are briefly summarized by comparing the sustainability parameters for various sources of WPG. The techno-socio-economic, aesthetic, and cultural impacts of conventional wind farms and the influence of the COVID-19 pandemic on the accomplishment of new wind energy projects were also discussed. The study aims to visualize the intellectual background, current research status and state-of-the-art knowledge structure of WPG-related literature using CiteSpace based scientometric investigation. The WPG-related original articles, published from 2005 to 2020, were retrieved from the Web of Science core collection. The most prolific publications, countries and journals involved in the flourishment of WPG research were identified. Moreover, visualization methods were employed to determine the highly productive articles, keywords, hotspots, and research frontiers in the WPG domain. Furthermore, the classification of WPG knowledge was performed in the form of clusters and knowledge structure to achieve ten distinct sub-domains. It was revealed that China was the most prominent country among others in the research of WPG, holding 29% of the total publications; the most probable reason is the more assertive funding support policy from the Chinese government and research institutions, compared to other countries. This study can help the researchers to spot the new research frontiers and distinguish among the most critical sub-domains of WPG based knowledge.
Article
The de novo design of molecular structures using deep learning generative models introduces an encouraging solution to drug discovery in the face of the continuously increased cost of new drug development. From the generation of original texts, images, and videos, to the scratching of novel molecular structures the creativity of deep learning generative models exhibits the height machine intelligence can achieve. The purpose of this paper is to review the latest advances in generative chemistry which relies on generative modeling to expedite the drug discovery process. This review starts with a brief history of artificial intelligence in drug discovery to outline this emerging paradigm. Commonly used chemical databases, molecular representations, and tools in cheminformatics and machine learning are covered as the infrastructure for generative chemistry. The detailed discussions on utilizing cutting-edge generative architectures, including recurrent neural network, variational autoencoder, adversarial autoencoder, and generative adversarial network for compound generation are focused. Challenges and future perspectives follow.
Article
Machine-learned ranking models have been developed for the prediction of substrate-specific cross-coupling reaction conditions. Datasets of published reactions were curated for Suzuki, Negishi, and C–N couplings, as well as Pauson–Khand reactions. String, descriptor, and graph encodings were tested as input representations, and models were trained to predict the set of conditions used in a reaction as a binary vector. Unique reagent dictionaries categorized by expert-crafted reaction roles were constructed for each dataset, leading to context-aware predictions. We find that relational graph convolutional networks and gradient-boosting machines are very effective for this learning task, and we disclose a novel reaction-level graph-attention operation in the top-performing model.
Article
This study combines applied mathematics, visual analysis technology, information science with an approach of Scientometrics to systematically analyze the development status, research distribution and future trend of intelligent vehicles research. A total number of 3933 published paper index by SCIE and SSCI from 2000 to 2019 are researched based on Mapping Knowledge Domain (MKD) and Scientometrics approaches. Firstly, this paper analyzes the literature content in the field of intelligent vehicles by including the literature number, literature productive countries, research organization, co-authorship of main research groups and the journals from which the articles are mainly sourced. Then, co-citation analysis is used to obtain five major research directions in the field of intelligent vehicles, which include "system framework", "internet of vehicles", "intersection control algorithms", "influence on traffic flow", and "policies and barriers", respectively. The keyword co-occurrence analysis is applied to identify four dominant clusters: "planning and control system", "autonomous vehicle questionnaire", "sensor and vision", and "connected vehicles". Finally, we divide burst keywords into three phases according to the publication date to show more clearly the change of research focus and direction over time.
Article
This manuscript presents a scientometric analysis on the studies performed on the application of biochar for soil amendment in order to investigate the research and developments in this field and to identify the existing gaps to provide recommendations for future studies. A total of 2982 bibliographic records were retrieved from the Web of Science (WoS) database using appropriate sets of keywords, and these were analyzed based on the criteria of authors, publishing journals, citations received, contributing countries, institution, and categories in research and development. Based on these data, progress of research was mapped to identify the scientific status, such as current scientific and technological trends as well as the knowledge gaps. The majority of scientific developments started in the early 2000’s and accelerated considerably after 2014. China and USA are the leading countries in the application of biochar for the treatment of soils. Among the active journals, “Plant and Soil” has received the highest number of citations. This study attempts for a comprehensive discussion and understanding on scientific advances as well as the progress made, especially in recent years.
Article
The world needs new materials to stimulate the chemical industry in key sectors of our economy: environment and sustainability, information storage, optical telecommunications, and catalysis. Yet, nearly all functional materials are still discovered by "trial-and-error", of which the lack of predictability affords a major materials bottleneck to technological innovation. The average "molecule-to-market" lead time for materials discovery is currently 20 years. This is far too long for industrial needs, as highlighted by the Materials Genome Initiative, which has ambitious targets of up to 4-fold reductions in average molecule-to-market lead times. Such a large step change in progress can only be realistically achieved if one adopts an entirely new approach to materials discovery. Fortunately, a fundamentally new approach to materials discovery has been emerging, whereby data science with artificial intelligence offers a prospective solution to speed up these average molecule-to-market lead times. This approach is known as data-driven materials discovery. Its broad prospects have only recently become a reality, given the timely and major advances in "big data", artificial intelligence, and high-performance computing (HPC). Access to massive data sets has been stimulated by government-regulated open-access requirements for data and literature. Natural-language processing (NLP) and machine-learning (ML) tools that can mine data and find patterns therein are becoming mainstream. Exascale HPC capabilities that can aid data mining and pattern recognition and also generate their own data from calculations are now within our grasp. These timely advances present an ideal opportunity to develop data-driven materials-discovery strategies to systematically design and predict new chemicals for a given device application. This Account shows how data science can afford materials discovery via a four-step "design-to-device" pipeline that entails (1) data extraction, (2) data enrichment, (3) material prediction, and (4) experimental validation. Massive databases of cognate chemical and property information are first forged from "chemistry-aware" natural-language-processing tools, such as ChemDataExtractor, and enriched using machine-learning methods and high-throughput quantum-chemical calculations. New materials for a bespoke application can then be predicted by mining these databases with algorithmic encodings of relationships between chemical structures and physical properties that are known to deliver functional materials. These may take the form of classification, enumeration, or machine-learning algorithms. A data-mining workflow short-lists these predictions to a handful of lead candidate materials that go forward to experimental validation. This design-to-device approach is being developed to offer a roadmap for the accelerated discovery of new chemicals for functional applications. Case studies presented demonstrate its utility for photovoltaic, optical, and catalytic applications. While this Account is focused on applications in the physical sciences, the generic pipeline discussed is readily transferable to other scientific disciplines such as biology and medicine.
Article
Microplastic particles with less than 5 mm in diameter has been detected in human feces and freshwater systems. Microplastics could cause serious physical and chemical harm to humans and organisms. Some previous studies on microplastics mainly concentrate on the marine environment, but few have focused on freshwater microplastics. Therefore, Citespace II is used to systematically analyze the related literature in order to comprehensively understand the research state of freshwater microplastics. The results show that there is still a large gap between research on freshwater and marine microplastics. Studies on freshwater microplastics have mainly been undertaken in developed countries such as the United States and Germany, while fewer studies have been conducted in the developing countries which face the most serious plastic pollution. Most studies focus on the rivers and lakes, but other freshwater sources with microplastic pollution, such as groundwater and reservoirs, have received less attention. This study also explored the possible opportunities and challenges that may be faced in freshwater research in order to introduce specific policies and measures to mitigate this emerging pollutant.
Article
Recently, many research groups have been addressing data-driven approaches for (retro)synthetic reaction prediction and retrosynthetic analysis. Although the performances of the data-driven approach have progressed due to recent advances of machine learning and deep learning techniques, problems such as improving capability of reaction prediction and the black-box problem of neural networks persist for practical use by chemists. To spread data-driven approaches to chemists, we focused on two challenges: improvement of retrosynthetic reaction prediction and interpretability of the prediction. In this paper, we propose an interpretable prediction framework using Graph Convolutional Networks (GCN) for retrosynthetic reaction prediction and Integrated Gradients (IGs) for visualization of contributions to the prediction to address these challenges. As a result, from the viewpoint of balanced accuracies, our model showed better performances than the approach using Extended-Connectivity Fingerprint (ECFP). Furthermore, IGs based visualization of the GCN prediction successfully highlighted reaction-related atoms.
Article
Predicting how a complex molecule reacts with different reagents, and how to synthesise complex molecules from simpler starting materials, are fundamental to organic chemistry. We show that an attention-based machine translation model - Molecular Transformer - tackles both reaction prediction and retrosynthesis by learning from the same dataset. Reagents, reactants and products are represented as SMILES text strings. For reaction prediction, the model “translates” the SMILES of reactants and reagents to product SMILES, and the converse for retrosynthesis. Moreover, a model trained on publicly available data is able to make accurate predictions on proprietary molecules extracted from pharma electronic lab notebooks, demonstrating generalisability across chemical space. We expect our versatile framework to be broadly applicable to problems such as reaction condition prediction, reagent prediction and yield prediction.
Article
Reaction databases provide a great deal of useful information to assist planning of experiments, but do not provide any interpretation or chemical concepts to accompany this information. In this work reactions are labeled with experimental conditions and network analysis shows that consistencies within clusters of data points can be leveraged to organize this information. In particular, this analysis shows how particular experimental conditions (specifically solvent) are effective in enabling specific organic reactions (Friedel-Crafts, Aldol addition, Claisen condensation, Diels-Alder, and Wittig), including variations within each reaction class. An example of network analysis is shown in the graphical abstract, where data points for a Claisen condensation reaction break into clusters that depend on the catalyst and chemical structure. This type of clustering, which mimics how a chemist reasons, is derived directly from the network. Therefore the findings of this work could augment synthesis planning by providing predictions in a fashion that mimics human chemists. To numerically evaluate solvent prediction ability, three methods are compared: network analysis (through the k-nearest neighbor algorithm), a support vector machine, and a deep neural network. The most accurate method in 4 of the 5 test cases is the network analysis, with deep neural networks also showing good prediction scores. The network analysis tool was evaluated by an expert panel of chemists, who generally agreed that the algorithm produced accurate solvent choices while simultaneously being transparent in the underlying reasons for its predictions.
Article
Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence–function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function. This review provides an overview of machine learning techniques in protein engineering and illustrates the underlying principles with the help of case studies.
Article
In Polymer Informatics, quantitative structure-property relationship (QSPR) modeling is an emerging approach for predicting relevant properties of polymers in the context of computer-aided design of industrial materials. Nevertheless, most QSPR models available in the literature use simplistic computational representations of polymers based on their structural repetitive unit. The aim of this work is to evaluate the effect of this simplification and to analyze new strategies to achieve alternative characterizations that capture the phenomenon of polydispersity. In particular, the experiments reported in this work are focused on three mechanical properties derived from the tensile test. The reported results revealed the disadvantages of using these simplified representations. Besides, we contributed with alternative representations for the databases of polymer molecular descriptors that achieved more realistic and accurate QSPR models.
Article
Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.
Article
An extended semiempirical tight-binding model is presented, which is primarily designed for the fast calculation of structures and noncovalent interaction energies for molecular systems with roughly 1000 atoms. The essential novelty in this so-called GFN2-xTB method is the inclusion of anisotropic second order density fluctuation effects via short-range damped interactions of cumulative atomic multipole moments. Without noticeable increase in the computational demands, this results in a less empirical and overall more physically sound method, which does not require any classical halogen or hydrogen bonding corrections and which relies solely on global and element-specific parameters (available up to radon, Z = 86). Moreover, the atomic partial charge dependent D4 London dispersion model is incorporated self-consistently, which can be naturally obtained in a tight-binding picture from second order density fluctuations. Fully analytical and numerically precise gradients (nuclear forces) are implemented. The accuracy of the method is benchmarked for a wide variety of systems and compared with other semiempirical methods. Along with excellent performance for the "target" properties, we also find lower errors for "off-target" properties such as barrier heights and molecular dipole moments. High computational efficiency along with the improved physics compared to its precursor GFN-xTB makes this method well-suited to explore the conformational space of molecular systems. Significant improvements are furthermore observed for various benchmark sets, which are prototypical for biomolecular systems in aqueous solution.
Article
Predicting catalyst selectivity Asymmetric catalysis is widely used in chemical research and manufacturing to access just one of two possible mirror-image products. Nonetheless, the process of tuning catalyst structure to optimize selectivity is still largely empirical. Zahrt et al. present a framework for more efficient, predictive optimization. As a proof of principle, they focused on a known coupling reaction of imines and thiols catalyzed by chiral phosphoric acid compounds. By modeling multiple conformations of more than 800 prospective catalysts, and then training machine-learning algorithms on a subset of experimental results, they achieved highly accurate predictions of enantioselectivities. Science , this issue p. eaau5631