ArticleLiterature Review

Artificial intelligence for natural product drug discovery

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Developments in computational omics technologies have provided new means to access the hidden diversity of natural products, unearthing new potential for drug discovery. In parallel, artificial intelligence approaches such as machine learning have led to exciting developments in the computational drug design field, facilitating biological activity prediction and de novo drug design for molecular targets of interest. Here, we describe current and future synergies between these developments to effectively identify drug candidates from the plethora of molecules produced by nature. We also discuss how to address key challenges in realizing the potential of these synergies, such as the need for high-quality datasets to train deep learning algorithms and appropriate strategies for algorithm validation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... 19 Supervised learning entails training algorithms on a labeled dataset, which are subsequently utilized for classifying previously unseen data by the trained algorithm. 20 Herein we comprehensively explore the biosynthetic potential encoded in atropopeptide BGCs. We introduce AtropoFinder, a machine learning-based genome mining tool for the identification of atropopeptide BGCs. ...
Preprint
Full-text available
Ribosomally synthesized and posttranslationally modified peptides (RiPPs) constitute a diverse class of natural products. Atropopeptides are a recent addition to the fast-growing number of RiPP families. Characterized members of the peptide family feature a particular intricate three-dimensional shape. Here we developed AtropoFinder, a machine learning-based algorithm to chart the biosynthetic landscape of the atropopeptides. AtropoFinder identified more than 650 atropopeptide biosynthetic gene clusters (BGCs). Through bioinformatics and modeling analyses, we pinpointed crucial motifs and residues in leader and core peptide sequences, prompting a refined definition of the atropopeptide RiPP family. Our study revealed that a substantial subset of atropopeptide BGCs harbors multiple tailoring genes, thus suggesting a broader structural diversity than previously anticipated. To verify AtropoFinder, we heterologously expressed four atropopeptide BGCs, which resulted in the identification of novel atropopeptides with varying peptide lengths, number and type of modifications. Most notably, our study resulted in the characterization of an atropopeptide that is more extensively modified than previously identified members, resulting in an even more rigid 3-dimensional shape. Moreover, one characterized atropopeptide BGC encoding a single P450 is involved in the biosynthesis of two peptides with the same sequence but distinct and non-overlapping modification patterns. This work expands the atropopeptide chemical space, advances our understanding of atropopeptide biosynthesis and underscores the potential of machine learning in uncovering the uncharted biosynthetic diversity encoded in RiPP biosynthetic blueprints.
Article
Full-text available
Brachybacterium sp. GU-2 was isolated from the hard coral Porites lobata found in Apra Harbor, Guam, Micronesia. This genome sequence will be beneficial to understand the role of actinomycetes in coral holobionts. The Brachybacterium genome contains several gene clusters for bioactive compounds, including antibiotics.
Article
With the emergence of next‐generation nucleotide sequencing and mass spectrometry‐based proteomics and metabolomics tools, we have comprehensive and scalable methods to analyze the genes, transcripts, proteins, and metabolites of a multitude of biological systems. Despite the fascinating new molecular insights at the genome, transcriptome, proteome and metabolome scale, we are still far from fully understanding cellular organization, cell cycles and biology at the molecular level. Significant advances in sensitivity and depth for both sequencing as well as mass spectrometry‐based methods allow the analysis at the single cell and single molecule level. At the same time, new tools are emerging that enable the investigation of molecular interactions throughout the central dogma of molecular biology. In this review, we provide an overview of established and recently developed mass spectrometry‐based tools to probe metabolite‐protein interactions—from individual interaction pairs to interactions at the proteome‐metabolome scale. This article is protected by copyright. All rights reserved
Preprint
Full-text available
Natural products (bio)synthesised by microbes are an important component of the pharmacopeia with a vast array of biomedical applications, in addition to their key role in many ecological interactions. One approach for the discovery of these metabolites is the identification of biosynthetic gene clusters (BGCs), genomic units which encode the molecular machinery required for producing the natural product. Genome mining has revolutionised the discovery of BGCs, yet metagenomic assemblies represent a largely untapped source of natural products. The imbalanced distribution of BGC classes in existing databases restricts the generalisation of detection patterns and limits the ability of mining methods to recognise a broader spectrum of BGCs. This problem is further intensified in metagenomic datasets, where BGC genes may be split across multiple contigs. This work presents SanntiS, a new machine learning-based approach for identifying BGCs. SanntiS achieved high precision and recall in both genomic and metagenomic datasets, effectively capturing a broad range of BGCs. Application of SanntiS to metagenomic assemblies found in MGnify led to a resource containing 1.1 million BGC predictions with associated contextual data from diverse biomes. Additionally, experimental validation of a previously undescribed BGC, detected solely by SanntiS, further demonstrates the potential of this approach in uncovering novel bioactive compounds. The study illustrates the significance of metagenomic datasets in comprehensively understanding the diversity and distribution of BGCs in microbial communities.
Article
Full-text available
Acinetobacter baumannii is a nosocomial Gram-negative pathogen that often displays multidrug resistance. Discovering new antibiotics against A. baumannii has proven challenging through conventional screening approaches. Fortunately, machine learning methods allow for the rapid exploration of chemical space, increasing the probability of discovering new antibacterial molecules. Here we screened ~7,500 molecules for those that inhibited the growth of A. baumannii in vitro. We trained a neural network with this growth inhibition dataset and performed in silico predictions for structurally new molecules with activity against A. baumannii. Through this approach, we discovered abaucin, an antibacterial compound with narrow-spectrum activity against A. baumannii. Further investigations revealed that abaucin perturbs lipoprotein trafficking through a mechanism involving LolE. Moreover, abaucin could control an A. baumannii infection in a mouse wound model. This work highlights the utility of machine learning in antibiotic discovery and describes a promising lead with targeted activity against a challenging Gram-negative pathogen.
Preprint
Full-text available
Proteochemometric (PCM) modelling is a powerful computational drug discovery tool used in bioactivity prediction of potential drug candidates relying on both chemical and protein information. In PCM features are computed to describe small molecules and proteins, which directly impact the quality of the predictive models. State-of-the-art protein descriptors, however, are calculated from the protein sequence and neglect the dynamic nature of proteins. This dynamic nature can be computationally simulated with molecular dynamics (MD). Here, novel 3D dynamic protein descriptors (3DDPDs) were designed to be applied in bioactivity prediction tasks with PCM models. As a test case publicly available G protein-coupled receptor (GPCR) MD data from GPCRmd was used. GPCRs are membrane-bound proteins, which are activated by hormones and neurotransmitters, and constitute an important target family for drug discovery. GPCRs exist in different conformational states that allow transmission of diverse signals and that can be modified by ligand interactions, among other factors. To translate the MD-encoded protein dynamics two types of 3DDPDs were considered: one-hot encoded residue-specific (rs) and embedding-like protein-specific (ps) 3DDPDs. The descriptors were developed by calculating distributions of trajectory coordinates and partial charges, applying dimensionality reduction, and subsequently condensing them into vectors per residue or protein, respectively. 3DDPDs were benchmarked on a number of PCM tasks against state-of-the-art non-dynamic protein descriptors. Our rs- and ps3DDPDs outperformed non-dynamic descriptors in regression tasks using a temporal split, and showed comparable performance with a random split and in all classification tasks. Combinations of non-dynamic descriptors with 3DDPDs did not result in increased performance. Finally, the power of 3DDPDs to capture dynamic fluctuations in mutant GPCRs was explored. The results presented here show the potential of including protein dynamic information on machine learning tasks, specifically bioactivity prediction, and open opportunities for applications in drug discovery, including oncology.
Article
Full-text available
Natural products research increasingly applies -omics technologies to guide molecular discovery. While the combined analysis of genomic and metabolomic datasets has proved valuable for identifying natural products and their biosynthetic gene clusters (BGCs) in bacteria, this integrated approach lacks application to fungi. Because fungi are hyper-diverse and underexplored for new chemistry and bioactivities, we created a linked genomics–metabolomics dataset for 110 Ascomycetes, and optimized both gene cluster family (GCF) networking parameters and correlation-based scoring for pairing fungal natural products with their BGCs. Using a network of 3,007 GCFs (organized from 7,020 BGCs), we examined 25 known natural products originating from 16 known BGCs and observed statistically significant associations between 21 of these compounds and their validated BGCs. Furthermore, the scalable platform identified the BGC for the pestalamides, demystifying its biogenesis, and revealed more than 200 high-scoring natural product–GCF linkages to direct future discovery.
Article
Full-text available
Rational drug design often starts from specific scaffolds to which side chains/substituents are added or modified due to the large drug-like chemical space available to search for novel drug-like molecules. With the rapid growth of deep learning in drug discovery, a variety of effective approaches have been developed for de novo drug design. In previous work we proposed a method named DrugEx, which can be applied in polypharmacology based on multi-objective deep reinforcement learning. However, the previous version is trained under fixed objectives and does not allow users to input any prior information (i.e. a desired scaffold). In order to improve the general applicability, we updated DrugEx to design drug molecules based on scaffolds which consist of multiple fragments provided by users. Here, a Transformer model was employed to generate molecular structures. The Transformer is a multi-head self-attention deep learning model containing an encoder to receive scaffolds as input and a decoder to generate molecules as output. In order to deal with the graph representation of molecules a novel positional encoding for each atom and bond based on an adjacency matrix was proposed, extending the architecture of the Transformer. The graph Transformer model contains growing and connecting procedures for molecule generation starting from a given scaffold based on fragments. Moreover, the generator was trained under a reinforcement learning framework to increase the number of desired ligands. As a proof of concept, the method was applied to design ligands for the adenosine A2A receptor (A2AAR) and compared with SMILES-based methods. The results show that 100% of the generated molecules are valid and most of them had a high predicted affinity value towards A2AAR with given scaffolds.
Article
Full-text available
Generative chemical language models (CLMs) can be used for de novo molecular structure generation by learning from a textual representation of molecules. Here, we show that hybrid CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma (PI3Kγ), a collection of virtual molecules was created with a generative CLM. This virtual compound library was refined using a CLM-based classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented molecular structures and fine-tuned with known PI3Kγ ligands. Several of the computer-generated molecular designs were commercially available, enabling fast prescreening and preliminary experimental validation. A new PI3Kγ ligand with sub-micromolar activity was identified, highlighting the method’s scaffold-hopping potential. Chemical synthesis and biochemical testing of two of the top-ranked de novo designed molecules and their derivatives corroborated the model’s ability to generate PI3Kγ ligands with medium to low nanomolar activity for hit-to-lead expansion. The most potent compounds led to pronounced inhibition of PI3K-dependent Akt phosphorylation in a medulloblastoma cell model, demonstrating efficacy of PI3Kγ ligands in PI3K/Akt pathway repression in human tumor cells. The results positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design.
Article
Full-text available
With the ongoing rapid growth of publicly available ligand–protein bioactivity data, there is a trove of valuable data that can be used to train a plethora of machine-learning algorithms. However, not all data is equal in terms of size and quality and a significant portion of researchers’ time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. To meet these challenges, we have constructed the Papyrus dataset. Papyrus is comprised of around 60 million data points. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with several smaller datasets containing high-quality data. The aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways and also perform some examples of quantitative structure–activity relationship analyses and proteochemometric modelling. Our ambition is that this pruned data collection constitutes a benchmark set that can be used for constructing predictive models, while also providing an accessible data source for research. Graphical Abstract
Preprint
Full-text available
The AlphaFold neural network model has revolutionized structural molecular biology with unprecedented performance. We demonstrate that by stochastically perturbing the neural network by enabling dropout at inference combined with massive sampling, it is possible to improve the quality of the generated models. We generated around 6,000 models per target compared to 25 default for AF2-multimer, with v1 and v2 multimer network models, with and without templates, and increased the number of recycles within the network. The method was benchmarked in CASP15, and compared to AF2-multimer it improved the average DockQ from 0.41 to 0.55 using identical input and was ranked at the very top in the protein assembly category when compared to all other groups participating in CASP15. The simplicity of the method should facilitate the adaptation by the field, and the method should be useful for anyone interested in modelling multimeric structures, alternate conformations or flexible structures. Availability AFsample is available online at http://wallnerlab.org/AFsample .
Article
Full-text available
Machine learning has become a crucial tool in drug discovery and chemistry at large, e.g., to predict molecular properties, such as bioactivity, with high accuracy. However, activity cliffs─pairs of molecules that are highly similar in their structure but exhibit large differences in potency─have received limited attention for their effect on model performance. Not only are these edge cases informative for molecule discovery and optimization but also models that are well equipped to accurately predict the potency of activity cliffs have increased potential for prospective applications. Our work aims to fill the current knowledge gap on best-practice machine learning methods in the presence of activity cliffs. We benchmarked a total of 24 machine and deep learning approaches on curated bioactivity data from 30 macromolecular targets for their performance on activity cliff compounds. While all methods struggled in the presence of activity cliffs, machine learning approaches based on molecular descriptors outperformed more complex deep learning methods. Our findings highlight large case-by-case differences in performance, advocating for (a) the inclusion of dedicated "activity-cliff-centered" metrics during model development and evaluation and (b) the development of novel algorithms to better predict the properties of activity cliffs. To this end, the methods, metrics, and results of this study have been encapsulated into an open-access benchmarking platform named MoleculeACE (Activity Cliff Estimation, available on GitHub at: https://github.com/molML/MoleculeACE). MoleculeACE is designed to steer the community toward addressing the pressing but overlooked limitation of molecular machine learning models posed by activity cliffs.
Article
Full-text available
With an ever-increasing amount of (meta)genomic data being deposited in sequence databases, (meta)genome mining for natural product biosyn-thetic pathways occupies a critical role in the discovery of novel pharmaceutical drugs, crop protection agents and biomaterials. The genes that encode these pathways are often organised into biosynthetic gene clusters (BGCs). In 2015, we defined the Minimum Information about a Biosynthetic Gene cluster (MIBiG): a standardised data format that describes the minimally required information to uniquely char-acterise a BGC. We simultaneously constructed an accompanying online database of BGCs, which has since been widely used by the community as a reference dataset for BGCs and was expanded to 2021 entries in 2019 (MIBiG 2.0). Here, we describe MIBiG 3.0, a database update comprising large-scale validation and re-annotation of existing entries and 661 new entries. Particular attention was paid to the annotation of compound structures and biological activities , as well as protein domain selectivities. Together , these new features keep the database up-to-date, and will provide new opportunities for the scientific community to use its freely available data, e.g. for the training of new machine learning models to predict sequence-structure-function relationships for diverse natural products. MIBiG 3.0 is accessible online at https://mibig.secondarymetabolites.org/.
Article
Full-text available
The identity and biological activity of most metabolites still remain unknown. A bottleneck in the exploration of metabolite structures and pharmaceutical activities is the compound purification needed for bioactivity assignments and downstream structure elucidation. To enable bioactivity-focused compound identification from complex mixtures, we develop a scalable native metabolomics approach that integrates non-targeted liquid chromatography tandem mass spectrometry and detection of protein binding via native mass spectrometry. A native metabolomics screen for protease inhibitors from an environmental cyanobacteria community reveals 30 chymotrypsin-binding cyclodepsipeptides. Guided by the native metabolomics results, we select and purify five of these compounds for full structure elucidation via tandem mass spectrometry, chemical derivatization, and nuclear magnetic resonance spectroscopy as well as evaluation of their biological activities. These results identify rivulariapeptolides as a family of serine protease inhibitors with nanomolar potency, highlighting native metabolomics as a promising approach for drug discovery, chemical ecology, and chemical biology studies. Bioactivity-guided isolation of specialized metabolites is an iterative process. Here, the authors demonstrate a native metabolomics approach that allows for fast screening of complex metabolite extracts against a protein of interest and simultaneous structure annotation.
Article
Full-text available
The complete biosynthetic pathways are unknown for most natural products (NPs), it is thus valuable to make computer-aided bio-retrosynthesis predictions. Here, a navigable and user-friendly toolkit, BioNavi-NP, is developed to predict the biosynthetic pathways for both NPs and NP-like compounds. First, a single-step bio-retrosynthesis prediction model is trained using both general organic and biosynthetic reactions through end-to-end transformer neural networks. Based on this model, plausible biosynthetic pathways can be efficiently sampled through an AND-OR tree-based planning algorithm from iterative multi-step bio-retrosynthetic routes. Extensive evaluations reveal that BioNavi-NP can identify biosynthetic pathways for 90.2% of 368 test compounds and recover the reported building blocks as in the test set for 72.8%, 1.7 times more accurate than existing conventional rule-based approaches. The model is further shown to identify biologically plausible pathways for complex NPs collected from the recent literature. The toolkit as well as the curated datasets and learned models are freely available to facilitate the elucidation and reconstruction of the biosynthetic pathways for NPs.
Article
Full-text available
Current methods for structure elucidation of small molecules rely on finding similarity with spectra of known compounds, but do not predict structures de novo for unknown compound classes. We present MSNovelist, which combines fingerprint prediction with an encoder–decoder neural network to generate structures de novo solely from tandem mass spectrometry (MS ² ) spectra. In an evaluation with 3,863 MS ² spectra from the Global Natural Product Social Molecular Networking site, MSNovelist predicted 25% of structures correctly on first rank, retrieved 45% of structures overall and reproduced 61% of correct database annotations, without having ever seen the structure in the training phase. Similarly, for the CASMI 2016 challenge, MSNovelist correctly predicted 26% and retrieved 57% of structures, recovering 64% of correct database annotations. Finally, we illustrate the application of MSNovelist in a bryophyte MS ² dataset, in which de novo structure prediction substantially outscored the best database candidate for seven spectra. MSNovelist is ideally suited to complement library-based annotation in the case of poorly represented analyte classes and novel compounds.
Article
Full-text available
Contemporary bioinformatic and chemoinformatic capabilities hold promise to reshape knowledge management, analysis and interpretation of data in natural products research. Currently, reliance on a disparate set of non-standardized, insular, and specialized databases presents a series of challenges for data access, both within the discipline and for integration and interoperability between related fields. The fundamental elements of exchange are referenced structure-organism pairs that establish relationships between distinct molecular structures and the living organisms from which they were identified. Consolidating and sharing such information via an open platform has strong transformative potential for natural products research and beyond. This is the ultimate goal of the newly established LOTUS initiative, which has now completed the first steps toward the harmonization, curation, validation and open dissemination of 750,000+ referenced structure-organism pairs. LOTUS data is hosted on Wikidata and regularly mirrored on https://lotus.naturalproducts.net. Data sharing within the Wikidata framework broadens data access and interoperability, opening new possibilities for community curation and evolving publication models. Furthermore, embedding LOTUS data into the vast Wikidata knowledge graph will facilitate new biological and chemical insights. The LOTUS initiative represents an important advancement in the design and deployment of a comprehensive and collaborative natural products knowledge base.
Article
Full-text available
Bacterial specialized metabolites are a proven source of antibiotics and cancer therapies, but whether we have sampled all the secondary metabolite chemical diversity of cultivated bacteria is not known. We analysed ~170,000 bacterial genomes and ~47,000 metagenome assembled genomes (MAGs) using a modified BiG-SLiCE and the new clust-o-matic algorithm. We estimate that only 3% of the natural products potentially encoded in bacterial genomes have been experimentally characterized. We show that the variation in secondary metabolite biosynthetic diversity drops significantly at the genus level, identifying it as an appropriate taxonomic rank for comparison. Equal comparison of genera based on relative evolutionary distance revealed that Streptomyces bacteria encode the largest biosynthetic diversity by far, with Amycolatopsis, Kutzneria and Micromonospora also encoding substantial diversity. Finally, we find that several less-well-studied taxa, such as Weeksellaceae (Bacteroidota), Myxococcaceae (Myxococcota), Pleurocapsa and Nostocaceae (Cyanobacteria), have potential to produce highly diverse sets of secondary metabolites that warrant further investigation. A comprehensive survey of secondary metabolites encoded in bacteria identifies large differences in biosynthetic diversity among genera and pinpoints those that can be targeted for novel chemistries provisionally suitable as antimicrobials.
Article
Full-text available
Deep learning has disrupted nearly every field of research, including those of direct importance to drug discovery, such as medicinal chemistry and pharmacology. This revolution has largely been attributed to the unprecedented advances in highly parallelizable graphics processing units (GPUs) and the development of GPU-enabled algorithms. In this Review, we present a comprehensive overview of historical trends and recent advances in GPU algorithms and discuss their immediate impact on the discovery of new drugs and drug targets. We also cover the state-of-the-art of deep learning architectures that have found practical applications in both early drug discovery and consequent hit-to-lead optimization stages, including the acceleration of molecular docking, the evaluation of off-target effects and the prediction of pharmacological properties. We conclude by discussing the impacts of GPU acceleration and deep learning models on the global democratization of the field of drug discovery that may lead to efficient exploration of the ever-expanding chemical universe to accelerate the discovery of novel medicines. GPUs, which are highly parallel computer processing units, were originally designed for graphics applications, but they have played an important role in accelerating the development of deep learning methods. In this Review, Pandey and colleagues summarize how GPUs have advanced machine learning in the field of drug discovery.
Article
Full-text available
The current global health emergency in the form of the Coronavirus 2019 (COVID-19) pandemic has highlighted the need for fast, accurate, and efficient drug discovery pipelines. Traditional drug discovery projects relying on in vitro high-throughput screening (HTS) involve large investments and sophisticated experimental set-ups, affordable only to big biopharmaceutical companies. In this scenario, application of efficient state-of-the-art computational methods and modern artificial intelligence (AI)-based algorithms for rapid screening of repurposable chemical space [approved drugs and natural products (NPs) with proven pharmacokinetic profiles] to identify the initial leads is a powerful option to save resources and time. Structure-based drug repurposing is a popular in silico repurposing approach. In this review, we discuss traditional and modern AI-based computational methods and tools applied at various stages for structure-based drug discovery (SBDD) pipelines. Additionally, we highlight the role of generative models in generating molecules with scaffolds from repurposable chemical space. Teaser: This review highlights the importance of repurposable chemical space, and the contributions of conventional in silico approaches and modern machine-learning algorithms for rapid structure-based drug repurposing.
Article
Full-text available
Background An increasing number of studies now produce multiple omics measurements that require using sophisticated computational methods for analysis. While each omics data can be examined separately, jointly integrating multiple omics data allows for deeper understanding and insights to be gained from the study. In particular, data integration can be performed horizontally, where biological entities from multiple omics measurements are mapped to common reactions and pathways. However, data integration remains a challenge due to the complexity of the data and the difficulty in interpreting analysis results. Results Here we present GraphOmics, a user-friendly platform to explore and integrate multiple omics datasets and support hypothesis generation. Users can upload transcriptomics, proteomics and metabolomics data to GraphOmics. Relevant entities are connected based on their biochemical relationships, and mapped to reactions and pathways from Reactome. From the Data Browser in GraphOmics, mapped entities and pathways can be ranked, sorted and filtered according to their statistical significance ( p values) and fold changes. Context-sensitive panels provide information on the currently selected entities, while interactive heatmaps and clustering functionalities are also available. As a case study, we demonstrated how GraphOmics was used to interactively explore multi-omics data and support hypothesis generation using two complex datasets from existing Zebrafish regeneration and Covid-19 human studies. Conclusions GraphOmics is fully open-sourced and freely accessible from https://graphomics.glasgowcompbio.org/ . It can be used to integrate multiple omics data horizontally by mapping entities across omics to reactions and pathways. Our demonstration showed that by using interactive explorations from GraphOmics, interesting insights and biological hypotheses could be rapidly revealed.
Article
Full-text available
Transformer models coupled with Simplified Molecular Line Entry System (SMILES) have recently proven to be a powerful combination for solving challenges in cheminformatics. These models, however, are often developed specifically for a single application and can be very resource-intensive to train. In this work we present Chemformer model – a Transformerbased model which can be quickly applied to both sequence-to-sequence and discriminative cheminformatics tasks. Additionally, we show that self-supervised pre-training can improve performance and significantly speed up convergence on downstream tasks. On direct synthesis and retrosynthesis prediction benchmark datasets we publish state-of-the-art results for top- 1 accuracy. We also improve on existing approaches for a molecular optimisation task and show that Chemformer can optimise on multiple discriminative tasks simultaneously. Models, datasets and code will be made available after publication.
Article
Full-text available
The Natural Products Magnetic Resonance Database (NP-MRD) is a comprehensive, freely available electronic resource for the deposition, distribution, searching and retrieval of nuclear magnetic resonance (NMR) data on natural products, metabolites and other biologically derived chemicals. NMR spectroscopy has long been viewed as the 'gold standard' for the structure determination of novel natural products and novel metabolites. NMR is also widely used in natural product dereplication and the characterization of biofluid mixtures (metabolomics). All of these NMR applications require large collections of high quality, well-annotated, referential NMR spectra of pure compounds. Unfortunately, referential NMR spectral collections for natural products are quite limited. It is because of the critical need for dedicated, open access natural product NMR resources that the NP-MRD was funded by the National Institute of Health (NIH). Since its launch in 2020, the NP-MRD has grown quickly to become the world's largest repository for NMR data on natural products and other biological substances. It currently contains both structural and NMR data for nearly 41,000 natural product compounds from >7400 different living species. All structural, spectroscopic and descriptive data in the NP-MRD is interactively viewable, searchable and fully downloadable in multiple formats. Extensive hyperlinks to other databases of relevance are also provided. The NP-MRD also supports community deposition of NMR assignments and NMR spectra (1D and 2D) of natural products and related meta-data. The deposition system performs extensive data enrichment, automated data format conversion and spectral/assignment evaluation. Details of these database features, how they are implemented and plans for future upgrades are also provided. The NP-MRD is available at https://np-mrd.org.
Preprint
Full-text available
Natural products produced by microorganisms constitute an important source of essential pharmaceuticals, including antimicrobial and anti-tumor drugs. These bioactive molecules are microbial secondary metabolites synthesized by co-localized genes termed Biosynthetic Gene Clusters (BGCs). The rapid increase of microbial genomics resources, due to the availability of high-throughput sequencing technologies, has spurred the development of computational methods for microbial genome mining for BGC discovery. Current machine learning methods, however, have limited successes in uncovering novel BGCs due to an excessive number of false positives in their predictions. To this end, we propose Deep-BGCpred, a framework that effectively addresses the aforementioned issue by improving a deep learning model termed DeepBGC. The new model embeds multi-source protein family domains and employs a stacked Bidirectional Long Short-Term Memory model to boost accuracy for BGC identifications. In particular, it integrates two customized strategies, sliding window strategy and dual-model serial screening, to improve the model's performance stability and reduce the number of false positive in BGC predictions. We compare the proposed model against other well-established methods on common benchmarks and achieve new state-of-the-art results with convincing evidences. We expect that researchers working on genome mining for natural products may be greatly benefited from our newly proposed method, Deep-BGCpred.
Article
Full-text available
Systematic, large-scale, studies at the genomic, metabolomic, and functional level have transformed the natural product sciences. Improvements in technology and reduction in cost for obtaining spectroscopic, chromatographic, and genomic data coupled with the creation of readily accessible curated and functionally annotated data sets have altered the practices of virtually all natural product research laboratories. Gone are the days when the natural products researchers were expected to devote themselves exclusively to the isolation, purification, and structure elucidation of small molecules. We now also engage with big data in taxonomic, genomic, proteomic, and/or metabolomic collections, and use these data to generate and test hypotheses. While the oft stated aim for the use of large-scale -omics data in the natural products sciences is to achieve a rapid increase in the rate of discovery of new drugs, this has not yet come to pass. At the same time, new technologies have provided unexpected opportunities for natural products chemists to ask and answer new and different questions. With this viewpoint, we discuss the evolution of big data as a part of natural products research and provide a few examples of how discoveries have been enabled by access to big data. We also draw attention to some of the limitations in our existing engagement with large datasets and consider what would be necessary to overcome them.
Article
Full-text available
Mass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are generally considered to be characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of > 100,000 mass spectra of about 15,000 unique known compounds, we trained MS2DeepScore to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model’s prediction uncertainty. On 3600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and to predict Tanimoto scores for pairs of molecules based on their fragment spectra with a root mean squared error of about 0.15. Furthermore, the prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. Furthermore, we demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity measures have great potential for a range of metabolomics data processing pipelines.
Article
Full-text available
Within the natural products field there is an increasing emphasis on the study of compounds from microbial sources. This has been fuelled by interest in the central role that microorganisms play in mediating both interspecies interactions and host-microbe relationships. To support the study of natural products chemistry produced by microorganisms we released the Natural Products Atlas, a database of known microbial natural products structures, in 2019. This paper reports the release of a new version of the database which includes a full RESTful application programming interface (API), a new website framework, and an expanded database that includes 8128 new compounds, bringing the total to 32 552. In addition to these structural and content changes we have added full taxonomic descriptions for all microbial taxa and have added chemical ontology terms from both NP Classifier and ClassyFire. We have also performed manual curation to review all entries with incomplete configurational assignments and have integrated data from external resources, including CyanoMetDB. Finally, we have improved the user experience by updating the Overview dashboard and creating a dashboard for taxonomic origin. The database can be accessed via the new interactive website at https://www.npatlas.org.
Article
Full-text available
Computational approaches such as genome and metabolome mining are becoming essential to natural products (NPs) research. Consequently, a need exists for an automated structure-type classification system to handle the massive amounts of data appearing for NP structures. An ideal semantic ontology for the classification of NPs should go beyond the simple presence/absence of chemical substructures, but also include the taxonomy of the producing organism, the nature of the biosynthetic pathway, and/or their biological properties. Thus, a holistic and automatic NP classification framework could have considerable value to comprehensively navigate the relatedness of NPs, and especially so when analyzing large numbers of NPs. Here, we introduce NPClassifier, a deep-learning tool for the automated structural classification of NPs from their counted Morgan fingerprints. NPClassifier is expected to accelerate and enhance NP discovery by linking NP structures to their underlying properties.
Article
Full-text available
Untargeted metabolomics experiments rely on spectral libraries for structure annotation, but, typically, only a small fraction of spectra can be matched. Previous in silico methods search in structure databases but cannot distinguish between correct and incorrect annotations. Here we introduce the COSMIC workflow that combines in silico structure database generation and annotation with a confidence score consisting of kernel density P value estimation and a support vector machine with enforced directionality of features. On diverse datasets, COSMIC annotates a substantial number of hits at low false discovery rates and outperforms spectral library search. To demonstrate that COSMIC can annotate structures never reported before, we annotated 12 natural bile acids. The annotation of nine structures was confirmed by manual evaluation and two structures using synthetic standards. In human samples, we annotated and manually validated 315 molecular structures currently absent from the Human Metabolome Database. Application of COSMIC to data from 17,400 metabolomics experiments led to 1,715 high-confidence structural annotations that were absent from spectral libraries. COSMIC outperforms spectral library search for metabolite annotation and annotates previously unseen structures.
Article
Full-text available
The analysis of nuclear magnetic resonance (NMR) spectra for the comprehensive and unambiguous identification and characterization of peaks is a difficult, but critically important step in all NMR analyses of complex biological molecular systems. Here, we introduce DEEP Picker, a deep neural network (DNN)-based approach for peak picking and spectral deconvolution which semi-automates the analysis of two-dimensional NMR spectra. DEEP Picker includes 8 hidden convolutional layers and was trained on a large number of synthetic spectra of known composition with variable degrees of crowdedness. We show that our method is able to correctly identify overlapping peaks, including ones that are challenging for expert spectroscopists and existing computational methods alike. We demonstrate the utility of DEEP Picker on NMR spectra of folded and intrinsically disordered proteins as well as a complex metabolomics mixture, and show how it provides access to valuable NMR information. DEEP Picker should facilitate the semi-automation and standardization of protocols for better consistency and sharing of results within the scientific community.
Article
Full-text available
Microbial specialized metabolites are key mediators in host-microbiome interactions. Most of the chemical space produced by the microbiome currently remains unexplored and uncharacterized. This situation calls for new and improved methods to exploit the growing publicly available genomic and metabolomic data sets and connect the outcomes to structural and functional knowledge inferred from transcriptomics and proteomics experiments. Here, we first describe currently available approaches that support the comprehensive mining of metabolomics and genomics data. Next, we provide our vision on how to move forward toward the automated linking of omics data of specialized metabolites to their structures, biosynthesis pathways, producers, and functions.
Article
Full-text available
The amount of data available on chemical structures and their properties has increased steadily over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, Optical Chemical Structure Recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50–100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.
Article
Full-text available
While neural networks achieve state-of-the-art performance for many molecular modeling and structure–property prediction tasks, these models can struggle with generalization to out-of-domain examples, exhibit poor sample efficiency, and produce uncalibrated predictions. In this paper, we leverage advances in evidential deep learning to demonstrate a new approach to uncertainty quantification for neural network-based molecular structure–property prediction at no additional computational cost. We develop both evidential 2D message passing neural networks and evidential 3D atomistic neural networks and apply these networks across a range of different tasks. We demonstrate that evidential uncertainties enable (1) calibrated predictions where uncertainty correlates with error, (2) sample-efficient training through uncertainty-guided active learning, and (3) improved experimental validation rates in a retrospective virtual screening campaign. Our results suggest that evidential deep learning can provide an efficient means of uncertainty quantification useful for molecular property prediction, discovery, and design tasks in the chemical and physical sciences.
Article
Full-text available
More than 60% of pharmaceuticals are related to natural products (NPs), chemicals produced by living organisms. Despite this, the rate of NP discovery has slowed over the past few decades. In many cases the rate-limiting step in NP discovery is structural characterization. Here we report the use of microcrystal electron diffraction (MicroED), an emerging cryogenic electron microscopy (CryoEM) method, in combination with genome mining to accelerate NP discovery and structural elucidation. As proof of principle we rapidly determine the structure of a new 2-pyridone NP, Py-469, and revise the structure of fischerin, an NP isolated more than 25 years ago, with potent cytotoxicity but hitherto ambiguous structural assignment. This study serves as a powerful demonstration of the synergy of MicroED and synthetic biology in NP discovery, technologies that when taken together will ultimately accelerate the rate at which new drugs are discovered. Combined use of microcrystal electron diffraction and genome mining for biosynthetic gene clusters enables the rapid structural elucidation of natural products, including a newly discovered 2-pyridone compound and a revised structure of fischerin.
Article
Full-text available
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1–4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the 3-D structure that a protein will adopt based solely on its amino acid sequence, the structure prediction component of the ‘protein folding problem’8, has been an important open research problem for more than 50 years9. Despite recent progress10–14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even where no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experiment in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
Article
Full-text available
The ability to access chemical information openly is an essential part of many scientific disciplines. The Journal of Cheminformatics is leading the way for rigorous, open cheminformatics in many ways, but there remains room for improvement in primary areas. This letter discusses how both authors and the journal alike can help increase the FAIR ness (Findability, Accessibility, Interoperability, Reusability) of the chemical structural information in the journal. A proposed chemical structure template can serve as an interoperable Additional File format (already accessible ), made more findable by linking the DOI of this data file to the article DOI metadata, supporting further reuse .
Article
Full-text available
Mass spectra provide the ultimate evidence to support the findings of mass spectrometry proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to any mass spectrum contained in datasets deposited to public proteomics repositories. USI enables greater transparency of spectral evidence, with more than 1 billion USI identifications from over 3 billion spectra already available through ProteomeXchange repositories. Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to mass spectra deposited to public repositories or contained in public spectral libraries.
Article
Full-text available
Chemical language models enable de novo drug design without the requirement for explicit molecular construction rules. While such models have been applied to generate novel compounds with desired bioactivity, the actual prioritization and selection of the most promising computational designs remains challenging. In this work, we leveraged the probabilities learnt by chemical language models with the beam search algorithm as a model‐intrinsic technique for automated molecule design and scoring. Prospective application of this method yielded three novel inverse agonists of retinoic acid receptor‐related orphan receptors (RORs). Each design was synthesizable in three reaction steps and presented low‐micromolar to nanomolar potency towards RORγ. This model‐intrinsic sampling technique eliminates the strict need for external compound scoring functions, thereby further extending the applicability of generative artificial intelligence to data‐driven drug discovery.
Article
Full-text available
Chemical descriptors encode the physicochemical and structural properties of small molecules, and they are at the core of chemoinformatics. The broad release of bioactivity data has prompted enriched representations of compounds, reaching beyond chemical structures and capturing their known biological properties. Unfortunately, bioactivity descriptors are not available for most small molecules, which limits their applicability to a few thousand well characterized compounds. Here we present a collection of deep neural networks able to infer bioactivity signatures for any compound of interest, even when little or no experimental information is available for them. Our signaturizers relate to bioactivities of 25 different types (including target profiles, cellular response and clinical outcomes) and can be used as drop-in replacements for chemical descriptors in day-to-day chemoinformatics tasks. Indeed, we illustrate how inferred bioactivity signatures are useful to navigate the chemical space in a biologically relevant manner, unveiling higher-order organization in natural product collections, and to enrich mostly uncharacterized chemical libraries for activity against the drug-orphan target Snail1. Moreover, we implement a battery of signature-activity relationship (SigAR) models and show a substantial improvement in performance, with respect to chemistry-based classifiers, across a series of biophysics and physiology activity prediction benchmarks. Small molecules bioactivity descriptors are enriched representations of compounds, reaching beyond chemical structures and capturing their known biological properties. Here the authors present a collection of deep neural networks able to infer bioactivity signatures for any compound of interest, even when little or no experimental information is available for them.
Article
Full-text available
Identification of small molecules is a critical task in various areas of life science. Recent advances in mass spectrometry have enabled the collection of tandem mass spectra of small molecules from hundreds of thousands of environments. To identify which molecules are present in a sample, one can search mass spectra collected from the sample against millions of molecular structures in small molecule databases. The existing approaches are based on chemistry domain knowledge, and they fail to explain many of the peaks in mass spectra of small molecules. Here, we present molDiscovery, a mass spectral database search method that improves both efficiency and accuracy of small molecule identification by learning a probabilistic model to match small molecules with their mass spectra. A search of over 8 million spectra from the Global Natural Product Social molecular networking infrastructure shows that molDiscovery correctly identify six times more unique small molecules than previous methods. A large number of mass spectra from different samples have been collected, and to identify small molecules from these spectra, database searches are needed, which is challenging. Here, the authors report molDiscovery, a mass spectral database search method that uses an algorithm to generate mass spectrometry fragmentations and learns a probabilistic model to match small molecules with their mass spectra.
Article
Full-text available
Research in natural products, the genetically encoded small molecules produced by organisms in an idiosyncratic fashion, deals with molecular structure, biosynthesis, and biological activity. Bioinformatics analyses of microbial genomes can successfully reveal the genetic instructions, biosynthetic gene clusters, that produce many natural products. Genes to molecule predictions made on biosynthetic gene clusters have revealed many important new structures. There is no comparable method for genes to biological activity predictions. To address this missing pathway, we developed a machine learning bioinformatics method for predicting a natural product’s antibiotic activity directly from the sequence of its biosynthetic gene cluster. We trained commonly used machine learning classifiers to predict antibacterial or antifungal activity based on features of known natural product biosynthetic gene clusters. We have identified classifiers that can attain accuracies as high as 80% and that have enabled the identification of biosynthetic enzymes and their corresponding molecular features that are associated with antibiotic activity.
Article
Full-text available
Many microorganisms produce natural products that form the basis of antimicrobials, antivirals, and other drugs. Genome mining is routinely used to complement screening-based workflows to discover novel natural products. Since 2011, the "antibiotics and secondary metabolite analysis shell—antiSMASH" (https://antismash.secondarymetabolites.org/) has supported researchers in their microbial genome mining tasks, both as a free-to-use web server and as a standalone tool under an OSI-approved open-source license. It is currently the most widely used tool for detecting and characterising biosynthetic gene clusters (BGCs) in bacteria and fungi. Here, we present the updated version 6 of antiSMASH. antiSMASH 6 increases the number of supported cluster types from 58 to 71, displays the modular structure of multi-modular BGCs, adds a new BGC comparison algorithm, allows for the integration of results from other prediction tools, and more effectively detects tailoring enzymes in RiPP clusters.
Preprint
Full-text available
Biosynthetic gene clusters (BGCs) are enticing targets for (meta)genomic mining efforts, as they may encode novel, specialized metabolites with potential uses in medicine and biotechnology. Here, we describe GECCO (GEne Cluster prediction with COnditional random fields; https://gecco.embl.de ), a high-precision, scalable method for identifying novel BGCs in (meta)genomic data using conditional random fields (CRFs). Based on an extensive evaluation of de novo BGC prediction, we found GECCO to be more accurate and over 3x faster than a state-of-the-art deep learning approach. When applied to over 12,000 genomes, GECCO identified nearly twice as many BGCs compared to a rule-based approach, while achieving higher accuracy than other machine learning approaches. Introspection of the GECCO CRF revealed that its predictions rely on protein domains with both known and novel associations to secondary metabolism. The method developed here represents a scalable, interpretable machine learning approach, which can identify BGCs de novo with high precision.
Article
Full-text available
Specialised metabolites from microbial sources are well-known for their wide range of biomedical applications, particularly as antibiotics. When mining paired genomic and metabolomic data sets for novel specialised metabolites, establishing links between Biosynthetic Gene Clusters (BGCs) and metabolites represents a promising way of finding such novel chemistry. However, due to the lack of detailed biosynthetic knowledge for the majority of predicted BGCs, and the large number of possible combinations, this is not a simple task. This problem is becoming ever more pressing with the increased availability of paired omics data sets. Current tools are not effective at identifying valid links automatically, and manual verification is a considerable bottleneck in natural product research. We demonstrate that using multiple link-scoring functions together makes it easier to prioritise true links relative to others. Based on standardising a commonly used score, we introduce a new, more effective score, and introduce a novel score using an Input-Output Kernel Regression approach. Finally, we present NPLinker, a software framework to link genomic and metabolomic data. Results are verified using publicly available data sets that include validated links.
Article
Two years after DeepMind’s revolutionary AI swept a competition for predicting protein structures, researchers are building on AlphaFold’s success. Two years after DeepMind’s revolutionary AI swept a competition for predicting protein structures, researchers are building on AlphaFold’s success. Credit: Leonid Andronov/Alamy Protein structure model of DNA polymerase I. An enzyme that participates in the DNA replication Protein structure model of DNA polymerase I. An enzyme that participates in the DNA replication
Chapter
TeachOpenCADD is a teaching platform developed with students for students and researchers. The material teaches how to leverage open source cheminformatics and structural bioinformatics resources to explore key questions in computer-aided drug design (CADD). Both the theoretical and practical aspects of CADD concepts are covered in interactive Jupyter Notebooks using Python. This setup makes it easy for students from various fields of science to understand computational drug design techniques with hands-on programming examples. In this book chapter, we explain the motivation for putting the TeachOpenCADD material together, how this teaching material can be and has been used in different teaching formats, and what lessons we have learned so far.
Article
The identification of metabolites from complex biofluids and extracts of tissues is an essential process for understanding metabolic profiles. Nuclear magnetic resonance (NMR) spectroscopy is widely used in metabolomics studies for identification and quantification of metabolites. However, the accurate identification of individual metabolites is still a challenging process with higher peak intensity or similar chemical shifts from different metabolites. In this study, we applied a convolutional neural network (CNN) to 1 H-13 C HSQC NMR spectra to achieve accurate peak identification in complex mixtures. The results reveal that the neural network was successfully trained on metabolite identification from these 2D NMR spectra and achieved very good performance compared with other NMR-based metabolomic tools.
Article
Geometric deep learning (GDL) is based on neural network architectures that incorporate and process symmetry information. GDL bears promise for molecular modelling applications that rely on molecular representations with different symmetry properties and levels of abstraction. This Review provides a structured and harmonized overview of molecular GDL, highlighting its applications in drug discovery, chemical synthesis prediction and quantum chemistry. It contains an introduction to the principles of GDL, as well as relevant molecular representations, such as molecular graphs, grids, surfaces and strings, and their respective properties. The current challenges for GDL in the molecular sciences are discussed, and a forecast of future opportunities is attempted.
Article
Covering: 2016 up to 2021Mass spectrometry (MS) is an essential technology in natural products research with MS fragmentation (MS/MS) approaches becoming a key tool. Recent advancements in MS yield dense metabolomics datasets which have been, conventionally, used by individual labs for individual projects; however, a shift is brewing. The movement towards open MS data (and other structural characterization data) and accessible data mining tools is emerging in natural products research. Over the past 5 years, this movement has rapidly expanded and evolved with no slowdown in sight; the capabilities of today vastly exceed those of 5 years ago. Herein, we address the analysis of individual datasets, a situation we are calling the '2021 status quo', and the emergent framework to systematically capture sample information (metadata) and perform repository-scale analyses. We evaluate public data deposition, discuss the challenges of working in the repository scale, highlight the challenges of metadata capture and provide illustrative examples of the power of utilizing repository data and the tools that enable it. We conclude that the advancements in MS data collection must be met with advancements in how we utilize data; therefore, we argue that open data and data mining is the next evolution in obtaining the maximum potential in natural products research.
Article
Chemical language models enable de novo drug design without the requirement for explicit molecular construction rules. While such models have been applied to generate novel compounds with desired bioactivity, the actual prioritization and selection of the most promising computational designs remains challenging. In this work, we leveraged the probabilities learnt by chemical language models with the beam search algorithm as a model‐intrinsic technique for automated molecule design and scoring. Prospective application of this method yielded three novel inverse agonists of retinoic acid receptor‐related orphan receptors (RORs). Each design was synthesizable in three reaction steps and presented low‐micromolar to nanomolar potency towards RORγ. This model‐intrinsic sampling technique eliminates the strict need for external compound scoring functions, thereby further extending the applicability of generative artificial intelligence to data‐driven drug discovery.
Article
All organisms produce specialized organic molecules, ranging from small volatile chemicals to large gene-encoded peptides, that have evolved to provide them with diverse cellular and ecological functions. As natural products, they are broadly applied in medicine, agriculture and nutrition. The rapid accumulation of genomic information has revealed that the metabolic capacity of virtually all organisms is vastly underappreciated. Pioneered mainly in bacteria and fungi, genome mining technologies are accelerating metabolite discovery. Recent efforts are now being expanded to all life forms, including protists, plants and animals, and new integrative omics technologies are enabling the increasingly effective mining of this molecular diversity.