Article

Prediction of human pharmacokinetic parameters incorporating SMILES information

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This study aimed to develop a model incorporating natural language processing analysis for the simplified molecular-input line-entry system (SMILES) to predict clearance (CL) and volume of distribution at steady state (Vd,ss) in humans. The construction of CL and Vd,ss prediction models involved data from 435 to 439 compounds, respectively. In machine learning, features such as animal pharmacokinetic data, in vitro experimental data, molecular descriptors, and SMILES were utilized, with XGBoost employed as the algorithm. The ChemBERTa model was used to analyze substance SMILES, and the last hidden layer embedding of ChemBERTa was examined as a feature. The model was evaluated using geometric mean fold error (GMFE), r2, root mean squared error (RMSE), and accuracy within 2- and 3-fold error. The model demonstrated optimal performance for CL prediction when incorporating animal pharmacokinetic data, in vitro experimental data, and SMILES as features, yielding a GMFE of 1.768, an r2 of 0.528, an RMSE of 0.788, with accuracies within 2-fold and 3-fold error reaching 75.8% and 81.8%, respectively. The model's performance in Vd,ss prediction was optimized by leveraging animal pharmacokinetic data and in vitro experimental data as features, yielding a GMFE of 1.401, an r2 of 0.902, an RMSE of 0.413, with accuracies within 2-fold and 3-fold error reaching 93.8% and 100%, respectively. This study has developed a highly predictive model for CL and Vd,ss. Specifically, incorporating SMILES information into the model has predictive power for CL.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Accurate prediction of a new compound's pharmacokinetic (PK) profile is pivotal for the success of drug discovery programs. An initial assessment of PK in preclinical species and humans is typically performed through allometric scaling and mathematical modeling. These methods use parameters estimated from in vitro or in vivo experiments, which although helpful for an initial estimation, require extensive animal experiments. Furthermore, mathematical models are limited by the mechanistic underpinning of the drugs' absorption, distribution, metabolism, and elimination (ADME) which are largely unknown in the early stages of drug discovery. In this work, we propose a novel methodology in which concentration versus time profile of small molecules in rats is directly predicted by machine learning (ML) using structure‐driven molecular properties as input and thus mitigating the need for animal experimentation. The proposed framework initially predicts ADME properties based on molecular structure and then uses them as input to a ML model to predict the PK profile. For the compounds tested, our results demonstrate that PK profiles can be adequately predicted using the proposed algorithm, especially for compounds with Tanimoto score greater than 0.5, the average mean absolute percentage error between predicted PK profile and observed PK profile data was found to be less than 150%. The suggested framework aims to facilitate PK predictions and thus support molecular screening and design earlier in the drug discovery process.
Article
Full-text available
Machine learning techniques are extensively employed in drug discovery, with a significant focus on developing QSAR models that interpret the structural information of potential drugs. In this study, the pre-trained natural language processing (NLP) model, ChemBERTa, was utilized in the drug discovery process. We proposed and evaluated four core model architectures as follows: deep neural network (DNN), encoder, concatenation (concat), and pipe. The DNN model processes physicochemical properties as input, while the encoder model leverages the simplified molecular input line entry system (SMILES) along with NLP techniques. The latter two models, concat and pipe, incorporate both SMILES and physicochemical properties, operating in parallel and with sequential manners, respectively. We collected 5238 entries from DrugBank, including their physicochemical properties and absorption, distribution, metabolism, excretion, and toxicity (ADMET) features. The models’ performance was assessed by the area under the receiver operating characteristic curve (AUROC), with the DNN, encoder, concat, and pipe models achieved 62.4%, 76.0%, 74.9%, and 68.2%, respectively. In a separate test with 84 experimental microsomal stability datasets, the AUROC scores for external data were 78% for DNN, 44% for the encoder, and 50% for concat, indicating that the DNN model had superior predictive capabilities for new data. This suggests that models based on structural information may require further optimization or alternative tokenization strategies. The application of natural language processing techniques to pharmaceutical challenges has demonstrated promising results, highlighting the need for more extensive data to enhance model generalization.
Article
Full-text available
Natural language processing (NLP) technology has recently used to predict substance properties based on their Simplified Molecular-Input Line-Entry System (SMILES). We aimed to develop a model predicting human skin sensitizers by integrating text features derived from SMILES with in vitro test outcomes. The dataset on SMILES, physicochemical properties, in vitro tests (DPRA, KeratinoSensTM, h-CLAT, and SENS-IS assays), and human potency categories for 122 substances sourced from the Cosmetics Europe database. The ChemBERTa model was employed to analyze the SMILES of substances. The last hidden layer embedding of ChemBERTa was tested with other features. Given the modest dataset size, we trained five XGBoost models using subsets of the training data, and subsequently employed bagging to create the final model. Notably, the features computed from SMILES played a pivotal role in the model for distinguishing sensitizers and non-sensitizers. The final model demonstrated a classification accuracy of 80% and an AUC-ROC of 0.82, effectively discriminating sensitizers from non-sensitizers. Furthermore, the model exhibited an accuracy of 82% and an AUC-ROC of 0.82 in classifying strong and weak sensitizers. In summary, we demonstrated that the integration of NLP of SMILES with in vitro test results can enhance the prediction of health hazard associated with chemicals.
Article
Full-text available
We developed a novel drug metabolism and pharmacokinetics (DMPK) analysis platform named DruMAP. This platform consists of a database for DMPK parameters and programs that can predict many DMPK parameters based on the chemical structure of a compound. The DruMAP database includes curated DMPK parameters from public sources and in-house experimental data obtained under standardized conditions; it also stores predicted DMPK parameters produced by our prediction programs. Users can predict several DMPK parameters simultaneously for novel compounds not found in the database. Furthermore, the highly flexible search system enables users to search for compounds as they desire. The current version of DruMAP comprises more than 30,000 chemical compounds, about 40,000 activity values (collected from public databases and in-house data), and about 600,000 predicted values. Our platform provides a simple tool for searching and predicting DMPK parameters and is expected to contribute to the acceleration of new drug development. DruMAP can be freely accessed at: https://drumap.nibiohn.go.jp/.
Article
Full-text available
Pharmacokinetic research plays an important role in the development of new drugs. Accurate predictions of human pharmacokinetic parameters are essential for the success of clinical trials. Clearance (CL) and volume of distribution (Vd) are important factors for evaluating pharmacokinetic properties, and many previous studies have attempted to use computational methods to extrapolate these values from nonclinical laboratory animal models to human subjects. However, it is difficult to obtain sufficient, comprehensive experimental data from these animal models, and many studies are missing critical values. This means that studies using nonclinical data as explanatory variables can only apply a small number of compounds to their model training. In this study, we perform missing-value imputation and feature selection on nonclinical data to increase the number of training compounds and nonclinical datasets available for these kinds of studies. We could obtain novel models for total body clearance (CLtot) and steady-state Vd (Vdss) (CLtot: geometric mean fold error [GMFE], 1.92; percentage within 2-fold error, 66.5%; Vdss: GMFE, 1.64; percentage within 2-fold error, 71.1%). These accuracies were comparable to the conventional animal scale-up models. Then, this method differs from animal scale-up methods because it does not require animal experiments, which continue to become more strictly regulated as time passes.
Article
Full-text available
Ninety percent of clinical drug development fails despite implementation of many successful strategies, which raised the question whether certain aspects in target validation and drug optimization are overlooked? Current drug optimization overly emphasizes potency/specificity using structure‒activity-relationship (SAR) but overlooks tissue exposure/selectivity in disease/normal tissues using structure‒tissue exposure/selectivity–relationship (STR), which may mislead the drug candidate selection and impact the balance of clinical dose/efficacy/toxicity. We propose structure‒tissue exposure/selectivity–activity relationship (STAR) to improve drug optimization, which classifies drug candidates based on drug’s potency/selectivity, tissue exposure/selectivity, and required dose for balancing clinical efficacy/toxicity. Class I drugs have high specificity/potency and high tissue exposure/selectivity, which needs low dose to achieve superior clinical efficacy/safety with high success rate. Class II drugs have high specificity/potency and low tissue exposure/selectivity, which requires high dose to achieve clinical efficacy with high toxicity and needs to be cautiously evaluated. Class III drugs have relatively low (adequate) specificity/potency but high tissue exposure/selectivity, which requires low dose to achieve clinical efficacy with manageable toxicity but are often overlooked. Class IV drugs have low specificity/potency and low tissue exposure/selectivity, which achieves inadequate efficacy/safety, and should be terminated early. STAR may improve drug optimization and clinical studies for the success of clinical drug development.
Article
Full-text available
Despite their importance in determining the dosing regimen of drugs in the clinic, only a few studies have investigated methods for predicting blood-to-plasma concentration ratios (Rb). This study established an Rb prediction model incorporating typical human pharmacokinetics (PK) parameters. Experimental Rb values were compiled for 289 compounds, offering reliable predictions by expanding the applicability domain. Notably, it is the largest list of Rb values reported so far. Subsequently, human PK parameters calculated from plasma drug concentrations, including the volume of distribution (Vd), clearance, mean residence time, and plasma protein binding rate, as well as 2702 kinds of molecular descriptors, were used to construct quantitative structure–PK relationship models for Rb. Among the evaluated PK parameters, logVd correlated best with Rb (correlation coefficient of 0.47). Thus, in addition to molecular descriptors selected by XGBoost, logVd was employed to construct the prediction models. Among the analyzed algorithms, artificial neural networks gave the best results. Following optimization using six molecular descriptors and logVd, the model exhibited a correlation coefficient of 0.64 and a root-mean-square error of 0.205, which were superior to those previously reported for other Rb prediction methods. Since Vd values and chemical structures are known for most medications, the Rb prediction model described herein is expected to be valuable in clinical settings. Graphical abstract
Article
Full-text available
Research into pharmacokinetics plays an important role in the development process of new drugs. Accurately predicting human pharmacokinetic parameters from preclinical data can increase the success rate of clinical trials. Since clearance (CL) which indicates the capacity of the entire body to process a drug is one of the most important parameters, many methods have been developed. However, there are still rooms to be improved for practical use in drug discovery research; “improving CL prediction accuracy” and “understanding the chemical structure of compounds in terms of pharmacokinetics”. To improve those, this research proposes a multimodal learning method based on deep learning that takes not only the chemical structure of a drug but also rat CL as inputs. Good results were obtained compared with the conventional animal scale-up method; the geometric mean fold error was 2.68 and the proportion of compounds with prediction errors of 2-fold or less was 48.5%. Furthermore, it was found to be possible to infer the partial structure useful for CL prediction by a structure contributing factor inference method. The validity of these results of structural interpretation of metabolic stability was confirmed by chemists.
Article
Full-text available
Volume of distribution at steady state (VD,ss) is one of the key pharmacokinetic parameters estimated during the drug discovery process. Despite considerable efforts to predict VD,ss, accuracy and choice of prediction methods remain a challenge, with evaluations constrained to a small set (<150) of compounds. To address these issues, a series of in silico methods for predicting human VD,ss directly from structure were evaluated using a large set of clinical compounds. Machine learning (ML) models were built to predict VD,ss directly and to predict input parameters required for mechanistic and empirical VD,ss predictions. In addition, log D, fraction unbound in plasma (fup), and blood-to-plasma partition ratio (BPR) were measured on 254 compounds to estimate the impact of measured data on predictive performance of mechanistic models. Furthermore, the impact of novel methodologies such as measuring partition (Kp) in adipocytes and myocytes (n = 189) on VD,ss predictions was also investigated. In predicting VD,ss directly from chemical structures, both mechanistic and empirical scaling using a combination of predicted rat and dog VD,ss demonstrated comparable performance (62%–71% within 3-fold). The direct ML model outperformed other in silico methods (75% within 3-fold, r² = 0.5, AAFE = 2.2) when built from a larger data set. Scaling to human from predicted VD,ss of either rat or dog yielded poor results (<47% within 3-fold). Measured fup and BPR improved performance of mechanistic VD,ss predictions significantly (81% within 3-fold, r² = 0.6, AAFE = 2.0). Adipocyte intracellular Kp showed good correlation to the VD,ss but was limited in estimating the compounds with low VD,ss. SIGNIFICANCE STATEMENT This work advances the in silico prediction of VD,ss directly from structure and with the aid of in vitro data. Rigorous and comprehensive evaluation of various methods using a large set of clinical compounds (n = 956) is presented. The scale of techniques evaluated is far beyond any previously presented. The novel data set (n = 254) generated using a single protocol for each in vitro assay reported in this study could further aid in advancing VD,ss prediction methodologies.
Article
Full-text available
Virtual screening (VS) has emerged in drug discovery as a powerful computational approach to screen large libraries of small molecules for new hits with desired properties that can then be tested experimentally. Similar to other computational approaches, VS intention is not to replace in vitro or in vivo assays, but to speed up the discovery process, to reduce the number of candidates to be tested experimentally, and to rationalize their choice. Moreover, VS has become very popular in pharmaceutical companies and academic organizations due to its time-, cost-, resources-, and labor-saving. Among the VS approaches, quantitative structure–activity relationship (QSAR) analysis is the most powerful method due to its high and fast throughput and good hit rate. As the first preliminary step of a QSAR model development, relevant chemogenomics data are collected from databases and the literature. Then, chemical descriptors are calculated on different levels of representation of molecular structure, ranging from 1D to nD, and then correlated with the biological property using machine learning techniques. Once developed and validated, QSAR models are applied to predict the biological property of novel compounds. Although the experimental testing of computational hits is not an inherent part of QSAR methodology, it is highly desired and should be performed as an ultimate validation of developed models. In this mini-review, we summarize and critically analyze the recent trends of QSAR-based VS in drug discovery and demonstrate successful applications in identifying perspective compounds with desired properties. Moreover, we provide some recommendations about the best practices for QSAR-based VS along with the future perspectives of this approach.
Article
Full-text available
Predicting the fraction unbound in plasma provides a good understanding of the pharmacokinetic properties of a drug to assist candidate selection in the early stages of drug discovery. It is also an effective tool to mitigate the risk of late-stage attrition and to optimize further screening. In this study, we built in silico prediction models of fraction unbound in human plasma with freely available software, aiming specifically to improve the accuracy in the low value ranges. We employed several machine learning techniques and built prediction models trained on the largest ever data set of 2738 experimental values. The classification model showed a high true positive rate of 0.826 for the low fraction unbound class on the test set. The strongly biased distribution of the fraction unbound in plasma was mitigated by a logarithmic transformation in the regression model, leading to improved accuracy at lower values. Overall, our models showed better performance than those of previously published methods, including commercial software. Our prediction tool can be used on its own or integrated into other pharmacokinetic modeling systems.
Article
Full-text available
DrugBank (www.drugbank.ca) is a web-enabled database containing comprehensive molecular information about drugs, their mechanisms, their interactions and their targets. First described in 2006, DrugBank has continued to evolve over the past 12 years in response to marked improvements to web standards and changing needs for drug research and development. This year's update, DrugBank 5.0, represents the most significant upgrade to the database in more than 10 years. In many cases, existing data content has grown by 100% or more over the last update. For instance, the total number of investigational drugs in the database has grown by almost 300%, the number of drug-drug interactions has grown by nearly 600% and the number of SNP-associated drug effects has grown more than 3000%. Significant improvements have been made to the quantity, quality and consistency of drug indications, drug binding data as well as drug-drug and drug-food interactions. A great deal of brand new data have also been added to DrugBank 5.0. This includes information on the influence of hundreds of drugs on metabolite levels (pharmacometabolomics), gene expression levels (pharmacotranscriptomics) and protein expression levels (pharmacoprotoemics). New data have also been added on the status of hundreds of new drug clinical trials and existing drug repurposing trials. Many other important improvements in the content, interface and performance of the DrugBank website have been made and these should greatly enhance its ease of use, utility and potential applications in many areas of pharmacological research, pharmaceutical science and drug education.
Conference Paper
Full-text available
Understanding why a model makes a certain prediction can be as crucial as the prediction's accuracy in many applications. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, such as ensemble or deep learning models, creating a tension between accuracy and interpretability. In response, various methods have recently been proposed to help users interpret the predictions of complex models, but it is often unclear how these methods are related and when one method is preferable over another. To address this problem, we present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties. The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
Article
Full-text available
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public repository for information on chemical substances and their biological activities, launched in 2004 as a component of the Molecular Libraries Roadmap Initiatives of the US National Institutes of Health (NIH). For the past 11 years, PubChem has grown to a sizable system, serving as a chemical information resource for the scientific research community. PubChem consists of three inter-linked databases, Substance, Compound and BioAssay. The Substance database contains chemical information deposited by individual data contributors to PubChem, and the Compound database stores unique chemical structures extracted from the Substance database. Biological activity data of chemical substances tested in assay experiments are contained in the BioAssay database. This paper provides an overview of the PubChem Substance and Compound databases, including data sources and contents, data organization, data submission using PubChem Upload, chemical structure standardization, web-based interfaces for textual and non-textual searches, and programmatic access. It also gives a brief description of PubChem3D, a resource derived from theoretical three-dimensional structures of compounds in PubChem, as well as PubChemRDF, Resource Description Framework (RDF)-formatted PubChem data for data sharing, analysis and integration with information contained in other databases.
Article
Full-text available
Introduction: Interspecies allometric scaling provides a simple and fast option to interpolate or extrapolate drug dose or pharmacokinetic parameters to a species of interest. Over the years, new scaling methods have been developed in order to improve the performance of these predictions. It is critical to choose appropriate allometric scaling approach(es) to analyze the available pharmacokinetic data. Areas covered: This review provides updated information on the latest allometric scaling methods developed for the most frequently interpolated or extrapolated pharmacokinetic parameters. The different degrees of success and advantages/disadvantages of different methods are compared and contrasted. The pitfalls that affect the accuracy of prediction and the solutions to avoid the risk of prediction errors are discussed. The application of allometric scaling in veterinary medicine is presented. Expert opinion: Although interspecies allometric scaling needs further refinements and has limitations, it is still a potential tool and rational option for the estimate of pharmacokinetic parameters in species for which there are no data available or to better interpret preclinical efficacy and safety trials. Allometric scaling can offer insight into possible mechanisms of species-dependent drug disposition.
Article
Full-text available
ChEMBL is an Open Data database containing binding, functional and ADMET information for a large number of drug-like bioactive compounds. These data are manually abstracted from the primary published literature on a regular basis, then further curated and standardized to maximize their quality and utility across a wide range of chemical biology and drug-discovery research problems. Currently, the database contains 5.4 million bioactivity measurements for more than 1 million compounds and 5200 protein targets. Access is available through a web-based interface, data downloads and web services at: https://www.ebi.ac.uk/chembldb.
Article
Full-text available
Quantitative structure-activity relationships (QSAR) have been applied for decades in the development of relationships between physicochemical properties of chemical substances and their biological activities to obtain a reliable statistical model for prediction of the activities of new chemical entities. The fundamental principle underlying the formalism is that the difference in structural properties is responsible for the variations in biological activities of the compounds. In the classical QSAR studies, affinities of ligands to their binding sites, inhibition constants, rate constants, and other biological end points, with atomic, group or molecular properties such as lipophilicity, polarizability, electronic and steric properties (Hansch analysis) or with certain structural features (Free-Wilson analysis) have been correlated. However such an approach has only a limited utility for designing a new molecule due to the lack of consideration of the 3D structure of the molecules. 3D-QSAR has emerged as a natural extension to the classical Hansch and Free-Wilson approaches, which exploits the three-dimensional properties of the ligands to predict their biological activities using robust chemometric techniques such as PLS, G/PLS, ANN etc. It has served as a valuable predictive tool in the design of pharmaceuticals and agrochemicals. Although the trial and error factor involved in the development of a new drug cannot be ignored completely, QSAR certainly decreases the number of compounds to be synthesized by facilitating the selection of the most promising candidates. Several success stories of QSAR have attracted the medicinal chemists to investigate the relationships of structural properties with biological activity. This review seeks to provide a bird's eye view of the different 3D-QSAR approaches employed within the current drug discovery community to construct predictive structure-activity relationships and also discusses the limitations that are fundamental to these approaches, as well as those that might be overcome with the improved strategies. The components involved in building a useful 3D-QSAR model are discussed, including the validation techniques available for this purpose.
Article
The pharmacokinetic (PK) and toxicokinetic profile of a drug from its preclinical evaluation helps the researcher determine whether the drug should be tested in humans based on its safety and toxicity.Preclinical studies require time and resources and are prone to error. Moreover, according to the United States Food and Drug Administration Modernisation Act 2, animal testing is no longer mandatory for new drug development, and an animal-free alternative, such as cell-based assay and computer models, can be used.Different physiologically based PK models were developed for an anaplastic lymphoma kinase inhibitor in rats and monkeys after intravenous and oral administration using its physicochemical properties and in vitro characterisation data.The developed model was validated against the in vivo data available in the literature, and the validation results were found within the acceptable limit. A parameter sensitivity analysis was performed to identify the properties of the compound influencing the PK profile.This work demonstrates the application of the physiologically based PK model to predict the PKs of a drug, which will eventually assist in reducing the number of animal studies and save time and cost of drug discovery and development.
Article
Optimisation of compound pharmacokinetics (PK) is an integral part of drug discovery and development. Animal in vivo PK data as well as human and animal in vitro systems are routinely utilised to evaluate PK in humans. In recent years machine learning and artificial intelligence (AI) emerged as a major tool for modelling of in vivo animal and human PK, enabling prediction from chemical structure early in drug discovery, and therefore offering opportunities to guide the design and prioritisation of molecules based on relevant in vivo properties and, ultimately, predicting human PK at the point of design. This review presents recent advances in machine learning and AI models for in vivo animal and human PK for small-molecule compounds as well as some examples for antibody therapeutics.
Article
Unlabelled: Skin irritation test is an essential part of the safety assessment of chemicals. Recently, computational models to predict the skin irritation draw attention as alternatives to animal testing. We developed prediction models on skin irritation/corrosion of liquid chemicals using machine learning algorithms, with 34 physicochemical descriptors calculated from the structure. The training and test dataset of 545 liquid chemicals with reliable in vivo skin hazard classifications based on UN Globally Harmonized System [category 1 (corrosive, Cat 1), 2 (irritant, Cat 2), 3 (mild irritant, Cat 3), and no category (nonirritant, NC)] were collected from public databases. After the curation of input data through removal and correlation analysis, every model was constructed to predict skin hazard classification for liquid chemicals with 22 physicochemical descriptors. Seven machine learning algorithms [Logistic regression, Naïve Bayes, k-nearest neighbor, Support vector machine, Random Forest, Extreme gradient boosting (XGB), and Neural net] were applied to ternary and binary classification of skin hazard. XGB model demonstrated the highest accuracy (0.73-0.81), sensitivity (0.71-0.92), and positive predictive value (0.65-0.81). The contribution of physicochemical descriptors to the classification was analyzed using Shapley Additive exPlanations plot to provide an insight into the skin irritation of chemicals. Supplementary information: The online version contains supplementary material available at 10.1007/s43188-022-00168-8.
Article
Problems with drug ADME are responsible for many clinical failures. By understanding the ADME properties of marketed drugs and modeling how chemical structure contributes to these inherent properties, we can help new projects reduce their risk profiles. Kinetic aqueous solubility, the parallel artificial membrane permeability assay (PAMPA), and rat liver microsomal stability constitute the Tier I ADME assays at the National Center for Advancing Translational Sciences (NCATS). Using recent data generated from in-house lead optimization Tier I studies, we update quantitative structure–activity relationship (QSAR) models for these three endpoints and validate in silico performance against a set of marketed drugs (balanced accuracies range between 71% and 85%). Improved models and experimental datasets are of direct relevance to drug discovery projects and, together with the prediction services that have been made publicly available at the ADME@NCATS web portal ( https://opendata.ncats.nih.gov/adme/ ), provide important tools for the drug discovery community. The results are discussed in light of our previously reported ADME models and state-of-the-art models from scientific literature. Graphical Abstract [Figure: see text]
Preprint
We apply a Transformer architecture, specifically BERT, to learn flexible and high quality molecular representations for drug discovery problems. We study the impact of using different combinations of self-supervised tasks for pre-training, and present our results for the established Virtual Screening and QSAR benchmarks. We show that: i) The selection of appropriate self-supervised task(s) for pre-training has a significant impact on performance in subsequent downstream tasks such as Virtual Screening. ii) Using auxiliary tasks with more domain relevance for Chemistry, such as learning to predict calculated molecular properties, increases the fidelity of our learnt representations. iii) Finally, we show that molecular representations learnt by our model `MolBert' improve upon the current state of the art on the benchmark datasets.
Preprint
GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction. However, in NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer. In parallel, the software ecosystem around transformers is maturing rapidly, with libraries like HuggingFace and BertViz enabling streamlined training and introspection. In this work, we make one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via our ChemBERTa model. ChemBERTa scales well with pretraining dataset size, offering competitive downstream performance on MoleculeNet and useful attention-based visualization modalities. Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction. To facilitate these efforts, we release a curated dataset of 77M SMILES from PubChem suitable for large-scale self-supervised pretraining.
Conference Paper
With the rapid progress of AI in both academia and industry, Deep Learning has been widely introduced into various areas in drug discovery to accelerate its pace and cut R&D costs. Among all the problems in drug discovery, molecular property prediction has been one of the most important problems. Unlike general Deep Learning applications, the scale of labeled data is limited in molecular property prediction. To better solve this problem, Deep Learning methods have started focusing on how to utilize tremendous unlabeled data to improve the prediction performance on small-scale labeled data. In this paper, we propose a semi-supervised model named SMILES-BERT, which consists of attention mechanism based Transformer Layer. A large-scale unlabeled data has been used to pre-train the model through a Masked SMILES Recovery task. Then the pre-trained model could easily be generalized into different molecular property prediction tasks via fine-tuning. In the experiments, the proposed SMILES-BERT outperforms the state-of-the-art methods on all three datasets, showing the effectiveness of our unsupervised pre-training and great generalization capability of the pre-trained model.
Article
Modeling and simulation of drug disposition has emerged as an important tool in drug development, clinical study design and regulatory review, and the number of physiologically based pharmacokinetic (PBPK) modeling related publications and regulatory submissions have risen dramatically in recent years. However, the extent of use of PBPK modeling by researchers, and the public availability of models has not been systematically evaluated. This review evaluated PBPK-related publications to 1) identify the common applications of PBPK modeling, 2) determine ways in which models are developed, 3) establish how model quality is assessed and 4) provide a list of publically available PBPK models for sensitive P450 and transporter substrates as well as selective inhibitors and inducers. PubMed searches were conducted using the terms PBPK and physiologically based pharmacokinetic model to collect published models. Only papers on PBPK modeling of pharmaceutical agents in humans published in English between 2008 and May 2015 were reviewed. A total of 366 PBPK-related articles met the search criteria with the number of articles published per year rising steadily. Published models were most commonly used for drug-drug interaction (DDI) predictions (28%), followed by interindividual variability and general clinical pharmacokinetic predictions (23%), formulation or absorption modeling (12%) and predicting age related changes in pharmacokinetics and disposition (10%). 106 models of sensitive substrates, inhibitors and inducers were identified. An in-depth analysis of the model development and verification revealed a lack of consistency in model development and quality assessment practices demonstrating a need for development of best-practice guidelines. The American Society for Pharmacology and Experimental Therapeutics.
Article
Human clearance is often predicted prior to clinical study from in vivo preclinical data by virtue of interspecies allometric scaling methods. The aims of this study were to determine the important molecular descriptors for the extrapolation of animal data to human clearance and further to build a model to predict human clearance by combination of animal data and the selected molecular descriptors. These important molecular descriptors selected by genetic algorithm (GA) were from five classes: quantum mechanical, shadow indices, E-state keys, molecular properties and molecular property counts. Although the data set contained many outliers determined by the conventional Mahmood method, the variation of most outliers was reduced significantly by our final support vector machine (SVM) model. The values of cross-validated correlation coefficient and root mean squared error (RMSE) for leave-one-out cross-validation (LOOCV) of the final SVM model were 0.783 and 0.305, respectively. Meanwhile, the reliability and consistence of the final model was also validated by an external test set. In conclusion, the SVM model based on the molecular descriptors selected by GA and animal data achieved better prediction performance than the Mahmood method. This approach can be applied as an improved interspecies allometric scaling method in drug research and development. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Article
Han van de Waterbeemd, Christopher Kohl – Pfizer Global Research & Development, Sandwich Laboratories, PDM (Pharmacokinetics, Dynamics and Metabolism), ipc 664, Ramsgate Road, Sandwich, Kent, UK CT13 9NJ The prediction of the fundamental pharmacokinetic parameters clearance and volume of distribution is essential to rational compound progression in the drug discovery phase. Improvements in the accuracy and reliability of these predictions could considerable improve the chances of success in drug development in many pharmaceutical companies. Jörg Keldenich has many years of industrial experience in the modeling and prediction of ADME properties. He reviews general approaches and recent advances in the prediction of human clearance and volume from early discovery to the preclinical phase.
Article
A comprehensive analysis on the prediction of human clearance based on intravenous pharmacokinetic data from rat, dog, and monkey for approximately 400 compounds was undertaken. This data set has been carefully compiled from literature reports and expanded with some in-house determinations for plasma protein binding and rat clearance. To the authors- knowledge, this is the largest publicly available data set. The present examination offers a comparison of 37 different methods for prediction of human clearance across compounds of diverse physicochemical properties. Furthermore, this work demonstrates the application of each prediction method to each charge class of the compounds, thus presenting an additional dimension to prediction of human pharmacokinetics. In general, the observations suggest that methods employing monkey clearance values and a method incorporating differences in plasma protein binding between rat and human yield the best overall predictions as suggested by approximately 60% compounds within 2-fold geometric mean-fold error. Other single-species scaling or proportionality methods incorporating the fraction unbound in the corresponding preclinical species for prediction of free clearance in human were generally unsuccessful.
Article
The authors present a comprehensive analysis on the estimation of volume of distribution at steady state (VD(ss) ) in human based on rat, dog, and monkey data on nearly 400 compounds for which there are also associated human data. This data set, to the authors- knowledge, is the largest publicly available, has been carefully compiled from literature reports, and was expanded with some in-house determinations such as plasma protein binding data. This work offers a good statistical basis for the evaluation of applicable prediction methods, their accuracy, and some methods-dependent diagnostic tools. The authors also grouped the compounds according to their charge classes and show the applicability of each method considered to each class, offering further insight into the probability of a successful prediction. Furthermore, they found that the use of fraction unbound in plasma, to obtain unbound volume of distribution, is generally detrimental to accuracy of several methods, and they discuss possible reasons. Overall, the approach using dog and monkey data in the íie-Tozer equation offers the highest probability of success, with an intrinsic diagnostic tool based on aberrant values (<0 or >1) for the calculated fraction unbound in tissue. Alternatively, methods based on dog data (single-species scaling) and rat and dog data (íie-Tozer equation with 2 species or multiple regression methods) may be considered reasonable approaches while not requiring data in nonhuman primates.
Article
The objective of this study was to evaluate the performance of various allometric and in vitro-in vivo extrapolation (IVIVE) methodologies with and without plasma protein binding corrections for the prediction of human intravenous (i.v.) clearance (CL). The objective was also to evaluate the IVIVE prediction methods with animal data. Methodologies were selected from the literature. Pharmaceutical Research and Manufacturers of America member companies contributed blinded datasets from preclinical and clinical studies for 108 compounds, among which 19 drugs had i.v. clinical pharmacokinetics data and were used in the analysis. In vivo and in vitro preclinical data were used to predict CL by 29 different methods. For many compounds, in vivo data from only two species (generally rat and dog) were available and/or the required in vitro data were missing, which meant some methods could not be properly evaluated. In addition, 66 methods of predicting oral (p.o.) area under the curve (AUC(p.o.) ) were evaluated for 107 compounds using rational combinations of i.v. CL and bioavailability (F), and direct scaling of observed p.o. CL from preclinical species. Various statistical and outlier techniques were employed to assess the predictability of each method. Across methods, the maximum success rate in predicting human CL for the 19 drugs was 100%, 94%, and 78% of the compounds with predictions falling within 10-fold, threefold, and twofold error, respectively, of the observed CL. In general, in vivo methods performed slightly better than IVIVE methods (at least in terms of measures of correlation and global concordance), with the fu intercept method and two-species-based allometry (rat-dog) being the best performing methods. IVIVE methods using microsomes (incorporating both plasma and microsomal binding) and hepatocytes (not incorporating binding) resulted in 75% and 78%, respectively, of the predictions falling within twofold error. IVIVE methods using other combinations of binding assumptions were much less accurate. The results for prediction of AUC(p.o.) were consistent with i.v. CL. However, the greatest challenge to successful prediction of human p.o. CL is the estimate of F in human. Overall, the results of this initiative confirmed predictive performance of common methodologies used to predict human CL. © 2011 Wiley-Liss, Inc. and the American Pharmacists Association J Pharm Sci.
Article
The objective of this study was to evaluate the performance of various empirical, semimechanistic and mechanistic methodologies with and without protein binding corrections for the prediction of human volume of distribution at steady state (Vss ). PhRMA member companies contributed a set of blinded data from preclinical and clinical studies, and 18 drugs with intravenous clinical pharmacokinetics (PK) data were available for the analysis. In vivo and in vitro preclinical data were used to predict Vss by 24 different methods. Various statistical and outlier techniques were employed to assess the predictability of each method. There was not simply one method that predicts Vss accurately for all compounds. Across methods, the maximum success rate in predicting human Vss was 100%, 94%, and 78% of the compounds with predictions falling within tenfold, threefold, and twofold error, respectively, of the observed Vss . Generally, the methods that made use of in vivo preclinical data were more predictive than those methods that relied solely on in vitro data. However, for many compounds, in vivo data from only two species (generally rat and dog) were available and/or the required in vitro data were missing, which meant some methods could not be properly evaluated. It is recommended to initially use the in vitro tissue composition-based equations to predict Vss in preclinical species and humans, putting the assumptions and compound properties into context. As in vivo data become available, these predictions should be reassessed and rationalized to indicate the level of confidence (uncertainty) in the human Vss prediction. The top three methods that perform strongly at integrating in vivo data in this way were the Øie-Tozer, the rat -dog-human proportionality equation, and the lumped-PBPK approach. Overall, the scientific benefit of this study was to obtain greater characterization of predictions of human Vss from several methods available in the literature.
Article
PaDEL-Descriptor is a software for calculating molecular descriptors and fingerprints. The software currently calculates 797 descriptors (663 1D, 2D descriptors, and 134 3D descriptors) and 10 types of fingerprints. These descriptors and fingerprints are calculated mainly using The Chemistry Development Kit. Some additional descriptors and fingerprints were added, which include atom type electrotopological state descriptors, McGowan volume, molecular linear free energy relation descriptors, ring counts, count of chemical substructures identified by Laggner, and binary fingerprints and count of chemical substructures identified by Klekota and Roth. PaDEL-Descriptor was developed using the Java language and consists of a library component and an interface component. The library component allows it to be easily integrated into quantitative structure activity relationship software to provide the descriptor calculation feature while the interface component allows it to be used as a standalone software. The software uses a Master/Worker pattern to take advantage of the multiple CPU cores that are present in most modern computers to speed up calculations of molecular descriptors. The software has several advantages over existing standalone molecular descriptor calculation software. It is free and open source, has both graphical user interface and command line interfaces, can work on all major platforms (Windows, Linux, MacOS), supports more than 90 different molecular file formats, and is multithreaded. PaDEL-Descriptor is a useful addition to the currently available molecular descriptor calculation software. The software can be downloaded at http://padel.nus.edu.sg/software/padeldescriptor.