ArticleLiterature Review

Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The success of ligand-based virtual-screening calculations is influenced highly by the nature of target-specific structure-activity relationships. This might pose severe constraints on the ability to recognize diverse structures with similar activity. Accordingly, the performance of similarity-based methods strongly depends on the class of compound that is studied, and approaches of different design and complexity often produce, overall, equally good (or bad) results. However, it is also found that there is often little overlap in the similarity relationships detected by different approaches, which rationalizes the need to develop alternative similarity methods. Among others, these include novel algorithms to navigate high-dimensional chemical spaces, train similarity calculations on specific compound classes, and detect remote similarity relationships.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... За да се оцени състоянието на кожата няма нужда физически да се посещава кабинета на дерматолог, тъй като приложенията за смартфони предлагат тази услуга. SkinVision е едно такова приложение, което позволява на потребителите да следят подозрителни бенки, като информират дерматолозите, когато има нужда от задълбочени проверки [24]. Друга подобна опция е Idoc24, позволяваща на потребителите да изпращат снимки за обриви, лезии или странни петна, които ги засягат. ...
... Те показват, че социалната подкрепа може да бъде ефективна за намаляване на последиците от "синдрома на прегаряне", чрез отстраняване на някои от предразполагащите условия, които водят до развитието на синдрома, или чрез предпазване на лицето независимо от условията, така че то да не достигне до прегаряне. Според други изследователи, социалната подкрепа е необходима за създаване и поддържане както на физическо, така и психическо здраве -независимо от наличието или липсата на работни стресови фактори [24]. ...
... Има много различни теории за лидерството, предложени и практикувани от специалисти в различни организации и обстоятелства. В своята публикация "Лидерство" (1978 г.) J. Burns за първи път разграничава стиловете транзакционно и трансформационно ръководство (таблица 2): Таблица 2. Разлики между трансформационен и транзакционен стил на ръководство [15, 24,25]. ...
Article
Full-text available
Tumour cell motility, which is dependent on the organization of the cytoskeleton, is considered to play an important role in the spread of malignant melanoma. Therefore, retinoids, which are modulators of cytoskeletal organization, may affect the motile activity of melanoma cells. The goal of the present study was to find similar structures (compounds) of mofarotene by the CompTox Chemistry Dashboard and to calculate their molecular properties by the Molinspiration
... The calculated Tanimoto coefficient (TC, the recognized structures similarity measure), for the calcidiol-calcitriol pair, is 0.97, while high structure similarity is recognized for TC> 0,85 [2]. Thus, according to similar property principle (SPP) [3] these molecules should exhibit similar biological effects. In fact, it has been recently demonstrated that calcidiol exerts similar to calcitriol genomic and nongenomic effects however at substantially higher effective concentrations [4][5][6][7]. ...
... Molecular similarity is a key concept widely used in drug discovery and design. However, molecular similarity does not always ensure similar, identical biological activity [3,22]. To the best of our knowledge, our study is the first that reveals the opposite biological effect of these vitamin D metabolites, while previous studies demonstrated that 25(OH)D3 exert similar to 1,25(OH)D3 genomic and non-genomic effects; however, at, substantially higher effective concentrations [4][5][6][7]. ...
Article
Calcidiol (25 hydroxy vitamin D3) is the major circulating metabolite of vitamin D in the human body and direct precursor of calcitriol (1,25 dihydroxy vitamin D3) with generally known hormonal activity. The biological activities of calcidiol have not been widely investigated despite its high structural similarity with calcitriol and hundreds of times higher circulating concentration. In this study, we investigated the impact of calcidiol on hypoxia-inducible transcription factors HIF-1 and HIF-2 that are highly involved in the regulation of various aspects of tumour biology and important for the survival of hypoxic tumour cells. Our study demonstrates that, unlike calcitriol, calcidiol induces the transcription of HIF-1A and EPAS1 genes and increases HIF-1/2α protein levels under normoxia. Moreover, in spite of the fact that this induction practically does not increase the transcriptional activities of HIF-1/2 under normoxia, calcidiol strongly potentiates the HIF�1/2 transcriptional activity induced by hypoxia. To the best of our knowledge, this is the first study that demonstrates the effect of calcidiol on HIF-1/2 signalling and reveals that this effect is opposing to that of calcitriol.
... 6 The main point of calculating similarity measurements lies on the "molecular similarity principle": similar molecules have similar properties/activities. 7 This powerful idea is at the core of virtual screening [8][9][10][11] , hit selection 12 , QSAR/QSPR modeling 13,14 , many chemical space exploration methods 15,16 , activity landscape description 17,18 , diversity selection 19 , clustering 20,21 , and many more applications. ...
Article
Full-text available
The quantification of molecular similarity has been present since the beginning of cheminformatics. Although several similarity indices and molecular representations have been reported, all of them ultimately reduce to the calculation of molecular similarities of only two objects at a time. Hence, to obtain the average similarity of a set of molecules, all the pairwise comparisons need to be computed, which demands a quadratic scaling in the number of computational resources. Here we propose an exact alternative to this problem: iSIM (instant similarity). iSIM performs comparisons of multiple molecules at the same time and yields the same value as the average pairwise comparisons of molecules represented by binary fingerprints and real-value descriptors. In this work, we introduce the mathematical framework and several applications of iSIM in chemical sampling, visualization, diversity selection, and clustering.
... Twelve credible objective functions covering basic molecular properties, synesthetic accessibility, drug-likeness, absorption, distribution and toxicity were provided for property optimization. To retain the efficacy and novelty of initial optimized molecule, the Similarity Constrain and Substructure Constrain functions were applied for the definition of the starting point and the annotation of important active motif, respectively [38]. The application of the Similarity Constrain function enables the setting of the distance limitation between the generated molecule and the reference molecule based on the ECFP4 fingerprint and Tanimoto similarity metric, while the application of the Substructure Constrain function highlights the importance of bioactivity motif. ...
Article
Full-text available
Drug discovery and development constitute a laborious and costly undertaking. The success of a drug hinges not only good efficacy but also acceptable absorption, distribution, metabolism, elimination, and toxicity (ADMET) properties. Overall, up to 50% of drug development failures have been contributed from undesirable ADMET profiles. As a multiple parameter objective, the optimization of the ADMET properties is extremely challenging owing to the vast chemical space and limited human expert knowledge. In this study, a freely available platform called Chemical Molecular Optimization, Representation and Translation (ChemMORT) is developed for the optimization of multiple ADMET endpoints without the loss of potency (https://cadd.nscc-tj.cn/deploy/chemmort/). ChemMORT contains three modules: Simplified Molecular Input Line Entry System (SMILES) Encoder, Descriptor Decoder and Molecular Optimizer. The SMILES Encoder can generate the molecular representation with a 512-dimensional vector, and the Descriptor Decoder is able to translate the above representation to the corresponding molecular structure with high accuracy. Based on reversible molecular representation and particle swarm optimization strategy, the Molecular Optimizer can be used to effectively optimize undesirable ADMET properties without the loss of bioactivity, which essentially accomplishes the design of inverse QSAR. The constrained multi-objective optimization of the poly (ADP-ribose) polymerase-1 inhibitor is provided as the case to explore the utility of ChemMORT.
... This calculation centers around analyzing the molecular characteristics of molecules under investigation, such as Log P and molar mass. Moreover, the assessment of molecular similarity predominantly hinges on the calculation of spatial distances by incorporating both structural and physicochemical characteristics (Eckert & Bajorath, 2007). The utilization of built-in chemical spaces of InfiniSee for molecular or 2D similarity searching of S-adenosyl-L-cysteine led to the creation of a compound library comprising 300 molecules. ...
Article
Yellow fever is a flavivirus having plus-sensed RNA which encodes a single polyprotein. Host proteases cut this polyprotein into seven nonstructural proteins including a vital NS3 protein. The present study aims to identify the most effective inhibitor against the helicase (NS3) using different advanced ligand and structure-based computational studies. A set of 300 ligands was selected against helicase by chemical structural similarity model, which are similar to S-adenosyl-l-cysteine using infiniSee. This tool screens billions of compounds through a similarity search from in-built chemical spaces (CHEMriya, Galaxi, KnowledgeSpace and REALSpace). The pharmacophore was designed from ligands in the library that showed same features. According to the sequence of ligands, six compounds (29, 87, 99, 116, 148, and 208) were taken for pharmacophore designing against helicase protein. Subsequently, compounds from the library which showed the best pharmacophore shared-features were docked using FlexX functionality of SeeSAR and their optibrium properties were analyzed. Afterward, their ADME was improved by replacing the unfavorable fragments, which resulted in the generation of new compounds. The selected best compounds (301, 302, 303 and 304) were docked using SeeSAR and their pharmacokinetics and toxicological properties were evaluated using SwissADME. The optimal inhibitor for yellow fever helicase was 2-amino-N-(4-(dimethylamino)thiazol-2-yl)-4-methyloxazole-5-carboxamide (302), which exhibits promising potential for drug development.
... Before the visual inspection, computational medicinal chemists often need to perform cluster analysis on the screened hits [16]. Cluster analysis incorporates molecular similarity theory into the structure-based virtual screening, which states that similar molecules should have similar biological activities [17,18]. In several cases, virtual screening results contain compounds similar to each other, making it impossible to test all the hits. ...
Article
Molecular clustering analysis has been developed to facilitate visual inspection in the process of structure-based virtual screening. However, traditional methods based on molecular fingerprints or molecular descriptors limit the accuracy of selecting active hit compounds, which may be attributed to the lack of representations of receptor structural and protein-ligand interaction during the clustering. Here, a novel deep clustering framework named ClusterX is proposed to learn molecular representations of protein-ligand complexes and cluster the ligands. In ClusterX, the graph was used to represent the protein-ligand complex, and the joint optimisation can be used efficiently for learning the cluster-friendly features. Experiments on the KLIFs database show that the model can distinguish well between the binding modes of different kinase inhibitors. To validate the effectiveness of the model, the clustering results on the virtual screening dataset further demonstrated that ClusterX achieved better or more competitive performance against traditional methods, such as SIFt and extended connectivity fingerprints. This framework may provide a unique tool for clustering analysis and prove to assist computational medicinal chemists in visual decision-making.
... Therefore, VS is typically used as the first CADD method in a pipeline as it can rapidly test a large number of compounds computationally, reducing time and cost by limiting the number of compounds that must be synthesized or purchased. VS is usually performed using structure-based methods but can also be performed via ligand-based methods if there is at least one known hit [14]. Most often, VS is performed at an ultra-high-throughput scale (millions to billions of compounds) employing previously enumerated and purchasable chemical libraries or in-house VS libraries [15]. ...
Article
Full-text available
The use of computer-aided drug design (CADD) for the identification of lead compounds in radiotracer development is steadily increasing. Traditional CADD methods, such as structure-based and ligand-based virtual screening and optimization, have been successfully utilized in many drug discovery programs and are highlighted throughout this review. First, we discuss the use of virtual screening for hit identification at the beginning of drug discovery programs. This is followed by an analysis of how the hits derived from virtual screening can be filtered and culled to highly probable candidates to test in in vitro assays. We then illustrate how CADD can be used to optimize the potency of experimentally validated hit compounds from virtual screening for use in positron emission tomography (PET). Finally, we conclude with a survey of the newest techniques in CADD employing machine learning (ML).
... Although identical structures in biological systems do not always act the same, computational techniques for drug repositioning can take use of the degrees of resemblance that exist. Chemical similarity techniques work by extracting a set of chemical properties for each drug in a group of medications, then clustering or creating networks based on the recovered features to relate the drugs directly to one another [25]. Simple chemical associations or looking for specific biological traits, such as known drug targets, enriched in the resulting correlations can subsequently be used to infer therapeutic repositioning prospects. ...
Chapter
Full-text available
Repurposing “old” drugs to treat both common and rare diseases is increasingly emerging as an attractive proposition due to the use of de-risked compounds, with potential for lower overall development costs and shorter development timelines. This is due to the high attrition rates, significant costs, and slow pace of new drug discovery and development. Drug repurposing is the process of finding new, more efficient uses for already-available medications. Numerous computational drug repurposing techniques exist, there are three main types of computational drug-repositioning methods used on COVID-19 are network-based models, structure-based methods and artificial intelligence (AI) methods used to discover novel drug–target relationships useful for new therapies. In order to assess how a chemical molecule can interact with its biological counterpart and try to find new uses for medicines already on the market, structure-based techniques made it possible to identify small chemical compounds capable of binding macromolecular targets. In this chapter, we explain strategies for drug repurposing, discuss about difficulties encountered by the repurposing community, and suggest reported drugs through the drug repurposing. Moreover, metabolic and drug discovery network resources, tools for network construction, analysis and protein–protein interaction analysis to enable drug repurposing to reach its full potential.
... Therefore, compounds with structural features similar to active inhibitors are expected to have similar inhibitory properties. 35 Despite deviations from the molecular similarity principle, similarity measures are extensively employed in drug discovery strategies. 36 Numerous studies have identified novel inhibitors of Mtb using ligand-based and receptor-based virtual screening strategies. ...
Chapter
The acid fast bacterium Mycobacterium tuberculosis (Mtb) is the causative agent of tuberculosis and is a serious global threat. The concern is further heightened by the emergence of multidrug and extensively drug-resistant strains of Mtb, which necessitates the discovery of novel inhibitors. Mtb DNA gyrase is a potential therapeutic target for antitubercular drug discovery and has been successfully targeted in numerous studies. The gyrase heterotetramer comprises two GyrA and two GyrB subunits, and catalyzes topological transitions in DNA. However, mutations in the genes encoding GyrA and GyrB are primarily responsible for resistance to fluoroquinolones, including moxifloxacin and ciprofloxacin, which necessitates the discovery of novel antagonists of Mtb gyrase. Virtual screening and molecular docking strategies are extensively employed to reduce the cost and time of drug discovery. This chapter discusses various applications of ligand-based and receptor-based virtual screening strategies, including consensus docking and pharmacophore-based screening approaches, for the identification of potent inhibitors of Mtb DNA gyrase. The case studies discussed in this chapter highlight the judicious application of virtual screening strategies in the discovery of potent antagonists of Mtb.
... Substructure information extracted from molecules has been widely used in molecular generation, property prediction, and virtual screening (Willett et al., 1998;Brown & Martin, 1996;Eckert & Bajorath, 2007;Jin et al., 2018). ECFPs (Rogers & Hahn, 2010) encode existing substructures within a circular distance from each atom in a molecule. ...
Preprint
Designing a neural network architecture for molecular representation is crucial for AI-driven drug discovery and molecule design. In this work, we propose a new framework for molecular representation learning. Our contribution is threefold: (a) demonstrating the usefulness of incorporating substructures to node-wise features from molecules, (b) designing two branch networks consisting of a transformer and a graph neural network so that the networks fused with asymmetric attention, and (c) not requiring heuristic features and computationally-expensive information from molecules. Using 1.8 million molecules collected from ChEMBL and PubChem database, we pretrain our network to learn a general representation of molecules with minimal supervision. The experimental results show that our pretrained network achieves competitive performance on 11 downstream tasks for molecular property prediction.
... The structural similarity is the most frequent approach and has been applied historically, in a manual fashion by experts, although today, read-across can be performed in an automated manner using specific open access or commercial software. Overall, structural similarity is measured using tools of different complexity, such as structural keys, fingerprints, molecular descriptors and quantum similarity [3][4][5][6]. ...
Article
Full-text available
Read-across applies the principle of similarity to identify the most similar substances to represent a given target substance in data-poor situations. However, differences between the target and the source substances exist. The present study aims to screen and assess the effect of the key components in a molecule which may escape the evaluation for read-across based only on the most similar substance(s) using a new open-access software: Virtual Extensive Read-Across (VERA). VERA provides a means to assess similarity between chemicals using structural alerts specific to the property, pre-defined molecular groups and structural similarity. The software finds the most similar compounds with a certain feature, e.g., structural alerts and molecular groups, and provides clusters of similar substances while comparing these similar substances within different clusters. Carcinogenicity is a complex endpoint with several mechanisms, requiring resource intensive experimental bioassays and a large number of animals; as such, the use of read-across as part of new approach methodologies would support carcinogenicity assessment. To test the VERA software, carcinogenicity was selected as the endpoint of interest for a range of botanicals. VERA correctly labelled 70% of the botanicals, indicating the most similar substances and the main features associated with carcinogenicity.
... In cheminformatics, quantifying the similarity of two molecules is a crucial concept and a common task [41]. Its applications span a variety of domains, the majority of which are connected to medicinal chemistry, such as virtual screening [42]. HCA in the current study revealed MSS between MMV396693 and CF. ...
Article
Full-text available
Background An innovative approach has been introduced for identifying and developing novel potent and safe anti-Babesia and anti-Theileria agents for the control of animal piroplasmosis. In the present study, we evaluated the inhibitory effects of Malaria Box (MBox) compounds (n = 8) against the growth of Babesia microti in mice and conducted bioinformatics analysis between the selected hits and the currently used antibabesial drugs, with far-reaching implications for potent combinations. Methods A fluorescence assay was used to evaluate the in vivo inhibitory effects of the selected compounds. Bioinformatics analysis was conducted using hierarchical clustering, distance matrix and molecular weight correlation, and PubChem fingerprint. The compounds with in vivo potential efficacy were selected to search for their target in the piroplasm parasites using quantitative PCR (qPCR). Results Screening the MBox against the in vivo growth of the B. microti parasite enabled the discovery of potent new antipiroplasm drugs, including MMV396693 and MMV665875. Interestingly, statistically significant (P < 0.05) downregulation of cysteine protease mRNA levels was observed in MMV665875-treated Theileria equi in vitro culture in comparison with untreated cultures. MMV396693/clofazimine and MMV665875/atovaquone (AV) showed maximum structural similarity (MSS) with each other. The distance matrix results indicate promising antibabesial efficacy of combination therapies consisting of either MMV665875 and AV or MMV396693 and imidocarb dipropionate (ID). Conclusions Inhibitory and hematology assay results suggest that MMV396693 and MMV665875 are potent antipiroplasm monotherapies. The structural similarity results indicate that MMV665875 and MMV396693 have a similar mode of action as AV and ID, respectively. Our findings demonstrated that MBox compounds provide a promising lead for the development of new antibabesial therapeutic alternatives. Graphical Abstract
... Many computational methods have been applied to design and generate multi molecules structures by chemical group replacement to achieve new compound similar in skeleton to the active drug to with better activities (Aysha et al., 2014;Brown & Jacoby, 2006). These methods includes and subdivide to virtual screening, de novo molecular design, topology similarity, pharmacophore search and shape similarity search (Eckert & Bajorath, 2007;Khedkar et al., 2007;Klebe, 2006). In particular, the main aims of chemical group replacement to prevent many biological, chemical, or even intellectual side effects associated to the scaffold of currently used drugs (Abd Razik et al., 2020; Mehdy et al., 2018). ...
Article
Full-text available
The pharmacotherapy treatment of pain is an active and motivated area of investigation for treatment with free side effects. This paper presents the docking ability of twenty-five analogues of 4-Chromanone derivatives inside the crystal structure of μ opioid receptor to estimate the binding affinity of each derivative. Molecular modelling design approach applied to identify the effective substation position with generation of 989 novel 4-Chromanone derivatives. The final result of the most active twenty novel 4-Chromanone derivatives with docking affinity range (-9.89 to -9.34) kcal/mol were selected as promising hit ligand drugs comparing with morphine docking affinity at (-6.02) kcal/mol.
... As a more promising CADD method, machine learning (ML) algorithms, such as neural networks and the transformer, are developing suitable models to predict target protein structure and discover potential compounds. Other CADD methods, including molecular docking, QSAR, and pharmacophore modeling, also benefit from machine learning algorithms [142,143]. In molecular docking, ML is used for scoring functions that translate protein-ligand interactions into descriptors. ...
Article
Full-text available
The Rat Sarcoma (RAS) family (NRAS, HRAS, and KRAS) is endowed with GTPase activity to regulate various signaling pathways in ubiquitous animal cells. As proto-oncogenes, RAS mutations can maintain activation, leading to the growth and proliferation of abnormal cells and the development of a variety of human cancers. For the fight against tumors, the discovery of RAS-targeted drugs is of high significance. On the one hand, the structural properties of the RAS protein make it difficult to find inhibitors specifically targeted to it. On the other hand, targeting other molecules in the RAS signaling pathway often leads to severe tissue toxicities due to the lack of disease specificity. However, computer-aided drug design (CADD) can help solve the above problems. As an interdisciplinary approach that combines computational biology with medicinal chemistry, CADD has brought a variety of advances and numerous benefits to drug design, such as the rapid identification of new targets and discovery of new drugs. Based on an overview of RAS features and the history of inhibitor discovery, this review provides insight into the application of mainstream CADD methods to RAS drug design.
... Virtual screening (VS) is a common starting point in many drug discovery projects and is performed using a variety of computational techniques such as docking, [1][2][3] pharmacophore modeling, [4][5][6] similarity and substructure searches, [7][8][9][10][11][12][13] and QSAR equations. [14][15][16] These techniques are typically classified as either structure-based (i. ...
Article
Full-text available
Docking‐based virtual screening (VS) is a common starting point in many drug discovery projects. The advantage of docking lies in its ability to provide reliable ligand binding modes and approximated binding free energies, two factors that are important for hit selection and optimization. We present a method for the development of target‐specific scoring functions using our recently reported Enrichment‐Optimizationlgorithm(EOA). EOA derives QSAR models in the form of MLR equations by optimizing an enrichment‐like metric. Since EOA requires target‐specific active and inactive/decoys compounds, we retrieved such data for six targets from the DUD‐E database, and used them to re‐derive the weights associated with the components that make up ChemPLP scoring function yielding target‐specific, modified functions. We then used the original ChemPLP function in small‐scale VS experiments on the six targets and subsequently rescored the resulting poses with the modified functions. We used the modified functions for re‐docking. In many cases, rescoring or re‐docking, yielded better results in terms of AUC and EF1%. While work on additional datasets and docking tools is clearly required, we propose that the results obtained thus far hint to the potential benefits in using EOA‐based optimization for the derivation of target‐specific‐functions in the context of virtual screening.
... Results may vary depending on further issues, such as chosen similarity criteria, the kind of target, and the class of compounds. 46,47 In cases where multiple active molecules are known, the typical strategy is to apply pharmacophore techniques. 48,49 These methods, however, typically involve the generation of 3D structures as well as conceivable conformers of all considered actives, and what is actually crucial regarding the feasibility of all molecules in the database is being searched. ...
Article
In conventional fingerprint methods, the similarity between two molecules is calculated using the Tanimoto index as a numerical criterion. Thus, the query molecules in virtual screening should be most representative of the wanted compound class at hand. In the concept introduced here, all available active molecules form a multimolecule fingerprint in which the appearing features are weighted according to their respective frequency. The features of inactive molecules are treated likewise and the resulting values are subtracted from those of the active ones. The obtained differential multimolecule fingerprint (DMMFP) is thus specific for the respective class of compounds. To account for the noninteger representation within this fingerprint, a modified Sørensen-Dice coefficient is used to compute the similarity. Potentially active molecules yield positive scores, whereas presumably inactive ones are denoted by negative values. The concept was applied to Angiotensin-converting enzyme (ACE) inhibitors, β2-adrenoceptor ligands, leukotriene A4 hydrolase inhibitors, dopamine D3 antagonists, and cytochrome CYP2C9 substrates, for which experimental binding affinities are known and was tested against decoys from DUD-E and a further background database consisting of molecules from the dark chemical matter, which comprises compounds that appear as frequent hitters across multiple assays. Using the 166 publicly available keys of the MACCS fingerprint and the larger PubChem fingerprint, actives were recovered with very high sensitivity. Furthermore, three marketed ACE inhibitors as well as the carbonic anhydrase II inhibitor dorzolamide were detected in the dark chemical matter data set. For comparison, the DMMFP was also used with a Bayesian classifier, for which the specificity (correctly classified inactives) and likewise the accuracy was superior. Conversely, the similarity score produced by the Sørensen-Dice coefficient showed its potential for the early recognition of (potentially) active molecules.
... Typical methods include: a) docking of various collections of available or synthesizable compounds against the known or predicted protein pocket(s) of interest; [17] b) familybased selection of compounds based on similar protein sequence and or fold, [18][19] and c) selection of compounds based on 2D or 3D similarity searches to known ligands. [20][21] In the absence of a suitable molecule to be used as 19 F reporter, the 19 F NMR direct FAXS method can also be used to screen additional target-specific and hypothesis-driven fluorinated focused sets (fragments and/or larger molecules) by assembling them in mixtures based on predicted 19 F NMR chemical shifts, to reduce the likelihood of signals overlap and to allow rapid on-the-fly deconvolution of the active mixtures. ...
Article
Full-text available
Ligand‐based ¹⁹F NMR screening is a highly effective and well‐established hit‐finding approach. The high sensitivity to protein binding makes it particularly suitable for fragment screening. Different criteria can be considered for generating fluorinated fragment libraries. One common strategy is to assemble a large, diverse, well‐designed and characterized fragment library which is screened in mixtures, generated based on experimental ¹⁹F NMR chemical shifts. Here, we introduce a complementary knowledge‐based ¹⁹F NMR screening approach, named ¹⁹Focused screening, enabling the efficient screening of putative active molecules selected by computational hit finding methodologies, in mixtures assembled and on‐the‐fly deconvoluted based on predicted ¹⁹F NMR chemical shifts. In this study, we developed a novel approach, named LEFshift, for ¹⁹F NMR chemical shift prediction using rooted topological fluorine torsion fingerprints in combination with a random forest machine learning method. A demonstration of this approach to a real test case is reported.
... A general docking approach and molecular dynamics simulation approaches were used as supplementary validation methods to investigate the potential of the predicted compounds against the prioritized targets. Both structurally close and diverse analogs can be recognized using similarity search based on the applied metrics (Eckert and Bajorath 2007). It is extremely important in the analysis of hits and the schematic representation of Similarity search is shown in Fig. 10. ...
Article
Full-text available
A few decades ago, drug discovery and development were limited to a bunch of medicinal chemists working in a lab with enormous amount of testing, validations, and synthetic procedures, all contributing to considerable investments in time and wealth to get one drug out into the clinics. The advancements in computational techniques combined with a boom in multi-omics data led to the development of various bioinformatics/pharmacoinformatics/cheminformatics tools that have helped speed up the drug development process. But with the advent of artificial intelligence (AI), machine learning (ML) and deep learning (DL), the conventional drug discovery process has been further rationalized. Extensive biological data in the form of big data present in various databases across the globe acts as the raw materials for the ML/DL-based approaches and helps in accurate identifications of patterns and models which can be used to identify therapeutically active molecules with much fewer investments on time, workforce and wealth. In this review, we have begun by introducing the general concepts in the drug discovery pipeline, followed by an outline of the fields in the drug discovery process where ML/DL can be utilized. We have also introduced ML and DL along with their applications, various learning methods, and training models used to develop the ML/DL-based algorithms. Furthermore, we have summarized various DL-based tools existing in the public domain with their application in the drug discovery paradigm which includes DL tools for identification of drug targets and drug-target interaction such as DeepCPI, DeepDTA, WideDTA, PADME DeepAffinity, and DeepPocket. Additionally, we have discussed various DL-based models used in protein structure prediction, de novo design of new chemical scaffolds, virtual screening of chemical libraries for hit identification, absorption, distribution, metabolism, excretion, and toxicity (ADMET) prediction, metabolite prediction, clinical trial design, and oral bioavailability prediction. In the end, we have tried to shed light on some of the successful ML/DL-based models used in the drug discovery and development pipeline while also discussing the current challenges and prospects of the application of DL tools in drug discovery and development. We believe that this review will be useful for medicinal and computational chemists searching for DL tools for use in their drug discovery projects.
... Besides, structural similarity between drug molecules can also be exploited to infer the BBB permeability of drugs. Structurally similar drug molecules are likely to show similar physicochemical properties and may bind to the same proteins (Eckert and Bajorath, 2007;Martin et al., 2002;Muegge and Mukherjee, 2016). Thus, the drug-drug similarity is closely related to the factors affecting BBB penetration. ...
Article
Motivation Evaluating the blood-brain barrier (BBB) permeability of drug molecules is a critical step in brain drug development. Traditional methods for the evaluation require complicated in vitro or in vivo testing. Alternatively, in silico predictions based on machine learning have proved to be a cost-efficient way to complement the in vitro and in vivo methods. However, the performance of the established models has been limited by their incapability of dealing with the interactions between drugs and proteins, which play an important role in the mechanism behind the BBB penetrating behaviors. To address this limitation, we employed the relational graph convolutional network (RGCN) to handle the drug-protein interactions as well as the properties of each individual drug. Results The RGCN model achieved an overall accuracy of 0.872, an AUROC of 0.919 and an AUPRC of 0.838 for the testing dataset with the drug-protein interactions and the Mordred descriptors as the input. Introducing drug-drug similarity to connect structurally similar drugs in the data graph further improved the testing results, giving an overall accuracy of 0.876, an AUROC of 0.926 and an AUPRC of 0.865. In particular, the RGCN model was found to greatly outperform the LightGBM base model when evaluated with the drugs whose BBB penetration was dependent on drug-protein interactions. Our model is expected to provide high-confidence predictions of BBB permeability for drug prioritization in the experimental screening of BBB-penetrating drugs. Availability The data and the codes are freely available at https://github.com/dingyan20/BBB-Penetration-Prediction. Supplementary information Supplementary data are available at Bioinformatics online.
... In some common examples, their binary implementations are used to quantify the similarity of binary molecular fingerprints (with the Tanimoto coefficient unquestionably being the most popular one) [3], while their continuous implementations constitute the basics of clustering algorithms [4]. The applications of molecular similarity (as expressed by pairwise similarity calculations between binary fingerprints) in ligand-based virtual screening were thoroughly explored by the groups of Jürgen Bajorath [5,6], Peter Willett [7,8], and many others, with a large body of works from the latter group dedicated to data fusion practices [9,10]. Binary similarity measures from many sub-fields were collected by Todeschini and colleagues [11], and further analyzed by our group to select ideal candidates for specific applications in metabolomics [12] and molecular design [13] studies. ...
Article
Full-text available
Extended (or n-ary) similarity indices have been recently proposed to extend the comparative analysis of binary strings. Going beyond the traditional notion of pairwise comparisons, these novel indices allow comparing any number of objects at the same time. This results in a remarkable efficiency gain with respect to other approaches, since now we can compare N molecules in O(N) instead of the common quadratic O(N²) timescale. This favorable scaling has motivated the application of these indices to diversity selection, clustering, phylogenetic analysis, chemical space visualization, and post-processing of molecular dynamics simulations. However, the current formulation of the n-ary indices is limited to vectors with binary or categorical inputs. Here, we present the further generalization of this formalism so it can be applied to numerical data, i.e. to vectors with continuous components. We discuss several ways to achieve this extension and present their analytical properties. As a practical example, we apply this formalism to the problem of feature selection in QSAR and prove that the extended continuous similarity indices provide a convenient way to discern between several sets of descriptors.
... From the results of screening using the Abbreviated Mental Test (AMT), it was found that most of the elderly with good cognitive ability were 65 people (76.4%), moderate cognitive impairment was 11.7%, and severe cognitive impairment was 5.8%. Assessment using the Depression Anxiety Stress Scale (DASS) found that 15 people complained of depression (17.5%), which was later confirmed by the Geriatric Depression Scale (GDS) screening (Grimes & Schulz, 2002;Eckert & Bajorath, 2007). The results were not much different i.e. 70 people with mild depression (82.4%), 13 people with moderate depression (15.3%), and 2 people with severe depression (2.3%). ...
Article
Full-text available
Detection of mental disorders in the elderly are using the Abbreviated Mental Test (AMT) screening/questionnaire, Depression Anxiety Stress Scale (DASS), Geriatric Depression Scale (GDS), Pittsburgh Sleep Quality Index (PSQI) and Visual Analog Scale (VAS) and/or or based on structured interviews. The elderly who had complete screening and interview data were included in the study sample, i.e., 85 people. From 85 elderly as a participant, 65 people (76.4%) had a good cognitive, 10 (11.7%) moderate cognitive, and 5 severe cognitive (5.8%), but their daily activities were still good. The results of the screening GDS showed 70 people with mild depression (82.4%), moderate depression in 13 people (15.3%) and 2 people with severe depression (2.3%). The results of the screening with DASS show 15 elderly people with depression (17.5%), 55 people with anxiety (65%) and 15 people experiencing stress (17.5%). Screening for elderly sleep quality with the PSQI for elderly showed 60 people with disrupted sleep quality (70.5%) and 15 people with good sleep quality (17.6%). A 68 people elderly (80%) complained of mild pain and 17 people (20%) with moderate pain by screening using the VAS, where the location of the pain varied in the body and leg areas.
... Quantifying the similarity of two molecules is a key concept and a routine task in cheminformatics [18]. Its applications encompass a number of fields, mostly medicinal chemistry-related, such as virtual screening [19]. In the current study, HCA analysis revealed that MMV019881 and MMV007285 have MSS with benzene compound difference. ...
Article
Here, we have evaluated the inhibitory effects of Medicines for Malaria Venture (MMV) Malaria Box compounds that exhibited potent in vitro anti-bovine Babesia efficacy against the growth of B. microti in mice and conducted follow-up investigations of the structural similarity between the identified potent MMV compounds and the commonly used antibabesial drugs was performed using atom Pair fingerprints (APfp). Screening the Malaria Box against the in vivo growth of the B. microti parasite helped with the discovery of new, effective anti-bovine Babesia drugs, including MMV667488, MMV007285, and MMV019881. Of note, MMV019881 exhibited the highest anti-B. microti efficacy in vivo among the screened MMV compounds. The APfp results revealed that the maximum structural similarity (MSS) was observed between MMV007285, diminazene aceturate, and imidocarb dipropionate (ID). In the same way, clofazimine (CF) and MMV667488 showed the MSS with either each other based on the analysis. The distance matrix and molecular weight correlation findings highlight the possible potential antibabesial efficacy of MMV667488, ID, and CF when administrated as a combination therapy. In conclusion, in the current study new potent antibabesial drug, MMV019881 was identified. CF and MMV667488 showed the MSS with either each other based on the hierarchical clustering analysis (HCA) and such relation is confirmed by the distance matrix and molecular weight correlation findings. Such combination therapy might have a potential as a novel regime for treating animal or human babesiosis.
... Molecular similarity principle states that molecules with similar structure tend to have similar properties. Indeed, the observation that common substructural fragments lead to similar biological activities, can be quantified from database analysis [5], [6]. The secondary metabolites studied are shown in Figure 1. ...
Article
Full-text available
We report the bioavailability analysis for six secondary metabolites with antiviral, antioxidant and antitumor activity reported, from Curatella amaricana L. Additionally, the molecular similarity analysis of each metabolite is presented and compared with Lopinavir, Ritonavir, Darunavir, Cobicistat and Nelfinavir, which actually are in the third phase for the production of a new vaccine for SARS-Cov2. The mode of interaction through molecular docking between each structure and the zone of action for protease type 3-chymotrypsin (3CLpro) also is presented. The molecular geometry for structures were optimized at semiempirical PM6 level. The bioavailability and molecular docking calculations were performed using the algorithms incorporated in chemoinformatic servers and AutoDock Vina. The results show that the structures studied lead a moderated permeability through the cell membrane, by complying with Lipinski's "rule of 5". Molecular similarity was evaluated by averaging geometric parameters (3D-Shape) and electrostatic potential (ESP). The results show that the most secondary metabolites would have a similar mode of action as the Lopinavir, with average similarity between 0.65 and 0.73. This last idea is reinforced by the results for molecular docking with the 3CLpro active site, highlighting the interaction of the molecules studied with the amino acid residues: His-41, Phe-140, Gly-143, Ser-144, Cys-145, His-163, Glu-166 and His-172, with an range interaction-free energy between-7.2 kcal/mol and-9.2 Kcal/mol, highlighting Quercetin 3-O-Alpha-L-rhamnoside with improve affinity energy than Lopinavir.
... Наличие биологической активности в отношении сердечно-сосудистой системы у веществ, структурно схожих с биогенными полиаминами, можно объяснить на основании принципа структурного подобия [83]. ...
Article
Full-text available
Objectives. Biogenic polyamines are widely present in nature. They are characteristic of both protozoan cells and multicellular organisms. These compounds have a wide range of biological functions and are necessary for normal growth and development of cells. Violation of polyamine homeostasis can cause significant abnormalities in cell functioning, provoking various pathological processes, including oncological and neuropsychiatric diseases. The impact on the “polyamine pathway” is an attractive basis for the creation of many pharmacological agents with a diverse spectrum of action. The purpose of this review is to summarize the results of the studies devoted to understanding the biological activity of compounds of the polyamine series, comparing their biological action with action on certain molecular targets. Due to the structural diversity of this group of substances, it is impossible to fully reflect the currently available data in one review. Therefore, in this work, the main attention is paid to the derivatives, acyclic saturated polyamines. Results. The following aspects are considered: biological functionality, biosynthesis and catabolism, cell transport, and localization of biogenic polyamines in the living systems. Structural analogs and derivatives of biogenic polyamines with antitumor, neuroprotective, antiarrhythmic, antiparasitic, antibacterial, and other biological activities are represented; the relationship between biological activity and the target of exposure is reflected. It was found that the nature of the substituent, the number of cationic centers, and the length of the polyamine chain have a great influence on the nature of the effect. Conclusions. At present, the use of polyamine structures is restrained by cytotoxicity and nonspecific toxic effects on the central nervous system. Further research in the field of biochemistry, cell transport, and a deeper understanding of receptor interaction mechanisms will help making polyamines as the basis for potential drug formulation.
... It is then at least reasonable to enquire as to what the "normal" substrates of these molecules are that happen also to allow them to transport drugs. The principle of molecular similarity (e.g., [201,[554][555][556][557][558][559][560][561][562][563][564]) suggests that molecules that have similar structures should tend to have similar activities, so the question then becomes "to which molecules are marketed drugs most similar"? This is a cheminformatics question, and the answer depends in part on the nature of the structural encoding, although most encodings of "actually" similar molecules show a Tanimoto similarity exceeding 0.8 or so, a number that may be used as a kind of benchmark. ...
Article
Full-text available
Over the years, my colleagues and I have come to realise that the likelihood of pharmaceutical drugs being able to diffuse through whatever unhindered phospholipid bilayer may exist in intact biological membranes in vivo is vanishingly low. This is because (i) most real biomembranes are mostly protein, not lipid, (ii) unlike purely lipid bilayers that can form transient aqueous channels, the high concentrations of proteins serve to stop such activity, (iii) natural evolution long ago selected against transport methods that just let any undesirable products enter a cell, (iv) transporters have now been identified for all kinds of molecules (even water) that were once thought not to require them, (v) many experiments show a massive variation in the uptake of drugs between different cells, tissues, and organisms, that cannot be explained if lipid bilayer transport is significant or if efflux were the only differentiator, and (vi) many experiments that manipulate the expression level of individual transporters as an independent variable demonstrate their role in drug and nutrient uptake (including in cytotoxicity or adverse drug reactions). This makes such transporters valuable both as a means of targeting drugs (not least anti-infectives) to selected cells or tissues and also as drug targets. The same considerations apply to the exploitation of substrate uptake and product efflux transporters in biotechnology. We are also beginning to recognise that transporters are more promiscuous, and antiporter activity is much more widespread, than had been realised, and that such processes are adaptive (i.e., were selected by natural evolution). The purpose of the present review is to summarise the above, and to rehearse and update readers on recent developments. These developments lead us to retain and indeed to strengthen our contention that for transmembrane pharmaceutical drug transport “phospholipid bilayer transport is negligible”.
... The precise method used to quantify molecular similarity or diversity depends upon the context of the question. Examples where a new set of compounds which covered a wider range of chemical space is required is described by Eckert et al [28]. For example, finding bioactive peptide-like molecules requires exploration of highly structural diverse data set [29]. ...
Thesis
Structure-based drug design (SBDD) is an essential component of many drug discovery programs. In SBDD, a large number of potential molecules are virtually screened against the known three-dimensional protein structure. Proper creation of ligand libraries and selection of target binding sites are critical for SBDD. This thesis focuses on knowledge-based approaches to improve SBDD in two aspects, the construction of the ligand libraries and analysis of the target binding sites used. First, this thesis presents ChemTreeMap, a visualization tool to explore structurally diverse molecules and mine for correlation between chemical structure and biological data. The visualization tool is applicable to a wide range of questions involving small molecule/drug binding and exploration and construction of ligand libraries. Experimental data and molecular properties can be interactively visualized with graph properties. With the help of this powerful tool, this thesis reports the findings on discriminating physicochemical properties between allosteric and orthosteric competitive molecules. It is observed that allosteric ligands are more hydrophobic, aromatic, and rigid. The result is useful to guide building new chemical libraries biased towards allosteric regulators. Thirdly, the selection of target binding sites of drug candidates needs to take account for possible interruption due to mutations which occur from non-synonymous single nucleotide polymorphisms (nsSNPs). Disease nsSNPs occur more frequently in a protein core or binding site, rather than the rest of the protein surface. The result can be used to imply the probability and consequence of nsSNP on new target binding sites.
... 5,6 According to the SPP, structurally similar molecules will more likely possess similar biological activities and physicochemical properties. Despite limitations to the SPP, 7,8 such as activity cliffs that manifest when a small structural modification drastically alters the biological properties of a compound, 9 this structure−activity relationship is broadly consistent throughout the larger flat regions of activity landscapes. 10,11 Improving the structural similarity-based retrieval of biologically similar compounds will therefore benefit a multitude of drug discovery efforts. ...
Article
Full-text available
A common strategy for identifying molecules likely to possess a desired biological activity is to search large databases of compounds for high structural similarity to a query molecule that demonstrates this activity, under the assumption that structural similarity is predictive of similar biological activity. However, efforts to systematically benchmark the diverse array of available molecular fingerprints and similarity coefficients have been limited by a lack of large-scale datasets that reflect biological similarities of compounds. To elucidate the relative performance of these alternatives, we systematically benchmarked 11 different molecular fingerprint encodings, each combined with 13 different similarity coefficients, using a large set of chemical–genetic interaction data from the yeast Saccharomyces cerevisiae as a systematic proxy for biological activity. We found that the performance of different molecular fingerprints and similarity coefficients varied substantially and that the all-shortest path fingerprints paired with the Braun-Blanquet similarity coefficient provided superior performance that was robust across several compound collections. We further proposed a machine learning pipeline based on support vector machines that offered a fivefold improvement relative to the best unsupervised approach. Our results generally suggest that using high-dimensional chemical–genetic data as a basis for refining molecular fingerprints can be a powerful approach for improving prediction of biological functions from chemical structures.
... Besides, structural similarity between drug molecules can also be exploited to infer the BBB permeability of drugs. Structurally similar drug molecules are likely to show similar physicochemical properties and may bind to the same proteins (Muegge and Mukherjee, 2016;Eckert and Bajorath, 2007;Martin et al., 2002). Thus, the drugdrug similarity is closely related to the factors affecting BBB penetration. ...
Preprint
Full-text available
The evaluation of the BBB penetrating ability of drug molecules is a critical step in brain drug development. Computational prediction based on machine learning has proved to be an efficient way to conduct the evaluation. However, performance of the established models has been limited by their incapability of dealing with the interactions between drugs and proteins, which play an important role in the mechanism behind BBB penetrating behaviors. To address this issue, we employed the relational graph convolutional network (RGCN) to handle the drug-protein (denoted by the encoding gene) relations as well as the features of each individual drug. In addition, drug-drug similarity was also introduced to connect structurally similar drugs in the graph. The RGCN model was initially trained without input of any drug features. And the performance was already promising, demonstrating the significant role of the drug-protein/drug-drug relations in the prediction of BBB permeability. Moreover, molecular embeddings from a pre-trained knowledge graph were used as the drug features to further enhance the predictive ability of the model. Finally, the best performing RGCN model was built with a large number of unlabeled drugs integrated into the graph.
... Accordingly, it only provides qualitative information. Notably, it is familiar that even a minor modification to a particular structure can lead to dramatic changes in the activity [158,166,167]. ...
Article
In the drug discovery setting, undesirable ADMET properties of a pharmacophore with good predictive power obtained after a tedious drug discovery and development process may lead to late-stage attrition. The early-stage ADMET profiling has introduced a new dimension to leading development. Although several high-throughput in vitro models are available for ADMET profiling, however, the in silico methods are gaining more importance because of their economic and faster prediction ability without the requirements of tedious and expensive laboratory resources. Nonetheless, in silico ADMET tools alone are not accurate and, therefore, ideally adopted along with in vitro and or in vivo methods in order to enhance predictability power. This review summarizes the significance and challenges associated with the application of in silico tools as well as the possible scope of in vitro models for integration to improve the ADMET predictability power of these tools.
Article
With over 10,000 new reaction protocols arising every year, only a handful of these procedures transition from academia to application. A major reason for this gap stems from the lack of comprehensive knowledge about a reaction’s scope, i.e., to which substrates the protocol can or cannot be applied. Even though chemists invest substantial effort to assess the scope of new protocols, the resulting scope tables involve significant biases, reducing their expressiveness. Herein we report a standardized substrate selection strategy designed to mitigate these biases and evaluate the applicability, as well as the limits, of any chemical reaction. Unsupervised learning is utilized to map the chemical space of industrially relevant molecules. Subsequently, potential substrate candidates are projected onto this universal map, enabling the selection of a structurally diverse set of substrates with optimal relevance and coverage. By testing our methodology on different chemical reactions, we were able to demonstrate its effectiveness in finding general reactivity trends by using a few highly representative examples. The developed methodology empowers chemists to showcase the unbiased applicability of novel methodologies, facilitating their practical applications. We hope that this work will trigger interdisciplinary discussions about biases in synthetic chemistry, leading to improved data quality.
Chapter
The chapter discusses recent advances on computer-aided drug discovery with particular emphasis on the concept and broad applications of chemical space and chemical or molecular multiverses. We also briefly describe progress on selected concepts, methodologies, resources, and applications that are part of multidisciplinary efforts in drug discovery: artificial intelligence, machine learning, virtual screening, and novel extended similarity methods for chemical space exploration. We emphasize public resources and open-source code available to the scientific community working in academia and non-profit institutions.
Article
Full-text available
The nuclear export protein 1 (XPO1) mediates the nucleocytoplasmic transport of proteins and ribonucleic acids (RNAs) and plays a prominent role in maintaining cellular homeostasis. XPO1 has emerged as a promising therapeutic approach to interfere with the lifecycle of many viruses. In our earlier study, we proved the inhibition of XPO1 as a therapeutic strategy for managing SARS-COV-2 and its variants. In this study, we have utilized pharmacophore-assisted computational methods to identify prominent XPO1 inhibitors. After several layers of screening, a few molecules were shortlisted for further experimental validation on the in vitro SARS-CoV-2 cell infection model. It was observed that these compounds reduced spike positivity, suggesting inhibition of SARS-COV-2 infection. The outcome of this study could be considered further for developing novel antiviral therapeutic strategies against SARS-CoV-2.
Article
Full-text available
In our quest to discover effective inhibitors against severe acute respiratory syndrome coronavirus 2 helicase, a diverse set of more than 300 naturally occurring antiviral metabolites was investigated. Employing advanced computational techniques, we initiated the selection process by analyzing and comparing the co-crystallized ligand (VXG) of the severe acute respiratory syndrome coronavirus 2 helicase protein (PDB ID: 5RMM) to identify compounds with structurally similar features and potential for comparable binding. Through structural similarity and pharmacophore research, 13 compounds that shared important characteristics with VXG were pinpointed. Subsequently, these candidates were subjected to molecular docking to identify seven compounds that demonstrated favorable energy profiles and accurate binding to the severe acute respiratory syndrome coronavirus 2 helicase. Among these, mycophenolic acid emerged as the most promising candidate. To ensure the safety and viability of the selected compounds, we conducted ADMET tests, which confirmed the favorable characteristics of mycophenolic acid, and the safety of atropine and plumbagin. Building on these results, we performed additional analyses on mycophenolic acid, including various molecular dynamics simulations. These investigations demonstrated that mycophenolic acid exhibited optimal binding to the severe acute respiratory syndrome coronavirus 2 helicase, maintaining flawless dynamics throughout the simulations. Furthermore, the Molecular Mechanics Poisson–Boltzmann Surface Area tests provided strong evidence that mycophenolic acid successfully formed a stable connection with the severe acute respiratory syndrome coronavirus 2 helicase, with a calculated free energy value of −294 kJ mol⁻¹. These encouraging findings provide a solid foundation for further research, including in vitro and in vivo studies, on the three identified compounds. The potential efficacy of these compounds as treatment options for coronavirus-19 warrants further exploration and may hold significant promise in the ongoing fight against the pandemic.
Article
Full-text available
Chemical space modelling has great importance in unveiling and visualising latent information, which is critical in predictive toxicology related to drug discovery process. While the use of traditional molecular descriptors and fingerprints may suffer from the so-called curse of dimensionality, complex networks are devoid of the typical drawbacks of coordinate-based representations. Herein, we use chemical space networks (CSNs) to analyse the case of the developmental toxicity (Dev Tox), which remains a challenging endpoint for the difficulty of gathering enough reliable data despite very important for the protection of the maternal and child health. Our study proved that the Dev Tox CSN has a complex non-random organisation and can thus provide a wealth of meaningful information also for predictive purposes. At a phase transition, chemical similarities highlight well-established toxicophores, such as aryl derivatives, mostly neurotoxic hydantoins, barbiturates and amino alcohols, steroids, and volatile organic compounds ether-like chemicals, which are strongly suspected of the Dev Tox onset and can thus be employed as effective alerts for prioritising chemicals before testing.
Preprint
Full-text available
The quantification of molecular similarity has been present since the beginning of cheminformatics. Although several similarity indices and molecular representations have been reported, all of them ultimately reduce to the calculation of molecular similarities of only two objects at a time. Hence, to get the average similarity of a set of molecules, all the pairwise comparisons need to be computed, which demands a quadratic scaling in the number of computational resources. Here we propose an exact alternative to this problem: iSIM (Instant Similarity). iSIM performs comparisons of multiple molecules at the same time and yields the same value as the average pairwise comparisons of molecules represented with binary fingerprints and real-value descriptors. In this work, we introduce the mathematical framework and several applications of iSIM in chemical sampling, visualization, diversity selection, and clustering.
Chapter
Inhibiting anomalous protein–protein interactions has led to the discovery of drugs (small molecules, precisely) that can inhibit such interactions and can prevent a variety of diseases. Computer-Aided Drug Design (CADD) helps in accurately identifying the drug candidates in a rapid and cost-effective manner. This chapter discusses the two branches of CADD—namely structure-based drug design and ligand-based drug design. The former is preferred when high-resolution structures of target proteins are available, while the latter is usually used when structural information is limited. The chapter concludes with discussing the experimental methods that have been employed alongside these computational techniques for drug discovery aspects.
Article
In the present study, the effect of a combination therapy consisting of diminazene aceturate (DA) and imidocarb dipropionate (ID) on the in vitro growth of several parasitic piroplasmids, and on Babesia microti in BALB/c mice was evaluated using a fluorescence-based SYBR Green I test. We evaluated the structural similarities between the regularly used antibabesial medications, DA and ID, and the recently found antibabesial drugs, pyronaridine tetraphosphate, atovaquone, and clofazimine, using atom pair fingerprints (APfp). The Chou-Talalay approach was used to determine the interactions between the two drugs. A Celltac MEK-6450 computerized hematology analyzer was used to detect hemolytic anemia every 96 hours in mice infected with B. microti and in those treated with either mono- or combination therapy. According to the APfp results, DA and ID have the most structural similarities (MSS). DA and ID had synergistic and additive interactions against the in vitro growth of Babesia bigemina and Babesia bovis, respectively. Low dosages of DA (6.25 mg kg-1) and ID (8.5 mg kg-1) in conjunction with each other inhibited B. microti growth by 16.5 %, 32 %, and 4.5 % more than 25 mg kg-1 DA, 6.25 mg kg-1 DA, and 8.5 mg kg-1 ID monotherapies, respectively. In the blood, kidney, heart, and lung tissues of mice treated with DA/ID, the B. microti small subunit rRNA gene was not detected. The obtained findings suggest that DA/ID could be a promising combination therapy for treating bovine babesiosis. Also, such combination may overcome the potential problems of Babesia resistance and host toxicity induced by utilizing full doses of DA and ID.
Article
PurposeThe in vitro inhibitory effect of two fluroquinolone antibiotics, norfloxacin and ofloxacin, was evaluated in this study on the growth of several Babesia and Theileria parasites with highlighting the bioinformatic analysis for both drugs with the commonly used antibabesial drug, diminazene aceturate (DA), and the recently identified antibabesial drugs, luteolin, and pyronaridine tetraphosphate (PYR).Methods The antipiroplasm efficacy of screened fluroquinolones in vitro and in vivo was assessed using a fluorescence-based SYBR Green I assay. Using atom Pair signatures, we investigated the structural similarity between fluroquinolones and the antibabesial drugs.ResultsBoth fluroquinolones significantly inhibited (P < 0.05) the in vitro growths of Babesia bovis (B. bovis), B. bigemina, B. caballi, and Theileria equi (T. equi) in a dose-dependent manner. The best inhibitory effect for both drugs was observed on the growth of T. equi. Atom Pair fingerprints (APfp) results and AP Tanimoto values revealed that both fluroquinolones, norfloxacin with luteolin, and ofloxacin with PYR, showed the maximum structural similarity (MSS). Two drug interactions findings confirmed the synergetic interaction between these combination therapies against the in vitro growth of B. bovis and T. equi.Conclusion This study helped in discovery novel potent antibabesial combination therapies consist of norfloxacin/ofloxacin, norfloxacin/luteolin, and ofloxacin/PYR.
Article
The effect of MMV665941 on the growth of Babesia microti (B. microti) in mice, was investigated in this study using a fluorescence-based SYBR Green I test. Using atom Pair signatures, we investigated the structural similarity between MMV665941 and the commonly used antibabesial medicines diminazene aceturate (DA), imidocarb dipropionate (ID), or atovaquone (AV). In vitro cultures of Babesia bovis (B. bovis) and, Theileria equi (T. equi) were utilized to determine the MMV665941 and AV interaction using combination ratios ranged from 0.75 IC50 MMV665941:0.75 IC50 AV to 0.50 IC50 MMV665941:0.50 IC50 AV. The used combinations were prepared depending on the IC50 of each drug against the in vitro growth of the tested parasite. Every 96 h, the hemolytic anemia in the treated mice was monitored using a Celltac MEK-6450 computerized hematology analyzer. A single dose of 5 mg/kg MMV665941 exhibited inhibition in the B. microti growth from day 4 post-inoculation (p.i.) till day 12 p.i. MMV665941 caused 62.10%, 49.88%, and 74.23% inhibitions in parasite growth at days 4, 6 and 8 p.i., respectively. Of note, 5 mg/kg MMV665941 resulted in quick recovery of hemolytic anemia caused by babesiosis. The atom pair fingerprint (APfp) analysis revealed that MMV665941 and atovaquone (AV) showed maximum structural similarity. Of note, high concentrations (0.75 IC50) of MMV665941 and AV caused synergistic inhibition on B. bovis growth. These findings suggest that MMV665941 might be a promising drug for babesiosis treatment, particularly when combined with the commonly used antibabesial drug, AV.
Article
Structural modification of natural products is an effective strategy to discover potent lead compounds with improved medicinal performance. Toosendanin (TSN), a natural limonoid with diverse pharmacological properties, was selected as the starting material for structural modification to obtain more active anti-cancer agents in the current study. A library containing 25 structurally diverse derivatives (including 12 new ones) were constructed on the basis of the structure-guided modification of TSN. Subsequent cytotoxic assay of this library discovered that compounds 14, 18, and 25 showed more significant antiproliferative activity than the precursor TSN in MDA-MB-468 cell model and so as compounds 14, 17−19, 21, and 25 in Hela cell model. Among them, the new derivative 29-O-(6-chloronicotinoyl)-toosendanin (25) exhibited the most potent antiproliferative activity (IC50s 0.05−0.06 μM), being more active than TSN (IC50s 0.14−0.24 μM) and even the first-line drug adriamycin (IC50s ∼ 0.07 μM) in both tested cancer cell lines. The SARs study uncovered that the hemiacetal group, the 14,15-epoxy ring, 1-OH, 7-OH, 3-OAc, and 12-OAc were viewed as the essential active groups and the 29-OH was the critically active modification position of TSN for the enhancement of cytotoxicity. The discovering of 25 from TSN-based derivatives might serve as a lead compound for anti-cancer chemotherapy, which may shed light on rationally design TSN-based derivatives for obtaining more potent anti-tumor agents.
Chapter
Drug promiscuity refers to multitarget activity of a drug and can elicit two distinct or opposing actions: adverse side effects and improved therapeutic efficacy. On one hand, drug promiscuity is the source and mechanism of off-target effects; on the other hand, it forms the basis for polypharmacology-based drug repurposing, thereby a source of drug rediscovery. These opposing effects reflect two sides of a coin: positive/good/desirable drug promiscuity and negative/bad/undesirable drug promiscuity. In Chap. 13, the “good” side of drug promiscuity for drug repurposing and how the “good” side should be used for drug rediscovery have been discussed. The topic in this chapter focuses on the “bad” side of drug promiscuity, specifically the application of polypharmacology principles to predicting the drug toxicity induced by drug promiscuity, which is a critical issue in drug development.KeywordsDrug promiscuityPredictionDrug toxicityDrug promiscuity cliffBig data
Article
Full-text available
Small‐molecule drugs are of significant importance to human health. The use of efficient model‐based de novo drug design method is an option worth considering for expediting the discovery of drugs with satisfactory properties. In this article, a deep learning model is first developed for identifications of protein‐ligand complexes with high binding affinity, where the Mol2vec descriptor, the convolutional neural network, and the gate augmentation‐based Attention mechanism are used for the model construction. Then, an optimization‐based de novo drug design framework is established by integrating the deep learning model into a Mixed‐Integer NonLinear Programming (MINLP) model for drug candidate design. The optimal solution of the MINLP model is further verified by the physics‐based methods of molecular docking and molecular dynamics simulation. Finally, two case studies involving the design of anticoagulant and antitumor drug candidates are presented to highlight the wide applicability and effectiveness of the MINLP‐based de novo drug design framework.
Article
In the current investigation, the effect of ascorbic acid on the in vitro growth of several piroplasm including Babesia bovis (Bartonella bovis), Baconia bigemina, B. caballi, and Theileria equi (T. equi), as well as against Brucella microti in mice was assessed. The antipiroplasm efficacy of ascorbic acid in vitro and in vivo was assessed using a fluorescence-based SYBR Green I test. Using atom pair fingerprint (APfp), we investigated the structural similarity between ascorbic acid and the commonly used antibabesial medicines, diminazene aceturate (DA) and imidocarb dipropionate (ID). In vitro cultures of B. bovis and T. equi were utilized to determine the ascorbic acid and DA interaction using the Chou–Talalay method. Ascorbic acid inhibited B. bovis, B. bigemina, T. equi, and B. caballi growth in vitro in a dose-dependent manner. The APfp results revealed that ascorbic acid and DA have a maximum structural similarity (MSS). On a T. equi culture in vitro, ascorbic acid showed a synergistic interaction with DA, with a combination index of 0.28. B. microti growth was decreased by 41% in vivo using ascorbic acid combined with a very low dosage of DA (6.25 mg kg⁻¹). The results imply that ascorbic acid /DA could be a viable combination therapy for the treatment of T. equi and that it could be utilized to overcome the resistance of Babesia parasites to full doses of the regularly used antibabesial medication, DA.
Article
Multi-parameter optimization (MPO) is a major challenge in new chemical entity (NCE) drug discovery. Recently, promising results were reported for deep learning generative models applied to de novo molecular design, but, to our knowledge, until now no report was made of the value of this new technology for addressing MPO in an actual drug discovery project. In this study, we demonstrate the benefit of applying AI technology in a real drug discovery project. We evaluate the potential of a ligand-based de novo design technology using deep learning generative models to accelerate the obtention of lead compounds meeting 11 different biological activity objectives simultaneously. Using the initial dataset of the project, we built QSAR models for all the 11 objectives, with moderate to high performance (precision between 0.67 and 1.0 on an independent test set). Our DL-based AI de novo design algorithm, combined with the QSAR models, generated 150 virtual compounds predicted as active on all objectives. Eleven were synthetized and tested. The AI-designed compounds met 9.5 objectives on average (i.e., 86% success rate) versus 6.4 (i.e., 58% success rate) for the initial molecules measured on all objectives. One of the AI-designed molecules was active on all 11 measured objectives, and two were active on 10 objectives while being in the error margin of the assay for the last one. The AI algorithm designed compounds with functional groups, which, although being rare or absent in the initial dataset, turned out to be highly beneficial for the MPO.
Article
Turbo Similarity Searching (TSS) is the simplest and most recent chemical similarity searching (SS) approach, which improves the effectiveness of SS by performing a multi‐target searching. TSS has four important elements, namely structural representation, similarity coefficient, number of nearest neighbours (NNs), and fusion rule, and any changes in these elements could affect the TSS results. A previous study suggested the advantage of using large numbers of reference compounds with small fractions of the database structures to obtain a better recall in group fusion. Therefore, this study aims to investigate the effect of partial ranking on TSS utilising different fusion rules and different numbers of NNs on the ChEMBL database and to evaluate whether these observations hold in TSS. Furthermore, the objective is to observe the effect of the indirect relationship feature of TSS on the partial ranking investigation. The results showed that the effect of using partial ranking on TSS was significant. This study also found that the performance of TSS improved as the database proportions used in the fusion process decreased and by using a small number of NNs. In addition, fusion rules based on reciprocal rank positions (RKP), maximum similarity score (sMAX), and sMNZ were superior to all the other fusion rules.
Article
A deep learning (DL) method for quickly predicting surface charge density profiles (σ-profile) and cavity volumes (VCOSMO) of molecules for the COSMO-SAC model is developed. The molecular fingerprints are derived from the encoder state of a Transformer model pre-trained on the ChEMBL database, which allows transfer learning from large-scale unlabeled data and improve generalization performance by developing better molecular fingerprints for building models with significantly smaller datasets. Employing the pre-trained molecular fingerprints, a convolutional neural network (CNN) model for the σ-profile and VCOSMO prediction is trained and tested on the VT-2005 database. The obtained Transformer-CNN model presents superior performance to the GC-COSMO approach and enables the prediction of σ-profile and VCOSMO of millions of molecules in only a few minutes. Taking advantages of the model, a high-throughput solvent screening framework based on COSMO-SAC is further proposed and exemplified by searching sustainable solvent for the deterpenation process of citrus essential oils.
Chapter
Personalized medicine, also known as precision medicine, refers to a medical model of providing the best treatment plan for a patient according to his or her personal genomic information. The research and practice of personalized medicine have become a hot topic in current medical research, and predicting the response of cell lines to specific drugs is one of the core problems. Using computer algorithms to predict the responses of cell lines to drugs based on huge amounts of existing omics information is currently one focus of bioinformatics. A variety of predictive methods have been proposed. The paper introduces the baseline analysis data, surveys some classical prediction methods and models, and details on the application of matrix decomposition, heterozygous network and deep learning at the drug response prediction. At last, some existing problems and future development trend and prospects are discussed.
Article
Full-text available
The multitude of potential drug targets emerging from genome sequencing demands new approaches to drug discovery. A chemogenomics strategy, which involves the generation of small-molecule compounds that can be used both as tools to probe biological mechanisms and as leads for drug-property optimization, provides a highly parallel, industrialized solution. Key to the success of this strategy is an integrated suite of chemi-informatics applications that can allow the rapid and directed optimization of chemical compounds with drug-like properties using 'just-in-time' combinatorial chemical synthesis. An effective embodiment of this process requires new computational and data-mining tools that cover all aspects of library generation, compound selection and experimental design, and work effectively on a massive scale.
Article
Full-text available
Molecular Informatics utilises many ideas and concepts to find relationships between molecules. The concept of similarity, where molecules may be grouped according to their biological effects or physicochemical properties has found extensive use in drug discovery. Some areas of particular interest have been in lead discovery and compound optimisation. For example, in designing libraries of compounds for lead generation, one approach is to design sets of compounds "similar" to known active compounds in the hope that alternative molecular structures are found that maintain the properties required while enhancing e.g. patentability, medicinal chemistry opportunities or even in achieving optimised pharmacokinetic profiles. Thus the practical importance of the concept of molecular similarity has grown dramatically in recent years. The predominant users are pharmaceutical companies, employing similarity methods in a wide range of applications e.g. virtual screening, estimation of absorption, distribution, metabolism, excretion and toxicity (ADME/Tox) and prediction of physicochemical properties (solubility, partitioning etc.). In this perspective, we discuss the representation of molecular structure (descriptors), methods of comparing structures and how these relate to measured properties. This leads to the concept of molecular similarity, its various definitions and uses and how these have evolved in recent years. Here, we wish to evaluate and in some cases challenge accepted views and uses of molecular similarity. Molecular similarity, as a paradigm, contains many implicit and explicit assumptions in particular with respect to the prediction of the binding and efficacy of molecules at biological receptors. The fundamental observation is that molecular similarity has a context which both defines and limits its use. The key issues of solvation effects, heterogeneity of binding sites and the fundamental problem of the form of similarity measure to use are addressed.
Article
Full-text available
The judges evaluated the submissions for the McMaster University High-Throughput Data-Mining and Docking Competition based on 3 criteria: identification of active compounds, percent enrichment, and overview of the competition. Using these metrics, 4 of the participating groups found meaningful enrichment, and 3 groups made perceptive comments about the general nature of the competition.
Article
Full-text available
Bridging chemical and biological space is the key to drug discovery and development. Typically, cheminformatics methods operate under the assumption that similar chemicals have similar biological activity. Ideally then, one could predict a drug's biological function(s) given only its chemical structure by similarity searching in libraries of compounds with known activities. In practice, effectively choosing a similarity metric is case dependent. This work compares both 2D and 3D chemical descriptors as tools for predicting the biological targets of ligand probes, on the basis of their similarity to reference molecules in a 46,000 compound, biologically annotated chemical database. Overall, we found that the 2D methods employed here outperform the 3D (88% vs 67% success) in correct target prediction. However, the 3D descriptors proved superior in cases of probes with low structural similarity to other compounds in the database (singletons). Additionally, the 3D method (FEPOPS) shows promise for providing pharmacophoric alignment of the small molecules' chemical features consistent with those seen in experimental ligand/ receptor complexes. These results suggest that querying annotated chemical databases with a systematic combination of both 2D and 3D descriptors will prove more effective than employing single methods.
Article
If properly constructed, high-dimensional (fingerprint) and low-dimensional metrics can provide equally valid presentations of chemistry-space for chemical diversity purposes. High-dimensional metrics offer the advantage of providing substantial detail regarding the topological aspects of molecular substructure but suffer the disadvantage that they can be used only for distance-based algorithms for addressing the various diversity-related tasks encountered in pharmaceutical and agrochemical industry. Low-dimensional metrics offer the advantage of enabling the use of either distance-based or cell-based algorithms but traditional molecular descriptors are often cross-correlated, provide little or no substructural information and, thus, are poor choices for chemistry-space metrics. BCUT values constitute a novel class of molecular descriptors which not only encode substructural topological (or topographical) information, but also encode atom-based information relevant to the strength of ligand-receptor interaction. We have presented an algorithm for choosing those low-dimensional metrics which best represent the diversity of a given population of compounds, an algorithm for validating the chosen metrics and cell-based algorithms using those metrics to address all of the diversity-related tasks. Work is currently in progress to develop additional metrics and a more efficient implementation of the algorithm for choosing the best chemistry-space.
Article
Several 3D approaches discussed in this volume describe methods for the analysis and quantitative description of chemical similarity. The underlying concept is that chemical similarity is reflected by similar biological activities — i.e. chemically closely related analogs should be related in their mode of action, as well as in their relative potencies. This fundamental assumption has, indeed, been used in medicinal chemistry research, and has led to many valuable drugs. However, chemical similarity may have different facets if a computer chemist or a medicinal chemist look at the compounds. There is no argument that for maximal affinity a ligand of a biological macromolecule has to fit the binding pocket geometrically and that hydrophobic surfaces of the ligand and the binding site have to be complementary. The functional groups of the ligand need a separate consideration. For lipophilicity, there is no significant difference between, e.g. -O- and - NH- in an organic molecule; for ionization. there is a big difference whether the nitrogen atom is part of a basic group (an amine) or a neutral group (e.g. in an amide); and for binding, potency differences of several orders of magnitude may result from the exchange of the hydrogen bond acceptor-O- against a donor function -NH-. 1. Similarity as a Design Principle in Lead Optimization Nearly all drugs result from the optimization of a lead structure. Sources of such leads are natural products from plants or microorganisms, synthetic chemicals or their intermediates. hits from (high-throughput) screening of in-house and combinatorial libraries, rational concepts from a biochemical pathway or the unexpected observation of a therapeutically useful side-effect of a drug. Most often, the biological activity of a lead structure is neither optimal, with respect to its efficacy, nor with respect to specificity, bioavailability, pharmacokinetics, toxic and other side-effects. Chemists perform more or less systematic variations of lead structures, using the experience of about 100 years of medicinal chemistry and the results of (quantitative) structure-activity relationships. The principle of bioisoteric replacement of functional groups serves as a successful optimization strategy [1–3]. Its systematic application has resulted in a broad variety of therapeutically used drugs, many of them finally having the desired combination of favorable properties. A few examples of typical but different consequences of isosteric replacement of atoms or groups are illustrated by compounds 1–3 (Fig. 1), som eo thers with unexpected effects on biological activities are discussed in later sections of this chapter. In their attempts to optimize lead structures, medicinal chemists intuitively follow the principles of evolution. In genetic and evolutionary algorithms, randomly generated starting models (the lead structures) are reproduced involving random mutations and crossover (the chemical variation of the structures). Better models (compounds) are kept for further modification; worse ones are discarded. The biological activity, in later
Article
This contribution focuses on an assessment of errors in experimental and virtual screening. Sources of errors in high-throughput screening can be classified as logistic, measurement-related, or strategic. Biological assays formatted for high throughput are generally susceptible to small but systematic errors arising from a variety of sources, and the correction of such errors often requires the application of advanced data analysis methods. For virtual screening, chemical space design and molecular similarity analysis play crucial roles and similarity-based methods also have principal limitations, as discussed herein. In addition, the relative performance of computational screening methods, regardless of their specific features, generally displays strong compound class dependence. However, given their opportunities and limitations, experimental and computational screening can be carried out in a highly complementary manner and integrated screening strategies are thought to have significant potential in pharmaceutical research.
Article
In this article, we review the use of in vitro and in silico affinity fingerprints as novel descriptors for similarity searches in molecular databases and QSAR analyses. An affinity fingerprint for a particular molecule is constructed as a vector of either its binding affinities, docking scores or superpositioning pseudoenergies against a reference panel of proteins or small molecules. In contrast to most other molecular descriptors, affinity fingerprints are not directly derived from molecular structures. As such, they offer the possibility to detect similarities amongst molecules independent of their structural scaffolds. In this report we introduce the Flexsim-S method, an extension of our previous work on virtual affinity fingerprints. Moreover, we demonstrate that virtual affinity fingerprint methods are comparable to some popular two-dimensional descriptors in terms ofcorrectly classifying compounds, but complementary with respect to the particular search results (hit lists).
Article
An evaluation of a variety of structure-based clustering methods for use in compound selection is presented. The use of MACCS, Unity and Daylight 2D descriptors; Unity 3D rigid and flexible descriptors and two in-house 3D descriptors based on potential pharmacophore points, are considered. The use of Ward's and group-average hierarchical agglomerative, Guenoche hierarchical divisive, and Jarvis-Patrick nonhierarchical clustering methods are compared. The results suggest that 2D descriptors and hierarchical clustering methods are best at separating biologically active molecules from inactives, a prerequisite for a good compound selection method. Tn particular, the combination of MACCS descriptors and Ward's clustering was optimal.
Article
Barnard Chemical Information Ltd.'s software products for generation of fragment-dictionary-based chemical structure fingerprints and for hierarchical and nonhierarchical clustering of large files of structures are described.
Article
There are many ways to represent a molecule's properties, including atomic-connectivity drawings, NMR spectra, and molecular orbital models. Prior methods for predicting the biological activity of compounds have largely depended on these physical representations. Measuring a compound's binding potency against a small reference panel of diverse proteins defines a very different representation of the molecule, which we call an affinity fingerprint. Statistical analysis of such fingerprints provides new insights into aspects of binding interactions that are shared among a wide variety of proteins. These analyses facilitate prediction of the binding properties of these compounds assayed against new proteins. Affinity fingerprints are reported for 122 structurally-diverse compounds using a reference panel of eight proteins that collectively are able to generate unique fingerprints for about 75% of the small organic compounds tested. Application of multivariate regression techniques to this database enables the creation of computational surrogates to represent new proteins that are surprisingly effective at predicting binding potencies. We illustrate this for two enzymes with no previously recognizable similarity to each other or to any of the reference proteins. Fitting of analogous computational surrogates to four other proteins confirms the generality of the method; when applied to a fingerprinted library of 5000 compounds, several sub-micromolar hits were correctly predicted. An affinity fingerprint database, which provides a rich source of data defining operational similarities among proteins, can be used to test theories of cryptic homology unexpected from current understanding of protein structure. Practical applications to drug design include efficient pre-screening of large numbers of compounds against target proteins using fingerprint similarities, supplemented by a small number of empirical measurements, to select promising compounds for further study.
Article
High-throughput screening has made a significant impact on drug discovery, but there is an acknowledged need for quantitative methods to analyze screening results and predict the activity of further compounds. In this paper we introduce one such method, binary kernel discrimination, and investigate its performance on two datasets; the first is a set of 1650 monoamine oxidase inhibitors, and the second a set of 101 437 compounds from an in-house enzyme assay. We compare the performance of binary kernel discrimination with a simple procedure which we call "merged similarity search", and also with a feedforward neural network. Binary kernel discrimination is shown to perform robustly with varying quantities of training data and also in the presence of noisy data. We conclude by highlighting the importance of the judicious use of general pattern recognition techniques for compound selection.
Article
High-throughput and virtual screening are important components of modern drug discovery research. Typically, these screening technologies are considered distinct approaches, as one is experimental and the other is theoretical in nature. However, given their similar tasks and goals, these approaches are much more complementary to each other than often thought. Various statistical, informatics and filtering methods have recently been introduced to foster the integration of experimental and in silico screening and maximize their output in drug discovery. Although many of these ideas and efforts have not yet proceeded much beyond the conceptual level, there are several success stories and good indications that early-stage drug discovery will benefit greatly from a more unified and knowledge-based approach to biological screening, despite the many technical advances towards even higher throughput that are made in the screening arena.
Article
Computational tools to search chemical structure databases are essential to finding leads early in a drug discovery project. Similarity methods are among the most diverse and most useful. We will present some lessons we have gathered over many years experience with in-house methods on several therapeutic problems. The effectiveness of any similarity method can vary greatly from one biological activity to another in a way that is difficult to predict. Also, any two methods tend to select different subsets of actives from a database, so it is advisable to use several search methods where possible.
Article
In this study we evaluate how far the scope of similarity searching can be extended to identify not only ligands binding to the same target as the reference ligand(s) but also ligands of other homologous targets without initially known ligands. This "homology-based similarity searching" requires molecular representations reflecting the ability of a molecule to interact with target proteins. The Similog keys, which are introduced here as a new molecular representation, were designed to fulfill such requirements. They are based only on the molecular constitution and are counts of atom triplets. Each triplet is characterized by the graph distances and the types of its atoms. The atom-typing scheme classifies each atom by its function as H-bond donor or acceptor and by its electronegativity and bulkiness. In this study the Similog keys are investigated in retrospective in silico screening experiments and compared with other conformation independent molecular representations. Studied were molecules of the MDDR database for which the activity data was augmented by standardized target classification information from public protein classification databases. The MDDR molecule set was split randomly into two halves. The first half formed the candidate set. Ligands of four targets (dopamine D2 receptor, opioid delta-receptor, factor Xa serine protease, and progesterone receptor) were taken from the second half to form the respective reference sets. Different similarity calculation methods are used to rank the molecules of the candidate set by their similarity to each of the four reference sets. The accumulated counts of molecules binding to the reference target and groups of targets with decreasing homology to it were examined as a function of the similarity rank for each reference set and similarity method. In summary, similarity searching based on Unity 2D-fingerprints or Similog keys are found to be equally effective in the identification of molecules binding to the same target as the reference set. However, the application of the Similog keys is more effective in comparison with the other investigated methods in the identification of ligands binding to any target belonging to the same family as the reference target. We attribute this superiority to the fact that the Similog keys provide a generalization of the chemical elements and that the keys are counted instead of merely noting their presence or absence in a binary form. The second most effective molecular representation are the occurrence counts of the public ISIS key fragments, which like the Similog method, incorporates key counting as well as a generalization of the chemical elements. The results obtained suggest that ligands for a new target can be identified by the following three-step procedure: 1. Select at least one target with known ligands which is homologous to the new target. 2. Combine the known ligands of the selected target(s) to a reference set. 3. Search candidate ligands for the new targets by their similarity to the reference set using the Similog method. This clearly enlarges the scope of similarity searching from the classical application for a single target to the identification of candidate ligands for whole target families and is expected to be of key utility for further systematic chemogenomics exploration of previously well explored target families.
Article
Reduced graphs provide summary representations of chemical structures. In this work, the effectiveness of reduced graphs for similarity searching is investigated. Different types of reduced graphs are introduced that aim to summarize features of structures that have the potential to form interactions with receptors while retaining the topology between the features. Similarity searches have been carried out across a variety of different activity classes. The effectiveness of the reduced graphs at retrieving compounds with the same activity as known target compounds is compared with searching using Daylight fingerprints. The reduced graphs are shown to be effective for similarity searching and to retrieve more diverse active compounds than those found using Daylight fingerprints; they thus represent a complementary similarity searching tool.
Article
We investigate the following data mining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector Machines". This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones.
Article
The concept of compound class-specific profiling and scaling of molecular fingerprints for similarity searching is discussed and applied to newly designed fingerprint representations. The approach is based on the analysis of characteristic patterns of bits in keyed fingerprints that are set on in compounds having equivalent biological activity. Once a fingerprint profile is generated for a particular activity class, scaling factors that are weighted according to observed bit frequencies are applied to signature bit positions when searching for similar compounds. In systematic similarity search calculations over 23 diverse activity classes, profile scaling consistently increased the performance of fingerprints containing property descriptors and/or structural keys. A significant improvement of approximately 15% was observed for a new fingerprint consisting of binary encoded molecular property descriptors and structural keys. Under scaling conditions, this fingerprint, termed MP-MFP, correctly recognized on average close to 60% of all active test compounds, with only a few false positives. MP-MFP outperformed MACCS keys and other reference fingerprints. In general, optimum performance in scaling calculations was achieved at higher threshold values of the Tanimoto coefficient than in nonscaled calculations, thereby increasing the search selectivity. In general, putting relatively high weight on signature bit positions that were always, or almost always, set on was found to be the most effective scaling procedure. Analysis of class-specific search performance revealed that profile scaling of MP-MFP improved the similarity search results for each of the 23 activity classes.
Article
A novel compound classification algorithm is described that operates in binary molecular descriptor spaces and groups active compounds together in a computationally highly efficient manner. The method involves the transformation of continuous descriptor value ranges into a binary format, subsequent definition of simplified descriptor spaces, identification of consensus positions of specific compound sets in these spaces, and iterative adjustments of the dimensionality of the descriptor spaces in order to discriminate compounds sharing similar activity from others. We term this approach Dynamic Mapping of Consensus positions (DMC) because the definition of reference spaces is tuned toward specific compound classes and their dimensionality is increased as the analysis proceeds. When applied to virtual screening, sets of bait compounds are added to a large screening database to identify hidden active molecules. In these calculations, molecules that map to consensus positions after elimination of most of the database compounds are considered hit candidates. In a benchmark study on five biological activity classes, hits for randomly assembled sets of bait molecules were correctly identified in 95% of virtual screening calculations in a source database containing more than 1.3 million molecules, thus providing a measure of the sensitivity of the DMC technique.
Article
The number of compounds available for evaluation as part of the drug discovery process continues to increase. These compounds may exist physically or be stored electronically allowing screening by either actual or virtual means. This growing number of compounds has generated an increasing need for effective strategies to direct screening efforts. Initial efforts toward this goal led to the development of methods to select diverse sets of compounds for screening, methods to cluster actives into related groups of compounds, and tools to select compounds similar to actives of interest for further screening. In this work we extend these earlier efforts to exploit information about inactive compounds to help make rational decisions about which sets of compounds to include as part of a continuing screening campaign, or as part of a focused follow-up effort. This method uses the information from inactive compounds to "shave" off or deprioritize compounds similar to inactives from further consideration. This methodology can be used in two ways: first, to provide a rational means of deciding when sufficient compounds containing certain structural features have been tested and second as a tool to enhance similarity searching around known actives. Similarity searching is improved by deprioritizing compounds predicted to be inactive, due to the presence of structural features associated with inactivity.
Article
Anti-AIDS drug candidate and non-nucleoside reverse transcriptase inhibitor (NNRTI) TMC125-R165335 (etravirine) caused an initial drop in viral load similar to that observed with a five-drug combination in naïve patients and retains potency in patients infected with NNRTI-resistant HIV-1 variants. TMC125-R165335 and related anti-AIDS drug candidates can bind the enzyme RT in multiple conformations and thereby escape the effects of drug-resistance mutations. Structural studies showed that this inhibitor and other diarylpyrimidine (DAPY) analogues can adapt to changes in the NNRTI-binding pocket in several ways: (1). DAPY analogues can bind in at least two conformationally distinct modes; (2). within a given binding mode, torsional flexibility ("wiggling") of DAPY analogues permits access to numerous conformational variants; and (3). the compact design of the DAPY analogues permits significant repositioning and reorientation (translation and rotation) within the pocket ("jiggling"). Such adaptations appear to be critical for potency against wild-type and a wide range of drug-resistant mutant HIV-1 RTs. Exploitation of favorable components of inhibitor conformational flexibility (such as torsional flexibility about strategically located chemical bonds) can be a powerful drug design concept, especially for designing drugs that will be effective against rapidly mutating targets.
Article
There are several Quantitative Structure-Activity Relationship (QSAR) methods to assist in the design of compounds for medicinal use. Owing to the different QSAR methodologies, deciding which QSAR method to use depends on the composition of system of interest and the desired results. The relationship between a compound's binding affinity/activity to its structural properties was first noted in the 1930s by Hammett and later refined by Hansch and Fujita in the mid-1960s. In 1988 Cramer and coworkers created Comparative Molecular Field Analysis (CoMFA) incorporating the three-dimensional (3D) aspects of the compounds, specifically the electrostatic fields of the compound, into the QSAR model. Hopfinger and coworkers included an additional dimension to 3D-QSAR methodology in 1997 that eliminated the question of "Which conformation to use in a QSAR study?", creating 4D-QSAR. In 1999 Chemical Computing Group Inc. (CCG) developed the Binary-QSAR methodology and added novel 3D-QSAR descriptors to the traditional QSAR model allowing the 3D properties of compounds to be incorporated into the traditional QSAR model. Recently CCG released Probabilistic Receptor Potentials to calculate the substrate's atomic preferences in the active site. These potentials are constructed by fitting analytical functions to experimental properties of the substrates using knowledge-based methods. An overview of these and other QSAR methods will be discussed along with an in-depth examination of the methodologies used to construct QSAR models. Also, included in this chapter is a case study of molecules used to create QSAR models utilizing different methodologies and QSAR programs.
Article
Fingerprint-based similarity searching is widely used for virtual screening when only a single bioactive reference structure is available. This paper reviews three distinct ways of carrying out such searches when multiple bioactive reference structures are available: merging the individual fingerprints into a single combined fingerprint; applying data fusion to the similarity rankings resulting from individual similarity searches; and approximations to substructural analysis. Extended searches on the MDL Drug Data Report database suggest that fusing similarity scores is the most effective general approach, with the best individual results coming from the binary kernel discrimination technique.
Article
The performance of a molecular similarity searching algorithm, which is based on atom environments, information-gain-based feature selection and naive Bayesian classifier, was analyzed. The technique was applied to a series of diverse datasets and its performance was compared to those of alternative searching methods. Atom environments are count vectors of heavy atoms present at a topological distance from each heavy atom of a molecular structure. The technique performed better than the unity fingerprints and binary kernel discrimination with overall retrieval rates 10% better than them. The difference in performance was attributed to the different molecular descriptors used, which captures information relevant to the similarity examined.
Article
A method for ligand-based virtual screening (LBVS), dynamic mapping of consensus positions (DMC), has been extended to take different potency levels of template compounds into account. This potency scaling technique is designed to tune search calculations toward the detection of increasingly potent hits. LBVS analysis of three different compound classes confirmed the ability of potency-scaled DMC (POT-DMC) to identify active database compounds with higher potency than conventional calculations.
Article
A primary goal of 3D similarity searching is to find compounds with similar bioactivity to a reference ligand but with different chemotypes, i.e., "scaffold hopping". However, an adequate description of chemical structures in 3D conformational space is difficult due to the high-dimensionality of the problem. We present an automated method that simplifies flexible 3D chemical descriptions in which clustering techniques traditionally used in data mining are exploited to create "fuzzy" molecular representations called FEPOPS (feature point pharmacophores). The representations can be used for flexible 3D similarity searching given one or more active compounds without a priori knowledge of bioactive conformations or pharmacophores. We demonstrate that similarity searching with FEPOPS significantly enriches for actives taken from in-house high-throughput screening datasets and from MDDR activity classes COX-2, 5-HT3A, and HIV-RT, while also scaffold or ring-system hopping to new chemical frameworks. Further, inhibitors of target proteins (dopamine 2 and retinoic acid receptor) are recalled by FEPOPS by scaffold hopping from their associated endogenous ligands (dopamine and retinoic acid). Importantly, the method excels in comparison to commonly used 2D similarity methods (DAYLIGHT, MACCS, Pipeline Pilot fingerprints) and a commercial 3D method (Pharmacophore Distance Triplets) at finding novel scaffold classes given a single query molecule.
Article
This work describes a practical strategy used at Pharmacia for identifying compounds for follow-up screening following an initial HTS campaign against targets where no 3-D structural information is available and preliminary SAR models do not exist. The approach explicitly takes into account different representations of chemistry space and identifies compounds for follow-up screening that are likely to provide the best overall coverage of the chemistry spaces considered. Specifically, the work employs hit-directed nearest-neighbor (HDNN) searching of compound databases based upon a set of "probe compounds" obtained as hits in the preliminary high-throughput screens. Four different molecular representations that generate nearly unique chemistry spaces are used. The representations include 3-D, 2-D, 2-D topological BCUTs (2-DT) and molecular fingerprints derived from substructural fragments. In the case of the BCUT representations the NN searching is distance based, while in the case of molecular fingerprints a similarity-based measure is used. Generally, the results obtained differ significantly among all four methods, that is, the sets of NN compounds have surprisingly little overlap. Moreover, in all of the four chemistry space representations, a minimum of 3- to 4-fold enrichment in actives over random screening is observed even though the actives identified in each of the sets of NNs are in large measure unique. These results suggest that use of multiple searches based upon a variety of molecular representations provides an effective way of identifying more hits in HDNN searches of chemistry spaces than can be realized with single searches.
Article
In this paper, we describe the first prospective application of the shape-comparison program ROCS (Rapid Overlay of Chemical Structures) to find new scaffolds for small molecule inhibitors of the ZipA-FtsZ protein-protein interaction, a proposed antibacterial target. The shape comparisons are made relative to the crystallographically determined, bioactive conformation of a high-throughput screening (HTS) hit. The use of ROCS led to the identification of a set of novel, weakly binding inhibitors with scaffolds presenting synthetic opportunities to further optimize biological affinity and lacking development issues associated with the HTS lead. These ROCS-identified scaffolds would have been missed using other structural similarity approaches such as ISIS 2D fingerprints. X-ray crystallographic analysis of one of the new inhibitors bound to ZipA reveals that the shape comparison approach very accurately predicted the binding mode. These experimental results validate this use of ROCS for chemotype switching or "lead hopping" and suggest that it is of general interest for lead identification in drug discovery endeavors.
Article
The optimal overlap between two molecular structures is a useful measure of shape similarity. However, it usually requires significant computation. This work describes the design of shape-fingerprints: binary bit strings that encode molecular shape. Standard measures of similarity between two shape-fingerprints are shown to be an excellent surrogate for similarity based on volume overlap but several orders of magnitude faster to compute. Consequently, shape-fingerprints can be used for clustering of large data sets, evaluating the diversity of compound libraries, as descriptors in SAR and as a prescreen for exact shape comparison against large virtual databases. Our results show that a small set of shapes can be used to build these fingerprints and that this set can be applied universally.
Article
The Support Vector Machine (SVM) is an algorithm that derives a model used for the classification of data into two categories and which has good generalization properties. This study applies the SVM algorithm to the problem of virtual screening for molecules with a desired activity. In contrast to typical applications of the SVM, we emphasize not classification but enrichment of actives by using a modified version of the standard SVM function to rank molecules. The method employs a simple and novel criterion for picking molecular descriptors and uses cross-validation to select SVM parameters. The resulting method is more effective at enriching for active compounds with novel chemistries than binary fingerprint-based methods such as binary kernel discrimination.
Article
The combination of 3D pharmacophore fingerprints and the support vector machine classification algorithm has been used to generate robust models that are able to classify compounds as active or inactive in a number of G-protein-coupled receptor assays. The models have been tested against progressively more challenging validation sets where steps are taken to ensure that compounds in the validation set are chemically and structurally distinct from the training set. In the most challenging example, we simulate a lead-hopping experiment by excluding an entire class of compounds (defined by a core substructure) from the training set. The left-out active compounds comprised approximately 40% of the actives. The model trained on the remaining compounds is able to recall 75% of the actives from the "new" lead series while correctly classifying >99% of the 5000 inactives included in the validation set.
Article
We have performed virtual screening using some very simple features, by employing the number of atoms per element as molecular descriptors but without regard to any structural information whatsoever. Surprisingly, these atom counts are able to outperform virtual-affinity-based fingerprints and Unity fingerprints in some activity classes. Although molecular weight and other biases were known in target-based virtual screening settings (docking), we report the effect of using very simple descriptors for ligand-based virtual screening, by using clearly defined biological targets and employing a large data set (>100,000 compounds) containing multiple (11) activity classes. Structure-unaware atom count vectors as descriptors in combination with the Euclidean distance measure are able to achieve "enrichment factors" over random selection of around 4 (depending on the particular class of active compounds), putting the enrichment factors reported for more sophisticated virtual screening methods in a different light. They are also able to retrieve active compounds with novel scaffolds instead of merely the expected structural analogues. The added value of many currently used virtual screening methods (calculated as enrichment factors) drops down to a factor of between 1 and 2, instead of often reported double-digit figures. The observed effect is much less profound for simple descriptors such as molecular weight and is only present in cases of atypical (larger) ligands. The current state of virtual screening is not as sophisticated as might be expected, which is due to descriptors still not being able to capture structural properties relevant to binding. This fact can partly be explained by highly nonlinear structure-activity relationships, which represent a severe limitation of the "similar property principle" in the context of bioactivity.
Article
We test the hypothesis that fusing the outputs of similarity searches based on a single bioactive reference structure and on its nearest neighbors (of unknown activity) is more effective (in terms of numbers of high-ranked active structures) than a similarity search involving just the reference structure. This turbo similarity searching approach provides a simple way to enhance the effectiveness of simulated virtual screening searches of the MDL Drug Data Report database.
Article
Multiple sequence alignment has proven to be a powerful method for creating protein and DNA sequence alignment profiles. These profiles of protein families are useful tools for identifying conserved motifs, such as the catalytic triad of the serine protease family or the seven transmembrane helices of the G-protein coupled receptor family. Ultimately, the understanding of the critical motifs within a family is useful for identifying new members of the family. Due to the complexity of protein-ligand recognition, no universally accepted method exists for clustering small molecules into families with the same or similar biological activity. A combination of the concept of multiple sequence alignment and the 1-dimensional molecular representation described earlier offers a new method for profiling sets of small molecules with the same biological activity. These small molecule profiles can isolate key commonalities within the set of bioactive compounds much like a multiple sequence alignment can isolate critical motifs within a protein family. The small molecule profiles then make useful tools for searching small molecule databases for new compounds with the same biological activity. The technique is demonstrated here using the human ether-a-go-go potassium channel and the kinase SRC.
Article
The ability to find novel bioactive scaffolds in compound similarity-based virtual screening experiments has been studied comparing Tanimoto-based, ranking-based, voting, and consensus scoring protocols. Ligand sets for seven well-known drug targets (CDK2, COX2, estrogen receptor, neuraminidase, HIV-1 protease, p38 MAP kinase, thrombin) have been assembled such that each ligand represents its own unique chemotype, thus ensuring that each similarity recognition event between ligands constitutes a scaffold hopping event. In a series of virtual screening studies involving 9969 MDDR compounds as negative controls it has been found that atom pair descriptors and 3D pharmacophore fingerprints combined with ranking, voting, and consensus scoring strategies perform well in finding novel bioactive scaffolds. In addition, often superior performance has been observed for similarity-based virtual screening compared to structure-based methods. This finding suggests that information about a target obtained from known bioactive ligands is as valuable as knowledge of the target structures for identifying novel bioactive scaffolds through virtual screening.
Article
The paper describes the generation of four types of three-dimensional molecular field descriptors or 'field points' as extrema of electrostatic, steric, and hydrophobic fields. These field points are used to define the properties necessary for a molecule to bind in a characteristic way into a specified active site. The hypothesis is that compounds showing a similar field point pattern are likely to bind at the same target site regardless of structure. The methodology to test this idea is illustrated using HIV NNRTI and thrombin ligands and validated across seven other targets. From the in silico comparisons of field point overlays, the experimentally observed binding poses of these ligands in their respective sites can be reproduced from pairwise comparisons.
Article
MAD (Mapping to Activity class-specific Descriptor value ranges) is a novel molecular similarity method that relies on the identification of activity-specific descriptors. Applying a categorical descriptor scoring function, value ranges of molecular descriptors in screening databases are compared with those in classes of active compounds and descriptors displaying significant deviations are selected. In order to identify new actives, database molecules are mapped to class-specific value ranges and ranked using a similarity function. As a mapping algorithm, MAD is distinct from many other molecular similarity and virtual screening methods. In systematic virtual screening trials, for small selection sets of only 30 database compounds, average hit and recovery rates over six activity classes ranged from about 10% to 25% and about 25% to 75%, respectively. Moreover, when mining a database of bioactive molecules many similar compounds were selected (with hit rates between about 15% and 79%). Our findings suggest that it is possible to generate compound class-directed descriptor reference spaces for molecular similarity analysis.
Article
The concept of chemical space is of fundamental importance for chemoinformatics research. It is generally thought that high-dimensional space representations are too complex for the successful application of many compound classification or virtual screening methods. Here, we show that a simple "activity-centered" distance function is capable of accurately detecting molecular similarity relationships in "raw" chemical spaces of high dimensionality.
Article
Here, we introduce the DynaMAD algorithm that is designed to map database compounds to combinations of activity-class-dependent descriptor value ranges in order to identify novel active molecules. The method combines and extends key features of two previously developed algorithms, MAD and DMC. These methods were first described as compound-mapping algorithms for large-scale virtual screening applications. DynaMAD and DMC operate in chemical spaces of stepwise increasing dimensionality. However, in contrast to DMC, which utilizes binary transformed descriptors, DynaMAD uses unmodified descriptor value distributions. The performance of these mapping methods was compared in detail in virtual screening trials on 24 different compound activity classes against a background of about 2 million database compounds. In these calculations, all three approaches produced results of considerable predictive value, and the enrichment of active molecules in small selection sets consisting of only about 20 or fewer database compounds emerged as a common feature. Furthermore, mapping methods were capable of recognizing remote molecular similarity relationships. Overall, DynaMAD performed better than MAD and DMC, producing average hit and recovery rates of 55% and 33%, respectively, over all 24 classes. Taken together, our findings suggest that dynamic compound mapping to combinations of activity-class-selective descriptor settings has significant potential for molecular similarity analysis and ligand-based virtual screening.
Article
A novel method termed MolBlaster is introduced for the evaluation of molecular similarity relationships on the basis of randomly generated fragment populations. Our motivation has been to develop a similarity method that does not depend on the use of predefined structural or property descriptors. Fragment profiles of molecules are generated by random deletion of bonds in connectivity tables and quantitatively compared using entropy-based metrics. In test calculations, MolBlaster accurately reproduced a structural key-based similarity ranking of druglike molecules.
Article
We apply a recently published method of text-based molecular similarity searching (LINGO) to standard data sets for the purpose of quantifying the accuracy of the approach. Our implementation is based on a pattern-matching finite state machine (FSM) which results in fast search times. The accuracy of LINGO is demonstrated to be comparable to that of a path-based fingerprint and offers a simple yet effective method for similarity searching.
Article
Recent attempts to increase similarity search performance using molecular fingerprints have mostly focused on the evaluation of alternative similarity metrics or scoring schemes, rather than the development of new types of fingerprints. Here, we introduce a novel 2D fingerprint design (property descriptor value range-derived fingerprint or PDR-FP) that involves activity-oriented selection of property descriptors and the transformation of descriptor value ranges into a binary format such that each fingerprint bit position represents a specific value interval. The design is tailored toward multiple-template similarity searching and permits training on specific activity classes. In search calculations on 15 compound classes of increasing structural diversity, the PDR fingerprint performed better than other state-of-the-art 2D fingerprints. Among the structurally diverse classes were six compound sets with peptide character, which represent a notoriously difficult chemotype for 2D similarity searching. In these cases, PDR-FP produced promising results, whereas other fingerprint methods mostly failed. PDR-FP is specifically designed for search calculations on structurally diverse compounds, and these calculations are not influenced by molecular size effects, which represent a general problem for similarity searching using bit string representations.
Article
Conventional similarity searching of molecules compares single (or multiple) active query structures to each other in a relative framework, by means of a structural descriptor and a similarity measure. While this often works well, depending on the target, we show here that retrieval rates can be improved considerably by incorporating an external framework describing ligand bioactivity space for comparisons ("Bayes affinity fingerprints"). Structures are described by Bayes scores for a ligand panel comprising about 1000 activity classes extracted from the WOMBAT database. The comparison of structures is performed via the Pearson correlation coefficient of activity classes, that is, the order in which two structures are similar to the panel activity classes. Compound retrieval on a recently published data set could be improved by as much as 24% relative (9% absolute). Knowledge about the shape of the "bioactive chemical universe" is thus beneficial to identifying similar bioactivities. Principal component analysis was employed to further analyze activity space with the objective to define orthogonal ligand bioactive chemical space, leading to nine major (roughly orthogonal) activity axes. Employing only those nine activity classes, retrieval rates are still comparable to original Bayes affinity fingerprints; thus, the concept of orthogonal bioactive ligand chemical space was validated as being an information-rich but low-dimensional representation of bioactivity space. Correlations between activity classes are a major determinant to gauge whether the desired multitarget activity of drugs is (on the basis of current knowledge) a feasible concept because it measures the extent to which activities can be optimized independently, or only by strongly influencing one another.
Article
This paper summarizes recent work at the University of Sheffield on virtual screening methods that use 2D fingerprint measures of structural similarity. A detailed comparison of a large number of similarity coefficients demonstrates that the well-known Tanimoto coefficient remains the method of choice for the computation of fingerprint-based similarity, despite possessing some inherent biases related to the sizes of the molecules that are being sought. Group fusion involves combining the results of similarity searches based on multiple reference structures and a single similarity measure. We demonstrate the effectiveness of this approach to screening, and also describe an approximate form of group fusion, turbo similarity searching, that can be used when just a single reference structure is available.
Article
We studied the similarity search performance of differently designed molecular fingerprints using multiple reference structures and different search strategies. For this purpose, nine compound activity classes were assembled that exclusively consisted of molecules with different core structures and that represented different levels of intra-class structural diversity. Thus, there was a strict one-to-one correspondence between test compounds and core structures. Analysis of unique core structures was found to be a better measure of class diversity than distributions of simplified scaffolds. On increasingly diverse classes, a trainable fingerprint using a unique search strategy performed better than others tested herein. Overall, clear preferences were detected for nearest-neighbor search strategies over fingerprint-averaging techniques. Nearest-neighbor searching that relied on selecting database compounds most similar to one of the reference structures often improved compound recovery over other averaging methods, but at the cost of decreasing the ability to detect hits that were structurally distinct from reference molecules.
Article
The scope of the current work is to investigate whether structurally similar ligands bind in a similar fashion by exhaustively analyzing experimental data from the protein database (PDB). The complete PDB was searched for pairs of structurally similar ligands binding to the same biological target. The binding sites of the pairs of proteins complexing structurally similar ligands were found to differ in 83% of the cases. The most recurrent structural change among the pairs involves different water molecule architecture. Side-chain movements are observed in half of the pairs, whereas backbone movements rarely occurred. However, two structurally similar ligands generally confirm a high degree of structural conservation. That is, a majority of the ligand pairs occupy the same region in the binding sites, providing support for the use of shape matching in the drug design process. We allow ourselves to draw general conclusions because our data set consists of ligands with drug-like physicochemical properties complexed to a broad spectrum of different protein classes.