Article
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Developing structure-activity relationships (SARs) of molecules is an important approach in facilitating hit exploration in the early stage of drug discovery. Although information on millions of compounds and their bioactivities is freely available to the public, it is very challenging to infer a meaningful and novel SAR from that information. Research discussed in the present paper employed a bioactivity-centered clustering approach to group 843,845 non-inactive compounds stored in PubChem according to both structural similarity and bioactivity similarity, with the aim of mining bioactivity data in PubChem for useful SAR information. The compounds were clustered in three bioactivity similarity contexts: (1) non-inactive in a given bioassay, (2) non-inactive against a given protein, and (3) non-inactive against proteins involved in a given pathway. In each context, these small molecules were clustered according to their two-dimensional (2-D) and three-dimensional (3-D) structural similarities. The resulting 18 million clusters, named "PubChem SAR clusters", were delivered in such a way that each cluster contains a group of small molecules similar to each other in both structure and bioactivity. The PubChem SAR clusters, pre-computed using publicly available bioactivity information, make it possible to quickly navigate and narrow down the compounds of interest. Each SAR cluster can be a useful resource in developing a meaningful SAR or enable one to design or expand compound libraries from the cluster. It can also help to predict the potential therapeutic effects and pharmacological actions of less-known compounds from those of well-known compounds (i.e., drugs) in the same cluster.
Article
Full-text available
After performing a fragment based screen the resulting hits need to be prioritized for follow-up structure elucidation and chemistry. This paper describes a new similarity metric, Atom-Atom-Path (AAP) similarity that is used in conjunction with the Directed Sphere Exclusion (DISE) clustering method to effectively organize and prioritize the fragment hits. The AAP similarity rewards common substructures and recognizes minimal structure differences. The DISE method is order-dependent and can be used to enrich fragments with properties of interest in the first clusters. The merit of the software is demonstrated by its application to the MAP4K4 fragment screening hits using ligand efficiency (LE) as quality measure. The first clusters contain the hits with the highest LE. The clustering results can be easily visualized in a LE-over-clusters scatterplot with points colored by the members' similarity to the corresponding cluster seed. The scatterplot enables the extraction of preliminary SAR. The detailed structure differentiation of the AAP similarity metric is ideal for fragment-sized molecules. The order-dependent nature of the DISE clustering method results in clusters ordered by a property of interest to the teams. The combination of both allows for efficient prioritization of fragment hit for follow-ups. Graphical abstractAAP similarity computation and DISE clustering visualization.
Article
Full-text available
Background Hierarchical clustering is an exploratory data analysis method that reveals the groups (clusters) of similar objects. The result of the hierarchical clustering is a tree structure called dendrogram that shows the arrangement of individual clusters. To investigate the row/column hierarchical cluster structure of a data matrix, a visualization tool called ‘cluster heatmap’ is commonly employed. In the cluster heatmap, the data matrix is displayed as a heatmap, a 2-dimensional array in which the colour of each element corresponds to its value. The rows/columns of the matrix are ordered such that similar rows/columns are near each other. The ordering is given by the dendrogram which is displayed on the side of the heatmap. Results We developed InCHlib (Interactive Cluster Heatmap Library), a highly interactive and lightweight JavaScript library for cluster heatmap visualization and exploration. InCHlib enables the user to select individual or clustered heatmap rows, to zoom in and out of clusters or to flexibly modify heatmap appearance. The cluster heatmap can be augmented with additional metadata displayed in a different colour scale. In addition, to further enhance the visualization, the cluster heatmap can be interconnected with external data sources or analysis tools. Data clustering and the preparation of the input file for InCHlib is facilitated by the Python utility script inchlib_clust. Conclusions The cluster heatmap is one of the most popular visualizations of large chemical and biomedical data sets originating, e.g., in high-throughput screening, genomics or transcriptomics experiments. The presented JavaScript library InCHlib is a client-side solution for cluster heatmap exploration. InCHlib can be easily deployed into any modern web application and configured to cooperate with external tools and data sources. Though InCHlib is primarily intended for the analysis of chemical or biological data, it is a versatile tool which application domain is not limited to the life sciences only. Electronic supplementary material The online version of this article (doi:10.1186/s13321-014-0044-4) contains supplementary material, which is available to authorized users.
Article
Full-text available
It has been claimed that relational properties among chemical substances are at the core of chemistry. Here we show that chemical elements and a wealth of their trends can be found by the study of a relational property: the formation of binary compounds. We say that two chemical elements A and B are similar if they form binary compounds AC and BC, C being another chemical element. To allow the richness of chemical combinations, we also included the different stoichiomet-rical ratios for binary compounds. Hence, the more combinations with different chemical elements, and with similar stoichiometry, the more similar two chemical elements are. We studied 4,700 binary compounds by using network theory and point set topology, we obtained well-known chemical families of elements, such as: alkali metals, alkaline earth metals, halogens, lanthanides, actinides, some transi-tion metal groups and chemical patterns like: singularity principle, knight's move, and secondary periodicity. The methodology applied here can be extended to the study of ternary, quaternary and other compounds, as well as other chemical sets where a relational property can be defined.
Article
Full-text available
We recently developed a methodology to endow a finite set Q with topologies using similarity results from cluster analysis (dendrograms). In this paper we characterise the family of these topologies. We introduce a new method generalising the previous one and allowing to build new topologies over Q not belonging to the former family. Either procedures ensure the existence of a topology given a dendrogram and it is shown that given a topology for Q, mirroring similarities, then a dendrogram can be associated.
Article
Full-text available
Background Although many consensus clustering methods have been successfully used for combining multiple classifiers in many areas such as machine learning, applied statistics, pattern recognition and bioinformatics, few consensus clustering methods have been applied for combining multiple clusterings of chemical structures. It is known that any individual clustering method will not always give the best results for all types of applications. So, in this paper, three voting and graph-based consensus clusterings were used for combining multiple clusterings of chemical structures to enhance the ability of separating biologically active molecules from inactive ones in each cluster. Results The cumulative voting-based aggregation algorithm (CVAA), cluster-based similarity partitioning algorithm (CSPA) and hyper-graph partitioning algorithm (HGPA) were examined. The F-measure and Quality Partition Index method (QPI) were used to evaluate the clusterings and the results were compared to the Ward’s clustering method. The MDL Drug Data Report (MDDR) dataset was used for experiments and was represented by two 2D fingerprints, ALOGP and ECFP_4. The performance of voting-based consensus clustering method outperformed the Ward’s method using F-measure and QPI method for both ALOGP and ECFP_4 fingerprints, while the graph-based consensus clustering methods outperformed the Ward’s method only for ALOGP using QPI. The Jaccard and Euclidean distance measures were the methods of choice to generate the ensembles, which give the highest values for both criteria. Conclusions The results of the experiments show that consensus clustering methods can improve the effectiveness of chemical structures clusterings. The cumulative voting-based aggregation algorithm (CVAA) was the method of choice among consensus clustering methods.
Article
Full-text available
Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kind of features, like structural fragments as well as quantitative chemical descriptors. These features can be highlighted within CheS-Mapper, which aids the chemist to better understand patterns and regularities and relate the observations to established scientific knowledge. As a final function, the tool can also be used to select and export specific subsets of a given dataset for further analysis.
Article
Full-text available
In agglomerative hierarchical clustering, pair-group methods suffer from a problem of non-uniqueness when two or more distances between different clusters coincide during the amalgamation process. The traditional approach for solving this drawback has been to take any arbitrary criterion in order to break ties between distances, which results in different hierarchical classifications depending on the criterion followed. In this article we propose a variable-group algorithm that consists in grouping more than two clusters at the same time when ties occur. We give a tree representation for the results of the algorithm, which we call a multidendrogram, as well as a generalization of the Lance andWilliams’ formula which enables the implementation of the algorithm in a recursive way.
Article
Full-text available
On solid growth media with limiting nitrogen source, diploid budding-yeast cells differentiate from the yeast form to a filamentous, adhesive, and invasive form. Genomic profiles of mRNA levels in Saccharomyces cerevisiae yeast-form and filamentous-form cells were compared. Disparate data types, including genes implicated by expression change, filamentation genes known previously through a phenotype, protein-protein interaction data, and protein-metabolite interaction data were integrated as the nodes and edges of a filamentation-network graph. Application of a network-clustering method revealed 47 clusters in the data. The correspondence of the clusters to modules is supported by significant coordinated expression change among cluster co-member genes, and the quantitative identification of collective functions controlling cell properties. The modular abstraction of the filamentation network enables the association of filamentous-form cell properties with the activation or repression of specific biological processes, and suggests hypotheses. A module-derived hypothesis was tested. It was found that the 26S proteasome regulates filamentous-form growth.
Article
Full-text available
How well do different classification methods perform in selecting the ligands of a protein target out of large compound collections not used to train the model? Support vector machines, random forest, artificial neural networks, k-nearest-neighbor classification with genetic-algorithm-optimized feature selection, trend vectors, naïve Bayesian classification, and decision tree were used to divide databases into molecules predicted to be active and those predicted to be inactive. Training and predicted activities were treated as binary. The database was generated for the ligands of five different biological targets which have been the object of intense drug discovery efforts: HIV-reverse transcriptase, COX2, dihydrofolate reductase, estrogen receptor, and thrombin. We report significant differences in the performance of the methods independent of the biological target and compound class. Different methods can have different applications; some provide particularly high enrichment, others are strong in retrieving the maximum number of actives. We also show that these methods do surprisingly well in predicting recently published ligands of a target on the basis of initial leads and that a combination of the results of different methods in certain cases can improve results compared to the most consistent method.
Article
Topoisomerase IB (Top1) is a key eukaryotic nuclear enzyme that regulates the topology of DNA during replication and gene transcription. Anticancer drugs that block Top1 are either well-characterized interfacial poisons or lesser-known catalytic inhibitor compounds. Here we describe a new class of cytotoxic redox-stable cationic Au(3+) macrocycles which, through hierarchical cluster analysis of cytotoxicity data for the lead compound, 3, were identified as either poisons or inhibitors of Top1. Two pivotal enzyme inhibition assays prove that the compounds are true catalytic inhibitors of Top1. Inhibition of human topoisomerase IIα (Top2α) by 3 was 2 orders of magnitude weaker than its inhibition of Top1, confirming that 3 is a type I-specific catalytic inhibitor. Importantly, Au(3+) is essential for both DNA intercalation and enzyme inhibition. Macromolecular simulations show that 3 intercalates directly at the 5'-TA-3' dinucleotide sequence targeted by Top1 via crucial electrostatic interactions, which include π-π stacking and an Au···O contact involving a thymine carbonyl group, resolving the ambiguity of conventional (drug binds protein) vs unconventional (drug binds substrate) catalytic inhibition of the enzyme. Surface plasmon resonance studies confirm the molecular mechanism of action elucidated by the simulations.
Article
A simple method of counting the number of possible evolutionary trees is presented. The trees are assumed to be rooted, with labelled tips but unlabelled root and unlabelled interior nodes. The method generalizes the previous work of Edwards and Cavalli-Sforza by allowing forks to be multifurcations as well as bifurcations. It makes use of a simple recurrence relation for T(n,m), the number of trees with n labelled tips and m unlabelled interior nodes. A table of the total number of trees is presented up to n = 22. There are 282,137,824 different trees having 10 tip species, and over 8.87 x 10²³ different trees having 20 tip species. The method is extended to count trees some of whose interior nodes may be labelled. The principal uses of these numbers will be to double-check algorithms and to frighten taxonomists.
Article
Cruzipain (Cz) is the major cystein protease of the protozoan Trypanosoma cruzi, etiological agent of Chagas disease. From a 163-compound dataset, a 2D-classifier capable of identifying Cz inhibitors was obtained and applied in a virtual screening campaign on the DrugBank database, which compiles FDA-approved and investigational drugs. 54 approved drugs were selected as candidates, 4 of which were acquired and tested on Cz and T. cruzi epimastigotes. Among them, the antiparkinsonian and antidiabetic drug bromocriptine and the antiarrhythmic amiodarone showed dose-dependent inhibition of Cz and antiproliferative activity on the parasite.
Article
Topological indices (TIs) have been used to study structure-activity relationships (SAR) with respect to the physical, chemical, and biological properties of congeneric sets of molecules. Since there are many TIs and many are correlated, it is important that we identify redundancies and extract useful information from TIs into a smaller number of parameters. Moreover, it is important to determine if TIs, or parameters derived from TIs, can be used for global SAR models of diverse sets of chemicals. We calculated seventy-one TIs for three groups of molecules of increasing complexity and diversity: (a) 74 alkanes, (b) 29 alkylbenzenes, and (c) 37 polycyclic aromatic hydrocarbons (PAHs). Principal components analysis (PCA) revealed that a few principal components (PCs) could extract most of the information encoded by the seventy-one TIs. The structural basis of the first few PCs could be derived from their pattern of correlation with individual TIs. For the three sets of molecules, viz. alkanes, alkylbenzenes and PAHs, PCs were able to predict the boiling points reasonably well. Also, for the combined set of 140 chemicals consisting of the alkanes, alkylbenzenes and PAHs, the derived PCs were not as effective in predicting properties as in the case of individual classes of compounds.
Article
A disposable preoxidation technique that dramatically improves the detection and identification of volatile organic compounds (VOCs) by a colorimetric sensor array is reported. Passing a vapor stream through a tube packed with chromic acid on silica immediately before the colorimetric sensor array substantially increases the sensitivity to less-reactive VOCs and improves the limits of detection (LODs) ~300-fold, permitting the detection, identification, and discrimination of 20 commonly found indoor VOC pollutants at both their immediately dangerous to life or health (IDLH) and permissible exposure limit (PEL) concentrations. The LODs of these pollutants were on average 1.4% of their respective PELs.
Article
A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications.
Article
Chemical structure curatione plays an important role in cheminformatics and QSAR modeling research. Both common sense and the recent investigations described above indicate that chemical record curation should be viewed as a separate and critical component of any cheminformatics research. Treatment of mixtures is not as simple as it appears. The practice of retaining the component with the highest molecular weight or largest number of atoms is common and widely used, but not necessarily the best solution. Manual conversion of all functional groups to some standard forms is too time-consuming and could introduce additional human-dependent nonsystematic errors. ChemAxon's Standardizer is probably the most well-known tool to rapidly and efficiently realize chemotype normalizations. Rigorous statistical analysis of any data set assumes that each compound is unique and thus, structurally different from all other compounds.
Article
Pyramidal clustering is an extension of the hierarchical clustering method. Our aim is to describe the mathematical properties of the pyramidal clustering model. Most of them generalize the properties of the indexed hierarchies. We first extend the well-known one-to-one correspondence between ultrametrics and indexed hierarchies to the case of pyramidal structures. Then, we turn our attention to the order features of the pyramidal clustering model. It leads to the enumeration of the orders such that for a given binary pyramidal structure, each cluster is a set of consecutive elements.
Article
It is shown that the computational behaviour of a hierarchical sorting-strategy depends on three properties, which are established for five conventional strategies and four measures. The conventional strategies are shown to be simple variants of a single linear system defined by four parameters. A new strategy is defined, enabling continuous variation of intensity of grouping by variation in a single parameter. An Appendix provides specifications of computer programs embodying the new principles.
Article
In this study, we propose a drug design approach which includes docking, molecular fingerprints based cluster analysis, and 'induced' descriptors based receptor-dependent 3D-QSAR. The method was shown to be very useful for screening and modeling structurally diverse data sets of pharmacological interest. Different from other receptor-dependent 3D-QSAR, no ambiguous alignments are required for the construction of the models, and the computational cost is relatively lower. Moreover, 'induced' descriptors were shown to be very powerful in "capturing" ligand-receptor intermolecular interactions. The methodology was validated for eight data sets sampled from the literature and from public databases: human sex hormone-binding globulin, human corticosteroid-binding globulin, anthrax lethal factor, HIV-1 reverse transcriptase, neuraminidase A, thrombin, trypsin, and Pneumocystis carinii dihydrofolate reductase data sets. The resulting models were interpretable; the constructed QSAR equations have high statistical significance and predictive strength; and the drug design solutions were shown to be useful for guiding ligand modification for the development of new inhibitors for a broad range of molecular targets.
Hierarchical clustering algorithms such as Wards or complete-link are commonly used in compound selection and diversity analysis. Many such applications utilize binary representations of chemical structures, such as MACCS keys or Daylight fingerprints, and dissimilarity measures, such as the Euclidean or the Soergel measure. However, hierarchical clustering algorithms can generate ambiguous results owing to what is known in the cluster analysis literature as the ties in proximity problem, i.e., compounds or clusters of compounds that are equidistant from a compound or cluster in a given collection. Ambiguous ties can occur when clustering only a few hundred compounds, and the larger the number of compounds to be clustered, the greater the chance for significant ambiguity. Namely, as the number of "ties in proximity" increases relative to the total number of proximities, the possibility of ambiguity also increases. To ensure that there are no ambiguous ties, we show by a probabilistic argument that the number of compounds needs to be less than 2(n 1/4), where n is the total number of proximities, and the measure used to generate the proximities creates a uniform distribution without statistically preferred values. The common measures do not produce uniformly distributed proximities, but rather statistically preferred values that tend to increase the number of ties in proximity. Hence, the number of possible proximities and the distribution of statistically preferred values of a similarity measure, given a bit vector representation of a specific length, are directly related to the number of ties in proximities for a given data set. We explore the ties in proximity problem, using a number of chemical collections with varying degrees of diversity, given several common similarity measures and clustering algorithms. Our results are consistent with our probabilistic argument and show that this problem is significant for relatively small compound sets.
We carried out a topological study of the Space of Chemical Elements, SCE, based on a clustering analysis of 72 elements, each one defined by a vector of 31 properties. We looked for neighborhoods, boundaries, and other topological properties of the SCE. Among the results one sees the well-known patterns of the Periodic Table and relationships such as the Singularity Principle and the Diagonal Relationship, but there appears also a robustness property of some of the better-known families of elements. Alkaline metals and Noble Gases are sets whose neighborhoods have no other elements besides themselves, whereas the topological boundary of the set of metals is formed by semimetallic elements.
Article
We have developed a visualized cluster analysis of protein-ligand interaction (VISCANA) that analyzes the pattern of the interaction of the receptor and ligand on the basis of quantum theory for virtual ligand screening. Kitaura et al. (Chem. Phys. Lett. 1999, 312, 319-324.) have proposed an ab initio fragment molecular orbital (FMO) method by which large molecules such as proteins can be easily treated with chemical accuracy. In the FMO method, a total energy of the molecule is evaluated by summation of fragment energies and interfragment interaction energies (IFIEs). In this paper, we have proposed a cluster analysis using the dissimilarity that is defined as the squared Euclidean distance between IFIEs of two ligands. Although the result of an ordered table by clustering is still a massive collection of numbers, we combine a clustering method with a graphical representation of the IFIEs by representing each data point with colors that quantitatively and qualitatively reflect the IFIEs. We applied VISCANA to a docking study of pharmacophores of the human estrogen receptor alpha ligand-binding domain (57 amino acid residues). By using VISCANA, we could classify even structurally different ligands into functionally similar clusters according to the interaction pattern of a ligand and amino acid residues of the receptor protein. In addition, VISCANA could estimate the correct docking conformation by analyzing patterns of the receptor-ligand interactions of some conformations through the docking calculation.
Article
We discussed three dissimilarity measures between dendrograms defined over the same set, they are triples, partition, and cluster indices. All of them decompose the dendrograms into subsets. In the case of triples and partition indices, these subsets correspond to binary partitions containing some clusters, while in the cluster index, a novel dissimilarity method introduced in this paper, the subsets are exclusively clusters. In chemical applications, the dendrograms gather clusters that contain similarity information of the data set under study. Thereby, the cluster index is the most suitable dissimilarity measure between dendrograms resulting from chemical investigation. An application example of the three measures is shown to remark upon the advantages of the cluster index over the other two methods in similarity studies. Finally, the cluster index is used to measure the differences between five dendrograms obtained when applying five common hierarchical clustering algorithms on a database of 1000 molecules.
The chemical core of chemistry I: a conceptual approach
  • J Schummer
Schummer J (1998) The chemical core of chemistry I: a conceptual approach. HYLE Int J Philos Chem 4:129-162
Typologies and taxonomies: an introduction to classification techniques
  • K D Bailey
Bailey KD (1994) Typologies and taxonomies: an introduction to classification techniques. Sage publications, Inc., Thousand Oaks, pp 34-63 [Lewin-Beck M (series editor): Sage University paper series on quantitative applications in the social sciences, vol 102]
Independent component analysis for binary data: An experimental study
  • J Himberg
  • A Hyvärine
Himberg J, Hyvärine A (2001) Independent component analysis for binary data: An experimental study. In: Lee TW, Jung TP, Makeig S, Sejnowsky TJ (eds) Proceedings of the international workshop on independent component analysis and blind signal separation (ICA2001), pp 552-556
Molecular descriptors for chemoinformatics, volume I: alphabetical listing
  • R Todeschini
  • V Consonni
Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics, volume I: alphabetical listing. Wiley-VCH, Weinheim
  • P Graham
Graham P (1996) ANSI common Lisp. Prentice Hall, New Jersey
Inferring phylogenies
  • J Felsestein
Felsestein J (2004) Inferring phylogenies. Sinauer Associates Inc., Massachusetts
On the topological sense of chemical sets
  • G Restrepo
  • H Mesa
  • J L Villaveces
Restrepo G, Mesa H, Villaveces JL (2006) On the topological sense of chemical sets. J Math Chem 39:363-376