JEDA: Joint entropy diversity analysis. An information-theoretic method for choosing diverse and representative subsets from combinatorial libraries

Graduate Program in Bioinformatics and Systems Biology, Boston, MA 02215, USA.
Molecular Diversity (Impact Factor: 1.9). 09/2006; 10(3):333-9. DOI: 10.1007/s11030-006-9042-4
Source: PubMed


The joint entropy-based diversity analysis (JEDA) program is a new method of selecting representative subsets of compounds from combinatorial libraries. Similar to other cell-based diversity analyses, a set of chemical descriptors is used to partition the chemical space of a library of compounds; however, unlike other metrics for choosing a compound from each partition, a Shannon-entropy based scoring function implemented in a probabilistic search algorithm determines a representative subset of compounds. This approach enables the selection of compounds that are not only diverse but that also represent the densities of chemical space occupied by the original chemical library. Additionally, JEDA permits the user to define the size of the subset that the chemist wishes to create so that restrictions on time and chemical reagents can be considered. Subsets created from a chemical library by JEDA are compared to subsets obtained using other partition-based diversity analyses, namely principal components analysis and median partitioning, on a combinatorial library derived from the Comprehensive Medical Chemistry Dataset.

4 Reads
  • Source
    • "This common framework can be especially important when incorporating categorical data, such as the classification of a type of cancer, into the analysis of a continuous data set, such as mRNA expression microarrays. A variety of dimension reduction problems has already been phrased using high-dimensional information theoretic statistics (Landon and Schaus, 2006; Peng et al., 2005; Slonim et al., 2005). Notably, the maximum-dependency criterion [maximizing the mutual information (MI) between the feature set and the output] has been proposed for FS (Peng et al., 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The study of complex biological relationships is aided by large and high-dimensional data sets whose analysis often involves dimension reduction to highlight representative or informative directions of variation. In principle, information theory provides a general framework for quantifying complex statistical relationships for dimension reduction. Unfortunately, direct estimation of high-dimensional information theoretic quantities, such as entropy and mutual information (MI), is often unreliable given the relatively small sample sizes available for biological problems. Here, we develop and evaluate a hierarchy of approximations for high-dimensional information theoretic statistics from associated low-order terms, which can be more reliably estimated from limited samples. Due to a relationship between this metric and the minimum spanning tree over a graph representation of the system, we refer to these approximations as MIST (Maximum Information Spanning Trees). The MIST approximations are examined in the context of synthetic networks with analytically computable entropies and using experimental gene expression data as a basis for the classification of multiple cancer types. The approximations result in significantly more accurate estimates of entropy and MI, and also correlate better with biological classification error than direct estimation and another low-order approximation, minimum-redundancy-maximum-relevance (mRMR). Software to compute the entropy approximations described here is available as Supplementary Material. Supplementary data are available at Bioinformatics online.
    Bioinformatics 04/2009; 25(9):1165-72. DOI:10.1093/bioinformatics/btp109 · 4.98 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recently, Landon and Schaus [1] published a very interesting article providing an algorithm that allows the selection of a representative subset taken form a molecular library. The aim of this note is to argue that some aspects discussed in the article remain unclear despite it having passed a review process. That is because the manuscript check was not complete and, possibly, it was published prematurely. There are several concepts that are not well defined or explained in the text. This note will attempt to focus on some of those points, transmitting them to the reader with the hope that a door is being opened to a more complete discussion and understanding of the article. At the same time, this is a good opportunity to recall the relevance of this new and promising method.
    Molecular Diversity 06/2007; 11(2):113-4. DOI:10.1007/s11030-007-9060-x · 1.90 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Chemical libraries or databases are collections of compounds which can be screened (virtually or experimentally) in order to discover drug candidates. These libraries are very variable in their content (description of structures, molecular descriptors, literature links...) and their size (number of compounds). Over the last decade, a large number of papers have been published on the subject. In this review, we summarize these studies by introducing different types of compound collections and reviewing the main kinds of software used to manipulate them. We present the descriptors which have a fundamental role in the characterisation of the molecules, and describe how they are used to define the molecular filters applied before screening, in order to obtain both a representation of chemical spaces and selections of subsets by diversity or similarity.
    Current Computer - Aided Drug Design 09/2008; 4(3):156-168. DOI:10.2174/157340908785747410 · 1.27 Impact Factor
Show more