JEDA: Joint entropy diversity analysis. An information-theoretic method for choosing diverse and representative subsets from combinatorial libraries

Graduate Program in Bioinformatics and Systems Biology, Boston, MA 02215, USA.
Molecular Diversity (Impact Factor: 1.9). 09/2006; 10(3):333-9. DOI: 10.1007/s11030-006-9042-4
Source: PubMed


The joint entropy-based diversity analysis (JEDA) program is a new method of selecting representative subsets of compounds from combinatorial libraries. Similar to other cell-based diversity analyses, a set of chemical descriptors is used to partition the chemical space of a library of compounds; however, unlike other metrics for choosing a compound from each partition, a Shannon-entropy based scoring function implemented in a probabilistic search algorithm determines a representative subset of compounds. This approach enables the selection of compounds that are not only diverse but that also represent the densities of chemical space occupied by the original chemical library. Additionally, JEDA permits the user to define the size of the subset that the chemist wishes to create so that restrictions on time and chemical reagents can be considered. Subsets created from a chemical library by JEDA are compared to subsets obtained using other partition-based diversity analyses, namely principal components analysis and median partitioning, on a combinatorial library derived from the Comprehensive Medical Chemistry Dataset.

4 Reads
  • Source
    • "This common framework can be especially important when incorporating categorical data, such as the classification of a type of cancer, into the analysis of a continuous data set, such as mRNA expression microarrays. A variety of dimension reduction problems has already been phrased using high-dimensional information theoretic statistics (Landon and Schaus, 2006; Peng et al., 2005; Slonim et al., 2005). Notably, the maximum-dependency criterion [maximizing the mutual information (MI) between the feature set and the output] has been proposed for FS (Peng et al., 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The study of complex biological relationships is aided by large and high-dimensional data sets whose analysis often involves dimension reduction to highlight representative or informative directions of variation. In principle, information theory provides a general framework for quantifying complex statistical relationships for dimension reduction. Unfortunately, direct estimation of high-dimensional information theoretic quantities, such as entropy and mutual information (MI), is often unreliable given the relatively small sample sizes available for biological problems. Here, we develop and evaluate a hierarchy of approximations for high-dimensional information theoretic statistics from associated low-order terms, which can be more reliably estimated from limited samples. Due to a relationship between this metric and the minimum spanning tree over a graph representation of the system, we refer to these approximations as MIST (Maximum Information Spanning Trees). The MIST approximations are examined in the context of synthetic networks with analytically computable entropies and using experimental gene expression data as a basis for the classification of multiple cancer types. The approximations result in significantly more accurate estimates of entropy and MI, and also correlate better with biological classification error than direct estimation and another low-order approximation, minimum-redundancy-maximum-relevance (mRMR). Software to compute the entropy approximations described here is available as Supplementary Material. Supplementary data are available at Bioinformatics online.
    Bioinformatics 04/2009; 25(9):1165-72. DOI:10.1093/bioinformatics/btp109 · 4.98 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recently, Landon and Schaus [1] published a very interesting article providing an algorithm that allows the selection of a representative subset taken form a molecular library. The aim of this note is to argue that some aspects discussed in the article remain unclear despite it having passed a review process. That is because the manuscript check was not complete and, possibly, it was published prematurely. There are several concepts that are not well defined or explained in the text. This note will attempt to focus on some of those points, transmitting them to the reader with the hope that a door is being opened to a more complete discussion and understanding of the article. At the same time, this is a good opportunity to recall the relevance of this new and promising method.
    Molecular Diversity 06/2007; 11(2):113-4. DOI:10.1007/s11030-007-9060-x · 1.90 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput screening (HTS) is a well-established technology which can test up to several million compounds in a few weeks. Despite these appealing capabilities, available resources and high costs may limit the number of molecules screened, making diversity analysis a method of choice to design and prioritize screening libraries. With a constantly increasing number of molecules available for screening, chemical space has become a key concept for visualizing, analyzing, and comparing chemical libraries. In this first article, we present a new method to build delimited reference chemical subspaces (DRCS). A set of 16 million screening compounds from 73 chemical providers has been gathered, resulting in a database of 6.63 million standardized and unique molecules. These molecules have been used to create three DRCS using three different sets of chemical descriptors. A robust principal component analysis model for each space has been obtained, whereby molecules are projected in a reduced two-dimensional viewable space. The specificity of our approach is that each reduced space has been delimited by a representative contour encompassing a very large proportion of molecules and reflecting its overall shape. The methodology is illustrated by mapping and comparing various chemical libraries. Several tools used in these studies are made freely available, thus enabling any user to compute DRCS matching specific requirements.
    Journal of Chemical Information and Modeling 08/2011; 51(8):1762-74. DOI:10.1021/ci200051r · 3.74 Impact Factor
Show more