Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing

Department of Computational Biology, University of Southern California, Los Angeles, 90089, USA.
Journal of Chemical Information and Modeling (Impact Factor: 3.74). 01/2006; 46(1):321-33. DOI: 10.1021/ci050403o
Source: PubMed


Virtual screening (VS) has become a preferred tool to augment high-throughput screening(1) and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm(2) called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.

Download full-text


Available from: Ting Chen,
  • Source
    • "As noted in subsection III, there is extensive literature in this area, e.g. [16]. While random projection per se will not guarantee a bijection of best match in original and in lower dimensional spaces, our use of projection here is effectively a hashing method. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of data measurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections.
    P-Adic Numbers Ultrametric Analysis and Applications 11/2011; 4(1). DOI:10.1134/S2070046612010062
  • Source
    • "Traditionally this approach has a running time that is quadratic with the number of points in the dataset, though this can be improved by use of data structures such as kd-trees. We also investigated[25] the use of an approximate nearest neighbor detection technique that runs in sublinear time, allowing this approach to be applied to large datasets. Our initial investigation focused on nearest neighbor detection using the Euclidean metric. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Some of the latest trends in cheminformatics, computation, and the world wide web are reviewed with predictions of how these are likely to impact the field of cheminformatics in the next five years. The vision and some of the work of the Chemical Informatics and Cyberinfrastructure Collaboratory at Indiana University are described, which we base around the core concepts of e-Science and cyberinfrastructure that have proven successful in other fields. Our chemical informatics cyberinfrastructure is realized by building a flexible, generic infrastructure for cheminformatics tools and databases, exporting "best of breed" methods as easily-accessible web APIs for cheminformaticians, scientists, and researchers in other disciplines, and hosting a unique chemical informatics education program aimed at scientists and cheminformatics practitioners in academia and industry.
    In silico biology 01/2011; 11(1-2):41-60. DOI:10.3233/CI-2008-0015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Libraries of chemical structures are used in a variety of cheminformatics tasks such as virtual screening and QSAR modeling and are generally characterized using molecular descriptors. When working with libraries it is useful to understand the distribution of compounds in the space defined by a set of descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library in general and outlier detection in particular based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, appear to follow a logistic model for any given descriptor space, which we justify theoretically for the 2D case. The method can be applied to data sets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We also present a method to numerically characterize the R-NN curves thus allowing identification of outliers in a single plot.
    Journal of Chemical Information and Modeling 07/2006; 46(4):1713-22. DOI:10.1021/ci060013h · 3.74 Impact Factor
Show more