Scalable partitioning and exploration of chemical spaces using geometric hashing

Department of Computational Biology, University of Southern California, Los Angeles, 90089, USA.
Journal of Chemical Information and Modeling (Impact Factor: 4.07). 01/2006; 46(1):321-33. DOI: 10.1021/ci050403o
Source: PubMed

ABSTRACT Virtual screening (VS) has become a preferred tool to augment high-throughput screening(1) and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm(2) called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.

  • [Show abstract] [Hide abstract]
    ABSTRACT: In this article we present an overview of the origin and applications of the activity landscape view of structure-actvitiy relationship data as conceived by Maggiora. Within this landscape, different regions exemplify different aspects of SAR trends - ranging from smoothly varying trends to discontinuous trends (also termed activity cliffs). We discuss the various definitions of landscapes and cliffs that have been proposed as well as different approaches to the numerical quantification of a landscape. We then highlight some of the landscape visualization approaches that have been developed, followed by a review of the various applications of activity landscapes and cliffs to topics in medicinal chemistry and SAR analysis.
    Wiley interdisciplinary reviews: Computational Molecular Science 11/2012; 2(6). DOI:10.1002/wcms.1087 · 9.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Local alignment-free sequence comparison arises in the context of identifying similar segments of sequences that may not be alignable in the traditional sense. We propose a randomized approximation algorithm that is both accurate and efficient. We show that under D2 and its important variant [Formula: see text] as the similarity measure, local alignment-free comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two sets of points in high dimensions. We introduce a geometric framework that reduces this problem to that of finding the bichromatic closest pair (BCP), allowing the properties of the underlying metric to be leveraged. Local alignment-free sequence comparison can be solved by making a quadratic number of alignment-free substring comparisons. We show both theoretically and through empirical results on simulated data that our approximation algorithm requires a subquadratic number of such comparisons and trades only a small amount of accuracy to achieve this efficiency. Therefore, our algorithm can extend the current usage of alignment-free-based methods and can also be regarded as a substitute for local alignment algorithms in many biological studies.
    Journal of computational biology: a journal of computational molecular cell biology 07/2013; 20(7):471-85. DOI:10.1089/cmb.2012.0280 · 1.67 Impact Factor

Full-text (2 Sources)

Available from
Jun 5, 2014