Arthur M. Lesk

William Penn University, Worcester, Massachusetts, United States

Are you Arthur M. Lesk?

Claim your profile

Publications (155)917.8 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: The problem of superposition of two corresponding vector sets by minimizing their sum-of-squares error under orthogonal transformation is a fundamental task in many areas of science, notably structural molecular biology. This problem can be solved exactly using an algorithm whose time complexity grows linearly with the number of correspondences. This efficient solution has facilitated the widespread use of the superposition task, particularly in studies involving macromolecular structures. This article formally derives a set of sufficient statistics for the least-squares superposition problem. These statistics are additive. This permits a highly efficient (constant time) computation of superpositions (and sufficient statistics) of vector sets that are composed from its constituent vector sets under addition or deletion operation, where the sufficient statistics of the constituent sets are already known (that is, the constituent vector sets have been previously superposed). This results in a drastic improvement in the run time of the methods that commonly superpose vector sets under addition or deletion operations, where previously these operations were carried out ab initio (ignoring the sufficient statistics). We experimentally demonstrate the improvement our work offers in the context of protein structural alignment programs that assemble a reliable structural alignment from well-fitting (substructural) fragment pairs. A C++ library for this task is available online under an open-source license.
    Journal of computational biology: a journal of computational molecular cell biology 02/2015; 22(6). DOI:10.1089/cmb.2014.0154 · 1.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Progress in protein biology depends on the reliability of results from a handful of computational techniques, structural alignments being one. Recent reviews have highlighted substantial inconsistencies and differences between alignment results generated by the ever-growing stock of structural alignment programs. The lack of consensus on how the quality of structural alignments must be assessed has been identified as the main cause for the observed differences. Current methods assess structural alignment quality by constructing a scoring function that attempts to balance conflicting criteria, mainly alignment coverage and fidelity of structures under superposition. This traditional approach to measuring alignment quality, the subject of considerable literature, has failed to solve the problem. Further development along the same lines is unlikely to rectify the current deficiencies in the field. Results: This paper proposes a new statistical framework to assess structural alignment quality and significance based on lossless information compression. This is a radical departure from the traditional approach of formulating scoring functions. It links the structural alignment problem to the general class of statistical inductive inference problems, solved using the information-theoretic criterion of minimum message length. Based on this, we developed an efficient and reliable measure of structural alignment quality, I-value. The performance of I-value is demonstrated in comparison with a number of popular scoring functions, on a large collection of competing alignments. Our analysis shows that I-value provides a rigorous and reliable quantification of structural alignment quality, addressing a major gap in the field. Availability: http://lcb.infotech.monash.edu.au/I-value Contact: arun.konagurthu@monash.edu Supplementary information: Online supplementary data are available at http://lcb.infotech.monash.edu.au/I-value/suppl.html
    Bioinformatics 09/2014; 30(17):i512-i518. DOI:10.1093/bioinformatics/btu460 · 4.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Atomic coordinates in the Worldwide Protein Data Bank (wwPDB) are generally reported to greater precision than the experimental structure determinations have actually achieved. By using information theory and data compression to study the compressibility of protein atomic coordinates, it is possible to quantify the amount of randomness in the coordinate data and thereby to determine the realistic precision of the reported coordinates. On average, the value of each C α coordinate in a set of selected protein structures solved at a variety of resolutions is good to about 0.1 Å.
    Acta Crystallographica Section D Biological Crystallography 03/2014; 70(Pt 3):904-6. DOI:10.1107/S1399004713031787 · 7.23 Impact Factor
  • Fei-Yi Guo · Arthur M Lesk
    [Show abstract] [Hide abstract]
    ABSTRACT: Eph receptors comprise the largest known family of receptor tyrosine kinases in mammals. They bind members of a second family, the ephrins. As both Eph receptors and ephrins are membrane bound, interactions permit unusual bidirectional cell-cell signalling. Eph receptors and ephrins each form two classes, A and B, based on sequences, structures and patterns of affinity: Class A Eph receptors bind class A ephrins, and class B Eph receptors bind class B ephrins. The only known exceptions are the receptor EphA4, which can bind ephrinB2 and ephrinB3 in addition to the ephrin-As(1) ; and EphB2, which can bind ephrin-A5 in addition to the ephrin-Bs.(2) A crystal structure is available of the interacting domains of the EphA4-ephrin B2 complex (wwPDB entry 2WO2).(1) In this complex, the ligand-binding domain of EphA4 adopts an EphB-like conformation. To understand why other cross-class EphA receptor-ephrinB complexes do not form, we modeled hypothetical complexes between (1) EphA4-ephrinB1, (2) EphA4-ephrinB3, and (3) EphA2-ephrinB2. We identify particular residues in the interface region, the size variations of which cause steric clashes that prevent formation of the unobserved complexes. The sizes of the sidechains of residues at these positions correlate with the pattern of binding affinity. © Proteins 2013;. © 2013 Wiley Periodicals, Inc.
    Proteins Structure Function and Bioinformatics 03/2014; 82(3). DOI:10.1002/prot.24414 · 2.92 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Proteins are biomolecules of life. They fold into a great variety of three-dimensional (3D) shapes. Underlying these folding patterns are many recurrent structural fragments or building blocks (analogous to 'LEGO® bricks'). This paper reports an innovative statistical inference approach to discover a comprehensive dictionary of protein structural building blocks from a large corpus of experimentally determined protein structures. Our approach is built on the Bayesian and information theoretic criterion of minimum message length. To the best of our knowledge, this work is the first systematic and rigorous treatment of a very important data mining problem that arises in the cross-disciplinary area of structural bioinformatics. The quality of the dictionary we find is demonstrated by its explanatory power - any protein within the corpus of known 3D structures can be dissected into successive regions assigned to fragments from this dictionary. This induces a novel one-dimensional representation of three-dimensional protein folding patterns, suitable for application of the rich repertoire of character-string processing algorithms, for rapid identification of folding patterns of newly determined structures. This paper presents the details of the methodology used to infer the dictionary of building blocks, and is supported by illustrative examples to demonstrate its effectiveness and utility.
    2013 IEEE International Conference on Data Mining (ICDM); 12/2013
  • Arthur M. Lesk · Juliette T.J. Lecomte
    [Show abstract] [Hide abstract]
    ABSTRACT: The globins are an ancient family of proteins, appearing in archaea, bacteria, and eukarya. The early determination of the crystal structures of globins, and their amino acid sequences, made possible pioneering investigations of protein evolution-at the level of sequence and of structure, the mechanism of allosteric changes, and the implication of mutations in disease. Many homologs from a wide range of species are now known, with a wide range of functions. This chapter surveys what has been learnt about this family, and what topics continue to be active in current research. It describes the basic globin structure and its variations; the taxonomic distribution of different types of globins; and the variety of known functions, focusing on the mechanism of the allosteric change in mammalian tetrameric hemoglobins and on the effects of mutations with clinical consequences.
    Protein Families, 11/2013: pages 207-235; , ISBN: 9780470624227
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Proteins are biomolecules of life. They fold into a great variety of three-dimensional (3D) shapes. Underlying these folding patterns are many recurrent structural fragments or building blocks (analogous to `LEGO bricks'). This paper reports an innovative statistical inference approach to discover a comprehensive dictionary of protein structural building blocks from a large corpus of experimentally determined protein structures. Our approach is built on the Bayesian and information-theoretic criterion of minimum message length. To the best of our knowledge, this work is the first systematic and rigorous treatment of a very important data mining problem that arises in the cross-disciplinary area of structural bioinformatics. The quality of the dictionary we find is demonstrated by its explanatory power -- any protein within the corpus of known 3D structures can be dissected into successive regions assigned to fragments from this dictionary. This induces a novel one-dimensional representation of three-dimensional protein folding patterns, suitable for application of the rich repertoire of character-string processing algorithms, for rapid identification of folding patterns of newly-determined structures. This paper presents the details of the methodology used to infer the dictionary of building blocks, and is supported by illustrative examples to demonstrate its effectiveness and utility.
  • Article: Comment on
    Arthur M. Lesk
    Physics of Life Reviews 03/2013; 10(1):33-34. · 9.48 Impact Factor
  • Arun S Konagurthu · Arthur M Lesk
    [Show abstract] [Hide abstract]
    ABSTRACT: We have developed a concise tableau representation of protein folding patterns, based on the order and contact patterns of elements of secondary structure: helices and strands of sheet. The tableaux provide a database, derived from the protein data bank, minable for studies on the general principles of protein architecture, including investigation of the relationship between local supersecondary structure of proteins and the complete folding topology. This chapter outlines the tableaux representation of protein folding patterns and methods to use them to identify structural and substructural similarities.
    Methods in molecular biology (Clifton, N.J.) 01/2013; 932:51-9. DOI:10.1007/978-1-62703-065-6_4 · 1.29 Impact Factor
  • Arthur M Lesk
    Physics of Life Reviews 10/2012; 10(1). DOI:10.1016/j.plrev.2012.10.007 · 9.48 Impact Factor
  • Source
    Arun S Konagurthu · Arthur M Lesk · Lloyd Allison
    [Show abstract] [Hide abstract]
    ABSTRACT: Secondary structure underpins the folding pattern and architecture of most proteins. Accurate assignment of the secondary structure elements is therefore an important problem. Although many approximate solutions of the secondary structure assignment problem exist, the statement of the problem has resisted a consistent and mathematically rigorous definition. A variety of comparative studies have highlighted major disagreements in the way the available methods define and assign secondary structure to coordinate data. We report a new method to infer secondary structure based on the Bayesian method of minimum message length inference. It treats assignments of secondary structure as hypotheses that explain the given coordinate data. The method seeks to maximize the joint probability of a hypothesis and the data. There is a natural null hypothesis and any assignment that cannot better it is unacceptable. We developed a program SST based on this approach and compared it with popular programs, such as DSSP and STRIDE among others. Our evaluation suggests that SST gives reliable assignments even on low-resolution structures. http://www.csse.monash.edu.au/~karun/sst.
    Bioinformatics 06/2012; 28(12):i97-105. DOI:10.1093/bioinformatics/bts223 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Searching for well-fitting 3D oligopeptide fragments within a large collection of protein structures is an important task central to many analyses involving protein structures. This article reports a new web server, Super, dedicated to the task of rapidly screening the protein data bank (PDB) to identify all fragments that superpose with a query under a prespecified threshold of root-mean-square deviation (RMSD). Super relies on efficiently computing a mathematical bound on the commonly used structural similarity measure, RMSD of superposition. This allows the server to filter out a large proportion of fragments that are unrelated to the query; >99% of the total number of fragments in some cases. For a typical query, Super scans the current PDB containing over 80,500 structures (with ∼40 million potential oligopeptide fragments to match) in under a minute. Super web server is freely accessible from: http://lcb.infotech.monash.edu.au/super.
    Nucleic Acids Research 05/2012; 40(Web Server issue):W334-9. DOI:10.1093/nar/gks436 · 9.11 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Research article Quantitative sequence-function relationships in proteins based on gene ontology
  • [Show abstract] [Hide abstract]
    ABSTRACT: Simple and concise representations of protein-folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features—that is, helices and strands of sheet—by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information. Further, traditional methods of simplification crucially depend on the consistency and accuracy of external methods to assign secondary structures to protein coordinate data. Although many methods exist automatically to identify secondary structure, the impreciseness of definitions, along with errors and inconsistencies in experimental structure data, drastically limit their applicability to generate reliable simplified representations, especially for structural comparison. This article introduces a mathematically rigorous algorithm to delineate protein structure using the elegant statistical and inductive inference framework of minimum message length (MML). Our method generates consistent and statistically robust piecewise linear explanations of protein coordinate data, resulting in a powerful and concise representation of the structure. The delineation is completely independent of the approaches of using hydrogen-bonding patterns or inspecting local substructural geometry that the current methods use. Indeed, as is common with applications of the MML criterion, this method is free of parameters and thresholds, in striking contrast to the existing programs which are often beset by them. The analysis of results over a large number of proteins suggests that the method produces consistent delineation of structures that encompasses, among others, the segments corresponding to standard secondary structure. Availability: http://www.csse.monash.edu.au/~karun/pmml. Contact: arun.konagurthu@monash.edu; lloyd.allison@monesh.edu
    Bioinformatics 07/2011; 27(13):i43-51. DOI:10.1093/bioinformatics/btr240 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Tasmanian devil (Sarcophilus harrisii) is threatened with extinction because of a contagious cancer known as Devil Facial Tumor Disease. The inability to mount an immune response and to reject these tumors might be caused by a lack of genetic diversity within a dwindling population. Here we report a whole-genome analysis of two animals originating from extreme northwest and southeast Tasmania, the maximal geographic spread, together with the genome from a tumor taken from one of them. A 3.3-Gb de novo assembly of the sequence data from two complementary next-generation sequencing platforms was used to identify 1 million polymorphic genomic positions, roughly one-quarter of the number observed between two genetically distant human genomes. Analysis of 14 complete mitochondrial genomes from current and museum specimens, as well as mitochondrial and nuclear SNP markers in 175 animals, suggests that the observed low genetic diversity in today's population preceded the Devil Facial Tumor Disease disease outbreak by at least 100 y. Using a genetically characterized breeding stock based on the genome sequence will enable preservation of the extant genetic diversity in future Tasmanian devil populations.
    Proceedings of the National Academy of Sciences 06/2011; 108(30):12348-53. DOI:10.1073/pnas.1102838108 · 9.81 Impact Factor
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A central tenet of structural biology is that related proteins of common function share structural similarity. This has key practical consequences for the derivation and analysis of protein structures, and is exploited by the process of "molecular sieving" whereby a common core is progressively distilled from a comparison of two or more protein structures. This paper reports a novel web server for "sieving" of protein structures, based on the multiple structural alignment program MUSTANG. "Sieved" models are generated from MUSTANG-generated multiple alignment and superpositions by iteratively filtering out noisy residue-residue correspondences, until the resultant correspondences in the models are optimally "superposable" under a threshold of RMSD. This residue-level sieving is also accompanied by iterative elimination of the poorly fitting structures from the input ensemble. Therefore, by varying the thresholds of RMSD and the cardinality of the ensemble, multiple sieved models are generated for a given multiple alignment and superposition from MUSTANG. To aid the identification of structurally conserved regions of functional importance in an ensemble of protein structures, Lesk-Hubbard graphs are generated, plotting the number of residue correspondences in a superposition as a function of its corresponding RMSD. The conserved "core" (or typically active site) shows a linear trend, which becomes exponential as divergent parts of the structure are included into the superposition. The application addresses two fundamental problems in structural biology: first, the identification of common substructures among structurally related proteins--an important problem in characterization and prediction of function; second, generation of sieved models with demonstrated uses in protein crystallographic structure determination using the technique of Molecular Replacement.
    PLoS ONE 04/2010; 5(4):e10048. DOI:10.1371/journal.pone.0010048 · 3.23 Impact Factor
  • Source
    Arun S Konagurthu · Arthur M Lesk
    [Show abstract] [Hide abstract]
    ABSTRACT: Comparing and classifying protein folding patterns allows organizing the known structures, structure search and retrieval, and investigation of general principles of protein architecture. We have been developing a concise tableau representation of protein folding patterns, based on the order and contact patterns of elements of secondary structure: helices and strands of sheet (Lesk, 1995; Kamat and Lesk, 2007; Konagurthu et al., 2008). The tableaux provide a database, derived from the world-wide protein data bank, mineable in studies of protein architecture, including: (i) determination of statistical properties of secondary structure contacts in an unbiased set of protein domains, (ii) investigations of the range of, and relationships among, protein topologies, (iii) investigation of the relationship between local structure of proteins and the complete folding topology, (iv) potential for fold identification from amino acid sequence, and (v) the basis for a complete enumeration of possible protein folding patterns, which can be compared with the corpus of known structures.
    Journal of Molecular Recognition 03/2010; 23(2):253-7. DOI:10.1002/jmr.1006 · 2.34 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We report the first two complete mitochondrial genome sequences of the thylacine (Thylacinus cynocephalus), or so-called Tasmanian tiger, extinct since 1936. The thylacine's phylogenetic position within australidelphian marsupials has long been debated, and here we provide strong support for the thylacine's basal position in Dasyuromorphia, aided by mitochondrial genome sequence that we generated from the extant numbat (Myrmecobius fasciatus). Surprisingly, both of our thylacine sequences differ by 11%-15% from putative thylacine mitochondrial genes in GenBank, with one of our samples originating from a direct offspring of the previously sequenced individual. Our data sample each mitochondrial nucleotide an average of 50 times, thereby providing the first high-fidelity reference sequence for thylacine population genetics. Our two sequences differ in only five nucleotides out of 15,452, hinting at a very low genetic diversity shortly before extinction. Despite the samples' heavy contamination with bacterial and human DNA and their temperate storage history, we estimate that as much as one-third of the total DNA in each sample is from the thylacine. The microbial content of the two thylacine samples was subjected to metagenomic analysis, and showed striking differences between a wild-captured individual and a born-in-captivity one. This study therefore adds to the growing evidence that extensive sequencing of museum collections is both feasible and desirable, and can yield complete genomes.
    Genome Research 02/2009; 19(2):213-20. DOI:10.1101/gr.082628.108 · 13.85 Impact Factor
  • Source
    Biophysical Journal 02/2009; 96(3). DOI:10.1016/j.bpj.2008.12.3444 · 3.97 Impact Factor

Publication Stats

14k Citations
917.80 Total Impact Points

Institutions

  • 2014
    • William Penn University
      Worcester, Massachusetts, United States
  • 2005–2014
    • Pennsylvania State University
      • Department of Biochemistry and Molecular Biology
      University Park, Maryland, United States
    • University of Vic
      Vic, Catalonia, Spain
  • 1999–2013
    • Monash University (Australia)
      • Department of Biochemistry and Molecular Biology
      Melbourne, Victoria, Australia
  • 2011
    • University of Melbourne
      • Department of Computing and Information Systems
      Melbourne, Victoria, Australia
  • 2008
    • Park University
      Parkville, Missouri, United States
  • 1993–2008
    • University of Cambridge
      • Department of Haematology
      Cambridge, England, United Kingdom
  • 2000–2003
    • Cambridge Institute for Medical Research
      • Department of Haematology
      Cambridge, England, United Kingdom
  • 1992
    • Universität Basel
      Bâle, Basel-City, Switzerland
  • 1988–1992
    • European Molecular Biology Laboratory
      Heidelburg, Baden-Württemberg, Germany
  • 1986–1989
    • Mrc Harwell
      Oxford, England, United Kingdom
    • University College London
      Londinium, England, United Kingdom
  • 1987
    • Université Paris-Sud 11
      Orsay, Île-de-France, France
  • 1980–1985
    • Fairleigh Dickinson University
      Teaneck, New Jersey, United States