[Show abstract][Hide abstract] ABSTRACT: Progress in protein biology depends on the reliability of results from a handful of computational techniques, structural alignments being one. Recent reviews have highlighted substantial inconsistencies and differences between alignment results generated by the ever-growing stock of structural alignment programs. The lack of consensus on how the quality of structural alignments must be assessed has been identified as the main cause for the observed differences. Current methods assess structural alignment quality by constructing a scoring function that attempts to balance conflicting criteria, mainly alignment coverage and fidelity of structures under superposition. This traditional approach to measuring alignment quality, the subject of considerable literature, has failed to solve the problem. Further development along the same lines is unlikely to rectify the current deficiencies in the field.
[Show abstract][Hide abstract] ABSTRACT: Atomic coordinates in the Worldwide Protein Data Bank (wwPDB) are generally reported to greater precision than the experimental structure determinations have actually achieved. By using information theory and data compression to study the compressibility of protein atomic coordinates, it is possible to quantify the amount of randomness in the coordinate data and thereby to determine the realistic precision of the reported coordinates. On average, the value of each C(α) coordinate in a set of selected protein structures solved at a variety of resolutions is good to about 0.1 Å.
[Show abstract][Hide abstract] ABSTRACT: Proteins are biomolecules of life. They fold into a great variety of
three-dimensional (3D) shapes. Underlying these folding patterns are many
recurrent structural fragments or building blocks (analogous to `LEGO bricks').
This paper reports an innovative statistical inference approach to discover a
comprehensive dictionary of protein structural building blocks from a large
corpus of experimentally determined protein structures. Our approach is built
on the Bayesian and information-theoretic criterion of minimum message length.
To the best of our knowledge, this work is the first systematic and rigorous
treatment of a very important data mining problem that arises in the
cross-disciplinary area of structural bioinformatics. The quality of the
dictionary we find is demonstrated by its explanatory power -- any protein
within the corpus of known 3D structures can be dissected into successive
regions assigned to fragments from this dictionary. This induces a novel
one-dimensional representation of three-dimensional protein folding patterns,
suitable for application of the rich repertoire of character-string processing
algorithms, for rapid identification of folding patterns of newly-determined
structures. This paper presents the details of the methodology used to infer
the dictionary of building blocks, and is supported by illustrative examples to
demonstrate its effectiveness and utility.
[Show abstract][Hide abstract] ABSTRACT: We have developed a concise tableau representation of protein folding patterns, based on the order and contact patterns of elements of secondary structure: helices and strands of sheet. The tableaux provide a database, derived from the protein data bank, minable for studies on the general principles of protein architecture, including investigation of the relationship between local supersecondary structure of proteins and the complete folding topology. This chapter outlines the tableaux representation of protein folding patterns and methods to use them to identify structural and substructural similarities.
[Show abstract][Hide abstract] ABSTRACT: Secondary structure underpins the folding pattern and architecture of most proteins. Accurate assignment of the secondary structure elements is therefore an important problem. Although many approximate solutions of the secondary structure assignment problem exist, the statement of the problem has resisted a consistent and mathematically rigorous definition. A variety of comparative studies have highlighted major disagreements in the way the available methods define and assign secondary structure to coordinate data.
We report a new method to infer secondary structure based on the Bayesian method of minimum message length inference. It treats assignments of secondary structure as hypotheses that explain the given coordinate data. The method seeks to maximize the joint probability of a hypothesis and the data. There is a natural null hypothesis and any assignment that cannot better it is unacceptable. We developed a program SST based on this approach and compared it with popular programs, such as DSSP and STRIDE among others. Our evaluation suggests that SST gives reliable assignments even on low-resolution structures.
[Show abstract][Hide abstract] ABSTRACT: Searching for well-fitting 3D oligopeptide fragments within a large collection of protein structures is an important task central to many analyses involving protein structures. This article reports a new web server, Super, dedicated to the task of rapidly screening the protein data bank (PDB) to identify all fragments that superpose with a query under a prespecified threshold of root-mean-square deviation (RMSD). Super relies on efficiently computing a mathematical bound on the commonly used structural similarity measure, RMSD of superposition. This allows the server to filter out a large proportion of fragments that are unrelated to the query; >99% of the total number of fragments in some cases. For a typical query, Super scans the current PDB containing over 80,500 structures (with ∼40 million potential oligopeptide fragments to match) in under a minute. Super web server is freely accessible from: http://lcb.infotech.monash.edu.au/super.
Nucleic Acids Research 05/2012; 40(Web Server issue):W334-9. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Simple and concise representations of protein-folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features-that is, helices and strands of sheet-by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information. Further, traditional methods of simplification crucially depend on the consistency and accuracy of external methods to assign secondary structures to protein coordinate data. Although many methods exist automatically to identify secondary structure, the impreciseness of definitions, along with errors and inconsistencies in experimental structure data, drastically limit their applicability to generate reliable simplified representations, especially for structural comparison. This article introduces a mathematically rigorous algorithm to delineate protein structure using the elegant statistical and inductive inference framework of minimum message length (MML). Our method generates consistent and statistically robust piecewise linear explanations of protein coordinate data, resulting in a powerful and concise representation of the structure. The delineation is completely independent of the approaches of using hydrogen-bonding patterns or inspecting local substructural geometry that the current methods use. Indeed, as is common with applications of the MML criterion, this method is free of parameters and thresholds, in striking contrast to the existing programs which are often beset by them. The analysis of results over a large number of proteins suggests that the method produces consistent delineation of structures that encompasses, among others, the segments corresponding to standard secondary structure. AVAILABILITY: http://www.csse.monash.edu.au/~karun/pmml.
[Show abstract][Hide abstract] ABSTRACT: The Tasmanian devil (Sarcophilus harrisii) is threatened with extinction because of a contagious cancer known as Devil Facial Tumor Disease. The inability to mount an immune response and to reject these tumors might be caused by a lack of genetic diversity within a dwindling population. Here we report a whole-genome analysis of two animals originating from extreme northwest and southeast Tasmania, the maximal geographic spread, together with the genome from a tumor taken from one of them. A 3.3-Gb de novo assembly of the sequence data from two complementary next-generation sequencing platforms was used to identify 1 million polymorphic genomic positions, roughly one-quarter of the number observed between two genetically distant human genomes. Analysis of 14 complete mitochondrial genomes from current and museum specimens, as well as mitochondrial and nuclear SNP markers in 175 animals, suggests that the observed low genetic diversity in today's population preceded the Devil Facial Tumor Disease disease outbreak by at least 100 y. Using a genetically characterized breeding stock based on the genome sequence will enable preservation of the extant genetic diversity in future Tasmanian devil populations.
Proceedings of the National Academy of Sciences 06/2011; 108(30):12348-53. · 9.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Comparing and classifying protein folding patterns allows organizing the known structures, structure search and retrieval, and investigation of general principles of protein architecture. We have been developing a concise tableau representation of protein folding patterns, based on the order and contact patterns of elements of secondary structure: helices and strands of sheet (Lesk, 1995; Kamat and Lesk, 2007; Konagurthu et al., 2008). The tableaux provide a database, derived from the world-wide protein data bank, mineable in studies of protein architecture, including: (i) determination of statistical properties of secondary structure contacts in an unbiased set of protein domains, (ii) investigations of the range of, and relationships among, protein topologies, (iii) investigation of the relationship between local structure of proteins and the complete folding topology, (iv) potential for fold identification from amino acid sequence, and (v) the basis for a complete enumeration of possible protein folding patterns, which can be compared with the corpus of known structures.
Journal of Molecular Recognition 02/2010; 23(2):253-7. · 3.01 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A central tenet of structural biology is that related proteins of common function share structural similarity. This has key practical consequences for the derivation and analysis of protein structures, and is exploited by the process of "molecular sieving" whereby a common core is progressively distilled from a comparison of two or more protein structures. This paper reports a novel web server for "sieving" of protein structures, based on the multiple structural alignment program MUSTANG.
"Sieved" models are generated from MUSTANG-generated multiple alignment and superpositions by iteratively filtering out noisy residue-residue correspondences, until the resultant correspondences in the models are optimally "superposable" under a threshold of RMSD. This residue-level sieving is also accompanied by iterative elimination of the poorly fitting structures from the input ensemble. Therefore, by varying the thresholds of RMSD and the cardinality of the ensemble, multiple sieved models are generated for a given multiple alignment and superposition from MUSTANG. To aid the identification of structurally conserved regions of functional importance in an ensemble of protein structures, Lesk-Hubbard graphs are generated, plotting the number of residue correspondences in a superposition as a function of its corresponding RMSD. The conserved "core" (or typically active site) shows a linear trend, which becomes exponential as divergent parts of the structure are included into the superposition.
The application addresses two fundamental problems in structural biology: first, the identification of common substructures among structurally related proteins--an important problem in characterization and prediction of function; second, generation of sieved models with demonstrated uses in protein crystallographic structure determination using the technique of Molecular Replacement.
PLoS ONE 01/2010; 5(4):e10048. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We report the first two complete mitochondrial genome sequences of the thylacine (Thylacinus cynocephalus), or so-called Tasmanian tiger, extinct since 1936. The thylacine's phylogenetic position within australidelphian marsupials has long been debated, and here we provide strong support for the thylacine's basal position in Dasyuromorphia, aided by mitochondrial genome sequence that we generated from the extant numbat (Myrmecobius fasciatus). Surprisingly, both of our thylacine sequences differ by 11%-15% from putative thylacine mitochondrial genes in GenBank, with one of our samples originating from a direct offspring of the previously sequenced individual. Our data sample each mitochondrial nucleotide an average of 50 times, thereby providing the first high-fidelity reference sequence for thylacine population genetics. Our two sequences differ in only five nucleotides out of 15,452, hinting at a very low genetic diversity shortly before extinction. Despite the samples' heavy contamination with bacterial and human DNA and their temperate storage history, we estimate that as much as one-third of the total DNA in each sample is from the thylacine. The microbial content of the two thylacine samples was subjected to metagenomic analysis, and showed striking differences between a wild-captured individual and a born-in-captivity one. This study therefore adds to the growing evidence that extensive sequencing of museum collections is both feasible and desirable, and can yield complete genomes.
Genome Research 02/2009; 19(2):213-20. · 13.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In 1994, two independent groups extracted DNA from several Pleistocene epoch mammoths and noted differences among individual specimens. Subsequently, DNA sequences have been published for a number of extinct species. However, such ancient DNA is often fragmented and damaged, and studies to date have typically focused on short mitochondrial sequences, never yielding more than a fraction of a per cent of any nuclear genome. Here we describe 4.17 billion bases (Gb) of sequence from several mammoth specimens, 3.3 billion (80%) of which are from the woolly mammoth (Mammuthus primigenius) genome and thus comprise an extensive set of genome-wide sequence from an extinct species. Our data support earlier reports that elephantid genomes exceed 4 Gb. The estimated divergence rate between mammoth and African elephant is half of that between human and chimpanzee. The observed number of nucleotide differences between two particular mammoths was approximately one-eighth of that between one of them and the African elephant, corresponding to a separation between the mammoths of 1.5-2.0 Myr. The estimated probability that orthologous elephant and mammoth amino acids differ is 0.002, corresponding to about one residue per protein. Differences were discovered between mammoth and African elephant in amino-acid positions that are otherwise invariant over several billion years of combined mammalian evolution. This study shows that nuclear genome sequencing of extinct species can reveal population differences not evident from the fossil record, and perhaps even discover genetic factors that affect extinction.
[Show abstract][Hide abstract] ABSTRACT: Inventories of small subgraphs in biological networks have identified commonly-recurring patterns, called motifs. The inference that these motifs have been selected for function rests on the idea that their occurrences are significantly more frequent than random.
Our analysis of several large biological networks suggests, in contrast, that the frequencies of appearance of common subgraphs are similar in natural and corresponding random networks.
Indeed, certain topological features of biological networks give rise naturally to the common appearance of the motifs. We therefore question whether frequencies of occurrences are reasonable evidence that the structures of motifs have been selected for their functional contribution to the operation of networks.
BMC Systems Biology 09/2008; 2:73. · 2.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We report five new complete mitochondrial DNA (mtDNA) genomes of Siberian woolly mammoth (Mammuthus primigenius), sequenced with up to 73-fold coverage from DNA extracted from hair shaft material. Three of the sequences present the first complete mtDNA genomes of mammoth clade II. Analysis of these and 13 recently published mtDNA genomes demonstrates the existence of two apparently sympatric mtDNA clades that exhibit high interclade divergence. The analytical power afforded by the analysis of the complete mtDNA genomes reveals a surprisingly ancient coalescence age of the two clades, approximately 1-2 million years, depending on the calibration technique. Furthermore, statistical analysis of the temporal distribution of the (14)C ages of these and previously identified members of the two mammoth clades suggests that clade II went extinct before clade I. Modeling of protein structures failed to indicate any important functional difference between genomes belonging to the two clades, suggesting that the loss of clade II more likely is due to genetic drift than a selective sweep.
Proceedings of the National Academy of Sciences 07/2008; 105(24):8327-32. · 9.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In designing an algorithm to find pairs of points that are within Euclidean distance d it is effective to use a screening procedure to reject most pairs of points that are far apart. A procedure based on multiple, overlapping lattices can efficiently identify close points and exclude distant ones.
Computational Biology and Chemistry 07/2008; 32(3):212-4. · 1.60 Impact Factor