Bioinformatics

Published by Oxford University Press

Online ISSN: 1367-4811,1460-2059

·

Print ISSN: 1367-4803

Articles


From data to knowledge
  • Article

June 2000

·

30 Reads

Francois Rechenmann
Share

The-more-the-better and the-less-the-better
  • Article
  • Full-text available

October 2006

·

71 Reads

We report a novel protein domain---G8---which contains five repeated β-strand pairs and is present in some disease-related proteins such as PKHD1, KIAA1199, TMEM2 as well as other uncharacterized proteins. Most G8-containing proteins ...
Download

About the use of protein models

August 2002

·

110 Reads

Protein models can be of great assistance in functional genomics, as they provide the structural insights often necessary to understand protein function. Although comparative modelling is far from yielding perfect structures, this is still the most reliable method and the quality of the predictions is now well understood. Models can be classified according to their correctness and accuracy, which will impact their applicability and usefulness in functional genomics and a variety of situations. Contact: manuel.peitsch@pharma.novartis.com; manuel.peitsch@isb-sib.ch

Fig. 1. Venn diagram of datasets of the same/different folds. Set-I contains 746 420 same Fold domain pairs generated from 11 239 protein domains in SCOP. Set-II consists of 2 769 868 same Topology domain pairs generated from 14 830 protein domains in CATH. Set-III is the overlap part of Set-I and Set-II, which includes 186 359 pairs from 5105 consensus domains. Set- IV contains 13 027 960 all-to-all pairs from the 5105 consensus domains. Set-I is the different fold set for SCOP, generated by subtracting a subset of Set-I from Set-IV. Set-II is the different fold set for CATH, generated by subtracting a subset of Set-II from Set-IV. Set-III is the different fold set for Set-III and obtained by subtracting subsets of Set-I and Set-II from Set-IV. 
Fig. 2. TM-score distribution of 71 583 085 gapless comparisons among 6684 non-homologous protein structures. The continuous curve represents an EVD with the location parameter and the scale parameter being 0.1512 and 0.0242, respectively; the reduced χ 2 of fitting is 0.001 obtained by the 
Fig. 3. The P -value versus TM-score. The curve is a sigmoid like Boltzmann function with reduced χ 2 equal to 0.0001. Inset: P -value (in logarithm scale) 
Fig. 4. The average TM-scores (with error bars) of gapless alignment matches on random structural pairs with protein length from 80 to 200 amino acids. The straight and dash lines above TM-scores = 0.2 indicate the number of random protein pairs (values on the right-hand side) needed to achieve or surpass a certain TM-score level. By doing random structure comparisons in 10 2 , 10 4 , 10 10 and 10 16 times, one can hit a match with a TM-score ≥ 0.263, 
Fig. 5. The conditional probabilities of TM-score for proteins in the same fold and different fold families as defined by SCOP (Set-I; Set-II ), CATH (Set-II; Set-II ) and SCOP and CATH (Set-III; Set-III ). 

+1

How significant is a protein structure similarity with TM-score = 0.5?

February 2010

·

483 Reads

Motivation: Protein structure similarity is often measured by root mean squared deviation, global distance test score and template modeling score (TM-score). However, the scores themselves cannot provide information on how significant the structural similarity is. Also, it lacks a quantitative relation between the scores and conventional fold classifications. This article aims to answer two questions: (i) what is the statistical significance of TM-score? (ii) What is the probability of two proteins having the same fold given a specific TM-score? Results: We first made an all-to-all gapless structural match on 6684 non-homologous single-domain proteins in the PDB and found that the TM-scores follow an extreme value distribution. The data allow us to assign each TM-score a P-value that measures the chance of two randomly selected proteins obtaining an equal or higher TM-score. With a TM-score at 0.5, for instance, its P-value is 5.5×10-7, which means we need to consider at least 1.8 million random protein pairs to acquire a TM-score of no less than 0.5. Second, we examine the posterior probability of the same fold proteins from three datasets SCOP, CATH and the consensus of SCOP and CATH. It is found that the posterior probability from different datasets has a similar rapid phase transition around TM-score = 0.5. This finding indicates that TM-score can be used as an approximate but quantitative criterion for protein topology classification, i.e. protein pairs with a TM-score >0.5 are mostly in the same fold while those with a TM-score <0.5 are mainly not in the same fold. Contact: [email protected] /* */ Supplementary information: Supplementary data are available at Bioinformatics online. © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] /* */

Table 1 . The performance of different MHC class II prediction algorithms on our data set 
SVM based method for predicting HLA-DRB1(*)0401 binding peptides in an antigen sequence

March 2004

·

92 Reads

Prediction of peptides binding with MHC class II allele HLA-DRB1*0401 can effectively reduce the number of experiments required for identifying helper T cell epitopes. This paper describes support vector machine (SVM) based method developed for identifying HLA-DRB1*0401 binding peptides in an antigenic sequence. SVM was trained and tested on large and clean data set consisting of 567 binders and equal number of non-binders. The accuracy of the method was 86% when evaluated through 5-fold cross-validation technique. Available: A web server HLA-DR4Pred based on above approach is available at http://www.imtech.res.in/raghava/hladr4pred/ and http://bioinformatics.uams.edu/mirror/hladr4pred/ (Mirror Site). Supplementary information: http://www.imtech.res.in/raghava/hladr4pred/info.html

Table 2 . Statistics on HSCS and overall scores for poly-aa and repeated peptides. 
0j.py: A software tool for low complexity proteins and protein domains

February 2001

·

103 Reads

Low complexity proteins and protein domains have sequences which appear highly non-random. Over the years, these sequences have been routinely filtered out during sequence similarity searches because interest has been focused on globular proteins, and inclusion of these domains can severely skew search results. However, early work on these proteins and more recent studies of the related area of repeated protein sequences suggests that low complexity protein domains have function and therefore are in need of further investigation. 0j.py is a new tool for demarcating low complexity protein domains more accurately than has been possible to date. The paper describes 0j.py and its use in revealing proteins with repeated and poly-amino-acid peptides. Statistical methods are then employed to to examine the distribution of these proteins across species, while keyword clustering is used to suggest roles performed by proteins through the use of low complexity domains. Contact: M.Wise@ccsr.cam.ac.uk

MATRIX SEARCH 1.0: A computer program that scans DNA sequences for transcriptional elements using a database of weight matrices

November 1995

·

210 Reads

The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C' language, and the program is available for unix platforms.

Fig. 1. (a) This step is to load the model file (CSML or SBML) and define the distribution and range for parameters that users wish to estimate. (b) This step is to input the observed time-series data. Accepted formats include EDF, CSV and TSV. Functions such as smoothing and sampling are included to improve the quality of observed data for better estimation results. (c) This step is needed to pair the model entities with observed data. An auto-map function is available to match corresponding entities and observed data with same names. (d) A variety of settings for the particle filter and simulation are enabled to allow for flexibility based on the user's needs. (e) After running the particle filter algorithm, the simulation runs results using estimated parameters will be plotted for ease of comparison between the original and fitted models. The parameters' distribution plot is also displayed.  
Fig. 2. (a) Time versus seed size plot. (b) Score versus coverage. Scores close to 1 or <1 indicates a very good match between observed data and simulation results. (Please see supplementary data for experiments details.)  
DA 1.0: Parameter Estimation of Biological Pathways using Data Assimilation approach

July 2010

·

80 Reads

·

Masao Nagasaki

·

Ayumu Saito

·

[...]

·

Data assimilation (DA) is a computational approach that estimates unknown parameters in a pathway model using time-course information. Particle filtering, the underlying method used, is a well-established statistical method that approximates the joint posterior distributions of parameters by using sequentially generated Monte Carlo samples. In this article, we report the release of Java-based software (DA 1.0) with an intuitive and user-friendly interface to allow users to carry out parameters estimation using DA. Availability and Implementation: DA 1.0 was developed using Java and thus would be executable on any platform installed with JDK 6.0 (not JRE 6.0) or later. DA 1.0 is freely available for academic users and can be launched or downloaded from http://da.csml.org. Contact: masao{at}ims.u-tokyo.ac.jp

Von Bertalanffy 1.0: A COBRA toolbox extension to thermodynamically constrain metabolic models

January 2011

·

165 Reads

In flux balance analysis of genome scale stoichiometric models of metabolism, the principal constraints are uptake or secretion rates, the steady state mass conservation assumption and reaction directionality. Here, we introduce an algorithmic pipeline for quantitative assignment of reaction directionality in multi-compartmental genome scale models based on an application of the second law of thermodynamics to each reaction. Given experimental or computationally estimated standard metabolite species Gibbs energy and metabolite concentrations, the algorithms bounds reaction Gibbs energy, which is transformed to in vivo pH, temperature, ionic strength and electrical potential. This cross-platform MATLAB extension to the COnstraint-Based Reconstruction and Analysis (COBRA) toolbox is computationally efficient, extensively documented and open source. http://opencobra.sourceforge.net.

Fig. 1. ROC curves for the benchmark. Plots are shown for the new INFERNAL 1.0 with and without filters, for the old INFERNAL 0.72 and for family-pairwise searches (FPS) with blastn. CPU times are total times for all 51 family searches measured for single execution threads on 3.0 GHz Intel Xeon processors. The INFERNAL 1.0 times do not include time required for model calibration. 
Infernal 1.0: Inference of RNA Alignments

April 2009

·

442 Reads

infernal builds consensus RNA secondary structure profiles called covariance models (CMs), and uses them to search nucleic acid sequence databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments. Availability: Source code, documentation and benchmark downloadable from http://infernal.janelia.org. infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Contact: nawrockie,kolbed,eddys{at}janelia.hhmi.org

Fig. 1. Contact maps of protein HigA in the standard, homogeneous formulation and the Miyazawa-Jernigan formulation. Each square stands for a contact between residue i and j in the native structure of HigA. The upper left contact map represents the homogeneous energetics of the standard SBM formulation. However, the lower right contact map illustrates the possibility of weighting each native contact by Miyazawa-Jernigan factors (Miyazawa and Jernigan, 1996). 
ESBMTools 1.0: Enhanced native structure-based modeling tools

September 2013

·

115 Reads

Molecular dynamics (MD) simulations provide detailed insights into the structure and function of biomolecular systems. Thus, they complement experimental measurements by giving access to experimentally inaccessible regimes. Amongst the different MD techniques native structure-based models (SBM) are based on energy landscape theory and the principle of minimal frustration. Typically employed in protein and RNA folding simulations, they coarse-grain the biomolecular system and/or simplify the Hamiltonian resulting in modest computational requirements while achieving high agreement with experimental data. eSBMTools streamlines running and evaluating SBM in a comprehensive package and offers high flexibility in adding experimental or bioinformatics derived restraints. We present a software package that allows setting up, modifying and evaluating SBM for both RNA and proteins. The implemented workflows include predicting protein complexes based on bioinformatics derived inter-protein contacts information, a standardized setup of protein folding simulations based on the common PDB format, calculating reaction coordinates and evaluating the simulation by free-energy calculations with WHAM or by phi-values. The modules interface with the molecular dynamics simulation program GROMACS. The package is open source and written in architecture independent Python2. http://sourceforge.net/projects/esbmtools/ CONTACT: alexander.schug@kit.edu.

MesoRD 1.0: Stochastic Reaction-Diffusion Simulations in the Microscopic Limit

October 2012

·

80 Reads

MesoRD is a tool for simulating stochastic reaction-diffusion systems as modeled by the reaction diffusion master equation. The simulated systems are defined in the Systems Biology Markup Language with additions to define compartment geometries. MesoRD 1.0 supports scale-dependent reaction rate constants and reactions between reactants in neighbouring subvolumes. These new features make it possible to construct physically consistent models of diffusion-controlled reactions also at fine spatial discretization. Availability: MesoRD is written in C++ and licensed under the GNU general public license (GPL). MesoRD can be downloaded at http://mesord.sourceforge.net. The MesoRD homepage, http://mesord.sourceforge.net, contains detailed documentation and news about recently implemented features. Contact: johan.elf@icm.uu.se.

PromFD 1.0: A computer program that predicts eukaryotic pol II promoters using strings and IMD matrices

March 1997

·

100 Reads

Motivation: A large number of new DNA sequences with virtually unknown functions are generated as the Human Genome Project progresses. Therefore, it is essential to develop computer algorithms that can predict the functionality of DNA segments according to their primary sequences, including algorithms that can predict promoters. Although several promoter-predicting algorithms are available, they have high false-positive detections and the rate of promoter detection needs to be improved further. Results: In this research, PromFD, a computer program to recognize vertebrate RNA polymerase II promoters, has been developed. Both vertebrate promoters and non-promoter sequences are used in the analysis. The promoters are obtained from the Eukaryotic Promoter Database. Promoters are divided into a training set and a test set. Non-promoter sequences are obtained from the GenBank sequence databank, and are also divided into a training set and a test set. The first step is to search out, among all possible permutations, patterns of strings 5-10 bp long, that are significantly over-represented in the promoter set. The program also searches IMD (Information Matrix Database) matrices that have a significantly higher presence in the promoter set. The results of the searches are stored in the PromFD database, and the program PromFD scores input DNA sequences according to their content of the database entries. PromFD predicts promoters-their locations and the location of potential TATA boxes, if found. The program can detect 71% of promoters in the training set with a false-positive rate of under 1 in every 13,000 bp, and 47% of promoters in the test set with a false-positive rate of under 1 in every 9800 bp. PromFD uses a new approach and its false-positive identification rate is better compared with other available promoter recognition algorithms. The source code for PromFD is in the 'c+2' language.

MulBlast 1.0: A multiple alignment of BLAST output to boost protein sequence similarity analysis

January 1997

·

42 Reads

The protein sequence similarity search has become a major tool for biologists. Various efficient and rapid programs and comparison matrices have been designed and refined in order to perform the scanning task (BLAST, FAST A, Automat, etc.). However, the final step of the search, the analysis of the results, is still tedious and time consuming. In order to optimize true-positive hit screening, we have developed a program which makes a multiple alignment from the BLAST search output. Conserved sequence segments are pointed out. It makes the recognition of already known as well as new sequence patterns easier. It allows at a glance a rapid identification of significant similarities, protein family signature and new sequence motifs. This alignment is written in a compatible format for the GCG programs LineUp and ProfileMake.

Genomes OnLine Database (GOLD 1.0): a monitor of complete and ongoing genome projects world-wide

October 1999

·

73 Reads

Unlabelled: GOLD (Genomes On Line Database) is a World Wide Web resource for comprehensive access to information regarding complete and ongoing genome projects around the world. Availability: GOLD is based at the University of Illinois at Urbana-Champaign and is available at http://geta.life.uiuc.edu/ approximately nikos/genomes. html. It is also mirrored at the European Bioinformatics Institute at http://www.ebi.ac.uk/research/cgg/genomes.html. Contact: genomes@ebi.ac.uk

Fig. 2. The ancestral sequence predictions and the corresponding confidence level (between 0 and 100) of each character. These confidence levels have been computed according to the confidence level of the indel predictions as well as the substitution predictions. The ancestral names correspond to a concatenation of the names of the descendant species.  
Ancestors 1.0: A Web Server For Ancestral Sequence Reconstruction

October 2009

·

318 Reads

The computational inference of ancestral genomes consists of five difficult steps: identifying syntenic regions, inferring ancestral arrangement of syntenic regions, aligning multiple sequences, reconstructing the insertion and deletion history and finally inferring substitutions. Each of these steps have received lot of attention in the past years. However, there currently exists no framework that integrates all of the different steps in an easy workflow. Here, we introduce Ancestors 1.0, a web server allowing one to easily and quickly perform the last three steps of the ancestral genome reconstruction procedure. It implements several alignment algorithms, an indel maximum likelihood solver and a context-dependent maximum likelihood substitution inference algorithm. The results presented by the server include the posterior probabilities for the last two steps of the ancestral genome reconstruction and the expected error rate of each ancestral base prediction. Availability: The Ancestors 1.0 is available at http://ancestors.bioinfo.uqam.ca/ancestorWeb/.

Fig. 1. Many of the key concepts of CMap are shown. Five maps of varying types from QTL to genetic to sequence and from two species are displayed, and more could be added. Map features range from QTLs to genetic markers to bins to genes, and correspondences based on different types of evidence are show in varying colors (http://www.gramene.org/db/cmap). 
CMap 1.01: A comparative mapping application for the Internet

August 2009

·

212 Reads

CMap is a web-based tool for displaying and comparing maps of any type and from any species. A user can compare an unlimited number of maps, view pair-wise comparisons of known correspondences, and search for maps or for features by name, species, type and accession. CMap is freely available, can run on a variety of database engines and uses only free and open software components. Availability: http://www.gmod.org/cmap Contact: kclark@cshl.edu

JaMBW 1.1: Java-based Molecular Biologists' Workbench

September 1997

·

218 Reads

A software suite, 'Java-based Molecular Biologists' Workbench' (JaMBW), has been developed in order to accomplish common bioinformatics tasks, and can be accessed at the URL: http://www.embl-heidelberg.de/JaMBW/. Java implementations are designed to operate on any computer architecture. Furthermore, Java is designed to be independent of the operating system, relying only on the availability of a Java Virtual Machine (JVM) package. JVMs were initially implemented as software, but with the creation of Java processors, they are now becoming available as hardware implementations. One can even foresee that hybrid solutions may appear, in which multiprocessor-based computers are fitted with a Java chip so that the speed of Java-based code is greatly enhanced (apparently by a factor of up to 10 times; Sun, 1996). Java code will therefore not only be completely portable—drastically reducing development costs—but will also have better performance than native C-code implementations. The superiority of the

ROC-like curves for the benchmark. Plots are shown for the new Infernal 1.1 with and without filters, for the old Infernal 1.0.2, for profile HMM searches with nhmmer (from the HMMER package included in Infernal 1.1, default parameters) and for family-pairwise-searches with BLASTN (ncbi-blast-2.2.28+, default parameters). The maximum sensitivity (not shown) for default Infernal 1.1 is 0.81 (629 of 820 true positives found), which is achieved at a false-positive rate of 0.19/Mb/query. For non-filtered Infernal, maximum sensitivity is 0.87 at 2.9 false positives per Mb per query. This indicates that at high false-positive rates the filters prevent some true positives from being found, but prevent many more false positives from being found. CPU times are total times for all 106 family searches measured for single execution threads on 3.0 GHz Intel Xeon processors. The Infernal times do not include time required for model calibration.
Infernal 1.1: 100-fold faster RNA homology searches

September 2013

·

411 Reads

Infernal builds probabilistic profiles of the sequence and secondary structure of an RNA family called covariance models (CMs) from structurally annotated multiple sequence alignments given as input. Infernal uses CMs to search for new family members in sequence databases, and to create potentially large multiple sequence alignments. Version 1.1 of Infernal introduces a new filter pipeline for RNA homology search based on accelerated profile HMM methods and HMM-banded CM alignment methods. This enables a roughly 100-fold acceleration over the previous version and roughly a 10,000-fold acceleration over exhaustive, non-filtered CM searches. Source code, documentation, and the benchmark are downloadable from http://infernal.janelia.org. Infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Documentation includes a user's guide with a tutorial, a discussion of file formats and user options, and additional details on methods implemented in the software. nawrockie@janelia.hhmi.org, eddys@janelia.hhmi.org.

Fig. 1. Illustrations of the functions of the Aho-Corasic automaton constructed on the set of samples R = {r 1 , r 2 , r 3 , r 4 , r 5 } = {HE, SHE, HIS, HER, HERS}. (a) Graph representation of transition function G(s, a). (b) Rejections' function F(s). (c) Output function O(s). (d) Transition of Automaton from state to state if the input text is 'ushers'. 
Fig. 2. An example of the pattern bank entry. 
Table 4 . Some characteristics of secondary banks in comparison with PROF PAT 
PROF_ PAT 1.3: Updated database of patterns used to detect local similarities

May 2000

·

87 Reads

When analysing novel protein sequences, it is now essential to extend search strategies to include a range of 'secondary' databases. Pattern databases have become vital tools for identifying distant relationships in sequences, and hence for predicting protein function and structure. The main drawback of such methods is the relatively small representation of proteins in trial samples at the time of their construction. Therefore, a negative result of an amino acid sequence comparison with such a databank forces a researcher to search for similarities in the original protein banks. We developed a database of patterns constructed for groups of related proteins with maximum representation of amino acid sequences of SWISS-PROT in the groups. Software tools and a new method have been designed to construct patterns of protein families. By using such method, a new version of databank of protein family patterns, PROF_ PAT 1.3, is produced. This bank is based on SWISS-PROT (r1.38) and TrEMBL (r1.11), and contains patterns of more than 13 000 groups of related proteins in a format similar to that of the PROSITE. Motifs of patterns, which had the minimum level of probability to be found in random sequences, were selected. Flexible fast search program accompanies the bank. The researcher can specify a similarity matrix (the type PAM, BLOSUM and other). Variable levels of similarity can be set (permitting search strategies ranging from exact matches to increasing levels of 'fuzziness'). The Internet address for comparing sequences with the bank is: http://wwwmgs.bionet.nsc.ru/mgs/programs/prof_pat/. The local version of the bank and search programs (approximately 50 Mb) is available via ftp: ftp://ftp.bionet.nsc. ru/pub/biology/vector/prof_pat/, and ftp://ftp.ebi.ac. uk/pub/databases/prof_pat/. Another appropriate way for its external use is to mail amino acid sequences to bachin@vector.nsc.ru for comparison with PROF_ PAT 1.3.

Fig. 1. DAPC of simulated data (see text). ( a ) Density of individual scores on the first discriminant function, with groups represented in red and blue. ( b ) SNP contribution to the separation of the groups; the last (structured) 1000 SNPs are coloured in green; the last 4000 SNPs are represented in the main plot, while the figure corresponding to all SNPs is shown in inset. 
Jombart T, Ahmed I. Adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics 27: 3070-3071

September 2011

·

1,079 Reads

While the R software is becoming a standard for the analysis of genetic data, classical population genetics tools are being challenged by the increasing availability of genomic sequences. Dedicated tools are needed for harnessing the large amount of information generated by next-generation sequencing technologies. We introduce new tools implemented in the adegenet 1.3-1 package for handling and analyzing genome-wide single nucleotide polymorphism (SNP) data. Using a bit-level coding scheme for SNP data and parallelized computation, adegenet enables the analysis of large genome-wide SNPs datasets using standard personal computers. adegenet 1.3-1 is available from CRAN: http://cran.r-project.org/web/packages/adegenet/. Information and support including a dedicated forum of discussion can be found on the adegenet website: http://adegenet.r-forge.r-project.org/. adegenet is released with a manual and four tutorials totalling over 300 pages of documentation, and distributed under the GNU General Public Licence (≥2). t.jombart@imperial.ac.uk. Supplementary data are available at Bioinformatics online.

A reference database for circular dichroism spectroscopy covering fold and secondary structure space. Lees JG Bioinformatics 2006 22 1955 1962 10.1093/bioinformatics/btl327 16787970

September 2006

·

164 Reads

Motivation: Circular Dichroism (CD) spectroscopy is a long-established technique for studying protein secondary structures in solution. Empirical analyses of CD data rely on the availability of reference datasets comprised of far-UV CD spectra of proteins whose crystal structures have been determined. This article reports on the creation of a new reference dataset which effectively covers both secondary structure and fold space, and uses the higher information content available in synchrotron radiation circular dichroism (SRCD) spectra to more accurately predict secondary structure than has been possible with existing reference datasets. It also examines the effects of wavelength range, structural redundancy and different means of categorizing secondary structures on the accuracy of the analyses. In addition, it describes a novel use of hierarchical cluster analyses to identify protein relatedness based on spectral properties alone. The databases are shown to be applicable in both conventional CD and SRCD spectroscopic analyses of proteins. Hence, by combining new bioinformatics and biophysical methods, a database has been produced that should have wide applicability as a tool for structural molecular biology.

1001 Proteomes: A functional proteomics portal for the analysis of Arabidopsis thaliana accessions

March 2012

·

100 Reads

Motivation: The sequencing of over a thousand natural strains of the model plant Arabidopsis thaliana is producing unparalleled information at the genetic level for plant researchers. To enable the rapid exploitation of these data for functional proteomics studies, we have created a resource for the visualization of protein information and proteomic datasets for sequenced natural strains of A. thaliana. Results: The 1001 Proteomes portal can be used to visualize amino acid substitutions or non-synonymous single-nucleotide polymorphisms in individual proteins of A. thaliana based on the reference genome Col-0. We have used the available processed sequence information to analyze the conservation of known residues subject to protein phosphorylation among these natural strains. The substitution of amino acids in A. thaliana natural strains is heavily constrained and is likely a result of the conservation of functional attributes within proteins. At a practical level, we demonstrate that this information can be used to clarify ambiguously defined phosphorylation sites from phosphoproteomic studies. Protein sets of available natural variants are available for download to enable proteomic studies on these accessions. Together this information can be used to uncover the possible roles of specific amino acids in determining the structure and function of proteins in the model plant A. thaliana. An online portal to enable the community to exploit these data can be accessed at http://1001proteomes.masc-proteomics.org/

Wong, W.H.: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8), 1026-1032

March 2009

·

131 Reads

The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription at an unprecedented precision and throughput. However, challenges remain in understanding the source and distribution of the reads, modeling the transcript abundance and developing efficient computational methods. In this article, we develop a method to deal with the isoform expression estimation problem. The count of reads falling into a locus on the genome annotated with multiple isoforms is modeled as a Poisson variable. The expression of each individual isoform is estimated by solving a convex optimization problem and statistical inferences about the parameters are obtained from the posterior distribution by importance sampling. Our results show that isoform expression inference in RNA-Seq is possible by employing appropriate statistical methods. Contact: whwong{at}stanford.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Table 1 . Feature comparison between AMPHORA and AMPHORA2. 
Wu M, Scott AJ.. Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2. Bioinforma Oxf Engl 28: 1033-1034

February 2012

·

337 Reads

With the explosive growth of bacterial and archaeal sequence data, large-scale phylogenetic analyses present both opportunities and challenges. Here we describe AMPHORA2, an automated phylogenomic inference tool that can be used for high-throughput, high-quality genome tree reconstruction and metagenomic phylotyping. Compared with its predecessor, AMPHORA2 has several major enhancements and new functions: it has a greatly expanded phylogenetic marker database and can analyze both bacterial and archaeal sequences; it incorporates probability-based sequence alignment masks that improve the phylogenetic accuracy; it can analyze DNA as well as protein sequences and is more sensitive in marker identification; finally, it is over 100× faster in metagenomic phylotyping. http://wolbachia.biology.virginia.edu/WuLab/Software.html. mw4yv@virginia.edu Supplementary data are available at Bioinformatics online.

Top-cited authors