We report a novel protein domain---G8---which contains five repeated β-strand pairs and is present in some disease-related proteins such as PKHD1, KIAA1199, TMEM2 as well as other uncharacterized proteins. Most G8-containing proteins ...
Protein models can be of great assistance in functional genomics, as they provide the structural insights often necessary to understand protein function. Although comparative modelling is far from yielding perfect structures, this is still the most reliable method and the quality of the predictions is now well understood. Models can be classified according to their correctness and accuracy, which will impact their applicability and usefulness in functional genomics and a variety of situations.
Contact: manuel.peitsch@pharma.novartis.com; manuel.peitsch@isb-sib.ch
Prediction of peptides binding with MHC class II allele HLA-DRB1*0401 can effectively reduce the number of experiments required for identifying helper T cell epitopes. This paper describes
support vector machine (SVM) based method developed for identifying HLA-DRB1*0401 binding peptides in an antigenic sequence. SVM was trained and tested on large and clean data set consisting of 567 binders
and equal number of non-binders. The accuracy of the method was 86% when evaluated through 5-fold cross-validation technique.
Available: A web server HLA-DR4Pred based on above approach is available at http://www.imtech.res.in/raghava/hladr4pred/ and http://bioinformatics.uams.edu/mirror/hladr4pred/ (Mirror Site).
Supplementary information: http://www.imtech.res.in/raghava/hladr4pred/info.html
Low complexity proteins and protein domains have sequences which appear highly non-random. Over the years, these sequences have been routinely filtered out during sequence similarity searches because interest has been focused on globular proteins, and inclusion of these domains can severely skew search results. However, early work on these proteins and more recent studies of the related area of repeated protein sequences suggests that low complexity protein domains have function and therefore are in need of further investigation. 0j.py is a new tool for demarcating low complexity protein domains more accurately than has been possible to date. The paper describes 0j.py and its use in revealing proteins with repeated and poly-amino-acid peptides. Statistical methods are then employed to to examine the distribution of these proteins across species, while keyword clustering is used to suggest roles performed by proteins through the use of low complexity domains.
Contact: M.Wise@ccsr.cam.ac.uk
The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C' language, and the program is available for unix platforms.
Data assimilation (DA) is a computational approach that estimates unknown parameters in a pathway model using time-course
information. Particle filtering, the underlying method used, is a well-established statistical method that approximates the
joint posterior distributions of parameters by using sequentially generated Monte Carlo samples. In this article, we report
the release of Java-based software (DA 1.0) with an intuitive and user-friendly interface to allow users to carry out parameters
estimation using DA.
Availability and Implementation: DA 1.0 was developed using Java and thus would be executable on any platform installed with JDK 6.0 (not JRE 6.0) or later.
DA 1.0 is freely available for academic users and can be launched or downloaded from http://da.csml.org.
Contact: masao{at}ims.u-tokyo.ac.jp
In flux balance analysis of genome scale stoichiometric models of metabolism, the principal constraints are uptake or secretion rates, the steady state mass conservation assumption and reaction directionality. Here, we introduce an algorithmic pipeline for quantitative assignment of reaction directionality in multi-compartmental genome scale models based on an application of the second law of thermodynamics to each reaction. Given experimental or computationally estimated standard metabolite species Gibbs energy and metabolite concentrations, the algorithms bounds reaction Gibbs energy, which is transformed to in vivo pH, temperature, ionic strength and electrical potential.
This cross-platform MATLAB extension to the COnstraint-Based Reconstruction and Analysis (COBRA) toolbox is computationally efficient, extensively documented and open source.
http://opencobra.sourceforge.net.
infernal builds consensus RNA secondary structure profiles called covariance models (CMs), and uses them to search nucleic acid sequence
databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments.
Availability: Source code, documentation and benchmark downloadable from http://infernal.janelia.org. infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and
Mac OS/X.
Contact: nawrockie,kolbed,eddys{at}janelia.hhmi.org
Molecular dynamics (MD) simulations provide detailed insights into the structure and function of biomolecular systems. Thus, they complement experimental measurements by giving access to experimentally inaccessible regimes. Amongst the different MD techniques native structure-based models (SBM) are based on energy landscape theory and the principle of minimal frustration. Typically employed in protein and RNA folding simulations, they coarse-grain the biomolecular system and/or simplify the Hamiltonian resulting in modest computational requirements while achieving high agreement with experimental data. eSBMTools streamlines running and evaluating SBM in a comprehensive package and offers high flexibility in adding experimental or bioinformatics derived restraints.
We present a software package that allows setting up, modifying and evaluating SBM for both RNA and proteins. The implemented workflows include predicting protein complexes based on bioinformatics derived inter-protein contacts information, a standardized setup of protein folding simulations based on the common PDB format, calculating reaction coordinates and evaluating the simulation by free-energy calculations with WHAM or by phi-values. The modules interface with the molecular dynamics simulation program GROMACS. The package is open source and written in architecture independent Python2.
http://sourceforge.net/projects/esbmtools/ CONTACT: alexander.schug@kit.edu.
MesoRD is a tool for simulating stochastic reaction-diffusion systems as modeled by the reaction diffusion master equation. The simulated systems are defined in the Systems Biology Markup Language with additions to define compartment geometries. MesoRD 1.0 supports scale-dependent reaction rate constants and reactions between reactants in neighbouring subvolumes. These new features make it possible to construct physically consistent models of diffusion-controlled reactions also at fine spatial discretization.
Availability:
MesoRD is written in C++ and licensed under the GNU general public license (GPL). MesoRD can be downloaded at http://mesord.sourceforge.net. The MesoRD homepage, http://mesord.sourceforge.net, contains detailed documentation and news about recently implemented features.
Contact:
johan.elf@icm.uu.se.
Motivation:
A large number of new DNA sequences with virtually unknown functions are generated as the Human Genome Project progresses. Therefore, it is essential to develop computer algorithms that can predict the functionality of DNA segments according to their primary sequences, including algorithms that can predict promoters. Although several promoter-predicting algorithms are available, they have high false-positive detections and the rate of promoter detection needs to be improved further.
Results:
In this research, PromFD, a computer program to recognize vertebrate RNA polymerase II promoters, has been developed. Both vertebrate promoters and non-promoter sequences are used in the analysis. The promoters are obtained from the Eukaryotic Promoter Database. Promoters are divided into a training set and a test set. Non-promoter sequences are obtained from the GenBank sequence databank, and are also divided into a training set and a test set. The first step is to search out, among all possible permutations, patterns of strings 5-10 bp long, that are significantly over-represented in the promoter set. The program also searches IMD (Information Matrix Database) matrices that have a significantly higher presence in the promoter set. The results of the searches are stored in the PromFD database, and the program PromFD scores input DNA sequences according to their content of the database entries. PromFD predicts promoters-their locations and the location of potential TATA boxes, if found. The program can detect 71% of promoters in the training set with a false-positive rate of under 1 in every 13,000 bp, and 47% of promoters in the test set with a false-positive rate of under 1 in every 9800 bp. PromFD uses a new approach and its false-positive identification rate is better compared with other available promoter recognition algorithms. The source code for PromFD is in the 'c+2' language.
The protein sequence similarity search has become a major tool for biologists. Various efficient and rapid programs and comparison
matrices have been designed and refined in order to perform the scanning task (BLAST, FAST A, Automat, etc.). However, the
final step of the search, the analysis of the results, is still tedious and time consuming. In order to optimize true-positive
hit screening, we have developed a program which makes a multiple alignment from the BLAST search output. Conserved sequence
segments are pointed out. It makes the recognition of already known as well as new sequence patterns easier. It allows at
a glance a rapid identification of significant similarities, protein family signature and new sequence motifs. This alignment
is written in a compatible format for the GCG programs LineUp and ProfileMake.
Unlabelled:
GOLD (Genomes On Line Database) is a World Wide Web resource for comprehensive access to information regarding complete and ongoing genome projects around the world.
Availability:
GOLD is based at the University of Illinois at Urbana-Champaign and is available at http://geta.life.uiuc.edu/ approximately nikos/genomes. html. It is also mirrored at the European Bioinformatics Institute at http://www.ebi.ac.uk/research/cgg/genomes.html.
Contact:
genomes@ebi.ac.uk
The computational inference of ancestral genomes consists of five difficult steps: identifying syntenic regions, inferring ancestral arrangement of syntenic regions, aligning multiple sequences, reconstructing the insertion and deletion history and finally inferring substitutions. Each of these steps have received lot of attention in the past years. However, there currently exists no framework that integrates all of the different steps in an easy workflow. Here, we introduce Ancestors 1.0, a web server allowing one to easily and quickly perform the last three steps of the ancestral genome reconstruction procedure. It implements several alignment algorithms, an indel maximum likelihood solver and a context-dependent maximum likelihood substitution inference algorithm. The results presented by the server include the posterior probabilities for the last two steps of the ancestral genome reconstruction and the expected error rate of each ancestral base prediction.
Availability:
The Ancestors 1.0 is available at http://ancestors.bioinfo.uqam.ca/ancestorWeb/.
CMap is a web-based tool for displaying and comparing maps of any type and from any species. A user can compare an unlimited number of maps, view pair-wise comparisons of known correspondences, and search for maps or for features by name, species, type and accession. CMap is freely available, can run on a variety of database engines and uses only free and open software components.
Availability: http://www.gmod.org/cmap
Contact: kclark@cshl.edu
A software suite, 'Java-based Molecular Biologists' Workbench' (JaMBW), has been developed in order to accomplish common bioinformatics tasks, and can be accessed at the URL: http://www.embl-heidelberg.de/JaMBW/. Java implementations are designed to operate on any computer architecture. Furthermore, Java is designed to be independent of the operating system, relying only on the availability of a Java Virtual Machine (JVM) package. JVMs were initially implemented as software, but with the creation of Java processors, they are now becoming available as hardware implementations. One can even foresee that hybrid solutions may appear, in which multiprocessor-based computers are fitted with a Java chip so that the speed of Java-based code is greatly enhanced (apparently by a factor of up to 10 times; Sun, 1996). Java code will therefore not only be completely portable—drastically reducing development costs—but will also have better performance than native C-code implementations. The superiority of the
Infernal builds probabilistic profiles of the sequence and secondary structure of an RNA family called covariance models (CMs) from structurally annotated multiple sequence alignments given as input. Infernal uses CMs to search for new family members in sequence databases, and to create potentially large multiple sequence alignments. Version 1.1 of Infernal introduces a new filter pipeline for RNA homology search based on accelerated profile HMM methods and HMM-banded CM alignment methods. This enables a roughly 100-fold acceleration over the previous version and roughly a 10,000-fold acceleration over exhaustive, non-filtered CM searches.
Source code, documentation, and the benchmark are downloadable from http://infernal.janelia.org. Infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Documentation includes a user's guide with a tutorial, a discussion of file formats and user options, and additional details on methods implemented in the software.
nawrockie@janelia.hhmi.org, eddys@janelia.hhmi.org.
When analysing novel protein sequences, it is now essential to extend search strategies to include a range of 'secondary' databases. Pattern databases have become vital tools for identifying distant relationships in sequences, and hence for predicting protein function and structure. The main drawback of such methods is the relatively small representation of proteins in trial samples at the time of their construction. Therefore, a negative result of an amino acid sequence comparison with such a databank forces a researcher to search for similarities in the original protein banks. We developed a database of patterns constructed for groups of related proteins with maximum representation of amino acid sequences of SWISS-PROT in the groups.
Software tools and a new method have been designed to construct patterns of protein families. By using such method, a new version of databank of protein family patterns, PROF_ PAT 1.3, is produced. This bank is based on SWISS-PROT (r1.38) and TrEMBL (r1.11), and contains patterns of more than 13 000 groups of related proteins in a format similar to that of the PROSITE. Motifs of patterns, which had the minimum level of probability to be found in random sequences, were selected. Flexible fast search program accompanies the bank. The researcher can specify a similarity matrix (the type PAM, BLOSUM and other). Variable levels of similarity can be set (permitting search strategies ranging from exact matches to increasing levels of 'fuzziness').
The Internet address for comparing sequences with the bank is: http://wwwmgs.bionet.nsc.ru/mgs/programs/prof_pat/. The local version of the bank and search programs (approximately 50 Mb) is available via ftp: ftp://ftp.bionet.nsc. ru/pub/biology/vector/prof_pat/, and ftp://ftp.ebi.ac. uk/pub/databases/prof_pat/. Another appropriate way for its external use is to mail amino acid sequences to bachin@vector.nsc.ru for comparison with PROF_ PAT 1.3.
While the R software is becoming a standard for the analysis of genetic data, classical population genetics tools are being challenged by the increasing availability of genomic sequences. Dedicated tools are needed for harnessing the large amount of information generated by next-generation sequencing technologies. We introduce new tools implemented in the adegenet 1.3-1 package for handling and analyzing genome-wide single nucleotide polymorphism (SNP) data. Using a bit-level coding scheme for SNP data and parallelized computation, adegenet enables the analysis of large genome-wide SNPs datasets using standard personal computers.
adegenet 1.3-1 is available from CRAN: http://cran.r-project.org/web/packages/adegenet/. Information and support including a dedicated forum of discussion can be found on the adegenet website: http://adegenet.r-forge.r-project.org/. adegenet is released with a manual and four tutorials totalling over 300 pages of documentation, and distributed under the GNU General Public Licence (≥2).
t.jombart@imperial.ac.uk.
Supplementary data are available at Bioinformatics online.
Motivation:
Circular Dichroism (CD) spectroscopy is a long-established technique for studying protein secondary structures in solution. Empirical analyses of CD data rely on the availability of reference datasets comprised of far-UV CD spectra of proteins whose crystal structures have been determined. This article reports on the creation of a new reference dataset which effectively covers both secondary structure and fold space, and uses the higher information content available in synchrotron radiation circular dichroism (SRCD) spectra to more accurately predict secondary structure than has been possible with existing reference datasets. It also examines the effects of wavelength range, structural redundancy and different means of categorizing secondary structures on the accuracy of the analyses. In addition, it describes a novel use of hierarchical cluster analyses to identify protein relatedness based on spectral properties alone. The databases are shown to be applicable in both conventional CD and SRCD spectroscopic analyses of proteins. Hence, by combining new bioinformatics and biophysical methods, a database has been produced that should have wide applicability as a tool for structural molecular biology.
Motivation:
The sequencing of over a thousand natural strains of the model plant Arabidopsis thaliana is producing unparalleled information at the genetic level for plant researchers. To enable the rapid exploitation of these data for functional proteomics studies, we have created a resource for the visualization of protein information and proteomic datasets for sequenced natural strains of A. thaliana.
Results:
The 1001 Proteomes portal can be used to visualize amino acid substitutions or non-synonymous single-nucleotide polymorphisms in individual proteins of A. thaliana based on the reference genome Col-0. We have used the available processed sequence information to analyze the conservation of known residues subject to protein phosphorylation among these natural strains. The substitution of amino acids in A. thaliana natural strains is heavily constrained and is likely a result of the conservation of functional attributes within proteins. At a practical level, we demonstrate that this information can be used to clarify ambiguously defined phosphorylation sites from phosphoproteomic studies. Protein sets of available natural variants are available for download to enable proteomic studies on these accessions. Together this information can be used to uncover the possible roles of specific amino acids in determining the structure and function of proteins in the model plant A. thaliana. An online portal to enable the community to exploit these data can be accessed at http://1001proteomes.masc-proteomics.org/
The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription at an unprecedented precision
and throughput. However, challenges remain in understanding the source and distribution of the reads, modeling the transcript
abundance and developing efficient computational methods. In this article, we develop a method to deal with the isoform expression
estimation problem. The count of reads falling into a locus on the genome annotated with multiple isoforms is modeled as a
Poisson variable. The expression of each individual isoform is estimated by solving a convex optimization problem and statistical
inferences about the parameters are obtained from the posterior distribution by importance sampling. Our results show that
isoform expression inference in RNA-Seq is possible by employing appropriate statistical methods.
Contact: whwong{at}stanford.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
With the explosive growth of bacterial and archaeal sequence data, large-scale phylogenetic analyses present both opportunities and challenges. Here we describe AMPHORA2, an automated phylogenomic inference tool that can be used for high-throughput, high-quality genome tree reconstruction and metagenomic phylotyping. Compared with its predecessor, AMPHORA2 has several major enhancements and new functions: it has a greatly expanded phylogenetic marker database and can analyze both bacterial and archaeal sequences; it incorporates probability-based sequence alignment masks that improve the phylogenetic accuracy; it can analyze DNA as well as protein sequences and is more sensitive in marker identification; finally, it is over 100× faster in metagenomic phylotyping.
http://wolbachia.biology.virginia.edu/WuLab/Software.html.
mw4yv@virginia.edu
Supplementary data are available at Bioinformatics online.