[show abstract][hide abstract] ABSTRACT: The workshop "Bioinformatics for Biotechnology Applications (HavanaBioinfo 2012)", held December 8-11 2012 in Havana, aimed at exploring new bioinformatics tools and approaches for large-scale proteomics, genomics and chemoinformatics. Major conclusions of the workshop include the following: (i) Development of new applications and bioinformatics tools for proteomic repositories analysis is crucial; current proteomic repositories contain enough data (spectra/identifications) that can be used to increase the annotations in protein databases and to generate new tools for protein identification; (ii) spectral libraries, de novo sequencing and database search tools should be combined to increase the number of protein identifications; (iii) Protein probabilities and FDR are not yet sufficiently mature; (iv) computational proteomics software needs to become more intuitive; and at the same time appropriate education and training should be provided to help efficient exchange of knowledge between mass spectrometrist and experimental biologists with bioinformaticians in order to increase their bioinformatics background, especially statistics knowledge.
Journal of proteomics 01/2013; · 5.07 Impact Factor
[show abstract][hide abstract] ABSTRACT: We report the release of mzIdentML, an exchange standard for peptide and protein identification data, designed by the Proteomics Standards Initiative. The format was developed by the Proteomics Standards Initiative in collaboration with instrument and software vendors, and the developers of the major open-source projects in proteomics. Software implementations have been developed to enable conversion from most popular proprietary and open-source formats, and mzIdentML will soon be supported by the major public repositories. These developments enable proteomics scientists to start working with the standard for exchanging and publishing data sets in support of publications and they provide a stable platform for bioinformatics groups and commercial software vendors to work with a single file format for identification data.
[show abstract][hide abstract] ABSTRACT: A letter published in January 2011, "The Problem with Peptide Presumption and Low Mascot Scoring", raised concerns about the reporting of peptide identifications based on mass spectrometry data with high precursor mass accuracy. We explain why we believe these concerns are unfounded.
Journal of Proteome Research 09/2011; 10(11):5272-3. · 5.06 Impact Factor
[show abstract][hide abstract] ABSTRACT: A new result report for Mascot search results is described. A greedy set cover algorithm is used to create a minimal set of proteins, which is then grouped into families on the basis of shared peptide matches. Protein families with multiple members are represented by dendrograms, generated by hierarchical clustering using the score of the nonshared peptide matches as a distance metric. The peptide matches to the proteins in a family can be compared side by side to assess the experimental evidence for each protein. If the evidence for a particular family member is considered inadequate, the dendrogram can be cut to reduce the number of distinct family members.
[show abstract][hide abstract] ABSTRACT: The Human Proteome Organization Proteomics Standards Initiative has produced reporting requirements and data interchange formats for the proteomics community. The implementation of these increasingly mature formats was the main focus of this meeting, with extensions being made to many schema to enable encoding of new data types. The endorsement of the proteomics standards initiative standards by an increasing number of journals is a main driving force behind tool development and a recognized need to ease the process of data deposition into the public domain for the bench scientist.
[show abstract][hide abstract] ABSTRACT: The role of the Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) is to produce and release community-accepted reporting requirements, interchange formats and controlled vocabularies for mass spectrometry proteomics and related technologies such as gel electrophoresis, column chromatography and molecular interactions. A number of significant advances were made at this workshop, with the new MS standard, mzML, being finalised prior to release on 1(st) June 2008 and analysisXML, which will allow protein and peptide identifications and post-translational modifications to be captured, being prepared to enter the review process this summer. The accompanying controlled vocabularies are continuing to evolve and a number of standards papers are now being finalised prior to publication.
[show abstract][hide abstract] ABSTRACT: Over the last five years, the Human Proteome Organisation Proteomics Standards Initiative (HUPO PSI) has produced and released community-accepted XML interchange formats in the fields of mass spectrometry, molecular interactions and gel electrophoresis, have led the field in the discussion of the minimum information with which such data should be annotated and are now in the process of publishing much of this information. At this 4(th) Spring workshop, the emphasis was on consolidating this effort, refining and improving the existing models and in pushing these forward to align with more broadly encompassing efforts such as FuGE (Jones, A.R., Pizarro, A., Spellman, P., Miller, M., FuGE Working Group FuGE: Functional Genomics Experiment Object Model. OMICS 2006, 10, 179-184) and the Ontology for Biomedical Investigation (OBI). The effort to merge the existing mass spectrometry XML interchange formats, mzData and mzXML, into one single standard mzML yielded significant progress. Also the preliminary design of AnalysisXML was extended to include several new use cases and better support for quantification information. Finally the Molecular Interaction group discussed the development of a molecular interaction scoring system with accompanying gold standard data test sets.
[show abstract][hide abstract] ABSTRACT: Unimod is a database of protein modifications for use in mass spectrometry applications, especially protein identification and de novo sequencing. It contains accurate and verifiable values, derived from elemental compositions, for the mass differences introduced by both natural and artificial modifications.
[show abstract][hide abstract] ABSTRACT: An error tolerant mode for database matching of uninterpreted tandem mass spectrometry data is described. Selected database entries are searched without enzyme specificity, using a comprehensive list of chemical and post-translational modifications, together with a residue substitution matrix. The modifications are tested serially, to avoid the catastrophic loss of discrimination that would occur if all the permutations of large numbers of modifications in combination were possible. The new mode has been coded as an extension to the Mascot search engine, and tested against a number of Liquid chromatography-tandem mass spectrometry datasets. The results show a number of additional peptide matches, but require careful interpretation. The most significant limitation of this approach is that it can only reveal new matches to proteins that already have at least one significant peptide match.
[show abstract][hide abstract] ABSTRACT: The use of mass spectrometry data to search molecular sequence databases is a well-established method for protein identification. The technique can be extended to searching raw genomic sequences, providing experimental confirmation or correction of predicted coding sequences, and has the potential to identify novel genes and elucidate splicing patterns.
Trends in Biotechnology 11/2001; 19(10 Suppl):S17-22. · 9.66 Impact Factor
[show abstract][hide abstract] ABSTRACT: The public availability of a draft assembly of the human genome has enabled us to demonstrate, for the first time, the feasibility of searching a complete, unmasked eukaryotic genome using uninterpreted mass spectrometry data. A complex LC-MS/MS data set, containing peptides from at least 22 human proteins, was searched against a comprehensive, nonidentical protein database, an expressed sequence tag (EST) database, and the International Human Genome Project draft assembly of the human genome. The results from the three searches are compared in detail, and the merits of the different databases for this application are discussed. In the case of the EST database, the UniGene index provided a method of simplifying and summarising the search results. In the case of the genomic DNA, the presence of introns prevented matching of roughly one quarter of the spectra, but the technique can provide primary experimental verification of predicted coding sequences, and has the potential to identify novel coding sequences.
[show abstract][hide abstract] ABSTRACT: Several algorithms have been described in the literature for protein identification by searching a sequence database using mass spectrometry data. In some approaches, the experimental data are peptide molecular weights from the digestion of a protein by an enzyme. Other approaches use tandem mass spectrometry (MS/MS) data from one or more peptides. Still others combine mass data with amino acid sequence data. We present results from a new computer program, Mascot, which integrates all three types of search. The scoring algorithm is probability based, which has a number of advantages: (i) A simple rule can be used to judge whether a result is significant or not. This is particularly useful in guarding against false positives. (ii) Scores can be compared with those from other types of search, such as sequence homology. (iii) Search parameters can be readily optimised by iteration. The strengths and limitations of probability-based scoring are discussed, particularly in the context of high throughput, fully automated protein identification.