Article

Identification of biomarkers from mass spectrometry data using a "common" peak approach.

Department of Mathematical Analysis and Statistical Inference, Institute of Statistical Mathematics, Tokyo, Japan.
BMC Bioinformatics (impact factor: 2.75). 02/2006; 7:358. DOI:10.1186/1471-2105-7-358
Source: PubMed

ABSTRACT Proteomic data obtained from mass spectrometry have attracted great interest for the detection of early-stage cancer. However, as mass spectrometry data are high-dimensional, identification of biomarkers is a key problem.
This paper proposes the use of "common" peaks in data as biomarkers. Analysis is conducted as follows: data preprocessing, identification of biomarkers, and application of AdaBoost to construct a classification function. Informative "common" peaks are selected by AdaBoost. AsymBoost is also examined to balance false negatives and false positives. The effectiveness of the approach is demonstrated using an ovarian cancer dataset.
Continuous covariates and discrete covariates can be used in the present approach. The difference between the result for the continuous covariates and that for the discrete covariates was investigated in detail. In the example considered here, both covariates provide a good prediction, but it seems that they provide different kinds of information. We can obtain more information on the structure of the data by integrating both results.

0 0
 · 
0 Bookmarks
 · 
35 Views
  • Article: SpecAlign--processing and alignment of mass spectra datasets.
    [show abstract] [hide abstract]
    ABSTRACT: Pre-processing of chromatographic profile or mass spectral data is an important aspect of many types of proteomics and biomarker discovery experiments. Here we present a graphical computational tool, SpecAlign, that enables simultaneous visualization and manipulation of multiple datasets. SpecAlign not only provides all common processing functions, but also uniquely implements an algorithm that enables the complete alignment of each mass spectrum within a loaded dataset. We demonstrate its utility by aligning two datasets each containing six spectra; one set was acquired prior to instrument calibration and the other following calibration. AVAILABILITY: The software is free of charge and available for download from http://ptcl.chem.ox.ac.uk/~jwong/specalign. Supports Windows operating systems including Windows 9X/NT/2000/XP.
    Bioinformatics 06/2005; 21(9):2088-90. · 5.47 Impact Factor
  • Article: Algorithms for alignment of mass spectrometry proteomic data.
    [show abstract] [hide abstract]
    ABSTRACT: MOTIVATION: The analysis of biological samples with high-throughput mass spectrometers has increased greatly in recent years. As larger datasets are processed, it is important that the spectra are aligned to ensure that the same protein intensities are correctly identified in each sample. Without such an alignment procedure it is possible to make errors in identifying the signals from peptides with similar molecular weight. Two algorithms are provided that can improve the alignment among samples. One algorithm is designed to work with SELDI data produced from a Ciphergen instrument, and the other can be used with data in a more general format. RESULTS: The two algorithms were applied to samples drawn from a common pool of reference serum. The results indicate substantial improvement in consistently identifying peptide signals in different samples.
    Bioinformatics 08/2005; 21(14):3066-73. · 5.47 Impact Factor
  • Source
    Conference Proceeding: GroupAdaBoost for Selecting Important Genes.
    [show abstract] [hide abstract]
    ABSTRACT: This paper proposes GroupAdaBoost as a variant of AdaBoost for statistical pattern recognition. The objective of the proposed algorithm is to solve the p ≫ n problem arisen in bioinformatics. Typically, p is the number of investigated genes and n is number of individuals in a microarray experiment for observing gene expressions in a problem to extract any speci c pattern of gene expressions related to a disease status. The ordinary method for predicting the genetic causes of diseases is apt to over-learn from any particular training dataset because of facing p ≫ n problem. We observed that GroupAdaBoost gave a robust performance for cases of the excess number of genes. In several real datasets, which are publicly available from Web-pages, we compared the analysis of results among the proposed method and others, and a small scale of simulation study to confirm the validity of the proposed method.
    Fifth IEEE International Symposium on Bioinformatic and Bioengineering (BIBE 2005), 19-21 October 2005, Minneapolis, MN, USA; 01/2005

Full-text

View
0 Downloads
Available from

Keywords

balance false negatives
 
biomarkers
 
classification function
 
continuous covariates
 
covariates
 
data preprocessing
 
different kinds
 
discrete covariates
 
good prediction
 
great interest
 
key problem
 
mass spectrometry
 
mass spectrometry data
 
ovarian cancer dataset
 
present approach
 
Proteomic data