Sample classi. cation from protein mass spectrometry by ′peak probability contrasts′

Department of Statistics, Stanford University, Palo Alto, California, United States
Bioinformatics (Impact Factor: 4.98). 12/2004; 20(17):3034-44. DOI: 10.1093/bioinformatics/bth357
Source: PubMed


MOTIVATION: Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers. RESULTS: Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.

Download full-text


Available from: Trevor Hastie,
48 Reads
  • Source
    • "Supervised principal component analysis (SPCA) is used for selection of a subset of genes with prognostic value from differentially expressed genes (Tibshirani et al., 2004). We randomly split the training (Wang's) cohort after appropriate filtering of patients into training set and testing set of the same size (the same number of individual patients). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Early full-term pregnancy is one of the most effective natural protections against breast cancer. To investigate this effect, we have characterized the global gene expression and epigenetic profiles of multiple cell types from normal breast tissue of nulliparous and parous women and carriers of BRCA1 or BRCA2 mutations. We found significant differences in CD44(+) progenitor cells, where the levels of many stem cell-related genes and pathways, including the cell-cycle regulator p27, are lower in parous women without BRCA1/BRCA2 mutations. We also noted a significant reduction in the frequency of CD44(+)p27(+) cells in parous women and showed, using explant cultures, that parity-related signaling pathways play a role in regulating the number of p27(+) cells and their proliferation. Our results suggest that pathways controlling p27(+) mammary epithelial cells and the numbers of these cells relate to breast cancer risk and can be explored for cancer risk assessment and prevention.
    Cell stem cell 06/2013; 13(1). DOI:10.1016/j.stem.2013.05.004 · 22.27 Impact Factor
  • Source
    • "There is an extensive literature on mass spectrum analysis problems (Plechawska et al., 2011). One can find several techniques of peak detection and identification. A very popular approach is to use local maxima and minima. Such methods (Morris et. al., 2005; Yasui et. al., 2003; Tibshirani et. al., 2004) usually compare local maxima with noise level. There are also methods (Zhang et al., 2007) considering the signal to noise ratio. This ratio needs to be high enough to identify a true peak with a local maximum. Such methods choose peaks with the highest intensities. Similar ideas (Mantini et al., 2007; Mantini et al., 2008) consider us"
    Medical Informatics, 03/2012; , ISBN: 978-953-51-0259-5
  • Source
    • "Another approach that is commonly employed for the peak alignment of mass spectral data is based on hierarchical clustering and could be applied as well on NMR spectral data [9-14]. Most of these methods apply hierarchical clustering to the entire collection of all peaks from the individual spectra and "cut off" the resulting dendrogram at a suitable height to produce a number of clusters used for alignment. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Nuclear magnetic resonance spectroscopy (NMR) is a powerful technique to reveal and compare quantitative metabolic profiles of biological tissues. However, chemical and physical sample variations make the analysis of the data challenging, and typically require the application of a number of preprocessing steps prior to data interpretation. For example, noise reduction, normalization, baseline correction, peak picking, spectrum alignment and statistical analysis are indispensable components in any NMR analysis pipeline. We introduce a novel suite of informatics tools for the quantitative analysis of NMR metabolomic profile data. The core of the processing cascade is a novel peak alignment algorithm, called hierarchical Cluster-based Peak Alignment (CluPA). The algorithm aligns a target spectrum to the reference spectrum in a top-down fashion by building a hierarchical cluster tree from peak lists of reference and target spectra and then dividing the spectra into smaller segments based on the most distant clusters of the tree. To reduce the computational time to estimate the spectral misalignment, the method makes use of Fast Fourier Transformation (FFT) cross-correlation. Since the method returns a high-quality alignment, we can propose a simple methodology to study the variability of the NMR spectra. For each aligned NMR data point the ratio of the between-group and within-group sum of squares (BW-ratio) is calculated to quantify the difference in variability between and within predefined groups of NMR spectra. This differential analysis is related to the calculation of the F-statistic or a one-way ANOVA, but without distributional assumptions. Statistical inference based on the BW-ratio is achieved by bootstrapping the null distribution from the experimental data. The workflow performance was evaluated using a previously published dataset. Correlation maps, spectral and grey scale plots show clear improvements in comparison to other methods, and the down-to-earth quantitative analysis works well for the CluPA-aligned spectra. The whole workflow is embedded into a modular and statistically sound framework that is implemented as an R package called "speaq" ("spectrum alignment and quantitation"), which is freely available from
    BMC Bioinformatics 10/2011; 12(1):405. DOI:10.1186/1471-2105-12-405 · 2.58 Impact Factor
Show more