Sample classification from protein mass spectrometry, by 'peak probability contrasts'

Department of Statistics, Stanford University, Palo Alto, California, United States
Bioinformatics (Impact Factor: 4.62). 12/2004; 20(17):3034-44. DOI: 10.1093/bioinformatics/bth357
Source: PubMed

ABSTRACT MOTIVATION: Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers. RESULTS: Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Mixture - modeling of mass spectra is an approach with many potential applications including peak detection and quantification, smoothing, de-noising, feature extraction and spectral signal compression. However, existing algorithms do not allow for automatic analyses of whole spectra. Therefore, despite highlighting potential advantages of mixture modeling of mass spectra of peptide/protein mixtures and some preliminary results presented in several papers, the mixture modeling approach was so far not developed to the stage enabling systematic comparisons with existing software packages for proteomic mass spectra analyses. In this paper we present an efficient algorithm for Gaussian mixture modeling of proteomic mass spectra of different types (e.g., MALDI-ToF profiling, MALDI-IMS). The main idea is automatic partitioning of protein mass spectral signal into fragments. The obtained fragments are separately decomposed into Gaussian mixture models. The parameters of the mixture models of fragments are then aggregated to form the mixture model of the whole spectrum. We compare the elaborated algorithm to existing algorithms for peak detection and we demonstrate improvements of peak detection efficiency obtained by using Gaussian mixture modeling. We also show applications of the elaborated algorithm to real proteomic datasets of low and high resolution.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging. Results: Here, we introduce a simple yet effective approach for identifying differentially expressed proteins using binary discriminant analysis. This approach works by data-adaptive thresholding of protein expression values and subsequent ranking of the dichotomized features using a relative entropy measure. Our framework may be viewed as a generalization of the `peak probability contrast' approach of Tibshirani et al. (2004) and can be applied both in the two-group and the multi-group setting. Our approach is computationally fast and shows in the analysis of a large-scale drug discovery test data set equivalent prediction accuracy as random forests. Furthermore, we were able to identify in the analysis of mass spectrometry data from a pancreas cancer study biological relevant and statistically predictive marker peaks unrecognized in the original study. Availability: The methodology for binary discriminant analysis is implemented in the R package binda, which is freely available under the GNU General Public License (version 3 or later) from URL . R scripts reproducing all described analyzes are available from the web page .
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Pancreatic cancer is the fourth leading cause of cancer-related deaths. Therefore, in order to improve survival rates, the development of biomarkers for early diagnosis is crucial. Recently, diabetes has been associated with an increased risk of pancreatic cancer. The aims of this study were to search for novel serum biomarkers that could be used for early diagnosis of pancreatic cancer and to identify whether diabetes was a risk factor for this disease. Blood samples were collected from 25 patients with diabetes (control) and 93 patients with pancreatic cancer (including 53 patients with diabetes), and analyzed using matrix-assisted laser desorption/ionization-time of flight mass spectrometry (MALDI-TOF/MS). We performed preprocessing, and various classification methods with imputation were used to replace the missing values. To validate the selection of biomarkers identified in pancreatic cancer patients, we measured biomarker intensity in pancreatic cancer patients with diabetes following surgical resection and compared our results with those from control (diabetes-only) patients. By using various classification methods, we identified the commonly splitting protein peaks as m/z 1,465, 1,206, and 1,020. In the follow-up study, in which we assessed biomarkers in pancreatic cancer patients with diabetes after surgical resection, we found that the intensities of m/z at 1,465, 1,206, and 1,020 became comparable with those of diabetes-only patients.
    Cancer informatics 01/2014; 13(Suppl 7):45-53. DOI:10.4137/CIN.S16341

Full-text (2 Sources)

Available from
Jun 5, 2014