Page 1

BIOINFORMATICS

Vol.20no.172004,pages3034–3044

doi:10.1093/bioinformatics/bth357

Sample classification from protein mass

spectrometry, by ‘peak probability contrasts’

Robert Tibshirani1,2,∗, Trevor Hastie1,2, Balasubramanian

Narasimhan1,2, Scott Soltys3, Gongyi Shi3, Albert Koong3and

Quynh-Thu Le3

1Department of Health, Research and Policy,2Department of Statistics and

3Department of Radiation Oncology, Stanford University, CA 94305, USA

Received on April 19, 2004; revised on May 13, 2004; accepted on May 26, 2004

Advance Access publication June 29, 2004

ABSTRACT

Motivation: Early cancer detection has always been a major

research focus in solid tumor oncology. Early tumor detection

can theoretically result in lower stage tumors, more treatable

diseases and ultimately higher cure rates with less treatment-

related morbidities. Protein mass spectrometry is a potentially

powerful tool for early cancer detection.

We propose a novel method for sample classification from

proteinmassspectrometrydata. Whenappliedtospectrafrom

both diseased and healthy patients, the ‘peak probability con-

trast’ technique provides a list of all common peaks among the

spectra, their statistical significance and their relative import-

ance in discriminating between the two groups. We illustrate

the method on matrix-assisted laser desorption and ionization

mass spectrometry data from a study of ovarian cancers.

Results: Compared to other statistical approaches for class

prediction, the peak probability contrast method performs as

wellorbetterthanseveralmethodsthatrequirethefullspectra,

rather than just labelled peaks. It is also much more inter-

pretable biologically. The peak probability contrast method is

a potentially useful tool for sample classification from protein

mass spectrometry data.

Contact: tibs@stanford.edu

Supplementary Information: http://www.stat.stanford.edu/

∼tibs/ppc

1INTRODUCTION

Earlycancerdetectionhasalwaysbeenamajorresearchfocus

in solid tumor oncology. Early tumor detection can theoret-

ically result in lower stage tumors, more treatable diseases

and ultimately higher cure rates with less treatment-related

morbidities. Many screening approaches have therefore been

studied in solid cancers. Established screening tools for the

early detection of cancer include mammography for breast

cancer, colonoscopy for colorectal cancer, prostate-specific

∗To whom correspondence should be addressed.

antigen (PSA) test for prostate cancer and pap smear for cer-

vix cancer (Smith et al., 2003). Imaging techniques such as

chest X-ray and spiral computed tomography are also used,

but are limited to a tumor size detection limit of 0.5–1.0 cm

(representing close to 109cells) (Swenson et al., 2002).

Several serum markers have been identified through the

years but, with a few exceptions such as PSA for prostate

cancers and alpha fetal protein (AFP) for hepatocellular car-

cinomas, most have failed general integration into general

clinicalpractice(HansenandPedersen,1986). Therefore, itis

importanttoidentifyandtointerpretnewmethodsthatprovide

sensitive and reliable diagnostic markers for solid cancers.

Recent advancements in proteomics have yielded novel

and promising techniques to aid in biomarker identification

(Hanash, 2003; Petricoin et al., 2002a). One such advance-

ment is the development of protein mass spectrometry and

the ability to analyze complex samples using this technique.

Surface enhanced laser desorption/ionization–time-of-flight

(SELDI–TOF) and matrix-assisted laser desorption and ion-

ization (MALDI) mass spectrometry (MS) are the two most

popular approaches presently employed for detecting quant-

itative or qualitative changes in circulating serum or plasma

proteinsinrelationtoapathologicalstatesuchasthepresence

of a solid tumor. Both represent high throughput and highly

sensitive proteomic approaches that allows protein expres-

sion profiling of large sample sets (Hutchens and Yip, 1993;

Merchant and Weinberger, 2000). Briefly, in SELDI, proteins

ofinterestfrombiologicallycomplexsamplesbindselectively

tochemicallymodifiedaffinitysurfaces,withnon-specifically

bound impurities washed away. The retained sample is com-

plexed with an energy-absorbing molecule, and analyzed

by laser desorption/ionization time-of-flight (MS), producing

spectra of mass/charge ratio (m/z).

MALDI is similar to SELDI except that it does not have

the preselection or enrichment steps for certain proteins in the

sample mixture by allowing fractionation based on prebind-

ing to different surfaces or chemical coatings. In MALDI, the

3034

Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved.

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 2

Peak probability contrasts

samples are mixed with a crystal forming matrix, placed on

an inert metal target, and subjected to a pulsed laser beam to

producegasphaseionsthattraverseafield-freeflighttubeand

then are separated by m/z ratio. There are theoretical advant-

ages and disadvantages for each of these two approaches;

however, both have been applied to cancer detection in solid

tumors with reported high sensitivity and specificity using a

varietyofstatisticalanalyses(Petricoinetal.,2002b;Lietal.,

2002;Quetal.,2002;Raietal.,2002;Adametal.,2003;Yasui

et al., 2003; Wu et al., 2003).

In this paper, we propose a novel algorithm for pattern clas-

sification from protein spectra, and compare it with several

other existing techniques. We primarily focus on the compar-

ison of two diagnostic classes (e.g. healthy versus cancer),

although our method can be generalized to more than two

classes (details are given in the Appendix).

Sincetherearemanypossibleapproachestothisproblem, it

is important to discuss the desiderata for such a procedure:

(1) It should focus on clearly detectable peaks in the spec-

tra, at least for the initial analysis. While there may

well be discriminative information in other parts of the

spectra, peaks are more likely to represent isolatable

proteins, protein fragments or peptides.

(2) The method should account for the variation in the m/z

location and heights of the same biological peak in

different spectra. The source of this variation may be

biologicalortechnical(i.e.duetopropertiesofthemass

spectrometer).

(3) It should give some measure of discriminatory power

for all peaks.

(4) Ifpossible, thesampleclassificationruleshouldusethe

peakinformationinarelativelysimplewayandprovide

a method for filtering out the less significant peaks.

Point (3) can be important in the following scenario: suppose

thatotherresearchers,studyingthesamedisease,findapoten-

tially important peak at a certain m/z value. The user would

like to assess the importance of that peak (or a nearby peak)

in their data, and hence need an evaluation of all peaks found

intheirdata. Figures1and2showthemainresultsofthePPC

method. They are explained below.

2

2.1

The ovarian cancer dataset was analyzed by Wu et al. (2003),

and was provided by the authors. It consists of MALDI–MS

spectra generated using a Micromass MALDI-R instrument

on pre-treatment serum samples of 89 subjects, consisting of

42 non-cancer controls and 47 ovarian cancer patients. The

MS spectra are measured at 91360 sites, spaced 0.019 Da

apart and extending from 800 to 3500 Da. Following Wu

et al. (2003), we log-transformed the intensities and then did

METHODS

Sample description

a baseline subtraction using a ‘loess’ smoother with span of

1000/91360. Finally, we normalized each spectrum by a lin-

ear transformation that mapped the 10th and 90th percentiles

to 0 and 1, respectively.

A flowchart of the peak probability contrast (PPC) proced-

ureisshowninFigure3. Wenowdescribetheindividualsteps

in detail.

2.2

We begin with the raw MALDI spectra. In some systems,

the spectrometry software provides a list of labeled peaks.

These were not available for our data, hence, we developed

a simple peak-finding procedure based on the ideas of Yasui

et al. (2003). It looks for sites (m/z values) whose intensity

is higher than that at the ±s sites surrounding it, and higher

than the estimated average background at that site; here we

used the value s = 100.

Firstwesmoothedtherawspectra, asillustratedinFigure6.

For this we used a ‘supersmoother’ with a span of 0.002.

This step would normally only be carried out for the MALDI

data, and not for the SELDI data. It has the effect of smooth-

ing over the isotopic envelop that is present in the MALDI

data, which is helpful for the purposes of finding peak loca-

tions. However after determining the peak locations that are

important for sample classification, one should examine the

raw spectra to determine the actual width and location of

the primary peak in each envelop. Alternatively, one could

apply a de-isotoping method to extract the primary peak from

each envelop and eliminate the secondary ones. The output

of a de-isotoping peak finder can be fed directly into our

procedure.

Weestimatedthatinthesmoothedspectra, thatpeakwidths

were ∼0.5% of the corresponding m/z value. Hence we log-

transformed the m/z values so that the peak widths were

approximately constant over the entire range. This produced

a roughly constant peak width of 0.005. This same approach

andpeakwidthhasbeenusedbyotherauthors,e.g.Yasuietal.

(2003). Visual examination of the individual spectra showed

that this peak width was fairly reasonable for these data. By

log-transforming the data, the peak widths are approximately

constant across the m/z range and this facilitates application

of a clustering procedure, described next. In general, the peak

width is an important adjustable parameter in our procedure.

The data analyst should try to vary it, and examine the results

both visually and in terms of the cross-validated misclassi-

fication (described below). In the ovarian cancer example in

this paper, smaller widths such as 0.25% resulted in slightly

higher error rates.

Insomecases, twopeakswithin0.5%arefoundinthesame

spectra, and these are combined. Note that any peak finding

method can be used to provide peaks for the PPC procedure.

Theonewehaveusediscrude, andamorerefinedpeak-finder

could yield improved classification results.

(a) Peak extraction

3035

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 3

R.Tibshirani et al.

x x x

2995.13

0.55

0.29

0.83

xx x x

1053.85

0.8

0.55

0.15

x x x

2437.28

0.62

0.74

0.34

x

1391.79

0.62

0.31

0.7

x x

1031.92

0.98

0.83

0.45

x

945.53

1.22

0.69

0.32

x x

2012.51

3.08

0.64

0.28

x

1853.43

0.72

0.67

0.32

xx x x

1143.15

0.55

0.64

0.3

x

2096.82

1.43

0.71

0.38

x

2213.81

0.91

0.1

0.43

x x

3238.57

0.91

0.1

0.43

xxx

1301.62

0

0.4

0.72

xx x

1568.96

0.61

0.79

0.47

x

2112.78

0.84

0.79

0.47

x

3113.47

0.51

0.26

0.57

x

3346.01

0.5

0.48

0.79

xx

2255.32

0.67

0.69

0.38

x

3196.74

0.54

0.33

0.64

xx

2031.01

0

0.74

0.45

x

1402.45

0.68

0.71

0.43

x

2127.58

0.68

0.69

0.4

x

1172.89

0.55

0.67

0.38

x

1868.61

1.48

0.55

0.83

x x

3016.31

0.49

0.55

0.83

x

1628.58

0.95

0.4

0.13

x

1323.24

0.67

0.67

0.4

x

2728.57

1.23

0.67

0.4

x

1075.61

0

0.55

0.81

xxx

2790.57

0.49

0.19

0.45

Fig. 1. The results of PPC method on the ovarian cancer example. Each panel shows a histogram of peak heights in the training set at one

site (m/z value in black type in top right corner), for healthy patients (green) and cancer patients (blue). Figure 2 gives details of the format.

The peaks are ordered from strongest to weakest, as measured by the difference in proportions (red type), starting in the top left corner and

moving down the left column. Only the top 30 peaks are shown, out of a total of 192 peak sites.

2.3

To align peaks from the set of spectra, we applied com-

plete linkage hierarchical clustering to the collection of

all 14067 peaks from the individual spectra. The cluster-

ing here is somewhat novel: it is one-dimensional, using

(b) Peak alignment via clustering

the distance along the logm/z-axis. This is depicted in

Figure 4.

The idea is that tight clusters should represent the same

biological peak that has been horizontally shifted in different

spectra. We then extract the centroid (mean position) of each

3036

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 4

Peak probability contrasts

xxx

other split points

2995.13

m/z site

0.55

split point for peak height

0.29

Proportion in normal class

0.83

Proportion in cancer class

Fig.2. ExplodedviewofthetopleftpanelofFigure1, withalegenddetailingtheformat. Theverticallineshowstheestimatedoptimalheight

split point. The proportions of samples in each class having peaks higher than the split point are indicated. The ‘x’s indicate the horizontal

positions of split points that achieve a difference in proportion within 10% of the best at that site.

cluster, to represent the ‘consensus’ position for that peak

across all spectra.

Since this clustering can be performed very quickly, a spe-

cial routine was written for this purpose. Cutting off the

dendrogram at height 0.005 produced 192 clusters with cor-

respondingclustercenterstakenasthemidpointsbetweenthe

ranges of the cluster. Since complete linkage was used, we

are guaranteed that every peak in the cluster is at most 0.005

from any other peak in that same cluster.

2.4(c) Search for common peaks in individual

spectra

Given the list of common peaks from clustering in Step (b),

we go back to the individual spectra and record whether each

spectrum exhibits each of these common peaks. A peak in the

individualspectraisdeemedtooneofthecommonpeaksifits

center lies within log(0.005) of estimated center position of

the common peak. If it is present, the height of the individual

peak in the spectrum is also recorded.

2.5

From the previous steps, we have spectrum peak heights yij,

for observations j = 1,2,...,n and sites i = 1,2,...,m.

These are the centroids from a hierarchical clustering of all

individual spectra peaks. If there is no peak at site i, we take

yij= 0.Inthisstepwecutthepeakheightatsomequantile,in

such a way as to maximally discriminate between the healthy

andnormalsamplesinthetrainingset. Basingthesplitsonthe

quantiles of all heights at a peak position, rather than absolute

peak heights, is important: it accounts for the fact that peaks

heights can vary greatly across the m/z range. Here are the

details:

• Let q(α,i) be the α quantile of the peaks yijat site i.

• Given two groups G1,G2of size n1,n2, let pik(α) be

the proportion of spectra in group k with a peak at site i

larger than q(α,i):

?

(d) Split point estimation for each peak

pik(α) =

j∈Gk

I[yij> q(α,i)]/nk, k = 1,2,

3037

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 5

R.Tibshirani et al.

Extract peaks from

individual spectra

set of peaks via

hierarchical clustering

in individual spectra

quantile height (split point)

at each common peak site

Split point estimation: find most discriminatory

(a)

(b)

(c)

(d)

(e)

Estimate the common

Search for common peaks

Classification: apply the nearest

shrunken centroid classifier to the set

of extracted features

(f)

is present in each individual spectrum

for each common peak, determine

if a peak higher than the split point

Feature extraction for a new spectrum:

Fig. 3. Flow chart of PPC analysis.

whereI[·]istheindicatorfunction,equalsoneiftheevent

is true and zero otherwise.

• Choose ˆ α(i)tomaximize|pi2(α)−pi1(α)|andset ˆ pik=

pik[ˆ α(i)].

This process produced the cutpoints (red vertical lines) and

class probabilities ˆ pikshown in Figure 1. The panels in the

figure are arranged in decreasing strength, i.e. decreasing

value of | ˆ pi2− ˆ pi1|. These histograms are informative in

themselves. For some sites (e.g. at 1301.62 in the third left-

most column), the cutpoint divides height = 0 from the rest.

That is, it indicates that the presence or absence of the peak

is what important. At other sites (e.g. 2995.1 in the top left

corner), the proportion of peaks above a certain height (0.55)

is important for classification ability.

2.6

From the previous steps we have a set of common peaks, and

an optimal discriminating split point for the height of each

peak. To do class prediction for a new spectrum, we first

(e) Feature extraction for a new spectrum

construct a vector of binary features for that spectrum, one

for each of the common peaks. Each feature equals one if a

peak with height greater than the split point is found in the

new spectrum, and zero otherwise. As before, a peak is con-

sideredtocorrespondtoacommonpeakifitscenterlieswithin

log0.005ofthepositionofthecommonpeak. InFigure1, the

first feature will equal 1 if the new spectrum contains a peak

at 2995.12 higher than 0.55, the second feature will equal 1

if the new spectrum contains a peak at 1053.85 higher than

0.80, and so on.

2.7(f) Class prediction via nearest shrunken

centroids

Here we show how to use the peak proportions ˆ pikto classify

a new spectrum into class 1 (healthy) or class 2 (diseased).

Given a spectrum from a new patient with peak heights

y∗

feature vector from Step (e), with a component equal to one if

the spectrum has a peak above the cutpoint height at that site,

and zero otherwise. We can then compare this binary profile

toeachoftheprobabilitycentroidvectors( ˆ p11, ˆ p21,..., ˆ pm1)

and ( ˆ p12, ˆ p22,..., ˆ pm2) and predict the class that is closest in

overall squared distance (or some other metric)1. This is a

kind of ‘nearest centroid’ classification. However to select

sites and potentially improve the prediction performance, we

also consider shrinkage of each pair of probabilities ˆ pi1, ˆ pi2

towards their average.

Figure 5 shows a hypothetical example of nearest shrunken

centroid classification in action.

Before giving details, the method is illustrated by the

example shown in Table 1 (details in table caption).

Here are the details. Let s(t,?) = sign(t)(|t| − ?)+, the

‘soft-threshold’ function. Here ‘+’ means positive part. The

soft-threshold function translates the value t towards zero by

the amount ?, setting it to zero if |t| ≤ ?. For example

if ? = 0.5, then s(1.2,?) = 0.7, s(−1.2,?) = −0.7,

s(0.3,?) = 0.0. Then we set ˜ pik = ¯ pi+ s( ˆ pik− ¯ pi,?),

with ¯ pi= ( ˆ pi1+ ˆ pi2)/2.

Theparameter?ischosenby10-foldcross-validation.That

is, we divide the samples into 10 approximately equal sized

parts. For each fixed value of ? we train the PPC algorithm

on nine parts of the data and then compute the error rate in

predicting the class labels of the samples in the tenth part.

Thisisdoneforeachofthe10partsinturn, andtheerrorrates

addedtogive, thecross-validationerrorestimateforthevalue

?. This process is carried out for a grid of values of ?, to

produce an error curve cv(?). Finally, we examine this curve

and choose ? to be its minimizerˆ?.

Note that if the probabilities are shrunken so that they coin-

cide, the site i no longer contributes to the nearest centroid

rule. Sample classification is then done as follows. For a

1,y∗

2,...,y∗

p, let z∗

i= I[y∗

i> q(ˆ α(i),i)]. This is the binary

1Our software also allows the use of absolute distance or binomial log-

likelihood distance.

3038

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 6

Peak probability contrasts

Fig. 4. Illustration of hierarchical clustering for peak alignment and clustering. The points marked ‘x’ represent the positions of extracted

peaks from the individual spectra. Complete linkage hierarchical clustering is applied to the peak positions along the log(m/z) axis, and the

resulting dendrogram (clustering tree) is cut at height log(0.005). In this simple illustrative example, this process produced four clusters with

associated centroids indicated by a ‘c’.

test set with peak heights y∗

q(ˆ α(i),i)] and compute the distances dk =?

class 1 otherwise. Estimated class probabilities are also avail-

able, derived as in Tibshirani et al. (2003). Note that all of the

training steps, including split-point estimation, are repeated

within each cross-validation fold.

2.8False discovery rates

The simple difference in proportions for peak i

1,y∗

2,...,y∗

p, let z∗

i= I[y∗

i(z∗

i>

i− ˜ pik)2

(or absolute value). We predict to class 2 if d2 < d1and

Ti= ˆ pi2− ˆ pi1

(1)

can be used to assess the significance of the peak. False dis-

covery rates (FDRs) (Benjamini and Hochberg, 1985; Tusher

etal.,2001;EfronandTibshirani,2002;StoreyandTibshirani,

2003) are a useful measure for this. For a given threshold t

we compute the number of Tithat exceed t in absolute value.

ThenwerandomlypermutetheclasslabelsandapplythePPC

procedure to the spectra with permuted labels, giving scores

T∗b

ing scores for b = 1,2,...,B. Finally, the false discovery

rate is estimated by

?B

1,T∗b

1,...,T∗b

m. This process is repeated B times, produc-

?

FDR(t) =

i=1I(|T∗b

?B

i| > t)/B

i=1I(|Ti| > t)

.(2)

The numerator is an estimate of the number of false posit-

ive peaks, and hence the ratio estimates the proportion of

false positives among the peaks called significant. We estim-

ateFDR(t)inthisway, forarangeofvaluesofthethresholdt.

From this, we find the threshold t giving a reasonable low

FDR (say 5%) and call all significant peaks i that fall beyond

this threshold. Note that the estimation of FDR is only for

descriptive purposes, and is not used formally in the sample

classification process.

2.9

The features derived from the PPC method can also serve

as useful inputs into other classifiers. We have chosen the

nearest shrunken centroid method as our primary classifier,

because of its simplicity and interpretability. But other meth-

odshavepotentialadvantagesinthiscontext.Forexample,the

lasso (Tibshirani, 1996; Efron et al., 2002) is a multivariate

fitting method that produces a sparse set of feature weights,

and could potentially improve the prediction performance of

nearest shrunken centroids in this setting. A binary decision

tree (Breiman et al., 1984) can find subgroups of cancer or

healthy patients, defined in terms of their individual peak

behavior.

Use of other classifiers

3

3.1

The peak extraction step found a total of 14067 peaks, or an

averageof158peaksperspectrum. Thesewerethenclustered

into 192 groups of peaks.

RESULTS

Ovarian cancer MALDI dataset

3039

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 7

R.Tibshirani et al.

Fig. 5. Hypothetical example of nearest shrunken centroid classification in action. There are four peaks shown in the left panel: peaks 1 and 2

appear more often in the cancer group than the healthy group, while the reverse is true for peaks 3 and 4. In the middle panel the probabilities

in the two classes have been shrunken towards each other. As a result, the probabilities for peaks 1 and 3 are now equal, and those peaks will

not participate in the class prediction of a new spectrum. The right panel shows the feature vector (consisting of zeroes and ones) for a test

spectrum: it has peaks 1 and 2, but not 3 or 4. To predict the class for this spectrum, we compare its feature vector to the healthy and cancer

profiles in the middle panel, and find the closest one in squared distance. Here the closest is the cancer profile, and so the prediction is to

class ‘cancer’.

Figure 1 summarizes the peak information in the first 30

of these 192 peak clusters 1. Each box shows a histogram of

peaks at the given m/z site, in the non-cancer (green) and

cancer (blue) classes. The different sites are arranged from

strongesttoweakest, startingatthetopleft, andmovingdown

the left column.

Figure 2 show an exploded, annotated view of the top left

box. The optimal split point for each site is indicated by a

vertical red line, and the resulting proportions to the right of

that split point are shown in the red numbers in the box. For

example, the strongest site is at m/z = 2995.1, with a much

larger proportion of cancer patients having peaks above that

split point, compared to control patients (0.70 versus 0.17).

(The split at this site actually corresponds to a peak height of

0.58). Figure 6 shows an example of three spectra in each

group, at the strongest site m/z = 2995.1. Figure 7 dis-

plays the FDR, as the number of significant sites is varied.

The FDR starts to rise >0.05 after the first 7 or 10 peaks.

Hence the strongest 10 peaks are very likely to be signi-

ficant, but we are less certain about the peaks farther down

the list.

Figure 8 shows a heatmap display of the top 7 peaks in the

89 samples.

The 10-fold cross-validated misclassification rate for the

PPC method is shown in line (1) of Table 2. The minimum

CV error is achieved with seven peaks. The cross-validated

sensitivityandspecificitywere35/47and30/42,respectively.

In line (2), we halved the peak width to 0.0025: this seems to

hurtthepredictionaccuracy. Inline(3), wehaverestrictedthe

splits so that the contrast represents presence versus absence

of a peak. The error rate has increased. In line (4), we have

applied the lasso to the binary features from the PPC method.

ThisdoesnotseemtoimprovepredictionaccuracyofthePPC

method in this problem.

3040

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 8

Peak probability contrasts

Table 1. Illustrative example of nearest centroid classifier used in PPC

method

Peak No.Unshrunken centroids

NormalCancer

Shrunken centroids

Normal

Feature vector

Cancer

1

2

3

4

5

6

7

8

9

0.29

0.55

0.74

0.31

0.83

0.69

0.64

0.67

0.64

0.83

0.15

0.34

0.70

0.45

0.32

0.28

0.32

0.30

0.48

0.36

0.55

0.50

0.64

0.50

0.46

0.50

0.47

0.64

0.34

0.53

0.51

0.64

0.50

0.46

0.50

0.47

1

0

0

1

0

0

0

0

1

The raw (unshrunken) class centroids, are shown in the 2nd and 3rd columns. These

are the proportion of samples in each class with peaks higher than the optimal split

point,ateachsite(inredtypeinFig.1).Forillustrationourexamplehasonlyninepeak

sites. In reality there will usually be many more (192 in our ovarian cancer example).

A typical feature vector from a new spectrum is shown in the rightmost column. This

spectrum has a peak higher than the cutpoints at sites 1, 4 and 9. A nearest centroid

classifier compares the feature vector to the two centroids, and predicts to the class

to which it is closest in squared distance. In this case we predict to class ‘Cancer’.

We can often improve upon this classifier by shrinking the centroids towards each

other by an amount ?. Here we chose ? = 0.19, producing the shrunken centroids

in columns 4 and 5. Our prediction is again based on nearest centroids, but now using

the shrunken centroids. The probabilities at peaks 5–9 have been shrunken together,

and so the prediction is based only on the first 4 peaks. In this case, the prediction is

still to class ‘Cancer’, but we have simplified the model.

Lines (5) and (6) of the table represent two of the best

performingmethodsamongthosestudiedbyWuetal.(2003).

Both methods start with the 15 sites having the largest

t-statisticsinabsolutevalue. Thefirstlineardiscriminantana-

lysis (LDA) is based on these 15 features, while the second

(SVM) is a support vector classifier. For SVM we optim-

ized over the choice of its cost parameter. Both LDA and

SVMperformworsethanthePPCmethodhere. Inline(7)we

applied SVM to all sites. Its prediction performance might be

a little better than that of PPC. In line (8) we applied a dis-

cretewavelettransformwithDaubuchiescompactwaveletsto

each spectrum, using the Wavethresh3 package in R (Nason,

1998). We then applied the nearest shrunken classifier to the

resulting wavelet coefficients. The classifier used only six

wavelet coefficients, but when transformed back to the ori-

ginal domain, it resulted in non-zero weights for all 91360

features. This procedure is analogous to the PPC method in

line (1), but uses a different feature extraction (encoding). We

see that the error rate is no better than that of PPC, and it uses

many more features.

Figure9showsthecross-validationerrorcurvesforthePPC

methods, as a function of the threshold parameter ?. The

numberofsitesisindicatedalongthetopofthefigure.Wehave

include the PPC/lasso method on the plot, using the number

of non-zero sites as the plotting abscissa.

We note that the CV error for LDA and SVM/t-15 reported

in Wu et al. (2003) averaged ∼12–15%, or 12–14 errors out

Fig. 6. Left column: three spectra from cancer patients having a

peak higher than 0.55 at the site m/z = 2995.1. Both the raw (black)

and smoothed (red) spectra are shown. In the right column, we show

three spectra from healthy patients without the peak, or whose peak

is too low. The vertical dotted lines indicate the centroid 2995.1 and

the outer limits for the peak position.

Fig. 7. Estimated FDR, as a function of the number of peaks called

significant.

of 89. This is far better than the results in Table 2. But in their

study, theseauthorsusedall89samplestochoosethe15sites,

and then applied cross-validation keeping the 15 sites fixed

(personal communication). This produces an unrealistically

3041

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 9

R.Tibshirani et al.

Fig. 8. Binary heatmap of top seven training set features. Each row

corresponds to one of the seven peak centroids, and each column

corresponds to a training spectra. A pixel is dark if the peak at that

site exceeds the threshold determined by PPC. The rows (peaks) are

ordered by decreasing strength from top to bottom.

Table 2. The results for ovarian cancer example

MethodCross-validation errors/89 (SE)Number of sites

(1) PPC

(2) PPC/width 0.0025

(3) PPC/pres-abs

(4) PPC/lasso

(5) LDA/t-15

(6) SVM/t-15

(7) SVM

(8) Wavelets

23 (1.1)

27 (1.7)

30 (1.8)

25 (1.5)

31 (1.4)

27 (1.6)

21 (1.4)

26 (1.3)

7

3

133

192

15

15

91360

91360

Methods are (1) peak probability contrast with default peak with of 0.005; (2) PPC with

peak width 0.0025; (3) PPC with splits restricted (i.e. peak present or absent); (4) lasso

applied to binary features from PPC method; (5) linear discriminant analysis using the

top 15 sites as ranked by the t-statistic; (6) support vector machine (SVM) using these

same 15 sites; (7) support vector machine applied to all sites; and (8) nearest shrunken

centroid classifier applied to wavelet coefficients.

low error rate that does not accurately estimate the true test

error rate.

The strongest peak used by PPC was at m/z = 2995.1.

The corresponding peak heights are shown in Table 3 and

show a strong trend towards larger peaks in the cancer spec-

tra. The t-statistic at m/z = 2995.1 was 3.19 Among the

91360 t-statistics, the value 3.19 ranks as only the 4196th

largest. Hence, itisnotclearthatscreeningonthevalueofthe

t-statistics is a good way to choose features in this example.

3.2

To assess the performance of the PPC algorithm, we created

artificially ‘spiked’ spectra from our original data. First we

created artificial control and cancer datasets, each consist-

ing of approximately half of the original control and cancer

An artificial spiking experiment

Fig. 9. The 10-fold cross-validation error for PPC, as a function of

thethresholdparameter?. Thecorrespondingnumberofpeaksused

is shown along the top of the figure. We have included the PPC/lasso

methodontheplot, usingthenumberofnon-zerositesastheplotting

abscissa.

Table 3. Training set results for peak at m/z = 2995.1

Number in quartile

0

Total

1234

Healthy

Cancer

6

3

16

4

12

8

44 42

471616

Number of samples in quartiles of peak heights; quartile ‘0’ means no peak.

patients, respectively. By construction, these artificial data-

sets were heterogeneous but were similar to each other. We

then chose two sets of five sites:

Control samples: 820.0,1106.7,1680.0,2540.0,3113.3.

Cancer samples: 1393.3,1966.7,2253.3,2826.7,3400.0.

An artificial peak was spiked into each control spectra

at the first five sites, and spiked into the cancer spectra at

the second five sites. In each case this peak was a narrow

spike of width 1. To simulate actual conditions, the pos-

ition of the spike was also ‘jittered’ by 0.0025 from the

target site.

In detail, if h(x) is the intensity of the spectrum at x =

log(m/z), then the height of the spectrum at x?after spiking

was defined to be

h(x?) +¯h(x?) · f,(3)

where¯h(x?) is the average intensity at x?for all spectra, and

f isafractionequalto2,1or0.5.Herex?isthejitteredversion

3042

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 10

Peak probability contrasts

Table 4. The results for artificial spiking experiment

f

10 site model

No. of sites found

Full model

No. of sites foundTest errors/45Test errors/45

2

1

7

4

3

0

3

8

10 20

24

21

8

0.510

of x, i.e. x?= x + U, where U is uniformly distributed on

[−0.0025,0.0025]

Aftercreatingthedataset,werandomlysplititintoatraining

set (2/3) and a test set (1/3). We applied the PPC procedure

giving the results in Table 4. In the 2nd and 3rd columns,

the PPC model was shrunken down to its top 10 sites, and we

report how many of the actual 10 spiking sites appeared in

these 10 and the resulting test set error rate. By ‘appeared’,

we mean that the actual spiking site was within 0.005 units

of one of the centroids. Similarly, the last two columns report

the results for the full (unshrunken) PPC model. This model

typically had on the order of 600 centroid sites.

Asexpected,aswespikewithsmallerpeaks,boththeability

todetectthesepeaksandtheabilitytousethemforprediction,

tends to decrease. However with peaks twice as high as the

averageintensity(f = 1),thetesterroroftheshrunkenmodel

islow(3/45)andtheprocedurefinds4ofthe10spikingsites.

The full model is better at finding the spiking sites (among

others), butbyretainingnoisysitesitpaysapriceintesterror.

4

Sample classification from proteomic data is often difficult

becausethesignalintensityforeachm/zpointcanbeaffected

bybothbiologicalprocessesandexperimentalconditionvari-

abilities. ThepreprocessingstepsofMSoutputarecriticalfor

theoverallanalysisoftheproteomicdata.Peaknormalization,

identification and alignment can all affect the performance of

class prediction using conventional classification approaches.

The proposed peak probability contrast method first extracts

clusters of peaks in the spectra. In other experiments in our

laboratory, we found that this extraction step appears to be

robust and reproducible when tested on spectra obtained from

different runs using the same plasma sample. It can help to

minimize experimental variability.

Afterextractionofpeakclusters,PPCusesresultingfeatures

in a nearest centroid classifier. These features can also serve

as useful inputs into other classifiers such as a binary decision

tree, or lasso model. Comparison with other classifiers that

operate on the raw spectra, PPCs performance is just as com-

petitive while providing an advantage of generating a simple,

moreinterpretablesetoffeaturesforfurtherinvestigation.The

efficiency of this method in finding a relative small number of

DISCUSSION

peakclustersforclasspredictionwillfacilitatefutureidentific-

ationofbiologicallysignificantandrelevantproteinsfortumor

developmentandprogression.Discoveryoftheseproteinswill

result in novel targets for cancer prevention and antitumor

therapies.

The concept of low molecular weight (LMW) serum pro-

teomewasrecentlyintroducedthroughtheincreasingpopular

mass spectral-based proteomic analysis. Its importance was

demonstrated by a number of studies (Petricoin et al., 2002b;

Kozak et al., 2003). However, because these LMW mark-

ers were mainly identified through bioinformatic/statistical

analysis, their identities remain elusive. Two other ovarian

cancer studies yield two panels of markers each with five

m/z values [534, 989, 2111, 2251 and 2465 (Petricoin et al.,

2002b)]; 4.4k, 15.9k, 18.9k 23.0k and 30.1k (Kozak et al.,

2003). None of them show identical m/z value as the panel

of markers we have shown here. This can be easily explained

by different samples, handling process, instruments and stat-

istical tools used by these studies. Although not necessarily

straightforward, therearewaystopurifythoseserummarkers

for identification through tryptic peptide mapping (Rai et al.,

2002) or amino acid sequencing (Klade et al., 2001).

Software for performing the PPC analysis will be available

at http://www-stat.stanford.edu/∼tibs/PPC.

ACKNOWLEDGEMENTS

We would like to thank Baolin Wu and Hongyu Zhao for

sharing their data on ovarian cancer, and Yutaka Yasui for

providing details on his peak finding algorithm. We would

also like to thank the editor and reviewers for comments that

led to substantial improvements in the manuscript. R.T. was

partially supported by National Science Foundation Grant

DMS-9971405 and National Institutes of Health Contract

N01-HV-28183. Q.-T.L. was partially supported by the PHS

Grant Number CA67166 awarded by the National Cancer

Institute.

REFERENCES

Adam,B.-L.,

Cazares,L.H.,

Feng,Z. and Wright,G.L.,Jr (2003) Serum protein fingerprinting

coupled with a pattern-matching algorithm distinguishes pro-

state cancer from benign prostate hyperplasia and healthy mean.

Cancer Res., 63, 3609–3614.

Benjamini,Y. and Hochberg,Y. (1985) Controlling the false discov-

ery rate: a practical and powerful approach to multiple testing.

J. R. Stat. Soc. B, 85, 289–300.

Breiman,L., Friedman,J., Olshen,R. and Stone,C. (1984) Classifica-

tion and Regression Trees. Wadsworth.

Efron,B.andTibshirani,R.(2002)Empiricalbayesmethodsandfalse

discovery rates for microarrays. Genet. Epidemiol., 23, 70–86.

Efron,B., Hastie,T., Johnstone,I. and Tibshirani,R. (2002) Least

angle regression. Technical Report, Stanford University, CA.

Hanash,S. (2003) Disease proteomics. Nature, 422, 226–232.

Qu,Y., Davis,J.W.,

Semmes,O.J.,

Ward,M.D.,

Schellhammer,P.F.,

Clements,M.A.,

Yasui,Y.,

3043

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from

Page 11

R.Tibshirani et al.

Hansen,H. and Pedersen,A. (1986) Tumor markers in patients with

lung cancer. Chest, 89, 219–224.

Hutchens,T.W.andYip,T.T.(1993)Newdesorptionstrategiesforthe

mass-spectrometric analysis of macromolecules. Rapid Commun.

Mass Spectrom., 7, 576–580.

Klade,C.,Voss,T., Krystek,E.,

Pummer,K. and Adolf,G. (2001) Identification of tumor

antigens in renal cell carcinoma by serological proteome

analysis. Proteomics, 1, 890–898.

Kozak,K., Amneus,M., Pusey,S., Su,F., Luong,M., Luong,S.,

Reddy,S. and Farias-Eisner,R. (2003) Identification of biomark-

ers for ovarian cancer using strong anion-exchange protein chips:

potential use in diagnosis and prognosis. Proc. Natl Acad. Sci.,

USA, 100, 12343–12348.

Li,J., Zhang,Z., Rosenzweig,J., Wang,Y.Y. and Chan,D.W. (2002)

Proteomics and bioinformatics approaches for identification of

serum biomarkers to detect breast cancer. Clin. Chem., 48,

296–1304.

Merchant,M. and Weinberger,S.R. (2000) Recent advancements in

surface-enhanced laser desorption/ionization-time of flight-mass

spectroscopy. Electrophoresis, 21, 1164–1177.

Nason,G. (1998) Wavethresh3

Departmentof Mathematics,

Bristol, UK.

Petricoin,E.F., Zoon,K.C., Kohn,E.C., Barrett,J.C. and Liotta,L.A.

(2002a) Clinical proteomics: translating benchside promise into

bedside reality. Drug discov., 1, 683–695.

Petricoin,E.F., Ardekani,A.M., Hitt,B.A., Levine,P.J., Fusaro,V.,

Steinberg,S.M., Mills,G.B., Simone,C., Fishman,D.A., Kohn,E.

and Liotta,L.A. (2002b) Use of proteomic patterns in serum to

identify ovarian cancer. Lancet, 359, 572–577.

Qu,Y., Adam,B.L., Yasui,Y., Ward,M.D., Cazares,L.H., Schell-

hammer,P.F., Feng,Z., Semmes,O.J. and Wright,G.L.,Jr (2002)

Boosted decision tree analysis of surface-enhanced laser

desorption/ionization mass spectral serum profiles discrimin-

ates prostate cancer from noncancer patients. Clin. Chem., 48,

1835–1843.

Rai,A.J., Zhang,Z., Rosenzweig,J.,

Fung,E.T.,Sokoll,L.J. and Chan,D.W. (2002) Proteomic

approaches to tumor marker discovery. Arch. Pathol. Lab. Med.,

126, 1518–1526.

Ahorn,H., Zatloukaa,L.K.

software.

Technical Report,

Bristol, Universityof

Shih,I.E.M., Pham,T.,

Smith,R., Cokkinides,V. and Eyre,H. (2003) American cancer soci-

ety guidelines for the early detection of cancer, 2003. CA Cancer

J. Clin., 53, 27–43.

Storey,J. and Tibshirani,R. (2003) Statistical significance for gen-

omewide studies. Proc. Natl Acad. Sci., USA, 100, 9440–9445.

Swenson,S., Jett,J., Sloan,J., Midthun,D., Hartman,T., Sykes,A.,

Aughenbaugh,G., Zink,F., Hillman,S., Noetzel,G. et al. (2002)

Screening for lung cancer with low-dose spiral computed tomo-

graphy. Am. J. Respir. Crit. Care Med., 165, 508–513.

Tibshirani,R. (1996) Regression shrinkage and selection via the

lasso. J. R. Stat. Soc. B, 58, 267–288.

Tibshirani,R., Hastie,T., Narasimhan,B. and Chu,G. (2003) Class

prediction by nearest shrunken centroids, with applications to

DNA microarrays. Stat. Sci., 18, 104–117.

Tusher,V., Tibshirani,R. and Chu,G. (2001) Significance analysis

of microarrays applied to transcriptional responses to ionizing

radiation. Proc. Natl Acad. Sci., USA, 98, 5116–5121.

Wu,B., Abbott,T., Fishman,D., McMurray,W., Mor,G., Stone,K.,

Ward,D., Williams,K. and Zhao,H. (2003) Comparison of stat-

istical methods for classification of ovarian cancer using mass

spectrometry data. Bioinformatics, 19, 1636–1643.

Yasui,Y., Pepe,M., Thompson,M.L., Adam,B.-L., Wright,G.L.,Jr,

Qu,Y., Potter,J.D., Winget,M., Thornquist,M. and Feng,Z. (2003)

A data-analytic strategy for protein biomarker discovery: pro-

filing of high-dimensional proteomic data for cancer detection.

Biostatistics, 4, 449–463.

APPENDIX: THE MULTI-CLASS CASE

ThePPCprocedurecanbeeasilygeneralizedtoproblemswith

more than two classes. In the notation of Section 2.5, let Gk

be the indices of observations in group k each of size nk, for

k = 1,2,...,K. Let

pik(α) =

j∈Gk

and ¯ pi(α) =?

site i, and then set ˆ pik= pik[ˆ α(i)]. Centroid shrinkage and

classification then proceeds exactly as in Section 2.7.

?

knkpik(α)/?

I[yij> q(α,i)]/nk, k = 1,2,...,K,

knk.

We choose α(i) to maximize?

k|pik(α)− ¯ pi(α)| for each

3044

by guest on May 24, 2011

bioinformatics.oxfordjournals.org

Downloaded from