# Regression Level Set Estimation Via Cost-Sensitive Classification

**ABSTRACT** Regression level set estimation is an important yet understudied learning task. It lies somewhere between regression function estimation and traditional binary classification, and in many cases is a more appropriate setting for questions posed in these more common frameworks. This note explains how estimating the level set of a regression function from training examples can be reduced to cost-sensitive classification. We discuss the theoretical and algorithmic benefits of this learning reduction, demonstrate several desirable properties of the associated risk, and report experimental results for histograms, support vector machines, and nearest neighbor rules on synthetic and real data

**0**Bookmarks

**·**

**113**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Estimating the level set of a signal from measurements is a task that arises in a variety of fields, including medical imaging, astronomy, and digital elevation mapping. Motivated by scenarios where accurate and complete measurements of the signal may not available, we examine here a simple procedure for estimating the level set of a signal from highly incomplete measurements, which may additionally be corrupted by additive noise. The proposed procedure is based on box-constrained Total Variation (TV) regularization. We demonstrate the performance of our approach, relative to existing state-of-the-art techniques for level set estimation from compressive measurements, via several simulation examples.Proceedings / ICIP ... International Conference on Image Processing 10/2012; - SourceAvailable from: Moulinath Banerjee[Show abstract] [Hide abstract]

**ABSTRACT:**We consider the problem of estimating the region on which a non-parametric regression function is at its baseline level in two dimensions. The baseline level typically corresponds to the minimum/maximum of the function and estimating such regions or their complements is pertinent to several problems arising in edge estimation, environmental statistics, fMRI and related fields. We assume the baseline region to be convex and estimate it via fitting a `stump' function to approximate $p$-values obtained from tests for deviation of the regression function from its baseline level. The estimates, obtained using an algorithm originally developed for constructing convex contours of a density, are studied in two different sampling settings, one where several responses can be obtained at a number of different covariate-levels (dose-response) and the other involving limited number of response values per covariate (standard regression). The shape of the baseline region and the smoothness of the regression function at its boundary play a critical role in determining the rate of convergence of our estimate: for a regression function which is `p-regular' at the boundary of the convex baseline region, our estimate converges at a rate $N^{2/(4p+3)}$ in the dose-response setting, $N$ being the total budget, and its analogue in the standard regression setting converges at a rate of $N^{1/(2p+2)}$. Extensions to non-convex baseline regions are explored as well.12/2013; - SourceAvailable from: Kalyani Krishnamurthy
##### Article: Level set estimation from projection measurements: Performance guarantees and fast computation

[Show abstract] [Hide abstract]

**ABSTRACT:**Estimation of the level set of a function (i.e., regions where the function exceeds some value) is an important problem with applications in digital elevation mapping, medical imaging, astronomy, etc. In many applications, the function of interest is not observed directly. Rather, it is acquired through (linear) projection measurements, such as tomographic projections, interferometric measurements, coded-aperture measurements, and random projections associated with compressed sensing. This paper describes a new methodology for rapid and accurate estimation of the level set from such projection measurements. The key defining characteristic of the proposed method, called the projective level set estimator, is its ability to estimate the level set from projection measurements without an intermediate reconstruction step. This leads to significantly faster computation relative to heuristic "plug-in" methods that first estimate the function, typically with an iterative algorithm, and then threshold the result. The paper also includes a rigorous theoretical analysis of the proposed method, which utilizes the recent results from the non-asymptotic theory of random matrices results from the literature on concentration of measure and characterizes the estimator's performance in terms of geometry of the measurement operator and 1-norm of the discretized function.SIAM Journal on Imaging Sciences 09/2012; · 2.97 Impact Factor

Page 1

2752IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 6, JUNE 2007

Regression Level Set Estimation Via Cost-Sensitive

Classification

Clayton Scott, Member, IEEE, and

Mark Davenport, Student Member, IEEE

Abstract—Regression level set estimation is an important yet under-

studied learning task. It lies somewhere between regression function

estimation and traditional binary classification, and in many cases is

a more appropriate setting for questions posed in these more common

frameworks. This note explains how estimating the level set of a regression

function from training examples can be reduced to cost-sensitive classifi-

cation. We discuss the theoretical and algorithmic benefits of this learning

reduction, demonstrate several desirable properties of the associated risk,

and report experimental results for histograms, support vector machines,

and nearest neighbor rules on synthetic and real data.

Index Terms—Cost-sensitive classification, learning reduction, regres-

sion level set estimation, supervised learning.

I. INTRODUCTION

Consider a function ? ?

set of ? at level ? is the set

??

and a fixed value ? ?

. The level

??? ?? ? ???? ? ???

Inthis paper,weconsidertheproblem ofestimating ??from atraining

sample of noisy input/output pairs ??????? ?

Our only assumption on the training data is that they are realizations

of ???? ? such that ? is the regression of ? on ?, that is, ???? ?

??? ?? ? ??.

The level set problem is relevant in a number of applications. Sup-

pose for example that ? represents demographic information of an in-

dividual and ? is income. While it may be instructive to estimate ?,

policy decisions often hinge on level sets such as those corresponding

to the poverty line or certain tax brackets.

A second example is taken from medical decision making. Consider

a cancer that is treated by either standard or aggressive chemotherapy,

depending on a variable ? that characterizes the severity of the cancer.

The choice of treatment is made by comparing ? to a threshold ?. This

is the situation for osteosarcoma [1], where ? is the percent necrosis

(celldeath)inthetumorafteraninitialroundoftreatment,and? ? ???

by convention. The problem is that measuring ? involves an invasive

biopsy. Suppose that ? is a feature vector (whose acquisition is less

invasive) collected from the patient, such as gene expression levels de-

rived from an RNA microarray. Knowledge of the regression level set

would allow for accurate treatment planning without a biopsy.

These two examples represent a much larger collection of potential

applications. Inawiderange of regressionproblems, ifit isworthwhile

to estimate the regression function ?, it is also worthwhile to estimate

??

?? ? ???????.

Manuscript received February 21, 2006; revised September 2, 2006. The as-

sociate editor coordinating the review of this manuscript and approving it for

publication was Prof. Tulay Adali. This work was supported in part by the Na-

tional Science Foundation under Grant No. 0240058.

C. Scott is with the Department of Electrical Engineering and Computer Sci-

ence, University of Michigan, Ann Arbor, MI 48109 USA (e-mail: cscott-at-

eecs-dot-umich-dot-edu).

M. Davenport is with the Department of Electrical and Computer Engi-

neering, Rice University, Houston, TX 77005 USA (e-mail: md-at-rice-dot-

edu).

Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2007.893758

certain level sets. Moreover, these level sets may be of ultimate im-

portance. And in many classification problems, labels are obtained by

thresholding a continuous variable. Thus, estimating regression level

sets may be a more appropriate framework for addressing many prob-

lems that are currently envisioned in other ways.

Two naïve approaches to level set estimation are as follows. One

is to use some method to estimate the regression function ? and then

threshold at ?. Another is to apply standard binary classification to

the data ????????, where??? ? ??? ???? ?????. However, both ap-

proaches are unsatisfying. The first violates Vapnik’s maxim: When

solving a given problem, try to avoid solving a more general problem

as an intermediate step [2]. The second approach ignores the informa-

tion conveyed by the distance of the different response values from ?.

In this paper, we pose regression level sets estimation in terms of

cost-sensitive classification. This approach lies somewhere between

these two naïve approaches. It formulates the issue in terms of direct

set estimation and thus bypasses the intermediate step of estimating

?, while still accounting for response magnitudes. We argue that the

cost-sensitive formulation provides a natural performance measure for

the level set learning task.

Our approach can be described as a “learning reduction” from one

supervised learning problem to another which is more fundamental or

better understood. As discussed by Beygelzimer et al. [3], such re-

ductions come with both algorithmic and theoretical benefits. From

an algorithmic standpoint, we can estimate regression level sets using

algorithms for cost-sensitive classification. Furthermore, as discussed

below, cost-sensitive classification can be further reduced to conven-

tional binary classification. Thus, standard methods such as support

vector machines, decision trees, and nearest neighbors can be brought

to bear on the problem. The ability to import and adapt existing classi-

fication algorithms is a principal advantage of our framework.

From a theoretical perspective, the analysis of algorithms for regres-

sion level set estimation can be deduced from well studied results for

classification. For example, if we assume the regression function and

noise are bounded, concentration inequalities like Hoeffding’s can be

applied as they are in the analysis of conventional classification algo-

rithms [4], [5].

In previous work on regression level set estimation, Cavalier [6]

demonstrated asymptotic minimax rates of convergence for piecewise

polynomial estimators constructed with an excess mass criterion. Wil-

lett and Nowak [7], [8] also demonstrated minimax rates (for different

smoothnessclasses)forestimatorsbasedonrecursivedyadicpartitions.

A difference between these works and the present work is in the per-

formance measure used to quantify the quality of an estimate. Our per-

formance measure is the risk given by the expected misclassification

cost, and its connection to the performance measures in the above cited

works is spelled out in Section IV.

The paper is structured as follows. Section II reviews cost-sensitive

classificationanddiscussesthecostingalgorithmof[9].SectionIIIfor-

mally defines regression level set estimation and formulates a solution

in terms of cost-sensitive classification. Section IV demonstrates sev-

eral desirable properties of the risk proposed in Section III. Section V

describes support vector and nearest neighbor algorithms for regres-

sion level set estimation. Section VI illustrates the proposed ideas with

experiments on synthetic and real-world data. Conclusions and future

work are discussed in Section VII.

II. REVIEW OF COST-SENSITIVE CLASSIFICATION

Cost-sensitive classification problems can be grouped into two

kinds: those with class-dependent costs, and those with example-de-

pendent costs. In binary classification with class-dependent costs, for

1053-587X/$25.00 © 2007 IEEE

Page 2

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 6, JUNE 2007 2753

example, false positives and false negatives incur fixed costs ??and ??.

The goal is to learn, from training data, a classifier with low expected

misclassification cost (Bayes risk). This framework is appropriate

when false positives and false negatives carry different penalties and

was the subject of most early research on cost-sensitive classification

(see [10], [11], [12], and references therein).

The second variant, example-dependent cost-sensitive classification

is a generalization of the class-dependent cost problem and is the rele-

vant frameworkforthis paper. Considerrandom variables ??????? ?

?? ????? ?

, where ? represents a pattern, ? a class label, and

? is the cost associated with misclassifying ? when the true label is

? .Cost-sensitiveclassification seeksto minimizethe expectedcost(or

risk)

???? ? ???????????? ???

Here, weoverload thenotation ? torefer to botha subsetof

classifier ???? ? ??????. Clearly, this formulation of cost-sensitive

classification is more general than simply assigning fixed costs ??and

??to errors from the respective classes.

By casting level set estimation as example-dependent cost-sensitive

classification, we avoid the need for function estimation and may in-

stead rely on algorithms for direct set estimation. Fortunately, there

are many algorithmic strategies for example-dependent cost-sensitive

classification. In some simple settings, such as the histogram classifier

discussed in Section VI, direct empirical risk minimization is possible.

Many other algorithms for conventional classification can be modified

(in a manner specific to the algorithm) to include example-dependent

costs. Support vector and nearest neighbor methods are described con-

cretely in Section V.

Even when direct modification is not possible, Zadrozny et al. [9]

provide a general “black-box” procedure for reducing cost-sensitive

classificationtoconventional(cost-insensitive)classification.Theirap-

proach is based on the realization that minimizing the expected cost is

equivalent to minimizing the probability of error for an appropriately

reweighted distribution. The idea is implemented algorithmically by a

strategytermed costing. Thelabel foratest pointisbased onamajority

vote over a finite number (determined by the user) of classifiers. Each

of these classifiers is obtained by running a conventional classification

algorithmonadatasetobtainedbyresamplingtheoriginaldataset.Re-

sampling is accomplished with cost-proportionate rejection sampling.

The importanceofcosting forthepresent workis thatitallows areduc-

tionofregressionlevelsetestimationtoconventionalclassification,the

most fundamental and widely studied supervised learning problem.

?and the

III. APPLICATION TO REGRESSION LEVEL SET ESTIMATION

The regression level set estimation problem is stated formally as fol-

lows.Let???? ? ?

berandomvariables.Assumethatforsome

(unknown) function ? ?

where ? is zero mean noise with Lebesgue density ????. Although it is

not reflected in the notation, the distribution of ? may depend on ?. Let

? ?

be fixed. The goal is to estimate the level set

??

??

we have ? ?? ? ? ? ???? ? ?,

??? ?? ? ???? ? ??

using only a training sample ??????? ?

realizations of ???? ?.

Our proposal is to estimate ??by reducing to a cost-sensitive clas-

sification problem as follows. Define?? ? ??? ???and ? ? ?? ? ? ?,

anddefinetherisk????ofaset? tobetheexpectedcostforcost-sen-

sitive classification based on ????? ???, as follows:

???? ? ??????? of

???? ? ???? ? ? ?????????? ? ???

(1)

We can now apply cost-sensitive learning algorithms to the training

data ??????????? and the outcome will be an estimate of the regression

level set ??.

IV. PROPERTIES OF THE RISK

In this section, we give credence to the proposed reduction by

demonstrating certain desirable properties of the risk and relating it to

the classification risk and other metrics for the level set problem. Let??

denote the complement of ?, and let ?????? ???? ???? ???????

denote the symmetric difference of ? and ??. Let ?? denote the

distribution of ?. The following is proven in the Appendix.

Proposition 1: The excess risk can be expressed as

???? ? ????? ?

???

????? ? ??????

(2)

Corollary 1: The risk ???? is minimized by the level set ???

?? ? ???? ? ??.

Proposition 1 establishes that the error associated with a misidenti-

fiedpoint is proportional to the distance from the regression function at

thatpointtothetargetlevel.Thismakessensebecausepointsforwhich

????? ? ?? is large should be easier to classify, and any estimate that

errs on such a point should be penalized more heavily than if it erred

where ???????? is small.Ifwethinkofaclassificationproblem where

the labels are obtained by thresholding a continuous response variable,

then points for which ???????? is small are “almost” in the other class

anyway, so it is not as problematic to misclassify them.

The excess risk here is similar to the excess risk in conventional

classification.Let????? ? ?

??. Recall that we identify subsets ? ?

??????. In conventional classification, the risk of a classifier is defined

to be????? ? ??????????? ? ??. The Bayes classifier is a level set of ?:

????? ? ???????????. Furthermore, we have the formula [4]

???????anddenote???? ? ???? ?? ?

?with classifiers ???? ?

????? ??????? ? ?

???

????? ? ????????

Conceptually, we may view conventional binary classification as a spe-

cial regression level set estimation problem where the response vari-

ables have been “binarized” to 0 or 1. Conversely, from the discussion

of the costing algorithm in Section II, we can think of regression level

set estimation as a binary classification problem where the labels are

obtained by thresholding the continuous responses and the probability

mass of ? has been reweighted in proportion to ????? ? ??.

Further theoretical guarantees are possible for certain cost-sensitive

classification algorithms. For example, [9] relates the performance of

their costing algorithm to the performance of the underlying conven-

tional (cost-insensitive) classification algorithm. Translating their re-

sult to our setting gives the following.

Corollary2: Let? ? ???? ????. Let? beatrainingsampledrawn

from ??????? and let ??be a sample derived from ? by cost-propor-

tionate rejection sampling as described in [9]. Let?? be a classification

algorithm based on ??. If the expected1probability of error of?? is no

more than ?, then the expected2value of ????? is no more than ??.

Let us now consider the connection between the excess risk ?????

????? and the performance measures studied in [6] and [7], [8]. We

consider in particular the following two questions: 1) When does con-

vergence to zero of one performance measure imply convergence of

another? and 2) How useful from a practical standpoint are the various

performance measures?

1This expectation is with respect to the random draw of

2This expectation is with respect to the random draw of

.

.

Page 3

2754IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 6, JUNE 2007

The performance measures studied in [6] are i) the volume of the

symmetric difference of the estimate and ??and ii) the Hausdorff dis-

tance between the estimate and ??. In general, it is not possible to

answer question 1) without imposing some conditions on the under-

lying distribution. A complete characterization of such conditions is

beyond the scope of this paper. Roughly speaking, however, let us sup-

pose the distribution is reasonably well-behaved in the sense that a) the

boundaryof ??is not tooirregular, b)thedistributionof ? ismutually

bounded by Lebesgue measure on some compact set, and c) ? does not

“flatten out” near the level ?. By a), if the Hausdorff distance tends to

zero, so does the volume of the symmetric difference. By b) and equa-

tion (2), the volume of the symmetric difference tends to zero, so does

our excess risk. Conversely, by c), if the excess risk tends to zero, then

the volume of the symmetric difference also tends to zero. Also by c),

it can be seen that convergence of the symmetric difference to zero im-

plies convergence of the Hausdorff distance to zero. Thus, the different

performance measures are asymptotically equivalent (under the stated

assumptions) in the sense that an estimator that is consistent for one is

consistent for the other.

With respect to question 2), our excess risk enjoys the clear advan-

tage that it can be minimized without access to the true level set ??,

which is of course unknown in practice. Furthermore, ???? can be

easily estimated given sufficient data. The two performance measures

of [6], in contrast, cannot be easily estimated from data because of the

dependence on ??.

The performance measure employed in [7] and [8] is very similar

to the cost-sensitive classification risk. In particular, they consider the

risk (ignoring constants)

????? ? ???? ? ? ??????????? ?? ?? ? ? ?????????? ???

Conceptually, one may think of this risk as both penalizing errors and

rewarding correct decisions in proportion to the distance to the regres-

sion function, whereas the cost-sensitive classification risk only penal-

izes errors. The two risks are related by

????? ? ???? ? ??? ??

? ???? ? ? ?????????? ? ??

? ???? ? ? ???? ??????? ? ??

? ???? ? ? ?????????? ? ??

? ???? ? ? ??? ? ????????? ? ???

? ????? ? ? ?????????? ? ?? ? ???? ? ? ??

? ????? ? ???? ? ? ???

Note the last term does not depend on ?. Consequently, the two risks

are effectively the same. The advantage of our risk is the connection

to cost-sensitive classification and the associated algorithmic benefits

discussed earlier.

V. ALGORITHMS

In this section, we describe the algorithms that are later applied in

Section VI. For each class of algorithms, we describe three variants:

cost-insensitive classification based on binarized response values,

direct cost-sensitive classification, and regression function estimation

followed by thresholding.

A. Support Vector Machines

Support vector machines (SVMs) are among the most effective

methods for learning classifiers from training data [13]. Conceptually,

we construct the support vector classifier in a two-step process. In the

first step, we transform the ?? ?

?via a mapping ? ?

?? ?

where ? is a high (possibly infinite)-dimensional Hilbert space. The

intuition is that we should be able to separate these classes more

easily in ? than in

that we can compute inner products in ? through the kernel operator

??????? ? ?????????????.

In the second step, we determine a hyperplane in the induced fea-

ture space according to the max-margin principle, which states that, in

the case where we can separate the two classes by a hyperplane, we

should pick the hyperplane that maximizes the margin—the distance

between the decision boundary and the closest point to the boundary.

This hyperplane is then our decision boundary. Thus, if ? ? ? and

? ?

are the normal vector and affine shift (or bias) defining the

max-margin hyperplane, then the support vector classifier is given by

??????? ? ?????????????? ??.

The original formulation of the SVM [14], which we shall call the

cost-insensitive SVM, can be stated as the following quadratic pro-

gram:

?. For algorithmic reasons, we choose ? so

???

????? ? ?

?

?????? ?

?

???

??

subject to

?????????? ? ?? ? ? ? ??

for ? ? ???????

for ? ? ???????

?? ? ?

where ? ? ? is a parameter that controls overfitting.

A simple modification to this formulation leads to the cost-sensitive

SVM

???

????? ? ?

?

?????? ?

?

???

????

??????? ??

?????????? ? ?? ? ? ? ??

for

for ? ? ???????

?? ? ?

? ? ???????

where ? ? ? is again a parameter that controls overfitting, and the ??

are weights depending on the individual sample.

Support vector regression (SVR) solves

???

???

?

?????? ?

?

???

???? ???????? ? ????

where

???? ???????? ? ???? ? ?????????? ???????? ? ??? ? ??

is the so-called ?-insensitive loss function. This optimization problem

is solved by considering the equivalent formulation

???

????? ? ??? ? ?

?

?????? ?

?

???

???? ??

??

subject to

???????? ? ?? ? ?? ? ? ? ??

??? ???????? ? ?? ? ? ? ??

?????

?? ?

for ? ? ????????

for ? ? ???????

for ? ? ???????

?

See [13] for further details.

B. Nearest Neighbors

The ? nearest neighbor (?-NN) decision rule is a classic method for

classification [15]. The rule assigns a label to a test point by taking a

majority vote over the labels of the ? training points that are closest ac-

cording to a specified (usually Euclidean) metric. A cost-sensitive ver-

sionisobtainedbytakinga“weighted”vote,wheretheweightassigned

toaneighboristhecost?? ? ??????. Finally,?-NNregressionassigns

aresponsevaluetoatestpointbyaveragingtheresponsevaluesofthe?

closesttraining points.Note thatforthe nearestneighbormethodology,

Page 4

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 6, JUNE 20072755

Fig.1. Trainingsampleofsize200forthehistogramexperiment,togetherwith

the true regression function and a level set estimate.

thecost-sensitive classifieris identicaltothethresholdedregressiones-

timate. Also note that all methods are equivalent when ? ? ?.

VI. EXPERIMENTS

We consider first a simple, highly controlled simulation study using

histograms, and second, experiments on real-world data using SVMs

and nearest neighbor methods.

A. Simulation Study With Histograms

We compare three approaches for constructing histogram level set

estimators: cost-sensitive classification, cost-insensitive classification,

and regressionfunction estimation followedby thresholding.Although

histograms are very simple estimators, their simplicity does ensure

a reasonably fair comparison. With more complicated models, con-

founding factors such as the selection of free parameters and differ-

ences in design criteria make it increasingly difficult to isolate the rea-

sons for an algorithm’s performance. Furthermore, synthetic data al-

lows for very precise estimation of performance measures.

For this experiment, we are interested in the ? ? ? level set of

???? ? ??????? ? ????. The data are generated independently ac-

cording to ? ? ????????? and ? ? ??????. Fig. 1 depicts a typical

realization, as well as the true regression function and a level set esti-

mate.

The three estimates are based on a fixed partition of ????? into 20

equally spaced bins. Each bin is assigned a label of 1 or 0 to indicate

whether the bin does or does not belong to the estimate. For the cost-

sensitive estimate (CS), the labels are determined by minimizing the

empirical risk

?????? ??

?

?

???

?? ? ?????? ???? ???? ???? ????? ? ???? ????

For the cost-insensitive estimate (CI), the labels are assigned by mini-

mizing the empirical probability of error. For the third method (REG),

the regression function is estimated by a constant on each cell using

an ??distortion. The constant is thus the median value of the response

variables on the bin. An ??distortion was also considered, but this in

factleads toanestimate thatisidenticaltoCS,acoincidencestemming

from the simplicity of the histogram estimate.

Theexperimentconsistedofgeneratingatrainingsampleofsize? ?

???, computing the three estimates, and estimating their performance

on a test set of size 10000. The three performance measures are the

cost-sensitive risk, the probability of error, and the Lebesgue measure

ofthesymmetricdifferencewithrespecttothetruelevelset.Theresults

in Table I represent averages over 10000 repetitions of the experiment

and are accurate to four digits.

TABLE I

RESULTS FROM THE HISTOGRAM SIMULATION STUDY. THE REPORTED

NUMBERS REPRESENT AVERAGES OVER 10000 REPETITIONS OF THE

EXPERIMENT AND ARE ACCURATE TO FOUR DIGITS. THE METHODS

COMPARED ARE COST-SENSITIVE (CS), COST-INSENSITIVE (CI),

REGRESSION FOLLOWED BY THRESHOLDING (L1)

AND

B. Real-World Data

We ran our algorithms on the benchmark data sets named “pyrim,”

“mpg,” “housing,” and “triazines.” The data sets are available online

with documentation.3They contain 74, 392, 506, and 186 examples

each, with dimensionalities 27, 7, 14, and 60, respectively. We ran-

domlypermutedeachdataset100times.Foreachpermutation,weused

70% for estimating the level set and the remaining 30% for testing the

estimate’s performance. For the SVM methods, 40% of the data were

used for training, and 30% formed a holdout set for setting free param-

eters. The targeted level ? was taken to be the average of the response

? across the data set.

On these data sets, we compare eight methods in all. The four

SVM methods are the direct cost-sensitive SVM (SVM-CS-DI-

RECT), the cost-sensitive SVM via costing (SVM-CS-COSTING),

the cost-insensitive SVM (SVM-CI), and SVR followed by thresh-

olding (SVM-REG). Similarly, the four nearest neighbor methods

(using ? ? ?) are denoted 3-NN-CS-DIRECT, 3-NN-CS-COSTING,

3-NN-CI, and 3-NN-REG. Recall that 3-NN-CS-DIRECT and

3-NN-REG are equivalent. Other values of ? were investigated, but

they did not affect our conclusions. For costing, we vote over 25

resamples for the SVM and 100 for 3-NN.

To implement the support vector classifiers we used the SVM?????

package [16], while for support vector regression we employed the

LIBSVM [17]. In all of our SVM experiments, we used a radial basis

function (Gaussian) kernel and searched for the bandwidth parameter

? overalogarithmicallyspacedgridof50pointsfrom????to???.We

also searched for the regularization parameter ? over a logarithmically

spaced gridof 50points from????to???.In addition,forthe SVR ex-

periments, we searched for the width of the insensitive loss tube ? over

a logarithmically spaced grid of 50 points from ????to ???.

TableIIreportstheestimatedcostofeachalgorithmoneachdataset,

averaged over all 100 permutations, along with standard deviations.

VII. CONCLUSION AND FUTURE WORK

An interesting conclusion from the synthetic data study is that the

cost-sensitiveapproachissuperiortothenaïveapproacheswithrespect

to all three metrics considered: expected misclassification cost, prob-

ability of error, and measure of the symmetric difference. Thus, for

classification problems where the labels are obtained by thresholding a

responsevariable,evenifthedesigncriterionistheprobabilityoferror,

it may be advantageous to incorporate cost information.

Our experiments on real-world data suggest that costing is not an

affective algorithm, at least for the sample sizes we considered. This

may be partially explained by the fact that cost-proportionate rejection

sampling leads to sample sizes that are only a fraction of the original

sample size. Furthermore, if a few costs are substantially larger than all

3http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Page 5

2756IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 6, JUNE 2007

TABLE II

EXPERIMENTAL RESULTS FOR FOUR REAL-WORLD DATA SETS. THE REPORTED NUMBERS REPRESENT AVERAGE ESTIMATED COSTS AND STANDARD DEVIATIONS

OBTAINED FROM 100 PERMUTATIONS OF THE DATA INTO TRAINING AND TEST SETS

others, then the fraction of points rejected can be quite high, leading to

even smaller sample sizes.

As for the other methods, the 3-NN methods outperform the SVM

methods on the first data set, while the reverse is true on the fourth data

set. Within each methodology (3-NN or SVM), the three competitive

approaches (direct cost-sensitive, cost-insensitive, and regression fol-

lowed by thresholding) do not differ in a statistically significant way

on any of the four data sets. If we look at the average cost (normal-

ized by standard error) across the four data sets and across methodolo-

gies (3-NN and SVM), the results are 2.33, 2.38, and 2.27. Thus, the

cost-sensitive approach appears to have a slight edge over the cost-in-

sensitive method, which is to be expected since it does not throw away

information. On the other hand, the regression/thresholding method

seems to perform at least as well as the cost-sensitive approach.

Although the cost-sensitive and regression-based methods have

comparable performance, there can be a significant difference in terms

of computation time. In particular, the cost-sensitive SVM has two

free parameters while SVR has three. When conducting a grid search

over parameter values and using a holdout or cross-validation error es-

timate, the increased computational complexity of SVR is substantial.

This observation may extend to other algorithmic frameworks because

regression is a harder problem and will often require the specification

of more free parameters than classification.

An interesting problem for future work is to demonstrate a gen-

eral algorithmic framework for estimating multiple level sets (of the

same regression function) simultaneously. One approach is to reduce

the problem to multiclass cost-sensitive classification, but it would be

important to constrain the estimated sets to be nested [8].

APPENDIX I

PROOF OF PROPOSITION 1

Observe

???? ? ???? ? ? ???????? ???? ?? ? ????? ? ???? ????

?

?? ? ??????????

?

?

??

???

?

? ?

?

?

?? ? ?????????????

?

?

?????????

? ?

????????

where

????? ?

?

??

?

?? ? ??????????

????? ?

?

?? ? ???????????

Therefore

???? ? ?????

?

?

?????????

? ?

????????

?

?

?????????

?

?

????????

?

???

?

?????????

? ?? ?

????????

?

? ???

?????????

???

?

????????

?

???

?

?????? ? ?????????

?

? ?? ?

?????? ? ??????????

Now

????? ? ?????

?

?

??

?

?? ? ??????????

?

??

????? ? ? ? ????????

? ???? ? ?

because ? is zero mean. Since ? ? ???? ???? ? ? ? ?, we have

???? ? ????? ?

???

????? ? ?????

as desired.

REFERENCES

[1] T.-K. Man, M. Chintagumpala, J. Visvanathan, J. Shen, L. Perlaky, J.

H. M. Johnson,N. Davino, J. Murray, L. Helman, W.Meyer, T. Triche,

K.-K. Wong, and C. C. Laus, “Expression profiles of osteosarcoma

that can predict response to chemotherapy,” Cancer Res., vol. 65, pp.

8142–8150, Sep. 2005.

[2] V. Vapnik, The Nature of Statistical Learning Theory.

Springer-Verlag, 1995.

[3] A. Beygelzimer, V. Dani, T. Hayes, J. Langford, and B. Zadrozny,

“Error-limiting reductions between classification tasks,” in Proc. 22nd

Int. Machine Learning Conf. (ICML), L. D. Raedt and S. Wrobel,

Eds.New York: ACM Press, 2005.

[4] L.Devroye,L.Gyorfi,andG.Lugosi,AProbabilisticTheoryofPattern

Recognition.New York: Springer, 1996.

[5] A. B. Tsybakov, “Optimal aggregation of classifiers in statistical

learning,” Ann. Stat., vol. 32, no. 1, pp. 135–166, 2004.

[6] L. Cavalier, “Nonparametric estimation of regression level sets,” Sta-

tistics, vol. 29, pp. 131–160, 1997.

[7] R. Willett and R. Nowak, “Minimax optimal level set estimation,” in

Proc. SPIE, Wavelets XI, San Diego, CA, Jul. 31–Aug. 4 2005, vol.

5914.

[8] R.WillettandR.Nowak,“Minimaxoptimallevelsetestimation,”IEEE

Trans. Image Process. 2006 [Online]. Available: http://www.ee.duke.

edu/~willett/, submitted for publication

New York: