Content uploaded by Alexandre Xavier Ywata Carvalho

Author content

All content in this area was uploaded by Alexandre Xavier Ywata Carvalho on Nov 07, 2014

Content may be subject to copyright.

Content uploaded by Alexandre Xavier Ywata Carvalho

Author content

All content in this area was uploaded by Alexandre Xavier Ywata Carvalho on Dec 26, 2013

Content may be subject to copyright.

HOW SHOULD A MANAGER SET PRICES WHEN THE DEMAND

FUNCTION IS UNKNOWN?

By ALEXANDRE X. CARVALHO AND MARTIN L. PUTERMAN

Statistics Department and Sauder School of Business, University of British Columbia

Revised - August 3, 2004

Abstract

This paper considers the problem of changing prices over time to maximize expected

revenues in the presence of unknown demand distribution parameters. It provides

and compares several methods that use the sequence of past prices and observed de-

mands to set price in the current period. A Taylor series expansion of the future

reward function explicitly illustrates the tradeoﬀ between short term revenue maxi-

mization and future information gain and suggests a promising pricing policy referred

to as a one-step look-ahead rule. An in-depth Monte Carlo study compares several

diﬀerent pricing strategies and shows that the one-step look-ahead rules dominate

other heuristic policies and produce good short term performance. The reasons for

the observed bias of parameter estimates is also investigated.

(Demand function estimation; adaptive control; Kalman ﬁlter; biased estimation)

1 Introduction

In January 2003, students and faculty in the Centre for Operations Excellence at the Uni-

versity of British Columbia b egan an extensive project, Martinez (2003), with Intrawest

Corporation to develop approaches to set prices for ski lift tickets to increase the com-

pany’s revenue. The analysis team quickly found that historical data was not suﬃcient to

determine the eﬀect of price changes on demand. At this point, the project focus shifted

to determining data requirements and developing tools and methods to capture data to

investigate the eﬀect of price on demand. But even if relevant data had been available,

the question of how management should vary its prices to maximize revenue remained.

1

This paper provides insights into how to do this by developing and evaluating several im-

plementable price setting mechanisms. Further it measures the beneﬁts of using these

methods and provides implementable recommendations for how to use these approaches.

In a nutshell, the main messages of this paper are:

• Managers can increase revenue by changing prices over time. This beneﬁt comes

through using variable prices to learn about the relationship between price and de-

mand.

• Managers should collect and save pricing and sales data and use it to guide pricing

decisions.

• Managers can increase total revenue by intermittently choosing prices in a random

manner.

• Managers can increase total revenue by implementing a systematic approach to choos-

ing prices and using it in real time.

• The Internet and its host of e-commerce tools are ideal for using the approaches

proposed in this paper.

In the latter sections of this paper, we will expand on and quantify the beneﬁts from

following these recommendations.

This paper considers the following pricing problem. Each period a manager (who we

refer to as ”he”) sets the price of a go od (for example, a ski ticket of a particular type)

and observes the total demand (skier visits) for that good during the period. He seeks to

choose prices so as to maximize total expected revenues over a ﬁxed ﬁnite horizon of length

T which might represent the length of a season or the lifetime of a good.

We assume an unlimited inventory of goods to simplify analysis and develop generaliz-

able principals that may apply in wider contexts. Clearly this assumption is appropriate in

the ski industry as well as in software and other information good industries. Also it applies

to settings in which pricing and inventory decisions are made in separate units within the

2

organization. From a technical perspective, by focussing on pricing only we avoid joint

optimization of inventory levels and prices and can provide clearer recommendations. We

have chosen not to take a revenue management approach in which the inventory is ﬁnite

and prices are set to maximize the total expected revenue from selling this inventory of

goods. Conceptually either of these extensions would not change our approach, they would

only alter the dynamic programming equations on which our analysis is based.

We assume demand is stochastic and its mean is described by a demand function which

relates demand to price and possibly other factors such as season, time on market and

competitive factors. We further assume that the demand function form is known, for

example, its logarithm may be linear in price and quadratic in the day of the season.

When the parameters of the demand function are also known, the choice of an optimal

price reduces to a simple stochastic optimization problem. However, when the parameters

are not known, which is the setting we will fo cus on, the manager may beneﬁt from using

variable pricing to learn them. Initially he may rely on prior information or intuition to

guide pricing decisions. After several periods, he can use statistical methods to estimate

the demand function and use this information to guide his price setting for the rest of the

planning horizon.

The simplest and most widely used approach to pricing is to set a single price at the

beginning of the planning horizon and use that price throughout the lifetime of the prod-

uct. To do this well requires complete knowledge of the demand curve for the product.

Alternatively the decision maker may specify an ”open loop” price schedule which deter-

mines how to vary prices over the planning horizon independent of any information that

may be acquired throughout the planning period. We will show that these approaches are

not attractive when there is uncertainty about the demand model.

We focus on adaptive or ”closed-loop” price setting, that is when the decision maker

varies prices on the basis of his record of historical demand and prices chosen. He can do this

”myopically” by setting the price p

t

at each period t, in order to maximize the immediate

expected revenue R

t

. Obviously, at time t, he can use all past price and demand data to

3

choose p

t

. As will be illustrated in this paper, this strategy will not yield the maximum

total expected revenues by the end of the planning horizon. In fact, this myopic strategy

may turn out to be far from optimal. This is because the price set today will impact not

only the the immediate expected return R

t

, but will also aﬀect the amount of information

about the demand function that the retailer gains. In fact, we show that prices that do

not maximize immediate revenue will lead to better estimation of the demand curve and

better future price decisions.

On the other hand, there is a vast statistical body of literature (see Draper and Smith,

1998) that addresses how to vary experimental conditions, which in the setting of this paper

are the prices, to best learn a parametric function that relates outcomes to experimental

conditions. Such optimal design approaches do not usually focus on the impact of outcomes,

herein the revenue gained, during the experimental process and use optimality criteria based

on covariance matrices of parameters.

The trade-oﬀ between immediate reward maximization and learning has been exten-

sively studied in many areas. We draw inspiration from the reinforcement learning (RL)

literature where most of the techniques are suited for problems where the horizon is inﬁ-

nite and the state space is of high dimension. In these cases, there is a great interest in

ﬁnding the optimal strategy, and much attention is devoted to construct methods to un-

cover the optimal policy in the limit. RL methods seek to approximate the value function

with high accuracy and many algorithms have been proposed to do this (see Anderson and

Hong, 1994, Dietterich and Wang, 2003, Forbes and Andre, 2000, Sutton and Barto, 1998,

Tsitsiklis, 1997).

The literature on learning and pricing appears to date back to Rothschild (1974). He

represents the problem of maximizing the ﬁrm’s revenue by choosing between two prices

as a two armed bandit model and shows among other results that the optimizing manager

can choose the wrong price inﬁnitely often. Kiefer and Nyarko (1989) study the general

problem of learning and control for a linear system with a general utility function. They

formulate this as an inﬁnite horizon discounted Bayesian dynamic program; show that a

4

stationary optimal policy exists and that the posterior process converges with probability

one but that the limit need not be concentrated on the true parameter. Related work

appears in Easley and Kiefer (1988,1989).

Balvers and Cosimano (1990) apply the Kiefer and Nyarko framework to the sp eciﬁc

problem of a manager who sets prices dynamically to maximize expected total revenue.

They use a dynamic programming approach to ”gain some insight into why it is important

for the ﬁrm to learn”. They derive a speciﬁc expression for the optimal price and then

explore its implications. They conclude among other results, that when varying price, an

anticipated change in demand leads to a small price change while an unanticipated shift in

demand leads to greater changes in price and that the eﬀect of learning persists into the

future. Their focus is qualitative and does not quantify the potential beneﬁts that can be

gained from using learning nor show to do it in practice.

Aviv and Pazgal (2002a and 2002b) introduce learning into a pricing model formulated

by Gallego and van Ryzin (1994) in which customers arrive singly according to a Poisson

process depending with rate depending on the price. Aviv and Pazgal (2002a) are concerned

with deriving a closed form optimal control policy while Aviv and Pazgal (2002b) use a

partially observed MDP framework to consider a model with a ﬁnite number of scenarios.

Petruzzi and Dada (2002) also study learning, pricing and inventory control. In their

model, the inventory level censors demand and once the inventory level is suﬃciently high or

demand is low so that it is not censored, the demand function parameters can be determined

with certainty and revenue can be maximized.

Lobo and Boyd (2003) consider a similar model to that in this paper. They assume

demand is linear in price and derive a Taylor series approximation for the one-period

diﬀerence between the expected reward under the policy that uses the true optimal price

and that based on the myopic policy. Although their Taylor series expansion is similar

to ours, they use it for a diﬀerent purpose; that is, to formulate a semi-deﬁnite convex

program in which the objective function is the sum of discounted one-period Taylor series

approximations. They then solve the problem over short planning horizons (10 and 20

5

periods) and compare the average revenue from several approaches. This paper focusses

methodological issues; unlike our paper, it does not focus on the managerial implications

of learning.

Cope (2004) and Carvalho and Puterman (2004a) study a related dynamic pricing

problem. Their setting is as follows. Each period, the manager sets a price and observes

the number of people who arrive and the number who purchase the product at that price. In

these papers the focus is on estimating the relationship between the probability of purchase

and the price set. Cope uses a non-parametric non-increasing model to relate price to the

probability of purchase while Carvalho and Puterman use logistic regression. Further, Cope

uses Bayesian metho ds while Carvalho and Puterman use maximum likelihood estimation.

From a managerial perspective this work of Cope and Carvalho and Puterman focus on low

demand items or a setting in which the prices can be changed frequently, while the focus

of this paper is on settings in which demand is high or prices can be altered less frequently.

Carvalho and Puterman (2004b) investigate the theoretical foundations of learning and

pricing.

Raju, Narahari and Kumar (2004) study the use of reinforcement learning to set prices

in a retail market consisting of multiple competitive sellers and heterogeneous buyers who

have diﬀerent preferences to price and lead time. They use a continuous time queuing model

for their system and explore the eﬀect of using several diﬀerent algorithms for computing

optimal policies.

The eﬀect of learning in other contexts have been considered by Scarf (1960), Azoury

(1985), Lovejoy (1990), and Treharne and Sox (2002) in the inventory context and Hu,

Lovejoy and Shafer (1996) in the area of drug therapy. In the papers of Scarf, Azoury,

Lovejoy and Treharne and Sox, learning is passive in the sense that the policy selected

by the decision maker does not impact the information received. In particular, in these

papers the decision maker observes the demand each period regardless of the inventory

level set by the decision maker. In contrast in this paper and the remaining papers above,

learning is active, that is, the policy set by the decision maker impacts the information

6

that is received. In the newsvendor models studied by Lariviere and Porteus (1999), Ding,

Puterman and Bisi (2002) the inventory levels censors demand so when setting inventory

levels the decision maker is faced with the tradeoﬀ of myopic optimality and learning. In

Hu, Lovejoy and Shafer (1996) the dosage level impacts both the health of the patient

and subsequent parameter estimates so the decision maker must again tradeoﬀ short term

optimality with learning.

The remainder of the paper is organized as follows. In Section 2 we formulate our model,

illustrate the trade-oﬀ between immediate revenue maximization and learning and discuss

demand distribution parameter updating. We present Monte Carlo simulation results to

evaluate the performance of several heuristic methods on Section 3. In Section 4 we describe

why model parameter estimates may be biased. Conclusions and recommendations appear

in the ﬁnal section.

2 Model Formulation

Consider a manager who has an unlimited quantity of a product to sell at each time period

t, t = 1, . . . , T , where T is ﬁnite. The demand q

t

for the product is represented by a

continuous random variable and is assumed to follow a log-linear model. Such a model

was recommended by Kalyanam (1996) on the basis of an analysis of marketing data.

Consequently, we assume that the demand in period t, is related to price set in that period,

p

t

through the equation

q

t

= exp

£

α + βp

t

+ ²

t

¤

, (1)

where α and β are unknown regression parameters, and ²

t

is a random disturbance term

that is normally distributed with mean 0 and variance σ

2

. We will start by focusing on this

model but many generalizations are possible including letting the regression parameters

vary over time with both random and systematic comp onents, or we can add other factors

to the model; for example if we assume a quadratic time trend in the model, then (1)

7

becomes

q

t

= exp

£

α + βp

t

+ γt + ηt

2

+ ²

t

¤

, (2)

A further enhancement of this model would be to include an interaction between trend and

price.

We argue here that from a modeling perspective parametric models are preferable to

non-parametric models in this ﬁnite horizon setting. Since our primary focus is optimize

revenues during a limited number of periods, say T = 100 or T = 200, it would be diﬃcult

to obtain a reasonable approximation to a non-parametric demand function with such little

data without strong prior assumptions. Further, this parametric formulation enables the

user to easily include additional information in the demand equation, such as in (2) above,

which would be extremely diﬃcult in a non-parametric setting.

The manager’s objective is to adaptively choose a sequence of prices {p

t

: t = 1, . . . , T }

to maximize the total expected revenue

P

T

t=1

p

t

E[q

t

], over T periods. We now focus on the

model in (1). If the retailer knows the model parameters α and β, he can set the price p

t

to the value that maximizes the revenue R

t

(p

t

) = p

t

E[q

t

] in each period t. The optimum

price in this case is p

∗

t

= −1/β, which does not depend on α, and the maximum expected

revenues are

P

T

t=1

R

∗

t

=

P

T

t=1

R

t

(p

∗

t

) = −(T/β)e

α−1

M, where M = E[e

²

t

] = exp[σ

2

/2].

Of course, in any real setting, but especially for new products, the decision maker does

not know the true parameter values α and β so they must be estimated from the data that

is acquired during the planning horizon. The key concept on which this paper is based is

that the prices the manager sets inﬂuence both the data received and the ongoing revenues.

Thus the manager wants to use prices both strategically, that is to learn about the demand

function and optimally to maximize immediate revenue. This paper is about this tradeoﬀ.

The information ﬂow in this system is linked through a feedback process. Each period,

the manager uses his current estimate of the demand function parameters to set the price,

then he observes demand in the period and ﬁnally he updates his estimates of the demand

function parameters. Then he repeats this process. The paper now proceeds along two

8

paths. First we discuss how to update the parameter estimates given an additional demand

observation and second we discuss how to choose a price given estimates of the demand

distribution parameters.

2.1 Updating Demand Distribution Parameter Estimates

This section describes how to update the demand function parameter estimates once the

demand in a particular period has been observed. We assume at time t = 0, which means

before we set the ﬁrst price p

1

and observe the demand q

1

, that the manager has speciﬁed a

prior distribution for the regression parameters. The initial prior may either be subjective,

derived from product information and past history or based on some preliminary sales data

for the current product. We assume that the vector of regression coeﬃcients θ = [α β]

0

∼

N(θ

0

, σ

2

P

0

), where θ

0

is the 2×1 prior mean vector and σ

2

P

0

is a 2×2 is the prior covariance

matrix. To simplify our initial analysis, we assume known σ

2

.

At period t, based on the available estimates α

t−1

and β

t−1

, the manager sets p

t

(hope-

fully by methods suggested below) and observes the demand q

t

. From (1), we can rewrite

the model as y

t

= log(q

t

) = α+βp

t

+²

t

. To simplify notation and allow easy generalization

of results we write y

t

= θ

0

z

t

+ ²

t

, where z

t

= [1 p

t

]

0

and the covariance of the estimates

of regression parameters as σ

2

P

t

where P

t

is an explicit function of the prices up to and

including decision epoch t.

Using properties of the normal distribution and the fact that the normal prior is con-

jugate to the normal distribution, we can easily derive the posterior distribution of the

parameters α and β using standard methods. It has been widely established (see, for ex-

ample, Harvey, 1994 or Fahrmeir and Tutz, 1994) that the parameters of the posterior

distribution are related to the parameters of the prior distribution through the recursive

equations which can be expressed in matrix and vector notation as

θ

t

= θ

t−1

+ P

t−1

z

0

t

F

−1

t

[y

t

− δ

0

t−1

z

t

] (3)

P

t

= P

t−1

− P

t−1

z

0

t

F

−1

t

z

t

P

t−1

, (4)

9

F

t

= z

t

P

t−1

z

0

t

+ H

t

, (5)

where, in our model H

t

= 1 and all other quantities are deﬁned above. At time t, the

variance of the estimate β

t

is given by σ

2

P

t,2,2

, where P

t,2,2

is the second element in the

diagonal of P

t

. Therefore, at the end of period t, after observing q

t

, we have updated

estimates α

t

and β

t

for the parameters. These equations are referred to as the Kalman

Filter. They were originally developed in an engineering context for a diﬀerent purpose

and are widely used in time series analysis and control theory.

2.2 Choosing a Price

We investigate the trade-oﬀ between optimization and learning by fo cussing on the case

where we have observed the demands {q

t

: t = 1, . . . , T − 2}, corresponding to the sequence

of prices {p

t

: t = 1, . . . , T − 2}, and now we seek to maximize the expected revenues in

the last two periods E[R

T −1

(p

T −1

)] + E[R

T

(p

T

)]. This can be formulated as a two-period

dynamic program; our analysis is motivated by that observation. At the beginning of

period t = T − 1, the information for θ is provided by the estimate θ

T −2

which is normally

distributed with mean θ and covariance matrix σ

2

P

T −2

. After ﬁxing the price p

T −1

and

observing q

T −1

, we update θ

T −2

and P

T −2

using the Kalman ﬁlter recursions (3) - (5). At

the last period t = T , learning has no subsequent beneﬁt so the manager should choose a

price in that period to maximize immediate expected revenue in period T . Using simple

calculus, the optimum price, considering the estimate β

T −1

, will be p

T

= −1/β

T −1

.

The estimate β

T −1

will be normally distributed with variance σ

2

β

T −1

= σ

2

P

T −1,2,2

. Here

P

T −1,2,2

denotes the coeﬃcient of the matrix P

T −1

in the 2nd row and 2nd column. Because

β

T −1

is subject to estimation error, we expect the optimum price p

T

= −1/β

T −1

to deviate

from its true optimum −1/β. Therefore, we may anticipate some loss in the revenue at

time T given the uncertainty about β. For an estimate β

T −1

, if we use the rule p

T

(β

T −1

) =

−1/β

T −1

, the expected revenue in the last period can be written as

R

∗

T

(p

T

(β

T −1

)) = E

h

−

1

β

T −1

e

(α−β/β

T −1

+²

T

)

i

= −M

1

β

T −1

e

(α−β/β

T −1

)

, (6)

10

where the expectation above is taken with respect to the noise ²

T

. Obviously R

∗

T

(p

T

(β

T −1

))

is maximum when β

T −1

is the true value β. Theorem 1 below provides an approximation

for the expectation E[R

∗

T

(p

T

(β

T −1

))] based on a Taylor expansion. The proof is presented

in the Appendix and generalized considerably in Carvalho and Puterman (2004b).

Theorem 1 Suppose that the price in period T is set equal to p

T

= −1/β

T −1

. Then the

expected revenue in period T can be approximated by

E[R

∗

T

(p

T

(β

T −1

))] = R

∗

T

(p

T

(β)) +

1

2

Me

(α−1)

β

3

σ

2

β

T −1

+ O((T − 2)

−2

), (7)

where the expectation above is calculated with respect to the distribution of the random

variable β

T −1

.

One of the crucial assumptions in Theorem 1 is that the sequence of prices {p

1

, p

2

, . . . , p

T −2

}

is non-random, or, if it is random, the dependence between p

t

and p

t+k

vanishes as k → ∞.

In practice, when the prices are updated recursively, this hypothesis does not necessarily

hold. The implications of violating this assumption will be discussed in Section 4 and

explored in depth in Carvalho and Puterman (2004b).

Based on the result above, the expected loss at period T associated with the uncertainty

in β

T −1

can be approximated by

−

1

2

Me

(α−1)

β

3

σ

2

β

T −1

> 0. (8)

At the time t = T − 1, the ﬁxed price p

T −1

will aﬀect not only the immediate return

R(p

T −1

) but will also aﬀect the variance of β

T −1

, σ

2

β

T −1

= σ

2

β

T −1

(p

T −1

). Therefore, we can

choose the price p

T −1

that maximizes the expression

F

T −1

(p

T −1

) = p

T −1

e

α+βp

T −1

M +

1

2

Me

(α−1)

β

3

σ

2

β

T −1

(p

T −1

), (9)

which explicitly shows the trade-oﬀ between learning and maximization at time t = T − 1.

Note that (9) involves the parameters α and β, which are exactly what we want to estimate,

but in computation we replace them by their current estimates.

11

When the managers seeks to maximize revenue in an horizon exceeding 2 periods, writ-

ing down expressions equivalent to (9) is not an easy task and the dynamic programming

formulations become quickly intractable. However, as a result of (9) and general observa-

tions about the growth of value functions over time, we propose a simple heuristic pricing

policy, hereafter referred to as one-step-ahead strategy, to optimize the the total expected

revenue over longer horizons. In this case, the one-step-ahead policy consists of choosing

the price p

t

, at each time period t, that maximizes the objective function

ˆ

F

t

(p

t

) = p

t

e

α

t−1

+p

t

β

t−1

M

t−1

+

G(t)

2

M

t−1

e

−1

e

(α

t−1

)

β

3

t−1

σ

2

β

t

(p

t

). (10)

Note that we replaced the unknown regression coeﬃcients α and β by their available es-

timates α

t−1

and β

t−1

at the beginning of period t (before observing the demand q

t

).

Analogously, M

t−1

= e

σ

2

t−1

/2

, where σ

2

t−1

is the estimate of σ

2

at the beginning of the

period t. The ﬁrst term in the objective function above corresponds to the immediate

revenue maximization, while the second term corresponds to maximizing the information

(minimizing the variance) about the model parameters. We include a multiplicative term

G(t) in the second expression to reﬂect the time remaining in the planning horizon. For

a ﬁnite horizon T , we can make G(t) a decreasing function in t, with G(T ) = 0, since we

do not have to learn anymore at the last stage. We can assume, for example, a piecewise

linear form G(t) = (T

c

− t) for t < T

c

, and G(t) = 0 when t ≥ T

c

, where T

c

does not

necessarily equal the horizon T . Alternatively, we can use an exponential decaying form

G(t) = Ke

−tρ

− Ke

−Kρ

for t < K, and G(t) = 0 when t ≥ K.

3 Monte Carlo Simulations

In this section, we present and discuss results of an extensive Monte Carlo simulation

that investigates the performance of the one-step ahead policy and other heuristic pricing

strategies. We begin with a discussion of the simulation setup.

12

3.1 Simulation Design

We describe the classes of price selection rules that will be compared in the simulation

study.

1. Myopic rule. The simplest strategy is the myopic rule, which at each period t sets

the price p

t

= −1/β

t−1

, where β

t−1

is the most recent estimate of the regression slope. We

will see that this strategy produces prices which get ”stuck” at a level far away from the

optimum p

∗

= −1/β and do not beneﬁt from learning.

2. Myopic rules with random exploration. An alternative to the myopic rule is to choose

the optimum price p

t

= −1/β

t−1

with probability 1 − η

t

and choose a random price with

probability η

t

(Sutton and Barto, 1998). Because learning is more important at initial

periods, we let η

t

→ K

0

when t → ∞, where K

0

equals zero or a very small value, in the

case we wish to continue experimenting indeﬁnitely. We used an exponential decay function,

η

t

= K

0

+K

1

e

−K

2

t

, with diﬀerent values for the parameters K

0

, K

1

and K

2

. Some care must

be taken here, because when implementing the proposed methodology in practice, prices

must be chosen in an economically viable range. Further, since the proposed parametric

model is only an approximation for the real data generating process, the approximation

may be reasonable only for a limited range of prices. In light of this when learning, we

choose random prices periods from a uniform distribution on a pre-speciﬁed interval [p

l

, p

u

].

3. Softmax exploration rules. An alternative to the myopic rule with random exploration

is to use the softmax exploration rule described in Sutton and Barto (1998). The basic

idea is to draw, at each time period t, the price p

t

from the distribution with density

f(p

t

) ∝ exp

©

[p

t

e

α

t−1

+β

t−1

p

t

e

σ

2

t−1

/2

]/τ

t

ª

, (11)

with τ

t

→ 0 as t → ∞. The density in (11) has a single mode at −1/β

t−1

and, as τ

t

→ 0,

it becomes more concentrated around the mode, so that, in the limit, we only select the

price p

t

= −1/β

t−1

, and the softmax rule becomes equivalent to the myopic policy. The

same way as before, we used τ

t

= K

0

+ K

1

e

−K

2

t

.

4. Optimal design rules. Another approach to price selection is to choose a ”statisti-

13

cally” optimal design in terms of model estimation during the ﬁrst C periods, and then

apply the myopic rule for the rest of the process. We may think of this as acting as a

”statistician” from t = 1 to t = C, and as an ”optimizer” from t = C + 1 to t = T .

Therefore, for t = 1, . . . , C, we select p

t

= p

u

if t is o dd and p

t

= p

l

if t is even. From

t = C + 1 to t = T, we use p

t

= −1/β

t−1

.

5. One-step look ahead rules. Less arbitrary strategies are based on the one-step ahead

rule, which explicitly account for the trade-oﬀ between learning and revenue maximization

through (10). As noted in Section 2, we use functions G(t) with piecewise linear, G(t) =

max{T

c

−t, 0 }, and exponential decaying, G(t) = max {Ke

−tρ

−Ke

−Kρ

, 0}, functional forms.

Alternatively, to overcome the lack of consistency of the ordinary least squares estimator,

discussed in Subsection 4.1, we also simulated modiﬁed versions of the one-step ahead

rules, by performing random exploration with constant probability η

t

= 0.01. Similarly to

the myopic policy with random exploration, at the exploration stage, the prices p

t

were

drawn from a uniform distribution on [p

l

, p

u

]. We refer to these strategies as unconstrained

one-step ahead rules (to diﬀerentiate from the policies described below) or simply one-step

ahead rules.

6. Price constrained one-step ahead rules. Figure 4 below shows that the prices vary

considerably at the beginning of the process, in order to allow for faster learning. In prac-

tice, a manager may wish to avoid such abrupt price changes. We explored this possibility

by imposing a limit on period to period price changes. Given the price p

t

, the price p

t+1

at decision epoch t + 1 is restricted to be in the interval [p

t

− 0.25p

t

, p

t

+ 0.25p

t

].

In all the above policies, we restricted the prices to be within the range [p

l

, p

u

]. There-

fore, if, at a certain period t, the calculated optimum price is p

t

> p

u

, we used p

t

= p

u

.

Analogously, if the calculated optimum price is p

t

< p

l

, we used p

t

= p

l

. In the diﬀerent

policies described above, we tried diﬀerent values for K

0

, K

1

, K

2

, C, T

c

, K and ρ, and the

results reported here correspond to the conﬁgurations providing the best performances.

For all the strategies described above, the Kalman ﬁlter updating equations (3) - (5)

require initial values δ

0

= [α

0

β

0

]

0

for the regression parameters δ = [α β]

0

and for the

14

matrix P

0

. Besides, all strategies except the optimum design rules require an initial value

β

0

to set the price p

1

. To avoid any bias related to wrong prior information, we assumed

that we had information from two previous data points [log q

−2

p

−2

]

0

and [log q

−1

p

−1

]

0

,

with log q

−i

= α + βp

−i

+ ²

−i

, ²

−i

∼ N(0, σ

2

), i = 1, 2

1

. Avoiding bias in the initial values

is particularly important when studying the bias in the ordinary least squares estimator. If

incorrect prior information were used, one may argue that the bias observed in the estimates

ˆ

β

t

is due to this initial misleading set up.

Based on the discussion above, we set the initial matrix P

0

= (Z

0

−1

Z

−1

)

−1

, for Z

−1

a

two by two matrix, Z

−1

=

£

[1 p

−2

]

0

, [1 p

−1

]

0

¤

0

, with p

−2

and p

−1

the same for all simulation

replicates. The vector δ

0

is equal to (Z

0

−1

Z

−1

)

−1

Z

0

−1

Y

−1

, with Y

−1

= [log q

−2

log q

−1

]

0

. Note

that, although δ

0

is a random vector, it has expectation equal to the true vector δ and

covariance matrix equal to σ

2

P

0

, so that we are not biasing the conclusions due to wrong

priors. In the simulation results presented in this paper, we ﬁxed p

−2

= p

u

and p

−1

= p

l

.

We also tried other values for p

−2

and p

−1

, but the conclusions remained the same. It is

important to mention that all the rules considered in the simulation, including the ”optimal

design rule”, beneﬁted from the fact that we used correct information about δ

0

and P

0

.

Speciﬁcally for the one-step ahead strategies, we need, at each decision period t, t =

1, . . . , T , an estimate for the variance σ

2

. Because the prior information for σ

2

will aﬀect

only the one-step ahead strategies, we decided not to worry about wrong initial values for

σ

2

0

. The idea of having two extra data points [log q

−2

p

−2

]

0

and [log q

−1

p

−1

]

0

does not provide

enough degrees of freedom to estimate σ

2

. Therefore, at time t = 1, the one-step ahead rule

was based on σ

2

0

= σ

2

/2 (wrong prior). After ﬁxing the price p

1

and observing log q

1

, we

have three data points in total, what makes it p ossible to obtain the ﬁrst estimate σ

2

1

, used

at the decision epoch t = 2. In fact, for t = 1, . . . , T , we can use the ordinary least squares

estimator σ

2

t

=

1

t

[ˆ²

−2

+ ˆ²

−1

+

P

t

k=1

ˆ²

2

k

], where ˆ²

k

= y

k

− α

t

− β

t

p

k

, k = −2, −1, 1, . . . , t. To

evaluate the eﬀect of the choice or prior σ

2

0

, we also performed simulations with σ

2

0

= 2σ

2

,

1

We used the indices p

−2

and p

−1

, instead of p

−1

and p

0

, to make explicit that the information is

available before the ﬁrst decision period t = 1.

15

but the general conclusions did not change.

To compare these diﬀerent strategies, we performed L = 10, 000 simulations for each

policy and computed the cumulative revenues in each simulation

CR(t) =

t

X

k=1

R

k

(p

k

), t = 1, . . . , T. (12)

The expected cumulative revenues can then be estimated by the sample means

ˆ

E[CR(t)] =

1

L

L

X

l=1

t

X

k=1

R

k

(p

k

). (13)

By plotting the path of

ˆ

E[CR(t)] against t, t = 1, . . . , T , we gain insight into how these

diﬀerent computational strategies perform. In general, we focus on maximizing revenues in

short planning horizon (T = 100 or T = 200). On the other hand, it is interesting to look

at the path of other measures as t tends to inﬁnity. The sample paths for the estimate β

t

,

for example, provide insight into the long run convergence of the model parameters under

each of these computational methods.

3.2 Simulation Results

The simulations show that the unconstrained one-step ahead rules provide greater mean

cumulative revenues

ˆ

E[CR(t)] than the other strategies. Figure 1 provides a comparison

of a selected one-step ahead rule in which G(t) is piecewise linear, and the other rules. A

comparison between several one-step pricing rules is shown in Figure 2.

For these simulations the parameter values are set to α = 8.0, β = −1.5 and σ

2

= 5.0.

The optimum price in this case is p

∗

= 0.667, which implies that the maximum mean

cumulative revenues equal to 8, 906.5, when p

t

= p

∗

for all t = 1, . . . , T. The minimum

allowed price was p

l

= 0.167 and the maximum allowed price was p

u

= 3.00.

After 100 periods, by using the one-step ahead rule we obtain a relative gain of at

least 3.7% over all the none one-step ahead rules. This relative gain is equal to 3.0% after

200 periods and equal to 2.4% after 400 periods. Note that the myopic rules performed

poorly by getting ”stuck” at a price level away from the optimum. The policies with

16

optimal statistical design at the beginning of the pricing process perform better than the

policies with random exploration (myopic rule with random exploration and the softmax

rule) during the initial periods. However, as the random exploration policies keep learning

about the model parameters, they eventually outperform the statistical design rules.

100 200 300 400 500 600 700 800 900 1000

7600

7850

8100

8350

8600

Myopic policy

Unconstrained one−step ahead policy

Constrained one−step ahead policy

Softmax policy

Optimal statistical design

Myopic rule with random exploration

Period (

t

)

Mean cumulative revenues

Comparing the mean cumulative revenues for different pricing policies

Figure 1: Comparison of expected mean cumulative revenues for several pricing policies

(optimal expected revenue per period under known parameters values is 8, 906).

Figures 1 also displays the mean cumulative revenues for a one-step ahead policy with

price change constraints. At each time period t ∈ {2, . . . , T}, the prices were chosen by

maximizing the objective function in (10) with the restriction p

t+1

∈ [p

t

−0.25p

t

, p

t

+0.25p

t

].

At the initial period t = 1, the price was only restricted to be within the interval [p

l

, p

u

]. We

considered piecewise linear G(t) with T

c

= 170 (fast learning) and T

c

= 70 (slow learning).

To simplify the presentation, the results for the slow-learning case are shown in Figure 2.

Although none of the other rules had price change restrictions, the constrained one-step

ahead policies still presented a superior performance when compared to the rules other

than the one-step ahead ones. As we already expected, the one-step ahead policy with fast

17

learning performs better than the one of slow learning methods after some initial periods.

To validate the analysis, we performed other simulations with diﬀerent choices of model

parameters, and minimum and maximum allowed prices, and the conclusions remained the

same.

Figure 2 presents the mean cumulative revenues for the six one-step ahead strategies

studied here. Note that there does not seem to be any signiﬁcant diﬀerence between the

four unconstrained policies. For t < 400, the rule with exponential decaying G(t), without

random exploration seems to slightly outperform the other ones. For t > 400, the policy

with piecewise linear G(t) and random exploration with η

t

= 0.01 presents a somewhat

better performance than the others. By imposing the price change constraint, the relative

loss in the one-step ahead policies is not higher than 1.2% after 100 periods, 1.4% after 200

periods, and 1.0% after 400 periods.

100 300 500 700 900

8000

8150

8300

8450

Period (t)

Mean cumulative revenues

Constrained one−step ahead policy with slow learning

Constrained one−step ahead policy with fast learning

Unconstrained one−step ahead policies

Comparing mean cumulative revenues for one−step ahead policies

Unconstrained one−step ahead policy with piecewise linear

G(t)

Unconstrained one−step ahead policy with exponential G(t)

Unconstrained one−step ahead policy with random exploration (ε= 0.01)

Unconstrained one−step ahead policy with exponential G(t) and random exploration

Figure 2: Expected mean cumulative revenues for several one-step ahead policies (optimal

expected revenue per period under known parameters values is 8, 906).

In Figure 3, we plot the mean paths of the estimated slope β, for the diﬀerent strate-

18

gies. We observe that all strategies produce biased estimates of β for all t = 1, . . . , 1000.

This eﬀect is real and is supported by theory. The reason for this will be discussed in

Subsection 4.1. However, for the myopic rule with random exploration and the one-step

ahead rules with random exploration, the bias tends to go to zero, as t grows, what was

already expected based on the discussion about adaptive control with random perturbation

presented in Subsection 4.1. An additional strategy, which sets random prices at all periods

t = 1, . . . , 1000, was also simulated and produced unbiased estimates for β. However, its

revenue performance was very p oor, since it never uses the produced estimates for opti-

mization purposes. For the one-step ahead policies, the bias is quite signiﬁcant. However,

because of asymmetry in the revenue function, the loss incurred by a negative bias is not as

harmful as that incurred by a positive bias. This phenomenon has been previously observed

in the inventory literature as for example by Silver and Rahnama (1987).

0 100 200 300 400 500 600 700 800 900 1000

−1.650

−1.625

−1.600

−1.575

−1.550

−1.525

−1.500

−1.475

−1.450

Period (

t

)

Slope estimates

Optimal statistical design at initial periods

Myopic rule

Myopic rule with random exploration

Softmax policy

Unconstrained one−step ahead policy with random exploration

Unconstrained one−step ahead policy

Comparing the estimated slopes for different pricing policies

Figure 3: Mean estimates for the slope coeﬃcient β in the log-linear model.

Finally, Figure 4 shows the mean paths of selected prices for some of the diﬀerent

strategies. For almost all the policies, the prices get stuck at a ﬁxed level after 80 periods.

19

For the myopic rule with random exploration, although the prices are set initially in a level

above the optimal price p

∗

= 0.667, they tend to approach p

∗

as t grows and there is more

exploration about the true slope value. For the one-step ahead rule speciﬁcally, the prices

tend to go up and down, with the variations around p

∗

decreasing as more information is

added. It illustrates the idea behind the one-step ahead policies: as more information is

obtained, the marginal value of extra information decreases and the algorithm values more

the maximization of immediate revenues. Note that, although the estimates β

t

are biased,

as shown in Figure 3, the mean prices in the one-step ahead policies converge to levels

very close to the optimal price p

∗

= 2/3. This can be explained by the nonlinearity in

the function p

t

= −1/β

t−1

, so that E[p

t

] 6= −1/E[β

t−1

]. For the two constrained one-step

ahead policies, note the smoother evolution of the chosen prices, when compared to the

price paths for the unconstrained one-step ahead rules. The constrained one-step ahead

policy with fast learning presents a higher price variation during the initial periods than

the constrained one-step ahead policy with low learning. This fact was already expected,

given the higher weight for the learning component (second term in the right-hand-side of

equation (10)) in the fast-learning case.

4 Random Prices and Estimation Bias

This section focuses on two important technical issues that underlie the observed bias of

the regression parameters in the previous section and the derivation of the Taylor series

expansion which is the basis for the one step ahead rules in Section 3. In it, we give a high

level discussion of these issues; a much deeper analysis appears in Carvalho and Puterman

(2004b).

4.1 Bias in Parameter Estimates θ

We now discuss why the estimates of the regression parameters may be biased when prices

are chosen adaptively. This phenomenon was observed in Figure 3 which showed that

20

0 10 20 30 40 50 60 70 80 90 100

0.667

1.167

1.667

2.167

Period (

t

)

Chosen prices

Mean path of chosen prices for different policies

Myopic rule

Unconstrained one−step ahead policy

Constrained one−step ahead rule with slow learning

Constrained one−step ahead rule with fast learning

Figure 4: Comparison of chosen prices for several policies (optimal price p

∗

= 1/1.5).

estimates of the regression parameter β did not converge to their true value.

To understand the reason for the bias of

ˆ

β

t

, the estimate of β based on t observa-

tions, consider the usual ordinary least squares estimator

ˆ

δ

t

= (Z

0

t

Z

t

)

−1

Z

0

t

Y

t

, where Z

t

is the t × 2 design matrix [[1 p

1

]

0

, [1 p

2

]

0

, . . . , [1 p

t

]

0

]

0

and Y

t

is the t × 1 response vector

[log q

1

log q

2

. . . log q

t

]

0

. Although the parameter vector δ = [α β]

0

is recursively estimated

with the Kalman ﬁlter, given the choice of prior N(δ

0

, σ

2

P

0

) employed in our simulations,

the resulting estimate

ˆ

δ

t

is numerically equivalent to the ordinary least squares (OLS) esti-

mator (Z

0

t

Z

t

)

−1

Z

0

t

Y

t

. According to the classical regression analysis theory (see Draper and

Smith, 1998) when Z

t

is ﬁxed, the expected value of δ is equal to

E{

ˆ

δ

t

} = E{(Z

0

t

Z

t

)

−1

Z

0

t

Y

t

} = (Z

0

t

Z

t

)

−1

Z

0

t

E{Y

t

} = (Z

0

t

Z

t

)

−1

Z

0

t

E{Z

t

δ + υ

t

}, (14)

where υ

t

= [²

1

²

2

. . . ²

t

]

0

. Because E{υ

t

} = 0, we conclude that E{

ˆ

δ

t

} = δ, and hence

ˆ

δ

t

is

unbiased for ﬁxed Z

t

.

As we discussed above, the sequence of prices p

1

, . . . , p

t

is usually random, and hence

the design matrix Z

t

is not ﬁxed. Therefore, the classical theory for OLS estimation is

21

not valid in this case speciﬁcally, and we cannot ensure the unbiasedness of

ˆ

δ

t

. Besides,

following the same derivation in (14), we have

E{

ˆ

δ

t

} = δ + E{(Z

0

t

Z

t

)

−1

Z

0

t

υ

t

}, (15)

and if Z

t

and υ

t

were independent, it is easy to show that the second term in (15) would

be zero and the the OLS estimator would be still unbiased. However, because the price

p

k

set at period k depends on the estimate

ˆ

δ

k−1

, and the estimate

ˆ

δ

k−1

depends on the

history of disturbances ²

1

, . . . , ²

k−1

, we conclude that the p

k

depends on ²

1

, . . . , ²

k−1

. There-

fore, the random variables Z

t

and υ

t

are not independent and we cannot guarantee that

E{(Z

0

t

Z

t

)

−1

Z

0

t

υ

t

} = 0, so it is expected that

ˆ

θ

t

is biased for ﬁnite t.

Although

ˆ

δ

t

is biased for ﬁnite t, one may be interested in the b ehavior of

ˆ

δ

t

as t → ∞.

As is well known in the econometrics literature, when Z

t

is random, under some regularity

conditions, the estimate

ˆ

δ

t

is strongly consistent (i.e., converges almost surely) to the true

parameter δ, in the sense that

ˆ

δ

t

a.s.

−−→ δ as t → ∞, as discussed in White (2001). These

conditions, interpreted in the context of the pricing model, are that the random sequence

of prices {p

t

: t = 1, . . . , ∞} satisﬁes a strong Law of Large Numbers and that there exists

a ∆ > 0, such that the sequence of minimum eigenvalues λ

min,t

of Z

0

t

Z

t

satisﬁes λ

min,t

> ∆

for t = 1, . . . , ∞ with probability 1. However, for the one-step ahead rules the sequences

of prices {p

t

: t = 1, . . . , ∞} does not satisfy either of these two conditions. In particular,

because Z

t

contains a column of ones and for each simulation replicate the prices approach

a constant value as indicated in Figure 4, the smallest eigenvalue of Z

0

t

Z

t

converges to zero

as t goes to inﬁnity. Besides, looking at diﬀerent sample paths for diﬀerent simulations

(not shown here), the price level to which the sequences of prices converges varies across

the simulation replicates. Therefore, the Law of Large Numbers does not apply to the

price sequence. We conclude that the usual conditions for consistency of the ordinary least

squares estimator are not satisﬁed, and we cannot guarantee that

ˆ

δ

t

a.s.

−−→ δ as t goes to

inﬁnity.

Some of these issues have been addressed in the adaptive control literature by Campi and

Kumar (1998), Chen and Guo (1988), Kumar (1990) and Sternby (1977) who show that the

22

parameter estimates

ˆ

δ

t

a.s.

−−→ δ

∞

, where δ

∞

depends on the random path of the state variable

which in this example is the demand sequence {q

t

}

∞

t=1

. Further δ

∞

6= δ. To guarantee the

consistency of the estimates

ˆ

δ

t

to the true parameter vector δ in adaptive control problems

Campi and Kumar (1998), and Chen and Guo (1988) suggest the addition of persistent

yet infrequent random perturbations to the control law. These perturbations should be

small in magnitude and suﬃciently infrequent, so that they do not incur a high extra cost.

Speciﬁcally for our dynamic pricing problem, the addition of random perturbations can

be accomplished by choosing the price p

t

according to the objective function (10) with

probability 1 − η

t

, and drawing p

t

from a uniform distribution with probability η

t

, with η

t

very low. Although the addition of the random experimentation guarantees the consistency

of

ˆ

δ

t

, it does not improve the performance of the one-step ahead rules over short horizons.

The use of biased parameter estimates also has some precedence in the inventory liter-

ature as for example in Silver and Rahnama (1987).

4.2 Validity of the one-step-ahead rule

A crucial assumption in the derivations for Theorem 1 is that the sequence of prices

{p

1

, p

2

, . . . , p

T −2

} is ﬁxed, or, if it is random, the dependence between p

t

and p

t+k

van-

ishes as k → ∞. However, if we employ the optimization rule recursively by maximizing

the objective function

ˆ

F

t

(p

t

) in (10), at each period t, the optimum price p

t

will be a func-

tion of the estimates α

t−1

, β

t−1

and σ

2

t−1

, which are random variables calculated using the

sequence of prices {p

1

, p

2

, . . . , p

t−1

}. Therefore, the price p

t

will also be a random variable

and will depend on the sequence {p

1

, p

2

, . . . , p

t−1

}. If the initial estimate [ˆα

0

ˆ

β

0

ˆ

σ

2

0

]

0

is

random, and we use the objective function in (10) to recursively update the prices, the

whole sequence {p

t

: t = 1, 2, . . . , T } will be random. The randomness of the price se-

quence compromises the derivation of Theorem 1. In fact, as discussed in Subsection 4.1

and illustrated in the simulations in Section 3, the bias of

ˆ

β

t

, bias E{

ˆ

β

t

− β}, does not

converge to zero as t increases. Fortunately, although the assumption of non-randomness

of {p

1

, p

2

, . . . , p

T

} does not hold, the Taylor series approximation, on which the one-step

23

ahead policies are based, may still be valid. In this subsection, we give a informal discussion

of why the one-step ahead rules work well even though the assumptions on which they are

based do not hold.

To understand the problems caused by the inconsistency of

ˆ

θ

t

, consider the following

approximation, based on the Taylor expansion in (18), presented in the proof of Theorem

1 in the Appendix.

E{R

∗

T

(p

T

(β

T −1

))}

.

= R

∗

T

(p

T

(β)) + ∂

β

T −1

R

∗

T

(p

T

(β))E{β

T −1

− β}

+

1

2

∂

2

β

T −1

R

∗

T

(p

T

(β))E{[β

T −1

− β]

2

}.

(16)

Because of the inconsistency of β

T −1

, the term E

©

[β

T −1

− β]

2

ª

in (16) is not equal to the

variance β

T −1

anymore. In this case, we have E{[β

T −1

− β]

2

} = MSE

β

T −1

6= Var(β

T −1

),

even for large T . Therefore, the approximation in (16) can be rewritten as

E{R

∗

T

(p

T

(β

T −1

))}

.

= R

∗

T

(p

T

(β)) + ∂

β

T −1

R

∗

T

(p

T

(β))E{β

T −1

− β}

+

1

2

∂

2

β

T −1

R

∗

T

(p

T

(β))MSE

β

T −1

,

and the objective function to be maximized in the recursive pricing procedure should be

ˆ

F

t

(p

t

) = p

t

e

α

t−1

+p

t

β

t−1

M

t−1

+

G(t)

2

M

t−1

e

−1

e

(α

t−1

)

β

3

t−1

MSE

β

t

(p

t

), (17)

where α

t−1

, β

t−1

and M

t−1

= exp[ˆσ

2

t−1

/2] are the estimates for α, β and M = exp[σ

2

/2],

based on the information available at the end of period t−1. We wrote MSE

β

t

= MSE

β

t

(p

t

)

to emphasize that the mean square error at the end of period t depends on the price p

t

.

To implement the optimization/learning policy based on maximizing

ˆ

F

t

(p

t

) in (17), we

need an expression for MSE

β

t

(p

t

), which may be very hard to obtain in explicit form. In

Section 3, we implemented the one-step ahead rule in the simulations by maximizing at each

period t, t = 1, . . . , T, the objective function in (10), where we replace the unconditional

MSE

β

t

(p

t

) by the conditional σ

2

β

t

(p

t

).

In order to evaluate the approximation of the unconditional mean square error MSE

β

t

(p

t

)

by the conditional variance σ

2

β

t

(p

t

), assuming ﬁxed prices, we can use the generated paths

for β

t

, t = 1, . . . , T , in the Monte Carlo experiment. Figure 5 presents the comparison

24

between the mean σ

2

β

t

(p

t

) of the estimates for σ

2

β

t

(p

t

) and the estimate

[

MSE

β

t

(p

t

), for the

unconditional mean square error MSE

β

t

(p

t

), obtained from the simulations. The upper

graph in Figure 5 shows the evolution of

[

MSE

β

t

(p

t

) and σ

2

β

t

(p

t

) over time. Note that

both decay at the same rate, although the σ

2

β

t

(p

t

) is slightly higher than

[

MSE

β

t

(p

t

) for

all time periods. The lower graph in Figure 5 show the scatter plot of

[

MSE

β

t

(p

t

) ver-

sus σ

2

β

t

(p

t

). According to the graph, there is an approximate linear relationship between

these two measures. Besides, the corresponding regression line has slope 1.0274, intercept

-0.0527 and R

2

= 0.9975. Therefore, the approximation MSE

β

t

(p

t

)

.

= K

−1

σ

2

β

t

(p

t

), with K

very close to one, is justiﬁed empirically. These empirical results suggest that the objective

function in (17) can be reasonably approximated by the objective function in (10), used

the simulations.

0 20 40 60 80 100 120 140 160 180 200

0.25

0.50

0.75

Mean square error and conditional variance assuming fixed prices

Period (t)

MSE and conditional variance

Conditional variance assuming fixed prices

Mean square error from Monte Carlo simulations

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90

0.25

0.50

0.75

Scatter plot: mean square error versus conditional variance assuming fixed prices

Regression line: y = 1.0274x − 0.0527

R

2

= 0.9975

Conditional variance assuming fixed prices

Mean square error

Figure 5: Comparing the mean of the variance estimates based on ﬁxed prices to the true

mean square error.

25

5 Conclusions

In this paper, we have prescribed and analyzed methods for setting prices in the presence

of demand function parameter uncertainty focussing especially on short planning horizons.

Our contributions are both practical and technical with each of these aspects being impor-

tant.

The key practical issue addressed by this paper arises from the fact that the demand

function is never known in practice!! Thus the prudent decision maker must maker an

explicit tradeoﬀ between variance reduction and revenue optimization. Theorem 1 in this

paper makes that tradeoﬀ rigorous by using statistical asymptotic theory to approximate

the MDP value function.

Extensive simulations produce important managerial implications and deep insights into

the mathematical foundations of active learning.

The key managerial insights are:

• Myopic policies perform poorly for all planning horizons.

• Myopic policies plus occasional random price changes outperform myopic policy over

the long term but not over the short term.

• Active learning (such as the one step ahead rules) are better than all other approaches

over all planning horizons.

• It is possible to constrain price changes each period and still drastically improve over

myopic policies but of course, unconstrained policies produce greater revenue.

The bottom line is that managers should use active learning and if its not possible, at

least be willing to exp eriment with some price changes to learn about the demand function.

The methods in this paper suggest how to do this experimentation and would have been

of use to Intrawest’s management when it sought to increase revenue by varying prices.

From a technical perspective we have observed that the adaptive rules lead to biased

parameter estimates. Even though the demand function is never estimated accurately,

26

active learning still produces good revenue streams. This also suggests that one should

consider biased parameter estimates when combining estimation with optimization. Bias

is present in all active learning (adaptive control). This bias is due to randomness in both

prices and noise. Under repeated simulations, we would still get biased parameter estimates

unless prices were ﬁxed and the only source of variability was the random disturbance in

the demand function. We have shown why this is the case and also why one step ahead

methods still produce excellent results.

The authors are investigating several extensions of this model.

• Empirically testing the methods of this paper in real or simulated markets.

• Including other explanatory variables in the demand function that might be ﬁxed

(seasonal dummy variables, time trend, day of the week) or random (competitor

prices, market indicators) covariates. In particular, by regarding the constant in (1)

as market size, we can view a time trend as a changing market size and investigate

its implications on price choice throughout the planning horizon.

• For low demand items, Poisson, binomial or other generalized linear models may be

more appropriate demand distribution models. A ﬁrst step in this direction is pursued

in Carvalho and Puterman (2004a).

• Allowing model parameters to change over time following a state space model or a

step-change model.

• Exploring enhanced price setting mechanisms that may yield higher revenues or have

reduced biased.

• Allowing for heterogeneity in markets by using mixture models (see, for example,

Hastie, Tibshirani and Friedman, 2001), where we increase the number of components

as we observe more data. In this case, we expect that the number of components or

basis functions J will be an increasing function of the number of available observations

t.

27

Acknowledgment

This research was partially supported by grants from NSERC and the MITACS NCE

(Canada). We wish to thank the Area Editor, Bill Lovejoy and an associate editor for

helping us align the paper with the editorial objectives of Management Science. Dan

Adelman of The University of Chicago also provided some helpful comments on an earlier

draft of this manuscript.

Appendix

Proof of Theorem 1. By using a Taylor’s series expansion for R

∗

T

(p

T

(·)) around the true

parameter β, we have

R

∗

T

(p

T

(β

T −1

)) = R

∗

T

(p

T

(β)) + ∂

β

T −1

R

∗

T

(p

T

(β))[β

T −1

− β]

+

1

2

∂

2

β

T −1

R

∗

T

(p

T

(β))[β

T −1

− β]

2

+

1

6

∂

3

β

T −1

R

∗

T

(p

T

(β))[β

T −1

− β]

3

+

1

24

∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))[β

T −1

− β]

4

,

where

¯

β is located between β

T −1

and β, and ∂

r

β

T −1

denotes the r-th derivative with respect

to

β

T −1

. Taking expectations with respect to the random variable

β

T −1

, we obtain

E{R

∗

T

(p

T

(β

T −1

))} = R

∗

T

(p

T

(β)) + ∂

β

T −1

R

∗

T

(p

T

(β))E{β

T −1

− β}

+

1

2

∂

2

β

T −1

R

∗

T

(p

T

(β))E{[β

T −1

− β]

2

} +

1

6

∂

3

β

T −1

R

∗

T

(p

T

(β))E{[β

T −1

− β]

3

}

+

1

24

E{∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))[β

T −1

− β]

4

}.

(18)

The second term in the right-hand-side of (18) is equal to zero, provided that β

T −1

is

unbiased for β. The third term is equal to

1

2

∂

2

β

T −1

R

∗

T

(p

T

(β))E{[β

T −1

− β]

2

} =

1

2

∂

2

β

T −1

R

∗

T

(p

T

(β))Var[β

T −1

].

The fourth term in (18) is equal to zero because β

T −1

is normally distributed, so the

third central moment is zero. Finally, for the ﬁfth term, employing Jensen’s and Cauchy-

28

Schwarz inequalities

|E{∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))[β

T −1

− β]

4

}| ≤ E{|∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))|[β

T −1

− β]

4

}

≤ E{|∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))|

2

}

1/2

E{[β

T −1

− β]

8

}

1/2

.

We can show that E{|∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))|

2

} = O(1), so it does not diverge as the sample size

(T −1), used in the estimation of β

T −1

, goes to inﬁnity. Besides, Var[β

T −1

]

−1/2

[β

T −1

− β] ∼

N(0, 1), so that E{Var[β

T −1

]

−4

[β

T −1

− β]

8

} = µ

8

, where µ

8

is the 8-th central moment of

a standard normal random variable. We then have E{ [β

T −1

− β]

8

} = Var[β

T −1

]

4

µ

8

, and

therefore

|E{∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))[β

T −1

− β]

4

}| ≤ E{|∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))|

2

}

1/2

Var[β

T −1

]

2

µ

1/2

8

.

We know that, Var[β

T −1

] = σ

2

P

T −1,2,2

, with P

T −1

has the form (Z

0

Z)

−1

, where Z is the

corresponding design matrix for the regression model in (1). If the magnitude of the rows

in the design matrix Z do not change as (T − 1) goes to inﬁnity, we have Var[β

T −1

] =

O((T − 2)

−1

), in the sense that it goes to zero at order (T − 2)

−1

when the sample size n

goes to inﬁnity. Hence,

|E{∂

4

β

T −1

R

∗

T

(p

T

(

¯

β))[β

T −1

− β]

4

}| = O((T − 2)

−2

),

and

E{R

∗

T

(p

T

(β

T −1

))} = R

∗

T

(p

T

(β)) +

1

2

∂

2

β

T −1

R

∗

T

(p

T

(β))E{[β

T −1

− β]

2

} + O((T − 2)

−2

).

By diﬀerentiating (6) twice with respect to β

T −1

, we have

∂

2

β

T −1

R

∗

T

(p

T

(β

T −1

)) = −2

M

β

3

T −1

exp(α − β/β

T −1

)

+ 4

Mβ

β

4

T −1

exp(α − β/β

T −1

) −

Mβ

2

β

5

T −1

exp(α − β/β

T −1

),

and

∂

2

β

T −1

R

∗

T

(p

T

(β

T −1

))

¯

¯

¯

β

T −1

=β

=

Me

(α−1)

β

3

σ

2

β

T −1

.

Therefore,

E[R

∗

T

(p

T

(β

T −1

))] = R

∗

T

(p

T

(β)) +

1

2

Me

(α−1)

β

3

σ

2

β

T −1

+ O((T − 2)

−2

),

as we wanted to show. ¤

29

References

[1] C. Anderson and Z. Hong, Reinforcement Learning with Modular Neural Networks

for Control. Proceedings of NNACIP’94, the IEEE International Workshop on Neural

Networks Applied to Control and Image Processing, 1994.

[2] Y. Aviv and A. Pazgal, Pricing of Short Life-Cycle Products through Active Learning,

Technical Report, Olin School of Business, Washington University, 2002.

[3] Y. Aviv and A. Pazgal, A Partially Observed Markov Decision Process for Dynamic

Pricing, Technical Report, Olin School of Business, Washington University, 2002.

[4] K. Azoury, Bayes Solution to Dynamic Inventory Models under Unknown Demand

Distributions, Management Sci.31, 1150-1160, 1985.

[5] R. Balvers and T. Cosimano, Actively Learning about Demand and the Dynamics of

Price Adjustment, The Economic Journal,100, 882-898, 1990.

[6] M. Campi and P. Kumar, Adaptive Linear Quadratic Gaussian Control: The Cost-

Biased Approach Revisited, University of Illinois at Urbana-Champaign Technical Re-

port, http://black.csl.uiuc.edu/ prkumar, 1998.

[7] A. Carvalho and M. Puterman, Learning and Pricing in an Internet Environment with

Binomial Demands, Technical Report, Sauder School of Business, University of British

Columbia, 2004a.

[8] A. Carvalho and M. Puterman, Foundations of Learning and Pricing, Technical Report,

Sauder School of Business, University of British Columbia, 2004b.

[9] H. Chen and L. Guo, A Robust Stochastic Adaptive Controller, IEEE Transactions on

Automatic Control, 33, 1988.

[10] E. Cope, Non-parametric Strategies for Dynamic Pricing in e-Commerce, Technical

Report, Sauder School of Business, University of British Columbia, 2004.

30

[11] T. Dietterich and X. Wang, Batch Value Function Approximation via Support Vec-

tors, Forthcoming in Dietterich, T. G., Becker, S., Ghahramani, Z. (Eds.) Advances in

Neural Information Processing Systems 14, Cambridge, MA: MIT Press, 2003.

[12] X. Ding, M. Puterman and A. Bisi, The Censored Newsvendor and the Optimal Ac-

quisition of Information, Operations Res.,50, 517-527, 2002.

[13] N. Draper and H. Smith, Applied Regression Analysis, Wiley Series in Probability and

Statistics, 1998.

[14] D. Easley and N. Kiefer, Controlling a Stochastic Process with Unknown Parameters,

Econometrica, 56, 5, 1045-1069, 1988.

[15] D. Easley and N. Kiefer, Optimal Learning with Endogenous Data, Int. Econ. Rev.,

30, 4, 963-978, 1989.

[16] L. Fahrmeir and G. Tutz, Multivariate Statistical Modeling Based on Generalized Lin-

ear Models (Springer Series in Statistics), Springer-Verlag, 1994.

[17] J. Forbes and D. Andre, Real-Time Reinforcement Learning in Continuous Domain,

AAAI Spring Symposium on Real-Time Autonomous Systems, 2000.

[18] G. Gallego and G. van Ryzin, Optimal Dynamic Pricing of Inventories with Stochastic

Demand over Finite Horizons, Management Science, 40, 8, 999-1020, 1994.

[19] A. Harvey, Forecasting, Structural Time Series Models and the Kalman Filter. Cam-

bridge University Press, 1994.

[20] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning - Data

Mining, Inference and Prediction. Springer, 2001.

[21] C. Hu, W. Lovejoy and S. Shafer. Comparison of Some Sub optimal Cpntrol Policies

in Medical Drug Therapy, Operations Res., 44, 696-709, 1996.

31

[22] K. Kalyanam, Pricing Decisions Under Demand Uncertainty: A Bayesian Mixture

Model Approach, Marketing Science, 1996.

[23] N. Kiefer and Y. Nyarko, Optimal Control of an Unknown Linear Process with Learn-

ing, Int. Econ. Rev., 30, 3, 571- 586, 1989.

[24] P. Kumar, Convergence of Adaptive Control Schemes Using Least-Squares Parameter

Estimates, IEEE Transactions on Automatic Control, 1990.

[25] M. Lariviere and E. Porteus, Stalking Information: Bayesian Inventory Management

with Unobserved Lost Sales, Management Sci., 45, , 1999.

[26] E. Lehmann, Elements of Large-Sample Theory, Springer, 1999.

[27] M. Lobo and S. Boyd, Pricing and Learning with uncertain demand, Working Paper,

2003.

[28] W. Lovejoy, Myopic Policies for Some Invneotory Models with Uncertain Demand

Distributions, Management Sci., 36, 724-738, 1990.

[29] R. Martinez, Pricing in a Congestible Service Industry with a Focus on the Ski

Industry, Unpublished MSc Thesis, Sauder School of Business, University of British

Columbia, 2003.

[30] N. Petruzzi and M. Dada, Dynamic Pricing and Inventory Control with Learning,

Naval Research Logistics, 49, 304-325, 2002.

[31] C. Raju, Y. Narahari and K. Kumar, Learning dynamic prices in multi-seller elec-

tronic retail markets with price sensitive customers, stochastic demands, and inventory

replenishments, Indian Institute of Science Working Paper, 2004.

[32] M. Rothschild, ”A Two-Armed Bandit Theory of Market Pricing”, J. Econ. Theor.,

9, 185-202, 1974.

32

[33] H. Scarf, Some Remarks on Baye’s Solution to Inventory Problem, naval Res. Logist.

Quart., 7, 591-596, 1960.

[34] E. Silver and M. Rahnama, Biased Selection of the Inventory Reorder Point when

Demand Parameters are Statistically Estimated, Engr. Cost and Prod. Econ., 12, 283-

292,1987.

[35] J. Treharne and C. Sox, Adaptive Inventory Control for Non-stationary Demand and

Partial Information, Management Science, 48, 607-624, 2002.

[36] R. Sutton and G. Barto, Reinforcement Learning. MIT Press, 2nd edition, 1998.

[37] J. Sternby, On Consistency for the method of least squares using martingale theory,

IEEE Transactions on Automatic Control, 1977.

[38] J. Tsitsiklis, An Analysis of Temporal-Diﬀerence Learning with Function Approxima-

tion, IEEE Transactions on Automatic Control, 1997.

[39] H. White, Asymptotic Theory for Econometricians. Academic Press, 2001.

33