Content uploaded by Alexandre Xavier Ywata Carvalho
Author content
All content in this area was uploaded by Alexandre Xavier Ywata Carvalho on Nov 07, 2014
Content may be subject to copyright.
Content uploaded by Alexandre Xavier Ywata Carvalho
Author content
All content in this area was uploaded by Alexandre Xavier Ywata Carvalho on Dec 26, 2013
Content may be subject to copyright.
HOW SHOULD A MANAGER SET PRICES WHEN THE DEMAND
FUNCTION IS UNKNOWN?
By ALEXANDRE X. CARVALHO AND MARTIN L. PUTERMAN
Statistics Department and Sauder School of Business, University of British Columbia
Revised - August 3, 2004
Abstract
This paper considers the problem of changing prices over time to maximize expected
revenues in the presence of unknown demand distribution parameters. It provides
and compares several methods that use the sequence of past prices and observed de-
mands to set price in the current period. A Taylor series expansion of the future
reward function explicitly illustrates the tradeoff between short term revenue maxi-
mization and future information gain and suggests a promising pricing policy referred
to as a one-step look-ahead rule. An in-depth Monte Carlo study compares several
different pricing strategies and shows that the one-step look-ahead rules dominate
other heuristic policies and produce good short term performance. The reasons for
the observed bias of parameter estimates is also investigated.
(Demand function estimation; adaptive control; Kalman filter; biased estimation)
1 Introduction
In January 2003, students and faculty in the Centre for Operations Excellence at the Uni-
versity of British Columbia b egan an extensive project, Martinez (2003), with Intrawest
Corporation to develop approaches to set prices for ski lift tickets to increase the com-
pany’s revenue. The analysis team quickly found that historical data was not sufficient to
determine the effect of price changes on demand. At this point, the project focus shifted
to determining data requirements and developing tools and methods to capture data to
investigate the effect of price on demand. But even if relevant data had been available,
the question of how management should vary its prices to maximize revenue remained.
1
This paper provides insights into how to do this by developing and evaluating several im-
plementable price setting mechanisms. Further it measures the benefits of using these
methods and provides implementable recommendations for how to use these approaches.
In a nutshell, the main messages of this paper are:
• Managers can increase revenue by changing prices over time. This benefit comes
through using variable prices to learn about the relationship between price and de-
mand.
• Managers should collect and save pricing and sales data and use it to guide pricing
decisions.
• Managers can increase total revenue by intermittently choosing prices in a random
manner.
• Managers can increase total revenue by implementing a systematic approach to choos-
ing prices and using it in real time.
• The Internet and its host of e-commerce tools are ideal for using the approaches
proposed in this paper.
In the latter sections of this paper, we will expand on and quantify the benefits from
following these recommendations.
This paper considers the following pricing problem. Each period a manager (who we
refer to as ”he”) sets the price of a go od (for example, a ski ticket of a particular type)
and observes the total demand (skier visits) for that good during the period. He seeks to
choose prices so as to maximize total expected revenues over a fixed finite horizon of length
T which might represent the length of a season or the lifetime of a good.
We assume an unlimited inventory of goods to simplify analysis and develop generaliz-
able principals that may apply in wider contexts. Clearly this assumption is appropriate in
the ski industry as well as in software and other information good industries. Also it applies
to settings in which pricing and inventory decisions are made in separate units within the
2
organization. From a technical perspective, by focussing on pricing only we avoid joint
optimization of inventory levels and prices and can provide clearer recommendations. We
have chosen not to take a revenue management approach in which the inventory is finite
and prices are set to maximize the total expected revenue from selling this inventory of
goods. Conceptually either of these extensions would not change our approach, they would
only alter the dynamic programming equations on which our analysis is based.
We assume demand is stochastic and its mean is described by a demand function which
relates demand to price and possibly other factors such as season, time on market and
competitive factors. We further assume that the demand function form is known, for
example, its logarithm may be linear in price and quadratic in the day of the season.
When the parameters of the demand function are also known, the choice of an optimal
price reduces to a simple stochastic optimization problem. However, when the parameters
are not known, which is the setting we will fo cus on, the manager may benefit from using
variable pricing to learn them. Initially he may rely on prior information or intuition to
guide pricing decisions. After several periods, he can use statistical methods to estimate
the demand function and use this information to guide his price setting for the rest of the
planning horizon.
The simplest and most widely used approach to pricing is to set a single price at the
beginning of the planning horizon and use that price throughout the lifetime of the prod-
uct. To do this well requires complete knowledge of the demand curve for the product.
Alternatively the decision maker may specify an ”open loop” price schedule which deter-
mines how to vary prices over the planning horizon independent of any information that
may be acquired throughout the planning period. We will show that these approaches are
not attractive when there is uncertainty about the demand model.
We focus on adaptive or ”closed-loop” price setting, that is when the decision maker
varies prices on the basis of his record of historical demand and prices chosen. He can do this
”myopically” by setting the price p
t
at each period t, in order to maximize the immediate
expected revenue R
t
. Obviously, at time t, he can use all past price and demand data to
3
choose p
t
. As will be illustrated in this paper, this strategy will not yield the maximum
total expected revenues by the end of the planning horizon. In fact, this myopic strategy
may turn out to be far from optimal. This is because the price set today will impact not
only the the immediate expected return R
t
, but will also affect the amount of information
about the demand function that the retailer gains. In fact, we show that prices that do
not maximize immediate revenue will lead to better estimation of the demand curve and
better future price decisions.
On the other hand, there is a vast statistical body of literature (see Draper and Smith,
1998) that addresses how to vary experimental conditions, which in the setting of this paper
are the prices, to best learn a parametric function that relates outcomes to experimental
conditions. Such optimal design approaches do not usually focus on the impact of outcomes,
herein the revenue gained, during the experimental process and use optimality criteria based
on covariance matrices of parameters.
The trade-off between immediate reward maximization and learning has been exten-
sively studied in many areas. We draw inspiration from the reinforcement learning (RL)
literature where most of the techniques are suited for problems where the horizon is infi-
nite and the state space is of high dimension. In these cases, there is a great interest in
finding the optimal strategy, and much attention is devoted to construct methods to un-
cover the optimal policy in the limit. RL methods seek to approximate the value function
with high accuracy and many algorithms have been proposed to do this (see Anderson and
Hong, 1994, Dietterich and Wang, 2003, Forbes and Andre, 2000, Sutton and Barto, 1998,
Tsitsiklis, 1997).
The literature on learning and pricing appears to date back to Rothschild (1974). He
represents the problem of maximizing the firm’s revenue by choosing between two prices
as a two armed bandit model and shows among other results that the optimizing manager
can choose the wrong price infinitely often. Kiefer and Nyarko (1989) study the general
problem of learning and control for a linear system with a general utility function. They
formulate this as an infinite horizon discounted Bayesian dynamic program; show that a
4
stationary optimal policy exists and that the posterior process converges with probability
one but that the limit need not be concentrated on the true parameter. Related work
appears in Easley and Kiefer (1988,1989).
Balvers and Cosimano (1990) apply the Kiefer and Nyarko framework to the sp ecific
problem of a manager who sets prices dynamically to maximize expected total revenue.
They use a dynamic programming approach to ”gain some insight into why it is important
for the firm to learn”. They derive a specific expression for the optimal price and then
explore its implications. They conclude among other results, that when varying price, an
anticipated change in demand leads to a small price change while an unanticipated shift in
demand leads to greater changes in price and that the effect of learning persists into the
future. Their focus is qualitative and does not quantify the potential benefits that can be
gained from using learning nor show to do it in practice.
Aviv and Pazgal (2002a and 2002b) introduce learning into a pricing model formulated
by Gallego and van Ryzin (1994) in which customers arrive singly according to a Poisson
process depending with rate depending on the price. Aviv and Pazgal (2002a) are concerned
with deriving a closed form optimal control policy while Aviv and Pazgal (2002b) use a
partially observed MDP framework to consider a model with a finite number of scenarios.
Petruzzi and Dada (2002) also study learning, pricing and inventory control. In their
model, the inventory level censors demand and once the inventory level is sufficiently high or
demand is low so that it is not censored, the demand function parameters can be determined
with certainty and revenue can be maximized.
Lobo and Boyd (2003) consider a similar model to that in this paper. They assume
demand is linear in price and derive a Taylor series approximation for the one-period
difference between the expected reward under the policy that uses the true optimal price
and that based on the myopic policy. Although their Taylor series expansion is similar
to ours, they use it for a different purpose; that is, to formulate a semi-definite convex
program in which the objective function is the sum of discounted one-period Taylor series
approximations. They then solve the problem over short planning horizons (10 and 20
5
periods) and compare the average revenue from several approaches. This paper focusses
methodological issues; unlike our paper, it does not focus on the managerial implications
of learning.
Cope (2004) and Carvalho and Puterman (2004a) study a related dynamic pricing
problem. Their setting is as follows. Each period, the manager sets a price and observes
the number of people who arrive and the number who purchase the product at that price. In
these papers the focus is on estimating the relationship between the probability of purchase
and the price set. Cope uses a non-parametric non-increasing model to relate price to the
probability of purchase while Carvalho and Puterman use logistic regression. Further, Cope
uses Bayesian metho ds while Carvalho and Puterman use maximum likelihood estimation.
From a managerial perspective this work of Cope and Carvalho and Puterman focus on low
demand items or a setting in which the prices can be changed frequently, while the focus
of this paper is on settings in which demand is high or prices can be altered less frequently.
Carvalho and Puterman (2004b) investigate the theoretical foundations of learning and
pricing.
Raju, Narahari and Kumar (2004) study the use of reinforcement learning to set prices
in a retail market consisting of multiple competitive sellers and heterogeneous buyers who
have different preferences to price and lead time. They use a continuous time queuing model
for their system and explore the effect of using several different algorithms for computing
optimal policies.
The effect of learning in other contexts have been considered by Scarf (1960), Azoury
(1985), Lovejoy (1990), and Treharne and Sox (2002) in the inventory context and Hu,
Lovejoy and Shafer (1996) in the area of drug therapy. In the papers of Scarf, Azoury,
Lovejoy and Treharne and Sox, learning is passive in the sense that the policy selected
by the decision maker does not impact the information received. In particular, in these
papers the decision maker observes the demand each period regardless of the inventory
level set by the decision maker. In contrast in this paper and the remaining papers above,
learning is active, that is, the policy set by the decision maker impacts the information
6
that is received. In the newsvendor models studied by Lariviere and Porteus (1999), Ding,
Puterman and Bisi (2002) the inventory levels censors demand so when setting inventory
levels the decision maker is faced with the tradeoff of myopic optimality and learning. In
Hu, Lovejoy and Shafer (1996) the dosage level impacts both the health of the patient
and subsequent parameter estimates so the decision maker must again tradeoff short term
optimality with learning.
The remainder of the paper is organized as follows. In Section 2 we formulate our model,
illustrate the trade-off between immediate revenue maximization and learning and discuss
demand distribution parameter updating. We present Monte Carlo simulation results to
evaluate the performance of several heuristic methods on Section 3. In Section 4 we describe
why model parameter estimates may be biased. Conclusions and recommendations appear
in the final section.
2 Model Formulation
Consider a manager who has an unlimited quantity of a product to sell at each time period
t, t = 1, . . . , T , where T is finite. The demand q
t
for the product is represented by a
continuous random variable and is assumed to follow a log-linear model. Such a model
was recommended by Kalyanam (1996) on the basis of an analysis of marketing data.
Consequently, we assume that the demand in period t, is related to price set in that period,
p
t
through the equation
q
t
= exp
£
α + βp
t
+ ²
t
¤
, (1)
where α and β are unknown regression parameters, and ²
t
is a random disturbance term
that is normally distributed with mean 0 and variance σ
2
. We will start by focusing on this
model but many generalizations are possible including letting the regression parameters
vary over time with both random and systematic comp onents, or we can add other factors
to the model; for example if we assume a quadratic time trend in the model, then (1)
7
becomes
q
t
= exp
£
α + βp
t
+ γt + ηt
2
+ ²
t
¤
, (2)
A further enhancement of this model would be to include an interaction between trend and
price.
We argue here that from a modeling perspective parametric models are preferable to
non-parametric models in this finite horizon setting. Since our primary focus is optimize
revenues during a limited number of periods, say T = 100 or T = 200, it would be difficult
to obtain a reasonable approximation to a non-parametric demand function with such little
data without strong prior assumptions. Further, this parametric formulation enables the
user to easily include additional information in the demand equation, such as in (2) above,
which would be extremely difficult in a non-parametric setting.
The manager’s objective is to adaptively choose a sequence of prices {p
t
: t = 1, . . . , T }
to maximize the total expected revenue
P
T
t=1
p
t
E[q
t
], over T periods. We now focus on the
model in (1). If the retailer knows the model parameters α and β, he can set the price p
t
to the value that maximizes the revenue R
t
(p
t
) = p
t
E[q
t
] in each period t. The optimum
price in this case is p
∗
t
= −1/β, which does not depend on α, and the maximum expected
revenues are
P
T
t=1
R
∗
t
=
P
T
t=1
R
t
(p
∗
t
) = −(T/β)e
α−1
M, where M = E[e
²
t
] = exp[σ
2
/2].
Of course, in any real setting, but especially for new products, the decision maker does
not know the true parameter values α and β so they must be estimated from the data that
is acquired during the planning horizon. The key concept on which this paper is based is
that the prices the manager sets influence both the data received and the ongoing revenues.
Thus the manager wants to use prices both strategically, that is to learn about the demand
function and optimally to maximize immediate revenue. This paper is about this tradeoff.
The information flow in this system is linked through a feedback process. Each period,
the manager uses his current estimate of the demand function parameters to set the price,
then he observes demand in the period and finally he updates his estimates of the demand
function parameters. Then he repeats this process. The paper now proceeds along two
8
paths. First we discuss how to update the parameter estimates given an additional demand
observation and second we discuss how to choose a price given estimates of the demand
distribution parameters.
2.1 Updating Demand Distribution Parameter Estimates
This section describes how to update the demand function parameter estimates once the
demand in a particular period has been observed. We assume at time t = 0, which means
before we set the first price p
1
and observe the demand q
1
, that the manager has specified a
prior distribution for the regression parameters. The initial prior may either be subjective,
derived from product information and past history or based on some preliminary sales data
for the current product. We assume that the vector of regression coefficients θ = [α β]
0
∼
N(θ
0
, σ
2
P
0
), where θ
0
is the 2×1 prior mean vector and σ
2
P
0
is a 2×2 is the prior covariance
matrix. To simplify our initial analysis, we assume known σ
2
.
At period t, based on the available estimates α
t−1
and β
t−1
, the manager sets p
t
(hope-
fully by methods suggested below) and observes the demand q
t
. From (1), we can rewrite
the model as y
t
= log(q
t
) = α+βp
t
+²
t
. To simplify notation and allow easy generalization
of results we write y
t
= θ
0
z
t
+ ²
t
, where z
t
= [1 p
t
]
0
and the covariance of the estimates
of regression parameters as σ
2
P
t
where P
t
is an explicit function of the prices up to and
including decision epoch t.
Using properties of the normal distribution and the fact that the normal prior is con-
jugate to the normal distribution, we can easily derive the posterior distribution of the
parameters α and β using standard methods. It has been widely established (see, for ex-
ample, Harvey, 1994 or Fahrmeir and Tutz, 1994) that the parameters of the posterior
distribution are related to the parameters of the prior distribution through the recursive
equations which can be expressed in matrix and vector notation as
θ
t
= θ
t−1
+ P
t−1
z
0
t
F
−1
t
[y
t
− δ
0
t−1
z
t
] (3)
P
t
= P
t−1
− P
t−1
z
0
t
F
−1
t
z
t
P
t−1
, (4)
9
F
t
= z
t
P
t−1
z
0
t
+ H
t
, (5)
where, in our model H
t
= 1 and all other quantities are defined above. At time t, the
variance of the estimate β
t
is given by σ
2
P
t,2,2
, where P
t,2,2
is the second element in the
diagonal of P
t
. Therefore, at the end of period t, after observing q
t
, we have updated
estimates α
t
and β
t
for the parameters. These equations are referred to as the Kalman
Filter. They were originally developed in an engineering context for a different purpose
and are widely used in time series analysis and control theory.
2.2 Choosing a Price
We investigate the trade-off between optimization and learning by fo cussing on the case
where we have observed the demands {q
t
: t = 1, . . . , T − 2}, corresponding to the sequence
of prices {p
t
: t = 1, . . . , T − 2}, and now we seek to maximize the expected revenues in
the last two periods E[R
T −1
(p
T −1
)] + E[R
T
(p
T
)]. This can be formulated as a two-period
dynamic program; our analysis is motivated by that observation. At the beginning of
period t = T − 1, the information for θ is provided by the estimate θ
T −2
which is normally
distributed with mean θ and covariance matrix σ
2
P
T −2
. After fixing the price p
T −1
and
observing q
T −1
, we update θ
T −2
and P
T −2
using the Kalman filter recursions (3) - (5). At
the last period t = T , learning has no subsequent benefit so the manager should choose a
price in that period to maximize immediate expected revenue in period T . Using simple
calculus, the optimum price, considering the estimate β
T −1
, will be p
T
= −1/β
T −1
.
The estimate β
T −1
will be normally distributed with variance σ
2
β
T −1
= σ
2
P
T −1,2,2
. Here
P
T −1,2,2
denotes the coefficient of the matrix P
T −1
in the 2nd row and 2nd column. Because
β
T −1
is subject to estimation error, we expect the optimum price p
T
= −1/β
T −1
to deviate
from its true optimum −1/β. Therefore, we may anticipate some loss in the revenue at
time T given the uncertainty about β. For an estimate β
T −1
, if we use the rule p
T
(β
T −1
) =
−1/β
T −1
, the expected revenue in the last period can be written as
R
∗
T
(p
T
(β
T −1
)) = E
h
−
1
β
T −1
e
(α−β/β
T −1
+²
T
)
i
= −M
1
β
T −1
e
(α−β/β
T −1
)
, (6)
10
where the expectation above is taken with respect to the noise ²
T
. Obviously R
∗
T
(p
T
(β
T −1
))
is maximum when β
T −1
is the true value β. Theorem 1 below provides an approximation
for the expectation E[R
∗
T
(p
T
(β
T −1
))] based on a Taylor expansion. The proof is presented
in the Appendix and generalized considerably in Carvalho and Puterman (2004b).
Theorem 1 Suppose that the price in period T is set equal to p
T
= −1/β
T −1
. Then the
expected revenue in period T can be approximated by
E[R
∗
T
(p
T
(β
T −1
))] = R
∗
T
(p
T
(β)) +
1
2
Me
(α−1)
β
3
σ
2
β
T −1
+ O((T − 2)
−2
), (7)
where the expectation above is calculated with respect to the distribution of the random
variable β
T −1
.
One of the crucial assumptions in Theorem 1 is that the sequence of prices {p
1
, p
2
, . . . , p
T −2
}
is non-random, or, if it is random, the dependence between p
t
and p
t+k
vanishes as k → ∞.
In practice, when the prices are updated recursively, this hypothesis does not necessarily
hold. The implications of violating this assumption will be discussed in Section 4 and
explored in depth in Carvalho and Puterman (2004b).
Based on the result above, the expected loss at period T associated with the uncertainty
in β
T −1
can be approximated by
−
1
2
Me
(α−1)
β
3
σ
2
β
T −1
> 0. (8)
At the time t = T − 1, the fixed price p
T −1
will affect not only the immediate return
R(p
T −1
) but will also affect the variance of β
T −1
, σ
2
β
T −1
= σ
2
β
T −1
(p
T −1
). Therefore, we can
choose the price p
T −1
that maximizes the expression
F
T −1
(p
T −1
) = p
T −1
e
α+βp
T −1
M +
1
2
Me
(α−1)
β
3
σ
2
β
T −1
(p
T −1
), (9)
which explicitly shows the trade-off between learning and maximization at time t = T − 1.
Note that (9) involves the parameters α and β, which are exactly what we want to estimate,
but in computation we replace them by their current estimates.
11
When the managers seeks to maximize revenue in an horizon exceeding 2 periods, writ-
ing down expressions equivalent to (9) is not an easy task and the dynamic programming
formulations become quickly intractable. However, as a result of (9) and general observa-
tions about the growth of value functions over time, we propose a simple heuristic pricing
policy, hereafter referred to as one-step-ahead strategy, to optimize the the total expected
revenue over longer horizons. In this case, the one-step-ahead policy consists of choosing
the price p
t
, at each time period t, that maximizes the objective function
ˆ
F
t
(p
t
) = p
t
e
α
t−1
+p
t
β
t−1
M
t−1
+
G(t)
2
M
t−1
e
−1
e
(α
t−1
)
β
3
t−1
σ
2
β
t
(p
t
). (10)
Note that we replaced the unknown regression coefficients α and β by their available es-
timates α
t−1
and β
t−1
at the beginning of period t (before observing the demand q
t
).
Analogously, M
t−1
= e
σ
2
t−1
/2
, where σ
2
t−1
is the estimate of σ
2
at the beginning of the
period t. The first term in the objective function above corresponds to the immediate
revenue maximization, while the second term corresponds to maximizing the information
(minimizing the variance) about the model parameters. We include a multiplicative term
G(t) in the second expression to reflect the time remaining in the planning horizon. For
a finite horizon T , we can make G(t) a decreasing function in t, with G(T ) = 0, since we
do not have to learn anymore at the last stage. We can assume, for example, a piecewise
linear form G(t) = (T
c
− t) for t < T
c
, and G(t) = 0 when t ≥ T
c
, where T
c
does not
necessarily equal the horizon T . Alternatively, we can use an exponential decaying form
G(t) = Ke
−tρ
− Ke
−Kρ
for t < K, and G(t) = 0 when t ≥ K.
3 Monte Carlo Simulations
In this section, we present and discuss results of an extensive Monte Carlo simulation
that investigates the performance of the one-step ahead policy and other heuristic pricing
strategies. We begin with a discussion of the simulation setup.
12
3.1 Simulation Design
We describe the classes of price selection rules that will be compared in the simulation
study.
1. Myopic rule. The simplest strategy is the myopic rule, which at each period t sets
the price p
t
= −1/β
t−1
, where β
t−1
is the most recent estimate of the regression slope. We
will see that this strategy produces prices which get ”stuck” at a level far away from the
optimum p
∗
= −1/β and do not benefit from learning.
2. Myopic rules with random exploration. An alternative to the myopic rule is to choose
the optimum price p
t
= −1/β
t−1
with probability 1 − η
t
and choose a random price with
probability η
t
(Sutton and Barto, 1998). Because learning is more important at initial
periods, we let η
t
→ K
0
when t → ∞, where K
0
equals zero or a very small value, in the
case we wish to continue experimenting indefinitely. We used an exponential decay function,
η
t
= K
0
+K
1
e
−K
2
t
, with different values for the parameters K
0
, K
1
and K
2
. Some care must
be taken here, because when implementing the proposed methodology in practice, prices
must be chosen in an economically viable range. Further, since the proposed parametric
model is only an approximation for the real data generating process, the approximation
may be reasonable only for a limited range of prices. In light of this when learning, we
choose random prices periods from a uniform distribution on a pre-specified interval [p
l
, p
u
].
3. Softmax exploration rules. An alternative to the myopic rule with random exploration
is to use the softmax exploration rule described in Sutton and Barto (1998). The basic
idea is to draw, at each time period t, the price p
t
from the distribution with density
f(p
t
) ∝ exp
©
[p
t
e
α
t−1
+β
t−1
p
t
e
σ
2
t−1
/2
]/τ
t
ª
, (11)
with τ
t
→ 0 as t → ∞. The density in (11) has a single mode at −1/β
t−1
and, as τ
t
→ 0,
it becomes more concentrated around the mode, so that, in the limit, we only select the
price p
t
= −1/β
t−1
, and the softmax rule becomes equivalent to the myopic policy. The
same way as before, we used τ
t
= K
0
+ K
1
e
−K
2
t
.
4. Optimal design rules. Another approach to price selection is to choose a ”statisti-
13
cally” optimal design in terms of model estimation during the first C periods, and then
apply the myopic rule for the rest of the process. We may think of this as acting as a
”statistician” from t = 1 to t = C, and as an ”optimizer” from t = C + 1 to t = T .
Therefore, for t = 1, . . . , C, we select p
t
= p
u
if t is o dd and p
t
= p
l
if t is even. From
t = C + 1 to t = T, we use p
t
= −1/β
t−1
.
5. One-step look ahead rules. Less arbitrary strategies are based on the one-step ahead
rule, which explicitly account for the trade-off between learning and revenue maximization
through (10). As noted in Section 2, we use functions G(t) with piecewise linear, G(t) =
max{T
c
−t, 0 }, and exponential decaying, G(t) = max {Ke
−tρ
−Ke
−Kρ
, 0}, functional forms.
Alternatively, to overcome the lack of consistency of the ordinary least squares estimator,
discussed in Subsection 4.1, we also simulated modified versions of the one-step ahead
rules, by performing random exploration with constant probability η
t
= 0.01. Similarly to
the myopic policy with random exploration, at the exploration stage, the prices p
t
were
drawn from a uniform distribution on [p
l
, p
u
]. We refer to these strategies as unconstrained
one-step ahead rules (to differentiate from the policies described below) or simply one-step
ahead rules.
6. Price constrained one-step ahead rules. Figure 4 below shows that the prices vary
considerably at the beginning of the process, in order to allow for faster learning. In prac-
tice, a manager may wish to avoid such abrupt price changes. We explored this possibility
by imposing a limit on period to period price changes. Given the price p
t
, the price p
t+1
at decision epoch t + 1 is restricted to be in the interval [p
t
− 0.25p
t
, p
t
+ 0.25p
t
].
In all the above policies, we restricted the prices to be within the range [p
l
, p
u
]. There-
fore, if, at a certain period t, the calculated optimum price is p
t
> p
u
, we used p
t
= p
u
.
Analogously, if the calculated optimum price is p
t
< p
l
, we used p
t
= p
l
. In the different
policies described above, we tried different values for K
0
, K
1
, K
2
, C, T
c
, K and ρ, and the
results reported here correspond to the configurations providing the best performances.
For all the strategies described above, the Kalman filter updating equations (3) - (5)
require initial values δ
0
= [α
0
β
0
]
0
for the regression parameters δ = [α β]
0
and for the
14
matrix P
0
. Besides, all strategies except the optimum design rules require an initial value
β
0
to set the price p
1
. To avoid any bias related to wrong prior information, we assumed
that we had information from two previous data points [log q
−2
p
−2
]
0
and [log q
−1
p
−1
]
0
,
with log q
−i
= α + βp
−i
+ ²
−i
, ²
−i
∼ N(0, σ
2
), i = 1, 2
1
. Avoiding bias in the initial values
is particularly important when studying the bias in the ordinary least squares estimator. If
incorrect prior information were used, one may argue that the bias observed in the estimates
ˆ
β
t
is due to this initial misleading set up.
Based on the discussion above, we set the initial matrix P
0
= (Z
0
−1
Z
−1
)
−1
, for Z
−1
a
two by two matrix, Z
−1
=
£
[1 p
−2
]
0
, [1 p
−1
]
0
¤
0
, with p
−2
and p
−1
the same for all simulation
replicates. The vector δ
0
is equal to (Z
0
−1
Z
−1
)
−1
Z
0
−1
Y
−1
, with Y
−1
= [log q
−2
log q
−1
]
0
. Note
that, although δ
0
is a random vector, it has expectation equal to the true vector δ and
covariance matrix equal to σ
2
P
0
, so that we are not biasing the conclusions due to wrong
priors. In the simulation results presented in this paper, we fixed p
−2
= p
u
and p
−1
= p
l
.
We also tried other values for p
−2
and p
−1
, but the conclusions remained the same. It is
important to mention that all the rules considered in the simulation, including the ”optimal
design rule”, benefited from the fact that we used correct information about δ
0
and P
0
.
Specifically for the one-step ahead strategies, we need, at each decision period t, t =
1, . . . , T , an estimate for the variance σ
2
. Because the prior information for σ
2
will affect
only the one-step ahead strategies, we decided not to worry about wrong initial values for
σ
2
0
. The idea of having two extra data points [log q
−2
p
−2
]
0
and [log q
−1
p
−1
]
0
does not provide
enough degrees of freedom to estimate σ
2
. Therefore, at time t = 1, the one-step ahead rule
was based on σ
2
0
= σ
2
/2 (wrong prior). After fixing the price p
1
and observing log q
1
, we
have three data points in total, what makes it p ossible to obtain the first estimate σ
2
1
, used
at the decision epoch t = 2. In fact, for t = 1, . . . , T , we can use the ordinary least squares
estimator σ
2
t
=
1
t
[ˆ²
−2
+ ˆ²
−1
+
P
t
k=1
ˆ²
2
k
], where ˆ²
k
= y
k
− α
t
− β
t
p
k
, k = −2, −1, 1, . . . , t. To
evaluate the effect of the choice or prior σ
2
0
, we also performed simulations with σ
2
0
= 2σ
2
,
1
We used the indices p
−2
and p
−1
, instead of p
−1
and p
0
, to make explicit that the information is
available before the first decision period t = 1.
15
but the general conclusions did not change.
To compare these different strategies, we performed L = 10, 000 simulations for each
policy and computed the cumulative revenues in each simulation
CR(t) =
t
X
k=1
R
k
(p
k
), t = 1, . . . , T. (12)
The expected cumulative revenues can then be estimated by the sample means
ˆ
E[CR(t)] =
1
L
L
X
l=1
t
X
k=1
R
k
(p
k
). (13)
By plotting the path of
ˆ
E[CR(t)] against t, t = 1, . . . , T , we gain insight into how these
different computational strategies perform. In general, we focus on maximizing revenues in
short planning horizon (T = 100 or T = 200). On the other hand, it is interesting to look
at the path of other measures as t tends to infinity. The sample paths for the estimate β
t
,
for example, provide insight into the long run convergence of the model parameters under
each of these computational methods.
3.2 Simulation Results
The simulations show that the unconstrained one-step ahead rules provide greater mean
cumulative revenues
ˆ
E[CR(t)] than the other strategies. Figure 1 provides a comparison
of a selected one-step ahead rule in which G(t) is piecewise linear, and the other rules. A
comparison between several one-step pricing rules is shown in Figure 2.
For these simulations the parameter values are set to α = 8.0, β = −1.5 and σ
2
= 5.0.
The optimum price in this case is p
∗
= 0.667, which implies that the maximum mean
cumulative revenues equal to 8, 906.5, when p
t
= p
∗
for all t = 1, . . . , T. The minimum
allowed price was p
l
= 0.167 and the maximum allowed price was p
u
= 3.00.
After 100 periods, by using the one-step ahead rule we obtain a relative gain of at
least 3.7% over all the none one-step ahead rules. This relative gain is equal to 3.0% after
200 periods and equal to 2.4% after 400 periods. Note that the myopic rules performed
poorly by getting ”stuck” at a price level away from the optimum. The policies with
16
optimal statistical design at the beginning of the pricing process perform better than the
policies with random exploration (myopic rule with random exploration and the softmax
rule) during the initial periods. However, as the random exploration policies keep learning
about the model parameters, they eventually outperform the statistical design rules.
100 200 300 400 500 600 700 800 900 1000
7600
7850
8100
8350
8600
Myopic policy
Unconstrained one−step ahead policy
Constrained one−step ahead policy
Softmax policy
Optimal statistical design
Myopic rule with random exploration
Period (
t
)
Mean cumulative revenues
Comparing the mean cumulative revenues for different pricing policies
Figure 1: Comparison of expected mean cumulative revenues for several pricing policies
(optimal expected revenue per period under known parameters values is 8, 906).
Figures 1 also displays the mean cumulative revenues for a one-step ahead policy with
price change constraints. At each time period t ∈ {2, . . . , T}, the prices were chosen by
maximizing the objective function in (10) with the restriction p
t+1
∈ [p
t
−0.25p
t
, p
t
+0.25p
t
].
At the initial period t = 1, the price was only restricted to be within the interval [p
l
, p
u
]. We
considered piecewise linear G(t) with T
c
= 170 (fast learning) and T
c
= 70 (slow learning).
To simplify the presentation, the results for the slow-learning case are shown in Figure 2.
Although none of the other rules had price change restrictions, the constrained one-step
ahead policies still presented a superior performance when compared to the rules other
than the one-step ahead ones. As we already expected, the one-step ahead policy with fast
17
learning performs better than the one of slow learning methods after some initial periods.
To validate the analysis, we performed other simulations with different choices of model
parameters, and minimum and maximum allowed prices, and the conclusions remained the
same.
Figure 2 presents the mean cumulative revenues for the six one-step ahead strategies
studied here. Note that there does not seem to be any significant difference between the
four unconstrained policies. For t < 400, the rule with exponential decaying G(t), without
random exploration seems to slightly outperform the other ones. For t > 400, the policy
with piecewise linear G(t) and random exploration with η
t
= 0.01 presents a somewhat
better performance than the others. By imposing the price change constraint, the relative
loss in the one-step ahead policies is not higher than 1.2% after 100 periods, 1.4% after 200
periods, and 1.0% after 400 periods.
100 300 500 700 900
8000
8150
8300
8450
Period (t)
Mean cumulative revenues
Constrained one−step ahead policy with slow learning
Constrained one−step ahead policy with fast learning
Unconstrained one−step ahead policies
Comparing mean cumulative revenues for one−step ahead policies
Unconstrained one−step ahead policy with piecewise linear
G(t)
Unconstrained one−step ahead policy with exponential G(t)
Unconstrained one−step ahead policy with random exploration (ε= 0.01)
Unconstrained one−step ahead policy with exponential G(t) and random exploration
Figure 2: Expected mean cumulative revenues for several one-step ahead policies (optimal
expected revenue per period under known parameters values is 8, 906).
In Figure 3, we plot the mean paths of the estimated slope β, for the different strate-
18
gies. We observe that all strategies produce biased estimates of β for all t = 1, . . . , 1000.
This effect is real and is supported by theory. The reason for this will be discussed in
Subsection 4.1. However, for the myopic rule with random exploration and the one-step
ahead rules with random exploration, the bias tends to go to zero, as t grows, what was
already expected based on the discussion about adaptive control with random perturbation
presented in Subsection 4.1. An additional strategy, which sets random prices at all periods
t = 1, . . . , 1000, was also simulated and produced unbiased estimates for β. However, its
revenue performance was very p oor, since it never uses the produced estimates for opti-
mization purposes. For the one-step ahead policies, the bias is quite significant. However,
because of asymmetry in the revenue function, the loss incurred by a negative bias is not as
harmful as that incurred by a positive bias. This phenomenon has been previously observed
in the inventory literature as for example by Silver and Rahnama (1987).
0 100 200 300 400 500 600 700 800 900 1000
−1.650
−1.625
−1.600
−1.575
−1.550
−1.525
−1.500
−1.475
−1.450
Period (
t
)
Slope estimates
Optimal statistical design at initial periods
Myopic rule
Myopic rule with random exploration
Softmax policy
Unconstrained one−step ahead policy with random exploration
Unconstrained one−step ahead policy
Comparing the estimated slopes for different pricing policies
Figure 3: Mean estimates for the slope coefficient β in the log-linear model.
Finally, Figure 4 shows the mean paths of selected prices for some of the different
strategies. For almost all the policies, the prices get stuck at a fixed level after 80 periods.
19
For the myopic rule with random exploration, although the prices are set initially in a level
above the optimal price p
∗
= 0.667, they tend to approach p
∗
as t grows and there is more
exploration about the true slope value. For the one-step ahead rule specifically, the prices
tend to go up and down, with the variations around p
∗
decreasing as more information is
added. It illustrates the idea behind the one-step ahead policies: as more information is
obtained, the marginal value of extra information decreases and the algorithm values more
the maximization of immediate revenues. Note that, although the estimates β
t
are biased,
as shown in Figure 3, the mean prices in the one-step ahead policies converge to levels
very close to the optimal price p
∗
= 2/3. This can be explained by the nonlinearity in
the function p
t
= −1/β
t−1
, so that E[p
t
] 6= −1/E[β
t−1
]. For the two constrained one-step
ahead policies, note the smoother evolution of the chosen prices, when compared to the
price paths for the unconstrained one-step ahead rules. The constrained one-step ahead
policy with fast learning presents a higher price variation during the initial periods than
the constrained one-step ahead policy with low learning. This fact was already expected,
given the higher weight for the learning component (second term in the right-hand-side of
equation (10)) in the fast-learning case.
4 Random Prices and Estimation Bias
This section focuses on two important technical issues that underlie the observed bias of
the regression parameters in the previous section and the derivation of the Taylor series
expansion which is the basis for the one step ahead rules in Section 3. In it, we give a high
level discussion of these issues; a much deeper analysis appears in Carvalho and Puterman
(2004b).
4.1 Bias in Parameter Estimates θ
We now discuss why the estimates of the regression parameters may be biased when prices
are chosen adaptively. This phenomenon was observed in Figure 3 which showed that
20
0 10 20 30 40 50 60 70 80 90 100
0.667
1.167
1.667
2.167
Period (
t
)
Chosen prices
Mean path of chosen prices for different policies
Myopic rule
Unconstrained one−step ahead policy
Constrained one−step ahead rule with slow learning
Constrained one−step ahead rule with fast learning
Figure 4: Comparison of chosen prices for several policies (optimal price p
∗
= 1/1.5).
estimates of the regression parameter β did not converge to their true value.
To understand the reason for the bias of
ˆ
β
t
, the estimate of β based on t observa-
tions, consider the usual ordinary least squares estimator
ˆ
δ
t
= (Z
0
t
Z
t
)
−1
Z
0
t
Y
t
, where Z
t
is the t × 2 design matrix [[1 p
1
]
0
, [1 p
2
]
0
, . . . , [1 p
t
]
0
]
0
and Y
t
is the t × 1 response vector
[log q
1
log q
2
. . . log q
t
]
0
. Although the parameter vector δ = [α β]
0
is recursively estimated
with the Kalman filter, given the choice of prior N(δ
0
, σ
2
P
0
) employed in our simulations,
the resulting estimate
ˆ
δ
t
is numerically equivalent to the ordinary least squares (OLS) esti-
mator (Z
0
t
Z
t
)
−1
Z
0
t
Y
t
. According to the classical regression analysis theory (see Draper and
Smith, 1998) when Z
t
is fixed, the expected value of δ is equal to
E{
ˆ
δ
t
} = E{(Z
0
t
Z
t
)
−1
Z
0
t
Y
t
} = (Z
0
t
Z
t
)
−1
Z
0
t
E{Y
t
} = (Z
0
t
Z
t
)
−1
Z
0
t
E{Z
t
δ + υ
t
}, (14)
where υ
t
= [²
1
²
2
. . . ²
t
]
0
. Because E{υ
t
} = 0, we conclude that E{
ˆ
δ
t
} = δ, and hence
ˆ
δ
t
is
unbiased for fixed Z
t
.
As we discussed above, the sequence of prices p
1
, . . . , p
t
is usually random, and hence
the design matrix Z
t
is not fixed. Therefore, the classical theory for OLS estimation is
21
not valid in this case specifically, and we cannot ensure the unbiasedness of
ˆ
δ
t
. Besides,
following the same derivation in (14), we have
E{
ˆ
δ
t
} = δ + E{(Z
0
t
Z
t
)
−1
Z
0
t
υ
t
}, (15)
and if Z
t
and υ
t
were independent, it is easy to show that the second term in (15) would
be zero and the the OLS estimator would be still unbiased. However, because the price
p
k
set at period k depends on the estimate
ˆ
δ
k−1
, and the estimate
ˆ
δ
k−1
depends on the
history of disturbances ²
1
, . . . , ²
k−1
, we conclude that the p
k
depends on ²
1
, . . . , ²
k−1
. There-
fore, the random variables Z
t
and υ
t
are not independent and we cannot guarantee that
E{(Z
0
t
Z
t
)
−1
Z
0
t
υ
t
} = 0, so it is expected that
ˆ
θ
t
is biased for finite t.
Although
ˆ
δ
t
is biased for finite t, one may be interested in the b ehavior of
ˆ
δ
t
as t → ∞.
As is well known in the econometrics literature, when Z
t
is random, under some regularity
conditions, the estimate
ˆ
δ
t
is strongly consistent (i.e., converges almost surely) to the true
parameter δ, in the sense that
ˆ
δ
t
a.s.
−−→ δ as t → ∞, as discussed in White (2001). These
conditions, interpreted in the context of the pricing model, are that the random sequence
of prices {p
t
: t = 1, . . . , ∞} satisfies a strong Law of Large Numbers and that there exists
a ∆ > 0, such that the sequence of minimum eigenvalues λ
min,t
of Z
0
t
Z
t
satisfies λ
min,t
> ∆
for t = 1, . . . , ∞ with probability 1. However, for the one-step ahead rules the sequences
of prices {p
t
: t = 1, . . . , ∞} does not satisfy either of these two conditions. In particular,
because Z
t
contains a column of ones and for each simulation replicate the prices approach
a constant value as indicated in Figure 4, the smallest eigenvalue of Z
0
t
Z
t
converges to zero
as t goes to infinity. Besides, looking at different sample paths for different simulations
(not shown here), the price level to which the sequences of prices converges varies across
the simulation replicates. Therefore, the Law of Large Numbers does not apply to the
price sequence. We conclude that the usual conditions for consistency of the ordinary least
squares estimator are not satisfied, and we cannot guarantee that
ˆ
δ
t
a.s.
−−→ δ as t goes to
infinity.
Some of these issues have been addressed in the adaptive control literature by Campi and
Kumar (1998), Chen and Guo (1988), Kumar (1990) and Sternby (1977) who show that the
22
parameter estimates
ˆ
δ
t
a.s.
−−→ δ
∞
, where δ
∞
depends on the random path of the state variable
which in this example is the demand sequence {q
t
}
∞
t=1
. Further δ
∞
6= δ. To guarantee the
consistency of the estimates
ˆ
δ
t
to the true parameter vector δ in adaptive control problems
Campi and Kumar (1998), and Chen and Guo (1988) suggest the addition of persistent
yet infrequent random perturbations to the control law. These perturbations should be
small in magnitude and sufficiently infrequent, so that they do not incur a high extra cost.
Specifically for our dynamic pricing problem, the addition of random perturbations can
be accomplished by choosing the price p
t
according to the objective function (10) with
probability 1 − η
t
, and drawing p
t
from a uniform distribution with probability η
t
, with η
t
very low. Although the addition of the random experimentation guarantees the consistency
of
ˆ
δ
t
, it does not improve the performance of the one-step ahead rules over short horizons.
The use of biased parameter estimates also has some precedence in the inventory liter-
ature as for example in Silver and Rahnama (1987).
4.2 Validity of the one-step-ahead rule
A crucial assumption in the derivations for Theorem 1 is that the sequence of prices
{p
1
, p
2
, . . . , p
T −2
} is fixed, or, if it is random, the dependence between p
t
and p
t+k
van-
ishes as k → ∞. However, if we employ the optimization rule recursively by maximizing
the objective function
ˆ
F
t
(p
t
) in (10), at each period t, the optimum price p
t
will be a func-
tion of the estimates α
t−1
, β
t−1
and σ
2
t−1
, which are random variables calculated using the
sequence of prices {p
1
, p
2
, . . . , p
t−1
}. Therefore, the price p
t
will also be a random variable
and will depend on the sequence {p
1
, p
2
, . . . , p
t−1
}. If the initial estimate [ˆα
0
ˆ
β
0
ˆ
σ
2
0
]
0
is
random, and we use the objective function in (10) to recursively update the prices, the
whole sequence {p
t
: t = 1, 2, . . . , T } will be random. The randomness of the price se-
quence compromises the derivation of Theorem 1. In fact, as discussed in Subsection 4.1
and illustrated in the simulations in Section 3, the bias of
ˆ
β
t
, bias E{
ˆ
β
t
− β}, does not
converge to zero as t increases. Fortunately, although the assumption of non-randomness
of {p
1
, p
2
, . . . , p
T
} does not hold, the Taylor series approximation, on which the one-step
23
ahead policies are based, may still be valid. In this subsection, we give a informal discussion
of why the one-step ahead rules work well even though the assumptions on which they are
based do not hold.
To understand the problems caused by the inconsistency of
ˆ
θ
t
, consider the following
approximation, based on the Taylor expansion in (18), presented in the proof of Theorem
1 in the Appendix.
E{R
∗
T
(p
T
(β
T −1
))}
.
= R
∗
T
(p
T
(β)) + ∂
β
T −1
R
∗
T
(p
T
(β))E{β
T −1
− β}
+
1
2
∂
2
β
T −1
R
∗
T
(p
T
(β))E{[β
T −1
− β]
2
}.
(16)
Because of the inconsistency of β
T −1
, the term E
©
[β
T −1
− β]
2
ª
in (16) is not equal to the
variance β
T −1
anymore. In this case, we have E{[β
T −1
− β]
2
} = MSE
β
T −1
6= Var(β
T −1
),
even for large T . Therefore, the approximation in (16) can be rewritten as
E{R
∗
T
(p
T
(β
T −1
))}
.
= R
∗
T
(p
T
(β)) + ∂
β
T −1
R
∗
T
(p
T
(β))E{β
T −1
− β}
+
1
2
∂
2
β
T −1
R
∗
T
(p
T
(β))MSE
β
T −1
,
and the objective function to be maximized in the recursive pricing procedure should be
ˆ
F
t
(p
t
) = p
t
e
α
t−1
+p
t
β
t−1
M
t−1
+
G(t)
2
M
t−1
e
−1
e
(α
t−1
)
β
3
t−1
MSE
β
t
(p
t
), (17)
where α
t−1
, β
t−1
and M
t−1
= exp[ˆσ
2
t−1
/2] are the estimates for α, β and M = exp[σ
2
/2],
based on the information available at the end of period t−1. We wrote MSE
β
t
= MSE
β
t
(p
t
)
to emphasize that the mean square error at the end of period t depends on the price p
t
.
To implement the optimization/learning policy based on maximizing
ˆ
F
t
(p
t
) in (17), we
need an expression for MSE
β
t
(p
t
), which may be very hard to obtain in explicit form. In
Section 3, we implemented the one-step ahead rule in the simulations by maximizing at each
period t, t = 1, . . . , T, the objective function in (10), where we replace the unconditional
MSE
β
t
(p
t
) by the conditional σ
2
β
t
(p
t
).
In order to evaluate the approximation of the unconditional mean square error MSE
β
t
(p
t
)
by the conditional variance σ
2
β
t
(p
t
), assuming fixed prices, we can use the generated paths
for β
t
, t = 1, . . . , T , in the Monte Carlo experiment. Figure 5 presents the comparison
24
between the mean σ
2
β
t
(p
t
) of the estimates for σ
2
β
t
(p
t
) and the estimate
[
MSE
β
t
(p
t
), for the
unconditional mean square error MSE
β
t
(p
t
), obtained from the simulations. The upper
graph in Figure 5 shows the evolution of
[
MSE
β
t
(p
t
) and σ
2
β
t
(p
t
) over time. Note that
both decay at the same rate, although the σ
2
β
t
(p
t
) is slightly higher than
[
MSE
β
t
(p
t
) for
all time periods. The lower graph in Figure 5 show the scatter plot of
[
MSE
β
t
(p
t
) ver-
sus σ
2
β
t
(p
t
). According to the graph, there is an approximate linear relationship between
these two measures. Besides, the corresponding regression line has slope 1.0274, intercept
-0.0527 and R
2
= 0.9975. Therefore, the approximation MSE
β
t
(p
t
)
.
= K
−1
σ
2
β
t
(p
t
), with K
very close to one, is justified empirically. These empirical results suggest that the objective
function in (17) can be reasonably approximated by the objective function in (10), used
the simulations.
0 20 40 60 80 100 120 140 160 180 200
0.25
0.50
0.75
Mean square error and conditional variance assuming fixed prices
Period (t)
MSE and conditional variance
Conditional variance assuming fixed prices
Mean square error from Monte Carlo simulations
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90
0.25
0.50
0.75
Scatter plot: mean square error versus conditional variance assuming fixed prices
Regression line: y = 1.0274x − 0.0527
R
2
= 0.9975
Conditional variance assuming fixed prices
Mean square error
Figure 5: Comparing the mean of the variance estimates based on fixed prices to the true
mean square error.
25
5 Conclusions
In this paper, we have prescribed and analyzed methods for setting prices in the presence
of demand function parameter uncertainty focussing especially on short planning horizons.
Our contributions are both practical and technical with each of these aspects being impor-
tant.
The key practical issue addressed by this paper arises from the fact that the demand
function is never known in practice!! Thus the prudent decision maker must maker an
explicit tradeoff between variance reduction and revenue optimization. Theorem 1 in this
paper makes that tradeoff rigorous by using statistical asymptotic theory to approximate
the MDP value function.
Extensive simulations produce important managerial implications and deep insights into
the mathematical foundations of active learning.
The key managerial insights are:
• Myopic policies perform poorly for all planning horizons.
• Myopic policies plus occasional random price changes outperform myopic policy over
the long term but not over the short term.
• Active learning (such as the one step ahead rules) are better than all other approaches
over all planning horizons.
• It is possible to constrain price changes each period and still drastically improve over
myopic policies but of course, unconstrained policies produce greater revenue.
The bottom line is that managers should use active learning and if its not possible, at
least be willing to exp eriment with some price changes to learn about the demand function.
The methods in this paper suggest how to do this experimentation and would have been
of use to Intrawest’s management when it sought to increase revenue by varying prices.
From a technical perspective we have observed that the adaptive rules lead to biased
parameter estimates. Even though the demand function is never estimated accurately,
26
active learning still produces good revenue streams. This also suggests that one should
consider biased parameter estimates when combining estimation with optimization. Bias
is present in all active learning (adaptive control). This bias is due to randomness in both
prices and noise. Under repeated simulations, we would still get biased parameter estimates
unless prices were fixed and the only source of variability was the random disturbance in
the demand function. We have shown why this is the case and also why one step ahead
methods still produce excellent results.
The authors are investigating several extensions of this model.
• Empirically testing the methods of this paper in real or simulated markets.
• Including other explanatory variables in the demand function that might be fixed
(seasonal dummy variables, time trend, day of the week) or random (competitor
prices, market indicators) covariates. In particular, by regarding the constant in (1)
as market size, we can view a time trend as a changing market size and investigate
its implications on price choice throughout the planning horizon.
• For low demand items, Poisson, binomial or other generalized linear models may be
more appropriate demand distribution models. A first step in this direction is pursued
in Carvalho and Puterman (2004a).
• Allowing model parameters to change over time following a state space model or a
step-change model.
• Exploring enhanced price setting mechanisms that may yield higher revenues or have
reduced biased.
• Allowing for heterogeneity in markets by using mixture models (see, for example,
Hastie, Tibshirani and Friedman, 2001), where we increase the number of components
as we observe more data. In this case, we expect that the number of components or
basis functions J will be an increasing function of the number of available observations
t.
27
Acknowledgment
This research was partially supported by grants from NSERC and the MITACS NCE
(Canada). We wish to thank the Area Editor, Bill Lovejoy and an associate editor for
helping us align the paper with the editorial objectives of Management Science. Dan
Adelman of The University of Chicago also provided some helpful comments on an earlier
draft of this manuscript.
Appendix
Proof of Theorem 1. By using a Taylor’s series expansion for R
∗
T
(p
T
(·)) around the true
parameter β, we have
R
∗
T
(p
T
(β
T −1
)) = R
∗
T
(p
T
(β)) + ∂
β
T −1
R
∗
T
(p
T
(β))[β
T −1
− β]
+
1
2
∂
2
β
T −1
R
∗
T
(p
T
(β))[β
T −1
− β]
2
+
1
6
∂
3
β
T −1
R
∗
T
(p
T
(β))[β
T −1
− β]
3
+
1
24
∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))[β
T −1
− β]
4
,
where
¯
β is located between β
T −1
and β, and ∂
r
β
T −1
denotes the r-th derivative with respect
to
β
T −1
. Taking expectations with respect to the random variable
β
T −1
, we obtain
E{R
∗
T
(p
T
(β
T −1
))} = R
∗
T
(p
T
(β)) + ∂
β
T −1
R
∗
T
(p
T
(β))E{β
T −1
− β}
+
1
2
∂
2
β
T −1
R
∗
T
(p
T
(β))E{[β
T −1
− β]
2
} +
1
6
∂
3
β
T −1
R
∗
T
(p
T
(β))E{[β
T −1
− β]
3
}
+
1
24
E{∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))[β
T −1
− β]
4
}.
(18)
The second term in the right-hand-side of (18) is equal to zero, provided that β
T −1
is
unbiased for β. The third term is equal to
1
2
∂
2
β
T −1
R
∗
T
(p
T
(β))E{[β
T −1
− β]
2
} =
1
2
∂
2
β
T −1
R
∗
T
(p
T
(β))Var[β
T −1
].
The fourth term in (18) is equal to zero because β
T −1
is normally distributed, so the
third central moment is zero. Finally, for the fifth term, employing Jensen’s and Cauchy-
28
Schwarz inequalities
|E{∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))[β
T −1
− β]
4
}| ≤ E{|∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))|[β
T −1
− β]
4
}
≤ E{|∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))|
2
}
1/2
E{[β
T −1
− β]
8
}
1/2
.
We can show that E{|∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))|
2
} = O(1), so it does not diverge as the sample size
(T −1), used in the estimation of β
T −1
, goes to infinity. Besides, Var[β
T −1
]
−1/2
[β
T −1
− β] ∼
N(0, 1), so that E{Var[β
T −1
]
−4
[β
T −1
− β]
8
} = µ
8
, where µ
8
is the 8-th central moment of
a standard normal random variable. We then have E{ [β
T −1
− β]
8
} = Var[β
T −1
]
4
µ
8
, and
therefore
|E{∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))[β
T −1
− β]
4
}| ≤ E{|∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))|
2
}
1/2
Var[β
T −1
]
2
µ
1/2
8
.
We know that, Var[β
T −1
] = σ
2
P
T −1,2,2
, with P
T −1
has the form (Z
0
Z)
−1
, where Z is the
corresponding design matrix for the regression model in (1). If the magnitude of the rows
in the design matrix Z do not change as (T − 1) goes to infinity, we have Var[β
T −1
] =
O((T − 2)
−1
), in the sense that it goes to zero at order (T − 2)
−1
when the sample size n
goes to infinity. Hence,
|E{∂
4
β
T −1
R
∗
T
(p
T
(
¯
β))[β
T −1
− β]
4
}| = O((T − 2)
−2
),
and
E{R
∗
T
(p
T
(β
T −1
))} = R
∗
T
(p
T
(β)) +
1
2
∂
2
β
T −1
R
∗
T
(p
T
(β))E{[β
T −1
− β]
2
} + O((T − 2)
−2
).
By differentiating (6) twice with respect to β
T −1
, we have
∂
2
β
T −1
R
∗
T
(p
T
(β
T −1
)) = −2
M
β
3
T −1
exp(α − β/β
T −1
)
+ 4
Mβ
β
4
T −1
exp(α − β/β
T −1
) −
Mβ
2
β
5
T −1
exp(α − β/β
T −1
),
and
∂
2
β
T −1
R
∗
T
(p
T
(β
T −1
))
¯
¯
¯
β
T −1
=β
=
Me
(α−1)
β
3
σ
2
β
T −1
.
Therefore,
E[R
∗
T
(p
T
(β
T −1
))] = R
∗
T
(p
T
(β)) +
1
2
Me
(α−1)
β
3
σ
2
β
T −1
+ O((T − 2)
−2
),
as we wanted to show. ¤
29
References
[1] C. Anderson and Z. Hong, Reinforcement Learning with Modular Neural Networks
for Control. Proceedings of NNACIP’94, the IEEE International Workshop on Neural
Networks Applied to Control and Image Processing, 1994.
[2] Y. Aviv and A. Pazgal, Pricing of Short Life-Cycle Products through Active Learning,
Technical Report, Olin School of Business, Washington University, 2002.
[3] Y. Aviv and A. Pazgal, A Partially Observed Markov Decision Process for Dynamic
Pricing, Technical Report, Olin School of Business, Washington University, 2002.
[4] K. Azoury, Bayes Solution to Dynamic Inventory Models under Unknown Demand
Distributions, Management Sci.31, 1150-1160, 1985.
[5] R. Balvers and T. Cosimano, Actively Learning about Demand and the Dynamics of
Price Adjustment, The Economic Journal,100, 882-898, 1990.
[6] M. Campi and P. Kumar, Adaptive Linear Quadratic Gaussian Control: The Cost-
Biased Approach Revisited, University of Illinois at Urbana-Champaign Technical Re-
port, http://black.csl.uiuc.edu/ prkumar, 1998.
[7] A. Carvalho and M. Puterman, Learning and Pricing in an Internet Environment with
Binomial Demands, Technical Report, Sauder School of Business, University of British
Columbia, 2004a.
[8] A. Carvalho and M. Puterman, Foundations of Learning and Pricing, Technical Report,
Sauder School of Business, University of British Columbia, 2004b.
[9] H. Chen and L. Guo, A Robust Stochastic Adaptive Controller, IEEE Transactions on
Automatic Control, 33, 1988.
[10] E. Cope, Non-parametric Strategies for Dynamic Pricing in e-Commerce, Technical
Report, Sauder School of Business, University of British Columbia, 2004.
30
[11] T. Dietterich and X. Wang, Batch Value Function Approximation via Support Vec-
tors, Forthcoming in Dietterich, T. G., Becker, S., Ghahramani, Z. (Eds.) Advances in
Neural Information Processing Systems 14, Cambridge, MA: MIT Press, 2003.
[12] X. Ding, M. Puterman and A. Bisi, The Censored Newsvendor and the Optimal Ac-
quisition of Information, Operations Res.,50, 517-527, 2002.
[13] N. Draper and H. Smith, Applied Regression Analysis, Wiley Series in Probability and
Statistics, 1998.
[14] D. Easley and N. Kiefer, Controlling a Stochastic Process with Unknown Parameters,
Econometrica, 56, 5, 1045-1069, 1988.
[15] D. Easley and N. Kiefer, Optimal Learning with Endogenous Data, Int. Econ. Rev.,
30, 4, 963-978, 1989.
[16] L. Fahrmeir and G. Tutz, Multivariate Statistical Modeling Based on Generalized Lin-
ear Models (Springer Series in Statistics), Springer-Verlag, 1994.
[17] J. Forbes and D. Andre, Real-Time Reinforcement Learning in Continuous Domain,
AAAI Spring Symposium on Real-Time Autonomous Systems, 2000.
[18] G. Gallego and G. van Ryzin, Optimal Dynamic Pricing of Inventories with Stochastic
Demand over Finite Horizons, Management Science, 40, 8, 999-1020, 1994.
[19] A. Harvey, Forecasting, Structural Time Series Models and the Kalman Filter. Cam-
bridge University Press, 1994.
[20] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning - Data
Mining, Inference and Prediction. Springer, 2001.
[21] C. Hu, W. Lovejoy and S. Shafer. Comparison of Some Sub optimal Cpntrol Policies
in Medical Drug Therapy, Operations Res., 44, 696-709, 1996.
31
[22] K. Kalyanam, Pricing Decisions Under Demand Uncertainty: A Bayesian Mixture
Model Approach, Marketing Science, 1996.
[23] N. Kiefer and Y. Nyarko, Optimal Control of an Unknown Linear Process with Learn-
ing, Int. Econ. Rev., 30, 3, 571- 586, 1989.
[24] P. Kumar, Convergence of Adaptive Control Schemes Using Least-Squares Parameter
Estimates, IEEE Transactions on Automatic Control, 1990.
[25] M. Lariviere and E. Porteus, Stalking Information: Bayesian Inventory Management
with Unobserved Lost Sales, Management Sci., 45, , 1999.
[26] E. Lehmann, Elements of Large-Sample Theory, Springer, 1999.
[27] M. Lobo and S. Boyd, Pricing and Learning with uncertain demand, Working Paper,
2003.
[28] W. Lovejoy, Myopic Policies for Some Invneotory Models with Uncertain Demand
Distributions, Management Sci., 36, 724-738, 1990.
[29] R. Martinez, Pricing in a Congestible Service Industry with a Focus on the Ski
Industry, Unpublished MSc Thesis, Sauder School of Business, University of British
Columbia, 2003.
[30] N. Petruzzi and M. Dada, Dynamic Pricing and Inventory Control with Learning,
Naval Research Logistics, 49, 304-325, 2002.
[31] C. Raju, Y. Narahari and K. Kumar, Learning dynamic prices in multi-seller elec-
tronic retail markets with price sensitive customers, stochastic demands, and inventory
replenishments, Indian Institute of Science Working Paper, 2004.
[32] M. Rothschild, ”A Two-Armed Bandit Theory of Market Pricing”, J. Econ. Theor.,
9, 185-202, 1974.
32
[33] H. Scarf, Some Remarks on Baye’s Solution to Inventory Problem, naval Res. Logist.
Quart., 7, 591-596, 1960.
[34] E. Silver and M. Rahnama, Biased Selection of the Inventory Reorder Point when
Demand Parameters are Statistically Estimated, Engr. Cost and Prod. Econ., 12, 283-
292,1987.
[35] J. Treharne and C. Sox, Adaptive Inventory Control for Non-stationary Demand and
Partial Information, Management Science, 48, 607-624, 2002.
[36] R. Sutton and G. Barto, Reinforcement Learning. MIT Press, 2nd edition, 1998.
[37] J. Sternby, On Consistency for the method of least squares using martingale theory,
IEEE Transactions on Automatic Control, 1977.
[38] J. Tsitsiklis, An Analysis of Temporal-Difference Learning with Function Approxima-
tion, IEEE Transactions on Automatic Control, 1997.
[39] H. White, Asymptotic Theory for Econometricians. Academic Press, 2001.
33