Content uploaded by Kenneth Train

Author content

All content in this area was uploaded by Kenneth Train on Oct 03, 2015

Content may be subject to copyright.

A Recursive Estimator for Random Coeﬃcient

Models

Kenneth Train

Department of Economics

University of California, Berkeley

October 18, 2007

Abstract

This paper describes a recursive method for estimating random co-

eﬃcient models. Starting with a trial value for the moments of the

distribution of coeﬃcients in the population, draws are taken and then

weighted to represent draws from the conditional distribution for each

sampled agent (i.e., conditional on the agent’s observed dependent vari-

able.) The moments of the weighted draws are calculated and then

used as the new trial values, repeating the process to convergence. The

recursion is a simulated EM algorithm that provides a method of sim-

ulated scores estimator. The estimator is asymptotically equivalent to

the maximum likelihood estimator under speciﬁed conditions. The re-

cursive procedure is faster than maximum simulated likelihood (MSL)

with numerical gradients, easier to code than MSL with analytic gra-

dients, assures a positive deﬁnite covariance matrix for the coeﬃcients

at each iteration, and avoids the numerical diﬃculties that often oc-

cur with gradient-based optimization. The method is illustrated with a

mixed logit model of households’ choice among energy suppliers.

Keywords: Mixed logit, probit, random coeﬃcients, EM algorithm.

1

1 Introduction

Random coeﬃcient models, such as mixed logit or probit, are widely used be-

cause they parsimoniously represent the fact that diﬀerent agents have diﬀerent

preferences. The parameters of the model are the parameters of the distribu-

tion of coeﬃcients in the population. The speciﬁcations generally permit full

covariance among the random coeﬃcients. However, this full generality is sel-

dom realized in empirical applications due to the numerical diﬃculty of max-

imizing a likelihood function that contains so many parameters. As a result,

most applications tend to assume no covariance among coeﬃcients (Chen and

Cosslett, 1998, Goett et al., 2000, Hensher et al., 2005) or covariance among

only a subset of coeﬃcients (Train, 1998, Revelt and Train, 1998).1

This paper presents a procedure that facilitates estimation of random co-

eﬃcient models with full covariance among coeﬃcients. In its simplest form,

it is implemented as follows. For each sampled agent, draws are taken from

the population distribution of coeﬃcients using a trial value for the mean and

covariance of this distribution. Each draw is weighted proportionally to the

probability of the agent’s observed dependent variable under this draw. The

mean and covariance of these weighted draws over all sampled agents are then

calculated. This mean and covariance become the new trial values, and the pro-

cess is repeated to convergence. The procedure provides a method of simulated

scores estimator (Hajivassiliou and McFadden, 1998), which is asymptotically

equivalent to maximum likelihood under well-known conditions discussed be-

low. The recursive procedure constitutes a simulated EM algorithm (Dempster

et al., 1977; Ruud, 1991), which converges to a root of the score condition.

The procedure is related to the diagnostic tool described by Train (2003,

section 11.5) of comparing the conditional and unconditional densities of co-

1Restrictions on the covariances are not as benign as they might at ﬁrst appear. For exam-

ple, Louviere (2003) argues, with compelling empirical evidence, that the scale of utility (or,

equivalently, the variance of random terms over repeated choices by the same agent) varies

over people, especially in stated-preference experiments. Models without full covariance of

utility coeﬃcients imply the same scale for all people. If in fact scale varies, the variation in

scale, which does not aﬀect marginal rates of substitution (MRS), manifests erroneously as

variation in independent coeﬃcients that does aﬀect estimated MRS.

2

eﬃcients for an estimated model. In particular, to evaluate a model, draws

are taken from the conditional distribution of coeﬃcients for each agent in the

sample, and then the distribution of these draws is compared with the esti-

mated population (i.e., unconditional) distribution. If the model is correctly

speciﬁed, the two distributions should be similar, since the expectation of the

former is equal to the later. In the current paper, this concept is used as an

estimation criterion rather than a diagnostic tool.

The procedure is described and applied in the sections below. Section 2

provides the basic version under assumptions that are more restrictive than

needed but facilitate explanation and implementation. Section 3 generalizes

the basic version. Section 4 applies the procedure to data on households’

choices among energy suppliers.

2 Basic Version

Each agent faces exogenous observed explanatory variables xand observed

dependent variable(s) y. We assume in our notation that yis discrete and

xis continuous, though these assumptions can be changed with appropriate

change in notation. Let βbe a vector of random coeﬃcients that aﬀect the

agent’s outcome and are distributed over agents in the population with density

f(β|θ), where θare parameters that characterize the density, such as its mean

and covariance. For the purposes of this section, we specify fto be the normal

density, independent of x; these assumptions will be relaxed in section 3. Let

m(β) be the vector-valued function consisting of βitself and the vectorized

lower portion of (ββ), Then, by deﬁnition, θ=m(β)f(β|θ)dβ.Thatis,θ

are the unconditional moments of β.

Consider now the behavioral model. Given β, the behavioral model gives

the probability that an agent facing xhas outcome yas some function L(y|

β,x), which we assume in this section depends on coeﬃcients βand not (di-

rectly) on elements of θ. In a mixed logit model with repeated choices for each

agent, Lis a product of logits. In other models, L, which we call the kernel of

the behavioral model, takes other forms.2Since βis not known, the probability

2If all random elements of the behavioral model are captured in β, then Lis an indicator

3

of outcome yis P(y|x, θ)=L(y|β, x)f(β|θ)dβ.

The density of βcan be determined for each agent conditional on the agent’s

outcome. This conditional distribution is the distribution of βamong the

subpopulation of agents who, when faced with x,haveoutcomey.ByBayes’

identity, the conditional density is h(β|y, x, θ)=L(y|β,x)f(β|θ)/P (y|

x, θ). The moments of this conditional density are m(β)h(β|y, x, θ)dβ,and

the expectation of such moments in the population is:

M(θ)=x

yS(y|x)βm(β)h(β|y, x, θ)dβ g(x)dx

where g(x) is the density of xin the population and S(y|x)istheshareof

agents with outcome yamong those facing x.

Denote the true parameters as θ∗. At the true parameters S(y|x)=

P(y|x, θ∗), such that the expected value of the moments of the conditional

distributions equals the unconditional moments:

M(θ∗)=x

yS(y|x)βm(β)L(y|β,x)f(β|θ∗)dβ

P(y|x, θ∗)g(x)dx

=x

yβm(β)L(y|β,x)f(β|θ∗)dβg(x)dx

=xβm(β)[

y

L(y|β,x)]f(β|θ∗)dβg(x)dx

=xβm(β)f(β|θ∗)dβg(x)dx

=θ∗.

since L(y|β,x) sums to one over all possible values of y.

The estimation procedure uses a sample analog to the population expecta-

tion M(θ). The variables for sampled agents are subscripted by n=1, ..., N.

The sample average of the moments of the conditional distributions is then:

M(θ)= 1

N

nβm(β)L(yn|β,xn)

P(yn|xn,θ)f(β|θ)dβ.

This quantity is simulated as follows: (1) For each agent, take Rdraws of βfrom

f(β|θ) and label the r-th draw for agent nas βnr. (2) Calculate L(yn|βnr,x

n)

function of whether or not the observed outcome arises under that β.

4

for all draws for all agents. (3) Weight draw βnr by wnr =L(yn|βnr,xn)

1

RrL(yn|βnr,xn),

such that the weights average to one over draws for each given agent. (4)

Average the weighted moments:

˜

M(θ)=

n

r

wnr m(βnr)/N R

The estimator ˆ

θis deﬁned by ˜

M(ˆ

θ)=ˆ

θ. The recursion starts with an

initial value of θand repeatedly calculates θt+1 =˜

M(θt)untilθT+1 =θT

within a tolerance. Since the ﬁrst two moments determine the covariance, the

procedure is equivalently applied to the mean and covariance directly. Note

that the covariance in each iteration is necessarily positive deﬁnite, since it is

calculated as the covariance of weighted draws.

We ﬁrst examine the properties of the estimator and then the recursion.

2.1 Relation of estimator to maximum likelihood

Given the speciﬁcation of P(yn|xn,θ), the score can be written:

sn(θ)=∂logP(yn|xn,θ)

∂θ

=1

P(yn|xn,θ)L(yn|β, xn)∂f(β|θ)

∂θ dβ

=∂logf(β|θ)

∂θ

L(yn|β,xn)

P(yn|xn,θ)f(β|θ)dβ.

The maximum likelihood estimator is a root of nsn(θ)=0.

Let bbe the mean and Wthe covariance of the normally distributed coef-

ﬁcients, such that logf(β|b, W )=k−1

2log(|W|)−1

2(β−b)W−1(β−b).The

derivatives entering the score are:

∂logf

∂b =−W−1(β−b)

∂logf

∂W =−1

2W−1+1

2W−1[(β−b)(β−b)]W−1.

It is easy to see that nsn(θ0) = 0 for some θ0if and only if M(θ0)=θ0,such

that, in the non-simulated version, the estimator is the same as MLE.

Consider now simulation. A direct simulator of the score is

˜sn(θ)= 1

R

r

wnr

∂logf(β|θ)

∂θ .

5

A method of simulated scores estimator is a root of n˜sn(θ) = 0. As in the non-

simulated case, ˜sn(θ0)=0iﬀ ˜

M(θ0)=θ0, such that the recursive estimator

is this MSS estimator. Hajivassiliou and McFadden (1998) give properties of

MSS estimators. In our case, the score simulator is not unbiased, due to the

inverse probability that enters the weights. In this case, the MSS estimator is

consistent and asymptotically equivalent to MLE if Rrises at a rate greater

than √N.

These properties, and the requirement on the draws, are the same as

for maximum simulated likelihood (MSL; Hajivassiliou and Ruud, 1994, Lee,

1995.) However, the estimator is not the same as the MSL estimator. For MSL,

the probability is expressed as an integral over a parameter-free density, with

the parameters entering the kernel. The gradient then involves the derivatives

of the kernel rather than the derivatives of the density. That is, the coeﬃcients

are treated as functions β(θ, μ)withμhaving a parameter-free distribution.

The probability is expressed as P(y|x, θ)=L(y|β(θ, μ),x)f(μ)dμ and

simulated as ˜

P(y|x, θ)=rL(y|β(θ, μr),x)/R for draws μ1, ...μR.The

derivative of the log of this simulated probability is

˜

˜s(θ)= 1

˜

P(y|x, θ)

1

R

r

∂L(y|β(θ, μr),x)

∂θ ,

which is not numerically the same as ˜s(θ) for a ﬁnite number of draws. In

particular, the value of θthat solves n˜sn(θ) = 0 is not the same as the value

that solves n˜

˜sn(θ) = 0 and maximizes the log of the simulated likelihood

function. Either simulated score can serve as the basis for a MSS estimator,

and they are asymptotically equivalent to each other under the maintained

condition that Rrises faster than √N. The distinction is the same as for any

MSS estimator that is based on a simulated score that is not the derivative of

the log of the simulated probability.3

The simulated scores at ˆ

θprovide an estimate of the information matrix,

analogous to the BHHH estimate for standard maximum likelihood: ˆ

I=

SS/N,whereSis the NxKmatrix of K-dimensional scores for Nagents.

3An important class are the unbiased score simulators that Hajivassiliou and McFadden

(1998) discuss, which, by deﬁnition, diﬀer from the derivative of the log of the simulated

probability because the latter is necessarily biased due to the log operation.

6

The covariance matrix of the estimated parameters is then estimated as V=

ˆ

I−1/N =(SS)−1, under the maintained assumption that Rrises faster than

√N. Also, the scores can be used as a convergence criterion, using the statistic

¯sV¯s,where¯s=n˜sn/N .

2.2 Simulated EM algorithm

We can show that the recursive procedure is an EM algorithm and, as such,

is guaranteed to converge. In general, an EM algorithm is a procedure for

maximizing a likelihood function in the presence of missing data (Dempster,

et al., 1977). For sample n=1,...,N, with discrete observed sample out-

come ynand continuous missing data znfor observation n(and suppress-

ing notation for observed explanatory variables), the likelihood function is

nlog P(yn|z,θ)fn(z|θ)dz,wherefn(z|θ) is the density of the miss-

ing data for observation nwhich can depend on parameters θ. The recursion

is speciﬁed as:

θt+1 =argmaxθ

nhn(z|yn,θ

t)logP(yn,z |θ)dz

where Pis the probability-density of both the observed outcome and missing

data, and his the density of the missing data conditional on y. It is called

EM because it consists of an expectation that is maximized. The term being

maximized is the expected log-likelihood of both the outcome and the missing

data, where this expectation is over the density of the missing data conditional

on the outcome. The expectation is calculated using the previous iteration’s

value of θin hn, and the maximization to obtain the next iteration’s value is

over θin logP (yn,z |θ). This distinction between the θentering the weights

for the expectation and the θentering the log-likelihood is the key element of

the EM algorithm. Under conditions given by Boyles (1983) and Wu (1983),

this algorithm converges to a local maximum of the original likelihood function.

As with standard gradient-based methods, it is advisable to check whether the

local maximum is global, by, e.g., using diﬀerent starting values.

In the present context, the missing data are the β’s which have the same

unconditional density for all observations, such that the above notation is trans-

7

lated to zn=βnand fn(z|θ)=f(β|θ)∀n. The EM recursion becomes:

θt+1 =argmaxθ

nh(β|yn,x

n,θ

t)log[L(yn|β,xn)f(β|θ)]dβ. (1)

Since L(yn|β,xn) does not depend on θ, the recursion becomes

θt+1 =argmaxθ

nh(β|yn,x

n,θ

t)logf(β|θ)dβ (2)

The integral is approximated by simulation, giving:

θt+1 =argmaxθ

n

r

wnr(θt)logf (βnr |θ)(3)

where the weights are expressed as functions of θtsince they are calculated

from θt. Note, as stated above, that in the maximization to obtain θt+1,the

weights are ﬁxed, and the maximization is over θin f. The function being

maximized is the log-likelihood function for a sample of draws from fweighted

by w(θt). In the current section, fis the normal density, which makes this

maximization easy. In particular, for a sample of weighted draws from a normal

distribution, the maximum likelihood estimator for the mean and covariance

of the distribution is simply the mean and covariance of the weighted draws.

This is our recursive procedure.4

3 Generalization

We consider non-normal distributions, ﬁxed coeﬃcients, and parameters that

enter the kernel but not the distribution of coeﬃcients.

3.1 Non-normal distributions

For distributions that can be expressed as a transformation of normally dis-

tributed terms, the transformation can be taken in the kernel, L(y|T(β),x)

4EM algorithms have been used extensively to examine Gaussian mixture models for

cluster analysis and data mining (e.g., McLachlan and Peel, 2000.) In these models, the

density of the data is described by a mixture of Gaussian distributions, and the goal is to

estimate the mean and covariance of each Gaussian distribution and the parameters that mix

them. In our case, the data being explained are discrete outcomes rather than continuous

variables, and the Gaussian is the mixing distribution rather than the quantity that is mixed.

8

for transformation T, and all other aspects of the procedure remain the same.

The parameters of the model are still the mean and covariance of the normally

distributed terms, before transformation. Draws are taken from a normal with

given mean and covariance, weights are calculated for each draw, the mean and

covariance of the weighted draws are calculated, and the process is repeated

with the new mean and covariance. The transformation aﬀects the weights,

but nothing else. A considerable degree of ﬂexibility can be obtained in this

way. Examples include lognormals with transformation exp(β), censored nor-

mal with max(0,β), and Johnson’s SBdistribution with exp(β)/(1 + exp(β)).

The empirical application in section 4 explores the use of these kinds of trans-

formations.

For any distribution, the EM algorithm in eqn (4) states that the next

value of the parameter, θt+1,istheMLEofθfrom a sample of weighted draws

from the distribution. With a normal distribution, the MLE is the mean and

covariance of the weighted draws. For many other distributions, the same

is true, namely, that the parameters of the distribution are moments whose

MLE is the analogous moments in the sample of weighted draws. When this

is not the case, then the moments of the weighted draws are replaced with

whatever constitutes the MLE of parameters based on the weighted draws.

The equivalence of ˜sn(θ)=0and ˜

M(θ)=θarises under any fwhen ˜

Mis

deﬁned as the MLE estimator from weighted draws from f.

3.2 Fixed coeﬃcients and parameters in the kernel

The procedure can be conveniently modiﬁed to allow random coeﬃcients to

contain a systematic part that would ordinarily appear as a ﬁxed coeﬃcient

in the kernel. Let βn=Γzn+ηnwhere znis a vector of observed variables

relating to agent n, Γ is a conforming matrix, and ηnis normally distributed.

The parameters θare now Γ and the mean and covariance of η. The density of β

is denoted f(β|zn,θ) since it depends on z. The probability for observation n

is P(yn|xn,z

n,θ)=L(yn|β, xn)f(β|zn,θ)dβ, and the conditional density

of βis h(β|yn,x

n,z

n,θ

t)=L(yn|β,xn)f(β|zn,θ)/P (yn|xn,z

n,θ). The EM

9

recursion is

θt+1 =argmaxθ

nh(β|yn,x

n,z

n,θ

t)log[L(yn|β,xn)f(β|zn,θ)]dβ.

As before, Ldoes not depend on θand so drops out, giving:

θt+1 =argmaxθ

nh(β|yn,x

n,z

nθt)logf(β|zn,θ)dβ.

which is simulated by

θt+1 =argmaxθ

n

r

wnr(θt)logf (βnr |zn,θ)(4)

where wnr =L(yn|βnr,x

n)/r1

RL(yn|βnr,x

n). Given a value of θ, draws of

βnare obtained by drawing ηfrom its normal distribution and adding Γzn.The

weight for each draw of βnis determined as before, proportional to L(yn|βn,x).

Then the ML estimate of θis obtained from the sample of weighted draws.

Since βis speciﬁed as a system of linear equations with normal errors, the

MLE of the parameters is the weighted seemingly unrelated regression (SUR)

of βnon zn(e.g., Greene, 2000, section 15.4). The estimated coeﬃcients of zn

are the new value of Γ; the estimated constants are the new means of η;and

the covariance of the residuals is the new value of the covariance of η.

For ﬁxed parameters that are not implicitly part of a random coeﬃcient,

an extra step must be added to the procedure. To account for this generality,

let the kernel depend on parameters λthat do not enter the distribution of

the random β: i.e., L(y|β,x,λ). Denote the parameters as θ, λ,whereθis

still the mean and covariance of the normally distributed coeﬃcients. The EM

recursion given in eq (1) becomes:

θt+1,λ

t+1=argmaxθ,λ

nh(β|yn,x

n,θ

t,λ

t)log[L(y|β,x,λ)f(β|θ)]dβ.

Unlike before, Lnow depends on the parameters and so does not drop out.

However, Ldepends only on λ,andfdepends only on θ, such that the two

sets of parameters can be updated separately. The equivalent recursion is:

θt+1 =argmaxθ

nh(β|yn,x

n,θ

t,λ

t)logf(β|θ)dβ

10

as before and

λt+1 =argmaxλ

nh(β|yn,x

n,θ

t,λ

t)logL(y|β,x,λ)dβ.

The latter is the MLE for the kernel model on weighted observations. If,

e.g., the kernel is a logit formula, then the updated value of λis obtained by

estimating a standard (i.e., non-mixed) logit model on weighted observations,

with each draw of βproviding an observation. A more realistic situation is a

model in which the kernel is a product of GEV probabilities (McFadden, 1978),

with λbeing the nesting parameters, which are the same for all agents. The

updated values of the nesting parameters are obtained by MLE of the nested

logit kernel on the weighted observations, where the only parameters in this

estimation are the nesting parameters themselves. The parameters associated

with the random coeﬃcients are updated the same as before, as the mean and

covariance of the weighted draws.

Alternative-speciﬁc constants in discrete choice models can be handled in

the way just described. However, if the constants are the only parameters that

enter the kernel, then the contraction suggested by Berry, Levinsohn, and Pakes

(1995) can be applied rather than estimating them by ML.5For constants α,

this contraction is a recursive application of αt+1 =αt+ln(S)−ln(ˆ

S(θt,α

t)),

where Sis the sample (or population) share choosing each alternative, and

ˆ

S(θ, α) is the predicted share based on parameters θand α. This recursion

would ideally be iterated to convergence with respect to αfor each iteration of

the recursion for θ. However, it is probably eﬀective with just one updating of

αfor each updating of θ.

4 Application

We apply the procedure to a mixed logit model, using data on households’

choice among energy suppliers in stated-preference (SP) exercises. SP exercises

are often used to estimate preferences for attributes that are not exhibited in

5If the kernel is the logit formula, then the contraction gives the MLE of the constants,

since both equate sample and predicted shares for each alternative; see, e.g., Train 2003, p.

66.

11

markets or for which market data provide insuﬃcient variation for meaningful

estimation. A general description of the approach, with a review of its history

and applications, is provided by, e.g., Louviere et al. (2000). In an SP survey,

each respondent is presented with a series of choice exercises. Each exercise

consists of two or more alternatives, with attributes of each alternative de-

scribed. The respondent is asked to identify the alternative that they would

choose if facing the choice in the real world. The attributes are varied over

situations faced by each respondent as well as over respondents, to obtain the

variation that is needed for estimation.

In the current application, respondents are residential energy customers,

deﬁned as a household member who is responsible for the household’s elec-

tricity bills. Each respondent was presented with 12 SP exercises representing

choice among electricity suppliers. Each exercise consisted of four alternatives,

with the following attributes of each alternative speciﬁed: the price charged by

the supplier in cents per kWh; the length of contract that binds the customer

and supplier to that price (varying from 0 for no binding to 5 years); whether

the supplier is the local incumbent electricity company (as opposed to a en-

trant); whether, if an entrant, the supplier is a well-known company like Home

Depot (as opposed to a entrant that is not otherwise known); whether time-

of-use rates are applied, with the rates in each period speciﬁed; and whether

seasonal rates are applied, with the rates in each period speciﬁed. Choices

were obtained for 361 respondents, with nearly all respondents completing all

12 exercises. These data are described by Goett (1998). Huber and Train

(2001) used the data to compare ML and Bayesian methods for estimation of

conditional distributions of utility coeﬃcients.

The behavioral model is speciﬁed as a mixed logit with repeated choices

(Revelt and Train, 1998). Consumer nfaces Jalternatives in each of Tchoice

situations. The utility that consumer nobtains from alternative jin choice

situation tis Unjt =β

nxnjt +εnj t,wherexnjt is a vector of observed variables,

βnis random with distribution speciﬁed below, and εnjt is iid extreme value. In

each choice situation, the agent chooses the alternative with the highest utility,

and this choice is observed but not the latent utilities themselves. By specifying

εnjt to be iid, all structure in unobserved terms is captured in the speciﬁcation

12

of β

nxnjt . McFadden and Train (2000) show that any random utility choice

model can be approximated to any degree of accuracy by a mixed logit model

of this form.6

Let ynt denote consumer n’s chosen alternative in choice situation t,with

the vector yncollecting the choices in all Tsituations. Similarly, let xnbe the

collection of variables for all alternatives in all choice situation. Conditional

on β, the probability of the consumer’s observed choices is a product of logits:

L(yn|β,xn)=

t

eβxnytt

jeβxnjt .

The (unconditional) probability of the consumer’s sequence of choice is:

P(yn|xn,θ)=L(yn|β, xn)f(β|θ)dβ

where fis the density of β, which depends on parameters θ.Thisfis the

(unconditional) distribution of coeﬃcients in the population. The density of β

conditional on the choices that consumer nmade when facing variables xnis

h(β|yn,x

n,θ)=L(yn|β, xn)f(β|θ)/P (yn|xn,θ).

We ﬁrst assume that βis normally distributed with mean band covariance

W. The recursive estimation procedure is implemented as follows, with band

Wused explicitly for θ:

1. Start with trial values b0and W0.

2. For each sampled consumer, take Rdraws of β,withther-th draw for

consumer ncreated as βnr =b0+C0ηwhere C0is the lower triangular

Choleski factor of W0and ηis a vector of iid standard normal draws.

3. Calculate a weight for each draw as wnr =L(yn|βnr,x

n)/1

RrL(yn|

βnr,x

n).

4. Calculate the weighted mean and covariance of the N·Rdraws, and label

them b1and W1.

6It is important to note that McFadden and Train’s theorem is an existence result only

and does not provide guidance on ﬁnding the appropriate distribution and speciﬁcation of

variables that attains a close approximation.

13

5. Repeat steps (2)-(4) using b1and W1in lieu of b0and W0,continuingto

convergence.

The last choice situation for each respondent was not used in estimation and

instead was reserved as a “hold-out” choice to assess the predictive ability

of the estimated models. For simulation, 200 randomized Halton draws were

used for each respondent. These draws are described by, e.g., Train (2003). In

the context of mixed logit models, Bhat (2001) found that 100 Halton draws

provided greater accuracy than 1000 pseudo-random draws; his results have

been conﬁrmed by Train (2000), Munizaga and Alvarez-Diaziano (2001) and

Hensher (2001).

The estimated parameters are given in Tables 1, with standard errors cal-

culated as described above, using the simulated scores at convergence. Table 1

also contains the estimated parameters obtained by maximum simulated likeli-

hood (MSLE.) The results are quite similar. Note that the recursive estimator

(RE) treats the covariances of the coeﬃcients as parameters, while the param-

eters for MSLE are the elements of the Choleski factor of the covariance. (The

covariances are not parameters in MLE because of the diﬃculty of assuring

that the covariance matrix at each iteration is positive deﬁnite when using

gradient-based methods. By construction, the RE assures a positive deﬁnite

covariance at each iteration, since each new value is the covariance of weighted

draws.) To provide a more easily interpretable comparison, Table 2 gives the

estimated standard deviations and correlation matrix implied by the estimated

parameters for each method.

The estimated parameters were used to calculate the probability of each

respondent’s choice in their last choice situation. The results are given at the

bottom of Table 1. Two calculation methods were utilized. First, the prob-

ability was calculated by mixing over the population density of parameters

(i.e., the unconditional distribution), i.e., PnT =L(ynT |β, xnT )f(β|ˆ

θ)dβ,

where Tdenotes the last choice situation. This is the appropriate formula

to use in situations for which previous choices by each sampled agent are not

observed. RE gives an average probability of 0.3742, and MSLE gives 0.3620.

The probability is slightly higher for RE than MSLE, which indicates that RE

predicts somewhat better. The same result was observed for all the alternative

14

speciﬁcations discussed below. The second calculation mixes over the condi-

tional density for each respondent, using h(β|y, x, ˆ

θ) instead of f(β|ˆ

θ). This

formula is appropriate when previous choices of agents have been observed.

The probability is of course higher under both estimators than when using the

unconditional density, since each respondent’s previous choices provide useful

information about how they will choose in a new situation. The average prob-

ability from RE is again higher than that from MSLE. However, unlike the

unconditional probability calculation, this relation is reversed for some of the

alternative speciﬁcations discussed below.

The MSLE algorithm converged in 141 iterations and took 7 minutes, 4

seconds using analytic gradients and 3 hours, 20 minutes using numerical gra-

dients.7For RE, I deﬁned convergence as each parameter changing by less

than one-half of one percent and the convergence statistic given above being

less than 1E-4. The ﬁrst of these criteria was the more stringent in this case,

in that the second was met (at 0.82E-4) once the ﬁrst was. RE converged in

162 iterations and took 7 minutes, 59 seconds. Since RE does not require the

coding of gradients, the implication of these time comparisons is that using

RE instead of MSLE reduces either the researcher’s time in coding analytic

gradients or the computer time in using numerical gradients.

Alternative convergence criteria were explored for RE, both more relaxed

and more stringent. Using a more relaxed criterion of each parameter changing

less than one percent, estimation required 63 iterations; took 3 minutes, 1

second; and obtained a convergence statistic of 1.2E-4. When the criterion was

tightened to each parameter changing by less than one-tenth of one percent,

estimation required 609 iterations; took 29 minutes, 3 seconds; and obtained

a convergence statistic of 0.44E-4. The estimated parameters changed little

by applying the stricter criterion. Interestingly, the more relaxed criterion

7All estimation was in Gauss on a PC with a Pentium 4 processor, 3.2GHz, with 2 GB

of RAM. For MSLE, I used Gauss’ maxlik routine with my codes for the mixed logit log-

likelihood function and for analytic gradients under normally distributed coeﬃcients. For

RE, I wrote my own code; one of the advantages of the approach is the ease of coding it. I

will shortly be developing RE routines in matlab, which I will make available for downloading

from my website at http://elsa.berkeley.edu/∼train.

15

obtained parameters that were a bit closer to the MSL estimates. For example,

the mean and standard deviation of the price coeﬃcient were -0.927 and .611

after 62 iterations and -0.9954 and 0.5471 after 162 iterations, compared to the

MSL estimates of -0.939 and 0.691.

Step-sizes are compared across the algorithms by examining the iteration

log. Table 3 gives the iteration log for the mean and standard deviation of the

price coeﬃcient, which is indicative for all the parameters. The RE algorithm

moves, at ﬁrst, considerably more quickly toward the converged values than the

gradient-based MSLE algorithm. However, it later slows down and eventually

takes smaller steps than the MSLE algorithm. As Dempster et al. (1977) point

out, this is a common feature of EM algorithms. However, Ruud (1991) notes

that the algorithm’s slowness near convergence is balanced by greater numerical

stability, since it avoids the numerical problems that are often encountered

in gradient-based methods, such as overstepping the maximum and getting

“stuck” in areas of the likelihood function that are poorly approximated by a

quadradic. We observed these problems with MSLE in two of our alternative

speciﬁcations, discussed below, where new starting values were required after

the MSLE algorithm failed at the original starting values. We encountered no

such problems with RE.8

Alternative starting values were tried in each algorithm. Several diﬀerent

convergence points were found with each of the algorithms. All of them were

similar to the estimates in Table 1, and none obtained a higher log-likelihood

value. However, the fact that diﬀerent converged values were obtained indi-

cates that the likelihood function is “rippled” around the maximum. This

phenomenon is not unexpected given the large number of parameters and the

relatively small behavioral diﬀerences associated with diﬀerent combinations

of parameter values. Though this issue might constitute a warning about esti-

mation of so many parameters, restricting the parameters doesn’t necessarily

8The recursion can be used as an “accelerator” rather than an estimator, by using it

for initial iterations and then switching to MSL near convergence. This procedure takes

advantage of its larger initial steps and the avoidance of numerical problems, which usually

occur in MSL further from the maximum, while retaining the familiarity of MSL and its

larger step-sizes near convergence.

16

resolve the issue as much as mask it. In any case, the issue is the same for

MSLE and RE.

Table 4 gives statistics for several alternative speciﬁcations. The columns

in the table are for the following speciﬁcations:

1. All coeﬃcients are normally distributed. This is the speciﬁcation in Table

1 and is included here for comparison.

2. Price coeﬃcient is lognormally distributed, as −exp(βp), with βpand the

coeﬃcients of the other variables normally distributed. This speciﬁcation

assures a negative price coeﬃcient for all agents.

3. The coeﬃcients of price, TOU rates and seasonal rates are lognormally

distributed, and the other coeﬃcients are normal. This speciﬁcation as-

sures that all three price-related attributes have negative coeﬃcients for

all agents.

4. Price coeﬃcient is censored normal, min(0,β

p), with others normal. This

speciﬁcation prevents positive price coeﬃcients but allows some agents

to place no importance on price, at least in the range of prices considered

in the choice situations.

5. Price coeﬃcient is distributed as SBfrom 0 to 2, as 2exp(βp)/(1 +

exp(βp)), others normal. This distribution is bounded on both sides and

allows a variety of shapes within these bounds; see Train and Sonnier

(2005) for an application and discussion of its use.

6. The model is speciﬁed in willingness-to-pay space, using the concepts

from Sonnier et al. (2007) and Train and Weeks (2005). Utility is re-

parameterized as U=αp +αβz+εfor price pand non-price attributes

z, such that βis the agent’s willingness to pay (wtp) for attribute z.This

parameterization allows the distribution of wtp to be estimated directly.

Under the usual parameterization, the distribution of wtp is estimated

indirectly by estimating the distribution of the price and attribute coef-

ﬁcients, and deriving (or simulating) the distribution of their ratio.

17

MSLE and RE provide fairly similar estimates under all the speciﬁcations.

In cases when the estimated mean and standard deviation of the underlying

normal for the price coeﬃcient are somewhat diﬀerent, the diﬀerence is less

when comparing the mean and standard deviation of the coeﬃcient itself. For

example, in speciﬁcations (2) and (3), a ﬁfty percent diﬀerence in the estimated

mean of the underlying normal translates into less than four percent diﬀerence

in the mean of the coeﬃcient itself.

For all the speciﬁcations, the log of the simulated likelihood ( ˜

LL) is lower at

convergence with RE than with MSLE. This diﬀerence is by construction, since

the MSL estimates are those that maximize the ˜

LL, while the RE estimates are

those that set the simulated scores equal to zero with the simulated scores not

being the derivative of the ˜

LL. However, despite this diﬀerence, it would be

useful if the ˜

LL under the two methods moved in the same direction when the

speciﬁcation is changed. This is not the case. ˜

LL is higher for speciﬁcation (3)

than speciﬁcation (1) under either estimator. However, for speciﬁcation (4),

˜

LL under RE is higher while that under MSLE is lower than for speciﬁcation

(1). The change under MSLE does not necessarily provide better guidance,

since simulation error can aﬀect MSLE both in the estimates that are obtained

and the calculation of the log-likelihood at those estimates.

The average probability for the “hold-out” choice using the population den-

sity is higher under RE than MSLE for all speciﬁcations. When using the

conditional density, neither method obtains a higher average probability for all

speciﬁcations. These results were mentioned above.

For MSLE, I used numerical gradients rather than recoding the analytic

gradients. The run times in Table 4 therefore reﬂect equal amounts of recoding

time for each method. Run times are much lower for RE than MSLE when

numerical gradients are used for MSLE. With analytic gradients, MSLE would

be about the same speed as RE,9but of course would require more coding

time. As mentioned above, the ML algorithm failed for two of the speciﬁcations

9In some cases, MSLE is slower even with analytic gradients. For example, speciﬁcation

(2) was took 333 iterations in MSLE, while RE took 139. An iteration in MSLE with analytic

gradients takes about the same time as an iteration in RE, such that for speciﬁcation (2),

MSLE with analytic gradients would be slower than RE.

18

(namely, 5 and 6) when using the same starting values as for the others; these

runs were repeated with the converged values from speciﬁcation (2) used as

starting values.

5 Summary

A simple recursive estimator for random coeﬃcients is based on the fact that

the expectation of the conditional distributions of coeﬃcients is equal to the

unconditional distribution. The procedure takes draws from the unconditional

distribution at trial values for its moments, weights the draws such that they

are equivalent to draws from the conditional distributions, calculates the mo-

ments of the weighted draws, and then repeats the process with these calculated

moments, continuing until convergence. The procedure constitutes a simulated

EM algorithm and provides a method of simulated scores estimator. The es-

timator is asymptotically equivalent to MLE if the number of draws used in

simulation rises faster than √N, which is the same condition as for MSL. In

an application of mixed logit on stated-preference data, the procedure gave

estimates that are similar to those by MSL, was faster than MSL with numeri-

cal gradients, and avoided the numerical problems that MSL encountered with

some of the speciﬁcations.

19

References

Berry, S., J. Levinsohn and A. Pakes (1995), ‘Automobile prices in market

equilibrium’, Econometrica 63, 841–889.

Bhat, C. (2001), ‘Quasi-random maximum simulated likelihood estimation of

the mixed multinomial logit model’, Transportation Research B 35, 677–

693.

Boyles, R. (1983), ‘On the convergence of the em algorithm’, Journal of the

Royal Statistical Society B 45, 47–50.

Chen, H. and S. Cosslett (1998), ‘Environmental quality preference and beneﬁt

estimation in multinomial probit models: A simulation approach’, Amer-

ican Journal of Agricultural Economics 80, 512–520.

Dempster, A., N. Laird and D. Rubin (1977), ‘Maximum likelihood from incom-

plete data via the em algorithm’, Journal of the Royal Statistical Society

B39, 1–38.

Goett, A. (1998), ‘Estimating customer preferences for new pricing products’,

Electric Power Research Institute Report TR-111483, Palo Alto.

Goett, A., K. Hudson and K. Train (2000), ‘Consumers’ choice among retail

energy suppliers: The willingnes-to-pay for service attributes’, The Energy

Journal 21, 1–28.

Greene, W. (2000), Econometric Analysis, Prentice Hall, New York.

Hajivassiliou, V. and D. McFadden (1998), ‘The method of simulated scores

for the estimation of ldv models’, Econometrica 66, 863–96.

Hajivassiliou, V. and P. Ruud (1994), Classical estimation methods for ldv

models using simulation, in R.Engle and D.McFadden, eds, ‘Handbook of

Econometrics’, North-Holland, Amsterdam, pp. 2383–441.

Hensher, D. (2001), ‘The valuation of commuter travel time savings for car

drivers in new zealand: Evaluating alternative model speciﬁcations’,

Transportation 28, 101–118.

20

Hensher, D., N. Shore and K. Train (2005), ‘Households’ willingness to pay

for water service attributes’, Environmental and Resource Economics

32, 509–531.

Huber, J. and K. Train (2001), ‘On the similarity of classical and bayesian

estimates of individual mean partworths’, Marketing Letters 12, 259–269.

Lee, L. (1995), ‘Asymptotic bias in simulated maximum likelihood estimation

of discrete choice models’, Econometric Theory 11, 437–483.

Louviere, J. (2003), ‘Random utility theory-based stated preference elicitation

methods’, working paper, Faculty of Business, University of Technology,

Sydney.

Louviere, J., D. Hensher and J. Swait (2000), Stated Choice Methods: Analysis

and Applications, Cambridge University Press, New York.

McLachlan, G. and D. Peel (2000), Finite Mixture Models, John Wiley and

Sons, New York.

Munizaga, M. and R. Alvarez-Daziano (2001), ‘Mixed logit versus nested logit

and probit’, Working Paper, Departmento de Ingeniera Civil, Universidad

de Chile.

Revelt, D. and K. Train (1998), ‘Mixed logit with repeated choices’, Review of

Economics and Statistics 80, 647–657.

Ruud, P. (1991), ‘Extensions of estimation methods using the em algorithm’,

Journal of Econometrics 49, 305–341.

Sonnier, G., A. Ainslie and T. Otter (2007), ‘Hereogeneous distributions of will-

ingness to pay in choice models’, Quantitative Marketing and Economics

5(3), 313–331.

Train, K. (1998), ‘Recreation demand models with taste variation’, Land Eco-

nomics 74, 230–239.

Train, K. (2000), ‘Halton sequences for mixed logit’, Working Paper No. E00-

278, Department of Economics, University of California, Berkeley.

21

Train, K. (2003), Discrete Choice Methods with Simulation, Cambridge Uni-

versity Press, New York.

Train, K. and G. Sonnier (2005), Mixed logit with bounded distributions of cor-

related partworths, in R.Scarpa and A.Alberini, eds, ‘Applications of Sim-

ulation Methods in Environmental and Resource Economics’, Springer,

Dordrecht, pp. 117–134.

Train, K. and M. Weeks (2005), Discrte choice models in preference space

and willingness-to-pay space, in R.Scarpa and A.Alberini, eds, ‘Applica-

tions of Simulation Methods in Environmental and Resource Economics’,

Springer, Dordrecht, pp. 1–16.

Wu, C. (1983), ‘On the convergence properties of the em algorithm’, Annals of

Statistics 11, 95–103.

22

Table 1: Mixed Logit Model of Electricity Supplier Choice

All coeﬃcients normally distributed

Recursive estimator (RE), Maximum simulated likelihood estimator (MSLE)

Parameters RE MSLE

(Std errors in parentheses)

Means

1. Price -0.9954 (0.0521) -0.9393 (0.0520)

2. Contract length -0.2404 (0.0231) -0.2428 (0.0256)

3. Local utility 2.5464 (0.1210) 2.3328 (0.1337)

4. Well known co. 1.8845 (0.0742) 1.8354 (0.0104)

5. TOU rates -9.3126 (0.4571) -9.1682 (0.4400)

6. Seasonal rates -9.6898 (0.4496) -9.0710 (0.4365)

Covariances Choleski

11 0.5471 (0.0726) 0.6909 (0.0611)

21 0.0266 (0.0439) -0.0333 (0.0290)

22 0.1222 (0.0146) 0.4180 (0.0236)

31 0.9430 (0.2672) -1.6089 (0.1523)

32 0.3602 (0.1039) 0.2419 (0.1475)

33 2.8709 (0.3321) 1.4068 (0.1468)

41 0.5208 (0.1689) -0.9107 (0.1218)

42 0.2668 (0.0681) 0.1526 (0.1192)

43 1.3065 (0.3543) 0.6746 (0.1216)

44 1.1015 (0.1339) -1.0424 (0.0997)

51 4.5204 (1.2492) -4.6228 (0.4740)

52 0.2707 (0.3972) -0.1813 (0.1690)

53 7.9995 (2.4263) 1.8399 (0.1740)

54 4.5092 (1.5356) 0.3592 (0.2026)

55 45.050 (5.9201) 2.6309 (0.1631)

61 4.4860 (1.1875) -5.3688 (0.4862)

62 0.1156 (0.3933) -0.3913 (0.1302)

63 7.5672 (2.2439) 0.4850 (0.1691)

64 3.8878 (1.3968) 0.5309 (0.2054)

65 39.927 (10.578) 1.1074 (0.1544)

66 41.916 (5.2169) 1.7984 (0.1371)

Log of Sim. Likelihood -3482.93 -3423.08

Average probability of chosen alt. in last situation.

Unconditional density 0.3742 0.3620

Conditional density 0.5678 0.5632

23

Table 2: Standard deviations and correlations

Std devs Correlations

RE MSLE RE bottom, MSLE top

Price 0.740 0.691 1.000 0.079 0.748 0.589 0.819 0.921

Contract 0.350 0.419 0.103 1.000 0.172 0.145 0.033 0.006

Local util 1.694 2.151 0.752 0.608 1.000 0.736 0.822 0.736

Well known 1.050 1.547 0.671 0.728 0.735 1.000 0.578 0.511

TOU rates 6.712 5.643 0.911 0.115 0.703 0.640 1.000 0.879

Seasonal 6.474 5.827 0.937 0.051 0.690 0.572 0.919 1.000

24

Table 3: Iterations

Price coeﬃcents

Iteration Mean Std dev

RE MSLE RE MSLE

1002.449 0.2000

20.1431 0.1108 0.6650 0.1259

30.0479 0.0718 0.3641 0.1567

4-0.0553 0.0174 0.2944 0.1120

5-0.1405 0.1127 0.2657 0.2430

6-0.2136 0.0567 0.2567 0.2217

7-.2762 -0.0326 0.2575 0.1552

8-.3293 -0.0162 0.2625 0.1717

9-.3746 -0.0070 0.2702 0.1687

10 -.4132 -0.0064 0.2800 0.1514

20 -.6416 -0.3322 0.4191 0.0697

30 -0.7607 -0.5693 0.5346 0.3825

40 -0.8357 -0.7316 0.5947 0.4505

50 -0.8869 -0.7919 0.6325 0.5633

60 -0.9217 -0.8913 0.6573 0.6173

70 -0.9446 -0.9272 0.6738 0.6414

80 -0.9602 -0.9399 0.6854 0.6559

90 -0.9711 -0.9325 0.6941 0.6485

100 -0.9776 -0.9415 0.7006 0.6593

110 -0.9827 -0.9462 0.7044 0.6688

120 -0.9856 -0.9464 0.7087 0.6786

130 -0.9848 -0.9425 0.7178 0.6862

140 -0.9832 -0.9396 0.7282 0.6904

150 -0.9862 NA 0.7341 NA

160 -0.9932 NA 0.7381 NA

25

Table 4: Alternative Speciﬁcations

All Price Price Price Price WTP

normal log TOU censor SBspace

normal season normal

lognorm

Price

Underlying normal

Mean

RE -.9954 -.2441 -.1692 -1.0203 -.1761 -0.0892

MSLE -.9393 -.1655 -.2466 -.9828 -.0915 -.1125

Std dev

RE .7397 .5560 .6274 .6253 1.316 .2941

MSLE .6909 .4475 .7903 .6530 1.7964 .2444

Coeﬃcient

Mean

RE -.9954 -.9144 -1.028 -1.033 -.9335 -.9551

MSLE -.9393 -.9397 -1.068 -1.002 -.9711 -.9207

Std dev

RE .7397 .5503 .7140 .5971 .4990 .2871

MSLE .6909 .4411 .9946 .6155 .5958 .2284

Probability

for last choice

Population density

RE .3742 .3629 .3702 .3785 .3688 .3696

MSLE .3620 .3557 .3539 .3565 .3662 .3649

Conditional densities

RE .5678 .5501 .5640 .5630 .5634 .5309

MSLE .5632 .5569 .5674 .5691 .5637 .5415

Log Sim. Likelihd

RE -3482.93 -3510.81 -3467.49 -3508.84 -3474.66 -3554.66

MSLE -3423.08 -3456.63 -3420.58 -3420.21 -3424.19 -3494.48

Run time

RE 7m59s 6m45s 12m2s 11m27s 10m14s 22m32s

MSLE* 3h20m30s 7h54m34s 3h31m31s 6h26m46s 2h51m9s 4h6m5s

*Using numerical derivatives. Staring values for (5) and (6) are estimates from (2).

26