Content uploaded by Tiago Oliveira

Author content

All content in this area was uploaded by Tiago Oliveira on Aug 12, 2015

Content may be subject to copyright.

This article was downloaded by: [Faculdade de Ciencias Sociais e Humanas]

On: 11 June 2014, At: 08:02

Publisher: Taylor & Francis

Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered

office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics - Simulation

and Computation

Publication details, including instructions for authors and

subscription information:

http://www.tandfonline.com/loi/lssp20

Is It Always Necessary to Take Sample

Selection into Account?

João Gomesa, Isabel Barãoa & Tiago Oliveirab

a Departamento de Estatística e Investigação Operacional, Centro de

Matemática e Aplicações Fundamentais, Faculdade de Ciências da

Universidade de Lisboa, Portugal

b ISEGI, Universidade Nova de Lisboa, Portugal

Accepted author version posted online: 03 Oct 2013.Published

online: 03 Oct 2013.

To cite this article: João Gomes, Isabel Barão & Tiago Oliveira (2014) Is It Always Necessary to Take

Sample Selection into Account?, Communications in Statistics - Simulation and Computation, 43:10,

2264-2274, DOI: 10.1080/03610918.2012.762384

To link to this article: http://dx.doi.org/10.1080/03610918.2012.762384

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the

“Content”) contained in the publications on our platform. However, Taylor & Francis,

our agents, and our licensors make no representations or warranties whatsoever as to

the accuracy, completeness, or suitability for any purpose of the Content. Any opinions

and views expressed in this publication are the opinions and views of the authors,

and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content

should not be relied upon and should be independently verified with primary sources

of information. Taylor and Francis shall not be liable for any losses, actions, claims,

proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or

howsoever caused arising directly or indirectly in connection with, in relation to or arising

out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any

substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,

systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Conditions of access and use can be found at http://www.tandfonline.com/page/terms-

and-conditions

Communications in Statistics—Simulation and Computation R, 43: 2264–2274, 2014

Copyright ©Taylor & Francis Group, LLC

ISSN: 0361-0918 print / 1532-4141 online

DOI: 10.1080/03610918.2012.762384

Is It Always Necessary to Take Sample Selection

into Account?

JO ˜

AO GOMES,1ISABEL BAR ˜

AO,1AND TIAGO OLIVEIRA2

1Departamento de Estat´

ıstica e Investigac¸˜

ao Operacional, Centro de Matem´

atica

e Aplicac¸˜

oes Fundamentais, Faculdade de Ciˆ

encias da Universidade de Lisboa,

Portugal

2ISEGI, Universidade Nova de Lisboa, Portugal

We compare a simple ordinary least squares (OLS) with the maximum likelihood esti-

mation of the Tobit I and Tobit II regression models, in the selected sample. We propose

a new measure to quantify the performance of OLS.

Keywords Conditional expected values; Correlation errors; OLS; Sample selection

models; Tobit.

Mathematics Subject Classiﬁcation C15; C24.

1. Introduction

The capacity to estimate and test regression models over nonrandomly chosen subsamples is

unquestionably one of the most signiﬁcant innovations in regression models. Econometric

analysis of the labor supply and the wage function has frequently used a sample selection

model named Tobit II (Amemiya, 1984). This model was a generalization of the standard

Tobit I (Tobin, 1958). In the sample selection model (Tobit II), the possibility of sample

selection bias arises whenever one examines a subsample, and the unobservable factors

determining inclusion in the subsample, are correlated with other unobservable factors that

interfere on the variable of primary interest. For details of the model, see (Aguirregabiria,

2009; Amemiya, 1984; Greene, 1981; Heckman, 1978; Heckman, 1979; Vella, 1998). For

lack of appropriate software, these models were not soon applied. Only after the easiness of

estimation, that followed the 1979 Heckman’s two-step method, these models had increased

application, even in other areas besides Econometrics (Barnighausen et al., 2011; Kim and

Jang, 2010; McBee, 2010; Quartey et al., 2010; Takada, 2008). Our question is whether

this additional computational effort produces substantial gains in the results obtained via

ordinary least squares (OLS). In order to understand the added value of maximum likelihood

(ML, known to be the best but also the more expensive method) relative to OLS (not so

efﬁcient but easier to use and less expensive) for the Tobit I and II models, in the selected

sample, we propose a measure that evaluates the performance of the OLS. Based on that

measure we identify the beneﬁts and misuses of OLS.

Received June 4, 2011; Accepted December 21, 2012

Address correspondence to Jo˜

ao Gomes, Departamento de Estat´

ıstica e Investigac¸˜

ao Opera-

cional, Centro de Matem´

atica e Aplicac¸˜

oes Fundamentais, Faculdade de Ciˆ

encias da Universidade

de Lisboa, Portugal; E-mail: jjgomesfc.ul.pt

2264

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

Take Sample Selection Always into Account? 2265

The remainder of the article is organized as follows. In the next section, we present

the most referred regression models with sample selection. In Section 3, some probabilistic

properties of these models are reviewed, namely their mean values, and in Section 4 we

introduce the new measure, Rsq, for comparing the ML and OLS estimators of those mean

values. In Section 5, we present the numerical results of a simulation study, we carried out

using R (2010), to study the distribution of Rsq for different scenarios of the Tobit models,

as well as two examples (Toomet and Henningsen, 2008). The discussion of our work is

drawn in Section 6.

2. Some Types of Selection

In the classical linear model, there is a linear relationship between the independent variables

and the mean value of the response variable. If Y∗

xis the answer to a stimulus caused by

kcharacteristics, associated with the individual x,x≡(x1,...,xk), that answer obeys the

equation Y∗

x=β0+β1x1+···+βkxk+εx=xβ+εx, where βj,j =1,...,kis the weight

attributed to the characteristic jand εxis a random error, associated with x.Nowletus

suppose that we have a population that follows this model and that a sample of dimension

nis collected from this population, therefore obtaining an equation system Y∗=X∗β+ε,

where, X∗βrepresents the product of a matrix with nlines and (k+1) columns, X∗,bya

column vector with (k+1) elements, β. If we impose the Gauss–Markov conditions to the

model, the OLS estimators of βare consistent and asymptotically Gaussian (Aguirregabiria,

2009). Likewise, if we impose the Gaussian distribution (0,σ)toεx, the ML estimators of

(β,σ) are strongly consistent (Amemiya, 1984). More than that, the estimators of βare

equal in both estimation processes.

Let us now suppose that for some reason we cannot observe (X∗,Y∗) but, instead,

we observe a subsample (X,Y)of(X∗,Y∗). (X∗,Y∗)will be called latent variables and

the model, Y∗=X∗β+ε, latent model. The literature contains several types of sample

selection (Amemiya, 1984). Here, we will use a more summarized approach considering

only three types of selection but using the names proposed by Amemiya.

a. Truncated Standard Tobit model (Truncated Tobit I)

b. Standard Tobit model (Tobit I)

c. Type 2 Tobit model (Tobit II)

The distinction between these three cases of selection is relatively simple. For models

(a) and (b) there is a known quantity, say T, from which on values of Y∗are not observed.

That truncation can be left or right. In this work, so as not to create ambiguity, we will

consider that there is left truncation, that is, in the sample, instead of observing (X∗,Y∗),

one observes (X,Y) where

(X,Y)=⎧

⎨

⎩

(X∗,Y∗):Y∗>T(case a.)

(X∗,Y∗):Y∗>T

(X∗,T):Y∗≤T(case b.)(Aguirregabiria, 2009).

Therefore, the difference between the two cases is that, when Y∗≤T, in case (a) we

have no information on the covariates, whereas in case (b) we have information on the

covariates, losing, nonetheless, information on Y∗. As for case (c), we need to insert a new

latent variable D∗, response of a linear model, D∗=X∗

2β2+ε2, where X∗

2has the same

lines as X∗, although not necessarily the same columns, and ε2is a random vector of non

correlated variables with distribution N(0, σ2). D∗is a latent variable, one observes only

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

2266 Gomes et al.

D=1ifD∗>Tand D=0ifD∗≤T. In these conditions,

(X,Y)=(X∗,Y∗):D∗>T

(X∗,T):D∗≤T(case c.) (Aguirregabiria, 2009).1

There are several estimation methods adequate to overcome this selection problem.

While in cases (a) and (b) semi-parametric methods are well known and used (e.g., Sym-

metrically Trimmed Least Squares; Aguirregabiria, 2009), in case (c), Heckman’s two-step

method (Heckman, 1976, Heckman, 1979) is still the most referred. However, the ML

estimators for their strong consistency (Amemiya, 1984) are the most valuable as long as

the assumptions are fulﬁlled, namely the homogeneity of variance of the random error. In

spite of that, in many situations, (when the objective is just to ﬁnd the mean value of Yx

when xis not censored) the OLS can be a quick and practical method to solve some sample

selection cases. Our objective with this work is to compare, in each of the cases mentioned,

the OLS and ML estimators of E(Yx) and to ﬁnd a quick and practical way of comparing

both estimation methods.

3. The Mean Values

We have, in a latent way, a linear model, Y∗

x=xβ+εx, such that Y∗

xhas a probability dis-

tribution N(ηx=xβ,σ ) with probability density function (p.d.f.) fY∗

x(y,ηx)=1

σφ(y−ηx

σ),

where φis the p.d.f. of a Gaussian standard. From here on and with no loss of generality,

we will consider T=0.

Let us then describe each of the situations.

3.1. Truncated Tobit I Model

Our random variable, Yx={Y∗

xif Y∗

x>0

−otherwise will have a Gaussian truncated distribution such

that

fYx(y,ηx)=exp[−(y−ηx)2/2σ2]

√2σ P (Y∗

x>0) =φy−ηx

σ

σ1−−ηx

σ =φy−ηx

σ

σ

ηx

σ,y>0,

where (.) is the distribution function (d.f.) of a Gaussian standard. The mean value of Yx,

the key quantity in this work, is

E(Yx)=E(xβ+εx|xβ+εx>0) =xβ+E(εx|εx>−xβ)=xβ+σλ(xβ/σ)(1)

where λ(x)=φ(x)

(x)is the inverse Mills ratio (IMR).

1Case b. (when D∗=Y∗) is a particular case of c.

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

Take Sample Selection Always into Account? 2267

3.2. Tobit I Model

In this case, Yx={Y∗

xif Y∗

x>0

0 otherwise will have a Gaussian censored distribution such that

fYx(y,ηx)=⎧

⎨

⎩

exp[−(y−ηx)2/2σ2]

√2σ if y>0

(−ηx

σ)ify=0=1

σφy−ηx

σif y>0

−ηx

σif y =0.

In this way, E(Yx)=E(xβ+εx|xβ+εx>0)P(Y∗

x>0) =xβ(xβ/σ)+σφ(xβ/σ)

and E(Yx|Y∗

x>0) =E(xβ+εx|xβ+εx>0) =xβ+E(εx|εx>−xβ)=xβ+σλ(xβ/σ).

3.3. Tobit II Model

In this case, for each individual xthere are two types of characteristics, x1(related to

the variable of interest) and x2(related to the selection variable). This model can be

seen as a bivariate model. Apart from Y∗

x=x1β1+ε1(structural equation), now with a

slightly different notation, there is also D∗

x=x2β2+ε2(index equation), both Gaussians

N(μx1=x1β, σ1) and N(μx2=x2β, σ2), respectively. We further assume that ε≡(ε1,ε

2)

is N2(0,) with =[σ2

1σ12

σ12 σ2

2

].

A true observation is not (Y∗

x,D

∗

x) but rather (Yx,D

x), so that we have Dx={0D∗

x≤0

1D∗

x>0

and the following observation rule:

Yx=Y∗

x,D

x=1ifD∗

x>0

Yx=0,D

x=0ifD∗

x≤0.

We therefore have a bivariate distribution such that the p.d.f. is f(Yx,Dx)(y,d)=

[P(Dx=d)]1−d[P(Dx=d|y)fYx(y)]d, where d∈{0,1}and y∈R.

Since ε2|Yx=y∼N(0 +ρσ2

y−ηx1

σ1,σ

2(1 −ρ2)) we have P(Dx=1|y)=

P(ε2>−x2β2|y)=[x2β2+ρ(y−x1β1)/σ1

√(1−ρ2)] and f(Yx,Dx)(y,d)=[1 −(x2β2)]1−d[

(x2β2+σ2ρ(y−x1β1)/σ1

σ2√(1−ρ2))φ(y−x1β1

σ1)

σ1]d.In this way E(Yx)=(x2β2/σ2)x1β1+ρσ1

σ2φ

(x2β2/σ2) and E(Yx|Dx=1) =x1β1+ρσ1λ(x2β2/σ2).

Note: The model of the index equation is a standard probit model, describing the choice Dx=

1orDx=0. Therefore, a standardization restriction is required and one usually sets σ2

2=1

such that E(Yx)=(x2β2)x1β1+ρσ1φ(x2β2)and E(Yx|Dx=1)=x1β1+ρσ1λ(x2β2).

4. Comparing the Mean Values of the Estimators in the Selected Sample

In summary, we have an individual xfrom the population and we want to compare the ML

and the OLS estimator of the three following parameters:

Case a.θ

a

x=E(Yx),

Case b.θ

b

x=E(Yx|Y∗

x>0),

Case c.θc

x=E(Yx|D∗

x>0).

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

2268 Gomes et al.

4.1. Truncated Tobit I and Tobit I Models

After collecting a sample in the population, let us suppose that n1is the initial dimension

of the sample and n the dimension of the selected sample (X,Y). Considering ( ˆ

β, ˆσ)

and ( ˆ

βOLS,ˆσOLS) the ML and OLS estimators of (β,σ), respectively, then ˆ

θx,ML =xˆ

β+

ˆσλ(xˆ

β/ ˆσ) and ˆ

θx,OLS =xˆ

βOLS will be the estimators for θa

xor θb

x. The ML estimators

of θa

x,θb

x, and θc

xare asymptotically unbiased (Aguirregabiria, 2009). From now on, we

will consider the entire selected sample instead of the individual x, to calculate the mean

values of the estimators. In this situation, the mean values of case (a) (θa=E(Y)) and case

(b) (θb=E(Y|Y>0)) are equal. Let θ≡θa≡θb.As ˆ

θML is asymptotically unbiased

E(ˆ

θML)≈θ=η+σλ. In the OLS estimation, let H=X(XTX)−1XTbe the hat matrix

resulting from the sample selection process. This matrix is an orthogonal projection matrix

of Yin the subspace deﬁned by X. An interesting property of His that when ηis a linear

combination of the columns of X, then Hη≡η. As mentioned above, if Y≡(Y∗|Y∗>0),

η=Xβ, then (λ=λ(η/σ)) θ=η+σλ(η/σ)≡η+σλ.

So, Eˆ

θOLS=EXˆ

βOLS=XXTX−1XTE(Y)=η+σHλ(η/σ )≡η+σHλ.

What information can be lost when we replace λby Hλ? The bias of ˆ

θOLS is B=

Bias(ˆ

θOLS)=E(ˆ

θOLS)−θ=−σMλ, where M=I−H.

Let us now state some properties of the bias of the OLS estimator of θ, and two

decompositions of the sum of squares in the selected sample that will direct us to the

construction of the measure of the goodness of the OLS estimator. This is the fundamental

contribution of the article.

4.1.1. Properties of the sums of squares in the selected sample.

1. As the ﬁrst column of Xis 1then XTB=−σXTMλ=0such that 1TB=0

orn1

i=1Bi=0.

2. Given this result, if we deﬁne θ=θi

nand Hθ=(Hθ)i

n, we can then state that

(θ−θ)T(θ−θ)=(Hθ−Hθ)T(Hθ−Hθ)+BTB.(2)

3. Similarly, (λ−¯

λ)T(λ−¯

λ)=λTMλ+(Hλ−Hλ)T(Hλ−Hλ), such that

σ2(λ−¯

λ)T(λ−¯

λ)=BTB+σ2(Hλ−Hλ)T(Hλ−Hλ).(3)

4. With λ−λ=λc,η−η=ηc,Hθ−Hθ=Hθcand θ−θ=θcwe can state that

{θc=ηc+σλ

c

Hθc=ηc+σHλc

.

5. We can represent decompositions (2) and (3) graphically as two adjacent rectangular

triangles with same leg σMλc(Fig. 1).

4.1.2. One measure to evaluate the goodness of OLS. The performance of the OLS esti-

mator of θrelative to ML can be analyzed in two ways:

•a global evaluation, comparing θcwith Hθc,

(ηc+σHλc)T(ηc+σHλc)

(ηc+σλ

c)T(ηc+σλ

c),(4)

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

Take Sample Selection Always into Account? 2269

Figure 1. Two adjacent rectangular triangles with same length leg σMλcleading to Rsq. (Color

ﬁgure available online.)

•an examination of the bias, comparing σλ

cwith σHλc,

(Hλc)T(Hλc)

(λc)T(λc).(5)

The measure we propose to evaluate the goodness of the OLS estimator, Rsq ,isa

balance between the two analyses. It is a convex linear combination of Equations (4) and

(5). The weight, w(1 −w, resp.), of each component represents the relative length of the

corresponding vector σλ

c(θc, resp.).

Rsq =w(Hλc)T(Hλc)

(λc)T(λc)+(1 −w)(ηc+σHλc)T(ηc+σHλc)

(ηc+σλ

c)T(ηc+σλ

c)

where w=||σλ

c||

||σλ

c||+||ηc+σλ

c|| (.is the Euclidean metric).

Looking at Fig. 1, Rsq is a weighted (by the lengths of the hypotenuses) average of

the ratio between the squares of the lengths of the adjacent leg and hypotenuse in the two

triangles. This measure lies between 0 and 1, and is similar to the determination coefﬁcient

R2in the linear model. Rsq ≈1(Rsq ≈0) corresponds to good (bad) performance of OLS.

4.2. Tobit II Model

Let us again consider that we have a sample of dimension n1. In this case, we have

information in the matricial vector (X1,X2,Y,D). Let us also assume that the dimension of

the selected sample is n.Xj,j=1, 2 are the model matrixes for Yand D, respectively, with

dimension n1×kj,forj=1, 2. In this case, considering ( ˆ

β1,ˆ

β2,ˆρ, ˆσ) the ML estimator

for (β1,β2,ρ,σ), ˆ

θx,ML =x1ˆ

β1+ˆσˆρλ(x2ˆ

β2) will be the estimator of ML for θc

x, while

ˆ

θx,OLS =x1ˆ

β1,OLS will be the estimator of OLS.

The ML estimator of θc

xis also asymptotically unbiased (Aguirregabiria, 2009).

As mentioned above, if Y≡(Y|D=1,X1,X2), η1=X1β1and η2=X2β2for the

rows of X2where D=1, then θ=E(Y)=η1+ρσ1λ(η2). Moreover, E(ˆ

θOLS)=

X1XT

1X1−1XT

1E(Y)=η1+ρσ1Hλ(η2).

Once again, what information can be lost when we replace λ(η2)byHλ(η2)?

4.2.1. Bias.

B=Bias(ˆ

θOLS)=E(ˆ

θOLS)−θ=ρσ1(H−I)λ(η2)=−ρσ1Mλ(η2).

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

2270 Gomes et al.

4.2.2. Properties of the sums of squares in the selected sample. The properties of the sum

of squares have no change (the important is X1) except the inclusion of a new parameter ρ.

Otherwise, λis now calculated in η2instead of η1/σ.

4.2.3. One measure to evaluate the goodness of OLS.

Rsq =w(Hλc)T(Hλc)

(λc)T(λc)+(1 −w)(ηc+ρσHλc)T(ηc+ρσ Hλc)

(ηc+ρσλc)T(ηc+ρσλc)

where w=ρσλc

ρσλc+ηc+ρσλc.

5. Numerical Results

5.1. Simulation Study

We compare the θ-estimation performance of the OLS relative to ML by means of Rsq,

which summarizes the bias properties of the OLS. Rsq is a function of n,ρ,σ,η(η1,η2), λ

and Hλ. To keep the simulation simple we ﬁx ρ,σ, and n. As λand Hλare functions of

η, this will be the key quantity of the simulation study. The elements of ηare simulated as

equally spaced by h=range(η)

n, where nis the size of the selected sample.

5.1.1. Tobit I. We ﬁ x σ=1. We consider 50 scenarios (i.e., 50 different “selected samples”)

each one represented by η, to calculate Rsq.Rirepresents Rsq for scenario i,i=1, . . . , 50.

Riis calculated from a sample of size n=1000 (simulating the selected sample), where

the elements of ηibegin in −2.5σand are separated by hi=ln(1 +e−10+2i

10 ). So, scenario 1

corresponds to a sample where the range of ηis small; the η1elements vary between −2.5σ

and −2.445σ; scenario 50 corresponds to a sample where η50 varies between −2.5σand

690σ.

Remarks. We begin each vector of η’s in −2.5σ, as selected values less than this are

negligible in probability. Once ηiis obtained, we calculate the IMR λ≡λ(ηi/σ). As in our

simulation design His not known, instead of Hλ(the projection of λin the subspace

deﬁned by the columns of X) we use the least-squares line of λ(ηi/σ)on ηi(the orthogonal

projection of Hinto a two-dimensional space). Therefore, the Rsq we obtain is just a lower

bound for the actual Rsq.

The numerical results obtained for Riare represented in Fig. 2 (left), as a function of

the lag hi.

5.1.2. Tobit II. In the Tobit II model, with ρσ =1, Rsq is calculated as a function of η1,

η2, or, better, of h1,h2, the lags between successive values of η1and η2, and a matrix

of 70 ×70 corresponding to 4900 scenarios will be obtained. Each Rij represents Rsq for

scenario (i,j), i=1, ..., 70, j=1, . . . , 70, obtained from a sample of size 1000 of

each η1i’s and a sample of size 1000 of η2j ’s. The elements of η1i begin in zero and are

separated by h1i=ln(1 +e−10+i

10 ). The elements of η2j begin in −2.5 and are separated by

h2j=ln(1 +e−10+j

10 ).

Remarks. We begin each vector of η2’s in −2.5, as selected values less than this are

negligible in probability. We begin each vector of η1’sin0(asRij does not depend on it).

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

Take Sample Selection Always into Account? 2271

σ=1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.0 0.2 0.4 0 .6 0.8 1.0

0

0.2

0.4

0.6

h

1

h

1

R

sq

σλ=1

0.00 0.01 0.02 0.03 0.04 0.05

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

0.0

5

h

1

h

2

R

sq

Figure 2. Three-dimensional plots of Rsq for Tobit I and Tobit II. (Color ﬁgure available online.)

Once η2j obtained, we calculate λ(η2j) and, yet, the least-squares lines of λ(η2j)on η2j

instead of Hλ(η2j ). Once again, the calculated Rij is just a lower bound for the actual Rij.

The results for Rij are represented in Figs. 2 (right), 3, and 4, as a function of the lags

hiand hj.

5.1.3. Performance of the estimators. Let us consider the effect of η1and η2. Fig. 3 suggests

that the choice of approach to take depends on the variability of η1/η2required in analysis.

The OLS can perform badly with Rsq →0 when the variability in η1is small and the

variability in η2is large (roughly, range (η1)<2 and range (η2)>10), but its performance

increases as long as the variability in η1increases. Rsq →1 provided η2is small (roughly,

range(η2)<4), or range(η1) is large (roughly, range(η1)>13), or range(η1)≥range(η2).

Now consider the effect of ρand σ: for the range of parameters values examined, the

parameter σ(Tobit I) and θ=ρσ (Tobit II) were found to have a scale effect, but the

former conclusion is maintained. When σ=1 we can conclude from the Tobit II properties

that OLS performs well (Rsq >.88) in the Tobit I model (see Fig. 2 (left)).

Two well-known examples in the literature are “Mroz87” and “RandHIE”, referred in

Toomet and Henningsen (2008)

Figure 3. Contour plot of Rsq,byh1and h2, with marks for the “Mroz87” and “RandHIE” examples.

Zoom in the area of the examples.

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

2272 Gomes et al.

5.2. Examples

5.2.1. “Mroz87”. The Mroz87 data frame contains data about 753 married women. These

data are collected within the “Panel Study of Income Dynamics” (PSID, http://psidonline.

isr.umich.edu/). Of the 753 observations, the ﬁrst 428 are for women who worked for pay,

i.e., with positive hours worked in 1975 (the selected sample), while the remaining 325

observations are for women who did not work for pay in 1975. These data were used by

Mroz (1987) for analyzing female labor supply using a Tobit II.

Based on the ML method we obtained: ˆσ=3.1, ˆρ=−0.13, Range( ˆη1)=

6.1,ˆη1,min =1.0,ˆη1,max =7.1 and Range( ˆη2)=2.1,ˆη2,min =−0.7,ˆη2,max =1.4. If we

had used these values as input in our simulation scheme of the preceding section (with h1=

.0061, h2=.0020, and n=1000), we would have obtained a lower bound for Rsq =.9999,

which can be viewed graphically in Fig. 3. Considering the ML estimates obtained as the

true values of σ,ρ,η1, and η2, we would obtain ˆ

Rsq =.972. The residual sum of squares

of the Tobit II model using ML and OLS were 4089.3 and 4095.2, respectively, and the

maximum absolute difference between the observed and ﬁtted values was 21.25 with ML,

and 21.28 with OLS, which shows that, in this example the performance of both methods

is similar.

5.2.2. “RandHIE”. The dataset used in this example is based on the “RAND Health Insur-

ance Experiment (1974–1982)” (http://en.wikipedia.org/wiki/RAND Health Insurance

Experiment). Cameron and Trivedi (2005, p. 553) use these data to analyze health ex-

penditures based in Tobit II model. Although the initial sample had 5574 values, the

selected sample has dimension 4281.

0.005 0.010 0. 015

0.005 0.010 0.015 0.020 0.025

h

1

h2

0.3

0.9

Figure 4. Performance regions of OLS: .3 and .9 contour lines of Rsq.

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

Take Sample Selection Always into Account? 2273

Based on ML we obtained: ˆσ=1.6, ˆρ=0.74, Range( ˆη1)=5.8,ˆη1,min =

1.2,ˆη1,max =6.9 and Range( ˆη2)=3.9,ˆη2,min =−0.6,ˆη2,max =3.3. Using again simu-

lation (with, h1=.0055, h2=.0037, and n=1000) we would have obtained a lower bound

for Rsq =.986 (Fig. 3). Considering the ML estimates obtained as the true values of σ,

ρ,η1, and η2, we would obtain ˆ

Rsq =.982. The residual sum of squares of the Tobit II

model using ML and OLS were 8312.2 and 8307.2, respectively, and the maximum absolute

difference between the observed and ﬁtted values was 5.94 with ML and 5.93 with OLS.

Once again, the performance of both methods is similar.

6. Discussion

Our main objective in this article was the study of the behavior of the OLS estimator of E(Y)

in a selected sample, in the Tobit I and II models, compared to that of the ML estimator.

We know the latter is the best but also the more expensive method, and the former is not

so efﬁcient but easier to use and less expensive. But are there cases where the performance

of both is similar? In which situations? However, we have examples which show that large

bias can be obtained using OLS, we have found that, for range(η1) large or range(η2)small

or range(η1)≥range(η2), OLS performs well. For comparison purposes we introduced the

measure Rsq. Based on Rsq we conducted a simulation study that showed that OLS can be

used, roughly, whenever range of η1is bigger than 13 or range of η2is less than 4 or, yet,

when range(η1)≥range(η2). Care has to be taken when range(η1)<2, because unless

range(η2) is also very small, OLS can lead to substantial bias.

The two examples conﬁrmed what has been said. In the conditions of the examples

OLS estimates were very similar to the ML ones. Figure 4 summarizes the main conclusions

of our study, indicating the regions where OLS can substitute ML for easiness and also

where OLS is to be avoided. It represents the .3 and .9 contours of Rsq. The white region

corresponds to Rsq <.3 and the darker region to Rsq >.9, respectively the “bad” and “good”

performance regions for OLS. So, provided careful sample selection is made, OLS has a

role to play in estimating E(Y) in the selected sample, for regression models with sample

selection, Tobit I and II.

References

Aguirregabiria, V. (2009). Some notes on sample selection models.

Amemiya, T. (1984). Tobit models: A survey. Journal of Econometrics 24:3–61.

Barnighausen, T., Bor, J., Wandira-Kazibwe, S., Canning, D. (2011). Correcting HIV prevalence

estimates for survey nonparticipation using Heckman-type selection models. Epidemiology

22:27–35.

Greene, W. H. (1981). Sample selection bias as a speciﬁcation error: comment. Econometrica

49:795–798.

Heckman, J. J. (1976). The common structure of statistical models of Truncation, sample selection

and limited dependent variables and a simple estimator for such models. Annals of Economic

and Social Measurement 5:120–137.

Heckman, J. J. (1978). Dummy endogenous variables in a simultaneous equation system. Economet-

rica 46:931–959.

Heckman, J. J. (1979). Sample selection bias as a speciﬁcation error. Econometrica 47:153–161.

Kim, J., Jang, S. (2010). Dividend behavior of lodging ﬁrms: Heckman’s two-step approach. Inter-

national Journal of Hospitality Management 29:413–420.

Mcbee, M. (2010). Modeling outcomes with ﬂoor or ceiling effects: An introduction to the Tobit

model. Gifted Child Quarterly 54:314–320.

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014

2274 Gomes et al.

Quartey, G., Watson, P., Narain, Y., Daniels, S. (2010). Assessing safety signals in large observational

databases using Heckman’s model. Drug Safety 33:132.

R Development Core Team. (2010). R: A Language and Environment for Statistical Computing.

Vienna, Austria: R Foundation for Statistical Computing.

Takada, H. (2008). Voting attitude in contemporary Japan: Analysis using the Heckman two-step

estimation method of the Tobit model. Sociological Theory and Methods 23:19–37.

Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica 26:24–36.

Toomet, O., Henningsen, A. (2008). Sample selection models in R: Package sample selection. Journal

of Statistical Software 27:1–23.

Vella, F. (1998). Estimating models with sample selection bias: A survey. Journal of Human Resources

33:127–169.

Downloaded by [Faculdade de Ciencias Sociais e Humanas] at 08:02 11 June 2014