ArticlePDF Available

The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models

Authors:

Abstract and Figures

No abstract available.
Content may be subject to copyright.
A preview of the PDF is not available
... We apply the probit with sample selection (PSS) model for analysis, which remedies the selection biases by allowing correlations between the unobservables in the selection and outcome equations (Heckman et al. 2001). The PSS model was proposed by van de Ven and van Praag (1981), which is modified from the Heckman model (Heckman 1976; originally designed for correcting sample selection biases in linear regressions) to fit binary outcome dependent variables. In the transportation domain, sample selection models have been applied for various purposes, one of the most common of which is to correct for residential selfselection effects (Cao 2009;Chen et al. 2017;van Herick & Mokhtarian 2020). ...
... To address the self-selection bias, Heckman (1976) proposed the sample selection model as a corrective method for linear regression models. Given the binary nature of the two decisions in our case (i.e., willing/unwilling to participate, respond/do not respond to the follow-up survey), we apply the analogous corrective method for discrete choice models, the probit with sample selection (PSS) model (van de Ven and van Praag 1981), to deal with the self-selection bias. ...
Article
Full-text available
Declining survey response rates have increased the costs of travel survey recruitment. Recruiting respondents based on their expressed willingness to participate in future surveys, obtained from a preceding survey, is a potential solution but may exacerbate sample biases. In this study, we analyze the self-selection biases of survey respondents recruited from the 2017 U.S. National Household Travel Survey (NHTS), who had agreed to be contacted again for follow-up surveys. We apply a probit with sample selection (PSS) model to analyze (1) respondents’ willingness to participate in a follow-up survey (the selection model) and (2) their actual response behavior once contacted (the outcome model). Results verify the existence of self-selection biases, which are related to survey burden, sociodemographic characteristics, travel behavior, and item non-response to sensitive variables. We find that age, homeownership, and medical conditions have opposing effects on respondents’ willingness to participate and their actual survey participation. The PSS model is then validated using a hold-out sample and applied to the NHTS samples from various geographic regions to predict follow-up survey participation. Effect size indicators for differences between predicted and actual (population) distributions of select sociodemographic and travel-related variables suggest that the resulting samples may be most biased along age and education dimensions. Further, we summarized six model performance measures based on the PSS model structure. Overall, this study provides insight into self-selection biases in respondents recruited from preceding travel surveys. Model results can help researchers better understand and address such biases, while the nuanced application of various model measures lays a foundation for appropriate comparison across sample selection models.
... Since approved borrowers are not randomly selected from the population, estimating the effect of the main variables on the interest rate from the subset of loans (only approved loans) may introduce bias. To tackle this issue, I use the Heckman correction model to address omitted variable bias stemming from this specific sample selection problem (Heckman, 1976;1979). ...
Preprint
Full-text available
With open banking, consumers take greater control over their own financial data and share it at their discretion. Using a rich set of loan application data from the largest German FinTech lender in consumer credit, this paper studies what characterizes borrowers who share data and assesses its impact on loan application outcomes. I show that riskier borrowers share data more readily, which subsequently leads to an increase in the probability of loan approval and a reduction in interest rates. The effects hold across all credit risk profiles but are the most pronounced for borrowers with lower credit scores (a higher increase in loan approval rate) and higher credit scores (a larger reduction in interest rate). I also find that standard variables used in credit scoring explain substantially less variation in loan application outcomes when customers share data. Overall, these findings suggest that open banking improves financial inclusion, and also provide policy implications for regulators engaged in the adoption or extension of open banking policies.
... Among these, the Heckman two-stage estimation approach (Heckman, 1976(Heckman, , 1979) is one of the most used. Although not extensively, the Heckman approach has also been applied in studies using the three databases analyzed in this article. ...
Article
This article outlines the main methodological implications of using Bloomberg SPLC, FactSet Supply Chain Relationships, and Mergent Supply Chain for academic purposes. These databases provide secondary data on buyer-supplier relationships that have been publicly disclosed. Despite the growing use of these databases in supply chain management (SCM) This article is protected by copyright. All rights reserved. research, several potential validity and reliability issues have not been systematically and openly addressed. This article thus expounds on challenges of using these databases that are caused by (1) inconsistency between data, SCM constructs, and research questions (data fit); (2) errors caused by the databases' classifications and assumptions (data accuracy); and (3) limitations due to the inclusion of only publicly disclosed buyer-supplier relationships involving specific focal firms (data representativeness). The analysis is based on a review of previous studies using Bloomberg SPLC, FactSet Supply Chain Relationships, and Mergent Supply Chain, publicly available materials, interviews with information service providers, and the direct experience of the authors. Some solutions draw upon established methodological literature on the use of secondary data. The article concludes by providing summary guidelines and urging SCM researchers toward greater methodological transparency when using these databases.
... Par conséquent, un biais de sélection peut découler de l'omission de ces observations et fausserait les résultats obtenus si on se limite seulement à l'échantillon tronqué des actifs occupés pour expliquer l'emploi inadéquat par un modèle probit simple. D'où le modèle économétrique le plus adapté à cette problématique est le modèle probit avec correction du biais de sélection en deux étapes, selon la méthode de Heckman (Heckman, 1976(Heckman, et 1979Van De Ven et al, 1981). ...
... These selected patients had higher baseline readmission rates than those who were not selected; hence, even though they received the intervention, they still had a higher chance of being readmitted than those who were not selected. To correct for the selection bias, we apply the Heckman's two-step correction (Heckman 1976). See Appendix EC.5 for the complete details. ...
Preprint
Motivated by the emerging needs of personalized preventative intervention in many healthcare applications, we consider a multi-stage, dynamic decision-making problem in the online setting with unknown model parameters. To deal with the pervasive issue of small sample size in personalized planning, we develop a novel data-pooling reinforcement learning (RL) algorithm based on a general perturbed value iteration framework. Our algorithm adaptively pools historical data, with three main innovations: (i) the weight of pooling ties directly to the performance of decision (measured by regret) as opposed to estimation accuracy in conventional methods; (ii) no parametric assumptions are needed between historical and current data; and (iii) requiring data-sharing only via aggregate statistics, as opposed to patient-level data. Our data-pooling algorithm framework applies to a variety of popular RL algorithms, and we establish a theoretical performance guarantee showing that our pooling version achieves a regret bound strictly smaller than that of the no-pooling counterpart. We substantiate the theoretical development with empirically better performance of our algorithm via a case study in the context of post-discharge intervention to prevent unplanned readmissions, generating practical insights for healthcare management. In particular, our algorithm alleviates privacy concerns about sharing health data, which (i) opens the door for individual organizations to levering public datasets or published studies to better manage their own patients; and (ii) provides the basis for public policy makers to encourage organizations to share aggregate data to improve population health outcomes for the broader community.
... When investigating which group an individual belongs to among the "met needs", "financial difficulty", "time constraint", and "lack of caring and support" groups, it is important to note that these can be observed only among individuals who have experienced healthcare needs. Otherwise, the results of the analysis might suffer from sample selection bias [24,25]. This study used a multivariable panel multinomial probit model with sample selection to correct for potential sample selection bias [26][27][28][29]. ...
Article
Full-text available
Using 68,930 observations selected from 16,535 adults in the Korea Health Panel Survey (2014–2018), this study explored healthcare barriers that prevent people from meeting their healthcare needs most severely during adulthood, and the characteristics that are highly associated with the barrier. This study derived two outcome variables: a dichotomous outcome variable on whether an individual has experienced healthcare needs, and a quadchotomous outcome variable on how an individual’s healthcare needs ended. An analysis was conducted using a multivariable panel multinomial probit model with sample selection. The results showed that the main cause of unmet healthcare needs was not financial difficulties but non-financial barriers, which were time constraints up to a certain age and the lack of caring and support after that age. People with functional limitations were at a high risk of experiencing unmet healthcare needs due to a lack of caring and support. To reduce unmet healthcare needs in South Korea, the government should focus on lowering non-financial barriers to healthcare, including time constraints and lack of caring and support. It seems urgent to strengthen the foundation of “primary care”, which is exceptionally scarce now, and to expand it to “community-based integrated care” and “people-centered care”.
Article
This paper studies semiparametric versions of the classical sample selection model (Heckman, 1976, 1979) without exclusion restrictions. We extend the analysis in Honoré and Hu (2020) by allowing for parameter heterogeneity and derive implications of this model. We also consider models that allow for heteroskedasticity and briefly discuss other extensions. The key ideas are illustrated in a simple wage regression for females. We find that the derived implications of a semiparametric version of Heckman’s classical sample selection model are consistent with the data for women with no college education, but strongly rejected for women with a college degree or more.
Article
The purposes of this study were to develop a profile of households who purchase coffee, to determine the socioeconomic determinants of the demand for coffee by US households, and to update the own‐price, cross‐price, and income elasticities of demand for US households. The profile suggests to targeting wealthier households, households without children living in the household, older households, white households, and households located in the Pacific region and in New England. The own‐price elasticity of −1.93 indicates that the demand for at‐home purchases of coffee is elastic. Tea is a substitute for coffee. Given the income elasticity of 0.20, changes in household income are not likely to have sizeable impacts on at‐home coffee consumption. Nevertheless, coffee is a necessity in economic parlance. The source of data for this analysis was the NielsenIQ pertaining to 61,380 households for calendar year 2015. [EconLit Citations: D1, D9, D12].
Article
This paper investigates the within country geographic mobility of researcher from their graduating universities and its relationship with publication productivity, paying special attention to the roles of Academicians, a member of the Chinese Academy of Science or Engineering. Various individual and institutional characteristics of researchers in environmental science and engineering in China and their publication productivity are collected and analyzed. We find that researcher geographic mobility from graduating institutions is generally associated with individual traits such as academic rank and age, and institutional characteristics such as school ranking and location. Researchers moving further away from their graduating institutions are generally associated with less productivity. The negative association can be explained by the fact that a significant portion of researchers stay within the same university where they received their highest degree; and these researchers are more productive than their peers who are hired from external institutions. Having an Academician in an institution is shown to be positively related to the likelihood of same university hires, and meanwhile positively related to higher publication productivity.
Article
In a two-step extremum estimation (M-estimation) framework with a finite-dimensional parameter of interest and a potentially infinite-dimensional first-step nuisance parameter, this paper proposes an averaging estimator that combines a semiparametric estimator based on a nonparametric first step and a parametric estimator which imposes parametric restrictions on the first step. The averaging weight is an easy-to-compute sample analog of an infeasible optimal weight that minimizes the asymptotic quadratic risk. Under Stein-type conditions, the asymptotic lower bound of the truncated quadratic risk difference between the averaging estimator and the semiparametric estimator is strictly less than zero for a class of data generating processes that includes both correct specification and varied degrees of misspecification of the parametric restrictions, and the asymptotic upper bound is weakly less than zero. The averaging estimator, along with an easy-to-implement inference method, is demonstrated in an example.
Article
We examine fluctuations in the predicted educational attainment of newly arrived legal U.S. immigrants between 1972 and 1999 by combining data from the U.S. Immigration and Naturalization Service with the Current Population Survey. A mid-1980s decline gave way to a noticeable improvement in the skill base of the immigrant population between 1987 and 1993. A short decline in the quality of immigrant skills—less severe than that of the mid-1980s—took place in the mid-1990s. In 1998, the trend reverses once more: The labor market quality of new legal U.S. immigrants improves. The primary sources of the fluctuations include changes in the quality and quantity of immigrants obtaining an adjustment and variations in the distribution of source regions and entry class types among new legal U.S. immigrants.