PresentationPDF Available

The overreliance on statistical goodness of fit and under-reliance on empirical validation in discrete choice models: A review of validation practices in the transportation academic literature.

Authors:

Abstract

In this article we reviewed validation practices from the transportation field in the peer-reviewed literature of the past five years. We found that although 91% of studies reported goodness of fit statistics, and 66% reported some sort of policy-related inference analysis, the percentage of validation reporting stood at 17%. Stronger criteria are needed to evaluate models in academia to increase the reliability of results and increase the credibility of inferences based on statistical models. We argue that model validation should be a non-negotiable part of model reporting and peer-review in academic journals and propose a simple heuristic to choose validation methods given available resources. At the same time, we recognize the need for proper incentives to promote better validation practices and providing tools and knowledge to do so, such as reporting guidelines, and encouragement by journals of submissions that focus on validation of existing models and theories, and not only new theoretically innovative models.
International Choice Modelling Conference
Kobe, Japan 19-21 August 2019
The overreliance on statistical goodness of fit and under-reliance on
empirical validation in discrete choice models:
Areview of validation practices in the transportation academic literature
Giancarlos Troncoso Parady The University of Tokyo
David Ory WSP
Joan Walker University of California, Berkeley
A credibility crisis in science and engineering?
Source: Baker, M. and Penny, D. (2016) ‘Is there a reproducibility crisis?’, Nature, 533 (7604)
Most published research findings are likely to be false due to factors such as lack of
power of the study, small effect sizes, and great flexibility in research design, definitions,
outcomes and methods.
(Ioannidis, 2005)
A credibility crisis in science and engineering?
What about the transportation field?
Demand overestimation: 30% Highway trips
35% Transit trips (UK, 1962 1972)
Forecasts have not become more accurate (between 1969-1998). (Flyvbjerg, kamris Holm, & Buhl, 2005)
Dependence on cross-section observational studies.
Classic scientific hypothesis testing is more difficult.
Underscores the need for proper validation practices.
“There is little tradition of confronting and confirming
predictions of cross-sectional models with outcomes
in either back-casting or detailed before-and-after
studies” (Boyce & William, 2015)
Unlike the natural sciences
Demand forecasting is the “Achilles’ heel” of the transport planning model (Banister, 2002)
In practice:
A credibility crisis in science and engineering?
While in practice, a feedback loop exists between forecast outputs and implementation results in
the form of measurable forecasting errors, in academia such feedback loop rarely exists.
Term definitions and research scope
Estimation:“the use of statistical analysis techniques and observed data to develop model parameters or
coefficients”
Calibration: “the adjustment of constants and other model parameters in estimated or asserted models in an
effort to make the models replicate observed data for a base (calibration) year or otherwise produce more
reasonable results”
Validation: “the application of the calibrated models and comparison of the results against observed data”.This
comparison is done in terms of predictive ability.
Sensitivity analysis:At the individual model level, refers to the analysis of changes in outcomes given changes
in input variables such as elasticities, marginal effects, etc. At the system level, it refers to the application of a
model system using alternative input data or assumptions.
Systemwide validation Model level validation
More common in practice More common in research
Scope is limited to discrete choice models in the peer-reviewed transportation literature
Cambridge Systematics (2010)
A general overview of model validation methods
Model =()
,
(,)
?
Validation
Check the predictive accuracy of a model
Choose between alternative models
Estimation and calibration
Model =()
,
A general overview of model validation methods
Validation with:
An Independent sample from the same population
(i.e. new cross-section sample, post-intervention sample)
A subset of the same sample (holdout, cross-validation)
Within sample predictive check (Information criteria, etc.)
Checking a model predictive ability
Over-optimistic since it uses the same data
Risk of overfitting
Asymptotic equivalence with cross validation relies on
stronger distributional assumptions (Arlot & Celisse, 2009)
Reduces overfitting risks, but still tied to the same data
Ideal, but limited by practical considerations,
very little incentives to do so under present peer-review
scheme
Ideally we would be conducting randomized controlled trials, but…
A general overview of model validation methods
Measure Abbrv.
Predicted vs observed outcomes PVO
Percentage of correct predictions FPR
% clearly right (t) CR
% clearly wrong (t) CW
% unclear (t) %U
Fitting factor FF
Maximum market share deviation MSD
Correlation Corr
Absolute percentage error APE
Sum of square error SSE
Root sum of square error RSSE
Mean absolute error MAE
Mean absolute percentage error MAPE
Mean squared error MSE
Root mean square error RMSE
Brier Score BS
χ2test CHISQ
Log-likelihood LL
Mean log-likelihood loss MLLL
ρ2, likelihood ratio test (LR), AIC f(LL)
Direct prediction accuracy
measures
Directly interpretable
Objective indicator of the
prediction accuracy of a model
Scores not directly interpretable
Only meaningful in relative terms
Useful for model selection
Error-based measures
Likelihood-based measures
The best model among a set of models
can still be a very bad model
Performance measures
Validation and reporting practices in the transportation literature
Peer-reviewed journal articles published between 2014 and 2018
Analysis uses discrete choice models
Target choice dimensions are:
Destination choice, Model choice and Route choice
Web of Science Database fields are:
Transportation; transportation science and technology; economics; civil engineering
Research scope is limited to land transport and daily travel behavior (tourism, evacuation
behavior, etc. were excluded)
Articles use empirical data (Studies using numerical simulations only were excluded)
Methodological papers only included if the use empirical data
Using the Web of Science Core Collection maintained by Clarivate Analytics we reviewed validation
and reporting practices in the transportation literature from the last 5 years (2014 to 2018). Articles
were selected based on the following criteria:
Validation and reporting practices in the transportation academic literature
282 articles reviewed
Validation Method Abbrv. Percentage
Holdout validation HOV 52.1%
Repeated learning testing cross validation RLT 22.9%
Validation with independent sample from same population ISV 14.6%
Validation with post-intervention data PIDV 6.3%
K-fold cross validation K-CV 2.1%
Other* O 2.1%
91% reported a goodness of fit statistics
66% reported a policy-related inference
Marginal effects, elasticities, odds ratios, value of time estimates,
marginal rates of substitution, and policy scenario simulations
17% reported a validation measure
*All indicators computed on calibration sample only
Validation and reporting practices in the transportation academic literature
13 presentations in the first 2 days
76.9% reported a goodness of fit statistics
61.5% reported a policy-related inference
Marginal effects, elasticities, odds ratios, value of time estimates,
marginal rates of substitution, and policy scenario simulations
23.1% reported a validation measure
How did ICMC2019 did?
Includes fields other than transportation
Considers only what authors reported in the presentations
Validation and reporting practices in the transportation academic literature
Evaluation measure Abbrv. % Studies
Log-likelihood LL 33.3%
Percentage of correct predictions or First Preference Recovery FPR 29.2%
Mean absolute error MAE 12.5%
Mean log-likelihood loss MLLL 12.5%
Predicted vs observed outcomes PVO 10.4%
Other functions of LL, ρ2, AIC, likelihood ratio test (LR) f(LL) 8.3%
% clearly right (t) %CR 6.3%
Mean absolute percentage error (MAPE) MAPE 6.3%
Root mean square error RMSE 6.3%
Absolute percentage error APE 4.2%
Chi-square CHISQ 4.2%
Sum of square error / Brier Score SSE 2.1%
% clearly wrong (t) %CW 2.1%
Mean squared error MSE 2.1%
Maximum market share deviation MSD 2.1%
Fitting factor FF 2.1%
Correlation Corr 2.1%
Brier Score BS 2.1%
73% of studies reported at least one likelihood-based/error-based measure
46% of studies reported at least one prediction accuracy measure
*Note that some studies reported more than one measure
25% of studies reported one of both
Validation and reporting practices in the transportation academic literature
Is a randomized
controlled
trial possible?
Conduct a
randomized
controlled trial
Yes
No
Is there an
independent
dataset available?
Conduct validation
with an independent
dataset
Yes
Is the mode too
computationally
intensive?
No
Conduct cross
validation
Conduct a holdout
validation
Yes
No
Is validation data
in disaggregate
form?
Yes
No
Report:
1. Predicted vs observed
market shares (for route
choice models, a correlation
measure)
2. Percentage of correct
predictions
3. A clearness of prediction
measure
4. An error-based or
likelihood-based measure
Report:
1. Predicted vs observed
market shares (for route
choice models, a correlation
measure)
2. An error-based performance
measure
Recommended validation practices
given available resources
Towards better validation practices in the field
Make model validation mandatory:
Non-negotiable part of model reporting and peer-review in academic journals for any
study that provides policy recommendations.
• Cross-validation is the norm in machine learning studies.
Share benchmark datasets:
A fundamental limitation in the field is the lack of benchmark datasets and a general
culture of sharing code and data.
Incentivize validation studies:
Lot of emphasis on theoretically innovative models.
Encourage submissions that focus on proper validation of existing models and theories.
Draw and enforce clear reporting guidelines:
In addition to detailed information of survey characteristics such as sampling method,
discussion on representativeness of the data, validation reporting is required.
Efforts to improve reporting are well documented in other fields
(i.e. STROBE statement (von Elm et al., 2007))
Wait a minute…
“I’m not validating my model because I’m not trying to build a predictive framework.
I’m trying to learn about travel behavior”
The more orthodox the type of analysis conducted (such as the dimensions of travel
behavior covered in this study), the stronger the onus of validation.
Wait a minute…
Does every study that uses a discrete choice model should be conducting validation?”
In short, yes.At the very least, any article that makes policy recommendations should be
subject to proper validation given the dependence of the field on cross-section
observational studies, and the lack of afeedback loop in academia.
“Is what we learn about travel behavior from coefficient estimation less valuable if not
conducted?”
Wait a minute…
There is amyriad of reasons why some skepticism is warranted against any
particular model outcome.the most obvious one being model overfitting.
Finally
Better validation practices will not solve the credibility crisis in the field, but it’s astep in
the right direction.
Model validation is no solution to the causality problem in the field, but we want to underscore that
the reliance on observational studies inherent to the field demands more stringent controls to
improve external validity of results.
International Choice Modelling Conference
Kobe, Japan 19-21 August 2019
gtroncoso@ut.t.u-tokyo.ac.jp
Thank you for listening.
Article
Since its inception, the choice modelling field has been dominated by theory-driven modelling approaches. Machine learning offers an alternative data-driven approach for modelling choice behaviour and is increasingly drawing interest in our field. Cross-pollination of machine learning models, techniques and practices could help overcome problems and limitations encountered in the current theory-driven modelling paradigm, such as subjective labour-intensive search processes for model selection, and the inability to work with text and image data. However, despite the potential benefits of using the advances of machine learning to improve choice modelling practices, the choice modelling field has been hesitant to embrace machine learning. This discussion paper aims to consolidate knowledge on the use of machine learning models, techniques and practices for choice modelling, and discuss their potential. Thereby, we hope not only to make the case that further integration of machine learning in choice modelling is beneficial, but also to further facilitate it. To this end, we clarify the similarities and differences between the two modelling paradigms; we review the use of machine learning for choice modelling; and we explore areas of opportunities for embracing machine learning models and techniques to improve our practices. To conclude this discussion paper, we put forward a set of research questions which must be addressed to better understand if and how machine learning can benefit choice modelling.
ResearchGate has not been able to resolve any references for this publication.