Content uploaded by Giancarlos Parady
Author content
All content in this area was uploaded by Giancarlos Parady on Dec 18, 2020
Content may be subject to copyright.
Journal of Choice Modelling 38 (2021) 100257
Available online 5 November 2020
17555345/© 2020 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
The overreliance on statistical goodnessoft and underreliance
on model validation in discrete choice models: A review of
validation practices in the transportation academic literature
Giancarlos Parady
a
,
*
, David Ory
b
, Joan Walker
c
a
Department of Urban Engineering, The University of Tokyo, Japan
b
WSP, USA
c
Department of Civil and Environmental Engineering, University of California, Berkeley, USA
ARTICLE INFO
Keywords:
Validation
Generalizability
Transferability
Policy inference
Transportation
Discrete choice models
ABSTRACT
An examination of model validation practices in the peerreviewed transportation literature
published between 2014 and 2018 reveals that 92% of studies reported goodnessoft statistics,
and 64.6% reported some sort of policyrelevant inference analysis. However, only 18.1% re
ported validation performance measures, out of which 78% (14.2% of all studies) consisted of
internal validation and 22% (4% of all studies) consisted of external validation. The proposition
put forward in this paper is that the reliance on goodnessoft measures rather than validation
performance is unwise, especially given the dependence of the transportation research eld on
observational (nonexperimental) studies. Model validation should be a nonnegotiable part of
presenting a model for peerreview in academic journals. For that purpose, we propose a simple
heuristic to select validation methods given the resources available to the researcher.
1. Introduction
Ioannidis (2005) brought to light the issue of lack of demonstrated reproducibility in the natural sciences and, in subsequent years,
the socalled “reproducibility crisis” in science has made headlines in the popular media (Baker and Penny, 2016). The transportation
domain can benet from thinking about the underlying concerns about reproducibility.
1
Though the transportation eld generally
does not rely on experiments that can be readily retried, to be useful, the observational studies that we generally do rely on should
generalize across time and space within reasonable boundaries. To do so, we argue that more attention needs to be paid to the validity
of the models used to extract information from observational studies.
The proposition put forward by this paper is that the transportation community overrelies on statistical goodnessoft when
assessing a model’s performance and policy relevance. For example, it is common in the transportation space for models to be esti
mated from an observational study of travel behaviour. An analyst, to continue the example, may use such a data set revealing travel
mode choice decisions and strive to quantify the relative onerousness of waiting for a bus relative to riding on a bus. Coefcients from a
statistical model, say linear regression or discrete choice, may suggest that waiting is three times as unpleasant as riding. A typical
analyst would take comfort in robust goodnessoft statistics, e.g., an Rsquared or pseudorhosquared in line with similar studies, as
* Corresponding author. The University of Tokyo, 731 Hongo, BunkyoKu, Tokyo, Japan.
Email address: gtroncoso@ut.t.utokyo.ac.jp (G. Parady).
1
Note than in experimental studies the concept of reproducibility differs from the one commonly used in observational studies, and in this article.
Contents lists available at ScienceDirect
Journal of Choice Modelling
journal homepage: http://www.elsevier.com/locate/jocm
https://doi.org/10.1016/j.jocm.2020.100257
Received 19 February 2020; Received in revised form 24 October 2020; Accepted 31 October 2020
Journal of Choice Modelling 38 (2021) 100257
2
well as statistics showing the statistical signicance of the coefcients for the key variables, e.g., a tstatistic suggesting the coefcients
are likely not equal to zero. It is rare for an analyst to try and replicate these ndings on a new sample or even against a “holdout” from
the same sample.
To understand how goodnessoft and validation is approached in the transportation literature, we review their use in the peer
reviewed transportation academic literature published between 2014 and 2018, focusing on models that use discrete choice. The
balance of the paper is organized as follows: In Section 2, we discuss the reproducibility crisis introduced above and its applicability to
the study of transportation. In Section 3, we put forth an operational denition of validation and related concepts, to organize the
myriad of related terms in the eld. In Section 4 and 5 we discuss the most commonly used validation methods, and performance
indicators, respectively. The results from the metaanalysis we conduct of the academic literature are presented in Section 6. In Section
7 we provide recommendations for improving validation practices in the eld and end with concluding thoughts in Section 8.
2. A credibility crisis in science and engineering?
In 2016, the journal Nature published the results of a survey of its readers regarding reproducibility (Baker and Penny, 2016). A key
nding was that “more than 70 percent of researchers tried and failed to reproduce another scientist’s experiment, and more than half have
failed to reproduce their own experiments.” (pp. 452) Furthermore, 52 percent of respondents stated that there is a signicant repro
ducibility crisis.
This perception echoes the argument of Ioannidis (2005), who estimated the positive predictive value of research – that is, the
probability that a reported nding is true – under different criteria. Ioannidis suggested that most published research ndings are likely
to be false, due to factors such as lack of statistical power of the study, small effect sizes, and great exibility in research design,
denitions, outcomes, and methods.
While the study by Ioannidis focused on experimental studies in natural science, the underlying concern is of relevance to the
transportation research eld.
Generally speaking, the purpose of academic transportation research is to better understand (and ultimately forecast) transport
related human behaviour to better inform transportation policy design and implementation. However, conducting experiments in
the transportation eld is often expensive and/or disruptive to transport users. Consider, for example, running an experiment in which,
a new subway line is constructed just to evaluate modal shift from automobile to transit. Although there are some examples of
experimental economics applications (e.g. HolguínVeras et al., 2004) the vast majority of transportation research relies on
crosssectional observational studies, which makes hypothesis testing in the classical way difcult.
Lack of validation of models increase the risk of model overtting, where the model ts the estimation data well but performs
poorly outside this estimation dataset, in other words, the model is tted to the noise instead of the signal in the data.
Despite the limitations imposed by observational studies, impact evaluation of policies drawn based on modelbased academic
research is rarely conducted, meaning there is little feedback, if any, in terms of how right or how wrong are these models and the
policy recommendations derived from them. Altogether, these issues strongly underscore the need to incorporate in the analysis means
to evaluate the validity of results. However, we will show that the academic literature has overrelied on statistical goodness of t and
widely disregarded model validation.
3. Dening validation
The meaning of the term validation differs across elds, and even within elds, and certainly in the case of transportation, there is
no agreedupon denition. As such, to organize the myriad of validationrelated terms, after reviewing the validation literature in
different research elds, we propose a set of denitions that are adequate for the transportation eld. Fig. 1 summarizes these concepts
and illustrate the relationship among them.
We will start by dening six key terms adapted from Justice et al. (1999):
•Predictive accuracy: the degree to which predicted outcomes match observed outcomes. Predictive accuracy is a function of:
o Calibration
2
ability: the ability of a model or system of models to make predictions that match observed outcomes.
o Discrimination ability: the ability of a model or system of models to discriminate between those instances with and without a
particular outcome.
•Generalizability: the ability of a model, or system of models to maintain its predictive accuracy in a different sample. The gener
alizability of a model is a function of:
o Reproducibility: the ability of a model, or system of models to maintain its predictive ability in different samples from the same
population.
o Transferability: the ability of a model, or system of models to maintain its predictive ability in samples from different but plausibly
related populations or in samples collected with different methodologies. In other elds it is also called transportability.
Given that the purpose of academic transportation research is to better understand (and ultimately forecast) transportrelated
2
Note that in the transportation eld the term calibration is commonly used to refer to the adjustment of constants or other parameters to match
observed outcomes.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
3
Fig. 1. Model validation is the evaluation of the generalizability of a statistical model. The evaluation of reproducibility is called internal validation, while the evaluation of transferability is called
external validation. Insample testing is not validation.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
4
human behaviour to better inform transportation policy design and implementation, the usefulness of a statistical model is a function
of its predictive accuracy outside the sample used to estimate it. We can thus dene model validation as the evaluation of the
generalizability of a statistical model. The evaluation of reproducibility is called internal validation, while the evaluation of trans
ferability is called external validation. Note that internal validity is very often a precondition for external validity (Justice et al., 1999).
This denition outright excludes the classication of insample testing such as goodnessoft statistics from being considered a
form of model validation. The “optimism” of insample estimates of predictive accuracy or apparent predictive accuracy is a well
documented phenomena (Efron, 1983; Steyerberg et al., 2001), which stems from the fact that the sample data is used for both
estimation and testing. Correcting for this is the very reason internal and external validation methods have been developed.
Irrespective of the type of validation, the process itself is conceptually simple. It consists of (i) estimating a model with an esti
mation sample e (sometimes called training sample), (ii) applying the estimated model to a validation sample v (sometimes called
testing sample), and (iii) evaluating the generalizability of the estimated model given the dened performance measure(s) of predictive
accuracy (See Section 5). Predictive accuracy is usually quantied as a function of the discrepancy between predicted and observed
outcomes (i.e. prediction error).
In the case of internal validation, while using independent samples is ideal, in many cases, producing such data is expensive, so
datasplitting methods such as holdout and crossvalidation, or resampling methods such as bootstrapping are often used. These
methods will be discussed in Section 4.
Regarding the evaluation of transferability (external validation), in the transportation eld, a signicant number of studies were
conducted during the 1980s on the subject (Ortúzar and Willumsen, 2011), even though it has “dropped off the radar” in recent years
(Fox et al., 2014). In transportation, transferability has been dened as the “usefulness of a model, information or theory in a new context”
(Koppelman and Wilmot, 1982) and as “the ability of a model developed in one context to explain behaviour in another under the assumption
that the underlying theory is equally applicable in the two contexts” (Fox et al., 2014), denitions that are consistent with the denition
presented above. Past studies on transferability have not only focused on predictive accuracy but also on the stability of parameters
across contexts, especially in earlier studies (Koppelman and Wilmot, 1982; Ortúzar and Willumsen, 2011; Fox et al., 2014).
In Transportation, the two key dimensions of interest are temporal transferability and spatial transferability which refer to the
potential to transfer a model to different points in time, and to different spatial contexts (i.e. cities, regions), respectively.
There are, of course, limits to how much a model can be transfered, and it is obvious that in transportation research, models are
context dependent (Ortúzar and Willumsen, 2011), as such, rather than a pass/fail test, transferability is a matter of degree (Kop
pelman and Wilmot, 1982) and should be bounded by reasonable limits. No one would expect, for example, a discretionary activity
destination choice model estimated for Tokyo to generalize well to Los Angeles. It would be hard to argue that these populations are
“plausibly related.” A similar argument can be made in terms of temporal transferability, which should be evaluated in general,
considering the time frame of the forecast of interest. Consider, for example, a 20year forecast, a timeframe typical of transportation
forecast models. In such case, two independent samples from the same city and collected six months apart, are more likely different
(yet contemporary) samples from the same population than temporally different samples. That is, they are too similar to constitute a
proper external validation dataset. As such, the validation effort would in fact be an internal validation test.
Signicant policy interventions (i.e. the construction of a new expressway, or rail line), or highimpact external events (i.e. natural
disasters, economic shocks, pandemics) can also be thresholds to make a population temporally different. As such, samples from before
and after such an event can be considered temporally different samples.
As an aside, Justice et al. (1999) have also discussed, in the context of prognostic assessments, the issue of methodological
transferability, which refers to model performance on data collected with different methodologies (i.e. different variable denitions,
survey methods, etc.) Although potentially relevant to the eld, this is a largely understudied issue, but included in this article for
completeness.
While it can be argued that as an analysis of model sensitivity, the calculation of elasticities and marginal effects are a part of the
model validation process (Cambridge Systematics, 2010), as measures of effect size, these are key policyrelated values. As such,
although the distinction might seem trivial, we classify these values as part of the policyrelevant inference analysis rather than part of
the validation process, which focuses, as per the denition provided above, on predictive ability.
4. Data splitting and resampling methods for validation
Data splitting and resampling methods have become common methods to conduct validation, largely due to the high costs asso
ciated with producing independent datasets to test models against. While these methods are largely used for internal validation, it is
possible to adapt these methods for external validation, although not very common (e.g. Austin et al. (2016) in epidemiology, and
Sanko (2017) in transport). We will now briey introduce these methods.
4.1. Holdout validation
Holdout validation (HOV) is the simplest data splitting method. In holdout validation, the dataset is randomly split into an esti
mation dataset and a validation dataset (holdout). For illustration purposes, let us dene Q[yn,
yn]as a measure of prediction cor
rectness for the nth instance, for the binary choice case as:
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
5
Qyn,yn=
0if yn=yn
1if yn∕= yn
(1)
where yn is the observed outcome, and
yn is the predicted outcome for instance n. The holdout estimator is
HOV =1
Nv
Nv
nv=1
Qynv,ye
nv(2)
where
ye
nv is the predicted outcome for instance n in sample v, using the model estimated with sample e, and Nv is the validation sample
size. The performance measure in equation (2) is called the misclassication rate, but as will be discussed in Section 5, there is myriad
of performance measures that can be used to evaluate predictive accuracy.
4.2. Crossvalidation methods
When the holdout process is repeated multiple times, thus generating a set of randomly split estimationvalidation data pairs, we
refer to the validation procedure as crossvalidation (CV). The basic crossvalidation estimator is dened as
CV =1
B
B
b=1
HOVb(3)
where B is the number of estimationvalidation data pairs generated, andHOVb is the holdout estimator for set b. Crossvalidation
methods differ from one another in the way the data is split. When the data splitting considers all possible estimation sets of size
Nc, the splitting is exhaustive, otherwise the splitting is partial. Partial data splitting methods have the advantage of being less
computationally intensive than exhaustive splitting. Here we will briey dene common crossvalidation methods. We refer readers to
the work of Arlot and Celisse (2009) for a more comprehensive review of crossvalidation practices in general.
Regarding exhaustive splitting, in the leaveoneout (LOO) approach (Stone, 1974) sometimes referred to as the jacknife approach,
the estimation set size is Nc=N−1, and B =N. That is, the model is tted leaving out one instance per iteration, and the outcome of
that single instance is predicted based on the estimated model. In the leavepout (LPO) approach (Shao, 1993) the estimation set size is
Nc=N−p.
Regarding partial data splitting, in the Kfold crossvalidation (KCV) approach (Geisser, 1975), data is partitioned into K
mutuallyexclusive subsets of roughly equal size, and B
–
–
K. It can be seen that the particular case of Nfold cross validation is
equivalent to the leaveoneout approach. In the case of repeated Kfold crossvalidation, the process is repeated R times and the CV
performance measures averaged over R. In the repeated learningtesting (RLT) approach (Burman, 1989) a B number of randomlysplit
estimationvalidation pairs are generated. This method is also called repeated holdout validation.
As an aside, it is important to note that insample statistics have been proposed to aid on model selection, such as the Akaike
Information Criteria (AIC) and the Bayesian Information Criteria (BIC). Stone (1977) showed the asymptotic equivalence between
cross validation (specically, the leaveoneout method) and the AIC statistic for model selection. However, the strength of validation
relies on the quasiuniversality of its applicability and its robustness to violations of assumptions necessary for these statistics to be
correct (Efron and Tibshirani, 1993; Arlot and Celisse, 2009).
4.3. Bootstrapping methods
Bootstrapping validation methods were proposed by Bradley Efron to address some of the limitations of crossvalidation methods.
Although (leaveoneout) crossvalidation gives a nearly unbiased estimate of predictive accuracy, it often exhibits unacceptably high
variability, particularly when sample size is small, whereas bootstrapping methods have been shown to be more efcient (Efron, 1983;
Efron and Tibshirani, 1995). We will briey summarize some basic bootstrapping estimators borrowing from Efron and Tibshirani
(1993, 1997) to whom we refer the reader for a more extensive treatment of bootstrapping for validation.
The idea of bootstrapping is conceptually simple. In the simplest case, given a dataset of size n, a bootstrap sample is generated by
randomly sampling (with replacement) n instances from the original dataset and estimating the model on this sample. A prediction
error estimate for this sample can be obtained by applying the model to the original sample. This process is repeated B times, and the
prediction error averaged over B to obtain the simple bootstrap prediction error estimate,
BSsimple =1
B
B
b=1
N
n=1
Qyn,yb
nN(4)
Another estimator is the leaveoneout bootstrap estimator, dened as
BSloo =1
N
N
n=1
b
ε
Cn
Qyn,yb
nBn(5)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
6
where, given a set of B bootstrap samples, Cn is the subset of bootstrap samples not containing instance n, and Bn is the size of Cn. This
estimator can be considered a smoothed version of the leaveoneout crossvalidation estimator, and while being more efcient, it has
been shown to be upwardbiased (Efron and Tibshirani, 1995).
A more rened approach is to get bootstrap estimates of the optimism of the apparent prediction error (the prediction error
estimated on the same sample used to estimate the model) and correct for it. The bootstrap estimator can then be dened as
BS =err +
ω
(6)
where err is the apparent prediction error (of the original sample) dened as
err =1
N
N
n=1
Qyn,yn(7)
and
ω
is a measure of optimism. One way to estimate
ω
is as the difference between the simple bootstrap estimate (Eq. (4)) and the
apparent prediction error of the bootstrap samples. Another way is using the 0.632 optimism estimator dened as
ω
.632 =.632[BSloo −err](8)
Plugging Equation (8) into equation (6) results in the 0.632 bootstrap estimator (Efron, 1983)
BS.362 =.368 err +.632 ·BSloo (9)
This estimator mitigates the upward bias of the leaveoneout bootstrap estimate. The weights for this estimator come from the fact
that the probability of any instance to be present in a bootstrap sample is 0.632. Although Efron discussed the 0.632 estimator in terms
of the leaveoneout case, it can be generalized to nonexhaustive cases (Steyerberg et al., 2001).
Although bootstrapping is rather common in transportation for standard error estimation, it is rarely used for model validation. And
while it has been suggested that bootstrapping is superior to crossvalidation (Efron, 1983; Steyerberg et al., 2001), other studies have
suggested that repeated Kfold crossvalidation (Kim, 2009) or stratied crossvalidation (Kohavi, 1995) are superior. The fact of the
matter is that the performance of these methods is dependent on data characteristics such as sample size, and the type of model used.
As such, we will refrain from making a recommendation, noting the need for an empirical study on the performance of different
validation methods using typical data from transportation studies such as household travel surveys, and typical models.
5. Performance measures
In the transportation literature several articles have been published discussing in detail validation methods for discrete choice
models from Train (1978, 1979) to Gunn and Bates (1982) and Koppelman and Wilmot (1982), and more recently de Luca and
Cantarella, 2009). Using these studies as a departing point, in this section we briey review the key performance measures reported in
the transportation literature. Although the list of measures discussed here is not exhaustive, it is comprehensive in that it covers the
vast majority of measures reported for internal and/or external validation in the reviewed studies.
One of the simplest ways to evaluate the predictive ability of a model is to compare the predicted market shares against the
observed shares. While very simple and easy to understand, this approach does not provide a quantitative measure to evaluate the level
of agreement between predictions and observations.
A quantitative measure often used is the percentage of correct predictions of a model (sometimes called the First Preference Re
covery) where the alternative with the highest probability is dened as the predicted choice. Although this measure is widely reported,
its use in models with more than two alternatives has been criticized, since it cannot differentiate between different probabilities
assigned to a chosen alternative (de Luca and Cantarella, 2009). de Luca & Cantarella illustrate this point with a choice exercise with
three alternatives a, b, c where the chosen alternative is a. With the percentage of correct predictions measure, a model that predicts
choice ratios of 0.34, 0.33 and 0.33, for alternatives a, b, c, respectively, is equivalent to a model that predicts 0.90, 0.05, 0.05, even
though the latter model assigns a considerably higher probability to the rst alternative and the former is very close to a random
prediction. To overcome this limitation, they proposed additional measures to evaluate “clearness of predictions,” a concept that bears
some resemblance to the concept of discriminative ability. They proposed the percentage of clearly right choices, the percentage of
clearly wrong choices and the percentage of unclear choices.
They dened the percentage of clearly right choices as “the percentage of users in the sample whose observed choices are given a
probability greater than threshold t by the model”. The idea here is that a model that predicts with higher probability a chosen alternative
performs better than one that does so with a lower probability. Conversely, they dened the percentage of clearly wrong choices as the
“the percentage of users in the sample for whom the model gives a probability greater than threshold t to a choice alternative differing form the
observed one.” Finally, the percentage of unclear choices is the “percentage of users for whom the model does not give a probability greater
than threshold t to any choice.” To be meaningful, the threshold t must be “considerably larger” than c
−1
, where c is the choice set size.
While there is no agreedupon denition of what qualies as “considerably larger,” if the threshold is just marginally larger to c, the test
becomes useless. As a reference, when reporting the percentage of clearly right choices, de Luca and Di Pace (2015) used a threshold of
0.9 for binary choice models, while Glerum, Atasoy and Bierlaire (2014) used a threshold of 0.5 for choice models with three alter
natives. When in doubt, we recommend reporting results for different threshold values. For example, de Luca and Cantarella (2009)
reported values for 0.50, 0.66, 0.90 in tabular format for a pair of models with four alternatives, as well as a plot for all threshold values
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
7
above 0.5.
Concordance probability (cstatistic) is a measure of discriminative ability, in the case of binary choices. This measure estimates the
probability that an individual who was observed choosing alternative a is assigned a higher probability than an individual who did not.
This probability is calculated as the ratio between the number of concordant pairs and comparable pairs. A comparable pair is a pair of
individuals where one individual chose alternative a, and one did not. Such a pair is concordant if the individual who chose alternative
a was assigned a higher probability of doing so than the individual who did not choose it. If the model has no discriminative ability the
cstatistic equals 0.5 and equals 1 for perfect discriminant ability.
Several extensions have been proposed for the multinomial choice cases, see Calster et al. (2012) for a review of existing measures.
The tting factor is dened as the ratio between the sum of predicted choice probabilities of the chosen alternatives and the number
of individuals. A tting factor of 1 indicates a perfect forecast of all choices with predicted probability of 1 (de Luca and Cantarella,
2009).
The correlation between prediction and outcomes can be used to evaluate predictive performance when dealing with continuous
outcome such ridership levels, trafc ow, etc.
Other commonly used measures include the sum of square error (SSE), mean square error (MSE), root sum square error (RSSE), root
mean square error (RMSE), mean absolute error (MAE), absolute percentage error (APE), and mean absolute percentage error (MAPE).
Although there are some differences between these measures, there is no consensus regarding which measure is better. While quadratic
measures like MSE, RSSE and RMSE tend to weight heavier on efciency (Troncoso Parady et al., 2017) it has been argued that ab
solute measures like MAE are more natural and less ambiguous error measures than quadratic indicators (Willmott and Matsuura,
2005). It has also been suggested that RMSE is more appropriate when errors are expected to be Gaussian distributed (Chai and
Draxler, 2014)
While in the transportation eld it is common to use aggregate outcome measures such as market shares (Train, 1979) or choice
frequencies (Koppelman and Wilmot, 1982) when using such measures, for most cases, individual outcomes can also be used. For
example, the Brier score is calculated as the mean square difference between predicted individual choice probabilities and actual
choices across all choices. To calculate this difference, the observed outcome is assigned a value of 1. This score has a minimum score of
0 for perfect forecasting, and a maximum score of 2 for the worst possible forecast (Brier, 1950). The Brier score can be thought of as a
disaggregate form of the MSE for discrete outcomes. Disaggregate forms of MAE (de Luca and Cantarella, 2009) and RMSE can be
calculated in a similar manner.
When using choice frequencies instead of market shares, it is possible to use the
χ
2 test as a test of consistency between predictions
and observations. The null hypothesis of this test is that observed and expected frequency outcome distributions are the same. Different
to the other measures reviewed so far, this is a pass/fail test. Although the
χ
2 statistic is sometimes used with market shares, it must be
noted that this test is only valid for frequencies.
Since the likelihood is proportional to the product of individual probabilities, likelihoodbased loss functions are also frequently
used. The loglikelihood is a natural measure given that maximum likelihood estimators are widely used in discrete choice models. The
crossentropy measure, which is essentially the negative of the loglikelihood function, is also a commonly used loss function in
machine learning.
The transferability test statistic (TTS) is a likelihood ratio test between the base model applied to the transfer data and the model
estimated in the transfer data. This statistic is
χ
2distributed with degrees of freedom equal to the number of parameters in the model. As
with the regular
χ
2 test, strictly speaking this is a pass/fail test. It tests whether the model parameters are equal across contexts.
However, we strongly agree with Koppelman and Wilmot (1982) in that while such tests are useful to alert the analyst to differences
between models, these differences should be interpreted against the acceptable error in each application context. This means focusing
not on whether there is a difference or not, but how big is this difference.
Other likelihoodbased indices are the transfer index (TI) and the transfer rhosquare proposed by Koppelman and Wilmot (1982).
The transfer index (TI) measures the degree to which the transferred model (a model estimated on sample e and transferred to transfer
sample v) exceeds a local reference model (e.g. a market share model estimated on transfer sample v) relative to a model estimated
directly on the transfer sample v. This measure has an upperbound of one, in the case where the transferred model is as accurate as the
local model, and takes negative values when the model is worse than the local reference model. While the transfer index is a relative
measure of transferability, the transfer rhosquare can be used as an absolute measure. This is the usual likelihood ratio index but used
to evaluate performance of the transferred model against the local reference model. This index is upperbounded by the local
rhosquare and has no lower bound. Negative values are interpreted in a similar manner to the transfer index.
Table 1 summarizes the performance measures described above and their respective equations. Measures are classied into ab
solute measures, relative measures, and pass/fail measures. While relative measures are useful for model selection, they do not give an
absolute measure of predictive accuracy. Absolute measures, on the other hand, can be used to generate benchmark values against
which researchers can evaluate, to a certain extent, the performance of their models against similar studies in the literature. It must be
noted, however, that even when using absolute measures, model performance is relative to factors such as choice set size and the base
market share of alternatives. As such, comparisons across different studies must be interpreted with a clear understanding of these
limitations, and not as an absolute indicator of model superiority.
6. Validation and reporting practices in the transportation academic literature
Having summarized the basic validation process and the most used performance indicators in the literature, this section will review
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
8
Table 1
Denition of model validation performance measures reported in the literature.
Measure Abbrv. Type Equation Lower
bound
Upper
bound
Notes
Predicted vs observed
outcomes
PVO – – 
a

a
Simple comparison of predicted and
observed outcomes or market shares.
Usually in the form of a table or plot. No
prediction accuracy statistics are
calculated.
Percentage of correct
predictions or First
Preference
Recovery
FPR Absolute 100
Nv
Nv
nv=1
ye
nv=ynv
0 100 ynv is the observed choice made by
individual n in validation sample v, and
ye
nv is the choice with the highest
predicted probability, predicted from
model estimated on sample e.
%
clearly right
(t) %CR Absolute 100
Nv
Nv
nv=1
CRnv
where,
CRnv=1if
P(ye
nv)>t
0otherwise
0 100
P(ye
nv)is the estimated choice
probability of the chosen alternative for
individual n in validation sample v,
predicted from model estimated on
sample e.
P(!ye
nv)is the estimated choice
probability of an alternative other than
the chosen one.
t is an arbitrary threshold.
%
clearly wrong
(t) %CW Absolute 100
Nv
Nv
nv=1
CWnv
where,
CWnv=1if
P(!ye
nv)>t
0otherwise
0 100
%
unclear
(t) %U Absolute 100 – (%
clearly right
(t) +%
clearly
wrong
(t))
0 100
Fitting factor FF Absolute 1
Nv
Nv
nv=1
P(ye
nv)0 1
Concordance statistic C Absolute 1
Nc1Nc0
Nc1
nc1=1
Nc0
nc0=1
C(
P(ye
n1,c1),
P(ye
n0,c0))
where,
C(
P(ye
n1,c1),
P(ye
n0,c0)) =
1if
P(ye
n1,c1)>
P(ye
n0,c0)
0.5if
P(ye
n1,c1) =
P(ye
n0,c0)
0if
P(ye
n1,c1)<
P(ye
n0,c0)
0 1 Given a binary choice situation:
N
c1
is the subset of the sample that
chose alternative c.
N
c1
is the subset of the sample that did
not choose alternative c.
P(ye
n1,c1)is the probability of choosing
alternative c for individuals that chose
it.
P(ye
n0,c0)is the probability of choosing
alternative c for individuals that did not
choose it.
The v subscript is omitted for simplicity.
Correlation CORR Absolute corr(sv,
sve)−1 1 Correlation between predicted and
observed outcomes. sv is a continuous
aggregate outcome measure in sample v
(i.e. train ridership, etc.)
se
v is a continuous aggregate outcome
measure predicted from model
estimated on sample e.
Error E Relative
se
v,m−sv,m – – sv,m is an aggregate outcome measure in
sample v, such as the market share of
alternative m (i.e. modal market share),
choice frequency, etc.
se
v,m is an aggregate outcome measure in
sample v, such as the market share of
alternative m, predicted from model
estimated on sample e.
M is the number of alternatives in the
choice set.
Percentage error PE Absolute 100·
se
v,m−sv,m
sv,m
 
Absolute percentage
error
APE Absolute
100 ·se
v,m−sv,m
sv,m
0 
Mean absolute
percentage error
MAPE Absolute 100
M
M
m=1
se
v,m−sv,m
sv,m
0 
Sum of square error SSE Relative
M
m=1
(
se
v,m−sv,m)2 0 
b
Root sum of square
error
RSSE Relative
M
m=1
(se
v,m−sv,m)2
0 
b
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
9
the validation and reporting practices in the peerreviewed literature and show that the transportation academic literature has over
relied on statistical goodnessoft and disregarded model validation to a very large extent.
Using the Web of Science Core Collection maintained by Clarivate Analytics we reviewed validation and reporting practices in the
transportation academic literature published between 2014 and 2018. Articles were selected based on the following criteria:
(1) Peerreviewed journal articles published between 2014 and 2018
(2) Analysis uses discrete choice models
Table 1 (continued )
Measure Abbrv. Type Equation Lower
bound
Upper
bound
Notes
Mean absolute error MAE Aggregate:
Relative
Disaggregate:
Absolute
1
M
M
m=1
se
v,m−sv,m
0 
b, c
Mean squared error MSE Aggregate:
Relative
Disaggregate:
Absolute
1
M
M
m=1
(
se
v,m−sv,m)2 0 
b, c
Root mean square error RMSE Aggregate:
Relative
Disaggregate:
Absolute
1
M
M
m=1
(se
v,m−smv,)2
0 
b, c
Brier Score BS Absolute 1
Nv
Nv
nv=1
M
m=1
(
P(ye
nv,m) − ynv,m)2 0 2
d
P(ye
nv,m)is the predicted probability that
individual n chooses alternative m,
predicted from model estimated on
sample e.
y
nm
is the actual outcome variable
valued 0 or 1.
Loglikelihood LL Relative LLv(
βe)– 0 LLv,r(
βe)is the loglikelihood of the
model estimated on data e applied to
the validation data v
r
.
Nv,r is the size of the validation
(holdout) sample r, and R is number of
validation samples generated.
Loglikelihood loss LLL
e
Relative 1
R
r
−1
Nv,r
nv,r
LLv,r(
βe)
∀1≤r≤R
0 –
Rhosquare (Likelihood
ratio index)
RHOSQ Absolute
ρ
2=1−LLv(
βe)
LLv(0)
0 1 LLv(
βe)is the loglikelihood of the
model estimated on data e applied to
the validation data v.
LLv(0)is loglikelihood of the model
when all parameters are zero for data v.
Transfer rhosquare
(Likelihood ratio
index)
T
RHOSQ
Absolute
ρ
2
transfer =1−LLv(
βe)
LLv(MSv)
–
ρ
2
local LLv(MSv)is a base model estimated on
validation data v (i.e. market share
model.)
ρ
2
local is the local rhosquare of the
model.
Transfer index TI Relative LLv(
βe) − LLv(MSv)
LLv(
βv) − LLv(MSv)
– 1
LLv(
βv)is the likelihood of the model
estimated on the validation data v.
Transferability test
statistic
TTS Pass/Fail −2(LLv(
βv) − LLv(
βe)) 0 –
χ
2 test CHISQ Pass/Fail
M
m=1
(fm−E(fe
v,m))2
E(fe
v,m)
0 – fm is the observed choice frequency of
alternative m in sample v, and E(fe
v,m)is
the expected choice frequency
predicted from model estimated on
sample e.
a
Bounded for market shares.
b
In the case aggregate outcomes are market shares upper bounds are dependent on the choice set size.
c
Upper bounds exist for the disaggregate case.
d
For the specic case of binary choices, it is common to drop the second summation sign for simplicity. In this case the upper bound is 1.
e
A simple average over R, or a moving average across all validation sets can be calculated.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
10
(3) Target choice dimensions are
➢ Destination choice
➢ Mode choice
➢ Route choice
(4) Articles that analyse other choice dimensions are considered if and only if the article includes at least one of the three target
choice dimensions dened in (3).
(5) Web of Science Database search keywords are:
➢ Destination choice model
➢ Mode choice model
➢ Route choice model
(6) Web of Science Database elds are:
➢ Transportation
➢ Transportation science and technology
➢ Economics
➢ Civil engineering
(7) Research scope is limited to human land transport and daily travel behaviour (tourism, evacuation behaviour, and freight
transport articles are excluded)
(8) Analysis uses empirical data from revealed preference (RP) or joint revealed/stated preferences (SPRP) studies.
3
(Studies using
numerical simulations only are excluded)
(9) Methodological papers only included if they use empirical data
A total of 226 articles met the above criteria. Note that although the choice dimensions are destination, mode and route choice, the
denition of choice settings or choice sets differ by study. For example, in the category of mode choice, in addition to the traditional
way to dene choices (i.e. car, rail, bus, walk, etc.) changes in mode choices are also included in the review. Similarly, in route choice
analysis, in addition to link and path choices, simpler choice settings are also included such as, riding on a bike lane, or on the walkway,
taking the stairs or the escalator, using a particular detour or not, etc. In the case of route choice models, Stochastic user equilibrium
(SUE) models are excluded, as discrete choice models are just a subcomponent of a larger model.
Of the 226 articles reviewed, and consistent with standard practice in the eld, 92% of articles reported a goodness of t statistic.
64.6% of articles reported some kind of policyrelevant inference analysis. This means going beyond simply discussing the statistical
signicance and direction of the estimated parameters, and focusing instead on effect magnitudes, and estimating values that are
interpretable in a policy context such as marginal effects, elasticities, odds ratios, marginal rates of substitution, and/or policy scenario
simulations (See Table 5 in the appendix section for the complete table summarizing the articles reviewed in this paper) This is a
welcomed nding amid criticism that focusing exclusively on statistical signicance is widespread in many sciences at the expense of
policyrelevant inferences, discussion of effect magnitudes, and tests of power (Ziliak and McCloskey, 2007).
Only 18.1% of the reviewed articles reported model validation, out of which 78% (14.2% of all studies) consisted of internal
validation and 22% (4% of all studies) consisted of external validation. Table 2 summarizes the details of these studies. Note that only
studies that reported explicitly and in a clear way how model validation was conducted were considered.
In terms of internal validation, as illustrated in Table 3, of the studies that reported any validation practice, 56.3% used the holdout
validation method, followed by repeated learningtesting (25%). These two methods add up to 81.3% of the studies reporting internal
validation in the literature. 12.5% of studies used an independent sample for internal validation, with the remaining 87.5% relying on
sample splitting approaches.
The bootstrap method was only used by one study (Sanko, 2017), which used bootstrapping to evaluate the effects of data newness
and sample size on the temporal transferability of demand forecasting models.
Irrespective of type of validation, as shown in Table 4, the loglikelihood (34.1%) and the loglikelihood loss (12.2%) had jointly the
highest reporting share, with 46.3% of studies reporting either one of them.
4
The second most frequently used measures were the
predictedvsobserved simple comparison and the percentage of correct predictions with a 24.4% share each.
One limitation of likelihoodbased measures such as LL and LLL is that they fail to provide an absolute measure of predictive
accuracy. This is important because gains in predictive power of a “superior” model can be, in fact, very minimal. That is, the best
model among a set of models can still be a very bad model predictionwise. A similar argument can be made with relation to other
relative performance measures. While we believe that model validation is not a pass/fail test, absolute measures of predictive ability
such as the percentage of correct predictions, rhosquare (7.3%), or the Brier Score (2.4%), among others, are useful as the can be used
to generate benchmark values against which researchers can evaluate, to a certain extent, the performance of their models against
similar studies in the literature. For example, in the reviewed articles, the percentage of correct predictions for destination (or location)
choice models ranged between 13% and 22%, while for mode choices it ranged from 36% to 87%, with most studies reporting values
above the 60% threshold. Similarly, for route choice, values ranged between 51% and 73%. However, note that since the number of
3
Give that the error component variance for designed experiments will differ from the variance in the real world, the use of stated preference (SP)
surveys for forecasting and calculation of elasticities is not recommended (Hess and Rose, 2009; de Jong, 2014). As such SP studies were excluded
from the analysis. In the case of SPRP studies, the SP error component can be calibrated against the RP data (de Jong, 2014).
4
Note that as shown in Table 2, some articles reported more than one measure.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
11
Table 2
Summary of articles reporting validation in the literature.
Article Model Dependent
variables
Validation Evaluation Measure
Type Method Notes PVO FPR %
CR
%
CW
E/
PE/
APE
SSE MAE MAPE MSE RMSE BS C CHISQ MSD FF LL LLL RHOSQ CORR TTS TI OTHER
Zimmermann
et al. (2018)
MNL;
MXL;
MRL
ACDCDT
MC
Internal RLT 54% estimation,
46% validation,
13 runs.
●
Danalet et al.
(2016)
MNL; DC Internal HOV 91.5%
estimation, 9.5%
validation (past
choices used for
estimation, most
recent choice
used for
validation).
● ●
FaghihImani
and Eluru
(2015)
MNL DC Internal HOV 75% estimation,
25% validation.
● ● ●
Assi et al.
(2018)
MNL;
NN*
MC Internal RLT 75% estimation,
25% validation, 3
runs.
●
Bohluli et al.
(2014)
MNL;
NL
MC External IS Validation data is
observed
ridership data
after
introduction of
new transit
service.
● ● ● ●
Bueno et al.
(2017)
MNL MC Internal RKCV 10fold CV, 5 000
runs.
●
Glerum et al.
(2014)
HCM MC Internal HOV 80% estimation,
20% validation.
● ● ●
Hasnine and
Habib
(2018)
HDDC MC Internal HOV 80% estimation,
20% validation.
●
Kunhikrishnan
and
Srinivasan
(2018)
MNL MC Internal HOV 70% estimation,
30% validation.
● ●
Ma et al. (2015) MNL;
MXL;
LCM
MC Internal HOV 80% estimation,
20% validation.
●
Mahmoud et al.
(2016)
MNL;
PLC
MC Internal HOV 80% estimation,
20% validation.
●
Sanko (2014) MNL MC External IS Validation
against temporal
transfer sample.
● ●
Sanko (2016) MNL MC External IS ●
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
12
Table 2 (continued )
Article Model Dependent
variables
Validation Evaluation Measure
Type Method Notes PVO FPR %
CR
%
CW
E/
PE/
APE
SSE MAE MAPE MSE RMSE BS C CHISQ MSD FF LL LLL RHOSQ CORR TTS TI OTHER
Validation
against temporal
transfer sample.
Sanko (2017) MNL MC External IS Validation
against temporal
transfer sample.
●
Sanko (2018) MNL MC External IS Validation
against temporal
transfer sample.
●
Vij and Krueger
(2017)
MNL;
MXL;
LCM
MC Internal HOV 90% estimation,
10% validation.
●
Vij and Walker
(2014)
LCM MC Internal HOV 90% estimation,
10% validation.
●
Wang et al.
(2014)
NL MC Internal IS Validation data is
observed mode
shares collected
by transit
agencies.
●
Weng et al.
(2018)
MNL MC Internal IS Validation data is
smart cart data.
●
Gokasar and
Gunay
(2017)
MNL MC Internal HOV 75% estimation,
25% validation.
● ●
Habib et al.
(2014)
hTEV MC; △MC External IS Validation
against temporal
transfer sample.
● ● ●
Chikaraishi and
Nakayama
(2016)
MNL;
WM; QL
MC; RC Internal RLT 50% estimation,
50% validation,
100 runs.
● ●
Idris et al.
(2015)
MNL;
NL;
HCM
MC; △MC Internal HOV Validation group
is subset of
observations
with an observed
mode shift (in SP
experiment).
● ● ●
Golshani et al.
(2018)
MNL;
ANN*;
JDC
MCDT Internal HOV 80% estimation,
20% validation.
● ●
Suel and Polak
(2017)
NL O MC Internal HOV Validation
sample is
observations
from same
sample, for the
onemonth
●
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
13
Table 2 (continued )
Article Model Dependent
variables
Validation Evaluation Measure
Type Method Notes PVO FPR %
CR
%
CW
E/
PE/
APE
SSE MAE MAPE MSE RMSE BS C CHISQ MSD FF LL LLL RHOSQ CORR TTS TI OTHER
period prior to
survey week.
FaghihImani
and Eluru
(2017)
LMM;
MNL
DC; O Internal HOV For AT/DT:
Approx. 96%
estimation, 4%
validation
(1week/
24weeks).
For DC: holdout
of 5 000 trips
(estimation was
conducted with
sample sizes
ranging from 1
000 to 20 000).
● ● ● ●
Alizadeh et al.
(2018)
PSL;
EPSL;
IAL; LK;
NL
RC Internal HOV 80% estimation,
20% validation, 1
run.
●
Kazagli et al.
(2016)
MNL RC Internal RLT 80% estimation,
20% validation,
100 runs.
●
Lai and Bierlaire
(2015)
MNL;
PSL;
CNL
RC Internal HOV 50% estimation,
50% validation. 3
runs,
performance
measures not
averaged.
●
Li et al. (2016) MXL
(PSL;
GNL;
LK; CL,
etc)
RC Internal SSO Two non
overlapping
samples were
generated from
original dataset
for validation.
●
Mai (2016) NRL;
RCNL
RC Internal RLT 80% estimation,
20% validation,
20 runs.
●
Mai et al. (2017) RL;
NRL;
ML;
RRM
RC Internal RLT 80% estimation,
20% validation,
40 runs.
●
Mai et al. (2015) RL; NRL RC Internal RLT 80% estimation,
20% validation,
40 runs.
●
Papola (2016) CoRUM MC Internal HOV 70% estimation,
30% validation.
● ●
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
14
Table 2 (continued )
Article Model Dependent
variables
Validation Evaluation Measure
Type Method Notes PVO FPR %
CR
%
CW
E/
PE/
APE
SSE MAE MAPE MSE RMSE BS C CHISQ MSD FF LL LLL RHOSQ CORR TTS TI OTHER
Zhang et al.
(2015)
MNL RC Internal IS Validation data is
an independent
observation of 3
sites plus a new
site.
●
Zhang et al.
(2018)
NL RC Internal HOV 85% estimation,
15% validation.
● ●
Zimmermann
et al. (2017)
NRL RC Internal RLT 80% estimation,
20% validation,
20 runs.
●
Xie et al. (2016) MXL RC Internal IS New
observations
from same site.
●
DucNghiem
et al. (2018)
MNL RC External IS Validation data is
prepost data
from site not used
in calibration.
● ●
Fox et al. (2014) NL MCDC External IS Validation
against temporal
transfer sample.
● ● ● ●
Forsey et al.
(2014)
O MC External IS Validation
against temporal
transfer sample.
● ● ●
Model abbreviations: Dependent variable abbreviations: Validation method abbreviations
CNL: Crossnested logit
CoRUM: Combination of RUM models
EPSL: Extended path size logit
HCM: Hybrid choice model
hTEV: Heteroscedastic tee extreme
value
IAL: Independent availability logit
IBL: Instancebased learning
LCM: Latent class model
LK: Logit kernel
MNL: Binary or multinomial logit
MNLp: Binary or multinomial logit with panel data
corrections
MNP: Binary or multinomial probit
MRL: Mixed recursive logit
MXL: Mixed logit
NL: nested logit
NN*: Neural network
O: Other
PLC: Parameterized logit
captivity
PSL: Path size logit
QL: qlogit
RCNL; Recursive cross nested
logit
RRM: Randomregret
maximization
WB: Weibit model
*Machine learning models
AC: Activity choice
AT: Arrival time
DT: Departure time choice/Time of day
choice
DC: Destination choice
MC: Mode choice
△MC: Mode choice change
RC: Route choice
O: Other choice dimension
HOV: Holdout validation
IS: Validation against an independent
sample
RKCV: Repeated Kfold cross
validation
RLT: Repeated learningtesting
SSO: Other sample splitting method
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
15
studies is not large, a myriad of studies with different denition of choice variables and choice set sizes were combined to calculate
these values. As such, they need to be interpreted with caution. Furthermore, as pointed out by de Luca and Cantarella (2009), this
index fails to discriminate between different degrees of predicted probabilities of correct predictions. Unfortunately, “clearness of
prediction” measures that do account for this, are still not very widely used. The percentage of clearly right index had a share of 2.4%,
while the percentage of clearly wrong index had a 0% share. Similarly, the share of studies reporting measures of model discriminatory
ability was 2.4%.
Due to the fact that model performance is relative to factors such as choice set size and the base market share of alternatives,
comparisons across models in different studies and contexts are complicated. Such comparisons must be interpreted with clear un
derstanding of these limitations, and not as absolute measures of model superiority.
Although ideally all studies would be externally validated, in most cases the types of validation that a researcher can conduct are
limited by the availability of data and computational resources. But there is certainly room for improvement. Researchers should strive
to conduct the best validation tests possible given the resources at hand and carefully report the details of how these tests were
conducted, so that other researchers can clearly understand to what extent the results presented are generalizable. In Fig. 2 we
illustrate as simple heuristic to determine the recommended validation practices given available resources, as well as what measures to
report. We start by acknowledging that if a randomized controlled trial is possible, it would be the best alternative. That being said, as
we discussed earlier, such an experiment is extremely difcult in the eld.
The existence of data from either a different but plausibly related population, or from the same population is one of the key factors
dening what kind of validation is possible. The bottom line is conducting internal validation either via data/splitting methods or
bootstrapping, in the absence of independent datasets. Unless the model is too computationally intensive, we recommend avoiding the
holdout validation as it a pessimistic estimator that makes inefcient use of the data (Kohavi, 1995). Regarding the choice between
crossvalidation and bootstrapping, as discussed earlier, the performance of these methods is dependent on data characteristics such as
sample size, and the type of model used. In the absence of an empirical study comparing the performance of these methods using
typical data from transportation studies and commonly used models, we will refrain from making a recommendation.
In terms of what performance measures to report, rather than relying on a single indicator we recommend reporting several in
dicators as shown in Fig. 2, which include measures comparable across studies, and measures of discriminative ability and clearness of
prediction. When using indicators for model selection, ideally, the best model will excel in all performance measures, but there is no
guarantee this is so. As such, the analyst should strive to clearly explain the criteria used for model selection.
Table 4
Predictive accuracy performance measures reported in the literature by frequency.
Performance measure Abbrv. Frequency Percentage
Loglikelihood/loglikelihood loss LL/LLL 19 46.3%
Percentage of correct predictions or First Preference Recovery FPR 10 24.4%
Predicted vs observed market outcomes PVO 10 24.4%
Mean absolute error MAE 6 14.6%
Root mean square error RMSE 4 9.8%
Error/Percentage error/Absolute percentage error E/PE/APE 3 7.3%
RhoSquare RHOSQ 3 7.3%
Transfer index TI 2 4.9%
% clearly right (t) %CR 1 2.4%
Brier Score BS 1 2.4%
Chisquare CHISQ 1 2.4%
Concordance index C 1 2.4%
Correlation CORR 1 2.4%
Fitting factor FF 1 2.4%
Mean absolute percentage error MAPE 1 2.4%
Sum of square error SSE 1 2.4%
Transferability test statistic TTS 1 2.4%
All other measures specied in Table 1 – 0 0%
Other measures not specied in Table 1 – 3 7.3%
Very similar measures are reported jointly.
Table 3
Internal validation methods reported in the literature by frequency.
Method Abbvr. Frequency Percentage
Holdout validation HOV 18 56.3%
Repeated learningtesting RLT 8 25.0%
Validation against an independent sample IS 4 12.5%
Repeated Kfold crossvalidation RKCV 1 3.1%
Other sample splitting methods SSO 1 3.1%
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
16
7. Towards better validation practices in the eld
While most researchers would agree that the purpose of travel demand analysis is to make valid predictions to aid effective policy
evaluation (evident in that most studies reported some sort of policy implication,) we showed that there is an evident disconnect
between the objective of transportation research and the current practice of research, partially evident in the low levels of validation
reporting in the literature. Strong pressure to publish among academics means little incentive to conduct proper validation. While
producing independent samples for validation might be prohibitively expensive, whenever possible researchers should strive to
produce such data or use existing periodical surveys such as household travel surveys. In the absence of such data, datasplitting and
bootstrapping methods are inexpensive and should not be very different from drawing policyrelevant inferences, in terms of efforts
required.
Stronger criteria are needed to evaluate models in academia to increase the reliability of results and the credibility of inferences
based on statistical models. We put forth a set of recommendations aimed at improving validation practices in the eld:
(1) Make validation mandatory: model validation should be a nonnegotiable part of model reporting and peerreview in academic
journals for any study that provides policy recommendations. At the very least, internal validation results should be reported.
Conducting internal validation is a norm in machine learning studies, and there is no reason why similar standards cannot be
implemented in the eld. This will provide better incentives to perform validation.
(2) Sharing of benchmark datasets: as pointed out by an anonymous reviewer at the conference the previous version of this article was
presented, a fundamental limitation in the eld is the lack of benchmark datasets and a general culture of sharing code and data.
Certainly, privacy concerns as well as institutional limitations impede the free sharing of most relevant and larger datasets (i.e.
Fig. 2. Heuristic to select validation method given available resources and recommended performance measures to report.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
17
household travel surveys, etc.), and in most cases it is out of the control of individual researchers to decide, but efforts should be
made towards the collection and opening of relevant benchmark data.
(3) Incentivize validation studies: most prestigious journals put a lot of emphasis on theoretically innovative models. While we
recognize its importance, submissions that focus on proper validation of existing models and/or theories should be encouraged
by journal editors.
(4) Draw and enforce clear reporting guidelines: efforts to improve reporting are well documented in other elds. For example, in the
medical eld, following a consensus among researchers that reporting of published observational studies is inadequate, and
hinders the assessment of quality and generalizability of these studies, guidelines have been developed to strengthen the
reporting of observational studies (e.g. STROBE statement (von Elm et al., 2007)). Although results are mixed regarding the
impact of these guidelines (as in many journals the use of the guideline is recommended but not mandatory), it is a step in the
right direction. A transportation eld specic guideline could be developed and properly enforced, where, in addition to
detailed information of survey characteristics such as sampling method and representativeness of the data, validation reporting
is required. It is worth noting that guidelines for travel model validation do exist for practitioners (i.e. Cambridge Systematics,
2010), so validation and reporting guidelines specic to transportation researchers are a welcomed contribution. Proper
reporting should also include policyrelevant values such as elasticities, marginal effects and other inferences that properly
convey the magnitude of the effects of interest. As mentioned earlier, 64.6% of the studies reviewed reported some sort of
policyrelevant inferences, a welcomed nding. Models with highpredictive power are not very useful if no policy implications
can be drawn from it.
Finally, there is one argument and two question that we would like to address that are commonly heard in academic circles related
to validation practices.
The argument is that “I’m not validating my model because I’m not trying to build a predictive framework. I’m trying to learn about travel
behaviour”. In response, we argue that the more exploratory the subject is, the less the onus of validation until some critical mass of
research has been conducted. On the other hand, the more orthodox the type of analysis conducted (such as the dimensions of travel
behaviour covered in this study), the stronger the onus of validation.
This response motivates the following questions: “Should every study using a discrete choice model be conducting validation?” and “Is
what we learn about travel behaviour from coefcient estimation less valuable if validation not conducted?”
Regarding the rst question, in short, yes. At the very least, any article that makes policy recommendations should be subject to
proper validation given the issues discussed in Section 2 about policy impacts and the lack of a feedback loop in academia. There is a
myriad of reasons why some scepticism is warranted against any particular model outcome, the most obvious one being model
overtting. So proper validation should be used to strengthen presented results, especially given the limitations discussed in Section 2
regarding the dependence on crosssectional studies, and difculties associated with scientic hypothesis testing.
Regarding the second question, while coefcients are useful for policyrelevant inferences, coefcients by themselves do not inform
us about model predictive ability. Policy inference analysis should be complemented with evidence on the generalizability of such
inference.
Finally, while we recognize that better validation practices will not solve the credibility crisis in the eld, it is certainly a step in the
right direction. More specically, model validation is no solution to the causality problem in the eld (see Brathwaite and Walker
(2018) for an indepth discussion of causality in transportation studies), but we want to underscore that the reliance on observational
studies inherent to the eld demands more stringent controls to improve the validity of results. Although out of the scope of the present
study, such controls should also include aspects such as sample representativeness, proper model specication, and statistical power of
effects of interest, which are also critical to the validity of results.
8. Conclusion
In this article we reviewed validation practices from the transportation eld in the peerreviewed literature published between
2004 and 2008. We found that although 92% of studies reported goodness of t statistics, and 64.6% reported some sort of policy
relevant inference analysis, the percentage of validation reporting stood at 18.1%.
We argued that model validation should be a nonnegotiable part of model reporting and peerreview in academic journals and
proposed a simple heuristic to choose validation methods given available resources. At the same time, we recognize the need for proper
incentives to promote better validation practices and providing tools and knowledge to do so, such as reporting guidelines, and
encouragement by journals of submissions that focus on validation of existing models and theories, and not only new theoretically
innovative models.
Author statement
Giancarlos Parady conceived the study, conducted the literature review and wrote the rst draft.
Giancarlos Parady, David Ory and Joan Walker reviewed and revised the manuscript, conrmed the analysis results, and approved
the nal version.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
18
Funding
This work was supported by JSPS KAKENHI Grants Number 17K14737 and 20H02266.
Declaration of competing interest
The authors report no conict of interest.
Acknowledgements
An earlier version of this work was presented at the 6th International choice modelling conference, Kobe, Japan, August 19–21,
2019
Appendix
Table 5
Summary of reviewed articles
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RPSP
1 Manoj and Verma (2015b) MNL AC; MC; DT ● RP ● ●
2 Sadhu and Tiwari (2016) NL ACDC ● RP ● ●
3 Arman et al. (2018) MXNL ACMC ● RP ●
4 He and Giuliano (2018) MNL DC ● RP ● ●
5 Mahpour et al. (2018) HCM DC ● RP ●
6 Basu et al. (2018) NL DC ● RP ●
7 Clifton et al. (2016) MNL DC ● RP ● ●
8 Danalet et al. (2016) MNLp; DC ● RP ● ● ●
9 DeutschBurgner et al.
(2014)
MNL DC ● RP ●
10 FaghihImani and Eluru
(2015)
MNL DC ● RP ● ● ●
11 Ho and Hensher (2016) MNL DC ● RP ●
12 Huang and Levinson
(2015)
MXL DC ● RP ● ●
13 Huang and Levinson
(2017)
MXL DC ● RP ● ●
14 Shao et al. (2015) MNL DC ● RP ● ●
15 Wang et al. (2017) SCL; MNL DC ● RP ●
16 FaghihImani and Eluru
(2017)
LMM;
MNL
DC; AT; DT ● RP ● ● ●
17 Khan et al. (2014) MNL DC; MC ● RP ●
18 Paleti et al. (2017) MXL; JDC DCDTO ● RP ● ●
19 Gonz´
alez et al. (2016) NL(PSL) DCRC ● RP ● ●
20 Ma et al. (2018) CNL DTMC ● RP ●
21 Abasahl et al. (2018) NL MC ● RP ●
22 Anta et al. (2016) MNL; NL MC ● RPSP ● ●
23 Assi et al. (2018) MNL MC ● RP ● ●
24 Aziz et al. (2018) MXL MC ● RP ● ●
25 B¨
ocker et al. (2017) MNL MC ● RP ●
26 Bohluli et al. (2014) O MC ● RP ● ● ●
27 Braun et al. (2016) MNL MC ● RPSP ●
28 Bridgelall (2014) MNL MC ● RP ●
29 Bueno et al. (2017) MNL MC ● RP ● ● ●
30 CastilloManzano et al.
(2015)
MNL MC ● RP ● ●
31 Chakour and Eluru (2014) LCM MC ● RP ● ●
32 Cherchi and Cirillo (2014) MXL MC ● RP ● ●
33 Cherchi et al. (2017) MXL MC ● RP ●
34 ChicaOlmo et al. (2018) MNL MC ● RP ● ●
35 Clark et al. (2014) MXL MC ● RP ● ●
36 ColeHunter et al. (2015) MNL MC ● RP ● ●
37 Collins and MacFarlane
(2018)
MNL MC ● RP ● ●
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
19
Table 5 (continued )
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RPSP
38 Danaf et al. (2014) NL MC ● RP ● ●
39 de Grange et al. (2015) O MC ● RP ● ●
40 Di Ciommo et al. (2014) MXL MC ● RP ●
41 Ding et al. (2017) HCM MC ● RP ●
42 Ding et al. (2018) HCM MCMTO ● RP ●
43 Ding et al. (2014b) O MC ● RP ●
44 Dong et al. (2016) NL MC ● RP ●
45 NguyenPhuoc (2018) MNL MC ● RP ● ●
46 NguyenPhuoc (2018) MNL MC ● RP ● ●
47 Efthymiou and Antoniou
(2017)
HCM MC ● RP
48 Ermagun and Levinson
(2017)
CNL MC ● RP ● ●
49 Ermagun and Samimi
(2015)
NL MC ● RP ● ●
50 Ermagun et al. (2015) NL; O MC ● RP ● ●
51 Fern´
andezAntolín et al.
(2016)
O MC ● RP ● ●
52 Flügel et al. (2015) CNL MC ● RPSP ●
53 Forsey et al. (2014) O MC ● RP ● ● ●
54 Fu and Juan (2017) LCM MC ● RP ●
55 Gao et al. (2016) MNL MC ● RP ●
56 Gao et al. (2017) MNL MC ● RP ● ●
57 Gerber et al. (2017) MNL;
MXL
MC ● RP ●
58 Glerum et al. (2014) HCM MC ● RP ● ● ●
59 Goel and Tiwari (2016) MNL MC ● RP ●
60 Guan and Xu (2018) MNLp MC ● RP ● ●
61 Guerra et al. (2018) MNL MC ● RP ● ●
62 Guo et al. (2018) MXL MC ● RP ●
63 Habib and Sasic (2014) O MC ● RP ● ●
64 Habib and Weiss (2014) O MC ● RP ● ●
65 Habib et al. (2014) HCM MC ● RP ●
66 Halld´
orsd´
ottir et al. (2017) JMXL MC ● RP ● ●
67 Hasnine and Habib (2018) O MC ● RP ● ● ●
68 Hasnine et al. (2018) CNL MC ● RP ● ●
69 He and Giuliano (2017) MNL MC ● RP ● ●
70 Helbich (2017) MXL MC ● RP ●
71 Helbich et al. (2014) O MC ● RP ●
72 Hensher and Ho (2016) MXL MC ● RP ● ●
73 Hsu and Saphores (2014) MNL MC ● RP ● ●
74 Hurtubia et al. (2014) LCM MC ● RP ● ●
75 Irfan et al. (2018) MNL MC ● RPSP ● ●
76 Lin et al. (2018) LCM MC ● RP ● ●
77 Ji et al. (2017) NL MC ● RP ● ●
78 Kamargianni et al. (2014) HCM MC ● RP
79 Kamruzzaman et al.
(2015)
MNL MC ● RP ● ●
80 Keyes and
CrawfordBrown (2018)
MNL MC ● RP ● ●
81 Khan et al. (2016) MXL MC ● RP ● ●
82 Kunhikrishnan and
Srinivasan (2018)
MNL MC ● RP ● ● ●
83 Yang et al. (2017) MNL MC ● RP ●
84 Larsen et al. (2018) MNL MC ● RP ● ●
85 Lee (2015) MNL MC ● RP ●
86 Lee et al. (2017) MNL MC ● RP ● ●
87 Lee et al. (2014) MNL MC ● RP ● ●
88 Li and Kamargianni
(2018)
MXNL MC ● RPSP ● ●
89 Liu et al. (2018) MNL MC ● RP ● ●
90 Liu et al. (2015) MNL MC ● RP ● ●
91 Lorenzo Varela et al.
(2018)
ML; NL MC ● RP ● ●
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
20
Table 5 (continued )
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RPSP
92 Ma et al. (2015) LCM MC ● RP ● ● ●
93 Mahmoud et al. (2016) O MC ● RP ● ● ●
94 Mattisson et al. (2018) MNL MC ● RP ●
95 Mehdizadeh et al. (2018) MXL MC ● RP ●
96 Meng et al. (2016) MNL MC ● RP ● ●
97 Minal and Ravi (2016) MXL MC ● RP ● ●
98 Mitra and Buliung (2014) MNL MC ● RP ●
99 Mitra and Buliung (2015) MNL MC ● RP ● ●
100 Mitra et al. (2015) MNL MC ● ● RP ● ●
101 Moniruzzaman and Farber
(2018)
MNL MC ● RP ● ●
102 MyungJin et al. (2018) MNL MC ● RP ● ●
103 Noland et al. (2014) MXL MC ● RP ●
104 Owen and Levinson (2015) MNL MC ● RP ●
105 Park et al. (2015) MNL MC ● RP
106 Park et al. (2014) MNL MC ● RP
107 Paulssen et al. (2014) HCM MC ● RP ●
108 Pike and Lubell (2018) O MC ● RP ● ●
109 Pnevmatikou et al. (2015) NL MC ● RPSP ● ●
110 Prato et al. (2017) LCM MC ● RP ● ●
111 Ramezani, Pizzo and
Deakin (2018b)
MNL MC ● RP ●
112 Ramezani, Pizzo and
Deakin (2018a)
MNL MC ● RP ● ●
113 Rashedi et al. (2017) O MC ● RPSP ● ●
114 Rubin et al. (2014) MNLp MC ● RP ●
115 Rybarczyk and Gallagher
(2014)
MNL MC ● RP ●
116 Sanko (2014) MNL MC ● RP ● ●
117 Sanko (2016) MNL MC ● RP ● ●
118 Sanko (2017) MNL MC ● RP ● ●
119 Sanko (2018) MNL MC ● RP ● ●
120 Sarkar and Chunchu
(2016)
MNL MC ● RP ● ●
121 Sarkar and Mallikarjuna
(2018)
HCM MC ● RP ● ●
122 Scheepers et al. (2016) MNL MC ● RP ●
123 Shaheen et al. (2016) MNL MC ● RP ●
124 Sharmeen and
Timmermans (2014)
MNL MC ● RP ●
125 Shirgaokar and Nurul
Habib (2018)
HCM MC ● RP ● ●
126 Singh and Vasudevan
(2018)
MNL MC ● RP ● ●
127 Soltani and Shams (2017) NL MC ● RP ●
128 Stone et al. (2014) MNL MC ● RP ● ●
129 Sun et al. (2018) MNL MC ● RP ● ●
130 Thigpen et al. (2015) O MC ● RP ● ●
131 Tos¸a et al. (2018) NL MC ● RPSP ● ●
132 Venigalla and Faghri
(2015)
MNL MC ● RP ● ●
133 Verma et al. (2015) MNL MC ● RP ● ●
134 Vij and Krueger (2017) MXL MC ● RP ● ● ●
135 Vij and Walker (2014) LCM MC ● RP ● ● ●
136 Vij et al. (2017) LCM MC ● RP ● ●
137 Wang et al. (2014) NL MC ● RP ● ● ●
138 Wang et al. (2015) MNP; O MC ● RP ● ●
139 Weiss and Habib (2018) O MC ● RP ● ●
140 Weng et al. (2018) MNL; O MC ● RPSP ● ●
141 Yang et al. (2014) MNL MC ● RP ● ●
142 Yang et al. (2018) MXL MC ● RP ● ●
143 Yang et al. (2016) NL MC ● RP ● ●
144 Yazdanpanah and Hadji
Hosseinlou (2017)
HCM MC ● RP ● ●
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
21
Table 5 (continued )
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RPSP
145 Yen et al., 2018a LCM MC ● RP ● ●
146 Yen et al., 2018b MNL MC ● RP ● ●
147 Zhang et al. (2017) MNL MC ● RP ●
148 Zhao and Li (2017) MLMNL MC ● RP ●
149 Zimmermann et al. (2018) MRL MC ● RP ● ● ●
150 Zolnik (2015) MLMNL MC ● RP ●
151 Gokasar and Gunay (2017) MNL MC ● RP ● ● ●
152 Tilahun et al. (2016) MNL MC ● RP ● ●
153 Astegiano et al. (2017) MXL; CNL MC; MTO ● RP ●
154 Heinen (2016) MNL MC; O ● RP ●
155 Chikaraishi and Nakayama
(2016)
O MC; RC ● ● RP ● ● ●
156 Habib et al. (2014) O MC; △MC ● RP ● ●
157 Idris et al. (2015) HCM MC; △MC ● RPSP ● ●
158 Ahmad Termida et al.
(2016)
MXL MC ● RP ●
159 Fatmi and Habib (2017) MXL MC ● RP ●
160 Heinen and Ogilvie (2016) MNL MC ● RP ● ●
161 Mitra et al. (2017) MNL MC ● RP ● ●
162 Rahman and Baker (2018) MNL MC ● RP ● ●
163 Standen et al. (2017) MNL MC; △RC ● RP ● ●
164 Llorca et al. (2018) MNL MC; DC; TF ● RP ● ●
165 Manoj and Verma (2015a) MNL MC; O ● RP ●
166 Rahul and Verma (2018) MNL; OLS MC; TD ● RP ●
167 Popovich and Handy
(2015)
MNL; OL MC; TF ● RP ●
168 Kristoffersson et al. (2018) NL MCDC ● RP ● ●
169 Fox et al. (2014) NL MCDC ● RP ● ●
170 Ding et al. (2014a) CNL MCDT ● RP ● ●
171 Golshani et al. (2018) NN; JDC MCDT ● RP ● ●
172 Ermagun and Samimi
(2018)
JDC MCO ● RP ● ●
173 Kaplan et al. (2016) O MCO ● RP ● ●
174 Xiqun et al. (2015) NL MCO ● RP ● ●
175 Habib (2014b) O MCOMTO ● RP ● ●
176 Habib (2014a) JDC MCTD ● RP ● ●
177 Schoner et al. (2015) O MCTF ● RP ●
178 Marquet and
MirallesGuasch (2016)
MNL MTO; MC ● RP ●
179 Shen et al. (2016) MNL; NL MTO; MC ● RP ● ●
180 Khan et al. (2014) O; MNL MTO; TF; MC;
O
● RP ● ●
181 Picard et al. (2018) NL MTOMC ● RP ● ●
182 Suel and Polak (2017) NL OMC ● RP ● ●
183 Yang (2018) MXL O; DC ● RP ●
184 Daisy et al. (2018) OP; MNL O; MC ● RP ●
185 Ho and Mulley (2015) NL OMC ● RP ● ●
186 Liu et al. (2017) O;
MDCEV
OMS ● RP ● ●
187 Zhang et al. (2014) NL ORC ● RP ● ●
188 Pang and Khani (2018) MNL;
MXL
DC ● RP ●
189 Alizadeh et al. (2018) NL RC ● RP ● ●
190 Anderson et al. (2014) MXL
(PSCL)
RC ● RP ● ●
191 Baek and Sohn (2016) MNL
(PSL)
RC ● RP ●
192 Basheer et al. (2018) MNL RC ● RP ●
193 Chen and Wen (2013) MNL RC ● RP ● ●
194 Chen et al. (2018) MXL(PSL) RC ● RP ● ●
195 Li et al. (2016) MXL(PSL;
GNL;
LK; CL,
etc)
RC ● RP ● ●
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
22
Table 5 (continued )
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RPSP
196 Dalumpines and Scott
(2017)
PSL RC ● RP ●
197 Di et al. (2017) MNP RC ● RP ● ●
198 GarciaMartinez et al.
(2018)
MXL RC ● RPSP ● ●
199 Ghanayim and Bekhor
(2018)
MXL(PSL;
CL)
RC ● RP ● ●
200 J´
anoˇ
síkova et al. (2014) MNL RC ● RP ● ●
201 Kazagli et al. (2016) O RC ● RP ● ● ●
202 Zhang et al. (2018) MNL RC ● RP ●
203 Lai and Bierlaire (2015) CNL RC ● RP ● ●
204 Mai (2016) RCNL RC ● RP ● ●
205 Mai et al. (2017) RL RC ● RP ● ●
206 Mai et al. (2015) NRL RC ● RP ● ●
207 Moran et al. (2018) MNL
(PSL)
RC ● RP ● ●
208 Oyama and Hato (2018) O RC ● RP ●
209 Papola (2016) O MC ● RP ● ● ●
210 Prato (2014) MNL
(RRM)
RC ● RP ● ●
211 Prato et al. (2018) GMXL
(PSL)
RC ● RP ● ●
212 Raveau et al. (2014) MNL(CL) RC ● ● RP ● ●
213 Thomas and Tutert (2015) MNL RC ● RP ●
214 Zhang et al. (2018) NL RC ● ● RP ● ●
215 Yamamoto et al. (2018) NL(PSL) RC ● RP ●
216 Zhang et al. (2015) MNL RC ● RP ●
217 Zhuang et al. (2017) MNL RC ● RP ●
218 Zimmermann et al. (2017) NRL RC ● RP ● ●
219 J´
anoˇ
síkova et al. (2014) MNL RC ● RP ● ●
220 Xie et al. (2016) MXL RC ● RP ● ●
221 Yang (2016) NL RC ● RP ● ●
222 DucNghiem et al. (2018) MNL RC ● RP ● ● ●
223 Katoshevski et al. (2015) MNL; NL RLC; DC; MC;
AC
● RP ●
224 Bhat et al. (2016) O RLCMTOD
MC
● RP ● ●
225 Tran et al. (2016) O RLCOMC ● RP ● ●
226 Sener and Reeder (2014) O TFMC ● RP ●
Model abbreviations Dependent variable abbreviations Data characteristics abbreviations
CL: Clogit
CoRUM: Combination of
RUM models
EPSL: Extended path size
logit
FMS: Flexible model
structure
GMXL: Generalized mixed
logit
GNL: Generalized nested
logit
HCM: Hybrid choice model
HDDC: heteroskedastic
dynamic discrete choice
hTEV: heteroscedastic tree
extreme value
IAL: independent
availability logit
IBL: instancebased learning
JDC: Joint discrete
continuous
JMXL: Joint mixed logit
LCM: latent class model
LK: Logit kernel
MRL: Mixed recursive logit
MVMLL: Multivariate multilevel binary
logit
MVPM: multivariate probit model
MXL: Mixed logit
MXNL: Mixed nested logit
NN*: Neural network
NL: Nested logit
NRL: Nested recursive logit
O: Other extensions/generalizations
OP: Ordered probit
PL: Polarized logit
PLC: parameterized logit captivity
PSCL: Path sized correction logit
PSL: Path size logit
QL: qlogit
RCNL: Recursive cross nested logit
RL: Recursive logit
RRM: Randomregret maximization
SCL: Spatially correlated logit
SVM*: support vector machine
WB: weibit
*Machine learning models
**When several models of the same
AC: Activity
choice
AT: Arrival time
DC: Destination
choice
DT: Departure
time choice/Time
of day choice
MC: Mode choice
MS: Modal split
MTO: Mobility
tool ownership
O: Other
RC: Route
choice
RLC:
Residential
location choice
TF: Trip
frequency
CS: Cross sectional data
RCS: Repeated cross section, pooled
data
PP: Pseudopanel data
TP: True panel data*
SP: Stated preference
RP: Revealed preference
*In this classication, true panel data
are dened as any survey that measures
travel behaviour of the same subjects at
two or more different points in time.
The smallest time unit is a day. (for
example, repeated observations in the
same day is classied as crosssectional
data, while a travel behaviour survey of
two or more days is considered true
panel data). This classication is
irrespective of the way the data was
handled by the analyst.
Stated preference surveys with multiple
choice scenarios are considered cross
sectional data.
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
23
Table 5 (continued )
Model abbreviations Dependent variable abbreviations Data characteristics abbreviations
LMM: Linear mixed model
MDCEV: Multiple discrete
continuous extreme value
ML: Mother logit
MLMXL: Multilevel mixed
logit
MNL: Binary or Multinomial
logit
MNLp: MNL with panel data
corrections
MPM: Multinomial probit
model
dependent variable are compared, only
the most general form model or best
performing model is listed in the Models
(s) column.
References
Abasahl, F., Kelarestaghi, K.B., Ermagun, A., 2018. Gender gap generators for bicycle mode choice in Baltimore college campuses. Elsevier Travel Behaviour and
Society 11, 78–85. https://doi.org/10.1016/j.tbs.2018.01.002. November 2017.
Ahmad Termida, N., Susilo, Y.O., Franklin, J.P., 2016. Observing dynamic behavioural responses due to the extension of a tram line by using panel survey. Elsevier Ltd
Transport. Res. Pol. Pract. 86, 78–95. https://doi.org/10.1016/j.tra.2016.02.005.
Alizadeh, H., et al., 2018. On the role of bridges as anchor points in route choice modeling. Springer US Transportation 45 (5), 1181–1206. https://doi.org/10.1007/
s1111601797617.
Anderson, M.K., Nielsen, O.A., Prato, C.G., 2014. Multimodal route choice models of public transport passengers in the Greater Copenhagen Area. Springer Berlin
Heidelberg EURO Journal on Transportation and Logistics 6 (3), 221–245. https://doi.org/10.1007/s1367601400633.
Anta, J., et al., 2016. Inuence of the weather on mode choice in corridors with timevarying congestion: a mixed data study. Transportation 43 (2), 337–355. https://
doi.org/10.1007/s1111601595781.
Arlot, S., Celisse, A., 2009. A survey of crossvalidation procedures for model selection, 4, 40–79. https://doi.org/10.1214/09SS054.
Arman, M.A., Khademi, N., de Lapparent, M., 2018. Women’s mode and trip structure choices in daily activitytravel: a developing country perspective. Taylor &
Francis Transport. Plann. Technol. 41 (8), 845–877. https://doi.org/10.1080/03081060.2018.1526931.
Assi, K.J., et al., 2018. Mode choice behavior of high school goers: evaluating logistic regression and MLP neural networks. Elsevier Case Studies on Transport Policy 6
(2), 225–230. https://doi.org/10.1016/j.cstp.2018.04.006.
Astegiano, P., et al., 2017. Quantifying the explanatory power of mobilityrelated attributes in explaining vehicle ownership decisions. Res. Transport. Econ. 66, 2–11.
https://doi.org/10.1016/j.retrec.2017.07.007.
Austin, P.C., et al., 2016. ‘Geographic and temporal validity of prediction models : different approaches were useful to examine model performance’. Elsevier Inc
J. Clin. Epidemiol. 79, 76–85. https://doi.org/10.1016/j.jclinepi.2016.05.007.
Aziz, H.M.A., et al., 2018. Exploring the impact of walk–bike infrastructure, safety perception, and builtenvironment on active transportation mode choice: a random
parameter model using New York City commuter data. Springer US Transportation 45 (5), 1207–1229. https://doi.org/10.1007/s1111601797608.
Baek, J., Sohn, K., 2016. An investigation into passenger preference for express trains during peak hours. Springer US Transportation 43 (4), 623–641. https://doi.org/
10.1007/s1111601595923.
Baker, M., Penny, D., 2016. Is there a reproducibility crisis? Nature 533 (7604), 452–454. https://doi.org/10.1038/533452A.
Basheer, S., Srinivasan, K.K., Sivanandan, R., 2018. Investigation of information quality and user response to realtime trafc information under heterogeneous trafc
conditions. Transportation in Developing Economies 4 (2), 1–11. https://doi.org/10.1007/s4089001800615. Springer International Publishing.
Basu, D., et al., 2018. Modeling choice behavior of nonmandatory tour locations in California – an experience. Travel Behaviour and Society 12, 122–129. https://
doi.org/10.1016/j.tbs.2017.04.008.
Bhat, C.R., et al., 2016. On accommodating spatial interactions in a Generalized Heterogeneous Data Model (GHDM) of mixed types of dependent variables. Transp.
Res. Part B Methodol. 94, 240–263. https://doi.org/10.1016/j.trb.2016.09.002. Elsevier Ltd.
B¨
ocker, L., van Amen, P., Helbich, M., 2017. Elderly travel frequencies and transport mode choices in Greater Rotterdam, The Netherlands. Transportation 44 (4),
831–852. https://doi.org/10.1007/s111160169680z.
Bohluli, S., Ardekani, S., Daneshgar, F., 2014. Development and validation of a direct mode choice model. Transport. Plann. Technol. 37 (7), 649–662. https://doi.
org/10.1080/03081060.2014.935571. Taylor & Francis.
Brathwaite, T., Walker, J.L., 2018. Causal inference in travel demand modeling (and the lack thereof). Elsevier Ltd Journal of Choice Modelling 26, 1–18. https://doi.
org/10.1016/j.jocm.2017.12.001. June 2017.
Braun, L.M., et al., 2016. Shortterm planning and policy interventions to promote cycling in urban centers: ndings from a commute mode choice analysis in
Barcelona, Spain. Transport. Res. Pol. Pract. 89, 164–183. https://doi.org/10.1016/j.tra.2016.05.007. Elsevier Ltd.
Bridgelall, R., 2014. Campus parking supply impacts on transportation mode choice. Transport. Plann. Technol. 37 (8), 711–737. https://doi.org/10.1080/
03081060.2014.959354. Taylor & Francis.
Brier, G., 1950. Verication of forecasts expressed in terms of probability. Mon. Weather Rev. 78 (1), 2–4.
Bueno, P.C., et al., 2017. Understanding the effects of transit benets on employees’ travel behavior: evidence from the New YorkNew Jersey region. Transport. Res.
Pol. Pract. 99, 1–13. https://doi.org/10.1016/j.tra.2017.02.009. Elsevier Ltd.
Burman, P., 1989. A comparative study of ordinary crossvalidation, vfold crossvalidation and the repeated learningtesting methods. Biometrika 76 (3), 503–514.
Calster, B. Van, et al., 2012. Extending the c statistic to nominal polytomous outcomes : the Polytomous Discrimination Index. Stat. Med. 31, 2610–2626. https://doi.
org/10.1002/sim.5321.
Cambridge Systematics, 2010. Travel Model Validation and Reasonability Checking Manual, 2nd. Federal Highway Administration.
CastilloManzano, J.I., CastroNu˜
no, M., L´
opezValpuesta, L., 2015. Analyzing the transition from a public bicycle system to bicycle ownership: a complex
relationship. Transport. Res. Transport Environ. 38, 15–26. https://doi.org/10.1016/j.trd.2015.04.004.
Chai, T., Draxler, R.R., 2014. Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geosci. Model Dev.
(GMD) 7 (3), 1247–1250. https://doi.org/10.5194/gmd712472014.
Chakour, V., Eluru, N., 2014. Analyzing commuter train user behavior: a decision framework for access mode and station choice. Transportation 41 (1), 211–228.
https://doi.org/10.1007/s111160139509y.
Chen, D.J., Wen, Y.H., 2013. ‘Effects of freeway mileagebased toll scheme on the shortrange driver’s route choice behavior’. J. Urban Plann. Dev. 140 (2),
04013012 https://doi.org/10.1061/(asce)up.19435444.0000167.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
24
Chen, P., Shen, Q., Childress, S., 2018. A GPS databased analysis of built environment inuences on bicyclist route preferences. International Journal of Sustainable
Transportation 12 (3), 218–231. https://doi.org/10.1080/15568318.2017.1349222. Taylor & Francis.
Cherchi, E., Cirillo, C., 2014. Understanding variability, habit and the effect of long period activity plan in modal choices: a day to day, week to week analysis on panel
data. Transportation 41 (6), 1245–1262. https://doi.org/10.1007/s111160149549y.
Cherchi, E., Cirillo, C., Ortúzar, J. de D., 2017. Modelling correlation patterns in mode choice models estimated on multiday travel data. Transport. Res. Pol. Pract. 96,
146–153. https://doi.org/10.1016/j.tra.2016.11.021. Elsevier Ltd.
ChicaOlmo, J., RodríguezL´
opez, C., Chill´
on, P., 2018. Effect of distance from home to school and spatial dependence between homes on mode of commuting to
school. J. Transport Geogr. 72 (August), 1–12. https://doi.org/10.1016/j.jtrangeo.2018.07.013. Elsevier.
Chikaraishi, M., Nakayama, S., 2016. Discrete choice models with qproduct random utilities. Transp. Res. Part B Methodol. 93, 576–595. https://doi.org/10.1016/j.
trb.2016.08.013. Elsevier Ltd.
Clark, A.F., Scott, D.M., Yiannakoulias, N., 2014. Examining the relationship between active travel, weather, and the built environment: a multilevel approach using a
GPSenhanced dataset. Transportation 41 (2), 325–338. https://doi.org/10.1007/s1111601394763.
Clifton, K.J., et al., 2016. Development of destination choice models for pedestrian travel. Transport. Res. Pol. Pract. 94, 255–265. https://doi.org/10.1016/j.
tra.2016.09.017. Elsevier Ltd.
ColeHunter, T., et al., 2015. Objective correlates and determinants of bicycle commuting propensity in an urban environment. Transport. Res. Transport Environ. 40
(2), 132–143. https://doi.org/10.1016/j.trd.2015.07.004. Elsevier Ltd.
Collins, P.A., MacFarlane, R., 2018. Evaluating the determinants of switching to public transit in an automobileoriented midsized Canadian city: a longitudinal
analysis. Transport. Res. Pol. Pract. 118 (September), 682–695. https://doi.org/10.1016/j.tra.2018.10.014. Elsevier.
Daisy, N.S., Millward, H., Liu, L., 2018. Trip chaining and tour mode choice of nonworkers grouped by daily activity patterns. Elsevier J. Transport Geogr. 69,
150–162. https://doi.org/10.1016/j.jtrangeo.2018.04.016. November 2017.
Dalumpines, R., Scott, D.M., 2017. Determinants of route choice behavior: a comparison of shop versus work trips using the Potential Path Area  gateway (PPAG)
algorithm and PathSize Logit. J. Transport Geogr. 59, 59–68. https://doi.org/10.1016/j.jtrangeo.2017.01.003. Elsevier Ltd.
Danaf, M., AbouZeid, M., Kaysi, I., 2014. ‘Modeling travel choices of students at a private, urban university: insights and policy implications’, Case Studies on
Transport Policy. World Conference on Transport Research Society 2 (3), 142–152. https://doi.org/10.1016/j.cstp.2014.08.006.
Danalet, A., et al., 2016. Location choice with longitudinal WiFi data. Journal of Choice Modelling 18, 1–17. https://doi.org/10.1016/j.jocm.2016.04.003. Elsevier.
de Grange, L., et al., 2015. A logit model with endogenous explanatory variables and network externalities. Network. Spatial Econ. 15 (1), 89–116. https://doi.org/
10.1007/s1106701492715.
de Jong, G.C., 2014. Mode choice models. In: Tavasszy, L.A., de Jong, G.C. (Eds.), Modelling Freight Transport (Elsevier).
de Luca, S., Cantarella, G.E., 2009. Validation and comparison of choice models. In: Saleh, W., Sammer, G. (Eds.), Travel Demand Management and Road User Pricing:
Success, Failure and Feasibility. Ashgate publications, pp. 37–58.
de Luca, S., Di Pace, R., 2015. ‘Modelling users’ behaviour in interurban carsharing program: a stated preference approach’. Transport. Res. Pol. Pract. 71, 59–76.
https://doi.org/10.1016/j.tra.2014.11.001. Elsevier Ltd.
DeutschBurgner, K., Ravualaparthy, S., Goulias, K., 2014. Place happiness: its constituents and the inuence of emotions and subjective importance on activity type
and destination choice. Transportation 41 (6), 1323–1340. https://doi.org/10.1007/s1111601495532.
Di Ciommo, F., et al., 2014. ‘Exploring the role of social capital inuence variables on travel behaviour. Transport. Res. Pol. Pract. 68, 46–55. https://doi.org/
10.1016/j.tra.2014.08.018. Elsevier Ltd.
Di, X., et al., 2017. Indifference bands for boundedly rational route switching. Transportation 44 (5), 1169–1194. https://doi.org/10.1007/s1111601696991.
Springer US.
Ding, C., et al., 2014a. ‘CrossNested joint model of travel mode and departure time choice for urban commuting trips: case study in Maryland–Washington, DC
region’. J. Urban Plann. Dev. 141 (4), 04014036 https://doi.org/10.1061/(asce)up.19435444.0000238.
Ding, C., Lin, Y., Liu, C., 2014b. Exploring the inuence of built environment on tourbased commuter mode choice: a crossclassied multilevel modeling approach.
Transport. Res. Transport Environ. 32, 230–238. https://doi.org/10.1016/j.trd.2014.08.001. Elsevier Ltd.
Ding, C., et al., 2017. Exploring the inuence of attitudes to walking and cycling on commute mode choice using a hybrid choice model. J. Adv. Transport. 2017, 1–8.
https://doi.org/10.1155/2017/8749040.
Ding, C., et al., 2018. Joint analysis of the spatial impacts of built environment on car ownership and travel mode choice. Transport. Res. Transport Environ. 60,
28–40. https://doi.org/10.1016/j.trd.2016.08.004. Elsevier Ltd.
Dong, H., Ma, L., Broach, J., 2016. Promoting sustainable travel modes for commute tours: a comparison of the effects of home and work locations and employer
provided incentives. International Journal of Sustainable Transportation 10 (6), 485–494. https://doi.org/10.1080/15568318.2014.1002027. Taylor & Francis.
DucNghiem, N., et al., 2018. ‘Modeling cyclists’ facility choice and its application in bike lane usage forecasting’, IATSS Research. International Association of Trafc
and Safety Sciences 42 (2), 86–95. https://doi.org/10.1016/j.iatssr.2017.06.006.
Efron, B., 1983. ‘Estimating the error rate of a prediction Rule : improvement on crossvalidation’. J. Am. Stat. Assoc. 78 (382), 316–331.
Efron, B., Tibshirani, R., 1993. An Introduction to the Boostrap. Chapman & Hall.
Efron, B., Tibshirani, R., 1995. CrossValidation and the Bootstrap : Estimating the Error Rate of a Prediction Rule. Division of Biostatistics, Stanford University.
Efron, B., Tibshirani, R., 1997. ‘Improvements on crossvalidation : the 632 +bootstrap method’. J. Am. Stat. Assoc. (438), 92. https://doi.org/10.1080/
01621459.1997.10474007.
Efthymiou, D., Antoniou, C., 2017. ‘Understanding the effects of economic crisis on public transport users’ satisfaction and demand’. Transport Pol. 53 (August 2015),
89–97. https://doi.org/10.1016/j.tranpol.2016.09.007. Elsevier.
Ermagun, A., Levinson, D., 2017. Public transit, active travel, and the journey to school: a crossnested logit analysis. Transportmetrica: Transport. Sci. 13 (1), 24–37.
https://doi.org/10.1080/23249935.2016.1207723.
Ermagun, A., Samimi, A., 2015. Promoting active transportation modes in school trips. Transport Pol. 37, 203–211. https://doi.org/10.1016/j.tranpol.2014.10.013.
Elsevier.
Ermagun, A., Samimi, A., 2018. ‘Mode choice and travel distance joint models in school trips’. Transportation 45 (6), 1755–1781. https://doi.org/10.1007/s11116
0179794y. Springer US.
Ermagun, A., Hossein Rashidi, T., Samimi, A., 2015. A joint model for mode choice and escort decisions of school trips. Transportmetrica: Transport. Sci. 11 (3),
270–289. https://doi.org/10.1080/23249935.2014.968654.
FaghihImani, A., Eluru, N., 2015. ‘Analysing bicyclesharing system user destination choice preferences: chicago’s Divvy system’. J. Transport Geogr. 44, 53–64.
https://doi.org/10.1016/j.jtrangeo.2015.03.005. Elsevier Ltd.
FaghihImani, A., Eluru, N., 2017. Examining the impact of sample size in the analysis of bicyclesharing systems. Transportmetrica: Transport. Sci. 13 (2), 139–161.
https://doi.org/10.1080/23249935.2016.1223205.
Fatmi, M.R., Habib, M.A., 2017. Modelling mode switch associated with the change of residential location. Travel Behaviour and Society 9 (August), 21–28. https://
doi.org/10.1016/j.tbs.2017.07.006. Elsevier.
Fern´
andezAntolín, A., et al., 2016. Correcting for endogeneity due to omitted attitudes: empirical assessment of a modied MIS method using RP mode choice data.
Journal of Choice Modelling 20, 1–15. https://doi.org/10.1016/j.jocm.2016.09.001. Elsevier.
Flügel, S., et al., 2015. Methodological challenges in modelling the choice of mode for a new travel alternative using binary stated choice data  the case of high speed
rail in Norway. Transport. Res. Pol. Pract. 78, 438–451. https://doi.org/10.1016/j.tra.2015.06.004.
Forsey, D., et al., 2014. Temporal transferability of work trip mode choice models in an expanding suburban area: the case of York Region, Ontario. Transportmetrica:
Transport. Sci. 10 (6), 469–482. https://doi.org/10.1080/23249935.2013.788100.
Fox, J., et al., 2014. Temporal transferability of models of modedestination choice for the Greater Toronto and Hamilton Area. Journal of Transport and Land Use 7
(2), 41. https://doi.org/10.5198/jtlu.v7i2.701.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
25
Fu, X., Juan, Z., 2017. ‘Accommodating preference heterogeneity in commuting mode choice: an empirical investigation in Shaoxing, China. Transport. Plann.
Technol. 40 (4), 434–448. https://doi.org/10.1080/03081060.2017.1300240. Taylor & Francis.
Gao, J., et al., 2016. ‘A study of traveler behavior under trafc informationprovided conditions in the Beijing area’. Transport. Plann. Technol. 39 (8), 768–778.
https://doi.org/10.1080/03081060.2016.1231896. Taylor & Francis.
Gao, Y., et al., 2017. ‘Differences in pupils’ school commute characteristics and mode choice based on the household registration system in China’. Case Studies on
Transport Policy 5 (4), 656–661. https://doi.org/10.1016/j.cstp.2017.07.008. Elsevier.
GarciaMartinez, A., et al., 2018. ‘Transfer penalties in multimodal public transport networks’. Transport. Res. Pol. Pract. 114 (January), 52–66. https://doi.org/
10.1016/j.tra.2018.01.016. Elsevier.
Geisser, S., 1975. The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70 (350), 320–328.
Gerber, P., et al., 2017. ‘Crossborder residential mobility, quality of life and modal shift: a Luxembourg case study’. Transport. Res. Pol. Pract. 104, 238–254. https://
doi.org/10.1016/j.tra.2017.06.015. Elsevier Ltd.
Ghanayim, M., Bekhor, S., 2018. Modelling bicycle route choice using data from a GPSassisted household survey. Eur. J. Transport Infrastruct. Res. 18 (2), 158–177.
Glerum, A., Atasoy, B., Bierlaire, M., 2014. Using semiopen questions to integrate perceptions in choice models. Journal of Choice Modelling 10 (1), 11–33. https://
doi.org/10.1016/j.jocm.2013.12.001. Elsevier.
Goel, R., Tiwari, G., 2016. ‘Accessegress and other travel characteristics of metro users in Delhi and its satellite cities’. IATSS Research. The Authors 39 (2), 164–172.
https://doi.org/10.1016/j.iatssr.2015.10.001.
Gokasar, I., Gunay, G., 2017. Mode choice behavior modeling of ground access to airports: a case study in Istanbul, Turkey. J. Air Transport. Manag. 59, 1–7. https://
doi.org/10.1016/j.jairtraman.2016.11.003. Elsevier Ltd.
Golshani, N., et al., 2018. Modeling travel mode and timing decisions: comparison of articial neural networks and copulabased joint model. Elsevier Travel
Behaviour and Society 10, 21–32. https://doi.org/10.1016/j.tbs.2017.09.003. October 2017.
Gonz´
alez, F., MeloRiquelme, C., de Grange, L., 2016. A combined destination and route choice model for a bicycle sharing system. Transportation 43 (3), 407–423.
https://doi.org/10.1007/s1111601595816.
Guan, J., Xu, C., 2018. ‘Are relocatees different from others? Relocatee’s travel mode choice and travel equity analysis in largescale residential areas on the periphery
of megacity Shanghai, China’. Transport. Res. Pol. Pract. 111 (January), 162–173. https://doi.org/10.1016/j.tra.2018.03.011. Elsevier.
Guerra, E., et al., 2018. ‘Urban form, transit supply, and travel behavior in Latin America: evidence from Mexico’s 100 largest urban areas’. Transport Pol. 69 (June),
98–105. https://doi.org/10.1016/j.tranpol.2018.06.001. Elsevier Ltd.
Gunn, H., Bates, J., 1982. Statistical aspects of travel demand modelling. Transport. Res. Gen. 16 (5–6), 371–382. https://doi.org/10.1016/01912607(82)900656.
Guo, Y., et al., 2018. ‘Impacts of internal migration, household registration system, and family planning policy on travel mode choice in China’. Travel Behaviour and
Society 13 (June), 128–143. https://doi.org/10.1016/j.tbs.2018.07.003. Elsevier.
Habib, K.N., 2014a. An investigation on mode choice and travel distance demand of older people in the Nationa