ArticlePDF Available

The overreliance on statistical goodness-of-fit and under-reliance on model validation in discrete choice models: A review of validation practices in the transportation academic literature

Authors:

Abstract and Figures

An examination of model validation practices in the peer-reviewed transportation literature published between 2014 and 2018 reveals that 92% of studies reported goodness-of-fit statistics, and 64.6% reported some sort of policy-relevant inference analysis. However, only 18.1% reported validation performance measures, out of which 78% (14.2% of all studies) consisted of internal validation and 22% (4% of all studies) consisted of external validation. The proposition put forward in this paper is that the reliance on goodness-of-fit measures rather than validation performance is unwise, especially given the dependence of the transportation research field on observational (non-experimental) studies. Model validation should be a non-negotiable part of presenting a model for peer-review in academic journals. For that purpose, we propose a simple heuristic to select validation methods given the resources available to the researcher.
Content may be subject to copyright.
Journal of Choice Modelling 38 (2021) 100257
Available online 5 November 2020
1755-5345/© 2020 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
The overreliance on statistical goodness-of-t and under-reliance
on model validation in discrete choice models: A review of
validation practices in the transportation academic literature
Giancarlos Parady
a
,
*
, David Ory
b
, Joan Walker
c
a
Department of Urban Engineering, The University of Tokyo, Japan
b
WSP, USA
c
Department of Civil and Environmental Engineering, University of California, Berkeley, USA
ARTICLE INFO
Keywords:
Validation
Generalizability
Transferability
Policy inference
Transportation
Discrete choice models
ABSTRACT
An examination of model validation practices in the peer-reviewed transportation literature
published between 2014 and 2018 reveals that 92% of studies reported goodness-of-t statistics,
and 64.6% reported some sort of policy-relevant inference analysis. However, only 18.1% re-
ported validation performance measures, out of which 78% (14.2% of all studies) consisted of
internal validation and 22% (4% of all studies) consisted of external validation. The proposition
put forward in this paper is that the reliance on goodness-of-t measures rather than validation
performance is unwise, especially given the dependence of the transportation research eld on
observational (non-experimental) studies. Model validation should be a non-negotiable part of
presenting a model for peer-review in academic journals. For that purpose, we propose a simple
heuristic to select validation methods given the resources available to the researcher.
1. Introduction
Ioannidis (2005) brought to light the issue of lack of demonstrated reproducibility in the natural sciences and, in subsequent years,
the so-called reproducibility crisisin science has made headlines in the popular media (Baker and Penny, 2016). The transportation
domain can benet from thinking about the underlying concerns about reproducibility.
1
Though the transportation eld generally
does not rely on experiments that can be readily retried, to be useful, the observational studies that we generally do rely on should
generalize across time and space within reasonable boundaries. To do so, we argue that more attention needs to be paid to the validity
of the models used to extract information from observational studies.
The proposition put forward by this paper is that the transportation community over-relies on statistical goodness-of-t when
assessing a models performance and policy relevance. For example, it is common in the transportation space for models to be esti-
mated from an observational study of travel behaviour. An analyst, to continue the example, may use such a data set revealing travel
mode choice decisions and strive to quantify the relative onerousness of waiting for a bus relative to riding on a bus. Coefcients from a
statistical model, say linear regression or discrete choice, may suggest that waiting is three times as unpleasant as riding. A typical
analyst would take comfort in robust goodness-of-t statistics, e.g., an R-squared or pseudo-rho-squared in line with similar studies, as
* Corresponding author. The University of Tokyo, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.
E-mail address: gtroncoso@ut.t.u-tokyo.ac.jp (G. Parady).
1
Note than in experimental studies the concept of reproducibility differs from the one commonly used in observational studies, and in this article.
Contents lists available at ScienceDirect
Journal of Choice Modelling
journal homepage: http://www.elsevier.com/locate/jocm
https://doi.org/10.1016/j.jocm.2020.100257
Received 19 February 2020; Received in revised form 24 October 2020; Accepted 31 October 2020
Journal of Choice Modelling 38 (2021) 100257
2
well as statistics showing the statistical signicance of the coefcients for the key variables, e.g., a t-statistic suggesting the coefcients
are likely not equal to zero. It is rare for an analyst to try and replicate these ndings on a new sample or even against a hold-outfrom
the same sample.
To understand how goodness-of-t and validation is approached in the transportation literature, we review their use in the peer-
reviewed transportation academic literature published between 2014 and 2018, focusing on models that use discrete choice. The
balance of the paper is organized as follows: In Section 2, we discuss the reproducibility crisis introduced above and its applicability to
the study of transportation. In Section 3, we put forth an operational denition of validation and related concepts, to organize the
myriad of related terms in the eld. In Section 4 and 5 we discuss the most commonly used validation methods, and performance
indicators, respectively. The results from the meta-analysis we conduct of the academic literature are presented in Section 6. In Section
7 we provide recommendations for improving validation practices in the eld and end with concluding thoughts in Section 8.
2. A credibility crisis in science and engineering?
In 2016, the journal Nature published the results of a survey of its readers regarding reproducibility (Baker and Penny, 2016). A key
nding was that more than 70 percent of researchers tried and failed to reproduce another scientists experiment, and more than half have
failed to reproduce their own experiments. (pp. 452) Furthermore, 52 percent of respondents stated that there is a signicant repro-
ducibility crisis.
This perception echoes the argument of Ioannidis (2005), who estimated the positive predictive value of research that is, the
probability that a reported nding is true under different criteria. Ioannidis suggested that most published research ndings are likely
to be false, due to factors such as lack of statistical power of the study, small effect sizes, and great exibility in research design,
denitions, outcomes, and methods.
While the study by Ioannidis focused on experimental studies in natural science, the underlying concern is of relevance to the
transportation research eld.
Generally speaking, the purpose of academic transportation research is to better understand (and ultimately forecast) transport-
related human behaviour to better inform transportation policy design and implementation. However, conducting experiments in
the transportation eld is often expensive and/or disruptive to transport users. Consider, for example, running an experiment in which,
a new subway line is constructed just to evaluate modal shift from automobile to transit. Although there are some examples of
experimental economics applications (e.g. Holguín-Veras et al., 2004) the vast majority of transportation research relies on
cross-sectional observational studies, which makes hypothesis testing in the classical way difcult.
Lack of validation of models increase the risk of model overtting, where the model ts the estimation data well but performs
poorly outside this estimation dataset, in other words, the model is tted to the noise instead of the signal in the data.
Despite the limitations imposed by observational studies, impact evaluation of policies drawn based on model-based academic
research is rarely conducted, meaning there is little feedback, if any, in terms of how right or how wrong are these models and the
policy recommendations derived from them. Altogether, these issues strongly underscore the need to incorporate in the analysis means
to evaluate the validity of results. However, we will show that the academic literature has over-relied on statistical goodness of t and
widely disregarded model validation.
3. Dening validation
The meaning of the term validation differs across elds, and even within elds, and certainly in the case of transportation, there is
no agreed-upon denition. As such, to organize the myriad of validation-related terms, after reviewing the validation literature in
different research elds, we propose a set of denitions that are adequate for the transportation eld. Fig. 1 summarizes these concepts
and illustrate the relationship among them.
We will start by dening six key terms adapted from Justice et al. (1999):
Predictive accuracy: the degree to which predicted outcomes match observed outcomes. Predictive accuracy is a function of:
o Calibration
2
ability: the ability of a model or system of models to make predictions that match observed outcomes.
o Discrimination ability: the ability of a model or system of models to discriminate between those instances with and without a
particular outcome.
Generalizability: the ability of a model, or system of models to maintain its predictive accuracy in a different sample. The gener-
alizability of a model is a function of:
o Reproducibility: the ability of a model, or system of models to maintain its predictive ability in different samples from the same
population.
o Transferability: the ability of a model, or system of models to maintain its predictive ability in samples from different but plausibly
related populations or in samples collected with different methodologies. In other elds it is also called transportability.
Given that the purpose of academic transportation research is to better understand (and ultimately forecast) transport-related
2
Note that in the transportation eld the term calibration is commonly used to refer to the adjustment of constants or other parameters to match
observed outcomes.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
3
Fig. 1. Model validation is the evaluation of the generalizability of a statistical model. The evaluation of reproducibility is called internal validation, while the evaluation of transferability is called
external validation. In-sample testing is not validation.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
4
human behaviour to better inform transportation policy design and implementation, the usefulness of a statistical model is a function
of its predictive accuracy outside the sample used to estimate it. We can thus dene model validation as the evaluation of the
generalizability of a statistical model. The evaluation of reproducibility is called internal validation, while the evaluation of trans-
ferability is called external validation. Note that internal validity is very often a precondition for external validity (Justice et al., 1999).
This denition outright excludes the classication of in-sample testing such as goodness-of-t statistics from being considered a
form of model validation. The optimismof in-sample estimates of predictive accuracy or apparent predictive accuracy is a well-
documented phenomena (Efron, 1983; Steyerberg et al., 2001), which stems from the fact that the sample data is used for both
estimation and testing. Correcting for this is the very reason internal and external validation methods have been developed.
Irrespective of the type of validation, the process itself is conceptually simple. It consists of (i) estimating a model with an esti-
mation sample e (sometimes called training sample), (ii) applying the estimated model to a validation sample v (sometimes called
testing sample), and (iii) evaluating the generalizability of the estimated model given the dened performance measure(s) of predictive
accuracy (See Section 5). Predictive accuracy is usually quantied as a function of the discrepancy between predicted and observed
outcomes (i.e. prediction error).
In the case of internal validation, while using independent samples is ideal, in many cases, producing such data is expensive, so
data-splitting methods such as holdout and cross-validation, or resampling methods such as bootstrapping are often used. These
methods will be discussed in Section 4.
Regarding the evaluation of transferability (external validation), in the transportation eld, a signicant number of studies were
conducted during the 1980s on the subject (Ortúzar and Willumsen, 2011), even though it has dropped off the radarin recent years
(Fox et al., 2014). In transportation, transferability has been dened as the usefulness of a model, information or theory in a new context
(Koppelman and Wilmot, 1982) and as the ability of a model developed in one context to explain behaviour in another under the assumption
that the underlying theory is equally applicable in the two contexts(Fox et al., 2014), denitions that are consistent with the denition
presented above. Past studies on transferability have not only focused on predictive accuracy but also on the stability of parameters
across contexts, especially in earlier studies (Koppelman and Wilmot, 1982; Ortúzar and Willumsen, 2011; Fox et al., 2014).
In Transportation, the two key dimensions of interest are temporal transferability and spatial transferability which refer to the
potential to transfer a model to different points in time, and to different spatial contexts (i.e. cities, regions), respectively.
There are, of course, limits to how much a model can be transfered, and it is obvious that in transportation research, models are
context dependent (Ortúzar and Willumsen, 2011), as such, rather than a pass/fail test, transferability is a matter of degree (Kop-
pelman and Wilmot, 1982) and should be bounded by reasonable limits. No one would expect, for example, a discretionary activity
destination choice model estimated for Tokyo to generalize well to Los Angeles. It would be hard to argue that these populations are
“plausibly related.A similar argument can be made in terms of temporal transferability, which should be evaluated in general,
considering the time frame of the forecast of interest. Consider, for example, a 20-year forecast, a timeframe typical of transportation
forecast models. In such case, two independent samples from the same city and collected six months apart, are more likely different
(yet contemporary) samples from the same population than temporally different samples. That is, they are too similar to constitute a
proper external validation dataset. As such, the validation effort would in fact be an internal validation test.
Signicant policy interventions (i.e. the construction of a new expressway, or rail line), or high-impact external events (i.e. natural
disasters, economic shocks, pandemics) can also be thresholds to make a population temporally different. As such, samples from before
and after such an event can be considered temporally different samples.
As an aside, Justice et al. (1999) have also discussed, in the context of prognostic assessments, the issue of methodological
transferability, which refers to model performance on data collected with different methodologies (i.e. different variable denitions,
survey methods, etc.) Although potentially relevant to the eld, this is a largely understudied issue, but included in this article for
completeness.
While it can be argued that as an analysis of model sensitivity, the calculation of elasticities and marginal effects are a part of the
model validation process (Cambridge Systematics, 2010), as measures of effect size, these are key policy-related values. As such,
although the distinction might seem trivial, we classify these values as part of the policy-relevant inference analysis rather than part of
the validation process, which focuses, as per the denition provided above, on predictive ability.
4. Data splitting and resampling methods for validation
Data splitting and resampling methods have become common methods to conduct validation, largely due to the high costs asso-
ciated with producing independent datasets to test models against. While these methods are largely used for internal validation, it is
possible to adapt these methods for external validation, although not very common (e.g. Austin et al. (2016) in epidemiology, and
Sanko (2017) in transport). We will now briey introduce these methods.
4.1. Holdout validation
Holdout validation (HOV) is the simplest data splitting method. In holdout validation, the dataset is randomly split into an esti-
mation dataset and a validation dataset (holdout). For illustration purposes, let us dene Q[yn,
yn]as a measure of prediction cor-
rectness for the nth instance, for the binary choice case as:
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
5
Qyn,yn=
0if yn=yn
1if yn= yn
(1)
where yn is the observed outcome, and
yn is the predicted outcome for instance n. The holdout estimator is
HOV =1
Nv
Nv
nv=1
Qynv,ye
nv(2)
where
ye
nv is the predicted outcome for instance n in sample v, using the model estimated with sample e, and Nv is the validation sample
size. The performance measure in equation (2) is called the misclassication rate, but as will be discussed in Section 5, there is myriad
of performance measures that can be used to evaluate predictive accuracy.
4.2. Cross-validation methods
When the holdout process is repeated multiple times, thus generating a set of randomly split estimation-validation data pairs, we
refer to the validation procedure as cross-validation (CV). The basic cross-validation estimator is dened as
CV =1
B
B
b=1
HOVb(3)
where B is the number of estimation-validation data pairs generated, andHOVb is the holdout estimator for set b. Cross-validation
methods differ from one another in the way the data is split. When the data splitting considers all possible estimation sets of size
Nc, the splitting is exhaustive, otherwise the splitting is partial. Partial data splitting methods have the advantage of being less
computationally intensive than exhaustive splitting. Here we will briey dene common cross-validation methods. We refer readers to
the work of Arlot and Celisse (2009) for a more comprehensive review of cross-validation practices in general.
Regarding exhaustive splitting, in the leave-one-out (LOO) approach (Stone, 1974) sometimes referred to as the jacknife approach,
the estimation set size is Nc=N1, and B =N. That is, the model is tted leaving out one instance per iteration, and the outcome of
that single instance is predicted based on the estimated model. In the leave-p-out (LPO) approach (Shao, 1993) the estimation set size is
Nc=Np.
Regarding partial data splitting, in the K-fold cross-validation (K-CV) approach (Geisser, 1975), data is partitioned into K
mutually-exclusive subsets of roughly equal size, and B
K. It can be seen that the particular case of N-fold cross validation is
equivalent to the leave-one-out approach. In the case of repeated K-fold cross-validation, the process is repeated R times and the CV
performance measures averaged over R. In the repeated learning-testing (RLT) approach (Burman, 1989) a B number of randomly-split
estimation-validation pairs are generated. This method is also called repeated holdout validation.
As an aside, it is important to note that in-sample statistics have been proposed to aid on model selection, such as the Akaike
Information Criteria (AIC) and the Bayesian Information Criteria (BIC). Stone (1977) showed the asymptotic equivalence between
cross validation (specically, the leave-one-out method) and the AIC statistic for model selection. However, the strength of validation
relies on the quasi-universality of its applicability and its robustness to violations of assumptions necessary for these statistics to be
correct (Efron and Tibshirani, 1993; Arlot and Celisse, 2009).
4.3. Bootstrapping methods
Bootstrapping validation methods were proposed by Bradley Efron to address some of the limitations of cross-validation methods.
Although (leave-one-out) cross-validation gives a nearly unbiased estimate of predictive accuracy, it often exhibits unacceptably high
variability, particularly when sample size is small, whereas bootstrapping methods have been shown to be more efcient (Efron, 1983;
Efron and Tibshirani, 1995). We will briey summarize some basic bootstrapping estimators borrowing from Efron and Tibshirani
(1993, 1997) to whom we refer the reader for a more extensive treatment of bootstrapping for validation.
The idea of bootstrapping is conceptually simple. In the simplest case, given a dataset of size n, a bootstrap sample is generated by
randomly sampling (with replacement) n instances from the original dataset and estimating the model on this sample. A prediction
error estimate for this sample can be obtained by applying the model to the original sample. This process is repeated B times, and the
prediction error averaged over B to obtain the simple bootstrap prediction error estimate,
BSsimple =1
B
B
b=1
N
n=1
Qyn,yb
nN(4)
Another estimator is the leave-one-out bootstrap estimator, dened as
BSloo =1
N
N
n=1
b
ε
Cn
Qyn,yb
nBn(5)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
6
where, given a set of B bootstrap samples, Cn is the subset of bootstrap samples not containing instance n, and Bn is the size of Cn. This
estimator can be considered a smoothed version of the leave-one-out cross-validation estimator, and while being more efcient, it has
been shown to be upward-biased (Efron and Tibshirani, 1995).
A more rened approach is to get bootstrap estimates of the optimism of the apparent prediction error (the prediction error
estimated on the same sample used to estimate the model) and correct for it. The bootstrap estimator can then be dened as
BS =err +
ω
(6)
where err is the apparent prediction error (of the original sample) dened as
err =1
N
N
n=1
Qyn,yn(7)
and
ω
is a measure of optimism. One way to estimate
ω
is as the difference between the simple bootstrap estimate (Eq. (4)) and the
apparent prediction error of the bootstrap samples. Another way is using the 0.632 optimism estimator dened as
ω
.632 =.632[BSloo err](8)
Plugging Equation (8) into equation (6) results in the 0.632 bootstrap estimator (Efron, 1983)
BS.362 =.368 err +.632 ·BSloo (9)
This estimator mitigates the upward bias of the leave-one-out bootstrap estimate. The weights for this estimator come from the fact
that the probability of any instance to be present in a bootstrap sample is 0.632. Although Efron discussed the 0.632 estimator in terms
of the leave-one-out case, it can be generalized to non-exhaustive cases (Steyerberg et al., 2001).
Although bootstrapping is rather common in transportation for standard error estimation, it is rarely used for model validation. And
while it has been suggested that bootstrapping is superior to cross-validation (Efron, 1983; Steyerberg et al., 2001), other studies have
suggested that repeated K-fold cross-validation (Kim, 2009) or stratied cross-validation (Kohavi, 1995) are superior. The fact of the
matter is that the performance of these methods is dependent on data characteristics such as sample size, and the type of model used.
As such, we will refrain from making a recommendation, noting the need for an empirical study on the performance of different
validation methods using typical data from transportation studies such as household travel surveys, and typical models.
5. Performance measures
In the transportation literature several articles have been published discussing in detail validation methods for discrete choice
models from Train (1978, 1979) to Gunn and Bates (1982) and Koppelman and Wilmot (1982), and more recently de Luca and
Cantarella, 2009). Using these studies as a departing point, in this section we briey review the key performance measures reported in
the transportation literature. Although the list of measures discussed here is not exhaustive, it is comprehensive in that it covers the
vast majority of measures reported for internal and/or external validation in the reviewed studies.
One of the simplest ways to evaluate the predictive ability of a model is to compare the predicted market shares against the
observed shares. While very simple and easy to understand, this approach does not provide a quantitative measure to evaluate the level
of agreement between predictions and observations.
A quantitative measure often used is the percentage of correct predictions of a model (sometimes called the First Preference Re-
covery) where the alternative with the highest probability is dened as the predicted choice. Although this measure is widely reported,
its use in models with more than two alternatives has been criticized, since it cannot differentiate between different probabilities
assigned to a chosen alternative (de Luca and Cantarella, 2009). de Luca & Cantarella illustrate this point with a choice exercise with
three alternatives a, b, c where the chosen alternative is a. With the percentage of correct predictions measure, a model that predicts
choice ratios of 0.34, 0.33 and 0.33, for alternatives a, b, c, respectively, is equivalent to a model that predicts 0.90, 0.05, 0.05, even
though the latter model assigns a considerably higher probability to the rst alternative and the former is very close to a random
prediction. To overcome this limitation, they proposed additional measures to evaluate clearness of predictions,a concept that bears
some resemblance to the concept of discriminative ability. They proposed the percentage of clearly right choices, the percentage of
clearly wrong choices and the percentage of unclear choices.
They dened the percentage of clearly right choices as the percentage of users in the sample whose observed choices are given a
probability greater than threshold t by the model. The idea here is that a model that predicts with higher probability a chosen alternative
performs better than one that does so with a lower probability. Conversely, they dened the percentage of clearly wrong choices as the
the percentage of users in the sample for whom the model gives a probability greater than threshold t to a choice alternative differing form the
observed one.Finally, the percentage of unclear choices is the percentage of users for whom the model does not give a probability greater
than threshold t to any choice.To be meaningful, the threshold t must be considerably largerthan c
1
, where c is the choice set size.
While there is no agreed-upon denition of what qualies as considerably larger,if the threshold is just marginally larger to c, the test
becomes useless. As a reference, when reporting the percentage of clearly right choices, de Luca and Di Pace (2015) used a threshold of
0.9 for binary choice models, while Glerum, Atasoy and Bierlaire (2014) used a threshold of 0.5 for choice models with three alter-
natives. When in doubt, we recommend reporting results for different threshold values. For example, de Luca and Cantarella (2009)
reported values for 0.50, 0.66, 0.90 in tabular format for a pair of models with four alternatives, as well as a plot for all threshold values
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
7
above 0.5.
Concordance probability (c-statistic) is a measure of discriminative ability, in the case of binary choices. This measure estimates the
probability that an individual who was observed choosing alternative a is assigned a higher probability than an individual who did not.
This probability is calculated as the ratio between the number of concordant pairs and comparable pairs. A comparable pair is a pair of
individuals where one individual chose alternative a, and one did not. Such a pair is concordant if the individual who chose alternative
a was assigned a higher probability of doing so than the individual who did not choose it. If the model has no discriminative ability the
c-statistic equals 0.5 and equals 1 for perfect discriminant ability.
Several extensions have been proposed for the multinomial choice cases, see Calster et al. (2012) for a review of existing measures.
The tting factor is dened as the ratio between the sum of predicted choice probabilities of the chosen alternatives and the number
of individuals. A tting factor of 1 indicates a perfect forecast of all choices with predicted probability of 1 (de Luca and Cantarella,
2009).
The correlation between prediction and outcomes can be used to evaluate predictive performance when dealing with continuous
outcome such ridership levels, trafc ow, etc.
Other commonly used measures include the sum of square error (SSE), mean square error (MSE), root sum square error (RSSE), root
mean square error (RMSE), mean absolute error (MAE), absolute percentage error (APE), and mean absolute percentage error (MAPE).
Although there are some differences between these measures, there is no consensus regarding which measure is better. While quadratic
measures like MSE, RSSE and RMSE tend to weight heavier on efciency (Troncoso Parady et al., 2017) it has been argued that ab-
solute measures like MAE are more natural and less ambiguous error measures than quadratic indicators (Willmott and Matsuura,
2005). It has also been suggested that RMSE is more appropriate when errors are expected to be Gaussian distributed (Chai and
Draxler, 2014)
While in the transportation eld it is common to use aggregate outcome measures such as market shares (Train, 1979) or choice
frequencies (Koppelman and Wilmot, 1982) when using such measures, for most cases, individual outcomes can also be used. For
example, the Brier score is calculated as the mean square difference between predicted individual choice probabilities and actual
choices across all choices. To calculate this difference, the observed outcome is assigned a value of 1. This score has a minimum score of
0 for perfect forecasting, and a maximum score of 2 for the worst possible forecast (Brier, 1950). The Brier score can be thought of as a
disaggregate form of the MSE for discrete outcomes. Disaggregate forms of MAE (de Luca and Cantarella, 2009) and RMSE can be
calculated in a similar manner.
When using choice frequencies instead of market shares, it is possible to use the
χ
2 test as a test of consistency between predictions
and observations. The null hypothesis of this test is that observed and expected frequency outcome distributions are the same. Different
to the other measures reviewed so far, this is a pass/fail test. Although the
χ
2 statistic is sometimes used with market shares, it must be
noted that this test is only valid for frequencies.
Since the likelihood is proportional to the product of individual probabilities, likelihood-based loss functions are also frequently
used. The log-likelihood is a natural measure given that maximum likelihood estimators are widely used in discrete choice models. The
cross-entropy measure, which is essentially the negative of the log-likelihood function, is also a commonly used loss function in
machine learning.
The transferability test statistic (TTS) is a likelihood ratio test between the base model applied to the transfer data and the model
estimated in the transfer data. This statistic is
χ
2distributed with degrees of freedom equal to the number of parameters in the model. As
with the regular
χ
2 test, strictly speaking this is a pass/fail test. It tests whether the model parameters are equal across contexts.
However, we strongly agree with Koppelman and Wilmot (1982) in that while such tests are useful to alert the analyst to differences
between models, these differences should be interpreted against the acceptable error in each application context. This means focusing
not on whether there is a difference or not, but how big is this difference.
Other likelihood-based indices are the transfer index (TI) and the transfer rho-square proposed by Koppelman and Wilmot (1982).
The transfer index (TI) measures the degree to which the transferred model (a model estimated on sample e and transferred to transfer
sample v) exceeds a local reference model (e.g. a market share model estimated on transfer sample v) relative to a model estimated
directly on the transfer sample v. This measure has an upper-bound of one, in the case where the transferred model is as accurate as the
local model, and takes negative values when the model is worse than the local reference model. While the transfer index is a relative
measure of transferability, the transfer rho-square can be used as an absolute measure. This is the usual likelihood ratio index but used
to evaluate performance of the transferred model against the local reference model. This index is upper-bounded by the local
rho-square and has no lower bound. Negative values are interpreted in a similar manner to the transfer index.
Table 1 summarizes the performance measures described above and their respective equations. Measures are classied into ab-
solute measures, relative measures, and pass/fail measures. While relative measures are useful for model selection, they do not give an
absolute measure of predictive accuracy. Absolute measures, on the other hand, can be used to generate benchmark values against
which researchers can evaluate, to a certain extent, the performance of their models against similar studies in the literature. It must be
noted, however, that even when using absolute measures, model performance is relative to factors such as choice set size and the base
market share of alternatives. As such, comparisons across different studies must be interpreted with a clear understanding of these
limitations, and not as an absolute indicator of model superiority.
6. Validation and reporting practices in the transportation academic literature
Having summarized the basic validation process and the most used performance indicators in the literature, this section will review
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
8
Table 1
Denition of model validation performance measures reported in the literature.
Measure Abbrv. Type Equation Lower
bound
Upper
bound
Notes
Predicted vs observed
outcomes
PVO -
a
-
a
Simple comparison of predicted and
observed outcomes or market shares.
Usually in the form of a table or plot. No
prediction accuracy statistics are
calculated.
Percentage of correct
predictions or First
Preference
Recovery
FPR Absolute 100
Nv
Nv
nv=1
ye
nv=ynv
0 100 ynv is the observed choice made by
individual n in validation sample v, and
ye
nv is the choice with the highest
predicted probability, predicted from
model estimated on sample e.
%
clearly right
(t) %CR Absolute 100
Nv
Nv
nv=1
CRnv
where,
CRnv=1if
P(ye
nv)>t
0otherwise
0 100
P(ye
nv)is the estimated choice
probability of the chosen alternative for
individual n in validation sample v,
predicted from model estimated on
sample e.
P(!ye
nv)is the estimated choice
probability of an alternative other than
the chosen one.
t is an arbitrary threshold.
%
clearly wrong
(t) %CW Absolute 100
Nv
Nv
nv=1
CWnv
where,
CWnv=1if
P(!ye
nv)>t
0otherwise
0 100
%
unclear
(t) %U Absolute 100 (%
clearly right
(t) +%
clearly
wrong
(t))
0 100
Fitting factor FF Absolute 1
Nv
Nv
nv=1
P(ye
nv)0 1
Concordance statistic C Absolute 1
Nc1Nc0
Nc1
nc1=1
Nc0
nc0=1
C(
P(ye
n1,c1),
P(ye
n0,c0))
where,
C(
P(ye
n1,c1),
P(ye
n0,c0)) =
1if
P(ye
n1,c1)>
P(ye
n0,c0)
0.5if
P(ye
n1,c1) =
P(ye
n0,c0)
0if
P(ye
n1,c1)<
P(ye
n0,c0)
0 1 Given a binary choice situation:
N
c1
is the subset of the sample that
chose alternative c.
N
c1
is the subset of the sample that did
not choose alternative c.
P(ye
n1,c1)is the probability of choosing
alternative c for individuals that chose
it.
P(ye
n0,c0)is the probability of choosing
alternative c for individuals that did not
choose it.
The v subscript is omitted for simplicity.
Correlation CORR Absolute corr(sv,
sve)1 1 Correlation between predicted and
observed outcomes. sv is a continuous
aggregate outcome measure in sample v
(i.e. train ridership, etc.)
se
v is a continuous aggregate outcome
measure predicted from model
estimated on sample e.
Error E Relative
se
v,msv,m sv,m is an aggregate outcome measure in
sample v, such as the market share of
alternative m (i.e. modal market share),
choice frequency, etc.
se
v,m is an aggregate outcome measure in
sample v, such as the market share of
alternative m, predicted from model
estimated on sample e.
M is the number of alternatives in the
choice set.
Percentage error PE Absolute 100·
se
v,msv,m
sv,m
- -
Absolute percentage
error
APE Absolute
100 ·se
v,msv,m
sv,m
0 -
Mean absolute
percentage error
MAPE Absolute 100
M
M
m=1
se
v,msv,m
sv,m
0 -
Sum of square error SSE Relative
M
m=1
(
se
v,msv,m)2 0 -
b
Root sum of square
error
RSSE Relative 
M
m=1
(se
v,msv,m)2
0 -
b
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
9
the validation and reporting practices in the peer-reviewed literature and show that the transportation academic literature has over-
relied on statistical goodness-of-t and disregarded model validation to a very large extent.
Using the Web of Science Core Collection maintained by Clarivate Analytics we reviewed validation and reporting practices in the
transportation academic literature published between 2014 and 2018. Articles were selected based on the following criteria:
(1) Peer-reviewed journal articles published between 2014 and 2018
(2) Analysis uses discrete choice models
Table 1 (continued )
Measure Abbrv. Type Equation Lower
bound
Upper
bound
Notes
Mean absolute error MAE Aggregate:
Relative
Disaggregate:
Absolute
1
M
M
m=1
se
v,msv,m
0 -
b, c
Mean squared error MSE Aggregate:
Relative
Disaggregate:
Absolute
1
M
M
m=1
(
se
v,msv,m)2 0 -
b, c
Root mean square error RMSE Aggregate:
Relative
Disaggregate:
Absolute

1
M
M
m=1
(se
v,msmv,)2
0 -
b, c
Brier Score BS Absolute 1
Nv
Nv
nv=1
M
m=1
(
P(ye
nv,m) − ynv,m)2 0 2
d
P(ye
nv,m)is the predicted probability that
individual n chooses alternative m,
predicted from model estimated on
sample e.
y
nm
is the actual outcome variable
valued 0 or 1.
Log-likelihood LL Relative LLv(
βe) 0 LLv,r(
βe)is the log-likelihood of the
model estimated on data e applied to
the validation data v
r
.
Nv,r is the size of the validation
(holdout) sample r, and R is number of
validation samples generated.
Log-likelihood loss LLL
e
Relative 1
R
r
1
Nv,r
nv,r
LLv,r(
βe)
1rR
0
Rho-square (Likelihood
ratio index)
RHOSQ Absolute
ρ
2=1LLv(
βe)
LLv(0)
0 1 LLv(
βe)is the log-likelihood of the
model estimated on data e applied to
the validation data v.
LLv(0)is log-likelihood of the model
when all parameters are zero for data v.
Transfer rho-square
(Likelihood ratio
index)
T-
RHOSQ
Absolute
ρ
2
transfer =1LLv(
βe)
LLv(MSv)
ρ
2
local LLv(MSv)is a base model estimated on
validation data v (i.e. market share
model.)
ρ
2
local is the local rho-square of the
model.
Transfer index TI Relative LLv(
βe) − LLv(MSv)
LLv(
βv) − LLv(MSv)
1
LLv(
βv)is the likelihood of the model
estimated on the validation data v.
Transferability test
statistic
TTS Pass/Fail 2(LLv(
βv) − LLv(
βe)) 0
χ
2 test CHISQ Pass/Fail
M
m=1
(fmE(fe
v,m))2
E(fe
v,m)
0 fm is the observed choice frequency of
alternative m in sample v, and E(fe
v,m)is
the expected choice frequency
predicted from model estimated on
sample e.
a
Bounded for market shares.
b
In the case aggregate outcomes are market shares upper bounds are dependent on the choice set size.
c
Upper bounds exist for the disaggregate case.
d
For the specic case of binary choices, it is common to drop the second summation sign for simplicity. In this case the upper bound is 1.
e
A simple average over R, or a moving average across all validation sets can be calculated.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
10
(3) Target choice dimensions are
Destination choice
Mode choice
Route choice
(4) Articles that analyse other choice dimensions are considered if and only if the article includes at least one of the three target
choice dimensions dened in (3).
(5) Web of Science Database search keywords are:
Destination choice model
Mode choice model
Route choice model
(6) Web of Science Database elds are:
Transportation
Transportation science and technology
Economics
Civil engineering
(7) Research scope is limited to human land transport and daily travel behaviour (tourism, evacuation behaviour, and freight
transport articles are excluded)
(8) Analysis uses empirical data from revealed preference (RP) or joint revealed/stated preferences (SP-RP) studies.
3
(Studies using
numerical simulations only are excluded)
(9) Methodological papers only included if they use empirical data
A total of 226 articles met the above criteria. Note that although the choice dimensions are destination, mode and route choice, the
denition of choice settings or choice sets differ by study. For example, in the category of mode choice, in addition to the traditional
way to dene choices (i.e. car, rail, bus, walk, etc.) changes in mode choices are also included in the review. Similarly, in route choice
analysis, in addition to link and path choices, simpler choice settings are also included such as, riding on a bike lane, or on the walkway,
taking the stairs or the escalator, using a particular detour or not, etc. In the case of route choice models, Stochastic user equilibrium
(SUE) models are excluded, as discrete choice models are just a subcomponent of a larger model.
Of the 226 articles reviewed, and consistent with standard practice in the eld, 92% of articles reported a goodness of t statistic.
64.6% of articles reported some kind of policy-relevant inference analysis. This means going beyond simply discussing the statistical
signicance and direction of the estimated parameters, and focusing instead on effect magnitudes, and estimating values that are
interpretable in a policy context such as marginal effects, elasticities, odds ratios, marginal rates of substitution, and/or policy scenario
simulations (See Table 5 in the appendix section for the complete table summarizing the articles reviewed in this paper) This is a
welcomed nding amid criticism that focusing exclusively on statistical signicance is widespread in many sciences at the expense of
policy-relevant inferences, discussion of effect magnitudes, and tests of power (Ziliak and McCloskey, 2007).
Only 18.1% of the reviewed articles reported model validation, out of which 78% (14.2% of all studies) consisted of internal
validation and 22% (4% of all studies) consisted of external validation. Table 2 summarizes the details of these studies. Note that only
studies that reported explicitly and in a clear way how model validation was conducted were considered.
In terms of internal validation, as illustrated in Table 3, of the studies that reported any validation practice, 56.3% used the holdout
validation method, followed by repeated learning-testing (25%). These two methods add up to 81.3% of the studies reporting internal
validation in the literature. 12.5% of studies used an independent sample for internal validation, with the remaining 87.5% relying on
sample splitting approaches.
The bootstrap method was only used by one study (Sanko, 2017), which used bootstrapping to evaluate the effects of data newness
and sample size on the temporal transferability of demand forecasting models.
Irrespective of type of validation, as shown in Table 4, the log-likelihood (34.1%) and the log-likelihood loss (12.2%) had jointly the
highest reporting share, with 46.3% of studies reporting either one of them.
4
The second most frequently used measures were the
predicted-vs-observed simple comparison and the percentage of correct predictions with a 24.4% share each.
One limitation of likelihood-based measures such as LL and LLL is that they fail to provide an absolute measure of predictive
accuracy. This is important because gains in predictive power of a superiormodel can be, in fact, very minimal. That is, the best
model among a set of models can still be a very bad model prediction-wise. A similar argument can be made with relation to other
relative performance measures. While we believe that model validation is not a pass/fail test, absolute measures of predictive ability
such as the percentage of correct predictions, rho-square (7.3%), or the Brier Score (2.4%), among others, are useful as the can be used
to generate benchmark values against which researchers can evaluate, to a certain extent, the performance of their models against
similar studies in the literature. For example, in the reviewed articles, the percentage of correct predictions for destination (or location)
choice models ranged between 13% and 22%, while for mode choices it ranged from 36% to 87%, with most studies reporting values
above the 60% threshold. Similarly, for route choice, values ranged between 51% and 73%. However, note that since the number of
3
Give that the error component variance for designed experiments will differ from the variance in the real world, the use of stated preference (SP)
surveys for forecasting and calculation of elasticities is not recommended (Hess and Rose, 2009; de Jong, 2014). As such SP studies were excluded
from the analysis. In the case of SP-RP studies, the SP error component can be calibrated against the RP data (de Jong, 2014).
4
Note that as shown in Table 2, some articles reported more than one measure.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
11
Table 2
Summary of articles reporting validation in the literature.
Article Model Dependent
variables
Validation Evaluation Measure
Type Method Notes PVO FPR %
CR
%
CW
E/
PE/
APE
SSE MAE MAPE MSE RMSE BS C CHISQ MSD FF LL LLL RHOSQ CORR TTS TI OTHER
Zimmermann
et al. (2018)
MNL;
MXL;
MRL
AC-DC-DT-
MC
Internal RLT 54% estimation,
46% validation,
13 runs.
Danalet et al.
(2016)
MNL; DC Internal HOV 91.5%
estimation, 9.5%
validation (past
choices used for
estimation, most
recent choice
used for
validation).
Faghih-Imani
and Eluru
(2015)
MNL DC Internal HOV 75% estimation,
25% validation.
Assi et al.
(2018)
MNL;
NN*
MC Internal RLT 75% estimation,
25% validation, 3
runs.
Bohluli et al.
(2014)
MNL;
NL
MC External IS Validation data is
observed
ridership data
after
introduction of
new transit
service.
Bueno et al.
(2017)
MNL MC Internal R-K-CV 10-fold CV, 5 000
runs.
Glerum et al.
(2014)
HCM MC Internal HOV 80% estimation,
20% validation.
Hasnine and
Habib
(2018)
HDDC MC Internal HOV 80% estimation,
20% validation.
Kunhikrishnan
and
Srinivasan
(2018)
MNL MC Internal HOV 70% estimation,
30% validation.
Ma et al. (2015) MNL;
MXL;
LCM
MC Internal HOV 80% estimation,
20% validation.
Mahmoud et al.
(2016)
MNL;
PLC
MC Internal HOV 80% estimation,
20% validation.
Sanko (2014) MNL MC External IS Validation
against temporal
transfer sample.
Sanko (2016) MNL MC External IS
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
12
Table 2 (continued )
Article Model Dependent
variables
Validation Evaluation Measure
Type Method Notes PVO FPR %
CR
%
CW
E/
PE/
APE
SSE MAE MAPE MSE RMSE BS C CHISQ MSD FF LL LLL RHOSQ CORR TTS TI OTHER
Validation
against temporal
transfer sample.
Sanko (2017) MNL MC External IS Validation
against temporal
transfer sample.
Sanko (2018) MNL MC External IS Validation
against temporal
transfer sample.
Vij and Krueger
(2017)
MNL;
MXL;
LCM
MC Internal HOV 90% estimation,
10% validation.
Vij and Walker
(2014)
LCM MC Internal HOV 90% estimation,
10% validation.
Wang et al.
(2014)
NL MC Internal IS Validation data is
observed mode
shares collected
by transit
agencies.
Weng et al.
(2018)
MNL MC Internal IS Validation data is
smart cart data.
Gokasar and
Gunay
(2017)
MNL MC Internal HOV 75% estimation,
25% validation.
Habib et al.
(2014)
hTEV MC; MC External IS Validation
against temporal
transfer sample.
Chikaraishi and
Nakayama
(2016)
MNL;
WM; QL
MC; RC Internal RLT 50% estimation,
50% validation,
100 runs.
Idris et al.
(2015)
MNL;
NL;
HCM
MC; MC Internal HOV Validation group
is subset of
observations
with an observed
mode shift (in SP
experiment).
Golshani et al.
(2018)
MNL;
ANN*;
JDC
MC-DT Internal HOV 80% estimation,
20% validation.
Suel and Polak
(2017)
NL O -MC Internal HOV Validation
sample is
observations
from same
sample, for the
one-month
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
13
Table 2 (continued )
Article Model Dependent
variables
Validation Evaluation Measure
Type Method Notes PVO FPR %
CR
%
CW
E/
PE/
APE
SSE MAE MAPE MSE RMSE BS C CHISQ MSD FF LL LLL RHOSQ CORR TTS TI OTHER
period prior to
survey week.
Faghih-Imani
and Eluru
(2017)
LMM;
MNL
DC; O Internal HOV For AT/DT:
Approx. 96%
estimation, 4%
validation
(1week/
24weeks).
For DC: holdout
of 5 000 trips
(estimation was
conducted with
sample sizes
ranging from 1
000 to 20 000).
Alizadeh et al.
(2018)
PSL;
EPSL;
IAL; LK;
NL
RC Internal HOV 80% estimation,
20% validation, 1
run.
Kazagli et al.
(2016)
MNL RC Internal RLT 80% estimation,
20% validation,
100 runs.
Lai and Bierlaire
(2015)
MNL;
PSL;
CNL
RC Internal HOV 50% estimation,
50% validation. 3
runs,
performance
measures not
averaged.
Li et al. (2016) MXL
(PSL;
GNL;
LK; CL,
etc)
RC Internal SS-O Two non-
overlapping
samples were
generated from
original dataset
for validation.
Mai (2016) NRL;
RCNL
RC Internal RLT 80% estimation,
20% validation,
20 runs.
Mai et al. (2017) RL;
NRL;
ML;
RRM
RC Internal RLT 80% estimation,
20% validation,
40 runs.
Mai et al. (2015) RL; NRL RC Internal RLT 80% estimation,
20% validation,
40 runs.
Papola (2016) CoRUM MC Internal HOV 70% estimation,
30% validation.
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
14
Table 2 (continued )
Article Model Dependent
variables
Validation Evaluation Measure
Type Method Notes PVO FPR %
CR
%
CW
E/
PE/
APE
SSE MAE MAPE MSE RMSE BS C CHISQ MSD FF LL LLL RHOSQ CORR TTS TI OTHER
Zhang et al.
(2015)
MNL RC Internal IS Validation data is
an independent
observation of 3
sites plus a new
site.
Zhang et al.
(2018)
NL RC Internal HOV 85% estimation,
15% validation.
Zimmermann
et al. (2017)
NRL RC Internal RLT 80% estimation,
20% validation,
20 runs.
Xie et al. (2016) MXL RC Internal IS New
observations
from same site.
Duc-Nghiem
et al. (2018)
MNL RC External IS Validation data is
pre-post data
from site not used
in calibration.
Fox et al. (2014) NL MC-DC External IS Validation
against temporal
transfer sample.
Forsey et al.
(2014)
O MC External IS Validation
against temporal
transfer sample.
Model abbreviations: Dependent variable abbreviations: Validation method abbreviations
CNL: Cross-nested logit
CoRUM: Combination of RUM models
EPSL: Extended path size logit
HCM: Hybrid choice model
hTEV: Heteroscedastic tee extreme
value
IAL: Independent availability logit
IBL: Instance-based learning
LCM: Latent class model
LK: Logit kernel
MNL: Binary or multinomial logit
MNLp: Binary or multinomial logit with panel data
corrections
MNP: Binary or multinomial probit
MRL: Mixed recursive logit
MXL: Mixed logit
NL: nested logit
NN*: Neural network
O: Other
PLC: Parameterized logit
captivity
PSL: Path size logit
QL: q-logit
RCNL; Recursive cross nested
logit
RRM: Random-regret
maximization
WB: Weibit model
*Machine learning models
AC: Activity choice
AT: Arrival time
DT: Departure time choice/Time of day
choice
DC: Destination choice
MC: Mode choice
MC: Mode choice change
RC: Route choice
O: Other choice dimension
HOV: Holdout validation
IS: Validation against an independent
sample
R-K-CV: Repeated K-fold cross-
validation
RLT: Repeated learning-testing
SS-O: Other sample splitting method
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
15
studies is not large, a myriad of studies with different denition of choice variables and choice set sizes were combined to calculate
these values. As such, they need to be interpreted with caution. Furthermore, as pointed out by de Luca and Cantarella (2009), this
index fails to discriminate between different degrees of predicted probabilities of correct predictions. Unfortunately, clearness of
predictionmeasures that do account for this, are still not very widely used. The percentage of clearly right index had a share of 2.4%,
while the percentage of clearly wrong index had a 0% share. Similarly, the share of studies reporting measures of model discriminatory
ability was 2.4%.
Due to the fact that model performance is relative to factors such as choice set size and the base market share of alternatives,
comparisons across models in different studies and contexts are complicated. Such comparisons must be interpreted with clear un-
derstanding of these limitations, and not as absolute measures of model superiority.
Although ideally all studies would be externally validated, in most cases the types of validation that a researcher can conduct are
limited by the availability of data and computational resources. But there is certainly room for improvement. Researchers should strive
to conduct the best validation tests possible given the resources at hand and carefully report the details of how these tests were
conducted, so that other researchers can clearly understand to what extent the results presented are generalizable. In Fig. 2 we
illustrate as simple heuristic to determine the recommended validation practices given available resources, as well as what measures to
report. We start by acknowledging that if a randomized controlled trial is possible, it would be the best alternative. That being said, as
we discussed earlier, such an experiment is extremely difcult in the eld.
The existence of data from either a different but plausibly related population, or from the same population is one of the key factors
dening what kind of validation is possible. The bottom line is conducting internal validation either via data/splitting methods or
bootstrapping, in the absence of independent datasets. Unless the model is too computationally intensive, we recommend avoiding the
holdout validation as it a pessimistic estimator that makes inefcient use of the data (Kohavi, 1995). Regarding the choice between
cross-validation and bootstrapping, as discussed earlier, the performance of these methods is dependent on data characteristics such as
sample size, and the type of model used. In the absence of an empirical study comparing the performance of these methods using
typical data from transportation studies and commonly used models, we will refrain from making a recommendation.
In terms of what performance measures to report, rather than relying on a single indicator we recommend reporting several in-
dicators as shown in Fig. 2, which include measures comparable across studies, and measures of discriminative ability and clearness of
prediction. When using indicators for model selection, ideally, the best model will excel in all performance measures, but there is no
guarantee this is so. As such, the analyst should strive to clearly explain the criteria used for model selection.
Table 4
Predictive accuracy performance measures reported in the literature by frequency.
Performance measure Abbrv. Frequency Percentage
Log-likelihood/log-likelihood loss LL/LLL 19 46.3%
Percentage of correct predictions or First Preference Recovery FPR 10 24.4%
Predicted vs observed market outcomes PVO 10 24.4%
Mean absolute error MAE 6 14.6%
Root mean square error RMSE 4 9.8%
Error/Percentage error/Absolute percentage error E/PE/APE 3 7.3%
Rho-Square RHOSQ 3 7.3%
Transfer index TI 2 4.9%
% clearly right (t) %CR 1 2.4%
Brier Score BS 1 2.4%
Chi-square CHISQ 1 2.4%
Concordance index C 1 2.4%
Correlation CORR 1 2.4%
Fitting factor FF 1 2.4%
Mean absolute percentage error MAPE 1 2.4%
Sum of square error SSE 1 2.4%
Transferability test statistic TTS 1 2.4%
All other measures specied in Table 1 0 0%
Other measures not specied in Table 1 3 7.3%
Very similar measures are reported jointly.
Table 3
Internal validation methods reported in the literature by frequency.
Method Abbvr. Frequency Percentage
Holdout validation HOV 18 56.3%
Repeated learning-testing RLT 8 25.0%
Validation against an independent sample IS 4 12.5%
Repeated K-fold cross-validation R-K-CV 1 3.1%
Other sample splitting methods SS-O 1 3.1%
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
16
7. Towards better validation practices in the eld
While most researchers would agree that the purpose of travel demand analysis is to make valid predictions to aid effective policy
evaluation (evident in that most studies reported some sort of policy implication,) we showed that there is an evident disconnect
between the objective of transportation research and the current practice of research, partially evident in the low levels of validation
reporting in the literature. Strong pressure to publish among academics means little incentive to conduct proper validation. While
producing independent samples for validation might be prohibitively expensive, whenever possible researchers should strive to
produce such data or use existing periodical surveys such as household travel surveys. In the absence of such data, data-splitting and
bootstrapping methods are inexpensive and should not be very different from drawing policy-relevant inferences, in terms of efforts
required.
Stronger criteria are needed to evaluate models in academia to increase the reliability of results and the credibility of inferences
based on statistical models. We put forth a set of recommendations aimed at improving validation practices in the eld:
(1) Make validation mandatory: model validation should be a non-negotiable part of model reporting and peer-review in academic
journals for any study that provides policy recommendations. At the very least, internal validation results should be reported.
Conducting internal validation is a norm in machine learning studies, and there is no reason why similar standards cannot be
implemented in the eld. This will provide better incentives to perform validation.
(2) Sharing of benchmark datasets: as pointed out by an anonymous reviewer at the conference the previous version of this article was
presented, a fundamental limitation in the eld is the lack of benchmark datasets and a general culture of sharing code and data.
Certainly, privacy concerns as well as institutional limitations impede the free sharing of most relevant and larger datasets (i.e.
Fig. 2. Heuristic to select validation method given available resources and recommended performance measures to report.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
17
household travel surveys, etc.), and in most cases it is out of the control of individual researchers to decide, but efforts should be
made towards the collection and opening of relevant benchmark data.
(3) Incentivize validation studies: most prestigious journals put a lot of emphasis on theoretically innovative models. While we
recognize its importance, submissions that focus on proper validation of existing models and/or theories should be encouraged
by journal editors.
(4) Draw and enforce clear reporting guidelines: efforts to improve reporting are well documented in other elds. For example, in the
medical eld, following a consensus among researchers that reporting of published observational studies is inadequate, and
hinders the assessment of quality and generalizability of these studies, guidelines have been developed to strengthen the
reporting of observational studies (e.g. STROBE statement (von Elm et al., 2007)). Although results are mixed regarding the
impact of these guidelines (as in many journals the use of the guideline is recommended but not mandatory), it is a step in the
right direction. A transportation eld specic guideline could be developed and properly enforced, where, in addition to
detailed information of survey characteristics such as sampling method and representativeness of the data, validation reporting
is required. It is worth noting that guidelines for travel model validation do exist for practitioners (i.e. Cambridge Systematics,
2010), so validation and reporting guidelines specic to transportation researchers are a welcomed contribution. Proper
reporting should also include policy-relevant values such as elasticities, marginal effects and other inferences that properly
convey the magnitude of the effects of interest. As mentioned earlier, 64.6% of the studies reviewed reported some sort of
policy-relevant inferences, a welcomed nding. Models with high-predictive power are not very useful if no policy implications
can be drawn from it.
Finally, there is one argument and two question that we would like to address that are commonly heard in academic circles related
to validation practices.
The argument is that Im not validating my model because Im not trying to build a predictive framework. Im trying to learn about travel
behaviour. In response, we argue that the more exploratory the subject is, the less the onus of validation until some critical mass of
research has been conducted. On the other hand, the more orthodox the type of analysis conducted (such as the dimensions of travel
behaviour covered in this study), the stronger the onus of validation.
This response motivates the following questions: Should every study using a discrete choice model be conducting validation?and Is
what we learn about travel behaviour from coefcient estimation less valuable if validation not conducted?
Regarding the rst question, in short, yes. At the very least, any article that makes policy recommendations should be subject to
proper validation given the issues discussed in Section 2 about policy impacts and the lack of a feedback loop in academia. There is a
myriad of reasons why some scepticism is warranted against any particular model outcome, the most obvious one being model
overtting. So proper validation should be used to strengthen presented results, especially given the limitations discussed in Section 2
regarding the dependence on cross-sectional studies, and difculties associated with scientic hypothesis testing.
Regarding the second question, while coefcients are useful for policy-relevant inferences, coefcients by themselves do not inform
us about model predictive ability. Policy inference analysis should be complemented with evidence on the generalizability of such
inference.
Finally, while we recognize that better validation practices will not solve the credibility crisis in the eld, it is certainly a step in the
right direction. More specically, model validation is no solution to the causality problem in the eld (see Brathwaite and Walker
(2018) for an in-depth discussion of causality in transportation studies), but we want to underscore that the reliance on observational
studies inherent to the eld demands more stringent controls to improve the validity of results. Although out of the scope of the present
study, such controls should also include aspects such as sample representativeness, proper model specication, and statistical power of
effects of interest, which are also critical to the validity of results.
8. Conclusion
In this article we reviewed validation practices from the transportation eld in the peer-reviewed literature published between
2004 and 2008. We found that although 92% of studies reported goodness of t statistics, and 64.6% reported some sort of policy-
relevant inference analysis, the percentage of validation reporting stood at 18.1%.
We argued that model validation should be a non-negotiable part of model reporting and peer-review in academic journals and
proposed a simple heuristic to choose validation methods given available resources. At the same time, we recognize the need for proper
incentives to promote better validation practices and providing tools and knowledge to do so, such as reporting guidelines, and
encouragement by journals of submissions that focus on validation of existing models and theories, and not only new theoretically
innovative models.
Author statement
Giancarlos Parady conceived the study, conducted the literature review and wrote the rst draft.
Giancarlos Parady, David Ory and Joan Walker reviewed and revised the manuscript, conrmed the analysis results, and approved
the nal version.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
18
Funding
This work was supported by JSPS KAKENHI Grants Number 17K14737 and 20H02266.
Declaration of competing interest
The authors report no conict of interest.
Acknowledgements
An earlier version of this work was presented at the 6th International choice modelling conference, Kobe, Japan, August 1921,
2019
Appendix
Table 5
Summary of reviewed articles
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RP-SP
1 Manoj and Verma (2015b) MNL AC; MC; DT RP
2 Sadhu and Tiwari (2016) NL AC-DC RP
3 Arman et al. (2018) MXNL AC-MC RP
4 He and Giuliano (2018) MNL DC RP
5 Mahpour et al. (2018) HCM DC RP
6 Basu et al. (2018) NL DC RP
7 Clifton et al. (2016) MNL DC RP
8 Danalet et al. (2016) MNLp; DC RP
9 Deutsch-Burgner et al.
(2014)
MNL DC RP
10 Faghih-Imani and Eluru
(2015)
MNL DC RP
11 Ho and Hensher (2016) MNL DC RP
12 Huang and Levinson
(2015)
MXL DC RP
13 Huang and Levinson
(2017)
MXL DC RP
14 Shao et al. (2015) MNL DC RP
15 Wang et al. (2017) SCL; MNL DC RP
16 Faghih-Imani and Eluru
(2017)
LMM;
MNL
DC; AT; DT RP
17 Khan et al. (2014) MNL DC; MC RP
18 Paleti et al. (2017) MXL; JDC DC-DT-O RP
19 Gonz´
alez et al. (2016) NL(PSL) DC-RC RP
20 Ma et al. (2018) CNL DT-MC RP
21 Abasahl et al. (2018) NL MC RP
22 Anta et al. (2016) MNL; NL MC RP-SP
23 Assi et al. (2018) MNL MC RP
24 Aziz et al. (2018) MXL MC RP
25 B¨
ocker et al. (2017) MNL MC RP
26 Bohluli et al. (2014) O MC RP
27 Braun et al. (2016) MNL MC RP-SP
28 Bridgelall (2014) MNL MC RP
29 Bueno et al. (2017) MNL MC RP
30 Castillo-Manzano et al.
(2015)
MNL MC RP
31 Chakour and Eluru (2014) LCM MC RP
32 Cherchi and Cirillo (2014) MXL MC RP
33 Cherchi et al. (2017) MXL MC RP
34 Chica-Olmo et al. (2018) MNL MC RP
35 Clark et al. (2014) MXL MC RP
36 Cole-Hunter et al. (2015) MNL MC RP
37 Collins and MacFarlane
(2018)
MNL MC RP
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
19
Table 5 (continued )
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RP-SP
38 Danaf et al. (2014) NL MC RP
39 de Grange et al. (2015) O MC RP
40 Di Ciommo et al. (2014) MXL MC RP
41 Ding et al. (2017) HCM MC RP
42 Ding et al. (2018) HCM MC-MTO RP
43 Ding et al. (2014b) O MC RP
44 Dong et al. (2016) NL MC RP
45 Nguyen-Phuoc (2018) MNL MC RP
46 Nguyen-Phuoc (2018) MNL MC RP
47 Efthymiou and Antoniou
(2017)
HCM MC RP
48 Ermagun and Levinson
(2017)
CNL MC RP
49 Ermagun and Samimi
(2015)
NL MC RP
50 Ermagun et al. (2015) NL; O MC RP
51 Fern´
andez-Antolín et al.
(2016)
O MC RP
52 Flügel et al. (2015) CNL MC RP-SP
53 Forsey et al. (2014) O MC RP
54 Fu and Juan (2017) LCM MC RP
55 Gao et al. (2016) MNL MC RP
56 Gao et al. (2017) MNL MC RP
57 Gerber et al. (2017) MNL;
MXL
MC RP
58 Glerum et al. (2014) HCM MC RP
59 Goel and Tiwari (2016) MNL MC RP
60 Guan and Xu (2018) MNLp MC RP
61 Guerra et al. (2018) MNL MC RP
62 Guo et al. (2018) MXL MC RP
63 Habib and Sasic (2014) O MC RP
64 Habib and Weiss (2014) O MC RP
65 Habib et al. (2014) HCM MC RP
66 Halld´
orsd´
ottir et al. (2017) JMXL MC RP
67 Hasnine and Habib (2018) O MC RP
68 Hasnine et al. (2018) CNL MC RP
69 He and Giuliano (2017) MNL MC RP
70 Helbich (2017) MXL MC RP
71 Helbich et al. (2014) O MC RP
72 Hensher and Ho (2016) MXL MC RP
73 Hsu and Saphores (2014) MNL MC RP
74 Hurtubia et al. (2014) LCM MC RP
75 Irfan et al. (2018) MNL MC RP-SP
76 Lin et al. (2018) LCM MC RP
77 Ji et al. (2017) NL MC RP
78 Kamargianni et al. (2014) HCM MC RP
79 Kamruzzaman et al.
(2015)
MNL MC RP
80 Keyes and
Crawford-Brown (2018)
MNL MC RP
81 Khan et al. (2016) MXL MC RP
82 Kunhikrishnan and
Srinivasan (2018)
MNL MC RP
83 Yang et al. (2017) MNL MC RP
84 Larsen et al. (2018) MNL MC RP
85 Lee (2015) MNL MC RP
86 Lee et al. (2017) MNL MC RP
87 Lee et al. (2014) MNL MC RP
88 Li and Kamargianni
(2018)
MXNL MC RP-SP
89 Liu et al. (2018) MNL MC RP
90 Liu et al. (2015) MNL MC RP
91 Lorenzo Varela et al.
(2018)
ML; NL MC RP
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
20
Table 5 (continued )
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RP-SP
92 Ma et al. (2015) LCM MC RP
93 Mahmoud et al. (2016) O MC RP
94 Mattisson et al. (2018) MNL MC RP
95 Mehdizadeh et al. (2018) MXL MC RP
96 Meng et al. (2016) MNL MC RP
97 Minal and Ravi (2016) MXL MC RP
98 Mitra and Buliung (2014) MNL MC RP
99 Mitra and Buliung (2015) MNL MC RP
100 Mitra et al. (2015) MNL MC RP
101 Moniruzzaman and Farber
(2018)
MNL MC RP
102 Myung-Jin et al. (2018) MNL MC RP
103 Noland et al. (2014) MXL MC RP
104 Owen and Levinson (2015) MNL MC RP
105 Park et al. (2015) MNL MC RP
106 Park et al. (2014) MNL MC RP
107 Paulssen et al. (2014) HCM MC RP
108 Pike and Lubell (2018) O MC RP
109 Pnevmatikou et al. (2015) NL MC RP-SP
110 Prato et al. (2017) LCM MC RP
111 Ramezani, Pizzo and
Deakin (2018b)
MNL MC RP
112 Ramezani, Pizzo and
Deakin (2018a)
MNL MC RP
113 Rashedi et al. (2017) O MC RP-SP
114 Rubin et al. (2014) MNLp MC RP
115 Rybarczyk and Gallagher
(2014)
MNL MC RP
116 Sanko (2014) MNL MC RP
117 Sanko (2016) MNL MC RP
118 Sanko (2017) MNL MC RP
119 Sanko (2018) MNL MC RP
120 Sarkar and Chunchu
(2016)
MNL MC RP
121 Sarkar and Mallikarjuna
(2018)
HCM MC RP
122 Scheepers et al. (2016) MNL MC RP
123 Shaheen et al. (2016) MNL MC RP
124 Sharmeen and
Timmermans (2014)
MNL MC RP
125 Shirgaokar and Nurul
Habib (2018)
HCM MC RP
126 Singh and Vasudevan
(2018)
MNL MC RP
127 Soltani and Shams (2017) NL MC RP
128 Stone et al. (2014) MNL MC RP
129 Sun et al. (2018) MNL MC RP
130 Thigpen et al. (2015) O MC RP
131 Tos¸a et al. (2018) NL MC RP-SP
132 Venigalla and Faghri
(2015)
MNL MC RP
133 Verma et al. (2015) MNL MC RP
134 Vij and Krueger (2017) MXL MC RP
135 Vij and Walker (2014) LCM MC RP
136 Vij et al. (2017) LCM MC RP
137 Wang et al. (2014) NL MC RP
138 Wang et al. (2015) MNP; O MC RP
139 Weiss and Habib (2018) O MC RP
140 Weng et al. (2018) MNL; O MC RP-SP
141 Yang et al. (2014) MNL MC RP
142 Yang et al. (2018) MXL MC RP
143 Yang et al. (2016) NL MC RP
144 Yazdanpanah and Hadji
Hosseinlou (2017)
HCM MC RP
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
21
Table 5 (continued )
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RP-SP
145 Yen et al., 2018a LCM MC RP
146 Yen et al., 2018b MNL MC RP
147 Zhang et al. (2017) MNL MC RP
148 Zhao and Li (2017) ML-MNL MC RP
149 Zimmermann et al. (2018) MRL MC RP
150 Zolnik (2015) ML-MNL MC RP
151 Gokasar and Gunay (2017) MNL MC RP
152 Tilahun et al. (2016) MNL MC RP
153 Astegiano et al. (2017) MXL; CNL MC; MTO RP
154 Heinen (2016) MNL MC; O RP
155 Chikaraishi and Nakayama
(2016)
O MC; RC RP
156 Habib et al. (2014) O MC; MC RP
157 Idris et al. (2015) HCM MC; MC RP-SP
158 Ahmad Termida et al.
(2016)
MXL MC RP
159 Fatmi and Habib (2017) MXL MC RP
160 Heinen and Ogilvie (2016) MNL MC RP
161 Mitra et al. (2017) MNL MC RP
162 Rahman and Baker (2018) MNL MC RP
163 Standen et al. (2017) MNL MC; RC RP
164 Llorca et al. (2018) MNL MC; DC; TF RP
165 Manoj and Verma (2015a) MNL MC; O RP
166 Rahul and Verma (2018) MNL; OLS MC; TD RP
167 Popovich and Handy
(2015)
MNL; OL MC; TF RP
168 Kristoffersson et al. (2018) NL MC-DC RP
169 Fox et al. (2014) NL MC-DC RP
170 Ding et al. (2014a) CNL MC-DT RP
171 Golshani et al. (2018) NN; JDC MC-DT RP
172 Ermagun and Samimi
(2018)
JDC MC-O RP
173 Kaplan et al. (2016) O MC-O RP
174 Xiqun et al. (2015) NL MC-O RP
175 Habib (2014b) O MC-O-MTO RP
176 Habib (2014a) JDC MC-TD RP
177 Schoner et al. (2015) O MC-TF RP
178 Marquet and
Miralles-Guasch (2016)
MNL MTO; MC RP
179 Shen et al. (2016) MNL; NL MTO; MC RP
180 Khan et al. (2014) O; MNL MTO; TF; MC;
O
RP
181 Picard et al. (2018) NL MTO-MC RP
182 Suel and Polak (2017) NL O-MC RP
183 Yang (2018) MXL O; DC RP
184 Daisy et al. (2018) OP; MNL O; MC RP
185 Ho and Mulley (2015) NL O-MC RP
186 Liu et al. (2017) O;
MDCEV
O-MS RP
187 Zhang et al. (2014) NL O-RC RP
188 Pang and Khani (2018) MNL;
MXL
DC RP
189 Alizadeh et al. (2018) NL RC RP
190 Anderson et al. (2014) MXL
(PSCL)
RC RP
191 Baek and Sohn (2016) MNL
(PSL)
RC RP
192 Basheer et al. (2018) MNL RC RP
193 Chen and Wen (2013) MNL RC RP
194 Chen et al. (2018) MXL(PSL) RC RP
195 Li et al. (2016) MXL(PSL;
GNL;
LK; CL,
etc)
RC RP
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
22
Table 5 (continued )
No Article Model(s) Dependent
variable(s)
Data characteristics Goodness of
t reported
Policy
inference
reported
Validation
reported
Data structure Data
type
CS RCS PP TP RP,
RP-SP
196 Dalumpines and Scott
(2017)
PSL RC RP
197 Di et al. (2017) MNP RC RP
198 Garcia-Martinez et al.
(2018)
MXL RC RP-SP
199 Ghanayim and Bekhor
(2018)
MXL(PSL;
CL)
RC RP
200 J´
anoˇ
síkova et al. (2014) MNL RC RP
201 Kazagli et al. (2016) O RC RP
202 Zhang et al. (2018) MNL RC RP
203 Lai and Bierlaire (2015) CNL RC RP
204 Mai (2016) RCNL RC RP
205 Mai et al. (2017) RL RC RP
206 Mai et al. (2015) NRL RC RP
207 Moran et al. (2018) MNL
(PSL)
RC RP
208 Oyama and Hato (2018) O RC RP
209 Papola (2016) O MC RP
210 Prato (2014) MNL
(RRM)
RC RP
211 Prato et al. (2018) GMXL
(PSL)
RC RP
212 Raveau et al. (2014) MNL(CL) RC RP
213 Thomas and Tutert (2015) MNL RC RP
214 Zhang et al. (2018) NL RC RP
215 Yamamoto et al. (2018) NL(PSL) RC RP
216 Zhang et al. (2015) MNL RC RP
217 Zhuang et al. (2017) MNL RC RP
218 Zimmermann et al. (2017) NRL RC RP
219 J´
anoˇ
síkova et al. (2014) MNL RC RP
220 Xie et al. (2016) MXL RC RP
221 Yang (2016) NL RC RP
222 Duc-Nghiem et al. (2018) MNL RC RP
223 Katoshevski et al. (2015) MNL; NL RLC; DC; MC;
AC
RP
224 Bhat et al. (2016) O RLC-MTO-D-
MC
RP
225 Tran et al. (2016) O RLC-O-MC RP
226 Sener and Reeder (2014) O TF-MC RP
Model abbreviations Dependent variable abbreviations Data characteristics abbreviations
CL: C-logit
CoRUM: Combination of
RUM models
EPSL: Extended path size
logit
FMS: Flexible model
structure
GMXL: Generalized mixed
logit
GNL: Generalized nested
logit
HCM: Hybrid choice model
HDDC: heteroskedastic
dynamic discrete choice
hTEV: heteroscedastic tree
extreme value
IAL: independent
availability logit
IBL: instance-based learning
JDC: Joint discrete
continuous
JMXL: Joint mixed logit
LCM: latent class model
LK: Logit kernel
MRL: Mixed recursive logit
MVMLL: Multivariate multilevel binary
logit
MVPM: multivariate probit model
MXL: Mixed logit
MXNL: Mixed nested logit
NN*: Neural network
NL: Nested logit
NRL: Nested recursive logit
O: Other extensions/generalizations
OP: Ordered probit
PL: Polarized logit
PLC: parameterized logit captivity
PSCL: Path sized correction logit
PSL: Path size logit
QL: q-logit
RCNL: Recursive cross nested logit
RL: Recursive logit
RRM: Random-regret maximization
SCL: Spatially correlated logit
SVM*: support vector machine
WB: weibit
*Machine learning models
**When several models of the same
AC: Activity
choice
AT: Arrival time
DC: Destination
choice
DT: Departure
time choice/Time
of day choice
MC: Mode choice
MS: Modal split
MTO: Mobility
tool ownership
O: Other
RC: Route
choice
RLC:
Residential
location choice
TF: Trip
frequency
CS: Cross sectional data
RCS: Repeated cross section, pooled
data
PP: Pseudo-panel data
TP: True panel data*
SP: Stated preference
RP: Revealed preference
*In this classication, true panel data
are dened as any survey that measures
travel behaviour of the same subjects at
two or more different points in time.
The smallest time unit is a day. (for
example, repeated observations in the
same day is classied as cross-sectional
data, while a travel behaviour survey of
two or more days is considered true
panel data). This classication is
irrespective of the way the data was
handled by the analyst.
Stated preference surveys with multiple
choice scenarios are considered cross-
sectional data.
(continued on next page)
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
23
Table 5 (continued )
Model abbreviations Dependent variable abbreviations Data characteristics abbreviations
LMM: Linear mixed model
MDCEV: Multiple discrete
continuous extreme value
ML: Mother logit
MLMXL: Multilevel mixed
logit
MNL: Binary or Multinomial
logit
MNLp: MNL with panel data
corrections
MPM: Multinomial probit
model
dependent variable are compared, only
the most general form model or best
performing model is listed in the Models
(s) column.
References
Abasahl, F., Kelarestaghi, K.B., Ermagun, A., 2018. Gender gap generators for bicycle mode choice in Baltimore college campuses. Elsevier Travel Behaviour and
Society 11, 7885. https://doi.org/10.1016/j.tbs.2018.01.002. November 2017.
Ahmad Termida, N., Susilo, Y.O., Franklin, J.P., 2016. Observing dynamic behavioural responses due to the extension of a tram line by using panel survey. Elsevier Ltd
Transport. Res. Pol. Pract. 86, 7895. https://doi.org/10.1016/j.tra.2016.02.005.
Alizadeh, H., et al., 2018. On the role of bridges as anchor points in route choice modeling. Springer US Transportation 45 (5), 11811206. https://doi.org/10.1007/
s11116-017-9761-7.
Anderson, M.K., Nielsen, O.A., Prato, C.G., 2014. Multimodal route choice models of public transport passengers in the Greater Copenhagen Area. Springer Berlin
Heidelberg EURO Journal on Transportation and Logistics 6 (3), 221245. https://doi.org/10.1007/s13676-014-0063-3.
Anta, J., et al., 2016. Inuence of the weather on mode choice in corridors with time-varying congestion: a mixed data study. Transportation 43 (2), 337355. https://
doi.org/10.1007/s11116-015-9578-1.
Arlot, S., Celisse, A., 2009. A survey of cross-validation procedures for model selection, 4, 4079. https://doi.org/10.1214/09-SS054.
Arman, M.A., Khademi, N., de Lapparent, M., 2018. Womens mode and trip structure choices in daily activity-travel: a developing country perspective. Taylor &
Francis Transport. Plann. Technol. 41 (8), 845877. https://doi.org/10.1080/03081060.2018.1526931.
Assi, K.J., et al., 2018. Mode choice behavior of high school goers: evaluating logistic regression and MLP neural networks. Elsevier Case Studies on Transport Policy 6
(2), 225230. https://doi.org/10.1016/j.cstp.2018.04.006.
Astegiano, P., et al., 2017. Quantifying the explanatory power of mobility-related attributes in explaining vehicle ownership decisions. Res. Transport. Econ. 66, 211.
https://doi.org/10.1016/j.retrec.2017.07.007.
Austin, P.C., et al., 2016. ‘Geographic and temporal validity of prediction models : different approaches were useful to examine model performance. Elsevier Inc
J. Clin. Epidemiol. 79, 7685. https://doi.org/10.1016/j.jclinepi.2016.05.007.
Aziz, H.M.A., et al., 2018. Exploring the impact of walkbike infrastructure, safety perception, and built-environment on active transportation mode choice: a random
parameter model using New York City commuter data. Springer US Transportation 45 (5), 12071229. https://doi.org/10.1007/s11116-017-9760-8.
Baek, J., Sohn, K., 2016. An investigation into passenger preference for express trains during peak hours. Springer US Transportation 43 (4), 623641. https://doi.org/
10.1007/s11116-015-9592-3.
Baker, M., Penny, D., 2016. Is there a reproducibility crisis? Nature 533 (7604), 452454. https://doi.org/10.1038/533452A.
Basheer, S., Srinivasan, K.K., Sivanandan, R., 2018. Investigation of information quality and user response to real-time trafc information under heterogeneous trafc
conditions. Transportation in Developing Economies 4 (2), 111. https://doi.org/10.1007/s40890-018-0061-5. Springer International Publishing.
Basu, D., et al., 2018. Modeling choice behavior of non-mandatory tour locations in California an experience. Travel Behaviour and Society 12, 122129. https://
doi.org/10.1016/j.tbs.2017.04.008.
Bhat, C.R., et al., 2016. On accommodating spatial interactions in a Generalized Heterogeneous Data Model (GHDM) of mixed types of dependent variables. Transp.
Res. Part B Methodol. 94, 240263. https://doi.org/10.1016/j.trb.2016.09.002. Elsevier Ltd.
B¨
ocker, L., van Amen, P., Helbich, M., 2017. Elderly travel frequencies and transport mode choices in Greater Rotterdam, The Netherlands. Transportation 44 (4),
831852. https://doi.org/10.1007/s11116-016-9680-z.
Bohluli, S., Ardekani, S., Daneshgar, F., 2014. Development and validation of a direct mode choice model. Transport. Plann. Technol. 37 (7), 649662. https://doi.
org/10.1080/03081060.2014.935571. Taylor & Francis.
Brathwaite, T., Walker, J.L., 2018. Causal inference in travel demand modeling (and the lack thereof). Elsevier Ltd Journal of Choice Modelling 26, 118. https://doi.
org/10.1016/j.jocm.2017.12.001. June 2017.
Braun, L.M., et al., 2016. Short-term planning and policy interventions to promote cycling in urban centers: ndings from a commute mode choice analysis in
Barcelona, Spain. Transport. Res. Pol. Pract. 89, 164183. https://doi.org/10.1016/j.tra.2016.05.007. Elsevier Ltd.
Bridgelall, R., 2014. Campus parking supply impacts on transportation mode choice. Transport. Plann. Technol. 37 (8), 711737. https://doi.org/10.1080/
03081060.2014.959354. Taylor & Francis.
Brier, G., 1950. Verication of forecasts expressed in terms of probability. Mon. Weather Rev. 78 (1), 24.
Bueno, P.C., et al., 2017. Understanding the effects of transit benets on employeestravel behavior: evidence from the New York-New Jersey region. Transport. Res.
Pol. Pract. 99, 113. https://doi.org/10.1016/j.tra.2017.02.009. Elsevier Ltd.
Burman, P., 1989. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76 (3), 503514.
Calster, B. Van, et al., 2012. Extending the c -statistic to nominal polytomous outcomes : the Polytomous Discrimination Index. Stat. Med. 31, 26102626. https://doi.
org/10.1002/sim.5321.
Cambridge Systematics, 2010. Travel Model Validation and Reasonability Checking Manual, 2nd. Federal Highway Administration.
Castillo-Manzano, J.I., Castro-Nu˜
no, M., L´
opez-Valpuesta, L., 2015. Analyzing the transition from a public bicycle system to bicycle ownership: a complex
relationship. Transport. Res. Transport Environ. 38, 1526. https://doi.org/10.1016/j.trd.2015.04.004.
Chai, T., Draxler, R.R., 2014. Root mean square error (RMSE) or mean absolute error (MAE)? -Arguments against avoiding RMSE in the literature. Geosci. Model Dev.
(GMD) 7 (3), 12471250. https://doi.org/10.5194/gmd-7-1247-2014.
Chakour, V., Eluru, N., 2014. Analyzing commuter train user behavior: a decision framework for access mode and station choice. Transportation 41 (1), 211228.
https://doi.org/10.1007/s11116-013-9509-y.
Chen, D.J., Wen, Y.-H., 2013. ‘Effects of freeway mileage-based toll scheme on the short-range drivers route choice behavior. J. Urban Plann. Dev. 140 (2),
04013012 https://doi.org/10.1061/(asce)up.1943-5444.0000167.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
24
Chen, P., Shen, Q., Childress, S., 2018. A GPS data-based analysis of built environment inuences on bicyclist route preferences. International Journal of Sustainable
Transportation 12 (3), 218231. https://doi.org/10.1080/15568318.2017.1349222. Taylor & Francis.
Cherchi, E., Cirillo, C., 2014. Understanding variability, habit and the effect of long period activity plan in modal choices: a day to day, week to week analysis on panel
data. Transportation 41 (6), 12451262. https://doi.org/10.1007/s11116-014-9549-y.
Cherchi, E., Cirillo, C., Ortúzar, J. de D., 2017. Modelling correlation patterns in mode choice models estimated on multiday travel data. Transport. Res. Pol. Pract. 96,
146153. https://doi.org/10.1016/j.tra.2016.11.021. Elsevier Ltd.
Chica-Olmo, J., Rodríguez-L´
opez, C., Chill´
on, P., 2018. Effect of distance from home to school and spatial dependence between homes on mode of commuting to
school. J. Transport Geogr. 72 (August), 112. https://doi.org/10.1016/j.jtrangeo.2018.07.013. Elsevier.
Chikaraishi, M., Nakayama, S., 2016. Discrete choice models with q-product random utilities. Transp. Res. Part B Methodol. 93, 576595. https://doi.org/10.1016/j.
trb.2016.08.013. Elsevier Ltd.
Clark, A.F., Scott, D.M., Yiannakoulias, N., 2014. Examining the relationship between active travel, weather, and the built environment: a multilevel approach using a
GPS-enhanced dataset. Transportation 41 (2), 325338. https://doi.org/10.1007/s11116-013-9476-3.
Clifton, K.J., et al., 2016. Development of destination choice models for pedestrian travel. Transport. Res. Pol. Pract. 94, 255265. https://doi.org/10.1016/j.
tra.2016.09.017. Elsevier Ltd.
Cole-Hunter, T., et al., 2015. Objective correlates and determinants of bicycle commuting propensity in an urban environment. Transport. Res. Transport Environ. 40
(2), 132143. https://doi.org/10.1016/j.trd.2015.07.004. Elsevier Ltd.
Collins, P.A., MacFarlane, R., 2018. Evaluating the determinants of switching to public transit in an automobile-oriented mid-sized Canadian city: a longitudinal
analysis. Transport. Res. Pol. Pract. 118 (September), 682695. https://doi.org/10.1016/j.tra.2018.10.014. Elsevier.
Daisy, N.S., Millward, H., Liu, L., 2018. Trip chaining and tour mode choice of non-workers grouped by daily activity patterns. Elsevier J. Transport Geogr. 69,
150162. https://doi.org/10.1016/j.jtrangeo.2018.04.016. November 2017.
Dalumpines, R., Scott, D.M., 2017. Determinants of route choice behavior: a comparison of shop versus work trips using the Potential Path Area - gateway (PPAG)
algorithm and Path-Size Logit. J. Transport Geogr. 59, 5968. https://doi.org/10.1016/j.jtrangeo.2017.01.003. Elsevier Ltd.
Danaf, M., Abou-Zeid, M., Kaysi, I., 2014. ‘Modeling travel choices of students at a private, urban university: insights and policy implications, Case Studies on
Transport Policy. World Conference on Transport Research Society 2 (3), 142152. https://doi.org/10.1016/j.cstp.2014.08.006.
Danalet, A., et al., 2016. Location choice with longitudinal WiFi data. Journal of Choice Modelling 18, 117. https://doi.org/10.1016/j.jocm.2016.04.003. Elsevier.
de Grange, L., et al., 2015. A logit model with endogenous explanatory variables and network externalities. Network. Spatial Econ. 15 (1), 89116. https://doi.org/
10.1007/s11067-014-9271-5.
de Jong, G.C., 2014. Mode choice models. In: Tavasszy, L.A., de Jong, G.C. (Eds.), Modelling Freight Transport (Elsevier).
de Luca, S., Cantarella, G.E., 2009. Validation and comparison of choice models. In: Saleh, W., Sammer, G. (Eds.), Travel Demand Management and Road User Pricing:
Success, Failure and Feasibility. Ashgate publications, pp. 3758.
de Luca, S., Di Pace, R., 2015. ‘Modelling usersbehaviour in inter-urban carsharing program: a stated preference approach. Transport. Res. Pol. Pract. 71, 5976.
https://doi.org/10.1016/j.tra.2014.11.001. Elsevier Ltd.
Deutsch-Burgner, K., Ravualaparthy, S., Goulias, K., 2014. Place happiness: its constituents and the inuence of emotions and subjective importance on activity type
and destination choice. Transportation 41 (6), 13231340. https://doi.org/10.1007/s11116-014-9553-2.
Di Ciommo, F., et al., 2014. ‘Exploring the role of social capital inuence variables on travel behaviour. Transport. Res. Pol. Pract. 68, 4655. https://doi.org/
10.1016/j.tra.2014.08.018. Elsevier Ltd.
Di, X., et al., 2017. Indifference bands for boundedly rational route switching. Transportation 44 (5), 11691194. https://doi.org/10.1007/s11116-016-9699-1.
Springer US.
Ding, C., et al., 2014a. ‘Cross-Nested joint model of travel mode and departure time choice for urban commuting trips: case study in MarylandWashington, DC
region. J. Urban Plann. Dev. 141 (4), 04014036 https://doi.org/10.1061/(asce)up.1943-5444.0000238.
Ding, C., Lin, Y., Liu, C., 2014b. Exploring the inuence of built environment on tour-based commuter mode choice: a cross-classied multilevel modeling approach.
Transport. Res. Transport Environ. 32, 230238. https://doi.org/10.1016/j.trd.2014.08.001. Elsevier Ltd.
Ding, C., et al., 2017. Exploring the inuence of attitudes to walking and cycling on commute mode choice using a hybrid choice model. J. Adv. Transport. 2017, 18.
https://doi.org/10.1155/2017/8749040.
Ding, C., et al., 2018. Joint analysis of the spatial impacts of built environment on car ownership and travel mode choice. Transport. Res. Transport Environ. 60,
2840. https://doi.org/10.1016/j.trd.2016.08.004. Elsevier Ltd.
Dong, H., Ma, L., Broach, J., 2016. Promoting sustainable travel modes for commute tours: a comparison of the effects of home and work locations and employer-
provided incentives. International Journal of Sustainable Transportation 10 (6), 485494. https://doi.org/10.1080/15568318.2014.1002027. Taylor & Francis.
Duc-Nghiem, N., et al., 2018. ‘Modeling cyclistsfacility choice and its application in bike lane usage forecasting, IATSS Research. International Association of Trafc
and Safety Sciences 42 (2), 8695. https://doi.org/10.1016/j.iatssr.2017.06.006.
Efron, B., 1983. ‘Estimating the error rate of a prediction Rule : improvement on cross-validation. J. Am. Stat. Assoc. 78 (382), 316331.
Efron, B., Tibshirani, R., 1993. An Introduction to the Boostrap. Chapman & Hall.
Efron, B., Tibshirani, R., 1995. Cross-Validation and the Bootstrap : Estimating the Error Rate of a Prediction Rule. Division of Biostatistics, Stanford University.
Efron, B., Tibshirani, R., 1997. ‘Improvements on cross-validation : the 632 +bootstrap method. J. Am. Stat. Assoc. (438), 92. https://doi.org/10.1080/
01621459.1997.10474007.
Efthymiou, D., Antoniou, C., 2017. ‘Understanding the effects of economic crisis on public transport userssatisfaction and demand. Transport Pol. 53 (August 2015),
8997. https://doi.org/10.1016/j.tranpol.2016.09.007. Elsevier.
Ermagun, A., Levinson, D., 2017. Public transit, active travel, and the journey to school: a cross-nested logit analysis. Transportmetrica: Transport. Sci. 13 (1), 2437.
https://doi.org/10.1080/23249935.2016.1207723.
Ermagun, A., Samimi, A., 2015. Promoting active transportation modes in school trips. Transport Pol. 37, 203211. https://doi.org/10.1016/j.tranpol.2014.10.013.
Elsevier.
Ermagun, A., Samimi, A., 2018. ‘Mode choice and travel distance joint models in school trips. Transportation 45 (6), 17551781. https://doi.org/10.1007/s11116-
017-9794-y. Springer US.
Ermagun, A., Hossein Rashidi, T., Samimi, A., 2015. A joint model for mode choice and escort decisions of school trips. Transportmetrica: Transport. Sci. 11 (3),
270289. https://doi.org/10.1080/23249935.2014.968654.
Faghih-Imani, A., Eluru, N., 2015. ‘Analysing bicycle-sharing system user destination choice preferences: chicagos Divvy system. J. Transport Geogr. 44, 5364.
https://doi.org/10.1016/j.jtrangeo.2015.03.005. Elsevier Ltd.
Faghih-Imani, A., Eluru, N., 2017. Examining the impact of sample size in the analysis of bicycle-sharing systems. Transportmetrica: Transport. Sci. 13 (2), 139161.
https://doi.org/10.1080/23249935.2016.1223205.
Fatmi, M.R., Habib, M.A., 2017. Modelling mode switch associated with the change of residential location. Travel Behaviour and Society 9 (August), 2128. https://
doi.org/10.1016/j.tbs.2017.07.006. Elsevier.
Fern´
andez-Antolín, A., et al., 2016. Correcting for endogeneity due to omitted attitudes: empirical assessment of a modied MIS method using RP mode choice data.
Journal of Choice Modelling 20, 115. https://doi.org/10.1016/j.jocm.2016.09.001. Elsevier.
Flügel, S., et al., 2015. Methodological challenges in modelling the choice of mode for a new travel alternative using binary stated choice data - the case of high speed
rail in Norway. Transport. Res. Pol. Pract. 78, 438451. https://doi.org/10.1016/j.tra.2015.06.004.
Forsey, D., et al., 2014. Temporal transferability of work trip mode choice models in an expanding suburban area: the case of York Region, Ontario. Transportmetrica:
Transport. Sci. 10 (6), 469482. https://doi.org/10.1080/23249935.2013.788100.
Fox, J., et al., 2014. Temporal transferability of models of mode-destination choice for the Greater Toronto and Hamilton Area. Journal of Transport and Land Use 7
(2), 41. https://doi.org/10.5198/jtlu.v7i2.701.
G. Parady et al.
Journal of Choice Modelling 38 (2021) 100257
25
Fu, X., Juan, Z., 2017. ‘Accommodating preference heterogeneity in commuting mode choice: an empirical investigation in Shaoxing, China. Transport. Plann.
Technol. 40 (4), 434448. https://doi.org/10.1080/03081060.2017.1300240. Taylor & Francis.
Gao, J., et al., 2016. ‘A study of traveler behavior under trafc information-provided conditions in the Beijing area. Transport. Plann. Technol. 39 (8), 768778.
https://doi.org/10.1080/03081060.2016.1231896. Taylor & Francis.
Gao, Y., et al., 2017. ‘Differences in pupilsschool commute characteristics and mode choice based on the household registration system in China. Case Studies on
Transport Policy 5 (4), 656661. https://doi.org/10.1016/j.cstp.2017.07.008. Elsevier.
Garcia-Martinez, A., et al., 2018. ‘Transfer penalties in multimodal public transport networks. Transport. Res. Pol. Pract. 114 (January), 5266. https://doi.org/
10.1016/j.tra.2018.01.016. Elsevier.
Geisser, S., 1975. The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70 (350), 320328.
Gerber, P., et al., 2017. ‘Cross-border residential mobility, quality of life and modal shift: a Luxembourg case study. Transport. Res. Pol. Pract. 104, 238254. https://
doi.org/10.1016/j.tra.2017.06.015. Elsevier Ltd.
Ghanayim, M., Bekhor, S., 2018. Modelling bicycle route choice using data from a GPS-assisted household survey. Eur. J. Transport Infrastruct. Res. 18 (2), 158177.
Glerum, A., Atasoy, B., Bierlaire, M., 2014. Using semi-open questions to integrate perceptions in choice models. Journal of Choice Modelling 10 (1), 1133. https://
doi.org/10.1016/j.jocm.2013.12.001. Elsevier.
Goel, R., Tiwari, G., 2016. ‘Access-egress and other travel characteristics of metro users in Delhi and its satellite cities. IATSS Research. The Authors 39 (2), 164172.
https://doi.org/10.1016/j.iatssr.2015.10.001.
Gokasar, I., Gunay, G., 2017. Mode choice behavior modeling of ground access to airports: a case study in Istanbul, Turkey. J. Air Transport. Manag. 59, 17. https://
doi.org/10.1016/j.jairtraman.2016.11.003. Elsevier Ltd.
Golshani, N., et al., 2018. Modeling travel mode and timing decisions: comparison of articial neural networks and copula-based joint model. Elsevier Travel
Behaviour and Society 10, 2132. https://doi.org/10.1016/j.tbs.2017.09.003. October 2017.
Gonz´
alez, F., Melo-Riquelme, C., de Grange, L., 2016. A combined destination and route choice model for a bicycle sharing system. Transportation 43 (3), 407423.
https://doi.org/10.1007/s11116-015-9581-6.
Guan, J., Xu, C., 2018. ‘Are relocatees different from others? Relocatees travel mode choice and travel equity analysis in large-scale residential areas on the periphery
of megacity Shanghai, China. Transport. Res. Pol. Pract. 111 (January), 162173. https://doi.org/10.1016/j.tra.2018.03.011. Elsevier.
Guerra, E., et al., 2018. ‘Urban form, transit supply, and travel behavior in Latin America: evidence from Mexicos 100 largest urban areas. Transport Pol. 69 (June),
98105. https://doi.org/10.1016/j.tranpol.2018.06.001. Elsevier Ltd.
Gunn, H., Bates, J., 1982. Statistical aspects of travel demand modelling. Transport. Res. Gen. 16 (56), 371382. https://doi.org/10.1016/0191-2607(82)90065-6.
Guo, Y., et al., 2018. ‘Impacts of internal migration, household registration system, and family planning policy on travel mode choice in China. Travel Behaviour and
Society 13 (June), 128143. https://doi.org/10.1016/j.tbs.2018.07.003. Elsevier.
Habib, K.N., 2014a. An investigation on mode choice and travel distance demand of older people in the Nationa