ArticlePDF Available

Variable Selection in Regression Models Using Global Sensitivity Analysis

Authors:
  • BlueFox Data
  • University Pompeu Fabra Barcelona School of Management

Abstract and Figures

Global sensitivity analysis is primarily used to investigate the effects of uncertainties in the input variables of physical models on the model output. This work investigates the use of global sensitivity analysis tools in the context of variable selection in regression models. Specifically, a global sensitivity measure is applied to a criterion of model fit, hence defining a ranking of regressors by importance; a testing sequence based on the ‘Pantula-principle’ is then applied to the corresponding nested submodels, obtaining a novel model-selection method. The approach is demonstrated on a growth regression case study, and on a number of simulation experiments, and it is found competitive with existing approaches to variable selection.
This content is subject to copyright. Terms and conditions apply.
William Becker*, Paolo Paruolo and Andrea Saltelli
Variable Selection in Regression Models
Using Global Sensitivity Analysis
https://doi.org/10.1515/jtse-2018-0025
Received August 31, 2018; accepted February 16, 2021
Abstract: Global sensitivity analysis is primarily used to investigate the effects of
uncertainties in the input variables of physical models on the model output. This
work investigates the use of global sensitivity analysis tools in the context of
variable selection in regression models. Specifically, a global sensitivity measure
is applied to a criterion of model fit, hence defining a ranking of regressors by
importance; a testing sequence based on the Pantula-principleis then applied to
the corresponding nested submodels, obtaining a novel model-selection method.
The approach is demonstrated on a growth regression case study, and on a number
of simulation experiments, and it is found competitive with existing approaches to
variable selection.
Keywords: model selection, Monte Carlo, sensitivity analysis, simulation
JEL classication: C52, C53
1 Introduction
Model selection in regression analysis is a central issue, both in theory and in
practice. Related fields include multiple testing (Bittman et al. 2009; Romano and
Wolf 2005), pre-testing (Leeb and Poetscher 2006), information criteria (Hjort and
Claeskens 2003; Liu and Yang 2011), model selection based on Lasso (Brunea 2008),
model averaging (Claeskens and Hjort2003), stepwise regression, (Miller 2002), risk
Information and views set out in this paper are those of the authors and do not necessarily reflect
the ones of the institutions of affiliation.
*Corresponding author: William Becker, European Commission, Joint Research Centre, Ispra, VA,
Italy, E-mail: william.becker@bluefoxdata.eu. https://orcid.org/0000-0002-6467-4472
Paolo Paruolo, European Commission, Joint Research Centre, Ispra, VA, Italy,
E-mail: paolo.paruolo@ec.europa.eu. https://orcid.org/0000-0002-3982-4889
Andrea Saltelli, Open Evidence Research, Universitat Oberta de Catalunya, Barcelona, Spain,
E-mail: andrea.saltelli@gmail.com. https://orcid.org/0000-0003-4222-6975
J. Time Ser. Econom. 2021; 13(2): 187233
Open Access. © 2021 William Becker et al., published by De Gruyter. This work is licensed
under the Creative Commons Attribution 4.0 International License.
ination in prediction, (Foster and George 1994), directed acyclic graphs and
causality discovery (Freedman and Humphreys 1999).
1
Model choice is also of primary concern in many areas of applied econometrics,
as witnessed for example by the literature on growth regression (Sala-i-Martin 1997).
Controlling for the right set of covariates is central in the analysis of policy impact
evaluations; this is embodied in theassumption of unconfoundedness (Imbens and
Wooldridge 2009). In economic forecasting, model selection is the main alternative
to model averaging, (Hjort and Claeskens 2003).
The analysis of the effects of pretesting on parameter estimation has a long
tradition in econometrics (Danilov and Magnus 2004) and in this context Magnus
and Durbin (1999) and co-authors proposed the weighted average least squares
estimator (WALS) and compared it with model averaging for growth empirics
(Magnus, Powell, and Prufer 2010).
Model selection is a major area of investigation also in time-series econometrics
(Phillips 1997, 2003). The so-called London School of Economics (LSE) methodology
has played a prominent role in this area, advocating the general-to-specic(GETS)
approach to model selection (Castle, Doornik, and Hendry 2011; Hendry and Krolzig
2005) and references therein. In a widely cited paper, Hoover and Perez (1999)
(hereafter HP) mechanized’—i.e. translatedthe GETS approach into an algorithm
for model selection and they then tested the performance of the HP algorithm
on a set of time-series regression experiments, constructed along the lines of
Lovell (1983).
Model selection is also related to the issue of regression coefficientsrobustness
(i.e. lack of sensitivity) to the omission/inclusion of additional variables. Leamer
(1983) proposed extreme bound analysis, i.e. to report the range of possible
parameter estimates of the coefcient of interest when varying the additional
regressors included in the analysis, as an application of sensitivity analysis to
econometrics. Other applications of sensitivity analysis to econometrics include the
1Model selection is also associated with current rules of thumb on the maximum number of
regression parameters to consider. This literature appears to have been initiated by Freedman
(1983), who considered the case of a rst screening regression with 50 regressors and 100 data
points, where regressors that are signicant at 25% signicance level are kept in a second
regression. Freedman showed that the second regression is troublesome when one acts as if the
screening regression had not been performed and the ratio of number of observations to number of
regressors in the screening regression is kept in a xed proportion as the number of observations
diverges. This study was followed by Freedman and Pee (1989), Freedman, Pee, and Midthune
(1992), who dened the rule of thumb that the ratio of the number of observations per regressor
should be at least equal to 4; this rule is included in Harrell (2001), who suggested to have it at
least equal to 10.
188 W. Becker et al.
local sensitivity to model misspecication developed in Magnus and Vasnev (2007)
and Magnus (2007).
2
On the other hand, sensitivity analysis originated in the natural sciences, and is
generally defined as the study of how the uncertainty in the output of a mathe-
matical model or system (numerical or otherwise) can be apportioned to different
sources of uncertainty in its inputs, (Saltelli, Tarantola, and Campolongo 2000). The
term global sensitivity analysis (GSA) is used to refer to sensitivity analysis
approaches that fully explore the space of uncertainties, as opposed to local
methods which are only valid at a nominal point (Saltelli and Annoni 2010). The
main tools used in GSA are based on a decomposition of the variance of the model
output (Sobol1993).
Despite several uses of sensitivity in econometrics, the present authors are not
aware of systematic applications of the techniques of Global Sensitivity Analysis to
the problem of model selection in regression. With this in mind, the present paper
explores the application of variance-based measures of sensitivity to model
selection.
This paper aims to answer the question: Can GSA methods help in model
selection in practice?, rather than to propose a single algorithm with the aim to
dominate all alternatives. To this purpose, a simple algorithm is considered as a
representative of a novel GSA approach; the new algorithm is found to perform
rather well when compared with alternatives. This shows how GSA methods can
indeed bring a useful contribution to this field.
In particular, a widely-used measure in the GSA literature, called the total
sensitivity indexis employed to rank regressors in terms of their importance in a
regression model. The information on the ordering of regressors given by GSA
methods appears to be somewhat complementary to the one based on t-ratios
employed in the GETS approach; this suggests to consider viable ordering of
regressors combining the two orderings. Based on these insights, a GSA algo-
rithm is constructed which combines the two rankings.
The proposed GSA representative algorithm uses the ordering of the regressors
via GSA or the t-ratios within a testing strategy based on the Pantula-principle,
see Pantula (1989). For any ordering of the regressors, this amounts to a single
sequence of tests against the full model, starting from the most restricted submodel
2They show that local sensitivity measures provide complementary information with respect to
standard diagnostic tests for misspecication, i.e. that the two types of statistics are asymptotically
independent. In SA a local measure of sensitivity is one focused on a precise point in the space of
the input factor, e.g. a partial derivative of the output versus the input. With a global measure of
sensitivity the inuence of a given input on the output is averaged both on the distribution of the
input factor itself and on the distributions of all the remaining factors, see Saltelli, Andres, and
Homma (1993).
Variable Selection Using Global Sensitivity Analysis 189
to the most unrestricted one.
3
This implies both a reduction in the number of tests
for each given ordering (with an associated saving of computing times) and the
favorable control of the size of the testing sequence. The present application of the
Pantula-principleappears novel in the context of model selection.
The GSA algorithm is tested here using several case studies. A detailed
investigation of the performance of the GSA algorithm is first performed using the
simulation experiments of HP, who defined a set of Data Generating Processes
(DGPs) based on real economic data. Simulating these DGPs, one can record how
often the algorithm recovers the variables that are included in the DGP. This is
compared to the results of HPs GETS algorithm, as well as those of the Autometrics
GETS package (Pretis, Reade, and Sucarrat 2018).
In order to further compare the GSA approach to a wider set of model selection
procedures, the DGPs in Deckers and Hanck (2014) are also considered; this allows
a direct comparison with a number of procedures. Finally, the algorithm is applied
to a growth regression case study which is also taken from the same paper.
Overall, results point to the possible usefulness of GSA methods in model se-
lection algorithms. When comparing the optimized GSA and HP algorithms, the GSA
method appears to be able to reduce the failure rate in recovering the underlying
data generating process from 5 to 1% approximatelya fivefold reduction. When
some of the regressors are weak,the recovery of the exact DGP does not appear to be
improved by the use of GSA methods.
Comparing the GSA algorithm to a wider set of approaches considered in DH,
the results are competitive with alternatives, in the sense that the GSA algorithm is
not dominated by alternative algorithms in the Monte Carlo (MC) simulations. In
the empirical application on growth regression, not surprisingly, it identifies
similar variables to those found by other methods. While these results do not prove
the GSA approach to dominate other existing approaches, they show that the GSA
approach is not dominated by any single alternative, and that it has the potential to
contribute to improve existing algorithms; the present study can hopefully hence
pave the way for future advances.
The rest of the paper is organized as follows. Section 2 denes the problem of
interest and introduce GSA and variance-based measures. Section 3 presents some
theoretical properties of orderings based on the total sensitivity index, while
Section 4 presents the GSA algorithm. Results are reported in Sections 5 and 6,
where the former is a detailed investigation on datasets generated following the
paper of Hoover and Perez (1999), and the latter is a comparison with a wide range
of model selection procedures on simulated data sets and on a growth regression,
following Deckers and Hanck (2014). Section 7 concludes. Three appendices report
3This sequence can still be interpreted as compliant to the GETS principle.
190 W. Becker et al.
proofs of the propositions in the paper, details on the DGP design in HP and a
discussion about the identiability of DGPs. Finally, this paper follows the nota-
tional conventions in Abadir and Magnus (2002).
2 Model Selection and Global Sensitivity Analysis
This section presents the setup of the problem, and introduces global sensitivity
analysis. The details of the proposed algorithm are deferred to Section 4.
2.1 Model Selection in Regression
Consider ndata points in a standard multiple regression model with pregressors of
the form
y=X1β1+Xpβp+ε=Xβ+ε(1)
where y=(y1,,yn)is n×1, X=(X
1
,,X
p
)isn×p,Xi(xi,1,,xi,n)is n×1,
β=(β1,,βp)is p×1 and εis a n×1 random vector with distribution N(0,σ2In).
The symbol indicates transposition.
Equation (1) describes both the model and the DGP. In the model, the coefcients
β
i
are parameters to be estimated given the observed data Z=(y,X). Each DGP is
described by Eq. (1) with β
i
set at some numerical values, here indicated as β
0i
,
collected in the vector β0=(β01 ,,β0p).
Some of the true β
0i
may be 0, corresponding to irrelevant regressors. Let
T{iJ:β0,i0}be the set of all relevant regressor indices in the DGP, with r
0
elements, where J{1,,p}indicatesthesetoftherst pintegers. Let also
MJ\Tindicate the set of all regressor indices for irrelevant regressors.
4
Equation (1) also formally nests dynamic specications, as detailed in Appendix
Bbelow;inthiscaseX
i
contain lagged dependent variables, and (1) is generated
recursively.
Imposing the restriction β
i
= 0 for some regressors i, one obtains a submodel
5
of
model (1). Each submodel can be characterized by a set a,aJ, containing the
indices of the included regressors. For instance, a= {1, 5, 9}, indicates the
submodel including regressors numbered 1, 5, 9. The model without any restriction
on β
i
= 0 is called the general unrestricted model (GUM).
4Here J\Tdenotes the set difference J\T{i:iJ,iT}; sums over empty sets are understood
to be equal to 0.
5In the paper submodeland specicationare used as synonyms.
Variable Selection Using Global Sensitivity Analysis 191
Alternatively, the same information on submodel acan be represented by a p×1
vector γa=(γ1,,γp),withj-th coordinate γ
j
with value 1 (respectively 0) that
indicates the inclusion (respectively exclusion) of regressor jfrom the specication,
i.e. γj=1(ja)and 1() is the indicator function.
6
The GUM corresponds to γequal to
ı, a vector with all 1s. γTcorresponds to the best selection of regressors, i.e. the same
one of the DGP; in the following the notation γT=γ0is also used.
Let Γbe the set of all p×1 vectors of indicators γ,Γ={0,1}p.Notethatthereare2
p
different specications, i.e. 2ppossible γvectors in Γ.Whenp=40,assome
experiments in Section 5, the number of specications is 2p1.0995 1012, a very
large number. This is why an exhaustive search of submodels is infeasible in many
practical cases, and model selection techniques focus on a search over a limited set
of submodels ΓsΓ.
Each submodel can be written as model (1) under the restriction
β=Hγϕ,(2)
where H
γ
contains the columns of an identity matrix I
p
corresponding to elements
γ
i
equal to 1 within γ. Specication (2) is referred to as the γsubmodelin the
following. Also the truevector β
0
has representation β0=H0ϕ0where H
0
is a
simplied notation for H0=HγT=Hγ0.
The least squares estimator of βin submodel γcan be written as
βγ=HγHγ
XXHγ1Hγ
Xy.(3)
The problem of interest is to retrieve T, or the corresponding γT, given the observed
data Z=(y,X), i.e. to identify the DGP.
7
2.2 GSA Approach
General-to-specific (GETS) approaches such as the algorithm used by HP (described
in detail in Section 5) use t-ratios to rank regressors in order of importance, which
guides the selection of the set of submodels Γs. This study proposes instead to
decompose the selection of models in two stages:
(i) dene an ordering of regressors based on their importance;
6Similarly, the notation aγ{i1,,ikγ}is used to indicate the index set corresponding to some
vector γ.
7All empirical models are assumed to contain the constant; this is imposed implicitly by de-
meaning the yand X
i
vectors. Hence in the following, the empty set of regressorsrefers to the
regression model with only the constant.
192 W. Becker et al.
(ii) use a sequence of ptests that compare the GUM with submodels which contain
the rst hmost important regressors, starting from h=0,1,2,and ending at
the rst submodel rthat does not reject the null hypothesis.
In this paper the ordering in (i) based on the t-ratios is complemented with a
variance-based measure of importance from GSA, called the total sensitivity
index. The proposed algorithm, called the GSA algorithm, combines this new
ranking with the ranking by t-ratios.
The testing sequence is defined based on this new ranking; a bottom-up
selection process is adopted, which builds candidate models by adding regressors
in descending order of importance. This bottom-upselection process follows the
Pantula principleand has well defined theoretical properties, see e. g. (Paruolo
2001), and it can still be interpreted as a GETS procedure.
The total sensitivity index in GSA is based on systematic exploration of the space
of the inputs to measure its influence on the system output, as is commonly prac-
ticed in mathematical modeling in natural sciences and engineering. It provides a
global measure of the inuence of each input to a system.
8
Reviews of global
sensitivity analysis methods used therein are given in Saltelli et al. (2012), Norton
(2015), Becker and Saltelli (2015), Wei, Lu, and Song (2015).
9
The total sensitivity
index is a variance-based measures of sensitivity, which are the analogue of the
analysis of the variance, see Archer, Saltelli, and Sobol (1997).
10
Given the sample data Z=(y,X), consider the γsubmodel, see eqs. (1), (2) and
(3). Let q(γ) indicate the BIC of model t of this submodel, q(γ)=log
σ2
γ+kγcn,
with cnlog(n)/n.
11
Remark that qis a continuous random variable that depends
on the discretely-valued γ. The idea is to apply the total sensitivity index using qas
output, with γas input. Although the BIC is used here as q, the measure of model
t, other consistent information criteria or the maximized log-likelihood could be
used instead.
8The mechanisticmodels in these disciplines are mostly principle-based, possibly involving the
solution of some kind of (differential) equation or optimization problem, and the outputbeing the
result of a deterministic calculationdoes not customarily include an error term.
9Recent applications of these methods to the quality of composite indicators are given in Paruolo,
Saltelli, and Saisana (2013) and Becker et al. (2017).
10 Variance-based methods explore the entire distribution of each factor.
11 qcan be taken to be any consistent information criterion where consistent information criteria
replace lognwith some other increasing function f(n)ofnwith the property cn=f(n)/n0. Here
the fact that nc
n
diverges is not used in the proofs. Note that q(γ) is a function of Z, but this is not
indicated in the notation for simplicity.
Variable Selection Using Global Sensitivity Analysis 193
The objective is to capture both the main effect and the interaction effects of
the input factors onto the output q, see Saltelli et al. (2012). The following section
denes the total sensitivity index.
2.3 Sensitivity Measures
Let Eindicate the empirical expectation over Γ,i.e.E(h(γ)) (#(Γ))1γΓ(h(γ)),for
any function h.LetalsoVindicate the variance operator associated with E,
V(h)E(h2)−(E(h))2.
The γvector is partitioned into two components γ
i
and γ
i
,whereγ
i
contains all
elements in γexcept γ
i
.LetE(|b)and V(|b)(respectively E()and V()) indicate the
conditional (respectively marginal) expectation and variance operators with respect
to a partition (a,b)ofγ,whereaand bare taken equal to γ
i
and to γ
i
.
Two commonly-accepted variance-based measures are reviewed here, the
first-order sensitivity indexS
i
, Sobol(1993), and the total-order sensitivity index
S
Ti
, Homma and Saltelli (1996); both rely on decomposing the variance of the
output, V=V(q), into portions attributable to inputs or sets of inputs.
The first-order index measures the contribution to V=V(q)of varying the i-th
input alone, and it is dened as Si=V(E(q|γi))/V. This index can be seen as the
application of Karl Pearsons correlation ratio η2, see Pearson (1905), to the present
context. This corresponds to seeing the effect of including or not including a
regressor, but averaged over all possible combinations of other regressors. How-
ever, this measure does not account for interactions with the inclusion/exclusion
of other regressors; hence it is not used in the present paper.
Instead, here the focus is placed on the total effect index, which is defined by
Homma and Saltelli (1996) as
STi =E(V(q|γi))
V=1V(E(q|γi))
V.(4)
In the following, the numerator of S
Ti
is indicated as σ2
Ti =E(V(q|γi)), and the
shorthand S
T
for S
Ti
is often used.
Examining σ2
Ti, one can notice that the inner term, V(q|γi), is the variance of q
due inclusion/exclusion of regressor i, but conditional on a given combination γ
i
of the remaining regressors. The outer expectation then averages over all values of
γ
i
; this quantity is then standardized by Vto give the fraction of total output
variance caused by the inclusion of x
i
. The second expression shows that S
Ti
is
1 minus the rst order effect for γ
i
.
These measures are based on the standard variance decomposition formula, or
law of total variance(Billingsley 1995), Problem 34.10(b)). In the context of GSA,
194 W. Becker et al.
these decomposition formulae are discussed in Archer, Saltelli, and Sobol (1997),
Saltelli and Tarantola (2002), Sobol(1993), Brell, Li, and Rabitz (2010). For further
reading about GSA in their original setting, see Saltelli et al. (2012).
2.4 Estimation of the Total Sensitivity Index
In order to calculate the total sensitivity measure S
Ti
one should be able to compute
q(γ) for all γΓ(i.e. estimate all possible submodels of the GUM) which is un-
feasible or undesirable. Instead, S
Ti
can be estimated from a random subset of Γ,
i.e. a sample of models. The estimation of S
Ti
is performed using an estimator and a
structured sample constructed as in Jansen (1999), which is a widely used method
in GSA.
Specifically, generate a random draw of γin Γ, say γ
*
; then consider elements
γ(i)
*with all elements equal to γ
*
except for the i-th coordinate which is switched
from 0 to 1 or vice-versa, γ(i)
*i=1γ*i. Doing this for each coordinate igenerates p
pairs of γvectors, γ
*
and γ(i)
*, that differ only in the coordinate i. This is then used to
calculate q(γ) and apply an estimator of Jansen (1999).
This process can be formalized as follows: initialize at 1, then,
1. Generate a random draw of γ, where γis a p-length vector with each element is
randomly selected from {0, 1}. Denote this by γ
.
2. Evaluate q
=q(γ
).
3. Take the ith element of γ
, and switch it to 0 if it is equal to 1, and to 1 if it is 0.
Denote this new vector with inverted ith element as γ(i)
.
4. Evaluate qi=q(γ(i)
).
5. Repeat steps 3 and 4 for i=1,2,,p.
6. Repeat steps 15Ntimes, i.e. for =1,2,,N.
The estimators for σ2
Ti and Vare then dened as in Jansen (1999), see also Saltelli
et al. (2010):
σ2
Ti =1
4N
N
=1qiq2,
V=1
N1
N
=1qq2,(5)
where q=1
NN
=1q. This delivers the following plug-in estimator for S
T
,
STi =
σ2
Ti/
V.
Readers familiar with sensitivity analysis may notice that the estimator in (5) is
Variable Selection Using Global Sensitivity Analysis 195
different by a factor of 2 to the estimator quoted in Saltelli et al. (2010). The reason
for this is given in Appendix A.
12
STi is an accurate estimator for S
Ti
as the number Nof models increases;
13
hence, the following discussion is based on the behavior of S
Ti
.
3 Properties of Orderings Based on S
Ti
This section investigates the theoretical properties of ordering of variables in a
regression model based on S
T
, and shows that these orderings satisfy the following
minimal requirement. When the true regressors in Tincluded in the DGP and the
irrelevant ones in Mare uncorrelated, the ordering of regressors based on S
T
separates the true from the irrelevant regressors in large samples.
Recall that STi =σ2
Ti/V=E(V(q|γi))/V, see (4). The large nproperties of S
Ti
are
studied under the following regularity assumptions.
Assumption 1: (Assumptions on the DGP). The variables wt(yt,x1,t,,xp,t,ϵt)
are stationary with nite second moments, and satisfy a law of large numbers for
large n, i.e. the second sample moments of w
t
converge in probability to Σ, the
variance covariance matrix of w
t
.
Notice that these requirements are minimal, and they are satisfied by the HP DGPs
as well as the DH DGPs. The following theorem shows that for large n, a scree plot
on the ordered S
Ti
allows to separate the relevant regressors from the irrelevant
ones when true and irrelevant regressors are uncorrelated.
Theorem 2: (Ordering based on S
Ti
works for uncorrelated regressors in Mand T).
Let Assumption 1hold and assume that the covariance Σ
j
between x
and x
j
equals
0for all j Tand M.Dene (ST(1),ST(2),,ST(p))as the set of S
Ti
values in
decreasing order, with ST(1)ST(2)ST(p). Then as n one has
12 A heuristic reason for this is that the method involves an exploration of models, with equal
probability to select γ
i
=0orγ
i
= 1. Note that in analyses with continuous variables, it is usually
advisable to use low-discrepancy sequences due to their space-lling properties, see Sobol(1967),
which give faster convergence with increasing N. However, since γcan only take binary values for
each element, low-discrepancy sequences offer no obvious advantage over (pseudo-)random
numbers.
13 For instance, it is consistent for S
Ti
for increasing Nthanks for Law of Large numbers for i.i.d.
sequences applied to its numerator and denominator.
196 W. Becker et al.
ST(1),ST(2),,ST(p)
pc(1),c(2),,c(r0),0,0
where (c(1),c(2),,c(r0))is the ordered set of c
i
>0values in decreasing order, where
ci1
42p1
γiΓi
logσ2+h,jT\aγ(i,1)β0,hΣhj.bγ(i,1)β0,j
σ2+h,jT\aγ(i,0)β0,hΣhj.bγ(i,0)β0,j,(6)
see Appendix A for the denition of the relevant quantities in eq. (6). Hence the
ordered S
Ti
values separate the block of true regressors in Tin the rst r
0
positions
from the irrelevant ones Min the last p r
0
positions of (ST(1),ST(2),,ST(p)).
Proof. See Lemma 6 in Appendix A.
Given the above, one may hence expect this result to apply to other more general
situations. However, this turns out not to be necessarily the case. The results in
Appendix A also show that one can build examples with correlated regressors
across Tand M, where the ordering of regressors based on S
T
fails to separate the
sets of true and irrelevant regressors in large samples.
14
In the end, the question of whether the ordering based on S
T
can help in
selecting regressors is an empirical matter. Section 5 explores the frequency with
which this happens in practice, based on simulated data from various DGPs.
4 Construction of the Algorithm
In order to construct an algorithm to perform model selection based on S
T
,an
initial investigation was performed to understand to what extent the ranking of
regressors provided by S
T
is complementary to that given by t-ratios. These ex-
periments are based on the MC design by HP; details of these experiments are
reported in Section 5. However, since the results provide the basis of the GSA
algorithm, they are also summarized here.
In short, 11 different datasets were simulated following the approach and
underlying DGPs defined by HP. For each DGP, the regressors were ordered using
both S
T
and the t-ratios. Then a metric was used which measures the success of
each ranking in assigning the regressors in the DGP with the highest ranks. This
gives a measure of the utility of each ranking in correctly identifying the DGP. It
was found that rst, S
T
gave overall better rankings than t-ratios, but for some
DGPs t-ratios were still more effective.
14 Worked out examples of this are available from the authors upon request.
Variable Selection Using Global Sensitivity Analysis 197
This result pointed to the fact that the two measures are in some way com-
plementary, and motivated the GSA algorithm proposed here, which combines the
search paths obtained using the t-ratios and the S
T
measures, and then selects the
best model between the two resulting specications. The combined procedure is
expected to be able to reap the advantages of both orderings. For simplicity, this
algorithm is called the GSA algorithm, despite the fact that it exploits both the
orderings based on GSA and on the t-ratios. The rest of this section contains a
description of the GSA algorithm in its basic form and with two modications.
4.1 The Basic Algorithm
The procedure involves ranking the regressors by t-ratios or S
T
, then adopting the
bottom upapproach following the Pantula principle, where candidate models
are built by successively adding regressors in order of importance. The steps are as
follows.
1. Order all regressors by method m(i.e. either the t-ratios or S
T
).
2. Dene the initial candidate model as the empty set of regressors (i.e. one with
only the constant term).
3. Add to the candidate model the highest-ranking regressor (that is not already in
the candidate model).
4. Perform an Ftest, comparing the validity of the candidate model to that of the
GUM.
5. If the p-value of the Ftest in step 4 is below a given signicance level α,goto
step 3 (continue adding regressors), otherwise, go to step 6.
6. Since the F-test has not rejected the model in step 4, this is the selected model
γ(m).
In the following, the notation γ(t)is used (respectively γ(S)) to denote the model
selected by this algorithm when t-ratios (respectively S
T
) are used for the ordering.
Note that candidate variables are added starting from an empty specication; this
is hence a bottom upapproach induced by the Pantula principle.
Onecanobservethatthisbottom upapproach is in line with the GETS
philosophy of model selection; in fact it corresponds to the nesting of models
known as the Pantula-principlein cointegration rank determination, see
Johansen (1996). Every model in the sequence is compared with the GUM, and
hence the sequence of tests can be interpreted as an implementation of the GETS
philosophy. Moreover, it can be proved that, for large sample sizes, the sequence
selects the smallest true model in the sequence with probability equal to 1 α,
198 W. Becker et al.
where αis the size of each test. Letting αtend to 0 as the sample size gets large,
one can prove that this delivers a true model with probability tending to 1.
15
As a last step, the final choice of regressors
γis chosen between γ(t)and γ(S)as
the one with the fewest regressors (since both models have been declared valid by
the F-test). If the number of regressors is the same, but the regressors are different,
the choice is made using the BIC.
The GSA algorithm depends on some key constants; the significance level of
the F-test, α, is a truly sensitiveparameter, in that varying it strongly affects its
performance. Of the remaining constants in the algorithm, N, the number of points
in the GSA sampling, can be increased to improve accuracy; in practice it was
found that N= 128 provided good results, and further increases made little
difference.
In the following two subsections, two extensions to the basic algorithm are
outlined with the reasoning explained.
4.2 Adaptive-α
Varying αessentially dictates how strongthe effect of regressors should be to be
included in the nal model, such that a high αvalue will tend to include more
variables, whereas a low value will cut out variables more harshly. The difculty is
that some DGPs require low αfor accurate identication of the true regressors in T,
whereas others require higher values. Hence, there could exist no single value of α
that is suitable for the identication of all DGPs.
A proposed modification to deal with this problem is to use an adaptive-α,α
ϕ
,
which is allowed to vary depending on the data. This is based on the observation
that the F-test returns a high p-value p
H
(typically of the order 0.20.6) when the
proposed model is a superset of the DGP, but when one or more of the regressors in
Tare missing from the proposed model, the p-value will generally be low, p
L
(of the
order 103say). The values of p
H
and p
L
will vary depending on the DGP and data
set, making it difcult to nd a single value of αwhich will yield good results across
all DGPs. However, for a given DGP and data set, the p
H
and p
L
values are easy to
identify.
Therefore, it is proposed to use a value of α
ϕ
, such that for each data set,
αϕ=pL+ϕpHpL(7)
where p
H
is taken as the p-value resulting from considering a candidate model with
15 See for instance Paruolo (2001). Recall that any model whose set of regressors contains the DGP
is true.
Variable Selection Using Global Sensitivity Analysis 199
all regressors that have STi >0.01 against the GUM, and p
L
is taken as the p-value
from considering the empty set of regressors against the GUM. The reasoning
behind the denition of p
H
is that it represents a candidate model which will
contain the DGP regressors with a high degree of condence. Here ϕis a tuning
parameter that essentially determines how far between p
L
and p
H
the cutoff should
be. Figure 1 illustrates this on a data set sampled from DGP 6B. Note that α
ϕ
is used
in the F-test for both the t-ranked regressors as well as those ordered by S
T
.
4.3 Skipping Regressors
In order to correct situations where the ordering of the regressors is not correct, a
different extension of the algorithm is to test discarding weakregressors in the
selected model. Here, a weak regressor is defined as being one with a value of S
T
lower than a certain threshold, which is set as 0.2. When Step 6 is reached, if weak
regressors exist in the selected model, they are removed one at a time, each time
performing an F-test. If the F-test is satised, the regressor is discarded, otherwise
it is retained. This approach is used instead of an exhaustive search of the
combinations of remaining regressors, because occasionally there may still be too
many regressors left to make this feasible.
4.4 Full GSA Algorithm
Adding the extensions discussed in the previous two sections results in the final
full algorithm, which can be described as follows.
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
Regressor index (in order of importance using ST)
pvalue
Adaptive α
pvalue
cutoff line
Figure 1: p-Values from F-test comparing candidate models to the GUM in a sample from DGP 6B,
for the six highest-ranked regressors. Here ϕ= 0.2 and α
ϕ
is marked as a dotted line.
200 W. Becker et al.
1. Obtain two orderings of regressors, one by the t-ratios, and the other by S
T
,
using the BIC as the output (penalized measure of model t) q.
2. Obtain p
H
as the p-value resulting from considering a candidate model with all
regressors that have S
Ti
> 0.01 against the GUM, and p
L
as the p-value from
considering the empty set of regressors against the GUM. Calculate α
ϕ
using
(7), which is used in all subsequent tests.
3. Dene the initial candidate model as the empty set of regressors (i.e. one with
only the constant term).
4. Add to the candidate model the regressor with the highest S
T
(that is not
already in the candidate model).
5. Perform an Ftest, comparing the validity of the candidate model to that of the
GUM.
6. If the p-value of the Ftest in step 4 is below α
ϕ
, go to step 4 (continue adding
regressors), otherwise, go to step 7.
7. Since the F-test has not rejected the model in step 4, this is the selected model
γ(m).
8. Identify any remaining weakregressors as those with S
T
< 0.2. Try removing
these one at a time: if removing a regressor satises the F-test, it is discarded;
otherwise it is retained. Repeat this procedure for all weak regressors.
9. Repeat steps 38, except use the ordering based on t-ratios, rather than on S
T
.
10. Compare between the nal model selected by S
T
and the nal model selected
by the t-ratios, by selecting the model with the fewest regressors (since both
have satised the F-test). If both nal specications have the same number of
regressors, chose the specication with the lowest BIC.
In the following section the performance of the algorithm is examined compared to
some benchmark test cases, with and without the extensions introduced in previous
sections. In thefollowing, S
Tfull
indicates thefull procedure as describedabove; S
Tno-
skip
refers to thesame procedure without the skipping extension (i.e. without step 8);
nally S
Tsimple
is the one without step 8, and also without step 2 (adaptive-α). For
S
Tsimple
axed value of αis used.
5 The Experiments of Hoover and Perez
This section tests the GSA algorithm on a suite of DGP simulation experiments
developed by HP. These experiments consider a possibly dynamic regression
equation with n= 139 and exogenous variables, xed across experiments, taken
from real-world, stationary, macroeconomic time series, in the attempt to represent
typical macroeconomic data. Several papers have used HPs experiments to test the
Variable Selection Using Global Sensitivity Analysis 201
performance of other methods (Castle, Doornik, and Hendry 2011; Hendry and
Krolzig 1999). HPs experiments are of varying degree of difculty for model search
algorithms. Details on the design of HP DGPs are reported in Appendix B.
The features of HPs experiments prompt a number of considerations. First,
because sample size is limited and fixed, consistency of model-selection algo-
rithms cannot be the sole performance criterion. Secondly, some of the DGPs in
HPs experiments are characterized by a low signal-to-noise ratio for some co-
efficients; the corresponding regressors are labeled weak. This situation makes it
very difficult for statistical procedures to discover if the corresponding regressors
should be included or not. This raises the question of how to measure selection
performance in this context.
This paper observes that, in the case of weak regressors, one can measure
performance of model-selection algorithms also with respect to a simplified DGP,
which contains the subset of regressors with sufficiently high signal-to-noise ratio;
this is called the Effective DGP(EDGP). The definition of the EDGP is made
operational using the parametricness indexintroduced in Liu and Yang (2011)
this concept is described in detail in Appendix C. For full transparency, results are
presented also relative to the original DGPs in cases where the EDGP is different.
5.1 Orderings Based on tand GSA
As mentioned in Section 4, the DGPs of HP were used as the basis for an initial
investigation into the comparative rankings of S
T
and the tratios. Here, these
numerical experiments are described in more detail.
For each of the 11 DGPs under investigation, N
R
= 500 replications for Zwere
generated; on each sample, regressors were ranked by the t-ratios and S
T
, using
N= 128 in (5). Both for the t-ratios ranking and the S
T
ranking, the ordering is from
the best-tting regressor to the worst-tting one.
In order to measure how successful the two methods were in ranking regressors,
the following measure δof minimum relative covering size is defined. Indicate by
φ0={i1,,ir0}the set containing the positions i
j
ofthetrueregressorsinthelisti=
1, ,p;i.e.foreachjone has γ0,ij=1. Recall also that r
0
is the number of elements in
φ
0
. Next, for a generic replication j,letφ(m)
={i(m)
1,,i(m)
}be the set containing the
rst positions i(m)
jinduced by the ordering of method m,mequal t,S
T
.Letb(m)
j=
min{:φ0φ(m)
}be the minimum number of elements for which φ(m)
contains all
the true regressors. Observe that b(m)
jis well dened, because at least for =pone
always has φ0φ(m)
p={1,,p}.δis dened to equal b(m)
jdivided by its minimum;
202 W. Becker et al.
this corresponds to the (relative) minimum number of elements in the ordering mthat
covers the set of true regressors.
Observe that, by construction, one has r0b(m)
jp, and that one wishes b(m)
jto
be as small as possible; ideally one would like to have to have b(m)
j=r0. Hence for
δ(m)
jdened as b(m)
j/r0one has 1 δ(m)
jp/r0. Finally δ(m)is dened as the average
δ(m)
jover j=1,,N
R
, i.e. δ(m)=1
NRNR
j=1δ(m)
j.
For example, if the regressors, ranked in descending order of importance by
method min replication j, were x
3
,x
12
,x
21
,x
11
,x
4
,x
31
,, and the true DGP were x
3
,
x
11
the measure δ
j
would be 2; in fact the smallest-ranked set containing x
3
,x
11
has
four elements b(m)
j=4, and r
0
=2.
The results over the N
R
= 500 replications are summarized in Table 1. Overall S
T
appears to perform better than t-ordering. For some DGPs (such as DGP 2 and 5)
both approaches perform well (δ= 1 indicating correct ranking for all 500 data
sets). There are other DGPs where the performance is signicantly different. In
particular the t-ratios is comparatively decient on DGPs 3 and 6A, whereas S
T
performs worse on DGP 8. This suggests that there are some DGPs in which S
T
may
offer an advantage over the t-ratios in terms of ranking regressors in order of
importance. This implies that a hybrid approach, using both measures, may yield a
more efcient method of regressor selection.
5.2 Measures of Performance
The performance of algorithms was measured by HP via the number of times the
algorithm selected the DGP as a final specification. Here use is made of measures of
performance similar to the ones in HP, as well as of additional ones proposed in
Castle, Doornik, and Hendry (2011).
Recall that γT=γ0is the true set of included regressors and let
γjindicate the
one produced by a generic algorithm in replication j=1,,N
R
.Dene r
j
to be
number of correct inclusions of components in vector
γj, i.e. the number of
Table :Values of δ(average over  data replications per DGP), using t-test and S
T
. Mean refers
to average across DGPs. Comparatively poor rankings are in boldface.
DGP ABMean
S
T
. . . . . . . . . . .
t-ratios . . . . . . . . . . .
Variable Selection Using Global Sensitivity Analysis 203
regression indices ifor which
γj,i=γ0,i=1, rj=p
i=11(
γj,i=γ0,i=1). Recall that r
0
indicates the number of true regressors.
The following exhaustive and mutually exclusive categories of results can be
defined:
C
1
: exact matches;
C
2
: the selected model is correctly specied, but it is larger than necessary, i.e.
it contains all relevant regressors as well as irrelevant ones;
C
3
: the selected model is incorrectly specied (misspecied), i.e. it lacks
relevant regressors.
C
1
matches correspond to the case when
γjcoincides with γT=γ0; the corre-
sponding frequency C
1
is computed as C1=1
NRNR
j=11(
γj=γT). The frequency of C
2
cases is given by C2=1
NRNR
j=11(
γjγ0,rj=r0). Finally, C
3
cases are the residual
category, and the corresponding frequency is C
3
=1C
1
C
2
.
16
The performance can be further evaluated through measures taken from
Castle, Doornik, and Hendry (2011), known as potency and gauge. First the
retention rate ˜
piof the i-th variable is dened as, ˜
pi=1
NRNR
j=11(
γj,i=1). Then, po-
tency and gauge are dened as follows:
potency =1
r0
i:β0,i0
˜
pi,gauge =1
pr0
i:β0,i=0
˜
pi.
Potency therefore measures the average frequency of inclusion of regressors
belonging to the DGP, while gauge measures the average frequency of inclusion of
regressors not belonging to the DGP. An ideal performance is thus represented by a
potency value of 1 and a gauge of 0.
In calculating these measures, HP chose to discard MC replications for which a
preliminary application of the battery of misspecification tests defined in (15) in
Appendix A reported a rejection.
17
This choice is called in the following pre-search
eliminationof MC replications.
16 C
1
corresponds to Category 1 in HP; C
2
corresponds to Category 2 +Category 3 Category 1 in
HP; nally C
3
corresponds to Category 4 in HP.
17 The empirical percentage of samples that were discarded in this way was found to be pro-
portional to the signicance level α. This fact, however, did not inuence signicantly the number
of C
1
catches. Hence the HP procedure was allowed to discard replications as in the original
version. For the GSA algorithm no pre-search elimination was performed.
204 W. Becker et al.
5.3 Benchmark
The performance of HPs algorithm is taken as a benchmark. The original MATLAB
code for generating data from HPs experiments was downloaded from HPs home
page.
18
The original scripts were then updated to run on the current version of
MATLAB. A replication of the results in Tables 4, 6 and 7 in HP is reported in the
rst panel of Table 2, using a nominal signicance level of α= 1, 5, 10% and
NR=103replications. The results do not appear to be signicantly different from
the ones reported in HP.
When checking the original code, an incorrect coding was found in the original
HP script for the generation of the AR series u
t
in Eq. (13), which produced simu-
lations of a moving average process of order 1, MA(1), with MA parameter 0.75
instead of an AR(1) with AR parameter 0.75.
19
The script was hence modied to
produce u
t
as an AR(1) with AR parameter 0.75; this is called the modied scriptin
the following.
Re-running the DGP simulation experiments using this modified script, the
results in the second panel in Table 2 were obtained; for this set of simulations
NR=104replications were used. Comparing the rst and second panel in the table
for the same nominal signicance level α, one observes a signicant increase in C
1
catches in DGP 2 and 7. One reason for this can be that when the modied script is
employed, the regression model is well-specied, i.e. it contains the DGP as a
special case.
20
Table 2 documents how HPs algorithm depends on α, the signi-
cance level chosen in the test Rin (15).
5.4 Alternative Algorithms
This section presents results using the performance measures introduced in
Section 5.2. The results compare the three variations of the S
T
algorithm with the
18 http://www.csus.edu/indiv/p/perezs/Data/data.htm.
19 This means that the results reported in HP for DGP 2, 3, 7, 8, 9 refer to a misspecied model. The
MA process can be inverted to obtain a AR() representation; substituting from the y
t
equation as
before, one nds that the DGP contains an innite number of lags on the dependent variable and of
the x*
it variables, with exponentially decreasing coefcients. The entertained regression model
with four lags on the dependent variable and two lags on the x*
it variables can be considered an
approximation to the DGP.
20 This nding is similar to the one reported in Hendry and Krolzig (1999), section 6; they re-run
HP experiments using PcGets, and they document similar increases in C
1
catches in DGP 2 and 7 for
their modied algorithms. Hence, it is possible that this result is driven by the correction of the
script for the generation of the AR series.
Variable Selection Using Global Sensitivity Analysis 205
modied HP code. To compare with a similar but more recent GETS implementa-
tion, the Autometrics package gets, see Pretis, Reade, and Sucarrat (2018), is also
added as an additional algorithm in the comparison.
The performance is measured with respect to the true DGP or with respect to the
Effective DGP (EDGP) that one can hope to recover, given the signal to noise ratio.
Because the HP, GSA and Autometrics algorithms depend on tunable constants,
results are given for various values of these constants.
The procedure employed to define the EGDP is discussed in Appendix C; it
implies that the only EDGPs differing from the true DGP are DGP 6 and DGP 9. DGP
6 contains regressors 3 and 11, but regressor 3 is weak and hence EDGP 6 contains
only regressor 11. DGP 9 contains regressors 3, 11, 21, 29 and 37 but regressors 3 and
21 are weak and they are dropped from the corresponding EDGP 9. More details are
given in Appendix C.
Both the HP algorithm and Autometrics depend on the significance levels α,
whereas the GSA algorithm depends on the threshold ϕ(which controls α
ϕ
) for
S
Tno-skip
and S
Tfull
and on αfor S
Tsimple
. Because the values of αand ϕcan seriously
affect the performance of the algorithms, a fair comparison of the performance of
the algorithms may be difcult, especially since the true parameter values will not
be known in practice. To deal with this problem, the performance of the algorithms
was measured at a number of parameter values within a plausible range.
Table :Percentages of Category matches C
for different values of α. Original script: data
generated by the original script, NR=replications. The frequencies are not statistically
different from the ones reported in HP (Tables ,,). Modied script: data from modied script
for the generation of AR series, NR=replications.
Original script Modified script
DGP α=. . .. . .
......
......
......
......
......
......
A.
.....
B......
......
......
.
206 W. Becker et al.
This allowed two ways of comparing the algorithms: first, the optimized
performance, corresponding to the value of αor ϕthat produced the highest C
1
score, averaged over the 11 DGPs. This can be viewed as the potential perfor-
mance. In practice, the optimization was performed with a grid search on αand ϕ
with N
R
=10
3replications, averaging across DGPs.
Secondly, a qualitative comparison was performed between the algorithms of
comparing their average performance over the range of parameter values. This
latter comparison gives some insight into the more realistic situation, where the
optimum parameter values are not known.
5.5 Results for optimal values of tuning coefficients
Table 3 shows the classication results in terms of C
1
matches, as well as the
potency and gauge measures, for all algorithms at their optimal parameter values,
using NR=104. Note that the value of α=4×104for Autometrics represents the
lowest value of αthat it was possible to assign without errors occurring due to
singular matriceslikely due to issues with numerical precision. Results for the S
T
algorithm are shown with and without the extensions discussed in Section 4.
Recovery of the true specication is here understood in the EDGP sense.
The C
1
column measures the percentage frequency with which the algorithms
identied the EDGP. One notable fact is that the performance of the HP algorithm
has been vastly improved (compared to the results in HPs original paper) simply
by setting αto a better value, in this case α=4×104, compare with Table 2.
The comparison shows that with the full S
T
algorithm, the correct classication
rate (C
1
) is 98.9%, compared with 94.3% for HP, and 88.6% with Autometrics. It is
presumed that if it were possible to reduce further the value of αfor Autometrics,
the mean C
1
value would increase still further. However, it was not possible to test
this conjecture. Removing the skippingextension, the average performance falls
to 96.7%, and further to 92.6% without the adaptive-αfeature.
Examining the DGPs individually, the GSA algorithm performs well on all
DGPs, although there are slightly lower C
1
values in