- Access to this full-text is provided by De Gruyter.

Download available

Content available from Journal of Time Series Econometrics

This content is subject to copyright. Terms and conditions apply.

William Becker*, Paolo Paruolo and Andrea Saltelli

Variable Selection in Regression Models

Using Global Sensitivity Analysis

https://doi.org/10.1515/jtse-2018-0025

Received August 31, 2018; accepted February 16, 2021

Abstract: Global sensitivity analysis is primarily used to investigate the effects of

uncertainties in the input variables of physical models on the model output. This

work investigates the use of global sensitivity analysis tools in the context of

variable selection in regression models. Specifically, a global sensitivity measure

is applied to a criterion of model fit, hence defining a ranking of regressors by

importance; a testing sequence based on the ‘Pantula-principle’is then applied to

the corresponding nested submodels, obtaining a novel model-selection method.

The approach is demonstrated on a growth regression case study, and on a number

of simulation experiments, and it is found competitive with existing approaches to

variable selection.

Keywords: model selection, Monte Carlo, sensitivity analysis, simulation

JEL classiﬁcation: C52, C53

1 Introduction

Model selection in regression analysis is a central issue, both in theory and in

practice. Related fields include multiple testing (Bittman et al. 2009; Romano and

Wolf 2005), pre-testing (Leeb and Poetscher 2006), information criteria (Hjort and

Claeskens 2003; Liu and Yang 2011), model selection based on Lasso (Brunea 2008),

model averaging (Claeskens and Hjort2003), stepwise regression, (Miller 2002), risk

Information and views set out in this paper are those of the authors and do not necessarily reflect

the ones of the institutions of affiliation.

*Corresponding author: William Becker, European Commission, Joint Research Centre, Ispra, VA,

Italy, E-mail: william.becker@bluefoxdata.eu. https://orcid.org/0000-0002-6467-4472

Paolo Paruolo, European Commission, Joint Research Centre, Ispra, VA, Italy,

E-mail: paolo.paruolo@ec.europa.eu. https://orcid.org/0000-0002-3982-4889

Andrea Saltelli, Open Evidence Research, Universitat Oberta de Catalunya, Barcelona, Spain,

E-mail: andrea.saltelli@gmail.com. https://orcid.org/0000-0003-4222-6975

J. Time Ser. Econom. 2021; 13(2): 187–233

Open Access. © 2021 William Becker et al., published by De Gruyter. This work is licensed

under the Creative Commons Attribution 4.0 International License.

inﬂation in prediction, (Foster and George 1994), directed acyclic graphs and

causality discovery (Freedman and Humphreys 1999).

1

Model choice is also of primary concern in many areas of applied econometrics,

as witnessed for example by the literature on growth regression (Sala-i-Martin 1997).

Controlling for the right set of covariates is central in the analysis of policy impact

evaluations; this is embodied in theassumption of unconfoundedness (Imbens and

Wooldridge 2009). In economic forecasting, model selection is the main alternative

to model averaging, (Hjort and Claeskens 2003).

The analysis of the effects of pretesting on parameter estimation has a long

tradition in econometrics (Danilov and Magnus 2004) and in this context Magnus

and Durbin (1999) and co-authors proposed the weighted average least squares

estimator (WALS) and compared it with model averaging for growth empirics

(Magnus, Powell, and Prufer 2010).

Model selection is a major area of investigation also in time-series econometrics

(Phillips 1997, 2003). The so-called London School of Economics (LSE) methodology

has played a prominent role in this area, advocating the general-to-speciﬁc(GETS)

approach to model selection (Castle, Doornik, and Hendry 2011; Hendry and Krolzig

2005) and references therein. In a widely cited paper, Hoover and Perez (1999)

(hereafter HP) ‘mechanized’—i.e. translated—the GETS approach into an algorithm

for model selection and they then tested the performance of the HP algorithm

on a set of time-series regression experiments, constructed along the lines of

Lovell (1983).

Model selection is also related to the issue of regression coefficients’robustness

(i.e. lack of sensitivity) to the omission/inclusion of additional variables. Leamer

(1983) proposed extreme bound analysis, i.e. to report the range of possible

parameter estimates of the coefﬁcient of interest when varying the additional

regressors included in the analysis, as an application of sensitivity analysis to

econometrics. Other applications of sensitivity analysis to econometrics include the

1Model selection is also associated with current rules of thumb on the maximum number of

regression parameters to consider. This literature appears to have been initiated by Freedman

(1983), who considered the case of a ﬁrst screening regression with 50 regressors and 100 data

points, where regressors that are signiﬁcant at 25% signiﬁcance level are kept in a second

regression. Freedman showed that the second regression is troublesome when one acts as if the

screening regression had not been performed and the ratio of number of observations to number of

regressors in the screening regression is kept in a ﬁxed proportion as the number of observations

diverges. This study was followed by Freedman and Pee (1989), Freedman, Pee, and Midthune

(1992), who deﬁned the rule of thumb that the ratio of the number of observations per regressor

should be at least equal to 4; this rule is included in Harrell (2001), who suggested to have it at

least equal to 10.

188 W. Becker et al.

local sensitivity to model misspeciﬁcation developed in Magnus and Vasnev (2007)

and Magnus (2007).

2

On the other hand, sensitivity analysis originated in the natural sciences, and is

generally defined as ‘the study of how the uncertainty in the output of a mathe-

matical model or system (numerical or otherwise) can be apportioned to different

sources of uncertainty in its inputs’, (Saltelli, Tarantola, and Campolongo 2000). The

term global sensitivity analysis (GSA) is used to refer to sensitivity analysis

approaches that fully explore the space of uncertainties, as opposed to ‘local’

methods which are only valid at a nominal point (Saltelli and Annoni 2010). The

main tools used in GSA are based on a decomposition of the variance of the model

output (Sobol’1993).

Despite several uses of sensitivity in econometrics, the present authors are not

aware of systematic applications of the techniques of Global Sensitivity Analysis to

the problem of model selection in regression. With this in mind, the present paper

explores the application of variance-based measures of sensitivity to model

selection.

This paper aims to answer the question: “Can GSA methods help in model

selection in practice?”, rather than to propose a single algorithm with the aim to

dominate all alternatives. To this purpose, a simple algorithm is considered as a

representative of a novel GSA approach; the new algorithm is found to perform

rather well when compared with alternatives. This shows how GSA methods can

indeed bring a useful contribution to this field.

In particular, a widely-used measure in the GSA literature, called the ‘total

sensitivity index’is employed to rank regressors in terms of their importance in a

regression model. The information on the ordering of regressors given by GSA

methods appears to be somewhat complementary to the one based on t-ratios

employed in the GETS approach; this suggests to consider viable ordering of

regressors combining the two orderings. Based on these insights, a GSA algo-

rithm is constructed which combines the two rankings.

The proposed GSA representative algorithm uses the ordering of the regressors

via GSA or the t-ratios within a testing strategy based on the ‘Pantula-principle’,

see Pantula (1989). For any ordering of the regressors, this amounts to a single

sequence of tests against the full model, starting from the most restricted submodel

2They show that local sensitivity measures provide complementary information with respect to

standard diagnostic tests for misspeciﬁcation, i.e. that the two types of statistics are asymptotically

independent. In SA a local measure of sensitivity is one focused on a precise point in the space of

the input factor, e.g. a partial derivative of the output versus the input. With a global measure of

sensitivity the inﬂuence of a given input on the output is averaged both on the distribution of the

input factor itself and on the distributions of all the remaining factors, see Saltelli, Andres, and

Homma (1993).

Variable Selection Using Global Sensitivity Analysis 189

to the most unrestricted one.

3

This implies both a reduction in the number of tests

for each given ordering (with an associated saving of computing times) and the

favorable control of the size of the testing sequence. The present application of the

‘Pantula-principle’appears novel in the context of model selection.

The GSA algorithm is tested here using several case studies. A detailed

investigation of the performance of the GSA algorithm is first performed using the

simulation experiments of HP, who defined a set of Data Generating Processes

(DGPs) based on real economic data. Simulating these DGPs, one can record how

often the algorithm recovers the variables that are included in the DGP. This is

compared to the results of HP’s GETS algorithm, as well as those of the Autometrics

GETS package (Pretis, Reade, and Sucarrat 2018).

In order to further compare the GSA approach to a wider set of model selection

procedures, the DGPs in Deckers and Hanck (2014) are also considered; this allows

a direct comparison with a number of procedures. Finally, the algorithm is applied

to a growth regression case study which is also taken from the same paper.

Overall, results point to the possible usefulness of GSA methods in model se-

lection algorithms. When comparing the optimized GSA and HP algorithms, the GSA

method appears to be able to reduce the failure rate in recovering the underlying

data generating process from 5 to 1% approximately—a fivefold reduction. When

some of the regressors are weak,the recovery of the exact DGP does not appear to be

improved by the use of GSA methods.

Comparing the GSA algorithm to a wider set of approaches considered in DH,

the results are competitive with alternatives, in the sense that the GSA algorithm is

not dominated by alternative algorithms in the Monte Carlo (MC) simulations. In

the empirical application on growth regression, not surprisingly, it identifies

similar variables to those found by other methods. While these results do not prove

the GSA approach to dominate other existing approaches, they show that the GSA

approach is not dominated by any single alternative, and that it has the potential to

contribute to improve existing algorithms; the present study can hopefully hence

pave the way for future advances.

The rest of the paper is organized as follows. Section 2 deﬁnes the problem of

interest and introduce GSA and variance-based measures. Section 3 presents some

theoretical properties of orderings based on the total sensitivity index, while

Section 4 presents the GSA algorithm. Results are reported in Sections 5 and 6,

where the former is a detailed investigation on datasets generated following the

paper of Hoover and Perez (1999), and the latter is a comparison with a wide range

of model selection procedures on simulated data sets and on a growth regression,

following Deckers and Hanck (2014). Section 7 concludes. Three appendices report

3This sequence can still be interpreted as compliant to the GETS principle.

190 W. Becker et al.

proofs of the propositions in the paper, details on the DGP design in HP and a

discussion about the identiﬁability of DGPs. Finally, this paper follows the nota-

tional conventions in Abadir and Magnus (2002).

2 Model Selection and Global Sensitivity Analysis

This section presents the setup of the problem, and introduces global sensitivity

analysis. The details of the proposed algorithm are deferred to Section 4.

2.1 Model Selection in Regression

Consider ndata points in a standard multiple regression model with pregressors of

the form

y=X1β1+…Xpβp+ε=Xβ+ε(1)

where y=(y1,…,yn)′is n×1, X=(X

1

,…,X

p

)isn×p,Xi≔(xi,1,…,xi,n)′is n×1,

β=(β1,…,βp)′is p×1 and εis a n×1 random vector with distribution N(0,σ2In).

The symbol ′indicates transposition.

Equation (1) describes both the model and the DGP. In the model, the coefﬁcients

β

i

are parameters to be estimated given the observed data Z=(y,X). Each DGP is

described by Eq. (1) with β

i

set at some numerical values, here indicated as β

0i

,

collected in the vector β0=(β01 ,…,β0p)′.

Some of the true β

0i

may be 0, corresponding to irrelevant regressors. Let

T≔{i∈J:β0,i≠0}be the set of all relevant regressor indices in the DGP, with r

0

elements, where J≔{1,…,p}indicatesthesetoftheﬁrst pintegers. Let also

M≔J\Tindicate the set of all regressor indices for irrelevant regressors.

4

Equation (1) also formally nests dynamic speciﬁcations, as detailed in Appendix

Bbelow;inthiscaseX

i

contain lagged dependent variables, and (1) is generated

recursively.

Imposing the restriction β

i

= 0 for some regressors i, one obtains a submodel

5

of

model (1). Each submodel can be characterized by a set a,a⊆J, containing the

indices of the included regressors. For instance, a= {1, 5, 9}, indicates the

submodel including regressors numbered 1, 5, 9. The model without any restriction

on β

i

= 0 is called the general unrestricted model (GUM).

4Here J\Tdenotes the set difference J\T≔{i:i∈J,i∉T}; sums over empty sets are understood

to be equal to 0.

5In the paper ‘submodel’and ‘speciﬁcation’are used as synonyms.

Variable Selection Using Global Sensitivity Analysis 191

Alternatively, the same information on submodel acan be represented by a p×1

vector γa=(γ1,…,γp)′,withj-th coordinate γ

j

with value 1 (respectively 0) that

indicates the inclusion (respectively exclusion) of regressor jfrom the speciﬁcation,

i.e. γj=1(j∈a)and 1(⋅) is the indicator function.

6

The GUM corresponds to γequal to

ı, a vector with all 1s. γTcorresponds to the best selection of regressors, i.e. the same

one of the DGP; in the following the notation γT=γ0is also used.

Let Γbe the set of all p×1 vectors of indicators γ,Γ={0,1}p.Notethatthereare2

p

different speciﬁcations, i.e. 2ppossible γvectors in Γ.Whenp=40,assome

experiments in Section 5, the number of speciﬁcations is 2p≈1.0995 ⋅1012, a very

large number. This is why an exhaustive search of submodels is infeasible in many

practical cases, and model selection techniques focus on a search over a limited set

of submodels Γs⊂Γ.

Each submodel can be written as model (1) under the restriction

β=Hγϕ,(2)

where H

γ

contains the columns of an identity matrix I

p

corresponding to elements

γ

i

equal to 1 within γ. Speciﬁcation (2) is referred to as the ‘γsubmodel’in the

following. Also the ‘true’vector β

0

has representation β0=H0ϕ0where H

0

is a

simpliﬁed notation for H0=HγT=Hγ0.

The least squares estimator of βin submodel γcan be written as

βγ=HγHγ

′X′XHγ−1Hγ

′X′y.(3)

The problem of interest is to retrieve T, or the corresponding γT, given the observed

data Z=(y,X), i.e. to identify the DGP.

7

2.2 GSA Approach

General-to-specific (GETS) approaches such as the algorithm used by HP (described

in detail in Section 5) use t-ratios to rank regressors in order of importance, which

guides the selection of the set of submodels Γs. This study proposes instead to

decompose the selection of models in two stages:

(i) deﬁne an ordering of regressors based on their importance;

6Similarly, the notation aγ≔{i1,…,ikγ}′is used to indicate the index set corresponding to some

vector γ.

7All empirical models are assumed to contain the constant; this is imposed implicitly by de-

meaning the yand X

i

vectors. Hence in the following, the ‘empty set of regressors’refers to the

regression model with only the constant.

192 W. Becker et al.

(ii) use a sequence of ptests that compare the GUM with submodels which contain

the ﬁrst hmost important regressors, starting from h=0,1,2,…and ending at

the ﬁrst submodel rthat does not reject the null hypothesis.

In this paper the ordering in (i) based on the t-ratios is complemented with a

variance-based measure of importance from GSA, called the ‘total sensitivity

index’. The proposed algorithm, called the ‘GSA algorithm’, combines this new

ranking with the ranking by t-ratios.

The testing sequence is defined based on this new ranking; a ‘bottom-up’

selection process is adopted, which builds candidate models by adding regressors

in descending order of importance. This ‘bottom-up’selection process follows the

‘Pantula principle’and has well defined theoretical properties, see e. g. (Paruolo

2001), and it can still be interpreted as a GETS procedure.

The total sensitivity index in GSA is based on systematic exploration of the space

of the inputs to measure its influence on the system output, as is commonly prac-

ticed in mathematical modeling in natural sciences and engineering. It provides a

global measure of the inﬂuence of each input to a system.

8

Reviews of global

sensitivity analysis methods used therein are given in Saltelli et al. (2012), Norton

(2015), Becker and Saltelli (2015), Wei, Lu, and Song (2015).

9

The total sensitivity

index is a variance-based measures of sensitivity, which are the analogue of the

analysis of the variance, see Archer, Saltelli, and Sobol (1997).

10

Given the sample data Z=(y,X), consider the γsubmodel, see eqs. (1), (2) and

(3). Let q(γ) indicate the BIC of model ﬁt of this submodel, q(γ)=log

σ2

γ+kγcn,

with cn≔log(n)/n.

11

Remark that qis a continuous random variable that depends

on the discretely-valued γ. The idea is to apply the total sensitivity index using qas

output, with γas input. Although the BIC is used here as q, the measure of model

ﬁt, other consistent information criteria or the maximized log-likelihood could be

used instead.

8The ‘mechanistic’models in these disciplines are mostly principle-based, possibly involving the

solution of some kind of (differential) equation or optimization problem, and the output—being the

result of a deterministic calculation—does not customarily include an error term.

9Recent applications of these methods to the quality of composite indicators are given in Paruolo,

Saltelli, and Saisana (2013) and Becker et al. (2017).

10 Variance-based methods explore the entire distribution of each factor.

11 qcan be taken to be any consistent information criterion where consistent information criteria

replace lognwith some other increasing function f(n)ofnwith the property cn=f(n)/n→0. Here

the fact that nc

n

diverges is not used in the proofs. Note that q(γ) is a function of Z, but this is not

indicated in the notation for simplicity.

Variable Selection Using Global Sensitivity Analysis 193

The objective is to capture both the main effect and the interaction effects of

the input factors onto the output q, see Saltelli et al. (2012). The following section

deﬁnes the total sensitivity index.

2.3 Sensitivity Measures

Let Eindicate the empirical expectation over Γ,i.e.E(h(γ)) ≔(#(Γ))−1∑γ∈Γ(h(γ)),for

any function h.LetalsoVindicate the variance operator associated with E,

V(h)≔E(h2)−(E(h))2.

The γvector is partitioned into two components γ

i

and γ

−i

,whereγ

−i

contains all

elements in γexcept γ

i

.LetE(⋅|b)and V(⋅|b)(respectively E(⋅)and V(⋅)) indicate the

conditional (respectively marginal) expectation and variance operators with respect

to a partition (a,b)ofγ,whereaand bare taken equal to γ

i

and to γ

−i

.

Two commonly-accepted variance-based measures are reviewed here, the

‘first-order sensitivity index’S

i

, Sobol’(1993), and the ‘total-order sensitivity index’

S

Ti

, Homma and Saltelli (1996); both rely on decomposing the variance of the

output, V=V(q), into portions attributable to inputs or sets of inputs.

The first-order index measures the contribution to V=V(q)of varying the i-th

input alone, and it is deﬁned as Si=V(E(q|γi))/V. This index can be seen as the

application of Karl Pearson’s correlation ratio η2, see Pearson (1905), to the present

context. This corresponds to seeing the effect of including or not including a

regressor, but averaged over all possible combinations of other regressors. How-

ever, this measure does not account for interactions with the inclusion/exclusion

of other regressors; hence it is not used in the present paper.

Instead, here the focus is placed on the total effect index, which is defined by

Homma and Saltelli (1996) as

STi =E(V(q|γ−i))

V=1−V(E(q| γ−i))

V.(4)

In the following, the numerator of S

Ti

is indicated as σ2

Ti =E(V(q|γ−i)), and the

shorthand S

T

for S

Ti

is often used.

Examining σ2

Ti, one can notice that the inner term, V(q|γ−i), is the variance of q

due inclusion/exclusion of regressor i, but conditional on a given combination γ

−i

of the remaining regressors. The outer expectation then averages over all values of

γ

−i

; this quantity is then standardized by Vto give the fraction of total output

variance caused by the inclusion of x

i

. The second expression shows that S

Ti

is

1 minus the ﬁrst order effect for γ

−i

.

These measures are based on the standard variance decomposition formula, or

‘law of total variance’(Billingsley 1995), Problem 34.10(b)). In the context of GSA,

194 W. Becker et al.

these decomposition formulae are discussed in Archer, Saltelli, and Sobol (1997),

Saltelli and Tarantola (2002), Sobol’(1993), Brell, Li, and Rabitz (2010). For further

reading about GSA in their original setting, see Saltelli et al. (2012).

2.4 Estimation of the Total Sensitivity Index

In order to calculate the total sensitivity measure S

Ti

one should be able to compute

q(γ) for all γ∈Γ(i.e. estimate all possible submodels of the GUM) which is un-

feasible or undesirable. Instead, S

Ti

can be estimated from a random subset of Γ,

i.e. a sample of models. The estimation of S

Ti

is performed using an estimator and a

structured sample constructed as in Jansen (1999), which is a widely used method

in GSA.

Specifically, generate a random draw of γin Γ, say γ

*

; then consider elements

γ(i)

*with all elements equal to γ

*

except for the i-th coordinate which is switched

from 0 to 1 or vice-versa, γ(i)

*i=1−γ*i. Doing this for each coordinate igenerates p

pairs of γvectors, γ

*

and γ(i)

*, that differ only in the coordinate i. This is then used to

calculate q(γ) and apply an estimator of Jansen (1999).

This process can be formalized as follows: initialize ℓat 1, then,

1. Generate a random draw of γ, where γis a p-length vector with each element is

randomly selected from {0, 1}. Denote this by γ

ℓ

.

2. Evaluate q

ℓ

=q(γ

ℓ

).

3. Take the ith element of γ

ℓ

, and switch it to 0 if it is equal to 1, and to 1 if it is 0.

Denote this new vector with inverted ith element as γ(i)

ℓ.

4. Evaluate qiℓ=q(γ(i)

ℓ).

5. Repeat steps 3 and 4 for i=1,2,…,p.

6. Repeat steps 1–5Ntimes, i.e. for ℓ=1,2,…,N.

The estimators for σ2

Ti and Vare then deﬁned as in Jansen (1999), see also Saltelli

et al. (2010):

σ2

Ti =1

4N∑

N

ℓ=1qiℓ−qℓ2,

V=1

N−1∑

N

ℓ=1qℓ−q2,(5)

where q=1

N∑N

ℓ=1qℓ. This delivers the following plug-in estimator for S

T

,

STi =

σ2

Ti/

V.

Readers familiar with sensitivity analysis may notice that the estimator in (5) is

Variable Selection Using Global Sensitivity Analysis 195

different by a factor of 2 to the estimator quoted in Saltelli et al. (2010). The reason

for this is given in Appendix A.

12

STi is an accurate estimator for S

Ti

as the number Nof models increases;

13

hence, the following discussion is based on the behavior of S

Ti

.

3 Properties of Orderings Based on S

Ti

This section investigates the theoretical properties of ordering of variables in a

regression model based on S

T

, and shows that these orderings satisfy the following

minimal requirement. When the true regressors in Tincluded in the DGP and the

irrelevant ones in Mare uncorrelated, the ordering of regressors based on S

T

separates the true from the irrelevant regressors in large samples.

Recall that STi =σ2

Ti/V=E(V(q|γ−i))/V, see (4). The large nproperties of S

Ti

are

studied under the following regularity assumptions.

Assumption 1: (Assumptions on the DGP). The variables wt≔(yt,x1,t,…,xp,t,ϵt)′

are stationary with ﬁnite second moments, and satisfy a law of large numbers for

large n, i.e. the second sample moments of w

t

converge in probability to Σ, the

variance covariance matrix of w

t

.

Notice that these requirements are minimal, and they are satisfied by the HP DGPs

as well as the DH DGPs. The following theorem shows that for large n, a scree plot

on the ordered S

Ti

allows to separate the relevant regressors from the irrelevant

ones when true and irrelevant regressors are uncorrelated.

Theorem 2: (Ordering based on S

Ti

works for uncorrelated regressors in Mand T).

Let Assumption 1hold and assume that the covariance Σ

ℓj

between x

ℓ

and x

j

equals

0for all j ∈Tand ℓ∈M.Deﬁne (ST(1),ST(2),…,ST(p))as the set of S

Ti

values in

decreasing order, with ST(1)≥ST(2)≥⋯≥ST(p). Then as n →∞one has

12 A heuristic reason for this is that the method involves an exploration of models, with equal

probability to select γ

i

=0orγ

i

= 1. Note that in analyses with continuous variables, it is usually

advisable to use low-discrepancy sequences due to their space-ﬁlling properties, see Sobol’(1967),

which give faster convergence with increasing N. However, since γcan only take binary values for

each element, low-discrepancy sequences offer no obvious advantage over (pseudo-)random

numbers.

13 For instance, it is consistent for S

Ti

for increasing Nthanks for Law of Large numbers for i.i.d.

sequences applied to its numerator and denominator.

196 W. Becker et al.

ST(1),ST(2),…,ST(p)→

pc(1),c(2),…,c(r0),0,…0

where (c(1),c(2),…,c(r0))is the ordered set of c

i

>0values in decreasing order, where

ci≔1

4⋅2p−1∑

γ−i∈Γ−i

logσ2+∑h,j∈T\aγ(i,1)β0,hΣhj.bγ(i,1)β0,j

σ2+∑h,j∈T\aγ(i,0)β0,hΣhj.bγ(i,0)β0,j,(6)

see Appendix A for the deﬁnition of the relevant quantities in eq. (6). Hence the

ordered S

Ti

values separate the block of true regressors in Tin the ﬁrst r

0

positions

from the irrelevant ones Min the last p −r

0

positions of (ST(1),ST(2),…,ST(p)).

Proof. See Lemma 6 in Appendix A. □

Given the above, one may hence expect this result to apply to other more general

situations. However, this turns out not to be necessarily the case. The results in

Appendix A also show that one can build examples with correlated regressors

across Tand M, where the ordering of regressors based on S

T

fails to separate the

sets of true and irrelevant regressors in large samples.

14

In the end, the question of whether the ordering based on S

T

can help in

selecting regressors is an empirical matter. Section 5 explores the frequency with

which this happens in practice, based on simulated data from various DGPs.

4 Construction of the Algorithm

In order to construct an algorithm to perform model selection based on S

T

,an

initial investigation was performed to understand to what extent the ranking of

regressors provided by S

T

is complementary to that given by t-ratios. These ex-

periments are based on the MC design by HP; details of these experiments are

reported in Section 5. However, since the results provide the basis of the GSA

algorithm, they are also summarized here.

In short, 11 different datasets were simulated following the approach and

underlying DGPs defined by HP. For each DGP, the regressors were ordered using

both S

T

and the t-ratios. Then a metric was used which measures the success of

each ranking in assigning the regressors in the DGP with the highest ranks. This

gives a measure of the utility of each ranking in correctly identifying the DGP. It

was found that ﬁrst, S

T

gave overall better rankings than t-ratios, but for some

DGPs t-ratios were still more effective.

14 Worked out examples of this are available from the authors upon request.

Variable Selection Using Global Sensitivity Analysis 197

This result pointed to the fact that the two measures are in some way com-

plementary, and motivated the GSA algorithm proposed here, which combines the

search paths obtained using the t-ratios and the S

T

measures, and then selects the

best model between the two resulting speciﬁcations. The combined procedure is

expected to be able to reap the advantages of both orderings. For simplicity, this

algorithm is called the GSA algorithm, despite the fact that it exploits both the

orderings based on GSA and on the t-ratios. The rest of this section contains a

description of the GSA algorithm in its basic form and with two modiﬁcations.

4.1 The Basic Algorithm

The procedure involves ranking the regressors by t-ratios or S

T

, then adopting the

‘bottom up’approach following the ‘Pantula principle’, where candidate models

are built by successively adding regressors in order of importance. The steps are as

follows.

1. Order all regressors by method m(i.e. either the t-ratios or S

T

).

2. Deﬁne the initial candidate model as the empty set of regressors (i.e. one with

only the constant term).

3. Add to the candidate model the highest-ranking regressor (that is not already in

the candidate model).

4. Perform an Ftest, comparing the validity of the candidate model to that of the

GUM.

5. If the p-value of the Ftest in step 4 is below a given signiﬁcance level α,goto

step 3 (continue adding regressors), otherwise, go to step 6.

6. Since the F-test has not rejected the model in step 4, this is the selected model

γ(m).

In the following, the notation γ(t)is used (respectively γ(S)) to denote the model

selected by this algorithm when t-ratios (respectively S

T

) are used for the ordering.

Note that candidate variables are added starting from an empty speciﬁcation; this

is hence a ‘bottom up’approach induced by the ‘Pantula principle’.

Onecanobservethatthis‘bottom up’approach is in line with the GETS

philosophy of model selection; in fact it corresponds to the nesting of models

known as the ‘Pantula-principle’in cointegration rank determination, see

Johansen (1996). Every model in the sequence is compared with the GUM, and

hence the sequence of tests can be interpreted as an implementation of the GETS

philosophy. Moreover, it can be proved that, for large sample sizes, the sequence

selects the smallest true model in the sequence with probability equal to 1 −α,

198 W. Becker et al.

where αis the size of each test. Letting αtend to 0 as the sample size gets large,

one can prove that this delivers a true model with probability tending to 1.

15

As a last step, the final choice of regressors

γis chosen between γ(t)and γ(S)as

the one with the fewest regressors (since both models have been declared valid by

the F-test). If the number of regressors is the same, but the regressors are different,

the choice is made using the BIC.

The GSA algorithm depends on some key constants; the significance level of

the F-test, α, is a truly ‘sensitive’parameter, in that varying it strongly affects its

performance. Of the remaining constants in the algorithm, N, the number of points

in the GSA sampling, can be increased to improve accuracy; in practice it was

found that N= 128 provided good results, and further increases made little

difference.

In the following two subsections, two extensions to the basic algorithm are

outlined with the reasoning explained.

4.2 Adaptive-α

Varying αessentially dictates how ‘strong’the effect of regressors should be to be

included in the ﬁnal model, such that a high αvalue will tend to include more

variables, whereas a low value will cut out variables more harshly. The difﬁculty is

that some DGPs require low αfor accurate identiﬁcation of the true regressors in T,

whereas others require higher values. Hence, there could exist no single value of α

that is suitable for the identiﬁcation of all DGPs.

A proposed modification to deal with this problem is to use an ‘adaptive-α’,α

ϕ

,

which is allowed to vary depending on the data. This is based on the observation

that the F-test returns a high p-value p

H

(typically of the order 0.2–0.6) when the

proposed model is a superset of the DGP, but when one or more of the regressors in

Tare missing from the proposed model, the p-value will generally be low, p

L

(of the

order 10−3say). The values of p

H

and p

L

will vary depending on the DGP and data

set, making it difﬁcult to ﬁnd a single value of αwhich will yield good results across

all DGPs. However, for a given DGP and data set, the p

H

and p

L

values are easy to

identify.

Therefore, it is proposed to use a value of α

ϕ

, such that for each data set,

αϕ=pL+ϕpH−pL(7)

where p

H

is taken as the p-value resulting from considering a candidate model with

15 See for instance Paruolo (2001). Recall that any model whose set of regressors contains the DGP

is ‘true’.

Variable Selection Using Global Sensitivity Analysis 199

all regressors that have STi >0.01 against the GUM, and p

L

is taken as the p-value

from considering the empty set of regressors against the GUM. The reasoning

behind the deﬁnition of p

H

is that it represents a candidate model which will

contain the DGP regressors with a high degree of conﬁdence. Here ϕis a tuning

parameter that essentially determines how far between p

L

and p

H

the cutoff should

be. Figure 1 illustrates this on a data set sampled from DGP 6B. Note that α

ϕ

is used

in the F-test for both the t-ranked regressors as well as those ordered by S

T

.

4.3 Skipping Regressors

In order to correct situations where the ordering of the regressors is not correct, a

different extension of the algorithm is to test discarding “weak”regressors in the

selected model. Here, a weak regressor is defined as being one with a value of S

T

lower than a certain threshold, which is set as 0.2. When Step 6 is reached, if weak

regressors exist in the selected model, they are removed one at a time, each time

performing an F-test. If the F-test is satisﬁed, the regressor is discarded, otherwise

it is retained. This approach is used instead of an exhaustive search of the

combinations of remaining regressors, because occasionally there may still be too

many regressors left to make this feasible.

4.4 Full GSA Algorithm

Adding the extensions discussed in the previous two sections results in the final

full algorithm, which can be described as follows.

1 2 3 4 5 6

0

0.2

0.4

0.6

0.8

Regressor index (in order of importance using ST)

p−value

Adaptive α

p−value

cutoff line

Figure 1: p-Values from F-test comparing candidate models to the GUM in a sample from DGP 6B,

for the six highest-ranked regressors. Here ϕ= 0.2 and α

ϕ

is marked as a dotted line.

200 W. Becker et al.

1. Obtain two orderings of regressors, one by the t-ratios, and the other by S

T

,

using the BIC as the output (penalized measure of model ﬁt) q.

2. Obtain p

H

as the p-value resulting from considering a candidate model with all

regressors that have S

Ti

> 0.01 against the GUM, and p

L

as the p-value from

considering the empty set of regressors against the GUM. Calculate α

ϕ

using

(7), which is used in all subsequent tests.

3. Deﬁne the initial candidate model as the empty set of regressors (i.e. one with

only the constant term).

4. Add to the candidate model the regressor with the highest S

T

(that is not

already in the candidate model).

5. Perform an Ftest, comparing the validity of the candidate model to that of the

GUM.

6. If the p-value of the Ftest in step 4 is below α

ϕ

, go to step 4 (continue adding

regressors), otherwise, go to step 7.

7. Since the F-test has not rejected the model in step 4, this is the selected model

γ(m).

8. Identify any remaining ‘weak’regressors as those with S

T

< 0.2. Try removing

these one at a time: if removing a regressor satisﬁes the F-test, it is discarded;

otherwise it is retained. Repeat this procedure for all weak regressors.

9. Repeat steps 3–8, except use the ordering based on t-ratios, rather than on S

T

.

10. Compare between the ﬁnal model selected by S

T

and the ﬁnal model selected

by the t-ratios, by selecting the model with the fewest regressors (since both

have satisﬁed the F-test). If both ﬁnal speciﬁcations have the same number of

regressors, chose the speciﬁcation with the lowest BIC.

In the following section the performance of the algorithm is examined compared to

some benchmark test cases, with and without the extensions introduced in previous

sections. In thefollowing, S

Tfull

indicates thefull procedure as describedabove; S

Tno-

skip

refers to thesame procedure without the skipping extension (i.e. without step 8);

ﬁnally S

Tsimple

is the one without step 8, and also without step 2 (adaptive-α). For

S

Tsimple

aﬁxed value of αis used.

5 The Experiments of Hoover and Perez

This section tests the GSA algorithm on a suite of DGP simulation experiments

developed by HP. These experiments consider a possibly dynamic regression

equation with n= 139 and exogenous variables, ﬁxed across experiments, taken

from real-world, stationary, macroeconomic time series, in the attempt to represent

typical macroeconomic data. Several papers have used HP’s experiments to test the

Variable Selection Using Global Sensitivity Analysis 201

performance of other methods (Castle, Doornik, and Hendry 2011; Hendry and

Krolzig 1999). HP’s experiments are of varying degree of difﬁculty for model search

algorithms. Details on the design of HP DGPs are reported in Appendix B.

The features of HP’s experiments prompt a number of considerations. First,

because sample size is limited and fixed, consistency of model-selection algo-

rithms cannot be the sole performance criterion. Secondly, some of the DGPs in

HP’s experiments are characterized by a low signal-to-noise ratio for some co-

efficients; the corresponding regressors are labeled ‘weak’. This situation makes it

very difficult for statistical procedures to discover if the corresponding regressors

should be included or not. This raises the question of how to measure selection

performance in this context.

This paper observes that, in the case of weak regressors, one can measure

performance of model-selection algorithms also with respect to a simplified DGP,

which contains the subset of regressors with sufficiently high signal-to-noise ratio;

this is called the ‘Effective DGP’(EDGP). The definition of the EDGP is made

operational using the ‘parametricness index’introduced in Liu and Yang (2011)—

this concept is described in detail in Appendix C. For full transparency, results are

presented also relative to the original DGPs in cases where the EDGP is different.

5.1 Orderings Based on tand GSA

As mentioned in Section 4, the DGPs of HP were used as the basis for an initial

investigation into the comparative rankings of S

T

and the tratios. Here, these

numerical experiments are described in more detail.

For each of the 11 DGPs under investigation, N

R

= 500 replications for Zwere

generated; on each sample, regressors were ranked by the t-ratios and S

T

, using

N= 128 in (5). Both for the t-ratios ranking and the S

T

ranking, the ordering is from

the best-ﬁtting regressor to the worst-ﬁtting one.

In order to measure how successful the two methods were in ranking regressors,

the following measure δof minimum relative covering size is defined. Indicate by

φ0={i1,…,ir0}the set containing the positions i

j

ofthetrueregressorsinthelisti=

1, …,p;i.e.foreachjone has γ0,ij=1. Recall also that r

0

is the number of elements in

φ

0

. Next, for a generic replication j,letφ(m)

ℓ={i(m)

1,…,i(m)

ℓ}be the set containing the

ﬁrst ℓpositions i(m)

jinduced by the ordering of method m,mequal t,S

T

.Letb(m)

j=

min{ℓ:φ0⊆φ(m)

ℓ}be the minimum number of elements ℓfor which φ(m)

ℓcontains all

the true regressors. Observe that b(m)

jis well deﬁned, because at least for ℓ=pone

always has φ0⊆φ(m)

p={1,…,p}.δis deﬁned to equal b(m)

jdivided by its minimum;

202 W. Becker et al.

this corresponds to the (relative) minimum number of elements in the ordering mthat

covers the set of true regressors.

Observe that, by construction, one has r0≤b(m)

j≤p, and that one wishes b(m)

jto

be as small as possible; ideally one would like to have to have b(m)

j=r0. Hence for

δ(m)

jdeﬁned as b(m)

j/r0one has 1 ≤δ(m)

j≤p/r0. Finally δ(m)is deﬁned as the average

δ(m)

jover j=1,…,N

R

, i.e. δ(m)=1

NR∑NR

j=1δ(m)

j.

For example, if the regressors, ranked in descending order of importance by

method min replication j, were x

3

,x

12

,x

21

,x

11

,x

4

,x

31

,…, and the true DGP were x

3

,

x

11

the measure δ

j

would be 2; in fact the smallest-ranked set containing x

3

,x

11

has

four elements b(m)

j=4, and r

0

=2.

The results over the N

R

= 500 replications are summarized in Table 1. Overall S

T

appears to perform better than t-ordering. For some DGPs (such as DGP 2 and 5)

both approaches perform well (δ= 1 indicating correct ranking for all 500 data

sets). There are other DGPs where the performance is signiﬁcantly different. In

particular the t-ratios is comparatively deﬁcient on DGPs 3 and 6A, whereas S

T

performs worse on DGP 8. This suggests that there are some DGPs in which S

T

may

offer an advantage over the t-ratios in terms of ranking regressors in order of

importance. This implies that a hybrid approach, using both measures, may yield a

more efﬁcient method of regressor selection.

5.2 Measures of Performance

The performance of algorithms was measured by HP via the number of times the

algorithm selected the DGP as a final specification. Here use is made of measures of

performance similar to the ones in HP, as well as of additional ones proposed in

Castle, Doornik, and Hendry (2011).

Recall that γT=γ0is the true set of included regressors and let

γjindicate the

one produced by a generic algorithm in replication j=1,…,N

R

.Deﬁne r

j

to be

number of correct inclusions of components in vector

γj, i.e. the number of

Table :Values of δ(average over data replications per DGP), using t-test and S

T

. Mean refers

to average across DGPs. Comparatively poor rankings are in boldface.

DGP ABMean

S

T

. . . . . . . . . . .

t-ratios . . . . . . . . . . .

Variable Selection Using Global Sensitivity Analysis 203

regression indices ifor which

γj,i=γ0,i=1, rj=∑p

i=11(

γj,i=γ0,i=1). Recall that r

0

indicates the number of true regressors.

The following exhaustive and mutually exclusive categories of results can be

defined:

–C

1

: exact matches;

–C

2

: the selected model is correctly speciﬁed, but it is larger than necessary, i.e.

it contains all relevant regressors as well as irrelevant ones;

–C

3

: the selected model is incorrectly speciﬁed (misspeciﬁed), i.e. it lacks

relevant regressors.

C

1

matches correspond to the case when

γjcoincides with γT=γ0; the corre-

sponding frequency C

1

is computed as C1=1

NR∑NR

j=11(

γj=γT). The frequency of C

2

cases is given by C2=1

NR∑NR

j=11(

γj≠γ0,rj=r0). Finally, C

3

cases are the residual

category, and the corresponding frequency is C

3

=1−C

1

−C

2

.

16

The performance can be further evaluated through measures taken from

Castle, Doornik, and Hendry (2011), known as potency and gauge. First the

retention rate ˜

piof the i-th variable is deﬁned as, ˜

pi=1

NR∑NR

j=11(

γj,i=1). Then, po-

tency and gauge are deﬁned as follows:

potency =1

r0

∑

i:β0,i≠0

˜

pi, gauge =1

p−r0

∑

i:β0,i=0

˜

pi.

Potency therefore measures the average frequency of inclusion of regressors

belonging to the DGP, while gauge measures the average frequency of inclusion of

regressors not belonging to the DGP. An ideal performance is thus represented by a

potency value of 1 and a gauge of 0.

In calculating these measures, HP chose to discard MC replications for which a

preliminary application of the battery of misspecification tests defined in (15) in

Appendix A reported a rejection.

17

This choice is called in the following ‘pre-search

elimination’of MC replications.

16 C

1

corresponds to Category 1 in HP; C

2

corresponds to Category 2 +Category 3 −Category 1 in

HP; ﬁnally C

3

corresponds to Category 4 in HP.

17 The empirical percentage of samples that were discarded in this way was found to be pro-

portional to the signiﬁcance level α. This fact, however, did not inﬂuence signiﬁcantly the number

of C

1

catches. Hence the HP procedure was allowed to discard replications as in the original

version. For the GSA algorithm no pre-search elimination was performed.

204 W. Becker et al.

5.3 Benchmark

The performance of HP’s algorithm is taken as a benchmark. The original MATLAB

code for generating data from HP’s experiments was downloaded from HP’s home

page.

18

The original scripts were then updated to run on the current version of

MATLAB. A replication of the results in Tables 4, 6 and 7 in HP is reported in the

ﬁrst panel of Table 2, using a nominal signiﬁcance level of α= 1, 5, 10% and

NR=103replications. The results do not appear to be signiﬁcantly different from

the ones reported in HP.

When checking the original code, an incorrect coding was found in the original

HP script for the generation of the AR series u

t

in Eq. (13), which produced simu-

lations of a moving average process of order 1, MA(1), with MA parameter 0.75

instead of an AR(1) with AR parameter 0.75.

19

The script was hence modiﬁed to

produce u

t

as an AR(1) with AR parameter 0.75; this is called the ‘modiﬁed script’in

the following.

Re-running the DGP simulation experiments using this modified script, the

results in the second panel in Table 2 were obtained; for this set of simulations

NR=104replications were used. Comparing the ﬁrst and second panel in the table

for the same nominal signiﬁcance level α, one observes a signiﬁcant increase in C

1

catches in DGP 2 and 7. One reason for this can be that when the modiﬁed script is

employed, the regression model is well-speciﬁed, i.e. it contains the DGP as a

special case.

20

Table 2 documents how HP’s algorithm depends on α, the signiﬁ-

cance level chosen in the test Rin (15).

5.4 Alternative Algorithms

This section presents results using the performance measures introduced in

Section 5.2. The results compare the three variations of the S

T

algorithm with the

18 http://www.csus.edu/indiv/p/perezs/Data/data.htm.

19 This means that the results reported in HP for DGP 2, 3, 7, 8, 9 refer to a misspeciﬁed model. The

MA process can be inverted to obtain a AR(∞) representation; substituting from the y

t

equation as

before, one ﬁnds that the DGP contains an inﬁnite number of lags on the dependent variable and of

the x*

it variables, with exponentially decreasing coefﬁcients. The entertained regression model

with four lags on the dependent variable and two lags on the x*

it variables can be considered an

approximation to the DGP.

20 This ﬁnding is similar to the one reported in Hendry and Krolzig (1999), section 6; they re-run

HP experiments using PcGets, and they document similar increases in C

1

catches in DGP 2 and 7 for

their modiﬁed algorithms. Hence, it is possible that this result is driven by the correction of the

script for the generation of the AR series.

Variable Selection Using Global Sensitivity Analysis 205

modiﬁed HP code. To compare with a similar but more recent GETS implementa-

tion, the Autometrics package ‘gets’, see Pretis, Reade, and Sucarrat (2018), is also

added as an additional algorithm in the comparison.

The performance is measured with respect to the true DGP or with respect to the

Effective DGP (EDGP) that one can hope to recover, given the signal to noise ratio.

Because the HP, GSA and Autometrics algorithms depend on tunable constants,

results are given for various values of these constants.

The procedure employed to define the EGDP is discussed in Appendix C; it

implies that the only EDGPs differing from the true DGP are DGP 6 and DGP 9. DGP

6 contains regressors 3 and 11, but regressor 3 is weak and hence EDGP 6 contains

only regressor 11. DGP 9 contains regressors 3, 11, 21, 29 and 37 but regressors 3 and

21 are weak and they are dropped from the corresponding EDGP 9. More details are

given in Appendix C.

Both the HP algorithm and Autometrics depend on the significance levels α,

whereas the GSA algorithm depends on the threshold ϕ(which controls α

ϕ

) for

S

Tno-skip

and S

Tfull

and on αfor S

Tsimple

. Because the values of αand ϕcan seriously

affect the performance of the algorithms, a fair comparison of the performance of

the algorithms may be difﬁcult, especially since the true parameter values will not

be known in practice. To deal with this problem, the performance of the algorithms

was measured at a number of parameter values within a plausible range.

Table :Percentages of Category matches C

for different values of α. Original script: data

generated by the original script, NR=replications. The frequencies are not statistically

different from the ones reported in HP (Tables ,,). Modiﬁed script: data from modiﬁed script

for the generation of AR series, NR=replications.

Original script Modified script

DGP α=. . .. . .

......

......

......

......

......

......

A.

.....

B......

......

......

.

206 W. Becker et al.

This allowed two ways of comparing the algorithms: first, the ‘optimized’

performance, corresponding to the value of αor ϕthat produced the highest C

1

score, averaged over the 11 DGPs. This can be viewed as the ‘potential perfor-

mance’. In practice, the optimization was performed with a grid search on αand ϕ

with N

R

=10

3replications, averaging across DGPs.

Secondly, a qualitative comparison was performed between the algorithms of

comparing their average performance over the range of parameter values. This

latter comparison gives some insight into the more realistic situation, where the

optimum parameter values are not known.

5.5 Results for optimal values of tuning coefficients

Table 3 shows the classiﬁcation results in terms of C

1

matches, as well as the

potency and gauge measures, for all algorithms at their optimal parameter values,

using NR=104. Note that the value of α=4×10−4for Autometrics represents the

lowest value of αthat it was possible to assign without errors occurring due to

singular matrices—likely due to issues with numerical precision. Results for the S

T

algorithm are shown with and without the extensions discussed in Section 4.

Recovery of the true speciﬁcation is here understood in the EDGP sense.

The C

1

column measures the percentage frequency with which the algorithms

identiﬁed the EDGP. One notable fact is that the performance of the HP algorithm

has been vastly improved (compared to the results in HP’s original paper) simply

by setting αto a better value, in this case α=4×10−4, compare with Table 2.

The comparison shows that with the full S

T

algorithm, the correct classiﬁcation

rate (C

1

) is 98.9%, compared with 94.3% for HP, and 88.6% with Autometrics. It is

presumed that if it were possible to reduce further the value of αfor Autometrics,

the mean C

1

value would increase still further. However, it was not possible to test

this conjecture. Removing the ‘skipping’extension, the average performance falls

to 96.7%, and further to 92.6% without the adaptive-αfeature.

Examining the DGPs individually, the GSA algorithm performs well on all

DGPs, although there are slightly lower C

1

values in DGPs 3 and 6A (around 96%).

For HP, these differences are more marked, with C

1

= 62% for DGP 3, and C

1

= 85.3%

for DGP 7. Autometrics also has a lower success rate of 74% for DGP 3, and 87% for

DGP 7.