# The Hannan-Quinn Proposition for Linear Regression

Abstract

We consider the variable selection problem in linear regression. Suppose that we have a set of random variables $X_1,...,X_m,Y,\epsilon$ such that $Y=\sum_{k\in \pi}\alpha_kX_k+\epsilon$ with $\pi\subseteq \{1,...,m\}$ and $\alpha_k\in {\mathbb R}$ unknown, and $\epsilon$ is independent of any linear combination of $X_1,...,X_m$. Given actually emitted $n$ examples $\{(x_{i,1}...,x_{i,m},y_i)\}_{i=1}^n$ emitted from $(X_1,...,X_m, Y)$, we wish to estimate the true $\pi$ using information criteria in the form of $H+(k/2)d_n$, where $H$ is the likelihood with respect to $\pi$ multiplied by -1, and $\{d_n\}$ is a positive real sequence. If $d_n$ is too small, we cannot obtain consistency because of overestimation. For autoregression, Hannan-Quinn proved that, in their setting of $H$ and $k$, the rate $d_n=2\log\log n$ is the minimum satisfying strong consistency. This paper solves the statement affirmative for linear regression as well which has a completely different setting.

International Journal of Statistics and Probability; Vol. 1, No. 2; 2012

ISSN 1927-7032 E-ISSN 1927-7040

Published by Canadian Center of Science and Education

The Hannan-Quinn Proposition for Linear Regression

Joe Suzuki1

1Department of Mathematics, Osaka University, Osaka, Japan

Correspondence: Joe Suzuki, Department of Mathematics, Osaka University, 1-1 Machikaneyama-cho, Toyonaka,

Osaka 560-0043, Japan. Tel: 81-6-6850-5315. E-mail: suzuki@math.sci.osaka-u.ac.jp

Received: September 5, 2012 Accepted: September 26, 2012 Online Published: October 17, 2012

doi:10.5539/ijsp.v1n2p179 URL: http://dx.doi.org/10.5539/ijsp.v1n2p179

Abstract

We consider the variable selection problem in linear regression. Suppose that we have a set of random variables

X1,···,Xm,Y, (m≥1) such that Y=k∈παkXk+with π⊆{1,···,m}and reals {αk}m

k=1, assuming that is

independent of any linear combination of X1,···,Xm.Givennexamples {(xi,1···,xi,m,yi)}n

i=1actually indepen-

dently emitted from (X1,···,Xm,Y), we wish to estimate the true πbased on information criteria in the form of

H+(k/2)dn, where His the likelihood with respect to πmultiplied by −1, and {dn}is a positive real sequence.

If dnis too small, we cannot obtain consistency because of overestimation. For autoregression, Hannan-Quinn

proved that the rate dn=2 log log nis the minimum satisfying strong consistency. This paper solves the statement

aﬃrmative for linear regression. Thus far, there was no proof for the proposition while dn=clog log nfor some

c>0 was shown to be suﬃcient.

Keywords: Hannan-Quinn, linear regression, the law of iterated logarithms, strong consistency, information crite-

ria, model selection

1. Introduction

We consider model selection based on information criteria such as AIC (Akaike’s information criterion) and BIC

(Bayesian information criterion).

For example, given independently and identically distributed (i.i.d.) random variables {i}∞

i=−∞ and nonnegative

reals {αi}k

i=1(k≥0), we can deﬁne random variables {Xi}∞

i=−∞ such that

Xi=

k

j=1

αjXi,j+i

(autoregression). Suppose that we wish to know the minimum true kas well as the values of {αi}k

i=1from a number

of examples {xi}n

i=1(n≥1) emitted from {Xi}n

i=1. Then, one way to estimate the order kis to prepare a positive real

sequence {dn}∞

n=1and to choose kminimizing the information criterion

nlog Sk+k

2dn

with respect to dn, where Skis the estimated variance based on the Yule-Walker algorithm.

The sequence {dn}∞

n=1balances ﬁtness of the examples to the model and simplicity of the model. If dnis too small

and too large, the estimated model will be overestimated and underestimated, respectively. The information criteria

are said AIC and BIC if dn=2 and dn=log n, respectively.

In this paper, we consider consistency of model selection: the estimation is weakly and strongly consistent if the

true model is obtained as n→∞in probability and almost surely, respectively. For autoregression, Hannan and

Quinn (1979) proved strong consistency for dn=(2 +) log log nwith arbitrary >0 based on the law of iterated

logarithms. They also showed the converse: dn=(2 −) log log ndoes not satisfy the property.

For linear regression, we can draw a similar scenario: given random variables {Xi}m

i=1and that is independent of

any linear combination of {Xi}m

i=1, we can deﬁne

Y=

k

j=1

αjXj+,

179

www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012

where 0 ≤k≤mand {αj}k

j=1are reals. We wish to know the minimum true kas well as the values of {αj}k

j=1from

nexamples {[yi,xi1,···,xim]}n

i=1independently emitted from (Y,X1,··· ,Xm). Similarly, we can deﬁne information

criteria

nlog Sk+k

2dn

such as dn=2 (AIC) and dn=log n(BIC), where Skis the empirical square error of the nexamples.

However, currently, we do not know whether dn=(2 +) log log nwith arbitrary >0 achieves strong consistency

for linear regression. In fact, no proof was given for the proposition. Wu and Zen (1999) suggested that dn=

clog log nwith some c>0 realizes strong consistency. However, they did not obtain either the exact value of cor

any converse result.

On the other hand, for the problem of classiﬁcation rules which has many applications such as Markov order

estimation, data mining, and pattern recognition, Suzuki (2006) proved the Hannan-Quinn proposition.

The main purpose of this paper is to prove the Hannan-Quinn proposition for linear regression. We do not assume

that the noise to be normal in the ﬁnal result.

Section 2 gives preliminary for linear regression such as idempotent matrices and eigenspaces. In Section 3,

we derive the asymptotic error probability of model selection in linear regression when information criteria are

applied, which will be an important step to prove the main result. In Section 4, we give a proof of the Hannan-

Quinn proposition for linear regression. Section 5 summarizes the results in this paper and gives a future problem.

Throughout the paper, we denote by X(Ω) the image {X(ω)|ω∈Ω}of a random variable X:Ω→R, where Ωis

the underlying sample space.

2. Linear Regression

Let X1,···,Xmbe random variables, ∼N(0,σ

2) a normal random variable with expectation zero and variance

σ2>0, and

Y:=

p

j=1

αjXj+,

where α:=[α1,···,α

p]T∈Rp(0 ≤p≤m). We assume that is independent of any linear combination of

X1,···,Xm.

Suppose we do not know the values of order pand coeﬃcients α, and that we are given independently emitted n

examples

zn:={[yi,xi,1,···,xi,m]}n

i=1

with

yi∈Y(Ω),[xi,1,···,xi,m]∈X1(Ω)×···×Xm(Ω),

where {[x1,j,···,xn,j]}m

j=1are to be linearly independent. If we deﬁne

Xp:=⎡

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎣

x1,1... x1,p

.

.

.....

.

.

xn,1... xn,p

⎤

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎦

,y:=⎡

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎣

y1

.

.

.

yn

⎤

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎦

,:=⎡

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎣

1

.

.

.

n

⎤

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎦

,

we can write y=Xpα+. Suppose that we estimate pby q(0 ≤q≤m). If we wish to minimize the quantity

n

i=1(yi−q

j=1ˆαjq xij)2given the nexamples, then ˆ

αq=[ˆα1,q,··· ,ˆαq,q]T:=(XT

qXq)−1XT

qyis the exact solution

(minimum square error estimation), where

Xq:=⎡

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎢

⎣

x1,1... x1,q

.

.

.....

.

.

xn,1... xn,q

⎤

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎥

⎦

Suppose p≤q. If we deﬁne Pq:=Xq(XT

qXq)−1XT

q,wehave

P2

q=Pq

180

www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012

and

(I−Pq)2=I−Pq,

so that the square error is expressed by

Sq:=

n

i=1

(yi−

q

j=1

ˆαj,qxi,j)2=||y−Xqˆ

αq||2=||(I−Pq)y||2=yT(I−Pq)y.

Similarly, if q=p, for Pp:=Xp(XT

pXp)−1XT

pand ˆ

αp=[ˆα1,p,··· ,ˆαp,p]T:=(XT

pXp)−1XT

py, the square error is

expressed by

Sp=yT(I−Pp)y.

Thus, the diﬀerence between the square errors is

Sp−Sq=yT(I−Pq)y−yT(I−Pq)y=yT(Pq−Pp)y.

On the other hand, we have

PT

q=(XT

q)T{(XT

qXq)−1}TXT

q=Xq{(XT

qXq)T}−1XT

q=Pq

and PT

p=Pp. From PqXp=Xpand PpXp=Xp, we obtain

PqPp=PqXp(XT

pXp)−1XT

p=Xp(XT

pXp)−1XT

p=Pp

and

PpPq=PT

pPT

q=(PqPp)T=PT

p=Pp.

Thus, not just for Pp,I−Ppbut also for Pq−Pp, the property

(Pq−Pp)2=P2

q−PqPp−PpPq+P2

p=Pq−Pp

holds. Such square matrices satisfying the property are called idempotent matrices (Chatterjee-Hadi, 1987).

In general, for idempotent matrix P∈Rn×n, the inner product (Px,(I−P)x)=0 for any x=Px +(I−P)x∈Rn,

so that the eigenspaces are

1) V1:={Px|x∈Rn}with dim(V1)=rank(P), and

2) V0:={(I−P)x|x∈Rn}with dim(V0)=n−rank(P).

Since the eigenvalues are one and zero, the multiplicity of eigenvalue one is the same as the trace. Notice that for

(XT

qXq)=[yjk] and (XT

qXq)−1=[zjk],

trace(Pq)=trace(Xq(XT

qXq)−1XT

q)=

n

i=1

q

j=1

q

k=1

xijzjkxki =

q

j=1

q

k=1

ykjzjk =

q

k=1

1=q,

and trace(Pp)=p, so that we have the following table.

Ptrace(P) dim(V1) dim(V0) rank(P)

Ppppn−pp

I−Ppn−pn−ppn−p

Pq−Ppq−pq−pn−q+pq−p

3. Error Probability in Model Selection

3.1 Overestimation

Proposition 1 If p <q, Sp−Sq

Sp/nasymptotically obeys the χ2distribution with freedom q −p.

Proof. Given Xp, we choose an orthogonal matrix U=[u1,···,un]ofI−Ppso that U1=<u1,··· ,un−p>and

U0=<un−p+1,···,un>are the eigenspaces of eigenvalues one and zero, respectively. Notice that

(I−Pp)y=y−(Xpα+Pp)=−Pp=(I−Pp).(1)

181

www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012

For j=1,···,n−p, multiplying uT

jin both hands from left, we get a normal random variable

zj:=uT

jy=uT

j.

Since the expectation and variance of iare zero and σ2(independent), and

uT

juk=1,j=k,

0,jk,

we have E[zj]=0 and

E[zjzk]=E[uT

j·uT

k]=σ2uT

juk=σ2,j=k,

0,jk.

Thus, from the strong law of large numbers, with probability one as n→∞,

1

nSp=1

n

n−p

j=1

z2

j→σ2.(2)

On the other hand, given Xq, we choose an orthogonal matrix V=[v1,··· ,vn]ofPq−Ppso that V1=<

v1,···,vq−p>and V0=<vq−p+1,···,vn>are the eigenspaces of eigenvalues one and zero, respectively. No-

tice that from (1), we have

(Pq−Pp)y=Pq(I−Pp)y=Pq(I−Pp)=(Pq−Pp).

For j=1,···,q−p, multiplying vjin both hands from left, we get a normal random variable

rj:=vT

jy=vT

j.

Since the expectation and variance of iare zero and σ2(independent), and

vT

jvk=1,j=k,

0,jk,

we have E[rj]=0 and

E[rjrk]=E[vT

j·vT

k]=σ2vT

jvT

k=σ2,j=k,

0,jk.

Hence, as n→∞,

Sp−Sq

σ2=

q−p

j=1

r2

j

σ2∼χ2

q(3)

where the fact that the square sum of q−pindependent random variables with the standard normal distribution

obeys the χ2distribution of freedom q−phas been applied. Equations (2)(3) imply Proposition 1.

In the sequel, for π⊆{1,···,m}, we write the square error of {Xj}j∈πand Yby S(π), and put

L(zn,π):=nlog S(π)+k(π)

2dn

and k(π)=|π|,givenzn={[yi,xi,1,···,xi,m]}n

i=1. Let π∗⊆{1,···,m}be the true π.

Theorem 1 For π⊃π∗, the probability of L(zn,π)<L(zn,π

∗)is

∞

n{1−exp[−k(π)−k(π∗)

2ndn]}

fk(π)−k(π∗)(x)dx,

where flis the probability density function of the χ2distribution of freedom l.

Proof. Notice that

2{L(zn,π)−L(zn,π

∗)}=2nlog S(π)

S(π∗)+{k(π)−k(π∗)}dn=2nlog(1 −S(π∗)−S(π)

S(π∗))+{k(π)−k(π∗)}dn,

182

www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012

so that

L(zn,π)<L(zn,π

∗)⇐⇒ S(π∗)−S(π)

S(π∗)/n>n{1−exp[−k(π)−k(π∗)

2ndn]}.(4)

From Proposition 2, we obtain Theorem 1.

3.2 Underestimation

Hereafter, we do not assume that to be normally distributed.

Theorem 2 For ππ∗,L(zn,π)>L(zn,π

∗)with probability one as n →∞.

Proof. Suppose q<p.GivenXp, we choose an orthogonal matrix W:=[w1,···,wn]ofPp−Pqso that

W1=<w1,···,wp−q>and W0=<wp−q+1,···,wn>are the eigenspaces of eigenvalue one and zero, respectively.

If we deﬁne ti:=p

k=q+1xi,kαk+iand

sj:=

n

i=1

wijyi=

n

i=1

wij(

p

k=1

xikαk+i)=

n

i=1

wij(

p

k=q+1

xikαk+i)=(wj,t)

for wj=[w1j,···,wnj]Tand t=[t1,··· ,tn]T, then ||wj||2=n

i=1w2

ij =1, and ||t||2/n=n

i=1t2

i/nconverges to a

positive constant with probability one. Otherwise, =−p

k=q+1Xkαkwith probability one (contradiction). If wj

and tare orthogonal, the tshould be in the form

(

q

k=1

x1kβk,···,

q

k=1

xnkβk)

for some [β1,···,β

q]∈Rq, which means that =q

k=1Xkβk−p

k=q+1Xkαkwith probability one (contradiction).

Hence,

1

n(Sq−Sp)=1

n

p

j=q+1

s2

j=

p

j=q+1||wj||2||t||2/n·cos(wj,t)2(5)

converges to a positive value, which implies the theorem when π⊂π∗. Suppose ππ∗. In the same way, since (5)

converges to a positive value even for q=|π∩π∗|,wehave

lim

n→∞

1

n{S(π∩π∗)−S(π∗)}>0.(6)

Furthermore, if we replace π∗by π∩π∗, from a similar discussion as in Theorem 1, we have

lim

n→∞

1

n{S(π)−S(π∩π∗)}=0.(7)

The statements (6)(7) imply the theorem.

4. Proof of the Hannan-Quinn Proposition

In this section, we do not assume that ∼N(0,σ

2) but that is an independently identically distributed random

variable with expectation zero and variance σ2.

Proposition 2 If q >p, with probability one,

1≤lim sup

n→∞ {Sp−Sq

Sp/n/log log n}≤q−p(8)

Proof. The notation is similar to Proposition 2, and let p+1≤j≤q.

Let λ1,···,λ

q−pand [β1,1,···,β

1,q]T,···,[βq−p,1,···,β

q−p,q]Tbe the nonzero eigenvalues and corresponding unit

eigenvectors of

Xp,q:=(1

nXT

qXq)−1−(1

nXT

pXp)−10

00

.

183

www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012

Then, from

n(Sq−Sp)=nT(Pq−Pp)=n(TXq)Xp,q(TXq)T=

q−p

j=1

λj(

q

k=1

βj,k

n

i=1

xi,ki)2

and E[r2

j]=E[2

i]=σ2, we require λj=[n

i=1(q

k=1βjk xik)2]−1, thus,

vij =kβjk xik

n

i=1(lβjl xil)2.

Since Xp,qconverges to a constant matrix as n→∞and limn→∞ βjk exists with probability one, so does

γjk :=lim

n→∞

βjk

1

nn

i=1(lβjl xil)2

.

Let Zi=kγjk xiki/σ. Then, E[n

i=1Zi]=0, E[n

i=1Z2

i]=n, and {Zi}n

i=1are independent. From the law of

iterated logarithms (Stout 1974), we have

lim sup

n→∞ i√nviji/σ

nlog log n

=lim sup

n→∞ i(kγjk xik)i/σ

nlog log n

=lim sup

n→∞ n

i=1Zi

nlog log n

=1,

namely,

lim sup

n→∞

rj

σlog log n

=1

with probability one. Since

lim sup

n→∞

rj

σlog log n≤lim sup

n→∞

q

j=p+1

rj

σlog log n≤

q

j=p+1

lim sup

n→∞

rj

σlog log n,

and from (2) and (3), we have (8) with probability one.

The following equation is useful in derivation of the ﬁnal result.

1

2{k(π)−k(π∗)}dn−1

4n[{k(π)−k(π∗)}dn]2≤n[1 −exp{−k(π)−k(π∗)

2ndn}]≤1

2{k(π)−k(π∗)}dn(9)

Theorem 3 For dn:=(2 +) log log n(>0), L(zn,π)>L(zn,π

∗)with probability one.

Proof. From Theorem 2, the error for π∗πis almost surely zero as long as dn

n→0(n→∞), so that we only

need to consider the case π∗⊂π. However, dn=(2 +) log log nwith >0 implies the left hand side of (9) is

strictly larger than (q−p) log log nwith p=k(π∗) and q=k(π) for large n, which from Proposition 2 and (4)

implies Theorem 3.

Theorem 4 For dn:=(2 −) log log n(>0), L(zn,π)≤L(zn,π

∗)with nonzero probability as n →∞for πsuch

that k(π)=k(π∗)+1.

Proof. dn=(2 −) log log nwith >0 implies the right hand side of (9) is strictly smaller than (q−p) log log n

with p=k(π∗) and q=k(π)=p+1 for large n, which from Proposition 2 and (4) implies Theorem 4.

For example, suppose p=0 and q=1. Then, X0,1=1

nn

i=1x2

i1,vi1=xi1

1

nn

h=1x2

h1

, and γ11 =E[X2

1]−1/2. In this

case, S0−S1=n

i=1x2

i12

i

n

h=1x2

h1

and S0

n=1

n

n

i=1

2

i. Thus, with probability one,

S0−S1

S0/n=nn

i=1x2

i12

i

n

h=1x2

h1n

l=12

l

exceeds (1 +) log log nﬁnitely many times and (1 −) log log nwith nonzero probabiolity, so that the model

selection procedure makes wrong results at most ﬁnitely many times.

184

www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012

5. Conclusion

We proved that the Hannan-Quinn proposition is true for linear regression as well as for auto regression (Hannan-

Quinn, 1979) and for classiﬁcation (Suzuki, 2006): the minimum rate of dnsatisfying strong consistency is (2 +

) log log nfor arbitrary >0.

The future problems contain ﬁnding strong consistency conditions that are good for all the cases including linear

regression, auto regression, and classiﬁcation. Making clear why the same dn=2 log log nis the crucial rate for

those problems would be the ﬁrst step to solve the problem.

References

Chatterjee, S., & Hadi, A. S. (1988). Sensitivity Analysis In Linear Regression. New York: John Wiley & Sons.

http://dx.doi.org/10.1002/jae.3950050108

Hannan, E. J., & Quinn, B. G. (1979). The Determination of the Order of an Autoregression. Journal of the Royal

Statistical Society, B, 41, 190-195. http://dx.doi.org/10.1086/260800

Stout, W. (1974). Almost Sure Convergence. New York: Academic Press.

Suzuki, J. (2006). On Strong Consistency of Model Selection in Classiﬁcation. IEEE Transactions on Information

Theory, 52(11), 4767-4774. http://dx.doi.org/10.1109/TIT.2006.883611

Wu, Y., & Zen, M. (1999). A strongly consistent information criterion for linear model selection based on M

-estimation. Probab. Theory Relat. Fields, 113, 599-625. http://dx.doi.org/10.1007/s004400050219

185

- CitationsCitations2
- ReferencesReferences11

- To obtain strong consistency, they proved that log log n rather than 1 2 log n is sufficient for the penalty terms of autoregressive model selection. Recently, several authors have proven this in other settings, such as classification [9] and linear regression [8].

[Show abstract] [Hide abstract]**ABSTRACT:**We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in O(n log n) time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is strongly consistent, i.e., correct with probability one as n ! 1. To date, consistency has only been obtained for discrete variables for this class of problem, and many authors have attempted to prove consistency when continuous variables are present. Furthermore, we prove that the “log n” term that appears in the penalty term of the description length can be replaced by 2(1+ε) log log n to obtain strong consistency, where ε > 0 is arbitrary, which implies that the Hannan–Quinn proposition holds.- To obtain strong consistency, they proved that log log n rather than 1 2 log n is sufficient for the penalty terms of autoregressive model selection. Recently, several authors have proven this in other settings, such as classification [9] and linear regression [8].

[Show abstract] [Hide abstract]**ABSTRACT:**We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in O(n log n) time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is strongly consistent, i.e., correct with probability one as n ! 1. To date, consistency has only been obtained for discrete variables for this class of problem, and many authors have attempted to prove consistency when continuous variables are present. Furthermore, we prove that the “log n” term that appears in the penalty term of the description length can be replaced by 2(1+ε) log log n to obtain strong consistency, where ε > 0 is arbitrary, which implies that the Hannan–Quinn proposition holds.

## Supplementary resources

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

This publication is from a journal that may support self archiving.

Learn more