# The Hannan-Quinn Proposition for Linear Regression

Article (PDF Available) · December 2010with41 Reads
DOI: 10.5539/ijsp.v1n2p179 · Source: arXiv
Abstract
We consider the variable selection problem in linear regression. Suppose that we have a set of random variables $X_1,...,X_m,Y,\epsilon$ such that $Y=\sum_{k\in \pi}\alpha_kX_k+\epsilon$ with $\pi\subseteq \{1,...,m\}$ and $\alpha_k\in {\mathbb R}$ unknown, and $\epsilon$ is independent of any linear combination of $X_1,...,X_m$. Given actually emitted $n$ examples $\{(x_{i,1}...,x_{i,m},y_i)\}_{i=1}^n$ emitted from $(X_1,...,X_m, Y)$, we wish to estimate the true $\pi$ using information criteria in the form of $H+(k/2)d_n$, where $H$ is the likelihood with respect to $\pi$ multiplied by -1, and $\{d_n\}$ is a positive real sequence. If $d_n$ is too small, we cannot obtain consistency because of overestimation. For autoregression, Hannan-Quinn proved that, in their setting of $H$ and $k$, the rate $d_n=2\log\log n$ is the minimum satisfying strong consistency. This paper solves the statement affirmative for linear regression as well which has a completely different setting.
International Journal of Statistics and Probability; Vol. 1, No. 2; 2012
ISSN 1927-7032 E-ISSN 1927-7040
The Hannan-Quinn Proposition for Linear Regression
Joe Suzuki1
1Department of Mathematics, Osaka University, Osaka, Japan
Correspondence: Joe Suzuki, Department of Mathematics, Osaka University, 1-1 Machikaneyama-cho, Toyonaka,
Osaka 560-0043, Japan. Tel: 81-6-6850-5315. E-mail: suzuki@math.sci.osaka-u.ac.jp
Received: September 5, 2012 Accepted: September 26, 2012 Online Published: October 17, 2012
doi:10.5539/ijsp.v1n2p179 URL: http://dx.doi.org/10.5539/ijsp.v1n2p179
Abstract
We consider the variable selection problem in linear regression. Suppose that we have a set of random variables
X1,···,Xm,Y, (m1) such that Y=kπαkXk+with π⊆{1,···,m}and reals {αk}m
k=1, assuming that is
independent of any linear combination of X1,···,Xm.Givennexamples {(xi,1···,xi,m,yi)}n
i=1actually indepen-
dently emitted from (X1,···,Xm,Y), we wish to estimate the true πbased on information criteria in the form of
H+(k/2)dn, where His the likelihood with respect to πmultiplied by 1, and {dn}is a positive real sequence.
If dnis too small, we cannot obtain consistency because of overestimation. For autoregression, Hannan-Quinn
proved that the rate dn=2 log log nis the minimum satisfying strong consistency. This paper solves the statement
armative for linear regression. Thus far, there was no proof for the proposition while dn=clog log nfor some
c>0 was shown to be sucient.
Keywords: Hannan-Quinn, linear regression, the law of iterated logarithms, strong consistency, information crite-
ria, model selection
1. Introduction
We consider model selection based on information criteria such as AIC (Akaike’s information criterion) and BIC
(Bayesian information criterion).
For example, given independently and identically distributed (i.i.d.) random variables {i}
i=−∞ and nonnegative
reals {αi}k
i=1(k0), we can deﬁne random variables {Xi}
i=−∞ such that
Xi=
k
j=1
αjXi,j+i
(autoregression). Suppose that we wish to know the minimum true kas well as the values of {αi}k
i=1from a number
of examples {xi}n
i=1(n1) emitted from {Xi}n
i=1. Then, one way to estimate the order kis to prepare a positive real
sequence {dn}
n=1and to choose kminimizing the information criterion
nlog Sk+k
2dn
with respect to dn, where Skis the estimated variance based on the Yule-Walker algorithm.
The sequence {dn}
n=1balances ﬁtness of the examples to the model and simplicity of the model. If dnis too small
and too large, the estimated model will be overestimated and underestimated, respectively. The information criteria
are said AIC and BIC if dn=2 and dn=log n, respectively.
In this paper, we consider consistency of model selection: the estimation is weakly and strongly consistent if the
true model is obtained as n→∞in probability and almost surely, respectively. For autoregression, Hannan and
Quinn (1979) proved strong consistency for dn=(2 +) log log nwith arbitrary >0 based on the law of iterated
logarithms. They also showed the converse: dn=(2 ) log log ndoes not satisfy the property.
For linear regression, we can draw a similar scenario: given random variables {Xi}m
i=1and that is independent of
any linear combination of {Xi}m
i=1, we can deﬁne
Y=
k
j=1
αjXj+,
179
www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012
where 0 kmand {αj}k
j=1are reals. We wish to know the minimum true kas well as the values of {αj}k
j=1from
nexamples {[yi,xi1,···,xim]}n
i=1independently emitted from (Y,X1,··· ,Xm). Similarly, we can deﬁne information
criteria
nlog Sk+k
2dn
such as dn=2 (AIC) and dn=log n(BIC), where Skis the empirical square error of the nexamples.
However, currently, we do not know whether dn=(2 +) log log nwith arbitrary >0 achieves strong consistency
for linear regression. In fact, no proof was given for the proposition. Wu and Zen (1999) suggested that dn=
clog log nwith some c>0 realizes strong consistency. However, they did not obtain either the exact value of cor
any converse result.
On the other hand, for the problem of classiﬁcation rules which has many applications such as Markov order
estimation, data mining, and pattern recognition, Suzuki (2006) proved the Hannan-Quinn proposition.
The main purpose of this paper is to prove the Hannan-Quinn proposition for linear regression. We do not assume
that the noise to be normal in the ﬁnal result.
Section 2 gives preliminary for linear regression such as idempotent matrices and eigenspaces. In Section 3,
we derive the asymptotic error probability of model selection in linear regression when information criteria are
applied, which will be an important step to prove the main result. In Section 4, we give a proof of the Hannan-
Quinn proposition for linear regression. Section 5 summarizes the results in this paper and gives a future problem.
Throughout the paper, we denote by X(Ω) the image {X(ω)|ωΩ}of a random variable X:ΩR, where Ωis
the underlying sample space.
2. Linear Regression
Let X1,···,Xmbe random variables, ∼N(0
2) a normal random variable with expectation zero and variance
σ2>0, and
Y:=
p
j=1
αjXj+,
where α:=[α1,···
p]TRp(0 pm). We assume that is independent of any linear combination of
X1,···,Xm.
Suppose we do not know the values of order pand coecients α, and that we are given independently emitted n
examples
zn:={[yi,xi,1,···,xi,m]}n
i=1
with
yiY(Ω),[xi,1,···,xi,m]X1(Ω)×···×Xm(Ω),
where {[x1,j,···,xn,j]}m
j=1are to be linearly independent. If we deﬁne
Xp:=
x1,1... x1,p
.
.
.....
.
.
xn,1... xn,p
,y:=
y1
.
.
.
yn
,:=
1
.
.
.
n
,
we can write y=Xpα+. Suppose that we estimate pby q(0 qm). If we wish to minimize the quantity
n
i=1(yiq
j=1ˆαjq xij)2given the nexamples, then ˆ
αq=α1,q,··· ,ˆαq,q]T:=(XT
qXq)1XT
qyis the exact solution
(minimum square error estimation), where
Xq:=
x1,1... x1,q
.
.
.....
.
.
xn,1... xn,q
Suppose pq. If we deﬁne Pq:=Xq(XT
qXq)1XT
q,wehave
P2
q=Pq
180
www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012
and
(IPq)2=IPq,
so that the square error is expressed by
Sq:=
n
i=1
(yi
q
j=1
ˆαj,qxi,j)2=||yXqˆ
αq||2=||(IPq)y||2=yT(IPq)y.
Similarly, if q=p, for Pp:=Xp(XT
pXp)1XT
pand ˆ
αp=α1,p,··· ,ˆαp,p]T:=(XT
pXp)1XT
py, the square error is
expressed by
Sp=yT(IPp)y.
Thus, the dierence between the square errors is
SpSq=yT(IPq)yyT(IPq)y=yT(PqPp)y.
On the other hand, we have
PT
q=(XT
q)T{(XT
qXq)1}TXT
q=Xq{(XT
qXq)T}1XT
q=Pq
and PT
p=Pp. From PqXp=Xpand PpXp=Xp, we obtain
PqPp=PqXp(XT
pXp)1XT
p=Xp(XT
pXp)1XT
p=Pp
and
PpPq=PT
pPT
q=(PqPp)T=PT
p=Pp.
Thus, not just for Pp,IPpbut also for PqPp, the property
(PqPp)2=P2
qPqPpPpPq+P2
p=PqPp
holds. Such square matrices satisfying the property are called idempotent matrices (Chatterjee-Hadi, 1987).
In general, for idempotent matrix PRn×n, the inner product (Px,(IP)x)=0 for any x=Px +(IP)xRn,
so that the eigenspaces are
1) V1:={Px|xRn}with dim(V1)=rank(P), and
2) V0:={(IP)x|xRn}with dim(V0)=nrank(P).
Since the eigenvalues are one and zero, the multiplicity of eigenvalue one is the same as the trace. Notice that for
(XT
qXq)=[yjk] and (XT
qXq)1=[zjk],
trace(Pq)=trace(Xq(XT
qXq)1XT
q)=
n
i=1
q
j=1
q
k=1
xijzjkxki =
q
j=1
q
k=1
ykjzjk =
q
k=1
1=q,
and trace(Pp)=p, so that we have the following table.
Ptrace(P) dim(V1) dim(V0) rank(P)
Ppppnpp
IPpnpnppnp
PqPpqpqpnq+pqp
3. Error Probability in Model Selection
3.1 Overestimation
Proposition 1 If p <q, SpSq
Sp/nasymptotically obeys the χ2distribution with freedom q p.
Proof. Given Xp, we choose an orthogonal matrix U=[u1,···,un]ofIPpso that U1=<u1,··· ,unp>and
U0=<unp+1,···,un>are the eigenspaces of eigenvalues one and zero, respectively. Notice that
(IPp)y=y(Xpα+Pp)=Pp=(IPp).(1)
181
www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012
For j=1,···,np, multiplying uT
jin both hands from left, we get a normal random variable
zj:=uT
jy=uT
j.
Since the expectation and variance of iare zero and σ2(independent), and
uT
juk=1,j=k,
0,jk,
we have E[zj]=0 and
E[zjzk]=E[uT
j·uT
k]=σ2uT
juk=σ2,j=k,
0,jk.
Thus, from the strong law of large numbers, with probability one as n→∞,
1
nSp=1
n
np
j=1
z2
jσ2.(2)
On the other hand, given Xq, we choose an orthogonal matrix V=[v1,··· ,vn]ofPqPpso that V1=<
v1,···,vqp>and V0=<vqp+1,···,vn>are the eigenspaces of eigenvalues one and zero, respectively. No-
tice that from (1), we have
(PqPp)y=Pq(IPp)y=Pq(IPp)=(PqPp).
For j=1,···,qp, multiplying vjin both hands from left, we get a normal random variable
rj:=vT
jy=vT
j.
Since the expectation and variance of iare zero and σ2(independent), and
vT
jvk=1,j=k,
0,jk,
we have E[rj]=0 and
E[rjrk]=E[vT
j·vT
k]=σ2vT
jvT
k=σ2,j=k,
0,jk.
Hence, as n→∞,
SpSq
σ2=
qp
j=1
r2
j
σ2χ2
q(3)
where the fact that the square sum of qpindependent random variables with the standard normal distribution
obeys the χ2distribution of freedom qphas been applied. Equations (2)(3) imply Proposition 1.
In the sequel, for π⊆{1,···,m}, we write the square error of {Xj}jπand Yby S(π), and put
L(zn):=nlog S(π)+k(π)
2dn
and k(π)=|π|,givenzn={[yi,xi,1,···,xi,m]}n
i=1. Let π⊆{1,···,m}be the true π.
Theorem 1 For ππ, the probability of L(zn)<L(zn
)is
n{1exp[k(π)k(π)
2ndn]}
fk(π)k(π)(x)dx,
where flis the probability density function of the χ2distribution of freedom l.
Proof. Notice that
2{L(zn)L(zn
)}=2nlog S(π)
S(π)+{k(π)k(π)}dn=2nlog(1 S(π)S(π)
S(π))+{k(π)k(π)}dn,
182
www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012
so that
L(zn)<L(zn
)⇐⇒ S(π)S(π)
S(π)/n>n{1exp[k(π)k(π)
2ndn]}.(4)
From Proposition 2, we obtain Theorem 1.
3.2 Underestimation
Hereafter, we do not assume that to be normally distributed.
Theorem 2 For ππ,L(zn)>L(zn
)with probability one as n →∞.
Proof. Suppose q<p.GivenXp, we choose an orthogonal matrix W:=[w1,···,wn]ofPpPqso that
W1=<w1,···,wpq>and W0=<wpq+1,···,wn>are the eigenspaces of eigenvalue one and zero, respectively.
If we deﬁne ti:=p
k=q+1xi,kαk+iand
sj:=
n
i=1
wijyi=
n
i=1
wij(
p
k=1
xikαk+i)=
n
i=1
wij(
p
k=q+1
xikαk+i)=(wj,t)
for wj=[w1j,···,wnj]Tand t=[t1,··· ,tn]T, then ||wj||2=n
i=1w2
ij =1, and ||t||2/n=n
i=1t2
i/nconverges to a
positive constant with probability one. Otherwise, =p
k=q+1Xkαkwith probability one (contradiction). If wj
and tare orthogonal, the tshould be in the form
(
q
k=1
x1kβk,···,
q
k=1
xnkβk)
for some [β1,···
q]Rq, which means that =q
k=1Xkβkp
Hence,
1
n(SqSp)=1
n
p
j=q+1
s2
j=
p
j=q+1||wj||2||t||2/n·cos(wj,t)2(5)
converges to a positive value, which implies the theorem when ππ. Suppose ππ. In the same way, since (5)
converges to a positive value even for q=|ππ|,wehave
lim
n→∞
1
n{S(ππ)S(π)}>0.(6)
Furthermore, if we replace πby ππ, from a similar discussion as in Theorem 1, we have
lim
n→∞
1
n{S(π)S(ππ)}=0.(7)
The statements (6)(7) imply the theorem.
4. Proof of the Hannan-Quinn Proposition
In this section, we do not assume that ∼N(0
2) but that is an independently identically distributed random
variable with expectation zero and variance σ2.
Proposition 2 If q >p, with probability one,
1lim sup
n→∞ {SpSq
Sp/n/log log n}≤qp(8)
Proof. The notation is similar to Proposition 2, and let p+1jq.
Let λ1,···
qpand [β1,1,···
1,q]T,···,[βqp,1,···
qp,q]Tbe the nonzero eigenvalues and corresponding unit
eigenvectors of
Xp,q:=(1
nXT
qXq)1(1
nXT
pXp)10
00
.
183
www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012
Then, from
n(SqSp)=nT(PqPp)=n(TXq)Xp,q(TXq)T=
qp
j=1
λj(
q
k=1
βj,k
n
i=1
xi,ki)2
and E[r2
j]=E[2
i]=σ2, we require λj=[n
i=1(q
k=1βjk xik)2]1, thus,
vij =kβjk xik
n
i=1(lβjl xil)2.
Since Xp,qconverges to a constant matrix as n→∞and limn→∞ βjk exists with probability one, so does
γjk :=lim
n→∞
βjk
1
nn
i=1(lβjl xil)2
.
Let Zi=kγjk xiki. Then, E[n
i=1Zi]=0, E[n
i=1Z2
i]=n, and {Zi}n
i=1are independent. From the law of
iterated logarithms (Stout 1974), we have
lim sup
n→∞ inviji
nlog log n
=lim sup
n→∞ i(kγjk xik)i
nlog log n
=lim sup
n→∞ n
i=1Zi
nlog log n
=1,
namely,
lim sup
n→∞
rj
σlog log n
=1
with probability one. Since
lim sup
n→∞
rj
σlog log nlim sup
n→∞
q
j=p+1
rj
σlog log n
q
j=p+1
lim sup
n→∞
rj
σlog log n,
and from (2) and (3), we have (8) with probability one.
The following equation is useful in derivation of the ﬁnal result.
1
2{k(π)k(π)}dn1
4n[{k(π)k(π)}dn]2n[1 exp{−k(π)k(π)
2ndn}]1
2{k(π)k(π)}dn(9)
Theorem 3 For dn:=(2 +) log log n(>0), L(zn)>L(zn
)with probability one.
Proof. From Theorem 2, the error for ππis almost surely zero as long as dn
n0(n→∞), so that we only
need to consider the case ππ. However, dn=(2 +) log log nwith >0 implies the left hand side of (9) is
strictly larger than (qp) log log nwith p=k(π) and q=k(π) for large n, which from Proposition 2 and (4)
implies Theorem 3.
Theorem 4 For dn:=(2 ) log log n(>0), L(zn)L(zn
)with nonzero probability as n →∞for πsuch
that k(π)=k(π)+1.
Proof. dn=(2 ) log log nwith >0 implies the right hand side of (9) is strictly smaller than (qp) log log n
with p=k(π) and q=k(π)=p+1 for large n, which from Proposition 2 and (4) implies Theorem 4.
For example, suppose p=0 and q=1. Then, X0,1=1
nn
i=1x2
i1,vi1=xi1
1
nn
h=1x2
h1
, and γ11 =E[X2
1]1/2. In this
case, S0S1=n
i=1x2
i12
i
n
h=1x2
h1
and S0
n=1
n
n
i=1
2
i. Thus, with probability one,
S0S1
S0/n=nn
i=1x2
i12
i
n
h=1x2
h1n
l=12
l
exceeds (1 +) log log nﬁnitely many times and (1 ) log log nwith nonzero probabiolity, so that the model
selection procedure makes wrong results at most ﬁnitely many times.
184
www.ccsenet.org/ijsp International Journal of Statistics and Probability Vol. 1, No. 2; 2012
5. Conclusion
We proved that the Hannan-Quinn proposition is true for linear regression as well as for auto regression (Hannan-
Quinn, 1979) and for classiﬁcation (Suzuki, 2006): the minimum rate of dnsatisfying strong consistency is (2 +
) log log nfor arbitrary >0.
The future problems contain ﬁnding strong consistency conditions that are good for all the cases including linear
regression, auto regression, and classiﬁcation. Making clear why the same dn=2 log log nis the crucial rate for
those problems would be the ﬁrst step to solve the problem.
References
Chatterjee, S., & Hadi, A. S. (1988). Sensitivity Analysis In Linear Regression. New York: John Wiley & Sons.
http://dx.doi.org/10.1002/jae.3950050108
Hannan, E. J., & Quinn, B. G. (1979). The Determination of the Order of an Autoregression. Journal of the Royal
Statistical Society, B, 41, 190-195. http://dx.doi.org/10.1086/260800
Stout, W. (1974). Almost Sure Convergence. New York: Academic Press.
Suzuki, J. (2006). On Strong Consistency of Model Selection in Classiﬁcation. IEEE Transactions on Information
Theory, 52(11), 4767-4774. http://dx.doi.org/10.1109/TIT.2006.883611
Wu, Y., & Zen, M. (1999). A strongly consistent information criterion for linear model selection based on M
-estimation. Probab. Theory Relat. Fields, 113, 599-625. http://dx.doi.org/10.1007/s004400050219
185
• To obtain strong consistency, they proved that log log n rather than 1 2 log n is sufficient for the penalty terms of autoregressive model selection. Recently, several authors have proven this in other settings, such as classification [9] and linear regression [8].
##### Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach
[Show abstract] [Hide abstract] ABSTRACT: We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in O(n log n) time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is strongly consistent, i.e., correct with probability one as n ! 1. To date, consistency has only been obtained for discrete variables for this class of problem, and many authors have attempted to prove consistency when continuous variables are present. Furthermore, we prove that the “log n” term that appears in the penalty term of the description length can be replaced by 2(1+ε) log log n to obtain strong consistency, where ε > 0 is arbitrary, which implies that the Hannan–Quinn proposition holds.
Full-text · Article · Aug 2015
• To obtain strong consistency, they proved that log log n rather than 1 2 log n is sufficient for the penalty terms of autoregressive model selection. Recently, several authors have proven this in other settings, such as classification [9] and linear regression [8].
##### Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach
[Show abstract] [Hide abstract] ABSTRACT: We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in O(n log n) time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is strongly consistent, i.e., correct with probability one as n ! 1. To date, consistency has only been obtained for discrete variables for this class of problem, and many authors have attempted to prove consistency when continuous variables are present. Furthermore, we prove that the “log n” term that appears in the penalty term of the description length can be replaced by 2(1+ε) log log n to obtain strong consistency, where ε > 0 is arbitrary, which implies that the Hannan–Quinn proposition holds.
Full-text · Article · Aug 2015