ArticlePDF Available

Abstract and Figures

Inferential studies for the regression coefficients of a linear model for interval-valued random variables are addressed. Confidence sets and hypothesis tests are investigated and solved through asymptotic and bootstrap techniques. The inferences are based on the least-squares estimators of the model which have been shown to be coherent with the interval arithmetic defining the model and to verify good statistical properties. Theoretical results assure the validity of the procedures. Moreover, some simulation studies and examples are considered to show the empirical behaviour and the practical applicability of the inferences.
Content may be subject to copyright.
International Journal of Computer Mathematics, 2014
http://dx.doi.org/10.1080/00207160.2014.964998
Inferential studies for a flexible linear regression model for
interval-valued variables
A. Blanco-Fernándezand G. González-Rodríguez
Department of Statistics and Operational Research, University of Oviedo, 33007 Asturias, Spain
(Received 2 December 2013; revised version received 23 April 2014; second revision received 25 July 2014;
accepted 6 September 2014)
Inferential studies for the regression coefficients of a linear model for interval-valued random variables are
addressed. Confidence sets and hypothesis tests are investigated and solved through asymptotic and boot-
strap techniques. The inferences are based on the least-squares estimators of the model which have been
shown to be coherent with the interval arithmetic defining the model and to verify good statistical prop-
erties. Theoretical results assure the validity of the procedures. Moreover, some simulation studies and
examples are considered to show the empirical behaviour and the practical applicability of the inferences.
Keywords: bootstrap; inferences; interval data; linear regression; set arithmetic
2010 AMS Subject Classifications: 62J05; 62A86; 62J86
1. Introduction
Interval-valued data arise in various experimental scenarios. Sometimes a real random variable is
imprecisely observed due to, for example, an inexact measurement device, a subjective percep-
tion of its values, or a requirement of confidentiality on the data. In these cases, the experimental
outcomes are recorded as the intervals containing the precise values of the variable correspond-
ing to each individual; see, for instance [28,31]. Censoring, grouping and rounding processes
also produce intervals [36]. The fluctuation of the values of a magnitude over a period of time
can also be modelled by means of the min–max range of the values registered during that period
[12,16,24]. Within symbolic data analysis (SDA) intervals are considered to summarize infor-
mation stored in large data sets [46,9,10,33]. These situations may appear in multiple scientific
areas: medicine, economics, psychology, environments, meteorology, etc. Classical statistical
procedures have been extended to cope with intervals: measures of central tendency and vari-
ability [30], parameter estimation [34], limit theorems [2], regression and classification problems
[4,12,15,24], to name but a few.
The statistical analysis of interval-valued data depend on their nature. When the intervals
represent a lack of knowledge about an underlying real variable, the uncertainty should be prop-
agated [3]. However, when there is no underlying variable (or it is not of interest in the statistical
process), the statistical techniques are developed and interpreted as constrained two-dimensional
*Corresponding author. Email: blancoangela@uniovi.es
© 2014 Taylor & Francis
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
2A. Blanco-Fernández and G. González-Rodríguez
approaches [24]. Thus, the statistical conclusions refer to the interval-valued variables them-
selves, which are the random elements modelling the experiment on target. This paper follows
this latter approach.
There exist several alternatives to deal with interval-valued variables. On one hand, the SDA
has extensively been addressed [5,6,9,14]. Symbolic interval variables are mainly used for mod-
elling aggregated data or interval descriptions of technical specifications. Thus, intervals are
the representation groups of observations in which a uniform distribution for all the individual
observations in the interval is assumed. With this idea, several statistical methods have been
developed. Most of these SDA techniques are implemented in the SODAS (Symbolic Object
Data Analysis Software) [14] and the R package ISDA [1]. Focusing on regression problems,
alternative linear models for symbolic data (sometimes represented by intervals, but also by lists,
sets, or histograms) have been proposed [4,10,33]. The symbolic regression problems are usu-
ally solved separately for real-valued variables associated with the intervals, such as the lower
and upper limits, or the midpoints and ranges [4,10]. The resolution of the regression estimation
by means of classical techniques does not prevent anomalous results such as forecast intervals
whose lower bounds are larger than their upper ones. In [33] non-negativity conditions for the
regression parameters to overcome such shortcoming are included in the estimation process.
However, in this case the estimation is solved by means of numerical optimization methods, and
no analytical expressions for the regression estimators are obtained. More recent works on SDA
extend the regression problems to histogram- or modal-valued symbolic variables [13]. They
all provide several alternatives to estimate the proposed models. However, since the regression
models are not formalized in a well-defined probabilistic context, the estimation process reduces
to a fitting problem, and the study of statistical properties of the estimators and inferential studies
about the models make no sense in this setting.
An alternative approach for interval regression is based on the formalization of a linear
interval-valued function to relate the interval-valued variables themselves, as a natural gen-
eralization of the classical linear models between real-valued variables [8,12,20,24,26]. This
approach considers the intervals as a whole, and no distribution for the individual points on each
interval is assumed. Moreover, these models are formalized in a probability scenario, so that
statistical properties and inferences for the models can be investigated.
The aim of this work is to develop inferential studies for a flexible linear model for inter-
vals introduced in [8]. Such a model is formalized in terms of the so-called interval arithmetic.
It has been shown to be flexible and versatile, while keeping the coherency with the interval
structure. Analytic expressions for least-squares (LS) estimators for the regression parameters of
the model are shown [8]. The limit distributions of the conveniently normalized expressions of
those regression estimators have been analysed and employed to develop asymptotic confidence
intervals for the parameters of the model [7].
Due to the lack of realistic parametric models for intervals, asymptotic or bootstrap techniques
are generally applied in inferences [3,7,23]. Asymptotic results require very large samples to
obtain reliable confidence sets, so bootstrap procedures are generally advisable. Bootstrap con-
fidence sets and hypothesis testing procedures for the regression coefficients of the model in [8]
have not been investigated yet. This work fills these gaps. A bootstrap procedure to construct
confidence sets for the parameters is investigated. Moreover, hypothesis tests on the regression
coefficients are presented and solved through asymptotic and bootstrap techniques.
The rest of the paper is organized as follows: in Section 2some preliminary concepts about
the interval framework and the considered linear model are presented, showing the LS estimators
of the regression coefficients. The confidence sets and the hypothesis tests for the parameters of
the model are developed in Sections 3and 4, respectively. In Section 5some simulation studies
are developed, in order to show the performance of the methods and to compare both asymptotic
and bootstrap approaches. Section 6is devoted to the practical application of the procedures on
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
International Journal of Computer Mathematics 3
some classical interval data sets. Finally, Section 7includes some concluding remarks and future
directions. Technical proofs are provided in the appendix.
2. Notation and linear regression model M for intervals
Interval-valued data are formalized as elements of the space
Kc(R)= {[a,b] : a,bR,ab}.
Any AKc(R)can be characterized by (mid A, spr A)R×R+, where mid A=(sup A+
inf A)/2 is the midpoint (or centre), and spr A=(sup Ainf A)/2 is the spread (or radius) of A.
The notation A=[mid A±spr A] will be used. That enables the embedding of the space Kc(R)
into the subspace R×R+of R2, which allows one to apply many classical properties on R2
for the statistical treatment of intervals. Nevertheless, it will always be necessary to guarantee
that the resulting elements remain in the subspace R×R+, so that they are associated with
well-defined intervals.
The Minkowski addition and the product by scalars form the natural arithmetic on Kc(R). That
is, A+B={a+b:aA,bB}and λA={λa:aA}, for all A,BKc(R)and λR. The space
(Kc(R),+, .)is not linear but semi-linear, or conical, due to the lack of symmetric element with
respect to the addition; A+(A)=[l,l] with l>0, unless A={a}is a singleton. Moreover,
the expression A+(1)Bgenerally differs from the natural difference AB, since it does not
fulfil the addition/subtraction simplification, i.e. (A+(1)B)+B6=Ain general. To partially
overcome this situation, it is defined the so-called Hukuhara difference AHBas the interval
Csuch that A=B+C. It verifies that AHA={0}and (AHB)+B=A. However, it does not
exist for any pair of intervals; for instance, if A=[1,2] and B=[0,4], the unique way to get
A=B+C(so that C=AHB) is being C=[1, 2] /Kc(R). The interval AHBexists if, and
only if, spr Bspr A.
Any interval Acan be alternatively represented through the so-called canonical decomposi-
tion, given by A=mid A[1 ±0] +spr A[0 ±1]. It allows the consideration of the mid and spr
components of Aseparately within the interval arithmetic.
The interval data-generating process is formalized through random intervals. Given a proba-
bility space (,A,P), a mapping X:Kc(R)is said to be a random interval associated with
(,A,P)if (mid X, spr X):R2is a real-valued random vector such that spr X0 almost
sure (a.s.)-[P].
The expected value of a random interval Xis usually defined in terms of the well-known
Aumann expectation, which satisfies that
E(X)=[E(mid X)±E(spr X)], (1)
whenever mid X, spr XL1.
Given {Xi}n
i=1a simple random sample obtained from X, an estimator of the population mean
is defined in terms of the interval arithmetic as ¯
X=(X1+ · ·· + Xn)/n. It is coherent with the
Aumann expectation in Equation (1) in the sense of Strong Laws of Large Numbers [2].
2.1 The linear regression model M for intervals
Let X,Y:Kc(R)be two random intervals associated with (,A,P). The so-called model
M between Xand Yintroduced in [8] is formalized as follows:
Y=αmid X[1 ±0] +βspr X[0 ±1] +γ[1 ±0] +ε, (2)
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
4A. Blanco-Fernández and G. González-Rodríguez
where α,β,γRand ε:Kc(R)is an interval-valued error such that E|X)=[δ,δ]
Kc(R), with δ0. The use of the canonical decomposition of the regressor Xin Equation (2)
allows us to relate the mid and spr components of Yin terms of Xby means of different regression
coefficients αand β, respectively. Namely, it holds that mid Y=αmid X+γ+mid εand spr Y
=|β|spr X+spr ε, where δ=E(spr ε|X). Thereafter, the second-order moments of the mid and
spr components of the random intervals involved in the linear model (2) are assumed to be finite,
and the variances strictly positive.
The regression function associated with the model (2) can be written as
E(Y|X)=αXM+βXS+B, (3)
where B=[γδ,γ+δ]Kc(R),XM=mid X[1 ±0] and XS=spr X[0 ±1]. Since XS=
XS, it is clear that the regression coefficient affecting the spreads βcan be assumed without
loss of generality to be non-negative.
The estimation process consists in finding suitable values (ˆα,ˆ
β) R×R+and ˆ
BKc(R)(so
that ˆγ=mid ˆ
Band ˆ
δ=spr ˆ
B) as estimates of the corresponding regression parameters. The LS
criterion based on a generic L2-type metric defined on the space Kc(R)has been applied in [8] to
estimate (3). The associated minimization problem has been solved over a suitable feasible set
assuring the existence of the interval-valued residuals and the coherency of the solutions with the
interval arithmetic. Analytic expressions for the regression estimators have been obtained. They
can be expressed in terms of classical moments for the mid and spr variables of the intervals as
follows:
ˆα=ˆσmid X,mid Y
ˆσ2
mid X
,ˆ
β=min (ˆs0, max (0, ˆσspr X,spr Y
ˆσ2
spr X)) and (4)
ˆ
B=[(mid Y− ˆαmid X)±(spr Yˆ
βspr X)], (5)
where ˆs0=min{spr Yi/spr Xi: spr Xi6= 0}.
The estimator for the parameter αagrees with the ordinary LS estimator for the classical
linear model between the mid variables, but it is not the case for ˆ
β. The term ˆs0determines an
upper bound of the regression estimator ˆ
βin Equation (4) which assures the existence of the
residuals of the estimated model, ˆεi=YiH(ˆαXM
i+ˆ
βXS
i), for all i=1, ...,n. If sprXi=0 for
all i=1, ...,n, that is all the sample intervals Xiare reduced to real values, then ˆs0= ∞.
Since no realistic parametric model for describing the distribution of a random interval has
been widely accepted until now, the exact distributions for the regression estimators are not avail-
able. Thus, inferential studies in the interval setting are often developed by means of asymptotic
or bootstrap approaches. Regarding model M, in [7] the limit distributions of the normalized
estimators ˆαand ˆ
βin Equation (4) are shown. They have been employed to construct asymptotic
confidence sets for the regression coefficients of the model. As usual, asymptotic techniques pro-
vide accurate results for large sample sizes, but they often lose accuracy for small or moderate
samples. In these cases, bootstrap methods are known to improve generally the results [18,38].
3. Confidence sets for the regression parameters: a bootstrap approach
Since the linear model Mis formalized for both a response Yand a regressor Xbeing random
(interval-valued) variables, and the sample data are obtained through a simple random sample-
generation process from the random pair (X,Y), the paired bootstrap is applied to the generation
of bootstrap samples from the model [17].
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
International Journal of Computer Mathematics 5
The existence of separate analytic expressions for the estimators allows us to construct
confidence sets for the regression coefficients of the model individually.
For α, the identification of both the parameter and the estimate with the respective terms in
the classical linear model for (mid X, mid Y) enables us to apply classical bootstrap results.
Let {Xi,Yi}n
i=1be a random sample obtained from (X,Y) verifying the linear model (2), and
let ˆαbe the LS estimator of αcomputed from that sample. From its expression in Equation (4)
ˆαcan obviously be computed from the random sample of midpoints, {mid Xi, mid Yi}n
i=1. By
re-sampling with replacement from this set, a bootstrap sample {mid X
i, mid Y
i}n
i=1is obtained,
and the bootstrap estimate ˆαis computed. Let 1 ρ(0,1) be a fixed significance level. Various
methods can be used to construct bootstrap confidence intervals for regression parameters as α
in a classical linear model (see [38] for a detailed review). Namely,
Let KBOOT be the distribution function of the bootstrap estimator α, i.e.
KBOOT(x)=Px),xR.
The percentile bootstrap confidence set for αis defined as
CIP(α)1ρ=hK1
BOOT ρ
2,K1
BOOT 1ρ
2i. (6)
Let HBOOT be the distribution function of nl(ˆα− ˆα), for a fixed constant l. The usual choice
is l=1/2, and the hybrid bootstrap confidence set for αis
CIH(α)1ρ=ˆα1
nH1
BOOT 1ρ
2,ˆα1
nH1
BOOT ρ
2. (7)
Let us consider the standardized statistic R=(ˆαα)/ ˆσˆα, with ˆσˆαbeing an estimator of
the variance of ˆα. Its bootstrap version is defined as R=(ˆα− ˆα )/ ˆσ
ˆα. Let GBOOT be the
distribution function of R. The t-bootstrap confidence set for αis defined as
CIt(α)1ρ=hˆα− ˆσˆαG1
BOOT 1ρ
2,ˆα− ˆσˆαG1
BOOT ρ
2i. (8)
The theoretical consistency of the proposed bootstrap confidence sets for the regression param-
eter αis assured from the consistency of each of the bootstrap distribution functions with respect
to the corresponding sample distribution function. Given ˆ
K(x)=P(ˆαx)the sample distri-
bution function of the estimator ˆα, in [38] it is shown that limn→∞ ρ(KBOOT ,ˆ
K)=0 a.s.-[P],
where ρ(f,g)=sup{f(x)g(x):xR},f,gbounded real-valued functions, so KBOOT is con-
sistent (or ρ- consistent). Analogous results are shown for HBOOT, with respect to ˆ
H[ˆ
H(x)=
P(n(ˆαα) x)] and for GBOOT, with respect to ˆ
G[ˆ
G(x)=P(Rx)].
In practice, the percentiles of the bootstrap distribution functions can be computed by means of
the Monte Carlo method, from the empirical distribution of the bootstrap estimator ˆαobtained
with Bindependent bootstrap re-samples.
Analogous developments can be done to construct confidence sets for the regression parameter
βfrom the estimator ˆ
βgiven in Equation (4). The bootstrap estimator of βis defined as
ˆ
β=min (ˆs
0, min (0, ˆσspr X,spr Y
ˆσ2
spr X)), (9)
where {spr X
i, spr Y
i}n
i=1is a bootstrap sample obtained by re-sampling from {spr Xi, spr Yi}n
i=1
and ˆs
0=mini=1,...,n{spr Y
i/spr X
i: spr X
i6= 0}. Obviously, the consistency results for classical
bootstrap regression estimators are not directly applicable to ˆ
βin Equation (9).
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
6A. Blanco-Fernández and G. González-Rodríguez
We will focus the study on the asymptotic performance of the hybrid confidence set for β.
In [7] it has been shown that the sample statistic Tn=n(ˆ
ββ) has a different asymptotic
distribution depending on the theoretical scenario of the model. This result entails important
difficulties to prove the consistency of the corresponding bootstrap distribution statistic T
n=
n(ˆ
βˆ
β). In Theorem 3.1 the result is shown for one of the theoretical situations of the linear
model. The proof is in the appendix.
Theorem 3.1 Let X ,Y:Kc(R)be two random intervals so that Y =αXM+
βXS+γ[1 ±0] +ε,with E|X)=[δ,δ], δ0, and 0< σ 2
sprX,σ2
spr ε<. Let l =
limx0+P((spr X>0)(spr εxspr X))/x2. If β6=0 (so β > 0) and l =0, then
T
n=n(ˆ
βˆ
β) L
TN 0, σ2
spr ε
σ2
spr X!a.s. [P], as n→ ∞.
Under the assumptions of Theorem 3.1, it has been shown that the sample statistic Tn
converges to the same distribution, i.e.
Tn
L
T, as n→ ∞. (10)
The value of laffects the limit distribution of the sample term n(ˆs0β). Specifically, if
l=0, it diverges to . This condition is included in order to assure that the term ˆs0does not
converge to a non-degenerated distribution at a speed n. Thus, a mixture of distributions for
n(ˆ
ββ) is avoided, which would be very difficult to handle in practice. It can be shown
that l=0 if the distribution function of the real random variable spr ε/spr Xhas a faster speed
convergence to 0 in a neighbourhood of that point than the real function h(x)=x2. If usual para-
metric distribution models for the non-negative variables spr εand spr Xare considered, such as
chi-square or log-normal models, the condition fulfils.
Theorem 3.1 assures the asymptotic consistency of the hybrid bootstrap confidence set for the
regression parameter βat (1 ρ)-confidence level, given by
CIH(β)1ρ=ˆ
β1
n(Hβ
BOOT)11ρ
2,ˆ
β1
n(Hβ
BOOT)1ρ
2, (11)
with Hβ
BOOT(x)=P(T
nx),xR.
The existence of a different limit distribution for Tn=n(ˆ
ββ) depending on the theo-
retical situation of Xand Ylimits the practical applicability of the inferences based on this
statistic. An alternative asymptotic confidence set for βhas been constructed by using the partial
information provided by the sample term
ˆ
βS=ˆσspr X,spr Y
ˆσ2
spr X
. (12)
A bootstrap confidence set for βcan be then constructed by considering the bootstrap statistic
TS
n=n(ˆ
β
Sˆ
β),
where ˆ
β
S= ˆσspr X, spr Y/ˆσ2
spr Xis computed from the bootstrap sample of spreads of the intervals
{spr X
i, spr Y
i}n
i=1. As a result, an alternative hybrid confidence set for βat (1 ρ)-confidence
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
International Journal of Computer Mathematics 7
level is given by
CIS
H(β)1ρ=ˆ
βS1
n(HβS
BOOT)11ρ
2,ˆ
βS1
n(HβS
BOOT)1ρ
2, (13)
with HβS
BOOT(x)=P(TS
nx),xR. The inferences based on ˆ
βSdefined in Equation (12) are
completely applicable in practice, since the limit distribution of n(ˆ
βSβ) is shown to be
unique in all the population conditions [7]. Besides, the consistency of this bootstrap confidence
set is guaranteed for any population scenario by applying classical results, analogously to the
bootstrap method for the parameter α.
4. Hypothesis testing on the regression parameters: asymptotic and bootstrap
approaches
The parameters αand βin model (2) represent the strength of the relationship between the
response interval Yand the mid and the spr component of the regressor X, respectively. Testing
the explicative power of mid Xand/or spr Xconsists in testing that the corresponding coefficient
is equal to 0. In general it is possible to test the null hypothesis
H0:α=α0(14)
against the alternative
H1:α6= α0
for certain constant value α0R. Analogously for β:
H0:β=β0,
H1:β6= β0,(15)
where β0 without loss of generality.
Hypotheses (14) and (15) can be tested on the basis of the available sample information by
means of asymptotic or bootstrap techniques.
4.1 Asymptotic hypothesis testing
To test (14), the following test statistic will be used:
S0
n=n(ˆαα0). (16)
It fulfils that S0
n
L
N(0, σ2
mid ε2
mid X)as nwhen H0(14) is true. Thus, an approach to
test (14) is defined in Proposition 4.1. The proof is collected in the appendix.
Proposition 4.1 In testing the null hypothesis (14)at a nominal significance level ρ(0, 1),
the test consisting in rejecting H0when
|S0
n|>cρ/2,
where cρ/2is the ρ/2-quantile of the N(0, σ2
mid ε2
mid X)distribution, is asymptotically correct,
i.e. P(reject H0|H0true) ρas n . Moreover, it is asymptotically consistent, that is, the
probability of rejecting H0(14)when the alternative is true tends to 1 as n tends to .
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
8A. Blanco-Fernández and G. González-Rodríguez
To test (15) asymptotically it is necessary to define the test statistic by using partial information
from the estimation process, in order to avoid the possibility of a different limit distribution for
the statistic depending on the theoretical situation. Let us consider
TS0
n=n(ˆ
βSβ0), (17)
where ˆ
βSis defined in Equation (12). It holds that TS0
n
L
N(0, σ2
spr ε2
spr X)as nunder
H0(15). Based on these results, an approach to test (15) is described in Proposition 4.2. The proof
is analogous to the one of Proposition 4.1, by changing the role of αand βin the computations.
Proposition 4.2 In testing the null hypothesis (15)at a nominal significance level ρ(0, 1),
the test consisting in rejecting H0when
|TS0
n|>tρ/2,
where tρ/2is the ρ/2-quantile of the N(0, σ2
spr ε2
spr X)distribution, is asymptotically correct,
i.e. P(reject H0|H0true) ρas n . Moreover, it is asymptotically consistent, that is, the
probability of rejecting H0(15)when the alternative is true tends to 1 as n tends to .
The usual transformation of the test statistics by incorporating the estimated variances can
be done when the theoretical variances involved in the limit distributions of S0
nand/or TS0
nare
unknown.
These asymptotic tests work suitably for samples with very large size (see Section 5). For this
reason we propose a bootstrap approach in the next section.
4.2 Bootstrap hypothesis testing
To test (14), the new variable ZM=mid Y− ˆαmid X+α0mid Xis defined. From
{mid Xi, mid Yi}n
i=1the sampled midpoints of the random sample of intervals {Xi,Yi}n
i=1, the boot-
strap population {mid Xi,ZMi}n
i=1fulfilling the null hypothesis (14) is computed. Then, a bootstrap
sample of size n{mid X
i,Z
Mi}n
i=1is drawn with replacement from the bootstrap population, and
the bootstrap statistic
S0
n=n(ˆα
(X,Z)α0),
where ˆα
(X,Z)= ˆσmid X,Z
M/ˆσ2
mid X, is defined. Following well-known results on bootstrap meth-
ods for classical regression problems it is easy to show that the asymptotic distribution under the
null hypothesis of the bootstrap statistic S0
nequals the one of S0
nin (16) (see [19]).
Proposition 4.3 In testing the null hypothesis (14)at a nominal significance level ρ(0, 1),
the test consisting in rejecting H0when
|S0
n|>c
ρ/2,
with c
ρ/2being the ρ/2-quantile of the distribution of S0
n,is asymptotically correct and
consistent.
When applying bootstrap techniques, the bootstrap distribution can be approximated through
Monte Carlo method and the bootstrap p-value of the test can be then computed as the empirical
proportion of values of the bootstrap statistic being greater, in absolute value, than the sample
statistic [38].
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
International Journal of Computer Mathematics 9
Let us consider now the bootstrap resolution of the hypothesis testing for the regression
parameter βin Equation (15). Given the drawbacks of practical applicability of the asymptotic
distribution for the statistics Tnand T
n, we propose to employ the bootstrap version of the test
statistic TS0
ndefined in Equation (17). The bootstrap method shown previously for αcan be anal-
ogously developed to test (15) based on TS0
n. Classical bootstrap regression results guarantee
both the asymptotic correctness and the consistency of the test.
Algorithms for the practical application of the inferences in real case-studies have been imple-
mented in Matlab/Octave. They can be freely downloaded from http://bellman.ciencias.uniovi.es/
SMIRE/IntervalLM.html. It is important to remark that analytic expressions for both the confi-
dence sets and the rejection criteria of the hypothesis tests are obtained, so that the computational
complexity of the algorithms is equivalent to classical regression inferences.
5. Simulation studies
In order to check the empirical behaviour of the proposed asymptotic and bootstrap inferences,
some simulations are developed in this section. Let X,ε:Kc(R)be two random intervals
playing the role of independent and error variable in a model, respectively. Two linear models
with the structure of model M (2) are investigated. The random intervals Xand εare defined
through their (mid,spr)-parametrization.
Model M1: Let mid XN(1,2), spr Xχ2
1and mid εN(0,1), spr εχ2
1+1 be indepen-
dent random variables, and let Ybe a random interval defined by a linear model in terms of X
as
Y=2 mid X[1 ±0] +spr X[0 ±1] +[1 ±0] +ε. (18)
Model M2: Considering the same random error interval εas above, let spr Xχ2
1and
mid X=spr X+Z, with ZN(0,1) independent of spr X. The interval linear model defining Y
is
Y= −3 mid X[1 ±0] +2 spr X[0 ±1] +[1 ±0] +ε. (19)
In this situation, there is dependence between mid Xand spr X.
It is easy to check that the models (18) and (19) verify the theoretical conditions of
Theorem 3.1. Let us fix a significance level ρ=0.05. Based on the generation of k=10,000
random samples of different sample sizes nfrom the preceding theoretical situations, several
simulation studies have been developed.
First, the graphical representations of the empirical distributions of the normalized regression
estimators ˆαand ˆ
βfor the two models are shown, jointly with the theoretical asymptotic distri-
butions (solid lines), in Figures 1and 2, respectively. In both cases, the empirical distributions
approximate the theoretical probability function. Besides, it is shown that the larger the sam-
ple size n, the better the approximation. This result agrees with the convergences in law of the
regression estimators recalled in preceding sections.
The empirical coverages of the bootstrap confidence sets obtained in Section 3for αand βin
models M1and M2have been computed. The results are collected in Table 1. The average width
of the 10,000 confidence sets in each case is also shown (values in parentheses in Table 1). It is
obtained that the coverages for all the cases are closer to the nominal level 0.95 as the number
of observations nincreases, as expected according to the theoretical consistency results for the
bootstrap statistic distributions presented in Section 3. In the case of αit is shown that, whereas
the three confidence sets have a similar average width, the coverage of the t-bootstrap confi-
dence set is the best one, followed by the hybrid and the percentile ones. This fact agrees with
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
10 A. Blanco-Fernández and G. González-Rodríguez
–4 0 4
0
0.2
0.6
n= 30
–5 0 5
0
0.2
0.4
a b
n= 30
–4 0 4
0
0.2
0.6
n= 100
–5 0 5
0
0.2
0.4
n= 100
Figure 1. Empirical distributions of the normalized regression estimators ˆαand ˆ
βfor n=30 and n=100 in model M1,
and theoretical distributions (solid line).
–4 –2 0 2 4
0
0.4
0.8
–5 0 5
0
0.2
0.4
–4 –2 0 2 4
0
0.4
0.8
–5 0 5
0
0.2
0.4
n= 30 n= 30
n= 100 n= 100
a b
Figure 2. Empirical distributions of the normalized regression estimators ˆαand ˆ
βfor n=30 and n=100 in model M2,
and theoretical distributions (solid line).
the different speed convergence of the three alternatives [38]. In the case of β, the convergence
of the two confidence sets is quite different. Since CIH(β) is based on the regression estimator
ˆ
βgiven in Equation (4), the conditions guaranteeing the existence of residuals included in the
estimation process affect its coverage: the convergence to the nominal value is clearly slower
than the corresponding to CIS
H(β), based on classical results. In contrast, the inclusion of the
constraints makes the confidence sets CIH(β) being narrower in average. However, the differ-
ences in the average width of both intervals tend to vanish as the sample size increases, whereas
the differences in coverage remain large, specially in model M1. Thus, although the theoretical
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
International Journal of Computer Mathematics 11
Table 1. Empirical coverage of the 95%-bootstrap confidence sets for αand β
(averaged width of the CIs in parentheses).
α β
Model nCIP(α) CIH(α) CIt(α) CIH(β) CIS
H(β)
M110 0.9289 0.9352 0.9313 0.8835 0.9341
(1.0252) (1.1025) (1.0312) (0.9856) (1.1420)
30 0.9301 0.9418 0.9374 0.8911 0.9378
(0.8521) (0.8456) (0.8225) (0.6647) (0.7425)
50 0.9360 0.9458 0.9466 0.9061 0.9410
(0.6257) (0.6198) (0.6057) (0.5462) (0.6124)
100 0.9460 0.9465 0.9476 0.9052 0.9452
(0.4832) (0.4801) (0.4768) (0.3650) (0.4023)
200 0.9475 0.9485 0.9494 0.9123 0.9476
(0.2965) (0.2896) (0.2854) (0.2578) (0.2777)
M210 0.9253 0.9239 0.9363 0.8926 0.9382
(1.0668) (1.0521) (1.1594) (2.0832) (2.7429)
30 0.9334 0.9336 0.9404 0.9136 0.9446
(0.4658) (0.4596) (0.4709) (0.7853) (0.8820)
50 0.9420 0.9428 0.9438 0.9258 0.9486
(0.3472) (0.3489) (0.3396) (0.5450) (0.6097)
100 0.9458 0.9466 0.9483 0.9285 0.9492
(0.2347) (0.2401) (0.2365) (0.3645) (0.4022)
200 0.9485 0.9489 0.9495 0.9387 0.9496
(0.1630) (0.1674) (0.1638) (0.2574) (0.2771)
Table 2. Percentage of rejections of the null hypotheses in simulations.
H0:α=α0H0:β=β0
Model nAsymptotic test Bootstrap test Asymptotic test Bootstrap test
10 6.81 6.67 8.65 6.89
M130 6.78 6.61 8.42 6.85
α0=2 50 6.21 5.82 6.47 6.32
β0=1 100 5.86 5.20 6.25 5.54
200 5.78 5.19 5.91 5.28
500 5.44 5.13 5.42 5.22
10 7.01 6.88 7.96 7.01
M230 6.71 6.02 6.90 6.43
α0= −3 50 5.91 5.44 6.02 5.51
β0=2 100 5.45 5.21 5.74 5.32
200 5.27 5.15 5.34 5.19
500 5.21 5.09 5.18 5.11
consistency of the two confidence sets for βis guaranteed, it is shown that CIS
H(β) behaves better
than CIH(β) empirically. Contrary to what happens with the confidence sets based on asymptotic
distributions (see [7]), the empirical coverages of the bootstrap confidence sets are accurate even
for small and moderate sample sizes, except for CIH(β), whose coverage is significantly lower
than the nominal level in general.
On the other hand, to demonstrate that the empirical size of the tests for the regression param-
eters proposed in Section 4converges to the nominal one, the percentages of rejections of the
hypotheses H0:α=α0and H0:β=β0, where α0and β0are the known theoretical values in mod-
els M1and M2, respectively, have been computed. The results are gathered in Table 2. It is
shown that the percentages of rejections of all the tests tend to approximate the nominal signifi-
cance level 100(1 ρ)% =5% as nincreases. This findings show empirically the correctness of
the tests, being consistent with the theoretical results in Section 4. In general, the percentages of
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
12 A. Blanco-Fernández and G. González-Rodríguez
rejections of the bootstrap tests are closer to the nominal level than the corresponding asymptotic
test solved with the same sample size. However, these differences are not very large, and both
techniques are conservative, obtaining accurate approximations for large sample sizes.
6. Empirical results
The practical applicability of the proposed inferences for model M is illustrated with some
examples. Two classical interval data sets are considered.
Example 1 The experiment deals with the measurement of the daily fluctuation of the systolic
and the diastolic blood pressures (BPs) of the patients in the Nephrology Unit of Valle del Nalón
Hospital located in Asturias, Spain. A sample data set of 59 patients have been provided by the
hospital, from a population of 3000 patients who are hospitalized per year in the hospital unit (see
Table 3). Physicians focused their medical interest in the fluctuation of the BPs of their patients,
so they collect the experimental data by registering only the lowest and highest measurements
of those magnitudes over the registers made in different moments of a day (either manually by
a nurse, or by a 24-hour ambulatory monitoring) [35]. The study of the relationship between
the two types of BP is an important task, extensively investigated in medical research [22,29].
Previous works on interval linear models have used this case-study for the illustration of some
regression analysis methods, based on different samples of patients [4,8,10,23,25].
The relation between systolic and diastolic BP is a novel, theoretically attractive means of
investigating arterial stiffness [37]. For a given increase in diastolic BP, systolic BP is expected
to increase to a limited extent in a compliant artery, whereas the increase will be greater in a stiff
artery. The opposite holds for the increase in diastolic BP for a given increase in systolic BP,
which can be considered as a measure of arterial compliance. On the basis of these principles,
the Ambulatory Arterial Stiffness Index (AASI) has been developed. It is defined as one minus
the linear regression slope of diastolic BP on systolic BP. Such a measure of arterial stiffness
Table 3. Daily systolic (X) and diastolic (Y) blood pressure fluctuations of a sample
of patients.
X Y X Y X Y
[11.8,17.3] [6.3,10.2] [11.9,21.2] [4.7,9.3] [9.8,16.0] [4.7,10.8]
[10.4,16.1] [7.1,10.8] [12.2,17.8] [7.3,10.5] [9.7,15.4] [6.0,10.7]
[13.1,18.6] [5.8,11.3] [12.7,18.9] [7.4,12.5] [8.7,15.0] [4.7,8.6]
[10.5,15.7] [6.2,11.8] [11.3,21.3] [5.2,11.2] [14.1,25.6] [7.7,15.8]
[12.0,17.9] [5.9,9.4] [14.1,20.5] [6.9,13.3] [10.8,14.7] [6.2,10.7]
[10.1,19.4] [4.8,11.6] [9.9,16.9] [5.3,10.9] [11.5,19.6] [6.5,11.7]
[10.9,17.4] [6.0,11.9] [12.6,19.7] [6.0,9.8] [9.9,17.2] [4.2,8.6]
[12.8,21.0] [7.6,12.5] [9.9,20.1] [5.5,12.1] [11.3,17.6] [5.7,9.5]
[9.4,14.5] [4.7,10.4] [8.8,22.1] [3.7,9.4] [11.4,18.6] [4.6,10.3]
[14.8,20.1] [8.8,13.0] [11.3,18.3] [5.5,8.5] [14.5,21.0] [10.0,13.6]
[11.1,19.2] [5.2,9.6] [9.4,17.6] [5.6,12.1] [12.0,18.0] [5.9,9.0]
[11.6,20.1] [7.4,13.3] [10.2,15.6] [5.0,9.4] [10.0,16.1] [5.4,10.4]
[10.2,16.7] [3.9,8.4] [10.3,15.9] [5.2,9.5] [15.9,21.4] [9.9,12.7]
[10.4,16.1] [5.5,9.8] [10.2,18.5] [6.3,11.8] [13.8,22.1] [7.0,11.8]
[10.6,16.7] [4.5,9.5] [11.1,19.9] [5.7,11.3] [8.7,15.2] [5.0,9.5]
[11.2,16.2] [6.2,11.6] [13.0,18.0] [6.4,12.1] [12.0,18.8] [5.3,10.5]
[13.6,20.1] [6.7,12.2] [10.3,16.1] [5.5,9.7] [9.5,16.6] [5.4,10.0]
[9.0,17.7] [5.2,10.4] [12.5,19.2] [5.9,10.1] [9.2,17.3] [4.5,10.7]
[11.6,16.8] [5.8,10.9] [9.7,18.2] [5.4,10.4] [8.3,14.0] [4.5,9.1]
[9.8,15.7] [5.0,11.1] [12.7,22.6] [5.7,10.1]
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
International Journal of Computer Mathematics 13
provides prognostic information on cardiovascular mortality, and it is unquestionably expected
to have a great success in the scientific and medical community [37].
By modelling the BP daily fluctuations by means of interval-valued variables, the linear
model M can be used to relate both pressures and to compute the corresponding slope and stiff-
ness index. Let Y=systolic BP daily fluctuation and X=diastolic BP daily fluctuation. The
estimated linear model M from the sample data in Table 3is
ˆ
Y=0.9656XM+0.7589XS+6.6283[1 ±0] +[1.6137, 1.6137]. (20)
Since α6=βin general, two different regression coefficients serve as slopes of diastolic BP on
systolic BP, corresponding to the mid and spr components of the variables, respectively. Thus,
two different stiffness indexes can be defined in this case, AASIMfrom the relationship between
the variables mid Xand mid Y, and AASISon the basis of the slope of spr Xon spr Y. Since
1AASIM=αand 1 AASIS=β, the resolution of inferential studies for the regression coef-
ficients provides immediate results on the corresponding stiffness index. From several medical
studies it is obtained that the mean AASI was found between 0.31 and 0.56, and normal values of
AASI are estimated by 0.5 for mid-age people [37]. It is equivalent to estimate the corresponding
slope of the regression of diastolic vs. systolic BP by 0.5, ranging between 0.44 and 0.69. We
can check whether the considered population of patients fulfil these medical descriptions, on the
basis of the available sample data set.
On one hand, the 95%-bootstrap confidence sets for the regression parameters shown in
Section 3are constructed, based on 1000 bootstrap iterations. For α, the computation of
Equation (6)–(8) gives
CIP(α)0.95 =[0.6751, 1.2127],
CIH(α)0.95 =[0.7185, 1.2561],
CIt(α)0.95 =[0.6861, 1.2377].
In the case of β, the two investigated alternatives (11) and (13) lead to the following confidence
sets:
CIH(β)0.95 =[0.6231, 1.1871],
CIS
H(β)0.95 =[0.4254, 1.1871].
They are very similar due to the fact that from the available sample data the estimate of β
in Equation (20) coincides with the sample term ˆ
βS. However, in practical situations in which
ˆ
β6= ˆ
βS, the confidence sets CIH(β)1ρand CIS
H(β)1ρmight be significantly different (see
Example 2).
Hypothesis tests for the regression parameters of the model (and so for the corresponding
AASI) can be also solved by following the results presented in Section 4. Obviously, any fixed
value α0Rcan be chosen to test whether αtakes it or not (analogously for β). In general,
those real values are fixed depending on the specific strengths of the relationship between the
response interval Yand the mid and spr components of the regressor Xwhich are interesting
in the case-study. In regression analysis it is usually convenient to firstly test the explanatory
power of the independent variable on the response by means of the model. This is done by
checking if the corresponding regression parameter vanishes. In this case, it is possible to test
the explicative power of both components mid Xand spr Xof the interval Xon Yby means of the
linear model M by testing H0:α=0 and H0:β=0, respectively. The asymptotic and the bootstrap
testing methods proposed in Section 4provide a p-value <.001 for the two null hypotheses. As
a conclusion, the explicative power of the interval Xon Yby means of the model M is significant
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
14 A. Blanco-Fernández and G. González-Rodríguez
through both components mid Xand spr X. Let us test the normal value of 0.50 for AASI found
in medical studies. Considering the corresponding index between the midpoints of the intervals,
it is equivalent to test H0:α=0.50. Both the asymptotic and the bootstrap testing procedures
reject the null hypothesis with a p-value <.01.
Since large values of AASI are associated with an increased risk of cardiovascular mortality
in hypertensive patients [37], we might also check whether the population of patients in the hos-
pital does not exceed the upper normal estimated value. It can be done by testing H0:α0.69.
By simple modifications on the proposed tests for αin Section 4, it is possible to define unilat-
eral rejection regions for the null hypothesis H0:α0.69. It is rejected at the usual nominal
significance levels.
As a conclusion, the investigated population of patients in this case-study does not fulfil
the medical specifications estimated in previous studies. Inferential results indicate that in this
population the values of AASI are generally greater than the normal ones.
Example 2 A well-known interval data set introduced by Ichino [27] is considered to show an
alternative application of the inferences for the model M. This data set is widely employed in
the literature to illustrate practical aspects of statistical methods for intervals [11,31]. It refers to
various classes of oils described by several interval-valued variables. The simple linear model M
can be formalized to estimate the linear relationship between any pair of variables. For example,
let X=iodine value and Y=saponification value of an oil. The sample data set corresponding
to Xand Yis gathered in Table 4. The estimated model M between Yand Xfrom that sample is
ˆ
Y= −0.1355XM+0.500XS+203.1074[1 ±0] +[4.7188, 4.7188]. (21)
The asymptotic techniques are not applied in this case-study since the sample size is small. The
95%-bootstrap confidence sets for the regression parameters αand βbased on 1000 iterations
are
CIP(α)0.95 =[0.3327, 0.0029],
CIH(α)0.95 =[0.2738, 0.0618],
CIt(α)0.95 =[0.2183, 0.0602],
and
CIH(β)0.95 =[0.4375, 0.8425],
CIS
H(β)0.95 =[0.9212, 3.9306],
respectively. In this case, it is obtained that the two confidence sets for βare quite different, since
the estimate ˆ
βin Equation (22) equals the feasible upper bound ˆs0=0.500 <ˆ
βS=2.044. This
Table 4. Oil data set.
Oil class Iodine value Saponification value
Linseed [170,204] [118,196]
Perilla [192,208] [188,197]
Cotton [99,113] [189,198]
Sesame [104,116] [187,193]
Camellia [80,82] [189,193]
Olive [79,90] [187,196]
Beef [40,48] [190,199]
Hog [53,77] [190,202]
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
International Journal of Computer Mathematics 15
fact makes the confidence set based on ˆ
βto be narrower, but its accuracy may be lower than the
nominal level 95%, according to the simulation studies developed in Section 5(see Table 1).
To study the explanatory power of midXand spr Xto model Yby means of the linear model
M, the hypothesis H0:α=0 is non-rejected through the bootstrap test (bootstrap p-value =0.185
from 1000 iterations). Thus, it is concluded that mid Xdoes not contribute significantly to explain
linearly the interval Y. Nevertheless, H0:β=0 is rejected with bootstrap a p-value <0.01, so
that spr Xhas a significant linear explanatory power on Y.
7. Conclusions
The main objective of this work has been to complete the inferential studies for the regres-
sion parameters of the flexible linear regression model M for interval-valued data presented in
[8]. On one hand, the construction of confidence sets for the coefficients by means of boot-
strap techniques has been tackled. On the other hand, the resolution of hypothesis testing for the
regression parameters has been addressed. Both asymptotic and bootstrap techniques have been
applied. Whereas the theoretical validity of the asymptotic methods is deduced from the limit
distributions of the regression estimators shown in [7], the bootstrap procedures are theoretically
supported by some new results, in general or in certain population conditions, depending on
the chosen estimator. The empirical behaviour of all the investigated methods is illustrated by
simulations and practical applications.
It is important to recall that the inferential studies for the parameter βneed a careful attention,
since the asymptotic distribution of the estimator is sometimes invalid for the practical applica-
bility of the procedures. To avoid these inconveniences, an alternative process based on partial
information from the estimation process is also proposed.
Once the inferences for the regression parameters are solved, the inferential study of the inter-
val linear model M might be extended in some directions. For instance, it may be interesting to
check the adequacy of the linear regression for modelling the relationship between the involved
random intervals, as a previous step to the estimation and the inferential developments. This can
be investigated through a linearity test.
The extension of the simple linear model to the multiple case is also a target to be addressed.
Some initial studies on set-arithmetic multivariate linear models for interval-valued variables
have already been developed [20,21]. Numerical optimization techniques are used to estimate the
multiple models, so numerical constrained statistical inference methods should be applied in this
case, leading to numeric approximate solutions [21]. When only two interval-valued variables
are involved in the regression problem, the analytic results for the inferences of the simple linear
model M investigated in this work are more efficient.
It is finally remarked that alternative estimation procedures could be applied to estimate the
models for intervals, as for instance based on likelihood functions (see in [32] a recent proposal
for symbolic data). However, for interval models based on set-arithmetic the internal structure of
the arithmetic as well as the already-mentioned lack of parametric distributions for intervals that
had been proved to be realistic enough, entail limitations for those techniques. This is an open
problem subject to the determination of suitable distribution models for random intervals.
Acknowledgments
Authors wish to thank the referees and the associate editor handling the manuscript for their very helpful sugges-
tions and comments to improve the work. The research has been partially supported by the Spanish Government Grant
MTM2013-44212-P and by the Principality of Asturias through SV-PA-13-ECOEMP-66 Grant. Their financial support
is gratefully acknowledged.
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
16 A. Blanco-Fernández and G. González-Rodríguez
References
[1] R. J. de Almeida and R. Andrade, Package ‘ISDA.R’. Available at cran.r-project.org/web/packages/ISDA.R/ISDA.R.pdf.
[2] Z. Artstein and R.A. Vitale, A strong law of large numbers for random compact sets, Ann. Probab. 3 (1975), pp.
879–882.
[3] A. Beresteanu and F. Molinary, Asymptotic properties for a class of partially identified models, Econometrica 76(4)
(2008), pp. 763–814.
[4] L. Billard and E. Diday, Regression analysis for interval-valued data, in Data Analysis, Classification and Related
Methods, H.A.L. Kiers, J.P. Rasson, P.J.F. Groenen, and M. Schader, eds., Springer, Heidelberg, 2000, pp. 369–374.
[5] L. Billard and E. Diday, From the statistics of data to the statistics of knowledge: Symbolic data analysis, J. Am.
Statist. Assoc. 98 (2003), pp. 470–487.
[6] L. Billard and E. Diday, Symbolic Data Analysis, Wiley, New York, 2006.
[7] A. Blanco-Fernández, A. Colubi and G. González-Rodríguez, Confidence sets in a linear regression model for
interval data, J. Statist. Plan. Inference 142(6) (2012), pp. 1320–1329.
[8] A. Blanco-Fernández, N. Corral and G. González-Rodríguez, Estimation of a flexible simple linear model for
interval data based on set arithmetic, Comput. Stat. Data Anal. 55(9) (2011), pp. 2568–2578.
[9] H.H. Bock, Symbolic data, in Analysis of Symbolic Data. Exploratory Methods for Extracting Statistical
Information from Complex Data, H.H. Bock and E. Diday, eds., Springer, Heidelberg, 2000, pp. 39–53.
[10] F.A.T. de Carvalho, E.A. Lima Neto and C.P. Tenorio, A new method to fit a linear regression model for interval-
valued data, in Advances in Artificial Intelligence, Lecture Notes in Computer Science, Vol. 3238, S. Biundo, T.
Frühwirth, and G. Palm, eds., Springer, Berlin, Heidelberg, 2004, pp. 295–306.
[11] T. Denoeux and M. Masson, Multidimensional scaling of interval-valued dissimilarity data, Pattern Recognit. Lett.
21 (2000), pp. 83–92.
[12] P. Diamond, Least squares fitting of compact set-valued data, J. Math. Anal. Appl. 147 (1990), pp. 531–544.
[13] S. Dias and P. Brito, A new linear regression model for histogram-valued variables, Proceedings of the 58th World
Statistics Congress, Dublin, 2011.
[14] E. Diday and M. Noirhome-Fraiture, Symbolic Data Analysis and the SODAS Software, Wiley, New York, 2008.
[15] P. D’Urso, Linear regression analysis for fuzzy/crisp input and fuzzy/crisp output data, Comput. Stat. Data Anal.
42 (2003), pp. 47–72.
[16] P. D’Urso and P. Giordani, A least squares approach to principal component analysis for interval valued data,
Chemometr. Intell. Lab. Syst. 70 (2004), pp. 179–192.
[17] B. Efron, The Jacknife, the Bootstrap and Other Resampling Plans, Society for Industrial and Applied Mathematics,
Philadelphia, 1982.
[18] B. Efron and R. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, New York, 1993.
[19] D.A. Freedman, Bootstrapping regression models, Ann. Stat. 9 (1981), pp. 1218–1228.
[20] M. García Bárzana, A. Colubi, and E. Kontoghiorghes, A flexible multiple linear regression model for interval data,
Proceeding of the 8th International Conference on Computational Management Science, Neuchatel, Switzerland,
2011, p. 12.
[21] M. García Bárzana, A. Colubi, and E. Kontoghiorghes, Estimating the parameters of a interval-valued multiple
regression model, Proceeding of the 5th International Conference of the ERCIM Working Group on Computing &
Statistics, Oviedo, Spain, 2012, p. 106.
[22] B. Gavish, I.Z. Ben-Dov, and M. Bursztyn, The linear relationship between systolic and diastolic blood pressure
monitored over 24 h: Assessment and correlates, J. Hypertension 26 (2008), pp. 199–209.
[23] M.A. Gil, G. González-Rodríguez, A. Colubi and M. Montenegro, Testing linear independence in linear models
with interval-valued data, Comput Stat. Data Anal. 51 (2007), pp. 3002–3015.
[24] M.A. Gil, A. Lubiano, M. Montenegro, and M.T. López-García, Least squares fitting of an affine function and
strength of association for interval-valued data, Metrika 56 (2002), pp. 97–111.
[25] P. Giordani, Lasso-based linear regression for interval-valued data, Proceedings of the 58th World Statistical
Congress, Dublin, 2011.
[26] G. González-Rodríguez, A. Blanco, N. Corral, and A. Colubi, Least squares estimation of linear regression models
for convex compact random sets, Adv. Data Anal. Classif. 1 (2007), pp. 67–81.
[27] M. Ichino, General metrics for mixed features – the cartesian space theory for pattern recognition, Proceeding of
the 1988 IEEE International Conference on Systems, Man, and Cybernetics, Beijing, China, Vol. 1, International
Academic Publishers, Beijing, 1988, pp. 494–497.
[28] G.R. Jahanshahloo, F. Hosseinzadeh Lotfi, M. Rostamy Malkhalifeh, and M. Ahadzadeh Namin, A generalized
model for data envelopment analysis with interval data, Appl. Math. Model. 33 (2008), pp. 3237–3244.
[29] W.B. Kannel, T. Gordon and M.J. Schwartz, Systolic versus diastolic blood pressure and risk of coronary heart
disease, Am. J. Cardiol. 27 (1971), pp. 335–346.
[30] R. Körner, On the variance of fuzzy random variables, Fuzzy Sets Syst. 92 (1997), pp. 83–93.
[31] C.N. Lauro and F. Palumbo, Principal component analysis for non-precise data, in New Developments in Classifi-
cation and Data Analysis. Studies in Classification, Data Analysis and Knowledge Organization. Part II, M. Vichi,
P. Monari, S. Mignani, and A. Montanari, eds., Springer, Heidelberg, 2005, pp. 173–184.
[32] J. Le-Rademacher and L. Billard, Likelihood functions and some maximum likelihood estimators for symbolic data,
J. Statist. Plan. Inference 141 (2011), pp. 1593–1602.
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
International Journal of Computer Mathematics 17
[33] E.A. Lima Neto, F.A.T. de Carvalho, Constrained linear regression models for symbolic interval-valued variables,
Comput. Stat. Data Anal. 54 (2010), pp. 333–347.
[34] M.A. Lubiano, M.A. Gil, Estimating the expected value of fuzzy random variables in random samplings from finite
populations, Statist. Pap. 40(3) (1999), pp. 277–295.
[35] T.G. Pickering, J.E. Hall, L.J. Appel, B.E. Falkner, J. Graves, M.N. Hill, D.W. Jones, T. Kurtz, S.G. Sheps, E.J. Roc-
cella, Recommendations for blood pressure measurement in humans, experimental animals. Part 1: Blood pressure
measurement in humans, Circulation 111 (2005), pp. 697–716.
[36] C. Rivero, T. Valdes, An algorithm for robust linear estimation with grouped data, Compu. Stat. Data Anal. 53
(2008), pp. 255–271.
[37] G. Schillacia, G. Parati, Ambulatory arterial stiffness index merits and limitations of a simple surrogate measure of
arterial compliance, J. Hypertension 26 (2008), pp. 182–185.
[38] J. Shao, D. Tu, The Jackknife and Bootstrap, Springer, New York, 1995.
[39] M.S. Srivastava, V.K. Srivastava, Asymptotic distribution of least squares estimator and a test statistic in linear
regression models, Econ. Lett. 21 (1986), pp. 173–176.
Appendix
Proof of Theorem 3.1 From the bootstrap sample of spreads of the intervals {sprX
i, spr Y
i}n
i=1, let us define the
following expressions:
ˆs
1= n
Y
i=1
I[spr X
i=0]!. max{0, ˆ
β
S} + 1
n
Y
i=1
I[spr X
i=0]!.β+g1 1
n
Y
i=1
I[spr X
i=0]!. min
i=1,...,nZ
i!, (A1)
where g: [0, )[0, 1)is a continuous and strictly increasing function such that g(0) =0 and limx→∞g(x)=1, and
{Z
i}n
i=1are resampled from {Zi}n
i=1defined as
Zi=
gspr εi
spr Xiif spr Xi6= 0,
1 if spr Xi=0.
The bootstrap statistic T
n=n(ˆ
βˆ
β) can be written as
T
n=min{n(ˆs
1ˆ
β), max{−nˆ
β,n(ˆ
β
Sˆ
β)}}. (A2)
The convergence in law of the three sample terms involved in the expression (A2) is to be studied separately in the
following. First, since the estimator ˆ
βis strongly consistent w.r.t. the parameter β[8], it is immediate to conclude that
nˆ
βn→∞
→ −∞β= −∞ a.s. [P]. (A3)
Let us consider now the term n(ˆ
β
Sˆ
β). By adding and subtracting the estimator ˆ
βSwe obtain
n(ˆ
β
Sˆ
β) =n(ˆ
β
Sˆ
βS)+n(ˆ
βSˆ
β). (A4)
On one hand, the well-known result on the consistency of the bootstrap distribution associated with the LS estimator of
the classical linear regression model between spr variables of Xand Yallows us to assure that n(ˆ
β
Sˆ
βS)converges
in law a.s.-[P] to the same distribution of the sample version n(ˆ
βSβ), which is N1N(0, σ2
spr ε2
spr X).
On the other hand, simple computations on n(ˆ
βSˆ
β) by substituting ˆ
βby its expression in (4) lead to
n(ˆ
βSˆ
β) =max{n(ˆ
βS− ˆs1), min{nˆ
βS, 0}}, (A5)
where ˆs1is the sample version of ˆs
1in Equation (A1). It is straightforward to show that min{nˆ
βS, 0}n→∞
0 a.s.-[P].
Moreover,
n(ˆ
βS− ˆs1)= n
Y
i=1
I[spr Xi]!min{nˆ
βS, 0} + 1
n
Y
i=1
I[sprXi]!n(ˆ
βSβ)
ng1 1
n
Y
i=1
I[spr Xi]!min
i=1,...,nZi!. (A6)
In [7] it is proved that the third term in Equation (A6) diverges to . Moreover, it is also shown that
(Qn
i=1I[spr Xi])P
0 as n , so the first term in Equation (A6) converges in probability to 0. Finally, it is straightfor-
ward to show that the second term in Equation (A6) converges in law to N1. Thus, n(ˆ
βS− ˆs1) −∞ as n. As
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
18 A. Blanco-Fernández and G. González-Rodríguez
a result, in Equation (A5) it is obtained that n(ˆ
βSˆ
β) n→∞
0. Then in Equation (A4) we have that
n(ˆ
β
Sˆ
β) L
N1a.s. [P]. (A7)
It remains to study the convergence of the term n(ˆs
1ˆ
β) in Equation (A2). It fulfils that
n(ˆs
1ˆ
β) = n
Y
i=1
I[spr X
i=0]!. max nnˆ
β,n(ˆ
β
Sˆ
β)o+ 1
n
Y
i=1
I[spr X
i=0]!nˆ
β)
+ng1 1
n
Y
i=1
I[spr X
i=0]!min
i=1,...,nZ
i!
=A1+A2+A3. (A8)
Analogously to the corresponding sample expression, it is straightforward to show that (Qn
i=1I[spr X
i=0])P
0 a.s.-[P].
By joining this result with the convergences shown in Equations (A3) and (A7) it is obtained that A1P0 a.s.-[P].
From Equation (10) it is immediate to conclude that A2
L
(1)N1. Finally, the convergence of A3is investigated. By
following analogous computations than in the proof of Theorem 4.2 in [7], the distribution function of A3can be written
as
FA3(x)=Pmin
i=1,...,nZ
igx
n,
for all x>0, and FA3(x)=0, for x0. Since mini=1,...,nZ
iis computed through a random resampling with replacement
from {Zi}n
i=1, it is obvious that mini=1,...,nZ
imini=1,...,nZi. Thus,
Pmin
i=1,...,nZ
igx
nPmin
i=1,...,nZigx
n,
where the right-side term has been shown to tend to 0. Therefore, A3diverges to . As a result, in Equation (A8) it is
concluded that
n(ˆs
1ˆ
β) L
→ ∞. (A9)
By joining the results shown in Equations (A3), (A7) and (A9), it is finally obtained from Equation (A2) that T
n
L
N1.
Proof of Proposition 4.1 Assume that H0is true. Since S0
n
L
N(0, σ2
mid ε2
mid X)when n, as shown in [39], it
is directly obtained that
lim
n→∞ P(|S0
n| ≤ cρ/2|α=α0)=1ρ.
Thus, Cρ= {(X1,...,Xn):|S0
n|>cρ/2}defines an asymptotic rejection region of H0for the nominal significance
level ρ. Let us suppose now that H0is not true, i.e. α=α16=α0. The test statistic S0
n=n(ˆαα0)can be expressed as
S0
n=n(ˆαα1)+n1α0).
Under H1:α=α1, the first term has a limit distribution N(0, σ2
mid ε2
mid X), whereas the second term tends to , if
α1> α0, or to , if α1< α0, as ntends to . Thus, |S0
n|n→∞
→ ∞, and so
lim
n→∞ P(|S0
n|>cρ/2|α=α1)=1.
Downloaded by [UOV University of Oviedo] at 01:21 17 October 2014
... (iv) Joint global and local structure discriminant analysis (JGLDA) [46]: for linear dimension reduction, it preserves the local intrinsic structure, which characterizes the geometric properties of similarity and diversity of data by two quadratic functions. (v) Flexible linear regression classification (FLRC) [47]: the inferences are based on the least-squares estimators of the model which have been shown to be coherent with the interval arithmetic defining the model and to verify good statistical properties. (vi) Discriminative least-squares regression (DLSR) [34]: DLSR is to embed class label information into the LSR formulation such that the distances between classes can be enlarged. ...
Article
Full-text available
The traditional label relaxation regression (LRR) algorithm directly fits the original data without considering the local structure information of the data. While the label relaxation regression algorithm of graph regularization takes into account the local geometric information, the performance of the algorithm depends largely on the construction of graph. However, the traditional graph structures have two defects. First of all, it is largely influenced by the parameter values. Second, it relies on the original data when constructing the weight matrix, which usually contains a lot of noise. This makes the constructed graph to be often not optimal, which affects the subsequent work. Therefore, a discriminative label relaxation regression algorithm based on adaptive graph (DLRR_AG) is proposed for feature extraction. DLRR_AG combines manifold learning with label relaxation regression by constructing adaptive weight graph, which can well overcome the problem of label overfitting. Based on a large number of experiments, it can be proved that the proposed method is effective and feasible.
Article
Some hypothesis tests for analyzing the degree of overlap between the expected value of random intervals are provided. For this purpose, a suitable measure to quantify the overlapping grade between intervals is considered on the basis of the Szymkiewicz-Simpson coefficient defined for general sets. It can be seen as a kind of likeness index to measure the mutual information between two intervals. On one hand, an estimator for the proposed degree of overlap between intervals is provided and its strong consistency is analyzed. On the other hand, two tests are also proposed in this framework: a one-sample test to examine the degree of overlap between the expected value of a random interval and a given interval, and a two-sample test to check the degree of overlap between the expected value of two random intervals. To solve such hypothesis tests, two statistics are suggested and their limit distributions are studied by considering both asymptotic and bootstrap techniques. Their power has been also explored by means of local alternatives. In addition, some simulation studies are carried out to investigate the behavior of the proposed approaches. Finally, the performance of the tests is also reported in a real-life application.
Article
Energy planning is crucial to regional sustainable development, it contributes to dealing with electricity demand and supply effectively and tackling air-pollution control in a long-term view. However, the planning is complicated with various factor interrelationships and uncertainties. In this paper, an inexact Bi-level optimization method based on provincial scale hybrid renewable and non-renewable energy planning is developed. This method incorporates Analytic Hierarchy Process based on induced ordered weighted averaging operator and demand side management policies (IOWA-AHP-DSM), interval linear programming (ILP), and bi-level programming method (BLP) into electric power system (EPS) to optimize energy planning and air pollution control. A case study with both environmental and economic objects in Shanxi Province, China, are involved to demonstrate the availability of this method. Seven renewable energy proportion scenarios (0%, 5%, 10%, 15%, 20%, 25% and 30%) are set in this study. Results show that as the proportion increases, the amount of power generation and capacity expansion from natural gas and renewable energy resources increases, while the amount of power from coal and oil, the pollutants emissions and the trading volume of SO2 decreases. According to the satisfaction degrees of these solutions, results show that it meets both goals when the proportion is 20%.
Conference Paper
Full-text available
When observations in large data sets are aggregated into smaller more manageable data sizes, the resulting classifications of observations invariably involve symbolic data. In this paper, covariance and correlation functions are introduced for interval-valued symbolic data. These and their associated terms are then used to fit linear regression models to such data. The methods are illustrated with an example from cardiology.
Article
A multiple interval-valued linear regression model considering all the cross-relationships between the mids and spreads of the intervals has been introduced recently. A least-squares estimation of the regression parameters has been carried out by transforming a quadratic optimization problem with inequality constraints into a linear complementary problem and using Lemke’s algorithm to solve it. Due to the irrelevance of certain cross-relationships, an alternative estimation process, the LASSO (Least Absolut Shrinkage and Selection Operator), is developed. A comparative study showing the differences between the proposed estimators is provided.
Article
Preface.Vectors of Random Variables.Multivariate Normal Distribution.Linear Regression: Estimation and Distribution Theory.Hypothesis Testing.Confidence Intervals and Regions.Straight-Line Regression.Polynomial Regression.Analysis of Variance.Departures from Underlying Assumptions.Departures from Assumptions: Diagnosis and Remedies.Computational Algorithms for Fitting a Regression.Prediction and Model Selection.Appendix A. Some Matrix Algebra.Appendix B. Orthogonal Projections.Appendix C. Tables.Outline Solutions to Selected Exercises.References.Index.
Conference Paper
Symbolic Data Analysis: Histogram-valued variables In classical data analysis, each individual takes one " value " on each descriptive variable. Symbolic Data Analysis generalizes this framework by allowing each individual or class of individuals to take a finite set of values (quantitative multi-valued variables), a finite set of categories (qualitative multi-valued variables) , an interval (interval-valued variable) or a distribution on each variable (modal-valued variables) that can be an histogram (histogram-valued variables). We can assume the interval-valued variables as a particular case of the histogram-valued variables if we consider histograms only one interval with probability one. The variable Y is a random histogram-valued variable if to each observation j corresponds a probability or frequency distribution on a finite set of sub-intervals, Y (j). For observation j, Y (j) can be represented by a histogram (referência a Billard)
Article
The construction of confidence sets for the parameters of a flexible simple linear regression model for interval-valued random sets is addressed. For that purpose, the asymptotic distribution of the least-squares estimators is analyzed. A simulation study is conducted to investigate the performance of those confidence sets. In particular, the empirical coverages are examined for various interval linear models. The applicability of the procedure is illustrated by means of a real-life case study.