Content uploaded by Ana Belén Ramos-Guajardo
Author content
All content in this area was uploaded by Ana Belén Ramos-Guajardo on May 10, 2018
Content may be subject to copyright.
Two-sample similarity test for the
expected value of random intervals
Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
Abstract The similarity degree between the expectation of two random inter-
vals is studied by means of a hypothesis testing procedure. For this purpose,
a similarity measure for intervals is introduced based on the so-called Jac-
card index for convex sets. The measure ranges from 0 (if both intervals are
not similar at all, i.e., if they are not overlapped) to 1 (if both intervals are
equal). A test statistic is proposed and its limit distribution is analyzed by
considering asymptotic and bootstrap techniques. Some simulation studies
are carried out to examine the behaviour of the approach.
1 Introduction
Interval data derive from experimental studies involving ranges, fluctuations,
subjective perceptions, censored data, grouped data, and so on [6, 7, 10].
Random intervals (RIs) have been shown to model and handle suitably such
kind of data in different settings [2, 3, 11, 12].
The Aumman expectation of a RI is also an interval and inferences con-
cerning the Aumann expectation and, especially, hypothesis tests for the
expected value of random intervals have been previously developed in the
literature [5, 9]. Additionally, tests relaxing strict equalities have been also
carried out as, for instance, inclusion tests for the Aumann expectation of
RIs [13], or similarity tests for the expected value of an RI and a previously
fixed interval [14].
The aim of this work is to develop a two-sample test for the similarity
of the expectations of two RIs. The similarity measure to be considered is
based on the classical Jaccard similarity coefficient for classical convex sets
Department of Statistics and Operational Research, University of Oviedo, C/Calvo Sotelo,
s/n, 33007, Oviedo, Spain.
blancoangela@uniovi.es, ramosana@uniovi.es
1
2 Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
[8], which can be seen as a ratio of the Lebesgue measure of the intersection
interval and the Lebesgue measure of the union interval [15]. A statistic to
solve the test is introduced, and its asymptotic and bootstrap limit distri-
butions are theoretically analyzed. The development of bootstrap techniques
allows us to approximate the sampling distribution of the statistic in prac-
tice, since the asymptotic one depends on unknown parameters in general.
Finally, simulation studies are developed to show the empirical behaviour of
the procedure.
2 Preliminary concepts
From now on, let Kc(R) denote the family of non-empty closed and bounded
intervals in R. An interval A∈ Kc(R) can be characterized by either its
(mid,spr) representation (i.e., A= [midA±sprA], with midA∈Rthe mid-
point or centre and sprA≥0 the spread or radius of A) or its (inf,sup)
representation (i.e., A= [inf A, sup A]).
The usual interval arithmetic is based on the Minkowski’s addition and the
product by a scalar. It is expressed in terms of the (mid,spr) representation
as A1+λA2= [(midA1+λmidA2)±(sprA1+|λ|sprA2)], for A1, A2∈ Kc(R)
and λ∈R.
The Lebesgue measure of A∈ Kc(R) is given by λ(A) = 2sprA. Obviously,
the Lebesgue measure of the empty set is λ(∅) = 0. In addition, the Lebesgue
measure of the intersection between A and B, λ(A∩B), for any A, B ∈ Kc(R)
can be expressed as follows [15]:
max n0,min n2spr A, 2spr B, spr A+ spr B− |mid A−mid B|oo .(1)
A measure of the degree of similarity between two intervals A, B ∈ Kc(R)
can be defined according to the Jaccard coefficient [8] as
S(A, B) = λ(A∩B)
λ(A∪B).(2)
This similarity measure fulfils that S(A, B) = 0 iff A∩B=∅,S(A, B)=1
iff A=B, and S(A, B)∈(0,1) iff A∩B6=∅and A6=B. As an example, the
similarity measure of two intervals Aand Bis 1/2 whenever both intervals
are overlapped and the length of Ais the double than the length of B, or
viceversa.
Random variables modelling those situations in which intervals on Kc(R)
are provided as outcomes are called random intervals (RIs). Given a proba-
bility space (Ω, A, P ), an RI is a Borel measurable mapping X:Ω→ Kc(R)
w.r.t. the well-known Hausdorff metric on Kc(R) [11]. It is equivalently shown
Two-sample similarity test for the expected value of random intervals 3
that Xis an RI if both midX, sprX:Ω→Rare real-valued random variables
and sprX≥0 a.s.-[P].
Whenever midX, sprX∈L1(Ω, A, P ), it is possible to define the Aumann
expected value of X[1]. In terms of classical expectations it is expressed
as E([midX±sprX]) = [E(midX)±E(sprX)]. Let {Xi}n
i=1 be a simple
random sample of X. The corresponding sample expectation of Xis defined
coherently in terms of the interval arithmetic as X= (1/n)Pn
i=1 Xi, and it
fulfils X= [midX±sprX].
3 Similarity test for the expected values of two RIs
Let (Ω, A, P ) be a probability space, and X, Y :Ω−→ Kc(R) be two RIs
such that sprE(X)>0 and sprE(Y)>0. Some mild conditions are assumed
to guarantee the existence of the involved moments and to avoid trivial sit-
uations (as, for instance, the singularity of the covariance matrix). Thus, X
and Yare supposed to belong to the following class of random intervals:
P=nX:Ω→ Kc(R)|σ2
midX<∞,0< σ2
sprX<∞
∧(Cov(midX, sprX))26=σ2
midXσ2
sprXo.
Given d∈[0,1], the aim is to test
H0:S(E(X), E(Y)) ≥dvs. H1:S(E(X), E (Y)) < d. (3)
The alternative one-sided and two-sided tests (that is, those analyzing if the
Jaccard index of the expectations equals dor if it is greater than or equal
to d) could be analogously studied. We focus our attention in (3) since it
seems to be the most appealing for practical applications. From (1) and (2)
it is straighforward to show that the null hypothesis of the test (3) can be
equivalently expressed as
H0: max ndsprE(Y)−sprE(X), d sprE(X)−sprE(Y),
(1 + d)|midE(X)−midE(Y)|
+(d−1) (sprE(X) + sprE(Y))o≤0.
(4)
The resolution of the test is addressed below by considering an asymptotic
approach. Let {Xi}n
i=1 and {Yi}n
i=1 be two samples of random intervals being
independent and identically distributed as Xand Y, respectively. The test
statistic is defined as
4 Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
Tn=√nmax ndsprYn−sprXn, d sprXn−sprYn,
(1 + d)
midXn−midYn
+ (d−1) sprXn+ sprYno.(5)
From now on, let us consider the bivariate normal distributions Z=
(z1, z2)T≡ N20, Σ1and U= (u1, u2)T≡ N20, Σ2, where Σ1is the
covariance matrix for the random vector (midX, sprX) and Σ2is the cor-
responding one for (midY, sprY). The limit distribution of the statistic Tn
under H0is analyzed in the following result.
Theorem 1. For n∈N, let {Xi}n
i=1 and {Yi}n
i=1 be simple random samples
from Xand Y, respectively. Let Tnbe defined as in (5). If X, Y ∈ P, then:
a) Whenever sprE(X) = dsprE(Y)and midE(X)−midE(Y) = (1 −d)sprE(Y), it is
fulfilled that
Tn
L
−→ max{du2−z2,(1 + d)(z1−u1)+(d−1)(z2+u2)}.(6)
b) Whenever sprE(X) = dsprE(Y)and −midE(X) + midE(Y) = (1 −d)sprE(Y), it
is fulfilled that
Tn
L
−→ max{du2−z2,(1 + d)(u1−z1)+(d−1)(z2+u2)}.(7)
c) Whenever dsprE(X) = sprE(Y)and midE(X)−midE(Y) = (1 −d)
dsprE(Y), it
is fulfilled that
Tn
L
−→ max{dz2−u2,(1 + d)(z1−u1)+(d−1)(z2+u2)}.(8)
d) Whenever dsprE(X) = sprE(Y)and −midE(X) + midE(Y) = (1 −d)
dsprE(Y), it
is fulfilled that
Tn
L
−→ max{dz2−u2,(1 + d)(u1−z1)+(d−1)(z2+u2)}.(9)
Proof. The statistic Tncan be equivalently expressed as Tn=√nmax{A,B,C},
where A=dsprYn−sprE(Y)+dsprE(Y)−sprE(X) + sprE(X)−sprXn,
B=dsprXn−sprE(X)+dsprE(X)−sprE(Y)+sprE(Y)−sprYnand C=
(1 + d)
midXn−midE(X) + midE(X)−midE(Y) + midE(Y)−midYn
+
(d−1) sprXn−sprE(X) + sprE(X) + sprE(Y)−sprE(Y) + sprYn
a) If sprE(X) = dsprE(Y) and midE(X)−midE(Y) = (1 −d)sprE(Y), the
second term and the negative form of the third term diverge in probability
to −∞ as n→ ∞ by the Central Limit and the Slutsky’s theorems. Finally,
the Continuous Mapping and the Central Limit Theorems for real variables
lead to (6).
Similar reasonings can be taken into account in the other three situations:
Two-sample similarity test for the expected value of random intervals 5
b) If sprE(X) = dsprE(Y) and −midE(X)+midE(Y) = (1−d)sprE(Y), the
second term and the positive form of the third term diverges in probability
to −∞ as n→ ∞;
c) If dsprE(X) = sprE(Y) and midE(X)−midE(Y) = (1 −d)
dsprE(Y), the
first term and the negative form of the third term diverges in probability
to −∞ as n→ ∞;
d) If dsprE(X) = sprE(Y) and −midE(X) + midE(Y) = (1 −d)
dsprE(Y),
the first term and the negative form of the third term diverges in proba-
bility to −∞ as n→ ∞.
ut
Remark 1. As in the real framework, other situations under H0being different
than the ones shown in Theorem 1 (which are the ’worst’ or ’limit’ situations
under H0) lead the statistic Tnto converge weakly to a limit distribution
which is stochastically bounded for one of those provided in the theorem.
Since the limit distribution of Tndepends on Xand Y, we can consider
the following (X, Y )-dependent distribution for the theoretical analysis of the
testing procedure (see [14]):
T0
n= max n√ndsprYn−sprE(Y)+ sprE(X)−sprXn
+ min 0, n1/4(sprYn−sprXn),
√ndsprXn−sprE(X)+ sprE(Y)−sprYn
+ min 0, n1/4(sprXn−sprYn),
√n(1 + d)midXn−midE(X) + midE(Y)−midYn
+(d−1) sprXn−sprE(X) + sprYn−sprE(Y)
+ min 0, n1/4(midXn−midYn),
√n(1 + d)midE(X)−midXn−midE(Y) + midYn
+(d−1) sprXn−sprE(X) + sprYn−sprE(Y)
+ min 0, n1/4(midYn−midXn)o.
(10)
As in [14], the inclusion of min 0, n1/4(sprYn−sprXn)(and so for mids)
in T0
nare useful to determine the terms on its expression having relevance
depending on each situation considered under H0. The consistency and the
power of the test are shown in Theorem 2.
Theorem 2. Let α∈[0,1] and k1−αbe the (1−α)-quantile of the asymptotic
distribution of T0
n. If H0in (4) is true, then it is satisfied that
lim sup
n→∞
P(T0
n> k1−α)≤α ,
and the equality is achieved whenever conditions in a), b), c) and d) in The-
orem 1 are fulfilled. In addition, if H0is not true, then
6 Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
lim
n→∞ P(T0
n> k1−α) = 1.
As an immediate consequence of Theorem 2, the test which rejects H0in (4)
at the significance level αwhenever T0
n> k1−αis asymptotically efficient and
consistent.
3.1 Bootstrap test
Since the asymptotic limit distribution is not easy to handle in practice,
a residual bootstrap approach is proposed. Let Xand Ybe two RIs such
that sprE(X)>0 and sprE(Y)>0, and let {Xi}n
i=1 and {Yi}n
i=1 be two
simple random samples drawn from Xand Y, respectively. Let us consider
bootstrap samples for Xand Y, i.e. {X∗
i}n
i=1 and {Y∗
i}n
i=1 being chosen
randomly and with replacement from {Xi}n
i=1 and {Yi}n
i=1, respectively. The
bootstrap statistic is based on the expression of T0
nand it is defined as follows:
T∗
n= max n√ndsprY∗
n−sprYn+ sprXn−sprX∗
n
+ min 0, n1/4(sprYn−sprXn),
√ndsprX∗
n−sprXn+ sprYn−sprY∗
n
+ min 0, n1/4(sprXn−sprYn),
√n(1 + d)midX∗
n−midXn+ midYn−midY∗
n
+(d−1) sprX∗
n−sprXn+ sprY∗
n−sprYn
+ min 0, n1/4(midXn−midYn),
√n(1 + d)midXn−midX∗
n+ midY∗
n−midYn
+(d−1) sprX∗
n−sprXn+ midY∗
n−midYn
+ min 0, n1/4(midYn−midXn)o.
(11)
The different asymptotic distributions of T∗
nare (almost sure) the ones
provided in Theorem 1 for Tn, under the same conditions, and the consistency
of the bootstrap procedure is straightforwardly derived. The distribution of
T∗
nis approximated in practice by means of the Monte Carlo method.
4 Simulations
The empirical behaviour of the bootstrap test is shown by simulation. Two
different situations are considered: in the first one the mid and spr compo-
nents of the two independent RIs Xand Yare independently generated. In
Two-sample similarity test for the expected value of random intervals 7
the second situation, it is allowed that those components have certain level
of dependence each other. The two situations are described as follows:
Situation 1: midX≡ N(2,5), sprX≡U(1,3); midY≡ N(3,5),
sprY≡U(1,5).
Situation 2: midX≡U(2,6), sprX≡midX/2; midY= sprY≡U(1,5).
It is straightforward to show that the theoretical situation 1 satisfies the con-
ditions a) of Theorem 1, and the situation 2 is under conditions b). Besides,
S(E(X), E(Y)) = 2/3 in both cases.
The bootstrap test proposed in Section 3.1 has been run for 10000 simu-
lations with 1000 bootstrap replications each to test H0:S(E(X), E(Y)) ≥
2/3 vs. H1:S(E(X), E(Y)) <2/3, for several significance levels αand
different sample sizes. Results are gathered in Table 1. They show that the
empirical sizes of the test are in both cases quite close to the expected nominal
significance levels even for moderate sample sizes. Specifically, the approx-
imation to the nominal significance level is more conservative in the first
situation than in the second one. The slight differences appreciated in the
two situations may be due to the diverse nature of the distributions.
Table 1 Empirical size of the two-sample similarity bootstrap test in Situations 1 and 2
Situation 1 Situation 2
n100 ·α1 5 10 1 5 10
10 2.27 6.88 10.64 2.38 7.36 11.85
30 1.60 4.60 9.54 1.89 5.73 10.31
50 1.35 4.96 10.59 1.44 5.32 10.46
100 1.27 5.06 10.35 1.22 5.18 10.24
200 .95 4.89 9.88 1.1 5.12 9.8
Finally, a small empirical study to show the power of the proposed test
has been developed. Specifically, midXin Case 1 has been chosen to have
distributions N(1,5), N(0,5) and N(−1,5), respectively. In these cases, the
bootstrap approach for α=.05 and n= 50 lead to p-values of .153, .381
and .692, respectively, and, therefore, in this case the power of the test ap-
proximate to 1 as the distribution of Xmoves further away from the null
hypothesis.
5 Conclusions and open problems
A hypothesis test for checking the similarity between the expected value of
two RIs has been introduced. A test statistic has been proposed and its limit
8 Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
distribution has been analyzed by means of both asymptotic and bootstrap
techniques. Some simulation studies have been carried out to show the suit-
ability of the bootstrap approach for moderate/large sample sizes.
As future work, theoretical and empirical comparisons between different
similarity indexes should be developed. The power of the proposed test may
also be theoretically analyzed as well as the sensitivity of the test when
different distributions are chosen. Other versions of the test statistic involving
the covariance matrix can be studied. Finally, the proposed test could be
extended to more than two RIs and to the fuzzy framework.
Acknowledgements The research in this paper is partially supported by the Spanish
National Grant MTM2013-44212-P, and the Regional Grant FC-15-GRUPIN-14-005. Their
financial support is gratefully acknowledged.
References
1. Aumann RJ (1965) Integrals of set-valued functions. J Math Anal Appl 12:1–12
2. Blanco-Fernndez A, Corral N, Gonz´alez-Rodr´ıguez G (2011) Estimation of a flexible
simple linear model for interval data based on set arithmetic. Comput Stat Data An
55(9): 2568–2578
3. Ferraro MB, Coppi R, Gonz´alez-Rodr´ıguez G, Colubi A (2010) A linear regression
model for imprecise response. Int J Approx Reason 51(7):759–770
4. Gil MA, Gonz´alez-Rodr´ıguez G, Colubi A, Montenegro M (2007) Testing linear
independence in linear models with interval-valued data. Comput Stat Data An
51(6):3002–3015
5. Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA (2012) Fuzzy data treated as functional
data: A one-way ANOVA test approach. Comput Stat Data An 56(4):943–955
6. Horowitz JL, Manski CF (2006) Identification and estimation of statistical functionals
using incomplete data. J Econom 132:445–459
7. Hudgens MG (2005) On nonparametric maximum likelihood estimation with interval
censoring and left truncation. J Roy Stat Soc: Ser B 67:573–587
8. Jaccard P (1901) tude comparative de la distribution florale dans une portion des
Alpes et des Jura. Bulletin de la Socit Vaudoise des Sciences Naturelles 37:547–579
9. K¨orner R (2000) An asymptotic α-test for the expectation of random fuzzy variables.
J Stat Plan Infer 83:331–346
10. Magnac T, Maurin E (2008) Partial identification in monotone binary models: discrete
regressors and interval data. Rev Econ Stud 75:835–864
11. Matheron G (1975) Random Sets and Integral Geometry. Wiley, New York
12. Molchanov I (2005) Theory of Random Sets. Springer, London
13. Ramos-Guajardo AB, Colubi A, Gonz´alez-Rodr´ıguez G (2014) Inclusion degree tests
for the Aumann expectation of a random interval, Inf Sci 288(20):412–422
14. Ramos-Guajardo AB (2015) Similarity Test for the Expectation of a Random Interval
and a Fixed Interval. In: Grzegorzewski P, Gagolewski M, Hryniewicz O, Gil MA
(eds) Strengthening Links Between Data Analysis and Soft Computing. Advances in
Intelligent Systems and Computing 315:175–182
15. Shawe-Taylor J, Cristianini N (2004) Kernel Methods for Pattern Analysis. Cambridge
University Press, Cambridge