Content uploaded by Ana Belén Ramos-Guajardo

Author content

All content in this area was uploaded by Ana Belén Ramos-Guajardo on May 10, 2018

Content may be subject to copyright.

Two-sample similarity test for the

expected value of random intervals

Ana B. Ramos-Guajardo and ´

Angela Blanco-Fern´andez

Abstract The similarity degree between the expectation of two random inter-

vals is studied by means of a hypothesis testing procedure. For this purpose,

a similarity measure for intervals is introduced based on the so-called Jac-

card index for convex sets. The measure ranges from 0 (if both intervals are

not similar at all, i.e., if they are not overlapped) to 1 (if both intervals are

equal). A test statistic is proposed and its limit distribution is analyzed by

considering asymptotic and bootstrap techniques. Some simulation studies

are carried out to examine the behaviour of the approach.

1 Introduction

Interval data derive from experimental studies involving ranges, ﬂuctuations,

subjective perceptions, censored data, grouped data, and so on [6, 7, 10].

Random intervals (RIs) have been shown to model and handle suitably such

kind of data in diﬀerent settings [2, 3, 11, 12].

The Aumman expectation of a RI is also an interval and inferences con-

cerning the Aumann expectation and, especially, hypothesis tests for the

expected value of random intervals have been previously developed in the

literature [5, 9]. Additionally, tests relaxing strict equalities have been also

carried out as, for instance, inclusion tests for the Aumann expectation of

RIs [13], or similarity tests for the expected value of an RI and a previously

ﬁxed interval [14].

The aim of this work is to develop a two-sample test for the similarity

of the expectations of two RIs. The similarity measure to be considered is

based on the classical Jaccard similarity coeﬃcient for classical convex sets

Department of Statistics and Operational Research, University of Oviedo, C/Calvo Sotelo,

s/n, 33007, Oviedo, Spain.

blancoangela@uniovi.es, ramosana@uniovi.es

1

2 Ana B. Ramos-Guajardo and ´

Angela Blanco-Fern´andez

[8], which can be seen as a ratio of the Lebesgue measure of the intersection

interval and the Lebesgue measure of the union interval [15]. A statistic to

solve the test is introduced, and its asymptotic and bootstrap limit distri-

butions are theoretically analyzed. The development of bootstrap techniques

allows us to approximate the sampling distribution of the statistic in prac-

tice, since the asymptotic one depends on unknown parameters in general.

Finally, simulation studies are developed to show the empirical behaviour of

the procedure.

2 Preliminary concepts

From now on, let Kc(R) denote the family of non-empty closed and bounded

intervals in R. An interval A∈ Kc(R) can be characterized by either its

(mid,spr) representation (i.e., A= [midA±sprA], with midA∈Rthe mid-

point or centre and sprA≥0 the spread or radius of A) or its (inf,sup)

representation (i.e., A= [inf A, sup A]).

The usual interval arithmetic is based on the Minkowski’s addition and the

product by a scalar. It is expressed in terms of the (mid,spr) representation

as A1+λA2= [(midA1+λmidA2)±(sprA1+|λ|sprA2)], for A1, A2∈ Kc(R)

and λ∈R.

The Lebesgue measure of A∈ Kc(R) is given by λ(A) = 2sprA. Obviously,

the Lebesgue measure of the empty set is λ(∅) = 0. In addition, the Lebesgue

measure of the intersection between A and B, λ(A∩B), for any A, B ∈ Kc(R)

can be expressed as follows [15]:

max n0,min n2spr A, 2spr B, spr A+ spr B− |mid A−mid B|oo .(1)

A measure of the degree of similarity between two intervals A, B ∈ Kc(R)

can be deﬁned according to the Jaccard coeﬃcient [8] as

S(A, B) = λ(A∩B)

λ(A∪B).(2)

This similarity measure fulﬁls that S(A, B) = 0 iﬀ A∩B=∅,S(A, B)=1

iﬀ A=B, and S(A, B)∈(0,1) iﬀ A∩B6=∅and A6=B. As an example, the

similarity measure of two intervals Aand Bis 1/2 whenever both intervals

are overlapped and the length of Ais the double than the length of B, or

viceversa.

Random variables modelling those situations in which intervals on Kc(R)

are provided as outcomes are called random intervals (RIs). Given a proba-

bility space (Ω, A, P ), an RI is a Borel measurable mapping X:Ω→ Kc(R)

w.r.t. the well-known Hausdorﬀ metric on Kc(R) [11]. It is equivalently shown

Two-sample similarity test for the expected value of random intervals 3

that Xis an RI if both midX, sprX:Ω→Rare real-valued random variables

and sprX≥0 a.s.-[P].

Whenever midX, sprX∈L1(Ω, A, P ), it is possible to deﬁne the Aumann

expected value of X[1]. In terms of classical expectations it is expressed

as E([midX±sprX]) = [E(midX)±E(sprX)]. Let {Xi}n

i=1 be a simple

random sample of X. The corresponding sample expectation of Xis deﬁned

coherently in terms of the interval arithmetic as X= (1/n)Pn

i=1 Xi, and it

fulﬁls X= [midX±sprX].

3 Similarity test for the expected values of two RIs

Let (Ω, A, P ) be a probability space, and X, Y :Ω−→ Kc(R) be two RIs

such that sprE(X)>0 and sprE(Y)>0. Some mild conditions are assumed

to guarantee the existence of the involved moments and to avoid trivial sit-

uations (as, for instance, the singularity of the covariance matrix). Thus, X

and Yare supposed to belong to the following class of random intervals:

P=nX:Ω→ Kc(R)|σ2

midX<∞,0< σ2

sprX<∞

∧(Cov(midX, sprX))26=σ2

midXσ2

sprXo.

Given d∈[0,1], the aim is to test

H0:S(E(X), E(Y)) ≥dvs. H1:S(E(X), E (Y)) < d. (3)

The alternative one-sided and two-sided tests (that is, those analyzing if the

Jaccard index of the expectations equals dor if it is greater than or equal

to d) could be analogously studied. We focus our attention in (3) since it

seems to be the most appealing for practical applications. From (1) and (2)

it is straighforward to show that the null hypothesis of the test (3) can be

equivalently expressed as

H0: max ndsprE(Y)−sprE(X), d sprE(X)−sprE(Y),

(1 + d)|midE(X)−midE(Y)|

+(d−1) (sprE(X) + sprE(Y))o≤0.

(4)

The resolution of the test is addressed below by considering an asymptotic

approach. Let {Xi}n

i=1 and {Yi}n

i=1 be two samples of random intervals being

independent and identically distributed as Xand Y, respectively. The test

statistic is deﬁned as

4 Ana B. Ramos-Guajardo and ´

Angela Blanco-Fern´andez

Tn=√nmax ndsprYn−sprXn, d sprXn−sprYn,

(1 + d)

midXn−midYn

+ (d−1) sprXn+ sprYno.(5)

From now on, let us consider the bivariate normal distributions Z=

(z1, z2)T≡ N20, Σ1and U= (u1, u2)T≡ N20, Σ2, where Σ1is the

covariance matrix for the random vector (midX, sprX) and Σ2is the cor-

responding one for (midY, sprY). The limit distribution of the statistic Tn

under H0is analyzed in the following result.

Theorem 1. For n∈N, let {Xi}n

i=1 and {Yi}n

i=1 be simple random samples

from Xand Y, respectively. Let Tnbe deﬁned as in (5). If X, Y ∈ P, then:

a) Whenever sprE(X) = dsprE(Y)and midE(X)−midE(Y) = (1 −d)sprE(Y), it is

fulﬁlled that

Tn

L

−→ max{du2−z2,(1 + d)(z1−u1)+(d−1)(z2+u2)}.(6)

b) Whenever sprE(X) = dsprE(Y)and −midE(X) + midE(Y) = (1 −d)sprE(Y), it

is fulﬁlled that

Tn

L

−→ max{du2−z2,(1 + d)(u1−z1)+(d−1)(z2+u2)}.(7)

c) Whenever dsprE(X) = sprE(Y)and midE(X)−midE(Y) = (1 −d)

dsprE(Y), it

is fulﬁlled that

Tn

L

−→ max{dz2−u2,(1 + d)(z1−u1)+(d−1)(z2+u2)}.(8)

d) Whenever dsprE(X) = sprE(Y)and −midE(X) + midE(Y) = (1 −d)

dsprE(Y), it

is fulﬁlled that

Tn

L

−→ max{dz2−u2,(1 + d)(u1−z1)+(d−1)(z2+u2)}.(9)

Proof. The statistic Tncan be equivalently expressed as Tn=√nmax{A,B,C},

where A=dsprYn−sprE(Y)+dsprE(Y)−sprE(X) + sprE(X)−sprXn,

B=dsprXn−sprE(X)+dsprE(X)−sprE(Y)+sprE(Y)−sprYnand C=

(1 + d)

midXn−midE(X) + midE(X)−midE(Y) + midE(Y)−midYn

+

(d−1) sprXn−sprE(X) + sprE(X) + sprE(Y)−sprE(Y) + sprYn

a) If sprE(X) = dsprE(Y) and midE(X)−midE(Y) = (1 −d)sprE(Y), the

second term and the negative form of the third term diverge in probability

to −∞ as n→ ∞ by the Central Limit and the Slutsky’s theorems. Finally,

the Continuous Mapping and the Central Limit Theorems for real variables

lead to (6).

Similar reasonings can be taken into account in the other three situations:

Two-sample similarity test for the expected value of random intervals 5

b) If sprE(X) = dsprE(Y) and −midE(X)+midE(Y) = (1−d)sprE(Y), the

second term and the positive form of the third term diverges in probability

to −∞ as n→ ∞;

c) If dsprE(X) = sprE(Y) and midE(X)−midE(Y) = (1 −d)

dsprE(Y), the

ﬁrst term and the negative form of the third term diverges in probability

to −∞ as n→ ∞;

d) If dsprE(X) = sprE(Y) and −midE(X) + midE(Y) = (1 −d)

dsprE(Y),

the ﬁrst term and the negative form of the third term diverges in proba-

bility to −∞ as n→ ∞.

ut

Remark 1. As in the real framework, other situations under H0being diﬀerent

than the ones shown in Theorem 1 (which are the ’worst’ or ’limit’ situations

under H0) lead the statistic Tnto converge weakly to a limit distribution

which is stochastically bounded for one of those provided in the theorem.

Since the limit distribution of Tndepends on Xand Y, we can consider

the following (X, Y )-dependent distribution for the theoretical analysis of the

testing procedure (see [14]):

T0

n= max n√ndsprYn−sprE(Y)+ sprE(X)−sprXn

+ min 0, n1/4(sprYn−sprXn),

√ndsprXn−sprE(X)+ sprE(Y)−sprYn

+ min 0, n1/4(sprXn−sprYn),

√n(1 + d)midXn−midE(X) + midE(Y)−midYn

+(d−1) sprXn−sprE(X) + sprYn−sprE(Y)

+ min 0, n1/4(midXn−midYn),

√n(1 + d)midE(X)−midXn−midE(Y) + midYn

+(d−1) sprXn−sprE(X) + sprYn−sprE(Y)

+ min 0, n1/4(midYn−midXn)o.

(10)

As in [14], the inclusion of min 0, n1/4(sprYn−sprXn)(and so for mids)

in T0

nare useful to determine the terms on its expression having relevance

depending on each situation considered under H0. The consistency and the

power of the test are shown in Theorem 2.

Theorem 2. Let α∈[0,1] and k1−αbe the (1−α)-quantile of the asymptotic

distribution of T0

n. If H0in (4) is true, then it is satisﬁed that

lim sup

n→∞

P(T0

n> k1−α)≤α ,

and the equality is achieved whenever conditions in a), b), c) and d) in The-

orem 1 are fulﬁlled. In addition, if H0is not true, then

6 Ana B. Ramos-Guajardo and ´

Angela Blanco-Fern´andez

lim

n→∞ P(T0

n> k1−α) = 1.

As an immediate consequence of Theorem 2, the test which rejects H0in (4)

at the signiﬁcance level αwhenever T0

n> k1−αis asymptotically eﬃcient and

consistent.

3.1 Bootstrap test

Since the asymptotic limit distribution is not easy to handle in practice,

a residual bootstrap approach is proposed. Let Xand Ybe two RIs such

that sprE(X)>0 and sprE(Y)>0, and let {Xi}n

i=1 and {Yi}n

i=1 be two

simple random samples drawn from Xand Y, respectively. Let us consider

bootstrap samples for Xand Y, i.e. {X∗

i}n

i=1 and {Y∗

i}n

i=1 being chosen

randomly and with replacement from {Xi}n

i=1 and {Yi}n

i=1, respectively. The

bootstrap statistic is based on the expression of T0

nand it is deﬁned as follows:

T∗

n= max n√ndsprY∗

n−sprYn+ sprXn−sprX∗

n

+ min 0, n1/4(sprYn−sprXn),

√ndsprX∗

n−sprXn+ sprYn−sprY∗

n

+ min 0, n1/4(sprXn−sprYn),

√n(1 + d)midX∗

n−midXn+ midYn−midY∗

n

+(d−1) sprX∗

n−sprXn+ sprY∗

n−sprYn

+ min 0, n1/4(midXn−midYn),

√n(1 + d)midXn−midX∗

n+ midY∗

n−midYn

+(d−1) sprX∗

n−sprXn+ midY∗

n−midYn

+ min 0, n1/4(midYn−midXn)o.

(11)

The diﬀerent asymptotic distributions of T∗

nare (almost sure) the ones

provided in Theorem 1 for Tn, under the same conditions, and the consistency

of the bootstrap procedure is straightforwardly derived. The distribution of

T∗

nis approximated in practice by means of the Monte Carlo method.

4 Simulations

The empirical behaviour of the bootstrap test is shown by simulation. Two

diﬀerent situations are considered: in the ﬁrst one the mid and spr compo-

nents of the two independent RIs Xand Yare independently generated. In

Two-sample similarity test for the expected value of random intervals 7

the second situation, it is allowed that those components have certain level

of dependence each other. The two situations are described as follows:

Situation 1: midX≡ N(2,5), sprX≡U(1,3); midY≡ N(3,5),

sprY≡U(1,5).

Situation 2: midX≡U(2,6), sprX≡midX/2; midY= sprY≡U(1,5).

It is straightforward to show that the theoretical situation 1 satisﬁes the con-

ditions a) of Theorem 1, and the situation 2 is under conditions b). Besides,

S(E(X), E(Y)) = 2/3 in both cases.

The bootstrap test proposed in Section 3.1 has been run for 10000 simu-

lations with 1000 bootstrap replications each to test H0:S(E(X), E(Y)) ≥

2/3 vs. H1:S(E(X), E(Y)) <2/3, for several signiﬁcance levels αand

diﬀerent sample sizes. Results are gathered in Table 1. They show that the

empirical sizes of the test are in both cases quite close to the expected nominal

signiﬁcance levels even for moderate sample sizes. Speciﬁcally, the approx-

imation to the nominal signiﬁcance level is more conservative in the ﬁrst

situation than in the second one. The slight diﬀerences appreciated in the

two situations may be due to the diverse nature of the distributions.

Table 1 Empirical size of the two-sample similarity bootstrap test in Situations 1 and 2

Situation 1 Situation 2

n100 ·α1 5 10 1 5 10

10 2.27 6.88 10.64 2.38 7.36 11.85

30 1.60 4.60 9.54 1.89 5.73 10.31

50 1.35 4.96 10.59 1.44 5.32 10.46

100 1.27 5.06 10.35 1.22 5.18 10.24

200 .95 4.89 9.88 1.1 5.12 9.8

Finally, a small empirical study to show the power of the proposed test

has been developed. Speciﬁcally, midXin Case 1 has been chosen to have

distributions N(1,5), N(0,5) and N(−1,5), respectively. In these cases, the

bootstrap approach for α=.05 and n= 50 lead to p-values of .153, .381

and .692, respectively, and, therefore, in this case the power of the test ap-

proximate to 1 as the distribution of Xmoves further away from the null

hypothesis.

5 Conclusions and open problems

A hypothesis test for checking the similarity between the expected value of

two RIs has been introduced. A test statistic has been proposed and its limit

8 Ana B. Ramos-Guajardo and ´

Angela Blanco-Fern´andez

distribution has been analyzed by means of both asymptotic and bootstrap

techniques. Some simulation studies have been carried out to show the suit-

ability of the bootstrap approach for moderate/large sample sizes.

As future work, theoretical and empirical comparisons between diﬀerent

similarity indexes should be developed. The power of the proposed test may

also be theoretically analyzed as well as the sensitivity of the test when

diﬀerent distributions are chosen. Other versions of the test statistic involving

the covariance matrix can be studied. Finally, the proposed test could be

extended to more than two RIs and to the fuzzy framework.

Acknowledgements The research in this paper is partially supported by the Spanish

National Grant MTM2013-44212-P, and the Regional Grant FC-15-GRUPIN-14-005. Their

ﬁnancial support is gratefully acknowledged.

References

1. Aumann RJ (1965) Integrals of set-valued functions. J Math Anal Appl 12:1–12

2. Blanco-Fernndez A, Corral N, Gonz´alez-Rodr´ıguez G (2011) Estimation of a ﬂexible

simple linear model for interval data based on set arithmetic. Comput Stat Data An

55(9): 2568–2578

3. Ferraro MB, Coppi R, Gonz´alez-Rodr´ıguez G, Colubi A (2010) A linear regression

model for imprecise response. Int J Approx Reason 51(7):759–770

4. Gil MA, Gonz´alez-Rodr´ıguez G, Colubi A, Montenegro M (2007) Testing linear

independence in linear models with interval-valued data. Comput Stat Data An

51(6):3002–3015

5. Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA (2012) Fuzzy data treated as functional

data: A one-way ANOVA test approach. Comput Stat Data An 56(4):943–955

6. Horowitz JL, Manski CF (2006) Identiﬁcation and estimation of statistical functionals

using incomplete data. J Econom 132:445–459

7. Hudgens MG (2005) On nonparametric maximum likelihood estimation with interval

censoring and left truncation. J Roy Stat Soc: Ser B 67:573–587

8. Jaccard P (1901) tude comparative de la distribution ﬂorale dans une portion des

Alpes et des Jura. Bulletin de la Socit Vaudoise des Sciences Naturelles 37:547–579

9. K¨orner R (2000) An asymptotic α-test for the expectation of random fuzzy variables.

J Stat Plan Infer 83:331–346

10. Magnac T, Maurin E (2008) Partial identiﬁcation in monotone binary models: discrete

regressors and interval data. Rev Econ Stud 75:835–864

11. Matheron G (1975) Random Sets and Integral Geometry. Wiley, New York

12. Molchanov I (2005) Theory of Random Sets. Springer, London

13. Ramos-Guajardo AB, Colubi A, Gonz´alez-Rodr´ıguez G (2014) Inclusion degree tests

for the Aumann expectation of a random interval, Inf Sci 288(20):412–422

14. Ramos-Guajardo AB (2015) Similarity Test for the Expectation of a Random Interval

and a Fixed Interval. In: Grzegorzewski P, Gagolewski M, Hryniewicz O, Gil MA

(eds) Strengthening Links Between Data Analysis and Soft Computing. Advances in

Intelligent Systems and Computing 315:175–182

15. Shawe-Taylor J, Cristianini N (2004) Kernel Methods for Pattern Analysis. Cambridge

University Press, Cambridge