ChapterPDF Available

Abstract and Figures

The similarity degree between the expectation of two random intervals is studied by means of a hypothesis testing procedure. For this purpose, a similarity measure for intervals is introduced based on the so-called Jaccard index for convex sets. The measure ranges from 0 (if both intervals are not similar at all, i.e., if they are not overlapped) to 1 (if both intervals are equal). A test statistic is proposed and its limit distribution is analyzed by considering asymptotic and bootstrap techniques. Some simulation studies are carried out to examine the behaviour of the approach.
Content may be subject to copyright.
Two-sample similarity test for the
expected value of random intervals
Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
Abstract The similarity degree between the expectation of two random inter-
vals is studied by means of a hypothesis testing procedure. For this purpose,
a similarity measure for intervals is introduced based on the so-called Jac-
card index for convex sets. The measure ranges from 0 (if both intervals are
not similar at all, i.e., if they are not overlapped) to 1 (if both intervals are
equal). A test statistic is proposed and its limit distribution is analyzed by
considering asymptotic and bootstrap techniques. Some simulation studies
are carried out to examine the behaviour of the approach.
1 Introduction
Interval data derive from experimental studies involving ranges, fluctuations,
subjective perceptions, censored data, grouped data, and so on [6, 7, 10].
Random intervals (RIs) have been shown to model and handle suitably such
kind of data in different settings [2, 3, 11, 12].
The Aumman expectation of a RI is also an interval and inferences con-
cerning the Aumann expectation and, especially, hypothesis tests for the
expected value of random intervals have been previously developed in the
literature [5, 9]. Additionally, tests relaxing strict equalities have been also
carried out as, for instance, inclusion tests for the Aumann expectation of
RIs [13], or similarity tests for the expected value of an RI and a previously
fixed interval [14].
The aim of this work is to develop a two-sample test for the similarity
of the expectations of two RIs. The similarity measure to be considered is
based on the classical Jaccard similarity coefficient for classical convex sets
Department of Statistics and Operational Research, University of Oviedo, C/Calvo Sotelo,
s/n, 33007, Oviedo, Spain.
blancoangela@uniovi.es, ramosana@uniovi.es
1
2 Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
[8], which can be seen as a ratio of the Lebesgue measure of the intersection
interval and the Lebesgue measure of the union interval [15]. A statistic to
solve the test is introduced, and its asymptotic and bootstrap limit distri-
butions are theoretically analyzed. The development of bootstrap techniques
allows us to approximate the sampling distribution of the statistic in prac-
tice, since the asymptotic one depends on unknown parameters in general.
Finally, simulation studies are developed to show the empirical behaviour of
the procedure.
2 Preliminary concepts
From now on, let Kc(R) denote the family of non-empty closed and bounded
intervals in R. An interval A∈ Kc(R) can be characterized by either its
(mid,spr) representation (i.e., A= [midA±sprA], with midARthe mid-
point or centre and sprA0 the spread or radius of A) or its (inf,sup)
representation (i.e., A= [inf A, sup A]).
The usual interval arithmetic is based on the Minkowski’s addition and the
product by a scalar. It is expressed in terms of the (mid,spr) representation
as A1+λA2= [(midA1+λmidA2)±(sprA1+|λ|sprA2)], for A1, A2∈ Kc(R)
and λR.
The Lebesgue measure of A∈ Kc(R) is given by λ(A) = 2sprA. Obviously,
the Lebesgue measure of the empty set is λ() = 0. In addition, the Lebesgue
measure of the intersection between A and B, λ(AB), for any A, B ∈ Kc(R)
can be expressed as follows [15]:
max n0,min n2spr A, 2spr B, spr A+ spr B− |mid Amid B|oo .(1)
A measure of the degree of similarity between two intervals A, B ∈ Kc(R)
can be defined according to the Jaccard coefficient [8] as
S(A, B) = λ(AB)
λ(AB).(2)
This similarity measure fulfils that S(A, B) = 0 iff AB=,S(A, B)=1
iff A=B, and S(A, B)(0,1) iff AB6=and A6=B. As an example, the
similarity measure of two intervals Aand Bis 1/2 whenever both intervals
are overlapped and the length of Ais the double than the length of B, or
viceversa.
Random variables modelling those situations in which intervals on Kc(R)
are provided as outcomes are called random intervals (RIs). Given a proba-
bility space (Ω, A, P ), an RI is a Borel measurable mapping X:→ Kc(R)
w.r.t. the well-known Hausdorff metric on Kc(R) [11]. It is equivalently shown
Two-sample similarity test for the expected value of random intervals 3
that Xis an RI if both midX, sprX:Rare real-valued random variables
and sprX0 a.s.-[P].
Whenever midX, sprXL1(Ω, A, P ), it is possible to define the Aumann
expected value of X[1]. In terms of classical expectations it is expressed
as E([midX±sprX]) = [E(midX)±E(sprX)]. Let {Xi}n
i=1 be a simple
random sample of X. The corresponding sample expectation of Xis defined
coherently in terms of the interval arithmetic as X= (1/n)Pn
i=1 Xi, and it
fulfils X= [midX±sprX].
3 Similarity test for the expected values of two RIs
Let (Ω, A, P ) be a probability space, and X, Y :→ Kc(R) be two RIs
such that sprE(X)>0 and sprE(Y)>0. Some mild conditions are assumed
to guarantee the existence of the involved moments and to avoid trivial sit-
uations (as, for instance, the singularity of the covariance matrix). Thus, X
and Yare supposed to belong to the following class of random intervals:
P=nX:→ Kc(R)|σ2
midX<,0< σ2
sprX<
(Cov(midX, sprX))26=σ2
midXσ2
sprXo.
Given d[0,1], the aim is to test
H0:S(E(X), E(Y)) dvs. H1:S(E(X), E (Y)) < d. (3)
The alternative one-sided and two-sided tests (that is, those analyzing if the
Jaccard index of the expectations equals dor if it is greater than or equal
to d) could be analogously studied. We focus our attention in (3) since it
seems to be the most appealing for practical applications. From (1) and (2)
it is straighforward to show that the null hypothesis of the test (3) can be
equivalently expressed as
H0: max ndsprE(Y)sprE(X), d sprE(X)sprE(Y),
(1 + d)|midE(X)midE(Y)|
+(d1) (sprE(X) + sprE(Y))o0.
(4)
The resolution of the test is addressed below by considering an asymptotic
approach. Let {Xi}n
i=1 and {Yi}n
i=1 be two samples of random intervals being
independent and identically distributed as Xand Y, respectively. The test
statistic is defined as
4 Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
Tn=nmax ndsprYnsprXn, d sprXnsprYn,
(1 + d)
midXnmidYn
+ (d1) sprXn+ sprYno.(5)
From now on, let us consider the bivariate normal distributions Z=
(z1, z2)T≡ N20, Σ1and U= (u1, u2)T≡ N20, Σ2, where Σ1is the
covariance matrix for the random vector (midX, sprX) and Σ2is the cor-
responding one for (midY, sprY). The limit distribution of the statistic Tn
under H0is analyzed in the following result.
Theorem 1. For nN, let {Xi}n
i=1 and {Yi}n
i=1 be simple random samples
from Xand Y, respectively. Let Tnbe defined as in (5). If X, Y ∈ P, then:
a) Whenever sprE(X) = dsprE(Y)and midE(X)midE(Y) = (1 d)sprE(Y), it is
fulfilled that
Tn
L
max{du2z2,(1 + d)(z1u1)+(d1)(z2+u2)}.(6)
b) Whenever sprE(X) = dsprE(Y)and midE(X) + midE(Y) = (1 d)sprE(Y), it
is fulfilled that
Tn
L
max{du2z2,(1 + d)(u1z1)+(d1)(z2+u2)}.(7)
c) Whenever dsprE(X) = sprE(Y)and midE(X)midE(Y) = (1 d)
dsprE(Y), it
is fulfilled that
Tn
L
max{dz2u2,(1 + d)(z1u1)+(d1)(z2+u2)}.(8)
d) Whenever dsprE(X) = sprE(Y)and midE(X) + midE(Y) = (1 d)
dsprE(Y), it
is fulfilled that
Tn
L
max{dz2u2,(1 + d)(u1z1)+(d1)(z2+u2)}.(9)
Proof. The statistic Tncan be equivalently expressed as Tn=nmax{A,B,C},
where A=dsprYnsprE(Y)+dsprE(Y)sprE(X) + sprE(X)sprXn,
B=dsprXnsprE(X)+dsprE(X)sprE(Y)+sprE(Y)sprYnand C=
(1 + d)
midXnmidE(X) + midE(X)midE(Y) + midE(Y)midYn
+
(d1) sprXnsprE(X) + sprE(X) + sprE(Y)sprE(Y) + sprYn
a) If sprE(X) = dsprE(Y) and midE(X)midE(Y) = (1 d)sprE(Y), the
second term and the negative form of the third term diverge in probability
to −∞ as n→ ∞ by the Central Limit and the Slutsky’s theorems. Finally,
the Continuous Mapping and the Central Limit Theorems for real variables
lead to (6).
Similar reasonings can be taken into account in the other three situations:
Two-sample similarity test for the expected value of random intervals 5
b) If sprE(X) = dsprE(Y) and midE(X)+midE(Y) = (1d)sprE(Y), the
second term and the positive form of the third term diverges in probability
to −∞ as n→ ∞;
c) If dsprE(X) = sprE(Y) and midE(X)midE(Y) = (1 d)
dsprE(Y), the
first term and the negative form of the third term diverges in probability
to −∞ as n→ ∞;
d) If dsprE(X) = sprE(Y) and midE(X) + midE(Y) = (1 d)
dsprE(Y),
the first term and the negative form of the third term diverges in proba-
bility to −∞ as n→ ∞.
ut
Remark 1. As in the real framework, other situations under H0being different
than the ones shown in Theorem 1 (which are the ’worst’ or ’limit’ situations
under H0) lead the statistic Tnto converge weakly to a limit distribution
which is stochastically bounded for one of those provided in the theorem.
Since the limit distribution of Tndepends on Xand Y, we can consider
the following (X, Y )-dependent distribution for the theoretical analysis of the
testing procedure (see [14]):
T0
n= max nndsprYnsprE(Y)+ sprE(X)sprXn
+ min 0, n1/4(sprYnsprXn),
ndsprXnsprE(X)+ sprE(Y)sprYn
+ min 0, n1/4(sprXnsprYn),
n(1 + d)midXnmidE(X) + midE(Y)midYn
+(d1) sprXnsprE(X) + sprYnsprE(Y)
+ min 0, n1/4(midXnmidYn),
n(1 + d)midE(X)midXnmidE(Y) + midYn
+(d1) sprXnsprE(X) + sprYnsprE(Y)
+ min 0, n1/4(midYnmidXn)o.
(10)
As in [14], the inclusion of min 0, n1/4(sprYnsprXn)(and so for mids)
in T0
nare useful to determine the terms on its expression having relevance
depending on each situation considered under H0. The consistency and the
power of the test are shown in Theorem 2.
Theorem 2. Let α[0,1] and k1αbe the (1α)-quantile of the asymptotic
distribution of T0
n. If H0in (4) is true, then it is satisfied that
lim sup
n→∞
P(T0
n> k1α)α ,
and the equality is achieved whenever conditions in a), b), c) and d) in The-
orem 1 are fulfilled. In addition, if H0is not true, then
6 Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
lim
n→∞ P(T0
n> k1α) = 1.
As an immediate consequence of Theorem 2, the test which rejects H0in (4)
at the significance level αwhenever T0
n> k1αis asymptotically efficient and
consistent.
3.1 Bootstrap test
Since the asymptotic limit distribution is not easy to handle in practice,
a residual bootstrap approach is proposed. Let Xand Ybe two RIs such
that sprE(X)>0 and sprE(Y)>0, and let {Xi}n
i=1 and {Yi}n
i=1 be two
simple random samples drawn from Xand Y, respectively. Let us consider
bootstrap samples for Xand Y, i.e. {X
i}n
i=1 and {Y
i}n
i=1 being chosen
randomly and with replacement from {Xi}n
i=1 and {Yi}n
i=1, respectively. The
bootstrap statistic is based on the expression of T0
nand it is defined as follows:
T
n= max nndsprY
nsprYn+ sprXnsprX
n
+ min 0, n1/4(sprYnsprXn),
ndsprX
nsprXn+ sprYnsprY
n
+ min 0, n1/4(sprXnsprYn),
n(1 + d)midX
nmidXn+ midYnmidY
n
+(d1) sprX
nsprXn+ sprY
nsprYn
+ min 0, n1/4(midXnmidYn),
n(1 + d)midXnmidX
n+ midY
nmidYn
+(d1) sprX
nsprXn+ midY
nmidYn
+ min 0, n1/4(midYnmidXn)o.
(11)
The different asymptotic distributions of T
nare (almost sure) the ones
provided in Theorem 1 for Tn, under the same conditions, and the consistency
of the bootstrap procedure is straightforwardly derived. The distribution of
T
nis approximated in practice by means of the Monte Carlo method.
4 Simulations
The empirical behaviour of the bootstrap test is shown by simulation. Two
different situations are considered: in the first one the mid and spr compo-
nents of the two independent RIs Xand Yare independently generated. In
Two-sample similarity test for the expected value of random intervals 7
the second situation, it is allowed that those components have certain level
of dependence each other. The two situations are described as follows:
Situation 1: midX≡ N(2,5), sprXU(1,3); midY≡ N(3,5),
sprYU(1,5).
Situation 2: midXU(2,6), sprXmidX/2; midY= sprYU(1,5).
It is straightforward to show that the theoretical situation 1 satisfies the con-
ditions a) of Theorem 1, and the situation 2 is under conditions b). Besides,
S(E(X), E(Y)) = 2/3 in both cases.
The bootstrap test proposed in Section 3.1 has been run for 10000 simu-
lations with 1000 bootstrap replications each to test H0:S(E(X), E(Y))
2/3 vs. H1:S(E(X), E(Y)) <2/3, for several significance levels αand
different sample sizes. Results are gathered in Table 1. They show that the
empirical sizes of the test are in both cases quite close to the expected nominal
significance levels even for moderate sample sizes. Specifically, the approx-
imation to the nominal significance level is more conservative in the first
situation than in the second one. The slight differences appreciated in the
two situations may be due to the diverse nature of the distributions.
Table 1 Empirical size of the two-sample similarity bootstrap test in Situations 1 and 2
Situation 1 Situation 2
n100 ·α1 5 10 1 5 10
10 2.27 6.88 10.64 2.38 7.36 11.85
30 1.60 4.60 9.54 1.89 5.73 10.31
50 1.35 4.96 10.59 1.44 5.32 10.46
100 1.27 5.06 10.35 1.22 5.18 10.24
200 .95 4.89 9.88 1.1 5.12 9.8
Finally, a small empirical study to show the power of the proposed test
has been developed. Specifically, midXin Case 1 has been chosen to have
distributions N(1,5), N(0,5) and N(1,5), respectively. In these cases, the
bootstrap approach for α=.05 and n= 50 lead to p-values of .153, .381
and .692, respectively, and, therefore, in this case the power of the test ap-
proximate to 1 as the distribution of Xmoves further away from the null
hypothesis.
5 Conclusions and open problems
A hypothesis test for checking the similarity between the expected value of
two RIs has been introduced. A test statistic has been proposed and its limit
8 Ana B. Ramos-Guajardo and ´
Angela Blanco-Fern´andez
distribution has been analyzed by means of both asymptotic and bootstrap
techniques. Some simulation studies have been carried out to show the suit-
ability of the bootstrap approach for moderate/large sample sizes.
As future work, theoretical and empirical comparisons between different
similarity indexes should be developed. The power of the proposed test may
also be theoretically analyzed as well as the sensitivity of the test when
different distributions are chosen. Other versions of the test statistic involving
the covariance matrix can be studied. Finally, the proposed test could be
extended to more than two RIs and to the fuzzy framework.
Acknowledgements The research in this paper is partially supported by the Spanish
National Grant MTM2013-44212-P, and the Regional Grant FC-15-GRUPIN-14-005. Their
financial support is gratefully acknowledged.
References
1. Aumann RJ (1965) Integrals of set-valued functions. J Math Anal Appl 12:1–12
2. Blanco-Fernndez A, Corral N, Gonz´alez-Rodr´ıguez G (2011) Estimation of a flexible
simple linear model for interval data based on set arithmetic. Comput Stat Data An
55(9): 2568–2578
3. Ferraro MB, Coppi R, Gonz´alez-Rodr´ıguez G, Colubi A (2010) A linear regression
model for imprecise response. Int J Approx Reason 51(7):759–770
4. Gil MA, Gonz´alez-Rodr´ıguez G, Colubi A, Montenegro M (2007) Testing linear
independence in linear models with interval-valued data. Comput Stat Data An
51(6):3002–3015
5. Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA (2012) Fuzzy data treated as functional
data: A one-way ANOVA test approach. Comput Stat Data An 56(4):943–955
6. Horowitz JL, Manski CF (2006) Identification and estimation of statistical functionals
using incomplete data. J Econom 132:445–459
7. Hudgens MG (2005) On nonparametric maximum likelihood estimation with interval
censoring and left truncation. J Roy Stat Soc: Ser B 67:573–587
8. Jaccard P (1901) tude comparative de la distribution florale dans une portion des
Alpes et des Jura. Bulletin de la Socit Vaudoise des Sciences Naturelles 37:547–579
9. orner R (2000) An asymptotic α-test for the expectation of random fuzzy variables.
J Stat Plan Infer 83:331–346
10. Magnac T, Maurin E (2008) Partial identification in monotone binary models: discrete
regressors and interval data. Rev Econ Stud 75:835–864
11. Matheron G (1975) Random Sets and Integral Geometry. Wiley, New York
12. Molchanov I (2005) Theory of Random Sets. Springer, London
13. Ramos-Guajardo AB, Colubi A, Gonz´alez-Rodr´ıguez G (2014) Inclusion degree tests
for the Aumann expectation of a random interval, Inf Sci 288(20):412–422
14. Ramos-Guajardo AB (2015) Similarity Test for the Expectation of a Random Interval
and a Fixed Interval. In: Grzegorzewski P, Gagolewski M, Hryniewicz O, Gil MA
(eds) Strengthening Links Between Data Analysis and Soft Computing. Advances in
Intelligent Systems and Computing 315:175–182
15. Shawe-Taylor J, Cristianini N (2004) Kernel Methods for Pattern Analysis. Cambridge
University Press, Cambridge
... De Carvalho et al. 2006a;D'Urso et al. 2015;González-Rodríguez et al. 2009;Ramos-Guajardo and Blanco-Fernández 2017;Ramos-Guajardo et al. 2020;Sinova et al. 2012). This work is focused on those situations in which a random process generating interval-valued outcomes is considered, and classical statistics are applied on them. ...
... Then, the idea is to relax the concept of strict equality between expected values by using a measure of geometric similarity between intervals that was initially introduced by Jaccard (1901), and which is defined as a ratio of the Lebesgue measures of the intersection and the union intervals (Shawe-Taylor and Cristianini 2004). To do that, a two-sample similarity test for the expected value of random intervals suggested by Ramos-Guajardo and Blanco-Fernández (2017) can be applied to obtain the p-values matrix instead of the usual two-sample test for the equality of expectations of RIs. The main advantages of the new approach are the following: (1) it can be used for comparing all types of RIs and not only simple RIs, (2) two groups of RIs can be linked providing their expected values present a high degree of similarity (according to the Jaccard-based similarity coefficient), not being necessary the strict equality between such expectations. ...
Article
Full-text available
A new clustering method for random intervals that are measured in the same units over the same group of individuals is provided. It takes into account the similarity degree between the expected values of the random intervals that can be analyzed by means of a two-sample similarity bootstrap test. Thus, the expectations of each pair of random intervals are compared through that test and a p -value matrix is finally obtained. The suggested clustering algorithm considers such a matrix where each p -value can be seen at the same time as a kind of similarity between the random intervals. The algorithm is iterative and includes an objective stopping criterion that leads to statistically similar clusters that are different from each other. Some simulations to show the empirical performance of the proposal are developed and the approach is applied to two real-life situations.
Article
Some hypothesis tests for analyzing the degree of overlap between the expected value of random intervals are provided. For this purpose, a suitable measure to quantify the overlapping grade between intervals is considered on the basis of the Szymkiewicz-Simpson coefficient defined for general sets. It can be seen as a kind of likeness index to measure the mutual information between two intervals. On one hand, an estimator for the proposed degree of overlap between intervals is provided and its strong consistency is analyzed. On the other hand, two tests are also proposed in this framework: a one-sample test to examine the degree of overlap between the expected value of a random interval and a given interval, and a two-sample test to check the degree of overlap between the expected value of two random intervals. To solve such hypothesis tests, two statistics are suggested and their limit distributions are studied by considering both asymptotic and bootstrap techniques. Their power has been also explored by means of local alternatives. In addition, some simulation studies are carried out to investigate the behavior of the proposed approaches. Finally, the performance of the tests is also reported in a real-life application.
Article
Full-text available
A hypothesis test for analyzing the degree of similarity between the expected value of a random interval and a fixed interval is introduced. It is based on a measure of the similarity between classical convex sets proposed in the literature. Asymptotic techniques are firstly applied to analyze the limit distribution of the proposed test statistic. Afterwards, a bootstrap approach is presented to better approximate the sampling distribution. Finally, the performance of the test is investigated by means of simulation studies.
Article
Full-text available
Common error in bibliographies: "Étude comparative de la distribution florale dans une portion des Alpes et des Jura".
Article
Full-text available
Testing methods are introduced in order to determine whether there is some ‘linear’ relationship between imprecise predictor and response variables in a regression analysis. The variables are assumed to be interval-valued. Within this context, the variables are formalized as compact convex random sets, and an interval arithmetic-based linear model is considered. Then, a suitable equivalence for the hypothesis of linear independence in this model is obtained in terms of the mid-spread representations of the interval-valued variables. That is, in terms of some moments of random variables. Methods are constructed to test this equivalent hypothesis; in particular, the one based on bootstrap techniques will be applicable in a wide setting. The methodology is illustrated by means of a real-life example, and some simulation studies are considered to compare techniques in this framework.
Article
An extension of the inclusion test of the expected value of a real random variable in an interval to the case of general random intervals is introduced. The hypothesis of strict inclusion is relaxed by considering a measure of the degree of inclusion. Thus, partial inclusions are also tested. Asymptotic and bootstrap techniques are established. The performance of the bootstrap test is also analyzed by means of some simulations. A case-study regarding the blood pressure classification in adults is considered.
Article
An asymptotic α-test of hypotheses about the fuzzy expectation with respect to fuzzy data is obtained with the help of a central limit theorem. The asymptotical distribution of the distance between the sample mean and the fuzzy expectation is used to calculate the quantiles. The asymptotical distribution is an ω 2 -distribution. An explicit expression for the density function is obtained for tests with LR-fuzzy numbers.
Article
A linear regression model with imprecise response and p real explanatory variables is analyzed. The imprecision of the response variable is functionally described by means of certain kinds of fuzzy sets, the LR fuzzy sets. The LR fuzzy random variables are introduced to model usual random experiments when the characteristic observed on each result can be described with fuzzy numbers of a particular class, determined by 3 random values: the center, the left spread and the right spread. In fact, these constitute a natural generalization of the interval data. To deal with the estimation problem the space of the LR fuzzy numbers is proved to be isometric to a closed and convex cone of R3 with respect to a generalization of the most used metric for LR fuzzy numbers. The expression of the estimators in terms of moments is established, their limit distribution and asymptotic properties are analyzed and applied to the determination of confidence regions and hypothesis testing procedures. The results are illustrated by means of some case-studies.
Article
Incomplete data, due to missing observations or interval measurement of variables, usually cause parameters of interest in applications to be unidentified except under untestable and often controversial assumptions. However, it is often possible to identify sharp bounds on parameters without making untestable assumptions about the process through which data become incomplete. The bounds contain all logically possible values of the parameters and can be estimated consistently by replacing the population distribution of the data with the empirical distribution. This is straightforward in some circumstances but computationally burdensome in others. This paper describes the general problem and presents an empirical illustration.