Content uploaded by Ana Belén Ramos-Guajardo
Author content
All content in this area was uploaded by Ana Belén Ramos-Guajardo on May 10, 2018
Content may be subject to copyright.
Independent k-sample equality
distribution test based on the fuzzy
representation
Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
Abstract Classical tests for the equality of distributions of real-valued ran-
dom variables are widely applied in Statistics. When the normality assump-
tion for the variables fails, non-parametric techniques are to be considered;
Mann-Whitney, Wilcoxon, Kruskal-Wallis, Friedman tests, among other al-
ternatives. Fuzzy representations of real-valued random variables have been
recently shown to describe in an effective way the statistical behaviour of the
variables. Indeed, the expected value of certain fuzzy representations fully
characterizes the distribution of the variable. The aim of this paper is to
use this characterization to test the equality of distribution for two or more
real-valued random variables, as an alternative to classical procedures. The
inferential problem is solved through a parametric test for the equality of
expectations of fuzzy-valued random variables. Theoretical results on infer-
ences for fuzzy random variables support the validity of the test. Besides,
simulation studies and practical applications show the empirical goodness of
the method.
1 Introduction
The development of statistical methods for fuzzy random variables has in-
creased exponentially in last decades, from the seminal ideas on fuzzyness
by Zadeh [16]. In some situations, experimental data are not precise obser-
vations, represented by fixed categories or point-valued real numbers, and
modelled by real-valued variables. The outcomes of the experiment might be
more imprecise or fuzzy, in the sense that they are not represented by just
a point value, but a set of values, an interval, or even a function. Imprecise
experimental data are effectively modelled by means of fuzzy-valued vari-
Departament of Statistics and Operational Research, University of Oviedo, C/Calvo Sotelo,
s/n, 33007, Oviedo, Spain. blancoangela@uniovi.es ·ramosana@uniovi.es
1
2 Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
ables. Additionally to the imprecision on the data, the randomness on the
data generation process drives to the formalization of fuzzy-valued random
variables (FRVs).
Powerful statistical, probabilistic and inferential studies for fuzzy random
variables have been deeply investigated in the literature. It is important to
remark that fuzzy data can be seen under two different perspectives, usually
called ontic and epistemic views of fuzzy data, and the statistical treatment
of the variables in each line is radically different. In few words, the epistemic
approach considers the fuzzy data as imprecise observations or descriptions
of crisp (but unknown) quantities. Statistical methods are focused to draw
conclusions for the original real-valued variable, and they generally transfer
the imprecision to methods and results [4, 5, 10, 11]. Alternatively, ontic fuzzy
data are treated as precise entities representing the outcomes of the experi-
ment, belonging to the corresponding space of functions instead of the space
of real numbers. In this case, statistical methods try to mimic classical tech-
niques to draw conclusions directly to the fuzzy-valued variables modelling
the experiment [2, 6, 7, 12, 13]. Further discussions on the two frameworks
can be found in [2] and [5].
Besides their own statistical analysis, fuzzy-valued random variables in the
ontic perspective have been also shown as a powerful tool to obtain statistical
conclusions to classical real-valued random variables [1, 3, 8, 9]. Exploratory
and inferential studies for real random variables have been developed through
the so-called fuzzy representation of the variable, defined, roughly speaking,
by applying a fuzzy operator to the original variable and fuzzifing its values.
The key idea is that it is not included imprecision in the data gratuitously, but
this transformation is very effective to certain statistical purposes. The aim of
this paper is to extend this line of research to test the equality of two or more
real-valued distributions based on the fuzzy representation of the variables.
The rest of the paper is organized as follows. In Section 2, the main concepts
concerning fuzzy random variables and the concept of fuzzy representation of
a real-valued variable are recalled. The inferential studies on the equality of
real-valued distributions are presented in Section 3. Theoretical and empirical
results on the proposed tests are shown. Finally, some conclusions and future
problems are commented in Section 4.
2 Preliminaries
Let Fc(R) denote the class of fuzzy sets U:R→[0,1] such that Uα∈
Kc(R) for all α∈[0,1], where Kc(R) is the family of all non-empty
closed and bounded intervals of R, the α-levels of Uare defined as Uα=
{x∈R|U(x)≥α}if α∈(0,1], and U0is the closure of the support of U.
The usual arithmetic between fuzzy sets is based on Zadeh’s extension
principle [16]. It agrees levelwise with the Minkowski addition and the product
k-sample equality distribution test based on the fuzzy representation 3
by scalars for real intervals. Given U, V ∈ Fc(R) and λ∈R,U+Vand λU
are defined such that (U+V)α=Uα+Vα={u+v:u∈Uα, v ∈Vα}and
(λU)α=λUα={λu :u∈Uα}, for all α∈[0,1].
The space Fc(R) can be embedded into a convex and closed cone of
L2({−1,1} × [0,1]) by means of the support function [12], defined for any
U∈ Fc(R) as sU:{−1,1}×[0,1] →Rsuch that sU(u, α) = supv∈Uα⟨u, v⟩. It
is important to note that, although this embedding permits good operational
properties, the statistical processing of fuzzy sets cannot be directly trans-
ferred to L2({−1,1} × [0,1]); it must always be guaranteed that the results
remain coherently into the cone.
In order to measure distances between fuzzy sets, the family of metrics Dφ
θ
in Fc(R) ([14]) is defined as
Dφ
θ(U, V ) = (0,1] (midUα−midVα)2+θ(sprUα−sprVα)2dϕ(α),
with θ > 0, ϕis associated with a bounded density measure with positive
mass in (0,1], and midUα/sprUαare the mid-point/radious of the interval
Uα∈ Kc(R), respectively, i.e. Uα= [midUα±sprUα] for all α∈[0,1].
Let (Ω, A, P ) be a probability space. A mapping X:Ω→ Fc(R) is a
random fuzzy set (RFS) (or random fuzzy variable) if it is Borel-measurable
with respect to BDφ
θ, the σ-field generated by the topology induced by the
metric Dφ
θon the space Fc(R).
The central tendency of a RFS is usually measured by the Aumann ex-
pectation of X. If max{∥infX0∥,∥supX0∥} ∈ L1(Ω, A, P ), it is defined as the
unique fuzzy set
E(X)∈ Fc(R) such that
E(X)α= Kudo-Aumann’s integral of Xα= [E(infXα), E(supXα)] ,
for all α∈[0,1]. Given {Xi}n
i=1 a simple random sample of size nfrom X,the
associated sample mean is defined as Xn= (1/n)n
i=1 Xi.
Let γ:R→ Fc(R) the mapping transforming each x∈Rinto the fuzzy
set γ(x) whose α-levels are given by
γ(x)α=fL(x)−gL(x)(1 −α)1/hL(x), fR(x) + gR(x)(1 −α)1/hR(x),
for all α∈[0,1], where fL, fR:R→R,fL≤fR,gL, gR:R→[0,∞),
hL, hR:R→(0,∞), are Borel-measurable functions.
Given X:Ω→Ra real random variable associated with (Ω, A, P ), it is
straightforward to show that the mapping γ◦X:Ω→ Fc(R), ω 7→ γ(X(ω))
is a random fuzzy set. It is called the γ-fuzzy representation of X[8]. One of
the main statistical advantages of this fuzzyfication process is the possibility
of managing real-valued distributions, generally complicated in the classical
framework, through powerful statistical techniques for random fuzzy variables
which are available in the current literature on the fuzzy framework.
4 Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
Several statistical problems for Xhave been already solved by means of
this technique [1, 3, 8, 9]. Different fuzzy operators γare considered, depend-
ing on the relevant information from Xwhich it is desired to characterize.
There exists the possibility of characterizing the whole distribution of X
through the expected value of certain fuzzy representations. The fuzzy oper-
ator γξis defined as
γξ(x) = 1{x}+ sig(x−x0)γf
x−x0
a,
where ξ∈Θ={(x0, a, f )|x0∈R, a ∈R+, f : [0,+∞)→[0,1] injective and
continuous}, sig(z) denotes the sign of z∈Rand γf: [0,+∞)→ Fc(R) is
an auxiliary (fuzzy-valued) functional defined by
(γf(x))α=[0, B(x)−C(x)α] if 0 ≤α≤f(x)
[0, A(x)(1 −α)] if f(x)< α ≤1
for all α∈[0,1] and x≥0, where
A(x) = x2
1−f(x), B(x) = x2
f(x)and C(x) = x2(1 −f(x))
f(x)2.
The triple parameter ξ= (E(X),1,(0.6x+ 0.001)/1.001) provides a good
exploratory analysis of X, as well as the characterization of its distribution,
since two real random variables Xand Yare identically distributed if, and
only if,
E(γξ◦X) =
E(γξ◦Y) (see [3]).
3 Testing the equality of real-valued distributions
Let (Ω, A, P ) be a probability space and let X1, X2, . . . , Xk:Ω→Rbe k
real-random variables. The aim is to test whether the distributions of the k
variables behave significantly different each other or not. Thus, the hypothesis
test to be solved is:
H0:X1
d
∼X2
d
∼. . . d
∼Xk
H1:∃i, j ∈ {1, . . . , k}s.t. Xi
d
Xj
(1)
Following previous results on the fuzzy representation of the variables, it
is immediate to note that the hypothesis test (1) can be equivalently written
in terms of the expected values of the corresponding γξ-fuzzy representations
of the variables, as follows:
H0:E(γξ◦X1) = E(γξ◦X2) = . . . =E(γξ◦Xk)
H1:∃i, j ∈ {1, . . . , k}s.t. E(γξ◦Xi)̸=E(γξ◦Xj)(2)
k-sample equality distribution test based on the fuzzy representation 5
Whenever the normality assumption for the distributions of the variables
is not guaranteed, the classical test (1) is solved through non-parametric tech-
niques; k-sample Kolmogorov-Smirnov test, Kruskal-Wallis method, are some
of the well-known alternatives, among others. Nevertheless, the equality of ex-
pectations of fuzzy random variables is tested through parametric techniques,
shown to be asymptotically consistent. Let us define Xi=γξ◦Xi, the γξ-fuzzy
representation of the real random variable Xi, respectively for i= 1, . . . , k.
Given {Xij}ni
j=1 a simple random sample from the real random variable Xi,
for each i= 1, . . . , k, it is immediate to see that {Xij =γξ◦Xij }ni
j=1 is a
simple random sample from the random fuzzy variables Xi,i= 1, . . . , k.
By following ideas from [9], the test statistic to be considered to solve (2)
from the information provided by the random sample {Xij}ni
j=1 is defined as
follows:
Tn=
k
i=1
niDφ
θ(Xi·,X··)2,(3)
where Xi·=1
nini
j=1 Xij for each i= 1, . . . , k,X·· =1
nk
i=1 ni
j=1 Xij ,
and n=n1+. . . +nkis the overall sample size. The consistency of the
testing procedure based on the test statistic (3) is supported by the following
asymptotic result.
Theorem 1. ([9]) If ni→ ∞,ni/n →pi>0, as n→ ∞, and Xiis non-
degenerated for some i∈ {1, . . . , k}, then, if H0is true,
Tn
n
→
k
i=1 ||Zi−
k
l=1
αliZl||φ
θ2,(4)
where Z1, . . . , Zkare independent centered Gaussian processes in L2({−1,1}×
[0,1]) whose covariances are equal to cov(sXi), respectively, and αli =
pl/pik
r=1(pr/pi),i= 1, . . . , k.
Proposition 1. To test H0:E(γξ◦X1) = E(γξ◦X2) = . . . =E(γξ◦Xk)
at the nominal significance level ρ∈[0,1],H0should be rejected whenever
Tn> zρ, where zρis the 100(1 −ρ)-quantile of the distribution of the limit
expression in (4).
The limit distribution in (4) depends on the populational covariances cov(sXi),
i= 1, . . . , k, which are usually unknown in practice. In such situations, that
distribution can be approximated by Monte Carlo simulations.
Alternatively to the asymptotic approach, a bootstrap testing procedure
to solve (2) is proposed, which is always applicable in practice. Let {X ∗
ij }ni
j=1
be a bootstrap sample of {Xij}ni
j=1, for each i= 1, . . . , k, i.e. {X ∗
ij }ni
j=1 be-
ing randomly chosen and with replacement from {Xij}ni
j=1. The bootstrap
statistic is defined as follows:
6 Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
T∗
n=
k
i=1
niDφ
θ(X∗
i·+X··,Xi·+X∗
·· )2.(5)
By applying the Bootstrap Central Limit Theorem, it can be shown that T∗
n
converges in law to the same Gaussian process than Tnin (3) when H0is
true (see [9]). Consequently, the bootstrap distribution of T∗
napproximates
the one of Tnunder H0, and the following test resolution holds.
Proposition 2. To test H0:E(γξ◦X1) = E(γξ◦X2) = . . . =E(γξ◦Xk)
at the significance level ρ∈[0,1],H0should be rejected whenever T∗
n> z∗
ρ,
where z∗
ρis the 100(1 −ρ)-quantile of the bootstrap distribution of T∗
n.
The bootstrap statistic T∗
nis defined coherently in terms of the arithmetic be-
tween fuzzy values and distances between them. The corresponding quantile
to solve the test can be easily approximated by re-sampling.
3.1 Simulation studies
In order to illustrate the empirical behaviour of the proposed testing pro-
cedures, some simulations are shown. Let δ > 0, Zi→ N (0,1), i= 1,2,3.
We define X1=Z1,X2=Z2and X3=δZ3. It is immediate to check that
H0:X1
d
∼X2
d
∼X3holds when δ= 1. E(X1) = E(X2) = E(X3) for
all δ > 0. However, V ar(X3) increases and it differs more and more from
V ar(X1) = V ar(X2) as δincreases.
A number of 10,000 random samples from {Xi}3
i=1 are generated, for dif-
ferent sample sizes n1, n2, n3, respectively. For each case, the correspond-
ing samples for the fuzzy-representations γξ◦Xiare constructed, with
ξ= (E(X),1,(0.6x+0.001)/1.001), and the bootstrap test is run for B= 1000
bootstrap replications. The percentage of rejections of the null hypothesis (2)
(and so of (1)) on the 10,000 iterations of the test is computed. The results are
compared with the classical non-parametric Kruskal-Wallis (KW) method to
test the equality of real-valued distributions. Table 1 contains the results for
different values of δ. Some comments can be done. First, it is immediate to see
that the proposed bootstrap test approximates the nominal significance level
when H0is true (δ= 1) as the sample size increases, which agrees with the
theoretical correctness of the method. Under H0, the classical KW test ap-
proximates slightly better than the bootstrap test. However, the fuzzy-based
bootstrap test is always consistent, which is not the case of the classical one.
Kruskal-Wallis method does not identify the movement of the theoretical sit-
uation far from H0(when δincreases), whereas the bootstrap method does
it effectively even for small and moderate samples.
k-sample equality distribution test based on the fuzzy representation 7
Table 1 Percentage of rejections (bootstrap fuzzy test/classical KW test)
(n1, n2, n3)δ=1δ= 2 δ= 4 δ= 6
(30,30,30) 3.80/4.69 34.16/5.70 34.46/6.81 35.86/7.45
(30,50,100) 4.22/5.09 90.36/2.46 71.62/2.52 83.32/2.62
(100,100,100) 4.62/5.13 90.52/5.59 72.22/7.20 88.45/7.68
(100,150,200) 4.68/5.04 97.78/4.13 87.64/4.92 99.48/5.15
(500,500,500) 4.80/5.11 99.98/5.88 100/6.96 100/7.89
3.2 Practical applications
Once both the theoretical and empirical correctness of the proposed fuzzy-
based bootstrap testing procedure is shown, the technique is ready to be
applied in practice. Let us consider the sample dataset Energy Efficiency from
the UCI Repository (see [15]). It contains information about the heating load
of houses as well as different features of the houses such as orientation, roof
area, wall area, glacing area, etc. The aim is to test whether the heating load
of the houses is significantly different depending on one of those features. For
instance, if we consider Xi= heating load in a house with orientation i,i=
1,2,3,4, the hypothesis H0:X1
d
∼X2
d
∼X3
d
∼X4is not rejected at the usual
significance levels, since the obtained p-values with both the classical KW test
and the fuzzy bootstrap test are .9941 and .8741, respectively. Nevertheless,
for Yj= heating load in a house with glazing area j,j= 1,2,3, the hypothesis
H0:Y1
d
∼Y2
d
∼Y3is rejected with p-values 4.31.10−13 and 0, respectively.
4 Conclusions
The statistical analysis of fuzzy-valued random sets is widely recognized as
a powerful technique to develop descriptive and inferential studies in exper-
imental scenarios executed with certain degree of imprecision. Additionally,
fuzzy-based techniques can be also applied to solve statistical problems for
real-valued random variables through the fuzzy representation of the vari-
ables. In this work, the problem of testing the equality of two or more real-
valued distributions is addressed. Simulations and applications show that the
proposed techniques are a good alternative to classical methods, with good
and powerful statistical features.
The effect of using different fuzzy-operators to define the fuzzy represen-
tation of the variables in the results of the test could be further investigated.
Besides, the extension to other classical statistical methods, as discriminant
or regression analysis, is still to be developed.
8 Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
Acknowledgements The research in this paper is partially supported by the Spanish
National Grant MTM2013-44212-P, and the Regional Grant FC-15-GRUPIN-14-005. Their
financial support is gratefully acknowledged.
References
1. Blanco-Fern´andez A, Ramos-Gua jardo AB, Colubi A (2013) Fuzzy representations
of real-valued random variables: applications to exploratory and inferential studies.
Metron 71:245–259
2. Blanco-Fern´andez A, Casals RM, Colubi A, Corral N, Garc´ıa-B´arzana M, Gil MA,
Gonz´alez-Rodr´ıguez G, L´opez MT, Lubiano MA, Montenegro M, Ramos-Guajardo
AB, de la Rosa de S´aa S, Sinova B (2014). A distance-based statistical analysis of
fuzzy number-valued data. Int J Approx Reason 55:1487–1501
3. Colubi A (2009) Statistical inference about the means of fuzzy random variables:
Applications to the analysis of fuzzy- and real-valued data, Fuzzy Set Syst 160:344–
356
4. Couso I, Dubois D (2009) On the variability of the concept of variance for fuzzy
random variables. IEEE Trans Fuzzy Sys 17(5):1070–1080
5. Couso I, Dubois D (2014) Statistical reasoning with set-valued information: Ontic vs.
epistemic views. Int J Approx Reason 55:1502-1518
6. Diamond P, Kloeden P (1994) Metric spaces of fuzzy sets. World Scientific, Singapore
7. Gil MA, L´opez-D´ıaz M, Ralescu DA (2006) Overview on the development of fuzzy
random variables. Fuzzy Set Syst 157(19):2546–2557
8. Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA (2006) A fuzzy representation of random
variables: An operational tool in exploratory analysis and hypothesis testing. Comp
Stat Data An 51:163–176
9. Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA, Lubiano MA (2012) A new way of quanti-
fying the symmetry of a random variable: Estimation and hypothesis testing. J Stat
Plann Inf 142:3061–3072
10. Kwakernaak H (1978) Fuzzy random variables - I. Definitions and theorems. Inf Sci
15:1–29
11. Huellermeier E (2014) Learning from imprecise and fuzzy observations: data disam-
biguation through generalized loss minimization. Int J Approx Reason 55:1519-1534
12. K¨orner R, Nather W (2002) On the variance of random fuzzy variables. In: Bertoluzza
C, Gil MA, Ralescu DA (eds) Analysis and management of fuzzy data. Physica-Verlag,
Heidelberg:22–39
13. Puri ML, Ralescu DA (1986) Fuzzy random variables. J Math Anal Appl 114:409–422
14. Trutschnig W, Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA (2009) A new family of
metrics for compact, convex (fuzzy) sets based on a generalized concept of mid and
spread, Inf Sci 179:3964-3972
15. Tsanas A, Xifara A (2012) Accurate quantitative estimation of energy performance
of residential buildings using statistical machine learning tools. Energy and Buildings
49:560–567
16. Zadeh LA (1975) The concept of a linguistic variable and its application to approxi-
mate reasoning. Inf Sci 8:199–249