ChapterPDF Available

Abstract and Figures

Classical tests for the equality of distributions of real-valued random variables are widely applied in Statistics. When the normality assumption for the variables fails, non-parametric techniques are to be considered; Mann-Whitney, Wilcoxon, Kruskal-Wallis, Friedman tests, among other alternatives. Fuzzy representations of real-valued random variables have been recently shown to describe in an effective way the statistical behaviour of the variables. Indeed, the expected value of certain fuzzy representations fully characterizes the distribution of the variable. The aim of this paper is to use this characterization to test the equality of distribution for two or more real-valued random variables, as an alternative to classical procedures. The inferential problem is solved through a parametric test for the equality of expectations of fuzzy-valued random variables. Theoretical results on inferences for fuzzy random variables support the validity of the test. Besides, simulation studies and practical applications show the empirical goodness of the method.
Content may be subject to copyright.
Independent k-sample equality
distribution test based on the fuzzy
representation
Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
Abstract Classical tests for the equality of distributions of real-valued ran-
dom variables are widely applied in Statistics. When the normality assump-
tion for the variables fails, non-parametric techniques are to be considered;
Mann-Whitney, Wilcoxon, Kruskal-Wallis, Friedman tests, among other al-
ternatives. Fuzzy representations of real-valued random variables have been
recently shown to describe in an effective way the statistical behaviour of the
variables. Indeed, the expected value of certain fuzzy representations fully
characterizes the distribution of the variable. The aim of this paper is to
use this characterization to test the equality of distribution for two or more
real-valued random variables, as an alternative to classical procedures. The
inferential problem is solved through a parametric test for the equality of
expectations of fuzzy-valued random variables. Theoretical results on infer-
ences for fuzzy random variables support the validity of the test. Besides,
simulation studies and practical applications show the empirical goodness of
the method.
1 Introduction
The development of statistical methods for fuzzy random variables has in-
creased exponentially in last decades, from the seminal ideas on fuzzyness
by Zadeh [16]. In some situations, experimental data are not precise obser-
vations, represented by fixed categories or point-valued real numbers, and
modelled by real-valued variables. The outcomes of the experiment might be
more imprecise or fuzzy, in the sense that they are not represented by just
a point value, but a set of values, an interval, or even a function. Imprecise
experimental data are effectively modelled by means of fuzzy-valued vari-
Departament of Statistics and Operational Research, University of Oviedo, C/Calvo Sotelo,
s/n, 33007, Oviedo, Spain. blancoangela@uniovi.es ·ramosana@uniovi.es
1
2 Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
ables. Additionally to the imprecision on the data, the randomness on the
data generation process drives to the formalization of fuzzy-valued random
variables (FRVs).
Powerful statistical, probabilistic and inferential studies for fuzzy random
variables have been deeply investigated in the literature. It is important to
remark that fuzzy data can be seen under two different perspectives, usually
called ontic and epistemic views of fuzzy data, and the statistical treatment
of the variables in each line is radically different. In few words, the epistemic
approach considers the fuzzy data as imprecise observations or descriptions
of crisp (but unknown) quantities. Statistical methods are focused to draw
conclusions for the original real-valued variable, and they generally transfer
the imprecision to methods and results [4, 5, 10, 11]. Alternatively, ontic fuzzy
data are treated as precise entities representing the outcomes of the experi-
ment, belonging to the corresponding space of functions instead of the space
of real numbers. In this case, statistical methods try to mimic classical tech-
niques to draw conclusions directly to the fuzzy-valued variables modelling
the experiment [2, 6, 7, 12, 13]. Further discussions on the two frameworks
can be found in [2] and [5].
Besides their own statistical analysis, fuzzy-valued random variables in the
ontic perspective have been also shown as a powerful tool to obtain statistical
conclusions to classical real-valued random variables [1, 3, 8, 9]. Exploratory
and inferential studies for real random variables have been developed through
the so-called fuzzy representation of the variable, defined, roughly speaking,
by applying a fuzzy operator to the original variable and fuzzifing its values.
The key idea is that it is not included imprecision in the data gratuitously, but
this transformation is very effective to certain statistical purposes. The aim of
this paper is to extend this line of research to test the equality of two or more
real-valued distributions based on the fuzzy representation of the variables.
The rest of the paper is organized as follows. In Section 2, the main concepts
concerning fuzzy random variables and the concept of fuzzy representation of
a real-valued variable are recalled. The inferential studies on the equality of
real-valued distributions are presented in Section 3. Theoretical and empirical
results on the proposed tests are shown. Finally, some conclusions and future
problems are commented in Section 4.
2 Preliminaries
Let Fc(R) denote the class of fuzzy sets U:R[0,1] such that Uα
Kc(R) for all α[0,1], where Kc(R) is the family of all non-empty
closed and bounded intervals of R, the α-levels of Uare defined as Uα=
{xR|U(x)α}if α(0,1], and U0is the closure of the support of U.
The usual arithmetic between fuzzy sets is based on Zadeh’s extension
principle [16]. It agrees levelwise with the Minkowski addition and the product
k-sample equality distribution test based on the fuzzy representation 3
by scalars for real intervals. Given U, V ∈ Fc(R) and λR,U+Vand λU
are defined such that (U+V)α=Uα+Vα={u+v:uUα, v Vα}and
(λU)α=λUα={λu :uUα}, for all α[0,1].
The space Fc(R) can be embedded into a convex and closed cone of
L2({−1,1} × [0,1]) by means of the support function [12], defined for any
U∈ Fc(R) as sU:{−1,1}×[0,1] Rsuch that sU(u, α) = supvUαu, v. It
is important to note that, although this embedding permits good operational
properties, the statistical processing of fuzzy sets cannot be directly trans-
ferred to L2({−1,1} × [0,1]); it must always be guaranteed that the results
remain coherently into the cone.
In order to measure distances between fuzzy sets, the family of metrics Dφ
θ
in Fc(R) ([14]) is defined as
Dφ
θ(U, V ) = (0,1] (midUαmidVα)2+θ(sprUαsprVα)2(α),
with θ > 0, ϕis associated with a bounded density measure with positive
mass in (0,1], and midUα/sprUαare the mid-point/radious of the interval
Uα∈ Kc(R), respectively, i.e. Uα= [midUα±sprUα] for all α[0,1].
Let (Ω, A, P ) be a probability space. A mapping X:→ Fc(R) is a
random fuzzy set (RFS) (or random fuzzy variable) if it is Borel-measurable
with respect to BDφ
θ, the σ-field generated by the topology induced by the
metric Dφ
θon the space Fc(R).
The central tendency of a RFS is usually measured by the Aumann ex-
pectation of X. If max{∥infX0,supX0∥} ∈ L1(Ω, A, P ), it is defined as the
unique fuzzy set
E(X)∈ Fc(R) such that
E(X)α= Kudo-Aumann’s integral of Xα= [E(infXα), E(supXα)] ,
for all α[0,1]. Given {Xi}n
i=1 a simple random sample of size nfrom X,the
associated sample mean is defined as Xn= (1/n)n
i=1 Xi.
Let γ:R→ Fc(R) the mapping transforming each xRinto the fuzzy
set γ(x) whose α-levels are given by
γ(x)α=fL(x)gL(x)(1 α)1/hL(x), fR(x) + gR(x)(1 α)1/hR(x),
for all α[0,1], where fL, fR:RR,fLfR,gL, gR:R[0,),
hL, hR:R(0,), are Borel-measurable functions.
Given X:Ra real random variable associated with (Ω, A, P ), it is
straightforward to show that the mapping γX:→ Fc(R), ω 7→ γ(X(ω))
is a random fuzzy set. It is called the γ-fuzzy representation of X[8]. One of
the main statistical advantages of this fuzzyfication process is the possibility
of managing real-valued distributions, generally complicated in the classical
framework, through powerful statistical techniques for random fuzzy variables
which are available in the current literature on the fuzzy framework.
4 Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
Several statistical problems for Xhave been already solved by means of
this technique [1, 3, 8, 9]. Different fuzzy operators γare considered, depend-
ing on the relevant information from Xwhich it is desired to characterize.
There exists the possibility of characterizing the whole distribution of X
through the expected value of certain fuzzy representations. The fuzzy oper-
ator γξis defined as
γξ(x) = 1{x}+ sig(xx0)γf
xx0
a,
where ξΘ={(x0, a, f )|x0R, a R+, f : [0,+)[0,1] injective and
continuous}, sig(z) denotes the sign of zRand γf: [0,+)→ Fc(R) is
an auxiliary (fuzzy-valued) functional defined by
(γf(x))α=[0, B(x)C(x)α] if 0 αf(x)
[0, A(x)(1 α)] if f(x)< α 1
for all α[0,1] and x0, where
A(x) = x2
1f(x), B(x) = x2
f(x)and C(x) = x2(1 f(x))
f(x)2.
The triple parameter ξ= (E(X),1,(0.6x+ 0.001)/1.001) provides a good
exploratory analysis of X, as well as the characterization of its distribution,
since two real random variables Xand Yare identically distributed if, and
only if,
E(γξX) =
E(γξY) (see [3]).
3 Testing the equality of real-valued distributions
Let (Ω, A, P ) be a probability space and let X1, X2, . . . , Xk:Rbe k
real-random variables. The aim is to test whether the distributions of the k
variables behave significantly different each other or not. Thus, the hypothesis
test to be solved is:
H0:X1
d
X2
d
. . . d
Xk
H1:i, j ∈ {1, . . . , k}s.t. Xi
d
Xj
(1)
Following previous results on the fuzzy representation of the variables, it
is immediate to note that the hypothesis test (1) can be equivalently written
in terms of the expected values of the corresponding γξ-fuzzy representations
of the variables, as follows:
H0:E(γξX1) = E(γξX2) = . . . =E(γξXk)
H1:i, j ∈ {1, . . . , k}s.t. E(γξXi)̸=E(γξXj)(2)
k-sample equality distribution test based on the fuzzy representation 5
Whenever the normality assumption for the distributions of the variables
is not guaranteed, the classical test (1) is solved through non-parametric tech-
niques; k-sample Kolmogorov-Smirnov test, Kruskal-Wallis method, are some
of the well-known alternatives, among others. Nevertheless, the equality of ex-
pectations of fuzzy random variables is tested through parametric techniques,
shown to be asymptotically consistent. Let us define Xi=γξXi, the γξ-fuzzy
representation of the real random variable Xi, respectively for i= 1, . . . , k.
Given {Xij}ni
j=1 a simple random sample from the real random variable Xi,
for each i= 1, . . . , k, it is immediate to see that {Xij =γξXij }ni
j=1 is a
simple random sample from the random fuzzy variables Xi,i= 1, . . . , k.
By following ideas from [9], the test statistic to be considered to solve (2)
from the information provided by the random sample {Xij}ni
j=1 is defined as
follows:
Tn=
k
i=1
niDφ
θ(Xi·,X··)2,(3)
where Xi·=1
nini
j=1 Xij for each i= 1, . . . , k,X·· =1
nk
i=1 ni
j=1 Xij ,
and n=n1+. . . +nkis the overall sample size. The consistency of the
testing procedure based on the test statistic (3) is supported by the following
asymptotic result.
Theorem 1. ([9]) If ni→ ∞,ni/n pi>0, as n→ ∞, and Xiis non-
degenerated for some i∈ {1, . . . , k}, then, if H0is true,
Tn
n
k
i=1 ||Zi
k
l=1
αliZl||φ
θ2,(4)
where Z1, . . . , Zkare independent centered Gaussian processes in L2({−1,1
[0,1]) whose covariances are equal to cov(sXi), respectively, and αli =
pl/pik
r=1(pr/pi),i= 1, . . . , k.
Proposition 1. To test H0:E(γξX1) = E(γξX2) = . . . =E(γξXk)
at the nominal significance level ρ[0,1],H0should be rejected whenever
Tn> zρ, where zρis the 100(1 ρ)-quantile of the distribution of the limit
expression in (4).
The limit distribution in (4) depends on the populational covariances cov(sXi),
i= 1, . . . , k, which are usually unknown in practice. In such situations, that
distribution can be approximated by Monte Carlo simulations.
Alternatively to the asymptotic approach, a bootstrap testing procedure
to solve (2) is proposed, which is always applicable in practice. Let {X
ij }ni
j=1
be a bootstrap sample of {Xij}ni
j=1, for each i= 1, . . . , k, i.e. {X
ij }ni
j=1 be-
ing randomly chosen and with replacement from {Xij}ni
j=1. The bootstrap
statistic is defined as follows:
6 Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
T
n=
k
i=1
niDφ
θ(X
i·+X··,Xi·+X
·· )2.(5)
By applying the Bootstrap Central Limit Theorem, it can be shown that T
n
converges in law to the same Gaussian process than Tnin (3) when H0is
true (see [9]). Consequently, the bootstrap distribution of T
napproximates
the one of Tnunder H0, and the following test resolution holds.
Proposition 2. To test H0:E(γξX1) = E(γξX2) = . . . =E(γξXk)
at the significance level ρ[0,1],H0should be rejected whenever T
n> z
ρ,
where z
ρis the 100(1 ρ)-quantile of the bootstrap distribution of T
n.
The bootstrap statistic T
nis defined coherently in terms of the arithmetic be-
tween fuzzy values and distances between them. The corresponding quantile
to solve the test can be easily approximated by re-sampling.
3.1 Simulation studies
In order to illustrate the empirical behaviour of the proposed testing pro-
cedures, some simulations are shown. Let δ > 0, Zi N (0,1), i= 1,2,3.
We define X1=Z1,X2=Z2and X3=δZ3. It is immediate to check that
H0:X1
d
X2
d
X3holds when δ= 1. E(X1) = E(X2) = E(X3) for
all δ > 0. However, V ar(X3) increases and it differs more and more from
V ar(X1) = V ar(X2) as δincreases.
A number of 10,000 random samples from {Xi}3
i=1 are generated, for dif-
ferent sample sizes n1, n2, n3, respectively. For each case, the correspond-
ing samples for the fuzzy-representations γξXiare constructed, with
ξ= (E(X),1,(0.6x+0.001)/1.001), and the bootstrap test is run for B= 1000
bootstrap replications. The percentage of rejections of the null hypothesis (2)
(and so of (1)) on the 10,000 iterations of the test is computed. The results are
compared with the classical non-parametric Kruskal-Wallis (KW) method to
test the equality of real-valued distributions. Table 1 contains the results for
different values of δ. Some comments can be done. First, it is immediate to see
that the proposed bootstrap test approximates the nominal significance level
when H0is true (δ= 1) as the sample size increases, which agrees with the
theoretical correctness of the method. Under H0, the classical KW test ap-
proximates slightly better than the bootstrap test. However, the fuzzy-based
bootstrap test is always consistent, which is not the case of the classical one.
Kruskal-Wallis method does not identify the movement of the theoretical sit-
uation far from H0(when δincreases), whereas the bootstrap method does
it effectively even for small and moderate samples.
k-sample equality distribution test based on the fuzzy representation 7
Table 1 Percentage of rejections (bootstrap fuzzy test/classical KW test)
(n1, n2, n3)δ=1δ= 2 δ= 4 δ= 6
(30,30,30) 3.80/4.69 34.16/5.70 34.46/6.81 35.86/7.45
(30,50,100) 4.22/5.09 90.36/2.46 71.62/2.52 83.32/2.62
(100,100,100) 4.62/5.13 90.52/5.59 72.22/7.20 88.45/7.68
(100,150,200) 4.68/5.04 97.78/4.13 87.64/4.92 99.48/5.15
(500,500,500) 4.80/5.11 99.98/5.88 100/6.96 100/7.89
3.2 Practical applications
Once both the theoretical and empirical correctness of the proposed fuzzy-
based bootstrap testing procedure is shown, the technique is ready to be
applied in practice. Let us consider the sample dataset Energy Efficiency from
the UCI Repository (see [15]). It contains information about the heating load
of houses as well as different features of the houses such as orientation, roof
area, wall area, glacing area, etc. The aim is to test whether the heating load
of the houses is significantly different depending on one of those features. For
instance, if we consider Xi= heating load in a house with orientation i,i=
1,2,3,4, the hypothesis H0:X1
d
X2
d
X3
d
X4is not rejected at the usual
significance levels, since the obtained p-values with both the classical KW test
and the fuzzy bootstrap test are .9941 and .8741, respectively. Nevertheless,
for Yj= heating load in a house with glazing area j,j= 1,2,3, the hypothesis
H0:Y1
d
Y2
d
Y3is rejected with p-values 4.31.1013 and 0, respectively.
4 Conclusions
The statistical analysis of fuzzy-valued random sets is widely recognized as
a powerful technique to develop descriptive and inferential studies in exper-
imental scenarios executed with certain degree of imprecision. Additionally,
fuzzy-based techniques can be also applied to solve statistical problems for
real-valued random variables through the fuzzy representation of the vari-
ables. In this work, the problem of testing the equality of two or more real-
valued distributions is addressed. Simulations and applications show that the
proposed techniques are a good alternative to classical methods, with good
and powerful statistical features.
The effect of using different fuzzy-operators to define the fuzzy represen-
tation of the variables in the results of the test could be further investigated.
Besides, the extension to other classical statistical methods, as discriminant
or regression analysis, is still to be developed.
8 Angela Blanco-Fern´andez and Ana B. Ramos-Guajardo
Acknowledgements The research in this paper is partially supported by the Spanish
National Grant MTM2013-44212-P, and the Regional Grant FC-15-GRUPIN-14-005. Their
financial support is gratefully acknowledged.
References
1. Blanco-Fern´andez A, Ramos-Gua jardo AB, Colubi A (2013) Fuzzy representations
of real-valued random variables: applications to exploratory and inferential studies.
Metron 71:245–259
2. Blanco-Fern´andez A, Casals RM, Colubi A, Corral N, Garc´ıa-B´arzana M, Gil MA,
Gonz´alez-Rodr´ıguez G, L´opez MT, Lubiano MA, Montenegro M, Ramos-Guajardo
AB, de la Rosa de S´aa S, Sinova B (2014). A distance-based statistical analysis of
fuzzy number-valued data. Int J Approx Reason 55:1487–1501
3. Colubi A (2009) Statistical inference about the means of fuzzy random variables:
Applications to the analysis of fuzzy- and real-valued data, Fuzzy Set Syst 160:344–
356
4. Couso I, Dubois D (2009) On the variability of the concept of variance for fuzzy
random variables. IEEE Trans Fuzzy Sys 17(5):1070–1080
5. Couso I, Dubois D (2014) Statistical reasoning with set-valued information: Ontic vs.
epistemic views. Int J Approx Reason 55:1502-1518
6. Diamond P, Kloeden P (1994) Metric spaces of fuzzy sets. World Scientific, Singapore
7. Gil MA, L´opez-D´ıaz M, Ralescu DA (2006) Overview on the development of fuzzy
random variables. Fuzzy Set Syst 157(19):2546–2557
8. Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA (2006) A fuzzy representation of random
variables: An operational tool in exploratory analysis and hypothesis testing. Comp
Stat Data An 51:163–176
9. Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA, Lubiano MA (2012) A new way of quanti-
fying the symmetry of a random variable: Estimation and hypothesis testing. J Stat
Plann Inf 142:3061–3072
10. Kwakernaak H (1978) Fuzzy random variables - I. Definitions and theorems. Inf Sci
15:1–29
11. Huellermeier E (2014) Learning from imprecise and fuzzy observations: data disam-
biguation through generalized loss minimization. Int J Approx Reason 55:1519-1534
12. orner R, Nather W (2002) On the variance of random fuzzy variables. In: Bertoluzza
C, Gil MA, Ralescu DA (eds) Analysis and management of fuzzy data. Physica-Verlag,
Heidelberg:22–39
13. Puri ML, Ralescu DA (1986) Fuzzy random variables. J Math Anal Appl 114:409–422
14. Trutschnig W, Gonz´alez-Rodr´ıguez G, Colubi A, Gil MA (2009) A new family of
metrics for compact, convex (fuzzy) sets based on a generalized concept of mid and
spread, Inf Sci 179:3964-3972
15. Tsanas A, Xifara A (2012) Accurate quantitative estimation of energy performance
of residential buildings using statistical machine learning tools. Energy and Buildings
49:560–567
16. Zadeh LA (1975) The concept of a linguistic variable and its application to approxi-
mate reasoning. Inf Sci 8:199–249
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This note is a rejoinder on our paper in this issue. It attempts to provide some clarifications and thoughts in connection with the discussions/comments made about it by Didier Dubois and Sebastien Destercke. We hope our comments are at the level of the discussants'.
Article
Full-text available
The so-called fuzzy representations of real-valued random variables are reviewed. They are used to visualize or/and characterize distributions through fuzzy sets. Various fuzzy representations useful to explore or test about different characteristics of real distributions are described. The main developments concerning the representation, goodness-of-fit, equality of distribution and asymmetry are overviewed. New inferential strategies for the equality of two-paired distributions based on bootstrap techniques are introduced. They are analyzed theoretically and empirically.
Article
Full-text available
Real-life data associated with experimental outcomes are not always real-valued. In particular, opinions, perceptions, ratings, etc. are often assumed to be imprecise in nature, especially when they come from human valuations. Fuzzy numbers have long been considered to provide us with a convenient scale to express these imprecise data. In analyzing fuzzy data from a statistical perspective one finds two key obstacles, namely, the nonlinearity associated with the usual arithmetic with fuzzy data and the lack of suitable models and limit results for the distribution of fuzzy-valued statistics. These obstacles can be frequently bypassed by using an appropriate metric between fuzzy data, the notion of random fuzzy set, and a bootstrapped central limit theorem for general space-valued random elements. This paper aims to review these key ideas, and a methodology for the statistical analysis of fuzzy numbered data which is being developed along the last years.
Article
Full-text available
New measures of skewness for real-valued random variables are proposed. The measures are based on a functional representation of real-valued random variables. Specifically, the expected value of the transformed random variable can be used to characterize the distribution of the original variable. Firstly, estimators of the proposed skewness measures are analyzed. Secondly, asymptotic tests for symmetry are developed. The tests are consistent for both discrete and continuous distributions. Bootstrap versions improving the empirical results for moderated and small samples are provided. Some simulations illustrate the performance of the tests in comparison to other methods. The results show that our procedures are competitive and have some practical advantages.
Article
In information processing tasks, sets may have a conjunctive or a disjunctive reading. In the conjunctive reading, a set represents an object of interest and its elements are subparts of the object, forming a composite description. In the disjunctive reading, a set contains mutually exclusive elements and refers to the representation of incomplete knowledge. It does not model an actual object or quantity, but partial information about an underlying object or a precise quantity. This distinction between what we call ontic vs. epistemic sets remains valid for fuzzy sets, whose membership functions, in the disjunctive reading are possibility distributions, over deterministic or random values. This paper examines the impact of this distinction in statistics. We show its importance because there is a risk of misusing basic notions and tools, such as conditioning, distance between sets, variance, regression, etc. when data are set-valued. We discuss several examples where the ontic and epistemic points of view yield different approaches to these concepts.
Article
Two classes of metrics are introduced for spaces of fuzzy sets. Their equivalence is discussed and basic properties established. A characterisation of compact and locally compact subsets is given in terms of boundedness and p-mean equileft-continuity, and the spaces shown to be locally compact, complete and separable metric spaces.
Article
We develop a statistical machine learning framework to study the effect of eight input variables (relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution) on two output variables, namely heating load (HL) and cooling load (CL), of residential buildings. We systematically investigate the association strength of each input variable with each of the output variables using a variety of classical and non-parametric statistical analysis tools, in order to identify the most strongly related input variables. Then, we compare a classical linear regression approach against a powerful state of the art nonlinear non-parametric method, random forests, to estimate HL and CL. Extensive simulations on 768 diverse residential buildings show that we can predict HL and CL with low mean absolute error deviations from the ground truth which is established using Ecotect (0.51 and 1.42, respectively). The results of this study support the feasibility of using machine learning tools to estimate building parameters as a convenient and accurate approach, as long as the requested query bears resemblance to the data actually used to train the mathematical model in the first place.