Article

Fuzziness in data analysis: Towards accuracy and robustness

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The first aim is to empathize the use of fuzziness in data analysis to capture information that has been traditionally disregarded with a cost in the precision of the conclusions. Fuzziness can be considered in the data analysis process at various stages, but the main target in this paper will be fuzziness in the data. Depending on the nature of the fuzzy data or the aim to which they are handled, different approaches should be applied. We attempt to contribute to the clarification of such difference while focusing on the so-called ontic approach in contrast to the epistemic approach. The second aim is to underline the need of considering robust methods to reduce the misleading impact of outliers in fuzzy data analysis. We propose trimming as a general and intuitive method to discard outliers. We exemplify this approach with the case of the ontic fuzzy trimmed mean/variance and highlight the differences with the epistemic case. All the discussions and developments are illustrated by means of a case-study concerning the perception of lengths of men and women.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To fix them, we can first calculate an initial robust estimator of location. In particular, we use the impartial trimmed mean of Cuesta-Albertos, J.A. and as in Colubi, A. and González-Rodríguez, G. (2015) for this purpose. Then, the tuning constants in the loss function are selected from the distribution of the distances between the observations and this initial estimate. ...
... Two sample sizes n will represent the even (n = 100) and odd (n = 101) cases. In each situation, using as starting value the impartial trimmed mean for functional data (see Cuesta-Albertos, J.A. and Fraiman, R., 2007;Colubi, A. and González-Rodríguez, G., 2015) with trimming proportion .5, the tuning parameter for the Huber loss function has been selected as commented in Remark 3.1 and the corresponding Huber M-estimate has been computed. ...
... We use a more refined algorithm that is obtained by adapting the concentration algorithm of Rousseeuw, P.J. and Van Driessen, K. (2006) to our setting. Our adaptation is similar to the recent adaptation for the case of fuzzyvalued random variables in Colubi, A. and González-Rodríguez, G. (2015). By including the sample observations in the set of starting values of our algorithm, we can guarantee that its solution is at least as good as the approximation with the simple algorithm of Cuesta-Albertos, J.A. and . ...
Article
Full-text available
M-estimators of location are widely used robust estimators of the center of univariate or multivariate real-valued data. This paper aims to study M-estimates of location in the framework of functional data analysis. To this end, recent developments for robust nonparametric density estimation by means of M-estimators are considered. These results can also be applied in the context of functional data analysis and allow to state conditions for the existence and uniqueness of location M-estimates in this setting. Properties of these functional M-estimators are investigated. In particular, their consistency is shown and robustness is studied by means of their breakdown point and their influence function. The finite-sample performance of the M-estimators is explored by simulation. The M-estimators are also empirically compared to trimmed means for functional data.
... Dierent robust location alternatives for fuzzy numbers have already been proposed in the literature (see e.g. [1,2,3,4,5]). Among them, the adaptation of trimmed means and M-estimators of location to the fuzzy number-valued settings could be highlighted due to their importance and success for real-valued random variables. ...
... • The sample fuzzy trimmed mean [3] is the fuzzy number 1 h j∈ E xnx j , where E xn denotes the corresponding sample trimming region, that is, ...
Chapter
Full-text available
Several location measures have already been proposed in the literature in order to summarize the central tendency of a random fuzzy number in a robust way. Among them, fuzzy trimmed means and fuzzy M-estimators of location extend two successful approaches from the real-valued settings. The aim of this work is to present an empirical comparison of different location estimators, including both fuzzy trimmed means and fuzzy M-estimators, to study their differences in finite sample behaviour.
... With the aim of avoiding the excessive influence of the measurement units on the outputs, due to the lack of scale equivariance unless ρ is a power function, the tuning parameters will be selected based on the distribution of the distances to the center. That is, we first compute an initial robust estimator of location (e.g., the impartial trimmed mean as in Colubi and González-Rodríguez [2] or, if p = 1, the 1-norm median in Sinova et al. [11]) and then, the distances between each observation and this initial estimate are calculated. Our recommendation is to use the 1-norm median as initial estimate when analyzing random fuzzy numbers, since its computation is not complex and this measure does not depend on the existence or not of outliers in the sample to provide us with a good initial estimate. ...
... Our recommendation is to use the 1-norm median as initial estimate when analyzing random fuzzy numbers, since its computation is not complex and this measure does not depend on the existence or not of outliers in the sample to provide us with a good initial estimate. The impartial trimmed mean (see Colubi and González-Rodríguez [2]) presents the disadvantage of requiring to fix the trimming proportion "a priori" and, in case there are no outliers, the initial estimate could be a bit far from the real center of the sample distribution. The choice for the tuning parameters a, b and c will be, along this paper, the median, the 75 th and the 85 th percentiles of those distances, following Kim and Scott's suggestion [9]. ...
Chapter
Full-text available
The Aumann-type mean is probably the best-known measure for the location of a random fuzzy set. Despite its numerous probabilistic and statistical properties, it inherits from the mean of a real-valued random variable the high sensitivity to outliers or data changes. Several alternatives extending the concept of median to the fuzzy setting have already been proposed in the literature. Recently, the adaptation of location M-estimators has also been tackled. The expression of fuzzy-valued location M-estimators as weighted means under mild conditions allows us to guarantee that these measures take values in the space of fuzzy sets. It has already been shown that these conditions hold for the Huber and Hampel families of loss functions. In this paper, the strong consistency and the maximum finite sample breakdown point when the Tukey biweight (or bisquare) loss function is chosen are analyzed. Finally, a real-life example will illustrate the influence of the choice of the loss function on the outputs.
... For the last decades the statistical analysis of imprecise-valued random variables has awakened a great interest both from both the epistemic the ontic viewpoints (see, for extensive comparative discussions, e.g., [2,5]). These random variables associate outcomes of a random experiment (modelled through 5 probability spaces) with elements in generalized spaces, such as the space of compact real intervals, the space of convex, and compact subsets of R p , the space of fuzzy numbers, or the space of convex and compact p-dimensional fuzzy sets. ...
... These random variables associate outcomes of a random experiment (modelled through 5 probability spaces) with elements in generalized spaces, such as the space of compact real intervals, the space of convex, and compact subsets of R p , the space of fuzzy numbers, or the space of convex and compact p-dimensional fuzzy sets. From the 'ontic' perspective, the one considered in this paper, (fuzzy) setvalued data are considered as whole entities (see, e.g., [1,2,3,4]), in contrast 10 to the epistemic approach, which considers (fuzzy) set-valued data as imprecise measurements of precise data (see, e.g., [5,6,9,14]). ...
Article
Full-text available
The space of nonempty convex and compact (fuzzy) subsets of Rp, Kc(Rp), has been traditionally used to handle imprecise data. Its elements can be characterized via the support function, which agrees with the usual Minkowski addition, and naturally embeds Kc(Rp) into a cone of a separable Hilbert space. The support function embedding holds interesting properties, but it lacks of an intuitive interpretation for imprecise data. As a consequence, it is not easy to identify the elements of the image space that correspond to sets in Kc(Rp). Moreover, although the Minkowski addition is very natural when p=1, if p>1 the shapes which are obtained when two sets are aggregated are apparently unrelated to the original sets, because it tends to convexify. An alternative and more intuitive functional representation will be introduced in order to circumvent these difficulties. The imprecise data will be modeled by using star-shaped sets on Rp. These sets will be characterized through a center and the corresponding polar coordinates, which have a clear interpretation in terms of location and imprecision, and lead to a natural directionally extension of the Minkowski addition. The structures required for a meaningful statistical analysis from the so-called ontic perspective are introduced, and how to determine the representation in practice is discussed.
... The particular case of the Hilbert space-valued trimmed means for the fuzzy number-valued case, thanks to the embedding of the space of fuzzy set values into a closed convex cone of a Hilbert space of functions, has been first considered by Colubi and González-Rodríguez (2015) in terms of the metric D θ . Nevertheless, this notion can be extended to deal with the generic metric D in this paper. ...
... In Colubi and González-Rodríguez (2015), the Fast Minimum Covariance Determinant (for short, Fast-MCD) algorithm, which is an adaptation of the well-known k-means algorithm, was rewritten to compute the trimmed mean of a fuzzy-valued sample. We now present an adaptation of this algorithm which tries to avoid, as much as possible, that any local minimum traps the iterative process: ...
Article
Full-text available
Different approaches to robustly measure the location of data associated with a random experiment have been proposed in the literature, with the aim of avoiding the high sensitivity to outliers or data changes typical for the mean. In particular, M-estimators and trimmed means have been studied in general spaces, and can be used to handle Hilbert-valued data. Both alternatives are of interest due to their success in the classical framework. Since fuzzy set-valued data can be identified with a convex cone of a separable Hilbert space, the previous concepts have been recently applied to the one-dimensional fuzzy case. The aim of this paper is to extend M-estimators and trimmed means to p-dimensional fuzzy set-valued data, and to theoretically prove that they inherit robustness from the real settings. Some of such theoretical results are more general and directly apply to Hilbert-valued estimators and, in consequence, to functional data. A real-life example will also be included to illustrate the computation and behaviour of these estimators under contamination.
... In order to guarantee that the conclusions of a study remain valid even if there are some 'contaminated observations' among the collected data, robust location measures for fuzzy number-valued data sets have been proposed (see e.g. [3,4,5,6,7,8]). Fuzzy trimmed means, Huber and Hampel fuzzy M-estimates, the 1-norm median and the wabl/ldev/rdev median have been empirically compared by means of their bias, variance and mean squared error in [9]. As in the classical settings, it has been shown that there is no uniformly optimal location estimator. ...
... Let θ ∈ (0, +∞). For any β ∈ (0, 1), the sample fuzzy trimmed mean [6] is the fuzzy number 1 h j∈ E xnx j , where E xn denotes the corresponding sample trimming region, that is, ...
Article
Full-text available
Share Link – a personalized URL providing free access to the article until August 14, 2019: https://authors.elsevier.com/a/1ZHe4,KD6ZNmNn
... To this work proposed a general and intuitive method to discard outliers. [8] The absence of some important factors must be taken into account for the lack of the precision of the economic target and based on that the fuzziness could achieve better results from the exactness of the data that function separately. ...
... The first case (uncertainty is reducible) is referred to as "epistemic uncertainty" [15,16], "lack of knowledge" [17], "subjective" [18], "reducible", "type B" [19] or "type 2" uncertainty [20]. The second case (uncertainty is not reducible) is designated as "aleatory" [15,16], "ontological" [21] or "ontic" uncertainty [22], "variability" [17], "stochastic" [18], "irreducible", "type A" [19] or "type 1" uncertainty [20]. In the sequel of this paper the expressions epistemic and aleatory uncertainty are used, as they appear to be the most commonly applied in literature. ...
Article
Full-text available
Methods for managing uncertainty and fuzziness caused by a turbulent and volatile corporate environment play an important role for ensuring long-term competitiveness of producing companies. It is often difficult for practitioners, to choose the optimal approach for modelling existing uncertainties in a meaningful way. This contribution provides a guideline for classification of uncertain information and fuzzy data based on a flowchart and proposes suitable modelling methods for each characterized uncertainty. In addition, a measure for modelability, the degree to which an uncertain or fuzzy parameter can be modelled, is proposed. The method is based on a literature review comprising a discussion of the terms uncertainty and fuzziness.
... Although this approach is a fast method but the final result is not precise since the whole dataset has not been processed. The precision is an important aspect that must not be ignored in fuzzy clustering [20]. ...
Article
Clustering plays an important role in mining big data both as a modeling technique and a preprocessing step in many data mining process implementations. Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing each data record to belong to more than one cluster to some degree. However, a serious challenge in fuzzy clustering is the lack of scalability. Massive datasets in emerging fields such as geosciences, biology and networking do require parallel and distributed computations with high performance to solve real-world problems. Although some clustering methods are already improved to execute on big data platforms, but their execution time is highly increased for large datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering named BigFCM is proposed and designed for the Hadoop distributed data platform. Based on the map-reduce programming model, it exploits several mechanisms including an efficient caching design to achieve several orders of magnitude reduction in execution time. Extensive evaluation over multi-gigabyte datasets shows that BigFCM is scalable while it preserves the quality of clustering.
... In [40] a median for random fuzzy sets is defined. In addition, a robust population central location measure and a trimming approach have been proposed in [5] in order to reduce the impact of the outliers in the fuzzy data analysis. ...
Chapter
Full-text available
Sensory analysis entails subjective valuations provided by qualified experts which in most of the cases are given by means of a real value. However personal valuations usually present an uncertainty in their meaning which is difficult to capture by using a unique value. In this work some statistical techniques to deal with such kind of information are presented. The methodology is illustrated through a case-study, where some tasters have been proposed to use trapezoidal fuzzy numbers to express their perceptions regarding the quality of the so-called Gamonedo blue cheese. In order to establish an agreement between the tasters a weighted summary measure of the information collected is described. This will lead to assign a weight to each expert depending on the influence they have when the weighted mean is computed. An example of the real-life application is also provided.
... The absence of some important factors must be taken into account to deal with the lack of the precision of the economic target and based on that the fuzziness could achieve better results from the exactness of the data. To this thesis is showed the examination of the fuzzy querying as a human consistent and in addition showed a friendly way of retrieving information, due to real human intentions and preferences; which expressed in natural language represented with the usage of fuzzy logic and possibility theory [8]. To bipolar queries can be accommodated the user's intentions and preferences involving some sort of a required, desired, mandatory and optional elements. ...
Article
Full-text available
This paper analyzes the key elements of fuzzy logic and showes that through rational, behavioral economics and neo-classical economics it is possible to develop models using the Q. E. methodology. Therefore, it is plausible to apply contemporaneous Q. E. methodology in combination with the rationali-ty and the behavioral approach. The fuzzy logic and the generator is the source of this mechanism for the production of the appropriate models. Анализируются ключевые элементы нечеткой логики. Показано, что с помощью рациональной, поведенческой и неоклассической полит-экономии можно разработать модели с использованием методологии ко-личественного определения. Следовательно, вполне вероятно, что ме-тодология будет результативна в сочетании с рационально-поведенчес-ким подходом с использованием количественного определения. Нечеткая логика и генеративность являются источниками этого механизма для производства соответствующих моделей.
... In recent years, the advantages of modeling human perceptions or expert opinions have been performed through what is known as diffuse sets or fuzzy data. Fuzzy data are functional data with values on the scale [0,1], which help to represent the continuous nature of opinions (from completely agree to completely disagree) [9,10]. Fuzzy logic is able to deal with imprecise and uncertain information, and this has been proposed to be useful in clinical decision making, considering that symptoms and diseases are "fuzzy" in nature. ...
Article
Full-text available
Introduction: The mortality risk in children admitted to Pediatric Intensive Care Units (PICU) is usually estimated by means of validated scales, which only include objective data among their items. Human perceptions may also add relevant information to prognosticate the risk of death, and the tool to use this subjective data is fuzzy logic. The objective of our study was to develop a mathematical model to predict mortality risk based on the subjective perception of PICU staff and to evaluate its accuracy compared to validated scales. Methods: A prospective observational study in two PICUs (one in Spain and another in Latvia) was performed. Children were consecutively included regardless of the cause of admission along a two-year period. A fuzzy set program was developed for the PICU staff to record the subjective assessment of the patients' mortality risk expressed through a short range and a long range, both between 0% and 100%. Pediatric Index of Mortality 2 (PIM2) and Therapeutic Intervention Scoring System 28 (TISS28) were also prospectively calculated for each patient. Subjective and objective predictions were compared using the logistic regression analysis. To assess the prognostication ability of the models a stratified B-random K-fold cross-validation was performed. Results: Five hundred ninety-nine patients were included, 308 in Spain (293 survivors, 15 nonsurvivors) and 291 in Latvia (282 survivors, 9 nonsurvivors). The best logistic classification model for subjective information was the one based on MID (midpoint of the short range), whereas objective information was the one based on PIM2. Mortality estimation performance was 86.3% for PIM2, 92.6% for MID, and the combination of MID and PIM2 reached 96.4%. Conclusions: Subjective assessment was as useful as validated scales to estimate the risk of mortality. A hybrid model including fuzzy information and probabilistic scales (PIM2) seems to increase the accuracy of prognosticating mortality in PICU.
Chapter
Full-text available
In this study, the dynamic level of efficiency change of CAYKUR production plants was analyzed. In the application, the Hatami-Marbini, Tavana&Emrouznejad Modeli - 2012 (HTE) model, which is the fuzzy Malmquist Total Factor Productivity Index (MTFPI) method based on FDEA and a limited number of empirical studies have been made, was used. The HTE model offers dynamic efficiency measurement under the assumption of variable returns to scale.
Article
Most of the distances used in case of fuzzy data are based on the well-known Euclidean distance. In detail, a fuzzy number can be characterized by centers and spreads and the most common distances between fuzzy numbers are essentially defined as a weighted sum of the squared Euclidean distances between the centers and the spreads. In the multivariate case the Euclidean distance does not take into account the correlation structure between variables. For this reason, the Mahalanobis distance has been introduced which involves the corresponding covariance matrix between the variables. A generalization of that distance to the fuzzy framework is proposed. It is shown to be useful in different contexts and, in particular, in a clustering approach. As a result, non-spherical clusters, that generally are not recognized by means of Euclidean-type distances, can be recognized by means of the suggested distance. Clustering applications are reported in order to check the adequacy of the proposed approach.
Article
Empirical trimmed means have been studied in general spaces and, in particular, they have been applied to the one-dimensional fuzzy case. They provide a competing robust estimation procedure of the central tendency for fuzzy number-valued data, but they are not the only way to define a trimmed mean in this space. The aim is to adapt trimmed means defined on the basis of certain depth function to the framework of fuzzy number-valued data and compare their behaviour with that of empirical fuzzy trimmed means. The first idea for evaluating the depth of a fuzzy number-valued observation consists of applying an existing functional depth to the expression of such an observation as a function. The second alternative introduces a depth function specifically defined for fuzzy numbers. The empirical performance of both proposals is analyzed.
Article
This work focuses on robust clustering of data affected by imprecision. The imprecision is managed in terms of fuzzy sets. The clustering process is based on the fuzzy and possibilistic approaches. In both approaches the observations are assigned to the clusters by means of membership degrees. In fuzzy clustering the membership degrees express the degrees of sharing of the observations to the clusters. In contrast, in possibilistic clustering the membership degrees are degrees of typicality. These two sources of information are complementary because the former helps to discover the best fuzzy partition of the observations while the latter reflects how well the observations are described by the centroids and, therefore, is helpful to identify outliers. First, a fully possibilistic k-means clustering procedure is suggested. Then, in order to exploit the benefits of both the approaches, a joint possibilistic and fuzzy clustering method for fuzzy data is proposed. A selection procedure for choosing the parameters of the new clustering method is introduced. The effectiveness of the proposal is investigated by means of simulated and real-life data.
Chapter
One of the most common spaces to model imprecise data through (fuzzy) sets is that of convex and compact (fuzzy) subsets in \(\mathbb {R}^p\). The properties of compactness and convexity allow the identification of such elements by means of the so-called support function, through an embedding into a functional space. This embedding satisfies certain valuable properties, however it is not always intuitive. Recently, an alternative functional representation has been considered for the analysis of imprecise data based on the star-shaped sets theory. The alternative representation admits an easier interpretation in terms of ‘location’ and ‘imprecision’, as a generalized idea of the concepts of mid-point and spread of an interval. A comparative study of both functional representations is made, with an emphasis on the structures required for a meaningful statistical analysis from the ontic perspective.
Chapter
Full-text available
“Understanding and managing cybersecurity requires not only investing in technology but also considering its non-technical social and human aspects. Because cybersecurity is now a strategic necessity for digitalizing businesses.” A. Asiltürk You can access this book chapter from this link: https://www.dosyaupload.com/2Mloz/advances_in_social_science.pdf
Article
Fuzzy sets generalize the concept of sets by considering that elements belong to a class (or fulfil a property) with a degree of membership (or certainty) ranging between 0 and 1. Fuzzy sets have been used in diverse areas to model gradual transitions as opposite to abrupt changes. In econometrics and statistics, this has been especially relevant in clustering, regression discontinuity designs, and imprecise data modelling, to name but a few. Although the membership functions vary between 0 and 1 as the probabilities, the nature of the imprecision captured by the fuzzy sets is usually different from stochastic uncertainty. The aim is to illustrate the advantages of combining fuzziness, imprecision, or partial knowledge with randomness through various key methodological problems. Emphasis will be placed on the management of non-precise data modelled through (fuzzy) sets. Software to apply the reviewed methodology will be suggested. Some open problems that could be of future interest will be discussed.
Article
Full-text available
The notion of Fuzzy Random Variable has been introduced to model random mechanisms generating imprecisely-valued data which can be properly described by means of fuzzy sets. Probabilistic aspects of these random elements have been deeply discussed in the literature. However, statistical analysis of fuzzy random �variables has not received so much attention, in spite that implications of this analysis range over many Þelds, including Medicine, Sociology, Economics, and so on. A summary of the fundamentals of fuzzy random variables is presented. Then, some related ÒparametersÓ associated with the distribution of these variables are deÞned. Inferential procedures concerning these ÒparametersÓ are described. Some recent �results related to linear models for fuzzy data are Þnally reviewed.
Article
Full-text available
Real-life data associated with experimental outcomes are not always real-valued. In particular, opinions, perceptions, ratings, etc. are often assumed to be imprecise in nature, especially when they come from human valuations. Fuzzy numbers have long been considered to provide us with a convenient scale to express these imprecise data. In analyzing fuzzy data from a statistical perspective one finds two key obstacles, namely, the nonlinearity associated with the usual arithmetic with fuzzy data and the lack of suitable models and limit results for the distribution of fuzzy-valued statistics. These obstacles can be frequently bypassed by using an appropriate metric between fuzzy data, the notion of random fuzzy set, and a bootstrapped central limit theorem for general space-valued random elements. This paper aims to review these key ideas, and a methodology for the statistical analysis of fuzzy numbered data which is being developed along the last years.
Article
Full-text available
Very large (VL) data or big data are any data that you cannot load into your computer's working memory. This is not an objective definition, but a definition that is easy to understand and one that is practical, because there is a dataset too big for any computer you might use; hence, this is VL data for you. Clustering is one of the primary tasks used in the pattern recognition and data mining communities to search VL databases (including VL images) in various applications, and so, clustering algorithms that scale well to VL data are important and useful. This paper compares the efficacy of three different implementations of techniques aimed to extend fuzzy c-means (FCM) clustering to VL data. Specifically, we compare methods that are based on 1) sampling followed by noniterative extension; 2) incremental techniques that make one sequential pass through subsets of the data; and 3) kernelized versions of FCM that provide approximations based on sampling, including three proposed algorithms. We use both loadable and VL datasets to conduct the numerical experiments that facilitate comparisons based on time and space complexity, speed, quality of approximations to batch FCM (for loadable data), and assessment of matches between partitions and ground truth. Empirical results show that random sampling plus extension FCM, bit-reduced FCM, and approximate kernel FCM are good choices to approximate FCM for VL data. We conclude by demonstrating the VL algorithms on a dataset with 5 billion objects and presenting a set of recommendations regarding the use of different VL FCM clustering schemes.
Article
Full-text available
New measures of skewness for real-valued random variables are proposed. The measures are based on a functional representation of real-valued random variables. Specifically, the expected value of the transformed random variable can be used to characterize the distribution of the original variable. Firstly, estimators of the proposed skewness measures are analyzed. Secondly, asymptotic tests for symmetry are developed. The tests are consistent for both discrete and continuous distributions. Bootstrap versions improving the empirical results for moderated and small samples are provided. Some simulations illustrate the performance of the tests in comparison to other methods. The results show that our procedures are competitive and have some practical advantages.
Article
Full-text available
In quantifying the central tendency of the distribution of a random fuzzy number (or fuzzy random variable in Puri and Ralescu's sense), the most usual measure is the Aumann-type mean, which extends the mean of a real-valued random variable and preserves its main properties and behavior. Although such a behavior has very valuable and convenient implications, ‘extreme’ values or changes of data entail too much influence on the Aumann-type mean of a random fuzzy number. This strong influence motivates the search for a more robust central tendency measure. In this respect, this paper aims to explore the extension of the median to random fuzzy numbers. This extension is based on the 1-norm distance and its adequacy will be shown by analyzing its properties and comparing its robustness with that of the mean both theoretically and empirically.
Article
Full-text available
A procedure to test hypotheses about the population variance of a fuzzy random variable is analyzed. The procedure is based on the theory of UH-statistics. The variance is defined in terms of a general metric to quantify the variability of the fuzzy values about its (fuzzy) mean. An asymptotic one-sample test in a wide setting is developed and a bootstrap test, which is more suitable for small and moderate sample sizes, is also studied. Moreover, the power function of the asymptotic procedure through local alternatives is analyzed. Some simulations showing the empirical behavior and consistency of both tests are carried out. Finally, some illustrative examples of the practical application of the proposed tests are presented.
Article
Full-text available
Fuzzy random variables possess several interpretations. Historically, they were proposed either as a tool for handling linguistic label information in statistics or to represent uncertainty about classical random variables. Accordingly, there are two different approaches to the definition of the variance of a fuzzy random variable. In the first one, the variance of the fuzzy random variable is defined as a crisp number, which makes it easier to handle in further processing. In the second case, the variance is defined as a fuzzy interval, thus offering a gradual description of our incomplete knowledge about the variance of an underlying, imprecisely observed, classical random variable. In this paper, we also discuss another view of fuzzy random variables, which comes down to a set of random variables induced by a fuzzy relation describing an ill-known conditional probability. This view leads to yet another definition of the variance of a fuzzy random variable in the context of the theory of imprecise probabilities. The new variance is a real interval, which achieves a compromise between both previous definitions in terms of representation simplicity. Our main objective is to demonstrate, with the help of simple examples, the practical significance of these definitions of variance induced by various existing views of fuzzy random variables.
Article
Full-text available
Testing methods are introduced in order to determine whether there is some ‘linear’ relationship between imprecise predictor and response variables in a regression analysis. The variables are assumed to be interval-valued. Within this context, the variables are formalized as compact convex random sets, and an interval arithmetic-based linear model is considered. Then, a suitable equivalence for the hypothesis of linear independence in this model is obtained in terms of the mid-spread representations of the interval-valued variables. That is, in terms of some moments of random variables. Methods are constructed to test this equivalent hypothesis; in particular, the one based on bootstrap techniques will be applicable in a wide setting. The methodology is illustrated by means of a real-life example, and some simulation studies are considered to compare techniques in this framework.
Article
Full-text available
In previous studies we have stated that the well-known bootstrap techniques are a valuable tool in testing statistical hypotheses about the means of fuzzy random variables, when these variables are supposed to take on a finite number of different values and these values being fuzzy subsets of the one-dimensional Euclidean space. In this paper we show that the one-sample method of testing about the mean of a fuzzy random variable can be extended to general ones (more precisely, to those whose range is not necessarily finite and whose values are fuzzy subsets of finite-dimensional Euclidean space). This extension is immediately developed by combining some tools in the literature, namely, bootstrap techniques on Banach spaces, a metric between fuzzy sets based on the support function, and an embedding of the space of fuzzy random variables into a Banach space which is based on the support function.
Article
Full-text available
One of the most important aspects of the (statistical) analysis of imprecise data is the usage of a suitable distance on the family of all compact, convex fuzzy sets, which is not too hard to calculate and which reflects the intuitive meaning of fuzzy sets. On the basis of expressing the metric of Bertoluzza et al. [C. Bertoluzza, N. Corral, A. Salas, On a new class of distances between fuzzy numbers, Mathware Soft Comput. 2 (1995) 71–84] in terms of the mid points and spreads of the corresponding intervals we construct new families of metrics on the family of all d-dimensional compact convex sets as well as on the family of all d-dimensional compact convex fuzzy sets. It is shown that these metrics not only fulfill many good properties, but also that they are easy to calculate and easy to manage for statistical purposes, and therefore, useful from the practical point of view.
Article
Full-text available
The supervised classification of fuzzy data obtained from a random experiment is discussed. The data generation process is modelled through random fuzzy sets which, from a formal point of view, can be identified with certain function-valued random elements. First, one of the most versatile discriminant approaches in the context of functional data analysis is adapted to the specific case of interest. In this way, discriminant analysis based on nonparametric kernel density estimation is discussed. In general, this criterion is shown not to be optimal and to require large sample sizes. To avoid such inconveniences, a simpler approach which eludes the density estimation by considering conditional probabilities on certain balls is introduced. The approaches are applied to two experiments; one concerning fuzzy perceptions and linguistic labels and another one concerning flood analysis. The methods are tested against linear discriminant analysis and random K-fold cross validation.
Article
Full-text available
The use of the fuzzy scale of measurement to describe an important number of observations from real-life attributes or variables is first explored. In contrast to other well-known scales (like nominal or ordinal), a wide class of statistical measures and techniques can be properly applied to analyze fuzzy data. This fact is connected with the possibility of identifying the scale with a special subset of a functional Hilbert space. The identification can be used to develop methods for the statistical analysis of fuzzy data by considering techniques in functional data analysis and vice versa. In this respect, an approach to the FANOVA test is presented and analyzed, and it is later particularized to deal with fuzzy data. The proposed approaches are illustrated by means of a real-life case study.
Article
Full-text available
In this paper we define the concept of a normal fuzzy random variable and we prove the following representation theorem: Every normal fuzzy random variable equals the sum of its expected value and a mean zero random vector.
Article
Full-text available
The minimum covariance determinant (MCD) method of Rousseeuw (1984) is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations (out of n) whose covariance matrix has the lowest determinant. Until now applications of the MCD were hampered by the computation time of existing algorithms, which were limited to a few hundred objects in a few dimensions. We discuss two important applications of larger size: one about a production process at Philips with n = 677 objects and p = 9 variables, and a data set from astronomy with n =137,256 objects and p = 27 variables. To deal with such problems we have developed a new algorithm for the MCD, called FAST-MCD. The basic ideas are an inequality involving order statistics and determinants, and techniques which we call `selective iteration' and `nested extensions'. For small data sets FAST-MCD typically finds the exact MCD, whereas for larger data sets it gives more accurate results than existing algorithms.
Article
In information processing tasks, sets may have a conjunctive or a disjunctive reading. In the conjunctive reading, a set represents an object of interest and its elements are subparts of the object, forming a composite description. In the disjunctive reading, a set contains mutually exclusive elements and refers to the representation of incomplete knowledge. It does not model an actual object or quantity, but partial information about an underlying object or a precise quantity. This distinction between what we call ontic vs. epistemic sets remains valid for fuzzy sets, whose membership functions, in the disjunctive reading are possibility distributions, over deterministic or random values. This paper examines the impact of this distinction in statistics. We show its importance because there is a risk of misusing basic notions and tools, such as conditioning, distance between sets, variance, regression, etc. when data are set-valued. We discuss several examples where the ontic and epistemic points of view yield different approaches to these concepts.
Article
In this paper, a model-free approach for data mining in engineering is presented. The numerical approach is based on artificial neural networks. Recurrent neural networks for fuzzy data are developed to identify and predict complex dependencies from uncertain data. Uncertain structural processes obtained from measurements or numerical analyses are used to identify the time-dependent behavior of engineering structures. Structural action and response processes are treated as fuzzy processes. The identification of uncertain dependencies between structural action and response processes is realized by recurrent neural networks for fuzzy data. Algorithms for signal processing and network training are presented. The new recurrent neural network approach is verified by a fuzzy fractional rheological material model. An application for the identification and prediction of time-dependent structural behavior under dynamic loading is presented.
Article
A new linear regression model for an interval-valued response and a real-valued explanatory variable is presented. The approach is based on the interval arithmetic. Comparisons with previous methods are discussed. The new linear model is theoretically analyzed and the regression parameters are estimated. Some properties of the regression estimators are investigated. Finally, the performance of the procedure is illustrated using both a real-life application and simulation studies.
Article
This short paper discusses the contributions made to the featured section on Low Quality Data. We further refine the distinction between the ontic and epistemic views of imprecise data in statistics. We also question the extent to which likelihood functions can be viewed as belief functions. Finally we comment on the data disambiguation effect of learning methods, connecting it to data reconciliation problems.
Article
In possibilistic clustering objects are assigned to clusters according to the so-called membership degrees taking values in the unit interval. Differently from fuzzy clustering, it is not required that the sum of the membership degrees of an object to all clusters is equal to one. This is very helpful in the presence of outliers, which are usually assigned to the clusters with membership degrees close to zero. Unfortunately, a drawback of the possibilistic approach is the tendency to produce coincident clusters. A remedy is to add a repulsion term among prototypes in the loss function forcing the prototypes to be far 'enough' from each other. Here, a possibilistic clustering algorithm with repulsion constraints for imprecise data, managed in terms of fuzzy sets, is introduced. Applications to synthetic and real fuzzy data are considered in order to analyze how the proposed clustering algorithm works in practice.
Article
In standard regression analysis the relationship between the (response) variable and a set of (explanatory) variables is investigated. In the classical framework the response is affected by probabilistic uncertainty (randomness) and, thus, treated as a random variable. However, the data can also be subjected to other kinds of uncertainty such as imprecision. A possible way to manage all of these uncertainties is represented by the concept of fuzzy random variable (FRV). The most common class of FRVs is the LR family (LR FRV), which allows us to express every FRV in terms of three random variables, namely, the center, the left spread and the right spread. In this work, limiting our attention to the LR FRV class, we consider the linear regression problem in the presence of one or more imprecise random elements. The procedure for estimating the model parameters and the determination coefficient are discussed and the hypothesis testing problem is addressed following a bootstrap approach. Furthermore, in order to illustrate how the proposed model works in practice, the results of a real-life example are given together with a comparison with those obtained by applying classical regression analysis.
Article
Product planning is one of four important processes in new product development (NPD) using quality function deployment (QFD), which is a widely used customer-driven approach. In our opinion, the first problem to be solved is how to incorporate both qualitative and quantitative information regarding relationships between customer requirements (CRs) and engineering characteristics (ECs) as well as those among ECs into the problem formulation. Owing to the typical vagueness or imprecision of functional relationships in a product, product planning is becoming more difficult, particularly in a fuzzy environment. In this paper, an asymmetric fuzzy linear regression approach is proposed to estimate the functional relationships for product planning based on QFD. Firstly, by integrating the least-squares regression into fuzzy linear regression, a pair of hybrid linear programming models with asymmetric triangular fuzzy coefficients are developed to estimate the functional relationships for product planning under uncertainties. Secondly, using the basic concept of fuzzy regression, asymmetric triangular fuzzy coefficients are extended to asymmetric trapezoidal fuzzy coefficients, and another pair of hybrid linear programming models with asymmetric trapezoidal fuzzy coefficients is proposed. The main advantage of these hybrid-programming models is to integrate both the property of central tendency in least squares and the possibilistic property in fuzzy regression. Next, the illustrated example shows that trapezoidal fuzzy number coefficients have more flexibility to handle a wider variety of systematic uncertainties and ambiguities that cannot be modeled efficiently using triangular number fuzzy coefficients. Both asymmetric triangular and trapezoidal fuzzy number coefficients can be applicable to a much wider variety of design problems where uncertain, qualitative, and fuzzy relationships are involved than when symmetric triangular fuzzy numbers are used. Finally, future research direction is also discussed.
Article
A linear regression model with imprecise response and p real explanatory variables is analyzed. The imprecision of the response variable is functionally described by means of certain kinds of fuzzy sets, the LR fuzzy sets. The LR fuzzy random variables are introduced to model usual random experiments when the characteristic observed on each result can be described with fuzzy numbers of a particular class, determined by 3 random values: the center, the left spread and the right spread. In fact, these constitute a natural generalization of the interval data. To deal with the estimation problem the space of the LR fuzzy numbers is proved to be isometric to a closed and convex cone of R3 with respect to a generalization of the most used metric for LR fuzzy numbers. The expression of the estimators in terms of moments is established, their limit distribution and asymptotic properties are analyzed and applied to the determination of confidence regions and hypothesis testing procedures. The results are illustrated by means of some case-studies.
Article
In this paper we define the concepts of fuzzy random variable and the expectation of a fuzzy random variable. The new definition of expectation generalizes the integral of a set-valued function. We derive some properties of these new concepts. By considering a suitable generalization of the Hausdorff metric, we derive the Lebesgue-dominated convergence type theorem.
Article
In this paper, we propose an iterative algorithm for multiple regression with fuzzy variables. While using the standard least-squares criterion as a performance index, we pose the regression problem as a gradient-descent optimisation. The separation of the evaluation of the gradient and the update of the regression variables makes it possible to avoid undue complication of analytical formulae for multiple regression with fuzzy data. The origins of fuzzy input data are traced back to the fundamental concept of information granulation and an example FCM-based granulation method is proposed and illustrated by some numerical examples. The proposed multiple regression algorithm is applied to one-, three- and nine-dimensional synthetic data sets as well as the 13-dimensional Boston Housing dataset from the machine learning repository. The algorithm's performance is illustrated by the corresponding plots of convergence of regression parameters and the values of the prediction error of the resulting regression model. General comments on the numerical complexity of the proposed algorithm are also provided.
Article
Statistical data are frequently not precise numbers but more or less non-precise, also called fuzzy. Measurements of continuous variables are always fuzzy to a certain degree. Therefore histograms and generalized classical statistical inference methods for univariate fuzzy data have to be considered. Moreover Bayesian inference methods in the situation of fuzzy a priori information and fuzzy data are discussed.
Article
A generalized simple linear regression statistical/probabilistic model in which both input and output data can be fuzzy subsets of Rp is dealt with. The regression model is based on a fuzzy-arithmetic approach and it considers the possibility of fuzzy-valued random errors. Specifically, the least-squares estimation problem in terms of a versatile metric is addressed. The solutions are established in terms of the moments of the involved random elements by employing the concept of support function of a fuzzy set. Some considerations concerning the applicability of the model are made.
Article
This paper introduces a new approach to regression analysis based on a fuzzy extension of belief function theory. For a given input vector , the method provides a prediction regarding the value of the output variable y, in the form of a fuzzy belief assignment (FBA), defined as a collection of fuzzy sets of values with associated masses of belief. The output FBA is computed using a nonparametric, instance-based approach: training samples in the neighborhood of are considered as sources of partial information on the response variable; the pieces of evidence are discounted as a function of their distance to , and pooled using Dempster’s rule of combination. The method can cope with heterogeneous training data, including numbers, intervals, fuzzy numbers, and, more generally, fuzzy belief assignments, a convenient formalism for modeling unreliable and imprecise information provided by experts or multi-sensor systems. The performances of the method are compared to those of standard regression techniques using several simulated data sets.
Article
The results obtained in part I of the paper are specialized to the case of discrete fuzzy random variables. A more intuitive interpretation is given of the notion of fuzzy random variables. Algorithms are derived for determining expectations, fuzzy probabilities, fuzzy conditional expectations and fuzzy conditional probabilities related to discrete fuzzy random variables. These algorithms are applied to illustrative examples. A sample application to a medical diagnosis problem is briefly discussed.
Article
When we have only interval ranges of sample values x1, … , xn, what is the interval of possible values for the variance V of these values? There are quadratic time algorithms for computing the exact lower bound V on the variance of interval data, and for computing under reasonable easily verifiable conditions. The problem is that in real life, we often make additional measurements. In traditional statistics, if we have a new measurement result, we can modify the value of variance in constant time. In contrast, previously known algorithms for processing interval data required that, once a new data point is added, we start from the very beginning. In this paper, we describe new algorithms for statistical processing of interval data, algorithms in which adding a data point requires only O(n) computational steps.
Article
Genetic fuzzy systems (GFS) are based on the use of genetic algorithms for designing fuzzy systems, and for providing them with learning and adaptation capabilities. In this context, fuzzy sets represent linguistic granules of information, contained in the antecedents and consequents of the rules, whereas the data used in the genetic learning is assumed to be crisp. GFS seldom deal with fuzzy-valued data.In this paper we address this problem, and propose a set of techniques that can be incorporated to different GFS in order to learn a knowledge base (KB) from interval and fuzzy data for regression problems. Details will be given about the representation of non-standard data with fuzzy sets, about the needed changes in the reasoning method of the fuzzy rule-based system, and also about a new generalization of the mean squared error to vague data. In addition, we will show that the learning process requires a genetic algorithm that must be capable of optimizing a multicriteria fitness function, containing both crisp and interval-valued criteria.Lastly, we benchmark our procedures with some machine learning related datasets and a real-world problem of marketing, and the techniques proposed here are shown to improve the generalization properties of other KBs obtained from crisp training data.
Article
Conventional Fuzzy regression using possibilistic concepts allows the identification of models from uncertain data sets. However, some limitations still exist. This paper deals with a revisited approach for possibilistic fuzzy regression methods. Indeed, a new modified fuzzy linear model form is introduced where the identified model output can envelop all the observed data and ensure a total inclusion property. Moreover, this model output can have any kind of spread tendency. In this framework, the identification problem is reformulated according to a new criterion that assesses the model fuzziness independently from the collected data distribution. The potential of the proposed method with regard to the conventional approach is illustrated by simulation examples.
Article
Several models for simple least-squares fitting of fuzzy-valued data are developed. Criteria are given for when fuzzy data sets can be fitted to the models, and analogues of the normal equations are derived.
Article
Since fuzzy data can be regarded as distribution of possibility, fuzzy data analysis by possibilistic linear models is proposed in this paper. Possibilistic linear systems are defined by the extension principle. Fuzzy parameter estimations are discussed in possibilistic linear systems and possibilistic linear models are employed for fuzzy data analysis with non-fuzzy inputs and fuzzy outputs defined by fuzzy numbers. The estimated possibilistic linear system can be obtained by solving a linear programming problem. This approach can be regarded as fuzzy interval analysis.
Article
A clustering method to group independent fuzzy random variables observed on a sample by focusing on their expected values is developed. The procedure is iterative and based on the p-value of a multi-sample bootstrap test. Thus, it simultaneously takes into account fuzziness and stochastic variability. Moreover, an objective stopping criterion leading to statistically equal groups different from each other is provided. Some simulations to show the performance of this inferential approach are included. The results are illustrated by means of a case study.
Article
A method is proposed for estimating the parameters in a parametric statistical model when the observations are fuzzy and are assumed to be related to underlying crisp realizations of a random sample. This method is based on maximizing the observed-data likelihood defined as the probability of the fuzzy data. It is shown that the EM algorithm may be used for that purpose, which makes it possible to solve a wide range of statistical problems involving fuzzy data. This approach, called the fuzzy EM (FEM) method, is illustrated using three classical problems: normal mean and variance estimation from a fuzzy sample, multiple linear regression with crisp inputs and fuzzy outputs, and univariate finite normal mixture estimation from fuzzy data.
Article
In this study, we present a comprehensive comparative analysis of kernel-based fuzzy clustering and fuzzy clustering. Kernel based clustering has emerged as an interesting and quite visible alternative in fuzzy clustering, however, the effectiveness of this extension vis-à-vis some generic methods of fuzzy clustering has neither been discussed in a complete manner nor the performance of clustering quantified through a convincing comparative analysis. Our focal objective is to understand the performance gains and the importance of parameter selection for kernelized fuzzy clustering. Generic Fuzzy C-Means (FCM) and Gustafson–Kessel (GK) FCM are compared with two typical generalizations of kernel-based fuzzy clustering: one with prototypes located in the feature space (KFCM-F) and the other where the prototypes are distributed in the kernel space (KFCM-K). Both generalizations are studied when dealing with the Gaussian kernel while KFCM-K is also studied with the polynomial kernel. Two criteria are used in evaluating the performance of the clustering method and the resulting clusters, namely classification rate and reconstruction error. Through carefully selected experiments involving synthetic and Machine Learning repository (http://archive.ics.uci.edu/beta/) data sets, we demonstrate that the kernel-based FCM algorithms produce a marginal improvement over standard FCM and GK for most of the analyzed data sets. It has been observed that the kernel-based FCM algorithms are in a number of cases highly sensitive to the selection of specific values of the kernel parameters.
Article
Simple and multiple linear regression models are considered between variables whose “values” are convex compact random sets in $${\mathbb{R}^p}$$ , (that is, hypercubes, spheres, and so on). We analyze such models within a set-arithmetic approach. Contrary to what happens for random variables, the least squares optimal solutions for the basic affine transformation model do not produce suitable estimates for the linear regression model. First, we derive least squares estimators for the simple linear regression model and examine them from a theoretical perspective. Moreover, the multiple linear regression model is dealt with and a stepwise algorithm is developed in order to find the estimates in this case. The particular problem of the linear regression with interval-valued data is also considered and illustrated by means of a real-life example.
Article
In this paper we propose a robust fuzzy linear regression model based on the Least Median Squares–Weighted Least Squares (LMS–WLS) estimation procedure. The proposed model is general enough to deal with data contaminated by outliers due to measurement errors or extracted from highly skewed or heavy tailed distributions. We also define suitable goodness of fit indices useful to evaluate the performances of the proposed model. The effectiveness of our model in reducing the outliers influence is shown by using applicative examples, based both on simulated and real data, and by a simulation study.
Article
In this paper we develop a discussion on the mathematical formalization of the concept of fuzzy random variable. This discussion is mainly focused on finding an adequate notion of measurability to be coherent with the notions on the space these random elements take values.
Article
The aim of this paper is to cluster units (objects) described by interval-valued information by adopting an unsupervised neural network approach. By considering a suitable distance measure for interval data, self-organizing maps to deal with interval-valued data are suggested. The technique, called midpoint radius self-organizing maps (MR-SOMs), recovers the underlying structure of interval-valued data by using both the midpoints (or centers) and the radii (a measure of the interval width) information. In order to show how the method MR-SOMs works a suggestive application on telecommunication market segmentation is described.
Article
The Fuzzy k-Means clustering model (FkM) is a powerful tool for classifying objects into a set of k homogeneous clusters by means of the membership degrees of an object in a cluster. In FkM, for each object, the sum of the membership degrees in the clusters must be equal to one. Such a constraint may cause meaningless results, especially when noise is present. To avoid this drawback, it is possible to relax the constraint, leading to the so-called Possibilistic k-Means clustering model (PkM). In particular, attention is paid to the case in which the empirical information is affected by imprecision or vagueness. This is handled by means of LR fuzzy numbers. An FkM model for LR fuzzy data is firstly developed and a PkM model for the same type of data is then proposed. The results of a simulation experiment and of two applications to real world fuzzy data confirm the validity of both models, while providing indications as to some advantages connected with the use of the possibilistic approach.
Article
Summary In this paper a robust fuzzy k-means clustering model for interval valued data is introduced. The peculiarity of the proposed model is the capability to manage anomalous interval valued data by reducing the effects of such outliers in the clustering model. In the interval case, the concept of anomalous data involves both the center and the width (the radius) of an interval. In order to show how our model works the results of a simulation experiment and an application to real interval valued data are discussed.
Article
A fuzzy clustering model for fuzzy data is proposed. The model is based on a ‘weighted’ dissimilarity measure for comparing pairs of fuzzy data, composed by two distances, the so-called center (mode) distance and spread distance. The peculiarity of the proposed fuzzy clustering model is the objective estimation, incorporated in the clustering procedure, of suitable weights concerning the distance measures of the center and the spreads of the fuzzy data. In this way, the model objectively tunes the influence of the two components of the fuzzy data (center and spreads) for computing the mode and spread centroids in the fuzzy partitioning process. In order to show the performance of the proposed clustering algorithm, a simulation study and two illustrative applications are discussed.