ArticlePDF Available

Fuzzy Partitioning of Quantitative Attribute Domains by a Cluster Goodness Index

Authors:

Abstract and Figures

The problem of mining association rules for fuzzy quantitative items was introduced and an algorithm proposed in [7]. However, the algorithm assumes that fuzzy sets are given. In this paper we propose a method to find the fuzzy sets for each quantitative attribute in a database by using clustering techniques. We present a scheme for finding the optimal partitioning of a data set during the clustering process regardless of the clustering algorithm used. More specifically, we present an approach for evaluation of clustering partitions so as to find the best number of clusters for each specific data set. This is based on a goodness index, which assesses the most compact and well-separated clusters. We use these clusters to classify each quantitative attribute into fuzzy sets and define their membership functions. These steps are combined into a concise algorithm for finding the fuzzy sets. Finally, we describe the results of using this approach to generate association rules from a real-life dataset. The results show that a higher number of interesting rules can be discovered, compared to partitioning the attribute values into equal-sized sets.
Content may be subject to copyright.
0.25
0.3
0.35
0.4
0.45
0.5
2 3 4 5 6 7 8 9
Number of Clusters
Goodness Index
Age
r
1
d
1
+
r
3
r
2
d
2
-
d
2
+
d
3
-
1
MinValue MaxValue
(low) (middle) (high)
0.00014
0.00016
0.00018
0.0002
0.00022
0.00024
0.00026
0.00028
2 3 4 5 6 7 8 9
Number of Clusters
Goodness Index
IncHead
IncFam
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1 0.15 0.2 0.25 0.3 0.350.4 0.45 0.5
Minimum Support
(a)
Average Support
clustering method
quantile method
0
20
40
60
80
100
120
140
160
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Minimum Support
(b)
Number of Frequent Itemsets
clustering method
quantile method
0
0.2
0.4
0.6
0.8
1
1.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Minimum Confidence
(a)
Average Confidence
clustering method
quantile method
0
2
4
6
8
10
12
14
16
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Minimum Confidence
(b)
Number of Interesting Rules
clustering
method
quantile
method
... While the other two streams (FFD/FD d and FPA) are attracting more and more attention (Huhtala & Karkkainen et al., 1998a, 1998bChen & Wei et al., 2001;Wei & Chen et al., 2002;Wang & Shen et al., 2002), the stream of fuzzy association rules (FAR) has accounted for most of the existing efforts and is continuously attracting considerable attention by researchers and practitioners. The FAR research and applications center around issues of partitioning quantitative data domains, fuzzy taxonomies, FAR with linguistic hedges, fuzziness-related interestingness measures, and degree of fuzzy implication, e.g., Lee & Hyung (1997), Kuok & Fu et al. (1998), Cai & Fu et al. (1998), Wei & Chen et al. (1999, Hong & Kuo (1999a, 1999b, Gyenesei (2000aGyenesei ( , 2000bGyenesei ( , 2001, Shu & Tsang et al. (2000), Dubois & Hullermeier et al. (2001), Ishibuchi & Nakashima et al., (2001), Hullermeier (2001aHullermeier ( , 2001b, Bosc & Pivert (2001); . ...
... Fuzzy sets defined onto the domains are used to deal with the "sharp boundary" problem in partitioning Wu, 1999;Mazlack, 2000;Chien & Lin et al., 2001;Gyenesei, 2001), such sets are usually expressed in forms of labels or linguistic terms. For example, for attribute Age, some fuzzy sets may be defined on its domain U Age such as Young, Middle and Old. ...
... represented in section 3.2. Subsequently, with these extended measures incorporated, several mining algorithms have been proposed as extensions of the conventional one, such as the method by Lee & Hyung (1997), the FTDA method by Kuok & Fu et al. (1998), the algorithm by Hong & Kuo (1999a, 1999b, the fuzzy extensions by Gyenesei (2000aGyenesei ( , 2001, the SQLbased fuzzy extended method by Shu & Tsang et al. (2001), and the work by Chan & Au (2001). ...
Chapter
Full-text available
Associations reflect relationships among items in databases, and have been widely studied in the fields of knowledge discovery and data mining. Recent years have witnessed many efforts on discovering fuzzy associations, aimed at coping with fuzziness in knowledge representation and decision support processes. This chapter focuses on associations of three kinds: association rules, functional dependencies and pattern associations. Accordingly, it overviews major fuzzy logic extensions. Primary attention is paid (1) to frizzy association rules in dealing with partitioning quantitative data domains, crisp taxonomic belongings, and linguistically modified rules, (2) to various fuzzy mining measures from different perspectives such as interestingness, statistics and logic implication, (3) to fuzzy/partially satisfied functional dependencies for handling data closeness and noise tolerance, and (4) to time-series data patterns that are associated with partial degrees.
... This method also belongs to the unsupervised methods of fuzzy clustering by which the optimum number of clusters are determined by repetitively running clustering algorithm with different numbers of clusters. The optimum number of clusters is obtained as the optimum point of a well-defined validity index (Gyenesei, 2000). Ansari et al. (2009) used the GG algorithm for clustering the seismic catalog of Iran. ...
Article
Determination of seismic sources is the first step in probabilistic seismichazard analysis (PSHA); however, this step, especially in low seismic regions, is often controversial. In conventional PSHA procedure, determination of seismic sources is merely based on the subjective judgments of experts, and in many cases, there are great differences among proposed seismic models in a specific region. As a result, one important source of uncertainty in PSHA is due to determination of seismic sources. In this article, by combination of fuzzy clustering analysis and Monte Carlo simulation, an objective method for determination and probabilistic modeling of seismic sources is presented. By clustering spatial locations of earthquakes, it is possible to specify the extent of each seismic source in an objective way. A cluster quality index is used to identify the optimum number of clusters. The density and spread of events in each cluster determines the geometrical shape of seismic sources. Moreover, in this article a method is proposed to construct spatial probability density functions (PDFs) of earthquake locations based on the results of fuzzy clustering analysis. The spatial PDF of earthquakes can be used for the generation of synthetic events in Monte Carlo simulation. The Azarbaijan region, with its varied seismotectonics and generally high seismicity, is used as an important area of seismicity in which to develop and demonstrate the application and capability of fuzzy clustering analysis in specifying seismic sources. The PSHA is performed for the city of Tabriz, and a comprehensive comparison is made between the results of conventional PSHA, ordinary Monte Carlo hazard analysis, and the proposed method. The results indicate there is an objective relationship between observed seismicity and seismotectonic evidences in the region. Moreover, the distribution of synthetic events is highly correlated with the observed seismicity, seismotectonic, and geological information of the region.
... Les mesures statistiques ont alors été vériées.Notre interprétation de ces résultats est que les sous-ensembles ous construits à partir des connaissances du linguiste sur le vocabulaire du langage courant ne sont pas adaptés à la composition des noms de marque. Il serait donc intéressant de conduire de nouvelles analyses en construisant les sous-ensembles ous de façon automatique grâce à une segmentation oue, comme présenté dans [FWS + 98] et[Gye00b].Dans un second temps, nos algorithmes pourraient également permettre d'extraire les tendances de l'évolution de la composition des noms de marque an de préciser les résultats obtenus avec la méthode[LAS97] destinée à découvrir des tendances dans des bases de données textuelles. Ces analyses devraient ensuite être approfondies en utilisant des contraintes temporelles du type de celles présentées dans la partie suivante de ce mémoire.DiscussionDans cette partie, nous avons présenté une approche complète et ecace pour l'extraction de motifs séquentiels ous, permettant le traitement de séquences de données numériques telles que les données démographiques ou des relevés de capteurs, alors que les algorithmes existants ne permettaient d'extraire qu'une partie de l'information disponible dans les bases de données quantitatives. ...
Article
The large amount of data stored in any areas as well as the diversity of their format and origin make manual analysis or knowledge discovery impossible. For this reason, various communities have been interested for several years in the conception and implementation of tools that can automatically extract knowledge from such large databases. Nowadays these works aim at considering heterogeneity of data, format and quality. Our own workis part of this research axis. More particularly, we consider the context of frequent pattern discovery from data ordered as data sequences. Until now, such patterns, called sequential patterns, could be extracted only from sequence databases containing symbolic and perfect data, i.e. databases consisting of binary information or data that can be processed as binary and only containing complete data. So we propose several improvement of frequent sequence discovery techniques in order to take into account heterogeneous, incomplete or uncertain data, while minimizing possible information loss. Thus, the work described in this thesis consists of the implementation of a global framework for fuzzy sequential pattern discovery within numerical quantitative data, the definition of soft temporal constraints allowing flexibility for the user and sorting of uncovered patterns, last the implementation of two approaches for sequential pattern discovery from incomplete data.
... So one must resort to trying every possible K value (which is often computationally infeasible), or to guessing. For example, a researcher might: for each K-means analysis, guess what K to supply as an input parameter; after each hierarchical clustering, guess where to cut a dendrogram to determine K; and, for either technique, guess which of several, often-conflicting goodness measures yields the " best " K (Jain and Dubes 1988, Hartigan 1975, Kaufman and Rousseeuw 1990, Dunn 1974, Gyenesei 2000, Milligan et al. 1983, Halkidi et al. 2000, Turenne 2000). In the model-based framework, one hypothesizes a mixture of underlying probability distributions generating the data with each mixture component representing a different cluster (Dubes 1987, Fraley and Raferty 1998). ...
Article
Clustering can be a valuable tool for analyzing large datasets, such as in e-commerce applications. Anyone who clusters must choose how many item clusters, K, to report. Unfortunately, one must guess at K or some related parameter. Elsewhere we introduced a strongly-supported heuristic, RSQRT, which predicts K as a function of the attribute or item count, depending on attribute scales. We conducted a second analysis where we sought confirmation of the heuristic, analyzing data sets from theUCImachine learning benchmark repository. For the 25 studies where sufficient detail was available, we again found strong support. Also, in a side-by-side comparison of 28 studies, RSQRT best-predicted K and the Bayesian information criterion (BIC) predicted K are the same. RSQRT has a lower cost of O(log log n) versus O(n(2)) for BIC, and is more widely applicable. Using RSQRT prospectively could be much better than merely guessing.
... A solution to this problem is to run the clustering algorithm repetitively with different number of clusters and initial guess of centroids and then to compare the results with a well-defined validity index. This approach of clustering is usually referred to as 'unsupervised clustering' (Gyenesei, 2000). ...
Article
Identification and classification of different seismotectonic provinces with similar characteristics in a region of interest is one of the most important subjects in seismic hazard studies. This task is usually done through subjective interpretations based on geological and seismotectonic information. Seismic data is one of the most important sources of information where visual inspection of this data is a traditional way of identification of seismotectonic provinces. Pattern recognition of historical and instrumental seismic data in a non-subjective way provides more robust results and is a more suitable tool for extracting useful knowledge from a huge amount of data. In this study, applicability and usefulness of an unsupervised fuzzy clustering algorithm in identification of hidden patterns among historical and instrumental seismic catalog of Iran is examined through a comparison between the results of such an analysis and the proposed models for seismotectonic provinces of Iran. The clustering method used in this study is based on fuzzy modification of the maximum likelihood estimation and has the capability to detect elliptical clusters with variable size. Moreover, fuzzy hyper-volume and partition density indexes are used as performance indexes for selection the best number of clusters. The comparison between the results of clustering analyses and the seismotectonic models of Iran reveals that it is possible to partition the spatially distributed epicenters of earthquake events into distinct. These partition units, or clusters, are generally in good agreement with the proposed seismotectonic provinces of Iran and show major seismotectonic features of the Iranian Plateau in addition to some hidden information. Such kind of analysis provides a mathematical basis for seismological interpretations of seismic activities. Moreover, the comparisons of the results of clustering analysis among historical data, combination of historical and instrumental data and major earthquakes with magnitude greater than 5.0 shows that the best results will be achieved by the clustering of major events (i.e. Mw>5.0).
... More generally, fuzzy subset elicitation methods [1,13] are techniques that are explicitely designed to provide fuzzy subsets describing the data. Some of them involve interaction with a human expert, others are based on partitioning methods [12,6,7]. Many belong to the parametric framework, i.e. consist in deciding on a desired form for the membership function, e.g. ...
Conference Paper
Full-text available
This paper considers the task of constructing fuzzy prototypes for numerical data in order to characterize the data subgroups obtained after a clustering step. The proposed solution is motivated by the will of describing prototypes with a richer representation than point-based methods, and also to provide a characterization of the groups that catches not only the common features of the data pertaining to a group, but also their specificity. It transposes a method that has been designed for fuzzy data to numerical data, based on a prior computation of typicality degrees that are defined according to concepts used in cognitive science and psychology. The paper discusses the construction of prototypes and how their desirable semantics and properties can guide the selection of the various operators involved in the construction process.
Article
Full-text available
The article presents the method of forming associative rules from the database of the SIEM system for detecting cyber incidents, which is based on the theory of fuzzy sets and methods of data mining. On the basis of the conducted analysis, a conclusion was made about the expediency of detecting cyber incidents in special information and communication systems (SICS) by applying rule-oriented methods. The necessity of applying data mining technologies, in particular, methods of forming associative rules to supplement the knowledge base (KB) of the SIEM system with the aim of improving its characteristics in the process of detecting cyber incidents, is substantiated. For the effective application of cyber incident detection models built on the basis of the theory of fuzzy sets, the use of fuzzy associative rule search methods is proposed, which allow processing heterogeneous data about cyber incidents and are transparent for perception. The mathematical apparatus for forming fuzzy associative rules is considered and examples of its application are given. In order to increase the effectiveness of the methods of searching for fuzzy associative rules from the database of the SIEM it is proposed to use weighting coefficients of attributes that characterize the degree of manifestation of their importance in the fuzzy rule. A formal formulation of the problem of forming fuzzy associative rules with weighted attributes and which are used for the identification of cyber incidents is given. A scheme of their formation and application for identification of cyber incidents is proposed. The method of forming fuzzy associative rules with weighted attributes from the database of the SIEM is given. The problem of determining the weighting coefficients of the relative importance of SIEM system DB attributes is formulated and a method for its solution is proposed. The formulation of the problem of finding sets of elements that have a weighted fuzzy support of at least the given one and are used to form fuzzy associative rules with weighted attributes is given. Methods for its solution are proposed.
Article
Full-text available
У статті представлено метод формування нечітких асоціативних правил із зваженими атрибутами з бази даних (БД) SIEM – системи для поповнення її бази знань (БЗ) з метою більш ефективного виявлення нею кіберінцидентів, які виникають в ході функціонування спеціальних інформаційно – комунікаційних систем (СІКС). Розглянуто проблеми, які знижують ефективність застосування існуючих методів для вирішення задачі формування асоціативних правил на основі аналізу інформації, що знаходиться у БД систем кіберзахисту. Проведено аналіз публікацій, присвячених методам, в яких здійснено спроби усунення наведених проблем. Сформульовано основну ідею усунення недоліків, що є властивими відомим методам, яка полягає в знаходженні компромісу між зменшенням часу роботи обчислювального алгоритму, який реалізує на практиці метод та зменшенням інформаційних втрат в результаті його роботи. Запропоновано удосконалений метод пошуку асоціативних правил з БД SIEM – систем, в основу якого покладено теорію нечітких множин та лінгвістичних термів. Сформульовано задачу пошуку нечітких асоціативних правил із зваженими атрибутами. Наведено математичний апарат, який покладено в основу реалізації метода. Запропоновано алгоритм пошуку частих наборів елементів, що включають значення ознак кіберінцидентів та класів, до яких вони відносяться та який реалізує перший етап запропонованого методу. Проаналізовано особливості структури тестових наборів даних, які використовуються для навчання та тестування систем кіберзахисту та на основі його результатів зроблено висновок про можливість удосконалення розглянутого алгоритму. Наведено графічну ілюстрацію ідеї удосконалення алгоритму пошуку частих наборів елементів та описано суть його удосконалення. Запропоновано удосконалений алгоритм пошуку частих наборів елементів розглянутого методу та наведено його основні переваги.
Article
This paper presents a formal definition of stable peers, a novel method to separate stable peers from all peers and an analysis of the session sequences of stable peers in P2P (Peer-to-Peer) systems. This study uses the KAD, a P2P file sharing system with several million simultaneous users, as an example and draws some significant conclusions: (1) large numbers of peers with very short session time usually possess few sessions; (2) the stable peers is about 0.6% of all peers; (3) the 70% of stable peers possess very long total session time ensured by a large number of sessions, and possess large difference between session time; (4) the 30% of stable peers, whose average session time is 1.8 times of the former, possess long total session time, a small number of sessions and high availability. We believe that these two types of stable peers can be used for different functions to solve the churn problem in the hierarchical P2P systems.
Article
Full-text available
Associations, as specific forms of knowledge, reflect relationships among items in databases, and have been widely studied in the fields of knowledge discovery and data mining. Recent years have witnessed many efforts on discovering fuzzy associations, aimed at coping with fuzziness in knowledge representation and decision support processes. This paper focuses on associations of three kinds, namely, association rules, functional dependencies and pattern associations, and overviews major fuzzy logic extensions accordingly.
Conference Paper
Full-text available
During the last ten years, data mining, also known as knowledge discovery in databases, has established its position as a prominent and important research area. Mining association rules is one of the important research problems in data mining. Many algorithms have been proposed to find association rules in large databases containing both categorical and quantitative attributes. We generalize this to the case where part of attributes are given weights to reflect their importance to the user. In this paper, we introduce the problem of mining weighted quantitative association rules based on fuzzy approach. Using the fuzzy set concept, the discovered rules are more understandable to a human. We propose two different definitions of weighted support: with and without normalization. In the normalized case, a subset of a frequent itemset may not be frequent, and we cannot generate candidate k-itemsets simply from the frequent (k-1)-itemsets. We tackle this problem by using the concept of z-potential frequent subset for each candidate itemset. We give an algorithm for mining such quantitative association rules. Finally, we describe the results of using this approach on a real-life dataset.
Article
Full-text available
This study reports on a method for carrying out fuzzy classification without a priori assumptions on the number of clusters in the data set. Assessment of cluster validity is based on performance measures using hypervolume and density criteria. An algorithm is derived from a combination of the fuzzy K -means algorithm and fuzzy maximum-likelihood estimation. The unsupervised fuzzy partition-optimal number of classes algorithm performs well in situations of large variability of cluster shapes, densities, and number of data points in each cluster. The algorithm was tested on different classes of simulated data, and on a real data set derived from sleep EEG signal
Article
In this paper a new cluster validity index is introduced, which assesses the average compactness and separation of fuzzy partitions generated by the fuzzy c-means algorithm. To compare the performance of this new index with a number of known validation indices, the fuzzy partitioning of two data sets was carried out. Our validation performed favorably in all studies, even in those where other validity indices failed to indicate the true number of clusters within each data set. q 1998 Elsevier Science B.V. All rights reserved.
Article
Validation of fuzzy partitions induced through c-shells clustering is considered. The classical validity measures based on fuzzy partition alone are shown to be inadequate in capturing the shell sub-structure imposed by the shell clustering algorithm. Therefore, performance measures specifically designed for c-shells clustering are considered. Through examples, the new set of indices are shown to be capable of validating the structure characterized by the shell clustering algorithms. The issues related to classical cluster validity versus individual cluster validity are also discussed.
Article
this article. 0738-4602/92/$4.00 1992 AAAI 58 AI MAGAZINE for the 1990s (Silberschatz, Stonebraker, and Ullman 1990)
Article
The authors present a fuzzy validity criterion based on a validity function which identifies compact and separate fuzzy c-partitions without assumptions as to the number of substructures inherent in the data. This function depends on the data set, geometric distance measure, distance between cluster centroids and more importantly on the fuzzy partition generated by any fuzzy algorithm used. The function is mathematically justified via its relationship to a well-defined hard clustering validity function, the separation index for which the condition of uniqueness has already been established. The performance of this validity function compares favorably to that of several others. The application of this validity function to color image segmentation in a computer color vision system for recognition of IC wafer defects which are otherwise impossible to detect using gray-scale image processing is discussed