Chapter

Mining Frequent Patterns, Associations, and Correlations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This chapter introduces the basic concepts of frequent patterns, associations, and correlations and studies how they can be mined efficiently. How to judge whether the patterns found are interesting is also discussed. Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set. For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. Finding frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data. Moreover, it helps in data classification, clustering, and other data mining tasks. Thus, frequent pattern mining has become an important data mining task and a focused theme in data mining research. The discovery of frequent patterns, associations, and correlation relationships among huge amounts of data is useful in selective marketing, decision analysis, and business management. A popular area of application is market basket analysis, which studies customers' buying habits by searching for itemsets that are frequently purchased together (or in sequence).

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... By drawing from information technology, whose data mining techniques have been developed specifically to analyse large datasets, it is within Association Rules that one can find methods suitable for identifying combinations of species linked with specific ecosystems and environmental conditions. One such technique is the Market Basket Analysis, or MBA (Han and Kamber, 2006), which is mainly used in marketing to find associations between items in vast datasets, such as those created by sales from retail stores (e.g. Griva et al., 2018). ...
... Like S, a minimum threshold is also set up to define how likely the correspondence between species and habitats should be to make the rule interesting. The interestingness of a rule in MBA can be understood as its usefulness or novelty of the information provided (Han and Kamber, 2006). The thresholds for both parameters are arbitrary and can be adapted depending on the data and the amount of information required. ...
... This difference could be attributed to the approach to the generation of species combinations as indicators. The MBA Apriori algorithm uses a candidate generation, or bottom-up, approach by creating increasingly larger combinations (Agrawal andSrikant, 1994, Han andKamber, 2006), which seems to be more efficient, and therefore more adequate for large amounts of data without the need to pre-select species or limit the number of species in a combination. The adaptation of the current indicspecies package to include such an approach to species combinations, particularly with the combinespecies function, could help reduce the computational load of this step. ...
Article
Ecological monitoring research relies heavily on signals to detect ecosystem changes, making the selection of indicators a crucial methodological requirement. Over the years, individual species and species assemblages have been widely used, thereby, giving rise to reference methods that support the detection of ecological indicators. One such method, the Indicator Value Analysis (IndVal), has been adapted to identify not only species but also combinations of species, assuming collective responses to environmental factors. However, the IndVal method requires a pre-selection of species before performing the analysis, especially in the case of large datasets (e.g. high species richness), when it becomes ineffective. Species pre-selection might introduce subjectivity and a bias into the database, which can cause possible impacts on the final set of indicators. To address these issues, the authors propose the use of Market Basket Analysis (MBA) – a data mining method – which is mathematically similar to IndVal but designed to handle large amounts of data. Both methods were applied to select indicators from gradually larger datasets of Soil Surface Dwelling Arthropods from the Brazilian Amazon, using threshold-dependent indices to assess concordance between results. In general, the results obtained by applying both methods were found to be similar, with an average Jaccard's distance of 0.432 (±0.346) and an average True Skill Statistic of 0.991 (±0.012). As expected, MBA was able to select ecological indicators without species pre-selection as well as from datasets where IndVal had been unsuccessful. In such cases, and by means of objective association rules, the authors demonstrate that MBA could be used to pre-select ecological indicators, which can then be further processed and summarized with the IndVal method. In this study, the authors briefly outline the potential of MBA to complement IndVal and discuss advantages and disadvantages of using MBA for ecological indicators (pre-) selection.
... Algorithms based on FIM and ARM analyse the buying habits of customers by finding associations between the items they place in "shopping baskets" and, using this knowledge of associations, retailers can then develop marketing strategies to enhance customer development. These algorithms remain popular today and formed the basis for many recent studies including [50], [57] and [62] and has dedicated chapters in popular data mining books including in [29], [72] and [68]. ...
... Association Rule Mining (ARM) proposed in [3] and [4], is one of the most popular data mining techniques for MBA today thanks to the substantial growth in research and practical applications that leverage the mathematical framework and various algorithms for finding associations between independent items in a single transaction [68] [62]. Whilst considerable research has been conducted on ARM algorithms, the three most popular remain Apriori, ECLAT and FP-Growth, with Apriori still considered the benchmark and widely used [29][62] [50]. A review of previous studies pertaining to these algorithms is provided in Section 2.3.1. ...
... The task of finding frequent itemsets is typically achieved by using an ARM algorithm, however, shortlisting the set of all frequent items to find the best itemsets to promote can be tricky and is context-dependent [72][29] [9]. This is particularly true in large databases like those typically found in grocery retail [72] [29]. The shortlisting of frequent itemsets to find the best itemsets to promote cannot be achieved by FIM/ ARM algorithms alone, and this prompted a need to include further analytical elements to achieve this task [50] [41]. ...
Article
Full-text available
Targeted promotions in retail are becoming increasingly popular, particularly in UK grocery retail sector where competition is stiff and consumers remain price sensitive. Given this, a targeted promotion algorithm is proposed to enhance the effectiveness of promotions by retailers. The algorithm leverages a mathematical model for optimising items to target items and fuzzy c-means clustering for finding the best customers to target. Tests using simulations with real life consumer scanner panel data from the UK grocery retailer sector show that the algorithm performs well in finding the best items and customers to target whilst eliminating "false positives" (targeting customers who do not buy a product) and reducing "false negatives" (not targeting customers who could buy). The algorithm also shows better performance when compared to a similar published framework, particularly in handling "false positives" and "false negatives". The paper concludes by discussing managerial and research implications, and highlights applications of the model to other fields.
... The purpose of these methods is to identify interesting connections between data sets. The established data linkages create frequent occurrences of common patterns within data collection [30]. ...
... Furthermore, the evaluation includes the assessment of precision, which indicates the accuracy of predictions, and recall. It measures the model's ability to identify relevant instances correctly [30]. The primary objective is to optimize these models, ensuring that they closely correspond with the training data and enhance their predictive capabilities for practical use cases. ...
Article
Full-text available
The research addresses the escalating challenge of cyberbullying in the Philippines, a concern magnified by widespread social media use. A dataset of 146,661 tweets is analyzed using a pre-trained natural language processing model tailored to detect derogatory Filipino terms. The methodology is designed to preprocess data for clarity and analyze derogatory phrases, using the 23 key terms to indicate cyberbullying. Through quantitative analysis, specific patterns of derogatory term co-occurrence are uncovered. The research specifically focuses on Filipino digital discourse, uncovering patterns of derogatory language usage, which is unique to this context. Combining data mining and machine learning techniques, including Frequent Pattern (FP)-growth for pattern identification, cosine similarity for phrase correlation , and classification technique, the research achieves an accuracy rate of 97.91%. To assess the model's reliability and precision, a 10-fold cross-validation is utilized. Moreover, by examining specific tweets, the analysis highlights the alignment between automated classifications and human judgment. The co-occurrence of derogatory terms, identified through methods like FP-growth and cosine similarity, reveals underlying cyberbullying narratives that are not immediately obvious. This approach validates the high accuracy of the models and emphasizes the importance of a comprehensive framework for detecting cyberbullying in a linguistically and culturally specific context. The findings substantiate the effectiveness of the targeted approach, providing essential insights for developing cyberbullying prevention strategies. Furthermore, the research enriches the literature on digital discourse analysis and online harassment prevention by addressing cyberbullying patterns and behaviors. Importantly, the research offers valuable guidance for policymakers in crafting more effective online safety measures in the Philippines.
... The key difference between association rules mining and collaborative filtering is that in association rules mining we aim to find global or shared preferences across all users rather than finding an individual's preference like in collaborative filtering-based techniques [27]. ...
... Finally, the lift is a correlation measure used to discover and exclude the weak rules that have high confidence. Equation 7 shows that the lift measure is calculated by dividing the confidence by the unconditional probability of the consequent [27]. ...
Article
Full-text available
This paper introduces a frequent pattern mining framework for recommender systems (FPRS) - a novel approach to address the items? cold-start problem. This difficulty occurs when a new item hits the system, and properly handling such a situation is one of the key success factors of any deployment. The article proposes several strategies to combine collaborative and content-based filtering methods with frequent items mining and agglomerative clustering techniques to mitigate the cold-start problem in recommender systems. The experiments evaluated the developed methods against several quality metrics on three benchmark datasets. The conducted study confirmed usefulness of FPRS in providing apt outcomes even for cold items. The presented solution can be integrated with many different approaches and further extended to make up a complete and standalone RS.
... It helps us to discover the associations among items using every distinct transaction in large databases. The key difference between association rules mining and collaborative filtering is that in association rules mining we aim to find global or shared preferences across all users rather than finding an individual's preference like in collaborative filtering-based techniques [59][60][61]. ...
... Finally, the lift is a correlation measure used to discover and exclude the weak rules that have high confidence. The Equation 31 shows that the lift measure is calculated by dividing the confidence by the unconditional probability of the consequent [59,61]. ...
Chapter
Full-text available
Recommender systems play a key role in many branches of the digital economy. Their primary function is to select the most relevant services or products to users’ preferences. The article presents selected recommender algorithms and their most popular taxonomy. We review the evaluation techniques and the most important challenges and limitations of the discussed methods. We also introduce Factorization Machines and Association Rules-based recommender system (FMAR) that addresses the problem of efficiency in generating recommendations while maintaining quality.KeywordsRecommendation systemsCollaborative filteringMemory-based techniquesModel-based techniquesContent-based filteringHybrid filtering
... It helps us to discover the associations among items using every distinct transaction in large databases. The key difference between association rules mining and collaborative filtering is that in association rules mining we aim to find global or shared preferences across all users rather than finding an individual's preference like in collaborative filtering-based techniques [38] [39] [40]. ...
... Finally, the lift is a correlation measure used to discover and exclude the weak rules that have high confidence. The Equation 7 shows that the lift measure is calculated by dividing the confidence by the unconditional probability of the consequent [40] [38]. [42]. ...
... Frequent itemsets (FIs) can be mined from transaction databases through one of the traditional algorithms that can be generally grouped into two methods [3]: Apriori-based method, which is used for generating and filtering candidate itemsets such as Apriori algorithm [4], and tree-based method that is normally used for building FP-tree and then mining FIs from the FP-tree such as FP-Growth [5], TRR [6], PrePost+ [7], FIN [8], dFIN [9], and negFIN [10] algorithms. Since Apriori-based methods depend on continuous scanning of the database to generate multiple candidate itemsets, they require high I/O. ...
... Figure 4(b) shows P2, and Figure 4(c) shows the tree after phase 2. e outcome of phases 3-4 is P2 as Figure 4(d) shows. Phase 5: rules that have Conf ≥0.7 are generated as shown in Table 9. (Note) At T2, only 2 rules are interesting (rule 15 and rule 16) (green color), 14 rules are not interesting, 12 rules undergo change in value Conf (3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14) (red color), and 2 rules experience no change in value Conf (1)(2). ...
Article
Full-text available
Frequent itemset mining is the most important step of association rule mining. It plays a very important role in incremental data environments. e massive volume of data creates an imminent need to design incremental algorithms for the maximal frequent itemset mining in order to handle incremental data over time. In this study, we propose an incremental maximal frequent itemset mining algorithms that integrate subjective interestingness criterion during the process of mining. e proposed framework is designed to deal with incremental data, which usually come at different times. It extends FP-Max algorithm, which is based on FP-Growth method by pushing interesting measures during maximal frequent itemset mining, and performs dynamic and early pruning to leave uninteresting frequent itemsets in order to avoid uninteresting rule generation. e framework was implemented and tested on public databases, and the results found are promising.
... To investigate the effect of key behavioral factors on diabetes management, we employed two methods for analysis, namely, hypothesis testing [17], and association rule mining [18]. The purpose of hypothesis testing was to learn from the data whether statistical differences exist in diabetes management with respect to different categories of behavioral factors. ...
... non-interrupted sleep and lower glycemic variability), confidence, which describes the probability that event B occurred given event A, and lift, which describes the dependence and/or correlation between events A and B. When lift is greater than 1, then A and B are positively correlated and the occurrence of one implies the occurrence of the other. Per [18], the formulas for each metric is as follows: ...
Conference Paper
The prevalence of personal health data from wearable devices enables new opportunities to understand the impact of behavioral factors on health. Unlike consumer devices that are often auxiliary, such as Fitbit and Garmin, wearable medical devices like continuous glucose monitoring (CGM) devices and insulin pumps are becoming critical in diabetes care to minimize the occurrence of adverse glycemic events. Joint analysis of CGM and insulin pump data can provide unparalleled insights on how to modify treatment regimen to improve diabetes management outcomes. In this paper, we employ a data-driven approach to study the relationship between key behavioral factors and proximal diabetic management indicators. Our dataset includes an average of 161 days of time-matched CGM and insulin pump data from 34 subjects with Type 1 Diabetes (T1D). By employing hypothesis testing and association mining, we observe that smaller meals and insulin doses are associated with better glycemic outcomes compared to larger meals and insulin doses. Meanwhile, the occurrence of interrupted sleep is associated with poorer glycemic outcomes. This paper introduces a method for inferring disrupted sleep from wearable diabetes-device data and provides a baseline for future research on sleep quality and diabetes. This work also provides insights for development of decision-support tools for improving short- and long-term outcomes in diabetes care.
... Model One of the most popular approaches to data mining and binary data classification in particular is logistic regression (LR). [13] Logistic regression is a popular statistical method for analyzing data with binary and proportional answer formats among academics and statisticians. ...
Article
Full-text available
Data mining is an effective method that uses sophisticated tools and strategies to sift through massive databases in search of useful patterns. Its usefulness extends to many fields, including medicine. We used AdaBoost, Logistic Regression (LR), K-Nearest Neighbors (KNN), and Random Forest (RF) as predictive models in our investigation (ADaB). Evaluating utilizing methods like cross-validation and random sampling allowed us to concentrate on improving accuracy. Medical datasets were used to construct the following: Heart Disease (HD) dataset, Breast Cancer Wisconsin (BCW) dataset, and Covid-19 dataset. We set out to improve the accuracy of disease predictions and push the boundaries of medical data analysis. After completing the assessment procedure with the training data, the performance measurements showed that the highest accuracy was achieved with LR 83% for HD, KNN 97%, and LR 98% for Covid-19.
... Regression analysis is a statistical methodology that is most often used for numeric prediction, hence the two terms tend to be used synonymously [11]. This method visualizes the distribution trends of data. ...
Article
Full-text available
The rapid development of telecommunications services is increasingly attracting millions of users due to the convenience of interaction, promotion and communication. The abundance of daily transaction information has led to the creation of large data sources that are collected over time. This data source is a valuable resource for analyzing and understanding user habits and needs, devising a strategy to maintain and attract potential customers. Therefore, it is necessary to have a suitable system capable of collecting, storing and analyzing large datasets with efficient performance. In this article, we introduce Florus, a big data framework based on Lakehouse architecture, which can tackle these challenges. By applying this framework, we are able to propose an approach to analyzing customer behaviors in the telecommunication industry with a large dataset. Our work focuses on specific analysis of a huge volume of data presented in tables of different schemas, reflecting the business operation over time. Clustering based on the Bisecting K-Means algorithm will support the exploration of customer segments varying in density and complexity, and then characterize them into homogeneous groups to gain a better understanding of the market demand. Furthermore, the enterprise can forecast the revenue income at different levels, which can be applied to every customer. The work was tested with the Gradient Boosted Tree at the end of a data enriching and transformation pipeline. Overall, this work highlights the potential of Florus in supporting customer analysis experiments. Implementing the framework would significantly enhance our ability to conduct comprehensive analyses across the entire customer lifecycle.
... Kulczynsky mesure Kulczynski measure [33,34] is a metric proposed by a Polish mathematician named S. Kulczynski. It is a metric calculating the average of the confidences of two itemsets A and B. ...
Article
Full-text available
Association rules mining is one of the most relevant techniques in data mining. Aimed at identifying interesting connections and associations among groups of items or products within extensive transactional databases. However, this technique can yield too many rules, among which some are irrelevant and/or redundant. Thus, may present obstacles for the decision-maker. This Highlights the importance and challenge of evaluating extracted knowledge to define the most interesting association rules. In order to address this issue, we presented a constraint programming approach to evaluate the relevance of association rules. Our MeAR-CP approach involves filtering association rules using metrics such as IR, Cosine, Lift, Kulc as constraints solved by Choco constrain programming tools, and proposed our metric called 'Score'. The experiments are conducted on various datasets from the UCI Machine Learning Repository. We evaluate both time and rules. The results obtained from our experiments underscore the effectiveness of our approach in reducing irrelevant and redundant rules within an effective timeframe.
... A defining feature of the Apriori algorithm is the Apriori property (Han et al., 2012). ...
Thesis
Full-text available
In recent years, the use of virtual assistants and voice user interfaces has become a latent part of modern living. Unseen to the user are the various artificial intelligence and natural language processing technologies, the vast datasets, and the linguistic insights that underpin such tools. The technologies supporting them have chiefly targeted widely used spoken languages, leaving sign language users at a disadvantage. One important reason why sign languages are unsupported by such tools is a requirement of the underpinning technologies for a comprehensive description of the language. Sign language processing technologies endeavour to bridge this technology inequality. Recent approaches to sign language processing have shifted to the domain of machine learning. The principal challenge facing this method is the comparatively small sign language corpora available for training machine learning models. Such corpora are typically 10,000 times smaller than their spoken language equivalents. This study produces a statistical model which may be used in future hybrid learning approaches for sign language processing tasks. In doing so, this research explores the emerging patterns of non-manual articulation concerning grammatical classes in Irish Sign Language (ISL). Specifically, this study focuses on head movement, body movement, eyebrows, eyegaze, eye aperture, and cheek movement, in relation to the grammatical classes listed in the Auslan corpus annotation guidelines. The experimental method applied here is a novel implementation of an association rules mining approach to a sign language dataset. This method is transferable to other corpusbased analyses of sign languages. The study analyses the articulation of various non-manual features across grammatical classes. The dataset, a subset of the Signs of Ireland (SOI) corpus, contains Non-Manual Feature (NMF) annotations and has been further annotated, as part of this study, to include grammatical class data across 2,989 signs. The dataset is further refactored and refined according to the knowledge discovery on data process before it is subjected to an association rules mining approach. Results from the exploratory analysis, and a lexical frequency analysis, provide new statistical insights related to the distribution of grammatical classes and of NMFs in ISL. Meanwhile, an association rules analysis identifies patterns between grammatical classes and various non-manual articulations. One such pattern discovery is the strong correlation between various NMFs and depicting verbs. Indeed, this study reports that the more lexicalised a sign is, the less likely it is to use NMFs. This study also reports on patterns discovered between non-manual articulators, and finally, patterns discovered for constructed actions. This research provides novel contributions to the field of sign language linguistics and sign language processing. Firstly, a contribution to the understanding of ISL at the lexical level through new statistical insights. Secondly, through a transferable and novel application of the association rules mining method to sign language corpus data. Thirdly, through the production of two assets: (1) a statistical model applicable to future machine learning approaches, and (2) supplementary annotations to the SOI corpus.
... In contrast to collaborative filtering which aims to find the individual preferences for each user, frequent pattern mining seeks to discover global or shared preferences across all users. [62]. Each association rule consists of an antecedent and a consequent, both of which are a list of items. ...
Article
Full-text available
Recommender systems (RS) are substantial for online shopping or digital content services. However, due to some data characteristics or insufficient historical data, may encounter considerable difficulties impacting the quality of their recommendations. This study introduces the clustering-based frequent pattern mining framework for recommender systems (Clustering-based FPRS) - a novel RS constituting several recommendation strategies leveraging agglomerative clustering and FP-growth algorithms. The developed strategies combine the generated frequent itemsets with collaborative- and content-filtering methods to address the cold-start problem, which occurs whenever a new user or item enters the system. In such cases, the RS has limited information about the new user or object. Thus, the recommendations may be inaccurate. The experimental evaluation on several benchmark datasets showed that Clustering-based FPRS is superior to state-of-the-art and could effectively alleviate the cold-start problem.
... Basket analysis is one of the key analytical tasks in retail business (Hossain, Sattar, & Paul, 2019;Long & Zhu, 2012), and its purpose is to discover associations between product items contained in a large number of customer shopping transactions (Agrawal, Imielinski, & Swami, 1993;Han, Kamber, & Pei, 2012). The patterns or rules of associations can be used in various applications (Saputra, Rahayu, & Hariguna, 2023), e.g., product placement, product sales, product recommendation, inventory control, ...
Article
Basket analysis is a prevailing technique to help retailers uncover patterns and associations of sold products in customer shopping transactions. However, as the size of transaction databases grows, the traditional basket analysis techniques and systems become less effective because of two issues in the applications of the big data age: data scalability and flexibility to adapt different application tasks. This paper proposes a scalable distributed frequent itemset mining (ScaDistFIM) algorithm for basket analysis on big transaction data to solve these two problems. ScaDistFIM is performed in two stages. The first stage uses the FP-Growth algorithm to compute the local frequent itemsets from each random subset of the distributed transaction dataset, and all random subsets are computed in parallel. The second stage uses an approximation method to aggregate all local frequent itemsets to the final approximate set of frequent itemsets where the support values of the frequent itemsets are estimated. We further elaborate on implementing the ScaDistFIM algorithm and a flexible basket analysis system using Spark SQL queries to demonstrate the system’s flexibility in real applications. The experiment results on synthetic and real-world transaction datasets demonstrate that compared to the Spark FP-Growth algorithm, the ScaDistFIM algorithm can achieve time savings of at least 90% while ensuring nearly 100% accuracy. Hence, the ScaDistFIM algorithm exhibits superior scalability. On dataset GenD with 1 billion records, the ScaDistFIM algorithm requires only 360 s to achieve 100% precision and recall. In contrast, due to memory limitations, Spark FP-Growth cannot complete the computation task.
... Pengambilan keputusan yang efektif berasal dari menarik kesimpulan yang didasarkan pada data/fakta yang akurat [8]. Data Mining merupakan metode yang digunakan untuk mengolah atau menemukan informasi-informasi di dalam sekelompok data [9]. Data mining melibatkan pencarian tren atau pola yang diinginkan dalam basis data yang luas untuk memfasilitasi pengambilan keputusan di masa depan [10]. ...
Article
The sales of pajama products on i_docraft have not yet leveraged data mining algorithms to analyze transactional data for optimizing sales. To avoid underperforming pajama models and determine which pajama models sell well, the utilization of the Apriori algorithm is necessary. The Apriori algorithm can discern these patterns based on transactional data. This study conducts a transactional data analysis using data mining with the Apriori algorithm. By employing this algorithm, the most frequently sold pajama products can be identified, allowing for prioritization of these models and the development of marketing strategies for other types of pajamas based on a comparison of their strengths and commonly high sales figures. The processed data yields associations rules for concurrently sold pajama items. Based on the results of the final association rules meeting both predetermined minimum support and confidence criteria, for instance, if a product with item code 7 (Cherrypie Nightdress) is purchased, then a product with item code 17 (3 in 1 Lotso Set) will likely be bought with a support value of 22.58% and a confidence value of 100%.
... The Apriori algorithm initially generates frequent itemsets without proceeding with the exploration of the numerical space made up of all candidates and subsequently derives strict rules. The fundamental theorem of the Apriori algorithm is as follows: if a set of objects is frequent, all its subsets are also frequent (Han et al., 2012). ...
... Apriori (Agrawal et al., 1993) and FP-Growth (Han et al., 2000) are two algorithms commonly implemented in software programs to extract frequent itemsets. For example, {m 1 , m 2 }is a frequent itemset in the tabular context of Table 2 having a frequency of 4. To discover the interesting frequent pattern of attributes, two measures of support and confidence are defined to evaluate the strength of association rules (Han et al., 2012). ...
Article
Volunteered geographic information (VGI) provides geometric and descriptive sources of geospatial data. VGI exchange, reuse, and integration are serious challenges due to the subjective contribution process, lack of organization, and redundancy. This study aims to enhance the quality of VGI semantic data by presenting a new approach to integrating and formalizing the VGI semantic knowledge using formal concept analysis. The proposed approach is assessed using the building tags in OpenStreetMap (OSM) and CityGML. The alignment process discovers the conceptual overlap between the categories of Amenity (Others), Office, and Man‐Made in Map Features (OSM) and Business and Trade, Recreation, Sport, and Industry in AbstractBuilding (CityGML). The k‐means clustering of the results illustrated that class, usage/function, address, wheelchair, and website/wikidata/wikipedia are significant attributes to describe building categories. Moreover, results showed that the analysis of frequent itemsets and cluster characteristics provides significant information about custom tags in OSM's editing tools.
... However, this approach of relying alone on HE limits the capability of the model to achieve diversity and maximum predictive accuracy. No single classification algorithm is considered optimal for all cases, and only by combining various single classifiers can classification performance be improved [20], [21]. The main goal of ensembles is to improve generalization and diversity among the models to deal with dataset variance, and only HTE can achieve this better because it uses a diverse set of base classifiers [14], [22], [23]. ...
Article
Full-text available
Web-based learning technologies of educational institutions store a massive amount of interaction data which can be helpful to predict students' performance through the aid of machine learning algorithms. With this, various researchers focused on studying ensemble learning methods as it is known to improve the predictive accuracy of traditional classification algorithms. This study proposed an approach for enhancing the performance prediction of different single classification algorithms by using them as base classifiers of homogeneous ensembles (bagging and boosting) and heterogeneous ensembles (voting and stacking). The model utilized various single classifiers such as multilayer perceptron or neural networks (NN), random forest (RF), naïve Bayes (NB), J48, JRip, OneR, logistic regression (LR), k-nearest neighbor (KNN), and support vector machine (SVM) to determine the base classifiers of the ensembles. In addition, the study made use of the University of California Irvine (UCI) open-access student dataset to predict students' performance. The comparative analysis of the model's accuracy showed that the best-performing single classifier's accuracy increased further from 93.10% to 93.68% when used as a base classifier of a voting ensemble method. Moreover, results in this study showed that voting heterogeneous ensemble performed slightly better than bagging and boosting homogeneous ensemble methods.
... The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P(B|A). That is [7] support(A⇒B) = P(A∪ B) (1) confidence(A⇒B) = P(B|A). ...
Article
Full-text available
These days, the language is making hindrances in the advantages of Information Technology revolution in India. So, there is the need of the adequate measures to perform natural language processing (NLP) through computer processing so that computer based system can be interacted by users through natural language like Hindi. This paper presents a new Word Sense Disambiguation method associated with mining association rules. This method can mine the association rules between the sense of the ambiguous word and its context using its various senses present in the Hindi WordNet. The sense of the ambiguous word assigned by choosing the sense which the most association rules refer. In this paper, we are concerned with association rule mining for word sense disambiguation for Hindi language and finding the correct sense for given Hindi word. We explore the multiple meanings of Hindi word with the help of Hindi Word net prepared by IIT Bombay. The experiment result shows that the method has high precision. Keywords-Word Sense Disambiguation, WSD Approaches, Hindi WordNet, Mining assosiation rules. I. WORD SENSE DISAMBIGUATION In natural language processing, word sense disambiguation (WSD) is the problem of determining which "sense" (meaning) of a word is activated by the use of the word in a particular context, a process which appears to be largely unconscious in people. WSD is to make the computer automatically choose the correct meaning of each word, and it is still the biggest problem at vocabulary level in NLP. Words can have different senses. Some words have multiple meanings called Polysemy.[1] Word sense disambiguation (WSD) is the problem of determining in which sense a word having a number of distinct senses is used in a given sentence. Take example, Ambiguity for Humans and Computers: Ambiguity for Humans e.g. भहिरा एक छाता के साथ आदभी को भाया. (भहिरा एक छाता उऩमोग कयके आदभी को भाया मा वि एक आदभी जो एक छाता रे जा यिा िै उसे भाया) Ambiguity for computer in their day to day communication e.g.[2] आभ आभ आदभी की ऩरयधध से फािय िै (मिॉ आभ के दो भतरफ िै ् पर औय साभान्म आदभी) II. WSD APPROACHES As in all natural language processing, there are two main approaches to WSD-deep approaches and shallow approaches. Deep Approaches Deep approaches presume access to a comprehensive body of world knowledge. Knowledge, such as "दमा एक सात्ववक बावना िै " or "दमा बु वने श्वय के ऩास से फिती िै ", here दमा is Ambiguated by two meaning "compassion" and "name of river". Then Deep approaches used to determine in which sense the word is used.These approaches are not very successful in practice, mainly because such a body of knowledge does not exist in a computer readable format, outside of very limited domains.[3] Shallow Approaches Shallow approaches don't try to understand the text. They just consider the surrounding words, using information such as: if दमा has words बावना or दु ख nearby , it probably in the sense of "compassion"; if दमा has worlds फिती or बु वने श्वय nearby, it probably in the sense of "river".These rules can be automatically derived by the computer, using a training corpus of words tagged with their word senses. Our paper is based on shallow approaches this approach, while theoretically not as powerful as deep approaches, gives superior results in practice, due to the computer's limited world knowledge.[4] III. HINDI WORDNET The Hindi WordNet is a system for bringing together different lexical and semantic relations between the Hindi words.
... The aim of the present research was to introduce a new term frequency with a Gaussian technique (TF-G) and to compare its text classification accuracy with the accuracy of bag-of-words (BoW) [10], term frequency (TF), term frequencyinverse document frequency (TF-IDF) [2], and term frequency-inverse corpus document frequency (TF-ICF) [11]. The ML techniques compared in this study were decision tree (DT) [12], naïve Bayes (NB) [13], support vector machine (SVM) [14], random forest (RF) [15], and multilayer perceptron (MLP) [16][17] techniques to enhance the accuracy of text classification. The datasets in this study were derived from clinical notes in Thai from the outpatient department (OPD) of Khon Kaen Rajanagarindra Psychiatric Hospital, from Thai text of customer reviews for Burger King, Pizza Hut, and Sizzler restaurants [18], and from English text in the tweets of travelers that use US airline services [19]. ...
Article
Full-text available
This paper proposes a new term frequency with a Gaussian technique (TF-G) to classify the risk of suicide from Thai clinical notes and to perform sentiment analysis based on Thai customer reviews and English tweets of travelers that use US airline services. This research compared TF-G with term weighting techniques based on Thai text classification methods from previous researches, including the bag-of-words (BoW), term frequency (TF), term frequency-inverse document frequency (TF-IDF), and term frequency-inverse corpus document frequency (TF-ICF) techniques. Suicide risk classification and sentiment analysis were performed with the decision tree (DT), naïve Bayes (NB), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP) techniques. The experimental results showed that TF-G is appropriate for feature extraction to classify the risk of suicide and to analyze the sentiments of customer reviews and tweets of travelers. The TF-G technique was more accurate than BoW, TF, TF-IDF and TF-ICF for term weighting in Thai suicide risk classification, for term weighting in sentiment analysis of Thai customer reviews for Burger King, Pizza Hut, and Sizzler restaurants, and for the sentiment analysis of English tweets of travelers using US airline services.
... Various methods can be adopted for investigating the selection and arrangement of the identified attributes in the analysed texts. Following the frequent patterns theory proposed by Han et al. [2012], the selection can be modelled using the itemset pattern. This means that each identified attribute, along with its frequency in the texts, is detected. ...
Thesis
Full-text available
The computational analysis of argumentation strategies is substantial for many downstream applications. It is required for nearly all kinds of text synthesis, writing assistance, and dialogue-management tools. While various tasks have been tackled in the area of computational argumentation, such as argumentation mining and quality assessment, the task of the computational analysis of argumentation strategies in texts has so far been overlooked. This thesis principally approaches the analysis of the strategies manifested in the persuasive argumentative discourses that aim for persuasion as well as in the deliberative argumentative discourses that aim for consensus. To this end, the thesis presents a novel view of argumentation strategies for the above two goals. Based on this view, new models for pragmatic and stylistic argument attributes are proposed, new methods for the identification of the modelled attributes have been developed, and a new set of strategy principles in texts according to the identified attributes is presented and explored. Overall, the thesis contributes to the theory, data, method, and evaluation aspects of the analysis of argumentation strategies. The models, methods, and principles developed and explored in this thesis can be regarded as essential for promoting the applications mentioned above, among others.
... This is the same concept which borrowed from the data mining. 54 We observed that an optimal n = 3 (or even n = 2) of dimensionalities are sufficiently good to attain a high AUC (Figures 4 and 5). 55 Scaling up the present methodology to higher order (multiple peaks analyses) is straightforward and readily applicable for more complex analyses as demonstrated in actual biological system (e.g., bulk water, protein hydration layer 23,38 ) ( Figure 5). ...
Article
Full-text available
Low-field nuclear magnetic resonance (NMR) relaxometry is an attractive approach from point-of-care testing medical diagnosis to in situ oil-gas exploration. One of the problems, however, is the inherently long relaxation time of the (liquid) samples, (and hence low signal-to-noise ratio) which causes unnecessarily long repetition time. In this work, a new class of methodology is presented for rapid and accurate object classification using NMR relaxometry with the aid of machine learning. It is demonstrated that the sensitivity and specificity of the classification is substantially improved with higher order of (pseudo)-dimensionality (e.g., 2D- or multi-dimensional). This new methodology (termed as ‘Clustering NMR’)may be extremely useful for rapid and accurate object classification (in less than a minute) using low-field NMR.
... The study has enforced the SVM algorithm on the datasets using techniques like One-vs.-All [55] and GridSearchCV to build an SVM classifier for multi-classes classification efficiently. Figure 1 presents an approximate procedural used in the study. ...
Article
Full-text available
Although averting a seismic disturbance and its physical, social, and economic disruption is practically impossible, using the advancements in computational science and numerical modeling shall equip humanity to predict its severity, understand the outcomes, and equip for post-disaster management. Many buildings exist amidst the developed metropolitan areas, which are senile and still in service. These buildings were also designed before establishing national seismic codes or without the introduction of construction regulations. In that case, risk reduction is significant for developing alternatives and designing suitable models to enhance the existing structure’s performance. Such models will be able to classify risks and casualties related to possible earthquakes through emergency preparation. Thus, it is crucial to recognize structures that are susceptible to earthquake vibrations and need to be prioritized for retrofitting. However, each building’s behavior under seismic actions cannot be studied through performing structural analysis, as it might be unrealistic because of the rigorous computations, long period, and substantial expenditure. Therefore, it calls for a simple, reliable, and accurate process known as Rapid Visual Screening (RVS), which serves as a primary screening platform, including an optimum number of seismic parameters and predetermined performance damage conditions for structures. In this study, the damage classification technique was studied, and the efficacy of the Machine Learning (ML) method in damage prediction via a Support Vector Machine (SVM) model was explored. The ML model is trained and tested separately on damage data from four different earthquakes, namely Ecuador, Haiti, Nepal, and South Korea. Each dataset consists of varying numbers of input data and eight performance modifiers. Based on the study and the results, the ML model using SVM classifies the given input data into the belonging classes and accomplishes the performance on hazard safety evaluation of buildings.
... IoT is not a single technology; Can be the convergence of heterogeneous technologies associated with different Engineering domains that will be used to connect all objects through the Internet for remote sensing and control [3]. The learning management system does not provide special tools for teachers to track and grade students as a whole, all activities carried out by students to assess the structure and content of learning materials as well as concerns in the learning process [4][5][6]. ...
Article
Full-text available
The Internet of Things (IoT) which is known today has grown rapidly, the integration of each object can be connected via the internet. With the sophistication of the IoT system also has an impact on the world of education which is currently still using manual systems, smart technology owned by IoT has changed the existing system. The purpose of this paper is to analyze and collect data about IoT that is currently developing at school. IoT can be used as a learning management system where web-based learning is the main focus. The method used is a literature review from journals related to IoT, web learning for smart school systems. The findings obtained are the integration of IoT with web-based learning, namely administrative services, service management, data analysis and learning services. These results can be used in education.
... The focus of our research in this paper is on dynamic association rule mining. Association rule mining [10,11] can help to find interesting relationships among data items based on the frequency of their co-occurrences and has been used in decision-making in various areas. Extended from associ-ation rule mining, dynamic association rule mining [12][13][14] provides more information about such relationships by considering temporal features of data. ...
Article
Full-text available
Dynamic rule mining can discover time-dependent association rules and provide more accurate descriptions about the relationship among items at different time periods and temporal granularities. However, users still face some challenges in analyzing and choosing reliable rules from the rules identified by algorithms, because of the large number of rules, the dynamic nature of rules across different time periods and granularities and the opacity of the relationship between rules and raw data. In this paper, we present our work on the development of DART, a visual analytics system for dynamic association rule mining, to help analysts gain a better understanding of rules and algorithms. DART allows users to explore rules at different time granularities (e.g., per hour, per day, per month, etc.) and with different time periods (e.g., daily, weekly, yearly, etc.), and to examine rules at multiple levels of detail, including investigating temporal patterns of a set of rules, comparing multiple rules, and evaluating a rule with raw data. Two case studies are used to show the functions and features of DART in analyzing business data and public safety data.
... We then search for (k + 1)-patterns and once all those frequent patterns are found we search for (k + 2)-patterns, and so on until all frequent patterns have been found. Classical examples of BFS methods can be Apriori-like algorithms first suggested in [5,6] and based on ideas from the so-called Apriori Principle [51]: "All nonempty subsets of a frequent itemset must also be frequent". BFS-like algorithms generate all k-patterns in each k-th iteration and move to the next k + 1 step only after exploring the entire k-th search space. ...
Article
Full-text available
Pattern mining is a powerful tool for analysing big datasets. Temporal datasets include time as an additional parameter. This leads to complexity in algorithmic formulation, and it can be challenging to process such data quickly and efficiently. In addition, errors or uncertainty can exist in the timestamps of data, for example in manually recorded health data. Sometimes we wish to find patterns only within a certain temporal range. In some cases real-time processing and decision-making may be desirable. All these issues increase algorithmic complexity, processing times and storage requirements. In addition, it may not be possible to store or process confidential data on public clusters or the cloud that can be accessed by many people. Hence it is desirable to optimise algorithms for standalone systems. In this paper we present an integrated approach which can be used to write efficient codes for pattern mining problems. The approach includes: (1) cleaning datasets with removal of infrequent events, (2) presenting a new scheme for time-series data storage, (3) exploiting the presence of prior information about a dataset when available, (4) utilising vectorisation and multicore parallelisation. We present two new algorithms, FARPAM (FAst Robust PAttern Mining) and FARPAMp (FARPAM with prior information about prior uncertainty, allowing faster searching). The algorithms are applicable to a wide range of temporal datasets. They implement a new formulation of the pattern searching function which reproduces and extends existing algorithms (such as SPAM and RobustSPAM), and allows for significantly faster calculation. The algorithms also include an option of temporal restrictions in patterns, which is available neither in SPAM nor in RobustSPAM. The searching algorithm is designed to be flexible for further possible extensions. The algorithms are coded in C++, and are highly optimised and parallelised for a modern standalone multicore workstation, thus avoiding security issues connected with transfers of confidential data onto clusters. FARPAM has been successfully tested on a publicly available weather dataset and on a confidential adult social care dataset, reproducing results obtained by previous algorithms in both cases. It has been profiled against the widely used SPAM algorithm (for sequential pattern mining) and RobustSPAM (developed for datasets with errors in time points). The algorithm outperforms SPAM by up to 20 times and RobustSPAM by up to 6000 times. In both cases the new algorithm has better scalability.
... Minimum support and confidence constraints first introduced by 1 were born out of practical constraints, particularly in the retail sector where only the rules that passed a predefined level of interestingness were considered. However others 5,10 noted that in some applications it is important to find all rules, including rare rules and hence setting minimum thresholds is not always easy or applicable. ...
Article
Full-text available
The ability for grocery retailers to have a single view of customers across all their grocery purchases remains elusive and has become increasingly important in recent years (especially in the United Kingdom) where competition has intensified, shopping habits and demographics have changed and price sensitivity has increased following the 2008 recession. Numerous studies have been conducted on understanding independent items that are frequently bought together (association rule mining/frequent itemsets) with several measures proposed to aggregate item support and rule confidence with varying levels of accuracy as these measures are highly context dependent. Uninorms were used as an alternative measure to aggregate support and confidence in analysing market basket data using the UK grocery retail sector as a case study. Experiments were conducted on consumer panel data with the aim of comparing the uninorm against three other popular measures (Jaccard, Cosine and Conviction). It was found that the uninorm outperformed other models on its adherence to the fundamental monotonicity property of support in market basket analysis (MBA). Future work will include the extension of this analysis to provide a generalised model for market basket analysis.
... The bottom layer of the pyramid, the data layer, represents low level data, which can be stored into different, distributed, heterogeneous IT systems or even into so-called data lakes [9,10]. The low level data can then be processed and aggregated, for example, by applying data mining techniques [11], in order to generate information, which is represented through the second layer of the pyramid. This information describes interesting, previously unknown patterns in the data. ...
Article
Full-text available
Citation: Sakagianni, A.; Koufopoulou, C.; Koufopoulos, P.; Kalantzi, S.; Theodorakis, N.; Nikolaou, M.; Paxinou, E.; Kalles, D.; Verykios, V.S.; Myrianthefs, P.; et al. Abstract: Background/Objectives: The emergence of antimicrobial resistance (AMR) due to the misuse and overuse of antibiotics has become a critical threat to global public health. There is a dire need to forecast AMR to understand the underlying mechanisms of resistance for the development of effective interventions. This paper explores the capability of machine learning (ML) methods, particularly unsupervised learning methods, to enhance the understanding and prediction of AMR. It aims to determine the patterns from AMR gene data that are clinically relevant and, in public health, capable of informing strategies. Methods: We analyzed AMR gene data in the PanRes dataset by applying unsupervised learning techniques, namely K-means clustering and Principal Component Analysis (PCA). These techniques were applied to identify clusters based on gene length and distribution according to resistance class, offering insights into the resistance genes' structural and functional properties. Data preprocessing, such as filtering and normalization, was conducted prior to applying machine learning methods to ensure consistency and accuracy. Our methodology included the preprocessing of data and reduction of dimensionality to ensure that our models were both accurate and interpretable. Results: The unsupervised learning models highlighted distinct clusters of AMR genes, with significant patterns in gene length, including their associated resistance classes. Further dimensionality reduction by PCA allows for clearer visualizations of relationships among gene groupings. These patterns provide novel insights into the potential mechanisms of resistance, particularly the role of gene length in different resistance pathways. Conclusions: This study demonstrates the potential of ML, specifically unsupervised approaches, to enhance the understanding of AMR. The identified patterns in resistance genes could support clinical decision-making and inform public health interventions. However, challenges remain, particularly in integrating genomic data and ensuring model interpretability. Further research is needed to advance ML applications in AMR prediction and management.
Chapter
The evolution of telecommunications has led to a profound transformation in the realm of communication, revolutionizing how this industry mines customer behavior for their business outcomes. The analysis of user historical activities promoted paramount importance in driving strategic decision-making to enhance customer experiences and recommend ways to attract customers more effectively. While the demand is growing, some telecom data analytics either use small datasets or provide a high abstract level of analysis result. When the number of customers increases significantly, it becomes impractical to customize service for each customer under the same approach. This paper provides a comprehensive examination of the challenges, needs, and solutions associated with the analysis of user data within the telecom domain. We focus on three key user data analysis problems: user clustering, user classification, and revenue prediction derived from user insights. With Florus - our proposed big data framework, we have carried out the telecom customer behavior analysis with a large dataset. The experiment result demonstrates the promising performance and its potential for long-term use.
Chapter
In this chapter, we address the problem of inducing chronicles from a sequence or a collection of temporal sequences. This important problem is motivated by the need to identifies pieces of information that are recurrent in temporal sequence. This chapter show that a chronicle can be used to abstract the information from several sequences. We start by showing that a sequence can be represented by a chronicle (up to a time translation) to bridge the space of sequences and the space of chronicles. Then, we address the problem of summarizing a collection of sequences by a unique chronicle. This abstraction of a collection of sequences benefits from the results on the spaces of chronicles, and more precisely on the semi-lattice space of pairwise flush chronicle which allow to identify an abstraction as a kind of least general generalisation of the sequences. Finally, we focus on mining frequent chronicles from a collection of sequences using the framework of formal concept analysis.KeywordsTemporal abstractionChronicle miningFormal concept analysis
Chapter
An important phase in software development is the prediction of effort. It has its significance in project planning, control, and budgeting. Many researchers developed common software effort estimation models known as algorithmic models. These models need accurately estimated input parameters, namely lines of code and complexity. Accurate estimation of these features during the initial phases of the software life cycle is quite difficult. This issue of algorithmic models can be handled with non-algorithmic models. These non-algorithmic models are based on soft computing techniques such as Genetic Programming, Fuzzy Sets, and Artificial Neural Network (ANN). Many researchers proposed various models based on ANN but we did not find any estimation method focused on feature selection to remove the negative impact of irrelevant information. In this study, features with high information gain are selected using information gain to train the multilayer perceptron network FITNET. Experiment with two- and threefold cross-validation on 3 benchmark datasets shows that New ANN (NANN) trained on selected features makes effective prediction as compared with the ANN trained on all features. Our approach compared the performance using 5 performance metrics MAR, MMRE, MdMRE, PRED (25), and MSE to show that it will perform better for different metrics.
Article
The existing trajectory clustering algorithms only use the position information in trajectory segmentation, which makes the selection of segment points unreliable. Meanwhile, the adopted distance metrics are originally designed to compare the whole trajectory, leads to the inaccuracy similarity of trajectory segments, and hence causes subsequent clustering risks. Moreover, the execution time of clustering significantly increases with the amount of data. To address these issues, the direction of velocity is introduced for improving the discrimination of segment points. Next, the shared nearest neighbor (SNN) similarity and Trajectory-Hausdorff distance are combined to construct the similarity matrix for overcoming the limitations of existing distance measures. Then, based on the R-tree index strategy, the neighbored trajectory segments are extracted and stored for fastening segment indexing. Finally, the Atlantic hurricane and elk datasets verify that the proposed algorithm can not only improve the clustering efficiency but also extract the trajectory model accurately.
Article
Full-text available
Several knowledge discovery techniques are used in the health care sector. Among these are association rules, which provide quick access to standards. However, Classic algorithms can generate many patterns or fail to identify rare cases which are relevant to healthcare professionals. This study identified asymmetric associative patterns in health-related data using the Health Association Rules (HAR) algorithm. We used a combined strategy of six metrics to filter, select, and eliminate contradiction steps to find patterns and identify possible rare cases. The proposed solution used adjustment mechanisms to increase the quality of standards with the knowledge of health professionals. A survey of 597 studies identified the primary needs and problems of associative patterns in the health care context. The HAR identified characteristics with the highest cause and effect relationship. The experiments were carried out on 13 datasets, where we identified the most pertinent patterns for the datasets without losing relevant knowledge.
Article
Electric power distribution systems face outages that prevent them from serving customers. Short-term outages are known as momentary outages, and their causes are not usually recorded in the outage dataset. While, frequent occurrences of momentary outages may lead to a long-term permanent outage, which can significantly reduce system reliability. Unlike previous works which focused on permanent outage diagnosing and prediction, this paper proposes data-mining based approaches to identify the most probable momentary outages’ causes. To achieve this goal, the outage dataset, sub-transmission substation load, and weather historical data are processed and integrated. Then, association rules that describe the antecedents leading to different permanent outages’ and momentary outages’ causes are derived by using the Apriori algorithm. The frequent itemsets of momentary outages are also obtained. Based on momentary outage rules and frequent itemsets, two procedures are proposed to find similarities between permanent and momentary outages to identify the most probable causes of momentary outages. Finding the cause of momentary outages, the operator can reduce the probability of permanent outage occurrences. Results of applying the proposed approaches on real data of a test distribution system show that expected energy not supplied of the distribution system can be decreased by more than 18%.
Article
Many architects encounter problems during adoption of green building technologies and are unfamiliar with the benefits of green buildings. This study conducted two-stage data mining on 354 green building projects in Taiwan, in order to solve issues in the preliminary design phase of projects, such as technology adoptions, green building grades, and construction costs. This study adopted the association rules to explore the associations between different types and grades of green buildings and technology adoptions. Moreover, a prediction model based on the artificial neural network was constructed to predict the grades and costs of green building projects. The results indicate that different types and grades of green buildings are based on varying green building technologies. In particular, a green building with a high grade places more emphases on the green building technologies of air conditioning, CO2 reduction, and indoor environments. The green building technologies are affected by building types, regulations, costs, climate conditions, or geographic restrictions. This study also found that the accuracy of the artificial neural network in predicting green building grades and costs can reach above 80%. The systematic data mining method constructed herein can effectively assist architects and building owners to reduce preliminary design time and costs, and improve the success rates of green building projects. It is expected that the proposed approach can be adjusted in the future for other regions with different climates or their corresponding green building rating tools, so as to construct more suitable applications.
Article
Time series states that a number of observation on a specific variable are documented in time to time. This time to time includes hours, days, weeks, months, seasons, and years. The identification of attribute has major impact in contribution over any predication, and it is the challenging task. In particular, a time series analysis on data set taking various scale needs more attention in attribute selection. The proposed eARIMA method creates the work as a significant one by employed in different scale of every day time series data. The methodology parameters [p, q, d] are auto identified. The parameter ‘d’ varies of seven attributes in data set. The attribute weightage factor of each dimension identified enlite the features that causes the out come of prediction. The data objects applied weightage which are clustered using Segment K-Means Cluster and yield the better accuracy.
Conference Paper
Smart grid has been introduced to address power distribution system challenges. In conventional power distribution systems, when a power outage happens, the maintenance team tries to find the outage cause and mitigate it. After this, some information is documented in a dataset called outage dataset. If the team can estimate the outage cause before searching for it, the restoration time will be reduced. In line with smart grid concepts, an association rule-based method is presented in this paper to find the outage cause. To do this, we have first combined outage, load, and weather datasets and extracted features. Then, for every cause, the records are labelled main class or others. The association rules are extracted and evaluated. Through these rules, one can determine if the outage has happened because of a fault in a certain piece of equipment or not. Doing so alongside using smart devices may lead to reliability enhancement.
Article
Full-text available
Now-a-days, Formal Concept Analysis (FCA) is playing a crucial role in computational mathematics to identify the information from noise by building lattices. FCA consists of number of mathematical algorithms and tools which can be used for data analysis, data visualization and find interesting patterns by reducing the data without any loss of information since data is voluminous. This paper basically focuses on two techniques by which large datasets can be reduced by removal of noise then produces easily readable data and can be visualized by lattice construction using ConExp. Firstly, large number of attributes may be present in the data sets but here in this paper only attributes of interest are considered using Attribute Subset Selection Technique in data mining. Secondly, reducing the data Using Dimensionality Reduction using matrix factorization which generates number of concepts which need to be mined and analysed.
Conference Paper
Full-text available
The amount of created data has increased a lot in recent years. Because of this, data mining is used by various disciplines for extracting information through these datasets. Attribute selection which refers to identifying the attributes of dataset that have more contribution on the result is an important stage on data mining processes. The platform to be used for data mining processes also has effect on the performance of the task. Evaluating the performance of the new attribute selection method called Two Stage Correlation Based Attribute Selection (TSCBAS) which has been proposed by our previous works is aimed in this study. For this aim, SVM and Random Forest classification algorithms are applied on bank marketing data set from UCI machine learning data warehouse on two different data mining platforms such as Spark and R. The dataset was seperated as training and testing data by 5-fold cross-validation method. According to the results, SVM has shown better classification performance then random Forest both on raw dataset and the dataset created with TSCBAS. In addition Spark has performed better runtime result than R. The results have also confirmed the importance of attribute selction process.
Article
Full-text available
Over the last decade, the infrastructure supporting the smart city has lived together with and was surpassed by the rise of social media. The tremendous growth of both mobile devices and social media users has unearthed a new kind of services in the so‐called location‐based social networks (LBSNs). In this new scenario, the term crowdsensing refers to sharing data collected by sensing humans with the aim of measuring phenomena of common interest. Crowd‐sourced location data provide the ability to study, for the first time, the movement of individuals in urban environments. In this paper, we address the problem of monitoring crowds, whereabouts and movement, which can assist decision making in education, emergency training, urban planning, traffic engineering, etc. Precisely, two‐phase density‐based analysis for collectives and crowds (2PD‐CC) is a novel methodology over public data in LBSN, which combines density‐based clustering, outlier detection a topic modeling over a region under study to detect, predict, and explain abnormal group behavior. In order to validate the methodology and its potential application to full‐scale problems, an experiment over Twitter data was performed in Madrid city.
Chapter
Smart homes generate a vast amount of data measurements from smart meters and devices. These data have all the velocity and veracity characteristics to be called as Big Data. Meter data analytics holds tremendous potential for utilities to understand customers’ energy consumption patterns, and allows them to manage, plan, and optimize the operation of the power grid efficiently. In this paper, we propose a unified architecture that enables innovative operations for near real-time processing of large fine-grained energy consumption data. Specifically, we propose an Internet of Things (IoT) big data analytics system that makes use of fog computing to address the challenges of complexities and resource demands for near real-time data processing, storage, and classification analysis. The design architecture and requirements of the proposed framework are illustrated in this paper while the analytics components are validated using datasets acquired from real homes.
Article
Full-text available
This works proposes a technique to predict the revisit of repeated attempted-suicide patients. The technique applies the factors relating to suicide and attempted suicide, which are collected from medical treatment information. The proposed technique considers the probability distribution of attempted-suicide dates of the patients in order to examine a predetermined threshold for classifying the patients into three categories of revisit duration, i.e. (i) low, (ii) medium, and (iii) high. In addition, this work proposes a feature filtering method that can select a set of significance factors from the suicide and self-harm surveillance report (RP. 506S) of Khon Kaen Rajanagarindra Psychiatric hospital to perform the classification. There are 10,112 patients who had been in the services more than once. The filtering is performed before the threshold is determined using a Gaussian function. The experiment results show that the proposed technique is superior to the baseline for every learning algorithms, i.e. (i) k-NN, (ii) SVM, (iii) random forest and (iv) neural networks. In addition, the results obtained from random forest provide promising outcomes. The best performance (in terms of F-measure) is 91.10%, obtained from random forest.
ResearchGate has not been able to resolve any references for this publication.