Chapter

Data and Visual Analytics for Emerging Databases

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Hence, data science [33] is in demand. In general, a data science solution applies the following to big data for data analytics [5,22]: ...
Article
Full-text available
High volumes of wide varieties of valuable data of different veracity (e.g., imprecise and uncertain data) can be easily generated or collected at a high velocity for various knowledge-based and intelligent information & engineering systems in many real-life situations. Embedded in these big data is valuable knowledge and useful information, which can be discovered by data science solutions. As a popular data science task, frequent pattern mining aims to discover implicit, previously unknown and potentially useful information and valuable knowledge in terms of sets of frequently co-occurring items. Many of the existing frequent pattern mining algorithms use a transaction-centric mining approach to find frequent patterns from precise data. However, there are situations in which an item-centric mining approach is more appropriate, and there are also situations in which data are imprecise and uncertain. In this article, we present an item-centric algorithm for mining frequent patterns from big uncertain data. Evaluation results show the effectiveness of our algorithm in item-centric mining of frequent patterns from big uncertain data.
Chapter
Huge amounts of useful data are easily generated and gathered currently at a rapid rate from a broad range of rich data sources in numerous applications and services in the real world. Data science applies database techniques, scientific and engineering methods, mathematical and statistical models, data mining algorithms, and/or machine learning tools to manage data, extract the useful information and discover the new knowledge from these big data. This explains why data science for big data applications and services has become a fundamental technology in providing novel solutions in various areas in business, engineering, health, humanities, natural sciences, social sciences, etc. (e.g., healthcare, manufacturing, social life). Usually, data science focuses on big data management, analytics and visualization. Once big data are managed (i.e., captured, curated, managed and processed), big data are analyzed with an aim to discover interesting knowledge and information, which is usually presented in text or table form. Consistent with a proverb that “a picture is worth a thousand words”, big data visualization as well as visual analytics helps to reveal and explain the discovered interesting knowledge and information. In this paper, we present (a) big data management with focus on information fusion and the data lake; (b) big data analytics and mining, with focus on frequent patterns; as well as (c) big data visualization with focus on a few visual analytic systems for visualizing big data and mined frequent patterns. For illustration, we discuss these three aspects of data science on coronavirus disease 2019 (COVID-19) data. This highlights some important aspects of data science for big data analyses, services, and smart data.
Chapter
High volumes of wide varieties of valuable data of different veracities can be easily generated or collected at a high velocity from various big data applications and services. Embedded in these big data are valuable knowledge and useful information, which can be discovered by data science solutions. As a popular data science task, frequent pattern mining aims to discover implicit, previously unknown and potentially useful information and valuable knowledge in terms of sets of frequently co-occurring items. Many of the existing frequent pattern mining algorithms return large numbers of frequent patterns, of which only a small portion may be of interest to users. In this paper, we present a constrained mining algorithm that allows crowds of users to collaboratively vote for their interesting patterns. Such an algorithm takes the benefits of crowdsourcing, crowdvoting and collaborative filtering for the data analytics and mining of popular constrained frequent patterns from big data applications and services.
Chapter
High volumes of wide varieties of valuable data of different veracity can be easily generated or collected at a high velocity from various big data applications and services. A rich source of these big data is the Internet of Things (IoT), which can be viewed as a network of sensors, mobile devices, wearable devices, and other “things” that are capable to operate within the existing Internet infrastructure. As a popular data science task, frequent pattern mining aims to discover implicit, previously unknown and potentially useful information and valuable knowledge—in terms of sets of frequently co-occurring items—embedded in these big data. Existing frequent pattern mining algorithms mostly run serially on a single local computer or in distributed and parallel environments on computer clusters, grids, or clouds. Many of these algorithms return large numbers of frequent patterns, of which only some may be of interest to the user. In this paper, we present a constrained big data mining algorithm that (i) focuses the mining to those frequent patterns that are interested to the users and (ii) runs in an edge computing environment, in which computation is performed at edges of the computing network.
Chapter
Advances in technology and the increasing growth of popularity on Internet of Things (IoT) for many applications have produced huge volume of data at a high velocity. These valuable big data can be of a wide variety or different veracity. Embedded in these big data are useful information and valuable knowledge. This leads to data science, which aims to apply big data analytics to mine implicit, previously unknown and potentially useful information from big data. As a popular data analytic task, frequent itemset mining discovers knowledge about sets of frequently co-occurring items in the big data. Such a task has drawn attention in both academia and industry partially due to its practicality in various real-life applications. Existing mining approaches mostly use serial, distributed or parallel algorithms to mine the data horizontally (i.e., on a transaction basis). In this paper, we present an alternative big data analytic approach. Specifically, our scalable algorithm uses the MapReduce programming model that runs in a Spark environment to mine the data vertically (i.e., on an item basis). Evaluation results show the effectiveness of our algorithm in big data analytics of frequent itemsets.
Conference Paper
Full-text available
These days Frequent Induced Subgraph Mining (FISM) is an active research direction, in various application domains like biological networks, chemical, or social networks. A number of FISM approaches have been proposed over the years. However, existing methods take long execution time since they perform numerous subgraph isomorphism (SI) operations, an NP-hard for counting frequency of subgraphs in a graph database. In this paper, we propose kFISM, a new sampling-based method for top-k Frequent Induced Subgraph Mining from a graph database. To avoid SI operations in kFISM, we present a measure, indFreq, to compute frequency of subgraphs. kFISM executes a biased random walk-based sampling over fixed-size vertex-induced subgraphs so that the potentially frequent subgraphs are visited with high probability. We evaluate execution time and accuracy of finding our desired types of subgraphs using kFISM on a real-life dataset. We observe that our proposed method outperforms state of the art approach in execution time and accuracy.
Article
Full-text available
Many existing data mining algorithms search interesting patterns from transactional databases of precise data. However, there are situations in which data are uncertain. Items in each transaction of these probabilistic databases of uncertain data are usually associated with existential probabilities, which express the likelihood of these items to be present in the transaction. When compared with mining from precise data, the search space for mining from uncertain data is much larger due to the presence of the existential probabilities. This problem is worsened as we are moving to the era of Big data. Furthermore, in many real-life applications, users may be interested in a tiny portion of this large search space for Big data mining. Without providing opportunities for users to express the interesting patterns to be mined, many existing data mining algorithms return numerous patterns—out of which only some are interesting. In this article, we propose an algorithm that allows users to express their interest in terms of constraints, uses the MapReduce model to mine uncertain Big data for frequent patterns that satisfy the user-specified anti-monotone and monotone constraints, as well as balance the load.
Conference Paper
Full-text available
Frequent Itemset Mining (FIM) is one of the most well known techniques to extract knowledge from data. The combinatorial explosion of FIM methods become even more problematic when they are applied to Big Data. Fortunately, recent improvements in the field of parallel programming already provide good tools to tackle this problem. However, these tools come with their own technical challenges, e.g. balanced data distribution and inter-communication costs. In this paper, we investigate the applicability of FIM techniques on the MapReduce platform. We introduce two new methods for mining large datasets: Dist-Eclat focuses on speed while BigFIM is optimized to run on really large datasets. In our experiments we show the scalability of our methods.
Conference Paper
Full-text available
Frequent itemset mining plays an essential role in the mining of many different patterns. Most existing frequent itemset mining algorithms return the mined results--namely, frequent itemsets--in the form of textual lists. However, the use of visual representation can enhance the user understanding of the inherent relations in a collection of frequent itemsets. In this paper, we propose an effective visualizer, called WiFIsViz, to display the mined frequent itemsets. WiFIsViz provides users with an overview and details about the itemsets. Moreover, this visualizer is also equipped with several interactive features for effective visualization of the frequent itemsets mined from various real-life applications.
Conference Paper
Full-text available
A number of vertical mining algorithms have been proposed recently for association mining, which have shown to be very effective and usually outperform horizontal approaches. The main advantage of the vertical format is support for fast frequency counting via intersection operations on transaction ids (tids) and automatic pruning of irrelevant data. The main problem with these approaches is when intermediate results of vertical tid lists become too large for memory, thus affecting the algorithm scalability.In this paper we present a novel vertical data representation called Diffset, that only keeps track of differences in the tids of a candidate pattern from its generating frequent patterns. We show that diffsets drastically cut down the size of memory required to store intermediate results. We show how diffsets, when incorporated into previous vertical mining methods, increase the performance significantly.
Article
Full-text available
As frequent pattern mining plays an essential role in many knowledge discovery and data mining (KDD) tasks, numer- ous algorithms for finding frequent patterns have been pro- posed over the past 15 years. However, most of these al- gorithms return the mining results in the form of textual lists containing frequent patterns showing those frequently occurring sets of items. It is well known that "a picture is worth a thousand words". The use of visual representa- tion can enhance the user's understanding of the inherent relations in a collection of frequent patterns. In this pa- per, we develop a simple yet useful visual analytic tool for supporting frequent pattern mining called FpVAT. Such a vi- sual analytic tool consists of two modules: One module gives users an overview so that they can derive insight from a mas- sive amount of raw data; another module enables users to perform analytical reasoning on the mining results via inter- active visual interfaces so that users can detect the expected frequent patterns and discover the unexpected frequent pat- terns. As a visual analytic tool, our FpVAT is equipped with several interactive features for effective visual support in the data analysis and KDD process for various real-life applications.
Article
Full-text available
Since inception, association rule mining (ARM) has become one of the core data-mining tasks and has attracted tremendous interest among researchers and practitioners. ARM is undirected or unsupervised data mining over variable-length data, and it produces clear, understandable results. It has an elegantly simple problem statement: to find the set of all subsets of items or attributes that frequently occur in many database records or transactions, and additionally, to extract rules on how a subset of items influences the presence of another subset.
Article
Full-text available
Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. We present efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sublattices, which can be solved in memory. Efficient lattice traversal techniques are presented which quickly identify all the long frequent itemsets and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining improvements of more than an order of magnitude for our test databases
Conference Paper
In recent years, social media has played a huge role in how we share and communicate our thoughts and opinions. This information can very valuable for companies and governments as it can be used to analyze public mood and opinion which is a very powerful tool. In this paper, we present a system that mines social media content from a platform such as Twitter for predicting future outcomes. Specifically, it uses chatter from Twitter to predict box office revenue of movies by extracting features such as tweets and their sentiments. Then, by using these features, our system constructs a polynomial regression model for predicting box office revenue. Experimental results show the effectiveness of our system in mining social media and predicting box office revenue.
Conference Paper
We describe a new method of voting system in social networks environment¹. We suggest a sequence of continuous support via a social network after electing representatives or exemplars in the network that is different from the typical majority voting. In other words, this paper suggests the method of elected representatives using network clustering approach to counts voting. On the network structure, sending messages from each node reflects the influence or importance to the representative and that can be readjusted and send back to each node. Where the representatives can be clustered within which the selectivity can be decided through the graph edges. In the experiment our algorithm outperformed conventional approaches in social network synthetic dataset as well as real dataset.
Article
There is a perfect storm of the use of cloud computing, and the growth of Internet of Things (IoT). IoT is about processing data that comes from devices in some way that's meaningful, and cloud computing is about leveraging data from centralized computing and storage. Growth rates of both can easily become unmanageable. We have some problems to solve. In addition, alternatives are being consider to placing everything in the public cloud because the public cloud, in some cases, no longer makes sense.
Chapter
Collaborative filtering uses data mining and analysis to develop a system that helps users make appropriate decisions in real-life applications by removing redundant information and providing valuable to information users. Data mining aims to extract from data the implicit, previously unknown and potentially useful information such as association rules that reveals relationships between frequently co-occurring patterns in antecedent and consequent parts of association rules. This chapter presents an algorithm called CF-Miner for collaborative filtering with association rule miner. The CF-Miner algorithm first constructs bitwise data structures to capture important contents in the data. It then finds frequent patterns from the bitwise structures. Based on the mined frequent patterns, the algorithm forms association rules. Finally, the algorithm ranks the mined association rules to recommend appropriate merchandise products, goods or services to users. Evaluation results show the effectiveness of CF-Miner in using association rule mining in collaborative filtering.
Conference Paper
Typically, application or website shows the comments of people in a list format. This list means in seeing chronologically or log of recommends. However, it is difficult to grasp because of reading and knowing all countless comments of the topic at a glance. Therefore, it requires a lot of ability to grasp information at a glance via picking only the important information. In this paper, we design and develop a visualization tool that can identify a number of reviews containing comments on the movie at a glance. Review assumed to be extracted from the Amazon and IMDb that are both subjective information. The tool that we develop visualizes sentimental analysis of the review on pre-made Sentiment Dictionary with objective information of a movie. Our proposed system can search and display one or more movies. Users can determine the relationship between movies by clustering sentiment of positive/negative reviews and movie's factors. In the future, based on all the reviews on Amazon and grasp the reviews on a variety of movies and products, as well, it will be used as tools to help users of a rational choice.
Conference Paper
In this paper, we describe SNS (Social Networking Service, especially Twitter) data visualization for analyzing spatial-temporal distribution of social anxiety. We prepare train data collected from Twitter by using Open API(twitter4j), which represent whether the person who post Tweet, posting message in Twitter, is anxious or not. From these data, dictionary explaining frequency of words is constructed by using KOMORAN which is Korean morphological analysis library. And we design classifier based on Naive Bayes method and estimate degree of anxiety of Tweet which include spatial-temporal information. We visualize these estimations as the form of web application, which are represented as a map and word cloud. As the spatial-temporal data are visualized in this way, we can analyze public opinion about a variety of social events.
Chapter
High volumes of a wide variety of data can be easily generated at a high velocity in many real-life applications. Implicitly embedded in these big data is previously unknown and potentially useful knowledge such as frequently occurring sets of items, merchandise, or events. Different algorithms have been proposed for either retrieving information about the data or mining the data to find frequent sets, which are usually presented in a lengthy textual list. As “a picture is worth a thousand words”, the use of visual representations can enhance user understanding of the inherent relationships among the mined frequent sets. However, many of the existing visualizers were not designed to visualize these mined frequent sets. This book chapter presents an interactive next-generation visual analytic system. The system enables the management, visualization, and advanced analysis of the original big data and the frequent sets mined from the data.
Conference Paper
As an important data mining task, frequent pattern mining has drawn attention from many researchers. This has led to the development of many frequent pattern mining algorithms, which include Apriori-based, tree-based, and hyperlinked array structure-based algorithms, as well as vertical mining algorithms. Although these algorithms are efficient and popular, they also suffer from some drawbacks. To tackle these drawbacks, we present in this paper an alternative algorithm called B-mine that uses a bitwise approach to mine frequent patterns. Evaluation results show the space- and time-efficiency of B-mine for frequent pattern mining, as well as the practicality of B-mine for social network analysis and knowledge discovery from social networks.
Conference Paper
As one of data mining tasks, sequential pattern mining provides valuable information about frequent patterns of users over time. For instance, frequent sequential patterns can be applicable to analyze user clickstreams for determination of web navigation patterns, genome sequences, and customer purchasing patterns. In many real-life situations, data to be mined are continuously changing. Moreover, these data are streaming at a high velocity, which leads to impracticality of storing all these data in memory. Hence, to handle these situations, we propose three stream mining algorithms to first find frequent sequential patterns. The algorithms then form statistical models, which are stored as Markov chains or transition matrices capturing frequent sequential patterns mined so far, to predict future user clickstream (e.g., the web page the user will visit next). Experimental results show the efficiency and prediction accuracy of our proposed Markov chain-based sequential stream mining algorithms in clickstream prediction.
Conference Paper
As a popular data mining tasks, frequent pattern mining discovers implicit, previously unknown and potentially useful knowledge in the form of sets of frequently co-occurring items or events. Many existing data mining algorithms return to users with long textual lists of frequent patterns, which may not be easily comprehensible. As a picture is worth a thousand words, having a visual means for humans to interact with computers would be beneficial. This is when human-computer interaction (HCI) research meets data mining research. In particular, the popular HCI task of data and result visualization could help data miners to visualize the original data and to analyze the mined results (in the form of frequent patterns). In this paper, we present a few systems for data and visual analytics of frequent patterns, which integrate (i) data analytics and mining with (ii) data and result visualization.
Conference Paper
In the current era of big data, high volumes of valuable data can be easily collected and generated. Social networks are examples of generating sources of these big data. Users (or social entities) in these social networks are often linked by some interdependency such as friendship or “following” relationships. As these big social networks keep growing, there are situations in which individual users or businesses want to find those frequently followed groups of social entities so that they can follow the same groups. In this paper, we present a big data analytics solution that uses the MapReduce model to mine social networks for discovering groups of frequently followed social entities. Evaluation results show the efficiency and practicality of our big data analytics solution in discovering “following” patterns from social networks.
Article
Social networking sites (e.g., Facebook, Google+, and Twitter) have become popular for sharing valuable knowledge and information among social entities (e.g., individual users and organizations), who are often linked by some interdependency such as friendship. As social networking sites keep growing, there are situations in which a user wants to find those frequently followed groups of social entities so that he can follow the same groups. In this article, we present (i) a space-efficient bitwise data structure for capturing interdependency among social entities; (ii) a time-efficient data mining algorithm that makes the best use of our proposed data structure for serial discovery of groups of frequently followed social entities; and (iii) another time-efficient data mining algorithm for concurrent computation and discovery of groups of frequently followed social entities in parallel so as to handle high volumes of social network data. Evaluation results show the efficiency and practicality of our data structure and social network data mining algorithms.
Conference Paper
Over the past few years, social network sites (e.g., Facebook, Twitter, Weibo) have become very popular. These sites have been used for sharing knowledge and information among users. Nowadays, it is not unusual for any user to have many friends (e.g., hundreds or even thousands friends) in these social networks. In general, social networks consist of social entities that are linked by some interdependency such as friendship. As social networks keep growing, it is not unusual for a user to find those frequently followed groups of social entities in the networks so that he can follow the same groups. In this paper, we propose (i) a space-efficient bitwise data structure to capture interdependency among social entities and (ii) a time-efficient data mining algorithm that makes the best use of our proposed data structure to discover groups of friends who are frequently followed by social entities in the social networks. Evaluation results show the efficiency of our data structure and mining algorithm.
Conference Paper
Algorithm Eclat is a classical algorithm for mining frequent itemsets, which is based on vertical layout databases. It is greatly different from those algorithms based on horizontal layout databases, such as algorithm Apriori and FP-Growth. In order to improve the efficiency of mining frequent itemsets from massive datasets, parallel algorithm MREclat based on Map/Reduce framework is presented. The algorithm also overcomes the problem of memory and computational capability insufficient when mining frequent itemsets from massive datasets. In this paper, the idea of MREclat is introduced and the performance of the algorithm is studied. The experimental results show that algorithm MREclat has high scalability and good speedup.
Conference Paper
Frequent pattern mining algorithms aim to find sets of frequently co-occurring items. Visual representation of the mining results is more comprehensible to users than the traditional long textual list of frequent patterns. Existing visualizers mostly show frequent patterns as graphs in a two-dimensional space with (x,y)-coordinates. Nowadays, in a collaborative environment, it is not uncommon for users to have face-to-face meetings when they show the graphs visualizing frequent patterns. In these situations, the viewing orientation of the graphs plays an important role as different orientations positively or negatively impact the graph legibility. A legible right-side-up graph to one user may become an illegible upside-down graph towards another user. In this paper, we propose a visualizer that uses a radial layout—which is orientation free—to show frequent patterns. Having such a visualizer is beneficial in the collaborative environment.
Conference Paper
Frequent itemset mining aims to discover implicit, previously unknown and potentially useful knowledge—in the form of sets of frequently co-occurring items—that are embedded in data. Many algorithms developed in the early days mined frequent itemsets from traditional transaction databases of precise data such as shoppers' market basket data, in which the contents of databases are known. However, we are living in an uncertain world, in which uncertain data can be found in many real-life applications. Hence, in recent years, researchers have paid more attention to frequent itemset mining from probabilistic datasets of uncertain data. In this paper, we present some algorithms for mining frequent itemsets from these probabilistic datasets.
Article
Many parallelization techniques have been proposed to enhance the performance of the Apriori-like frequent itemset mining algorithms. Characterized by both map and reduce functions, MapReduce has emerged and excels in the mining of datasets of terabyte scale or larger in either homogeneous or heterogeneous clusters. Minimizing the scheduling overhead of each map-reduce phase and maximizing the utilization of nodes in each phase are keys to successful MapReduce implementations. In this paper, we propose three algorithms, named SPC, FPC, and DPC, to investigate effective implementations of the Apriori algorithm in the MapReduce framework. DPC features in dynamically combining candidates of various lengths and outperforms both the straight-forward algorithm SPC and the fixed passes combined counting algorithm FPC. Extensive experimental results also show that all the three algorithms scale up linearly with respect to dataset sizes and cluster sizes.
Conference Paper
Mining for a.ssociation rules between items in a large database of sales transactions has been described as an important database mining problem. In this paper we present an effi-cient algorithm for mining association rules that is fundamentally different from known al-gorithms. Compared to previous algorithms, our algorithm not only reduces the I/O over-head significantly but also has lower CPU overhead for most cases. We have performed extensive experiments and compared the per-formance of our algorithm with one of the best existing algorithms. It was found that for large databases, the CPU overhead was re-duced by as much as a factor of four and I/O was reduced by almost an order of magnitude. Hence this algorithm is especially suitable for very large size databases.
Conference Paper
Skyline query is an effective method to process large-sized multi-dimensional data sets as it can pinpoint the target data so that dominated data (say, 95% of data) can be efficiently excluded as unnecessary data objects. However, most of the conventional skyline algorithms were developed to handle numerical data. Thus, most of the text data were excluded from being processed by the algorithms. In this paper, we pioneer an entirely new domain for skyline query—namely, the categorical data—with which the corresponding ranking measures for the skyline queries are developed. We tested our proposed algorithm using the ACM Computing Classification System.
Conference Paper
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study, we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods.
Conference Paper
Weather plays an important role in many areas such as agriculture. Having a clean set of agro-meteorological data removes doubt on weather-derived models and ensures confidence on decisions supported by these models. In this paper, we present our design and development of a prototype system for detecting abnormal weather observations. This system is applicable for the real-life data quality control and assurance of reliable and error-free agro-meteorological data. It does so by checking the internal validity, as well as the temporal and spatial consistency, of each weather observation. In addition to having the ability to detect abnormal observations and control the quality of weather data, our system also has the capability to estimate values for temporal and spatial weather parameters. Having such a capability is helpful in replacing incorrect data and filling missing values. Moreover, in this paper, we discuss the challenges and our solutions to problems related to the management of weather observations such as data gaps, format incompatibilities, and integration issues.
Conference Paper
Recently, a significant number of parallel and distributed algorithms have been proposed to mine frequent patterns (FP) from large and/or distributed databases. Among them parallelization of the FP-growth algorithms using the FP-tree has been proved to be highly efficient. However, the FP-tree-based techniques suffer from two major limitations such as multiple database scans requirement (i.e., high I/O cost) and high inter-processor communications cost (during the mining phase). Therefore, we propose a novel tree structure, called PP-tree (Parallel Pattern tree) that significantly reduces the I/O cost by capturing the database contents with a single scan and facilitates the efficient FP-growth mining on it with reduced inter-processor communication overhead. Our parallel algorithm works independently at each local site and locally generates global frequent patterns which are merged at the final stage. The experimental results reflect that parallel and distributed FP mining with PP-tree outperforms other state-of-the-art algorithms.
Conference Paper
Frequent itemset mining (FIM) is a useful tool for discovering frequently co-occurrent items. Since its inception, a number of significant FIM algorithms have been developed to speed up mining performance. Unfortunately, when the dataset size is huge, both the memory use and computational cost can still be prohibitively expensive. In this work, we propose to parallelize the FP-Growth algorithm (we call our parallel algorithm PFP) on distributed machines. PFP partitions computation in such a way that each machine executes an independent group of mining tasks. Such partitioning eliminates computational dependencies between machines, and thereby communication between them. Through empirical study on a large dataset of 802,939 Web pages and 1,021,107 tags, we demonstrate that PFP can achieve virtually linear speedup. Besides scalability, the empirical study demonstrates that PFP to be promising for supporting query recommendation for search engines.
Conference Paper
Since its introduction, frequent itemset mining has been the subject of numerous studies. However, most of them return frequent itemsets in the form of textual lists. The common cliché that “a picture is worth a thousand words” advocates that visual representation can enhance user understanding of the inherent relations in a collection of objects such as frequent itemsets. Many visualization systems have been developed to visualize raw data or mining results. However, most of these systems were not designed for visualizing frequent itemsets. In this paper, we propose a frequent itemset visualizer (FIsViz). FIsViz provides many useful features so that users can effectively see and obtain implicit, previously unknown, and potentially useful information that is embedded in data of various real-life applications.
Conference Paper
In this paper, we propose an efficient algorithm, called TD-FP-Growth (the shorthand for Top-Down FP-Growth), to mine frequent patterns. TD-FP-Growth searches the FP-tree in the top-down order, as opposed to the bottom-up order of previously proposed FP-Growth. The advantage of the top-down search is not generating conditional pattern bases and sub-FP-trees, thus, saving substantial amount of time and space. We extend TD-FP-Growth to mine association rules by applying two new pruning strategies: one is to push multiple minimum supports and the other is to push the minimum confidence. Experiments show that these algorithms and strategies are highly effective in reducing the search space.
Conference Paper
Methods for efficient mining of frequent patterns have been studied extensively by many researchers. However, the previously proposed methods still encounter some performance bottlenecks when mining databases with different data characteristics, such as dense vs. sparse, long vs. short patterns, memory-based vs. disk-based, etc. In this study, we propose a simple and novel hyper-linked data structure, H-struct and a new mining algorithm, H-mine, which takes advantage of this data structure and dynamically adjusts links in the mining process. A distinct feature of this method is that it has very limited and precisely predictable space overhead and runs really fast in memory-based setting. Moreover it can be scaled up to very large databases by database partitioning, and when the data set becomes dense, (conditional) FP-trees can be constructed dynamically as part of the mining process. Our study shows that H-mine has high performance in various kinds of data, outperforms the previously developed algorithms in different settings, and is highly scalable in mining large databases. This study also proposes a new data mining methodology, space-preserving mining, which may have strong impact in the future development of efficient and scalable data mining methods
Article
We consider the problem of mining association rules on a shared nothing multiprocessor. We present three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem specific information. The best algorithm exhibits near perfect scaleup behavior, yet requires only minimal overhead compared to the current best serial algorithm
Article
Methods for efficient mining of frequent patterns have been studied extensively by many researchers. However, the previously proposed methods still encounter some performance bottlenecks when mining databases with different data characteristics, such as dense vs. sparse, long vs. short patterns, memory-based vs. disk-based, etc.
Article
In a vertical representation of a market-basket database, each item is associated with a column of values representing the transactions in which it is present. The association-rule mining algorithms that have been recently proposed for this representation show performance improvements over their classical horizontal counterparts, but are either efficient only for certain database sizes, or assume particular characteristics of the database contents, or are applicable only to specific kinds of database schemas. We present here a new vertical mining algorithm called VIPER, which is general-purpose, making no special requirements of the underlying database. VIPER stores data in compressed bit-vectors called "snakes" and integrates a number of novel optimizations for efficient snake generation, intersection, counting and storage. We analyze the performance of VIPER for a range of synthetic database workloads.
Mining frequent patterns from IoT devices with fog computing
  • P Braun
  • A Cuzzocrea
  • C K Leung
  • A G M Pazdor
  • S K Tanbeer
Big data analytics of social network data: who cares most about you on Facebook? In: Highlighting the Importance of Big Data Management and Analysis for Various Applications
  • C K Leung
  • F Jiang
  • T W Poon
  • P.-E Crevier
Mining ‘following’ patterns from big sparse social networks
  • C K Leung
  • Dela Cruz
  • E M Cook
  • T L Jiang