Conference Paper

A comprehensive study on periodicity mining algorithms

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Time series datasets are common these days, having some of the application areas such as economics, social sciences, epidemiology, medicine, and physical sciences, for instance, measuring a person's heart rate after every minute, readings of air temperature or wind after every hour, stock rates at mid and end of every day, and so on. Mining the patterns from time series data is usually referred to as periodic pattern mining [4][5][6][7]. e periodic patterns in the mining process are usually categorized as symbol, sequence (partial), and full-cycle (segment) periodic patterns [8][9][10][11]. Symbol periodicity happens when there is a single event that reappears in the time series after a specific time period. ...
Article
Full-text available
Finding flexible periodic patterns in a time series database is nontrivial due to irregular occurrence of unimportant events, which makes it intractable or computationally intensive for large datasets. There exist various solutions based on Apriori, projection, tree, and other techniques to mine these patterns. However, the existence of constant size tree structure, i.e., suffix tree, with extra information in memory throughout the mining process, redundant and invalid pattern generation, limited types of mined flexible periodic patterns, and repeated traversal over tree data structure for pattern discovery, results in unacceptable space and time complexity. In order to overcome these issues, we introduce an efficient approach called HOVA-FPPM based on Apriori approach with hashed occurrence vectors to find all types of flexible periodic patterns. We do not rely on complex tree structure rather manage necessary information in a hash table for efficient lookup during the mining process. We measured the performance of our proposed approach and compared the results with the baseline approach, i.e., FPPM. The results show that our approach requires lesser time and space, regardless of the data size or period value.
Conference Paper
As an active subfield of Automated Machine Learning, automated structural analysis focuses on extracting the structural information, such as periodicity, from the data automatically, enabling automated data cleaning and feature extraction. Little research, however, has been done on the periodicity mining from numeric data that contain noises and missing points. In this paper, we present a practical and innovative framework to close this gap. To validate our approach, we carry out detailed simulation studies and real data analyses. The experimental results show that our framework is more robust to data granularity with better accuracy and computational efficiency when comparing with baseline methods. Moreover, the results imply that our proposed method is insensitive to data jitters, noise points and missing signal points.
Conference Paper
Full-text available
This paper focuses on the problem of mining high utility episodes from complex event sequences. Episode mining, one of the fundamental problems of sequential pattern mining, has been continuously drawing attention over the past decade. Meanwhile, there is also tremendous interest in the problem of high utility mining. Recently, the problem of high utility episode mining comes into view from the interface of these two research areas. Although prior work [12] has proposed algorithm UP-Span to tackle this problem, their method suffers from several performance drawbacks. To that end, firstly, we explicitly interpret the high utility episode mining problem as a complete traversal of the lexico-graphic prefix tree. Secondly, under the framework of lexicographic prefix tree, we examine the original UP-Span algorithm and present several improvements on it. In addition, we propose several clever strategies from a practical perspective and obtain much tighter utility upper bounds of a given node. Based on these optimizations, an efficient algorithm named TSpan is presented for fast high utility episode mining using tighter upper bounds, which reduces huge search space over the prefix tree. Extensive experiments on both synthetic and real-life datasets demonstrate that TSpan outperforms the state-of-the-art in terms of both search space and running time significantly.
Article
Full-text available
We present a new method for the understandable description of local temporal relationships in multivariate data, called Time Series Knowledge Mining (TSKM). We define the Time Series Knowledge Representation (TSKR) as a new language for expressing temporal knowledge in time interval data. The patterns have a hierarchical structure, with levels corresponding to the temporal concepts duration, coincidence, and partial order. The patterns are very compact, but offer details for each element on demand. In comparison with related approaches, the TSKR is shown to have advantages in robustness, expressivity, and comprehensibility. The search for coincidence and partial order in interval data can be formulated as instances of the well known frequent itemset problem. Efficient algorithms for the discovery of the patterns are adapted accordingly. A novel form of search space pruning effectively reduces the size of the mining result to ease interpretation and speed up the algorithms. Human interaction is used during the mining to analyze and validate partial results as early as possible and guide further processing steps. The efficacy of the methods is demonstrated using two real life data sets. In an application to sports medicine the results were recognized as valid and useful by an expert of the field.
Article
Full-text available
Mining patterns in a market-basket dataset is a well-stated problem. There are a number of approaches to deal with this problem. Different types of patterns may be present in a dataset. An interesting one is patterns that hold seasonally, which are called calendar-based patterns. Earlier methods require periods to be specified by the user. We present here a method which is able to extract different types of periodic patterns that may exist in a temporal market-basket dataset and it is not needed for the user to specify the periods in advance. We consider the time-stamps as a hierarchical data structure and then extract different types of patterns. The algorithm can detect both wholly and partially periodic patterns. Although we have applied our approach to market-basket dataset, the approach can be used for any event related dataset where the events are associated with time intervals.
Conference Paper
Full-text available
Periodicy detection in time series data is a challenging problem of great importance in many applications. Most previous work focused on mining synchronous periodic patterns and did not recognize the misaligned presence of a pattern due to the intervention of random noise. In this paper, we propose a more flexible model of asynchronous periodic pattern that may be present only within a subsequence and whose occurrences may be shifted due to disturbance. Two parameters min_rep and max_dis are employed to specify the minimum number of repetitions that is required within each segment of nondisrupted pattern occurrences and the maximum allowed disturbance between any two successive valid segments. Upon satisfying these two requirements, the longest valid subsequence of a pattern is returned. A two-phase algorithm is devised to first generate potential periods by distance-based pruning followed by an iterative procedure to derive and validate candidate patterns and locate the longest valid subsequence. We also show that this algorithm cannot only provide linear time complexity with respect to the length of the sequence but also achieve space efficiency.
Conference Paper
Full-text available
Frequent episode discovery is a popular framework for min- ing data available as a long sequence of events. An episode is essentially a short ordered sequence of event types and the frequency of an episode is some suitable measure of how often the episode occurs in the data sequence. Recently, we proposed a new frequency measure for episodes based on the notion of non-overlapped occurrences of episodes in the event sequence, and showed that, such a definition, in addition to yielding computationally efficient algorithms, has some important theoretical properties in connecting fre- quent episode discovery with HMM learning. This paper presents some new algorithms for frequent episode discov- ery under this non-overlapped occurrences-based frequency definition. The algorithms presented here are better (by a factor of N, where N denotes the size of episodes being dis- covered) in terms of both time and space complexities when compared to existing methods for frequent episode discov- ery. We show through some simulation experiments, that our algorithms are very efficient. The new algorithms pre- sented here have arguably the least possible orders of space and time complexities for the task of frequent episode dis- covery.
Conference Paper
Full-text available
Most studies on sequential pattern mining are mainly focused on time point-based event data. Few research efforts have elaborated on mining patterns from time interval-based event data. However, in many real applications, event usually persists for an interval of time. Since the relationships among event time intervals are intrinsically complex, mining time interval-based patterns in large database is really a challenging problem. In this paper, a novel approach, named as incision strategy and a new representation, called coincidence representation are proposed to simplify the processing of complex relations among event intervals. Then, an efficient algorithm, CTMiner (Coincidence Temporal Miner) is developed to discover frequent time-interval based patterns. The algorithm also employs two pruning techniques to reduce the search space effectively. Furthermore, experimental results show that CTMiner is not only efficient and scalable but also outperforms state-of-the-art algorithms.
Conference Paper
Full-text available
We consider the question of finding an approximate period in a given string S of length n. Let S′ be a periodic string closest to S under some distance metric. We consider this distance the error of the periodic string, and seek the smallest period that generates a string with this distance to S. In this paper we consider the Hamming and swap distance metrics. In particular, if S is the given string, and S′ is the closest periodic string to S under the Hamming distance, and if that distance is k, we develop an O(nkloglogn) algorithm that constructs the smallest period that defines such a periodic string S′. We call that string the approximate period of S under the Hamming distance. We further develop an O(n 2) algorithm that constructs the approximate period under the swap distance. Finally, we show an O(nlogn) algorithm for finite alphabets, and O(nlog3 n) algorithm for infinite alphabets, that approximates the number of mismatches in the approximate period of the string.
Conference Paper
Full-text available
Since mining frequent patterns from transactional databases involves an exponential mining space and generates a huge number of patterns, efficient discovery of user-interest-based frequent pattern set becomes the first priority for a mining algorithm. In many real-world scenarios it is often sufficient to mine a small interesting representative subset of frequent patterns. Temporal periodicity of pattern appearance can be regarded as an important criterion for measuring the interestingness of frequent patterns in several applications. A frequent pattern can be said periodic-frequent if it appears at a regular interval given by the user in the database. In this paper, we introduce a novel concept of mining periodic-frequent patterns from transactional databases. We use an efficient tree-based data structure, called Periodic-frequent pattern tree (PF-tree in short), that captures the database contents in a highly compact manner and enables a pattern growth mining technique to generate the complete set of periodic-frequent patterns in a database for user-given periodicity and support thresholds. The performance study shows that mining periodic-frequent patterns with PF-tree is time and memory efficient and highly scalable as well.
Conference Paper
Full-text available
The mining of closed sequential patterns has attracted researchers for its capability of using compact results to preserve the same expressive power as conventional mining. However, existing studies only focus on time point-based data. Few research efforts have elaborated on discovering closed sequential patterns from time interval-based data, where each data persists for a period of time. Mining closed time interval-based patterns, also called closed temporal patterns, is an arduous problem since the pair wise relationships between two interval-based events are intrinsically complex. In this paper, an efficient algorithm, CEMiner is developed to discover closed temporal patterns from interval-based data. Algorithm CEMiner employs some optimization techniques to effectively reduce the search space. The experimental results on both synthetic and real datasets indicate that CEMiner not only significantly outperforms the prior interval-based mining algorithms in terms of execution time but also possesses graceful scalability. The experiment conducted on real dataset shows the practicability of time interval-based closed pattern mining.
Article
The search for the periodicity in time-series database has a number of application, is an interesting data mining problem. In real world dataset are mostly noisy and rarely a perfect periodicity, this problem is not trivial. Periodicity is very common practice in time series mining algorithms, since it is more likely trying to discover periodicity signal with no time limit. We propose an algorithm uses FP-tree for finding symbol, partial and full periodicity in time series. We designed the algorithm complexity as O (kN), where N is the length of input sequence and k is length of periodic pattern. We have shown our algorithm is fixed parameter tractable with respect to fixed symbol set size and fixed length of input sequences. Experiment results on both synthetic and real data from different domains have shown our algorithms' time efficient and noise-resilient feature. A comparison with some current algorithms demonstrates the applicability and effectiveness of the proposed algorithm.
Conference Paper
The concept of episodes was introduced for discovering the useful and interesting temporal patterns from the sequential data. Over the years, many episode mining strategies have been suggested, which can be roughly classified into two classes: Apriori-based breadth-first algorithms and projection-based depth-first algorithms. As we know, both kinds of algorithms are level-wise pattern growth methods, so that they have higher computational overhead due to level-wise growth iteration. In addition, their mining time will increase with the increase of sequence length. In the paper, we propose a novel two-phase strategy to discover frequent closed episodes. That is, in phase I, we present a level-wise shrinking mechanism, based on maximal duration episodes, to find the candidate frequent closed episodes from the episodes with the same 2-neighboring episode prefix, and in phase II, we compare the candidates with different prefixes to discover the final frequent closed episodes. The advantage of the suggested mining strategy is it can reduce mining time due to narrowing episode mapping range when doing closure judgment. Experiments on simulated and real datasets demonstrate that the suggested strategy is effective and efficient.
Conference Paper
This paper investigates the partial periodic behavior of the frequent patterns in a transactional database, and introduces a new class of user-interest-based patterns known as chronic-frequent patterns. Informally, a frequent pattern is said to be chronic if it has sufficient number of cyclic repetitions in a database. The proposed patterns can provide useful information to the users in many real-life applications. An example is finding chronic diseases in a medical database. The chronic-frequent patterns satisfy the anti-monotonic property. This property makes the pattern mining practicable in real-world applications. The existing pattern growth techniques that are meant to discover frequent patterns cannot be used for finding the chronic-frequent patterns. The reason is that the tree structure employed by these techniques’ capture only the frequency and disregards the periodic behavior of the patterns. We introduce another pattern-growth algorithm which employs an alternative tree structure, called Chronic-Frequent pattern tree (CFP-tree), to capture both frequency and periodic behavior of the patterns. Experimental results show that the proposed patterns can provide useful information and our algorithm is efficient.
Conference Paper
Periodic-frequent patterns are an important class of regularities that exist in a transactional database. Informally, a frequent pattern is said to be periodic-frequent if it appears at a regular interval specified by the user (i.e., periodically) in a database. A pattern-growth algorithm, called PFP-growth, has been proposed in the literature to discover the patterns. This algorithm constructs a tid-list for a pattern and performs a complete search on the tid-list to determine whether the corresponding pattern is a periodic-frequent or a non-periodic-frequent pattern. In very large databases, the tid-list of a pattern can be very long. As a result, the task of performing a complete search over a pattern’s tid-list can make the pattern mining a computationally expensive process. In this paper, we have made an effort to reduce the computational cost of mining the patterns. In particular, we apply greedy search on a pattern’s tid-list to determine the periodic interestingness of a pattern. The usage of greedy search facilitate us to prune the non-periodic-frequent patterns with a sub-optimal solution, while finds the periodic-frequent patterns with the global optimal solution. Thus, reducing the computational cost of mining the patterns without missing any knowledge pertaining to the periodic-frequent patterns. We introduce two novel pruning techniques, and extend them to improve the performance of PFP-growth. We call the algorithm as PFP-growth++. Experimental results show that PFP-growth++ is runtime efficient and highly scalable as well.
Article
Periodic-frequent patterns (or itemsets) are an important class of regularities that exist in a transactional database. Finding these patterns involves discovering all frequent patterns that satisfy the user-specified maximum periodicity constraint. This constraint controls the maximum inter-arrival time of a pattern in a database. The time complexity to measure periodicity of a pattern is O(n), where n represents the number of timestamps at which the corresponding pattern has appeared in a database. As n usually represents a high value in voluminous databases, determining the periodicity of every candidate pattern in the itemset lattice makes the periodic-frequent pattern mining a computationally expensive process. This paper introduces a novel approach to address this problem. Our approach determines the periodic interestingness of a pattern by adopting greedy search. The basic idea of our approach is to discover all periodic-frequent patterns by eliminating aperiodic patterns based on suboptimal solutions. The best and worst case time complexities of our approach to determine the periodic interestingness of a frequent pattern are O(1) and O(n), respectively. We introduce two pruning techniques and propose a pattern-growth algorithm to find these patterns efficiently. Experimental results show that our algorithm is runtime efficient and highly scalable as well.
Article
Periodic pattern mining in time series databases is one of the most interesting data mining problems that is frequently appeared in many real-life applications. Some of the existing approaches find fixed length periodic patterns by using suffix tree structure, i.e., unable to mine flexible patterns. One of the existing approaches generates periodic patterns by skipping intermediate events, i.e., flexible patterns, using apriori based sequential pattern mining approach. Since, apriori based approaches suffer from the issues of huge amount of candidate generation and large percentage of false pattern pruning, we propose an efficient algorithm FPPM (Flexible Periodic Pattern Mining) using suffix trie data structure. The proposed algorithm can capture more effective variable length flexible periodic patterns by neglecting unimportant or undesired events and considering only the important events in an efficient way. To the best of our knowledge, ours is the first approach that simultaneously handles various starting position throughout the sequences, flexibility among events in the mined patterns and interactive tuning of period values on the go. Complexity analysis of the proposed approach and comparison with existing approaches along with analytical comparison on various issues have been performed. As well as extensive experimental analyses are conducted to evaluate the performance of proposed FPPM algorithm using real-life datasets. The proposed approach outperforms existing algorithms in terms of processing time, scalability, and quality of mined patterns.
Article
Mining high utility episode rules in complex event sequences has emerged as an important topic in data mining because the utility-based episode rules generated may provide important insights that facilitate decision making for expert and intelligent systems. Although one may employ previous methods in this research area to indirectly construct utility-based episode rules, they typically lack efficiency and effectiveness for real-world applications. In this paper, we develop a novel methodology to directly generate high utility episode rules during the mining process, which is the first work addressing the issue of utility-based episode rule mining. Our goal is to simultaneously resolve the difficulty of the previous reported methods for frequent episode mining and utility-based episode mining. An algorithm called UBER-Mine (Utility-Based Episode Rules) and a structure named UR-Tree (Utility Rule Tree) are proposed to mine efficiently the complete set of high utility episode rules in complex event sequences. In short, UBER-Mine is based on an extended downward closure property, which can efficiently discover utility-based episode rules. On the other hand, UR-Tree can maintain important event information without producing candidate episodes to further accelerate the mining process. Results on both real and synthetic datasets show that UBER-Mine with UR-Tree has good scalability on large datasets and runs faster than the basic UBER-Mine and the current best high utility episode mining algorithm over 100 times. Furthermore, by proposing a high-utility episode-rule model called IV-UBER (InVestment by Utility-Based Episode Rules), we further demonstrate the effectiveness of our method for mining high utility-based episode rules on a real-world application for stock investment. The experimental results show that our proposed IV-UBER method outperforms several state-of-the-art algorithms in terms of both precision and annualized return for investment.
Article
Periodic patterns and cyclic patterns have been used to discover recurring patterns in sequence databases. Toroslu (2003) proposed cyclically repeated pattern (CRP) mining, in which a new parameter called repetition support is considered in the mining process. In a data sequence, the occurrence of a subsequence must satisfy a single user-specified minimum repetition support. However, in real-life applications, items may occur at various frequencies in a database. The rare item problem may occur when all items are set to a single minimum repetition support. To solve this problem, we included the concept of multiple minimum supports to enable users to specify the multiple minimum item repetition support (MIR) according to the natures of items. In this paper, we first redefined CRPs based on the MIR and original form of the sequence minimum support. A new algorithm, rep-PrefixSpan, was developed for discovering a complete set of CRPs in sequence databases. The experimental results indicate that the proposed approach exhibits performance superior to that of conventional CRP mining. The proposed method can be applied in many application domains including customer purchase behavior, web logging, and stock analyses.
Article
Frequent episode discovery is one of the methods used for temporal pattern discovery in sequential data. An episode is a partially ordered set of nodes with each node associated with an event type. For more than a decade, algorithms existed for episode discovery only when the associated partial order is total (serial episode) or trivial (parallel episode). Recently, the literature has seen algorithms for discovering episodes with general partial orders. In frequent pattern mining, the threshold beyond which a pattern is inferred to be interesting is typically user-defined and arbitrary. One way of addressing this issue in the pattern mining literature has been based on the framework of statistical hypothesis testing. This paper presents a method of assessing statistical significance of episode patterns with general partial orders. A method is proposed to calculate thresholds, on the non-overlapped frequency, beyond which an episode pattern would be inferred to be statistically significant. The method is first explained for the case of injective episodes with general partial orders. An injective episode is one where event-types are not allowed to repeat. Later it is pointed out how the method can be extended to the class of all episodes. The significance threshold calculations for general partial order episodes proposed here also generalize the existing significance results for serial episodes. Through simulations studies, the usefulness of these statistical thresholds in pruning uninteresting patterns is illustrated.
Article
Many previous approaches to frequent episode discovery only accept sim-ple sequences. Although a recent approach has been able to find frequent episodes from complex sequences, the discovered sets are neither condensed nor accurate. This paper investigates the discovery of condensed sets of frequent episodes from complex sequences. We adopt a novel anti-monotonic frequency measure based on non-redundant occurrences, and define a condensed set, nDaCF (the set of non-derivable approximately closed fre-quent episodes) within a given maximal error bound of support. We then introduce a series of effective pruning strategies, and develop a method, nDaCF -M iner, for discov-ering nDaCF sets. Experimental results show that, when the error bound is somewhat high, the discovered nDaCF sets are two orders of magnitude smaller than complete sets, and nDaCF-miner is more efficient than previous mining approaches. In addition, the nDaCF sets are more accurate than the sets found by previous approaches.
Conference Paper
Frequent episode mining (FEM) is an interesting research topic in data mining with wide range of applications. However, the traditional framework of FEM treats all events as having the same importance/utility and assumes that a same type of event appears at most once at any time point. These simplifying assumptions do not reflect the characteristics of scenarios in real applications and thus the useful information of episodes in terms of utilities such as profits is lost. Furthermore, most studies on FEM focused on mining episodes in simple event sequences and few considered the scenario of complex event sequences, where different events can occur simultaneously. To address these issues, in this paper, we incorporate the concept of utility into episode mining and address a new problem of mining high utility episodes from complex event sequences, which has not been explored so far. In the proposed framework, the importance/utility of different events is considered and multiple events can appear simultaneously. Several novel features are incorporated into the proposed framework to resolve the challenges raised by this new problem, such as the absence of anti-monotone property and the huge set of candidate episodes. Moreover, an efficient algorithm named UP-Span (Utility ePisodes mining by Spanning prefixes) is proposed for mining high utility episodes with several strategies incorporated for pruning the search space to achieve high efficiency. Experimental results on real and synthetic datasets show that UP-Span has excellent performance and serves as an effective solution to the new problem of mining high utility episodes from complex event sequences.
Conference Paper
This research paper focuses on data mining in time series and its applications on financial data. Data-mining attempts to analyze time series and extract valuable information about pattern periodicity, which might be concealed by substantial amounts of unformatted, random information. Such information, however, is of great importance as it can be used to forecast future behavior. In this paper, a new methodology is introduced aiming to utilize Suffix Arrays in data mining instead of the commonly used data structure Suffix Trees. Although Suffix Arrays, normally, require high storage capacity, the algorithm proposed allows them to be constructed in linear time. The methodology is also extended to detect repeated patterns in time series with time complexity of. This, combined with the capability of external storage, creates a critical advantage, for an overall efficient data mining and analysis regarding the construction of time series data structure and periodicity detection. The test results, presented below demonstrate the applicability and effectiveness of the proposed technique.
Article
Frequent episode discovery is a popular framework for pattern discovery from sequential data. It has found many applications in domains like alarm management in telecommunication networks, fault analysis in the manufacturing plants, predicting user behavior in web click streams and so on. In this paper, we address the discovery of serial episodes. In the episodes context, there have been multiple ways to quantify the frequency of an episode. Most of the current algorithms for episode discovery under various frequencies are apriori-based level-wise methods. These methods essentially perform a breadth-first search of the pattern space. However currently there are no depth-first based methods of pattern discovery in the frequent episode framework under many of the frequency definitions. In this paper, we try to bridge this gap. We provide new depth-first based algorithms for serial episode discovery under non-overlapped and total frequencies. Under non-overlapped frequency, we present algorithms that can take care of span constraint and gap constraint on episode occurrences. Under total frequency we present an algorithm that can handle span constraint. We provide proofs of correctness for the proposed algorithms. We demonstrate the effectiveness of the proposed algorithms by extensive simulations. We also give detailed run-time comparisons with the existing apriori-based methods and illustrate scenarios under which the proposed pattern-growth algorithms perform better than their apriori counterparts.
Article
Partial periodic pattern mining is one of the important issues in the field of data mining due to its practical applications. A partial periodic pattern consists of some periodic and non-periodic events in a specific period length, and is repeated with high frequency in an event sequence. In the past, a max-subpattern hit set algorithm was developed to discover partial periodic patterns, but its drawback is spending a large amount of time in calculating frequency counts from the redundant candidate nodes. In this study, we thus adopt an efficient encoding strategy to speed up the efficiency of processing period segments in an event sequence, and combined with the projection method to quickly find the partial periodic patterns in the recursive process. Finally, the experimental results show the superior performance of the proposed approach.
Article
The goal of analyzing a time series database is to find whether and how frequent a periodic pattern is repeated within the series. Periodic pattern mining is the problem that regards temporal regularity. However, most of the existing algorithms have a major limitation in mining interesting patterns of users interest, that is, they can mine patterns of specific length with all the events sequentially one after another in exact positions within this pattern. Though there are certain scenarios where a pattern can be flexible, that is, it may be interesting and can be mined by neglecting any number of unimportant events in between important events with variable length of the pattern. Moreover, existing algorithms can detect only specific type of periodicity in various time series databases and require the interaction from user to determine periodicity. In this paper, we have proposed an algorithm for the periodic pattern mining in time series databases which does not rely on the user for the period value or period type of the pattern and can detect all types of periodic patterns at the same time, indeed these flexibilities are missing in existing algorithms. The proposed algorithm facilitates the user to generate different kinds of patterns by skipping intermediate events in a time series database and find out the periodicity of the patterns within the database. It is an improvement over the generating pattern using suffix tree, because suffix tree based algorithms have weakness in this particular area of pattern generation. Comparing with the existing algorithms, the proposed algorithm improves generating different kinds of interesting patterns and detects whether the generated pattern is periodic or not. We have tested the performance of our algorithm on both synthetic and real life data from different domains and found a large number of interesting event sequences which were missing in existing algorithms and the proposed algorithm was efficient enough in generating and detecting periodicity of flexible patterns on both types of data.
Article
Periodic pattern detection in time-ordered sequences is an important data mining task, which discovers in the time series all patterns that exhibit temporal regularities. Periodic pattern mining has a large number of applications in real life; it helps understanding the regular trend of the data along time, and enables the forecast and prediction of future events. An interesting related and vital problem that has not received enough attention is to discover outlier periodic patterns in a time series. Outlier patterns are defined as those which are different from the rest of the patterns; outliers are not noise. While noise does not belong to the data and it is mostly eliminated by preprocessing, outliers are actual instances in the data but have exceptional characteristics compared with the majority of the other instances. Outliers are unusual patterns that rarely occur, and, thus, have lesser support (frequency of appearance) in the data. Outlier patterns may hint toward discrepancy in the data such as fraudulent transactions, network intrusion, change in customer behavior, recession in the economy, epidemic and disease biomarkers, severe weather conditions like tornados, etc. We argue that detecting the periodicity of outlier patterns might be more important in many sequences than the periodicity of regular, more frequent patterns. In this paper, we present a robust and time efficient suffix tree-based algorithm capable of detecting the periodicity of outlier patterns in a time series by giving more significance to less frequent yet periodic patterns. Several experiments have been conducted using both real and synthetic data; all aspects of the proposed approach are compared with the existing algorithm InfoMiner; the reported results demonstrate the effectiveness and applicability of the proposed approach.
Conference Paper
The problem of finding association rules from a dataset is to find all possible associations that hold among the items, given a minimum support and confidence. This involves finding frequent sets first and then the association rules that hold within the items in the frequent sets. In temporal datasets as the time in which a transaction takes place is important we may find sets of items that are frequent in certain time intervals but not frequent throughout the dataset. These frequent sets may give rise to interesting rules but these can not be discovered if we calculate the supports of the item sets in the usual way. We call here these frequent sets locally frequent. Normally these locally frequent sets are periodic in nature. We propose modification to the Apriori algorithm to compute locally frequent sets and periodic frequent sets and periodic association rules.
Article
Frequent episode discovery is a popular framework for temporal pattern discovery in event streams. An episode is a partially ordered set of nodes with each node associated with an event type. Currently algorithms exist for episode discovery only when the associated partial order is total order (serial episode) or trivial (parallel episode). In this paper, we propose efficient algorithms for discovering frequent episodes with unrestricted partial orders when the associated event-types are unique. These algorithms can be easily specialized to discover only serial or parallel episodes. Also, the algorithms are flexible enough to be specialized for mining in the space of certain interesting subclasses of partial orders. We point out that frequency alone is not a sufficient measure of interestingness in the context of partial order mining. We propose a new interestingness measure for episodes with unrestricted partial orders which, when used along with frequency, results in an efficient scheme of data mining. Simulations are presented to demonstrate the effectiveness of our algorithms. KeywordsEpisode mining–General partial order–Non-overlapped count–Bidirectional evidence
Article
Periodic pattern mining or periodicity detection has a number of applications, such as prediction, forecasting, detection of unusual activities, etc. The problem is not trivial because the data to be analyzed are mostly noisy and different periodicity types (namely symbol, sequence, and segment) are to be investigated. Accordingly, we argue that there is a need for a comprehensive approach capable of analyzing the whole time series or in a subsection of it to effectively handle different types of noise (to a certain degree) and at the same time is able to detect different types of periodic patterns; combining these under one umbrella is by itself a challenge. In this paper, we present an algorithm which can detect symbol, sequence (partial), and segment (full cycle) periodicity in time series. The algorithm uses suffix tree as the underlying data structure; this allows us to design the algorithm such that its worstcase complexity is O(k.n<sup>2</sup>), where k is the maximum length of periodic pattern and n is the length of the analyzed portion (whole or subsection) of the time series. The algorithm is noise resilient; it has been successfully demonstrated to work with replacement, insertion, deletion, or a mixture of these types of noise. We have tested the proposed algorithm on both synthetic and real data from different domains, including protein sequences. The conducted comparative study demonstrate the applicability and effectiveness of the proposed algorithm; it is generally more time-efficient and noise-resilient than existing algorithms.
Article
Discovering patterns with great significance is an important problem in data mining discipline. An episode is defined to be a partially ordered set of events for consecutive and fixed-time intervals in a sequence. Most of previous studies on episodes consider only frequent episodes in a sequence of events (called simple sequence). In real world, we may find a set of events at each time slot in terms of various intervals (hours, days, weeks, etc.). We refer to such sequences as complex sequences. Mining frequent episodes in complex sequences has more extensive applications than that in simple sequences. In this paper, we discuss the problem on mining frequent episodes in a complex sequence. We extend previous algorithm MINEPI to MINEPI+ for episode mining from complex sequences. Furthermore, a memory-anchored algorithm called EMMA is introduced for the mining task. Experimental evaluation on both real-world and synthetic data sets shows that EMMA is more efficient than MINEPI+.
Article
We study the problem of mining association rules and related time intervals, where an association rule holds either in all or some of the intervals. To restrict to meaningful time intervals, we use calendar schemas and their calendar-based patterns. A calendar schema example is (year, month, day) and a calendar-based pattern within the schema is (∗,3,15), which represents the set of time intervals each corresponding to the 15th day of a March. Our focus is finding efficient algorithms for this mining problem by extending the well-known Apriori algorithm with effective pruning techniques. We evaluate our techniques via experiments.
Article
Previous sequential pattern mining studies have dealt with either point-based event sequences or interval-based event sequences. In some applications, however, event sequences may contain both point-based and interval-based events. These sequences are called hybrid event sequences. Since the relationships among both kinds of events are more diversiform, the information obtained by discovering patterns from these events is more informative. In this study we introduce a hybrid temporal pattern mining problem and develop an algorithm to discover hybrid temporal patterns from hybrid event sequences. We carry out an experiment using both synthetic and real stock price data to compare our algorithm with the traditional algorithms designed exclusively for mining point-based patterns or interval-based patterns. The experimental results indicate that the efficiency of our algorithm is satisfactory. In addition, the experiment also shows that the predicting power of hybrid temporal patterns is higher than that of point-based or interval-based patterns.
Conference Paper
Periodic-Frequent patterns are an important class of regularities that exist in a transactional database. A pattern is periodic-frequent if it satisfies both minimum support (minsup) and maximum periodicity (maxprd) constraints. Minsup constraint controls the minimum number of transactions that a pattern must cover in a database. Maxprd constraint controls the maximum duration between the two transactions below which a pattern should reoccur in a database. In the literature an approach has been proposed to extract periodic-frequent patterns using single minsup and single maxprd constraints. However, real-world databases are mostly non-uniform in nature containing both frequent and relatively infrequent (or rarely) occurring items. Researchers are making efforts to propose improved approaches for extracting frequent patterns that contain rare items as they contain useful knowledge. For mining periodic patterns that contain frequent and rare items we have to specify low minsup and high maxprd. It is difficult to mine periodic-frequent patterns because the low minsup and high maxprd can cause combinatorial explosion. In this paper we propose an improved approach which facilitates the user to specify different minsup and maxprd values for each pattern depending upon the items within it. Also, we present an efficient pattern growth approach and a methodology to dynamically specify maxprd for each pattern. Experimental results show that the proposed approach is efficient.
Conference Paper
In this paper, we define a new research problem for mining approximate repeating patterns (ARP) with gap constraints, where the appearance of a pattern is subject to an approximate matching, which is very common in biological sciences. To solve the problem, we propose an ArpGap (Approximate repeating pattern mining with Gap constraints) algorithm with three major components for approximate repeating pattern mining: (1) a data-driven pattern generation approach to avoid generating unnecessary patterns; (2) a back-tracking pattern search process to discover approximate occurrences of a pattern under gap constraints; and (3) an Apriori-like deterministic pruning approach to progressively prune patterns and cease the search process if necessary. Experimental results on synthetic and real-world protein sequences assert that ArpGap is efficient in terms of memory consumption and computational cost.
Conference Paper
Many events repeat themselves as the time goes by. For example, an institute pays its employees on the first day of every month. However, events may not repeat with a constant span of time. In the payday example here, the span of time between each two consecutive paydays ranges between 28 and 31 days. As a result, regularity, or temporal pattern, has to be captured with a use of granularities (such as day, week, month, and year), oftentimes multiple granularities. This paper defines the above patterns, and proposes a number of pattern discovery algorithms. To focus on the basics, the paper assumes that a list of events with their timestamps is given, and the algorithms try to find patterns for the events. All of the algorithms repeat two possibly interleaving steps, with the first step generating possible (called candidate) patterns, and the second step verifying if candidate patterns satisfy some user-given requirements. The algorithms use pruning techniques to reduce the number of candidate patterns, and adopt a data structure to efficiently implement the second step. Experiments show that the pruning techniques and the data structure are quite effective.
Conference Paper
Temporal periodicity of itemset appearance can be regarded as an important criterion for measuring the interestingness of itemsets in several application. A frequent itemset can be said periodic-frequent in a database if it appears at a regular interval given by the user. In this paper, we propose a concept of the approximate periodicity of each itemset. Moreover, a new tree-based data structure, called ITL-tree (Interval Transaction-ids List tree), is proposed. Our tree structure maintains an approximation of the occurrence information in a highly compact manner for the periodic-frequent itemsets mining. A pattern-growth mining is used to generate all of periodic-frequent itemsets by a bottom-up traversal of the ITL-tree for user-given periodicity and support thresholds. The performance study shows that our data structure is very efficient for mining periodic-frequent itemsets with approximate periodicity results.
Conference Paper
An efficient algorithm with a worst-case time complexity of O(n logn) is proposed for detecting seasonal (calendar-based) periodicities of patterns in temporal datasets. Hierarchical data structures are used for representing the timestamps associated with the data. This representation facilitates the detection of different types of seasonal periodicities viz. yearly periodicities, monthly periodicities, daily periodicities etc. of patterns in the temporal dataset. The algorithm is tested with real-life data and the results are given.
Conference Paper
The periodic pattern mining is to discover valid periodic patterns in a time-related dataset. Previous studies mostly concern the synchronous periodic patterns. There are many methods for mining periodic patterns proposed in literature. Nevertheless, asynchronous periodic pattern mining gradually receives more and more attention recently. In this paper, we propose an efficient linked structure and the OEOP algorithm to discover all kinds of valid segments in each single event sequence. Then, refer to the general model of asynchronous periodic pattern mining proposed by Huang and Chang, we combine these valid segments found by OEOP into 1-patterns with multiple events, multiple patterns with multiple events and asynchronous periodic patterns. Besides, we implement these algorithms on two real datasets. The experimental results show that these algorithms have the good performance and scalability.
Conference Paper
In business applications, there have been tremendous interests in analysing customers' repeated purchase behaviour. Recently, the concepts of periodic pattern and cyclic pattern are used to discover recurring patterns from customer sequence database. Toroslu (2003) proposed cyclic pattern mining, which considers a new parameter, named repetition support, into the mining process. In a customer sequence, the occurrence of a subsequence must satisfy single user-specified repetition minimum support. In real-life applications, however, different items may have different frequencies in the database. If all items are set to have the same minimum repetition support, it may cause rare item problem. To solve this problem, we include the concept of multiple minimum supports (MMS) to allow users to specify multiple minimum item repetition support (MIR) according to the natures of items. In this paper, we first redefine cyclic sequential patterns based on MIR and original form of customer minimum support. A new algorithm, rep-PrefixSpan, is developed to discover complete set of cyclic sequential patterns from sequence database. The experimental result shows that the proposed approach achieves more preferable findings than conventional cyclic pattern mining.
Conference Paper
The goal of discovering association rules is to discover aH possible associations that accomplish certain restrictions (minimum support and confidence and interesting). However, it is possible to find interesting associations with a high confidence level hut with little support. This problem is caused by the way support is calculated, as the denominator represents the total number of transactions in a time period when the involved items may have not existed. If, on the other hand, we limit the total transactions to the ones belonging to the items' lifetime, those associations would be now discovered, as they would count on enough support. Another difficulty is the large number of rules that could be generated, tbr which many solutions have been proposed. Using age as an obsolescence tactor for rules helps reduce the number of rules to be presented to the user. In this paper we expand the notion of association rules incorporating time to the frequent itemsets discovered. The concept of temporal support is introduced and. as an example, the known algorithm A priori is modified to incorporate the temporal notions.
Conference Paper
The problem of partial periodic pattern mining in a discrete data sequence is to nd subsequences that appear periodically and fre- quently in the data sequence. Two essential subproblems are the ecient mining of frequent patterns and the automatic discovery of periods that correspond to these patterns. Previous methods for this problem in event sequence databases assume that the periods are given in advance or re- quire additional database scans to compute periods that dene candidate patterns. In this work, we propose a new structure, the abbreviated list table (ALT), and several ecient algorithms to compute the periods and the patterns, that require only a small number of passes. A performance study is presented to demonstrate the eectiveness and eciency of our method.
Conference Paper
Researchers have been endeavoring to discover concise sets of episode rules instead of complete sets in sequences. Existing approaches, however, are not able to process complex sequences and can not guarantee the accuracy of resulting sets due to the violation of anti-monotonicity of the frequency metric. In some real applications, episode rules need to be extracted from complex sequences in which multiple items may appear in a time slot. This paper investigates the discovery of concise episode rules in complex sequences. We define a concise representation called non-derivable episode rules and formularize the mining problem. Adopting a novel anti-monotonic frequency metric, we then develop a fast approach to discover non-derivable episode rules in complex sequences. Experimental results demonstrate that the utility of the proposed approach substantially reduces the number of rules and achieves fast processing.
Conference Paper
Temporal periodicity of patterns can be regarded as an important criterion for measuring the interestingness of frequent patterns in several applications. A frequent pattern can be said periodic-frequent if it appears at a regular interval. In this paper, we introduce the problem of mining the top-k periodic frequent patterns i.e. the periodic patterns with the k highest support. An efficient single-pass algorithm using a best-first search strategy without support threshold, called MTKPP (Mining Top-K Periodic-frequent Patterns), is proposed. Our experiments show that our proposal is efficient.