Frequent Pairs in Data Streams: Exploiting Parallelism and Skew.
DOI: 10.1109/ICDMW.2011.87 Conference: Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, Vancouver, BC, Canada, December 11, 2011
We introduce the Pair Streaming Engine (PairSE) that detects frequent pairs in a data stream of transactions. Our algorithm finds the most frequent pairs with high probability, and gives tight bounds on their frequency. It is particularly space efficient for skewed distribution of pair supports, confirmed for several real-world datasets. Additionally, the algorithm parallelizes easily, which opens up for real-time processing of large transactions. Unlike previous algorithms we make no assumptions on the order of arrival of transactions and pairs. Our algorithm builds upon approaches for frequent items mining in data streams. We show how to efficiently scale these approaches to handle large transactions. We report experimental results showcasing precision and recall of our method. In particular, we find that often our method achieves excellent precision, returning identical upper and lower bounds on the supports of the most frequent pairs.
Available from: Konstantin Kutzkov
[Show abstract] [Hide abstract]
ABSTRACT: A straightforward approach to frequent pairs mining in transactional streams is to generate all pairs occurring in transactions and apply a frequent items mining algorithm to the resulting stream. The well-known counter based algorithms Frequent and Space-Saving are known to achieve a very good approximation when the frequencies of the items in the stream adhere to a skewed distribution.
Motivated by observations on real datasets, we present a general technique for applying Frequent and Space-Saving to transactional data streams for the case when the transactions considerably vary in their lengths. Despite of its simplicity, we show through extensive experiments that our approach is considerably more efficient and precise than the naïve application of Frequent and Space-Saving.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.