Article

StreamDD: Stream clustering with Drift Detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the case of real-world data streams, the underlying data distribution will not be static; it is subject to variation over time, which is known as the primary reason for concept drift. Concept drift poses severe problems to the accuracy of a model in online learning scenarios. The recurring concept is a particular case of concept drift where the concepts already seen in the past reappear as the stream evolves. This problem is not yet studied in the context of stream clustering. This paper proposes a novel algorithm for identifying the recurring concepts in data stream clustering. During concept recurrence, the most matching model is retrieved from the repository and reused. The algorithm has minimum memory requirements and works online with the stream. Some of the concepts and definitions, already familiar in concept recurrence studies of stream classification have been redefined for clustering. The experiments conducted on real and synthetic data streams reveal that the proposed algorithm has the potential to identify recurring concepts.
Article
Full-text available
Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data. The main characteristics of these applications are the online arrival of data observations at high speed and the susceptibility to changes in the data distributions due to the dynamic nature of real environments. The data stream mining community still faces some primary challenges and difficulties related to the comparison and evaluation of new proposals, mainly due to the lack of publicly available high quality non-stationary real-world datasets. The comparison of stream algorithms proposed in the literature is not an easy task, as authors do not always follow the same recommendations, experimental evaluation procedures, datasets, and assumptions. In this paper, we mitigate problems related to the choice of datasets in the experimental evaluation of stream classifiers and drift detectors. To that end, we propose a new public data repository for benchmarking stream algorithms with real-world data. This repository contains the most popular datasets from literature and new datasets related to a highly relevant public health problem that involves the recognition of disease vector insects using optical sensors. The main advantage of these new datasets is the prior knowledge of their characteristics and patterns of changes to adequately evaluate new adaptive algorithms. We also present an in-depth discussion about the characteristics, reasons, and issues that lead to different types of changes in data distribution, as well as a critical review of common problems concerning the current benchmark datasets available in the literature.
Article
Full-text available
Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform unsupervised learning. Nevertheless, data stream clustering imposes several challenges to be addressed, such as dealing with nonstationary, unbounded data that arrive in an online fashion. The intrinsic nature of stream data requires the development of algorithms capable of performing fast and incremental processing of data objects, suitably addressing time and memory limitations. In this article, we present a survey of data stream clustering algorithms, providing a thorough discussion of the main design components of state-of-the-art algorithms. In addition, this work addresses the temporal aspects involved in data stream clustering, and presents an overview of the usually employed experimental methodologies. A number of references are provided that describe applications of data stream clustering in different domains, such as network intrusion detection, sensor networks, and stock market analysis. Information regarding software packages and data repositories are also available for helping researchers and practitioners. Finally, some important issues and open questions that can be subject of future research are discussed.
Article
Full-text available
An important problem in processing large data streams is detecting changes in the underly- ing distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general, information-theoretic approach to the change detection problem, which works for multidimen- sional as well as categorical data. We use relative entropy, also called the Kullback-Leibler distance, to measure the dierence between two given distributions. The KL-distance is known to be related to the optimal error in determining whether the two distributions are the same and draws on fundamental results in hypothesis testing. The KL-distance also generalizes tradi- tional distance measures in statistics, and has invariance properties that make it ideally suited for comparing distributions. Our scheme is general; it is nonparametric and requires no assumptions on the underlying distributions. It employs a statistical inference procedure based on the theory of bootstrapping, which allows us to determine whether our measurements are statistically signican t. The scheme is also quite exible from a practical perspective; it can be implemented using any spatial parti- tioning scheme that scales well with dimensionality. In addition to providing change detections, our method generalizes Kulldor 's spatial scan statistic, allowing us to quantitatively identify specic regions in space where large changes have occurred. We provide a detailed experimental study that demonstrates the generality and eciency of our approach with dieren t kinds of multidimensional datasets, both synthetic and real.
Article
Full-text available
An emerging problem in Data Streams is the detection of concept drift. This problem is aggravated when the drift is gradual over time. In this work we deflne a method for detecting concept drift, even in the case of slow gradual change. It is based on the estimated distribution of the distances between classiflcation errors. The proposed method can be used with any learning algorithm in two ways: using it as a wrapper of a batch learning algorithm or implementing it inside an incremental and online algorithm. The experimentation results compare our method (EDDM) with a similar one (DDM). Latter uses the error-rate instead of distance-error-rate.
Conference Paper
Full-text available
Most of the work in Machine Learning assume that examples are generated at random according to some stationary probability distribution. In this work we study the problem of learning when the distribution that generates the examples changes over time. We present a method for detection of changes in the probability distribution of examples. The idea behind the drift detection method is to monitor the online error-rate of a learning algorithm looking for significant deviations. The method can be used as a wrapper over any learning algorithm. In most problems, a change affects only some regions of the instance space, not the instance space as a whole. In decision models that fit different functions to regions of the instance space, like Decision Trees and Rule Learners, the method can be used to monitor the error in regions of the instance space, with advantages of fast model adaptation. In this work we present experiments using the method as a wrapper over a decision tree and a linear model, and in each internal-node of a decision tree. The experimental results obtained in controlled experiments using artificial data and a real-world problem show a good performance detecting drift and in adapting the decision model to the new concept.
Conference Paper
Full-text available
Clustering is an important task in mining evolving data streams. Beside the limited memory and one-pass con- straints, the nature of evolving data streams implies the following requirements for stream clustering: no as- sumption on the number of clusters, discovery of clus- ters with arbitrary shape and ability to handle outliers. While a lot of clustering algorithms for data streams have been proposed, they oer no solution to the combi- nation of these requirements. In this paper, we present DenStream, a new approach for discovering clusters in an evolving data stream. The "dense" micro-cluster (named core-micro-cluster) is introduced to summarize the clusters with arbitrary shape, while the potential core-micro-cluster and outlier micro-cluster structures are proposed to maintain and distinguish the potential clusters and outliers. A novel pruning strategy is de- signed based on these concepts, which guarantees the precision of the weights of the micro-clusters with lim- ited memory. Our performance study over a number of real and synthetic data sets demonstrates the eective- ness and eciency
Book
Full-text available
This book is downloadable from http://www.irisa.fr/sisthem/kniga/. Many monitoring problems can be stated as the problem of detecting a change in the parameters of a static or dynamic stochastic system. The main goal of this book is to describe a unified framework for the design and the performance analysis of the algorithms for solving these change detection problems. Also the book contains the key mathematical background necessary for this purpose. Finally links with the analytical redundancy approach to fault detection in linear systems are established. We call abrupt change any change in the parameters of the system that occurs either instantaneously or at least very fast with respect to the sampling period of the measurements. Abrupt changes by no means refer to changes with large magnitude; on the contrary, in most applications the main problem is to detect small changes. Moreover, in some applications, the early warning of small - and not necessarily fast - changes is of crucial interest in order to avoid the economic or even catastrophic consequences that can result from an accumulation of such small changes. For example, small faults arising in the sensors of a navigation system can result, through the underlying integration, in serious errors in the estimated position of the plane. Another example is the early warning of small deviations from the normal operating conditions of an industrial process. The early detection of slight changes in the state of the process allows to plan in a more adequate manner the periods during which the process should be inspected and possibly repaired, and thus to reduce the exploitation costs.
Article
Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.
Article
In our society, many fields has produced a large number of data streams. How to mining the interesting knowledge and patterns from continuous data stream becomes a problem which we have to solve. Different from conventional classification algorithms, data stream classification algorithms have to adjust their classification models with the change of data stream because of concept drift. However, conventional classification models will keep stable once models are trained. To solve the problem, a dynamic extreme learning machine for data stream classification (DELM) is proposed. DELM utilizes online learning mechanism to train ELM as basic classifier and trains a double hidden layer structure to improve the performance of ELM. When an alert about concept drift is set, more hidden layer nodes are added into ELM to improve the generalization ability of classifier. If the value measuring concept drift reaches the upper limit or the accuracy of ELM is in a low level, the current classifier will be deleted, and the algorithm will use new data to train a new classifier so as to learn new concept. The experimental results showed DELM could improve the accuracy of classification result, and can adapt to new concept in a short time.
Article
Incremental and online learning algorithms are more relevant in the data mining context because of the increasing necessity to process data streams. In this context, the target function may change over time, an inherent problem of online learning (known as concept drift). In order to handle concept drift regardless of the learning model, we propose new methods to monitor the performance metrics measured during the learning process, to trigger drift signals when a significant variation has been detected. To monitor this performance, we apply some probability inequalities that assume only independent, univariate and bounded random variables to obtain theoretical guarantees for the detection of such distributional changes. Some common restrictions for the online change detection as well as relevant types of change (abrupt and gradual) are considered. Two main approaches are proposed, the first one involves moving averages and is more suitable to detect abrupt changes. The second one follows a widespread intuitive idea to deal with gradual changes using weighted moving averages. The simplicity of the proposed methods, together with the computational efficiency make them very advantageous. We use a Naïve Bayes classifier and a Perceptron to evaluate the performance of the methods over synthetic and real data.
Article
Detecting changes of concepts, such as a change of customer preference for telecom services, is very important in terms of prediction and decision applications in dynamic environments. In particular, for case-based reasoning systems, it is important to know when and how concept drift can effectively assist decision makers to perform smarter maintenance operations at an appropriate time. This paper presents a novel method for detecting concept drift in a case-based reasoning system. Rather than measuring the actual case distribution, we introduce a new competence model that detects differences through changes in competence. Our competence-based concept detection method requires no prior knowledge of case distribution and provides statistical guarantees on the reliability of the changes detected, as well as meaningful descriptions and quantification of these changes. This research concludes that changes in data distribution do reflect upon competence. Eight sets of experiments under three categories demonstrate that our method effectively detects concept drift and highlights drifting competence areas accurately. These results directly contribute to the research that tackles concept drift in case-based reasoning, and to competence model studies.
Conference Paper
The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.
Article
Clustering streaming data requires algorithms that are capable of updating clustering results for the incoming data. As data is constantly arriving, time for processing is limited. Clustering has to be performed in a single pass over the incoming data and within the possibly varying inter-arrival times of the stream. Likewise, memory is limited, making it impossible to store all data. For clustering, we are faced with the challenge of maintaining a current result that can be presented to the user at any given time. In this work, we propose a parameter-free algorithm that automatically adapts to the speed of the data stream. It makes best use of the time available under the current constraints to provide a clustering of the objects seen up to that point. Our approach incorporates the age of the objects to reflect the greater importance of more recent data. For efficient and effective handling, we introduce the ClusTree, a compact and self-adaptive index structure for maintaining stream summaries. Additionally we present solutions to handle very fast streams through aggregation mechanisms and propose novel descent strategies that improve the clustering result on slower streams as long as time permits. Our experiments show that our approach is capable of handling a multitude of different stream characteristics for accurate and scalable anytime stream clustering.
- A Framework for Clustering Evolving Data Streams, in Proceedings 2003 VLDB Conference
  • Aggarwal
StreamKM++: A Clustering Algorithms for Data Streams
  • M Ackermann
  • C Lammersen
  • M Märtens
  • C Raupach
  • C Sohler
  • K Swierkot
Continuous inspection schemes
  • PAGE