Conference Paper

Data stream generation through real concept's interpolation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The datasets used in the stream generation process are presented in Table 1. The generation procedure is based on the temporal interpolation of normalized random projections of the original data set into a transformed problem space, which, depending on the interpolation, may generate sudden drifts (nearest-neighbor interpolation) and incremental drifts (cubic interpolation) [47]. ...
... The purpose of the third experiment was to evaluate the performance of sdde on streams generated from real concepts using random projection-based concept drift injector proposed by Komorniczak et al. [50]. The method is converting real static datasets to data streams with concept drifts of nearest and cubic types, which correspond to sudden and incremental drifts. ...
Article
Among the difficulties being considered in data stream processing, a particularly interesting one is the phenomenon of concept drift. Methods of concept drift detection are frequently used to eliminate the negative impact on the quality of classification in the environment of evolving concepts. This article proposes Statistical Drift Detection Ensemble (sdde), a novel method of concept drift detection. The method uses drift magnitude and conditioned marginal covariate drift measures, analyzed by an ensemble of detectors, whose members focus on random subspaces of the stream’s features. The proposed detector was compared with state-of-the-art methods on both synthetic data streams and the semi-synthetic streams generated based on the real-world concepts. A series of computer experiments and a statistical analysis of the results, both for the classification accuracy and Drift Detection errors were carried out and confirmed the effectiveness of the proposed method.
... The datasets used in the stream generation process are presented in Table 1. The generation procedure is based on the temporal interpolation of normalized random projections of original data set into a given subspace, which, depending on the interpolation, may generate sudden drifts (nearest-neighbor interpolation) and incremental drifts (cubic interpolation) [30]. Finally, the experiments were also carried out using 9 real benchmark data streams [13,60] and presented in Table 2. Multiclass streams have been binarized and the longest possible fragments have been cut from them, ensuring both classes are present in data chunks containing 250 instances. ...
Preprint
Full-text available
One of the significant problems of streaming data classification is the occurrence of concept drift, consisting of the change of probabilistic characteristics of the classification task. This phenomenon destabilizes the performance of the classification model and seriously degrades its quality. An appropriate strategy counteracting this phenomenon is required to adapt the classifier to the changing probabilistic characteristics. One of the significant problems in implementing such a solution is the access to data labels. It is usually costly, so to minimize the expenses related to this process, learning strategies based on semi-supervised learning are proposed, e.g., employing active learning methods indicating which of the incoming objects are valuable to be labeled for improving the classifier's performance. This paper proposes a novel chunk-based method for non-stationary data streams based on classifier ensemble learning and an active learning strategy considering a limited budget that can be successfully applied to any data stream classification algorithm. The proposed method has been evaluated through computer experiments using both real and generated data streams. The results confirm the high quality of the proposed algorithm over state-of-the-art methods.
Article
Full-text available
The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In practice, several critical issues emerge when extracting useful knowledge from these potentially infinite data, mainly because of their evolving nature and high arrival rate which implies an inability to store them entirely. In this work, we provide a comprehensive survey that discusses the research constraints and the current state‐of‐the‐art in this vibrant framework. Moreover, we present an updated overview of the latest contributions proposed in different stream mining tasks, particularly classification, regression, clustering, and frequent patterns. This article is categorized under: • Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining • Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Abstract Data Stream Mining
Chapter
Full-text available
The nature of analysed data may cause the difficulty of the many practical data mining tasks. This work is focusing on two of the important research topics associated with data analysis, i.e., data stream classification as well as data analysis with imbalanced class distributions. We propose the novel classification method, employing a classifier selection approach, which can update its model when new data arrives. The proposed approach has been evaluated on the basis of the computer experiments carried out on the diverse pool of the non-stationary data streams. Their results confirmed the usefulness of the proposed concept, which can outperform the state-of-art classifier selection algorithms, especially in the case of high imbalanced data streams.
Conference Paper
Full-text available
When a new concept drift detection method is proposed, a common way to show the benefits of the new method, is to use a classifier to perform an evaluation where each time the new algorithm detects change, the current classifier is replaced by a new one. Accuracy in this setting is considered a good measure of the quality of the change detector. In this paper we claim that this is not a good evaluation methodology and we show how a non-change detector can improve the accuracy of the classifier in this setting. We claim that this is due to the existence of a temporal dependence on the data and we propose not to evaluate concept drift detectors using only classifiers.
Article
Full-text available
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Conference Paper
Full-text available
We present a new approach for dealing with distribution change and concept drift when learning from data sequences that may vary with time. We use sliding windows whose size, instead of being fixed a priori, is recomputed online according to the rate of change observed from the data in the window itself. This delivers the user or programmer from having to guess a time-scale for change. Contrary to many related works, we provide rigorous guarantees of performance, as bounds on the rates of false positives and false negatives. Using ideas from data stream algorithmics, we develop a time-and memory-efficient version of this algorithm, called ADWIN2. We show how to combine ADWIN2 with the Naïve Bayes (NB) predictor, in two ways: one, using it to monitor the error rate of the current model and declare when revision is necessary and, two, putting it inside the NB predictor to maintain up-to-date estimations of conditional probabilities in the data. We test our approach using synthetic and real data streams and compare them to both fixed-size and variable-size window strategies with good results.
Article
stream-learn is a Python package compatible with scikit-learn and developed for the drifting and imbalanced data stream analysis. Its main component is a stream generator, which allows producing a synthetic data stream that may incorporate each of the three main concept drift types (i.e., sudden, gradual and incremental drift) in their recurring or non-recurring version, as well as static and dynamic class imbalance. The package allows conducting experiments following established evaluation methodologies (i.e., Test-Then-Train and Prequential). Besides, estimators adapted for data stream classification have been implemented, including both simple classifiers and state-of-the-art chunk-based and online classifier ensembles. The package utilises its own implementations of prediction metrics for imbalanced binary classification tasks to improve computational efficiency.
Article
Many researchers working on classification problems evaluate the quality of developed algorithms based on computer experiments. The conclusions drawn from them are usually supported by the statistical analysis and chosen experimental protocol. Statistical tests are widely used to confirm whether considered methods significantly outperform reference classifiers. Usually, the tests are applied to stratified datasets, which could raise the question of whether data folds used for classification are really randomly drawn and how the statistical analysis supports robust conclusions. Unfortunately, some scientists do not realize the real meaning of the obtained results and overinterpret them. They do not see that inappropriate use of such analytical tools may lead them into a trap. This paper aims to show the commonly used experimental protocols’ weaknesses and discuss if we really can trust in such evaluation methodology, if all presented evaluations are fair and if it is possible to manipulate the experimental results using well-known statistical evaluation methods. We will present that it is possible to choose only such results, confirming the experimenter’s expectation. We will try to show what could be done to avoid such likely unethical behavior. At the end of this work, we will formulate recommendations on improving an experimental protocol to design fair experimental classifier evaluation.
Article
Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.
Drift detection over non-stationary data streams using evolving spiking neural networks
  • Jesus L Lobo
Jesus L. Lobo and et al. Drift detection over non-stationary data streams using evolving spiking neural networks. In Intelligent Distributed Computing XII, pages 82-94, Cham, 2018. Springer International Publishing.
New ensemble methods for evolving data streams
  • Albert Bifet
Albert Bifet and et al. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 139-148, 2009.
  • Rafael Gonzalez
  • Richard Woods
  • Barry Masters
Rafael Gonzalez, Richard Woods, and Barry Masters. Digital image processing, third edition. Journal of biomedical optics, 14:029901, 03 2009.