ArticlePublisher preview available

Processing data stream with chunk-similarity model selection

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract and Figures

The classification of data stream susceptible to the concept drift phenomenon has been a field of intensive research for many years. One of the dominant strategies of the proposed solutions is the application of classifier ensembles with the member classifiers validated on their actual prediction quality. This paper is a proposal of a new ensemble method – Covariance-signature Concept Selector – which, like state-of-the-art solutions, uses both the model accumulation paradigm and the detection of changes in the data posterior probability, but in the integrated procedure. However, instead of ensemble fusion, it performs a static classifier selection, where model similarity assessment to the currently processed data chunk serves as a concept selector. The proposed method was subjected to a series of computer experiments assessing its temporal complexity and efficiency in classifying streams with synthetic and real concepts. The conducted experimental analysis allows concluding the advantage of this proposal over state-of-the-art methods in the identified pool of problems and high potential in practical applications.
This content is subject to copyright. Terms and conditions apply.
Applied Intelligence
https://doi.org/10.1007/s10489-022-03826-4
Processing data stream with chunk-similarity model selection
Pawel Ksieniewicz1
Accepted: 29 May 2022
©The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
The classification of data stream susceptible to the concept drift phenomenon has been a field of intensive research for many
years. One of the dominant strategies of the proposed solutions is the application of classifier ensembles with the member
classifiers validated on their actual prediction quality. This paper is a proposal of a new ensemble method Covariance-
signature Concept Selector which, like state-of-the-art solutions, uses both the model accumulation paradigm and the
detection of changes in the data posterior probability, but in the integrated procedure. However, instead of ensemble fusion,
it performs a static classifier selection, where model similarity assessment to the currently processed data chunk serves
as a concept selector. The proposed method was subjected to a series of computer experiments assessing its temporal
complexity and efficiency in classifying streams with synthetic and real concepts. The conducted experimental analysis
allows concluding the advantage of this proposal over state-of-the-art methods in the identified pool of problems and high
potential in practical applications.
Keywords Data stream ·Classifier selection ·Classification ·Pattern recognition
1 Introduction
An almost trivial introduction, but at the same time hard
to deny, is the often overused statement that the modern
world is filled with data. The difficult beginning of the
third decade of the 21st century has irreversibly transferred
the central axis of human existence to the global network
of computer systems, which, almost like in W. Gibson’s
Neuromancer, spans the vast majority of our everyday lives.
Most of the time, we organize and process subsequent
portions of data in e-mails, instant messages, remote video
calls, documents, reports, and notifications. Only to spend
our free time accepting more portions of data in the form of
Netflix or Youtube materials tailored to our taste, posts from
our social bubble on Twitter or music served by Spotify.In
such times Machine Learning drifts its intuitive meaning
into a marketing slogan, eagerly taken up by companies
such as Google or Amazon, which is to be a magic panacea
Pawel Ksieniewicz
pawel.ksieniewicz@pwr.edu.pl
1Department of Systems and Computer Networks, Wroclaw
University of Science and Technology, Wybrze˙ze Stanisława
Wys pi a ´
nskiego 27, Wroclaw, 50-370, Poland
for understanding the amount of data we produce and
receive every day.
The reality behind this slogan is a much more prosaic
complex of difficulties. Typical, classic recognition models
target stationary problems, i.e., those describing a particular
unchanging concept represented by a finite set of labeled
problem instances [1]. The existence of highly numerous
data sets, which do not allow for storing the entire dataset in
the memory of a computer system, justifies the development
of inductive learning paradigm with incremental learning
[2]. Most often using iterative model update procedure
based on upcoming data batches. In the case of high-
dimensional data, prone to the curse of dimensionality [3],
methods of feature selection and extraction are used [4],
aimed at reducing the difficulty of the analyzed problem. In
the case of data with a high cost of label acquisition, active
and semi-supervised learning paradigms are used [5,6],
allowing the identification of the most challenging objects
for the recognition model, thanks to which it is possible to
reduce the involvement of human experts in the field. All
these methods fit into the common domain of Big Data [7],
offering solutions for problems characterized by a large
amount of data (volume). They also have to deal with high
speed of data processing (velocity), a great variety affecting
the difficulty with the reliability of labeling (veracity)and
the potential value for the end-user of the system.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Machine learning models assume that data is drawn from a stationary distribution. However, in practice, challenges are imposed on models that need to make sense of fast-evolving data streams, where the content of data is changing and evolving over time. This change between the distributions of training data seen so-far and the distribution of newly coming data is called concept drift. It is of utmost importance to detect concept drifts to maintain the accuracy and reliability of online classifiers. Reactive drift detectors monitor the performance of the underlying machine learning model. That is, to detect a drift, feedback on the classifier output has to be given to the drift detector, known as prequential evaluation. In many real-life scenarios, immediate feedback on classifier output is not possible. Thus, drift detection is delayed and gets out of context. Moreover, the drift detector output is in the form of a binary answer if there is a drift or not. However, it is equally important to explain the source of drift. In this paper, we present the Statistical Drift Detection Method (SDDM) which can detect drifts by monitoring the change of data distribution without the need for feedback on classifier output. Moreover, the detection is quantified and the source of drift is identified. We empirically evaluate our method against the state-of-the-art on both synthetic and real life data sets. SDDM outperforms other related approaches by producing a smaller number of false positives and false negatives.
Article
Full-text available
Learning from data streams in the presence of concept drift is among the biggest challenges of contemporary machine learning. Algorithms designed for such scenarios must take into an account the potentially unbounded size of data, its constantly changing nature, and the requirement for real-time processing. Ensemble approaches for data stream mining have gained significant popularity, due to their high predictive capabilities and effective mechanisms for alleviating concept drift. In this paper, we propose a new ensemble method named Kappa Updated Ensemble (KUE). It is a combination of online and block-based ensemble approaches that uses Kappa statistic for dynamic weighting and selection of base classifiers. In order to achieve a higher diversity among base learners, each of them is trained using a different subset of features and updated with new instances with given probability following a Poisson distribution. Furthermore, we update the ensemble with new classifiers only when they contribute positively to the improvement of the quality of the ensemble. Finally, each base classifier in KUE is capable of abstaining itself for taking a part in voting, thus increasing the overall robustness of KUE. An extensive experimental study shows that KUE is capable of outperforming state-of-the-art ensembles on standard and imbalanced drifting data streams while having a low computational complexity. Moreover, we analyze the use of Kappa vs accuracy to drive the criterion to select and update the classifiers, the contribution of the abstaining mechanism, the contribution of the diversification of classifiers, and the contribution of the hybrid architecture to update the classifiers in an online manner.
Article
stream-learn is a Python package compatible with scikit-learn and developed for the drifting and imbalanced data stream analysis. Its main component is a stream generator, which allows producing a synthetic data stream that may incorporate each of the three main concept drift types (i.e., sudden, gradual and incremental drift) in their recurring or non-recurring version, as well as static and dynamic class imbalance. The package allows conducting experiments following established evaluation methodologies (i.e., Test-Then-Train and Prequential). Besides, estimators adapted for data stream classification have been implemented, including both simple classifiers and state-of-the-art chunk-based and online classifier ensembles. The package utilises its own implementations of prediction metrics for imbalanced binary classification tasks to improve computational efficiency.
Conference Paper
Despite the fact that real-life data streams may often be characterized by the dynamic changes in the prior class probabilities, there is a scarcity of articles trying to clearly describe and classify this problem as well as suggest new methods dedicated to resolving this issue. The following paper aims to fill this gap by proposing a novel data stream taxonomy defined in the context of prior class probability and by introducing the Dynamic Statistical Concept Analysis (DSCA) – prior probability estimation algorithm. The proposed method was evaluated using computer experiments carried out on 100 synthetically generated data streams with various class imbalance characteristics. The obtained results, supported by statistical analysis, confirmed the usefulness of the proposed solution, especially in the case of discrete dynamically imbalanced data streams (DDIS).
Article
The imbalanced data classification remains a vital problem. The key is to find such methods that classify both the minority and majority class correctly. The paper presents the classifier ensemble for classifying binary, non-stationary and imbalanced data streams where the Hellinger Distance is used to prune the ensemble. The paper includes an experimental evaluation of the method based on the conducted experiments. The first one checks the impact of the base classifier type on the quality of the classification. In the second experiment, the Hellinger Distance Weighted Ensemble (hdwe) method is compared to selected state-of-the-art methods using a statistical test with two base classifiers. The method was profoundly tested based on many imbalanced data streams and obtained results proved the hdwe method's usefulness.
Article
In the diversity of contemporary decision-making tasks, where the data is no longer static and changes over time, data stream processing has become an important issue in the field of pattern recognition. In addition, most of the real problems are not balanced, representing their classes in various improportions. Following paper proposes the Prior Imbalance Compensation method, modifying on-the-fly predictions made by the base classifier, aiming at mapping prior probability in the statistics of assigned classes. It is intended to be a less computationally complex competition for popular algorithms such as smote, solving this problem by oversampling the training set. The proposed method has been tested using computer experiments on the example of a set of various data streams, leading to promising results, suggesting its usefulness in solving this type of problems.
Article
free access till end of October 2020 -> use this link https://authors.elsevier.com/a/1blq25a7-GjBOl This work aims to connect two rarely combined research directions, i.e., non-stationary data stream classification and data analysis with skewed class distributions. We propose a novel framework employing stratified bagging for training base classifiers to integrate data preprocessing and dynamic ensemble selection methods for imbalanced data stream classification. The proposed approach has been evaluated based on computer experiments carried out on 135 artificially generated data streams with various imbalance ratios, label noise levels, and types of concept drift as well as on two selected real streams. Four preprocessing techniques and two dynamic selection methods, used on both bagging classifiers and base estimators levels, were considered. Experimentation results showed that, for highly imbalanced data streams, dynamic ensemble selection coupled with data preprocessing could outperform online and chunk-based state-of-art methods.
Article
Due to variety of modern real-life tasks, where analyzed data is often not a static set, the data stream mining gained a substantial focus of machine learning community. Main property of such systems is the large amount of data arriving in a sequential manner, which creates an endless stream of objects. Taking into consideration the limited resources as memory and computational power, it is widely accepted that each instance can be processed up once and it is not remembered, making reevaluation impossible. In the following work, we will focus on the data stream classification task where parameters of a classification model may vary over time, so the model should be able to adapt to the changes. It requires a forgetting mechanism, ensuring that outdated samples will not impact a model. The most popular approaches base on so-called windowing, requiring storage of a batch of objects and when new examples arrive, the least relevant ones are forgotten. Objects in a new window are used to retrain the model, which is cumbersome especially for online learners and contradicts the principle of processing each object at most once. Therefore, this work employs inbuilt forgetting mechanism of neural networks. Additionally, to reduce a need of expensive (sometimes even impossible) object labeling, we are focusing on active learning, which asks for labels only for interesting examples, crucial for appropriate model upgrading. Characteristics of proposed methods were evaluated on the basis of the computer experiments, performed over diverse pool of data streams. Their results confirmed the convenience of proposed strategy.