Conference Paper

Prior Probability Estimation in Dynamically Imbalanced Data Streams

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Despite the fact that real-life data streams may often be characterized by the dynamic changes in the prior class probabilities, there is a scarcity of articles trying to clearly describe and classify this problem as well as suggest new methods dedicated to resolving this issue. The following paper aims to fill this gap by proposing a novel data stream taxonomy defined in the context of prior class probability and by introducing the Dynamic Statistical Concept Analysis (DSCA) – prior probability estimation algorithm. The proposed method was evaluated using computer experiments carried out on 100 synthetically generated data streams with various class imbalance characteristics. The obtained results, supported by statistical analysis, confirmed the usefulness of the proposed solution, especially in the case of discrete dynamically imbalanced data streams (DDIS).

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Changes are a natural consequence of analyzing infinitely incoming data. As a result of daily physiological and seasonal cycles and changes in trends in social media, we will observe drifts both in concept and in the rate of data imbalance [7,8]. Concept drifts describe the change of posterior probability in the data stream over time. ...
... for each s i in S = {s 1 , . . . , s nmax } do 8: ...
Article
Among the difficulties being considered in data stream processing, a particularly interesting one is the phenomenon of concept drift. Methods of concept drift detection are frequently used to eliminate the negative impact on the quality of classification in the environment of evolving concepts. This article proposes Statistical Drift Detection Ensemble (sdde), a novel method of concept drift detection. The method uses drift magnitude and conditioned marginal covariate drift measures, analyzed by an ensemble of detectors, whose members focus on random subspaces of the stream’s features. The proposed detector was compared with state-of-the-art methods on both synthetic data streams and the semi-synthetic streams generated based on the real-world concepts. A series of computer experiments and a statistical analysis of the results, both for the classification accuracy and Drift Detection errors were carried out and confirmed the effectiveness of the proposed method.
... Illustration of the stationary imbalanced stream is presented in Fig. 6a Dynamically imbalanced stream A less common type of imbalanced data, impossible to obtain in static datasets, is data imbalanced dynamically. In this case, the class distribution is not constant throughout a stream, but changes over time, resulting the dynamically imbalanced stream (DIS) among which two categories can be distinguished [25]: ...
... The research carried out with the use of the stream-learn module also took up the subject of the construction of a framework combining Dynamic Classifier Selection with data preprocessing to classify highly imbalanced data streams with concept drift [32]. Thanks to the dynamically imbalanced stream generator, the first work on prior probability predictions in continuously and discreetly dynamically imbalanced data streams were also undertaken [25]. In addition, it enabled the first work on the detection of fake news interpreted not as stationary problems but as data streams [33]. ...
Article
stream-learn is a Python package compatible with scikit-learn and developed for the drifting and imbalanced data stream analysis. Its main component is a stream generator, which allows producing a synthetic data stream that may incorporate each of the three main concept drift types (i.e., sudden, gradual and incremental drift) in their recurring or non-recurring version, as well as static and dynamic class imbalance. The package allows conducting experiments following established evaluation methodologies (i.e., Test-Then-Train and Prequential). Besides, estimators adapted for data stream classification have been implemented, including both simple classifiers and state-of-the-art chunk-based and online classifier ensembles. The package utilises its own implementations of prediction metrics for imbalanced binary classification tasks to improve computational efficiency.
... Many of the proposed methods employ classifier ensembles [4] and dynamic selection of classifiers [5]. Moreover, various methods are based on concept drift detectors [6], and a prior probability estimation [7] which indicates when the data characteristics are changing. ...
Chapter
Full-text available
Contemporary man is addicted to digital media and tools supporting his daily activities, which causes the massive increase of incoming data, both in volume and frequency. Due to the observed trend, unsupervised machine learning methods for data stream clustering have become a popular research topic over the last years. At the same time, semi-supervised constrained clustering is rarely considered in data stream clustering. To address this gap in the field, the authors propose adaptations of k-means constrained clustering algorithms for employing them in imbalanced data stream clustering. In this work, proposed algorithms were evaluated in a series of experiments concerning synthetic and real data clustering and verified their ability to adapt to occurring concept drifts.KeywordsData streamsPair-wise constrainedClusteringImbalanced data
... In both of these cases, the distribution of objects in the feature space changes, but in virtual drift it does not affect the decision boundary of the model. A particular case of such a situation is the drift in prior probability [11], which apparently does not change the class definitions, but by changing their counts, often leads to noticeable changes in the classifier's decisions and its measurable quality [12]. This type of subject is more widely discussed in the recently popular subfield of imbalanced stream processing [13][14][15]. ...
Article
Full-text available
The classification of data stream susceptible to the concept drift phenomenon has been a field of intensive research for many years. One of the dominant strategies of the proposed solutions is the application of classifier ensembles with the member classifiers validated on their actual prediction quality. This paper is a proposal of a new ensemble method – Covariance-signature Concept Selector – which, like state-of-the-art solutions, uses both the model accumulation paradigm and the detection of changes in the data posterior probability, but in the integrated procedure. However, instead of ensemble fusion, it performs a static classifier selection, where model similarity assessment to the currently processed data chunk serves as a concept selector. The proposed method was subjected to a series of computer experiments assessing its temporal complexity and efficiency in classifying streams with synthetic and real concepts. The conducted experimental analysis allows concluding the advantage of this proposal over state-of-the-art methods in the identified pool of problems and high potential in practical applications.
... Class imbalance is a common problem in the data stream mining domain (Wu et al., 2014;Aminian et al., 2019). Here streams can have a fixed imbalance ratio, or it may evolve over time (Komorniczak et al., 2021). Furthermore, class imbalance combined with concept drift poses novel and unique challenges (Brzeziński and Stefanowski, 2017;Sun et al., 2021). ...
Preprint
Full-text available
Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.
Conference Paper
With the processing of data streams, come inevitable challenges, such as changes in the prior (class drift) and posterior (concept drift) probability distribution over the processing time. Both these phenomena have a negative impact on the quality of the classification. Heavily imbalanced problems, which are often typical for real-world applications, bring additional processing difficulties. Classifiers are often biased towards the majority class and have difficulty identifying instances of categories described with a lower number of objects. The following article proposes a Prior Probability Assisted Classifier (2PAC), a method aiming to improve the classification quality of heavily imbalanced data streams with dynamic changes by using the estimated prior probability value and the correction of the classifier's decision for batch predictions. Presented extensive computer experiments, supported by statistical analysis, show the ability to improve the classification quality using the proposed method.
Chapter
Nowadays, swarm intelligence shows a high accuracy while solving difficult problems, including image processing problem. Image Edge detection is a complex optimization problem due to the high-resolution images involving large matrix of pixels. The current work describes several sensitive to the environment models involving swarm intelligence. The agents’ sensitivity is used in order to guide the swarm to obtain the best solution. Both theoretical general guidance and a practical example for a particular swarm are included. The quality of results is measured using several known measures.KeywordsSwarm intelligenceImage processingImage Edge Detection
Article
Full-text available
Online class imbalance learning is a new learning problem that combines the challenges of both online learning and class imbalance learning. It deals with data streams having very skewed class distributions. This type of problems commonly exists in real-world applications, such as fault diagnosis of real-time control monitoring systems and intrusion detection in computer networks. In our earlier work, we defined class imbalance online, and proposed two learning algorithms OOB and UOB that build an ensemble model overcoming class imbalance in real time through resampling and time-decayed metrics. In this paper, we further improve the resampling strategy inside OOB and UOB, and look into their performance in both static and dynamic data streams. We give the first comprehensive analysis of class imbalance in data streams, in terms of data distributions, imbalance rates and changes in class imbalance status. We find that UOB is better at recognizing minority-class examples in static data streams, and OOB is more robust against dynamic changes in class imbalance status. The data distribution is a major factor affecting their performance. Based on the insight gained, we then propose two new ensemble methods that maintain both OOB and UOB with adaptive weights for final predictions, called WEOB1 and WEOB2. They are shown to possess the strength of OOB and UOB with good accuracy and robustness.
Article
Full-text available
Nowadays, with the advance of technology, many applications generate huge amounts of data streams at very high speed. Examples include network traffic, web click streams, video surveillance, and sensor networks. Data stream mining has become a hot research topic. Its goal is to extract hidden knowledge/patterns from continuous data streams. Unlike traditional data mining where the dataset is static and can be repeatedly read many times, data stream mining algorithms face many challenges and have to satisfy constraints such as bounded memory, single-pass, real-time response, and concept-drift detection. This paper presents a comprehensive survey of the state-of-the-art data stream mining algorithms with a focus on clustering and classification because of their ubiquitous usage. It identifies mining constraints, proposes a general model for data stream mining, and depicts the relationship between traditional data mining and data stream mining. Furthermore, it analyzes the advantages as well as limitations of data stream algorithms and suggests potential areas for future research.
Article
Full-text available
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
Conference Paper
Full-text available
We consider strategies for building classier ensembles for non-stationary environments where the classication task changes dur- ing the operation of the ensemble. Individual classier models capable of online learning are reviewed. The concept of \forgetting" is discussed. Online ensembles and strategies suitable for changing environments are summarized.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
stream-learn is a Python package compatible with scikit-learn and developed for the drifting and imbalanced data stream analysis. Its main component is a stream generator, which allows producing a synthetic data stream that may incorporate each of the three main concept drift types (i.e., sudden, gradual and incremental drift) in their recurring or non-recurring version, as well as static and dynamic class imbalance. The package allows conducting experiments following established evaluation methodologies (i.e., Test-Then-Train and Prequential). Besides, estimators adapted for data stream classification have been implemented, including both simple classifiers and state-of-the-art chunk-based and online classifier ensembles. The package utilises its own implementations of prediction metrics for imbalanced binary classification tasks to improve computational efficiency.
Article
In the diversity of contemporary decision-making tasks, where the data is no longer static and changes over time, data stream processing has become an important issue in the field of pattern recognition. In addition, most of the real problems are not balanced, representing their classes in various improportions. Following paper proposes the Prior Imbalance Compensation method, modifying on-the-fly predictions made by the base classifier, aiming at mapping prior probability in the statistics of assigned classes. It is intended to be a less computationally complex competition for popular algorithms such as smote, solving this problem by oversampling the training set. The proposed method has been tested using computer experiments on the example of a set of various data streams, leading to promising results, suggesting its usefulness in solving this type of problems.
Article
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
Article
Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.
Article
The relative abilities of 2, dimensioned statistics-the root-mean-square error (RMSE) and the mean absolute error (MAE) -to describe average model-performance error are examined. The RMSE is of special interest because it is widely reported in the climatic and environmental literature; nevertheless, it is an inappropriate and misinterpreted measure of average error. RMSE is inappropriate because it is a function of 3 characteristics of a set of errors, rather than of one (the average error). RMSE varies with the variability within the distribution of error magnitudes and with the square root of the number of errors (n(1/2)), as well as with the average-error magnitude (MAE). Our findings indicate that MAE is a more natural measure of average error, and (unlike RMSE) is unambiguous. Dimensioned evaluations and inter-comparisons of average model-performance error, therefore, should be based on MAE.
Article
Classification is an important data analysis tool that uses a model built from historical data to predict class labels for new observations. More and more applications are featuring data streams, rather than finite stored data sets, which are a challenge for traditional classification algorithms. Concept drifts and skewed distributions, two common properties of data stream applications, make the task of learning in streams difficult. The authors aim to develop a new approach to classify skewed data streams that uses an ensemble of models to match the distribution over under-samples of negatives and repeated samples of positives.
Article
1. vyd. 15000 výt. Vazba Martin Zhouf
Article
Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. This paper describes and evaluates VFDT, an anytime system that builds decision trees using constant memory and constant time per example. VFDT can incorporate tens of thousands of examples per second using o#-the-shelf hardware. It uses Hoe#ding bounds to guarantee that its output is asymptotically nearly identical to that of a conventional learner. We study VFDT's properties and demonstrate its utility through an extensive set of experiments on synthetic data. We apply VFDT to mining the continuous stream of Web access data from the whole University of Washington main campus.
The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east
  • J Gantz
  • D Reinsel
J. Gantz and D. Reinsel, "The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east," IDC iView: IDC Analyze the future, vol. 2007, no. 2012, pp. 1-16, 2012.