Article

Stream-learn — open-source Python library for difficult data stream batch analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

stream-learn is a Python package compatible with scikit-learn and developed for the drifting and imbalanced data stream analysis. Its main component is a stream generator, which allows producing a synthetic data stream that may incorporate each of the three main concept drift types (i.e., sudden, gradual and incremental drift) in their recurring or non-recurring version, as well as static and dynamic class imbalance. The package allows conducting experiments following established evaluation methodologies (i.e., Test-Then-Train and Prequential). Besides, estimators adapted for data stream classification have been implemented, including both simple classifiers and state-of-the-art chunk-based and online classifier ensembles. The package utilises its own implementations of prediction metrics for imbalanced binary classification tasks to improve computational efficiency.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The following section will describe goals of the planed experiments and experimental setup of the conducted research. All experiments and methods have been implemented in Python programming language using the scikit-learn [46], stream-learn [47], scikit-multiflow [48] and numpy [49] libraries. The base classifier of the meta-estimator 3.1, to simplify calculations for extensive computer experiments, was Gaussian Naïve Bayes Classifier. ...
... Each stream was replicated ten times with a different random state. For the purpose of this experiment synthetic data streams were generated by StreamGenerator from a stream-learn package [47]. ...
... Each of the streams was replicated ten times with a different random state. Streams were as well generated by stream-learn [47] package. ...
Article
Among the difficulties being considered in data stream processing, a particularly interesting one is the phenomenon of concept drift. Methods of concept drift detection are frequently used to eliminate the negative impact on the quality of classification in the environment of evolving concepts. This article proposes Statistical Drift Detection Ensemble (sdde), a novel method of concept drift detection. The method uses drift magnitude and conditioned marginal covariate drift measures, analyzed by an ensemble of detectors, whose members focus on random subspaces of the stream’s features. The proposed detector was compared with state-of-the-art methods on both synthetic data streams and the semi-synthetic streams generated based on the real-world concepts. A series of computer experiments and a statistical analysis of the results, both for the classification accuracy and Drift Detection errors were carried out and confirmed the effectiveness of the proposed method.
... One of the most promising data stream classification research directions, which usually employs chunk-based data processing, is the classifier ensemble approach [Krawczyk et al., 2017]. Its advantage is that the classifier ensemble can easily adapt to the concept drift using different updating strategies [Kuncheva, 2004]: ...
... In general, SR is the number of samples needed to obtain the p percent of the maximum performance achieved on the subsequent task. [Ksieniewicz and Zyblewski, 2020] was employed to generate the synthetic data containing three types of concept drift: abrupt, gradual, and increment, all generated with the recurring or unique concepts. We tested parameters such as chunk sizes and the stream length for each type of concept drift. ...
... This repository also contains detailed results of all experiments. Stream-learn [Ksieniewicz and Zyblewski, 2020] implementation of the ensemble models was utilized with the Gaussian Naïve Bayes and CART as base classifiers taken from the Scikit Learn library [Pedregosa et al., 2011]. Detailed information about used packages is provided in the YAML file with a specification of the Anaconda environment. ...
Article
Full-text available
Modern analytical systems must process streaming data and correctly respond to data distribution changes. The phenomenon of changes in data distributions is called concept drift , and it may harm the quality of the used models. Additionally, the possibility of concept drift appearance causes that the used algorithms must be ready for the continuous adaptation of the model to the changing data distributions. This work focuses on non-stationary data stream classification, where a classifier ensemble is used. To keep the ensemble model up to date, the new base classifiers are trained on the incoming data blocks and added to the ensemble while, at the same time, outdated models are removed from the ensemble. One of the problems with this type of model is the fast reaction to changes in data distributions. We propose the new Chunk Adaptive Restoration framework that can be adapted to any block-based data stream classification algorithm. The proposed algorithm adjusts the data chunk size in the case of concept drift detection to minimize the impact of the change on the predictive performance of the used model. The experimental research, backed up with the statistical tests, has proven that Chunk Adaptive Restoration significantly reduces the model’s restoration time.
... Experiments were carried out using both synthetic and real datasets. Stream-learn library [1] was employed to generate the synthetic data containing three types of concept drift: abrupt, gradual, and increment, all generated with the recurring or unique concepts. We tested parameters such as chunk sizes and the stream length for each type of concept drift. ...
... This repo also contains detailed results of all experiments. Stream-learn [1] implementation of the ensemble models was utilized with the Gaussian Naïve Bayes and CART as base classifiers from sklearn [83]. Detailed information about used packages is provided in the yml file with a specification of the conda environment. ...
... Accuracy for stream-learn data stream(1). ...
Preprint
Full-text available
Modern analytical systems must be ready to process streaming data and correctly respond to data distribution changes. The phenomenon of changes in data distributions is called concept drift, and it may harm the quality of the used models. Additionally, the possibility of concept drift appearance causes that the used algorithms must be ready for the continuous adaptation of the model to the changing data distributions. This work focuses on non-stationary data stream classification, where a classifier ensemble is used. To keep the ensemble model up to date, the new base classifiers are trained on the incoming data blocks and added to the ensemble while, at the same time, outdated models are removed from the ensemble. One of the problems with this type of model is the fast reaction to changes in data distributions. We propose a new Chunk Adaptive Restoration framework that can be adapted to any block-based data stream classification algorithm. The proposed algorithm adjusts the data chunk size in the case of concept drift detection to minimize the impact of the change on the predictive performance of the used model. The conducted experimental research, backed up with the statistical tests, has proven that Chunk Adaptive Restoration significantly reduces the model's restoration time.
... As there is a lack of a large diverse collection of real data streams where concept drift appearances are additionally marked. Therefore, to properly evaluate the proposed methods, a set of simulated data was generated using Stream-learn library [16]. Each data stream has 10000 chunks with 250 instances. ...
... We present in Table 1 the used parameters for generating the data. More details about each of the parameters are specified in [16]. ...
Preprint
Full-text available
Recently, continual learning has received a lot of attention. One of the significant problems is the occurrence of \emph{concept drift}, which consists of changing probabilistic characteristics of the incoming data. In the case of the classification task, this phenomenon destabilizes the model's performance and negatively affects the achieved prediction quality. Most current methods apply statistical learning and similarity analysis over the raw data. However, similarity analysis in streaming data remains a complex problem due to time limitation, non-precise values, fast decision speed, scalability, etc. This article introduces a novel method for monitoring changes in the probabilistic distribution of multi-dimensional data streams. As a measure of the rapidity of changes, we analyze the popular Kullback-Leibler divergence. During the experimental study, we show how to use this metric to predict the concept drift occurrence and understand its nature. The obtained results encourage further work on the proposed methods and its application in the real tasks where the prediction of the future appearance of concept drift plays a crucial role, such as predictive maintenance.
... iter ← e × err 15: end if 16: ...
... Additionally, to facilitate the visual analysis of obtained results, figures presenting the accumulated difference between the actual and estimated prior probability value in each data chunk are presented. All experiments were implemented in the Python programming language, based on the scikit-learn [15] and stream-learn [16] API's, and can be replicated according to the code published on the GitHub repository 1 . ...
Conference Paper
Despite the fact that real-life data streams may often be characterized by the dynamic changes in the prior class probabilities, there is a scarcity of articles trying to clearly describe and classify this problem as well as suggest new methods dedicated to resolving this issue. The following paper aims to fill this gap by proposing a novel data stream taxonomy defined in the context of prior class probability and by introducing the Dynamic Statistical Concept Analysis (DSCA) – prior probability estimation algorithm. The proposed method was evaluated using computer experiments carried out on 100 synthetically generated data streams with various class imbalance characteristics. The obtained results, supported by statistical analysis, confirmed the usefulness of the proposed solution, especially in the case of discrete dynamically imbalanced data streams (DDIS).
... A significant dimension of developing intelligent systems is their ability to continuously learn and adapt to new conditions in their environment. According to Bouchachia et al. [41], these systems must incorporate adaptable learning algorithms and continuous adaptation processes, making them capable of responding to new conditions as part of their learning process, just like any intelligent living organism that learns incrementally and dynamically from any changes in its environment [42]. In order to enable the above-mentioned behavior, ML models should be periodically re-trained when new data are available, thus adjusting their behavior when new patterns are detected. ...
Article
Full-text available
Energy management is crucial for various activities in the energy sector, such as effective exploitation of energy resources, reliability in supply, energy conservation, and integrated energy systems. In this context, several machine learning and deep learning models have been developed during the last decades focusing on energy demand and renewable energy source (RES) production forecasting. However, most forecasting models are trained using batch learning, ingesting all data to build a model in a static fashion. The main drawback of models trained offline is that they tend to mis-calibrate after launch. In this study, we propose a novel, integrated online (or incremental) learning framework that recognizes the dynamic nature of learning environments in energy-related time-series forecasting problems. The proposed paradigm is applied to the problem of energy forecasting, resulting in the construction of models that dynamically adapt to new patterns of streaming data. The evaluation process is realized using a real use case consisting of an energy demand and a RES production forecasting problem. Experimental results indicate that online learning models outperform offline learning models by 8.6% in the case of energy demand and by 11.9% in the case of RES forecasting in terms of mean absolute error (MAE), highlighting the benefits of incremental learning.
... A concept drift detection ensemble and an exemplary experiment were prepared to ensure the proposed stream generation method is competent in generating streams with research potential. The ensemble was developed using the scikit-learn library and sample experiment using stream-learn [11] package. ...
... The tested algorithms and the experimental environment were implemented in Python programming language, using the scikit-learn [16] and stream-learn libraries [17]. The repository allowing for the replication of presented results is available at Github 5 . ...
Chapter
Full-text available
Contemporary man is addicted to digital media and tools supporting his daily activities, which causes the massive increase of incoming data, both in volume and frequency. Due to the observed trend, unsupervised machine learning methods for data stream clustering have become a popular research topic over the last years. At the same time, semi-supervised constrained clustering is rarely considered in data stream clustering. To address this gap in the field, the authors propose adaptations of k-means constrained clustering algorithms for employing them in imbalanced data stream clustering. In this work, proposed algorithms were evaluated in a series of experiments concerning synthetic and real data clustering and verified their ability to adapt to occurring concept drifts.KeywordsData streamsPair-wise constrainedClusteringImbalanced data
... Dashed lines present actual values, while the red line shows its trend smoothed by the median filter. Additionally, in the form of a dotted line in the background, a concept change curve is presented, determined according to the concept sigmoid spacing parameter of the synthetic stream generator available in the stream-learn package [50]. ...
Article
Full-text available
The classification of data stream susceptible to the concept drift phenomenon has been a field of intensive research for many years. One of the dominant strategies of the proposed solutions is the application of classifier ensembles with the member classifiers validated on their actual prediction quality. This paper is a proposal of a new ensemble method – Covariance-signature Concept Selector – which, like state-of-the-art solutions, uses both the model accumulation paradigm and the detection of changes in the data posterior probability, but in the integrated procedure. However, instead of ensemble fusion, it performs a static classifier selection, where model similarity assessment to the currently processed data chunk serves as a concept selector. The proposed method was subjected to a series of computer experiments assessing its temporal complexity and efficiency in classifying streams with synthetic and real concepts. The conducted experimental analysis allows concluding the advantage of this proposal over state-of-the-art methods in the identified pool of problems and high potential in practical applications.
... The most important is real data, because it allows testing methods on data streams that at least try to mimic the difficulty of real-life classification problems. In addition, the data were supplemented by a large set of synthetically generated streams, which were provided by two data generators -MOA [92] and stream-learn [96]. The data streams used in the experimental evaluation are two-class problems having full object labeling. ...
Article
Full-text available
One of the most critical data analysis tasks is the streaming data classification, where we may also observe the concept drift phenomenon, i.e., changing the decision model’s probabilistic characteristics. From a practical point of view, we may face this type of banking, medicine, or cybersecurity task to enumerate only a few. A vital characteristic of these problems is that the classes we are interested in (e.g., fraudulent transactions, treats, or serious diseases) are usually infrequent, which hinders the classification system design. The paper presents a novel algorithm DSCB (Deterministic Sampling Classifier with weighted Bagging) employs data preprocessing methods and weighted bagging technique to classify non-stationary imbalanced data stream. It builds models based on an incoming data chunk, but it also takes previously arrived instances into account. The proposed approach has been evaluated based on a wide range of computer experiments carried out on real and artificially generated data streams with various imbalance ratios, label noise levels, and concept drift types. The results confirmed that the weighted bagging ensemble coupled with data preprocessing could outperform state-of-the-art methods.
... Reproducible research is the key towards the advancement of the machine learning community. If you want your method to have an impact, always provide the source code on GitHub and use popular frameworks such as MOA (Bifet et al., 2010b), River (Montiel et al., 2020), and Streamlearn (Ksieniewicz and Zyblewski, 2022). This will make sure that other researchers can use your classifier, as well as that it can be easily embedded in existing frameworks, such as our testbed. ...
Preprint
Full-text available
Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.
... The analysis was based on six types of synthetic streams generated using streamlearn module [9], replicated  times for stability of the achieved results. All experiments can be replicated according to the Python source code, available on GitHub repository 1 . ...
Chapter
Real data streams often, in addition to the possibility of concept drift occurrence, can display a high imbalance ratio. Another important problem with real classification tasks, often overlooked in the literature, is the cost of obtaining labels. This work aims to connect three rarely combined research directions i.e., data stream classification, imbalanced data classification, and limited access to labels. For this purpose, the behavior of the desisc-sb framework proposed by the authors in earlier works for the classification of highly imbalanced data stream was examined under the scenario of limited label access. Experiments conducted on synthetic and real streams confirmed the potential of using desisc-sb to classify highly imbalanced data streams even in the case of low label availability.
... The stream-learn module [63] was used to conduct all experiments. It is a complete set of tools that helps to process data streams. ...
Preprint
Full-text available
The imbalanced data classification remains a vital problem. The key is to find such methods that classify both the minority and majority class correctly. The paper presents the classifier ensemble for classifying binary, non-stationary and imbalanced data streams where the Hellinger Distance is used to prune the ensemble. The paper includes an experimental evaluation of the method based on the conducted experiments. The first one checks the impact of the base classifier type on the quality of the classification. In the second experiment, the Hellinger Distance Weighted Ensemble (HDWE) method is compared to selected state-of-the-art methods using a statistical test with two base classifiers. The method was profoundly tested based on many imbalanced data streams and obtained results proved the HDWE method's usefulness.
... Neural network layers contained: 2, 200, and 1 neuron respectively. They were trained using test-then-train approach [24] on 150 replicable data streams generated with stream-learn library [27]. Each data stream contained 1000 chunks and was built of two recurring data distributions with six sudden concept drifts. ...
Chapter
Streaming data analysis is currently a rapidly growing research direction. One of the serious problems hindering the data stream classification is the fact that during the exploitation of the model, its probabilistic characteristics may change. This phenomenon is called concept drift. Until today, multiple methods have been proposed to overcome their negative influence on model performance during learning in dynamic environments. This work introduces a new streaming data classifier based on a dropout technique that can significantly reduce model restoration time and performance loss and can improve its overall score in the presence of recurring concept drifts. The usefulness of the proposed algorithm is evaluated based on extensive experimental study and backed-up with thorough statistical analysis.
... The entire experimental evaluation was implemented using Python libraries, based on the scikit-learn [23] module in the implementation of two base classifiers and all feature reduction methods, on the stream-learn [24] module in data stream processing, calculating evaluation metrics and employed classifier ensembles of stream processing and on the scikit-multiflow [25] module in the implementation of Hoeffding Tree. ...
Chapter
Using fake news as a political or economic tool is not new, but the scale of their use is currently alarming, especially on social media. The authors of misinformation try to influence the users' decisions, both in the economic and political sphere. The facts of using disinformation during elections are well known. Currently, two fake news detection approaches dominate. The first approach, so-called fact or news checker, is based on the knowledge and work of volunteers, the second approach employs artificial intelligence algorithms for news analysis and manipulation detection. In this work, we will focus on using machine learning methods to detect fake news. However, unlike most approaches, we will treat incoming messages as stream data, taking into account the possibility of concept drift occurring, i.e., appearing changes in the probabilistic characteristics of the classification model during the exploitation of the classifier. The developed methods have been evaluated based on computer experiments on benchmark data, and the obtained results prove their usefulness for the problem under consideration. The proposed solutions are part of the distributed platform developed by the H2020 SocialTruth project consortium.
... The experimental evaluation was carried out by the implementation of the three considered methods consistent with the scikit-learn [10] api, according to the Test-Then-Train evaluation methodology using synthetic data streams obtained using generator implemented in the stream-learn module [8]. Source code for experiments as well as deyailed results are available on the public GitHub repository 1 . ...
Chapter
A significant problem when building classifiers based on data stream is information about the correct label. Most algorithms assume access to this information without any restrictions. Unfortunately, this is not possible in practice because the objects can come very quickly and labeling all of them is impossible, or we have to pay for providing the correct label (e.g., to human expert). Hence, methods based on partially labeled data, including methods based on an active learning approach, are becoming increasingly popular, i.e., when the learning algorithm itself decides which of the objects are interesting to improve the quality of the predictive model effectively. In this paper, we propose a new method of active learning of data stream classifier. Its quality has been compared with benchmark solutions based on a large number of test streams, and the results obtained prove the usefulness of the proposed method, especially in the case of a low budget dedicated to the labeling of incoming objects.
... To evaluate the proposed framework 90 artificially data streams were generated with various characteristics using stream-learn Python library [64]. Each data stream is composed of fifty thousand instances (200 chunks, 250 instances each) described by 8 informative features, and contains a single concept drift (in the 100th data chunk). ...
Article
free access till end of October 2020 -> use this link https://authors.elsevier.com/a/1blq25a7-GjBOl This work aims to connect two rarely combined research directions, i.e., non-stationary data stream classification and data analysis with skewed class distributions. We propose a novel framework employing stratified bagging for training base classifiers to integrate data preprocessing and dynamic ensemble selection methods for imbalanced data stream classification. The proposed approach has been evaluated based on computer experiments carried out on 135 artificially generated data streams with various imbalance ratios, label noise levels, and types of concept drift as well as on two selected real streams. Four preprocessing techniques and two dynamic selection methods, used on both bagging classifiers and base estimators levels, were considered. Experimentation results showed that, for highly imbalanced data streams, dynamic ensemble selection coupled with data preprocessing could outperform online and chunk-based state-of-art methods.
... The experiments were carried out in the Python environment using the scikit-learn [31], stream-learn [32], imbalanced-learn [33] and scikit-multiflow [34] libraries and own implementations of modified AWE and AUE methods. The source code of used algorithms as well as experimental procedure is published in a public repository on GitHub (https://github.com/w4k2/imbalancedstream-ensembles). ...
Article
Full-text available
In the era of a large number of tools and applications that constantly produce massive amounts of data, their processing and proper classification is becoming both increasingly hard and important. This task is hindered by changing the distribution of data over time, called the concept drift, and the emergence of a problem of disproportion between classes—such as in the detection of network attacks or fraud detection problems. In the following work, we propose methods to modify existing stream processing solutions—Accuracy Weighted Ensemble (AWE) and Accuracy Updated Ensemble (AUE), which have demonstrated their effectiveness in adapting to time-varying class distribution. The introduced changes are aimed at increasing their quality on binary classification of imbalanced data. The proposed modifications contain the inclusion of aggregate metrics, such as F1-score, G-mean and balanced accuracy score in calculation of the member classifiers weights, which affects their composition and final prediction. Moreover, the impact of data sampling on the algorithm’s effectiveness was also checked. Complex experiments were conducted to define the most promising modification type, as well as to compare proposed methods with existing solutions. Experimental evaluation shows an improvement in the quality of classification compared to the underlying algorithms and other solutions for processing imbalanced data streams.
... All tests were carried out using 24 generated streams and 30 real streams ( Table 1). The generated data comes from stream-learn [12] generator. These generated data differ in the level of imbalance: 10%, 20%, 30%. ...
Chapter
The classification of imbalanced data streams is gaining more and more interest. However, apart from the problem that one of the class is not well represented, there are problems typical for data stream classification, such as limited resources, lack of access to the true labels and the possibility of occurrence of the concept drift. Possibility of concept drift appearing enforces design in the method adaptation mechanism. In this article, we propose the OCEIS classifier (One-Class support vector machine classifier Ensemble for Imbalanced data Stream). The main idea is to supply the committee with one-class classifiers trained on clustered data for each class separately. The results obtained from experiments carried out on synthetic and real data show that the proposed method achieves results at a similar level as the state of the art methods compared with it.
... The whole experimental evaluation was performed in Python, using the scikit-learn api [20] to implement the cws method and is publicly available on the git repository 1 . As metrics for the conducted analysis, due to the imbalanced nature of the classification problem, three aggregate measures (balanced accuracy score, F1-score, and G-mean) and three base measures constituting their calculation (precision, recall, and specificity) were applied, using their implementation included in the stream-learn package [17]. ...
Chapter
Learning from imbalanced datasets is a challenging task for standard classification algorithms. In general, there are two main approaches to solve the problem of imbalanced data: algorithm-level and data-level solutions. This paper deals with the second approach. In particular, this paper shows a new proposition for calculating the weighted score function to use in the integration phase of the multiple classification system. The presented research includes experimental evaluation over multiple, open-source, highly imbalanced datasets, presenting the results of comparing the proposed algorithm with three other approaches in the context of six performance measures. Comprehensive experimental results show that the proposed algorithm has better performance measures than the other ensemble methods for highly imbalanced datasets.
... The evaluation in each of the experiments is based on 5 metrics commonly used to assess the quality of classification for imbalanced problems. These are F1 score [15], precision and recall [13], G-mean [11] and balanced accuracy score (bac) [3] according to the stream-learn [10] implementation. All experiments have been implanted in Python and can be repeated using the code on Github 1 . ...
Chapter
Imbalanced data analysis remains one of the critical challenges in machine learning. This work aims to adapt the concept of Dynamic Classifier Selection (dcs) to the pattern classification task with the skewed class distribution. Two methods, using the similarity (distance) to the reference instances and class imbalance ratio to select the most confident classifier for a given observation, have been proposed. Both approaches come in two modes, one based on the k-Nearest Oracles (knora) and the other also considering those cases where the classifier makes a mistake. The proposed methods were evaluated based on computer experiments carried out on Open image in new window datasets with a high imbalance ratio. The obtained results and statistical analysis confirm the usefulness of the proposed solutions.
Conference Paper
With the processing of data streams, come inevitable challenges, such as changes in the prior (class drift) and posterior (concept drift) probability distribution over the processing time. Both these phenomena have a negative impact on the quality of the classification. Heavily imbalanced problems, which are often typical for real-world applications, bring additional processing difficulties. Classifiers are often biased towards the majority class and have difficulty identifying instances of categories described with a lower number of objects. The following article proposes a Prior Probability Assisted Classifier (2PAC), a method aiming to improve the classification quality of heavily imbalanced data streams with dynamic changes by using the estimated prior probability value and the correction of the classifier's decision for batch predictions. Presented extensive computer experiments, supported by statistical analysis, show the ability to improve the classification quality using the proposed method.
Chapter
Nowadays, swarm intelligence shows a high accuracy while solving difficult problems, including image processing problem. Image Edge detection is a complex optimization problem due to the high-resolution images involving large matrix of pixels. The current work describes several sensitive to the environment models involving swarm intelligence. The agents’ sensitivity is used in order to guide the swarm to obtain the best solution. Both theoretical general guidance and a practical example for a particular swarm are included. The quality of results is measured using several known measures.KeywordsSwarm intelligenceImage processingImage Edge Detection
Chapter
With the advancement of internet technologies, network traffic monitoring and cyber-attack detection are becoming more and more important for critical infrastructure. Unfortunately, there are still relatively few works in the literature that interpret the available benchmark data as data streams and take into account the dynamic characteristics of network packet tracking. The following work proposes an approach to generating data streams from the IoT-23 dataset and evaluates the resulting streams based on their suitability for use in network intrusion detection research.
Chapter
One of the current challenges for the supervised classification methods is to obtain acceptable values of the performance measures for an imbalanced dataset. There is a significant disproportion in the number of objects from different class labels in datasets with a high imbalanced ratio. This article analyzes the clustering and weighted scoring algorithm based on estimating the number of clusters that consider the minimum number of objects from the minority class label in each cluster. This algorithm uses the distance metric when determining the value of the score function. Therefore, this article aims to analyze the impact of selecting the distance metric on the six classification performance measures’ value. The performed experiments show that the Euclidean distance allows obtaining the best classification results for imbalanced datasets.
Chapter
Imbalanced datasets are still a big method challenge in data mining and machine learning. Various machine learning methods and their combinations are considered to improve the quality of the classification of imbalanced datasets. This paper presents the approach with the clustering and weighted scoring function based on geometric space are used. In particular, we proposed a significant modification to our earlier algorithm. The proposed change concerns the use of automatic estimating the number of clusters and determining the minimum number of objects in a particular cluster. The proposed algorithm was compared with our earlier proposal and state-of-the-art algorithms using highly imbalanced datasets. The performed experiments show that the proposed modification is statistically better for a larger number of reference classifiers than the original algorithm.
Article
The imbalanced data classification remains a vital problem. The key is to find such methods that classify both the minority and majority class correctly. The paper presents the classifier ensemble for classifying binary, non-stationary and imbalanced data streams where the Hellinger Distance is used to prune the ensemble. The paper includes an experimental evaluation of the method based on the conducted experiments. The first one checks the impact of the base classifier type on the quality of the classification. In the second experiment, the Hellinger Distance Weighted Ensemble (hdwe) method is compared to selected state-of-the-art methods using a statistical test with two base classifiers. The method was profoundly tested based on many imbalanced data streams and obtained results proved the hdwe method's usefulness.
Chapter
Full-text available
The nature of analysed data may cause the difficulty of the many practical data mining tasks. This work is focusing on two of the important research topics associated with data analysis, i.e., data stream classification as well as data analysis with imbalanced class distributions. We propose the novel classification method, employing a classifier selection approach, which can update its model when new data arrives. The proposed approach has been evaluated on the basis of the computer experiments carried out on the diverse pool of the non-stationary data streams. Their results confirmed the usefulness of the proposed concept, which can outperform the state-of-art classifier selection algorithms, especially in the case of high imbalanced data streams.
Article
Full-text available
scikit-multiflow is a framework for learning from data streams and multi-output learning in Python. Conceived to serve as a platform to encourage the democratization of stream learning research, it provides multiple state-of-the-art learning methods, data generators and evaluators for different stream learning problems, including single-output, multi-output and multi-label. scikit-multiflow builds upon popular open source frameworks including scikit-learn, MOA and MEKA. Development follows the FOSS principles. Quality is enforced by complying with PEP8 guidelines, using continuous integration and functional testing. The source code is available at https://github.com/scikit-multiflow/scikit-multiflow. © 2018 Jacob Montiel and Jesse Read and Albert Bifet and Talel Abdessalem.
Article
Full-text available
Imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition. The implemented state-of-the-art methods can be categorized into 4 groups: (i) under-sampling, (ii) over-sampling, (iii) combination of over- and under-sampling, and (iv) ensemble learning methods. The proposed toolbox only depends on numpy, scipy, and scikit-learn and is distributed under MIT license. Furthermore, it is fully compatible with scikit-learn and is part of the scikit-learn-contrib supported project. Documentation, unit tests as well as integration tests are provided to ease usage and contribution. The toolbox is publicly available in GitHub: https://github.com/scikit-learn-contrib/imbalanced-learn.
Article
Full-text available
Despite more than two decades of continuous development learning from imbalanced data is still a focus of intense research. Starting as a problem of skewed distributions of binary tasks, this topic evolved way beyond this conception. With the expansion of machine learning and data mining, combined with the arrival of big data era, we have gained a deeper insight into the nature of imbalanced learning, while at the same time facing new emerging challenges. Data-level and algorithm-level methods are constantly being improved and hybrid approaches gain increasing popularity. Recent trends focus on analyzing not only the disproportion between classes, but also other difficulties embedded in the nature of data. New real-life problems motivate researchers to focus on computationally efficient, adaptive and real-time methods. This paper aims at discussing open issues and challenges that need to be addressed to further develop the field of imbalanced learning. Seven vital areas of research in this topic are identified, covering the full spectrum of learning from imbalanced data: classification, regression, clustering, data streams, big data analytics and applications, e.g., in social media and computer vision. This paper provides a discussion and suggestions concerning lines of future research for each of them.
Article
Full-text available
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.
Article
Full-text available
Online class imbalance learning is a new learning problem that combines the challenges of both online learning and class imbalance learning. It deals with data streams having very skewed class distributions. This type of problems commonly exists in real-world applications, such as fault diagnosis of real-time control monitoring systems and intrusion detection in computer networks. In our earlier work, we defined class imbalance online, and proposed two learning algorithms OOB and UOB that build an ensemble model overcoming class imbalance in real time through resampling and time-decayed metrics. In this paper, we further improve the resampling strategy inside OOB and UOB, and look into their performance in both static and dynamic data streams. We give the first comprehensive analysis of class imbalance in data streams, in terms of data distributions, imbalance rates and changes in class imbalance status. We find that UOB is better at recognizing minority-class examples in static data streams, and OOB is more robust against dynamic changes in class imbalance status. The data distribution is a major factor affecting their performance. Based on the insight gained, we then propose two new ensemble methods that maintain both OOB and UOB with adaptive weights for final predictions, called WEOB1 and WEOB2. They are shown to possess the strength of OOB and UOB with good accuracy and robustness.
Article
Full-text available
It has been past more than 15 years since the F-measure was first introduced to evaluation tasks of information extraction technology at the Fourth Message Understanding Conference (MUC-4) in 1992. Recently, sometimes I see some confusion with the definition of the F-measure, which seems to be triggered by lack of background knowledge about how the F-measure was derived. Since I was not involved in the process of the introduction or device of the F-measure, I might not be the best person to explain this but I hope this note would be a little help for those who are wondering what the F-measure really is. This introduction is devoted to provide brief but sufficient information on the F-measure.
Article
Full-text available
Most streaming decision models evolve continuously over time, run in resource-aware environments, and detect and react to changes in the environment generating data. One important issue, not yet convincingly addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. This paper proposes a general framework for assessing predictive stream learning algorithms. We defend the use of prequential error with forgetting mechanisms to provide reliable error estimators. We prove that, in stationary data and for consistent learning algorithms, the holdout estimator, the prequential error and the prequential error estimated over a sliding window or using fading factors, all converge to the Bayes error. The use of prequential error with forgetting mechanisms reveals to be advantageous in assessing performance and in comparing stream learning algorithms. It is also worthwhile to use the proposed methods for hypothesis testing and for change detection. In a set of experiments in drift scenarios, we evaluate the ability of a standard change detection algorithm to detect change using three prequential error estimators. These experiments point out that the use of forgetting mechanisms (sliding windows or fading factors) are required for fast and efficient change detection. In comparison to sliding windows, fading factors are faster and memoryless, both important requirements for streaming applications. Overall, this paper is a contribution to a discussion on best practice for performance assessment when learning is a continuous process, and the decision models are dynamic and evolve over time.
Chapter
Full-text available
The most challenging applications of knowledge discovery involve dynamic environments where data continuous flow at high-speed and exhibit non-stationary properties. In this chapter we discuss the main challenges and issues when learning from data streams. In this work, we discuss the most relevant issues in knowledge discovery from data streams: incremental learning, cost-performance management, change detection, and novelty detection. We present illustrative algorithms for these learning tasks, and a real-world application illustrating the advantages of stream processing. The chapter ends with some open issues that emerge from this new research area.
Conference Paper
Full-text available
Ensemble methods have recently garnered a great deal of attention in the machine learning community. Techniques such as Boosting and Bagging have proven to be highly effective but require repeated resampling of the training data, making them inappropriate in a data mining context. The methods presented in this paper take advantage of plentiful data, building separate classifiers on sequential chunks of training points. These classifiers are combined into a fixed-size ensemble using a heuristic replacement strategy. The result is a fast algorithm for large-scale or streaming data that classifies as well as a single decision tree built on all the data, requires approximately constant memory, and adjusts quickly to concept drift.
Conference Paper
Full-text available
In this paper we study the problem of constructing accurate block-based ensemble classifiers from time evolving data streams. AWE is the best-known representative of these ensembles. We propose a new algorithm called Accuracy Updated Ensemble (AUE), which extends AWE by using online component classifiers and updating them according to the current distribution. Additional modifications of weighting functions solve problems with undesired classifier excluding seen in AWE. Experiments with several evolving data sets show that, while still requiring constant processing time and memory, AUE is more accurate than AWE.
Article
Full-text available
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. ...
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
Full-text available
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naïve Bayes classifiers at the leaves. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license.
Article
Full-text available
Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.
Conference Paper
Despite the fact that real-life data streams may often be characterized by the dynamic changes in the prior class probabilities, there is a scarcity of articles trying to clearly describe and classify this problem as well as suggest new methods dedicated to resolving this issue. The following paper aims to fill this gap by proposing a novel data stream taxonomy defined in the context of prior class probability and by introducing the Dynamic Statistical Concept Analysis (DSCA) – prior probability estimation algorithm. The proposed method was evaluated using computer experiments carried out on 100 synthetically generated data streams with various class imbalance characteristics. The obtained results, supported by statistical analysis, confirmed the usefulness of the proposed solution, especially in the case of discrete dynamically imbalanced data streams (DDIS).
Chapter
Using fake news as a political or economic tool is not new, but the scale of their use is currently alarming, especially on social media. The authors of misinformation try to influence the users' decisions, both in the economic and political sphere. The facts of using disinformation during elections are well known. Currently, two fake news detection approaches dominate. The first approach, so-called fact or news checker, is based on the knowledge and work of volunteers, the second approach employs artificial intelligence algorithms for news analysis and manipulation detection. In this work, we will focus on using machine learning methods to detect fake news. However, unlike most approaches, we will treat incoming messages as stream data, taking into account the possibility of concept drift occurring, i.e., appearing changes in the probabilistic characteristics of the classification model during the exploitation of the classifier. The developed methods have been evaluated based on computer experiments on benchmark data, and the obtained results prove their usefulness for the problem under consideration. The proposed solutions are part of the distributed platform developed by the H2020 SocialTruth project consortium.
Article
free access till end of October 2020 -> use this link https://authors.elsevier.com/a/1blq25a7-GjBOl This work aims to connect two rarely combined research directions, i.e., non-stationary data stream classification and data analysis with skewed class distributions. We propose a novel framework employing stratified bagging for training base classifiers to integrate data preprocessing and dynamic ensemble selection methods for imbalanced data stream classification. The proposed approach has been evaluated based on computer experiments carried out on 135 artificially generated data streams with various imbalance ratios, label noise levels, and types of concept drift as well as on two selected real streams. Four preprocessing techniques and two dynamic selection methods, used on both bagging classifiers and base estimators levels, were considered. Experimentation results showed that, for highly imbalanced data streams, dynamic ensemble selection coupled with data preprocessing could outperform online and chunk-based state-of-art methods.
Chapter
Learning from the non-stationary imbalanced data stream is a serious challenge to the machine learning community. There is a significant number of works addressing the issue of classifying non-stationary data stream, but most of them do not take into consideration that the real-life data streams may exhibit high and changing class imbalance ratio, which may complicate the classification task. This work attempts to connect two important, yet rarely combined, research trends in data analysis, i.e., non-stationary data stream classification and imbalanced data classification. We propose a novel framework for training base classifiers and preparing the dynamic selection dataset (DSEL) to integrate data preprocessing and dynamic ensemble selection (DES) methods for imbalanced data stream classification. The proposed approach has been evaluated on the basis of computer experiments carried out on 72 artificially generated data streams with various imbalance ratios, levels of label noise and types of concept drift. In addition, we consider six variations of preprocessing methods and four DES methods. Experimentation results showed that dynamic ensemble selection, even without the use of any data preprocessing, can outperform a naive combination of the whole pool generated with the use of preprocessing methods. Combining DES with preprocessing further improves the obtained results.
Chapter
From one year to another, more and more vast amounts of data is being created in different fields of application. Great deal of those sources require real-time processing and analyzing, which leads to increased interest in streaming data classification field of machine learning. It is not rare, that many of those applications deal with somehow skewed or imbalanced data. In this paper, we analyze usage of smote oversampling algorithm variations in learning patterns from imbalanced data streams using different incremental learning ensemble algorithms.
Article
Due to variety of modern real-life tasks, where analyzed data is often not a static set, the data stream mining gained a substantial focus of machine learning community. Main property of such systems is the large amount of data arriving in a sequential manner, which creates an endless stream of objects. Taking into consideration the limited resources as memory and computational power, it is widely accepted that each instance can be processed up once and it is not remembered, making reevaluation impossible. In the following work, we will focus on the data stream classification task where parameters of a classification model may vary over time, so the model should be able to adapt to the changes. It requires a forgetting mechanism, ensuring that outdated samples will not impact a model. The most popular approaches base on so-called windowing, requiring storage of a batch of objects and when new examples arrive, the least relevant ones are forgotten. Objects in a new window are used to retrain the model, which is cumbersome especially for online learners and contradicts the principle of processing each object at most once. Therefore, this work employs inbuilt forgetting mechanism of neural networks. Additionally, to reduce a need of expensive (sometimes even impossible) object labeling, we are focusing on active learning, which asks for labels only for interesting examples, crucial for appropriate model upgrading. Characteristics of proposed methods were evaluated on the basis of the computer experiments, performed over diverse pool of data streams. Their results confirmed the convenience of proposed strategy.
Article
With a plethora of available classification performance measures, choosing the right metric for the right task requires careful thought. To make this decision in an informed manner, one should study and compare general properties of candidate measures. However, analysing measures with respect to complete ranges of their domain values is a difficult and challenging task. In this study, we attempt to support such analyses with a specialized visualisation technique, which operates in a barycentric coordinate system using a 3D tetrahedron. Additionally, we adapt this technique to the context of imbalanced data and put forward a set of measure properties, which should be taken into account when examining a classification performance measure. As a result, we compare 22 popular measures and show important differences in their behaviour. Moreover, for parametric measures such as the Fβ and IBAα(G-mean), we analytically derive parameter thresholds that pinpoint the changes in measure properties. Finally, we provide an online visualisation tool that can aid the analysis of measure variability throughout their entire domains.
Article
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
Article
Distributed data mining (DDM) deals with the problem of analyzing distributed, possibly multi-party data by paying attention to the computing, communication, storage, and human factors-related issues in a distributed environment. Unlike the conventional o#-the-shelf centralized data mining products, DDM systems are based on fundamentally distributed algorithms that do not necessarily require centralization of data and other resources. DDM technology is finding increasing number of applications in many domains. Examples include data driven pervasive applications for mobile and embedded devices, grid-based large scale scientific and business data analysis, security and defense related applications involving analysis of multi-party possibly privacy-sensitive data, and peer-topeer data stream mining in sensor and file-sharing networks. This talk will focus on peer-to-peer (P2P) distributed data stream mining and monitoring. It will first discuss the foundation of approximate and exact P2P algorithms for data analysis. Then it will present a class of P2P algorithms for eigen-analysis and clustering in details. The talk will end with a discussion on the future directions of research on P2P data mining.
Article
Adding examples of the majority class to the training set can have a detrimental effect on the learner's behavior: noisy or otherwise unreliable examples from the majority class can overwhelm the minority class. The paper discusses criteria to evaluate the utility of classifiers induced from such imbalanced training sets, gives explanation of the poor behavior of some learners under these circumstances, and suggests as a solution a simple technique called one-sided selection of examples. 1 Introduction The general topic of this paper is learning from examples described by pairs [(x; c(x)], where x is a vector of attribute values and c(x) is the corresponding concept label. For simplicity, we consider only problems where c(x) is either positive or negative, and all attributes are continuous. Since Fisher (1936), this task has received plenty of attention from statisticians as well as from researchers in artificial neural networks, AI, and ML. A typical scenario assumes the e...
Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning
  • Lemaıˇtre
Weighted aging classifier ensemble for the incremental drifted data streams
  • Woźniak
  • R M O Cruz
  • L G Hafemann
  • R Sabourin
  • G D C Cavalcanti
R.M.O. Cruz, L.G. Hafemann, R. Sabourin, G.D.C. Cavalcanti, DESlib: A Dynamic ensemble selection library in Python, arXiv preprint arXiv:1802.04967..
Weighted aging classifier ensemble for the incremental drifted data streams
  • M Woź Niak
  • A Kasprzak
M. Woź niak, A. Kasprzak, P. Cal, Weighted aging classifier ensemble for the incremental drifted data streams, in: H.L. Larsen, M.J. Martin-Bautista, M.A. Vila, T. Andreasen, H. Christiansen (Eds.), Flexible Query Answering Systems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 579-588.
  • R A Baeza-Yates
  • B Ribeiro-Neto
R.A. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc, Boston, MA, USA, 1999.
The truth of the f-measure, Teach Tutor Mater
  • Y Sasaki
Y. Sasaki, The truth of the f-measure, Teach Tutor Mater..