Project

Compound data stream classification methods based on unsupervised and active learning

Goal: The project relates to machine learning algorithms for data stream classification. The primary objective in the design of such a systems is to provide the highest efficiency, which can be understood as a kind of tradeoff between accuracy and processing time. In order to achieve the goal a model of the system has to be adapted to the specificities. In the case of streaming data classification we should take into account that:
(a) the characteristics of data may change over time, which is called Concept Drift,
(b) the computational speed of the system must be high enough to allow efficient processing of large amounts of information in an acceptably short time.
In the project we plan to develop number of algorithms which ensuring high resistance of classification system to aforementioned concept drift. It is plan to investigate possibility of application of algorithms using a distributed and parallel programming paradigms in order to ensure high processing speed of streaming data.
Concluding, we define the following project objectives:
1. Developing new methods of supervised and unsupervised concept drift detection along with
respective classification algorithms dedicated for stream processing;
2. Developing new classifier models along with respective adaptive learning algorithms aiming at permanent adjustment classifier parameters to changing characteristics of the data stream, especially ensemble of classifiers;
3. Developing machine learning algorithms using a distributed and parallel programming paradigms.

Updates
0 new
5
Recommendations
0 new
2
Followers
0 new
58
Reads
0 new
635

Project log

Michal Wozniak
added a research item
Imbalanced data classification is still remaining thje important topic and during the past decades, plenty of works are devoted to this field of study. More and more real-life based imbalanced class problems inspired researchers to come up with new solutions with better performance. Various techniques are employed such as data handling approaches, algorithm-level approaches, active learning approaches, and kernel-based methods to enumerate only a few. This work aims at applying a novel dynamic selection methods on imbalanced data classification problems. The experiments carried out on several benchmark datasets confirm its pretty high performance.
Michal Wozniak
added 2 research items
The big data is usually described by so-called 5Vs (Volume, Velocity, Variety, Veracity, Value). The business success in the big data era strongly depends on the smart analytical software which can help to make efficient decisions (Value for enterprise). Therefore, the decision support software should take into consideration especially that we deal with massive data (Volume) and that data usually comes continuously in the form of so-called data stream (Velocity). Unfortunately, most of the traditional data analysis methods are not ready to efficiently analyze fast growing amount of the stored records. Additionally, one should also consider phenomenon appearing in data stream called concept drift, which means that the parameters of an using model are changing, what could dramatically decrease the analytical model quality. This work is focusing on the classification task, which is very popular in many practical cases as fraud detection, network security, or medical diagnosis. We propose how to detect the changes in the data stream using combined concept drift detection model. The experimental evaluations show that it is an interesting direction, what encourage us to use it in practical applications.
Michal Wozniak
added a project reference
Bartosz Krawczyk
added 2 research items
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
Activity recognition is one of the emerging trends in the domain of mining ubiquitous environments. It assumes that we can recognize the current action undertaken by the monitored subject on the basis of outputs of a set of associated sensors. Often different combinations of smart devices are being used, thus creating an Internet of Things. Such data will arrive continuously during the operation time of sensors and require an online processing in order to keep a real-time track of the current activity being undertaken. This forms a natural data stream problem with the potential presence of changes in the arriving data. Therefore, we require an efficient online machine learning system that can offer high recognition rates and adapt to drifts and shifts in the stream. In this paper we propose an efficient and lightweight adaptive ensemble learning system for real-time activity recognition. We use a weighted modification of Naïve Bayes classifier that can swiftly adapt itself to the current state of the stream without a need for an external concept drift detector. To tackle the multi-class nature of activity recognition problem we propose to use an one-vs-one decomposition to form a committee of simpler and diverse learners. We introduce a novel weighted combination for one-vs-one decomposition that can adapt itself over time. Additionally, to limit the cost of supervision we propose to enhance our classification system with active learning paradigm to select only the most important objects for labeling and work under constrained budget. Experiments carried out on six data streams gathered from ubiquitous environments show that the proposed active and adaptive ensemble offer excellent classification accuracy with low requirement for access to true class labels.
Bartosz Krawczyk
added an update
New paper on active and adaptive ensemble learning published in Knowledge-Based Systems:
B. Krawczyk, Active and adaptive ensemble learning for online activity recognition from data streams, Knowledge-Based Systems (2017), https://doi.org/10.1016/j.knosys.2017.09.032
 
Bartosz Krawczyk
added an update
Two new conference papers on learning with limited class label availability from drifting data streams (co-authored respectively with Michal Wozniak and Lukasz Korycki):
1. B. Krawczyk, M. Woźniak: Online Query by Committee for Active Learning from Drifting Data Streams. IJCNN 2017: 2120-2127.
2. Ł. Korycki, B. Krawczyk: Combining Active Learning and Self-Labeling for Data Stream Mining. CORES 2017: 481-490.
 
Bartosz Krawczyk
added a research item
Most of data stream learning methods assume that a true class of an incoming instance is available right after it has been processed. However, assumption that we have an unlimited access to class labels is unrealistic and is directly connected with a very high labeling cost. This is a driving force behind growing development of methods that require reduced or no access to class labels. Among several potential directions active learning emerges as a promising solution, by allowing for a selection of most valuable instances from the stream and using as few label queries. Despite numerous proposals of active learning methods for static data, this domain is still developing for data streams. Here, non-stationary nature of data must be taken into consideration and proposed algorithms must accommodate potential occurrences of concept drift. In this paper we propose a Query by Committee active learning strategy that is adapted to online learning from drifting data streams. A decision regarding label query is made by an ensemble of classifiers instead of a single learner, leading to an improved instance selection. We present four different approaches for online Query by Committee and evaluate their usefulness on the basis of obtained accuracy with limited budgets and ability to handle concept drift. We introduce Budget Loss of Accuracy, a novel measure for evaluating active learning algorithms. Finally, we investigate the relationships between the efficacy of Query by Committee models and diversity of underlying ensembles. Based on thorough experimental investigation we are able to show the usefulness of proposed algorithms for reducing labeling effort in learning from drifting data streams.
Bartosz Krawczyk
added an update
Our new survey on data preprocessing algorithms for data stream mining, authored by Sergio Ramírez-Gallego, me, Salvador García, Michal Wozniak and Francisco Herrera, just appeared in Neurocomputing journal:
This article includes:
  • A thorough survey on recent advances in data preprocessing algorithms, including feature selection, instance selection and discretization, designed specifically for mining data streams.
  • A detailed experimental study comparing the performance of most important algorithms in these fields on a set of diverse streaming benchmarks, together with analysis of their performance and discussion of their areas of applicability.
  • A discussion on emerging challenges in this field, together with potential and promising future research directions.
  • A software package with all of analyzed algorithms, available as easy to use plug-in for popular MOA environment.
 
Bartosz Krawczyk
added an update
Our new survey on ensemble learning for data stream analysis, authored by me, Leandro L. Minku, João Gama, Jerzy Stefanowski and Michal Wozniak, just appeared in Information Fusion journal:
This paper contains:
  • A comprehensive survey of ensemble approaches for various data stream mining problems, including supervised classification, regression, imbalanced streams, active and semi-supervised approaches, novelty detection, one-class classification, multi-label data etc.
  • A detailed taxonomy of ensemble algorithms for discussed data stream mining tasks.
  • A thorough discussion of open research problems and lines of future research.
 
Michal Wozniak
added a project goal
The project relates to machine learning algorithms for data stream classification. The primary objective in the design of such a systems is to provide the highest efficiency, which can be understood as a kind of tradeoff between accuracy and processing time. In order to achieve the goal a model of the system has to be adapted to the specificities. In the case of streaming data classification we should take into account that:
(a) the characteristics of data may change over time, which is called Concept Drift,
(b) the computational speed of the system must be high enough to allow efficient processing of large amounts of information in an acceptably short time.
In the project we plan to develop number of algorithms which ensuring high resistance of classification system to aforementioned concept drift. It is plan to investigate possibility of application of algorithms using a distributed and parallel programming paradigms in order to ensure high processing speed of streaming data.
Concluding, we define the following project objectives:
1. Developing new methods of supervised and unsupervised concept drift detection along with
respective classification algorithms dedicated for stream processing;
2. Developing new classifier models along with respective adaptive learning algorithms aiming at permanent adjustment classifier parameters to changing characteristics of the data stream, especially ensemble of classifiers;
3. Developing machine learning algorithms using a distributed and parallel programming paradigms.