Imbalanced data classification is still remaining thje important topic and during the past decades, plenty of works are devoted to this field of study. More and more real-life based imbalanced class problems inspired researchers to come up with new solutions with better performance. Various techniques are employed such as data handling approaches, algorithm-level approaches, active learning approaches, and kernel-based methods to enumerate only a few. This work aims at applying a novel dynamic selection methods on imbalanced data classification problems. The experiments carried out on several benchmark datasets confirm its pretty high performance.
The big data is usually described by so-called 5Vs (Volume, Velocity, Variety, Veracity, Value). The business success in the big data era strongly depends on the smart analytical software which can help to make efficient decisions (Value for enterprise). Therefore, the decision support software should take into consideration especially that we deal with massive data (Volume) and that data usually comes continuously in the form of so-called data stream (Velocity). Unfortunately, most of the traditional data analysis methods are not ready to efficiently analyze fast growing amount of the stored records. Additionally, one should also consider phenomenon appearing in data stream called concept drift, which means that the parameters of an using model are changing, what could dramatically decrease the analytical model quality. This work is focusing on the classification task, which is very popular in many practical cases as fraud detection, network security, or medical diagnosis. We propose how to detect the changes in the data stream using combined concept drift detection model. The experimental evaluations show that it is an interesting direction, what encourage us to use it in practical applications.
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
Activity recognition is one of the emerging trends in the domain of mining ubiquitous environments. It assumes that we can recognize the current action undertaken by the monitored subject on the basis of outputs of a set of associated sensors. Often different combinations of smart devices are being used, thus creating an Internet of Things. Such data will arrive continuously during the operation time of sensors and require an online processing in order to keep a real-time track of the current activity being undertaken. This forms a natural data stream problem with the potential presence of changes in the arriving data. Therefore, we require an efficient online machine learning system that can offer high recognition rates and adapt to drifts and shifts in the stream. In this paper we propose an efficient and lightweight adaptive ensemble learning system for real-time activity recognition. We use a weighted modification of Naïve Bayes classifier that can swiftly adapt itself to the current state of the stream without a need for an external concept drift detector. To tackle the multi-class nature of activity recognition problem we propose to use an one-vs-one decomposition to form a committee of simpler and diverse learners. We introduce a novel weighted combination for one-vs-one decomposition that can adapt itself over time. Additionally, to limit the cost of supervision we propose to enhance our classification system with active learning paradigm to select only the most important objects for labeling and work under constrained budget. Experiments carried out on six data streams gathered from ubiquitous environments show that the proposed active and adaptive ensemble offer excellent classification accuracy with low requirement for access to true class labels.
Most of data stream learning methods assume that a true class of an incoming instance is available right after it has been processed. However, assumption that we have an unlimited access to class labels is unrealistic and is directly connected with a very high labeling cost. This is a driving force behind growing development of methods that require reduced or no access to class labels. Among several potential directions active learning emerges as a promising solution, by allowing for a selection of most valuable instances from the stream and using as few label queries. Despite numerous proposals of active learning methods for static data, this domain is still developing for data streams. Here, non-stationary nature of data must be taken into consideration and proposed algorithms must accommodate potential occurrences of concept drift. In this paper we propose a Query by Committee active learning strategy that is adapted to online learning from drifting data streams. A decision regarding label query is made by an ensemble of classifiers instead of a single learner, leading to an improved instance selection. We present four different approaches for online Query by Committee and evaluate their usefulness on the basis of obtained accuracy with limited budgets and ability to handle concept drift. We introduce Budget Loss of Accuracy, a novel measure for evaluating active learning algorithms. Finally, we investigate the relationships between the efficacy of Query by Committee models and diversity of underlying ensembles. Based on thorough experimental investigation we are able to show the usefulness of proposed algorithms for reducing labeling effort in learning from drifting data streams.