Bartosz Krawczyk

Bartosz Krawczyk
Virginia Commonwealth University | VCU · Department of Computer Science

PhD

About

214
Publications
72,500
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,891
Citations
Introduction
Bartosz Krawczyk is an assistant professor in the Department of Computer Science, Virginia Commonwealth University, Richmond VA, USA, where he heads the Machine Learning and Stream Mining Lab. He obtained his M.Sc. and Ph.D. degrees from Wroclaw University of Science and Technology, Poland, in 2012 and 2015 respectively. Dr. Krawczyk's current research interests include machine learning, data streams, ensemble learning, class imbalance, and explainable artificial intelligence.
Additional affiliations
March 2016 - present
Wroclaw University of Science and Technology
Position
  • Professor (Assistant)
October 2013 - February 2016
Wroclaw University of Science and Technology
Position
  • Research Assistant

Publications

Publications (214)
Conference Paper
Full-text available
Modern machine learning systems need to be able to cope with constantly arriving and changing data. Two main areas of research dealing with such scenarios are continual learning and data stream mining. Continual learning focuses on accumulating knowledge and avoiding forgetting, assuming information once learned should be stored. Data stream mining...
Article
Full-text available
Despite over two decades of progress, imbalanced data is still considered a significant challenge for contemporary machine learning models. Modern advances in deep learning have further magnified the importance of the imbalanced data problem, especially when learning from images. Therefore, there is a need for an oversampling method that is specifi...
Article
Full-text available
Data streams are potentially unbounded sequences of instances arriving over time to a classifier. Designing algorithms that are capable of dealing with massive, rapidly arriving information is one of the most dynamically developing areas of machine learning. Such learners must be able to deal with a phenomenon known as concept drift, where the data...
Article
Full-text available
Continual learning from streaming data sources becomes more and more popular due to the increasing number of online tools and systems. Dealing with dynamic and everlasting problems poses new challenges for which traditional batch-based offline algorithms turn out to be insufficient in terms of computational time and predictive performance. One of t...
Article
Full-text available
Continuous learning from streaming data is among the most challenging topics in the contemporary machine learning. In this domain, learning algorithms must not only be able to handle massive volume of rapidly arriving data, but also adapt themselves to potential emerging changes. The phenomenon of evolving nature of data streams is known as concept...
Preprint
Full-text available
Deep learning models memorize training data, which hurts their ability to generalize to under-represented classes. We empirically study a convolutional neural network's internal representation of imbalanced image data and measure the generalization gap between a model's feature embeddings in the training and test sets, showing that the gap is wider...
Preprint
Full-text available
Machine learning (ML) is playing an increasingly important role in rendering decisions that affect a broad range of groups in society. ML models inform decisions in criminal justice, the extension of credit in banking, and the hiring practices of corporations. This posits the requirement of model fairness, which holds that automated decisions shoul...
Preprint
Full-text available
Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a...
Preprint
Full-text available
Mining data streams poses a number of challenges, including the continuous and non-stationary nature of data, the massive volume of information to be processed and constraints put on the computational resources. While there is a number of supervised solutions proposed for this problem in the literature, most of them assume that access to the ground...
Conference Paper
Structural concept complexity, class overlap, and data scarcity are some of the most important factors influencing the performance of classifiers under class imbalance conditions. When these effects were uncovered in the early 2000s, understandably, the classifiers on which they were demonstrated belonged to the classical rather than Deep Learning...
Article
Full-text available
Data stream classification is one of the most vital areas of contemporary machine learning, as many real-life problems generate data continuously and in large volumes. However, most of research in this area focuses on vector-based representations, which are unsuitable for capturing properties of more complex multi-dimensional structures, such as im...
Chapter
Generative Adversarial Networks (GANs) are among the most popular contemporary machine learning algorithms. Despite remarkable successes in their developments, existing GANs cannot offer the appropriate tools to monitor their performance in a continual learning scenario when data distribution changes. We propose a complete framework for monitoring...
Chapter
Lifelong learning models should be able to efficiently aggregate knowledge over a long-term time horizon. Comprehensive studies focused on incremental neural networks have shown that these models tend to struggle with remembering previously learned patterns. This issue known as catastrophic forgetting has been widely studied and addressed by severa...
Preprint
Full-text available
Structural concept complexity, class overlap, and data scarcity are some of the most important factors influencing the performance of classifiers under class imbalance conditions. When these effects were uncovered in the early 2000s, understandably, the classifiers on which they were demonstrated belonged to the classical rather than Deep Learning...
Preprint
Full-text available
Learning from imbalanced data is among the most challenging areas in contemporary machine learning. This becomes even more difficult when considered the context of big data that calls for dedicated architectures capable of high-performance processing. Apache Spark is a highly efficient and popular architecture, but it poses specific challenges for...
Conference Paper
Full-text available
Learning from imbalanced data poses significant challenges for the classifier. This becomes even more difficult, when dealing with multi-class problems. Here relationships among classes are no longer well-defined and it is easy to loose performance on one of the classes while gaining on other. In last years this topic has gained increased interest...
Article
Motivation X-ray crystallography was used to produce nearly 90% of protein structures. These efforts were supported by numerous sequence-based tools that accurately predict crystallizable proteins. However, protein structures vary widely in their quality, typically measured with resolution and R-free. This impacts the ability to use these structure...
Article
Full-text available
Drifting data streams and multi-label data are both challenging problems. Multi-label instances may simultaneously be associated with many labels and classifiers must predict the complete set of labels. Learning from data streams requires algorithms able to learn from potentially unbounded data that is constantly changing. When multi-label data arr...
Preprint
Full-text available
Learning from data streams is among the contemporary challenges in machine learning domain, which is frequently plagued by the class imbalance problem. In non-stationary environments, ratios among classes, as well as their roles (majority and minority) may change over time. The class imbalance is usually alleviated by balancing classes with resampl...
Chapter
Classification of imbalanced data is one of most challenging aspects of machine learning. Despite over two decades of progress there is still a need for developing new techniques capable to overcome numerous difficulties embedded in the nature of imbalanced datasets. In this paper, we propose Locally Linear Support Vector Machines (LL-SVMs) for eff...
Conference Paper
Learning from data streams is among the contemporary challenges in the machine learning domain, which is frequently plagued by the class imbalance problem. In non-stationary environments, ratios among classes, as well as their roles (majority and minority) may change over time. The class imbalance is usually alleviated by balancing classes with res...
Preprint
Full-text available
Despite over two decades of progress, imbalanced data is still considered a significant challenge for contemporary machine learning models. Modern advances in deep learning have magnified the importance of the imbalanced data problem. The two main approaches to address this issue are based on loss function modifications and instance resampling. Ins...
Preprint
Full-text available
Modern machine learning systems need to be able to cope with constantly arriving and changing data. Two main areas of research dealing with such scenarios are continual learning and data stream mining. Continual learning focuses on accumulating knowledge and avoiding forgetting, assuming information once learned should be stored. Data stream mining...
Preprint
Full-text available
Continual learning from data streams is among the most important topics in contemporary machine learning. One of the biggest challenges in this domain lies in creating algorithms that can continuously adapt to arriving data. However, previously learned knowledge may become outdated, as streams evolve over time. This phenomenon is known as concept d...
Conference Paper
Full-text available
Continual learning from data streams is among the most important topics in contemporary machine learning. One of the biggest challenges in this domain lies in creating algorithms that can continuously adapt to arriving data. However, previously learned knowledge may become outdated, as streams evolve over time. This phenomenon is known as concept d...
Article
Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Add...
Preprint
Full-text available
Learning from data streams is among the most vital fields of contemporary data mining. The online analysis of information coming from those potentially unbounded data sources allows for designing reactive up-to-date models capable of adjusting themselves to continuous flows of data. While a plethora of shallow methods have been proposed for simpler...
Preprint
Full-text available
Continual learning from streaming data sources becomes more and more popular due to the increasing number of online tools and systems. Dealing with dynamic and everlasting problems poses new challenges for which traditional batch-based offline algorithms turn out to be insufficient in terms of computational time and predictive performance. One of t...
Preprint
Full-text available
Continuous learning from streaming data is among the most challenging topics in the contemporary machine learning. In this domain, learning algorithms must not only be able to handle massive volumes of rapidly arriving data, but also adapt themselves to potential emerging changes. The phenomenon of the evolving nature of data streams is known as co...
Article
Learning from imbalanced data is among the most popular topics in the contemporary machine learning. However, the vast majority of attention in this field is given to binary problems, while their much more difficult multiclass counterparts are relatively unexplored. Handling data sets with multiple skewed classes poses various challenges and calls...
Article
Purpose To present a Machine Learning pipeline for automatically relabeling anatomical structure sets in the Digital Imaging and Communications in Medicine (DICOM) format to a standard nomenclature that will enable data abstraction for research and quality improvement. Methods DICOM structure sets from approximately 1,200 lung and prostate cancer...
Conference Paper
Full-text available
Learning from imbalanced data and data stream mining are among most popular areas in contemporary machine learning. There is a strong interplay between these domains, as data streams are frequently characterized by skewed distributions. However, most of existing works focus on binary problems, omitting significantly more challenging multi-class imb...
Article
Full-text available
The imbalanced data classification is one of the most crucial tasks facing modern data analysis. Especially when combined with other difficulty factors, such as the presence of noise, overlapping class distributions, and small disjuncts, data imbalance can significantly impact the classification performance. Furthermore, some of the data difficulty...
Preprint
Full-text available
The imbalanced data classification is one of the most crucial tasks facing modern data analysis. Especially when combined with other difficulty factors, such as the presence of noise, overlapping class distributions, and small disjuncts, data imbalance can significantly impact the classification performance. Furthermore, some of the data difficulty...
Article
Full-text available
Learning from data streams in the presence of concept drift is among the biggest challenges of contemporary machine learning. Algorithms designed for such scenarios must take into an account the potentially unbounded size of data, its constantly changing nature, and the requirement for real-time processing. Ensemble approaches for data stream minin...
Conference Paper
Full-text available
Most machine learning methods work under the assumption that classes have a roughly balanced number of instances. However, in many real-life problems we may have some types of instances appearing predominantly more frequently than the others which causes a bias towards the majority class during classifier training. This becomes even more challengin...
Conference Paper
Full-text available
Learning from data streams is one of the most promising and challenging domains in modern machine learning. Proliferating online data sources provide us access to real-time knowledge we have never had before. At the same time, new obstacles emerge and we have to overcome them in order to fully and effectively utilize the potential of the data. Proh...
Conference Paper
Full-text available
Data stream mining is among the most contemporary branches of machine learning. The potentially infinite sources give us many opportunities and at the same time pose new challenges. To properly handle streaming data we need to improve our well-established methods, so they can work with dynamic data and under strict constraints. Supervised streaming...
Article
Full-text available
In multi-label learning, data may simultaneously belong to more than one class. When multi-label data arrives as a stream, the challenges associated with multi-label learning are joined by those of data stream mining, including the need for algorithms that are fast and flexible, able to match both the speed and evolving nature of the stream. This p...
Article
Full-text available
Multi-label classification is one of the most dynamically growing fields of machine learning, due to its numerous real-life applications in solving problems that can be described by multiple labels at the same time. While most of works in this field focus on proposing novel and accurate classification algorithms, the issue of the computational comp...
Conference Paper
Full-text available
Learning from data streams is among the most vital contemporary fields in machine learning and data mining. Streams pose new challenges to learning systems, due to their volume and velocity, as well as ever-changing nature caused by concept drift. Vast majority of works for data streams assume a fully supervised learning scenario, having an unrestr...
Article
Full-text available
Learning good-performing classifiers from data with easily separable classes is not usually a difficult task for most of algorithms. However, problems affecting classifier performance may arise when samples from different classes share similar characteristics or are overlapped, since the boundaries of each class may not be clearly defined. In order...
Article
Full-text available
Instance reduction techniques are data preprocessing methods originally developed to enhance the nearest neighbor rule for standard classification. They reduce the training data by selecting or generating representative examples of a given problem. These algorithms have been designed and widely analyzed in multi-class problems providing very compet...
Chapter
Learning from imbalanced data is still considered as one of the most challenging areas of machine learning. Among plethora of methods dedicated to alleviating the challenge of skewed distributions, two most distinct ones are data-level sampling and cost-sensitive learning. The former modifies the training set by either removing majority instances o...
Preprint
Full-text available
The paper describes how the new technologies and data they generate are transforming medicine. It stresses the uniqueness of heterogeneous medical data and the ways of dealing with them. It lists different sources that generate big medical data, their security, legal and ethical issues, as well as machine learning/AI methods of dealing with them. A...
Article
Full-text available
Designing efficient algorithms for mining massive high-speed data streams has become one of the contemporary challenges for the machine learning community. Such models must display highest possible accuracy and ability to swiftly adapt to any kind of changes, while at the same time being characterized by low time and memory complexities. However, l...
Article
Currently, knowledge discovery in databases is an essential first step when identifying valid, novel and useful patterns for decision making. There are many real-world scenarios, such as bankruptcy prediction, option pricing or medical diagnosis, where the classification models to be learned need to fulfill restrictions of monotonicity (i.e. the ta...
Article
Full-text available
Imbalanced data classification remains a focus of intense research, mostly due to the prevalence of data imbalance in various real-life application domains. A disproportion among objects from different classes may significantly affect the performance of standard classification models. The first problem is the high imbalance ratios that pose a serio...
Preprint
Full-text available
Currently, knowledge discovery in databases is an essential step to identify valid, novel and useful patterns for decision making. There are many real-world scenarios, such as bankruptcy prediction, option pricing or medical diagnosis, where the classification models to be learned need to fulfil restrictions of monotonicity (i.e. the target class l...
Conference Paper
Full-text available
The class imbalance problem is a pervasive issue in many real-world domains. Oversampling methods that inflate the rare class by generating synthetic data are amongst the most popular techniques for resolving class imbalance. However, they concentrate on the characteristics of the minority class and use them to guide the oversampling process. By co...
Article
In this paper we deal with the problem of addressing multi-class problems with decomposition strategies. Based on the divide-and-conquer principle, a multi-class problem is divided into a number of easier to solve sub-problems. In order to do so, binary decomposition is considered to be the most popular approach. However, when using this strategy w...
Article
Full-text available
Mining data streams is among most vital contemporary topics in machine learning. Such scenario requires adaptive algorithms that are able to process constantly arriving instances, adapt to potential changes in data, use limited computational resources, as well as be robust to any atypical events that may appear. Ensemble learning has proven itself...
Conference Paper
Full-text available
Learning from imbalanced data is a challenge that machine learning community is facing over last decades, due to its ever-growing presence in real-life problems. While there is a significant number of works addressing the issue of handling binary and skewed datasets, its multi-class counterpart have not received as much attention. This problem is m...
Conference Paper
Full-text available
High-speed data streams are potentially infinite sequences of rapidly arriving instances that may be subject to concept drift phenomenon. Hence, dedicated learning algorithms must be able to update themselves with new data and provide an accurate prediction in a limited amount of time. This requirement was considered as prohibitive for using evolut...
Chapter
Medical data mining problems are usually characterized by examples of some of the classes appearing more frequently. Such a learning difficulty is known as imbalanced classification problems. This contribution analyzes the application of algorithms for tackling multi-class imbalanced classification in the field of vertebral column diseases classifi...
Article
Natural Language Processing plays a key role in man-machine interactions, allowing computers to understand and analyze human language. One of its more challenging subdomains is Word Sense Disambiguation, the task of automatically identifying the intended sense (or concept) of an ambiguous word based on the context in which the word is used. This re...
Conference Paper
Data stream mining is among the most vital contemporary data science challenges. In this work we concentrate on the issue of actual availability of true class labels. Assumption that the ground truth for each instance becomes known right after processing it is far from being realistic, due to usually high costs connected with its acquisition. Activ...
Article
The recognition of coral species based on underwater texture images pose a significant difficulty for machine learning algorithms, due to the three following challenges embedded in the nature of this data: 1) datasets do not include information about the global structure of the coral; 2) several species of coral have very similar characteristics; a...
Chapter
Researchers in the topic of imbalanced classification have proposed throughout the years a large amount of different approaches to address this issue. To keep on developing this area of study, it is of extreme importance to make these methods available for the research community. This allows for a double advantage: (1) to analyze in depth the featu...
Chapter
Class imbalance is present in many real-world classification datasets and consists in a disproportion of the number of examples of the different classes in the problem. This issue is known to hinder the performance of classifiers due to their accuracy oriented design, which usually makes the minority class to be overlooked. In this chapter the foun...
Chapter
New developments in computation have allowed an explosion for both data generation and storage. The high value that is hidden within this large volume of data has attracted more and more researchers to address the topic of Big Data analytics. The main difference between addressing Big Data applications and carrying out traditional DM tasks is scala...
Chapter
Algorithm-level solutions can be seen as an alternative approach to data pre-processing methods for handling imbalanced datasets. Instead of focusing on modifying the training set in order to combat class skew, this approach aims at modifying the classifier learning procedure itself. This requires an in-depth understanding of the selected earning a...
Chapter
Dealing with multi-class problems is a hard issue, which becomes more severe in the presence of imbalance. When facing multi-majority and multi-minority classes, it is not straightforward to acknowledge a priori which ones should be stressed during the learning stage, as it was done in the binary case study. Additionally, most of the techniques pro...
Chapter
The first mechanism to address the problem of imbalanced learning was the use of sampling methods. They consists of modifying a set of imbalanced data using different procedures to provide a balanced or more adequate data distribution to the subsequent learning tasks. In the specialized literature, many studies have shown that, for several types of...
Chapter
Cost-sensitive learning is an aspect of algorithm-level modifications for class imbalance. Here, instead of using a standard error-driven evaluation (or 0–1 loss function), a misclassification cost is being introduced in order to minimize the conditional risk. By strongly penalizing mistakes on some classes, we improve their importance during class...
Chapter
In this chapter existing ensemble solutions for the class imbalance problems are reviewed. In Data Science, classifier ensembles, that is, the combination of several classifiers into a single one, are known to improve the accuracy in comparison with the usage of a single classifier. However, ensemble learning techniques by themselves are neither ab...
Chapter
One of the most successful data preprocessing techniques used is the reduction of the data dimensionality by means of feature selection and/or feature extraction. The key idea is to simplify the data by replacing the original features with new created that extract the main information or simply select a subset of original set. Although this topic h...
Chapter
Although class imbalance is often pointed out as a determinant factor for degradation in classification performance, there are situations in which good performance can be achieve even in the presence of severe class imbalance. The identification of situation where the class imbalance is a complicating factor is an important research question. These...
Chapter
Nowadays, the availability of large volumes of data and the widespread use of tools for the proper extraction of knowledge information has become very frequent, especially in large corporations. This fact has transformed the data analysis by orienting it towards certain specialized techniques included under the umbrella of Data Science. In summary,...
Chapter
Most of the research in class imbalance are carried out in standard (binary or multi-class) classification problems. However, in recent years, researchers have addressed new classification frameworks beyond standard classification in different aspects. Several variations of class imbalance problem appear within these frameworks. This chapter review...
Chapter
Mining data streams is one of the most vital fields in the contemporary ML. Increasing number of real-world problems are characterized by both volume and velocity of data, as well as by evolving characteristics. Learning from data stream assumes that new instances arrive continuously and that their properties may change over time due to a phenomeno...