Jean Paul Barddal

Jean Paul Barddal
  • PhD
  • Professor at Pontifical Catholic University of Paraná

About

96
Publications
74,200
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,415
Citations
Introduction
I'm a professor with the Graduate Program in Informatics (PPGIa) of the Pontifical Catholic University of Paraná, located at Curitiba, Brazil. My projects cover data mining, machine learning, and pattern recognition. My focus is mainly on data streams, regarding their processing and mining, more specifically​ on the classification, regression, clustering, and feature selection tasks. If you're interested in collaborating or pursuing a degree on any of these topics, please feel free to drop me a message at jean.barddal@ppgia.pucpr.br.
Current institution
Pontifical Catholic University of Paraná
Current position
  • Professor
Additional affiliations
February 2019 - present
Pontifical Catholic University of Paraná
Position
  • Professor
April 2017 - August 2017
Pontifical Catholic University of Paraná
Position
  • Lecturer
Description
  • Lecturer on Big Data Stream Mining for the Big Data & Analytics lato sensu course of the Polytechnic School
October 2017 - present
4KST
Position
  • Researcher
Education
February 2015 - November 2018
Pontifical Catholic University of Paraná
Field of study
  • Computer Science
March 2014 - February 2015
February 2010 - December 2013

Publications

Publications (96)
Conference Paper
Full-text available
The ubiquity of data streams has been encouraging the development of new incremental and adaptive learning algorithms. Data stream learners must be fast, memory-bounded, but mainly, tailored to adapt to possible changes in the data distribution, a phenomenon named concept drift. Recently, several works have shown the impact of a so far nearly negle...
Article
Full-text available
Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely f...
Article
Full-text available
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integr...
Article
Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally as data arrives. On top of that, due to the inherent evolving nature of data streams, it is expected that algorithms undergo both concept drifts and evolutions, which must be taken into account by the clust...
Conference Paper
Full-text available
This work presents two different voting strategies for ensemble learning on data streams based on pairwise combination of component classifiers. Despite efforts to build a diverse ensemble, there is always some degree of overlap between component classifiers models. Our voting strategies are aimed at using these overlaps to support ensemble predict...
Conference Paper
Full-text available
Fuels are crucial for any country's development and economy, impacting various sectors such as transportation, industry , and electricity generation. Accurate prediction of monthly fuel demand can improve supply chain management, strategic decision-making, and financial planning for businesses while helping governments develop decarbonization polic...
Conference Paper
Pre-trained language models (LMs) have been used in several scenarios and data mining tasks due to their good-quality representations and their use readiness. Although LMs constitute a significant gain in usability, they are frequently utilized statically over time, meaning that these models can suffer from concept drift and semantic shift, which c...
Chapter
The proliferation of textual data on the Internet presents a unique opportunity for institutions and companies to monitor public opinion about their services and products. Given the rapid generation of such data, the text stream mining setting, which handles sequentially arriving, potentially infinite text streams, is often more suitable than tradi...
Preprint
Full-text available
In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in proce...
Article
The society produces textual data online in several ways, e.g. , via reviews and social media posts. Therefore, numerous researchers have been working on discovering patterns in textual data that can indicate peoples’ opinions, interests, etc . Most tasks regarding natural language processing are addressed using traditional machine learning methods...
Article
Full-text available
This work introduces the representation ensemble learning algorithm, a novel approach for generating diverse unsupervised representations rooted in the principles of self-taught learning. The ensemble comprises convolutional autoencoders (CAEs) learned in an unsupervised manner, fostering diversity via a loss function designed to penalize similar C...
Article
Initially supported by Twitter, hashtags are now used on several social media platforms. Hashtags are helpful for tagging, tracking, and grouping posts on similar topics. In this paper, based on a hashtag stream regarding the hashtag #mybodymychoice, we analyze hashtag drifts over time using concepts from graph analysis and textual data streams usi...
Conference Paper
Full-text available
Social media has been a data source for various applications, given its characteristic of working as a social sensor. Many applications in several areas, such as brand reputation and online opinion monitoring, use this valuable resource to understand the users of services and products. This paper describes an application in the soccer domain, consi...
Conference Paper
Full-text available
The growing use of digital communication platforms has given rise to various criminal activities, such as grooming and drug dealing, which pose significant challenges to law enforcement and forensic experts. This paper presents a supervised keyphrase extraction approach to detect relevant information in high-volume chat logs involving grooming and...
Article
Full-text available
High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view...
Chapter
Energy consumption reduction is an increasing trend in machine learning given its relevance in socio-ecological importance. Consequently, it is important to quantify how real-time learning algorithms tailored for data streams and edge computing behave in terms of accuracy, processing time, memory usage, and energy consumption. In this work, we brin...
Conference Paper
Full-text available
Feature extraction regards transforming unstructured or semi-structured data into structured data that can be used as input for classification and sentiment analysis algorithms, among other applications. This task becomes even more challenging and relevant when textual data becomes available over time as a continuous data stream since the lexicon a...
Preprint
Full-text available
Mining data streams is one of the main studies in machine learning area due to its application in many knowledge areas. One of the major challenges on mining data streams is concept drift, which requires the learner to discard the current concept and adapt to a new one. Ensemble-based drift detection algorithms have been used successfully to the cl...
Preprint
Data streams are often defined as large amounts of data flowing continuously at high speed. Moreover, these data are likely subject to changes in data distribution, known as concept drift. Given all the reasons mentioned above, learning from streams is often online and under restrictions of memory consumption and run-time. Although many classificat...
Article
This paper introduces a novel method for classifier pool generation in which a two-level strategy explores diversity in both data complexity and classifier decision spaces. The rationale is to induce pool members using data subsets representing subproblems with different difficulties while promoting diversity in classifiers’ decisions. Two possible...
Preprint
This paper presents a deep learning approach for image retrieval and pattern spotting in digital collections of historical documents. First, a region proposal algorithm detects object candidates in the document page images. Next, deep learning models are used for feature extraction, considering two distinct variants, which provide either real-value...
Preprint
Full-text available
This work describes different strategies to generate unsupervised representations obtained through the concept of self-taught learning for facial emotion recognition (FER). The idea is to create complementary representations promoting diversity by varying the autoencoders' initialization, architecture, and training data. SVM, Bagging, Random Forest...
Preprint
Full-text available
Computer vision-based parking lot management methods have been extensively researched upon owing to their flexibility and cost-effectiveness. To evaluate such methods authors often employ publicly available parking lot image datasets. In this study, we surveyed and compared robust publicly available image datasets specifically crafted to test compu...
Article
Full-text available
Computer vision-based parking lot management methods have been extensively researched upon owing to their flexibility and cost-effectiveness. To evaluate such methods authors often employ publicly available parking lot image datasets. In this study, we surveyed and compared robust publicly available image datasets specifically crafted to test compu...
Conference Paper
In this paper, we present an exploratory study conducted to evaluate the impact of temporal dependence modeling on time series forecasting with Data Stream Mining (DSM) techniques. DSM algorithms have been used successfully in many domains that exhibit continuous generation of non-stationary data. However, the use of DSM in time series is rare sinc...
Chapter
Hierarchical data stream classification inherits the properties and constraints of hierarchical classification and data stream classification concomitantly. Therefore, it requires novel approaches that (i) can handle class hierarchies, (ii) can be updated over time, and (iii) are computationally light-weighted regarding processing time and memory u...
Article
Full-text available
The classification task usually works with flat and batch learners, assuming problems as stationary and without relations between class labels. Nevertheless, several real-world problems do not assume these premises, i.e., data have labels organized hierarchically and are made available in streaming fashion, meaning that their behavior can drift ove...
Article
Concept drift in process mining (PM) is a challenge as classical methods assume processes are in a steady-state, i.e., events share the same process version. We conducted a systematic literature review on the intersection of these areas, and thus, we review concept drift in PM and bring forward a taxonomy of existing techniques for drift detection...
Preprint
Concept drift in process mining (PM) is a challenge as classical methods assume processes are in a steady-state, i.e., events share the same process version. We conducted a systematic literature review on the intersection of these areas, and thus, we review concept drift in process mining and bring forward a taxonomy of existing techniques for drif...
Chapter
This paper presents a novel tool for detecting drifts in process models. The tool targets the challenge of defining the better parameter configuration for detecting drifts by providing an interactive user interface. Using this interface, the user can quickly change the parameters and verify how the process evolved. The process evolution is presente...
Chapter
Collaborative Filtering (CF) is one of the most successful techniques in recommender systems. Most CF scenarios depict positive-only implicit feedback, which means that negative feedback is unavailable. Therefore, One-Class Collaborative Filtering (OCCF)techniques have been tailored to tackling these scenarios. Nonetheless, several OCCF models stil...
Article
Recommender systems uncover relationships between users and items, thus allowing personalized recommendations. Nonetheless, users’ preferences may change over time, the so-called concept drifts; or new users and items may appear, making the recommender system unable to accurately map the relationship between users and items due to the cold start pr...
Preprint
Full-text available
This paper describes a classifier pool generation method guided by the diversity estimated on the data complexity and classifier decisions. First, the behavior of complexity measures is assessed by considering several subsamples of the dataset. The complexity measures with high variability across the subsamples are selected for posterior pool adapt...
Conference Paper
This paper describes a classifier pool generation method guided by the diversity estimated on the data complexity and classifier decisions. First, the behavior of complexity measures is assessed by considering several subsamples of the dataset. The complexity measures with high variability across the subsamples are selected for posterior pool adapt...
Conference Paper
Full-text available
Adaptive recommender systems are increasingly showing their importance as profiling is a dynamic problem. Their goal is to update recommendation models as new interactions take place, thus swiftly adapting to drifts in the user’s behavior and desires, and item’s audience. However, existing recommendation algorithms usually do not perform well durin...
Conference Paper
This paper proposes a hybrid ensemble learning approach that combines statistical and data stream mining algorithms to obtain better forecasting performance in multiple time series prediction problems. Although some multiple time series algorithms perform surprisingly well in a variety of domains, it is well-known that no one is dominant for every...
Preprint
Adaptive recommender systems are increasingly showing their importance as profiling is a dynamic problem. Their goal is to update recommendation models as new interactions take place, thus swiftly adapting to drifts in the user's behavior and desires, and item's audience. However, existing recommendation algorithms usually do not perform well durin...
Preprint
Full-text available
Mining data streams is a challenge per se. It must be ready to deal with an enormous amount of data and with problems not present in batch machine learning, such as concept drift. Therefore, applying a batch-designed technique, such as dynamic selection of classifiers (DCS) also presents a challenge. The dynamic characteristic of ensembles that dea...
Article
The financial credibility of a person is a factor used to determine whether a loan should be approved or not, and this is quantified by a ‘credit score,’ which is calculated using a variety of factors, including past performance on debt obligations, profiling, amongst others. Machine learning has been widely applied to automate the development of e...
Article
Decision trees are a widely used family of methods for learning predictive models from both batch and streaming data. Despite depicting positive results in a multitude of applications, incremental decision trees continuously grow in terms of nodes as new data becomes available, i.e., they eventually split on all features available, and also multipl...
Conference Paper
Full-text available
An end-to-end solution for handwritten numeral string recognition is proposed, in which the numeral string is considered as composed of objects automatically detected and recognized by a YoLo-based model. The main contribution of this paper is to avoid heuristic-based methods for string preprocessing and segmentation, the need for task-oriented cla...
Chapter
Full-text available
A Inteligência Artificial é um ramo da Ciência da Computação que replica a inteligência humana em computadores para atividades específicas. A aprendizagem de máquina, por sua vez, é uma subárea da Inteligência Artificial situada na interseção da Ciência da Computação, Estatística e Matemática, que objetiva descobrir padrões a partir de dados histór...
Preprint
Full-text available
An end-to-end solution for handwritten numeral string recognition is proposed, in which the numeral string is considered as composed of objects automatically detected and recognized by a YoLo-based model. The main contribution of this paper is to avoid heuristic-based methods for string preprocessing and segmentation, the need for task-oriented cla...
Article
Full-text available
In this paper, we provide an overview of how Data Analytics, Big Data, and Machine Learning may assist the judicial system by providing insightful information to citizens, police, lawyers, and judges, in a fast and accurate way. We conduct a bidirectional analysis between Law and Predictive Analytics applying the deductive method and bibliographic...
Article
Full-text available
Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., V...
Article
Full-text available
The Publisher regrets an error in the spelling of the family name of the sixth author. The correct spelling is Bernhard Pfahringer, as it appears in the author list above.
Conference Paper
Full-text available
Data stream mining targets the learning of predictive models that evolve over time according to changes in arriving data. Throughout the years, several approaches have been tailored to create and continuously update predictive models from these streams, and from these, Hoeffding Trees became a popular choice for learning decision trees from data st...
Conference Paper
Full-text available
Learning from data streams is a hot topic in machine learning that targets the learning and update of predictive models as data becomes available for both training and query. Due to their simplicity and convincing results in a multitude of applications, Hoeffding Trees are, by far, the most widely used family of methods for learning decision trees...
Article
Data streams are prone to various forms of concept drift over time including, for instance, changes to the relevance of features. This specific kind of drift is known as feature drift and requires techniques tailored not only to determine which features are the most important but also to take advantage of them. Feature selection has been studied an...
Article
Full-text available
Feature selection targets the identification of which features of a dataset are relevant to the learning task. It is also widely known and used to improve computation times, reduce computation requirements, and to decrease the impact of the curse of dimensionality and enhancing the generalization rates of classifiers. In data streams, classifiers s...
Chapter
Full-text available
Extracting useful patterns from data is a challenging task that has been extensively investigated by both machine learning researchers and practitioners for many decades. This task becomes even more problematic when data is presented as a potentially unbounded sequence, the so-called data streams. Albeit most of the research on data stream mining f...
Article
Full-text available
Learning from ephemeral data streams has garnered the interest of both researchers and practitioners towards adaptive learning techniques. Despite the convincing results obtained thus far, most of the current research still overlooks that the relevance of features may change throughout the learning process. Scenarios where features become - or ceas...
Conference Paper
Full-text available
Fintechs are technology companies that, in contrast to traditional banks, are engaged in digital solutions for payment, money transfers, and real-time notifications. Taking advantage of digital means of communication, most of the service interactions between fintechs and customers occurs via chats or posts in social media. In this work, our goal is...
Conference Paper
Full-text available
The financial market is one of the major consumers of data mining techniques, and the main reason is their efficiency to analyze complex data. One important trait shared between most financial applications is class imbalance. Since traditional classification methods assume nearly balanced classes and equal misclassification costs, they usually fail...
Conference Paper
Full-text available
Data stream mining is a hot topic in the machine learning community that tackles the problem of learning and updating predictive models as new data becomes available over time. Even though several new methods are proposed every year, most focus on the classification task and overlook the regression task. In this paper, we propose an adaptation to t...
Conference Paper
Full-text available
Feature selection has been studied and shown to improve classifier performance in standard batch data mining but is mostly unexplored in data stream mining. Feature selection becomes even more important when the relevant subset of features changes over time, as the underlying concept of a data stream drifts. This specific kind of drift is known as...
Conference Paper
Full-text available
Peer-to-peer (P2P) lending is a global trend of financial markets that allow individuals to obtain and concede loans without having financial institutions as a strong proxy. As many real-world applications, P2P lending presents an imbalanced characteristic, where the number of creditworthy loan requests is much larger than the number of non-creditw...
Article
Full-text available
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests alg...
Conference Paper
Full-text available
Extracting useful knowledge from data streams is problematic, mainly due to changes in their data distribution, a phenomenon named concept drift. Recently, studies have shown that most of existing algorithms for learning from data streams do not encompass techniques for a specific kind of drift: feature drifts. Feature drifts occur when features be...
Conference Paper
Full-text available
The ever increasing data generation confronts both practitioners and researchers on handling massive and sequentially generated amounts of information, the so-called data streams. In this context, a lot of effort has been put on the extraction of useful patterns from streaming scenarios. Learning from data streams embeds a variety of problems, and...
Conference Paper
Trust and reputation mechanisms are part of the logical protection of intelligent agents, preventing malicious agents from acting egotistically or with the intention to damage others. Several studies in Psychology, Neurology and Anthropology claim that emotions are part of human's decision making process. However, there is a lack of understanding...
Conference Paper
Full-text available
Mining data streams is of the utmost importance due to its appearance in many real-world situations, such as: sensor networks, stock market analysis and computer networks intrusion detection systems. Data streams are, by definition, potentially unbounded sequences of data that arrive intermittently at rapid rates. Extracting useful knowledge from d...
Conference Paper
Full-text available
The increased use of data mining algorithms reflects the need for automatic extraction of knowledge from large volumes of data. This work presents a temporal data mining algorithm that discovers frequent Association Rules from timestamped data. These rules are named Cause-Effect Rules, each represented by a multiset of unordered events (Cause) foll...
Conference Paper
Full-text available
Data stream mining is an active area of research that poses challenging research problems. In the latter years, a variety of data stream clustering algorithms have been proposed to perform unsupervised learning using a two-step framework. Additionally, dealing with non-stationary, unbounded data streams requires the development of algorithms capabl...
Conference Paper
Full-text available
Learning from data streams requires efficient algorithms capable of deriving a model accordingly to the arrival of new instances. Data streams are by definition unbounded sequences of data that are possibly non stationary, i.e. they may undergo changes in data distribution, phenomenon named concept drift. Concept drifts force streaming learning alg...
Conference Paper
Full-text available
Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally. On top of that, due to the inherent evolving nature of data streams, it is expected that these algorithms manage to quickly adapt to both concept drifts and the appearance and disappearance of clusters. Ne...
Article
Full-text available
Mining data streams is one of the main studies in machine learning area due to its application in many knowledge areas. One of the major challenges on mining data streams is concept drift, which requires the learner to discard the current concept and adapt to a new one. Ensemble-based drift detection algorithms have been used successfully to the cl...
Conference Paper
Full-text available
Traditional prediction algorithms assume that the underlying concept is stationary, i.e., no changes are expected to happen during the deployment of an algorithm that would render it obsolete. Although, for many real world scenarios changes in the data distribution, namely concept drifts, are expected to occur due to variations in the hidden contex...
Conference Paper
Full-text available
In this paper, we present a new ensemble method, the Scale-free Network Classifier (SFNClassifier), that is conceived as a dynamic sized scale-free network. In Data Stream Mining, ensemble-based approaches have been proposed to enhance accuracy and allow fast recovery from concept drift. However, these approaches are based on both update and pollin...
Conference Paper
In this paper, we present a new ensemble method, the Scale-free Network Classifier (SFNClassifier), that is conceived as a dynamic sized scale-free network. In Data Stream Mining, ensemble-based approaches have been proposed to enhance accuracy and allow fast recovery from concept drift. However, these approaches are based on both update and pollin...
Conference Paper
Full-text available
Mining data streams has become one of the major studies in machine learning area regarding its application in many knowledge areas. One of the major challenges on mining data streams is concept drift, which makes the learner discard the concept learned and adapt to the new one. Nevertheless, the majority of the algorithms in the state of art is lim...
Article
Full-text available
Um dos principais desafios das empresas do setor ferroviário de transporte é gerenciar a ocupação da malha ferroviária e maximizar o número de trens em circulação com segurança simultânea, podendo assim maximizar também a receita gerada e minimizar o consumo de combustível. Este artigo apresenta uma solução computacional de alto nível de abstração...

Questions

Questions (4)
Question
Hi there,
I'm working with feature selection and I'm curious about possible ways to determine the number of features (n) to be selected.
In my experiments, the optimal value of n heavily depends on the data set. Several papers use some unclear rule of thumbs like sqrt(n) or log_2 n, but I couldn't find any reasonable justifications for such choices.
Any insights on this?
Kind Regards.
Question
I would like to apply some transformations into my dataset using kernel functions, however, I'm not sure which kernels are available for categorical features.
Question
I am working with feature selection and I would like to add new features, each of which redundant to an existing one. Nevertheless, I would like this redundancy to be non-linear. Ideas?
Question
I want to test some feature selection algorithms I am proposing and I would like to add redundant features to my datasets.
Is there any clever (and documented) way of adding redundant features into a dataset?
Cheers,

Network

Cited By