
Tomas HorvathEötvös Loránd University · Data Science and Engineering
Tomas Horvath
PhD.
About
99
Publications
23,891
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,042
Citations
Introduction
Dr. Tomáš Horváth is the head of the Data Science and Engineering Department at the Eötvös Loránd University in Budapest, Hungary, since September 2016. Here received his MSc and PhD degrees at the Pavol Jozef Šafárik University in Košice, Slovak Republic, in 2002 and 2008, respectively. He was on a post-doc internship at the Information Systems and Machine Learning lab of the University in Hildesheim, Germany, from 2009 to 2012. From 2015 to 2016 he received a post-doc grant at the Department of Computer Science, University of São Paulo in São Carlos, Brazil. His research interests include relational learning techniques, pattern mining, recommender systems and personalization. Recently, he is focusing his work on meta-learning techniques and automated machine learning approaches.
Additional affiliations
October 2012 - May 2015
October 2009 - September 2012
Publications
Publications (99)
There are several machine learning algorithms addressing class imbalance problem, requiring standardized metrics for adequete performance evaluation. This paper reviews several metrics for imbalanced learning in binary and multi-class problems. We emphasize considering class separability, imbalance ratio, and noise when choosing suitable metrics. A...
In this work, we propose a novel method for generating inter-lingual document representations using neural network concept compression. The presented approach is intended to improve the quality of content-based multilingual document recommendation and information retrieval systems by creating a language-independent representation. The main idea is...
Multi-class imbalance problems are non-standard derivative data science problems. These problems are associated with the skewness in the data underlying distribution, which, in turn, raises numerous issues for conventional machine learning techniques. To address the lack of data in imbalance problems, we can either collect new data or oversample th...
Automatic essay scoring (AES) models based on neural networks (NN) have had a lot of success. However, research has shown that NN-based AES models have robustness issues, such that the output of a model changes easily with small changes in the input. We proposed to use keyword-based lexical substitution using BERT that generates new essays (adversa...
With shortage in human workforce and increasing ecological challenges application of Internet of Things and Artificial Intelligence technologies in the agricultural sector, known also as precision farming, has become very popular in recent years. Agriculture is, however, a domain with specific challenges which AI engineers and data scientists usual...
In this work, we present a multi-modal model for commercial product classification, that combines features extracted by multiple neural network models from textual (CamemBERT and FlauBERT) and visual data (SE-ResNeXt-50), using simple fusion techniques. The proposed method significantly outperformed the unimodal models' performance and the reported...
In this study, we propose feature extraction for multimodal meme classification using Deep Learning approaches. A meme is usually a photo or video with text shared by the young generation on social media platforms that expresses a culturally relevant idea. Since they are an efficient way to express emotions and feelings, a good classifier that can...
Automatic evaluation of essay (AES) and also called automatic essay scoring has become a severe problem due to the rise of online learning and evaluation platforms such as Coursera, Udemy, Khan academy, and so on. Researchers have recently proposed many techniques for automatic evaluation. However, many of these techniques use hand-crafted features...
Pig farming worldwide is largely characterized by closed, large-scale housing technology. These systems are driven by resource efficiency. This means producing the right quantity and quality of pork in a short time with efficient use of resources. In intensive technologies, humans control almost completely and thus greatly influence the lives of pi...
Accurate particle identification is an ongoing task in the European organization for nuclear research, known as CERN where the challenge remains that targeted particles/events represent tiny minorities in front of the overwhelming presence of common particles such as protons. This paper presents a directed undersampling using an active learning met...
Robots working in unstructured environments must be capable of sensing and interpreting their surroundings. One of the main obstacles of deep-learning-based models in the field of robotics is the lack of domain-specific labeled data for different industrial applications. In this article, we propose a sim2real transfer learning method based on domai...
In this paper, we proposed Linear Concept Approximation, a novel multilingual document representation approach for the task of multilingual document representation and recommendation. The main idea is in creating representations by using mappings to align monolingual representation spaces using linear concept approximation, that in turn will enhanc...
In this work, we proposed a novel approach to derive inter-lingual document representations. The introduced methods aim to enhance the quality of content-based Multilingual Document Recommendation and information retrieval Systems. The main idea centers around creating inter-lingual representations by using mappings to align monolingual representat...
In this study, we present a multimodal emotion recognition architecture that uses both feature-level attention (sequential co-attention) and modality attention (weighted modality fusion) to classify emotion in art. The proposed architecture helps the model to focus on learning informative and refined representations for both feature extraction and...
Accurate particle identification is an ongoing task in the European organization for nuclear research, known as CERN where the challenge remains that targeted particles/events represent tiny minorities in front of the overwhelming presence of common particles such as protons. This paper presents a directed undersampling using an active learning met...
Emotions are very important in dealing with human decisions, interactions, and cognitive processes. Art is an imaginative human creation that should be appreciated, thought-provoking, and elicits an emotional response. The automatic recognition of emotions triggered by art is of considerable importance. It can be used to categorize artworks accordi...
Initializing the hyper-parameters (HPs) of machine learning (ML) techniques became an important step in the area of automated ML (AutoML). The main premise in HP initialization is that a HP setting that performs well for a certain dataset(s) will also be suitable for a similar dataset. Thus, evaluation of similarities of datasets based on their cha...
Convolutional Neural Network (CNN) has become one of the most popular techniques in image classification. Usually CNN models are trained on a large amount of data, but in this paper, it is discussed CNN usage on data shortage and class imbalance issues. The study is conducted on a small dataset of vine leaves images on a classification task with fi...
Generative adversarial networks (GANs) could be used efficiently for image and video generation when labeled training data is available in bulk. In general, building a good machine learning model requires a reasonable amount of labeled training data. However, there are areas such as the biomedical field where the creation of such a dataset is time-...
Federated learning is a emerging branch of machine learning research, that is examining the methods for training models over geographically separated, unbalanced and non-iid data. In FL, on non-convex problems, as in single node training, the almost exclusively used method is mini batch gradient descent. In this work we examine the effect of using...
The European Union has created a way of classifying wines to make life easier for consumers when choosing the product that most appeals to them, this classification may require control that is hampered by the dis-tancing of production. The automation of control processes is a good way out, but this comes up against the difficulty of computational m...
Recently, abstractive text summarization has achieved success in switching from linear models via sparse and handcrafted features to nonlinear neural network models via dense inputs. This success comes from the application of deep learning models on natural language processing tasks where these models are capable of modeling intricate patterns in d...
In this paper, approaches to minimizing the efforts involved in creating annotated instances when training supervised automatic short answer scoring (ASAS) systems are explored since training supervised ASAS systems require a huge amount of annotated sets and also annotating training set is a time-consuming, expensive and tedious task. To address t...
Estimation of the attention that a blog post is expected to receive is an important text mining task with potential applications in various domains, such as online advertisement or early recognition of highly influential fake news. In the blog feedback prediction task, the number of comments is used as proxy for the attention. Although factorizatio...
With the development of sophisticated e-learning platforms, educational recommender systems and automatic essay evaluation are becoming an important feature in e-learning systems. Most of the works in educational recommendation techniques are focused on recommending learning materials or learning activities to the learners. In this paper, we propos...
Background
Classification of EEG signals is the common theoretical background of various EEG-related recognition tasks, such as the recognition of symptoms of diseases. We consider these tasks as time series classification tasks for which models based on dynamic time warping (DTW) are popular and effective (Dau et al., 2018, Buza et al., 2015). Acc...
Automated Essay Evaluation (AEE) use a set of features to evaluate and score students essay solutions. Most of the features like lexical similarity, syntax, vocabulary and shallow content were addressed but evaluating students essays using the semantics and context of the essay are not addressed well. To address the issue which are related to the s...
Automated essay evaluation systems use machine learning models to predict the score for an essay. For such, a training essay set is required which is usually created by human requiring time-consuming effort. Popular choice for scoring is a nearest neighbor model which requires on-line computation of nearest neighbors to a given essay. This is, howe...
Educational assessment plays a central role in the teaching-learning process as a tool for evaluating students' knowledge of the concepts associated with the learning objectives. The evaluation and scoring of essay answers is a process, besides being costly in terms of time spent by teachers, what may lead to inequities due to the difficulty in app...
Automated essay evaluation systems use machine learning models to predict the score for an essay. For such, a training essay set is required which is usually created by human requiring time-consuming effort. The popular choice for scoring is the nearest neighbor model which requires on-line computation of nearest neighbors to a given essay. This is...
One of the main current applications of Intelligent Systems are Recommender systems (RS). RS can help users to find relevant items in huge information spaces in a personalized way. Several techniques have been investigated for the development of RS. One of them are Swarm Intelligence (SI) techniques, which are an emerging trend with various applica...
Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations, and their complex interactions, it is common to use optimization techniques to find settings that lead to high predict...
This chapter describes a new group of predictive learning algorithms – search‐based and optimization‐based algorithms – which allow us to deal efficiently with more complex classification tasks. Decision tree induction algorithms (DTIAs) induce models with a tree‐shaped decision structure where each internal node is associated with one or more pred...
This chapter describes the three current fields of data analytics that are attracting a great deal of attention due to their wide application in different domains: text mining, social network analysis (SNA) and recommendation systems. Text mining is a very active area of data analytics. Text mining is an important part of several other tasks, like...
Classification is one of the most common tasks in analytics, and the most common in predictive analytics. In addition to being the most common classification task, binary classification is the simplest classification task. This chapter illustrates how the data are distributed in the data set. The main concern of most data analytic applications is t...
Predictive tasks are divided between classification tasks and regression tasks. This chapter focusses mainly on regression. It describes the concepts that are meaningful for both regression and classification, namely generalization and model validation. The chapter also describes some of the most popular regression methods. The methods described ar...
This chapter explores the advanced subjects in predictive analytics. The individual classifiers whose predictions will be combined will be referred to here as the “base” classifiers. Each base classifier can be induced using the same, original, training set, or parts of the original training set. Two important requirements in developing ensembles w...
This chapter focuses on how a data set can be described by descriptive statistics and by visualization techniques for single attributes and pairs of attributes. It presents several univariate and bivariate statistical formulae and data visualization techniques. The chapter describes the different scale types that exist to describe data. There are t...
This chapter presents a cheat sheet of descriptive analytics. The main purpose of descriptive analytics is to understand the data, providing relevant knowledge for future decisions in the project development. It presents the main aspects of the univariate methods that is, methods used to summarize a single attribute. It also presents a summary of b...
This chapter presents an important family of techniques for descriptive tasks. They can describe a data set by partitioning it, so that objects in the same group are similar to each other. These “clustering” techniques have been developed and extensively used to partition data sets into groups. Clustering techniques use only predictive attributes t...
This chapter discusses the frequent itemset mining, describing the three main approaches: Apriori, Eclat and frequent pattern growth (FP‐Growth). Frequent pattern mining methods were developed to deal with very large data sets recorded in hypermarkets and social media sites. The chapter discusses the min_sup threshold, a hyper‐parameter with high i...
This chapter discusses the aspects of data quality and describes the preprocessing techniques frequently used in data analytics. The quality of a dataset strongly affects the results of a data quality project. The chapter also discusses the techniques for data‐type conversions, a necessary operation when the values of a predictive attribute need to...
This introduction presents an overview of key concepts discussed in the subsequent chapters of the book. The book describes two real‐world problems from different areas as an introduction to the different subjects. It explains the multi‐layer perceptron neural networks and k‐means. The book explores the methodologies for planning and developing pro...
This chapter explores a project that relates to the CRoss‐Industry Standard Process for Data Mining (CRISP‐DM) methodology. The data used can be obtained in the UCI machine learning repository, easily obtainable in the web, entitled “Polish companies bankruptcy data”. The chapter presents a cheat sheet on predictive algorithms. Investors, banks and...
This chapter describes simple multivariate methods from the three data analysis approaches – frequency, visualization and statistical. The multivariate frequency values can be computed independently for each attribute. The chapter explores how multivariate data can be visually represented in different ways and the main benefits of each of these alt...
Automated essay evaluation (AEE) represents not only as a tool to assess evaluate and score essays but also helps to save time, effort and money without lowering the quality of goals and objectives of educational assessment. Even if the field has been developing since the 1960s and various algorithms and approaches have been proposed to implement A...
One of the main current applications of intelligent systems is recommender systems (RS). RS can help users to find relevant items in huge information spaces in a personalized way. Several techniques have been investigated for the development of RS. One of them is evolutionary computational (EC) techniques, which is an emerging trend with various ap...
Hyper-parameter tuning is one of the crucial steps in the successful application of machine learning algorithms to real data. In general, the tuning process is modeled as an optimization problem for which several methods have been proposed. For complex algorithms, the evaluation of a hyper-parameter configuration is expensive and their runtime is s...
Supervised classification is the most studied task in Machine Learning. Among the many algorithms used in such task, Decision Tree algorithms are a popular choice, since they are robust and efficient to construct. Moreover, they have the advantage of producing comprehensible models and satisfactory accuracy levels in several application domains. Li...
Collaborative-filtering (CF) techniques were successfully used for student performance prediction, however the research was provided mainly on large and very sparse matrix representing (student, task, performance score) triples. This work investigates the usability of CF techniques in student performance prediction for small universities or courses...
This paper introduces a new algorithm for computing concept lattices from very sparse large-scale formal contexts (input data) where the number of attributes per object is small. The algorithm consists of two steps: generate a diagram of a formal context and compute the concept lattice of the formal context using the diagram built in the previous s...
Ground penetrating radar is a non-destructive method to scan the shallow subsurface for detecting buried objects like pipes, cables, ducts and sewers. Such buried objects cause hyperbola shaped reflections in the radargram images achieved by GPR. Originally, those radargram images were interpreted manually by human experts in an expensive and time...
Rating prediction is a well-known recommendation task aiming to predict a user’s rating for those items which were not rated yet by her. Predictions are computed from users’ explicit feedback, i.e. their ratings provided on some items in the past. Another type of feedback are user reviews provided on items which implicitly express users’ opinions o...
In this paper, an experimental comparison of publicly available algorithms for computing intents of all formal concepts and mining frequent closed itemsets is provided. Experiments are performed on real data sets from UCI Machine Learning Repository and FIMI Repository. Results of experiments are discussed at the end of the paper.
Ground penetrating radar is used to scan the shallow subsurface for detecting buried objects like pipes without corrupting the road surface. Buried objects are represented by hyperbola branches on GPR radargram images. As the manually interpretation of such radargrams is expensive and time consuming, an important goal in this field is to automatize...
GPR is a nondestructive method to scan the subsurface. On the resulting radargrams, originally interpreted manually in a time consuming process, one can see hyperbolas corresponding to buried objects. For accelerating the interpretation a machine shall be enabled to recognize hyperbolas on radargrams autonomously. One possibility is the combination...
Ground Penetrating Radar (GPR) is a widely used technique for detecting buried objects in subsoil. Exact localization of buried objects is required, e.g. during environmental reconstruction works to both accelerate the overall process and to reduce overall costs. Radar measurements are usually visualized as images, so-called radargrams, that contai...
Predicting student performance (PSP) is the problem of predicting how well a student will perform on a given task. It has gained more attention from the educational data mining community recently. Previous works show that good results can be achieved by casting the PSP to rating prediction problem in recommender systems, where students, tasks and p...
Ensembles constitute one of the most prominent class of hybrid prediction models. One basically assumes that different models compensate each other's errors if one combines them in an appropriate way. Often, a large number of various prediction models are available. However, many of them may share similar error characteristics, which highly depress...
Formal Concept Analysis aims at finding clusters (concepts) with given properties in data. Most techniques of concept analysis require a dense matrix with no missing values on the input. However, real data are often incomplete or inaccurate due to the noise or other unforeseen reasons. This paper focuses on using matrix factorization methods to com...
Predicting student performance (PSP), one of the task in Student Modeling, has been taken into account by educational data min-ing community recently. Previous works show that good results can be achieved by casting the PSP to rating prediction task in recommender systems, where students, tasks and performance scores are mapped to users, items and...
Historically, student performance prediction has been ap-proached with regression models. For instance, the KDD Cup 2010 used the root mean squared error (RMSE) as an evaluation criterion. This is appropriate when the goal is to predict student marks or how well will they perform in a given exercise. Since in many datasets the target variable is bi...
Monotone prediction problems, in which the target variable is non-decreasing given an increase of the explanatory variables, have became more popular nowadays in many prob-lem settings which fulfill the so-called monotonicity constraint, namely, if an object is better in all attributes as another one then it should not be classified lower. Recent a...
This work proposes a novel approach - person- alized forecasting - to take into account the sequential effect in predicting student performance (PSP). Instead of using all historical data as other methods in PSP, the proposed methods only use the information of the individual students for fore- casting his/her own performance. Moreover, these metho...
Recommender systems are widely used in many areas, especially in e-commerce. Recently, they are also applied in technology enhanced learning such as recommending resources (e.g. papers, books,...) to the learners (students). In this study, we propose using state-of-the-art recommender system techniques for predicting stu- dent performance. We intro...
Predicting student performance (PSP) is one of the educational data mining task, where we would like to know how much knowledge the students have gained and whether they can perform the tasks (or exercises) correctly. Since the student’s knowledge improves and cumulates over time, the sequential (temporal) effect is an important information for PSP...
Recommender systems are widely used in many areas, especially in e- commerce. Recently, they are also applied in e-learning for recommending learn- ing objects (e.g. papers) to students. This chapter introduces state-of-the-art recom- mender system techniques which can be used not only for recommending objects like tasks/exercises to the students b...
Collaborative filtering - one of the recommendation techniques - has been applied for e-learning recently. This technique makes an assumption that each user rates for an item once. However, in educational environment, each student may perform a task (problem) several times. Thus, applying original collaborative filtering for student's task recommen...
We present models, methods, implementations and experiments with a system enabling personalized web search for many users with different preferences. The system consists of a web information extraction part, a text search engine, a middleware supporting top-k answers and a user interface for querying and evaluation of search results. We integrate s...
This paper focuses to a formal model of user preference learning for content-based recommender systems. First, some fundamental and special requirements to user preference learning are identified and proposed Three learning tasks are introduced as the exact, the order preserving and the iterative user preference learning tasks The first. two tasks...
We propose a model of a middleware system enabling personalized web search for users with different preferences. We integrate both inductive and deductive tasks to find user preferences and consequently best objects. The model is based on modeling preferences by fuzzy sets and fuzzy logic. We present the model-theoretic semantics for fuzzy descript...
We focus on replacing human processing web resources by automated processing. On an experimental system we identify uncertainty
issues making this process difficult for automated processing and try to minimize human intervention. In particular we focus
on uncertainty issues in a Web content mining system and a user preference mining system. We conc...
Web search heuristics based on Fagin 's threshold algorithm assume we have the user profile in the form of particular attribute ordering and a fuzzy aggregation function representing the user combining function. Having these, there are sufficient algorithms for searching top-k answers. Finding particular attribute ordering and aggregation for a use...
Uncertainty querying of large data can be solved by providing top-k answers according to a user fuzzy ranking/scoring function.
Usually different users have different fuzzy scoring function – a user preference model. Main goal of this paper is to assign
a user a preference model automatically. To achieve this we decompose user’s fuzzy ranking funct...
The new direction of the research in the field of data mining is the development of methods to handle imperfection (uncertainty,
vagueness, imprecision,...). The main interest in this research is focused on probability models. Besides these there is an
extensive study of the phenomena of imperfection in fuzzy logic. In this paper we concentrate esp...
We propose a model of a middleware system enabling personalized web search for users with different preferences. We integrate
both inductive and deductive tasks to find user preferences and consequently best objects. The model is based on modeling
preferences by fuzzy sets and fuzzy logic. We present the model- theoretic semantic for fuzzy descript...
We are interested in replacing human processing of web resources by automated,processing. Based on an experimental system we identify uncertainty issues which ,make ,this process difficult for automated ,processing. We show these uncertainty issues are connected ,with Web ,content mining ,and ,user preference,mining. We conclude ,with a discussion...
We present a middleware system UPRE enabling personalized web search for users with different preferences. The input for UPRE is user evaluation of some objects in scale from the worst to the best. Our model is inspired by existing models of distributed middleware search. We use both inductive and deductive tasks to find user preferences and conseq...