Grigorios Tsoumakas

Grigorios Tsoumakas
Aristotle University of Thessaloniki | AUTH · Department of Informatics

BSc Computer Science, MSc Artificial Intelligence, PhD Computer Science

About

198
Publications
75,946
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,451
Citations
Additional affiliations
June 2019 - August 2019
IBM
Position
  • Academic Visitor
April 2013 - present
Aristotle University of Thessaloniki
Position
  • Professor (Assistant)
August 2007 - April 2013
Aristotle University of Thessaloniki
Position
  • Lecturer
Education
January 2001 - June 2005
Aristotle University of Thessaloniki
Field of study
  • Computer Science
October 1999 - September 2000
The University of Edinburgh
Field of study
  • Artificial Intelligence
October 1994 - June 1999
Aristotle University of Thessaloniki
Field of study
  • Computer Science

Publications

Publications (198)
Preprint
Full-text available
Multi-label classification is a challenging task, particularly in domains where the number of labels to be predicted is large. Deep neural networks are often effective at multi-label classification of images and textual data. When dealing with tabular data, however, conventional machine learning algorithms, such as tree ensembles, appear to outperf...
Article
Current research in yes/no question answering (QA) focuses on transfer learning techniques and transformer-based models. Models trained on large corpora are fine-tuned on tasks similar to yes/no QA, and then the captured knowledge is transferred for solving the yes/no QA task. Most previous studies use existing similar tasks, such as natural langua...
Article
Full-text available
In critical situations involving discrimination, gender inequality, economic damage, and even the possibility of casualties, machine learning models must be able to provide clear interpretations of their decisions. Otherwise, their obscure decision-making processes can lead to socioethical issues as they interfere with people’s lives. Random forest...
Preprint
Topic-controllable summarization is an emerging research area with a wide range of potential applications. However, existing approaches suffer from significant limitations. First, there is currently no established evaluation metric for this task. Furthermore, existing methods built upon recurrent architectures, which can significantly limit their p...
Article
Full-text available
Artificial Intelligence (AI) is having an enormous impact on the rise of technology in every sector. Indeed, AI-powered systems are monitoring and deciding on sensitive economic and societal issues. The future is moving towards automation, and we must not prevent it. Many people, though, have opposing views because of the fear of uncontrollable AI...
Preprint
Full-text available
Dimensionality reduction (DR) is a popular method for preparing and analyzing high-dimensional data. Reduced data representations are less computationally intensive and easier to manage and visualize, while retaining a significant percentage of their original information. Aside from these advantages, these reduced representations can be difficult o...
Article
Full-text available
Predicting drug-target interactions (DTI) via reliable computational methods is an effective and efficient way to mitigate the enormous costs and time of the drug discovery process. Structure-based drug similarities and sequence-based target protein similarities are the commonly used information for DTI prediction. Among numerous computational meth...
Preprint
Full-text available
The discovery of drug-target interactions (DTIs) is a very promising area of research with great potential. In general, the identification of reliable interactions among drugs and proteins can boost the development of effective pharmaceuticals. In this work, we leverage random walks and matrix factorization techniques towards DTI prediction. In par...
Preprint
Disfluency detection is a critical task in real-time dialogue systems. However, despite its importance, it remains a relatively unexplored field, mainly due to the lack of appropriate datasets. At the same time, existing datasets suffer from various issues, including class imbalance issues, which can significantly affect the performance of the mode...
Article
Full-text available
Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant...
Article
Full-text available
This is the french translation of "Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3), 1-13. 10.4018/jdwm.2007070101"
Article
Full-text available
The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary widely used in biomedical knowledge systems, particularly for semantic indexing of scientific literature. As the MeSH hierarchy evolves through annual version updates, some new descriptors are introduced that were not previously available. This paper explores the conceptual pr...
Preprint
Full-text available
Bayesian Active Learning has had significant impact to various NLP problems, but nevertheless it's application to text summarization has been explored very little. We introduce Bayesian Active Summarization (BAS), as a method of combining active learning methods with state-of-the-art summarization models. Our findings suggest that BAS achieves bett...
Article
Class imbalance is an inherent characteristic of multi-label data that hinders most multi-label learning methods. One efficient and flexible strategy to deal with this problem is to employ sampling techniques before training a multi-label learning model. Although existing multi-label sampling approaches alleviate the global imbalance of multi-label...
Chapter
Energy production using renewable sources exhibits inherent uncertainties due to their intermittent nature. Nevertheless, the unified European energy market promotes the increasing penetration of renewable energy sources (RES) by the regional energy system operators. Consequently, RES forecasting can assist in the integration of these volatile ener...
Article
Full-text available
In the era of Big Data, the digitization of texts and the advancements in Artificial Intelligence (AI) and Natural Language Processing (NLP) are enabling the automatic analysis of literary works, allowing us to delve into the structure of artifacts and to compare, explore, manage and preserve the richness of our written heritage. This paper propose...
Article
Full-text available
Manual classification of works of literature with genre/form concepts is a time-consuming task requiring domain expertise. Building automated systems based on language understanding can help humans to achieve this work faster and more consistently. Towards this direction, we present a case study on automatic classification of Greek literature books...
Article
Zero-shot learning constitutes a variant of the broader category of weakly supervised learning algorithms. Its main asset is the possibility of identifying entities for which no training data are provided in advance. Under this extreme scenario, conventional supervised learning methods cannot operate properly, while consumption of human resources f...
Conference Paper
Full-text available
The use of machine learning rapidly increases in high-risk scenarios where decisions are required, for example in healthcare or industrial monitoring equipment. In crucial situations, a model that can offer meaningful explanations of its decision-making is essential. In industrial facilities, the equipment's well-timed maintenance is vital to ensur...
Preprint
Full-text available
Energy production using renewable sources exhibits inherent uncertainties due to their intermittent nature. Nevertheless, the unified European energy market promotes the increasing penetration of renewable energy sources (RES) by the regional energy system operators. Consequently, RES forecasting can assist in the integration of these volatile ener...
Chapter
The constant evolution of Medical Subject Headings (MeSH) vocabulary and specifically the changes in its descriptors brings forth a number of issues that need automation. The main one being that changed descriptors often lack proper ground truth articles. Therefore, the learning models which demand strong supervision are not directly applicable, se...
Preprint
In this document, we report an analysis of the Public MeSH Note field of the new descriptors introduced in the MeSH thesaurus between 2006 and 2020. The aim of this analysis was to extract information about the previous status of these new descriptors as Supplementary Concept Records. The Public MeSH Note field contains information in semi-structur...
Preprint
Full-text available
We propose a novel approach to summarization based on Bayesian deep learning. We approximate Bayesian summary generation by first extending state-of-the-art summarization models with Monte Carlo dropout and then using them to perform multiple stochastic forward passes. This method allows us to improve summarization performance by simply using the m...
Preprint
The purpose of citation indexes and metrics is intended to be a measure for scientific innovation and quality for researchers, journals, and institutions. However, those metrics are often prone to abuse and manipulation by excessive and unethical self-citations induced by authors, reviewers, editors, or journals. Identifying whether there are or no...
Preprint
Full-text available
In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Area under the precision-recall curv...
Article
Full-text available
Distantly-supervised relation extraction (RE) is an effective method to scale RE to large corpora but suffers from noisy labels. Existing approaches try to alleviate noise through multi-instance learning and by providing additional information but manage to recognize mainly the top frequent relations, neglecting those in the long-tail. We propose R...
Preprint
Full-text available
Artificial Intelligence (AI) has a tremendous impact on the unexpected growth of technology in almost every aspect. AI-powered systems are monitoring and deciding about sensitive economic and societal issues. The future is towards automation, and it must not be prevented. However, this is a conflicting viewpoint for a lot of people, due to the fear...
Preprint
Full-text available
In critical situations involving discrimination, gender inequality, economic damage, and even the possibility of casualties, machine learning models must be able to provide clear interpretations for their decisions. Otherwise, their obscure decision-making processes can lead to socioethical issues as they interfere with people's lives. In the afore...
Preprint
Full-text available
The use of machine learning rapidly increases in high-risk scenarios where decisions are required, for example in healthcare or industrial monitoring equipment. In crucial situations, a model that can offer meaningful explanations of its decision-making is essential. In industrial facilities, the equipment's well-timed maintenance is vital to ensur...
Preprint
Entity Linking (EL) seeks to align entity mentions in text to entries in a knowledge-base and is usually comprised of two phases: candidate generation and candidate ranking. While most methods focus on the latter, it is the candidate generation phase that sets an upper bound to both time and accuracy performance of the overall EL system. This work'...
Preprint
Distantly-supervised relation extraction (RE) is an effective method to scale RE to large corpora but suffers from noisy labels. Existing approaches try to alleviate noise through multi-instance learning and by providing additional information, but manage to recognize mainly the top frequent relations, neglecting those in the long-tail. We propose...
Preprint
Full-text available
The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary widely used in biomedical knowledge systems, particularly for semantic indexing of scientific literature. As the MeSH hierarchy evolves through annual version updates, some new descriptors are introduced that were not previously available. This paper explores the conceptual pr...
Preprint
Full-text available
Predicting drug-target interactions (DTI) via reliable computational methods is an effective and efficient way to mitigate the enormous costs and time of the drug discovery process. Structure-based drug similarities and sequence-based target protein similarities are the commonly used information for DTI prediction. Among numerous computational meth...
Preprint
Full-text available
In the medical domain, a Systematic Literature Review (SLR) attempts to collect all empirical evidence, that fit pre-specified eligibility criteria, in order to answer a specific research question. The process of preparing an SLR consists of multiple tasks that are labor-intensive and time-consuming, involving large monetary costs. Technology-assis...
Article
We present a novel divide-and-conquer method for the neural summarization of long documents. Our method exploits the discourse structure of the document and uses sentence similarity to split the problem into an ensemble of smaller summarization problems. In particular, we break a long document and its summary into multiple source-target pairs, whic...
Article
Full-text available
Web crawlers account for more than a third of the total web traffic and they are threatening the security, privacy and veracity of web applications and their users. Businesses in finance, ticketing, and publishing, as well as websites with rich and unique content are the ones mostly affected by their actions. To deal with this problem, we present a...
Conference Paper
Full-text available
We present the systems we submitted for the shared tasks of the Workshop on Scholarly Document Processing at EMNLP 2020. Our approaches to the tasks are focused on exploiting large Transformer models pre-trained on huge corpora and adapting them to the different shared tasks. For tasks 1A and 1B of CL-SciSumm we are using different variants of the...
Preprint
Full-text available
Interpretable machine learning is an emerging field providing solutions on acquiring insights into machine learning models' rationale. It has been put in the map of machine learning by suggesting ways to tackle key ethical and societal issues. However, existing techniques of interpretable machine learning are far from being comprehensible and expla...
Conference Paper
Although numerous applications that have been developed during the last years produce vast amounts of data, the inability to obtain their ground truth target values has triggered the appearance of several new machine learning (ML) variants that tackle such phenomena. The main reasons why this happens are the evolutionary nature that characterizes t...
Article
In this work, we propose a method for the automated refinement of subject annotations in biomedical literature at the level of concepts. Semantic indexing and search of biomedical articles in MEDLINE/PubMed are based on semantic subject annotations with MeSH descriptors that may correspond to several related but distinct biomedical concepts. Such s...
Preprint
Keyword extraction is an important document process that aims at finding a small set of terms that concisely describe a document's topics. The most popular state-of-the-art unsupervised approaches belong to the family of the graph-based methods that build a graph-of-words and use various centrality measures to score the nodes (candidate keywords)....
Preprint
Full-text available
Online hate speech is a newborn problem in our modern society which is growing at a steady rate exploiting weaknesses of the corresponding regimes that characterise several social media platforms. Therefore, this phenomenon is mainly cultivated through such comments, either during users' interaction or on posted multimedia context. Nowadays, giant...
Preprint
In this work, we propose a method for the automated refinement of subject annotations in biomedical literature at the level of concepts. Semantic indexing and search of biomedical articles in MEDLINE/PubMed are based on semantic subject annotations with MeSH descriptors that may correspond to several related but distinct biomedical concepts. Such s...
Chapter
Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before training the classifier. Existing multi-label sampling methods alleviate the (global) imbalance of multi-label datase...
Preprint
Full-text available
We present a novel divide-and-conquer method for the neural summarization of long documents. Our method exploits the discourse structure of the document and uses sentence similarity to split the problem into an ensemble of smaller summarization problems. In particular, we break a long document and its summary into multiple source-target pairs, whic...
Chapter
Technological breakthroughs on smart homes, self-driving cars, health care and robotic assistants, in addition to reinforced law regulations, have critically influenced academic research on explainable machine learning. A sufficient number of researchers have implemented ways to explain indifferently any black box model for classification tasks. A...
Chapter
We propose SUSIE, a novel summarization method that can work with state-of-the-art summarization models in order to produce structured scientific summaries for academic articles. We also created PMC-SA, a new dataset of academic publications, suitable for the task of structured summarization with neural networks. We apply SUSIE combined with three...
Chapter
The field of question answering has gained greater attention with the rise of deep neural networks. More and more approaches adopt paradigms which are based primarily on the powerful language representations models and transfer learning techniques to build efficient learning models which are able to outperform current state of the art systems. Endo...
Article
Full-text available
Keyphrase extraction is a textual information processing task concerned with the automatic extraction of representative and characteristic phrases from a document that express all the key aspects of its content. Keyphrases constitute a succinct conceptual summary of a document, which is very useful in digital information management systems for sema...
Book
This book constitutes the proceedings of the 23rd International Conference on Discovery Science, DS 2020, which took place during October 19-21, 2020. The conference was planned to take place in Thessaloniki, Greece, but had to change to an online format due to the COVID-19 pandemic. The 26 full and 19 short papers presented in this volume were car...
Article
Class imbalance is an intrinsic characteristic of multi-label data. Most of the labels in multi-label data sets are associated with a small number of training examples, much smaller compared to the size of the data set. Class imbalance poses a key challenge that plagues most multi-label learning methods. Ensemble of Classifier Chains (ECC), one of...
Preprint
Full-text available
Towards a future where machine learning systems will integrate into every aspect of people's lives, researching methods to interpret such systems is necessary, instead of focusing exclusively on enhancing their performance. Enriching the trust between these systems and people will accelerate this integration process. Many medical and retail banking...
Article
Identifying drug-target interactions is crucial for drug discovery. Despite modern technologies used in drug screening, experimental identification of drug-target interactions is an extremely demanding task. Predicting drug-target interactions in silico can thereby facilitate drug discovery as well as drug repositioning. Various machine learning mo...
Preprint
Full-text available
Technological breakthroughs on smart homes, self-driving cars, health care and robotic assistants, in addition to reinforced law regulations, have critically influenced academic research on explainable machine learning. A sufficient number of researchers have implemented ways to explain indifferently any black box model for classification tasks. A...
Article
Full-text available
Food sales prediction is concerned with estimating future sales of companies in the food industry, such as supermarkets, groceries, restaurants, bakeries and patisseries. Accurate short-term sales prediction allows companies to minimize stocked and expired products inside stores and at the same time avoid missing sales. This paper reviews existing...
Preprint
Full-text available
We propose SUSIE, a novel summarization method that can work with state-of-the-art summarization models in order to produce structured scientific summaries for academic articles. We also created PMC-SA, a new dataset of academic publications, suitable for the task of structured summarization with neural networks. We apply SUSIE combined with three...
Preprint
Automated keyphrase extraction is a crucial textual information processing task regarding the most types of digital content management systems. It concerns the selection of representative and characteristic phrases from a document that express all aspects related to its content. This article introduces the task of keyphrase extraction and provides...
Preprint
Full-text available
Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before training the classifier. Existing multilabel sampling methods alleviate the (global) imbalance of multi-label dataset...
Article
Full-text available
Active learning is an iterative supervised learning task where learning algorithms can actively query an oracle, i.e. a human annotator that understands the nature of the problem, to obtain the ground truth. The motivation behind this approach is to allow the learner to interactively choose the data it will learn from, which can lead to significant...
Article
The figures found in biomedical literature are a vital part of biomedical research, education and clinical decision. The multitude of their modalities and the lack of corresponding meta-data, constitute search and information retrieval a difficult task. We introduce novel multi-label modality classification approaches for biomedical figures without...
Article
Biomedical question answering (QA) is a challenging task that has not been yet successfully solved, according to results on international benchmarks, such as BioASQ. Recent progress on deep neural networks has led to promising results in domain independent QA, but the lack of large datasets with biomedical question-answer pairs hinders their succes...