Prodromos Malakasiotis

Prodromos Malakasiotis
Athens University of Economics and Business | AUEB · Department of Informatics

Doctor of Philosophy

About

60
Publications
12,969
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,874
Citations

Publications

Publications (60)
Article
Full-text available
Prompting Large Language Models (LLMs) performs impressively in zero- and few-shot settings. Hence, small and medium-sized enterprises (SMEs) that cannot afford the cost of creating large task-specific training datasets, but also the cost of pretraining their own LLMs, are increasingly turning to third-party services that allow them to prompt LLMs....
Preprint
Full-text available
We propose the use of conversational GPT models for easy and quick few-shot text classification in the financial domain using the Banking77 dataset. Our approach involves in-context learning with GPT-3.5 and GPT-4, which minimizes the technical expertise required and eliminates the need for expensive GPU computing while yielding quick and accurate...
Preprint
In the era of billion-parameter-sized Language Models (LMs), start-ups have to follow trends and adapt their technology accordingly. Nonetheless, there are open challenges since the development and deployment of large models comes with a need for high computational resources and has economical consequences. In this work, we follow the steps of the...
Preprint
Full-text available
Non-hierarchical sparse attention Transformer-based models, such as Longformer and Big Bird, are popular approaches to working with long documents. There are clear benefits to these approaches compared to the original Transformer in terms of efficiency, but Hierarchical Attention Transformer (HAT) models are a vastly understudied alternative. We de...
Preprint
Full-text available
We study the effect of seven data augmentation (da) methods in factoid question answering, focusing on the biomedical domain, where obtaining training instances is particularly difficult. We experiment with data from the BioASQ challenge, which we augment with training instances obtained from an artificial biomedical machine reading comprehension d...
Preprint
Full-text available
Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags...
Preprint
Full-text available
We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUSis the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in...
Preprint
Full-text available
Interpretability or explainability is an emerging research field in NLP. From a user-centric point of view, the goal is to build models that provide proper justification for their decisions, similar to those of humans, by requiring the models to satisfy additional constraints. To this end, we introduce a new application on legal text where, contrar...
Article
Full-text available
Interpretability or explainability is an emerging research field in NLP. From a user-centric point of view, the goal is to build models that provide proper justification for their decisions, similar to those of humans, by requiring the models to satisfy additional constraints. To this end, we introduce a new application on legal text where, contrar...
Preprint
Full-text available
Major scandals in corporate history have urged the need for regulatory compliance, where organizations need to ensure that their controls (processes) comply with relevant laws, regulations, and policies. However, keeping track of the constantly changing legislation is difficult, thus organizations are increasingly adopting Regulatory Technology (Re...
Preprint
Full-text available
We investigate contract element extraction. We show that LSTM-based encoders perform better than dilated CNNs, Transformers, and BERT in this task. We also find that domain-specific WORD2VEC embeddings outperform generic pre-trained GLOVE embeddings. Morpho-syntactic features in the form of POS tag and token shape embeddings, as well as context-awa...
Preprint
Full-text available
Although BERT is widely used by the NLP community, little is known about its inner workings. Several attempts have been made to shed light on certain aspects of BERT, often with contradicting conclusions. A much raised concern focuses on BERT's over-parameterization and under-utilization issues. To this end, we propose o novel approach to fine-tune...
Preprint
Full-text available
BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the p...
Article
Full-text available
BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the p...
Preprint
Full-text available
Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications and presents interesting challenges. First, not all labels are well represented in the training set, due to the very large label set and the skewed label distributions of LMTC datasets. Also, label hierarchies and differences in hum...
Article
Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications and presents interesting challenges. First, not all labels are well represented in the training set, due to the very large label set and the skewed label distributions of LMTC datasets. Also, label hierarchies and differences in hum...
Preprint
Full-text available
Transformer-based language models, such as BERT and its variants, have achieved state-of-the-art performance in several downstream natural language processing (NLP) tasks on generic benchmark datasets (e.g., GLUE, SQUAD, RACE). However, these models have mostly been applied to the resource-rich English language. In this paper, we present GREEK-BERT...
Conference Paper
Full-text available
We investigate contract element extraction. We show that LSTM-based encoders perform better than dilated CNNs, Transformers, and BERT in this task. We also find that domain-specific WORD2VEC embeddings outperform generic pre-trained GLOVE embeddings. Morpho-syntactic features in the form of POS tag and token shape embeddings, as well as context-awa...
Preprint
Full-text available
We propose SumQE, a novel Quality Estimation model for summarization based on BERT. The model addresses linguistic quality aspects that are only indirectly captured by content-based approaches to summary evaluation, without involving comparison with human references. SumQE achieves very high correlations with human ratings, outperforming simpler mo...
Preprint
Full-text available
We consider Large-Scale Multi-Label Text Classification (LMTC) in the legal domain. We release a new dataset of 57k legislative documents from EURLEX, annotated with ~4.3k EUROVOC labels, which is suitable for LMTC, few- and zero-shot learning. Experimenting with several neural classifiers, we show that BIGRUs with label-wise attention perform bett...
Conference Paper
Full-text available
We consider the task of Extreme Multi-Label Text Classification (XMTC) in the legal domain. We release a new dataset of 57k legislative documents from EURLEX, the European Union’s public document database, annotated with concepts from EUROVOC, a multidisciplinary thesaurus. The dataset is substantially larger than previous EURLEX datasets and suita...
Preprint
Full-text available
We consider the task of Extreme Multi-Label Text Classification (XMTC) in the legal domain. We release a new dataset of 57k legislative documents from EURLEX, the European Union's public document database, annotated with concepts from EUROVOC, a multidisciplinary thesaurus. The dataset is substantially larger than previous EURLEX datasets and suita...
Article
We consider the task of Extreme Multi-Label Text Classification (XMTC) in the legal domain. We release a new dataset of 57k legislative documents from EURLEX, the European Union's public document database, annotated with concepts from EUROVOC, a multidisciplinary thesaurus. The dataset is substantially larger than previous EURLEX datasets and suita...
Article
Full-text available
Experimenting with a dataset of approximately 1.6M user comments from a Greek news sports portal, we explore how a state of the art RNN-based moderation method can be improved by adding user embeddings, user type embeddings, user biases, or user type biases. We observe improvements in all cases, with user embeddings leading to the biggest performan...
Article
Full-text available
Experimenting with a new dataset of 1.6M user comments from a Greek news portal and existing datasets of English Wikipedia comments, we show that an RNN outperforms the previous state of the art in moderation. A deep, classification-specific attention mechanism improves further the overall performance of the RNN. We also compare against a CNN and a...
Conference Paper
Full-text available
We propose a document retrieval method for question answering that represents documents and questions as weighted centroids of word embeddings and reranks the retrieved documents with a relaxation of Word Mover's Distance. Using biomedical questions and documents from BIOASQ, we show that our method is competitive with PUBMED. With a top-k approxim...
Preprint
We propose a document retrieval method for question answering that represents documents and questions as weighted centroids of word embeddings and reranks the retrieved documents with a relaxation of Word Mover's Distance. Using biomedical questions and documents from BIOASQ, we show that our method is competitive with PUBMED. With a top-k approxim...
Article
Full-text available
This article provides an overview of the first BIOASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BIOASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-u...
Conference Paper
Full-text available
Question answering systems aim to find answers to natural language questions by searching in document collections (e.g., repositories of scientific articles or the entire Web) and/or structured data (e.g., databases, ontologies). Strictly speaking, the answer to a question might sometimes be simply 'yes' or 'no', a named entity, or a set of named e...
Conference Paper
Full-text available
This paper describes the system submitted for the Sentiment Analysis in Twitter Task of SEMEVAL 2014 and specifically the Message Polarity Classification subtask. We used a 2–stage pipeline approach employing a linear SVM classifier at each stage and several features including morphological features, POS tags based features and lexicon based featur...
Conference Paper
We introduce the BIOASQ suite, a set of open-source Web tools for the creation, assessment and community-driven improvement of question answering benchmarks. The suite comprises three main tools: (1) the annotation tool supports the creation of benchmarks per se. In particular, this tool allows a team of experts to create questions and answers as w...
Conference Paper
This paper describes the systems with which we participated in the task Sentiment Analysis in Twitter of SEMEVAL 2013 and specifically the Message Polarity Classification. We used a 2-stage pipeline approach employing a linear SVM classifier at each stage and several features including BOW features, POS based features and lexicon based features. We...
Conference Paper
Full-text available
We present a method that paraphrases a given sentence by first generating candidate paraphrases and then ranking (or classifying) them. The candidates are generated by applying existing paraphrasing rules extracted from parallel corpora. The ranking component considers not only the overall quality of the rules that produced each candidate, but also...
Article
Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most...
Conference Paper
Full-text available
The subject of this demonstration is natu- ral language interaction, focusing on adap- tivity and profiling of the dialogue man- agement and the generated output (text and speech). These are demonstrated in a museum guide use-case, operating in a simulated environment. The main techni- cal innovations presented are the profiling model, the dialogue...
Conference Paper
Full-text available
This paper presents three methods that can be used to recognize paraphrases. They all employ string similarity measures applied to shallow abstractions of the input sentences, and a Maximum Entropy classifier to learn how to combine the resulting features. Two of the methods also exploit WordNet to detect synonyms and one of them also exploits a de...
Conference Paper
This paper describes AUEB’s participation in TAC 2009. Specifically, we participated in the textual entailment recognition track for which we used string similarity measures applied to shallow abstractions of the input sentences, and a Maximum Entropy classifier to learn how to combine the resulting features. We also exploited WordNet to detect syn...
Article
Full-text available
This paper describes aueb's participation in tac 2008. Specically, we participated in the summa- rization and textual entailment recognition tracks. For the former we trained a Support Vector Regres- sion model that is used to rank the summary's can- didate sentences; and for the latter we used a Max- imum Entropy classier along with string similar...
Article
Full-text available
We present the system that we submitted to the 3rd Pascal Recognizing Textual Entailment Challenge. It uses four Support Vector Machines, one for each subtask of the challenge, with features that correspond to string similarity measures operating at the lexical and shallow syntactic level.
Conference Paper
We present MiniCount, the first efficient sound and complete algorithm for finding maximally contained rewritings of conjunctive queries with count, using conjunctive views with count and conjunctive views without aggregation. An efficient and scalable solution to this problem yields significant benefits for data warehousing and decision support sy...

Network

Cited By