
Jelena Mitrović- PhD
- CAROLL Research Group Leader at University of Passau
Jelena Mitrović
- PhD
- CAROLL Research Group Leader at University of Passau
About
94
Publications
18,691
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,003
Citations
Introduction
Current institution
Additional affiliations
September 2015 - October 2015
Publications
Publications (94)
The paper presents a language dependent model for classification of statements into ironic and non-ironic. The model uses various language resources: morphological dictionaries, sentiment lexicon, lexicon of markers and a WordNet based ontology. This approach uses various features: antonymous pairs obtained using the reasoning rules over the Serbia...
This paper surveys ontological modeling of rhetorical concepts, developed for use in argument mining and other applications of computational rhetoric, projecting their future directions. We include ontological models of argument schemes applying Rhetorical Structure Theory (RST); the RhetFig proposal for modeling; the related RetFig Ontology of Rhe...
This paper presents our submission for the SemEval shared task 6, sub-task A on the identification of offensive language. Our proposed model, C-BiGRU, combines a Convolu-tional Neural Network (CNN) with a bidirectional Recurrent Neural Network (RNN). We utilize word2vec to capture the semantic similarities between words. This composition allows us...
We discuss ontological modeling of legal terminology in SUMO (Pease, 2001) in combination with the lexico-semantic database WordNet (Fellbaum, 1998). Formal systems that allow for automated semantic interpretation of law supported by lexical resources can provide solutions to many tasks related to legal reasoning. We wish to formalize legal issues...
Abusive language detection is an unsolved and challenging problem for the NLP community. Recent literature suggests various approaches to distinguish between different language phenomena (e.g., hate speech vs. cyberbullying vs. offensive language) and factors (degree of explicitness and target) that may help to classify different abusive language p...
We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual...
Natural Language Processing (NLP) has completely revolutionized the analysis of textual data, offering new avenues for understanding different textual content. This paper uses the NLP technique for analysing letters written by Ivo Andrić, a Yugoslavian writer, diplomat, and recipient of the Nobel Prize for Literature. Employing Named Entity Recogni...
Rhetorical figures play an important role in our communication. They are used to convey subtle, implicit meaning, or to emphasize statements. We notice them in hate speech, fake news, and propaganda. By improving the systems for computational detection of rhetorical figures, we can also improve tasks such as hate speech and fake news detection, sen...
We introduce Krony-PT, a compression technique of GPT2 \citep{radford2019language} based on Kronecker Products. We specifically target the MLP layers of each transformer layer, and systematically compress the feed forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize the new factors, and also introd...
Supervised machine learning often encounters concept drift, where the data distribution changes over time, degrading model performance. Existing drift detection methods focus on identifying these shifts but often overlook the challenge of acquiring labeled data for model retraining after a shift occurs. We present the Strategy for Drift Sampling (S...
The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in th...
The process of annotating data within the legal sector is filled with distinct challenges that differ from other fields, primarily due to the inherent complexities of legal language and documentation. The initial task usually involves selecting an appropriate raw dataset that captures the intricate aspects of legal texts. Following this, extracting...
Rhetorical figures play a major role in our everyday communication as they make text more interesting, more memorable, or more persuasive. Therefore, it is important to computationally detect rhetorical figures to fully understand the meaning of a text. We provide a comprehensive overview of computational approaches to lesser-known rhetorical figur...
Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of th...
The process of annotating data within the legal sector is filled with distinct challenges that differ from other fields, primarily due to the inherent complexities of legal language and documentation. The initial task usually involves selecting an appropriate raw dataset that captures the intricate aspects of legal texts. Following this, extracting...
Web search is a crucial technology for the digital economy. Dominated by a few gatekeepers focused on commercial success, however, web publishers have to optimize their content for these gatekeepers, resulting in a closed ecosystem of search engines as well as the risk of publishers sacrificing quality. To encourage an open search ecosystem and off...
The calculation of semantic similarity is an important task in Natural Language Processing (NLP). There is a growing interest in this task in the research community, especially following the advent of new, ever-evolving neural architectures. However, this technique has not been explored in-depth in the realm of automatic processing of legal data, t...
Introduction/purpose: With the development of information technologies, the Internet and social networks, the amount of collected data grows year by year at high speed. Data processing and analysis is becoming a necessity without which quality business decisions cannot be made. With the development of data science, tools for processing unstructured...
Language Models (LMs) have shown state-of-the-art performance in Natural Language Processing (NLP) tasks. Downstream tasks such as Named Entity Recognition (NER) or Part-of-Speech (POS) tagging are known to suffer from data imbalance issues, specifically in terms of the ratio of positive to negative examples, and class imbalance. In this paper, we...
There is evidence that specific segments of the population were hit particularly hard by the Covid-19 pandemic (e.g., people with a migration background). In this context, the impact and role played by online platforms in facilitating the integration or fragmentation of public debates and social groups is a recurring topic of discussion. This is wh...
Extracting information from academic PDF documents is crucial for numerous indexing, retrieval, and analysis use cases. Choosing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few conte...
Extracting information from academic PDF documents is crucial for numerous indexing, retrieval, and analysis use cases. Choosing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few conte...
The present paper focuses on the presentation and discussion of aspects of OFFENSIVE LANGUAGE linguistic annotation, including the creation, annotation practice, curation, and evaluation of an OFFENSIVE LANGUAGE annotation taxonomy scheme, that was first proposed in Lewandowska-Tomaszczyk et al. (2021). An extended offensive language ontology compr...
Media has a substantial impact on the public perception of events. A one-sided or polarizing perspective on any topic is usually described as media bias. One of the ways how bias in news articles can be introduced is by altering word choice. Biased word choices are not always obvious, nor do they exhibit high context-dependency. Hence, detecting bi...
Contemporary public debates are often characterized by structural and substantial dissonances. This paper is concerned with normative and empirical evaluations of these dissonances and makes contributions on both levels. We argue that agonistic pluralism provides an insightful, yet often dismissed, theoretical perspective on the matter of political...
To date, the number of studies that address the generalization of argument models is still relatively small. In this study, we extend our stacking model from argument identification to an argument unit classification task. Using this model, and for each of the learned tasks, we address three real-world scenarios concerning the model robustness over...
BACKGROUND
After the first COVID-19 vaccine appeared, there has been a growing tendency to determine public attitudes toward it automatically. In particular, it has been important to find the reasons for vaccine hesitancy, since it was directly correlated with pandemic protraction. Natural language processing (NLP) and public health researchers hav...
Background:
Since the first COVID-19 vaccine appeared, there has been a growing tendency to automatically determine public attitudes toward it. In particular, it was important to find the reasons for vaccine hesitancy, since it was directly correlated with pandemic protraction. Natural language processing (NLP) and public health researchers have t...
In the current world, individuals are faced with decision making problems and opinion formation processes on a daily basis. Nevertheless, answering a comparative question by retrieving documents based only on traditional measures (such as TF-IDF and BM25) does not always satisfy the need. In this paper, we propose a multi-layer architecture to answ...
GRhOOT, the German RhetOrical OnTology, is a domain ontology of 110 rhetorical figures in the German language. The overall goal of building an ontology of rhetorical figures in German is not only the formal representation of different rhetorical figures, but also allowing for their easier detection, thus improving sentiment analysis, argument minin...
The COVID19 pandemic has brought health problems that concern individuals, the state, and the whole world. The information available on social networks, which were used more frequently and intensively during the pandemic than before, may contain hidden knowledge that can help to better address some problems and apply protective measures more adequa...
This paper examines several widespread assumptions about artificial intelligence, particularly machine learning, that are often taken as factual premises in discussions on the future of patent law in the wake of ‘artificial ingenuity’. The objective is to draw a more realistic and nuanced picture of the human-computer interaction in solving technic...
In this paper, we present our submission from the team CAROLL_Passau for subtask 1A of the HASOC 2021 workshop. Our presented model, C-BiGRU, is composed of a Convolutional Neural Network (CNN) together with a bidirectional Recurrent Neural Network (RNN). We utilized word embeddings to allow our model to apprehend the correlation between words in t...
The main focus of the paper is the definitional revision and enrichment of offensive language typology, making reference to publicly available offensive language datasets and testing them on available pretrained lexical embedding systems. We review over 60 available corpora and compare tagging schemas applied there while making an attempt to explai...
Media has a substantial impact on the public perception of events. A one-sided or polarizing perspective on any topic is usually described as media bias. One of the ways how bias in news articles can be introduced is by altering word choice. Biased word choices are not always obvious, nor do they exhibit high context-dependency. Hence, detecting bi...
We present an effective way to create a dataset from relevant channels and groups of the messenger service Telegram, to detect clusters in this network, and to find influential actors. Our focus lies on the network of German COVID-19 sceptics that formed on Telegram along with growing restrictions meant to prevent the spreading of COVID-19. We crea...
Argument identification is the cornerstone of a complete argument mining pipeline. Furthermore, it is the essential key for a wide spectrum of applications such as decision making, assisted writing, and legal counselling. Nevertheless, most existing argument mining approaches are limited to a single, specific domain. The problem of building a robus...
Abstract: The paper examines a set of assumptions about artificial intelligence, particularly machine learning, often taken as factual premises in discussions on the future of patent law in the wake of ‘artificial ingenuity’. The objective is to draw a more realistic and nuanced picture of the human-computer interaction in solving technical problem...
In the current world, individuals are faced with decision making problems and opinion formation processes on a daily basis. For example, debating or choosing between two similar products. However, answering a comparative question by retrieving documents based only on traditional measures (such as TF-IDF and BM25) does not always satisfy the need. T...
Media has a substantial impact on public perception of events, and, accordingly, the way media presents events can potentially alter the beliefs and views of the public. One of the ways in which bias in news articles can be introduced is by altering word choice. Such a form of bias is very challenging to identify automatically due to the high conte...
This paper presents our work on the refinement and improvement of the Serbian language part of Hurtlex, a multilingual lexicon of words to hurt. We pay special attention to adding Multi-word expressions that can be seen as abusive, as such lexical entries are very important in obtaining good results in a plethora of abusive language detection tasks...
In this paper, we introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was trained on RAL-E, a large-scale dataset of Reddit comments in English from communities banned for being offensive, abusive, or hateful that we have collected and made available to the public. We present the results of a detailed co...
The main tool of a lawyer is their language. Legal prose is bound by writing styles, especially in Germany. These styles ensure that, i.a. judgments are written in a structured and comprehensive way. The writing style used for German judgements is called Urteilsstil and consist of several subcomponents. These subcomponents should be classifiable wi...
This paper describes a neural network (NN) model that was used for participating in the OffensEval, Task 12 of the SemEval 2020 workshop. The aim of this task is to identify offensive speech in social media, specifically in tweets. The model we used, C-BiGRU, is composed of a Convolutional Neural Network (CNN) along with a bidirectional Recurrent N...
The Common European Framework of Reference (CEFR) provides generic guidelines for the evaluation of language proficiency. Nevertheless, for automated proficiency classification systems, different approaches for different languages are proposed. Our paper evaluates and extends the results of an approach to Automatic Essay Scoring proposed as a part...
This paper is about the Greek and Serbian multiword expressions (MWEs) that belong to the rhetorical figure simile. We use a corpus-driven crowdsourcing method to identify the most commonly used similes in Serbian and Greek. We attempt a first comparison of the two sets of data and discuss issues of simile encoding in lexical resources that are use...
In this paper, we introduce the architecture used for our PAN@CLEF-2019 author profiling participation. In this task, we had to predict if the author of 100 tweets was a bot, a female human, or a male human user. This task is proposed from a multilingual perspective, for English and Spanish. We handled this task in two steps, using different featur...
In the 2015 migration crisis thousands of refugees
and migrants crossed the border to Hungary, Austria and
Germany. The movements of these people are reflected in social
media, especially on Twitter. In this paper we present a dataset
of 3275 Tweets from the months September and October 2015.
These Tweets are annotated regarding their relevance of...
As part of the shared task of GermEval
2018 we developed a system that is able to
detect offensive speech in German tweets.
To increase the size of the existing training
set we made an application for gathering
trending tweets in Germany. This application
also assists in manual annotation of
those tweets. The main part of the training
data consists...
Automation in law can have far reaching advantages in providing direct access to justice. Simple to use applications could help in querying legal concerns and obtaining a preliminary analysis. Legal professionals could use those applications for help with case research and for detecting edge conditions, inequality and loopholes (Ashley, 2017). The...
https://www.uni-hildesheim.de/~linde002/wnlex2018_poster_abstracts.pdf
Research related to rhetorical figures and their automatic processing for the Serbian language started with building the Ontology of Rhetorical Figures (Mladenović & Mitrović, 2013) which gives a formal description of 98 rhetorical figures and allows for their automatic processing. An overview of the way this ontology was built and evaluated will b...
http://typo.uni-konstanz.de/parseme/images/Meeting/2016-09-26-Dubrovnik-meeting/WG1-MITROVIC-MARKANTONATOU-MLADENOVIC-KRSTEV-POSTER.pdf
The aim of this paper is to show a language-independent process of creating a new semantic relation between adjectives and nouns in wordnets. The existence of such a relation is expected to improve the detection of figurative language and sentiment analysis (SA). The proposed method uses an annotated corpus to explore the semantic knowledge contain...
This paper presents a process of building a Sentiment Analysis Framework for Serbian (SAFOS). We created a hybrid method that uses a sentiment lexicon and Serbian WordNet (SWN) synsets assigned with sentiment polarity scores in the process of feature selection. As the use of stemming for morphologically rich languages (MRLs) may result in loss or g...
Poster presented at the 2nd General Meeting of The IC1207 COST Action, PARSEME, an interdisciplinary scientific network devoted to the role of multi-word expressions (MWEs) in parsing.
Abstract
In this paper we present a set of tools that will help developers of wordnets not only to increase the number of synsets but also to ensure their quality, thus preventing it to become obsolete too soon. We discuss where the dangers lay in a Word-Net production and how they were faced in the case of the Serbian WordNet. Developed tools fall...
In this paper we present a set of new additions and func-tionalities to recently introduced software tools and techniques that will help researchers in the area of semantics and especially developers of wordnets. The motivation lies in our wish to get an on-line, fully comprehensive , modular, multiuser and safe system for further development of th...
The paper presents RetFig, a formal domain ontology of rhetorical
figures for Serbian. This ontology is one of the necessary steps in developing
tools for Natural Language Processing in the Serbian language, especially for
tools pertinent to discourse analysis, sentiment analysis and opinion mining.
The RetFig ontology was developed taking into acc...
Abstract: The goal of this paper is to point out the importance of crowdsourcing and to present some of the most successful projects that are functioning on the basis of this management model that originated in the business world, but it found its way into the world of culture and science. The ways in which crowdsourcing systems function are explor...
Abstract. Paper presents details of the project “Europeana libraries: Aggregating digital content from Europe’s
libraries” with special focus on participation of University library “Svetozar Markovic” in it. This CIP-Best
Practice Network ICT-PSP project brought together 24 institutions including some of Europe’s leading research
libraries from...
U radu su predstavljene inovativne tehnolo gije koje mogu da
omoguće značajna unapređenja u oblasti razvoja i promocije
digitalnih zbirki zavičajne građe. Optičko prepoznavanje
teksta u procesu digitalizacije zavičajnih zbirki predstavlja
usko grlo koje zbog velikih zahteva za ljudskim resursima
predstavlja značajan problem za realizaciju, pos...
LIBER, CERL and CELN have teamed up with The Europeana foundation in order to implement the project "Europeana libraries". This CIP ICT PSP project started in January 2011 and by 2013 a brand new library aggregator of Europeana will be operational and 5 million new items will be added to the digital portal of the European cultural heritage. To achi...
Questions
Question (1)
I am interested in possible methods of evaluation of the collected data.