Vuk Batanović

Vuk Batanović
  • PhD
  • Researcher at University of Belgrade

About

27
Publications
8,133
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
213
Citations
Introduction
I am an NLP researcher particularly interested in the field of short-text processing. My work has so far mostly been focused on semantic tasks, such as semantic similarity and sentiment analysis. My webpage: https://vukbatanovic.github.io/
Current institution
University of Belgrade
Current position
  • Researcher
Additional affiliations
April 2018 - present
Innovation Center of the School of Electrical Engineering, University of Belgrade
Position
  • Researcher
February 2018 - present
School of Electrical Engineering, University of Belgrade
Position
  • Lecturer
Description
  • Created the teaching materials, gave lectures and practical demonstrations, and supervised student projects within the Data Mining course (Computer Science and Information Technology MSc program) and the new Natural Language Processing course (Software Engineering MSc program and the "Master 4.0: Advanced information technologies in the digital transformation" MSc program).
May 2017 - December 2017
University of Belgrade
Position
  • Lecturer
Description
  • Created the teaching materials, gave lectures and practical demonstrations, and supervised student projects within the Machine Learning course at the Intelligent Systems PhD study program.
Education
January 2012 - December 2020
University of Belgrade
Field of study
  • Natural Language Processing
October 2010 - October 2011
University of Belgrade
Field of study
  • Computer Science and Engineering
October 2006 - October 2010
University of Belgrade
Field of study
  • Computer Science and Engineering

Publications

Publications (27)
Conference Paper
Full-text available
In this paper we present SETimes.SR – a gold standard dataset for Serbian, annotated with regard to document, sentence, and token segmentation, morphosyntax, lemmas, dependency syntax, and named entities. We describe the annotation layers and provide a basic statistical overview of them, and we discuss the method of encoding them in the CoNLL and t...
Article
Full-text available
Choosing a comprehensive and cost-effective way of articulating and annotating the sentiment of a text is not a trivial task, particularly when dealing with short texts, in which sentiment can be expressed through a wide variety of linguistic and rhetorical phenomena. This problem is especially conspicuous in resource-limited settings and languages...
Conference Paper
Full-text available
This paper presents an overview of the open access datasets in Serbian that have been manually annotated for the tasks of semantic textual similarity and short-text sentiment classification. In addition, it describes several kinds of statistical models that have been trained and evaluated on these datasets and discusses their results.
Conference Paper
Full-text available
Cross-Level Semantic Similarity (CLSS) is a measure of the level of semantic overlap between texts of different lengths. Although this problem was formulated almost a decade ago, research on it has been sparse, and limited exclusively to the English language. In this paper, we present the first CLSS dataset in another language, in the form of CLSS....
Article
Full-text available
Code comments are one of the most useful forms of documentation and metadata for understanding software implementation. Previous research on code comment classification has focused only on comments in English, typically extracted from a few programming languages. This paper addresses the problem of code comment classification not only in the monoli...
Conference Paper
Full-text available
In terms of natural language processing, Serbian belongs to low-resource languages, with a small number of available datasets and tools. In this paper, we present a novel poem classification corpus in the Serbian language, in multi-label and multiclass variants. We provide an analysis of this dataset and then use it to train and evaluate machine le...
Conference Paper
Full-text available
In this paper, we examine the effectiveness of lemmatizing texts in Serbian and Croatian using a pre-trained large language model fine-tuned on the task of string edit prediction. We define lemmatization as a tagging task, where each word-lemma transformation is represented as a string edit tag which encodes the necessary prefix and suffix alterati...
Article
Full-text available
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted...
Conference Paper
Full-text available
This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development-AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the level of semantic overlap between texts of different lengths, and it also refers to the problem of establishing...
Conference Paper
Full-text available
Code comments have become an increasingly important kind of software development metadata, due to the possibilities of automated code comment analysis and generation. Different downstream tasks inherently prioritize certain kinds of code comments over others, making it necessary to properly define and identify different comment classes. In this pap...
Conference Paper
Full-text available
Addresses represent a crucial type of textual data for real estate companies. In order to identify, fix, or remove incorrect entries, we categorize addresses into one of six predefined classes. In this context, we explore the effects of different text processing and classification methods. The best results are obtained by using non-linear classifie...
Conference Paper
Full-text available
Morphosyntactic definitions (MSD) are conventional codes that specify lexical and formal properties of words in context. An annotated Serbian example is given in Table 1, where the text is read vertically, one word per line, with annotations added horizontally. MSD are especially important in the case of Slavic languages, where they carry complex i...
Thesis
Full-text available
Statistički pristupi obradi prirodnih jezika tipično zahtevaju značajne količine anotiranih podataka, a često i različite pomoćne jezičke alate, što ograničava njihovu primenu u resursno ograničenim situacijama. U ovoj disertaciji predstavljena je metodologija razvoja statističkih rešenja u semantičkoj obradi prirodnih jezika sa ograničenim resursi...
Conference Paper
Full-text available
Otvorenost jezičkih resursa i alata je od velike važnosti za povećanje kvaliteta i brzine razvoja tehnologija za računarsku obradu prirodnih jezika. U ovom radu predstavljeni su otvoreni resursi za obradu srpskog jezika. Opisani su ručno anotirani korpusi, kao i širi spektar alata i računarskih modela, uključujući i veb servis koji omogućava njihov...
Conference Paper
Full-text available
Rapid Integrated Assessment (RIA) is a United Nations Development Programme procedure involving a comparison between a country’s development policy documents and the UN-defined Sustainable Development Goals. In this paper, we present the Serbian AutoRIA system that automates this procedure in Serbian, a resource-limited yet morphologically rich lan...
Conference Paper
Full-text available
With the rapid development and increasing accessibility of natural language processing (NLP) techniques, the exploitation of NLP inside electronic lexicography is on a rise. Textual datasets manually annotated with linguistic information are a backbone of the currently dominating paradigm in NLP based on supervised machine learning. However, develo...
Conference Paper
Full-text available
In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in Co...
Conference Paper
Full-text available
Although the task of semantic textual similarity (STS) has gained in prominence in the last few years, annotated STS datasets for model training and evaluation, particularly those with fine-grained similarity scores, remain scarce for languages other than English, and practically non-existent for minor ones. In this paper, we present the Serbian Se...
Article
Full-text available
An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Data...
Conference Paper
Full-text available
Sentiment classification of texts written in Serbian is still an under-researched topic. One of the open issues is how the different forms of morphological normalization affect the performances of different sentiment classifiers and which normalization procedure is optimal for this task. In this paper we assess and compare the impact of lemmatizers...
Conference Paper
Full-text available
Collecting data for sentiment analysis in resource-limited languages carries a significant risk of sample selection bias, since the small quantities of available data are most likely not representative of the whole population. Ignoring this bias leads to less robust machine learning classifiers and less reliable evaluation results. In this paper we...
Article
Full-text available
This paper presents POST STSS, a method of determining short-text semantic similarity in which part-of-speech tags are used as indicators of the deeper syntactic information usually extracted by more advanced tools like parsers and semantic role labelers. Our model employs a part-of-speech weighting scheme and is based on a statistical bag-of-words...
Article
Full-text available
This paper outlines and categorizes ways of using syntactic information in a number of algorithms for determining the semantic similarity of short texts. We consider the use of word order information, part-of-speech tagging, parsing and semantic role labeling. We analyze and evaluate the effects of syntax usage on algorithm performance by utilizing...
Conference Paper
Full-text available
U ovom radu su prikazani i kategorizovani načini korišćenja sintaksnih informacija u više algoritama za određivanje semantičke sličnosti kratkih tekstova. Evaluacija performansi algoritama je sprovedena korišćenjem rezultata testa detekcije parafraza iz Microsoft Research Paraphrase korpusa. Od svih opisanih algoritama i pristupa korišćenju sintaks...
Article
Full-text available
Measuring the semantic similarity of short texts is a noteworthy problem since short texts are widely used on the Internet, in the form of product descriptions or captions, image and webpage tags, news headlines, etc. This paper describes a methodology which can be used to create a software system capable of determining the semantic similarity of t...
Conference Paper
Full-text available
U radu je opisan softverski sistem koji ocenjuje stepen semantičke sličnosti dva zadata kratka teksta na srpskom jeziku. Objašnjeni su osnovni principi na kojima sistem funkcioniše, kao i faze razvoja i evaluacije sistema. Takođe, opisan je postupak generisanja korpusa parafraza nad kojim je izvršena evaluacija. Na kraju, analizirani su rezultati e...
Conference Paper
Full-text available
U radu je opisan softverski sistem, koji je realizovan na Elektrotehničkom fakultetu u Beogradu, za učenje predmeta Ekspertski sistemi. Softver je razvijen kao edukacioni sistem namenjen studentima i koristi se za potrebe nastave na osnovnim i master studijama. Sistem omogućava pregled primera i zadataka korak po korak po temama kojima pripadaju i...

Network

Cited By