Špela Vintar

Špela Vintar
University of Ljubljana · Department of Translation

Professor

About

53
Publications
4,779
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
428
Citations
Citations since 2017
14 Research Items
98 Citations
20172018201920202021202220230510152025
20172018201920202021202220230510152025
20172018201920202021202220230510152025
20172018201920202021202220230510152025
Introduction
Špela Vintar currently works at the Department of Translation, University of Ljubljana. Špela does research in Machine Translation, Computational Linguistics and Semantics. Her recent project is 'DigiLing - Trans-European e-Learning Hub for Digital Linguistics'. Her current project is 'TermFrame'.
Additional affiliations
September 2015 - September 2016
University of Ljubljana
Position
  • Professor (Full)
September 1998 - present
University of Ljubljana
Position
  • Professor (Associate)

Publications

Publications (53)
Preprint
Full-text available
We present an experiment in extracting adjectives which express a specific semantic relation using word embeddings. The results of the experiment are then thoroughly analysed and categorised into groups of adjectives exhibiting formal or semantic similarity. The experiment and analysis are performed for English and Croatian in the domain of karstol...
Article
We describe the creation of a knowledge base in the field of karstology using the frame-based approach. Apart from providing a new multilingual resource using manually annotated definitions as the source of structured information, the main focus is on exploring text mining methods to identify targeted knowledge structures in specialised corpora. Th...
Conference Paper
Full-text available
We present an innovative approach to the representation of domain-specific knowledge which combines traditional concept-oriented terminography with knowledge frames and augments linguistic data with images, videos, interactive graphs and maps. The interface is simple and intuitive prompting the user to enter a query term in any of the three languag...
Article
Full-text available
We describe a systematic and data-driven approach to karst terminology where knowledge from different textual sources is structured into a comprehensive multilingual knowledge representation. The approach is based on a domain model which is constructed in line with the frame-based approach to terminology and the analytical geomorphological method o...
Article
Na temelju jednojezičnog korpusa na hrvatskom jeziku provedena je analiza kombinacijskog potencijala ključnih naziva kako bi se odredila relevantna pojmovna obilježja za kategorizaciju krških fenomena. Korpusni rezultati ukazali su na važnu ulogu pridjeva u određivanju geomorfoloških pojmova i njihovo svrstavanje unutar taksonomije. Predloženim mod...
Conference Paper
Full-text available
We present the NetViz terminology visualization tool and apply it to the domain modeling of karstology, a subfield of geography studying karst phenomena. The developed tool allows for high-performance online network visualization where the user can upload the terminological data in a simple CSV format, define the nodes (terms, categories), edges (r...
Conference Paper
Full-text available
We report an experiment aimed at extracting words expressing a specific semantic relation using intersections of word embeddings. In a multilingual frame-based domain model, specific features of a concept are typically described through a set of non-arbitrary semantic relations. In karstology, our domain of choice which we are exploring though a co...
Preprint
Full-text available
We report an experiment aimed at extracting words expressing a specific semantic relation using intersections of word embeddings. In a multilingual frame-based domain model, specific features of a concept are typically described through a set of non-arbitrary semantic relations. In karstology, our domain of choice which we are exploring though a co...
Conference Paper
Full-text available
Article
Full-text available
The digital age brings dramatic changes to language and communication; its effects can be seen in the ways we use language, the channels we use to communicate and the manners in which ideas are spread. From the other end of the spectrum, our linguistic behaviour, communications and knowledge are transformed into data which can be used or bought to...
Chapter
Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) i...
Chapter
Full-text available
We report on a series of experiments aimed at improving the machine translation of ambiguous lexical items by using WordNet-based unsupervised Word Sense Disambiguation (WSD) and comparing its results to three MT systems. Our experiments are performed for the English-Slovene language pair using UKB, a freely available graph-based word sense disambi...
Conference Paper
Full-text available
We explore definitions in the domain of karstology from a cross-language perspective with the aim of comparing the cognitive frames underlying defining strategies in Croatian and English. The experiment involved the semi-automatic extraction of definition candidates from our corpora, manual selection of valid examples, identification of functional...
Article
This paper addresses lexical creativity and applies corpus-based methods to, firstly, identify potentially creative lexemes and, secondly, compare translations into Slovene from different source languages (English, German, French, and Italian) with texts originally written in Slovene. The primary resource for our work is the Spook corpus of transla...
Article
Slovene Sign Language (SZJ) has as yet received little attention from linguists. This article presents some basic facts about SZJ, its history, current status, and a description of the Slovene Sign Language Corpus and Pilot Grammar (SIGNOR) project, which compiled and annotated a representative corpus of SZJ. Finally, selected quantitative data ext...
Article
Full-text available
V prispevku prikazujemo analizo luščenja eno- in večbesednih terminoloških kandidatov, ki smo ga izvedli za potrebe priprave terminološke podatkovne zbirke odnosov z javnostmi na podlagi korpusa KoRP z luščilnikom LUIZ. Podrobneje se posvečamo dvojemu: (a) izluščenim enobesednim samostalniškim terminološkim kandidatom, katerih seznam primerjamo s p...
Conference Paper
Full-text available
We report on the project of compiling the first corpus of the Slovene Sign Language. The paper describes the procedures of data collection, the decisions regarding informant selection and plans for transcription and annotation. We outline the particularities of the Slovene situation, especially the high variability of the language, issues concernin...
Conference Paper
Full-text available
We report on a series of experiments aimed at improving the machine translation of ambiguous lexical items by using wordnet-based unsupervised Word Sense Disambiguation (WSD) and comparing its results to three MT systems. Our experiments are performed for the English-Slovene language pair using UKB, a freely available graph-based word sense disambi...
Article
Full-text available
This paper describes a series of experiments conducted to group similar words using context features derived from a corpus. The goal is to find an approach that would be suitable for cleaning the fuzzy WordNet synsets obtained by automatic translation of Serbian synsets into Slovene. Similar techniques have been used successfully by a number of res...
Article
The paper describes LUIZ, a bilingual term recognition system that has been developed for the Slovene-English language pair. The system is a hybrid term extractor using morphosyntactic patterns and statistical ranking to propose domain-specific expressions for each of the two languages, whereupon translation equivalents between the languages are id...
Conference Paper
Full-text available
The paper presents an innovative approach to extract Slovene definition candidates from domain-specific corpora using morphosyntactic patterns, automatic terminology recognition and semantic tagging with wordnet senses. First, a classification model was trained on examples from Slovene Wikipedia which was then used to find well-formed definitions a...
Conference Paper
Full-text available
The paper presents a set of approaches to extend the automatically created Slovene wordnet with nominal multiword expressions. In the first approach multiword expressions from Princeton WordNet are translated with a technique that is based on wordalignment and lexicosyntactic patterns. This is followed by extracting new terms from a monolingual cor...
Article
Full-text available
Contacts between cultures are a driving force of technological, scientific and linguistic development, where a culturally or economically more advanced re-gion "feeds" its neighbouring regions. The Austro-Hungarian Empire was a multi-cultural environment where this transfer can be observed through – among other processes – translation. The study fo...
Conference Paper
Full-text available
The paper presents the design concept of the VoiceTRAN Communicator that integrates speech recognition, machine translation and text-to-speech synthesis using the DARPA Galaxy architecture. The aim of the project is to build a robust speech-to-speech translation communicator able to translate simple domain-specific sentences in the Slovenian-Englis...
Article
Full-text available
The paper describes evaluation resources for concept-based, cross-lingual information retrieval in the medical domain. All resources were constructed in the context of the MuchMore project and are freely available through the project website. Available resources include: a bilingual, parallel document collection of German and English medical scient...
Article
Full-text available
In this paper we present a multi-layered approach to document annotation that allows for the structural integration of linguistic and semantic annotations produced by various language technology tools and using knowledge encoded in different domain ontologies as needed for semantic web applications.
Article
We present a framework for concept-based cross-language information retrieval in the medical domain, which is under development in the MUCHMORE pro ject. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data. Documents and queries are annotated with multiple layers of linguistic information...
Conference Paper
Full-text available
We present an approach to using ontologies as interlingua in cross-language information retrieval in the medical domain. Our approach is based on using the Unified Medical Language System (UMLS) as the primary ontology. Documents and queries are annotated with multiple layers of linguistic information (part-of-speech tags, lemmas, phrase chunks). B...
Article
Full-text available
The paper describes a set of experiments aimed at identifying and evaluating context features and machine learning methods to identify medical semantic relations in texts. We use manually constructed lists of pairs of MeSH-classes that represent specific relations, and a linguistically and semantically annotated corpus of medical abstracts to explo...
Article
Full-text available
We explore and evaluate the usefulness of semantic annotation, particularly semantic relations, in cross-language information retrieval in the medical domain. As the baseline for automatic semantic annotation we use UMLS, which specifies semantic relations between medical concepts. We developed two methods to improve the accuracy and yield of relat...
Article
We present a framework for concept-based cross-language information retrieval in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data. Documents and queries are annotated with multiple layers of linguistic information....
Article
Full-text available
An important aspect of word sense disambiguation is the evaluation of different methods and parameters. Unfortunately, there is a lack of test sets for evaluation, specifically for languages other than English and even more so for specific domains like medicine. Given that our work focuses on English as well as German text in the medical domain, we...
Article
Full-text available
We present a framework for concept-based, cross-lingual information retrieval (CLIR) in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data, whereby documents and queries are annotated with multiple layers of linguist...
Article
Full-text available
The paper describes an XML annotation format and tool developed within the MUCHMORE project. The annotation scheme was designed specifically for the purposes of Cross-Lingual Information Retrieval in the medical domain so as to allow both efficient and flexible access to layers of information. We use a parallel English-German corpus of medical abst...
Article
We present a framework for concept-based, cross-lingual information retrieval (CLIR) in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data, whereby documents and queries are annotated with multiple layers of linguist...
Article
Full-text available
In many scientific, technological or political fields terminology and the production of upto -date reference works is lagging behind, which causes problems to translators and results in inconsistent translations. Experience gained in various projects involving parallel corpora show that automatic extraction of terms and terminological collocations...
Article
We propose a framework for multi-track annotation of text corpora for terminological purposes. For most corpus-based research-tasks, several levels of linguistic and non-linguistic information must be...
Article
Full-text available
In many scientific, technological or political fields terminology and the production of up-to-date reference works is lagging behind, which causes problems to translators and results in inconsistent translations. Parallel corpora of texts already translated can be used as a resource for automatic extraction of terms and terminological collocations....
Article
Full-text available
Various efforts have been made for the development of tools and methods dedicated to the automatic processing of multilingual terminology databases. For that purpose, multilingual parallel corpora have been used as a basis resource. However, most of the neologisms in technical and scientific domains are realised by multiword terms that are rarely i...
Conference Paper
Full-text available
This paper describes an original hybrid system that extracts multiword unit candidates from part-of-speech tagged corpora. While classical hybrid systems manually define local part-of-speech patterns that lead to the identification of well-known multiword units (mainly compound nouns), we automatically identify relevant syntactical patterns from th...
Article
Full-text available
Named Entity Recognition The paper deals with Named Entities in German and Slovene texts. We first describe Named Entities from a linguistic viewpoint, where a typology of Named Entities and a theoretical framework is given. The second part of the paper describes different methods of Named Entity Recognition (NER). Since names, like all nouns, are...
Article
The paper presents the concept design of the VoiceTRAN Communicator that integrates speech recognition, machine translation and text-to-speech synthesis using the DARPA Galaxy architecture. The aim of the project is to build a robust multimodal speech-to- speech translation communicator able to translate simple domain-specific sentences in the Slov...
Article
Full-text available
The paper describes an innovative approach to expanding the domain coverage of wordnet by exploiting multiple resources. In the experiment described here we are using a large monolingual Slovene corpus of texts from the domain of informatics to harvest terminology from, and a parallel English-Slovene corpus and an online dictionary as bilingual res...
Article
Full-text available
In statistical term extraction systems the identification and selection of nested term candidates often presents a challenge. The paper presents an implementation and evaluation of C-value, a heuristic that ranks and/or discards nested terms according to their stability in the corpus. The method was tested for English and Slovene, for both the over...

Questions

Question (1)
Question
In Slovenia, the Personal Data Protection Officer recently issued a verdict for the freely available general language corpus Nova beseda, which prohibits the appearance of personal names in the corpus (or rather, as the explanation goes, the corpus search facility that will display a concordance of any given name+surname). Their argument is that a name is an item of personal data which can be linked to a concrete individual, hence all text corpora should be anonymized for names of (living? they do not say...) persons. Does anyone have any relevant experience or reference about such legal issues elsewhere? Experience from other EU countries would be particularly welcome...

Network

Cited By

Projects

Projects (4)
Project
TermFrame is a 3-year research project in which we explore Karst terminology using the frame-based approach. The final result is the TermFrame knowledge base containing terms, definitions, concept frames, images, videos and maps of Karst phenomena. The data in the knowledge base was obtained from large representative text collections containing books, articles, PhD theses and lexicons about karst. The relevant elements of knowledge were identified either manually through multiple levels of annotation, or automatically using advanced text mining and natural language processing methods.
Archived project
Basic info sloWNet is a lexical database of Slovene in which nouns, verbs, adjectives and adverbs are grouped into sets of synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by semantic and lexical relations, such has HYPONYMY and ANTONYMY. sloWNet is based on Princeton WordNet and has been created automatically from different types of existing resources, such as bilingual dictionaries, parallel corpora and Wikipedia. The current version of sloWNet is 3.1 (last change May 7, 2015), which contains 43,460 synsets and 71,803 literals, 33,546 of which have been manually validated. More info: http://lojze.lugos.si/darja/research/slownet/