
Špela VintarUniversity of Ljubljana · Department of Translation
Špela Vintar
Professor
About
53
Publications
4,779
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
428
Citations
Citations since 2017
Introduction
Špela Vintar currently works at the Department of Translation, University of Ljubljana. Špela does research in Machine Translation, Computational Linguistics and Semantics. Her recent project is 'DigiLing - Trans-European e-Learning Hub for Digital Linguistics'. Her current project is 'TermFrame'.
Additional affiliations
September 2015 - September 2016
September 1998 - present
Publications
Publications (53)
We present an experiment in extracting adjectives which express a specific semantic relation using word embeddings. The results of the experiment are then thoroughly analysed and categorised into groups of adjectives exhibiting formal or semantic similarity. The experiment and analysis are performed for English and Croatian in the domain of karstol...
We describe the creation of a knowledge base in the field of karstology using the frame-based approach. Apart from providing a new multilingual resource using manually annotated definitions as the source of structured information, the main focus is on exploring text mining methods to identify targeted knowledge structures in specialised corpora. Th...
We present an innovative approach to the representation of domain-specific knowledge which combines traditional concept-oriented terminography with knowledge frames and augments linguistic data with images, videos, interactive graphs and maps. The interface is simple and intuitive prompting the user to enter a query term in any of the three languag...
We describe a systematic and data-driven approach to karst terminology where knowledge from different textual sources is structured into a comprehensive multilingual knowledge representation. The approach is based on a domain model which is constructed in line with the frame-based approach to terminology and the analytical geomorphological method o...
Na temelju jednojezičnog korpusa na hrvatskom jeziku provedena je analiza kombinacijskog potencijala ključnih naziva kako bi se odredila relevantna pojmovna obilježja za kategorizaciju krških fenomena. Korpusni rezultati ukazali su na važnu ulogu pridjeva u određivanju geomorfoloških pojmova i njihovo svrstavanje unutar taksonomije. Predloženim mod...
We present the NetViz terminology visualization tool and apply it to the domain modeling of karstology, a subfield of geography studying karst phenomena. The developed tool allows for high-performance online network visualization where the user can upload the terminological data in a simple CSV format, define the nodes (terms, categories), edges (r...
We report an experiment aimed at extracting words expressing a specific semantic relation using intersections of word embeddings. In a multilingual frame-based domain model, specific features of a concept are typically described through a set of non-arbitrary semantic relations. In karstology, our domain of choice which we are exploring though a co...
We report an experiment aimed at extracting words expressing a specific semantic relation using intersections of word embeddings. In a multilingual frame-based domain model, specific features of a concept are typically described through a set of non-arbitrary semantic relations. In karstology, our domain of choice which we are exploring though a co...
The digital age brings dramatic changes to language and communication; its effects can be seen in the ways we use language, the channels we use to communicate and the manners in which ideas are spread. From the other end of the spectrum, our linguistic behaviour, communications and knowledge are transformed into data which can be used or bought to...
Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) i...
We report on a series of experiments aimed at improving the machine translation of ambiguous lexical items by using WordNet-based unsupervised Word Sense Disambiguation (WSD) and comparing its results to three MT systems. Our experiments are performed for the English-Slovene language pair using UKB, a freely available graph-based word sense disambi...
We explore definitions in the domain of karstology from a cross-language perspective with the aim of comparing the cognitive frames underlying defining strategies in Croatian and English. The experiment involved the semi-automatic extraction of definition candidates from our corpora, manual selection of valid examples, identification of functional...
This paper addresses lexical creativity and applies corpus-based methods to, firstly, identify potentially creative lexemes and, secondly, compare translations into Slovene from different source languages (English, German, French, and Italian) with texts originally written in Slovene. The primary resource for our work is the Spook corpus of transla...
Slovene Sign Language (SZJ) has as yet received little attention from linguists. This article presents some basic facts about SZJ, its history, current status, and a description of the Slovene Sign Language Corpus and Pilot Grammar (SIGNOR) project, which compiled and annotated a representative corpus of SZJ. Finally, selected quantitative data ext...
V prispevku prikazujemo analizo luščenja eno- in večbesednih terminoloških kandidatov, ki smo ga izvedli za potrebe priprave terminološke podatkovne zbirke odnosov z javnostmi na podlagi korpusa KoRP z luščilnikom LUIZ. Podrobneje se posvečamo dvojemu: (a) izluščenim enobesednim samostalniškim terminološkim kandidatom, katerih seznam primerjamo s p...
We report on the project of compiling the first corpus of the Slovene Sign Language. The paper describes the procedures of data collection, the decisions regarding informant selection and plans for transcription and annotation. We outline the particularities of the Slovene situation, especially the high variability of the language, issues concernin...
We report on a series of experiments aimed at improving the machine translation of ambiguous lexical items by using wordnet-based unsupervised Word Sense Disambiguation (WSD) and comparing its results to three MT systems. Our experiments are performed for the English-Slovene language pair using UKB, a freely available graph-based word sense disambi...
This paper describes a series of experiments conducted to group similar words using context features derived from a corpus. The goal is to find an approach that would be suitable for cleaning the fuzzy WordNet synsets obtained by automatic translation of Serbian synsets into Slovene. Similar techniques have been used successfully by a number of res...
The paper describes LUIZ, a bilingual term recognition system that has been developed for the Slovene-English language pair. The system is a hybrid term extractor using morphosyntactic patterns and statistical ranking to propose domain-specific expressions for each of the two languages, whereupon translation equivalents between the languages are id...
The paper presents an innovative approach to extract Slovene definition candidates from domain-specific corpora using morphosyntactic patterns, automatic terminology recognition and semantic tagging with wordnet senses. First, a classification model was trained on examples from Slovene Wikipedia which was then used to find well-formed definitions a...
The paper presents a set of approaches to extend the automatically created Slovene wordnet with nominal multiword expressions. In the first approach multiword expressions from Princeton WordNet are translated with a technique that is based on wordalignment and lexicosyntactic patterns. This is followed by extracting new terms from a monolingual cor...
Contacts between cultures are a driving force of technological, scientific and linguistic development, where a culturally or economically more advanced re-gion "feeds" its neighbouring regions. The Austro-Hungarian Empire was a multi-cultural environment where this transfer can be observed through – among other processes – translation. The study fo...
The paper presents the design concept of the VoiceTRAN Communicator that integrates speech recognition, machine translation
and text-to-speech synthesis using the DARPA Galaxy architecture. The aim of the project is to build a robust speech-to-speech
translation communicator able to translate simple domain-specific sentences in the Slovenian-Englis...
The paper describes evaluation resources for concept-based, cross-lingual information retrieval in the medical domain. All resources were constructed in the context of the MuchMore project and are freely available through the project website. Available resources include: a bilingual, parallel document collection of German and English medical scient...
In this paper we present a multi-layered approach to document annotation that allows for the structural integration of linguistic and semantic annotations produced by various language technology tools and using knowledge encoded in different domain ontologies as needed for semantic web applications.
We present a framework for concept-based cross-language information retrieval in the medical domain, which is under development in the MUCHMORE pro ject. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data. Documents and queries are annotated with multiple layers of linguistic information...
We present an approach to using ontologies as interlingua in cross-language information retrieval in the medical domain. Our approach is based on using the Unified Medical Language System (UMLS) as the primary ontology. Documents and queries are annotated with multiple layers of linguistic information (part-of-speech tags, lemmas, phrase chunks). B...
The paper describes a set of experiments aimed at identifying and evaluating context features and machine learning methods to identify medical semantic relations in texts. We use manually constructed lists of pairs of MeSH-classes that represent specific relations, and a linguistically and semantically annotated corpus of medical abstracts to explo...
We explore and evaluate the usefulness of semantic annotation, particularly semantic relations, in cross-language information retrieval in the medical domain. As the baseline for automatic semantic annotation we use UMLS, which specifies semantic relations between medical concepts. We developed two methods to improve the accuracy and yield of relat...
We present a framework for concept-based cross-language information retrieval in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data. Documents and queries are annotated with multiple layers of linguistic information....
An important aspect of word sense disambiguation is the evaluation of different methods and parameters. Unfortunately, there is a lack of test sets for evaluation, specifically for languages other than English and even more so for specific domains like medicine. Given that our work focuses on English as well as German text in the medical domain, we...
We present a framework for concept-based, cross-lingual information retrieval (CLIR) in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data, whereby documents and queries are annotated with multiple layers of linguist...
The paper describes an XML annotation format and tool developed within the MUCHMORE project. The annotation scheme was designed specifically for the purposes of Cross-Lingual Information Retrieval in the medical domain so as to allow both efficient and flexible access to layers of information. We use a parallel English-German corpus of medical abst...
We present a framework for concept-based, cross-lingual information retrieval (CLIR) in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data, whereby documents and queries are annotated with multiple layers of linguist...
In many scientific, technological or political fields terminology and the production of upto -date reference works is lagging behind, which causes problems to translators and results in inconsistent translations. Experience gained in various projects involving parallel corpora show that automatic extraction of terms and terminological collocations...
We propose a framework for multi-track annotation of text corpora for terminological purposes. For most corpus-based research-tasks, several levels of linguistic and non-linguistic information must be...
In many scientific, technological or political fields terminology and the production of up-to-date reference works is lagging behind, which causes problems to translators and results in inconsistent translations. Parallel corpora of texts already translated can be used as a resource for automatic extraction of terms and terminological collocations....
Various efforts have been made for the development of tools and methods dedicated to the automatic processing of multilingual terminology databases. For that purpose, multilingual parallel corpora have been used as a basis resource. However, most of the neologisms in technical and scientific domains are realised by multiword terms that are rarely i...
This paper describes an original hybrid system that extracts multiword unit candidates from part-of-speech tagged corpora.
While classical hybrid systems manually define local part-of-speech patterns that lead to the identification of well-known
multiword units (mainly compound nouns), we automatically identify relevant syntactical patterns from th...
Named Entity Recognition The paper deals with Named Entities in German and Slovene texts. We first describe Named Entities from a linguistic viewpoint, where a typology of Named Entities and a theoretical framework is given. The second part of the paper describes different methods of Named Entity Recognition (NER). Since names, like all nouns, are...
The paper presents the concept design of the VoiceTRAN Communicator that integrates speech recognition, machine translation and text-to-speech synthesis using the DARPA Galaxy architecture. The aim of the project is to build a robust multimodal speech-to- speech translation communicator able to translate simple domain-specific sentences in the Slov...
The paper describes an innovative approach to expanding the domain coverage of wordnet by exploiting multiple resources. In the experiment described here we are using a large monolingual Slovene corpus of texts from the domain of informatics to harvest terminology from, and a parallel English-Slovene corpus and an online dictionary as bilingual res...
In statistical term extraction systems the identification and selection of nested term candidates often presents a challenge. The paper presents an implementation and evaluation of C-value, a heuristic that ranks and/or discards nested terms according to their stability in the corpus. The method was tested for English and Slovene, for both the over...
Questions
Question (1)
In Slovenia, the Personal Data Protection Officer recently issued a verdict for the freely available general language corpus Nova beseda, which prohibits the appearance of personal names in the corpus (or rather, as the explanation goes, the corpus search facility that will display a concordance of any given name+surname). Their argument is that a name is an item of personal data which can be linked to a concrete individual, hence all text corpora should be anonymized for names of (living? they do not say...) persons. Does anyone have any relevant experience or reference about such legal issues elsewhere? Experience from other EU countries would be particularly welcome...
Projects
Projects (4)
TermFrame is a 3-year research project in which we explore Karst terminology using the frame-based approach. The final result is the TermFrame knowledge base containing terms, definitions, concept frames, images, videos and maps of Karst phenomena.
The data in the knowledge base was obtained from large representative text collections containing books, articles, PhD theses and lexicons about karst. The relevant elements of knowledge were identified either manually through multiple levels of annotation, or automatically using advanced text mining and natural language processing methods.
Basic info
sloWNet is a lexical database of Slovene in which nouns, verbs, adjectives and adverbs are grouped into sets of synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by semantic and lexical relations, such has HYPONYMY and ANTONYMY. sloWNet is based on Princeton WordNet and has been created automatically from different types of existing resources, such as bilingual dictionaries, parallel corpora and Wikipedia.
The current version of sloWNet is 3.1 (last change May 7, 2015), which contains 43,460 synsets and 71,803 literals, 33,546 of which have been manually validated.
More info: http://lojze.lugos.si/darja/research/slownet/