
Dan Ioan Tufis- Professor
- Managing Director at Romanian Academy
Dan Ioan Tufis
- Professor
- Managing Director at Romanian Academy
About
200
Publications
39,386
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,599
Citations
Introduction
Current institution
Additional affiliations
August 2002 - present
August 2002 - present
Research Institute for Artificial Intelligence Bucharest, Romanian Academy
Position
- Academician, Professor
Publications
Publications (200)
Transformer models produce advanced text representations that have been used to break through the hard challenge of natural language understanding. Using the Transformer’s attention mechanism, which acts as a language learning memory, trained on tens of billions of words, a word sense disambiguation (WSD) algorithm can now construct a more faithful...
This paper presents the design and evolution of the RELATE platform. It provides a high-performance environment for natural language processing activities, specially constructed for Romanian language. Initially developed for text processing, it has been recently updated to integrate audio processing tools. Technical details are provided with regard...
This paper introduces the USPDATRO dataset. This is a speech dataset, in the Romanian language, constructed from open data, focusing on under-represented voice types (children, young and old people, and female voices). The paper covers the methodology behind the dataset construction, specific details regarding the dataset, and evaluation of existin...
The transparency of the judicial process and the consistency of judicial decisions can be improved through their publication. Access to jurisprudence is of paramount importance both for law professionals (judges, lawyers, law students) and for the larger public. However, public access must ensure the preservation of privacy for people involved, in...
With the rise of bidirectional encoder representations from Transformer models in natural language processing, the speech community has adopted some of their development methodologies. Therefore, the Wav2Vec models were introduced to reduce the data required to obtain state-of-the-art results. This work leverages this knowledge and improves the per...
Since the previous META-NET report, there have been significant improvements (e. g., creation of a large Romanian national corpus, steady progress in written language technologies, LT, construction of a national LT portal for the Romanian language etc.), but things are far from what they should be. Support for LT and AI through national programmes...
Natural language processing (NLP) has become a vital requirement in a wide range of applications, including machine translation, information retrieval, and text classification. The development and evaluation of NLP models for various languages have received significant attention in recent years, but there has been relatively little work done on com...
Large-scale pre-trained language representation and its promising performance in various downstream applications have become an area of interest in the field of natural language processing (NLP). There has been huge interest in further increasing the model’s size in order to outperform the best previously obtained performances. However, at some poi...
The article reports on research and developments pursued by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy in order to narrow the gaps identified by the deep analysis on the European languages made by Meta-Net white papers and published by Springer in 2012. Except English, all the European languages ne...
As transfer learning from large-scale pre-trained language models has become prevalent in Natural Language Processing, running these models in computationally constrained environments remains a challenging problem yet to address. Several solutions including knowledge distillation, network quantization or network pruning have been proposed; however,...
One of the fundamental functionalities for accepting a socially assistive robot is its communication capabilities with other agents in the environment. In the context of the ROBIN project, situational dialogue through voice interaction with a robot was investigated. This paper presents different speech recognition experiments with deep neural netwo...
Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural language processing algorithms. This is especially significant for textual sources where punctuation and casing are missing, such as the raw output of automatic speech recognition systems. Additionally, short text messages and micro-blogging...
Automatically learned vector representations of words, also known as "word embeddings", are becoming a basic building block for more and more natural language processing algorithms. There are different ways and tools for constructing word embeddings. Most of the approaches rely on raw texts, the construction items being the word occurrences and/or...
EuroVoc is a multilingual thesaurus that was built for organizing the legislative documentary of the European Union institutions. It contains thousands of categories at different levels of specificity and its descriptors are targeted by legal texts in almost thirty languages. In this work we propose a unified framework for EuroVoc classification on...
Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural language processing algorithms. This is especially significant for textual sources where punctuation and casing are missing, such as the raw output of automatic speech recognition systems. Additionally, short text messages and micro-blogging...
Automatic speech to speech translation is known to be highly beneficial in enabling people to directly communicate with each other when they do not share a common language. This work presents a modular system for Romanian to English and English to Romanian speech translation created by integrating four families of components in a cascaded manner: (...
The paper describes the micro-world-based dialog manager which was developed in the ROBIN project. The manager was designed to be loaded into the Pepper robot, used in real-world scenarios and interface with real-time automatic speech recognition and synthesis for Romanian language. A strict requirement for the development of the dialog manager was...
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade h...
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade h...
Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine tran...
The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automaticall...
This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that c...
We investigate the usability of the CoRoLa corpus for generating high quality vector representations of words for Romanian language. Different model parameters are tested and model quality is compared in three test cases: two word analogies data sets and a word similarity correlation with human judgment. Furthermore, we prove that CoRoLa provides s...
This chapter surveys methods for iterative enhancement, the task of improving the annotation of corpora, potentially over several iterations. Within iterative enhancement, the way to speed up the annotation process is by reducing the amount of time needed for annotation correction. We thus discuss annotation error detection, broadly characterizing...
Over the last twenty years or so, the approaches to part-of-speech tagging based on machine learning techniques have been developed or ported to provide high-accuracy morpho-lexical annotation for an increasing number of languages. Given the large number of morpho-lexical descriptors for a morphologically complex language, one has to consider ways...
Lexical resources have taken different forms throughout time. Nowadays, they occur in forms serving several types of users: the traditional human user, the modern human users and the computer. This last type of user explains the appearance of lexical knowledge representation forms, with semantic networks as one of the most frequently used forms. We...
Proiectul se opreşte asupra uneia dintre problemele sensibile ale
societăţii contemporane: aceea a siguranţei informatice şi a importanţei
protecţiei datelor, cu precădere în domenii strategice, precum apărarea şi
securitatea naţională, sănătatea, educaţia, descoperirile din domeniile ştiinţei
şi tehnicii, sistemul bancar şi cel financiar etc., pre...
This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of c...
The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in three language pairs. Each extracted sentence pair is associated with a cross-lingual lexical similarity score based on which, several evaluations have been conducted to estimate the similarity thresholds which allow the extraction of the most useful da...
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative's work throughout Europe in order to boost progress a...
This second edition of The Oxford Handbook of Computational Linguistics has been substantially revised, updated, and expanded. Alongside updated accounts of the topics covered in the first edition, it includes 17 new chapters on subjects such as deep learning, word representation, semantic role labelling, translation technology, opinion mining and...
There are more than 60 wordnets worldwide; the Romanian wordnet is among those that are maintained and further developed. Begun within the BalkaNet project and further enriched in various (application oriented) projects, it was used in word sense disambiguation, machine translation and question answering with promising results. We present here the...
Speech to speech (S2S) translation is a complex process designed to enable the communication between individuals that speak different languages and it represents a valuable contribution to (1) science, (2) cross-cultural interaction and (3) global business. Through S2S, a text spoken in one language is automatically recognized, translated and synth...
The project on the Romanian wordnet has been under continuous development for more than 10 years now. It has been in constant use in many projects and applications which determined, to a large extent, the content and coverage of various lexical domains. The article presents the most recent developments of the Romanian wordnet and offers quantitativ...
Recent advances in Multilingual Machine Translation and in Speech Processing, coupled with the unprecedented computing power increase of mobile devices, served by faster communication means, made possible the implementation of operational Speech to Speech (S2S) translation systems on smart phones and tablets. Through S2S, a text spoken in one langu...
This article reports on mass experiments supporting the idea that data extracted from strongly comparable corpora may successfully be used to build statistical machine translation systems of reasonable translation quality for in-domain new texts. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-E...
Standard methods for part-of-speech tagging suffer from data sparseness when used on highly inflectional languages (which require large lexical tagset inventories). For this reason, a number of alternative methods have been proposed over the years. One of the most successful methods used for this task, called Tiered Tagging (Tufiş, 1999), exploits...
Brazil is one of the largest producers of eucalyptus that is used for manufacturing pulp and paper; this
contributes directly to the issue of carbon emissions. Reforestation of eucalyptus appears as a viable
alternative for mitigating these carbon emissions, leveraging their high productivity to that of other
leading countries in the market, such a...
The paper presents the system developed by RACAI for the ISWLT 2012 competition, TED task, MT track, Romanian to English translation. We describe the starting baseline phrase-based SMT system, the experiments conducted to adapt the language and translation models and our post-translation cascading system designed to improve the translation without...
The cyberspace is populated with valuable information
sources, expressed in about 1500 different languages and dialects. Yet, for the vast majority of WEB surfers this wealth of information is practically inaccessible or meaningless. Recent advancements in cross-lingual information retrieval, multilingual summarization, cross-lingual question answ...
The phrase-based translation approach has overcome several drawbacks of the word-based translation methods and proved to significantly improve the quality of translated output. However, they show less improvement on translating between languages with very different syntax and morphology, especially when the translation direction is from a language...
With the ever-growing volume of information on the web,
the traditional search engines, returning hundreds or thousands of documents per query, become more and more demanding on the user patience in satisfying his/her information needs. Question Answering in Open Domains is a top research and development topic in current language technology. Unlik...
According to Osgood's “Semantic Differential” theory, the connotative meaning of most adjectives can be rated on a scale, the ends of which are antonymic adjectives. Such a pair of antonymic adjectives is called a factor. Osgood and his colleagues found that most of the variance in the text affecting judgment was explained by only three major facto...
This article describes an annotation of the synonymy sets in Princeton WordNet2.0, in line with the principles of Osgood’s “Semantic Differential” theory. According to this theory, connotative meaning of most adjectives can be rated on a scale, the ends of which are antonymic adjectives. Such a pair of antonymic adjectives is called a factor. The m...
This paper documents the participation of the Research Institute for Artificial Intelligence to the CLEF 2010 ResPubliQA lab. We answered questions in Romanian and English from Romanian documents of Acquis Communautaire and the European Parliament Proceedings. We extend the report from the previous ResPubliQA participation by introducing multi-fact...
This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of c...
This paper reports on the construction and testing of a new Question Answering (QA) system, implemented as an workflow which
builds on several web services developed at the Research Institute for Artificial Intelligence (RACAI).The evaluation of the
system has been independently done by the organizers of the Romanian-Romanian task of the ResPubliQA...
Lack of sufficient linguistic resources and parallel corpora for many languages and domains currently is one of the major obstacles to further advancement of automated translation. The solution proposed in this paper is to exploit the fact that non-parallel bi-or multilingual text resources are much more widely available than parallel translation d...
We describe a new method for sentiment load annotation of the synsets of a wordnet, along the principles of Osgood's "Semantic Differential" theory and extending the Kamp and Marx calculus, by taking into account not only the WordNet structure but also the SUMO/MILO (Niles & Pease, 2001) and DOMAINS (Bentivogli et al., 2004) knowledge sources. We d...
Abstract Currently, research infrastructures are being designed and established in many
disciplines, all partly to address the problem that they all suffer from an enormous
fragmentation of their resources and tools. In the domain of language resources and tools
the CLARIN initiative has been funded since 2008 to overcome many of the integration an...
Science, as well as other domains of the human culture and civilization, benefits in its becoming from two important categories of personalities, which I would generically call "creators" and "catalysts", respectively. Creators, be they discoverers or innovators, bigger or lesser, are those who, by the power of thought, widen or deepen the knowledg...
Lately, there seems to be a growing acceptance of the idea that multilingual lexical ontologies might be the key towards aligning different views on the semantic atomic units to be used in characterizing the general meaning of various and multilingual documents. Comparing performances of word sense disambiguation systems is a difficult evaluation t...
The paper discusses experiments, results, applications and further developments in tagging a highly inflectional language, based on multiple register diversified language models. The texts are accurately disambiguated in terms of a large tagset (611 tags) in two linear-time processing steps (tiered processing). The underlying tagger simultaneously...
We describe the results of a short-term SEEERAnet project the aim of which was to investigate the feasibility of machine translation (MT) research and development for several South Slavic and Balkan languages, more precisely Romanian, Bulgarian, Slovene, Greek and Serbian. For these languages MT systems are scarce and for some of them even non-exis...
Multilingual technologies, which to a large extent are language independent, provide a powerful support for easier building of annotated linguistic resources for languages where such resources are scarce or missing. All these technologies require parallel corpora in order to achieve their ends. Parallel texts encode extremely valuable linguistic kn...
This paper presents a simple and effective method for extraction of translation equivalents from parallel corpora. Experiments were conducted on Orwell's "1984" parallel corpus with translations available in six CEE languages, all of them being aligned to the English original. There were extracted six bilingual lexicons X-English (En), where X stan...
- The paper describes a statistical approach to automatic extraction of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some details in terms of precision, recall coverage and processing time. The co...
The term corpus as used here refers to a collection of spoken or written texts encoded into a specific machine readable format. Corpora are used in language engineering to gather both qualitative and quantitative real language evidence. Qualitative evidence consists of examples which can be used for the construction of computational lexicons, gramm...
Sentence and word alignment are prerequisite tasks for any system concerning statistical machine translation. Although they seem very different, both sentence and word alignments require approximately the same features to discriminate between positive and negative examples of alignments. We present a solution that can align the sentences and the wo...
The development of the Romanian WordNet began in 2001 within the framework of the European project BalkaNet which aimed at building core WordNets for 5 new
Lexical alignment is one of the most challenging tasks in processing and exploiting parallel texts. There are numerous applications that may benefit from an accurate multilingual lexical alignment of bi-and multi-language corpora. We describe in this paper a hypothesis-testing approach to the problem of automatic extraction of translation equivalen...
This article describes two different word sense disambiguation (WSD) systems, one applicable to parallel corpora and requiring aligned wordnets and the other one, knowledge poorer, albeit more relevant for real applications, relying on unsupervised
learning methods and only monolingual data (text and wordnet). Comparing performances of word sense d...
We describe the results of a short-term SEE-ERAnet project the aim of which was to investigate the feasibility of machine translation (MT) research and development for several South Slavic and Balkan languages. The major tasks of the project were: compilation of a multilingual parallel corpus for the concerned languages, the XML mark-up of the corp...
Lexical ontologies are fundamental resources for any linguistic application with wide coverage. The reference lexical ontology
is the ensemble made of Princeton WordNet, a huge semantic network, and SUMO&MILO ontology, the concepts of which are labelling
each synonymic series of Princeton WordNet. This lexical ontology was developed for English lan...
This paper describes the participation of the Research Institute for Artificial Intelligence of the Romanian Academy (RACAI) to the Multiple Language Question Answering Main Task at the CLEF 2008 competition. We present our Question Answering system answering Romanian questions from Romanian Wikipedia documents focusing on the implementation detail...
In this volume we publish some papers presented at the Exploratory
Workshop on Natural Language Computation (EWNLC 2008), entitled ”From Nat-
ural Language to Soft Computing: New Paradigms in Artificial Intelligence”, during
May 15-17, 2008, Baile Felix, Oradea, Romania. This workshop was financed by the
Romanian Ministry of Education, Research and...
We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language- indepedent statistical techniques followed by a linguistic filtering, while the second appro...
It is known that POS tagging is not very accurate for unknown words (words which the POS tagger has not seen in the training corpora). Thus, a first step to improve the tagging accuracy would be to extend the coverage of the tagger's learned lexicon. It turns out that, through the use of a simple procedure, one can extend this lexicon without using...
Abstract Nowadays, there are hundreds of Natural Language Processing applications and resources for different languages that are developed and/or used, almost exclusively with a few but notable exceptions, by their creators. Assuming that the right to use a particular application or resource is licensed by the rightful owner, the user is faced with...
Language ambiguity as produced by humans, is often unnoticed and as such, is most of the times involuntary. In an original context, a sentence might be very clear with respect to the producer's intentions, but if it contains some unnoticed ambiguities (obliterated by the context), when put in another context, might convey a very different meaning,...
With the wide-world expansion of the social web, subjectivity analysis became lately one of the main research focus in the area of intelligent information retrieval. Being able to find out what people feel about a specific topic, be it a marketed product, a public person or a political issue, represents a very interesting application for a large cl...
Natural language ambiguity is a well-known challenge for the NLP community in deepening the performances of the computer programs when dealing with human languages. Language ambiguity as produced by humans, is often unnoticed and as such, is most of the times involuntary. However, in many cases ambiguity is purposely used for various reasons. In an...
Natural language ambiguity is a well-known challenge for the NLP community in deepening the performances of the computer programs when dealing with human languages. Language ambiguity as produced by humans, is often unnoticed and as such, is most of the times involuntary. However, in many cases ambiguity is purposely used for various reasons. In an...
This paper presents a pattern-based question answering system for the Romanian-Romanian task of the Multiple Language Question Answering (QA@CLEF) track of the CLEF 2007 campaign. We show that working with a good Boolean searching engine and using question type driven answer extraction heuristics, one can achieve acceptable results (30% overall acc...
The paper reports on recent experiments in cross-lingual document processing (with a case study for Bulgarian-English-Romanian
language pairs) and brings evidence on the benefits of using linguistic ontologies for achieving, with a high level of accuracy,
difficult tasks in NLP such as word alignment, word sense disambiguation, document classificat...
Parallel corpora encode extremely valuable linguistic knowledge, the revealing of which is facilitated by the recent advances in multilingual corpus linguistics. The linguistic decisions made by the human translators in order to faithfully convey the meaning of the source text can be traced and used as evidence on linguistic facts which, in a monol...
This paper describes the development of a Question Answering (QA) system and its evaluation results in the Romanian-English cross-lingual track or- ganized as part of the CLEF 1 2006 campaign. The development stages of the cross-lingual Question Answering system are described incrementally through- out the paper, at the same time pinpointing the pr...
This article introduces an unsupervised word sense disambiguation algorithm that is in-spired by the lexical attraction models of Yuret (1998). It is based on the assump-tion that the meanings of the words that form a sentence can be best assigned by con-structing an interpretation of the whole sen-tence. This interpretation is facilitated by a dep...