Jorge Vivaldi

Jorge Vivaldi
Pompeu Fabra University | UPF · University Institute of Applied Linguistics (IULA)

PhD

About

69
Publications
10,794
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
576
Citations

Publications

Publications (69)
Article
Full-text available
This paper presents MHeTRep, a multilingual medical terminology and the methodology followed for its compilation. The multilingual terminology is organised into one vocabulary for each language. All the terms in the collection are semantically tagged with a tagset corresponding to the top categories of Snomed-CT ontology. When possible, the individ...
Article
Full-text available
Even though many NLP resources and tools claim to be domain independent, their application to specific tasks is restricted to some specific domain, otherwise their performance degrade notably. As the accuracy of NLP resources drops heavily when applied in environments different from which they were built a tuning to the new environment is needed. T...
Chapter
In this paper, we provide an overview of the ninth annual edition of the CLEF eHealth evaluation lab. CLEF eHealth 2021 continues our evaluation resource building efforts around the easing and support of patients, their next-of-kins, health care professionals, and health scientists in understanding, accessing, and authoring electronic health inform...
Chapter
Full-text available
Motivated by the ever increasing difficulties faced by laypeople in retrieving and digesting valid and relevant information to make health-centred decisions, the CLEF eHealth lab series has offered shared tasks to the community in the fields of Information Extraction (IE), management, and Information Retrieval (IR) since 2013. These tasks have attr...
Chapter
Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition me...
Conference Paper
Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition me...
Article
Full-text available
A semantic tagger aiming to detect relevant entities in Arabic medical documents and tagging them with their appropriate semantic class is presented. The system takes profit of a Multilingual Framework covering four languages (Arabic, English, French, and Spanish), in a way that resources available for each language can be used to improve the resul...
Conference Paper
Full-text available
Identification of the certainty of events is an important text mining problem. In particular, biomedical texts report medical conditions or findings that might be factual, hedged or negated. Identification of negation and its scope over a term of interest determines whether a finding is reported and is a challenging task. Not much work has been per...
Conference Paper
A semantic tagger aiming to detect relevant entities in medical documents and tagging them with their appropriate semantic class is presented. In the experiments described in this paper the tagset consists of the six most frequent classes in SNOMED-CT taxonomy (SN). The system uses six binary classifiers, and two combination mechanisms are presente...
Article
Full-text available
This paper presents the work carried out towards enlarging WordNet MCR in linguistics domain, which discusses about problematic situations caused by WordNet structure and inherent characteristics of the domain. The approach employed in this paper is descriptive to explain how maintaining the original structure of WordNet might affect domain-specifi...
Data
Full-text available
This paper presents the work carried out towards enlarging WordNet MCR in linguistics domain, which discusses about problematic situations caused by WordNet structure and inherent characteristics of the domain. The approach employed in this paper is descriptive to explain how maintaining the original structure of WordNet might affect domain-specifi...
Chapter
Full-text available
In this chapter it is shown that certain grammatical features, besides lexicon, have a strong potential to differentiate specialized texts from non-specialized texts. A tool including these features has been developed and it has been trained using machine learning techniques based on association rules using two sub-corpora (specialized vs. non-spec...
Conference Paper
Domain terms are a useful mean for tuning both resources and NLP processors to domain specific tasks. This paper proposes an improved method for obtaining terms from potentially any domain using the Wikipedia graph structure as a knowledge source.
Article
Full-text available
Nowadays automatic systems for detecting and measuring textual similarity are being developed, in order to apply them to different tasks in the field of Natural Language Processing (NLP). Currently, these systems use surface linguistic features or statistical information. Nowadays, few researchers use deep linguistic information. In this work, we p...
Article
The aim of the tweet contextualization INEX (Initiative for the Evaluation of XML retrieval) task at CLEF 2013 (Conference and Labs of the Evaluation Forum) is to build a system that provides automatically information related with different tweets, that is, a summary that explains a specific tweet. In this article, our strategy and results are pres...
Article
This article explores a statistical, language-independent methodology for the construction of taxonomies of specialized domains from noisy corpora. In contrast to proposals that exploit linguistic information by searching for lexico-syntactic patterns that tend to express the hypernymy relation, our methodology relies entirely upon the distribution...
Article
Word co-occurrence graphs have been used in computational linguistics mainly for word sense disambiguation and induction, but until very recently, not for the extraction of hypernymy relations, where the methodology most often applied is the use of lexico-syntactic patterns. In this paper, we show that it is possible to use word co-occurrence stati...
Book
Full-text available
We are delighted to hereby present the proceedings of CHAT 2012. Altogether, 7 papers have been selected for presentation (4 regular papers and 3 short papers). The workshop papers cover various topics on automated approaches to terminology extraction and creation of terminology resources, compiling multilingual terminology, ensuring interoperabili...
Conference Paper
Full-text available
A scientific vocabulary is a set of terms that designate scientific concepts. This set of lexical units can be used in several applications ranging from the development of terminological dictionaries and machine translation systems to the development of lexical databases and beyond. Even though automatic term recognition systems exist since the 80s...
Article
Full-text available
This paper describes on-going work for the construction of a new treebank for Spanish, The IULA Treebank. This new resource will contain about 60,000 richly annotated sentences as an extension of the already existing IULA Technical Corpus which is only PoS tagged. In this paper we have focused on describing the work done for defining the annotation...
Article
Full-text available
Due to the increase in the number and depth of analyses required over the text, like entity recognition, POS tagging, syntactic analysis, etc. the annotation in-line has become unpractical. In Natural Language Processing (NLP) some emphasis has been placed in finding an annotation method to solve this problem. A possibility is the standoff annotati...
Article
WordNet applied to terminology is still an ongoing issue. Due to the increasing importance of WordNet for natural language processing tools (information retrieval and extraction tools), we would like to point out its upgrade with Languages for Specific Purposes. This paper aims to discuss some issues in terminological enrichment, paying attention t...
Article
The tweet contextualization INEX task at CLEF 2012 consists of the developing of a system that, given a tweet, can provide some context about the subject of the tweet, in order to help the reader to understand it. This context should take the form of a readable summary, not exceeding 500 words, composed of passages from a provided Wikipedia corpus....
Conference Paper
In this paper, our strategy and results for the INEX@QA 2011 question-answering task are presented. In this task, a set of 50 documents is provided by the search engine Indri, using some queries. The initial queries are titles associated with tweets. Reformulation of these queries is carried out using terminological and named entities information....
Conference Paper
Full-text available
Compilation of Languages for Specific Purposes (LSP) corpora is a task which is fraught with several difficulties (mainly time and human effort), because it is not easy to discern between specialized and non-specialized text. The aim of this work is to study automatic specialized vs. non-specialized sentence differentiation. The experiments are car...
Conference Paper
Full-text available
Terms are usually defined as lexical units that designate concepts of a thematically restricted domain. Their detection is useful for a number of purposes such as: building (terminological) dictionaries, text indexing, automatic translation, improving automatic summarisation systems and, in general, whatever task that contains any domain specific c...
Article
In this paper we present a new approach for obtaining the terminology of a given domain using the category and page structures of the Wikipedia in a domain and language independent way. The idea is to take profit of category graph of Wikipedia starting with a set of categories that we associate with the domain. After obtaining the full set of categ...
Conference Paper
Full-text available
In this paper we present a new approach for obtaining the terminology of a given domain using the category and page structures of the Wikipedia in a language independent way. The idea is to take profit of category graph of Wikipedia starting with a top category that we identify with the name of the domain. After obtaining the full set of categories...
Conference Paper
Full-text available
In this paper we present REG, a graph approach to study a fundamental problem of Natural Language Processing: the automatic summarization of documents. The algorithm models a document as a graph, to obtain weighted sentences. We applied this approach to the INEX@QA 2010 task (question-answering). To do it, we have extracted the terms from the queri...
Article
Full-text available
Resumen: Presentamos un sistema de extracción de términos que usa la Wikipedia como fuente de información semántica. El sistema ha sido probado en un corpus médico en español. Comparamos los resultados usando un módulo de un extractor de términos híbrido y un módulo equivalente que utiliza la Wikipedia. Los resultados demuestran que este recurso pu...
Article
Full-text available
En este trabajo se presenta un nuevo algoritmo de resumen automático de textos especializados, en concreto del dominio médico, que aúna estrategias lingüísticas y estadísticas. La novedad del artículo radica en la correcta combinación de dichas estrategias de cara a demostrar que los sistemas híbridos pueden obtener mejores resultados que los siste...
Conference Paper
Full-text available
Computational terminology has notably evolved since the advent of computers. Regarding the extraction of terms in particular, a large number of resources has been developed: from very general tools to other much more specific acquis ition methodologies. Such acquisition methodologies range from using simple linguistic patterns or frequency counting...
Conference Paper
This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as well as standard tools and has a modular conception that facilitates its re-integration on different systems. The first part of the paper describes the...
Conference Paper
Full-text available
In this article we present a hybrid approach for automatic summarization of Spanish medical texts. There are a lot of systems for automatic summarization using statistics or linguistics, but only a few of them combining both techniques. Our idea is that to reach a good sum-mary we need to use linguistic aspects of texts, but as well we should benef...
Article
Full-text available
Term extraction may be defined as a text mining activity whose main purpose is to obtain all the terms included in a text of a given domain. Since the eighties, and mainly due to the rapid scientific advances as well as the evolution of the communication systems, there has been a growing interest in obtaining the terms found in written documents. A...
Article
Full-text available
The main goal of this paper is to present a first approach to an automatic detection of conceptual relations between two terms in specialised written text. Previous experiments on the basis of the manual analysis lead the authors to implement an automatic query strategy combining the term candidates proposed by an extractor together with a list of...
Article
Full-text available
In the past twenty years much efforts have been devoted to the development of ontologies and term bases for different fields. All this work has been done separately or with slight integration. The GENOMA-KB is a project whose main aim is to integrate, at least, both resources. In this paper, most relevant aspects of the project are presented. Each...
Article
Full-text available
Ontology, usually understood as a particular representation of a given domain, will become an essential item in the information retrieval system we aim to build. Our research activities are developed on the communicative terminology framework, that is, we mainly deal with units effectively contained in specialized discourse. Bearing in mind this th...
Article
Full-text available
Summary of the PhD thesis presented at the Technical University of Catalonia in June 2001, under the supervision of Horacio Rodríguez Hontoria and Maria Teresa Cabré Castellví. Síntesis de la tesis doctoral presentada en la Universidad Politécnica de Catalunya en junio de 2001, bajo la dirección de Horacio Rodríguez Hontoria y Maria Teresa Cabré Ca...
Article
Two different reasons suggest that combining the performance of several term extractors could lead to an improvement in overall system accuracy. On the one hand, there is no clear agreement on whether to follow statistical, linguistic or hybrid approaches for (semi-) automatic term extraction. On the other hand, combining different knowledge source...
Conference Paper
Term extraction is the task of automatically detecting, from textual corpora, lexical units that designate concepts in thematically restricted domains (e.g. medicine). Current systems for term extraction integrate linguistic and statistical cues to perform the detection of terms. The best results have been obtained when some kind of combination of...
Article
Full-text available
Specialised texts contain both polilexical and monolexical terminological units. Monolexical terms are not treated in most current extraction systems mainly due to their high degree of polysemy. However this is mainly true, in a specialised domain such as medicine it needs further explanation. In this paper we discuss the requirements posed by term...
Article
Full-text available
It is well known that many languages make use of neo-classical compounds, and that some domains with a very long tradition like medicine made an intense use of such morphemes. This phenomenon has been largely studied for different languages with the common result that a relatively short number of morphemes allows the detection of a high number of s...
Article
Full-text available
1 Introducción Desde hace ya algunas décadas el estudio del lenguaje utilizando técnicas de lingüística de corpus ha ido incrementando su importancia. En la actualidad, la utilización de corpus se considera un recurso básico para prácticamente cualquier estudio relacionado con la lingüística. Esto es así porque los corpus son un recurso que permite...
Article
Full-text available
It happens very often that researchers in Terminology need to know about the terms included in a given LSP corpus. One possibility is to run a term extractor but in this case such a tool provides just term candidates, but not valid terms. Therefore, it is mandatory a term validation process that is not always easy and affordable. A different option...
Article
Full-text available
Some approaches to automatic terminology extraction from corpora imply the use of existing semantic resources for guiding the detection of terms. Most of these systems exploit specialised resources, like UMLS in the medical domain, while a few try to take profit from general-purpose semantic resources, like EuroWordNet (EWN). As the term extraction...
Article
Full-text available
This paper presents a language independent methodology for automatically extracting bilingual lexicon entries from the web without the need of resources like parallel or comparable corpora, POS tagging, nor an initial bilingual lexicon. It is suitable for specialized domains where bilingual lexicon entries are scarce. The input for the process is a...
Article
Full-text available
The GENOMA-KB knowledge base includes four independent modules: a textual database, a factual database, a terminological database and an ontology. We will briefly introduce in this paper the main features concerning each one of the modules, and we will highlight the process of enlarging both the term base and the ontology.
Article
Full-text available
In this paper, we present an approach to the automatic extraction of conceptual structures from unorganized collections of documents using large scale lexical regularities in text. The technique maps a term to a constellation of other terms that captures the essential meaning of the term in question. The methodology is language independent, it invo...
Article
Full-text available
El projecte central que es duu a terme a l'Institut Universitari de Lingüística Aplicada (IULA) de la Universitat Pompeu Fabra és el corpus de Llenguatges especialitzats. En el marc d'aquest projecte —que implica cinc dominis d'especialitat (dret, economia, informàtica, medi ambient i medicina) i cinc llengües (català, castellà, francès, anglès i a...
Article
Full-text available
En aquest paper, analitzem les principals ontologies amb la finalitat de dibuixar un panorama general d'una de les eines més utilitzades en l'estructuració del coneixement. En primer lloc, presentem una àmplia descripció de les cinc ontologies més difoses entre la comunitat científica dedicada a la gestió de la informació. Seguidament, repassem bre...
Article
Full-text available
En aquest paper, es presenten els criteris de treball que s'han seguit durant els 10 anys en què s'ha anat constituint el corpus de l'IULA. S'exposa l'estat de les dades del corpus, els recursos lèxics utilitzats per al tractament de les dades (diccionaris i etiquetaris) i les eines constituïdes o adaptades. Es dedica especial atenció a la document...
Article
Full-text available
El principal projecte de recerca desenvolupat a l'Institut Universitari de Lingüística Aplicada (IULA) de la Universitat Pompeu Fabra és el projecte de Llenguatges especialitzats, sota el qual s'aglutina la totalitat dels investigadors pertanyents a aquest centre. És en aquest marc de recerca on es duu a terme la constitució d'un corpus plurilingüe...
Article
Full-text available
Des de l'aparició de TERMINO l'any 1990 fins avui dia s'han portat a terme una sèrie de projectes per dissenyar diferents tipus d'extractors automàtics de terminologia, però malgrat la gran quantitat d'estudis que s'estan realitzant en aquesta línia, els resultats no són del tot satisfactoris. Aquest article presenta una anàlisi dels principals sis...
Article
Full-text available
L'edició de textos fent ús de mitjans informàtics ha comportat avantatges i inconvenients. Mentre que ha fet molt accessible l'edició acurada de textos també ha provocat tot un seguit de problemes, un del quals és la dificultat de compartir recursos.Per altra banda, la lingüística necessita emprar corpus voluminosos per conèixer amb més precisió i...
Article
Full-text available
Una primera versión de este estudio fue presentada en uno de los Seminarios sobre Inteligencia Artificial que organiza el Departament de Llenguagtes i Sistemes Informàtiques de la Universidat Politécnica de Catalunya, Marzo 1991.

Network

Cited By