Koldo GojenolaUniversity of the Basque Country | UPV/EHU · Computer Languages and Systems
Koldo Gojenola
PhD, Computer Science
About
105
Publications
10,219
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,040
Citations
Introduction
Additional affiliations
September 2001 - present
Escuela Universitaria de Ingeniería Técnica Industrial de Bilbao
Position
- Permanent Teacher and researcher
January 2001 - present
Publications
Publications (105)
Background:
Electronic Clinical Narratives (ECNs) store valuable individual's health information. However, there are few available open-source data. Besides, ECNs can be structurally heterogeneous, ranging from documents with explicit section headings or titles to unstructured notes. This lack of structure complicates building automatic systems an...
Background
Unlike diseases, automatic recognition of disabilities has not received the same attention in the area of medical NLP. Progress in this direction is hampered by obstacles like the lack of annotated corpus. Neural architectures learn to translate sequences from spontaneous representations into their corresponding standard representations...
The utilization of clinical reports for various secondary purposes, including health research and treatment monitoring, is crucial for enhancing patient care. Natural Language Processing (NLP) tools have emerged as valuable assets for extracting and processing relevant information from these reports. However, the availability of specialized languag...
Providing high quality explanations for AI predictions based on machine learning is a challenging and complex task. To work well it requires, among other factors: selecting a proper level of generality/specificity of the explanation; considering assumptions about the familiarity of the explanation beneficiary with the AI task under consideration; r...
This chapter landscapes the field of Language Technology (LT) and language- centric AI by assembling a comprehensive state-of-the-art of basic and applied research in the area. It sketches all recent advances in AI, including the most recent deep learning neural technologies. The chapter brings to light not only where language-centric AI as a whole...
Introduction
Atrial fibrillation (AF) is the most prevalent arrhythmia in the world and it is associated with a high rate of cardiovascular morbidity and mortality. A higher burden of atrial fibrillation is related to more adverse cardiovascular events and the rhythm control strategy (maintenance or recovery of sinus rhythm) is particularly indicat...
Background: Unlike diseases, automatic recognition of disabilities has not received the same attention in the area of medical NLP. Progress in this direction is hampered by obstacles like the lack of annotated corpus. Neural architectures learn to translate sequences from spontaneous representations into their corresponding standard representations...
Introduction
Atrial fibrillation (AF) is the most prevalent arrhythmia in the world and it is associated with a high rate of cardiovascular morbidity and mortality. A higher burden of atrial fibrillation is related to more adverse cardiovascular events and the rhythm control strategy (maintenance or recovery of sinus rhythm) is particularly indicat...
Background
Nowadays, with the digitalization of healthcare systems, huge amounts of clinical narratives are available. However, despite the wealth of information contained in them, interoperability and extraction of relevant information from documents remains a challenge.
Objective
This work presents an approach towards automatically standardizing...
This work describes the formalization of a word structure grammar that represents the complex morphological and morphosyntactic information embedded within the word forms of an agglutinative language (Basque), giving a comprehensive linguistic description of the main morphological phenomena, such as affixation, derivation, and composition, and also...
Background:
Automatic extraction of morbid disease or conditions contained in Death Certificates is a critical process, useful for billing, epidemiological studies and comparison across countries. The fact that these clinical documents are written in regular natural language makes the automatic coding process difficult because, often, spontaneous...
The creation of a semantic oriented lexicon of positive and negative words is often the first step to analyze the sentiment of a corpus. Various methods can be employed to create a lexicon: supervised and unsupervised. Until now, methods employed to create Basque polarity lexicons were unsupervised. The aim of this paper is to present the construct...
In this work, we have analysed the effects of negation on the semantic orientation in Basque. The analysis shows that negation markers can strengthen, weaken or have no effect on sentiment orientation of a word or a group of words. Using the Constraint Grammar formalism, we have designed and evaluated a set of linguistic rules to formalize these th...
Hizkuntzaren Prozesamenduan kokatzen den Dependentzia Unibertsalen proiektuaren helburua da hainbat hizkuntzatan sortu diren dependentzia-ereduan oinarritutako zuhaitz-bankuak etiketatze-eskema estandar berera egokitzea. Artikulu honetan, eredu horretara automatikoki egokitu den euskarazko zuhaitz-bankua aurkezten da. Egokitzapen-lan hori nola gauz...
Background and objectives:
Electronic health records (EHRs) convey vast and valuable knowledge about dynamically changing clinical practices. Indeed, clinical documentation entails the inspection of massive number of records across hospitals and hospital sections. The goal of this study is to provide an efficient framework that will help clinician...
Background:
Electronic Health Records (EHRs) are written using spontaneous natural language. Often, terms do not match standard terminology like the one available through the International Classification of Diseases (ICD).
Objective:
Information retrieval and exchange can be improved using standard terminology. Our aim is to render diagnostic te...
Systems for opinion and sentiment analysis rely on different resources: a lexicon, annotated corpora and constraints (morpholog-ical, syntactic or discursive), depending on the nature of the language or text type. In this respect, Basque is a language with fewer linguistic resources and tools than other languages , like English or Spanish. The aim...
Objective:
The goal of this study is to investigate entity recognition within Electronic Health Records (EHRs) focusing on Spanish and Swedish. Of particular importance is a robust representation of the entities. In our case, we utilized unsupervised methods to generate such representations.
Methods:
The significance of this work stands on its e...
The goal of the Deteami project is to develop tools that make clinicians aware of adverse drug reactions stated in electronic health records of the clinical digital history.The records produced in hospitals are a valuable though nearly unexplored source of information among others due to the fact that are tough to get due to privacy and confidentia...
Objective: To tackle the extraction of adverse drug reaction events in electronic health records. The challenge stands in inferring a robust prediction model from highly unbalanced data. According to our manually annotated corpus, only 6% of the drug-disease entity pairs trigger a positive adverse drug reaction event and this low ratio makes machin...
We outline an Adverse Drug Reaction (ADRs) extraction system for Electronic Health Records (EHRs) written in Spanish. The goal of the system is to assist experts on pharmacy in making the decision of whether a patient sufiers from one or more ADRs. The core of the system is a predictive model inferred from a manually tagged corpus that counts on bo...
Nowadays, opinion texts play an important role, in fact, people read opinions before they do an activity, buy a product or take a decision. However, the amount of opinion text is increasing rapidly and reading all opinions about a subject is unfeasible. ‘Sentiment analysis’ is a part of Natural Language Processing whose aim is to process opinion te...
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Ma...
This paper presents a study in sentiment analysis which exploits information of the relational discourse structure in a Basque corpus consisting of literature reviews. The QWN-PPV method was employed to label all the texts at element level and the Rhetorical Structure Theory (RST) was used to extract discourse structure information. The preliminary...
Laburpena: Artikulu honetan euskararako analizatzaile sintaktiko-estatistikoen emaitzak hobetzeko helburuarekin egindako esperimentu-multzoa aurkezten da. Lan honetan teknika ezberdinak aztertzen dira: i) zuhaitz-transformazioak, ii) analizatzaileen pilaketa, eta iii) analizatzaile-modelo desberdinen irteeren konbinazioa. Emaitza guztiak zuhaitz-ba...
The goal of this FP7 European project is to contribute for the advancement of quality machine translation by pursuing an approach that further relies on semantics, deep parsing and linked open data. © 2015 Sociedad Española para el Procesamiento del Lenguaje Natural.
This project addresses extraction of medical concepts relationship in scientific documents, medical records and general information on the Internet, in several languages by using advanced Natural Language Processing and Information Retrieval techniques and tools. The project aims to show, through two use cases, the benefits of the application of la...
The advances achieved in natural language processing make it possible to automatically mine information from electronically created documents. Many natural language processing methods that extract information from texts make use of annotated corpora, but these are scarce in the clinical domain due to legal and ethical issues. In this paper we prese...
The goal of this paper is to classify Medical Records (MRs) by their diagnostic terms (DTs) according to the International Classification of Diseases Clinical Modification (ICD-9-CM). The challenge we face is twofold: (i) to treat the natural and non-standard language in which doctors express their diagnostics and (ii) to perform a large-scale clas...
Abstract: Hospitals attached to the Spanish Ministry of Health are currently using the International Classification of Diseases 9 Clinical Modification (ICD9-CM) to classify health discharge records. Nowadays, this work is manually done by experts. This paper tackles the automatic classification of real Discharge Records in Spanish following the IC...
This paper presents experiments with WordNet semantic classes to improve de-pendency parsing. We study the effect of semantic classes in three dependency parsers, using two types of constituency-to-dependency conversions of the English Penn Treebank. Overall, we can say that the improvements are small and not sig-nificant using automatic POS tags,...
This paper presents the results of the Ix-
aMed team at the SemEval-2014 Shared
Task 7 on Analyzing Clinical Texts.
We have developed three different sys-
tems based on: a) exact match, b) a
general-purpose morphosyntactic analyzer
enriched with the SNOMED CT termi-
nology content, and c) a perceptron se-
quential tagger based on a Global Linear
Mo...
Resumen: La red de hospitales que configuran el sistema español de sanidad utili-za la Clasificación Internacional de Enfermedades Modificación Clínica (ICD9-CM) para codificar partes de alta hospitalaria. Hoy en día, este trabajo lo realizan a mano los expertos. Este artículo aborda la problemática de clasificar automática-mente partes reales de a...
The aim of this work is to infer a model able to extract cause-effect relations be-tween drugs and diseases. A two-level system is proposed. The first level car-ries out a shallow analysis of Electronic Health Records (EHRs) in order to iden-tify medical concepts such as drug brand-names, substances, diseases, etc. Next, all the combination pairs f...
This work tackles Electronic Health Record (EHR) classification according to their Diagnostic Terms (DTs) following the standard International Classification of Diseases-Clinical Modification (ICD-9-CM). To do so, we explore text mining relying on a wide variety of data from both standard catalogues, such as the ICD-9-CM and SNOMED-CT; and, what it...
This paper presents an annotation tool that detects entities in the biomedical domain. By enriching the lexica of the Freeling analyzer with bio-medical terms extracted from dictionaries and ontologies as SNOMED CT, the system is able to automatically detect medical terms in texts. An evaluation has been performed against a manually tagged corpus f...
This paper reports on the first shared task on statistical parsing of morphologically rich languages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the evaluation metrics for parsing MRLs...
This paper presents a dependency
parsing system, presented as BASQUETEAM at the SPMRL’2013
Shared Task, based on the analysis
of each morphological feature of the
languages. Once the specific relevance
of each morphological feature is
calculated, this system uses the most
significant of them to create a series
of analyzers using two freely availabl...
This paper presents a set of experiments performed on parsing Basque, a morphologically rich and agglutinative language, studying the effect of using the morphological analyzer for Basque together with the morphological disambiguation module, in contrast to using the gold standard tags taken from the treebank. The objective is to obtain a first est...
This paper shows the applicability of a general Information Extraction technology developed for the extraction of conceptual and factual knowledge from texts, to the specific domain of biomedicine. The rule-based system previously developed for the KYOTO Project is used to extract biomedical events involving proteins or genes from texts annotated i...
This paper presents a set of experiments for the detection and correction of syntactic errors, exploring two alternative approaches. The first one uses an error grammar which combines a robust morphosyntactic analyser and two groups of finite-state transducers (one for the description of syntactic error patterns and the other for the correction of...
This paper presents the introduction of WordNet semantic classes in a dependency parser, obtaining improvements on the full Penn Treebank for the first time. We tried different combinations of some basic semantic classes and word sense disambiguation algorithms. Our experiments show that selecting the adequate combination of semantic features on de...
We present a system for the detection of agreement errors in Basque, a language with agglutinative morphology and free order
of the main sentence constituents. Due to their complexity, agreement errors are one of the most frequent error types found
in written texts. As the constituents concerning agreement can appear in any order in the sentence, w...
We present a set of experiments on dependency parsing of the Basque Dependency Treebank (BDT). The present work has examined several directions that try to explore the rich set of morphosyntactic features in the BDT: i) experimenting the impact of morphological features, ii) application of dependency tree transformations, iii) application of a two-...
Thispaperpresentsasetofexperimentsper1 formedonparsingtheBasqueDependency� Treebank.�Wehaveappliedfeaturepropaga1 tiontodependencyparsing,�experimentingthe� propagationofseveralmorphosyntacticfea1 turevalues.�Intheexperimentswehaveused� theoutputofaparsertoenrichtheinputofa� secondparser.�Bothparsershavebeengener1 atedbyMaltparser,� afreelydata1dri...
We present a study of the impact of morpho-logical and syntactic ambiguity in the process of grammatical error detection. We will present three different systems that have been devised with the objective of detecting grammatical er-rors in Basque and will examine the influence of ambiguity in their results. We infer that the am-biguity rate in the...
Resumen: El proyecto Kyoto construye un sistema de información independiente del lenguaje para un dominio específico (medio ambiente, ecología y diversidad) basado en una ontología independiente del lenguaje que estará enlazada a Wordnets en siete idiomas. Palabras clave: Extracción de Información Abstract: The KYOTO project will construct a langua...
This work presents the development of a system that detects incorrect uses of com- plex postpositions in Basque, an aggluti- native language. Error detection in com- plex postpositions is interesting because: 1) the context of detection is limited to a few words; 2) it implies the interaction of multiple levels of linguistic processing (morphology,...
Este artículo presenta los primeros pasos dados para la obtención de un analizador sintáctico estadístico para el euskera. El sistema se basa en un treebank anotado sintácticamente mediante dependencias y la adaptación del analizador sintáctico determinista de Nivre et al. (2007), que mediante un análisis por desplazamiento/reducción y un sistema b...
This work presents the migration process of a syntactic grammar of Basque from one formalism to another. Due to differences in the formalisms and the kind of grammars, it is not possible to make a direct translation. As a consequence, the construction of a new grammar by a linguist must start almost from scratch. For this reason we devised an exper...
This article describes the different steps in the construction of EPEC (Reference Corpus for the Processing of Basque). EPEC is a corpus of standard written Basque that has been manually tagged at different levels (morphology, surface syntax, phrases) and is currently being hand tagged at deep syntax level following the Dependency Structure-based S...
This paper presents the design and development of a system for the detection and correction of syntactic errors in free texts. The system is composed of three main modules: a) a robust syntactic anal- yser, b) a compiler that will translate error processing rules, and c) a module that coordinates the results of the analyser, applying dieren t combi...
In this paper we present a framework for deal- ing with linguistic annotations. Our aim is to establish a flexible and extensible infrastructure which follows a coherent and general represen- tation scheme. This proposal provides us with a well-formalized basis for the exchange of lin- guistic information. We use TEI-P4 conformant feature structure...
This paper explores the viability of porting le xico-syntactic information from English to Basque in order to make PP attac hment decisions. Basque is a free constituent order la nguage where PPs in a multiple -verb sentence can be attached to any of the verbs. We compared a system trained in non- ambiguous (single verb) Basque sentences with anoth...
In this paper we present EULIA, a tool which has been designed for dealing with the linguistic annotated corpora generated by a set of different linguistic processing tools. The objective of EULIA is to provide a flexible and extensible environment for creating, consulting, visualizing, and modifying documents generated by existing linguistic tools...
This paper describes the representation of Basque Multiword Lexical Units and the automatic processing of Multiword Expressions. After discussing and stating which kind of multiword expressions we consider to be processed at the current stage of the work, we present the representation schema of the corresponding lexical units in a general-purpose l...
This paper explores a crosslingual approach to the PP attachment problem. We built a large dependency database for English based on an automatic parse of the BNC, and Reuters (sports and finances sections). The Basque attachment decisions are taken based on the occurrence frequency of the translations of the Basque (verb-noun) pairs in the English...
This article presents a robust syntactic analyser for Basque and the different modules it contains. Each module is structured in different analysis layers for which each layer takes the information provided by the previous layer as its input; thus creating a gradually deeper syntactic analysis in cascade. This analysis is carried out using the Cons...
Entidad financiera: MCyT (Proyecto PROFIT: FIT-150500-2002-411).
We present the application of finite state technology (FST) to several kinds of linguistic processing of Basque, which can serve as a representative of agglutinative languages and languages with free order of constituents. Three main tools will be described in this context: a morphological analyzer, a morphosyntactic disambiguation tool and a surfa...
In this chapter we describe a computational grammar for Basque, and the first results obtained using it in the process of automatically acquiring subcategorization information about verbs and their associated sentence elements (arguments and adjuncts).
In section 1 we describe the Basque syntax and the grammar we have developed for its treatment. T...
this paper we present the design of a digital resource which will be used as a repository of information of linguistic errors. As a first step in the design of this database, we made a classification of possible errors. This classification is based on information contained in Basque grammars (Alberdi et al., 2001; Zubiri, 1994) and our previous exp...
This paper presents the methodology followed in the construction of a surfacebased morphosyntactic parsing grammar as well as the results obtained. It is based on the Constraint Grammar formalism which we find suitable for our project of analysing unrestricted texts. Besides, we will present a description of the main types of morphosyntactic ambigu...
This paper presents a parsing system for the detection of syntactic errors. It combines a robust partial parser which obtains the main sentence components and a finite-state parser used for the description of syntactic error patterns. The system has been tested on a corpus of real texts, containing both correct and incorrect sentences, with promisi...
In this paper we present a program library conceived and implemented to represent and manipulate the information exchanged in the process of integration of NLP tools. It is currently used to integrate the tools developed for Basque processing during the last ten years at our research group. In our opinion, the program library is general enough to b...
This paper presents the design and implementation of a finite-state syntactic grammar of Basque that has been used with the objective of extracting information about verb subcategorization instances from newspaper texts. After a partial parser has built basic syntactic units such as noun phrases, prepositional phrases, and sentential complements, a...
En este artículo presentamos el trabajo realizado en la extracción automática de información sobre la aparición de complementos y adjuntos para un conjunto de 1.400 verbos a partir de un corpus periodístico de un millón y medio de palabras. Los resultados han sido evaluados, obteniéndose una precisión y cobertura satisfactorias. Estos datos se usar...
This work presents the development and implementation of a full morphological analyzer for Basque, an agglutinative language. Several problems (phrase structure inside word-forms, noun ellipsis, multiplicity of values for the same feature and the use of complex linguistic representations) have forced us to go beyond the morphological segmentation o...
This paper presents a robust parsing system for unrestricted Basque texts. It analyzes a sentence in two stages: a unification-based parser builds basic syntactic units such as NPs, PPs, and sentential complements, while a finite-state parser performs syntactic disambiguation and filtering of the results. The system has been applied to the acquisit...
This paper presents a parsing system for the detection of syntactic errors. It combines a robust partial parser which obtains the main sentence components and a finite-state parser used for the description of syntactic error patterns. The system has been tested on a corpus of real texts, containing both correct and incorrect sentences, with promisi...