Jordi Porta

Jordi Porta
Centro de Estudios de la Real Academia Española / Universidad Autónoma de Madrid · Computational Linguistics Area

PhD

About

37
Publications
4,608
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
130
Citations
Citations since 2017
12 Research Items
72 Citations
201720182019202020212022202302468101214
201720182019202020212022202302468101214
201720182019202020212022202302468101214
201720182019202020212022202302468101214
Additional affiliations
September 1999 - present
Universidad Autónoma de Madrid
Position
  • Professor (Associate)

Publications

Publications (37)
Conference Paper
Full-text available
This paper describes the techniques used by the system presented at the TweetLID shared task for Twitter language identification. The system is based on Support Vector Machines and Rational Kernels. An algorithm for multilanguage labeling is described. Its evaluation and application to Sociolinguistics is also included.
Conference Paper
Full-text available
DSLRAE is a hierarchical classifier for similar written languages and varieties based on maximum-entropy (maxent) classifiers. In the first level, the text is classified into a language group using a simple token-based maxent classifier. At the second level, a group-specific maxent classifier is applied to classify the text as one of the languages...
Conference Paper
Full-text available
The growing size of corpora poses some technological challenges to their management. To reduce some of the problems arising in processing a few billion (10^9) words corpora, a shared-memory multithreaded version of MapReduce has been introduced into a corpus backend. Results on indexing very large corpora and computing basic statistics in this para...
Article
Full-text available
One of the aims of Assistive Technologies is to help people with disabilities to communicate with others and to provide means of access to information. As an aid to Deaf people, we present in this work a production-quality rule-based machine system for translating from Spanish to Spanish Sign Language (LSE) glosses, which is a necessary precursor t...
Conference Paper
Full-text available
This paper presents a linguistic approach based on weighted-finite state transducers for the lexical normalisation of Spanish Twitter messages. The system developed consists of transducers which are applied to out-of-vocabulary tokens. Transducers implement linguistic models of variation which generate sets of candidates according to a lexicon. A s...
Article
Full-text available
We present an automatic discourse particle (DM) tagger developed using manual annotation and machine learning. The tagger has been developed on a dataset of financial letters, where human annotators have reached an 0.897 agreement rate (IAA) on the indications of a specific annotation guide. With the annotated dataset, a prototype has been develope...
Article
Full-text available
Resumen: La aparición y auge de la comunicación canalizada digitalmente, especialmente de las llamadas redes sociales, reclama capacidades analíticas automatizadas para extraer información y patrones a partir de datos masivos baja o pobremente estructurados con el objetivo de predecir tendencias, acciones y eventos futuros. Este ámbito concita el i...
Preprint
Full-text available
This paper summarizes the main findings of the ADoBo 2021 shared task, proposed in the context of IberLef 2021. In this task, we invited participants to detect lexical borrowings (coming mostly from English) in Spanish newswire texts. This task was framed as a sequence classification problem using BIO encoding. We provided participants with an anno...
Article
Full-text available
This paper summarizes the main findings of the ADoBo 2021 shared task, proposed in the context of IberLef 2021. In this task, we invited participants to detect lexical borrowings (coming mostly from English) in Spanish newswire texts. This task was framed as a sequence classification problem using BIO encoding. We provided participants with an anno...
Preprint
Full-text available
We present the results of the CAPITEL-EVAL shared task, held in the context of the IberLEF 2020 competition series. CAPITEL-EVAL consisted on two subtasks: (1) Named Entity Recognition and Classification and (2) Universal Dependency parsing. For both, the source data was a newly annotated corpus, CAPITEL, a collection of Spanish articles in the new...
Article
Full-text available
We present the results of the CAPITEL-EVAL shared task, held in the context of the IberLEF 2020 competition series. CAPITEL-EVAL consisted on two subtasks: (1) Named Entity Recognition and Classification and (2) Universal Dependency parsing. For both, the source data was a newly annotated corpus, CAPITEL, a collection of Spanish articles in the new...
Conference Paper
Full-text available
We present the results of the CAPITEL-EVAL shared task, held in the context of the IberLEF 2020 competition series. CAPITEL-EVAL consisted on two subtasks: (1) Named Entity Recognition and Classification and (2) Universal Dependency parsing. For both, the source data was a newly annotated corpus, CAPITEL, a collection of Spanish articles in the new...
Conference Paper
Full-text available
Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover,...
Conference Paper
Full-text available
This paper describes the system presented at the MEDDO-CAN (Medical Document Anonymization) task. The system consists of a candidate generator which uses a PoS-tagger, and a candidate classifier based on a convolutional neural network which uses three channels and pretrained word embeddings to represent the sequence of words to be classified and it...
Conference Paper
Full-text available
Online dictionaries try to include search capabilities to meet most users' needs. Although users are not always aware of how to effectively use dictionaries, sometimes it is the interface that does not facilitate a friendly access to the dictionary information. This work aims at lowering the barrier in supporting onomasiological and semasiological...
Thesis
Full-text available
This thesis addresses several aspects about the automatic translation from Castilian Spanish to Spanish Sign Language (LSE), two typologically distant languages with not enough linguistics resources enabling statistical approaches to translation. For this reason, a rule-based approach grounded on contrastive grammatical studies on both languages is...
Conference Paper
Full-text available
A system for the analysis of Old Spanish word forms using weighted finite-state transducers is presented. The system uses previously existing resources such as a modern lexicon, a phonological transcriber and a set of rules implementing the evolution of Spanish from the Middle Ages. The results obtained in all datasets show significant improvements...
Conference Paper
Full-text available
This paper presents a finite-state computational model for Spanish Sign Language nominal morphology with particular attention to the treatment of morphological alternations. A computational mor- phology consists of a lexicon, a rewrite rules component relating lex- ical representations to surface forms, and a morphotactic component. All these compon...
Conference Paper
Deaf people cannot properly access the speech information stored in any kind of recording format (audio, video, etc). We present a system that provides with subtitling and Spanish Sign Language rep- resentation capabilities to allow Spanish Deaf population can access to such speech content. The system is composed by a speech recognition module, a m...
Conference Paper
Watch a demo at: http://www.youtube.com/watch?v=ctmvCHguJKM An on-line Spanish-Spanish Sign Language (LSE) translation system is presented in which Spanish speech content is translated into LSE to provide Spanish deaf peo- ple access to speech information. It is cloud-based, built over a speech recognition module, a transfer-based machine translat...
Article
Full-text available
Iberia is a synchronic corpus of scientific Spanish designed mainly for terminological studies. In this paper, we describe its design and the infrastructure for its acquisition, processing and exploitation, including mark-up, linguistic annotation, indexing and the user interface. Two pre-processing tasks affecting a large number of words are descr...
Conference Paper
This paper presents the first results of the integration of a Spanish-to-LSE Machine Translation (MT) system into an e-learning platform. Most e-learning platforms provide speech-based contents, which makes them inaccessible to the Deaf. To solve this issue, we have developed a MT system that translates Spanish speech-based contents into LSE. To t...
Article
Full-text available
Este trabajo presenta la base lingüística utilizada en un sintetizador de lengua de signos española (LSE). Los aspectos fundamentales tratados son la fonología de la LSE y una aproximación para describir mensajes signados. En relación a la fonología se describen los parámetros fonológicos utilizados, el modelo fonológico en el que se basa este trab...
Article
Full-text available
This paper presents the work-in-progress in the development of an automatic term recognition (ATR) system built around the Corpus Científico-Técnico (CCT). Terms are modeled using three non-correlated dimensions: unithood, domainhood and usage, applied to a set of -grams automatically extracted from the corpus. These dimensions are combined with a...
Conference Paper
NLP systems with monolithic grammars have to deal with several sources of non-determinism (i.e. ambiguity). This is particularly true of broad-coverage unication-based grammars where all dimensions of linguistic information are interleaved as theories such as HPSG propose. This paper shows how the search space of the parser can be pruned by the int...
Conference Paper
Full-text available
This paper presents Latch; a system for PoS disambiguation and partial parsing that has been developed for Spanish. In this system, chunks can be recognized and can be referred to like ordinary words in the disambiguation process. This way, sentences are simplified so that the disambiguator can operate interpreting a chunk as a word and chunk head...
Article
Full-text available
Este artículo describe las herramientas y recursos desarrollados en el Departamento de Lingüística Computacional de la Real Academia Española para la anotación lingüística de los corpus CREA y CORDE. Además de abundar sobre el enfoque clásico del procesamiento lingüístico a bajo nivel de textos de muy diversa índole y procedencia, el artículo aport...

Network

Cited By

Projects

Projects (2)