
David Tomás- PhD
- Lecturer at University of Alicante
David Tomás
- PhD
- Lecturer at University of Alicante
About
101
Publications
12,016
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
728
Citations
Introduction
Current institution
Publications
Publications (101)
In the realm of data sharing, effective data discovery is critical for fostering collaboration and innovation across organizations. Data spaces merge the interoperability and secure sharing features of data ecosystems with the transactional and economic aspects of data markets, enabling seamless data exchange among data providers and consumers whil...
Objectives This study investigates gender biases in Artificial Intelligence (AI) perceptions among university students. It focuses on assessing self-perceptions regarding knowledge, impact, and support, with a specific emphasis on identifying any significant gender differences. The main hypotheses are focused on the existence of gender disparities...
Cognitive decline is a natural part of aging, often resulting in reduced cognitive abilities. In some cases, however, this decline is more pronounced, typically due to disorders such as Alzheimer's disease. Early detection of anomalous cognitive decline is crucial, as it can facilitate timely professional intervention. While medical data can help i...
Cognitive decline is a natural process that occurs as individuals age. Early diagnosis of anomalous decline is crucial for initiating professional treatment that can enhance the quality of life of those affected. To address this issue, we propose a multimodal model capable of predicting Mild Cognitive Impairment and cognitive scores. The TAUKADIAL...
Within research on data-intensive software development, a significant opportunity is emerging around data spaces. The main challenge is enabling researchers to create spin-offs that transfer research results within data spaces, beyond traditional approaches such as licensing or patenting. To address this challenge, we have combined the renowned Lea...
This study presents a pioneering approach that leverages advanced sensing technologies and data processing techniques to enhance the process of clinical documentation generation during medical consultations. By employing sophisticated sensors to capture and interpret various cues such as speech patterns, intonations, or pauses, the system aims to a...
Table retrieval involves providing a ranked list of relevant tables in response to a search query. A critical aspect of this process is computing the similarity between tables. Recent Transformer-based language models have been effectively employed to generate word embedding representations of tables for assessing their semantic similarity. However...
The growing significance of sensor data in the development of information technology services finds obstacles due to disparate data presentations and non-adherence to FAIR principles. This paper introduces a novel approach for sensor data gathering and retrieval. The proposal leverages large language models to convert sensor data into FAIR-complian...
The Internet of Things generates vast data volumes via diverse sensors, yet its potential remains unexploited for innovative data-driven products and services. Limitations arise from sensor-dependent data handling by manufacturers and user companies, hindering third-party access and comprehension. Initiatives like the European Data Act aim to enabl...
The current tourism landscape intertwines physical and digital realms, giving rise to smart tourism. This paradigm shift leverages technology, generating large amounts of data from sensors and mobile devices, offering vast opportunities for strategic tourism planning and management. Smart tourism destinations (STDs) are pivotal in this context, uti...
Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form of tables. Moreover, scientists require more elabor...
Smart monitoring and surveillance systems have become one of the fundamental areas in the context of security applications in Smart Cities. In particular, video surveillance for Human Activity Recognition (HAR) applied to the recognition of potential offenders and to the detection and prevention of violent acts is a challenging task that is still u...
In this paper, we propose a pipeline for analyzing audio recordings of both aphasic and healthy patients. The pipeline can transcribe and distinguish between patients and the interviewer. To evaluate the pipeline’s effectiveness, we conducted a manual review of the initial frames of one hundred randomly selected samples and achieved a 94% accuracy...
The progress of automatic scene analysis techniques for homes and the development of assisted living systems is vital to help different kinds of people, such as the elderly or visually impaired individuals, who require special care in their daily lives. In this work, we introduce the usage of SwinBERT, a powerful and recent video captioning model f...
The possibility that social networks offer to attach audio, video, and images to textual information has led many users to create messages with multimodal irony. Over the last years, a series of approaches have emerged trying to leverage all these formats to address the problem of multimodal irony detection. The question that the present work tries...
This paper presents a machine learning-based classifier for detecting points of interest through the combined use of images and text from social networks. This model exploits the transfer learning capabilities of the neural network architecture CLIP (Contrastive Language-Image Pre-Training) in multimodal environments using image and text. Different...
This paper presents a new approach to retrieve and further integrate tabular datasets (collections of rows and columns) using union and join operations. In this work, both processes were carried out using a similarity measure based on contextual word embeddings, which allows finding semantically similar tables and overcome the recall problem of lex...
Irony is nowadays a pervasive phenomenon in social networks. The multimodal functionalities of these platforms (i.e., the possibility to attach audio, video, and images to textual information) are increasingly leading their users to employ combinations of information in different formats to express their ironic thoughts. The present work focuses on...
Table retrieval is the task of answering a search query with a ranked list of tables that are considered as relevant to that query. Computing table similarity is a critical part of this process. Current Transformer-based language models have been successfully used to obtain word embedding representations of the tables to calculate their semantic si...
Social media is an interesting source of information, specially when physical sensors are not available. In this paper, we explore several methodologies for the geolocation of multimodal information (image and text) from social networks. To this end, we use pre-trained neural network models for the classification of images and their associated text...
In this paper we propose a deep architecture to predict dementia, a disease which affects around 55 million people all over the world and makes them in some cases dependent people. To this end, we have used the DementiaBank dataset, which includes audio recordings as well as their transcriptions of healthy people and people with dementia. Different...
Nowadays, open data has become a prominent information source for creating value-added product and services. Actually, open data portal initiatives are adopted by most of the governments to supply their public sector information, usually in the form of tabular data such as spreadsheets or CSV files. Most open data portals allow reusers to retrieve...
Social media is one of the data sources that could provide more information or potential knowledge in almost any field of application. One of the main challenges of machine learning and big data is to solve the difficulty involved in the identification, classification, and, in general, the processing of all this data to extract useful information f...
One of the main issues related to unsupervised machine learning is the cost of processing and extracting useful information from large datasets. In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture in multimodal environments (image and text) from social media. For thi...
One of the main issues related to unsupervised machine learning is the cost of processing and extracting useful information from large datasets. In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture in multimodal environments (image and text) from social media. For thi...
Nowadays, the emergence of large-scale and highly distributed cyber-physical systems (CPSs) in applications including Internet of things (IoT), cloud computing, mobility, Big Data, and sensor networks involves that architecture models have to work in an open and highly dynamic world. This fact increasingly highlights the importance of designing rea...
This paper describes a group activity concerning the topic of climate change, designed to introduce the concepts of sustainable development into a Robotic Engineering degree. The purpose of this activity was to make students reflect about the impact of their work on the planet as future engineers by asking them to design an environmentally friendly...
More and more governments around the world are publishing tabular open data, mainly in formats such as CSV or XLS(X). These datasets are mostly individually published, i.e. each publisher exposes its data on the Web without considering potential relationships with other datasets (from its own or from other publishers). As a result, reusing several...
Post-truth is a term that describes a distorting phenomenon that aims to manipulate public opinion and behavior. One of its key engines is the spread of Fake News. Nowadays most news is rapidly disseminated in written language via digital media and social networks. Therefore, to detect fake news it is becoming increasingly necessary to apply Artifi...
En este trabajo analizamos las habilidades como pro-gramadores de los alumnos de cuarto curso del grado en Ingeniería Informática de la Universidad de Alican-te con el propósito de identificar carencias comunes en su formación que no hayan sido abordadas adecua-damente en el transcurso de la titulación. Para tal fin, presentamos una prueba de evalu...
The abundance of digital media information coming from different sources, completely redefines approaches to media content production management and distribution for all contexts (i.e. technical, business and operational). Such content includes descriptive information (i.e. metadata) about an asset (e.g. a movie, song or game), as well as playable...
SAM is a social media platform that enhances the experience of watching video content in a conventional living room setting, with a service that lets the viewer use a second screen (such as a smart phone) to interact with content, context and communities related to the main video content. This article describes three key functionalities used in the...
Educational Data Mining (EDM) is a research field that focuses on the application of data mining, machine learning, and statistical methods to detect patterns in large collections of educational data. Different machine learning techniques have been applied in this field over the years, but it has been recently that Deep Learning has gained increasi...
This paper presents the first large-scale investigation of the users and uses of WorldCat.org, the world's largest bibliographic database and global union catalog. Using a mixed-methods approach involving focus group interviews with 120 participants, an online survey with 2,918 responses, and an analysis of transaction logs of approximately 15 mill...
This document presents a description of the Entity Linking system known as REDES, which was involved into the track Entity Discovery and Linking (EDL) of the challenge TAC Knowledge Base Population (KBP) 2016. The system developed is result of the collaboration among different research projects, in particular SAM, from which specific modules were r...
Today's generation of Internet-connected devices has changed the way users are interacting with media, exchanging their role from passive and unidirectional to proactive and interactive. Under this new role, users are able to comment or rate a TV show and search for information regarding characters, facts, multimedia content or any other related ma...
This paper presents an approach to entity linking in the domain of Social TV on two different knowledge bases: Wikipedia and our own ontology of media assets. We provide insights into the main challenges posed by this task, together with a description of different tools and related projects in the field. Since the system described is part of a plat...
This paper presents an opinion mining approach in the domain of Social TV using two different contexts: Twitter user messages for Spanish and English, as well as movie reviews. The main goal of this paper is to study the benefits of opinion mining approaches using ranking skip-gram techniques for processing user feedbacks. To carry out this study i...
This article presents a method for recommending scientific articles taking into consideration their degree of generality or specificity. This approach is based on the idea that less expert people in a specific topic prefer to read more general articles to be introduced into it, while people with more expertise prefer to read more specific articles....
Today's generation of Internet devices has changed how users are interacting with media, from passive and unidirectional users to proactive and interactive. Users can use these devices to comment or rate a TV show and search for related information regarding characters, facts or personalities. This phenomenon is known as second screen. This paper d...
Social media services offer a wide range of opportunities for businesses and developers to exploit the vast amount of information and user-generated content produced via social media. In addition, the notion of TV second screen usage -- the interleaved usage of TV and smart devices such as smartphones -- appears ever more prominent, with viewers co...
Social networking apps, sites and technologies offer a wide range of opportunities for businesses and developers to exploit the vast amount of information and user-generated content produced through social networking. In addition, the notion of second screen TV usage appears more influential than ever, with viewers continuously seeking further info...
This article introduces the EU-co-funded project “Socialising Around Media” (SAM). The project is developing an ecosystem for creating and presenting “second screen” experiences e.g. on smartphones to accompany “first screen” TV content. SAM second screen experiences can provide both additional supplemental content to users and allow them to engage...
This work introduces a new approach for aspect based sentiment analysis task. Its main purpose is to automatically assign the correct polarity for the aspect term in a phrase. It is a probabilistic automata where each state consists of all the nouns, adjectives, verbs and adverbs found in an annotated corpora. Each one of them contains the number o...
In this paper we describe Fénix, a data model for exchanging information between Natural Language Processing applications. The format proposed is intended to be exible enough to cover both current and future data structures employed in the field of Computational Linguistics. The Fénix architecture is divided into four separate layers: conceptual, l...
On-line Social Networks have increased their popularity rapidly since their creation, providing a huge amount of data which can be leverage to extract useful information related to commercial and social human behaviours. One of the most useful information that can be extracted is the geographical one. This paper shows an approach to detect the geog...
The main objective of this project is based on the need to reconsider the clasical HLT philosophy to adapt it, not only to the currently available resources (unstructured data with multimodality, multilinguality and different levels of formality) but also to the real needs of the final users. In order to reach this objective it is necessary to incl...
Wikipedia has become one of the most important sources of information available all over the world. However, the categorization of Wikipedia articles is not standardized and the searches are mainly performed on keywords rather than concepts. In this paper we present an application that builds a hierarchical structure to organize all Wikipedia entri...
This paper presents an application for medicinal plants prescription based on text classification techniques. The system receives as an input a free text describing the symptoms of a user, and retrieves a ranked list of medicinal plants related to those symptoms. In addition, a set of links to Wikipedia are also provided, enriching the information...
The geographical focus of a document identifies the relevant locations mentioned in text. This paper presents a corpus-based approach to detecting the geographical focus in documents. Despite other approaches focused on using solely geographical information, our proposal employs all the textual information included in the corpus under the assumptio...
El objetivo de este proyecto se basa en la necesidad de replantearse la filosofía clásica del TLH para adecuarse tanto a las fuentes disponibles actualmente (datos no estructurados con multi-modalidad, multi-lingualidad y diferentes grados de formalidad) como a las necesidades reales de los usuarios finales. Para conseguir este objetivo es necesari...
Tratamiento de la dimensión espacial en el texto y su aplicación a la recuperación de información Resumen: Proyecto emergente centrado en la desambiguación de topónimos y la detección del foco geográfico en el texto. La finalidad es mejorar el rendimiento de los sistemas de recuperación de información geográfica. Se describen los problemas abordado...
In this article we will describe the design and implementation of Jane, an efficient hierarchical phrase-based (HPB) toolkit developed at RWTH Aachen University. The system has been used by RWTH at several international evaluation campaigns, including ...
This project is focused on toponym disambiguation and geographical focus identification in text. The goal is to improve the performance of geographic information retrieval systems. This paper describes the problems faced, working hypothesis, tasks proposed and goals currently achieved. © 2012 Sociedad Española para el Procesamiento del Lenguaje Nat...
This article presents a minimally supervised approach to question classification on fine-grained taxonomies. We have defined an algorithm that automatically obtains lists of weighted terms for each class in the taxonomy, thus identifying which terms are highly related to the classes and are highly discriminative between them. These lists have then...
In this paper we present a complete system for the treatment of both geographical and temporal dimensions in text and its application to information retrieval. This system has been evaluated in both the GeoTime task of the 8th and 9th NTCIR workshop in the years 2010 and 2011 respectively, making it possible to compare the system to contemporary ap...
This paper presents the QALL-ME Framework, a reusable architecture for building multi- and cross-lingual Question Answering (QA) systems working on structured data modelled by an ontology. It is released as free open source software with a set of demo components and extensive documentation, which makes it easy to use and adapt. The main characteris...
Many users employ vague geographical expressions to query Information Retrieval systems. These fuzzy entities do not appear
neither in gazetteers nor in geographical databases. Searches such as “Ski resorts in north-central Spain” or “Restaurants
near the Teatro Real of Madrid” often do not get the expected results, mainly due to the difficulty of...
In this paper, we introduce a kernel-based approach to question classification. We employed a kernel function based on latent
semantic information acquired from Wikipedia. This kernel allows including external semantic knowledge into the supervised
learning process. We obtained a highly effective question classifier combining this knowledge with a...
royecto emergente centrado en la detección e interpretación de metáforas con métodos no supervisados. Se presenta la caracterización del problema metafórico en Procesamiento del Lenguaje Natural, los fundamentos teóricos del proyecto y los primeros resultados.
Este artículo presenta una aproximacíon a la clasificación automática de preguntas en español y catalán. El sistema de clasificación está basado en el algoritmo SVM y en el uso de diferentes funciones kernel, empleando únicamente características textuales superficiales que permiten la obtencíon de un sistema fácilmente adaptable a diferentes idiomas....
This paper presents our research related to automatic expected answer type and named entity annotation tasks in a question answering context. We present the initial step of our research, in which we created the annotation guidelines. We therefore show and justify the tag set employed in the annotation of a collection of questions, and finally, diff...
The analysis and creation of annotated corpus is fundamental for implementing natural language processing solutions based
on machine learning. In this paper we present a parallel corpus of 4500 questions in Spanish and English on the touristic
domain, obtained from real users. With the aim of training a question answering system, the questions were...
This paper presents a machine learning approach to question classification. We have defined a kernel function based on latent semantic information acquired from unlabeled data. This kernel allows including external semantic knowledge into the supervised learning process. We have combined this knowledge with a bag-of-words approach by means of compo...
This paper presents the QALL-ME benchmark, a multilingual resource of annotated spoken requests in the tourism domain, freely available for research purposes. The languages currently involved in the project are Italian, English, Spanish and German. It introduces a semantic annotation scheme for spoken information access requests, specifically deriv...
As in the previous QA@CLEF competition, two separate groups at the University of Alicante participated this year using different
approaches. This paper describes the work of Alicante 1 group. We have continued with the research line established in the past competition, where the main goal was to obtain a
fully data-driven system based on machine le...
In this paper we present a novel multiple-taxonomy question classification system, facing the challenge of assigning categories in multiple taxonomies to natural language questions. We applied our system to category search on faceted information. The system provides a natural language interface to faceted information, detecting the categories reque...
This paper describes the development of an English corpus of factoid TREC-like question-answer pairs. The corpus obtained consists of more than 70,000 samples, containing each one the following information: a question, its question type, an exact answer to the question, the different contexts levels (sentence, paragraph and document) where the answ...
Question Answering is a major research topic at the University of Alicante. For this reason, this year two groups participated
in the QA@CLEF track using different approaches. In this paper we describe the work of Alicante 2 group. This paper describes AliQAn, a monolingual open-domain Question Answering (QA) System developed in the Department
of L...
Question classification is one of the first tasks carried out in a Question Answering system. In this paper we present a multilingual
question classification system based on machine learning techniques. We use Support Vector Machines to classify the questions.
All the features needed to train and test this method are automatically extracted through...
As in the previous QA@CLEF track, two separate groups at the University of Ali-cante participated this year using different approaches. This paper describes the work of Alicante 1 group. We have continued with the research line established in the past competition, where the main goal was to obtain a fully data-driven system based on machine learnin...
Question Classification (QC) is usually the first stage in a Question Answering system. This paper presents a multilingual
SVM-based question classification system aiming to be language and domain independent. For this purpose, we use only surface
text features. The system has been tested on the TREC QA track questions set obtaining encouraging res...
As Question Answering is a major research topic at the University of Alicante, this year two separate groups participated
in the QA@CLEF track using different approaches. This paper describes the work of Alicante 1 group. Thinking of future developments, we have designed a modular framework based on XML that will easily let us integrate,
combine an...
This paper describes the development of an image retrieval system that combines probabilistic and ontological information 1 . The process is divided in two different stages: indexing and retrieval. Three information flows have been created with different kind of information each one: word forms, stems and stemmed bigrams. The final result com-bines...
This paper describes the participation of the University of Alicante (UA) in CLEF 2005 image retrieval task. For this purpose we used an image retrieval system based on probabilistic information combined with ontological information and a feedback technique. Several information streams are created using different sources: stems, words and bigrams;...
This paper presents a multilingual approach to Question Classification based on machine learning, using language independent features. This way we obtain a system flexible and easily adaptable to new languages. Using a parallel corpus in English and Spanish, we test the performance of the system with three different techniques: Support Vector Machi...
This paper describes the novelties introduced in the Question Answering system de-veloped in the Natural Language Processing and Information Systems Group at the University of Alicante for QA@CLEF 2005 campaign with respect to our previous par-ticipations. Thinking of future developments, this year we have designed a modular framework based on XML...
Este artículo presenta una aproximación multilingüe a la clasificación de preguntas basada en aprendizaje automático, empleando características de aprendizaje independientes del idioma. Esto va a permitir que el sistema sea flexible y fácilmente adaptable a nuevos idiomas. Sobre un corpus paralelo de preguntas en inglés y castellano, contrastaremos...
En esta ponencia se presentarán las características básicas de un sistema de Pregunta-Respuesta entendido como una de las principales aportaciones desde las tecnologías lingüísticas hacia la búsqueda y recuperación de la información, y muy especialmente se centrará en describir la aportación que la semántica puede proporcionar a este tipo de sistem...
In this paper we present a complete system for the treatment of the geographical dimension in the text and its application to information retrieval. This system has been evaluated in the GeoTime task of the 9th NTCIR workshop, making it possible to compare the system with other current ap-proaches to the topic. In order to participate in this task...
This paper describes the development of an English corpus of factoid TREC-like question-answer pairs. The corpus obtained consists of a set of more than 70,000 samples, containing each one the following information: a question, its question type, an exact answer to that question, the different context levels (sentence, paragraph and document) where...