Abel Gómez’s research while affiliated with Universitat Oberta de Catalunya and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (49)


Number of data papers published between 2015 and 2023 evaluated in the sample. 2023 has been evaluated until June.
Diversity in the collection and annotation teams type of the analyzed data papers.
Collection: Diversity of types of collection processes.
Annotation: Diversity of types of annotation processes.
Overall results of informed dimensions. Social concerns and Profile of the collection targets dimension have been evaluated only on datasets gathered from or describing people (16,5% of the sample). Speech context in language datasets has only been assessed on datasets representing natural language (5,15% of the sample). Annotation dimensions have been assessed only on datasets created through an annotation process (42,28% of the sample). In these cases, the percentage reflects the occurrence of those dimensions relative to the number of papers that should declare them.

+6

On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning
  • Article
  • Full-text available

January 2025

·

16 Reads

Scientific Data

·

Abel Gómez

·

To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, academic institutions’ adoption of these practices has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this broader scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their coverage and trends in the requested dimensions and comparing them to those from an ML-focused venue (NeurIPS D&B), which publishes papers describing datasets. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data’s preparedness for its transparent and fairer use in ML technologies.

Download


Figure 1. Number of data papers published between 2015 and 2023 evaluated in the sample. 2023 has been evaluated until June.
Figure 7. Funding information in publishers metadata, and extracted using text analysis over the paper's acknowledge section.
Figure 8. Data collection and preparation workflow
On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning

April 2024

·

73 Reads

To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, the adoption of these practices by academic institutions has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness and coverage of the requested dimensions, and trends in recent years, putting special emphasis on the most and least documented dimensions. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.


Target dimensions of the extraction approach
DataDoc Analyzer: A Tool for Analyzing the Documentation of Scientific Datasets

October 2023

·

58 Reads

·

4 Citations

Recent public regulatory initiatives and relevant voices in the ML community have identified the need to document datasets according to several dimensions to ensure the fairness and trustworthiness of machine learning systems. In this sense, the data-sharing practices in the scientific field have been quickly evolving in the last years, with more and more research works publishing technical documentation together with the data for replicability purposes. However, this documentation is written in natural language, and its structure, content focus, and composition vary, making them challenging to analyze. We present DataDoc Analyzer, a tool for analyzing the documentation of scientific datasets by extracting the details of the main dimensions required to analyze the fairness and potential biases. We believe that our tool could help improve the quality of scientific datasets, aid dataset curators during its documentation process, and be a helpful tool for empirical studies on the overall quality of the datasets used in the ML field. The tool implements an ML pipeline that uses Large Language Models at its core for information retrieval. DataDoc is open-source, and a public demo is published online.




Fig. 1. A simple blockchain
Fig. 3. A blockchain metamodel (extracted and adapted from [23])
Comparison of four main enterprise blockchain platforms
Blockchain Technologies in the Design and Operation of Cyber-Physical Systems

February 2023

·

188 Reads

·

3 Citations

A blockchain is an open, distributed ledger that can record transactions between two parties in an efficient, verifiable, and permanent way. Once recorded in a block, the transaction data cannot be altered retroactively. Moreover, smart contracts can be put in place to ensure that any new data added to the blockchain respects the terms of an agreement between the involved parties. As such, the blockchain becomes the single source of truth for all stakeholders in the system. These characteristics make blockchain technology especially useful in the context of Industry 4.0, distributed in nature, but with important requirements of trust and accountability among the large number of devices involved in the collaboration. In this chapter, we will see concrete scenarios where cyber-physical systems (CPSs) can benefit from blockchain technology, especially focusing on how blockchain works in practice, and which are the design and architectural trade-offs we should keep in mind when adopting this technology both for the design and operation of CPSs.



Figure 4: UI Overview
Figure 5: Tool's hints feature example
DescribeML: A Tool for Describing Machine Learning Datasets

October 2022

·

854 Reads

·

17 Citations

Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets. In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.


AIDOaRt: AI-augmented Automation for DevOps, a Model-based Framework for Continuous Development in Cyber-Physical Systems

October 2022

·

244 Reads

·

29 Citations

Microprocessors and Microsystems

The advent of complex Cyber-Physical Systems (CPSs) creates the need for more efficient engineering processes. Recently, DevOps promoted the idea of considering a closer continuous integration between system development (including its design) and operational deployment. Despite their use being still currently limited, Artificial Intelligence (AI) techniques are suitable candidates for improving such system engineering activities (cf. AIOps). In this context, AIDOaRT is a large European collaborative project that aims at providing AI-augmented automation capabilities to better support the modelling, coding, testing, monitoring, and continuous development of CPSs. The project proposes to combine Model Driven Engineering principles and techniques with AI-enhanced methods and tools for engineering more trustable CPSs. The resulting framework will 1) enable the dynamic observation and analysis of system data collected at both runtime and design time and 2) provide dedicated AI-augmented solutions that will then be validated in concrete industrial cases. This paper describes the main research objectives and underlying paradigms of the AIDOaRt project. It also introduces the conceptual architecture and proposed approach of the AIDOaRt overall solution. Finally, it reports on the actual project practices and discusses the current results and future plans.


Citations (38)


... In conclusion, the data used to perform this study is composed of a list of the 4041 data papers, enriched with the dimension extracted and with the results of the topic analysis. As Giner et al. 52 reports in their approach, language models have a tendency to hallucinate, and despite the different strategies implemented to reduce this issue, we need to keep this in mind when looking the data. In that sense, the report of his study have reported preliminary accuracy metrics for each dimensions, being of 88,26% for the uses, 70% of the collection, and 81,25% for the annotation dimension. ...

Reference:

On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning
DataDoc Analyzer: A Tool for Analyzing the Documentation of Scientific Datasets

... Among the others, the grid search strategy [30] involves specifying finite parameter values and evaluating the Cartesian product, but it suffers from the dimensionality problem due to exponential evaluations as parameter count increases. To avoid any bias in the implementation, we exploit the Surpise dedicated function to optimize the selected algorithm 6 . ...

A model-based framework for IoT systems in wastewater treatment plants.
  • Citing Article
  • January 2023

The Journal of Object Technology

... This situation has prompted the interest of regulatory agencies and the ML community in general in developing data best practices, such as building proper dataset documentation. Public regulatory initiatives, such as the European AI Act and the AI Right of Bills, as well as relevant scientific works 4,5 , have proposed general guidelines for developing standard dataset documentation. Such proposals identify a number of dimensions, such as the dataset's provenance or potential social issues, that could influence how the dataset is used and the quality and generalization of the ML models trained with it. ...

A domain-specific language for describing machine learning datasets

Journal of Computer Languages

... The BC SC functionality, embedded with attack detection, identification, and reconfiguration conditions, undertakes the mitigation response in case of cyberattacks (Masood, Hasan, Vassiliou, & Lestas, 2022). Moreover, SC also ensures that when new data is added to the BC, it accepts the agreement terms between the parties involved (Gómez, Joubert, & Cabot, 2023). SC is embedded in the BC that eliminates third-party systems to save administration costs and improve system efficiency (Gupta, et al., 2020). ...

Blockchain Technologies in the Design and Operation of Cyber-Physical Systems

... This work is developed and demonstrated within the European AIDOaRt project 1 [13,14]. The project aims to support systems engineering and continuous delivery activities, namely requirements engineering, modeling, coding, testing, deployment, and monitoring, with AI-augmented, automated MDE and development operations. ...

AIDOaRt: AI-augmented Automation for DevOps, a Model-based Framework for Continuous Development in Cyber-Physical Systems

Microprocessors and Microsystems

... Table 3 lists the final list of selected papers aligned with the publication venue. Particularly, 18 papers are added by query selection [2,9,11,37,41,42,45,52,60,62,63,66,71,72,80] and four are added due to snowballing [26,43,55,61]. The paper leverages MDE body of knowledge, techniques, and practices [13,21] 3 Full text of the paper is available EC 1 ...

DescribeML: A Tool for Describing Machine Learning Datasets

... Sedangkan, Clean Architecture merupakan sebuah arsitektur yang ditujukan agar kode pada suatu proyek menjadi lebih terstruktur dan rapi, dengan membagi konsentrasi kode menjadi beberapa layer (Fajri, 2022). Clean Architecture membantu memastikan bahwa perubahan dalam satu lapisan tidak mempengaruhi lapisan lainnya, sehingga aplikasi lebih mudah diuji, dipelihara, dan diperbarui (Giner-Miguelez, 2022;Mandal, 2022;Yang, 2022 ...

Enabling Content Management Systems as an Information Source in Model-Driven Projects

Lecture Notes in Business Information Processing

... The functional state of each component is represented as the nominal state, whereas the failure states are derived from it using a lightweight extension of the SMDs meta-model. The extension is achieved using the DAM profile which has stereotypes and tag values that extend the states representing nominal component DaStep (Bernardi et al. 2022). The lightweight extension of SMDs using DaStep represent the system's failure and error states, as well as the components' transitions and triggers. ...

DICE simulation: a tool for software performance assessment at the design stage

Automated Software Engineering

... Through a high-level abstraction, Model-Driven Engineering (MDE) can provide a unique means for representing many aspects of heterogeneous systems all in one place thanks to modeling languages, specifically domain-specific ones (DSMLs). Tackling such heterogeneity, it is essential to look at every system sub-component as a black box, where both the physical characteristics and the software that manages them are highly linked [5]. These sub-systems can be designed, developed, tested, and analyzed independently, and later, they can be integrated to form a fully functioning system. ...

Model-driven development of asynchronous message-driven architectures with AsyncAPI

Software and Systems Modeling

... This work is developed and demonstrated within the European AIDOaRt project 1 [13,14]. The project aims to support systems engineering and continuous delivery activities, namely requirements engineering, modeling, coding, testing, deployment, and monitoring, with AI-augmented, automated MDE and development operations. ...

AIDOaRt: AI-augmented Automation for DevOps, a Model-based Framework for Continuous Development in Cyber-Physical Systems