
Javier de la Rosa- PhD
- Research Developer at Stanford University
Javier de la Rosa
- PhD
- Research Developer at Stanford University
About
44
Publications
25,431
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
354
Citations
Introduction
Half Computer Scientist, half Digital Humanist
Current institution
Additional affiliations
September 2012 - July 2016
September 2011 - September 2012
Publications
Publications (44)
The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute pos...
Analyzing poetry with automatic tools has great potential for improving verse-related research. Over the last few decades, this field has expanded notably and a large number of tools aiming at analyzing various aspects of poetry have been developed. However, the concrete connection between these tools and traditional scholars investigating poetry a...
The increasing application of computational methods to the literature of the Spanish Golden Age has revealed the necessity of automating the modernization of its texts to facilitate seamless comparison and analysis. This study pioneers the employment of Natural Language Processing (NLP) techniques for the transformation of Spanish Golden Age texts...
Over the last nine years, a forty-seven-person digital humanities project has explored the feasibility of computer assisted paleography for Syriac. That is, could one use big data, visual analytics, and recent advances in the digital analysis of handwriting to better understand Syriac manuscripts and Syriac manuscript culture? Last June we launched...
In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against...
The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present \textsc{Alberti}, the fir...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putti...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a...
In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-bas...
The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a...
One stream of work in the digital humanities focuses on interoperability processes and the description of traditional concepts using computer-readable languages. In the case of literary studies, there has been some research into these topics, but the complexity of the knowledge domain remains an issue. This complexity is based on the different inte...
In this paper, we present a quantitative approach to Spanish poetry and versification based on the application of our own automatic metrical tool, Rantanplan, to the complete poetic works of four early modern Spanish poets. All of the poetry of these four representative authors—Garcilaso de la Vega (1503–1536), Fernando de Herrera (1534–1597), Luis...
The splitting of words into stressed and unstressed syllables is the foundation for the scansion of poetry, a process that aims at determining the metrical pattern of a line of verse within a poem. Intricate language rules and their exceptions, as well as poetic licenses exerted by the authors, make calculating these patterns a nontrivial task. Som...
The first edition of the IberLEF 2021 shared task on automatic detection of borrowings (ADoBo) focused on detecting lexical borrowings that appeared in the Spanish press and that have recently been imported into the Spanish language. In this work, we tested supplementary training on intermediate labeled-data tasks (STILTs) from part of speech (POS)...
The development of the network of ontologies of the ERC POSTDATA Project brought to light some deficiencies in terms of completeness in the currently available European poetry corpora. To tackle the issue in the realm of the Spanish poetic tradition, our approach consisted in designing a set of tools that any scholar could use to automatically enri...
The rise in artificial intelligence and natural language processing techniques has increased considerably in the last few decades. Historically, the focus has been primarily on texts expressed in prose form, leaving mostly aside figurative or poetic expressions of language due to their rich semantics and syntactic complexity. The creation and analy...
In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for bot...
The automatic metric analysis (commonly referred to as scansion) of Spanish poetry is not a trivial problem since it combines the nuances of the language, the different poetic traditions related to melodic patterns, and the personal stylistic preferences and intentions of the author. In this paper, we explore two alternative algorithmic approaches...
La instalación inmersiva Postcatálogo (https://youtu.be/nuc-LUW2W3s), realizada ad hoc para la exposición Catálogos desencadenados organizada por el Vicerrectorado de Cultura de la Universidad de Málaga (17-12-2020, 29-01-2021), aborda dos líneas de indagación. En primer lugar, trata de responder a la pregunta: ¿cómo hacer físicamente experimentabl...
The immersive installation Postcatalog, developed ad hoc for the exhibition Unchained Catalogs organized by the Vice-Rectorate of Culture of the University of Malaga (12-17-2020, 01-29-2021), addresses two lines of inquiry. Firstly, the project delves into how to make high-dimensional spaces derived from computational image processing physically ex...
In Tackling the Toolkit, we focus on the methodological innovations, challenges, obstacles and even shortcomings associated with applying quantitative methods to poetry specifically and poetics more broadly. Using tools including natural language processing, web ontologies, similarity detection devices and machine learning, our contributors explore...
In this paper, we compare automated metrical pattern identification systems available for Spanish against extensive experiments done by fine-tuning language models trained on the same task. Despite being initially conceived as a model suitable for semantic tasks, our results suggest that BERT-based models retain enough structural information to per...
In this paper, we compare automated metrical pattern identification systems available for Spanish against extensive experiments done by fine-tuning language models trained on the same task. Despite being initially conceived as a model suitable for semantic tasks, our results suggest that BERT-based models retain enough structural information to per...
Using one lexicographical tool, the Diccionario Crítico Etimológico Castellano e Hispánico (DECH), and two corpora, the HathiTrust’s digital library and the Google Books Ngrams, we tracked the occurrence of thousands of loanwords in Spanish to describe their use, origin, and historical context. In doing so, we used computational methodologies to pa...
Summit work of the Spanish Golden Age and forefather of the so-called picaresque novel, The Life of Lazarillo de Tormes and of His Fortunes and Adversities still remains an anonymous text. Although distinguished scholars have tried to attribute it to different authors based on a variety of criteria, a consensus has yet to be reached. The list of ca...
Has human beauty always been perceived in the same manner? We used a set of 120,000 paintings from different periods to analyze human faces between the 13th and the 20th centuries in order to establish whether there has been a single canon of beauty (that would maximize reproduction probabilities) or whether this has changed over time. Our study sh...
In this article we propose a set of methodologies to study emerging reading practices in narratives developing simultaneously in various media. We have taken the data by readers of the Spanish-Argentinian project Orsai in the form of blog comments, download rates, and print-run volumes as “reading traces.” We believe these traces shed much light on...
In this article we propose an approach to the study of art history based on geography of Hispanic Baroque art by digital means
that showcase the multiplicity of possible places of art. Our study advances four elements of a digital geography of art (communities,
semantic maps, areas, and flows)—a methodology that can be expanded in future Digital Hu...
This paper presents SylvaDB, a graph database management system designed to be used by people with no technical knowledge. SylvaDB is based on flexible schema definitions and has been developed taking
into account the need to deal with semantic information. It relies on the mathematical notion of property graph. SylvaDB is an open source project an...
Computer methodology to study the evolution of human faces through art paintings
The authors analyze the network of Hispanic baroque paintings from 1550 to 1850. They divide the dataset of 11,443 works from Spain and Latin America into 25-year periods in order to study the evolution of the paintings' 211 descriptors. The analysis shows that most of the paintings are linked through genre and theme and that religious Christian th...
This paper presents the results of a multi-disciplinary collaboration in Digital Humanities that focuses on the multi-scale analysis of the network of Baroque paintings in the territories of the Hispanic Monarchy from the 16 th through the 18 th centuries. We apply graph analysis and visualizations as well as natural language analysis over a databa...
This paper presents the results of a multidisciplinary collaboration in Digital Humanities that focuses on the multi-scale analysis of the network of Baroque paintings in the territories of the Hispanic Monarchy from the 16th through the 18th centuries. We apply graph analysis and visualizations as well as natural language analysis over a database...