Javier de la Rosa

Javier de la Rosa
  • PhD
  • Research Developer at Stanford University

About

44
Publications
25,431
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
354
Citations
Current institution
Stanford University
Current position
  • Research Developer
Additional affiliations
September 2012 - July 2016
Western University
Position
  • Research Assistant
September 2011 - September 2012
Western University
Position
  • Research Associate

Publications

Publications (44)
Preprint
Full-text available
The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute pos...
Article
Full-text available
Analyzing poetry with automatic tools has great potential for improving verse-related research. Over the last few decades, this field has expanded notably and a large number of tools aiming at analyzing various aspects of poetry have been developed. However, the concrete connection between these tools and traditional scholars investigating poetry a...
Article
Full-text available
The increasing application of computational methods to the literature of the Spanish Golden Age has revealed the necessity of automating the modernization of its texts to facilitate seamless comparison and analysis. This study pioneers the employment of Natural Language Processing (NLP) techniques for the transformation of Spanish Golden Age texts...
Article
Full-text available
Over the last nine years, a forty-seven-person digital humanities project has explored the feasibility of computer assisted paleography for Syriac. That is, could one use big data, visual analytics, and recent advances in the digital analysis of handwriting to better understand Syriac manuscripts and Syriac manuscript culture? Last June we launched...
Preprint
Full-text available
In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against...
Preprint
Full-text available
The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present \textsc{Alberti}, the fir...
Article
Full-text available
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Preprint
Full-text available
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Preprint
Full-text available
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putti...
Preprint
Full-text available
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Article
Full-text available
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Preprint
Full-text available
The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a...
Preprint
Full-text available
In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-bas...
Article
Full-text available
The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a...
Chapter
Full-text available
One stream of work in the digital humanities focuses on interoperability processes and the description of traditional concepts using computer-readable languages. In the case of literary studies, there has been some research into these topics, but the complexity of the knowledge domain remains an issue. This complexity is based on the different inte...
Chapter
Full-text available
In this paper, we present a quantitative approach to Spanish poetry and versification based on the application of our own automatic metrical tool, Rantanplan, to the complete poetic works of four early modern Spanish poets. All of the poetry of these four representative authors—Garcilaso de la Vega (1503–1536), Fernando de Herrera (1534–1597), Luis...
Article
Full-text available
The splitting of words into stressed and unstressed syllables is the foundation for the scansion of poetry, a process that aims at determining the metrical pattern of a line of verse within a poem. Intricate language rules and their exceptions, as well as poetic licenses exerted by the authors, make calculating these patterns a nontrivial task. Som...
Preprint
Full-text available
The first edition of the IberLEF 2021 shared task on automatic detection of borrowings (ADoBo) focused on detecting lexical borrowings that appeared in the Spanish press and that have recently been imported into the Spanish language. In this work, we tested supplementary training on intermediate labeled-data tasks (STILTs) from part of speech (POS)...
Conference Paper
Full-text available
The development of the network of ontologies of the ERC POSTDATA Project brought to light some deficiencies in terms of completeness in the currently available European poetry corpora. To tackle the issue in the realm of the Spanish poetic tradition, our approach consisted in designing a set of tools that any scholar could use to automatically enri...
Article
Full-text available
The rise in artificial intelligence and natural language processing techniques has increased considerably in the last few decades. Historically, the focus has been primarily on texts expressed in prose form, leaving mostly aside figurative or poetic expressions of language due to their rich semantics and syntactic complexity. The creation and analy...
Preprint
Full-text available
In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for bot...
Article
Full-text available
The automatic metric analysis (commonly referred to as scansion) of Spanish poetry is not a trivial problem since it combines the nuances of the language, the different poetic traditions related to melodic patterns, and the personal stylistic preferences and intentions of the author. In this paper, we explore two alternative algorithmic approaches...
Conference Paper
Full-text available
La instalación inmersiva Postcatálogo (https://youtu.be/nuc-LUW2W3s), realizada ad hoc para la exposición Catálogos desencadenados organizada por el Vicerrectorado de Cultura de la Universidad de Málaga (17-12-2020, 29-01-2021), aborda dos líneas de indagación. En primer lugar, trata de responder a la pregunta: ¿cómo hacer físicamente experimentabl...
Conference Paper
Full-text available
The immersive installation Postcatalog, developed ad hoc for the exhibition Unchained Catalogs organized by the Vice-Rectorate of Culture of the University of Malaga (12-17-2020, 01-29-2021), addresses two lines of inquiry. Firstly, the project delves into how to make high-dimensional spaces derived from computational image processing physically ex...
Book
Full-text available
In Tackling the Toolkit, we focus on the methodological innovations, challenges, obstacles and even shortcomings associated with applying quantitative methods to poetry specifically and poetics more broadly. Using tools including natural language processing, web ontologies, similarity detection devices and machine learning, our contributors explore...
Conference Paper
In this paper, we compare automated metrical pattern identification systems available for Spanish against extensive experiments done by fine-tuning language models trained on the same task. Despite being initially conceived as a model suitable for semantic tasks, our results suggest that BERT-based models retain enough structural information to per...
Preprint
Full-text available
In this paper, we compare automated metrical pattern identification systems available for Spanish against extensive experiments done by fine-tuning language models trained on the same task. Despite being initially conceived as a model suitable for semantic tasks, our results suggest that BERT-based models retain enough structural information to per...
Article
Full-text available
Using one lexicographical tool, the Diccionario Crítico Etimológico Castellano e Hispánico (DECH), and two corpora, the HathiTrust’s digital library and the Google Books Ngrams, we tracked the occurrence of thousands of loanwords in Spanish to describe their use, origin, and historical context. In doing so, we used computational methodologies to pa...
Article
Full-text available
Summit work of the Spanish Golden Age and forefather of the so-called picaresque novel, The Life of Lazarillo de Tormes and of His Fortunes and Adversities still remains an anonymous text. Although distinguished scholars have tried to attribute it to different authors based on a variety of criteria, a consensus has yet to be reached. The list of ca...
Article
Full-text available
Has human beauty always been perceived in the same manner? We used a set of 120,000 paintings from different periods to analyze human faces between the 13th and the 20th centuries in order to establish whether there has been a single canon of beauty (that would maximize reproduction probabilities) or whether this has changed over time. Our study sh...
Article
Full-text available
In this article we propose a set of methodologies to study emerging reading practices in narratives developing simultaneously in various media. We have taken the data by readers of the Spanish-Argentinian project Orsai in the form of blog comments, download rates, and print-run volumes as “reading traces.” We believe these traces shed much light on...
Article
In this article we propose an approach to the study of art history based on geography of Hispanic Baroque art by digital means that showcase the multiplicity of possible places of art. Our study advances four elements of a digital geography of art (communities, semantic maps, areas, and flows)—a methodology that can be expanded in future Digital Hu...
Conference Paper
Full-text available
This paper presents SylvaDB, a graph database management system designed to be used by people with no technical knowledge. SylvaDB is based on flexible schema definitions and has been developed taking into account the need to deal with semantic information. It relies on the mathematical notion of property graph. SylvaDB is an open source project an...
Article
Full-text available
The authors analyze the network of Hispanic baroque paintings from 1550 to 1850. They divide the dataset of 11,443 works from Spain and Latin America into 25-year periods in order to study the evolution of the paintings' 211 descriptors. The analysis shows that most of the paintings are linked through genre and theme and that religious Christian th...
Conference Paper
Full-text available
This paper presents the results of a multi-disciplinary collaboration in Digital Humanities that focuses on the multi-scale analysis of the network of Baroque paintings in the territories of the Hispanic Monarchy from the 16 th through the 18 th centuries. We apply graph analysis and visualizations as well as natural language analysis over a databa...
Conference Paper
This paper presents the results of a multidisciplinary collaboration in Digital Humanities that focuses on the multi-scale analysis of the network of Baroque paintings in the territories of the Hispanic Monarchy from the 16th through the 18th centuries. We apply graph analysis and visualizations as well as natural language analysis over a database...

Network

Cited By