Laurent Romary's scientific contributions

Publications (14)

Preprint
Full-text available
The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data...
Conference Paper
Full-text available
In this paper we describe the process of building a corporate corpus that will be used as a reference for modelling and computing threads from conversations generated using communication and collaboration tools. The overall goal of the reconstruction of threads is to be able to provide value to the collorator in various use cases, such as higlight...
Preprint
Full-text available
We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for several mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and...
Preprint
Full-text available
The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French c...
Preprint
Full-text available
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,...
Conference Paper
Full-text available
We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make un-structured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on t...
Article
Full-text available
Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to i...
Article
Full-text available
This article tackles the issue of integrating heterogeneous archival sources in one single data repository, namely the EHRI portal, whose aim is to support Holocaust research by providing online access to information about dispersed sources relating to the Holocaust (http://portal.ehri-project.eu). In this case, the problem at hand is to combine da...

Citations

... 14 See Appendix A for the details of all models. We use the same vocabularyṼ for all models, which consists of the 30,000 most common words 17 in the OSCAR corpus (Ortiz Suárez et al., 2020). We set the number of sentences we sample from OSCAR to calculate the decontextualised embeddings, i.e. ...
... We train and evaluate two tagging model types: BERT token classification model and contextual string embeddings model [14], referred later as "Flair". BERT tagger model uses locale variations of BERT transformer: Gbert for German [15], Camembert for French [16] and Beto for Spanish [17]. A single linear layer is added at the output, and the whole network is fine-tuned for the tagging task. ...
... As discussed earlier in this section, such data can be scarce, particularly in the low-resource setting. We therefore decided to extend the work done in CCNet, CCMatrix (Schwenk et al., 2021b;, and others like OSCAR (Ortiz Suárez et al., 2019). In this section, we describe our end-to-end process for both curating and cleaning monolingual data. ...
... This allowed us to adapt to the specificity of the targeted scientific domains. Recognition of new measurement units is also a key concern [32,33] in order to extract all quantitative entities. It is essential to recognise measurement units that are not present in the OTR, while also linking them to the corresponding quantity concepts of the OTR. ...
... Good praxis also requires documentation of the rules, along with encoding guidelines and examples. ODD is the perfect choice for this (Bauman 2019;Romary and Riondet 2018). ODD les, being TEI les, are also easily processable, a very useful feature whose value will become apparent below. ...
... Galeazzi & Di Giuseppantonio Di Franco (2017) pointed out humanobject interaction is an important aspect of 3D visualisation and argued we should link various datasets and provide suitable, useful access to researchers and practitioners. Alliez et al. (2017) however, recommend providing the full data resolution of the 3D model, dynamic lighting, measuring features, non-photorealistic lighting, cut-through sections, maps and sections from the 3D model, a dynamic camera, volume calculation at different layers, exploded views, space wrapping for enhanced visibility and inspection, including an option for transparent rendering. ...