Thierry Poibeau's research while affiliated with Ecole Normale Supérieure de Paris and other places

Publications (166)

Conference Paper
Full-text available
We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a mo...
Conference Paper
Full-text available
We present a novel neural model for modern poetry generation in French. The model consists of two pretrained neural models that are fine-tuned for the poem generation task. The encoder of the model is a RoBERTa based one while the decoder is based on GPT-2. This way the model can benefit from the superior natural language understanding performance...
Preprint
A central quest of probing is to uncover how pre-trained models encode a linguistic property within their representations. An encoding, however, might be spurious-i.e., the model might not rely on it when making predictions. In this paper, we try to find encodings that the model actually uses, introducing a usage-based probing setup. We first choos...
Preprint
Although transformer-based Neural Language Models demonstrate impressive performance on a variety of tasks, their generalization abilities are not well understood. They have been shown to perform strongly on subject-verb number agreement in a wide array of settings, suggesting that they learned to track syntactic dependencies during their training...
Conference Paper
Full-text available
The style and language of an author evolves over time, but how and to what extent? Is evolution linear or is it more erratic? In stylometry, those questions are often addressed with hierarchical clustering. Hierarchical clustering is popular in Digital Humanities to classify texts by degree of similarity. When texts can be ordered chronologically,...
Poster
Full-text available
It is well known that the idiolect (the language of an individual) evolves over time. However, there is a lack of quantitative studies on this topic, due to the lack of large corpora (but see Barlow 2013; Mollin 2009; Petré et al. 2019 for a few examples). To study what is specific in an idiolect and how it evolves over a lifetime, we assembled, cl...
Article
Full-text available
The Corpus for Idiolectal Research (CIDRE) is a collection of fiction works from 11 prolific 19th-century French authors (4 women, 7 men; 22-62 works/author; total of 37 million words). Every work is dated with the year it was written. Using programming scripts, the works have been gathered from open source platforms, for example La Bibliothèque él...
Article
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarit...
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
Article
In spite of the increasingly large textual datasets humanities researchers are confronted with, and the need for automatic tools to extract information from them, we observe a lack of communication and diverging goals between the communities of Natural Language Processing (NLP) and Digital Humanities (DH). This contrasts with the wealth of potentia...
Article
Full-text available
This article focuses on an experiment aimed at extracting information from text in order to automatically feed databases in the field of archaeology. The first experiments concerned a set of books: the Cartes archéologiques de la Gaule (CAG). Knowledge transfer and meaning evolution phenomena were observed when thesauri were examined, since the sam...
Article
Multi-view learning makes use of diverse models arising from multiple sources of input or different feature subsets for the same task. For example, a given natural language processing task can combine evidence from models arising from character, morpheme, lexical, or phrasal views. The most common strategy with multi-view learning, especially popul...
Preprint
Full-text available
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity...
Conference Paper
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
Chapter
In this contribution, we report on a computational corpus-based study to analyse the semantic evolution of words over time. Though semantic change is complex and not well suited to analytical manipulation, we believe that computational modelling is a crucial tool to study this phenomenon. This study consists of two parts. In the first one, our aim...
Article
Full-text available
Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typol...
Article
International audience British philosopher and reformer Jeremy Bentham (1748-1832) left over 60,000 folios of unpublished manuscripts. The Bentham Project, at University College London, is creating a TEI version of the manuscripts, via crowdsourced transcription verified by experts. We present here an interface to navigate these largely unedited ma...
Preprint
Addressing the cross-lingual variation of grammatical structures and meaning categorization is a key challenge for multilingual Natural Language Processing. The lack of resources for the majority of the world's languages makes supervised learning not viable. Moreover, the performance of most algorithms is hampered by language-specific biases and th...
Book
How do infants learn a language? Why and how do languages evolve? How do we understand a sentence? This book explores these questions using recent computational models that shed new light on issues related to language and cognition. The chapters in this collection propose original analyses of specific problems and develop computational models that...
Article
This article presents an attempt to apply efficient parsing methods based on recursive neural networks to languages for which very few resources are available. We propose an original approach based on multilingual word embeddings acquired from different languages so as to determine the best language combination for learning. The approach yields com...
Conference Paper
Full-text available
This paper introduces UDLex, a computational framework for the automatic extraction of argument structures for several languages. By exploiting the versatility of the Universal Dependency annotation scheme, our system acquires subcat-egorization frames directly from a dependency parsed corpus, regardless of the input language. It thus uses a univer...
Chapter
Full-text available
We investigate the notions of continuity and interaction in linguistic models. There is now a quite rich tradition of work based on the hypothesis that natural language is not a discrete model. Instead, continuous models consider word sense, grammar rules and categories as continuous notions: some words are hard to categorize and some rules do not...
Conference Paper
Full-text available
Human languages have multiple strategies that allow us to discriminate objects in a vast variety of contexts. Colours have been extensively studied from this point of view. In particular, previous research in artificial language evolution has shown how artificial languages may emerge based on specific strategies to distinguish colours. Still, it ha...
Conference Paper
University College London (UCL) owns a large corpus of the philosopher and social reformer Jeremy Bentham (1748-1832). Until recently, these papers were for the most part untranscribed, so that very few people had access to the corpus to evaluate its content and its value. The corpus is now being digitized and transcribed thanks to a large number o...
Conference Paper
Full-text available
Text analysis methods widely used in digital humanities often involve word co-occurrence, e.g. concept co-occurrence networks. These methods provide a useful corpus overview, but cannot determine the predicates that relate co-occurring concepts. Our goal was identifying propositions expressing the points supported or opposed by participants in inte...
Article
Translating without understanding: What kind of semantics for machine translation? The translation activity involves understanding the text to be translated so as to transpose the main ideas as precisely as possible in the target language. It is largely assumed that the first generations of machine translation systems (1950-1965) failed because of...
Book
Si l’annotation, la visualisation et l’interrogation de corpus pour des phénomènes morphosyntaxiques et syntaxiques posent beaucoup de problèmes méthodologiques et requièrent des outils adaptés, il en est de même, et peut-être encore plus, pour des phénomènes sémantico-pragmatiques tels que la référence et les transitions référentielles. D’une part...
Article
This paper deals with the exploration of genre-specific phraseology, more precisely, of "sequential patterns" expressing clichés. In our approach, a sequential pattern (or "motif") is a specific and regular lexical and grammatical configuration. This unit is automatically detected; genre-specific motifs are calculated on the basis of two alternativ...
Book
Les contributions contenues dans ce volume de la revue Nouvelles perspectives en sciences sociales s'interrogent sur l'intérêt des analyses textuelles informatisées, sur les différents logiciels disponibles, sur leurs possibles contributions aux recherches en sciences sociales, en particulier sur les voies ouvertes vers des modélisations qualitativ...
Article
Full-text available
Research units in archaeology often manage large and precious archives containing various documents, including reports on fieldwork, scholarly studies and reference books. These archives are of course invaluable, recording decades of work, but are generally hard to consult and access. In this context, digitizing full text documents is not enough: i...
Conference Paper
Full-text available
It is now commonplace to observe that we are facing a deluge of online information. Researchers have of course long acknowledged the potential value of this information since digital traces make it possible to directly observe, describe and analyze social facts, and above all the co-evolution of ideas and communities over time. However, most online...
Conference Paper
Full-text available
An English entity linking (EL) workflow is presented, which combines the annotations of five public open source EL services. The annotations are combined through a weighted voting scheme inspired by the ROVER method, which had not been previously tested on EL outputs. The combined results improved over each individual system's results, as evaluated...
Conference Paper
Full-text available
Entity Linking (EL) systems' performance is uneven across corpora or depending on entity types. To help overcome this issue, we propose an EL workflow that combines the outputs of several open source EL systems, and selects annotations via weighted voting. The results are displayed on a UI that allows the users to navigate the corpus and to evaluat...
Conference Paper
Full-text available
Experiments on the emergence of a shared language in a population of agents usually rely on the control of the complexity by the experimenter. In this article we show how agents provided with the autotelic principle, a system by which agents can regulate their own development, progressively develop an emerging language evolving from one word to mul...
Article
This article is about the notion of the instrumental subject, or more generally about sentences in which the subject is the instrument and not the agent of the action expressed by the verb. We show that this phenomenon calls into question the boundaries between semantic roles and syntactic functions. It also reveals the complexity of acceptability...
Article
Nous proposons une nouvelle methode pour l’extraction de termes multi-mots a partir de publications scientifiques. Notre strategie est fondee sur la combinaison de deux approches : une premiere liste de termes « candidats » est d’abord extraite a partir de criteres de frequence et de specificite. Cette liste est ensuite classee suivant la position...
Article
Many studies in cognitive linguistics have analyzed the semantics of over, notably the semantics associated with over as a preposition. Most of them generally conclude that over is polysemic and that this polysemy is to be described thanks to a semantic radial network, showing the relationships between the different meanings of the word. What we wo...
Article
Full-text available
In this paper we describe our contribution to the PoliInformatics 2014 Challenge on the 2007-2008 financial crisis. We propose a state of the art technique to extract information from texts and provide different representations, giving first a static overview of the domain and then a dynamic representation of its main evolutions. We show that this...
Article
This paper re-investigates a lexical acquisition system initially developed for French.We show that, interestingly, the architecture of the system reproduces and implements the main components of Optimality Theory. However, we formulate the hypothesis that some of its limitations are mainly due to a poor representation of the constraints used. Fina...
Article
Full-text available
This paper investigates the evolution of the computational linguistics domain through a quantitative analysis of the ACL Anthology (containing around 12,000 papers published between 1985 and 2008). Our approach combines complex system methods with natural language processing techniques. We reconstruct the socio-semantic landscape of the domain by i...
Conference Paper
We propose a new method to extract key- words from texts and categorize these keywords according to their informational value, derived from the analysis of the ar- gumentative goal of the sentences they ap- pear in. The method is applied to the ACL Anthology corpus, containing papers on the computational linguistic domain pub- lished between 1980 a...
Article
Automatic language processing for the social sciences Elements of reflection based on recent experiences Most textual data available today enable us to see the social sciences in a new light, as texts contain an abundance of as yet unexploited information. The difficulty consists in accessing the right information, in “standardizing” it and then i...
Article
Cette exposé présentait une méthode pour l'analyse outillée de la variation textuelle au sein de textes littéraires.
Article
The Workshop on Language, Cognition and Computational Models was held in Paris at Ecole Normale Supérieure (ENS) and at the Institut des Systèmes Complexes de Paris-Ile de France (ISC-PIF), on May 28th and 29th 2013. The goal of this event was to provide a venue for the multidisciplinary discussion of theoretical and practical research for computat...
Article
This paper examines to what extent the massive availability of textual data in digital form has recently changed the way people carry out research in linguistics. Several subfields of the domain require large amounts of attested data : here, we primarily consider the case of corpus linguistics and natural language processing. We consider recent bre...
Book
The nature and amount of information needed for learning a natural language, and the underlying mechanisms involved in this process, are the subject of much debate: how is the knowledge of language represented in the human brain? Is it possible to learn a language from usage data only, or is some sort of innate knowledge and/or bias needed to boost...
Book
Questions related to language acquisition have been of interest for many centuries, as children seem to acquire a sophisticated capacity for processing language with apparent ease, in the face of ambiguity, noise and uncertainty. However, with recent advances in technology and cognitive-related research it is now possible to conduct large-scale com...
Book
Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society. These technologies face particular challenges du...
Chapter
Automatic text summarization, the computer-based production of condensed versions of documents, is an important technology for the information society. Without summaries it would be practically impossible for human beings to get access to the ever growing mass of information available online. Although research in text summarization is over 50 years...
Conference Paper
This paper introduces a novel method for joint unsupervised aquisition of verb subcategorization frame (SCF) and selectional preference (SP) information. Treating SCF and SP induction as a multi-way co-occurrence problem, we use multi-way tensor factorization to cluster frequent verbs from a large corpus according to their syntactic and semantic be...
Article
Full-text available
This paper investigates cultural dynamics in social media by examining the proliferation and diversification of clearly-cut pieces of content: quoted texts. In line with the pioneering work of Leskovec et al. and Simmons et al. on memes dynamics we investigate in deep the transformations that quotations published online undergo during their diffusi...
Book
Full-text available
Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society. These technologies face particular challenges d...
Article
Full-text available
We introduce ANALEC, a tool which aim is to bring together corpus annotation, visualization and query management. Our main idea is to provide a unified and dynamic way of annotating textual data. ANALEC allows researchers to dynamically build their own annotation scheme and use the possibilities of scheme revision, data querying and graphical visua...
Article
Full-text available
We would like to propose a new model of meaning construction based on language comprehension considered as a dynamic process during which the meaning of each linguistic unit and the global meaning of the sentence are determined simultaneously. This model, which may be called "gestalt compositionality," is radically opposed to the classic compositio...
Chapter
Full-text available
Given a corpus of financial news items labelled according to the market reaction following their publication, we investigate ‘cotemporeneous’ and forward-looking price stock movements. Our approach is to provide a pool of relevant textual features to a machine learning algorithm to detect substantial stock price variations. Our two working hypothes...
Conference Paper
This paper presents a novel method for the computation of word meaning in context. We make use of a factorization model in which words, together with their window-based context words and their dependency relations, are linked to latent dimensions. The factorization model allows us to determine which dimensions are important for a particular context...