Thierry Poibeau's research while affiliated with Ecole Normale Supérieure de Paris and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (166)
We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a mo...
We present a novel neural model for modern poetry generation in French. The model consists of two pretrained neural models that are fine-tuned for the poem generation task. The encoder of the model is a RoBERTa based one while the decoder is based on GPT-2. This way the model can benefit from the superior natural language understanding performance...
A central quest of probing is to uncover how pre-trained models encode a linguistic property within their representations. An encoding, however, might be spurious-i.e., the model might not rely on it when making predictions. In this paper, we try to find encodings that the model actually uses, introducing a usage-based probing setup. We first choos...
Although transformer-based Neural Language Models demonstrate impressive performance on a variety of tasks, their generalization abilities are not well understood. They have been shown to perform strongly on subject-verb number agreement in a wide array of settings, suggesting that they learned to track syntactic dependencies during their training...
The style and language of an author evolves over time, but how and to what extent? Is evolution linear or is it more erratic? In stylometry, those questions are often addressed with hierarchical clustering. Hierarchical clustering is popular in Digital Humanities to classify texts by degree of similarity. When texts can be ordered chronologically,...
It is well known that the idiolect (the language of an individual) evolves over time. However, there is a lack of quantitative studies on this topic, due to the lack of large corpora (but see Barlow 2013; Mollin 2009; Petré et al. 2019 for a few examples). To study what is specific in an idiolect and how it evolves over a lifetime, we assembled, cl...
The Corpus for Idiolectal Research (CIDRE) is a collection of fiction works from 11 prolific 19th-century French authors (4 women, 7 men; 22-62 works/author; total of 37 million words). Every work is dated with the year it was written. Using programming scripts, the works have been gathered from open source platforms, for example La Bibliothèque él...
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarit...
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
In spite of the increasingly large textual datasets humanities researchers are confronted with, and the need for automatic tools to extract information from them, we observe a lack of communication and diverging goals between the communities of Natural Language Processing (NLP) and Digital Humanities (DH). This contrasts with the wealth of potentia...
This article focuses on an experiment aimed at extracting information from text in order to automatically feed databases in the field of archaeology. The first experiments concerned a set of books: the Cartes archéologiques de la Gaule (CAG). Knowledge transfer and meaning evolution phenomena were observed when thesauri were examined, since the sam...
Multi-view learning makes use of diverse models arising from multiple sources of input or different feature subsets for the same task. For example, a given natural language processing task can combine evidence from models arising from character, morpheme, lexical, or phrasal views. The most common strategy with multi-view learning, especially popul...
We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity...
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
In this contribution, we report on a computational corpus-based study to analyse the semantic evolution of words over time. Though semantic change is complex and not well suited to analytical manipulation, we believe that computational modelling is a crucial tool to study this phenomenon. This study consists of two parts. In the first one, our aim...
Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typol...
International audience
British philosopher and reformer Jeremy Bentham (1748-1832) left over 60,000 folios of unpublished manuscripts. The Bentham Project, at University College London, is creating a TEI version of the manuscripts, via crowdsourced transcription verified by experts. We present here an interface to navigate these largely unedited ma...
Addressing the cross-lingual variation of grammatical structures and meaning categorization is a key challenge for multilingual Natural Language Processing. The lack of resources for the majority of the world's languages makes supervised learning not viable. Moreover, the performance of most algorithms is hampered by language-specific biases and th...
How do infants learn a language? Why and how do languages evolve? How do we understand a sentence? This book explores these questions using recent computational models that shed new light on issues related to language and cognition. The chapters in this collection propose original analyses of specific problems and develop computational models that...
This article presents an attempt to apply efficient parsing methods based on recursive neural networks to languages for which very few resources are available. We propose an original approach based on multilingual word embeddings acquired from different languages so as to determine the best language combination for learning. The approach yields com...
This paper introduces UDLex, a computational framework for the automatic extraction of argument structures for several languages. By exploiting the versatility of the Universal Dependency annotation scheme, our system acquires subcat-egorization frames directly from a dependency parsed corpus, regardless of the input language. It thus uses a univer...
We investigate the notions of continuity and interaction in linguistic models. There is now a quite rich tradition of work based on the hypothesis that natural language is not a discrete model. Instead, continuous models consider word sense, grammar rules and categories as continuous notions: some words are hard to categorize and some rules do not...
Human languages have multiple strategies that allow us to discriminate objects in a vast variety of contexts. Colours have been extensively studied from this point of view. In particular, previous research in artificial language evolution has shown how artificial languages may emerge based on specific strategies to distinguish colours. Still, it ha...
University College London (UCL) owns a large corpus of the philosopher and social reformer Jeremy Bentham (1748-1832). Until recently, these papers were for the most part untranscribed, so that very few people had access to the corpus to evaluate its content and its value. The corpus is now being digitized and transcribed thanks to a large number o...
Text analysis methods widely used in digital humanities often involve word co-occurrence, e.g. concept co-occurrence networks. These methods provide a useful corpus overview, but cannot determine the predicates that relate co-occurring concepts. Our goal was identifying propositions expressing the points supported or opposed by participants in inte...
Translating without understanding: What kind of semantics for machine translation?
The translation activity involves understanding the text to be translated so as to transpose the main ideas as precisely as possible in the target language. It is largely assumed that the first generations of machine translation systems (1950-1965) failed because of...
Si l’annotation, la visualisation et l’interrogation de corpus pour des phénomènes morphosyntaxiques et syntaxiques posent beaucoup de problèmes méthodologiques et requièrent des outils adaptés, il en est de même, et peut-être encore plus, pour des phénomènes sémantico-pragmatiques tels que la référence et les transitions référentielles. D’une part...
This paper deals with the exploration of genre-specific phraseology, more precisely, of "sequential patterns" expressing clichés. In our approach, a sequential pattern (or "motif") is a specific and regular lexical and grammatical configuration. This unit is automatically detected; genre-specific motifs are calculated on the basis of two alternativ...
Les contributions contenues dans ce volume de la revue Nouvelles perspectives en sciences sociales s'interrogent sur l'intérêt des analyses textuelles informatisées, sur les différents logiciels disponibles, sur leurs possibles contributions aux recherches en sciences sociales, en particulier sur les voies ouvertes vers des modélisations qualitativ...
Research units in archaeology often manage large and precious archives
containing various documents, including reports on fieldwork, scholarly studies
and reference books. These archives are of course invaluable, recording decades
of work, but are generally hard to consult and access. In this context,
digitizing full text documents is not enough: i...
It is now commonplace to observe that we are facing a deluge of online
information. Researchers have of course long acknowledged the potential value
of this information since digital traces make it possible to directly observe,
describe and analyze social facts, and above all the co-evolution of ideas and
communities over time. However, most online...
An English entity linking (EL) workflow is presented, which combines the annotations of five public open source EL services. The annotations are combined through a weighted voting scheme inspired by the ROVER method, which had not been previously tested on EL outputs. The combined results improved over each individual system's results, as evaluated...
Entity Linking (EL) systems' performance is uneven across corpora or depending on entity types. To help overcome this issue, we propose an EL workflow that combines the outputs of several open source EL systems, and selects annotations via weighted voting. The results are displayed on a UI that allows the users to navigate the corpus and to evaluat...
Experiments on the emergence of a shared language in a population of agents usually rely on the control of the complexity by the experimenter. In this article we show how agents provided with the autotelic principle, a system by which agents can regulate their own development, progressively develop an emerging language evolving from one word to mul...
This article is about the notion of the instrumental subject, or more generally about sentences in which the subject is the instrument and not the agent of the action expressed by the verb. We show that this phenomenon calls into question the boundaries between semantic roles and syntactic functions. It also reveals the complexity of acceptability...
Nous proposons une nouvelle methode pour l’extraction de termes multi-mots a partir de publications scientifiques. Notre strategie est fondee sur la combinaison de deux approches : une premiere liste de termes « candidats » est d’abord extraite a partir de criteres de frequence et de specificite. Cette liste est ensuite classee suivant la position...
Many studies in cognitive linguistics have analyzed the semantics of over, notably the semantics associated with over as a preposition. Most of them generally conclude that over is polysemic and that this polysemy is to be described thanks to a semantic radial network, showing the relationships between the different meanings of the word. What we wo...
In this paper we describe our contribution to the PoliInformatics 2014
Challenge on the 2007-2008 financial crisis. We propose a state of the art
technique to extract information from texts and provide different
representations, giving first a static overview of the domain and then a
dynamic representation of its main evolutions. We show that this...
This paper re-investigates a lexical acquisition system initially developed
for French.We show that, interestingly, the architecture of the system
reproduces and implements the main components of Optimality Theory. However, we
formulate the hypothesis that some of its limitations are mainly due to a poor
representation of the constraints used. Fina...
This paper investigates the evolution of the computational linguistics domain through a quantitative analysis of the ACL Anthology (containing around 12,000 papers published between 1985 and 2008). Our approach combines complex system methods with natural language processing techniques. We reconstruct the socio-semantic landscape of the domain by i...
We propose a new method to extract key- words from texts and categorize these keywords according to their informational value, derived from the analysis of the ar- gumentative goal of the sentences they ap- pear in. The method is applied to the ACL Anthology corpus, containing papers on the computational linguistic domain pub- lished between 1980 a...
Automatic language processing for the social sciences
Elements of reflection based on recent experiences
Most textual data available today enable us to see the social sciences in a new light, as texts contain an abundance of as yet unexploited information. The difficulty consists in accessing the right information, in “standardizing” it and then i...
Cette exposé présentait une méthode pour l'analyse outillée de la variation textuelle au sein de textes littéraires.
The Workshop on Language, Cognition and Computational Models was held in Paris at Ecole Normale Supérieure (ENS) and at the Institut des Systèmes Complexes de Paris-Ile de France (ISC-PIF), on May 28th and 29th 2013. The goal of this event was to provide a venue for the multidisciplinary discussion of theoretical and practical research for computat...
This paper examines to what extent the massive availability of textual data in digital form has recently changed the way people carry out research in linguistics. Several subfields of the domain require large amounts of attested data : here, we primarily consider the case of corpus linguistics and natural language processing. We consider recent bre...
The nature and amount of information needed for learning a natural language, and the underlying mechanisms involved in this process, are the subject of much debate: how is the knowledge of language represented in the human brain? Is it possible to learn a language from usage data only, or is some sort of innate knowledge and/or bias needed to boost...
Questions related to language acquisition have been of interest for many centuries, as children seem to acquire a sophisticated capacity for processing language with apparent ease, in the face of ambiguity, noise and uncertainty. However, with recent advances in technology and cognitive-related research it is now possible to conduct large-scale com...
Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society. These technologies face particular challenges du...
Automatic text summarization, the computer-based production of condensed versions of documents, is an important technology for the information society. Without summaries it would be practically impossible for human beings to get access to the ever growing mass of information available online. Although research in text summarization is over 50 years...
This paper introduces a novel method for joint unsupervised aquisition of verb subcategorization frame (SCF) and selectional preference (SP) information. Treating SCF and SP induction as a multi-way co-occurrence problem, we use multi-way tensor factorization to cluster frequent verbs from a large corpus according to their syntactic and semantic be...
This paper investigates cultural dynamics in social media by examining the
proliferation and diversification of clearly-cut pieces of content: quoted
texts. In line with the pioneering work of Leskovec et al. and Simmons et al.
on memes dynamics we investigate in deep the transformations that quotations
published online undergo during their diffusi...
Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society.
These technologies face particular challenges d...
We introduce ANALEC, a tool which aim is to bring together corpus annotation, visualization and query management. Our main idea is to provide a unified and dynamic way of annotating textual data. ANALEC allows researchers to dynamically build their own annotation scheme and use the possibilities of scheme revision, data querying and graphical visua...
We would like to propose a new model of meaning construction based on language comprehension considered as a dynamic process during which the meaning of each linguistic unit and the global meaning of the sentence are determined simultaneously. This model, which may be called "gestalt compositionality," is radically opposed to the classic compositio...
Given a corpus of financial news items labelled according to the market reaction following their publication, we investigate
‘cotemporeneous’ and forward-looking price stock movements. Our approach is to provide a pool of relevant textual features
to a machine learning algorithm to detect substantial stock price variations. Our two working hypothes...
This paper presents a novel method for the computation of word meaning in context. We make use of a factorization model in which words, together with their window-based context words and their dependency relations, are linked to latent dimensions. The factorization model allows us to determine which dimensions are important for a particular context...