About
238
Publications
103,098
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,936
Citations
Introduction
Current institution
Additional affiliations
August 1987 - September 1993
August 1982 - July 1987
September 1993 - present
Publications
Publications (238)
Using large language models (LLMs), computers are able to generate a written text in response to a us er request. As this pervasive technology can be applied in numerous contexts, this study analyses the written style of one LLM called GPT by comparing its generated speeches with those of the recent US presidents. To achieve this objective, the Sta...
Recently several large language models (LLMs) have demonstrated their capability to generate a message in response to a user request. Such scientific breakthroughs promote new perspectives but also some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT 3.5 by comparing its generated messages with those of...
Over the past sixty-six years, eight presidents successively headed the Fifth French Republic (de Gaulle, Pompidou, Giscard d'Estaing, Mitterrand, Chirac, Sarkozy, Holland, Macron). After presenting the corpus of their speeches-9,202 texts and more than 20 million labelled words-the style of each of them will be characterized by their vocabulary (l...
Generative AI proposes several large language models (LLMs) to automatically generate a message in response to users' requests. Such scientific breakthroughs promote new writing assistants but with some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT by comparing its generated messages with those of the...
Generative AI proposes several large language models (LLMs) to automatically generate a message in response to users' requests. Such scientific breakthroughs promote new writing assistants but with some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT by comparing its generated messages with those of the...
The automatic assignment of a text to one or more predefined categories presents multiple applications. In this context, the current study focuses on author attribution in which the true author of a doubtful text must be identified. This analysis focuses on the style of sixty-six French comedies in verse written by seventeen supposed authors during...
Playwrights and screenwriters compose dialogues with characters from both genders. Assuming that men and women speak or write differently, can a great author take account of this difference? Previous studies have ascertained some stylistic markers that can discriminate between men and women either in writing or oral productions. The main aim of thi...
This study analyzes the style and content of 25,590 tweets sent by eight candidates during the French presidential election (from 1 September 2021 to 24 April 2022). During this campaign, the candidates have used Twitter intensively to motivate their supporters, putting forward some propositions or to announce their presence on a TV show or at a ra...
This paper presents the results of a new monitoring project of the US presidential election with the aim of establishing computer-based tools to track the popularity of the two main candidates. The innovative methods that have designed and developed allow us to extract the frequency of search queries sent to numerous search engines, social media, a...
To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n‐gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest‐neighbors, support vector ma...
This study analyzes the stylistic and rhetorical aspects of Trump’s tweets, focusing mainly on tweets sent during his electoral and presidential period. As stylometric findings should be based on a comparative basis, a corpus of tweets written by Obama during his presidency, and those posted by the White House account during the same period are emp...
Presidential speeches indicate the government's intentions and justifications supported by a dedicated style and rhetoric oscillating between explanation and controversy. Over a period of sixty years, can we observe stylistic variations by the different French presidents of the Fifth Republic (1958-2018)? Based on official transcripts of all their...
This paper presents the results of a new monitoring project of the US presidential elections with the aim of establishing computer-based tools to track in real time the popularity or awareness of candidates. The designed and developed innovative methods allow us to extract the frequency of queries sent to numerous search engines by US Internet user...
Presidential speeches indicate the government’s intentions and justifications supported by a dedicated style and rhetoric oscillating between explanation and controversy. Over a period of 60 years, can we observe stylistic variations by the different French presidents of the Fifth Republic (1958–2018)? Based on official transcripts of all their all...
As this book is related to stylometric text classification, this first chapter presents a general introduction to the fundamental problems and questions in this domain. After exposing the diversity of stylometric problems, this chapter defines the scope and limits of the different stylometric applications. As writing style is the main object, a lis...
As a first approach, it is assumed that stylistic markers can be detected by considering words, or more precisely, the most frequent ones. This chapter explores several other ways to define useful stylistic traces let by the author. Instead of considering only isolated words, one can explore the usefulness of short sequences of words (called word n...
The evaluation of a text classification model is an essential task to measure its effectiveness and to compare it with previously suggested approaches or with some possible variants. Various situations might lead to different measures beginning with the simple accuracy rate to a more detailed analysis based on precision–recall values. Moreover, pro...
This second chapter exposes some useful background information that will be useful over the entire book. In particular, this chapter presents and defines precisely the notions of word-type, word-token, and lemma. The mathematical notation used in the entire book is also presented and commented. A running example (the Federalist Papers) is described...
This third chapter focuses on the problem of authorship attribution, with leading models proposed and discussed in the humanities community. As the first approach, the Delta model is described with concrete examples extracted from the Federalist Papers corpus. Thereby the reader will have a precise and detailed description of one of the most well-k...
As many stylometric applications must first learn and represent the distinctive style of different categories or authors, several machine learning algorithms have been suggested to solve the authorship attribution or profiling issues. This sixth chapter presents four important models. Based on vector space representation, the k-nearest neighbors (k...
With US political speeches, this chapter illustrates some notions and concepts presented in the first two parts of this book. The studied corpus is composed of 233 annual State of the Union (SOTU) speeches written under 43 presidencies from Washington (Jan. 8th, 1790) to Trump (Feb. 4th, 2020). Moreover, 58 inaugural allocutions uttered by 40 presi...
In this chapter several authorship attribution methods are applied to identify the true author behind the Elena Ferrante penname. To achieve this objective, a corpus must be carefully generated as shown in the first section. Next, four authorship models with known high effectiveness levels must be selected and applied to resolve this authorship pro...
Some well-known models have been explained in the previous chapter, but various advanced approaches have been suggested. Related to the humanities, the Zeta test is focusing on terms used recurrently by one author and mainly ignored by the others. Selecting stylistic markers based on this criterion, the model builds a graph showing the similarities...
In this second chapter presenting stylometric applications, the social networks, and more precisely Twitter, are the source of our dataset. To explore new forms of communication, this chapter explores the distinct linguistic characteristics related to Twitter compared to the traditional oral or written form. For example, the frequency of mentions (...
As presented in the previous chapters, stylometric models and applications are located at the crossroads of several domains such as applied linguistics, statistics, and computer science. This position is not unique but, in a broader view, it corresponds to digital humanities, a field largely open to many relevant research directions and useful appl...
Contemporary research into recurrent neural networks (RNNs) focus on deep architectures which can discover long-range dependencies in textual data. However , whether this property can help in authorship attribution tasks remains an open question and is dependent on the dataset size, which is intrinsically limited in this field. As a result, this pa...
This book presents methods and approaches used to identify the true author of a doubtful document or text excerpt. It provides a broad introduction to all text categorization problems (like authorship attribution, psychological traits of the author, detecting fake news, etc.) grounded in stylistic features. Specifically, machine learning models as...
This is a report on the tenth edition of the Conference and Labs of the Evaluation Forum (CLEF 2019), held from September 9--12, 2019, in Lugano, Switzerland.
CLEF was a four day event combining a Conference and an Evaluation Forum.
The Conference featured keynotes by Bruce Croft, Yair Neuman, and Miguel Martinez, and presentation of peer reviewe...
This chapter describes the lessons learnt from the ad hoc track at CLEF in the years 2000 to 2009. This contribution focuses on Information Retrieval (IR) for languages other than English (monolingual IR), as well as bilingual IR (also termed “cross-lingual”; the request is written in one language and the searched collection in another), and multil...
The name Paul appears in 13 epistles, but is he the real author? According to different biblical scholars, the number of letters really attributed to Paul varies from 4 to 13, with a majority agreeing on seven. This article proposes to revisit this authorship attribution problem by considering two effective methods (Burrows' Delta, Labbé's intertex...
This book constitutes the refereed proceedings of the 10th International Conference of the CLEF Association, CLEF 2019, held in Lugano, Switzerland, in September 2019.
The conference has a clear focus on experimental information retrieval with special attention to the challenges of multimodality, multilinguality, and interactive search ranging from...
This paper describes and evaluates two neural models for gender profiling on the PAN@CLEF 2017 tweet collection. The first model is a character-based Convolutional Neural Network (CNN) and the second an Echo State Network-based (ESN) recurrent neural network with various features. We applied these models to the gender profiling task of the PAN17 ch...
This paper describes and evaluates two neural models for gender profiling on the PAN@CLEF 2017 tweet col-
lection. The first model is a character-based Convolutional Neural Network (CNN) and the second an Echo State
Network-based (ESN) recurrent neural network with various features. We applied these models to the gender pro-
filing task of the PAN1...
Elena Ferrante is a pen name known worldwide, authoring novels such as the bestseller My Brilliant Friend. A recent study indicates that the true author behind these books is probably Domenico Starnone. This study aims to select a set of approved authorship methods and appropriate feature sets to prove, with as much certainty as possible, that this...
Based on n text excerpts, the authorship linking task is to determine a way to link pairs of documents written by the same person together. This problem is closely related to authorship attribution questions and its solution can be used in the author clustering task. However, no training information is provided and the solution must be unsupervised...
Elena Ferrante is a pen name known across the globe, authoring novels such as the bestseller My Brilliant Friend. However, there have been no genuinely scientific studies to determine the true author of these books. With this in mind, this study aims to select a set of approved authorship methods and appropriate feature sets to reveal, with as much...
Text categorization domain proposes many applications and a classical one is to determine the true author of a document, literary excerpt, threatening email, legal testimony, etc. Recently a tetralogy called My Brilliant Friend has been published under the pen-name Elena Ferrante, first in Italian and then translated into several languages. Various...
This paper describes and evaluates an unsupervised author clustering model called Spatium. The proposed strategy can be adapted without any difficulty to different natural languages (such as Dutch, English, and Greek) and it can be applied to different text genres (newspaper articles, reviews, excerpts of novels, etc.). As features, we suggest usin...
Distributed language representation (deep learning) has been applied successfully in different applications in natural language processing. Using this model, we propose and implement two new authorship attribution classifiers. In this perspective, a vector-space representation can be generated for each author or disputed text according to words and...
To study the evolution of the rhetoric and language style of the American presidents from 1789 to 2017, we have analyzed all the annual State of the Union (SOTU) and inaugural addresses. Those speeches present the intentions and indicate the legislative priorities of the Chief of the Executive. Based on this relatively fixed form, this analysis cor...
UMUSE: Monitoring of the online demand (search volume) for keywords from the US Election 2016
This present paper examines the style and rhetoric of the two main candidates (Hillary Clin-ton & Donald Trump) during the 2016 US presidential election. To achieve this objective, this study analyzes the oral communication form based on interviews and TV debates both during the primaries and the general election. The speeches delivered during the...
Determining some demographics about the author of a document (e.g., gender, age) has attracted many studies during the last decade. To solve this author profiling task, various classification models have been proposed based on stylistic features (e.g., function word frequencies, n-gram of letters or words, POS distributions), as well as various voc...
This present paper examines the verbal style and rhetoric of the candidates of the 2016 US presidential primary elections. To achieve this objective, this study analyzes the oral communication forms used by the candidates during the TV debates. When considering the most frequent lemmas, the candidates can be split into two groups, one using more fr...
This paper describes and evaluates an unsupervised and effective authorship verification model called Spatium-L1. As features, we suggest using the 200 most frequent terms of the disputed text (isolated words and punctuation symbols). Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was wr...
This is a report on the sixth edition of the Conference and Labs of the Evaluation Forum (CLEF 2015), held in early September 2015, in Toulouse, France. CLEF was a four day event combining a Conference and an Evaluation Forum. The focus of the Conference is "Experimental IR" as carried out in the CLEF Labs and other evaluation forums, it featured k...
This paper analyses the vocabulary growth over the State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 US presidents). Because the context and the content of these speeches are fixed, this corpus presents a snapshot of the situation of the country on an annual basis. Based on this corpus, this study evaluates the fitness of...
Different computational models have been proposed to automatically determine
the most probable author of a disputed text (authorship attribution). These
models can be viewed as special approaches in the text categorization domain.
In this perspective, in a first step we need to determine the most effective features
(words, punctuation symbols, part...
In authorship attribution, various distance-based metrics have been proposed to determine the most probable author of a disputed text. In this paradigm, a distance is computed between each author profile and the query text. These values are then employed only to rank the possible authors. In this article, we analyze their distribution and show that...
Based on State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 presidents), this paper describes and evaluates different text representation strategies. To determine the most important words of a given text, the term frequencies (tf) or the tf idf weighting scheme can be applied. Recently, latent Dirichlet allocation (LDA) ha...
This paper describes a clustering and authorship attribution study over the State of the Union addresses from 1790 to 2014 (224 speeches delivered by 41 presidents). To define the style of each presidency, we have applied a principal component analysis (PCA) based on the part-of-speech (POS) frequencies. From Roosevelt (1934), each president tends...
This book constitutes the refereed proceedings of the 6th International Conference of the CLEF Initiative, CLEF 2015, held in Toulouse, France, in September 2015.
The 31 full papers and 20 short papers presented were carefully reviewed and selected from 68 submissions. They cover a broad range of issues in the fields of multilingual and multimodal...
This paper gives an overview of the HisDoc project, which aims at developing adaptable tools to support cultural heritage preservation by making historical documents, particularly medieval documents, electronically available for access via the Internet. HisDoc consists of three major components. The first component is image analysis. It has two mai...
The Cultural Heritage in CLEF 2013 lab comprised three tasks: multilingual ad-hoc retrieval and semantic enrichment in 13 languages (Dutch, English, German, Greek, Finnish, French, Hungarian, Italian, Norwegian, Polish, Slovenian, Spanish, and Swedish), Polish ad-hoc retrieval and the interactive task, which studied user behavior via log analysis a...
The authorship attribution (AA) problem can be viewed as a categorization problem. To determine the most effective features to discriminate between different authors, we have evaluated six independent feature-scoring selection functions (information gain, pointwise mutual information, odds ratio, χ2, DIA, and the document frequency (df)). To compar...
This paper presents and evaluates a collaborative attribution strategy based on six authorship attribution schemes representing the two main paradigms used in authorship studies. Based on very frequent words as features, the classical paradigm (or similarity-based methods) proposes to compute an intertextual distance between the disputed text and t...
This article tackles the task of retrieving very short documents via even shorter queries. The problem on hand may relate to the retrieval of tweets, image and table captions, short text messages (SMS) and sponsored retrieval among others. In such cases, document and/or query expansion using thesauri and other external resources (e.g., Wikipedia) u...
Our first objective in participating in FIRE evaluation campaigns is to analyze the retrieval effectiveness of various indexing and search strategies when dealing with corpora written in Hindi, Bengali and Marathi languages. As a second goal, we have developed new and more aggressive stemming strategies for both Marathi and Hindi languages during t...
This paper presents and analyzes the experiments done at the University of Neuchatel for both the multilingual and Polish CHiC tasks at CLEF 2013. Within these two tasks, our experiments explore the problem when facing with short text descriptions expressed in various languages having a richer morphology than English. For the multilingual task, eac...
Our goal in participating in FIRE 2011 evaluation campaign is to analyse and evaluate the retrieval effectiveness of our implemented retrieval system when using Marathi language. We have developed a light and an aggressive stemmer for this language as well as a stopword list. In our experiment seven different IR models (language model, DFR-PL2, DFR...
Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms (character n-gram, word, stem, lemma or sequence of them) which characterize a document. We then show how these Z score values...
In this article we propose a technique for computing a standardized Z score capable of defining the specific vocabulary found in a text (or part thereof) compared to that of an entire corpus. Assuming that the term occurrence follows a binomial distribution, this method is then applied to weight terms (words and punctuation symbols in the current s...
The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In th...
As participants in this CLEF evaluation campaign, our first objective is to propose and evaluate various indexing and search strategies for the CHiC corpus, in order to compare the retrieval effectiveness across different IR mod-els. Our second objective is to measure the relative merit of various stemming strategies when used for the French and En...
This paper describes and evaluates different IR models and search strategies for digitized manuscripts. Written during the thirteenth century, these manuscripts were digitized using an imperfect recognition system with a word error rate of around 6%. Having access to the internal representation during the recognition stage, we were able to produce...
Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms characterizing a document (or a sample of texts). We then show how these Z score values can be used to derive an efficient cat...
This paper evaluates degradation in retrieval effectiveness when working with a noisy text corpus. We first start with a clean version of the collection, and then a second for which the recognition error rate is about 5% and a third of 20%. In our experiments we evaluate six IR models based on three text representations (word-based, n-gram, trunc-n...
In this paper we describe current search technologies available on the web, explain underlying difficulties and show their
limits, related to either current technologies or to the intrinsic properties of all natural languages. We then analyze the
effectiveness of freely available machine translation services and demonstrate that under certain condi...
Based on different writing style definitions, various authorship attribution schemes have been proposed to identify the real author of a given text or text excerpt. In this article we analyze the relative performance of word types or lemmas assigned to re-present styles and texts. As a second objective we compare two authorship attribu-tion approac...
This paper evaluates the retrieval effectiveness degradation when facing with noisy text corpus. With the use of a test-collection having the clean text, another version with around 5% error rate in recognition and a third with 20% error rate, we have evaluated six IR models based on three text representations (bag-of-words, n-grams, trunc-n) as we...
In this paper, we present the authorship attribution problem. As text representation, recent studies suggest using a small set of function or very frequent words (50 or 100). On this basis, we can apply either the principal component analysis (PCA) or the correspondence analysis (CA) to visualize the relationships between text surrogates. Using the...
This article describes and evaluates various information retrieval models used to search document collections written in English through submitting queries written in various other languages, either members of the Indo-European family (English, French, German, and Spanish) or radically different language groups such as Chinese. This evaluation meth...
The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Ret...
This article describes a US political corpus comprising 245 speeches given by senators John McCain and Barack Obama during the years 2007–2008. We present the main characteristics of this collection and compare the common English words most frequently used by these political leaders with ordinary usage (Brown corpus). We then discuss and compare ce...
This paper describes some methods to automatically extract terms or sequences of terms closely reflecting the content of a corpus or a Web site by comparison of a given corpus. The frequency of occurrences or the rank of the most frequent terms may provide a first overview. The suggested method is based on the terms distribution according to a bino...
The default implementation in Lucene, an open-source search engine, is the well-known vector-space model with tf idf weighting. The objective of this paper is to propose and evaluate additional techniques that can be adapted to this search model, in order to meet the particular needs of domainspecific information retrieval (IR). In this paper, we s...
This chapter presents the fundamental concepts of Information Retrieval (IR) and shows how this domain is related to various aspects of NLP. After explaining some of the underlying and often hidden assumptions and problems of IR, we present the notion of indexing. Indexing is the cornerstone of various classical IR paradigms (Boolean, vector-space,...
The Information Retrieval (IR) (Manning et al., 2008) domain can be viewed, to a certain extent, as a successful applied domain of NLP. The speed and scale of Web take-up around the world have been made possible by freely available and effective search engines. These tools are used by around 85% of Web surfers when looking for some specific informa...
In this brief communication, we evaluate the use of two stopword lists for the English language (one comprising 571 words and another with 9) and compare them with a search approach accounting for all word forms. We show that through implementing the original Okapi form or certain ones derived from the Divergence from Randomness (DFR) paradigm, sig...
This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach ( n -gram). To evaluate the suggested stemming strate...
This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested...
The main goal of this paper is to describe and evaluate different indexing and stemming strategies for the Farsi (Persian) language. For this Indo-European language we have suggested a stopword list and a light stemmer. We have compared this stemmer to indexing strategy in which the stemming procedure was omitted, with or without stopword list remo...
This paper compares and illustrates the use of manually and automatically assigned descriptors on German documents extracted from the GIRT Corpus. A second objectives to analyze the usefulness of both specialized or general thesauri to automatically enhance queries. To illustrate our results we use different search models such as a vector space mod...