Jacques Savoy

Jacques Savoy
Université de Neuchâtel | UniNE · Institut d'informatique (IIUN)

Dr rer. pol. (University of Fribourg (Switzerland))

About

226
Publications
82,697
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,570
Citations
Citations since 2016
41 Research Items
1234 Citations
20162017201820192020202120220100200300
20162017201820192020202120220100200300
20162017201820192020202120220100200300
20162017201820192020202120220100200300
Additional affiliations
September 1993 - present
Université de Neuchâtel
Position
  • Professor (Full)
August 1987 - September 1993
Université de Montréal
Position
  • Professor (Assistant)
August 1982 - July 1987
Université de Fribourg
Position
  • PhD Student

Publications

Publications (226)
Chapter
This paper presents the results of a new monitoring project of the US presidential election with the aim of establishing computer-based tools to track the popularity of the two main candidates. The innovative methods that have designed and developed allow us to extract the frequency of search queries sent to numerous search engines, social media, a...
Article
Full-text available
To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n‐gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest‐neighbors, support vector ma...
Article
Full-text available
This study analyzes the stylistic and rhetorical aspects of Trump’s tweets, focusing mainly on tweets sent during his electoral and presidential period. As stylometric findings should be based on a comparative basis, a corpus of tweets written by Obama during his presidency, and those posted by the White House account during the same period are emp...
Preprint
Full-text available
Presidential speeches indicate the government's intentions and justifications supported by a dedicated style and rhetoric oscillating between explanation and controversy. Over a period of sixty years, can we observe stylistic variations by the different French presidents of the Fifth Republic (1958-2018)? Based on official transcripts of all their...
Article
Full-text available
This paper presents the results of a new monitoring project of the US presidential elections with the aim of establishing computer-based tools to track in real time the popularity or awareness of candidates. The designed and developed innovative methods allow us to extract the frequency of queries sent to numerous search engines by US Internet user...
Article
Full-text available
Presidential speeches indicate the government’s intentions and justifications supported by a dedicated style and rhetoric oscillating between explanation and controversy. Over a period of 60 years, can we observe stylistic variations by the different French presidents of the Fifth Republic (1958–2018)? Based on official transcripts of all their all...
Chapter
As this book is related to stylometric text classification, this first chapter presents a general introduction to the fundamental problems and questions in this domain. After exposing the diversity of stylometric problems, this chapter defines the scope and limits of the different stylometric applications. As writing style is the main object, a lis...
Chapter
As a first approach, it is assumed that stylistic markers can be detected by considering words, or more precisely, the most frequent ones. This chapter explores several other ways to define useful stylistic traces let by the author. Instead of considering only isolated words, one can explore the usefulness of short sequences of words (called word n...
Chapter
The evaluation of a text classification model is an essential task to measure its effectiveness and to compare it with previously suggested approaches or with some possible variants. Various situations might lead to different measures beginning with the simple accuracy rate to a more detailed analysis based on precision–recall values. Moreover, pro...
Chapter
This second chapter exposes some useful background information that will be useful over the entire book. In particular, this chapter presents and defines precisely the notions of word-type, word-token, and lemma. The mathematical notation used in the entire book is also presented and commented. A running example (the Federalist Papers) is described...
Chapter
This third chapter focuses on the problem of authorship attribution, with leading models proposed and discussed in the humanities community. As the first approach, the Delta model is described with concrete examples extracted from the Federalist Papers corpus. Thereby the reader will have a precise and detailed description of one of the most well-k...
Chapter
As many stylometric applications must first learn and represent the distinctive style of different categories or authors, several machine learning algorithms have been suggested to solve the authorship attribution or profiling issues. This sixth chapter presents four important models. Based on vector space representation, the k-nearest neighbors (k...
Chapter
With US political speeches, this chapter illustrates some notions and concepts presented in the first two parts of this book. The studied corpus is composed of 233 annual State of the Union (SOTU) speeches written under 43 presidencies from Washington (Jan. 8th, 1790) to Trump (Feb. 4th, 2020). Moreover, 58 inaugural allocutions uttered by 40 presi...
Chapter
In this chapter several authorship attribution methods are applied to identify the true author behind the Elena Ferrante penname. To achieve this objective, a corpus must be carefully generated as shown in the first section. Next, four authorship models with known high effectiveness levels must be selected and applied to resolve this authorship pro...
Chapter
Some well-known models have been explained in the previous chapter, but various advanced approaches have been suggested. Related to the humanities, the Zeta test is focusing on terms used recurrently by one author and mainly ignored by the others. Selecting stylistic markers based on this criterion, the model builds a graph showing the similarities...
Chapter
In this second chapter presenting stylometric applications, the social networks, and more precisely Twitter, are the source of our dataset. To explore new forms of communication, this chapter explores the distinct linguistic characteristics related to Twitter compared to the traditional oral or written form. For example, the frequency of mentions (...
Chapter
As presented in the previous chapters, stylometric models and applications are located at the crossroads of several domains such as applied linguistics, statistics, and computer science. This position is not unique but, in a broader view, it corresponds to digital humanities, a field largely open to many relevant research directions and useful appl...
Preprint
Full-text available
Contemporary research into recurrent neural networks (RNNs) focus on deep architectures which can discover long-range dependencies in textual data. However , whether this property can help in authorship attribution tasks remains an open question and is dependent on the dataset size, which is intrinsically limited in this field. As a result, this pa...
Book
This book presents methods and approaches used to identify the true author of a doubtful document or text excerpt. It provides a broad introduction to all text categorization problems (like authorship attribution, psychological traits of the author, detecting fake news, etc.) grounded in stylistic features. Specifically, machine learning models as...
Article
Full-text available
This is a report on the tenth edition of the Conference and Labs of the Evaluation Forum (CLEF 2019), held from September 9--12, 2019, in Lugano, Switzerland. CLEF was a four day event combining a Conference and an Evaluation Forum. The Conference featured keynotes by Bruce Croft, Yair Neuman, and Miguel Martinez, and presentation of peer reviewe...
Chapter
Full-text available
This chapter describes the lessons learnt from the ad hoc track at CLEF in the years 2000 to 2009. This contribution focuses on Information Retrieval (IR) for languages other than English (monolingual IR), as well as bilingual IR (also termed “cross-lingual”; the request is written in one language and the searched collection in another), and multil...
Article
Full-text available
The name Paul appears in 13 epistles, but is he the real author? According to different biblical scholars, the number of letters really attributed to Paul varies from 4 to 13, with a majority agreeing on seven. This article proposes to revisit this authorship attribution problem by considering two effective methods (Burrows' Delta, Labbé's intertex...
Book
This book constitutes the refereed proceedings of the 10th International Conference of the CLEF Association, CLEF 2019, held in Lugano, Switzerland, in September 2019. The conference has a clear focus on experimental information retrieval with special attention to the challenges of multimodality, multilinguality, and interactive search ranging from...
Conference Paper
Full-text available
This paper describes and evaluates two neural models for gender profiling on the PAN@CLEF 2017 tweet collection. The first model is a character-based Convolutional Neural Network (CNN) and the second an Echo State Network-based (ESN) recurrent neural network with various features. We applied these models to the gender profiling task of the PAN17 ch...
Presentation
Full-text available
This paper describes and evaluates two neural models for gender profiling on the PAN@CLEF 2017 tweet col- lection. The first model is a character-based Convolutional Neural Network (CNN) and the second an Echo State Network-based (ESN) recurrent neural network with various features. We applied these models to the gender pro- filing task of the PAN1...
Article
Full-text available
Elena Ferrante is a pen name known worldwide, authoring novels such as the bestseller My Brilliant Friend. A recent study indicates that the true author behind these books is probably Domenico Starnone. This study aims to select a set of approved authorship methods and appropriate feature sets to prove, with as much certainty as possible, that this...
Article
Based on n text excerpts, the authorship linking task is to determine a way to link pairs of documents written by the same person together. This problem is closely related to authorship attribution questions and its solution can be used in the author clustering task. However, no training information is provided and the solution must be unsupervised...
Article
Full-text available
Elena Ferrante is a pen name known across the globe, authoring novels such as the bestseller My Brilliant Friend. However, there have been no genuinely scientific studies to determine the true author of these books. With this in mind, this study aims to select a set of approved authorship methods and appropriate feature sets to reveal, with as much...
Article
Full-text available
Text categorization domain proposes many applications and a classical one is to determine the true author of a document, literary excerpt, threatening email, legal testimony, etc. Recently a tetralogy called My Brilliant Friend has been published under the pen-name Elena Ferrante, first in Italian and then translated into several languages. Various...
Conference Paper
This paper describes and evaluates an unsupervised author clustering model called Spatium. The proposed strategy can be adapted without any difficulty to different natural languages (such as Dutch, English, and Greek) and it can be applied to different text genres (newspaper articles, reviews, excerpts of novels, etc.). As features, we suggest usin...
Article
Distributed language representation (deep learning) has been applied successfully in different applications in natural language processing. Using this model, we propose and implement two new authorship attribution classifiers. In this perspective, a vector-space representation can be generated for each author or disputed text according to words and...
Article
Full-text available
To study the evolution of the rhetoric and language style of the American presidents from 1789 to 2017, we have analyzed all the annual State of the Union (SOTU) and inaugural addresses. Those speeches present the intentions and indicate the legislative priorities of the Chief of the Executive. Based on this relatively fixed form, this analysis cor...
Conference Paper
Full-text available
Article
Full-text available
This present paper examines the style and rhetoric of the two main candidates (Hillary Clin-ton & Donald Trump) during the 2016 US presidential election. To achieve this objective, this study analyzes the oral communication form based on interviews and TV debates both during the primaries and the general election. The speeches delivered during the...
Article
Full-text available
Determining some demographics about the author of a document (e.g., gender, age) has attracted many studies during the last decade. To solve this author profiling task, various classification models have been proposed based on stylistic features (e.g., function word frequencies, n-gram of letters or words, POS distributions), as well as various voc...
Article
This present paper examines the verbal style and rhetoric of the candidates of the 2016 US presidential primary elections. To achieve this objective, this study analyzes the oral communication forms used by the candidates during the TV debates. When considering the most frequent lemmas, the candidates can be split into two groups, one using more fr...
Article
This paper describes and evaluates an unsupervised and effective authorship verification model called Spatium-L1. As features, we suggest using the 200 most frequent terms of the disputed text (isolated words and punctuation symbols). Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was wr...
Article
Full-text available
This is a report on the sixth edition of the Conference and Labs of the Evaluation Forum (CLEF 2015), held in early September 2015, in Toulouse, France. CLEF was a four day event combining a Conference and an Evaluation Forum. The focus of the Conference is "Experimental IR" as carried out in the CLEF Labs and other evaluation forums, it featured k...
Article
This paper analyses the vocabulary growth over the State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 US presidents). Because the context and the content of these speeches are fixed, this corpus presents a snapshot of the situation of the country on an annual basis. Based on this corpus, this study evaluates the fitness of...
Article
Full-text available
Different computational models have been proposed to automatically determine the most probable author of a disputed text (authorship attribution). These models can be viewed as special approaches in the text categorization domain. In this perspective, in a first step we need to determine the most effective features (words, punctuation symbols, part...
Article
Full-text available
In authorship attribution, various distance-based metrics have been proposed to determine the most probable author of a disputed text. In this paradigm, a distance is computed between each author profile and the query text. These values are then employed only to rank the possible authors. In this article, we analyze their distribution and show that...
Article
Based on State of the Union addresses from 1790 to 2014 (225 speeches delivered by 42 presidents), this paper describes and evaluates different text representation strategies. To determine the most important words of a given text, the term frequencies (tf) or the tf idf weighting scheme can be applied. Recently, latent Dirichlet allocation (LDA) ha...
Article
This paper describes a clustering and authorship attribution study over the State of the Union addresses from 1790 to 2014 (224 speeches delivered by 41 presidents). To define the style of each presidency, we have applied a principal component analysis (PCA) based on the part-of-speech (POS) frequencies. From Roosevelt (1934), each president tends...
Book
This book constitutes the refereed proceedings of the 6th International Conference of the CLEF Initiative, CLEF 2015, held in Toulouse, France, in September 2015. The 31 full papers and 20 short papers presented were carefully reviewed and selected from 68 submissions. They cover a broad range of issues in the fields of multilingual and multimodal...
Chapter
Full-text available
This paper gives an overview of the HisDoc project, which aims at developing adaptable tools to support cultural heritage preservation by making historical documents, particularly medieval documents, electronically available for access via the Internet. HisDoc consists of three major components. The first component is image analysis. It has two mai...
Conference Paper
Full-text available
The Cultural Heritage in CLEF 2013 lab comprised three tasks: multilingual ad-hoc retrieval and semantic enrichment in 13 languages (Dutch, English, German, Greek, Finnish, French, Hungarian, Italian, Norwegian, Polish, Slovenian, Spanish, and Swedish), Polish ad-hoc retrieval and the interactive task, which studied user behavior via log analysis a...
Conference Paper
Full-text available
The authorship attribution (AA) problem can be viewed as a categorization problem. To determine the most effective features to discriminate between different authors, we have evaluated six independent feature-scoring selection functions (information gain, pointwise mutual information, odds ratio, χ2, DIA, and the document frequency (df)). To compar...
Article
Full-text available
This paper presents and evaluates a collaborative attribution strategy based on six authorship attribution schemes representing the two main paradigms used in authorship studies. Based on very frequent words as features, the classical paradigm (or similarity-based methods) proposes to compute an intertextual distance between the disputed text and t...
Article
This article tackles the task of retrieving very short documents via even shorter queries. The problem on hand may relate to the retrieval of tweets, image and table captions, short text messages (SMS) and sponsored retrieval among others. In such cases, document and/or query expansion using thesauri and other external resources (e.g., Wikipedia) u...
Chapter
Full-text available
Our first objective in participating in FIRE evaluation campaigns is to analyze the retrieval effectiveness of various indexing and search strategies when dealing with corpora written in Hindi, Bengali and Marathi languages. As a second goal, we have developed new and more aggressive stemming strategies for both Marathi and Hindi languages during t...
Article
This paper presents and analyzes the experiments done at the University of Neuchatel for both the multilingual and Polish CHiC tasks at CLEF 2013. Within these two tasks, our experiments explore the problem when facing with short text descriptions expressed in various languages having a richer morphology than English. For the multilingual task, eac...
Chapter
Our goal in participating in FIRE 2011 evaluation campaign is to analyse and evaluate the retrieval effectiveness of our implemented retrieval system when using Marathi language. We have developed a light and an aggressive stemmer for this language as well as a stopword list. In our experiment seven different IR models (language model, DFR-PL2, DFR...
Article
Full-text available
Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms (character n-gram, word, stem, lemma or sequence of them) which characterize a document. We then show how these Z score values...
Article
Full-text available
In this article we propose a technique for computing a standardized Z score capable of defining the specific vocabulary found in a text (or part thereof) compared to that of an entire corpus. Assuming that the term occurrence follows a binomial distribution, this method is then applied to weight terms (words and punctuation symbols in the current s...
Article
Full-text available
The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In th...
Article
As participants in this CLEF evaluation campaign, our first objective is to propose and evaluate various indexing and search strategies for the CHiC corpus, in order to compare the retrieval effectiveness across different IR mod-els. Our second objective is to measure the relative merit of various stemming strategies when used for the French and En...
Conference Paper
Full-text available
This paper describes and evaluates different IR models and search strategies for digitized manuscripts. Written during the thirteenth century, these manuscripts were digitized using an imperfect recognition system with a word error rate of around 6%. Having access to the internal representation during the recognition stage, we were able to produce...
Conference Paper
Full-text available
Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms characterizing a document (or a sample of texts). We then show how these Z score values can be used to derive an efficient cat...
Conference Paper
Full-text available
This paper evaluates degradation in retrieval effectiveness when working with a noisy text corpus. We first start with a clean version of the collection, and then a second for which the recognition error rate is about 5% and a third of 20%. In our experiments we evaluate six IR models based on three text representations (word-based, n-gram, trunc-n...
Conference Paper
Full-text available
In this paper we describe current search technologies available on the web, explain underlying difficulties and show their limits, related to either current technologies or to the intrinsic properties of all natural languages. We then analyze the effectiveness of freely available machine translation services and demonstrate that under certain condi...
Conference Paper
Full-text available
Article
Full-text available
This paper evaluates the retrieval effectiveness degradation when facing with noisy text corpus. With the use of a test-collection having the clean text, another version with around 5% error rate in recognition and a third with 20% error rate, we have evaluated six IR models based on three text representations (bag-of-words, n-grams, trunc-n) as we...
Article
In this paper, we present the authorship attribution problem. As text representation, recent studies suggest using a small set of function or very frequent words (50 or 100). On this basis, we can apply either the principal component analysis (PCA) or the correspondence analysis (CA) to visualize the relationships between text surrogates. Using the...
Article
This article describes and evaluates various information retrieval models used to search document collections written in English through submitting queries written in various other languages, either members of the Indo-European family (English, French, German, and Spanish) or radically different language groups such as Chinese. This evaluation meth...
Article
Full-text available
The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Ret...
Article
Full-text available
This article describes a US political corpus comprising 245 speeches given by senators John McCain and Barack Obama during the years 2007–2008. We present the main characteristics of this collection and compare the common English words most frequently used by these political leaders with ordinary usage (Brown corpus). We then discuss and compare ce...
Article
Full-text available
This paper describes some methods to automatically extract terms or sequences of terms closely reflecting the content of a corpus or a Web site by comparison of a given corpus. The frequency of occurrences or the rank of the most frequent terms may provide a first overview. The suggested method is based on the terms distribution according to a bino...
Conference Paper
Full-text available
The default implementation in Lucene, an open-source search engine, is the well-known vector-space model with tf idf weighting. The objective of this paper is to propose and evaluate additional techniques that can be adapted to this search model, in order to meet the particular needs of domainspecific information retrieval (IR). In this paper, we s...
Article
Full-text available
Article
Full-text available
This chapter presents the fundamental concepts of Information Retrieval (IR) and shows how this domain is related to various aspects of NLP. After explaining some of the underlying and often hidden assumptions and problems of IR, we present the notion of indexing. Indexing is the cornerstone of various classical IR paradigms (Boolean, vector-space,...