ArticlePDF Available

Abstract

This article describes our proposal for the Author Identification task in the PAN CLEF Challenge 2014.We adopt a machine learning approach based on several representations of the texts and on optimized decision trees which have as entry various attributes and which are learnt for every training corpus separately for this classification task. Our method ranked us at the 2nd place with an overall AUC of 70.7%, and C@1 of 68.4% and, between the 1st and the 6th place on the six corpora .
UJM at CLEF in Author Identification
Notebook for PAN at CLEF 2014
Jordan Fréry1, Christine Largeron1, and Mihaela Juganaru-Mathieu2
1Laboratoire Hubert Curien, Université de Lyon, F-42023, Saint-Etienne, France
2Institut H. Fayol, École Nationale Supérieure des Mines, F-42023 Saint-Etienne, France
jordan.frery@gmail.com, christine.largeron@univ-st-etienne.fr, mathieu@emse.fr
Abstract This article describes our proposal for the Author Identification task in
the PAN CLEF Challenge 2014. We adopt a machine learning approach based on
several representations of the texts and on optimized decision trees which have as
entry various attributes and which are learnt for every training corpus separately
for this classification task. Our method ranked us at the 2nd place with an overall
AUC of 70.7%, and C@1 of 68.4% and, between the 1st and the 6th place on the
six corpora .
1 Introduction
The task Author Identification (AI) in the CLEF-PAN Challenge is to solve a large set
of problems like : given a set Aof samples texts, all texts in Aare written by only one
author and a mysterious document u, determine if uwas written by the author of A. The
difficulties of this task are various : the lack of data we have per author: sometimes, A
has only one text, some languages that we do not know or we are not able to understand.
We decided to represent the documents in different vector spaces and by various
type of features :
length of the sentences,
variety of vocabulary,
words, n-characters grams, n-words gram,
punctuation marks.
For each feature, we considered two numerical values : a mean and a counter. An other
global counter was also used. Because we are not able to indicate or to justify the
features which are the most important, we used decision trees, an adapted version of
CART, to learn a decision model suited for a kind of document. Thus, each corpus
defined by a language and a genre, has its own learned tree.
So, our proposal is based on:
the proposition of vector space models and attributes that represent the documents
in a way as optimal as possible.
the formulation of Author Verification problem as a supervised classification prob-
lem.
the evaluation of this approach on different groups of problems in the challenge
context.
Section 2 describes the vector spaces that we choose to represent the documents.
Section 3 is dedicated to the methodological approach. The section 4 presents the ex-
periments and the results obtained on the training set and for the challenge. We will
finish with some conclusions and future perspectives.
2 Textual representation
A problem inside a corpus consists in a given set Aof documents written by the same
author and another document uwhose author is unknown. The aim is to decide whether
uhas the same author as all documents diin A.
2.1 Vector space models
In order to represent the textual documents as vectors we use different vector space
models. The first one is the well known term frequency-inverse document frequency
weighting scheme (tf-idf) introduced by Salton [1]. This model is very efficient to iso-
late terms (words or characters) that are frequent in one document and not in the others.
A document din a corpus Ais represented as a vector of weights d= (w1, . . . , wj, . . . , w|T|)
where the weight wjof the term tjin dcorresponds to the product of the term frequency
tfjof the term tjin dby the inverse document frequency idf(j)defined by:
idf (j) = log |A|
|{dA:tjd}|
This representation can be defined for terms corresponding either to words or char-
acters. Moreover, in order to take into account the variety of the style and vocabulary,
we consider representations based on the punctuation, length of phrases and diversity
of the vocabulary as detailed in Table 1.
Representation space Comparison method
Term Model
R1Character 8-grams tf-idf cosine similarity
R2Character 3-grams tf-idf correlation coefficient
R3Word 2-grams tf-idf correlation coefficient
R4Word 1-gram tf-idf without the 30% most frequent words correlation coefficient
R5Word 1-gram tf-idf without stop words correlation coefficient
R6Phrases word per sentence mean, word per sentence standard deviation correlation coefficient
R7Vocabulary diversity total number of different terms divided by the total number euclidean distance
of occurrences of words
R8Punctuation average of punctuation marks per sentence cosine similarity
characters taken into account: "," ";" ":" "(" ")" "!" "?"
Table 1. List of representation spaces and comparison measures
2.2 Documents comparison
Our approach requires to compare all documents inside a corpus using the cosine simi-
larity, euclidean distance or the correlation coefficient. These measures are normalized,
between 0 and 1 for the euclidean distance and cosine similarity and, between -1 and 1
for the correlation coefficient. For two documents represented as vectors diand dj, the
cosine similarity cos(di,dj)is defined as follows:
cos(di,dj) = di·dj
||di||dj||
The cosine similarity equals to 1 when the documents have the same representation.
Conversely, if two documents are highly different, cosine similarity will tend to be 0.
The correlation coefficient corrcoef (di, dj)between two documents is given by:
corrcoef (di, dj) = Cij
pCii Cjj
where Cij denotes the covariance between the documents diand dj.
Table 1 presents the different representation spaces and the measures we used to
compare the documents belonging to a corpus. In our methodological approach, we
extract two attributes for each representation space of Table 1 in order to represent the
unknown documents.
3 Methodological approach
Given a corpus Pcontaining all the documents having the same language and the same
type, we have p∈ P problems to solve and, for each problem there are one or several
documents written by the same author and one document (u) whose author is unknown.
Thus, the dataset of the supervised learning problem contains all the unknown docu-
ments of one corpus, described by 17 attributes but also by the class which has two
modalities: SameAuthor or DifferentAuthor. In supervised learning, models are learnt
by splitting the dataset into two subsets. The first one, called learning set, is used to
learn the model, in our case, decision tree. The second subset, called test set, is used
to evaluate the model. The decision tree learnt during the learning step is use to define
the class of each unknown document corresponding to a problem. The evaluation of the
quality of the decision rules is done by computing the well classification rate or the area
under the ROC curve (AUC) obtained by comparing the predicted class and the true
class for the unknown documents belonging to the test set. The accuracy of the models
depends largely of the attributes predictive power. That leads us to define two attributes
per representation space and a global attribute.
3.1 Attributes definition
We use a dissimilarity counter method that we designed during our experimenting on
the PAN 2013 corpora in Author identification and which yielded very good results [2].
We chose to use it back for PAN 2014 in a modified version. It is wise to note that this
method only works for problems with at least two known texts (|A|>= 2).
Given P, the set of problems provided for one corpus defined by Apthe set of doc-
uments written by one author and upthe unknown document for a problem p, p =
1, ..., |P|, such as:
P={(Ap, up), p 1, ..., |P|}
For each document up, corresponding to a given problem, and for each representa-
tion space Rv,v∈ {1, .., 8}, we calculate two attributes countv(up)and meanv(up)
as follows:
countv(up) = |{di Ap/min{s(di, dj), djAdi}< s(di, up)}|
meanv(up) = 1
|Ap|×XdiAp
s(di, up)
These two attributes are computed for each representation space. Consequently,
since v∈ {1, .., 8}we have 16 attributes. A last attribute, T OTcount (up)is built to
have a more global representation:
T OTcount(up) =
8
X
v=1
countv(up)
Finally we have 17 attributes describing each unknown document belonging to a
problem provided for one corpus composed of the documents having the same language
and the same genre.
3.2 Decision tree classifier
For the task of Author Verification, we used the Classification and Regression Trees
(CART) algorithm which constructs binary trees using the features and thresholds that
yield the largest information gain at each node [3]. The trees are built by using each
corpus of the training set separately in such a way to obtain a tree per corpus. We train
the classifier with the attributes detailed previously plus the true label for the given
unknown documents. At each step, the attribute that best splits the set of unknown doc-
uments into the two classes is chosen using the giny impurity. To avoiding overfitting,
we apply post-pruning that consists to build the tree which classify the training set per-
fectly and then prune the tree [4].
For each problem of the corpus, the decision tree has the following information for
the unknown document:
countv(up),v∈ {1, .., 8}
meanv(up),v∈ {1, .., 8}
T OTcount(up)
class(up), the true label of a problem
The previous data allow us to build rules in such a way that we classify 100% of
problems correctly. In order to handle overfitting we remove all leaves with less than
5% of the total number of problems so we could keep more general rules. Moreover,
we decided to not answer problems that were not significant enough. The rule we set is
that when the probability for a text to be written by the same author is between 0.4 and
0.6, we change the probability to 0.5 so that we choose to not answer this problem. So
finally there are 3 modalities for the class: sameAuthor, differentAuthor or undefined.
4 Experimentation and results
For the learning step, the implementation has been done in Python. We used scikit-learn
library 3for the n-grams representation and for CART.
4.1 Learning
The experimentation has been made on the training corpus which contains 696 prob-
lems labelled as DE, DR, GR, EN, EE or SP where D stands for Dutch (DE,DR), GR
for Greek, SP for Spanish and E for English (EE,EN). We have essays and review for
Dutch (DE,DR) and essays and novels for English (EE,EN). For experimentation, we
have made a 10 cross validation for each group of problems in order to evaluate the
quality of the decision trees on the training set.
The table 2 shows for each corpus: the number of problems and the result calculated
with the area under the ROC curve (AUC) on the training dataset.
Corpus EN EE DR DE SP GR
Problems# 100 200 100 96 100 100
AUC 89% 70% 68% 91% 77% 76%
Table 2. 10 cross validation on the training corpus
The following tree is the one used over the English essays corpus where "samples"
is the number of problem remaining at a node. There are 200 problems to treat.
X[5] = meanR5(up)
X[0] = meanR3(up)
X[1] = meanR2(up)
X[15] = meanR6(up)
X[4] = meanR1(up)
X[10] = countR7(up)
X[16] = meanR8(up)
3http://scikit-learn.org
4.2 Evaluation
The evaluation of the decision trees built during the learning step was done during the
competition. The table 3 contains the official results of PAN14 in Author Identification
for our team computed by the organizers of the challenge.
Corpus EN EE DR DE SP GR
AUC 61 % 72% 60% 90% 77% 68%
C@1 59 % 71% 58% 90% 75% 64%
Time(min) 3:10 0:54 0:08 0:29 1:00 0:57
Final rank(ROC c@1) 7/13 1/13 6/13 2/13 4/13 7/12
Rank(Exe. time) 3/13 3/13 3/13 4/13 3/13 3/12
Table 3. Challenge evaluation results
5 Conclusion
With a overall scores of 0.707 for AUC, 0.684 for C@1 we obtain a final score of 0.484
which is the second best submission. As shown in Table 3, we obtain the 1st rank for the
English essays corpus, 2nd for the Dutch essays corpus and 4th for the Spanish corpus.
For the previous corpora, the results we obtained were consistent with the ones we had
while training our decision tree. However we lost a lot of accuracy on the English novels
corpus (near 30% of loss). We would need to study the evaluation corpus to understand
why we had such a loss of accuracy. Moreover our approach is not time-consuming as
shown in Table 3.
During this challenge we saw that the most difficult part was to gather features
that are complemented each other. The use of CART allows to identify good predictive
features. However, we have not tried all possibilities for text representations. Moreover,
building efficient features, like with the counter method, highly improves the accuracy
of CART for some corpora.
References
1. G. Salton, M.M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York,
NY, USA (1983)
2. Juola, P., Stamatatos, E.: Overview of the Author Identification Task at PAN 2013. Pamela
Forner, Roberto Navigli and Dan Tufis edn., Working Notes Papers of the CLEF 2013
Evaluation Labs
3. L. Breiman, J. Friedman, R.O., Stone, C.: Classification and Regression Trees (1984)
4. Quinlan, J.R.: Simplifying decision trees. Int. J. Man-Mach. Stud. 27(3), 221–234 (Sep 1987)
... classifier to learn to distinguish between the two classes [7,17,22]. These methods attempt to learn a general verification model and its effectiveness strongly depends on the volume, representativeness, and distribution of the training dataset. ...
... These methods attempt to learn a general verification model and its effectiveness strongly depends on the volume, representativeness, and distribution of the training dataset. In [17], based on lexical features, a feature vector is created for each of the documents. The lexical features include character n-grams, word n-grams, vocabulary diversity, punctuation symbols, and phrases. ...
Article
Full-text available
In the current paper, we have proposed a new multi-modal authorship verification approach for social media texts. Authorship verification is a task of verifying whether an unknown text is written by a suspect or not. Use of social media like Facebook and Twitter is increasing day by day because of digitization. People have grown accustomed to regularly post or tweet about their everyday life, memorable incidences, random thoughts, opinions, and much more. Emojis are widely used in these tweets and posts. The writing style of a user can differ from others, since word choices, sentence structures, usage of punctuation symbols, and use of emoji can be different. We have applied a multi-modal Siamese-based framework for automatic extraction of features from the given texts and emojis. After the extraction of features, the extracted features are applied to a neural network–based architecture for binary classification. A multi-modal Twitter-based dataset is created for evaluating the performance of the proposed framework. We obtained an average accuracy of 61.56% with 78.08%, 61.50%, and 58.32% precision, recall, and f-measure values, respectively.
... We employed the same evaluation framework used in PAN-14, utilising standard AV metrics such as c@1, AUROC, and the 'final score' . Our results were benchmarked against the top-performing CNG-based approaches in PAN-14, including those documented by (Frery et al., 2014;Moreau et al., 2014;Satyam et al., 2014), as well as the PAN-14 baseline. By validating the method on the PAN-14 dataset, we affirmed its applicability to our institutional datasets (MGE-19 and MSR-21). ...
Preprint
As human-AI collaboration becomes increasingly prevalent in educational contexts, understanding and measuring the extent and nature of such interactions pose significant challenges. This research investigates the use of authorship verification (AV) techniques not as a punitive measure, but as a means to quantify AI assistance in academic writing, with a focus on promoting transparency, interpretability, and student development. Building on prior work, we structured our investigation into three stages: dataset selection and expansion, AV method development, and systematic evaluation. Using three datasets - including a public dataset (PAN-14) and two from University of Melbourne students from various courses - we expanded the data to include LLM-generated texts, totalling 1,889 documents and 540 authorship problems from 506 students. We developed an adapted Feature Vector Difference AV methodology to construct robust academic writing profiles for students, designed to capture meaningful, individual characteristics of their writing. The method's effectiveness was evaluated across multiple scenarios, including distinguishing between student-authored and LLM-generated texts and testing resilience against LLMs' attempts to mimic student writing styles. Results demonstrate the enhanced AV classifier's ability to identify stylometric discrepancies and measure human-AI collaboration at word and sentence levels while providing educators with a transparent tool to support academic integrity investigations. This work advances AV technology, offering actionable insights into the dynamics of academic writing in an AI-driven era.
... To evaluate our adapted FVD method, we applied the identical evaluation framework used in PAN-14, which consists of standard AV metrics of c@1, AUROC and 'final score'. We then compared our results against the top-performing approaches in PAN-14, including [5,15,23]. The goal of this validation was to assure our method showed comparably sound performance to other AV approaches, allowing us to proceed with our experiments to future analyses. ...
... Decision trees are commonly used to solve SCC problems [38]- [40]. However, recently, decision trees were also used to solve AV problems (a special case of SOC problems) [41]. ...
Article
Full-text available
Electronic text stylometry is concerned with analyzing the writing styles of input electronic texts to extract information about their authors. For example, such extracted data could be the authors' identity or other aspects, such as their gender and age group. This survey paper presents the following contributions: 1) A description of all stylometry problems in probability terms, under a unified notation. 2) A survey of data representation (or feature extraction) methods. 3) A comprehensive evaluation of 23, 760 feature extraction methods followed by a thorough discussion of the results. This extensive evaluation is critical since the known data representation methods are often not evaluated under the same unified testbed.
... In the inherent technique, just the data about the arrangement of realized archives composed by the creator and the obscure report for which the initiation check is to be done are required [6]. Lately, the creation check [7]used CART calculation (Classification and RegressionTrees) to choose if obscure report is composed by An or not. Another model is the Profile-Based Method for Authorship Verification [8] proposed by Potha and Stamatatos, in view of the natural strategy, as it doesn't need any outer data to choose whether the obscure archive is composed by a similar creator. ...
Article
Full-text available
Digital forensics is the study of recovery and investigation of the materials found in digital devices, mainly in computers. Forensic authorship analysis is a branch of digital forensics. It includes tasks such as authorship attribution, authorship verification, and author profiling. In Authorship verification, with a given a set of sample documents D written by an author A and an unknown document d, the task is to find whether document d is written by A or not. Authorship verification has been previously done using genetic algorithms, SVM classifiers, etc. The existing system creates an ensemble model by combining the features based on the similarity scores, and the parameter optimization was done using a grid search. The accuracy of verification using the grid search method is 62.14%. The time complexity is high as the system tries all possible combinations of the features during the ensemble model's construction. In the proposed work, Modified Particle Swarm Optimization (MPSO) is used to construct the classification model in the training phase, instead of the ensemble model. In addition to the combination of linguistic and character features, Average Sentence Length is used to improve the verification task accuracy. The accuracy of verification has been improved to 63.38%.
Article
With the advent of internet technologies, it has created different ways of writing anonymously, which has lead to criminal and malicious activities over social media platforms. Thus, the automatic authentication checking of the available contents is the need of the hour. Social media sites, such as Facebook, Twitter, and so on, are used heavily by the users for sharing of information about their day-to-day activities. The identity of the suspect user is matched against tweets written by the specific user in tweet-user verification process. Writing styles of different users differ from each other, due to unique word choices, emoji selection, sentence formation, and punctuation usage. We have developed a multimodal Siamese-based architecture, which uses attention between the text and emoji parts of the tweet for generating a combined representation for the tweet. Attention helps in selecting the relevant information from different modalities. Modality attention is used for fusing the two modalities (text and emoji). We have used a newly developed multimodal Twitter dataset for evaluating the performance of the proposed model. We achieved an average accuracy, precision, recall, and F -measure values of 68.50%, 78.52%, 69.47%, and 67.05%, respectively. The results show an increase of 2.14% in F -measure in comparison with the current state-of-the-art (SOTA) models for this dataset.
Article
Cybercrime can be associated with undisclosed social media accounts deliberately used to conduct unethical or illegal activities such as cyberbullying, fraudulent transactions, human trafficking, etc. The objective of this paper is to identify whether two social media accounts belong to the same person by examining the accounts' writing, i.e. comments and posts. To that end, this preliminary study introduces a new algorithm, ChunkedHCs, specifically designed for the authorship verification task to decide whether a pair of texts are written by the same person. In the domain of machine learning and deep learning, there have been previous authorship verification approaches, which often involve complex feature selections or sophisticated pre-processing steps due to the complexity of topic heterogeneity. Such limits provide motivations to seek a simpler yet more robust approach that could offer competitive verification ability. ChunkedHCs is based on the statistical testing Higher Criticism (Donoho and Jin, 2004) and the HC-based similarity algorithm (Kipnis, 2020a & 2020b) (Kestemont et al., 2020). Using Reddit users’ data, ChunkedHCs offer a promising performance with an accuracy of 0.94 and an F1 of 0.9381 for texts between 29,000 and 30,000 characters. It is speculated that the algorithm could also be highly applicable to identify if two accounts are used by the same person for other social media platforms such as Facebook, Twitter and even dark web forums. Various avenues of further research on ChunkedHCs are also proposed.
Preprint
Full-text available
Electronic text stylometry is a collection of forensics methods that analyze the writing styles of input electronic texts in order to extract information about authors of the input electronic texts. Such extracted information could be the identity of the authors, or aspects of the authors, such as their gender, age group, ethnicity, etc. This survey paper presents the following contributions: 1) A description of all stylometry problems in probability terms, under a unified notation. To the best of our knowledge, this is the most comprehensive definition to date. 2) A survey of key methods, with a particular attention to data representation (or feature extraction) methods. 3) An evaluation of 23,760 feature extraction methods, which is the most comprehensive evaluation of feature extraction methods in the literature of stylometry to date. The importance of this evaluation is two fold. First, identifying the relative effectiveness of the features (since, currently, many are not evaluated jointly; e.g. syntactic n-grams are not evaluated against k-skip n-grams, and so forth). Second, thanks to our generalizations, we could evaluate novel grams, such as what we name compound grams. 4) The release of our associated Python feature extraction library, namely Fextractor. Essentially, the library generalizes all existing n-gram based feature extraction methods under the "at least l-frequent, dir-directed, k-skipped n-grams'', and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as POS tags, as well as lower-level ones, such as distribution of function words, word shapes, etc. This makes the library, by far, the most extensive in this domain to date. 5) The construction, evaluation, and release of the first dataset for Emirati social media text. This evaluation represents the first evaluation of author identification against Emirati social media texts. Interestingly, we find that, when using our models and feature extraction library (Fextractor), authors could be identified significantly more accurately than what is reported with similarly sized datasets. The dataset also contains sub-datasets that represent other languages (Dutch, English, Greek and Spanish), and our findings are consistent across them.
Article
Authorship verification is the task of determining whether two texts were written by the same author based on a writing style analysis. Author obfuscation is the adversarial task of preventing a successful verification by altering a text’s style so that it does not resemble that of its original author anymore. This paper introduces new algorithms for both tasks and reports on a comprehensive evaluation to ascertain the merits of the state of the art in authorship verification to withstand obfuscation. After introducing a new generalization of the well-known unmasking algorithm for short texts, thus completing our collection of state-of-the-art algorithms for verification, we introduce an approach that (1) models writing style difference as the Jensen-Shannon distance between the character n-gram distributions of texts, and (2) manipulates an author’s writing style in a sophisticated manner using heuristic search. For obfuscation, we explore the huge space of textual variants in order to find a paraphrased version of the to-be-obfuscated text that has a sufficiently high Jensen-Shannon distance at minimal costs in terms of text quality loss. We analyze, quantify, and illustrate the rationale of this approach, define paraphrasing operators, derive text length-invariant thresholds for termination, and develop an effective obfuscation framework. Our authorship obfuscation approach defeats the presented state-of-the-art verification approaches, while keeping text changes at a minimum. As a final contribution, we discuss and experimentally evaluate a reverse obfuscation attack against our obfuscation approach as well as possible remedies.
Article
Many systems have been developed for constructing decision trees from collections of examples. Although the decision trees generated by these methods are accurate and efficient, they often suffer the disadvantage of excessive complexity and are therefore incomprehensible to experts. It is questionable whether opaque structures of this kind can be described as knowledge, no matter how well they function. This paper discusses techniques for simplifying decision trees while retaining their accuracy. Four methods are described, illustrated, and compared on a test-bed of decision trees from a variety of domains.
Overview of the Author Identification Task at PAN 2013. Pamela Forner, Roberto Navigli and Dan Tufis edn., Working Notes Papers of the CLEF 2013 Evaluation Labs
  • P Juola
  • E Stamatatos
Juola, P., Stamatatos, E.: Overview of the Author Identification Task at PAN 2013. Pamela Forner, Roberto Navigli and Dan Tufis edn., Working Notes Papers of the CLEF 2013 Evaluation Labs
  • L Breiman
  • J Friedman
  • R O Stone
L. Breiman, J. Friedman, R.O., Stone, C.: Classification and Regression Trees (1984)