PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

A wide range of Deep Natural Language Processing (NLP) models integrates continuous and low dimensional representations of words and documents. Surprisingly, very few models study representation learning for authors. These representations can be used for many NLP tasks, such as author identification and classification, or in recommendation systems. A strong limitation of existing works is that they do not explicitly capture writing style, making them hardly applicable to literary data. We therefore propose a new architecture based on Variational Information Bottleneck (VIB) that learns embeddings for both authors and documents with a stylistic constraint. Our model fine-tunes a pre-trained document encoder. We stimulate the detection of writing style by adding predefined stylistic features making the representation axis interpretable with respect to writing style indicators. We evaluate our method on three datasets: a literary corpus extracted from the Gutenberg Project, the Blog Authorship Corpus and IMDb62, for which we show that it matches or outperforms strong/recent baselines in authorship attribution while capturing much more accurately the authors stylistic aspects.
Content may be subject to copyright.
Capturing Style in Author and Document Representation
Enzo Terreaua, Antoine Gourrub;*and Julien Velcinb
aUniversité de Lyon, Lyon 2, ERIC UR3083
bLaboratoire Hubert Curien, UMR CNRS 5516, Saint-Etienne, France
Abstract. A wide range of Deep Natural Language Processing
(NLP) models integrates continuous and low dimensional represen-
tations of words and documents. Surprisingly, very few models study
representation learning for authors. These representations can be
used for many NLP tasks, such as author identification and classi-
fication, or in recommendation systems. A strong limitation of exist-
ing works is that they do not explicitly capture writing style, making
them hardly applicable to literary data. We therefore propose a new
architecture based on Variational Information Bottleneck (VIB) that
learns embeddings for both authors and documents with a stylistic
constraint. Our model fine-tunes a pre-trained document encoder. We
stimulate the detection of writing style by adding predefined stylistic
features making the representation axis interpretable with respect to
writing style indicators. We evaluate our method on three datasets: a
literary corpus extracted from the Gutenberg Project, the Blog Au-
thorship Corpus and IMDb62, for which we show that it matches or
outperforms strong/recent baselines in authorship attribution while
capturing much more accurately the authors stylistic aspects.
1 Introduction
Deep models for Natural Language Processing are usually based on
Transformers, and they rely on latent intermediate representations.
These representations are usually built in a self-supervised manner
on a language modeling task, such as Masked Language Modeling
(MLM) [6] or auto-regressive training [3]. They constitute a good
feature space to solve downstream tasks, for example classification
or generation, even though some of those tasks are still difficult to
handle with prompt-based generative models like ChatGPT [24]. Ad-
ditionally, some efforts have been made to benefit from large pre-
trained model to represent documents [4, 25] and even authors, with
contributions like Usr2Vec [2], Aut2Vec [10], and DGEA [12]. The
main drawback of these models is that they were shown by [35] to
mainly focus on topics rather than on stylistic features of the text.
It turns out that capturing writing style can be of much interest for
some applications.
When working with literacy data or for forensic investigation [38],
practitioners are generally interested in detecting similarities in writ-
ing style regardless of the topics covered by the authors. The author
style can be defined as every writing choice made without semantic
information, often study through various linguistic and syntactic fea-
tures. As demonstrated by [35], most author embedding techniques
rely on the semantic content of documents: a poem and a fiction writ-
ing on flowers will be placed closer in the latent space, regardless of
their strong differences in sentence construction, structure, etc.
Corresponding Author. Email: antoine.gourru@univ-st-etienne.fr.
Figure 1. Author and book representations from R-PGD.
We here present a 2D projection with T-SNE of VADES documents
and authors embeddings on R-PGD. Books are represented with diamond,
authors with dot. The bigger the dot, the bigger the author variance learnt.
As an answer to these limitations, we propose a new model that
builds a representation space which captures writing style by us-
ing stylistic metrics as additional input features. We follow [19] and
leverage the Variational Information Bottleneck (VIB) framework
[1], that was shown to outperform the classical pointwise contrastive
training. More precisely, we propose to use it to fine tune a pretrained
document encoder (such as [4]) and author representations on an au-
thorship attribution task. This is, to our knowledge, the first time that
this framework is applied to author representation learning. Then, we
add an additional term in the objective function to enforce the rep-
resentations to capture stylistic features. We name this new model
VADES. Using pretrained models allows to benefit from accurate
intermediate text representations, built on ready-to-use language re-
sources. In Figure 1, we present a subset of authors from the Project
Gutenberg and the representation of the documents they wrote. The
size of author’s vector is proportional to its variance, learnt by using
the VIB framework. As expected, some outlier productions from au-
thors in term of style (e.g., Thus Spake Zarathustra from Nietzsche)
lie closer in the representation space to books of the same genre.
More precisely, our model allows 1) to capture author and document
style, 2) to build an interpretable representation space to be used by
researchers in linguistic, literature and public at large, 3) to predict
stylistic features such as readibility index, NER frequencies, more
accurately than every existing neural based methods, 4) accurately
identify document’s author, even when they are unknown.
arXiv:2407.13358v1 [cs.CL] 18 Jul 2024
After a presentation of related works, we introduce the theoreti-
cal foundations of the VIB framework, we then describe our model
and how it is optimized. In the last section, we present experimental
results on two tasks: author identification and stylistic features pre-
diction. Our experiments demonstrate that our model outperforms
or matches existing author embedding methods, in addition to being
able to infer representations for unseen documents, measure seman-
tic uncertainty of authors and documents, and capture author stylistic
information.
2 Related Works
2.1 Author Embedding Models
Word embedding, popularized by [22], was then extended to docu-
ment embedding by the same authors. More recent works [4] pro-
pose different aggregation functions of word embeddings, based on
LSTM, Transformers, and Deep Averaging Networks, to build (short)
document level representations. The aggregations is learnt through
classification or document pairing. More recently, [25] proceed in a
similar way by fine tuning a BERT model [6].
There are also specific works focusing on author embeddings.
The Author Topic Model (ATM) [26] is a hierarchical graphical
model, optimized through Gibbs sampling. It produces a distribu-
tion over jointly learnt topic factors that can be used as author fea-
tures. Aut2vec [10] allows to learn representations of authors and
documents that can separate true observed pairs and negative sam-
pled (document, author) pairs. The distance between two represen-
tations modifies an activation function producing a probability that
the pair is observed in the corpus. This approach concatenates two
sub models: the Link Info model, which takes pairs of collaborating
authors, and the Content Info model, which uses pairs of author and
documents. It cannot infer representations of unseen documents and
authors: the embeddings are parameters of an embedding layer. The
Usr2vec model [2] learns author representation from pretrained word
vectors. Authors use the same objective than [22], and add an author
id to learn the representations.
2.2 Writing Style-oriented Embedding Models
While there is no consensual definition of writing style, it has always
been a widely addressed research topic. In computational linguistic,
the approach of [16] is often cited as a reference and gives the follow-
ing definition: “Style is, on a surface level, very obviously detectable
as the choice between items in a vocabulary, between types of syn-
tactical constructions, between the various ways a text can be woven
from the material it is made of., and the author to conclude further
to the “impossibility of drawing a clean line between meaning and
style”. That’s why style is commonly defined as every writing choice
without semantic information.
Based on this definition, it is hard, if not impossible, to pro-
duce a clear annotated dataset classifying different writing style. The
workaround in most studies is to identify the most useful stylistic
features to associate an author to its production. It starts in the 19th
century with [21] and the most basic features (e.g., word and punctu-
ation frequencies, hapax legomena, average sentence length). More
recent works focus on function words frequencies [41], hybrid vari-
ables such as character n-grams [34, 31] or even Part-Of-Speech
(POS) and Name Entity Recognition (NER) tag frequencies, using
authorship prediction as evaluation.
Several methods try to use these stylistic features to learn docu-
ment representations. For example, [20] use Doc2Vec on documents
VADES
Figure 2. VADES in one picture.
We draw a single representation zdusing the reparametrization trick.
Authors mean and variance are trainable parameters (embedding layers).
LV ADES computes the probability of the author/document pair to be
observed, plus a regularization term and a stylometric features-based loss,
see Eq.5.
of character trigrams annotated regarding their position in the word
or if they contain punctuation (NGRAM Doc2Vec). According to the
authors, it allows to capture both content and writing style. In an
other work, words and POS tags embeddings are learnt together be-
fore passing them through a CNN to get a sentence representation
[14]. Then these sentences are fed into an LSTM with a final atten-
tion layer to compute document representation. This model is trained
on the authorship attribution task.
Some works claim to capture this information in an unsupervised
manner. DBert-ft [13] fine-tunes DistilBERT on the authorship attri-
bution task, assuming that an author writing style must be consistent
over its documents, and thus, that this task allows to build a “stylo-
metric latent space” when the model is trained on a reference set. Yet,
for all above models, no author representation is explicitely learnt.
3 Our model: VADES
3.1 Goal and VIB Framework
We deal with a set of documents, such as literature or blog posts. We
assume each document is written by one author. Each document of
indice dis preprocessed to extract a vector zf
dof r= 300 stylistic
features following [35].
Our goal is threefold: i) We want to build author and document
representations in the same space Rrsuch that their proximity cap-
tures their stylistic similarity (Figure 1), ii) We want to learn a mea-
sure of variability in style for each document and author , and iii) We
want our model to incorporate an on-the-shelf pre-trained text en-
coder such as Sentence-BERT or USE to benefit from their complex
language understanding, fine-tuned on the dataset at hand using the
objective we have just defined. To do that, we build an architecture
based on the Variational Information Bottleneck (VIB) framework.
The VIB framework is a variational extension of the Informa-
tion Bottleneck principle [36] proposed by [1]. The general objective
function is, for a set of observations x, to associate labels yand latent
representations zof these observations:
arg max
zI(z, y)β I (z, x),(1)
where I is the well-known Mutual Information measure, defined as:
I(x, y) = Z Z p(x, y) log p(x, y)
p(x)p(y)dxdy.(2)
Information Bottleneck aims at maximally compressing the informa-
tion in z, such that zis highly informative regarding the labels, i.e.
zcan be used to predict the labels y. With ybeing a set of relevant
stylistic features, we would like to maximize the stylistic information
captured by the representation, while minimizing the semantic one.
β0is a hyper-parameter that controls the balance between the
two sub-objectives.
In this approach, p(z|x)(the “encoding law”) is defined by model-
ing choices. Most of the time, the mutual information is intractable.
We then obtain a lower bound of Eq.1 by using variational approxi-
mations thanks to [1]:
Lvib =E[log q(y|z)] βK L(p(z|x)||q(z)) (3)
where q(y|z)is a variational approximation of p(y|z)and q(z)ap-
proximates p(z). Maximizing Eq.3 leads to increasing Eq.1.
3.2 VIB for Embedding with Stylistic Constraints
[23] propose to use this framework to learn probabilistic representa-
tions of images. They leverage an instance of this framework based
on siamese networks with a (soft) contrastive loss objective function,
to separate positive observed pairs of images (y= 1) and negative
examples (y= 0). We extend this model to document and author
embedding with stylistic constraint. Each author a(resp. document
d) is associated to a stochastic representation za(resp. zd) that is un-
observed (i.e., latent). Additionally, each document is associated to a
stylistic feature vector zf
dthat is beforehand extracted from the cor-
pus with usual NLP toolkits. We assume that the dimensions of za,
zdand zf
dare the same (r).
We build a set of pairs (a, d)with label ya= 1 if awrote d.
We additionally draw knegative pairs (a, d)for each observed pair,
associated with label ya= 0, where ais not an author of d. The
encoding laws (p(z|x)) for authors and documents are normal laws.
To capture stylistic information, we also build a set of pairs (d, d)
with label yf= 1 and we draw knegative pairs (d, d)for each
observed pair, associated with label yf= 0. These pairs are used
to train the stylistic objective : the representation zdof a document
should be close to its feature vector zf
d.
We learn the following parameters for each author a: mean µa
and diagonal variance matrix with diagonal σ2
a(these are embedding
layers). For a document d, we use a trainable text encoder to map a
document’s content to a vector d0Rr0. We then build the docu-
ment mean µd=f(d0)Rrand diagonal variance matrix with
diagonal σ2
d=g(d0)Rr. As we will show later, the dimension
rshould match the number of stylistic features to gain in compre-
hension of the learning space, but the text encoder can output vectors
of any dimension (here, r0). Following [1, 23], fand gare neural
networks. We give more details on f,g(the “encoding functions”),
and the text encoder later.
Following [23], the probability of a label is the soft contrastive
loss:
q(ya= 1|za, zd) = σ(ca||zazd||2+ea)
q(yf= 1|zd, zf
d) = σ(cf||zdzf
d||2+ef),(4)
where σis the sigmoid function, ca, cf>0and ea, efR. We in-
troduce an additional parameter α[0,1] to control the importance
given to the features and to the authorship prediction objective. We
can define the loss function (to minimize) based on the VIB frame-
work as follows:
L=(1 α)Ep(za|xa),p(zd|xd)[log q(ya|za, zd)]
αEp(zd|xd)[log q(yf|zd, zf
d)]
+β(KL(p(za|xa)||q(za)) + K L(p(zd|xd)||q(zd)))
(5)
Here, α= 0 will produce representations that well predict the
author-document relation but will not capture the stylistic features
of the documents, as shown by [35]. With α= 1, on the contrary,
the model will simply bring document embeddings closer to their
feature vectors. Hence, the value of αneeds to be carefully tuned
on the dataset, regarding if the corpus is writing style specific or not
thanks to domain knowledge.
Eventually, computing the expected values in Eq.(5) is intractable
for a wide range of encoders. We therefore approximate it by sam-
pling Lexamples by observation (here, a triplet document, author,
feature vector), following p(z|x)as done in [23]. We get (the same
goes for feature vector/documents pairs) :
E[log q(ya|za, zd)] 1
L
L
X
l=1
log q(ya|z(l)
a, z(l)
d)(6)
We then use the reparametrization trick, following what is done in
VAE [17]:
z(l)
a=µa+σaϵ,z(l)
d=µd+σdϵwith ϵ N (0,1)
This loss can now be minimized using backpropagation. In Figure 2,
we show a schematic representation of our model, called VADES for
Variational Author and Document Representations with Style.
3.3 Encoding Functions and Choice of the Encoder
The entering bloc of our model for documents is a text encoder, map-
ping a document in natural language to a vector in Rr. Many deep
architectures could be used here and trained from scratch. Neverthe-
less, we propose to use a pretrained text encoder.
Models that are pretrained on large datasets are now easily avail-
able online1. They have been proved successful on many NLP tasks
with a simple fine-tuning phase (the only constraint being to avoid
catastrophic forgetting). Additionally, the VIB framework allows to
naturally introduce a pretrained text encoder as shown by [19]. The
encoder’s output should then be mapped to document mean and vari-
ance. Both [19, 12] map the text encoder output to the document’s
mean (the ffunction) and variance (the gfunction) using a Multi
Layer Perceptron (MLP). This approach is simple, and fast. In our ex-
periments, we build fand gas two-layer MLP with tanh and linear
activation with same input and intermediate dimensions (r0). Note
that the output dimension of fand gshould be the same as the num-
ber of stylistic features (r).
Several constraints arise regarding the pretrained encoder itself.
We would like our model to be able to capture stylistic information
from a given document. As shown in [5, 35], state-of-the-art models
trained on large datasets already capture complex grammatical and
syntactic notions in their representations, and therefore have the ex-
planatory power requested for our objective. Moreover, our model
must be able to deal with long text as it will be used in a literary
1e.g., https://huggingface.co/models
context. Processing novels, dramas, essays, where writing style in-
terferes the most. This is a serious problem: for example, the widely
used BERT model is limited to 512 tokens. Alternative models such
as [39] allow to apply transformers to long documents. To circum-
vent this issue, we use the Deep Averaging Network implementation
of the Universal Sentence Encoder (USE) from [4]. It has several ad-
vantages over the latter works: it gives no length constraint, it is faster
than transformer-based methods and it outperforms Sentence-BERT
on stylistic features prediction [35]. The test of other encoder models
is left to future works. Finally, note that our model is language ag-
nostic (as it depends on a out-of-the-box text encoder) and can infer
representations for unseen documents.
4 Authorship Attribution Datasets
4.1 IMDb Corpus
The IMDb (Internet Movie Database) corpus is one of the most used
ones regarding the authorship attribution task. It was introduced by
[32] and is composed of 271,000 movie reviews from 22,116 online
users. However, most of the works are evaluated on the reduction of
this dataset to only 62 authors with 1000 texts for each (IMDb62).
Thus, we benchmark our model on IMDb62. As shown later, the task
of authorship attribution on this corpus is more or less solved, due to
the low number of authors.
4.2 Project Gutenberg Dataset
The Project Gutenberg is a multilingual library of more than 60,000
e-books for which U.S. copyright has expired. It is freely available
and started in 1971. We gathered the corpus using [11]. Most of the
books are classical novels, dramas, essays, etc. from different eras,
which is relevant when studying writing style and represents quite
well our context of application. To keep the most authors possible,
we randomly sample 10 texts for each author with such a produc-
tion, leaving 664 authors in our Reduce Project Gutenberg Dataset
(R-PGD) (10 times more than IMDB). To be able to deal with such
works, we only keep the 200 first sentences of each book.
4.3 Blog Authorship Corpus
This dataset is composed of 681,288 posts from 19,320 authors gath-
ered in the early 2000s by [30]. There are approximately 35 posts
and 7,250 words by user. We only take 500 bloggers with at least 50
blogposts to build our reduced dataset of the Blog Authorship Corpus
(R-BAC). This dataset is also used in several authorship attribution
benchmark, only keeping the top 10 or 50 authors with most produc-
tions. We will also test our model on these extraction of the corpus.
These two last datasets (PGD and BAC) represent two common
uses of author embedding (classic literature and web analysis) with
a large number of authors. Usual datasets for authorship attribution
(CCAT50, NYT, IMDb62) contain far less classes, further from our
context of a web extracted corpus (from Blogger or Wordpress for
example)... They are also stylistically and structurally different, al-
lowing to evaluate our approach on various textual formats. For each
dataset, we perform a 80/20 train-test stratified split.
5 Experiments
5.1 Parameter Setting and Competitors
In this section, we present implementation details for our method and
competitors. For the encoder functions fand g, we use the architec-
Datasets statistics
Dataset Authors Avg. Tokens Avg. Texts
IMDb62 62 341(±223) 1000(±0)
BAC10 10 91(±184) 2350(±639)
BAC50 50 98(±167) 1466(±562)
R-BAC 500 243(±342) 50(±0)
R-PGD 664 2315(±961) 10(±0)
Table 1. Descriptive statistics for the 3 datasets and their decomposi-
tion. BAC
: Blog Authorship Corpus, PGD : Project Gutenberg Dataset.
Hyperparameter grid search
Hyperparameter Grid
# negative pairs {1, 5, 10, 20}
Monte Carlo sampling {1, 5, 10, 20}
Learning rate {1e-2, 1e-3, 1e-4, 1e-5}
β{1e-1, 1e-2, ..., 1e-12}
Feature loss {L2, Cross-Entropy}
Table 2. Grid search used for hyperparameter selection.
Selected value in bold.
tures presented in the previous section with batch normalization and
dropout equal to 0.2 with L2 regularization (1e5). Grid-search
parameters are detailed in Table 2. For L, we obtain a good trade-
off between accuracy and speed with L= 10, as we quickly reach a
plateau of performance when increasing its value. We can summarize
the tuning of αas follows:
α= 0 implies no feature loss and stylistic information,
α= 0.5gives the same importance to feature loss and author loss,
α= 0.9pushes feature loss to boost style detection.
We train the model for 15 epochs on R-PGD and R-BAC, and for
5 epochs on IMDB, BAC10 and BAC50 as the number of authors
is around ten times smaller. We use a partition of 2 GPUs V100.
On a single GPU, training the model on the R-PGD dataset takes
around 10 hours. In the following section, we report the results for
the best version of VADES only. As an ablation study, to justify the
use of both the VIB framework and stylistic features, we compare
our model with and without these components (respectively called
VADES no-VIB and VADES (α= 0)). The code is available on github
and will be shared if the paper is accepted. All the datasets are avail-
able online.
We compare our model with several baselines. We use [20]
(NGRAM Doc2Vec), a simple average based version of USE [4] (a
document representation is built from the average of its sentence en-
coding, and an author representation is an average of its documents).
We also compare our approach to DBert-ft [13], a document em-
bedding method where DistilBERT is fine-tuned on the authorship
attribution task. The author embeddings are built by averaging the
representations of the documents it wrote. We use the parameters de-
tailed in the authors’ implementation2.
5.2 Evaluation Tasks
We first evaluate the baselines and VADES regarding how well each
method captures writing style. As writing style is a complex and a
still discussed notion, there is no supervised dataset to evaluate how
2https://github.com/hayj/DBert-ft
0.8
0.6
0.4
0.2
Function words Indexes Letters NER Numbers Punctuation Structural TAG
0 50 100 150 200 250 300
Figure 3. Correlation score between ith embedding coordinates and ith stylistic feature for VADES representation on R-PGD.
A few values in the Punctuation categories are null as they were not found anywhere in the corpus.
4
2
0
-2
-4
0 5e-4 1e-3 1.5e-3 2e-3 2.5e-3
217th Author Embedding Axis
20 40 60 80
232nd Author Embedding Axis
4
2
0
-2
-4
3.5 4 4.5 5 5.5
6
4
2
0
-2
-4
1st Author Embedding Axis
Flesch Reading Ease Score (232nd feature)« ? » frequency by sentence (217th feature) Average Word Length (1st feature)
Figure 4. ith embedding axis against ith stylistic feature for each author representation, for a selection of 4 given features
We can see correlation between each feature and their respective embedding axis.
a model can grasp it. We therefore use a proxy task that consists in
predicting stylistic feature from the latent representations. We follow
the experimental protocol of [35]. The stylistic features are extracted
using spacy word and sentence tokenizer, POS-tagger and Name En-
tity Recognition, spacy English stopwords and nltk CMU Dictionary.
For each author, we aim to predict the value of all stylistic features
from their embeddings. Each feature is standardized before regres-
sion. We use an SVR with Radial Basis Function (rbf) kernel as it
offers both quick training time and best results among other kernels
in our experiments. We evaluate models using Mean Squared Error
(MSE) following a 10-fold cross validation scheme.
Secondly, we perform authorship attribution, the task of predicting
the author of a given document. We compare VADES with several
other authorship attribution methods even though they do not neces-
sarily perform representation learning. Each dataset is split into train
and test sets with a 80/20 ratio. For our model, we repeated 5 times
the evaluation scheme. For embedding method without classification
head, we associate each document with its most plausible author us-
ing cosine similarity. We use accuracy to evaluate these results (the
percentage of correctly predicted authors out of all data points).
5.3 Results on capturing writing style
As explained earlier, we use the author embeddings to perform re-
gression and predict each stylistic features. As shown in Table 4,
only using a simple logistic regression on these stylistic features al-
lows to reach decent scores in authorship attribution, close to these
of Universal Sentence Encoder, which is a state-of-the-art method in
sentence embedding. As they contain strictly no topic information, it
demonstrates how good they are as a proxy of writing style. Thus, a
model able to capture them is able to capture writing style.
Results on the style MSE metric are shown in Table 3. As ex-
pected, our model easily outperforms every baseline on all axes.
DBert-ft, only trained on the authorship attribution objective per-
forms the worst. Even though this approach is based on fine-tuned
language models which already capture syntactic and grammatical
notions [5], this is not the information that seems to be retained by
the network when trained on the author attribution task. This is con-
sistent with what was shown in [35]. The models may mainly focus
on the semantic information to predict author-document relation. In-
terestingly, we observe that a simple average of USE representations
performs quite well, which confirms that it can successfully capture
complex linguistic concepts. VADES is guided by the feature loss to
do so.
On a qualitative note, we present two additional visualisations to
underline the strong advantage of VADES for linguistic and stylistic
applications. In Figure 5, we present a T-SNE 2D projection of the
books of the R-PGD dataset colored by their publication year. A clear
color gradient appears, demonstrating that our model can grasp the
evolution of writing style through the last centuries. Figure 1 shows
a toy example of a T-SNE 2D projection of well-known authors from
the R-PGD dataset and their books (we use α= 0.5). The objects
are distributed in the space across clear author specific clusters. The
most interesting observation is related to documents that are outside
of their author cluster: Thus Spake Zarathustra: A Book for All and
None by Nietzsche is a philosophical poem, closer to Hugo, while
the rest of its production is mostly essays. The same conclusion goes
with The Power of Darkness by Tolstoï, a 5 acts drama, whose em-
bedding is closer to Shakespeare than to Tolstoï novels. The version
of Hamlet presented here is fully commented, and thus is closer to an-
alytical and philosophical works of Nietzsche and Plutarch as shown
on the figure. We also represent the variance learnt by the model in
the size of the author dot. Hugo, who wrote famous novels as well as
poetry and dramas, has a greater variance than other authors.
Average MSE Regression Score along with standard deviation (SVR Model) on R-PGD dataset
Embedding Letters Numbers Structural Punctuation Func. words TAG NER Indexes
Content-Info 0.67 (0.17) 0.88 (0.12) 0.55 (0.19) 0.68 (0.16) 0.72 (0.19) 0.65 (0.17) 0.74 (0.14) 0.50 (0.16)
Ngram Doc2Vec 0.63 (0.20) 0.88 (0.12) 0.51 (0.20) 0.58 (0.21) 0.68 (0.19) 0.59 (0.19) 0.71 (0.14) 0.45 (0.15)
USE 0.61 (0.27) 0.86 (0.09) 0.34 (0.18) 0.59 (0.26) 0.65 (0.24) 0.45 (0.29) 0.65 (0.17) 0.27 (0.15)
DBert-ft 0.79 (0.16) 0.92 (0.09) 0.65 (0.15) 0.82 (0.17) 0.84 (0.13) 0.74 (0.14) 0.84 (0.08) 0.60 (0.14)
VADES no-VIB (0.5) 0.55 (0.23) 0.67 (0.11) 0.32 (0.14) 0.66 (0.27) 0.58 (0.21) 0.44 (0.27) 0.62 (0.16) 0.24 (0.14)
VADES (0.0) 0.84 (0.24) 0.91 (0.12) 0.66 (0.13) 0.85 (0.18) 0.91 (0.15) 0.71 (0.23) 0.88 (0.09) 0.61 (0.16)
VADES (0.5) 0.50 (0.22) 0.60 (0.11) 0.28 (0.14) 0.62 (0.27) 0.53 (0.21) 0.40 (0.27) 0.58 (0.15) 0.20 (0.11)
VADES (0.9) 0.47 (0.22) 0.53 (0.10) 0.26 (0.13) 0.59 (0.28) 0.50 (0.21) 0.39 (0.26) 0.56 (0.15) 0.19 (0.10)
Average MSE Regression Score along with standard deviation (SVR Model) on R-BAC dataset
Embedding Letters Numbers Structural Punctuation Func. words TAG NER Indexes
Content-Info 0.80 (0.15) 0.85 (0.07) 0.62 (0.23) 0.92 (0.09) 0.87 (0.12) 0.90 (0.05) 0.93 (0.07) 0.70 (0.29)
Ngram Doc2Vec 0.77 (0.16) 0.88 (0.05) 0.67 (0.16) 0.78 (0.13) 0.84 (0.12) 0.82 (0.09) 0.86 (0.11) 0.67 (0.13)
USE 0.67 (0.25) 0.83 (0.05) 0.45 (0.20) 0.78 (0.17) 0.81 (0.17) 0.63 (0.21) 0.80 (0.17) 0.38 (0.18)
DBert-ft 1.05 (0.09) 1.05 (0.07) 1.01 (0.05) 0.98 (0.22) 1.05 (0.09) 0.95 (0.19) 0.91 (0.20) 1.03 (0.07)
VADES (0.9) 0.52 (0.23) 0.55 (0.09) 0.31 (0.17)0.76 (0.22) 0.67 (0.20) 0.57 (0.20) 0.73 (0.18) 0.32 (0.20)
Table 3. Feature prediction on R-PGD and R-BAC. MSE
score (standard deviation in parenthesis) on the prediction of stylistic features from author embedding on the R-BAC dataset using SVR. The 300 stylistic
features are grouped by families. In bold the best scores for each axis. Our model (αvalue in parenthesis) performs best with α= 0.9.
IMDb62 Blog Authorship Corpus
Approach 62 authors 10 authors 50 authors
Stylistic features + LR 88.2 (0.1) 40.9 (0.2) 28.4 (0.2)
LDA+Hellinger* [7] 82 52.5 18.3
Impostors* [18] x 35.4 22.6
Word Level TF-IDF* 91.4 x x
CNN-Char* [27] 91.7 61.2 49.4
C.Att + Sep.Rec.* [33] 91.8 x x
Token-SVM* [32] 92.5 x x
SCAP* [9] 94.8 48.6 41.6
Cont. N-gram* [29] 94.8 61.3 52.8
(C+W+POS)/LM* [15] 95.9 x x
N-gram + Style* [28] 95.9 x x
N-gram CNN* [40] x 63.7 53.1
Syntax CNN* [40] 96.2 64.1 56.7
DBert-ft [13] 96.7 (0.2) 64.3 (0.2) 58.5 (0.2)
BertAA* [8] 93.0 65.4 59.7
VADES no-VIB (0.5) 91.3 (0.1) 60.9 (0.2) 50.2 (0.2)
VADES (0.0) 94.9 (0.2) 62.6 (0.2) 52.4 (0.2)
VADES (0.1) 95.6 (0.2) 63.8 (0.2) 53.8 (0.2)
Table 4. Authorship Attribution accuracy
on IMDb62 and Blog Authorship Corpus
Results with * are gathered from other papers, x is for missing results
on a given dataset. Best model in bold and second underlined. We here com-
pare our model (in parenthesis αvalue) with several authorship attribution
models. Our model compete with SOTA model while learning meaningful
representations regarding writing style for documents and authors.
2000
1950
1900
1850
1800
1750
1700
20
15
10
5
0
-5
-10
-15
-20 10 0 10 20
Publication period
Figure 5. Books representations from
Project Gutenberg and their writing period.
We sampled 10 R-PGB books by decades starting in 1650 and present
here a 2D T-SNE projection of their VADES embeddings.
5.4 Interpretability of the Representation Space
As we use the L2distance between document representations and
stylistic feature vectors, each of the 300 embedding axes correspond
to one given stylistic feature. The soft contrastive loss allows to en-
sures the L2constraint (bringing document embedding and stylistic
features vectors closer) while being more flexible than a simple re-
gression loss. When experimenting with the latter, the task showed
up to be too hard and disadvantageous regarding both authorship at-
tribution scores and writing style loss.
On Figure 3, we show the Pearson correlation score between the
ith stylistic feature and the corresponding embedding axis. These
correlation values are always maximum for each feature regarding
every other embedding coordinate. To further illustrate the inter-
pretability of the embedding space, Figure 4 shows a selection of
4 stylistic features, the representation value of the matching coordi-
nate for each author. The representation space learnt by VADES is
interpretable in terms of writing style. In the context of a multidisci-
plinary project, involving several searchers in literature and linguistic
this is a significant added value.
5.5 Results on the Authorship Attribution Task
Results on the authorship attribution task for IMDb62 and Blog Au-
thorship Corpus are presented respectively in Table 4 against state-
of-the-art solutions (not necessarily embedding models). On both
datasets, our model ranks in top 4, outperforming recent competi-
tors while authorship attribution is not its main task. Our model is
outpaced by Syntax CNN [40], DBert-ft [13] and BertAA [8], two
variants of BERT fine tuned on the authorship attribution task. As
shown by [8], BERT and DistilBERT are really tailored for balanced
datasets with short texts such as IMDB62 and Blog Authorship Cor-
pus. The DBert-ft model splits every document in 512 chunks during
training, building an even bigger corpus with important improve-
ment, but it is hardly reproducible with our feature loss. BertAA
feeds encoded documents from a finetuned BERT together with a set
of stylistic features and of most frequent bi-grams and tri-grams to a
Logistic Regression. It clearly allows to better perform on Blog Au-
thorship Corpus as this dataset is a mix of several genres and styles,
compared to IMDB62 concerning only movie reviews. This confirms
our use of stylistic features. Syntax CNN encodes each sentence of a
document separately with its syntax. Unfortunately, this model was
hardly reproducible and cannot be tested in feature regression using
intermediate representation. For VADES lower values of αallow to
reach the best accuracy in authorship attribution on these datasets.
Additional information bring by stylistic features benefit to the au-
thorship attribution when texts are longer.
5.6 Ablation Study and Effect of α
We here compare our model to no-VIB and without feature loss. Both
variations underperform on both tasks. First, the VIB paradigm of-
fers more versatility than fixed document and author representation
which is key to grasp a complex notion such as writing style. Then,
the feature loss brings additional information for authorship predic-
tion, as shown by BertAA, which use it to improve BERT classifica-
tion results. Here, our framework enable to use it directly for docu-
ment and author embeddings. On Figure 6, we evaluate the influence
of αwhich balances the importance given to author loss and feature
loss on both feature regression and authorship attribution. Adding
just a few stylistic features information (α= 0.1) allows to im-
prove the precision of our model in authorship attribution. It forces
the model to extract discriminant stylistic information from the in-
put. Surprisingly the same phenomenon appears when shutting down
the author loss (α= 1). It creates a deterioration of the style score as
authors tend to use a consistent writing style among their documents.
Thus gathering a writer with its documents representation also helps
to capture its writing habits. ([13] call it the “Intra-author consis-
tency”).
0.9
0.8
0.7
0.6
0.5
0.4
55
50
45
40
0.0 0.2 0.4 0.6 0.8 1
Accuracy
Style MSE
𝛼parameter
Accuracy
Style MSE
Figure 6. Effect of α.We plot the evolution of the style evaluation metric
(average MSE score) and of the accuracy with the αparameter for R-PGD
6 Conclusion
In this article, we presented VADES, a new author and document
embedding method which leverages stylistic features. It has several
advantages compared to existing works: it easily integrates any pre-
trained text encoder, it allows to compare authors and documents
of any length (e.g., for authorship attribution), build an interpretable
representation space by incorporating widely used stylistic features
in computational linguistic. It is also able to infer representations
for unseen documents at the opposite of most prior approaches. We
demonstrated that VADES outperforms existing embedding base-
lines in stylistic feature prediction, often by a large margin, while
staying competitive in authorship attribution.
In further experiments, we will incorporate modern text encoders,
such as LLaMA [37]. They are much more difficult to adapt to this
task, but as most recent Large Language Model are trained in an au-
toregressive way, they might have the expressive power needed to
grasp stylistic aspects of authors productions.
References
[1] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy,
‘Deep variational information bottleneck’, Proceedings of the Interna-
tional Conference on Learning Representations (ICLR), (2017).
[2] Silvio Amir, Glen Coppersmith, Paula Carvalho, Mario J Silva, and
Byron C Wallace, ‘Quantifying mental health from social media with
neural user embeddings’, in Proceedings of the Machine Learning for
Healthcare Conference, pp. 306–321, (2017).
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,
Jared Kaplan, et al., ‘Language models are few-shot learners’, CoRR,
abs/2005.14165, (2020).
[4] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Limtiaco, and al.,
‘Universal sentence encoder for english’, in Proceedings of the 2018
Conference on EMNLP: System Demonstrations, pp. 169–174, (2018).
[5] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D.
Manning, ‘What does BERT look at? an analysis of bert’s attention’,
in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP, volume abs/1906.04341,
(2019).
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,
‘Bert: Pre-training of deep bidirectional transformers for language un-
derstanding’, in Proceedings of the 2019 Conference of the North Amer-
ican Chapter of the ACL: Human Language Technologies, Volume 1
(Long and Short Papers), pp. 4171–4186, (2019).
[7] Sara El, manarelbouanani and Ismail Kassou, ‘Authorship analysis
studies: A survey’, International Journal of Computer Applications,86,
(12 2013).
[8] Maël Fabien, Esau Villatoro-Tello, Petr Motlicek, and Shantipriya
Parida, ‘BertAA : BERT fine-tuning for authorship attribution’, in Pro-
ceedings of the 17th International Conference on Natural Language
Processing (ICON), pp. 127–137, Indian Institute of Technology Patna,
Patna, India, (December 2020). NLP Association of India (NLPAI).
[9] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and
Sokratis Katsikas, ‘Source code author identification based on n-gram
author profiles’, in Artificial Intelligence Applications and Innovations,
pp. 508–515, Boston, MA, (2006). Springer US.
[10] Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi,
et al., ‘Author2vec: Learning author representations by combining con-
tent and link information’, in Proceedings of the 25th International
Conference Companion on World Wide Web, pp. 49–50. International
World Wide Web Conferences Steering Committee, (2016).
[11] Martin Gerlach and Francesc Font-Clos, A standardized project guten-
berg corpus for statistical analysis of natural language and quantitative
linguistics’, Entropy,22(1), 126, (2020).
[12] Antoine Gourru, Julien Velcin, Christophe Gravier, and Julien Jacques,
‘Dynamic gaussian embedding of authors’, in Proceedings of the 2022
The Web Conference on World Wide Web. International World Wide
Web Conferences Steering Committee, (2022).
[13] Julien Hay, Bich-Lien Doan, Fabrice Popineau, and Ouassim Ait El-
hara, ‘Representation learning of writing style’, in Proceedings of the
Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 232–
243, Online, (November 2020). ACL.
[14] Fereshteh Jafariakinabad and Kien A Hua, ‘Style-aware neural model
with application in authorship attribution’, in 2019 18th IEEE Interna-
tional Conference On Machine Learning And Applications (ICMLA),
pp. 325–328. IEEE, (2019).
[15] Jaap Kamps, Giannis Tsakonas, Yannis Manolopoulos, Lazaros Iliadis,
and Ioannis Karydis, ‘Research and advanced technology for digital li-
braries 21st’, in Proceedings: 21st International Conference on Theory
and Practice of Digital Libraries, 2017, Thessaloniki, Greece, (2017).
[16] Jussi Karlgren, ‘The wheres and whyfores for studying textual genre
computationally’, AAAI Technical Report (7), 68–70, (2004).
[17] Diederik P Kingma and Max Welling, Auto-encoding variational
bayes’, Proceedings of the International Conference on Learning Rep-
resentations (ICLR), (2014).
[18] Winter Koppel, Moshe and Yaron, ‘Determining if two documents are
written by the same author’, Journal of the Association for Information
Science and Technology,65(1), 178–187, (2014).
[19] Rabeeh Karimi Mahabadi, Yonatan Belinkov, and James Henderson,
‘Variational information bottleneck for effective low-resource fine-
tuning’, in International Conference on Learning Representations,
(2021).
[20] Suraj Maharjan, Deepthi Mave, and et al. Shrestha, ‘Jointly learning
author and annotated character N-gram embeddings: A case study in
literary text’, International Conference RANLP, (2019).
[21] T. C. Mendenhall, ‘The characteristic curves of composition’, Science,
ns-9, (1887).
[22] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff
Dean, ‘Distributed representations of words and phrases and their com-
positionality’, in Advances in neural information processing systems,
pp. 3111–3119, (2013).
[23] Seong Joon Oh, Kevin Murphy, Jiyan Pan, Joseph Roth, Florian
Schroff, and Andrew Gallagher, ‘Modeling uncertainty with hedged in-
stance embedding’, in Proceedings of the International Conference on
Learning Representations, (2019).
[24] Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michi-
hiro Yasunaga, and Diyi Yang, ‘Is chatgpt a general-purpose natural
language processing task solver?’, arXiv:2302.06476, (2023).
[25] Nils Reimers and Iryna Gurevych, ‘Sentence-bert: Sentence embed-
dings using siamese bert-networks’, Proceedings of the International
Conference on EMNLP, (2019).
[26] Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic
Smyth, ‘The author-topic model for authors and documents’, in Pro-
ceedings of the 20th conference on Uncertainty in artificial intelligence,
pp. 487–494, (2004).
[27] Sebastian Ruder, Parsa Ghaffari, and John G. Breslin, ‘Character-level
and multi-channel convolutional neural networks for large-scale author-
ship attribution’, CoRR,abs/1609.06686, (2016).
[28] Yunita Sari, Mark Stevenson, and Andreas Vlachos, ‘Topic or Style ?
Exploring the Most Useful Features for Authorship Attribution’, 27th
International conference on computational lingustics, 343–353, (2018).
[29] Yunita Sari, Andreas Vlachos, and Mark Stevenson, ‘Continuous n-
gram representations for authorship attribution’, in Proceedings of the
15th Conference of the European Chapter of the ACL: Volume 2, Short
Papers, pp. 267–273, Valencia, Spain, (April 2017). ACL.
[30] Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pen-
nebaker, ‘Effects of age and gender on blogging.’, in AAAI Spring Sym-
posium: Computational Approaches to Analyzing Weblogs, pp. 199–
205. AAAI, (2006).
[31] Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel, ‘Author-
ship attribution of micro-messages’, in Proceedings of the 2013 Con-
ference on EMNLP, Seattle, Washington, USA, (October 2013). ACL.
[32] Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert, ‘Authorship At-
tribution with Topic Models’, Computational Linguistics,40(2), 269–
310, (06 2014).
[33] Wei Song, Chen Zhao, and Lizhen Liu, ‘Multi-task learning for au-
thorship attribution via topic approximation and competitive attention’,
IEEE Access,7, 177114–177121, (2019).
[34] Efstathios Stamatatos, ‘On the robustness of authorship attribution
based on character n-gram features’, Journal of Law and Policy,21,
421–439, (01 2013).
[35] Enzo Terreau, Antoine Gourru, and Julien Velcin, ‘Writing style author
embedding evaluation’, in Proceedings of the 58th Annual Meeting of
the ACL, 2nd Workshop on Evaluation and Comparison of NLP Sys-
tems, pp. 84–93, (2021).
[36] Naftali Tishby, Fernando C Pereira, and William Bialek, ‘The infor-
mation bottleneck method’, The 37th annual Allerton Conference on
Communication, Control, and Computing, 368–377, (1999).
[37] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet,
Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman
Goyal, Eric Hambro, Faisal Azhar, et al., ‘Llama: Open and effi-
cient foundation language models’, arXiv preprint arXiv:2302.13971,
(2023).
[38] Min Yang and Kam-Pui Chow, Authorship attribution for forensic in-
vestigation with thousands of authors’, in ICT Systems Security and
Privacy Protection, pp. 339–350, Berlin, Heidelberg, (2014). Springer
Berlin Heidelberg.
[39] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, and et al.
Ainslie, ‘Big bird: Transformers for longer sequences’, Advances in
Neural Information Processing Systems,33, 17283–17297, (2020).
[40] Richong Zhang, Zhiyuan Hu, Hongyu Guo, and Yongyi Mao, ‘Syntax
encoding with application in authorship attribution’, in Proceedings of
the 2018 Conference on EMNLP, pp. 2742–2753, Brussels, Belgium,
(October-November 2018). ACL.
[41] Ying Zhao and Justin Zobel, ‘Effective and scalable authorship attri-
bution using function words’, in Information Retrieval Technology, pp.
174–189, Berlin, Heidelberg, (2005). Springer Berlin Heidelberg.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
Article
Full-text available
Separating content from style is a fundamental problem in authorship attribution to represent topic independent personal style of authors. Previous work often ignores this problem by imposing strong but unrealistic assumptions or artificially determines a set of predefined stylistic structures. This paper proposes to separate topic and style based on neural multi-task learning. Our target is to learn separate representations for topic and style respectively. In addition to authorship attribution as the main task, we introduce a novel auxiliary task topic approximation to guide the learning of topic representations with the topic distributions inferred by topic models, which are trained from external corpus. Moreover, we propose a competitive attention mechanism and a separation-reconstruction constraint to assign different and competitive attentions to two tasks in order to separate topic and style as much as possible. Evaluation results demonstrate that the proposed multi-task learning based method is promising, especially on cross-topic settings. We found that topic approximation can help capture topical content and the competitive attentions benefit topic-style separation. It is encouraging since our model separates topic and style in a probabilistic way and doesn’t require human intervention.