ArticlePDF Available

Interpreting Burrows's Delta: Geometric and Probabilistic Foundations

Authors:

Abstract

While Burrows's intuitive and elegant 'Delta' measure for authorship attribution has proven to be extremely useful for authorship attribution, a theoretical understanding of its operation has remained somewhat obscure. In this article, I address this issue by introducing a geometric interpretation of Delta, which further allows us to interpret Delta as a probabilistic ranking principle. This interpretation gives us a better understanding of the method's fundamental assumptions and potential limitations, as well as leading to several well-founded variations and extensions.
Interpreting Burrows’s Delta:
Geometric and Probabilistic Foundations
Shlomo Argamon
Linguistic Cognition Laboratory
Department of Computer Science
Illinois Institute of Technology
Chicago, IL 60616
argamon@iit.edu
September 19, 2006
Abstract
While Burrows’s intuitive and elegant “Delta” measure for authorship attribution has proven
to be extremely useful for authorship attribution, a theoretical understanding of its operation
has remained somewhat obscure. In this paper, I address this issue by introducing a geometric
interpretation of Delta, which further allows us to interpret Delta as a probabilistic ranking
principle. This interpretation gives us a better understanding of the method’s fundamental
assumptions and potential limitations, as well as leading to several well-founded variations and
extensions.
1 Introduction
In his 2001 Busa Award lecture, John F. Burrows (2003) proposed a new measure for authorship
attribution which he termed “Delta”, defined as:
the mean of the absolute differences between the z-scores for a set of word-variables in
a given text-group and the z-scores for the same set of word-variables in a target text.
(Burrows, 2002)
The measure assumes some set of comparison texts is given, with respect to which z-scores are
computed (based on the mean and standard deviation of word frequencies in the comparison cor-
pus). The Delta measure is then computed between the target text and each of a set of candidate
texts (generally comprising the comparison corpus), and the target is attributed to the author of
the candidate text with the lowest Delta score.
A number of literary authorship studies (Burrows, 2002, 2003; Hoover, 2004b, 2004a, 2005)
have shown the Delta measure to be exceptionally useful for authorship attribution studies, even
with a large number of candidate authors (as long as genre is controlled for). However, while Delta
is a powerful new tool in the arsenal of the computational stylist, why it works so well has remained
somewhat obscure. As Hoover (2005) states:
In spite of the fact that Burrows’s Delta is simple and intuitively reasonable, it, like
previous statistical authorship attribution techniques, and like Hoover’s alterations,
lacks any compelling theoretical justification.
1
The purpose of this paper to partially fill this lacuna by examining a geometric interpretation of
Burrows’s original Delta measure. As we will see, this interpretation allows Delta to be related
to some well-understood notions in the theory of probability. This interpretation leads directly
to several well-founded extensions of the Delta measure, as well as a better understanding of
the method’s assumptions and potential limitations. In an upcoming sequel, we will address the
effectiveness of several different Delta variants derived from this interpretation, by empirical tests
on several authorship attribution tasks.
2 The Geometry of Delta
2.1 Delta as a nearest-neighbor classifier
We proceed by first examining the algebraic structure of Burrows’s Delta metric, and then consid-
ering how it may be interpreted geometrically as a kind of ‘distance measure’, where the ‘nearest’
authorship candidate is chosen for attribution.
Call the set of nwords of interest (usually the most frequent) for computing Delta {wi}, defining
fi(D) as wi’s frequency in document D,µiits mean frequency in the comparison corpus, and σiits
standard deviation. Then, the z-score for wiin document Dis
z(fi(D)) = fi(D)µi
σi
We thus have the following mathematical expression for the Delta measure (‘average absolute
difference of the z-scores’) between documents Dand D:
∆(D, D) = 1
n
n
X
i=1 |z(fi(D)) z(fi(D))|
This formula can be simplified algebraically as follows:
∆(D, D) = 1
nPn
i=1 |z(fi(D)) z(fi(D))|
=1
nPn
i=1
fi(D)µi
σifi(D)µi
σi
=1
nPn
i=1
fi(D)fi(D)
σi
This simplification shows that Delta does not actually depend on the mean frequencies in the
comparison set, but may be viewed as a normalized difference measure between frequencies in D
and in D. We further note that averaging is just multiplication of the sum by a constant factor of
1
n, which depends only on the number of words considered, and thus is irrelevant when using Delta
as a ranking metric. In the remainder of this paper, therefore, we will use the following simplified
formula as equivalent to Burrows’s Delta:
(n)
B(D, D) =
n
X
i=1
1
σi
fi(D)fi(D)
i.e., the sum of the standard-deviation-normalized absolute differences of the word frequencies.
Note that nis the number of frequent words used, and the subscript Bindicates equivalence to
Burrows’s original Delta formula (we will develop several variant later on).
2
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
Delta=1
Delta=0.75
Delta=0.5
Delta=0.25
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
Delta=1
Delta=0.75
Delta=0.5
Delta=0.25
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
Delta=1
Delta=0.75
Delta=0.5
Delta=0.25
(a) (b) (c)
Figure 1: Graphs showing the structure of the distance measure implied by Burrows’s Delta in
two dimensions (∆(2)
B). Each graph shows a space assuming two words are used for computing the
measure, where the coordinates of each point in the space correspond to the frequencies of these
words in a possible text. The diamonds show the ‘iso-Delta’ lines, such that all points on each
diamond have the same value for Delta. (a) The standard deviation is 1 for the frequencies of both
words. (b) The standard deviation for the word on the xaxis is 2, while the other is 1. (c) The
standard deviation for the word on the xaxis is 1
2, while the other is 1.
This formulation shows clearly that Delta ranks authorship candidates Dby a kind of distance
from the test text D, where each dimension of difference (word frequency) is scaled by a factor of
1
σi(i.e., small differences count more for dimensions with less ‘spread’). Thus, Delta may be viewed
as an axis-weighted form of ‘nearest neighbor’ classification (Wettschereck, Aha, & Mohri, 1997),
where a test document is classified the same as the known document at the smallest ‘distance’.
2.2 Manhattan and Euclidean distance
A deeper understanding of Delta is obtained by considering each comparison of target text Dand
given text Das a point in a multi-dimensional geometric space, where the differences between
word frequencies give the point’s coordinates in the space. Mathematically, we use lower-case
deltas δito indicate, for each word wi, the difference between wi’s frequencies in the two texts:
δi=fi(D)fi(D). (We defer for now consideration of the effect of conversion to z-scores.)
Every point in this n-dimensional difference space corresponds to one possible set of differences
between target and given texts over all words of interest. The Delta measure assigns a numeric
score to each such difference point, allowing them to be ranked for likelihood of similarity or
difference of authorship. To make this more concrete Figure 1(a) depicts the scores assigned by the
simplest Delta function, where all standard deviations are 1, to points in a 2-dimensional space (i.e.,
one where only two words are used in the Delta calculation). The figure shows several iso-Delta
diamonds, i.e., sets of points where Delta values are equal. In three dimensions the lines become
planes, and the diamonds octahedra; in higher dimensions it becomes rather more complex.
For comparison, Figure 1(b) shows iso-Delta lines where σ1=1
2and σ2= 1, while Figure 1(c)
shows iso-Delta lines where σ1= 2 and σ2= 1. As these graphs clearly show, the effect of dividing
3
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
Delta=1
Delta=0.75
Delta=0.5
Delta=0.25
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
Delta=1
Delta=0.75
Delta=0.5
Delta=0.25
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
Delta=1
Delta=0.75
Delta=0.5
Delta=0.25
(a) (b) (c)
Figure 2: Graphs showing the structure of the distance measure implied by axis-parallel quadratic
Delta in two dimensions (∆(2)
Q). Each graph shows a space assuming two words are used for com-
puting the measure, where the coordinates of each point in the space correspond to the frequencies
of these words in a possible text. The ellipses show the ‘iso-Delta’ lines, such that all points on each
ellipse have the same value for Delta. (a) The standard deviation is 1 for the frequencies of both
words. (b) The standard deviation for the word on the xaxis is 2, while the other is 1. (c) The
standard deviation for the word on the xaxis is 1
2, while the other is 1.
by σiis to rescale the scoring in the direction of each of the axes, making differences in some
directions more salient than in others.
Note that regardless of the rescaling of the axes, the distance of a given point in the difference
space from the origin is computed as the sum of its distances along each of the axes from the origin.
Such a distance measure has been termed “Manhattan distance” by analogy to measuring driving
(or walking) distance in Manhattan (or any other grid of city streets) by summing distances in
each of the principle directions. Another, possibly more natural, measure of distance would be the
straight-line ‘Euclidean’ distance, in which the iso-Delta curves would be circles, as in Figure 2(a).
Using Euclidean distance gives the distance formula (derived from the Pythagorean Theorem):
v
u
u
t
n
X
i=1
1
σ2
i
(fi(D)fi(D))2
Instead of the sum of absolute differences, we have the square-root of the sum of the squared
differences. Note that if the standard deviations (for rescaling) vary between the axes, the circular
iso-Delta curves become axis-parallel ellipses, as shown in Figure 2(b) for σ1=1
2and Figure 2(c)
for σ1= 2 (compare to Figure 1). Again, note that since Delta is to be used as a ranking principle
(and so we only care about relative value), we can simplify the formula by removing the square
root, giving as our first variant the quadratic Delta formula:
(n)
Q(D, D) =
n
X
i=1
1
σ2
i
(fi(D)fi(D))2
4
The ‘Q’ in the subscript denotes the use of a quadratic function in the sum, while the ‘’ symbol
indicates that the scaling is in line with the perpendicular axes (we will return to this point in
Section 4).
3 Delta as a Probabilistic Ranking Principle
3.1 Quadratic Delta
The quadratic Delta formula just introduced leads us to a deeper conception of why such a ranking
principle for authorship candidates can make sense. Consider first the case of using just a single
word-frequency. In this case,
(1)
Q(D, D) = 1
σ2
1
(f1(D)f1(D))2,
where the author of the Dgiving the smallest value will be attributed as D’s author. It can
be straightforwardly shown mathematically that an identical decision process is given by choosing
the highest probability value given by a Gaussian distribution1with mean f1(D) and standard
deviation σ1, whose probability density function is given as:
Ghf1(D)1i(f1(D)) = 1
2πσ1e1
σ2
1
(f1(D)f1(D))2
=1
2πσ1e(1)
Q(D,D)
This is because the normalizing factor 1
2πσ1is a constant (and so does not change candidate
rankings), and exponentiating minus Delta just inverts candidate rankings (turning minimizing into
maximizing), since exponentiation is a monotonic operator (its outputs have the same ordering as
its inputs).
This is easily generalized, so that minimizing the n-dimensional quadratic Delta
(n)
Q(D, D) = X
i
1
σ2
i
(fi(D)fi(D))2
may be seen to be equivalent to maximizing a probability according to the n-dimensional Gaussian
distribution2:
G~
f(D),~σ(~
f(D)) = 1
(2π)nQiσi
ePi
1
σ2
i
(fi(D)fi(D))2
=1
(2π)nQiσi
e(n)
Q(D,D)
What this equivalence means is that the use of quadratic Delta for choosing an authorship candi-
dates amounts to choosing the highest-probability candidate, where the frequency of each indicator
word wi, in texts written by the author of D, is assumed to be randomly distributed (in the
abstract n-dimensional word frequency space) with probabilities given by a Gaussian distribution
with mean fi(D) and standard deviation σi. Thus a main assumption of this method is that,
for each candidate author, the author’s candidate document Dis taken as the ‘prototype’ for all
1Named after mathematician Johann Carl Friedrich Gauss (1777–1855).
2The arrow notation in ~
f, and ~σ indicates that these quantities here are vectors containing the values of f, and σ
for all words wi; i.e., ~
f(D) = hf1(D), f2(D),···, fn(D)i. Also see the Appendix.
5
0
0.2
0.4
0.6
0.8
1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
sigma=1
sigma=0.5
sigma=2 -2-1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.1
0.2
0.3
0.4
0.5
sigma=1
-2-1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
sigma=1
(a) (b) (c)
Figure 3: Gaussian probability distributions. (a) One-dimensional Gaussian distributions for several
values of standard deviation σ. (b) Two-dimensional Gaussian distribution (probability shown
along the vertical axis) where both standard deviations are 1. Note that horizontal cross-sections
are circles. (c) Two-dimensional Gaussian distribution where one standard deviation is 1 and the
other is 2. Note that horizontal cross-sections are ellipses.
documents by that author, while the potential ‘spread’ of the word frequency distribution for other
documents is fixed as the overall spread for that word in the entire background set. In addition,
every indicator word’s frequency is assumed to be statistically independent of every other indicator
word’s frequency. (This second assumption is indeed rather questionable, and is dealt with in more
detail in Section 4.)
More specifically, in this case the distribution is taken to be a Gaussian distribution over nin-
dependent variables (the nindicator word frequencies). This distribution (also called the ‘normal’
distribution) is the one most commonly used for modeling probabilistic phenomena, for two main
reasons. The first is the distribution’s convenient mathematical structure (for example, the projec-
tion of an n-dimensional Gaussian distribution onto any linear combination of a subset of variables
is also a Gaussian distribution). The second is the Central Limit Theorem, which states that the
distribution of the sum of kidentically-distributed independent random variables will approach a
Gaussian distribution as kincreases, regardless of the type of distribution of the individual variables
(so long as the variance of the sum remains finite).
3.2 Linear Delta
While all of this discussion of the quadratic Delta variant is well and good, we may wonder what
this approach can say about Burrows’s original Delta. In fact, it can be shown that attribution
by Burrows’s Delta function is also equivalent to a probabilistic attribution principle, just using a
different underlying probability distribution function. Recall that Burrows’s Delta between texts
6
0
0.2
0.4
0.6
0.8
1
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
b=1
b=0.5
b=2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2-1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.05
0.1
0.15
0.2
0.25
0.3
b=1,1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2-1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.05
0.1
0.15
0.2
b=1,2
(a) (b) (c)
Figure 4: Laplace probability distributions. (a) One-dimensional Laplace distributions for several
values of standard deviation σ. (b) Two-dimensional Laplace distribution (probability shown along
the vertical axis) where both standard deviations are 1. Note that horizontal cross-sections are
diamonds. (c) Two-dimensional Laplace distribution where one standard deviation is 1 and the
other is 1
2. Note that horizontal cross-sections are elongated diamonds.
Dand Dmay be computed by the function:
(n)
B(D, D) =
n
X
i=1
1
σi
fi(D)fi(D)
In the same manner as we showed that distance ranking of authorship candidates by quadratic
Delta is equivalent to probability ranking using a Gaussian distribution, we can show that distance
ranking by Burrows’s Delta is equivalent to probability ranking using what is known as the Laplace
distribution3. The Laplace distribution for one variable is given by the equation (Evans, Hastings,
& Peacock, 2000, p. 117):
Lha,bi(x) = 1
2be|xa|
b
This distribution has two parameters, a, indicating the median, and bindicating the amount of
‘spread’ in the distribution. Graphs of this density function (centered around 0) are given in
Figure 4(a) for several values of b.
The multivariate form (assuming independent variables) is given by the formula:
Lh~a,~
bi(~x) = 1
Qi2bi
ePi
1
bi|xiai|
Graphs of the 2-dimensional Laplace density function with b1=b2= 1 is given in Figure 4(b), and
for b2= 2 in Figure 4(c). The connection to Burrows’s Delta is the same in form as the connection
3Named after mathematician and physicist Pierre-Simon Laplace (1749–1827).
7
of the Gaussian density function to quadratic Delta, seen by substituting σifor bi,fi(D) for ai,
and fi(D) for xi:
1
Qi2σiePi
1
σi|fi(D)fi(D)|
=1
Qi2σie(n)
B(D,D)
Note that the fraction in the front ( 1
Qi2σi) is a constant for all Dand D, and so will not affect any
ranking, and that exponentiating minus Delta just reverses the ranking (so we seek a maximum
probability instead of a minimum distance).
There are several important differences between the Laplace distribution and the Gaussian
distribution that are relevant to us. Most significantly, while the mean of the distribution is in fact
equal to the parameter a, the standard deviation of the Laplace distribution is not bi, but rather
is given by 2bi. If aiand biare to be estimated from wi’s frequencies in the comparison corpus,
the best (maximum likelihood) estimates are (Evans et al., 2000, p. 120):
ai= median(hfi(D1), fi(D2),···, fi(Dm)i)
bi=1
nPm
j=1 |fi(Dj)ai|
The median of a set of numbers is that number such that half of the set are higher and half are
lower.
The implications are crucial for properly establishing Burrows’s Delta as a well-founded prob-
abilistic ranking principle. As discussed above, with respect to the quadratic variation of Delta,
the most straightforward interpretation is that parameters estimated from the comparison corpus
are taken to describe the probability distribution of word frequencies for different authors (whose
distributions just vary by ‘location’, i.e., fi(D)). If so, then we would want to interpret Burrows’s
Delta as a principle for ranking authorship candidates based on assuming that:
Word frequencies for a given author are randomly distributed according to a multivariate
Laplace distribution;
All authors have the same bparameters for such distributions, varying only by the distribution
mean;
Such parameters can be best estimated from a varied comparison corpus of texts.
Using z-scores, which divide by standard deviations σiin the comparison corpus, does not precisely
allow such an interpretation, since the Laplace distribution does not divide by σibut rather by bi.
So to establish such an interpretation, instead of using z-scores the method should instead divide by
biestimates as given above, i.e., the average absolute deviation of comparison text word frequencies
from the median word frequency. This gives us the alternative linear Delta formulation:
(n)
L(D, D) =
n
X
i=1
1
bi
fi(D)fi(D)
with bidefined as above (average absolute deviation from the median). This function retains the
form of Burrows’s original Delta, but bases it more firmly as a probabilistic ranking principle. (It
may still be the case that the difference in practice between Burrows’s Delta and linear Delta may
be negligible.)
8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4
Gaussian
Laplace
0
0.01
0.02
0.03
0.04
0.05
0.06
2 2.5 3 3.5 4 4.5 5
Gaussian
Laplace
(a) (b)
Figure 5: Comparison of Gaussian and Laplace probability distributions, showing a Gaussian den-
sity function with µ= 0 and σ= 1 and a Laplace density function with a= 0 and b=2 (such
that its standard deviation is also 1. (a) From x=4 to x= 4. (b) Detail view of the range x= 2
to x= 5, showing the heavier tail of the Laplace distribution.
3.3 Contrasting the Gaussian and Laplace distributions
Both the Gaussian and Laplace distributions have useful properties for our purposes; the choice
between them for any given problem is largely empirical. We should also note that as both of
them are unbounded, i.e., they give any number some non-zero probability of occurrence, neither
is accurate for describing word frequencies, strictly speaking, as word frequencies cannot be less
than 0 nor greater than 1. However, they may still provide useful models for word frequencies.
A useful way of understanding the difference between the Gaussian and Laplace distribution is by
viewing them as error laws, that is, as describing the random distribution of errors in measurement
about some true value. J. M. Keynes (1911) has shown that, assuming positive and negative errors
to be equally likely, that:
If the most probable true value is the arithmetic mean (i.e., average) of the noisy observations,
then the Gaussian distribution is the most likely correct error law;
If the most probable true value is the median of the noisy observations, then the Laplace
distribution is the most likely correct error law.
The mean of a set of numbers is more strongly affected by far-away values (‘outliers’) than the
median, hence the Laplace distribution is more stable (hence generally to be preferred) when such
outliers are more common. Intuitively, we can see this also in comparison of the distributions’
shapes (Figure 5), where we see two main differences.
The Laplace distribution is more ‘peaked’ than the Gaussian, that is, it gives more probability
to values very close to the mean, whereas the Gaussian distribution allows more spread, and
9
The Laplace distribution is more ‘heavy-tailed’ than the Gaussian, in that it also gives more
probability to values very far from the mean.
Thus the Gaussian distribution allows for more mid-range spread around the mean than the Laplace,
while the Laplace allows for more far-away ‘outliers’ to occur. Thus if we expect the texts of a
given author to have similar frequencies for all common words, with a medium amount of spread
but almost no ‘outliers’, then a Gaussian distribution will be more appropriate. However, if we
think that the frequencies will be very tightly gathered around the center, but that there is a greater
(though still low) likelihood that some texts by the author will have a few highly atypical word
frequencies, then the Laplace distribution may be a more appropriate choice.
We emphasize again that the choice of proper distribution for modeling word frequencies for
authorship attribution is entirely an empirical one, which we will address in the sequel.
4 Interdependence among Word Frequencies
The understanding that Delta (in the various incarnations thus far presented here) corresponds to
choosing the author that assigns highest probability to the target text (under certain assumptions)
goes a long way to giving the method theoretical justification, as well as elucidating its underlying
theoretical (in this case probabilistic) assumptions. One of those assumptions, however, is particu-
larly troubling—the assumption that word frequencies are utterly independent of each other. This
assumption is clearly false in general, and so raises a significant theoretical difficulty for the general
use of the Delta method4. It is the purpose of this section to address this problem by examining
how versions of Delta could be developed that relax the strong assumption of independence. Such
methods will stand on a firmer theoretical foundation, and so possibly lead to more easily justified
results (provided that empirical evidence also supports their efficacy).
4.1 The Gaussian distribution
We begin, for mathematical simplicity, with the Gaussian distribution. If two random variables (in
this case, the frequencies of different frequent words in a random text) are not independent, then
the have non-zero covariance, where the covariance σij of the frequencies of words wiand wjis
defined as:
σij = ExpD[(fi(D)µi)(fj(D)µj)]
that is, the expected value (over all possible random texts D) of fi’s deviation from its mean times
fj’s deviation from its mean. If the two words occur independently, then the covariance will be
zero. Naturally, since we do not have access to all possible random texts, we must estimate the
covariance from given data in the comparison corpus C, so in what follows we use the estimate:
σij =1
|C|X
DC
(fi(D)µi)(fj(D)µj)
It is easy to see that the variance of a single variable is simply its covariance with itself:
σ2
i=1
|C|X
DC
(fi(D)µi)2
4Even given overwhelming empirical evidence to be adduced for Delta’s efficacy, the underlying gap between the
method’s theoretical assumptions and the known facts about language could still raise serious doubts about its validity
in novel cases, were the discrepancy not explained.
10
Mathematically, we arrange the covariances among all the variables in a covariance matrix, a square
matrix defined as:
S=
σ2
1σ12 ··· σ1n
σ21 σ2
2··· σ2n
.
.
..
.
.....
.
.
σn1σn2··· σ2
n
In the case that all the variables are independent, all elements of the matrix other than those on the
diagonal are zero. When there are statistical dependencies, however, some (perhaps many or all) of
the non-diagonal elements will be non-zero. Note that the diagonal elements (variances) are always
positive, while covariances between different variables may be negative (i.e., when one variable is
high, the other tends to be low). Correspondingly, we represent the relevant word frequencies in a
given document Dby a vector of all the frequencies:
~
f(D) =
~
f1(D)
~
f2(D)
.
.
.
.
.
.
~
fn(D)
This representation allows us to use the powerful tools of linear algebra (Meyer, 2000) to compactly
represent and manipulate complex expressions (a brief explanation of basic operations is given in
the Appendix).
The matrix equivalent of the reciprocal of the variance ( 1
σ2) is the inverse of the covariance
matrix, denoted S1, and defined such that S1S=I, where Iis the identity matrix which has all
ones on the diagonal and zeros elsewhere. To take a simple case in two dimensions, suppose that
the covariance between words w1and w2is σ12 =σ21 = 0.5, with variances σ2
1= 1 and σ2
2= 2.
Then the covariance matrix Sis
S=1 0.5
0.5 2
and its inverse S1is
S1=1.143 0.286
0.286 1.071
The corresponding quadratic Delta function, allowing for dependent variables, is:
(2)
Q(D, D) = (~
f(D)~
f(D))TS1(~
f(D)~
f(D))
= 1.143(f1(D)f1(D))2
0.286(f1(D)f1(D))(f2(D)f2(D))
0.286(f2(D)f2(D))(f1(D)f1(D))
+1.071(f2(D)f2(D))2
(The subscript “Q” indicates a quadratic function— “Q” —that is non-axis-parallel— “ ”.)
Graphical depictions of this function are given in Figure 6. As the figure shows, when variables
covary, the iso-Delta ellipses (or in higher dimensions, the ellipsoids) are rotated relative to the
axes. The direction and amount of rotation depends on the amount of covariance.
11
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
1
0.75
0.5
0.25
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
(a) (b)
Figure 6: Two-dimensional quadratic ranking functions with dependent variables; standard devia-
tions σ1= 1 and σ2= 2 and covariance σ12 =σ21 =1
2. (a) Iso-Delta curves; note that the ellipses
are rotated 45due to the covariance. (b) The Gaussian probability distribution.
To generalize, when S1exists (see below), the n-dimensional non-axis parallel quadratic Delta
function is given by
(n)
Q(D, D) = (~
f(D)~
f(D))TS1(~
f(D)~
f(D))
=PiPj(fj(D)fj(D))(S1)ij(fi(D)fi(D))
where (S1)ij denotes the i, jth element in the (S1) matrix. Note that this choice of non-axis-
parallel Delta function is not arbitrary. Choosing a document Dto minimize this function directly
corresponds to choosing one to maximize the probability given by the standard multivariate Gaus-
sian distribution with mean at ~
f(D) and covariance matrix S:
G~
f(D),S(~
f(D)) = 1
(2π)ndet(S)e(~
f(D)~
f(D))TS1(~
f(D)~
f(D))
=1
(2π)ndet(S)e(n)
Q(D,D)
where det(S) denotes the determinant of matrix S(see (Meyer, 2000)).
This is all well and good, but there is a fly in the ointment. The fact is that Sdoes not always
have an inverse; clearly such a situation will play havoc with the formulae presented above. In fact,
if there are fewer texts in the comparison corpus than there are dimensions (i.e., relevant words)
12
-3
-2
-1
0
1
2
3
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
-3
-2
-1
0
1
2
3
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
-3-2.5-2-1.5-1-0.5 0 0.5 1 1.5 2 2.5 3-3 -2 -1 0 1 2 3
1
2
3
4
5
6
(a) (b) (c)
Figure 7: Illustrations of subspaces spanned by small numbers of points (i.e., comparison texts).
(a) Two points in a two dimensional space can give variation along both axes, but span a ‘natural’
space (the line segment connecting them) with no ‘thickness’, regardless of how the points are
placed. Hence two points can only span a 1-dimensional subspace. (b) Three non-collinear points
in a two-dimensional space give a triangle, which has ‘thickness’ in all directions, and so span
a 2-dimensional subspace. (c) Even if embedded in three dimensions, such that the points have
different locations on all three axes, they still only form a two-dimensional triangle regardless of
where they are located, and so three points can only span a 2-dimensional space.
considered, we are guaranteed that Shas no inverse! Roughly speaking, this is because Sonly has
an inverse if the set of points used to estimate it (i.e., the comparison texts, viewed as points in an
n-dimensional space) has thickness in every single direction in the n-dimensional space (technically,
this is equivalent to the matrix having ‘full rank’). For example, consider the case of two relevant
words and two comparison texts (Figure 7(a)). Note that while in each axis-parallel direction the
set of two points has thickness, along the direction perpendicular to the line joining the two points
the two points the set perforce has zero thickness. That is, the two points define a line which has
lower dimension (one) than the full space (two). In general, a set of mpoints can define a space of
at most m1 dimensions, and hence if the number of word-variables nis greater than m1, the
covariance matrix will not be invertible.
However, all is not lost. As Figure 7 shows for the low-dimensional case, it is still possible to
use a Gaussian distribution as an authorship ranking principle by first rotating the space to align
with the ‘natural’ axes defined by the estimated covariance matrix S, and then ranking according
to a lower-dimensional Gaussian probability distribution. Mathematically, this is accomplished by
eigenvalue decomposition of the matrix S. This decomposition factors Sinto a product of three
matrices, as S=EDETwhere Dis a diagonal matrix (zero except on the diagonal) and Eis a
square matrix called the eigenvector matrix. Geometrically, Erepresents an n-dimensional rotation
of the space around the origin such that after such rotation, the variables corresponding to the new
axes are statistically independent. The values along the diagonal in D(the eigenvalues) are then
the estimated variances of the distribution for each of those new composite variables. In the case
that Sis non-invertible, some of the eigenvalues will be zero, meaning that the distribution is flat
13
in the corresponding directions. Let us call the number of non-zero eigenvalues (the lower number
of ‘thick’ dimensions) k. By using just the kcolumns of Ethat correspond to non-zero eigenvalues,
we can both rotate the space to accord with the ‘natural’ axes of the distribution, and project points
into a k-dimensional space all of whose dimensions have some ‘thickness’.
The idea, then, is to do an eigenvalue decomposition of S(using a standard software library) to
get Eand D. Then we remove columns of Ethat correspond to zero eigenvalues in Dto get E,
and remove the zero eigenvalue columns and rows in Dto get D. Note that for any n-dimensional
vector ~
x, computing the vector ~
y=ET
~
xrotates ~
xinto the natural axes of the space and then
projects it into the k-dimensional ‘thick’ space, transforming the n-dimensional vector ~
xto a k-
dimensional vector ~
y. In this new k-dimensional space, the elements d
ii along the diagonal of D
are just the variances σ2
iof each new variable. Thus, in the new space, we can just use ∆(k)
Qas
a ranking principle (equivalent to Gaussian distribution-based ranking), and get a probabilistic
ranking method that takes into account dependence among the variables. We also note that in the
case where Shas full rank, i.e. no eigenvalue is zero, this method will give identical results to just
directly using S1as above.
Thus, given matrices Dand Ederived from Sby eigenvalue decomposition, we define the
non-axis-parallel quadratic Delta function as:
(n)
Q(D, D) = (~
f(D)~
f(D))TED1
ET
(~
f(D)~
f(D))
with the equivalent Gaussian distribution:
G~
f(D),S(~
f(D)) = 1
(2π)ndet(D)e(~
f(D)~
f(D))TED1
ET
(~
f(D)~
f(D))
=1
(2π)ndet(D)e(n)
Q(D,D)
4.2 The Laplace distribution
Unfortunately, no such mathematically and computationally tractable treatment exists for mul-
tivariate Laplace distributions with correlated variables. Indeed, there is no single universally
accepted multivariate generalization of the Laplace distribution. Some possibly applicable multi-
variate extensions of the Laplace distribution do exist (Kotz, Kozubowski, & Podg´orski, 2001), but
since such generalizations are neither unique nor straightforward, they are outside the scope of this
article.
A heuristic (i.e., not probabilistically well-founded) extension of the Laplace distribution is
based on using the eigenvalue decomposition trick presented above to transform document vectors
onto an (empirically) independent set of variables. The idea is to take each document vector ~
f(D),
and convert it to a (lower-dimensional) vector ~
g(D), defined by:
~
g(D) = ET
~
f(D)
where Eis the reduced-dimension eigenvector matrix computed from the comparison set, as de-
scribed in Section 4.1 above. All document vectors (in the comparison and test sets) are transformed
thusly, then the standard (independent variable) linear Delta function ∆Lcan be applied to the
~
g(D) vectors, as the variables are now (assumed to be) independent. Keep in mind that the
parameters aiand bifor ∆Lare to be calculated over the variables in ~
g(D), rather that ~
f(D).
Note that this method is only approximate, as the eigenvector method for transforming the vec-
tors to independent variables is based on assuming that the true distribution is actually Gaussian.
As noted above, an exact solution using any of the currently available multivariate generalizations
of the Laplace distribution would be excessively complicated for our purposes here.
14
4.3 Relationship to PCA
The sort of eigenvector decomposition described in this section highlights some of the similarities
and differences between the Delta method and the older method of using principal components anal-
ysis (PCA). Briefly, PCA uses eigenvector decomposition to project points in a high-dimensional
space into a low-dimensional (usually 2-dimensional) space, while losing as little as possible in-
formation about the variability in the data (see (Binongo & Smith, 1999) for a more detailed
exposition). Applications of PCA for authorship attribution (Burrows, 1992; Binongo & Smith,
1999; Baayen, van Halteren, Neijt, & Tweedie, 2002) vary in their details, but the basic idea is to
obtain a 2-dimensional visualization of the relative positions of the n-dimensional representations
of comparison documents and the target document, and then to see if (a) comparison documents by
different authors are divided neatly into different regions of the space, and (b) whether the target
can be clearly associated with one of those regions (corresponding to a particular attribution of its
authorship).
Use of PCA can thus also be viewed as a form of nearest-neighbor classification in a transformed
space, where the transformation is the rotation and projection to a particular 2-dimensional space
(the PC space), defined by the two main principal components. Measuring distance between points
in this space can be viewed as a probabilistic ranking principle as discussed above (Sec. 3), which
assumes that:
The probability distribution is Gaussian,
Variances are equal in all directions (if the PC space is not scaled according to the eigenvalues),
and
Only distances in the PC space are significant–all others can be safely ignored.
These characteristics of the PCA method may account (singly or together) for the fact that
Burrows’s Delta often works better in practice than PCA, despite its strong assumption of word
frequency independence. The Laplace distribution may be a better approximation of the true distri-
bution of mean word frequencies than the Gaussian, accounting for differences in frequency variation
among different words may by critical, and two dimensions may simply not be enough to capture
the true structure of the space (though the success of linear discrimination methods (Diederich,
Kindermann, Leopold, & Paass, 2000; Baayen et al., 2002) can argue against this last notion). Use
of the (approximate) Laplace-based method given immediately above may therefore enable more
accurate attribution by combining the benefits of PCA with those of Delta.
5 Discussion
We have shown how Burrows’s Delta measure for authorship attribution may be viewed as an
approximation to ranking candidates by probability according to an estimated Laplace distribution.
This view leads directly to some theoretically well-founded extensions and generalizations of the
method based on using Gaussian distributions in place of Laplace distributions, as well as correcting
for statistical correlations between the various word frequencies being used. The choice among these
various methods, of course, is an empirical question, which we will address in the sequel to this
article, by applying these methods to several previously examined authorship problems.
In addition to giving several justifiable variants to the method, this view of Delta also gives
a clearer idea of its assumptions and likely limitations. The method clearly assumes that word
frequencies for all authors are distributed with similar spreads, only differing in the central value
15
(which is taken from a relevant comparison document). In cases where only one or two documents
are available from a given author, this assumption is virtually unavoidable; however, when more
documents are available, they should be used to adjust the estimates of the likely spread in fre-
quencies of the various words under consideration. More significantly, this assumption appears
to fundamentally limit use of the method when all the samples (from all authors) are of pretty
much the same textual variety, otherwise we would expect the word frequency distributions over
the comparison set to be a mixture of several disparate distributions, one for each genre found in
the set, thus potentially biasing results depending on the variety of the test text.
Acknowledgments
Thanks to David Hoover, Mark Olsen, and the anonymous reviewers for their readings and helpful
comments on earlier drafts of this article.
References
Baayen, H., van Halteren, H., Neijt, A., & Tweedie, F. (2002). An experiment in authorship attribu-
tion. In JADT 2002: Journe´es internationales d’Analyse statistique des donne´es textuelles.
Binongo, J. N. G., & Smith, M. W. A. (1999). The application of principal components analysis
to stylometry. Literary and Linguistic Computing,14, 445-65.
Burrows, J. (1992). Not unless you ask nicely: The interpretative nexus between analysis and
information. Literary and Linguistic Computing,7, 91-109.
Burrows, J. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship.
Literary and Linguistic Computing,17 (3), 267–287.
Burrows, J. (2003). Questions of authorship: Attribution and beyond; a lecture delivered on
the occasion of the Roberto Busa Award ACH-ALLC 2001, New York. Computers and the
Humanities,37 (1), 5–32.
Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2000). Authorship attribution with
support vector machines. Applied Intelligence.
Evans, M., Hastings, N. A. J., & Peacock, J. B. (2000). Statistical distributions. New York: Wiley.
Hoover, D. L. (2004a). Delta prime? Literary and Linguistic Computing,19 (4), 477–495.
Hoover, D. L. (2004b). Testing Burrows’s Delta. Literary and Linguistic Computing,19 (4),
453–475.
Hoover, D. L. (2005). Delta, Delta Prime, and modern American poetry: Authorship attribution
theory and method. In Proceedings of the 2005 ALLC/ACH conference. Victoria, BC.
Keynes, J. M. (1911). The principal averages and the laws of error which lead to them. Journal of
the Royal Statistical Society,74, 322–328.
Kotz, S., Kozubowski, T. J., & Podg´orski, K. (2001). The laplace distribution and generalizations:
A revisit with applications to communications, economics, engineering, and finance. Boston:
Birkh¨auser.
Meyer, C. D. (2000). Matrix analysis and applied linear algebra. Philadelphia: Society for Industrial
and Applied Mathematics.
Wettschereck, D., Aha, D. W., & Mohri, T. (1997). A review and empirical evaluation of feature
weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review,
11 (1–5), 273–314.
16
A Some Key Concepts in Linear Algebra
The mathematics of linear algebra studies analytical methods deriving from the problem of finding
a solution to multiple algebraic equations in multiple unknowns simultaneously. The theory is also
useful for geometric reasoning, based on Cartesian5principles of analytic geometry. This appendix
perforce must be a rather superficial overview; an excellent comprehensive text is Meyer’s (Meyer,
2000).
A.1 Matrices and vectors
The two key notions are the vector and the matrix. A vector represents a point in a multi-
dimensional space as a list of nnumbers, each number giving a position along one of nnotional
axes (on some arbitrary scale). A vector may also be viewed as an arrow from the origin to its
position. Conventionally, we notate a vector as ~x, and view it as a vertical column of its elements:
~x =
x1
x2
.
.
.
xn1
xn
A fundamental operation between vectors is the scalar product, notated ~xT~y, defined to be the sum
of the products of the corresponding elements of the vectors (the operation is only defined if both
vectors have the same number of elements). The scalar product of a vector with itself is just the
vector’s length (in Euclidean terms) squared. The scalar product of two different vectors can be
shown to be equal to the cosine of the angle between the vectors times the product of the two
vectors’ lengths. Thus, holding length equal, vectors that point in roughly the same direction will
have a larger scalar product than vectors that point in different directions; the scalar product of
vectors at right angles to one another is always 0.
A matrix, on the other hand, represents a linear transformation of the n-dimensional space into
a (possibly different) m-dimensional space, and may be viewed as comprising an n×marray of
numbers, as:
A=
A11 A12 ··· A1n
A21 A22 ··· A2n
.
.
..
.
.....
.
.
Am1Am2··· Amn
A matrix is viewed as a transformation of the space, via the operation of matrix multiplication,
where a matrix is used as a function to move points (i.e., vectors) in the space. In matrix multipli-
cation, a new m-dimensional vector ~z is constructed from a given n-dimensional vector ~x by taking
the scalar product of each row of the matrix (viewed as a vector) with ~x to form each element of
the new vector ~z. In mathematical terms:
~z =A~x
z1
z2
.
.
.
zm
=
A11x1+A12 x2+···A1nxn
A21x1+A22 x2+···A2nxn
.
.
.
Am1x1+Am2x2+···Amnxn
5After its originator, French mathematician and philosopher Ren´e Descartes (1596–1650).
17
Multiplication of an n×mmatrix Aby an k×nmatrix Bis defined similarly to give a k×m
matrix C, notated C=AB, by defining each column of Cto be the product of Awith each
corresponding column of B.
If both dimensions of a matrix are equal (m=n), then the matrix is square, and it transforms
points in n-dimensional space to other points in n-dimensional space. A special kind of square
matrix is the identity matrix, notated I, whose elements are all zero except for those along the
diagonal which are all 1 (the dimensionality is assumed known). It serves the same function in
linear algebra as the number 1 does in regular algebra. Some square matrices Ahave an inverse
A1defined such that A1A=I, such a matrix is called invertible; other matrices are termed
singular.
A square matrix can be used also to define an arbitrary quadratic function over points in the
n-dimensional space, by generalizing the notion of scalar product:
~xTA~x =x1(A11x1+A12x2+···A1nxn)+
x2(A21x1+A22 x2+···A2nxn)+
.
.
.
xn(An1x1+An2x2+···Annxn)
=A11x2
1+A12x2x1+···A1nxnx1+
A21x1x2+A22 x2
2+···A2nxnx2+
.
.
.
An1x1xn+An2x2xn+···Annx2
n
A.2 Eigenvalue decomposition
If we view an n×nmatrix Aas a transformation on an n-dimensional vector space (via the matrix
product, ~
y=A~
x), we may ask “Are there any vectors whose direction does not change when
multiplied by A(though its length might)?” That is, is there any vector ~
xsuch that A~
x=λ~
xfor
some value of λ? This question leads to an extremely fruitful area of linear algebra, via the twin
notions of the eigenvector and the eigenvalue. The essential definition is this:
Given an n×nmatrix A, an n-dimensional vector ~
xwith at least one non-zero element
is an eigenvector of Awith associated eigenvalue λ6= 0, if A~
x=λ~
x.
The eigenvectors of a matrix thus form, in a sense, the fundamental ‘modes’ of the matrix,
viewed as a vector-space transformation. If Ais invertible (i.e., if A1exists), then it has nnon-
zero eigenvalues (hence neigenvectors); if it is non-invertible, then there will be fewer non-zero
eigenvalues and associated eigenvectors.
Note that multiplying any eigenvector by a number (i.e., stretching it) will also give an eigen-
vector. Thus we can standardize eigenvectors to all have unit length, such that for any eigenvector
~
e, we have that ~
eT~
e= 1. To avoid confusion, we will call these unit eigenvectors.
Also note that any two eigenvectors ~
eiand ~
ejwill be orthogonal to each other (that is, perpen-
dicular), such that their dot-product is zero: ~
eT
i~
ej= 0.
Thus, we may define the eigenvector matrix E(A) (or simply Ewhen not ambiguous) for n×n
matrix Awith mneigenvectors to be the n×mmatrix whose columns are A’s unit eigenvectors.
If Ais invertible (i.e., has neigenvectors), then E(A) is a square matrix such that E(A)TE(A) = I,
that is, its transpose is its inverse (this follows directly from the two properties noted just above).
It can then be shown that ETAE =D, where Dis a diagonal matrix (one all of whose elements
are zero except along the diagonal), whose diagonal elements are the eigenvalues of A.
18
... Early 20th century approaches relied on univariate analyses [98,99,101], before more sophisticated and successful multivariate techniques were introduced [75]. The period up until the late 1990s, was focused largely on attempts to define and identify features that quantify writing style -known as stylometry [93], before the introduction of an important class of multivariate techniques known as similarity-based models [9,23,35]. The last two decades or so, have seen the rise of machine learning approaches [1,2,31,86] and more recently a growing interest in so called deep learning (i.e., neural network-based) methods [15,16]. ...
... Score-based methods for estimating LRs are prevalent across different types of forensic evidence [19,30,40], ([85], Ch. 7 & 8). Score-based methods are an appealing choice for text evidence because many authorship attribution studies have relied on distancebased methods [9,33,48,56,57,59,60,89,90,92] and it is a standard approach for the problem [34,86,93]. These distance measures are easily modelled using score-based methods. ...
... Some assumptions are made about the data distribution by distance-based measures. For instance, Burrows's Delta [23] -the most well-known metric in stylometric and authorship studies -assumes a double exponential (or Laplace) distribution of data [9]. Many other distance measures, such as Euclidian distance and cosine-based distance [61], suppose a normal distribution of data. ...
Article
This study compares score- and feature-based methods for estimating forensic likelihood ratios for text evidence. Three feature-based methods built on different Poisson-based models with logistic regression fusion are introduced and evaluated: a one-level Poisson model, a one-level zero-inflated Poisson model and a two-level Poisson-gamma model. These are compared with a score-based method that employs the cosine distance as a score-generating function. The two types of methods are compared using the same data (i.e., documents attributable to 2,157 authors) and the same features set, which is a bag-of-words model using the 400 most frequently occurring words. Their performances are evaluated via the log-likelihood ratio cost (Cllr) and its composites: discrimination (Cllrmin) and calibration (Cllrcal) cost. The results show that 1) the feature-based methods outperform the score-based method by a Cllr value of 0.14~0.2 when their best results are compared and 2) a feature selection procedure can further improve performance for the feature-based methods. Some distinctive performance characteristics associated with likelihood ratios produced using the feature-based methods are described, and their implications will be discussed with real forensic casework in mind.
... Hoover, 2004a et.al. in [11] proved that literary texts like novels, poems are most suitable for this method and produced significant results. Argamon (2008) in [47] explained the working principle of Delta theoretically and demonstrated that Delta can be seen as a axis-weighted form of nearest neighbour classification where the unidentified text is allocate to the nearest class instead of the nearest training text. Hoover et.al. in [11] worked on thorough research of disparity of Delta. ...
... Hoover, 2004a et.al. in [11] proved that literary texts like novels, poems are most suitable for this method and produced significant results. Argamon (2008) in [47] explained the working principle of Delta theoretically and demonstrated that Delta can be seen as a axis-weighted form of nearest neighbour classification where the unidentified text is allocate to the nearest class instead of the nearest training text. Hoover et.al. in [11] worked on thorough research of disparity of Delta. ...
Article
Full-text available
The goal of Authorship Attribution (AA) is to take decisions about an author of unique chunk of text. AA is a variant of classification problem and it differs from classical text classification problem. It is a text analysis process that has different techniques namely, Authorship Identification, Authorship Profiling, Authorship verification, Authorship clustering, Authorship Diarization, and Plagiarism Detection. In this study a brief survey of Authorship attribution technique are presented. AA is a process, given a document and a set of candidate authors, determine who among them wrote the document. Plenty of accessible electronic writings are available in different areas on web and sometimes the authors of the text are unknown. The primary goal of the survey is to address various stylomeric features used and Attribution techniques based on the text corpus written by different authors.
... La mesure la plus utilisée est appelée le delta de Burrows (Burrows, 2002) qui consiste à comparer chaque document par l'usage des mots les plus fréquents observés dans le corpus. Il existe différentes variantes de cette mesure décrites par Jannidis et al. (2015) et Argamon (2008). ...
Thesis
La modélisation des utilisateurs est une étape essentielle lorsqu'il s'agit de recommander des produits et proposer des services automatiquement. Les réseaux sociaux sont une ressource riche et abondante de données utilisateur (p. ex. liens partagés, messages postés) permettant de modéliser leurs intérêts et préférences. Dans cette thèse, nous proposons d'exploiter les articles d'actualité partagés sur les réseaux sociaux afin d'enrichir les modèles existants avec une nouvelle caractéristique textuelle : le style écrit. Cette thèse, à l'intersection des domaines du traitement automatique du langage naturel et des systèmes de recommandation, porte sur l'apprentissage de la représentation du style et de son application à la recommandation d'articles d'actualité. Dans un premier temps, nous proposons une nouvelle méthode d'apprentissage de la représentation du texte visant à projeter tout document dans un espace stylométrique de référence. L'hypothèse testée est qu'un tel espace peut être généralisé par un ensemble suffisamment large d'auteurs de référence, et que les projections vectorielles des écrits d'un auteur « nouveau » seront proches, d'un point de vue stylistique, des écrits d'un sous-ensemble consistant de ces auteurs de référence. Dans un second temps, nous proposons d'exploiter la représentation stylométrique du texte pour la recommandation d'articles d'actualité en la combinant à d'autres représentations (p. ex. thématique, lexicale, sémantique). Nous cherchons à identifier les caractéristiques les plus complémentaires pouvant permettre une recommandation d'articles plus pertinente et de meilleure qualité. L'hypothèse ayant motivé ces travaux est que les choix de lecture des individus sont non seulement influencés par le fond (p. ex. le thème des articles d'actualité, les entités mentionnées), mais aussi par la forme (c.-à-d. le style pouvant, par exemple, être descriptif, satirique, composé d'anecdotes personnelles, d'interviews). Les expérimentations effectuées montrent que non seulement le style écrit joue un rôle dans les préférences de lecture des individus, mais aussi que, lorsqu'il est combiné à d'autres caractéristiques textuelles, permet d'augmenter la précision et la qualité des recommandations en termes de diversité, de nouveauté et de sérendipité.
... Nous avons donc utilisé ici une mesure de distance connue pour son efficacité pour les tâches d'attribution d'auteur (51,52) : le delta de Burrows (53). Une revue récente à grande échelle (21) a notamment montré que, combinée à une normalisation euclidienne de la longueur des vecteurs, cette mesure de distance donnait les meilleures performances pour la langue française (21), revue utilisant des méthodes de classification telles que la classification hiérarchique avec la méthode de Ward. ...
Negative Results
Full-text available
Avant-propos à Florian Cafiero, Jean-Baptiste Camps. Molière n'a probablement pas écrit ses pièces. Traduction française de l'article de Cafiero et Camps corrigé selon les procédures de la publication scientifique. En effet, cet article n'avait pas bénéficié de la relecture par les pairs lors de sa première parution. Cette relecture est possible grâce aux "Matériaux supplémentaires" que MM. Cafiero et Camps ont mis en ligne sur le site de la revue. Ces matériaux permettent de suivre leur raisonnement et de refaire leurs calculs. Résumé corrigé : "Comme pour Shakespeare, il existe un débat passionné à propos de Molière, un acteur supposé sans culture qui, selon certains, ne peut pas avoir écrit les chefs d'oeuvre qui lui sont attribués. Depuis une vingtaine d'années, la thèse centenaire selon laquelle Pierre Corneille serait le véritable auteur s'est largement répandue. Nous utilisons un corpus de comédies en vers par les principaux auteurs contemporains de Corneille et Molière. L'analyse porte sur le lexique, les rimes, les mots, les affixes, certaines séquences morphosyntaxiques et mots outils. Elle démontre qu'un auteur, parmi les plus grands de l'époque - P. Corneille - aurait écrit les pièces présentées sous le nom de Molière."
... Three commonly used distance functions in authorship analysis-Euclidean, Manhattan and Cosine distances [11,19,28,31,46] -are tested as measures for scores. Fig. 4 illustrates these three distance measures in a two-dimensional space. ...
Article
The likelihood ratio paradigm for quantifying the strength of evidence has been researched in many fields of forensic science. Within this paradigm, score-based approaches for estimating likelihood ratios are becoming more prevalent in the forensic science literature. In this study, a score-based approach for estimating likelihood ratios is implemented for linguistic text evidence. Text data are represented via a bag-of-words model with the Z-score normalised relative frequencies of selected most-frequent words (the number of the most-frequent words = N), and the Euclidean, Manhattan and Cosine distance measures are trialled as the score-generating functions for comparing paired text samples. The score-to-likelihood-ratio conversion model was built using a common source method, and the best fitting model was selected from the parametric models of the Normal, Log-normal, Gamma and Weibull distributions. With the Amazon Product Data Authorship Verification Corpus, two groups of documents (each group including documents of approximately 700, 1,400 and 2,100 words) were synthesised for each author, allowing 720 same-author comparisons and 517,680 different-author comparisons to test the validity of the system. A series of experiments was conducted using combinations of the following conditions: the three score functions, the different values of N for the feature vector and the different document lengths. The validity of the system was assessed using the log-likelihood-ratio cost (Cllr), and the strength of the derived likelihood ratios was charted in the form of Tippett plots. It was demonstrated that 1) the Cosine measure consistently outperforms the other measures—the best performance is achieved with N = 260, regardless of the document length (e.g., Cllr values of 0.70640, 0.45314 and 0.30692, respectively, for 700, 1,400 and 2,100 words)—and 2) the derived likelihood ratios are very well calibrated irrespective of the distance measures and document lengths. A follow-up experiment showed that the described score-based approach is relatively robust and stable for a limited quantity of background data. The derived likelihood ratios that were estimated separately to the three distance measures were logistic regression fused; and the fusion achieved a further improvement in performance—for example, a Cllr of 0.23494 for 2,100 words. This study demonstrates the possibility of designing likelihood ratio–based systems that discriminate between same-author and different-author documents.
... From a contemporary perspective, however, the most important advance was arguably Shlomo Argamon's interpretation of the Delta's key principle. Argamon (2008) pointed out that the Delta measure that Burrows had stumbled on by intuition was actually the equivalent of measuring the Manhattan distance between two vectors. As such, the entire method could be seen as an instance of nearest neighbour classification or a special case of the popular k-nearest neighbour classifier where k = 1. ...
Book
https://versologie.cz/versification-authorship Contemporary stylometry uses different methods to figure out a poem’s author based on features like the frequencies of words and character n-grams. However, there is one potential textual fingerprint it tends to ignore: versification. Using poetic corpora in three different languages (Czech, German and Spanish), this book asks whether versification features like rhythm patterns and types of rhyme can help determine authorship. It then tests its findings on two real-life unsolved literary mysteries. In the first, we distinguish the parts of the verse play The Two Noble Kinsmen written by William Shakespeare from those by his co-author, John Fletcher. In the second, we seek to solve a case of suspected forgery. How authentic was a group of poems first published as the work of the 19th-century Russian author Gavriil Stepanovich Batenkov?
Article
Computational methods often produce large amounts of data about texts, which create theoretical and practical challenges for textual interpretation. How can we make claims about texts, when we cannot read every text or analyze every piece of data produced? This article draws on rhetorical and literary theories of textual interpretation to develop a hermeneutical theory for gaining insight about texts with large amounts of computational data. It proposes that computational data about texts can be thought of as analytical lenses that make certain textual features salient. Analysts can read texts with these lenses, and argue for interpretations by arguing for how the analyses of many pieces of data support a particular understanding of text(s). By focusing on validating an understanding of the corpus rather than explaining every piece of data, we allow space for close reading by the human reader, focus our contributions on the humanistic insight we can gain from our corpora, and make it possible to glean insight in a way that is feasible for the limited human reader while still having strategies to argue for (or against) certain interpretations. This theory is demonstrated with an analysis of academic writing using stylometry methods, by offering a view of knowledge-making processes in the disciplines through a close analysis of function words.
Article
Jelen cikk egy kutatásról számol be, amelynek keretében számítógépes stilometriai módszerekkel vizsgáltuk meg Móricz Zsigmond feleségéhez és másokhoz 1902 és 1913 között írt leveleinek textuális és stilometriai sajátosságait. Ez a kísérlet a Petőfi Irodalmi Múzeum Digitális Bölcsészeti Központjának az első stilometriai próbálkozása. A korpusz a Petőfi Irodalmi Múzeum Móricz-különgyűjteményének leveleiből készült digitális tudományos kiadásán alapul, 478 levelet (220 268 szót) tartalmaz. Egy R-csomagot, a Stylót, valamint távolságmérési módszereket (klasszikus deltát és Eder egyszerű deltáját) alkalmaztunk a fent említett sajátosságok elemzésére. Az eredményeket kétféleképpen vizualizáltuk: klaszteranalízissel (dendrogramon) és főkomponens-analízissel. A levelek klasszifikációja sikeres volt, bár csak a két vizualizációs módszer együttes alkalmazása vezetett eredményre. Sikerült kimutatnunk, hogy stilometriailag mérhető különbségek vannak a Jankának és másoknak írt Móricz-levelek között.
Preprint
Full-text available
Which of the Pauline letters have really been written by the Apostle Paul? This Authorship Attribution study tackles this question by the aid of the General Imposters Framework as implemented in the R-package stylo. The assumptions are that Rom, 1-2 Cor, Gal, Phil, 1 Thess and Phlm are authentic, and that the non-Paulinian texts in the NT form a good corpus of distractor authors. This study uses as text representations {1,2,3}-grams of Greek words, {1,2,3}-grams of Greek letters, {1,2,3}-grams of Strong-numbers, and some variations of Part-of-Speech-tags with morphological information. These representations were combined with the distance measures Cosine, Entropy, and Canberra. These combinations were tested for attributive success, and only those with a very small area of uncertainty were selected to run the analysis, so that on theoretical grounds, the results are expected to be highly significant. The results of the imposters’ method show clear authorial signs of Paul for all seven contested Paulinian letters. Further research is needed to corroborate these results, but they could potentially gather interest in the field of theology. Code available at https://github.com/jnussbaum/authorship-attribution
Conference Paper
Full-text available
This paper reports an experiment in authorship attribution that reveals considerable authorial structure in texts written by authors with very similar background and training, with genre and topic being strictly controlled for. We interpret our results as supporting the hypothesis that authors have 'textual fingerprints', at least for texts produced by authors who are not consciously changing their style of writing across texts. What this study has also taught us is that discriminant analysis is a more appropriate technique to use than principal components analysis when predicting the authorship of an unknown (held-out) text on the basis of known (training) texts of which the authorial provenance is available. Finally, standard discriminant analysis can be enhanced considerably by using an entropy-based weighting scheme of the kind used in latent semantic analysis (Landauer et al., 1998).
Book
CD-ROM contains: Searchable copy of textbook and all solutions -- Additional references -- Thumbnail sketches and photographs of mathematicians -- History of linear algebra and computing.
Article
John F. Burrows has proposed Delta, a simple new measure of textual difference, as a tool for authorship attribution, and has shown that it has great potential, especially in attribution problems where the possible authors are numerous and difficult to limit by traditional methods. In tests on prose, Delta has performed nearly as well as for Burrows’s verse texts. A series of further tests using automated methods, however, shows that two modified methods of calculating Delta and three alternatives to or transformations of Delta produce results that are even more accurate. Four of these five new measures produce much better results than Delta both on a very diverse group of 104 novels and on a group of forty-four smaller contemporary literary critical texts. Although further testing is needed, Delta and its modifications should prove valuable and effective tools for authorship attribution.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
This paper is a companion to my ‘Questions of authorship: attribution and beyond’, in which I sketched a new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship. The main emphasis of that paper was not on the new procedure but on the broader consequences of our increasing sophistication in making such comparisons and the increasing (although never absolute) reliability of our inferences about authorship. My present objects, accordingly, are to give a more complete account of the procedure itself; to report the outcome of an extensive set of trials; and to consider the strengths and limitations of the new procedure. The procedure offers a simple but comparatively accurate addition to our current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length. It is of even greater value as a method of reducing the field of likely candidates for texts of as little as 100 words in length. Not unexpectedly, it works least well with texts of a genre uncharacteristic of their author and, in one case, with texts far separated in time across a long literary career. Its possible use for other classificatory tasks has not yet been investigated.
Article
Computer-based evidence, especially when it incorporates statistical analysis, is too often regarded with special deference or special scepticism. It is better assessed upon its merits, like any other application of inductive logic. The nexus between the inductive process and the information available is studied in two paradigmatic attempts to interpret sets of statistically based distinctions between different texts.