Content uploaded by Shlomo Argamon
All content in this area was uploaded by Shlomo Argamon
Content may be subject to copyright.
The Rest of the Story:
Distilling Meaning from Stylistic Variation
Department of Computer Science
Illinois Institute of Technology
Chicago, IL 60645
Department of Computer Science
Ramat Gan, Israel
January 29, 2007
As we have seen, there are many diverse concepts of what we mean when we speak of the ‘style’ of
a work; we may speak, for example, of an individual’s speciﬁc style of artistic expression [23, 87], of
style as a means of implicitly contextualizing a work as part of a certain genre [26, 81], of style as a
means of conveying emotion , or of style as a concept organizing how observers tend to interpret
a work [87, 26]. In this chapter, we explore automated methods for stylistic analysis of written texts
(computational stylistics), formulated as having roughly two research goals: (a) develop methods
to automatically distinguish texts with a certain stylistic character from those of another, and
(b) distill an interpretable representation of the diﬀerence between such stylistic characters. We
seek an inclusive view of the nature of style, endeavoring to consider models and methods covering
the individual style of a genius as well as the generic style of a collective, style’s communicative
functions as well as its social determinants, and the intentions of the author as well as the potential
reaction of the reader.
We approach this question from a semantic and pragmatic perspective, taking the ‘style’ of a
text to cover the broad range of meanings that lie beyond what is conventionally thought of as the
text’s ‘content’. Such content may be thought of as the ‘denotational meaning’ of the text, roughly
covered by the “Six Ws” of journalism: Who, What, Where, Why, When, and How. By ‘style’,
then, we mean pretty much everything else that we can know from a text about the communicative
act that it embodies. The basic idea is that the text’s style may be deﬁned as how the author
chose to express her content, from among a very large space of possible ways of doing so. We
contrast, therefore, the how of a text (style) from the what (content). Style, thus construed, may
include interpersonal aspects of meaning such as aﬀect and social roles, sociolinguistic categories
such as genre and register, idiolectic aspects such as author identity and personality, and sp e ciﬁcs
of the individual speech act such as the medium of transmission and the purpose of the text. These
notions, ﬂeshed out in Section 2 below, form a rather diverse set of characteristics; we argue that
they can b e usefully considered together for the purposes of computational textual analysis.
The diversity of types of stylistic textual characteristics is reﬂected in the large number of
possible applications for style analysis that are already being explored. Current areas of application
include authorship attribution and proﬁling [76, 19, 54, 12, 72, 86, 28, 4, 89], genre-based text
classiﬁcation and retrieval [52, 53, 33], sentiment analysis [90, 78], and spam/scam ﬁltering [1, 59,
79]. Other potential applications include criminal and national security forensics [21, 74], mining of
customer feedback [16, 73], and aiding humanities scholarship [67, 47, 49, 6]. Automated stylistic
analysis thus promises new tools that may help with the ever-increasing number of texts available
in all topics and application domains.
A key research question is what textual features to use for modeling s tyle. While topic-based
text categorization typically uses models based on “bags of content words”, s tyle is somewhat more
elusive. We start from the intuitive notion that style is indicated by features representing the
author’s choice of one mode of expression from a set of equivalent modes for a given content. At
the surface level, this may be expressed by a wide variety of possible features of a text: choice
of particular words, syntactic structures, discourse strategy, or all of the ab ove and more. The
underlying causes of such variation are similarly hete rogeneous, including the genre, register, or
purpose of the text, as well as the educational background, social status, and personality of the
author and audience. What all these dimensions of variation have in common, though, is an
independence from the ‘topic’ or ‘content’ of the text, w hich may be considered to be those objects
and events that it refers to (as well as their properties and relations as described in the text). As
well, textual features of style (as opposed to content) tend to function mostly in the aggregate—no
single occurrence of a word or syntactic structure indicates style, but rather an aggregate preference
for certain choices in a text rather than others.
Most computational stylistics work to date has been based on hand-selected sets of content-
independent features such as function words [76, 67, 91], parts-of-speech and syntactic struc-
tures , and clause/sentence complexity measures [96, 27]; also see the survey in . While
new developments in machine learning and computational linguistics have enabled larger numbers
of features to be generated for stylistic analysis, it is still diﬃcult to articulate strong linguistic mo-
tivations for any preferred input feature set that relates it directly to particular stylistic concerns.
Rather, the general methodology that has developed is to ﬁnd as large a set of topic-independent
textual features as possible and use them as input to a generic learning algorithm (preferably one
resistant to overﬁtting, and possibly including some feature selection). Some interesting and eﬀec-
tive feature sets have been found in this way, such as [52, 58]; function words have also proven to
be surprisingly eﬀective on their own [72, 4, 9]. In the long term, however, a clear foundation in
a linguistic theory of meaning will be needed to gain true insight into the nature of the stylistic
dimension(s) under study. We hope in this chapter to provide a tentative framework for such study,
and show how some use ful insights can be gleaned from computational stylistic analyses.
2 Style and the Communicative Act
To understand what may constitute a meaningful characterization of s tylistic variation, as derived
from computational analysis of texts, we ﬁrst discuss the nature of the ‘communicative act’ which a
text embodies. Among the components of the communicative act, we may identify and distinguish
diverse aspects of what we call ‘style’. The objective is to model how various factors constituting
the communicative act inﬂuence (in various ways ) the c omposition of a text. Given such a model,
the goal of computational stylistics is to determine, to the extent possible, the speciﬁc stylistic
aspects of the communicative act embodied in a given text, or of the class of such acts embodied
in a corpus of texts. Again, by ‘stylistic’ we mean those aspects that are (relatively) independent
(ontology, ideology, intertext)
Figure 1: Schematic diagram of our model of the communicative act.
The most fundamental participants in any communicative act are the Author (equivalently, the
‘writer’ or ‘speaker’) who is the source of the text, and the Audience (equivalently, the ‘reader’,
‘listener’, or ‘recipient’) to whom the text is conveyed. Of course, when analyzing a non-dialogic
text, the text itself gives no possible access to the nature of the actual Audience, hence we are
interested in the Audience that the Author had in mind; it is that ‘intended’ Audience to which we
refer. The identities and characteristics of the Author and Audience can aﬀect the nature of the
text communicated between them—hence the classical stylistics problem of authorship attribution,
Three other factors are also directly implicated in the composition of the text. First, a text is
seen to encode some Content, normally conceived of as some sort of propositional content denoted
by the text. For example, the declarative clause “The red book is on the table,” denotes the
proposition that a book which is red and whose identity is assumed known, is on a table whose
identity is also assumed known; the case of questions, such as “Please give me the book,” is more
obscure, but can be thought of as representing a pote ntial future event of the Audience giving a
bo ok whose identity is known to the Author. Second, in addition to such propositional Content,
the Purpose of the communication aﬀects the form of the text, by expressing its Conte nt in a speech
act designed to accomplish that purpose [11, 84], which will specify to some extent the grammatical
and rhetorical form of the text. The overall form and texture of an essay geared to informing a
student about the eﬀects of monetary policy on inﬂation will diﬀer noticeably from that of one
geared to convincing voters of the advantages of a particular policy, even though the Content of
both may be very similar. Third, the Medium through which the text is expressed can aﬀect its
composition in both overt and subtle ways. To take two modern extremes, the Medium of a book
aﬀords greater possibilities of depth and options for variety of e xpression than does the Medium of
an instant message (IM), while the implicit intimacy and s peed of sending an IM oﬀers possibilities
that writing a book may lack. Certainly, texts in these two Media are distinctive based on a few
key surface features (e.g., text length, certain abbreviations, syntactic ‘well-formedness’).
Finally, in addition to these ﬁve direct inﬂuences on the characteristics of a particular text, we
must also consider the larger Context in which the communicative act takes place. This includes its
direct social context, the organization(s) or institution(s) sanctioning the communication, which
may impose constraints on the form of the text (in the extreme case, through speciﬁc editorial
guidelines). We may also consider the less-direct inﬂuence imposed by the context of the enclosing
culture, as mediated through various overlapping communities: communities of practice [92, 48],
formed of people engaged together in a given activity (e.g., a project team , or a club), communities
of association, formed of people sharing location and resources (e.g., a neighborhood), and discourse
communities [46, 88], comprising a group of people with a shared textual history.
Without entering into existing debates on the exact nature of such communal groupings, their
relationships with each other, and their precise methods inﬂuences on textual form, we may identify
(in a relatively theory-neutral fashion) three overarching notions of mediating factors for a context’s
eﬀect on the form of communicative act realized as a text. First, we can identify what we may
call an ontology, or theory of what is [25, 41, 42]. Diﬀerent cultural contexts may entail diﬀerent
commitments to the sorts of objects that exist, how they are referred to in the language, and
the taxonomic relationships that may obtain between them. Examples of ontological diﬀerences
include, for example, the variation in food items available in diﬀerent regions, and in other everyday
items—modern urbanites are more likely to have reference to automobile care than farm animal
care. As well, specialized ﬁelds such as science or metaphysics may reify certain abstract concepts
and processes, treating them similarly to tangible objects in their discourse.
Second, we should consider the role of an ideology, which establishes a set of possible social
roles and assumed relationships between them, of prestige and power . It is the ideological
context that provides the background for the interpersonal relationship(s) between the Author and
the Audience which will aﬀect the nature of the text, in terms of its formality, politeness, level of
epistemic commitment, and so on.
Third, we must also consider the intertext, or background of all pre- and co-e xisting texts that
may have inﬂuenced the composition of the text under consideration. This may be by direct quo-
tation or citation, by the incorporation of similar thematic elements or rhetorical structures, or
by the use of particular phraseology. Much or even most such intertextual reference may not be
conscious, but rather incorporation of textual elements ‘ﬂoating’ in a background knowledge of a
bo dy of texts. A community of discourse is distinguished primarily by its having a particular inter-
textual history to which its texts make reference; prime examples are various traditional religious
communities whose religious texts refer to the community’s acknowledged canonical texts.
Figure 1 gives a schematic diagram of this model. In these terms, the (ambitious) goal of our
inclusive notion of stylistic analysis is, given a text, to glean as much as possible about the various
components of the communicative act it embodies, apart from its Content. Thus authorship attri-
bution and proﬁling, identifying the genre of a text (i.e., its purpose and place in some community),
and determining social power relations realized in a body of texts thus all involve forms of stylistic
analysis in our terms. Certainly there are correlations between various components—particular au-
thors will have idiosyncratic topical preferences, certain topics are largely embedded in communal
discourses that come with particular stylistic commitments (e.g., scholarly disciplines), diﬀerent
media may be more or less appropriate for various types of content or purpose, and so on. Further-
more, how a text expresses its purpose, for example, depends on its particular content and context.
Thus no one factor can be considered in isolation, apart from all the others. However, in the context
of a speciﬁc research question, consideration of one or another aspect may be usefully considered
separately, provided that appropriate caution be exercised in drawing general conclusions.
Our mo del should perhaps be understood in the context of systemic functional models of register,
context, and culture, such as Martin , Halliday , Gregory , and Fawcett . For
example, in terms of systemic register theory, our notion of Content parallels Halliday’s “ﬁeld”
(though Ontology is surely also relevant), while the conjunction of Author and Audience determines
Gregory’s “personal tenor” (included with Purpose in Halliday’s “tenor”), and the Medium is
essentially the “mode”. Genre eﬀects are embedded in our model as as realizations of diﬀerent
Purposes within a discourse community (or community of practice); the communal dimension
foregrounds asp ec ts of Context in genre analysis—as Halliday and Martin (among others) have
noted, ideology is a key player in the c onstruction of genre. Note that we are not proposing here
any new detailed linguistic theory of these phenomena, but merely suggest that a model such as
ours will be useful for organizing research into automated style analysis.
3 Computational Stylistics
Research in computational stylistics seeks eﬀective models of language style by applying machine
learning algorithms to stylistically meaningful features. The roots of the ﬁeld go back to the
studies of Mendenhall  and Mascol [65, 66] in the late 19th century on the use of word-length
statistics for determining authorship. In the 20th century, the foundations of such ‘stylometric
analysis’ were further advanced by Yule’s statistical studies of word-length and part-of-speech
distributions inliterary prose [97, 96], and Mosteller and Wallace’s seminal authorship study of
The Federalist Papers , based on analyzing frequencies of function words. Due to the high
cost of computing and analyzing such features before the wide availability of powerful computers,
stylometrics researchers had, until recently, traditionally sought relatively simple statistically valid
models of stylistic distinctions, based on a small number (dozens, at mos t) of eas ily-computed
textual statistics, such as word-frequencies , phrase-type frequencies , or sentence-complexity
More recent research on machine-learning techniques for text classiﬁcation, on the other hand,
has developed more sophisticated learning algorithms which can use combinations of many thou-
sands of features to classify documents according to topic (see Sebastiani’s  excellent survey).
Working systems that have been developed use a variety of modern machine learning techniques
such as Na¨ıve Bayes [60, 61], Winnow , and Support Vector Machines . Recent work on
<x1,…,xN> = A
<x1,…,xN> = A
<x1,…,xN> = A
<x1,…,xN> = B
<x1,…,xN> = B
<x1,…,xN> = B
Figure 2: Generic text categorization system architecture.
applying machine learning and statistical methods for text classiﬁcation to stylometric features
for style analysis has achieved useful techniques for authorship attribution [3, 86, 7], genre analy-
sis [17, 67, 10, 30], and other applications [47, 39, 55].
Text categorization is a key problem in the ﬁeld of machine learning . The idea is that
we are given two or more classe s of documents and we need to ﬁnd some formula (usually called
a ”model”) that reﬂects statistical diﬀerences between the classes and that can then be used to
classify a new document. For example, we might wish to classify a document as being about one of
a number of possible topics, as having been written by a man or a woman, as having been written
by one of a given set of candidate authors and so forth.
Figure 2 depicts the basic architecture of a text categorization system in which we are given
examples of two classes of documents, Class A and C lass B. The ﬁrst step, document representation,
involves deﬁning a set of text features which might potentially be useful for categorizing texts in a
given corpus (for example, words that are neither too common nor too rare) and then representing
each text as a vector in which entries represent (some non-decreasing function of) the frequency of
each feature in the text. Optionally, one may then use various criteria for reducing the dimension
of the feature vectors [95, 35].
Once documents have been represented as vectors, there are a number of learning algorithms
that can be used to construct models that distinguish between ve ctors represe nting documents in
Class A and vectors representing documents in C lass B. Yang  compares and assesses some of
the most promising algorithms, which include k-nearest-neighbor, neural nets, Winnow, Support
Vector Machines, etc. One particular class of learned model which is easy to understand and
analyze, and which we use here, is the linear separator. The basic idea is that each feature x
assigned a weight w
for each possible text class c; these weights collectively form several weight
. The dotproduct of weight vector w
with a text’s feature vector x gives the text’s
score for class c; the class with the highest score is assigned to the text. A number of diﬀerent
methods are extant for computing such weight vectors, including Naive Bayes [61, 71], Support
Vector Machines (SVMs) , Bayesian logistic regression , and so forth, each with somewhat
diﬀerent properties. Generally speaking, however, many algorithms give similar performance in
text categorization, provided that features are chosen well.
Linear separator learning algorithms which do not require feature independence and are robust
to presence of irrelevant features (such as Winnow and SVMs), have tended to work quite well for
stylistic categorization problems. The main current research issue in the ﬁeld is the question of what
kinds of textual features are good s tyle discriminators, especially with the use of algorithms that
can eﬀectively deal with very large numbers of such features. Features for stylistic discrimination
must be invariant as to topic but vary with the speciﬁc stylistic dimension under study. Our results
and those of others [4, 7, 33, 39] have shown that using just relative frequencies of several hundred
function words often gives exce llent results, while adding syntactic, semantic, or complexity-based
features can sometimes be advantageous as well. Often, a great deal of insight into the underlying
stylistic dimension being studied can be found by using semantically-derived features, as we will
see in some of the studies described b elow.
4 Author Proﬁling
Our ﬁrst case study is from an analysis of a large corpus of blogs to see if writing style and
topic preferences vary with age and sex of the author , and what this variation may mean. In
this study, we applied two diﬀerent machine-learning algorithms: Bayesian multinomial logistic
regression (BMR ) and multi-class balanced real-valued Winnow (WIN [62, 24]), to construct
classiﬁcation models for author age and for author sex.
4.1 The corpus
The corpus for this study included all blogs on blogger.com in mid-August 2004 that had both
author-provided indication of sex and age and at least 200 occurrences of common English words.
The unit of analysis was the collected writing of each blogger from the blog’s inception date until
harvest; we do not distinguish between diﬀerent posts by a given blogger. Each blog was labeled
for sex and age based on the blogger’s self-identiﬁcation. For purposes of analysis, formatting and
non-English text was removed from each blog. To enable reliable age categorization (since a blog
can span several years of writing), all blogs for boundary ages (ages 18-22 and 28-32) were removed.
Each blogger was categorized by age at time of harvest: ”10s” (ages 13-17), ”20s” (ages 23-27) and
”30s” (ages 33-47), and also by sex: ”male” and ”female”. To decouple sex from age eﬀects, the
numbers of blogs of each sex within each age category were equalized by randomly deleting surplus
blogs from the larger sex category. The ﬁnal corpus contained 19,320 blogs (8,240 in 10s, 8,086 in
20s, and 2,994 in 30s), comprising a total of 681,288 posts and over 140 million words; there were
35 posts and 7300 words per blog on average.
4.2 Classiﬁcation accuracy
Each blog was represented by a vector containing the frequencies in the blog of the above-mentioned
377 function words as well as of the 1000 words with highest information gain for age and sex,
respectively (as computed on the holdout set).
10-fold classiﬁcation accuracies in ten-fold cross-validation for author age (over all three age
classes) were 77.4% (BMR) and 75.0% (WIN). Results for author sex were accuracies of 79.3%
(BMR) and 80.5% (WIN), consistent with classiﬁcation studies on author sex in other corpora [28,
4]. When one takes into account that self-identiﬁed sex and age information by bloggers may often
inaccurate and that blogs commonly include much quoted text, these results might be considered
surprisingly high, clearly showing that author sex and age are indicated by word usage.
4.3 Factor analysis
What do such diﬀerences in language use mean, however? First, let us c onsider the 1000 most
frequent words in the corpus overall and how diﬀerent classes of bloggers diﬀer in their use. These
1000 words contain 323 diﬀerent function words and 677 diﬀerent content words, accounting for
59.4% and 21.7%, respectively, of all word occurrences. In order to understand underlying patterns
of language variation, we considered naturally occurring word classes.
Function words can be divided straightforwardly into a number of grammatical classes, such
as PersonalPronouns, Articles, Conjunctions, and so forth. For content words, a straightforward
way to identify natural word classes for a given corpus is to perform factor analysis. A maximum
likelihood factor analysis with equimax rotation and Kaiser normalization  on the rate of use
of each of the 677 most frequent content words yields twenty coherent factors that depict diﬀerent
content-related themes, each containing between 13 and 32 words. We assigned intuitive names to
the factors:: Conversation, At Home, Family, Time, Work, PastActions, Games, Internet, Location,
Fun, Food/Clothes, Poetic, Books/Movies, Religion, Romance, Swearing, Politics, Music, School,
Frequencies of the twenty factors’ usage for each age and each sex, as well as the same data
for function words broken down by parts of speech, indicate meaningful diﬀerences in both content
and style among bloggers of diﬀerent ages. Use of words associated with Family, Religion, Politics,
Business, and Internet increases signiﬁcantly (p <0.001) with age, while use of words associated
with Conversation, AtHome, Fun, Romance, Music, School, and Swearing decreases signiﬁcantly
(p <0.001) with age. Use of other factors e ither doesn’t vary monotonically or shows no signiﬁcant
In addition, use of PersonalPronouns, Conjunctions, and AuxiliaryVerbs decreases signiﬁcantly
(p <0.001) with age, while use of Articles and Prepositions increases signiﬁcantly (p <0.001) with
age. Style and content eﬀects are highly correlated: use of multiple regressions indicates that
controlling for style eﬀects essentially eliminates content eﬀects and vice versa.
These results uniformly suggest that younger writers are more focused on personal issues, while
older writers have a more external focus on social issues. Two notes of caution are in order.
Interpretation must not ignore the fact that since this study is synchronic, we cannot separate
generational eﬀects from age eﬀects. Moreover, since there are fewer older bloggers, they may
represent an atypical demographic as early adopters of technology.
Another key ﬁnding was that the parts of speech, Articles and Prepositions, which are used
signiﬁcantly more by older bloggers are also used signiﬁcantly more (p <0.001) by male bloggers,
while the parts of speech, PersonalPronouns, Conjunctions, and AuxiliaryVerbs, which are used
signiﬁcantly more by younger bloggers are also used signiﬁcantly more (p <0.001) by female blog-
gers. Similarly, the factors Religion, Politics, Business, and Internet, which are used signiﬁcantly
more by older bloggers are also used signiﬁcantly more (p <0.001) by male bloggers, while the
factors Conversation, AtHome, Fun, Romance, and Swearing, which are used signiﬁcantly more by
younger bloggers are also used signiﬁcantly more (p <0.001) by female bloggers. (The exceptions to
this pattern are Family, used more by older bloggers and by fem ales, Music, used more by younger
bloggers and by males, and School, for which there is no signiﬁcant diﬀerence between male and
female usage.) As in the case of age, multiple regression indicates that controlling for style eﬀects
essentially eliminates content eﬀects and vice vers a.
These results suggest a single underlying factor distinguishing inner- and outer-directed com-
munication (both style and topic) that can explain both sex-linked and age-linked variation in
language use. Indeed, as we have seen, these variations are pronounced enough that they c an be
exploited to predict the age and sex of a blog author with reasonably high accuracy.
5 Authorship Veriﬁcation
The second problem we consider is that of authorship veriﬁcation. In the authorship veriﬁcation
problem, we are given examples of the writing of a single author and are asked to determine if
given texts were or were not written by this author. As a categorization problem, veriﬁcation is
signiﬁcantly more diﬃcult than attribution and little, if any, work has been performed on it in the
learning community. When we wish to determine if a text was written by one of several known
authors, it is suﬃcient to use their resp ec tive known writings, to construct a model distinguishing
them, and to test the unknown text against the model [76, 91, 82, 18]. If, on the other hand, we
need to determine if a text was written by Author A or not, it is very diﬃcult, if not impossible, to
assemble an exhaustive, or even representative, sample of not-A. The situation in which we suspect
that a given author may have written some text but do not have an exhaustive list of alternative
candidates is a common one.
The particular authorship veriﬁcation problem we will consider here is a genuine literary co-
nundrum. We are given two nineteenth century collections of Jewish rabbinic responsa written
in a combination of Hebrew and Aramaic. The ﬁrst, RP (Rav P e‘alim) includes 509 documents
authored by an Iraqi rabbinic scholar known as Ben Ish Chai. The second, TL (Torah Lishmah)
includes 524 documents that Ben Ish Chai, claims to have found in an archive. There is ample
historical reas on to believe that he in fact authored the manuscript but did not wish to take credit
for it for personal reasons . What do the texts tell us?
The ﬁrst thing we do is to ﬁnd four more collections of responsa written by four other authors
working in roughly the same area during (very) roughly the same period. These texts are Zivhei
Zedeq (ZZ; Iraq, nineteenth century), Sho’el veNish’al (SN; Tunisia, nineteenth century), Darhei
No‘am (DN; Egypt, seventeenth century), and Ginat Veradim (GV; Egypt, seventeenth century).
We begin by checking whether we are able to distinguish one collection from another using standard
text categorization techniques. We select a list of lexical features as follows: the 200 most frequent
words in the corpus are selected and all those that are deemed content-words are eliminated man-
ually. We are left with 130 features. After pre-processing the text as in the previous experiment,
we constructed vectors of length 130 in which each element represented the relative frequency (nor-
malized by document length) of each feature. We then used Balanced Winnow as our learner to
distinguish pairwise between the various collections. Five-fold c ross-validation experiments yield
accuracy of greater than 95% for each pair. In particular, we are able to distinguish between RP
and TL with accuracy of 98.5%.
One might thus be led to conclude that RP and TL are by diﬀerent authors. It is still possible,
however, that in fact only a small number of features are doing all the work of distinguishing between
them. The situation in which an author will use a small number of features in a consistently diﬀerent
way between works is typical. These diﬀerences might result from thematic diﬀerences between the
works, from diﬀerences in genre or purpose, from chronological stylistic drift, or from deliberate
attempts by the author to mask his or her identity.
In order to test whether the diﬀerences found between RP and TL reﬂect relatively shallow
diﬀerences that can be expected between two works of the s ame author or reﬂect deeper diﬀerences
Figure 3: Accuracy (y-axis) on training data of learned models comparing TL to other collections
as best features are eliminated, ﬁve per iteration (x-axis). Dotted line on bottom is RP vs. TL.
that can be expected between two diﬀerent authors, we invented a new technique that we call
unmasking [56, 57] that works as follows. We begin by learning models to distinguish TL from each
of the other authors including RP. As noted, such models are quite eﬀective. In each case, we then
eliminate the ﬁve highest-weighted features and learn a new model. We ite rate this procedure ten
times. The depth of diﬀerence between a given pair can then be gauged by the rate with which
results degrade as good features are eliminated.
The results (shown in Figure 3) could not b e more glaring. For TL versus each author other
than RP, we are able to distinguish with gradually degrading eﬀectiveness as the best features
are dropped. But for TL versus RP, the eﬀectiveness of the models drops right oﬀ a shelf. This
indicates that just a few features, possibly deliberately inserted as a ruse or possibly a function of
slightly diﬀering purposes assigned to the works, distinguish between the works. We have shown
elsewhere , that the evidence oﬀered in Figure 3 is suﬃcient to conclude that the author of RP
and TL are one and the same: Ben Ish Chai.
For example, the frequency (per 10000 words) of the word zeh [= this] in RP is 80 and in TL
is 116. A cursory glance at the texts is enough to establish why this is the case: the author of TL
ended every responsum with the phrase vehayah zeh shalom [= this shall be farewell], thus artiﬁcially
inﬂating the frequency of these words. Indeed the presence or absence of this phrase alone is enough
to allow highly accurate classiﬁcation of a given responsum as either RP or TL. Once features of
this sort are eliminated, however, the works become indistinguishable—a phenomenon which does
not occur when we compare TL to each of the other collections. In other words, many features can
be used to distinguish TL from works in our corpus other than RP, but only a few distinguish TL
from RP. Most features distribute similarly in RP and TL. A wonderfully illustrative example of
this is the abbreviation vkhu’ [= etc.], the respective frequencies of which in the various corpora are
as follows: TL:29 RP:28 SV:4 GV:4 DN:41 ZZ:77. Note that this similarity is unlikely to be due
to regional and chronological dependencies, since GV and DN have widely diﬀering values, though
they were both written in seventeenth century Egypt.
6 Scientiﬁc Rhetoric
A third type of texts that we have studied are peer-reviewed scientiﬁc articles. Our motivation in
doing so is to see what, if any, stylistic realization there may be to likely methodological diﬀerence s
between diﬀerent scientiﬁc ﬁelds. Our analysis here thus relates to questions of genre, speciﬁcally
to the interaction between Content and Purpose within a certain disciplinary Context.
Scientiﬁc ﬁelds form clear communities of discourse, with a shared textual history in textbooks
and the scientiﬁc literature; they are also communities of practice (or closely intertwined sets of
such), in that scientists in a given ﬁeld undertake similar/related research activities in the pursuit
of communally-deﬁned research goals. For these reasons, we might expect that the language used
in articles from one discipline diﬀer from those in another; our goal is to verify this intuition, and
to see if analysis of any diﬀerences found can shed any light on how scientists in diﬀerent ﬁelds
construe their arguments in diﬀerent ways.
6.1 Scientiﬁc methodologies
A key distinction which has recently been identiﬁed by historians and philosophers of science [22,
29, 70, 83] is between experimental sciences such as physics, which attempt to formulate general
predictive laws, and so rely heavily on repeatable series of controlled experiments which test hy-
potheses, and historical sciences such as geology, which study speciﬁc contigent past phenomena in
an attempt to ﬁnd unifying explanations for eﬀects caused by those phenomena (cf. ). Reason-
ing in historical sciences thus consists largely of ‘synthetic’ reconstructive reasoning (retrodiction),
as compared with more explicitly predictive reasoning from causes to possible eﬀec ts characteristic
of exp e rimental science [13, 29, 38, 93]. We summarize here results of our earlier studies [8, 2] on
articles in several ﬁelds of experimental and historical sciences, which show how analysis of style
diﬀerences c an give insights into how scientiﬁc rhetoric reﬂects methodological diﬀerences among
The two studies we will discuss analyze a corpus of recent (2003) articles drawn from twelve
peer-reviewed journals in both historical and experimental sciences; the numbers of articles used
from each journal and their average (preprocessed) lengths in words are given in Table 1. They
Journal of Geology (geology, historical) includes research on the full range of geological principles
including geophysics, geochemistry, sedimentology, geomorphology, petrology, plate tectonics,
volcanology, structural geology, mineralogy, and planetary sciences.
Journal of Metamorphic Geology (geology, historical) focuses on metamorphic studies
, from the
scale of individual crystals to that of lithospheric plates.
Biological Journal of the Linnean Society (evolutionary biology, historical) publishes work on or-
ganic evolution in a broad sense, particularly research unifying concepts of evolutionary biol-
ogy with evidence from genetics, systematics, biogeography, or ecology.
Journal of Human Evolution (evolutionary biology, historical) covers all aspects of human evolu-
tion, including both work on human/primate fossils and comparative studies of living species.
Palaeontologia Electronica (paleontology, historical) publishes papers in all branches of paleontol-
ogy as well as related biological or paleontologically-related disciplines.
Quaternary Research (paleontology, historical) publishes research in diverse areas in the earth and
biological sciences w hich examine the Quaternary period of the Earth’s history (from roughly
1.6 million years ago to the present).
Metamorphism refers to changes in mineral assemblage and texture in rocks that have been s ubjected to tem-
peratures and pressures diﬀerent from those under which they originally formed.
Physics Letters A (physics, experimental) publishes research in a wide range of areas, including :
condensed matter physics, theoretical physics, nonlinear science, statistical physics, mathe-
matical and computational physics, atomic, molecular and cluster physics, plasma and ﬂuid
physics, optics, biological physics and nanoscience.
Physical Review Letters (physics, experimental) also covers a wide range of physics research, in-
cluding: gravitation and astrophysics, elementary particles and ﬁelds, nuclear physics, atomic,
molecular, and optical physics, nonlinear dynamics, ﬂuid dynamics, plasma and beam physics,
and condensed matter physics.
Journal of Physical Chemistry A (physical chemistry, experimental) publishes chemical research at
the level of molecules (including dynamics, spectroscopy, gaseous clusters, molecular beams,
kinetics, atmospheric and environmental physical chemistry, molecular structure, bonding,
quantum chemistry, and general theory).
Journal of Physical Chemistry B (physical chemistry, experimental) publishes res earch on materi-
als (including nanostructures, micelles, macro-molecules, statistical mechanics and thermody-
namics of condensed matter, biophysical chemistry, and general physical chemistry), as well
as studies on the structure and properties of surfaces and interfaces.
Heterocycles (organic chemistry, e xperimental) publishes research in the areas of organic, pharma-
ceutical, analytical, and medicinal chemistry of heterocyclic compounds.
Tetrahedron (organic chemistry, experimental) publishes general experimental and theoretical re-
search results in the ﬁeld of organic chemistry and applications in related disciplines especially
6.2 Functional lexical features
To analyze rhetorically-relevant features of these texts, we use features derived based on the theory
of Systemic Functional Grammar (SFG), a functional approach to linguistic analysis . SFG
models the grammar of a language by a network of choices of meanings that can be expressed ,
and so all lexical and structural choices are represented as the realizations of particular semantic
and contextual meanings. The theory takes a primarily sociological view of language, and has
Table 1: Journals used in the studies with number of articles and average length (in words) per
Journal # Art. Avg. Len.
Journal of Geology 93 4891
Journal of Metamorphic Geology 108 5024
Biological Journal of the Linnean Society 191 4895
Human Evolution 169 4223
Palaeontologia Electronica 111 4132
Quaternary Research 113 2939
Physics Letters A 132 2339
Physical Review Let. 114 2545
Journal of Physical Chemistry A 121 4865
Journal of Physical Chemistry B 71 5269
Heterocycles 231 3580
Tetrahedron 151 5057
developed largely in the context of its use by applied linguists for literary/genre analysis and for
studying language learning. (See Butler  for an excellent overview of SFG and its relation to
other functional linguistic theories.)
Brieﬂy put, SFG construes language as a set of interlocking choices for expressing meanings,
with more general choices constraining the possible speciﬁc choices. A choice at one level may open
up further choices at other levels, choices that may not otherwise be possible; e.g., English does
not allow one to distinguish between third-person plural references to conscious or non-conscious
individuals—only one pronoun “them” is available. Furthermore, any speciﬁc choice of lexical item
or syntactic structure is determined by choices from multiple systems at once, as, e.g., the choice
between “I” and “me” is determined by the independent choice governing the pronoun’s syntactic
role as either a subject or an object.
For our purposes , it suﬃces to assign attributes to relevant lexical items, where each such
attribute takes on a value from a system taxonomy deﬁned by SFG. For simplicity, we require
that each such taxonomy is a tree, having a unique root, and requiring entry conditions to b e
only single option or conjunctions of options. This simpliﬁes computational issues, though it only
approximates the full SFG representation which e ss entially allows system networks to be general
AND/OR graphs; see  for a full discussion.
Such taxonomies enable a straighforward ‘variational’ interpretation of a linear classiﬁcation
model, in which certain nodes in the taxonomy are features indicating one document class, and other
nodes are features indicating the other document class. The idea is to ﬁnd all oppositions, where an
opposition is deﬁned as a pair of sibling nodes where one indicates one class and the other indicates
the other class. For example, if CONJUNCTION/Extension (i.e., RF
is indicative of class A and CONJUNCTION/Enhancement of class B, we would have the opposi-
Condition Class A Class B
CONJUNCTION Extension Enhancement
The oppositions given by such analysis give direct information about linguistic diﬀerences between
two doc ument classes, in that the two classes have diﬀering preferences about how to express the
parent ‘concept’. In the example, Class A prefers to conjoin items by Expansion, likely indicating
a higher density of more-or-less independent information units, whereas Class B prefers conjoining
items by Enhancements, indicating a more closely focused structure dealing with a smaller number
of independent information units.
The remainder of this subsection outlines the main system networks which we have used for
analyzing textual style. They are divided into three categories, denoting the general ‘stylistic
goals’ that these textual features relate to: Cohesion, referring to how a text is constructed to
‘hang together’, Assessment, meaning how a text construes propositions as statements of belief,
obligation, or necessity, contextualizing them in the larger discourse, and Appraisal, how the text
adjudges the quality of various objects or events. The relevant taxonomies are only summarized
here due to space considerations; a fuller description can be found in .
Cohesion refers to linguistic resources that enable language to connect to its larger context, both tex-
tual and extratextual . Such resources include a wide variety of referential modalities (pronomi-
nal reference, deictic expressions, ellipsis, and more), as well as lexical repetition and variation, and
diﬀerent ways of linking clauses together. How an author uses these various cohesive resources is an
indication of how the author organizes concepts and relates them to each other. Within cohesion,
we consider here only conjunctions, which are easiest to deal with computationally. Automated
coreference resolution [14, 77], for example, is a very diﬃcult unsolved problem.
Words and phrases that conjoin clauses (such as ‘and’, ‘while’, and ‘in other words’) are or-
ganized in SFG in the CONJUNCTION system network. Types of CONJUNCTION serve to
link a clause with its textual context, by denoting how the given clause expands on some aspect
of its preceding context [68, p. 519–528]. The three top-level options of CONJUNCTION are
Elaboration, Extension, and Enhancement, deﬁned as:
• Elaboration: Deepening the content in its context by exempliﬁcation or refocusing (for ex-
ample, in other words, i.e.);
• Extension: Adding new related information, perhaps contrasting with the current information
(and, furthermore, on the other hand);
• Enhancement: Qualifying the context by circumstance or logical connection (and then, be-
Generally speaking, assessment may be deﬁned as “contextual qualiﬁcation of the epistemic or
rhetorical status of events or propositions represented in a text”. Examples include assessment of
the likelihood of a proposition, the typicality of an event, the desirability of some fact, or its scope
of validity. Two important systems in SFG that address assessment are MODALITY, enabling
expression of typicality and necessity of some fact or event, and COMMENT, enabling assessment
of the writer’s stance with respect to an assertion in the text.
The system of MODALITY enables one to qualify events or entities in the text according
to their likelihood, typicality, or necessity. Syntactically, MODAL ITY may be realized in a text
through a modal verb (e.g., ‘can’, ‘might’, ‘should’, ‘must’), an adverbial adjunct (e .g., ‘probably’,
‘preferably’), or use of a projective clause (e.g., “I think that...”, “It is neces sary that...”). Each
word or phrase expressing MODALITY has a value for each of four attributes:
• Type: What kind of modality is being expressed?
– Modalization: How ‘typical’ is it? (probably, seldom)
– Modulation: How ‘necessary’ is it? (ought to, allowable)
• Value: What degree of the relevant modality scale is being averred?
– Median: The ‘normal’ amount. (likely, usually)
– Outer: An extreme (either high or low) amount. (maybe, always)
• Orientation: Relation of the modality expressed to the speaker/writer.
– Objective: Modality expressed irrespective of the speaker/writer. (maybe, always)
– Subjective: Modality expressed relative to the speaker/writer. (We think..., I require...)
• Manifestation: How is the modal assessment related to the event being assessed?
– Implicit: Modality realized ‘in-line’ by an adjunct or modal auxiliary. (preferably...,
– Explicit: Modality realized by a projective verb, with the nested clause being assessed.
(It is preferable..., It is possible..)
The system of COMMENT provides a resource for the writer to ‘comment’ on the status of
a message with respect to textual and interactive context in a discourse. Comments are usually
realized as adjuncts in a clause and may appear initially, medially, or ﬁnally. We use the eight
categories of COMMENT listed by Matthiessen : Admissive, message is an admission (e.g., ‘we
concur...’), Assertive, emphasis of reliability (e.g., ‘Certainly...’), Desiderative, desirability of the
content (e.g., ‘Unfortunately...’), Evaluative, judgment of the actors involved (e.g., ‘Sensibly...’),
Predictive, coherence with predictions (e.g., ‘As expected...’), Presumptive, dependence on other
assumptions (e.g., ‘I suppose...’), Tentative, assessing the message as tentative (e.g., ‘Tentatively...’),
and Validative, assessing scope of validity (e.g., ‘In general...’).
Finally, appraisal denotes how language is used to adopt or express an attitude of some kind
towards some target . For example, in “I found the movie quite monotonous”, the speaker
adopts a negative Attitude (“monotonous”) towards “the movie” (the appraised object). Note that
attitudes come in diﬀerent types; for example, ‘monotonous’ describes an inherent quality of the
appraised object, while ‘loathed’ would describe an emotional reaction of the writer. The overall
type and orientation of appraisal expressed in the text about an object gives a picture of how
the writer wishes the reader to view it (mo dulo sarcasm, of course). To date, we have developed
a lexicon for appraisal adjectives as well as relevant modiﬁers (such as ‘very’ or ‘sort of’). The
two main attributes of appraisal, as used in this work, are Attitude, giving the kind of appraisal
being expressed, and Orientation, giving whether the appraisal is positive (good, beautiful, nice) or
negative (bad, ugly, evil). (There are also other attributes of appraisal, discussed in the Appendix.)
The three main types of Attitude are: aﬀect, relating to the speaker/writers emotional state (e.g.,
‘happy’, ‘sad’), appreciation, expressing evaluation of supposed intrinsic qualities of an object (e.g.,
‘tall’, ‘complex’), and judgment, expressing social evaluation (e.g., ‘brave’, ‘cowardly’).
6.3 Experimental and historical science
The ﬁrst study we will discuss compares writing styles between experimental and historical science
journals . We ﬁrst considered if a diﬀerence between the types of science could be identiﬁed,
examining the 10-fold cross-validation accuracy of models built by an SVM (SMO with a linear
kernel ) for classifying articles as “experimental” or “historical”, using SFL features of EXPAN-
SION, COMMENT, and MODALI TY over the entire corpus of articles from 12 journals. Average
accuracy was 81.6%. To calibrate results, we then ran the same discrimination test for all 462
diﬀerent partitions of the twelve journals into two groups of six journals each. This gave a mean
accuracy of only 65.4% (range 55.4% to 81.6%), indicating that the division into experimental and
historical sciences is well supported by diﬀerences in style between writing in diﬀerent journals.
Given this result, we can now consider what consistent pattern of distinguishing features, if any,
emerges. That is, what features can be said to consistently indicate either historical or experimental
science articles. To do this, we ran SMO with a linear kernel on all training data for each pair of
a historical science journal with an experimental science journal (36 pairs in all), and ranked the
features by their weight for one or the other journal in the weight vector computed by SMO. We
summarize here the main results; for more detail, see .
First, in the system of Expansion, we see an opposition between Extension, which is an indica-
tor for historical science articles, and Enhancement, an indicator for experimental science articles.
This implies that historical science articles generally have a higher density of separate informa-
tional items, whereas experimental science articles tend to have fewer discrete information items,
though the information items they do have may have their meaning deepened or qualiﬁed by
informationally-related clauses. This may reﬂect diﬀering principles of rhetorical organization—
experimental scientists preferring a single coherent ‘story line’ focused on enhancements of a small
number of fo cal propositions, with historical scientists preferring a multifocal ‘landscape’ of con-
nected propositions. This supports the hypothesis that contrasts contextual examination of various
and highly unique entities by historical science with a more universalist, hence narrowly focused,
examination of generic entities by experimental science.
Further support for such methodological distinctions between kinds of science are further sup-
ported by preferences for types of Comment. Validative and Admissive Comments are indicators
for historical science articles compared to a very strong consistent indication of Predictive Com-
ments for experimental science articles. The latter result is a clear consequence of the experimental
scientist’s focus on predictive accuracyx. Historical sciencex, on the other hand, evinces a rhetorical
need (via Validative Comments) to explicitly delineate the scope of validity of diﬀerent assertions,
likely as a consequence of synthetic thinking  about complex and ambiguous webs of past cau-
sation . An Admissive comment marks a clause as the opinion (perhaps strongly held) of the
author; this too appears indicative of a more hedged and explicitly comparative argumentation
Finally, we may consider some aspects of variation in expressions of Modality between the
two classes of articles. The primary opposition is in modality Type. Experimental science writing
has a preference for using Modulation (assessing what ‘must’ or ‘is able’ to happen), which is
consistent with a focus on prediction and manipulation of nature. Concurrently, historical science
writing shows a preference for Modalization (assessing ‘likelihood’ or ‘usuality’), consistent with the
outlook of an observer who usually cannot directly manipulate or replicate outcomes, and therefore
(i) c annot make unqualiﬁed statements of what must (or must not) happen, and (ii) uses therefore
the method of “multiple working hypotheses”.
These results show how variations in language use between articles from diﬀerent disciplines can
be directly linked with the particular modes of reasoning posited by philosophers for these diﬀerent
kinds of science. Stylistic text analysis thus can lend some empirical weight to the argument for a
multiplicity of methods in science, rather than a single monolithic “scientiﬁc metho d”.
6.4 Geology and paleontology
In another study , we consider if stylistic diﬀerences between articles in geological and paleon-
tological journals may be found, and if so, what they may mean. As above, we applied the SMO
system to learn classiﬁcation models and measured accuracy by 10-fold cross-validation. We found
Table 2: Oppositions from the twenty highest-ranked systemic features in geology and paleontology
articles, from the model learned using function words plus all systemic features.
Condition Geology Paleontology
ORIENTATION Positive Negative
JUDGEMENT/So cialEsteem ORIENT/Positive ORIENT/Negative
JUDGEMENT/So cialSanction ORIENT/Positive ORIENT/Negative
ATTITUDE/Judgement ORIENT/Positive ORIENT/Negative
ATTITUDE/Aﬀect ORIENT/Positive ORIENT/Negative
APPRECIATION ReactionQuality CompositionComplexity
SOCIALSANCTION Propriety Veracity
COMMENT Assertive Validative
ENHANCEMENT SpatioTemporal CausalConditional
that using Conjunction, Modality, and Assessme nt features res ulted in low classiﬁcation accuracies
(all below 68%), while Appraisal features gave a higher 77.4%. Using function words together with
all the systemic feature types gave the highest accuracy of 87.5%, higher than using just function
words at 84.9% (p < 0.05). Together, these results indicate that while paleontology and geology
have similar preferences for rhetorical structure (measured by Conjunction) and epistemic commit-
ments (measured by Assessment), there are deﬁnite stylistic diﬀerences, a go od portion of which
relate to the use of evaluative language.
To understand better the diﬀerences between geological and paleontological language, we next
consider oppositions among the top twenty systemic features from the model constructed for the two
article classes using the full feature set, shown in Table 2. Appraisal is the most important, yielding
the largest boost in classiﬁcation power, as noted above, and accordingly generating many highly
ranked oppositions. ORIENTATION is most important overall—geologists appear to prefer Positive
appraisal, while paleontologists prefer Negative. This opposition is also seen within JUDGEMENT
and ATTITUDE. Such appraisal often appears when describing the results of the current, or of
previous, research. Geology appears to prefer positive appraisal, stressing the c ooperative and
incremental nature of the research enterprise, as in, e.g., “...collectively and consistently point
to a single conclusion...” On the other hand, paleontology tends to prefer negative orientation,
seeming to stress inadequacies of the evidence or of previous work, as, for example, in, “...records
are unfortunately much more fragmented...”. As well, we see cases where a researcher will discredit
previous work based on new evidence, as in ”...the approach taken is fundamentally ﬂawed.” It
seems that, in a sense, geologists more often express positive views of previous work as they often
apparently view their work as adding to it, while paleontologists are more often negative, seeing
themselves as replacing old ‘truths’ with new ones.
Next, oppositions in APPRECIATION indicate a distinction between a geological fo cus on
Reaction (i.e., the eﬀect of the object on an observer) and a paleontological focus on Composition
(i.e., qualities of how the object is put together). This may indicate that paleontologists are more
concerned with analyzing conﬁgurations of complex, multi-part entities (fossils of various sorts),
whereas geologists tend somewhat towards more qualitative evaluations of specimens.
A similar distinction is seen in SOCIALSANCTION and in COMMENT. In SOCIALSANC-
TION, we see geologists more concerned with Propriety, i.e., how a methodology or a piece of
evidence may ﬁt with others, whereas paleontologists are more concerned with Veracity, in terms
of how reliable particular methods or bits of evidence are on their own.
Similarly, we see two COMMENT types descriptive of geological prose: Assertive COMMENTs
(e.g., “There is surely more to it...”), and Desiderative COMMENTs (e.g., “In doing so, we hope to
deduce”), which is consistent with the apparent focus of geologists on ‘ﬁtting in’ noted above. Pale-
ontologists, on the other hand, tend more to use Validative COMMENTs, expanding or contracting
the scope of validity of a claim (e.g., ”Holocene shells generally lack...”), and Presumptive COM-
MENTs, evaluating new claims in light of general background knowledge (e.g., ”...which apparently
are linked with...”).
Finally, the single opposition we ﬁnd within the CONJUNCTION system is in ENHANCE-
MENT, where geology prefers SpatioTemporal, while paleontology prefers CausalConditional. Ge-
ological uses of SpatioTemporal conjunction tend to describe rock conﬁgurations and development
over time, as well as discourse organization (in a sort of descriptive ‘story-line’). Paleontologists,
however, are more often explicit about hypothesized causal links between specimens and historical
events (as in, e.g., ”...perhaps as a result of migration...”).
In this chapter we have presented a view of style as a semantics of text that points to various non-
propositional aspects of the communicative act that the text realize s. This view subsumes previous
work in computational stylistics, such as authorship attribution and genre analysis, as well as
suggesting possible new avenues for research, such as automatic detection of rhetorical purpose,
or identiﬁcation of ideological facets, as encoded in a given text. How to properly deal with the
inevitable interdependencies among textual realizations of the various aspects of the c ommunicative
act remains the central open research question.
Even so, isolation of certain aspects for investigation can yield insight, as the results of the
several s tudies described here show. Discrimination analysis of controlled corpora with properly
chosen feature sets enables semantic/pragmatic characterization of stylistic diﬀerences between
groups of text that can be linked back to aspectual diﬀerences in the relevant communicative acts.
Our schema of the communicative act may also prove fruitful for organizing stylistic analyses in
expressive media other than text. The fact that characteristics of the Author and Audience inﬂuence
style are clear in all media, though the speciﬁc expressive correlates and relevant characteristics
may vary. Harold Cohen, in his chapter in this volume , has forcefully made the point that
understanding an artist’s choice of Medium for a particular artistic Purpose are essential to a true
understanding of artistic style. Similarly, aspects of musical style can be best understo od by relating
them to Purposive goals of eliciting emotional responses  or serving particular social functions
(accompanying dances or the like). Context is also relevant—in all media, the artistic/expressive
history and background, the analogue of the intertext, is certainly a major inﬂuence on style; social
ideology is an important inﬂuence; and more-or-less elaborated medium-speciﬁc ontologies may also
be relevant. We suggest therefore that this schema, or something like it, could serve as a framework
to integrate, to some extent, work on stylistic analysis in disparate media.
 I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. Spyropoulos. An evalu-
ation of naive bayesian anti-spam ﬁltering. In Proc. of the workshop on Machine Learning in
the New Information Age, 2000.
 S. Argamon, J. Dodick, and P. Chase. Language use reﬂects scientiﬁc methodology: A corpus-
based study of peer-reviewed journal articles. Scientometrics, 2007. to appear.
 S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In Proc. Int’l
Workshop on Innovative Internet Information Systems, Pisa, Italy, 1998.
 S. Argamon, M. Koppel, J. Fine, and A. R. Shimony. Gender, genre, and writing style in
formal written texts. Text, 23(3), 2003.
 S. Argamon, M. Koppel, J. W. Pennebake r, and J. Schler. Age- and sex-linked language vari-
ation in the blogosphere. Journal of Computer Mediated Communication, 2007. (s ubmitted).
 S. Argamon and M. Olsen. Toward meaningful c omputing. Communications of the ACM,
 S. Argamon, M.
Sari´c, and S. S. Stein. Style mining of electronic messages for multiple author
discrimination. In Proc. ACM Conference on Knowledge Discovery and Data Mining, 2003.
 S. Argamon, C. Whitelaw, P. Chase, S. Dhawle, N. Garg, S. R. Hota, and S. Levitan. Stylis-
tic text classiﬁcation using functional lexical features. Journal of the American Society of
Information Science, 2007. (in press).
 Shlomo Argamon and Shlomo Levitan. Measuring the usefulness of function words for author-
ship attribution. In Proceedings of the 2005 ACH/ALLC Conference, Victoria, BC, Canada,
 S. Argamon-Engelson, M. Koppel, and G. Avneri. Style-based text categorization: What
newspaper am i reading? In Proc. of AAAI Workshop on Learning for Text Categorization,
pages 1–4, 1998.
 J.L. Austin. How to do things with words. Oxford University Press Oxford [Eng, 1976.
 R. Harald Baayen, Hans van Halteren, and Fiona Tweedie. Outside the cave of shadows: Using
syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing,
 V. R. Baker. The pragmatic routes of Am erican Quaternary geology and geomorphology.
Geomorphology, 16:197–215, 1996.
 D. Bean and E. Riloﬀ. Unsupervised learning of contextual role knowledge for coreference
resolution. Proc. of HLT/NAACL, pages 297–304, 2004.
 Y. L. Ben-David. Shevet mi-Yehudah (in Hebrew). (no publisher listed), Jerusalem, 2002.
 Michael J. Berry and Gordon Linoﬀ. Data Mining Techniques: For Marketing, Sales, and
Customer Support. John Wiley & Sons, Inc., New York, NY, USA, 1997.
 D. Biber. Dimensions of registe r variation: A cross-linguistic comparison. Cambridge Univer-
sity Press, Cambridge, 1995.
 J. Burrows. ‘Delta’: A measure of stylistic diﬀerence and a guide to likely authorship. Literary
and Linguistic Computing, 17(3):267–287, 2002.
 John F. Burrows. Computation into Criticism: A Study of Jane Austen’s Novels and an
Experiment in Method. Clarendon Press, Oxford, 1987.
 C.S. Butler. Structure and function: A Guide To Three Major structural-functional Theories.
John Benjamins, 2003.
 Carole E. Chaski. Linguistic authentication and reliability. In National Conference on Science
and the Law, San Diego, California, 1999. National Institute of Justice.
 C. E. Cleland. Methodological and epistemic diﬀerences between historical science and exper-
imental science. Philosophy of Science, 2002.
 Harold Cohen. Style as emergence (from what?). In S. Argamon, K. Burns, and S. Dubnov,
editors, The Structure of Style: Algorithmic Approaches to Manner and Meaning, pages X–
XX. Springer, 2007.
 Ido Dagan, Yael Karov, and Dan Roth. Mistake-driven learning in text categorization. In
Claire Cardie and Ralph Weischedel, editors, Proceedings of EMNLP-97, 2nd Conference on
Empirical Methods in Natural Language Processing, pages 55–63, Providence, US, 1997. As-
sociation for Computational Linguistics, Morristown, US.
 R.G. D’Andrade. The Development of Cognitive Anthropology. Cambridge University Press,
 Roger Dannenberg. Synthesizing musical style. In S. Argamon, K. Burns, and S. Dubnov,
editors, The Structure of Style: Algorithmic Approaches to Manner and Meaning, pages X–
XX. Springer, 2007.
 O. de Vel. Mining e-mail authorship. In Workshop on Text Mining, ACM International
Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000.
 O. de Vel, M. Corney, A. Anderson, and G. Mohay. Language and gender author cohort analysis
of e-mail for computer forensics. In Proc. Digital Forensic Research Workshop, Syracuse, NY,
 J. Diamond. Guns, Germs and Steel: The Fates of Human Societies. W.W. Norton, New
 M. Dimitrova, A. Finn, N. Kushmerick, and B. Smyth. Web genre visualization. In Proc.
Conference on Human Factors in Computing Systems, 2002.
 Shlomo Dubnov. Perception of music as an information processing experience. In S. Argamon,
K. Burns, and S. Dubnov, editors, The Structure of Style: Algorithmic Approaches to Manner
and Meaning, pages X–XX. Springer, 2007.
 R.P. Fawcett.
Cognitive Linguistics and Social Interaction: Towards an Integrated Model of a Systemic Functional Grammar and the Other Components of a Communicating Mind.
 Aidan Finn, Nicholas Kushmerick, and Barry Smyth. Genre classiﬁcation and domain transfer
for information ﬁltering. In Fabio Crestani, Mark Girolami, and Cornelis J. van Rijsbergen, ed-
itors, Proceedings of ECIR-02, 24th European Colloquium on Information Retrieval Research,
Glasgow, UK, 2002. Springer Verlag, Heidelberg, DE.
 J.R. Firth. A synopsis of linguistic theory 1930-1955. In F.R. Palmer, editor, Selec ted Papers
of J.R. Firth 1952-1959. Longman, 1968.
 G. Forman. An extensive empirical study of feature selection metrics for text classiﬁcation.
Journal of Machine Learning Research, 3(7-8):1289–1305, 2003.
 A. Genkin, D.D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text
categorization. Technometrics, 2006.
 R.L. Gorsuch. Factor Analysis. L. Erlbaum Associates, 1983.
 S. J. Gould. Evolution and the triumph of homology, or, why history matters. American
Scientist, Jan.-Feb.:60–69, 1986.
 Neil Graham and Graeme Hirst. Segmenting a document by stylistic character. In Workshop
on Computational Approaches to Style Analysis and Synthesis, 18th International Joint
Conference on Artiﬁcial Intelligence, Acapulco, 2003.
 M. Gregory. Aspects of varieties diﬀerentiation. Journal of Linguistics, 3:177–198, 1967.
 J.J. Gumperz and S.C. Levinson. Rethinking Linguistic Relativity. Cambridge University
 I. Hacking. Historical ontology. Harvard University Press Cambridge, Mass, 2002.
 M. A. K. Halliday and R. Hasan. Cohesion in English. Longman, London, 1976.
 M.A.K. Halliday. Language as Social Semiotic: The Social Interpretation of Language and Meaning.
Edward Arnold, 1978.
 Michael A. K. Halliday. Introduction to Functional Grammar. Edward Arnold, second edition,
 J. Harris. The Idea of Community in the Study of Writing. College Composition and
Communication, 40(1):11–22, 1989.
 D. I. Holmes. The evolution of stylometry in humanities scholarship. Literary and Linguistic
Computing, 13(3):111–117, 1998.
 J. HOLMES and M. MEYERHOFF. The Community of Practice: Theories and methodologies
in language and gender research. Language in Society, 28(02):173–183, 2000.
 David Hoover. Frequent word sequences and statistical stylistics. Literary and Linguistic
Computing, 17:157–180, 2002.
 T. Joachims. Text categorization with support vector machines: Learning with many relevant
features. In Machine Learning: ECML-98, Tenth European Conference on Machine Learning,
pages 137–142, 1998.
 T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. Burges, and
A. Smola, editors, Advances in Kernel Methods—Support Vector Learning. MIT Press, 1999.
 J. Karlgren. Stylistic Experiments for Information Retrieval. PhD thesis, SICS, 2000.
 Brett Kessler, Geoﬀrey Nunberg, and Hinrich Sch¨utze. Automatic detection of text genre. In
Philip R. Cohen and Wolfgang Wahlster, editors, Proceedings of the Thirty-Fifth Annual
Meeting of the Association for Computational Linguistics and Eighth Conference of the
European Chapter of the Association for Computational Linguistics, pages 32–38, Somerset,
New Jersey, 1997. Association for Computational Linguistics.
 B. Kjell and O. Frieder. Visualization of literary style. In IEEE International Conference on
Systems, Man and Cybernetics, pages 656–661, Chicago, 1992.
 M. Koppel, S. Argamon, and A. R. Shimoni. Automatically categorizing written texts by
author gender. Literary and Linguistic Computing, 17(4), 2003.
 M. Koppel, D. Mughaz, and J. Schler. Text categorization for authorship veriﬁcation. In
8th Symposium on Artiﬁcial Intelligence and Mathematics, Fort Lauderdale, FL, 2004.
 M. Koppe l and J. Schler. Authorship veriﬁcation as a one-class classiﬁcation problem. In Proc.
of Int’l Conf. on Machine Learning, Banﬀ, Canada, 2004.
 Moshe Koppel, Navot Akiva, and Ido Dagan. A corpus-independent feature set for style-
based text cate gorization. In Workshop on Computational Approaches to Style Analysis and
Synthesis, 18th International Joint Conference on Artiﬁcial Intelligence, Acapulco, 2003.
 Nicholas Kushmerick. Learning to remove internet advertisement. In Oren Etzioni, J¨org P.
M¨uller, and Jeﬀrey M. Bradshaw, editors, Proceedings of the Third International Conference
on Autonomous Agents (Agents’99), pages 175–181, Seattle, WA, USA, 1999. ACM Press.
 Ken Lang. NewsWeeder: learning to ﬁlter netnews. In Proceedings of the 12th International
Conference on Machine Learning, pages 331–339. Morgan Kaufmann publishers Inc.: San
Mateo, CA, USA, 1995.
 D.D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval.
Proceedings of ECML-98, 10th European Conference on Machine Learning, 1398:4–15, 1998.
 N. Littlestone. Learning when irrelevant attributes abound. In Proceedings of the
Twenty-Eighth Annual Symposium on Foundations of Computer Science, pages 68–77, Oc-
 J. R. Martin. English Text: System and Structure. Benjamin’s, Amsterdam, 1992.
 J. R. Martin and P. R. R. White. The Language of Evaluation: Appraisal in English. Palgrave,
London, 2005. (http://grammatics.com/appraisal/).
 C. Mascol. Curves of Pauline and Pseudo-Pauline Style I. Unitarian Review, 30:452460, 1888.
 C. Mascol. Curves of Pauline and Pseudo-Pauline Style II. Unitarian Review, 30:539546, 1888.
 R. A. J. Matthews and T. V. N. Merriam. Distinguishing literary styles using neural networks,
chapter 8. IOP publishing and Oxford University Press, 1997.
 Christian Matthiessen. Lexico-grammatical cartography: English systems. International Lan-
guage Sciences Publishers, 1995.
 E. Mayr. Evolution and the Diversity of Life. Harvard University Press, Cambridge, 1976.
 E. Mayr. How biology diﬀers from the physical sciences. In Evolution at the Crossroads: The
New Biology and the New Philosophy of Science, pages 43–46. MIT Press, Cambridge, 1985.
 A. McCallum and K. Nigam. A comparison of event models for naive bayes text classiﬁcation.
AAAI-98 Workshop on Learning for Text Categorization, 752, 1998.
 A. McEnery and M. Oakes. Authorship studies/textual statistics, pages 234–248. Marcel
 Vicki McKinney, Kanghyun Yoon, and Fatemeh ?Mariam? Zahedi. The measurement of
web-customer satisfaction: An expectation and disconﬁrmation approach. Info. Sys. Research,
 G. McMenamin. Forensic Linguistics: Advances in Forensic Stylistics. CRC Press, 2002.
 T.C. Mendenhall and Y.A.P. Collection. Characteristic Curves of Composition. Moore &
Langen, printers and binders, 1887.
 F. Mosteller and D. L. Wallace. Inference and Disputed Authorship: The Federalist. Series in
behavioral science: Quantitative methods edition. Addison-Wesley, Massachusetts, 1964.
 V. Ng. Learning noun phrase anaphoricity to improve coreference resolution: Issues in repre-
sentation and optimization. Proceedings of the 42nd Annual Meeting of the Association for
Computational Linguistics (ACL), pages 152–159, 2004.
 Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classiﬁcation
using machine learning techniques. In Proceedings of EMNLP, 2002.
 Jon Patrick. The scamseek project: Text mining for ﬁnanical scam s on the internet. In S.J.
Simoﬀ and G.J. Williams, editors, Proc. 3rd Australasian Data Mining Conf, pages 33–38.
 J. Platt. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector
Machines. Microsoft Research Technical Report MSR-TR-98-14, Redmond, WA, 1998.
 Ehud Reiter. Generating texts in diﬀerent styles. In S. Argamon, K. Burns, and S. Dubnov,
editors, The Structure of Style: Algorithmic Approaches to Manner and Meaning, pages X–
XX. Springer, 2007.
 J. Rudman. The state of authorship attribution studies: Some problems and solutions.
Computers and the Humanities, 31(4):351–365, 1997.
 J. L. Rudolph and J. Stewart. Evolution and the nature of science: On the historical discord
and its implication for education. Journal of Research in Science Teaching, 35:1069–1089,
 J.R. Searle. Expression and meaning: Studies in the Theory of Speech Acts. Cambridge Univ.
 Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing
Surveys, 34(1), 2002.
 Efstathios Stamatatos, Nikos Fakotakis, and George K. Kokkinakis. Automatic text catego-
rization in terms of genre, author. Computational Linguistics, 26(4):471–495, 2000.
 George Stiny. Whose style is it? In S. Argamon, K. Burns, and S. Dubnov, editors, The
Structure of Style: Algorithmic Approaches to Manner and Meaning, pages X–XX. Springer,
 John M. Swales. Genre Analysis. Cambridge University Press, 1990.
 V.I. Torvik, M. Weeber, D.R. Swanson, and N.R. Smalheiser. A probabilistic similarity metric
for Medline records: A model for author name disambiguation. Journal of the American
Society for Information Science and Technology, 56(2):140–158, 2005.
 Peter D. Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised
classiﬁcation of reviews. In Proceedings 40th Annual Meeting of the ACL (ACL’02), pages
417–424, Philadelphia, Pennsylvania, 2002.
 F. Tweedie, S. Singh, and D. Holmes . Neural network applications in stylometry: The Feder-
alist Papers. Computers and the Humanities, 30(1):1–10, 1996.
 E. Wenger. Communities of Practice: Learning, Meaning, and Identity. Cambridge University
 W. Whewell. History of the Inductive Sciences. John W. Parker, London, 1837.
 Y. Yang. An Evaluation of Statistical Approaches to Text Categorization. Information
Retrieval, 1(1):69–90, 1999.
 Y. Yang and J.O. Pedersen. A Comparative Study on Feature Selection in Text Categorization.
Proceedings of the Fourteenth International Conference on Machine Learning table of contents,
pages 412–420, 1997.
 G. U. Yule. Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.
 G.U. Yule. On sentence length as a statistical characteristic of style in prose with application
to two cases of disputed authorship. Biometrika, 30:363–390, 1938.