ArticlePDF Available

Abstract and Figures

The computational analysis of the style of natural language texts, computational stylistics, seeks to develop automated methods to (1) effectively distinguish texts with one stylistic character from those of another, and (2) give a meaningful representation of the differences between textual styles. Such methods have many potential applications in areas including criminal and national security forensics, customer relations management, spam/scam filtering, and scholarly research. In this chapter, we propose a framework for research in computational stylistics, based on a functional model of the communicative act. We illustrate the utility of this framework via several case studies.
Content may be subject to copyright.
The Rest of the Story:
Distilling Meaning from Stylistic Variation
Shlomo Argamon
Department of Computer Science
Illinois Institute of Technology
Chicago, IL 60645
Moshe Koppel
Department of Computer Science
Bar-Ilan University
Ramat Gan, Israel
January 29, 2007
Abstract
1 Introduction
As we have seen, there are many diverse concepts of what we mean when we speak of the ‘style’ of
a work; we may speak, for example, of an individual’s specific style of artistic expression [23, 87], of
style as a means of implicitly contextualizing a work as part of a certain genre [26, 81], of style as a
means of conveying emotion [26], or of style as a concept organizing how observers tend to interpret
a work [87, 26]. In this chapter, we explore automated methods for stylistic analysis of written texts
(computational stylistics), formulated as having roughly two research goals: (a) develop methods
to automatically distinguish texts with a certain stylistic character from those of another, and
(b) distill an interpretable representation of the difference between such stylistic characters. We
seek an inclusive view of the nature of style, endeavoring to consider models and methods covering
the individual style of a genius as well as the generic style of a collective, style’s communicative
functions as well as its social determinants, and the intentions of the author as well as the potential
reaction of the reader.
1
We approach this question from a semantic and pragmatic perspective, taking the ‘style’ of a
text to cover the broad range of meanings that lie beyond what is conventionally thought of as the
text’s ‘content’. Such content may be thought of as the ‘denotational meaning’ of the text, roughly
covered by the “Six Ws” of journalism: Who, What, Where, Why, When, and How. By ‘style’,
then, we mean pretty much everything else that we can know from a text about the communicative
act that it embodies. The basic idea is that the text’s style may be defined as how the author
chose to express her content, from among a very large space of possible ways of doing so. We
contrast, therefore, the how of a text (style) from the what (content). Style, thus construed, may
include interpersonal aspects of meaning such as affect and social roles, sociolinguistic categories
such as genre and register, idiolectic aspects such as author identity and personality, and sp e cifics
of the individual speech act such as the medium of transmission and the purpose of the text. These
notions, fleshed out in Section 2 below, form a rather diverse set of characteristics; we argue that
they can b e usefully considered together for the purposes of computational textual analysis.
The diversity of types of stylistic textual characteristics is reflected in the large number of
possible applications for style analysis that are already being explored. Current areas of application
include authorship attribution and profiling [76, 19, 54, 12, 72, 86, 28, 4, 89], genre-based text
classification and retrieval [52, 53, 33], sentiment analysis [90, 78], and spam/scam filtering [1, 59,
79]. Other potential applications include criminal and national security forensics [21, 74], mining of
customer feedback [16, 73], and aiding humanities scholarship [67, 47, 49, 6]. Automated stylistic
analysis thus promises new tools that may help with the ever-increasing number of texts available
in all topics and application domains.
A key research question is what textual features to use for modeling s tyle. While topic-based
text categorization typically uses models based on “bags of content words”, s tyle is somewhat more
elusive. We start from the intuitive notion that style is indicated by features representing the
author’s choice of one mode of expression from a set of equivalent modes for a given content. At
the surface level, this may be expressed by a wide variety of possible features of a text: choice
of particular words, syntactic structures, discourse strategy, or all of the ab ove and more. The
underlying causes of such variation are similarly hete rogeneous, including the genre, register, or
purpose of the text, as well as the educational background, social status, and personality of the
author and audience. What all these dimensions of variation have in common, though, is an
2
independence from the ‘topic’ or ‘content’ of the text, w hich may be considered to be those objects
and events that it refers to (as well as their properties and relations as described in the text). As
well, textual features of style (as opposed to content) tend to function mostly in the aggregate—no
single occurrence of a word or syntactic structure indicates style, but rather an aggregate preference
for certain choices in a text rather than others.
Most computational stylistics work to date has been based on hand-selected sets of content-
independent features such as function words [76, 67, 91], parts-of-speech and syntactic struc-
tures [86], and clause/sentence complexity measures [96, 27]; also see the survey in [52]. While
new developments in machine learning and computational linguistics have enabled larger numbers
of features to be generated for stylistic analysis, it is still difficult to articulate strong linguistic mo-
tivations for any preferred input feature set that relates it directly to particular stylistic concerns.
Rather, the general methodology that has developed is to find as large a set of topic-independent
textual features as possible and use them as input to a generic learning algorithm (preferably one
resistant to overfitting, and possibly including some feature selection). Some interesting and effec-
tive feature sets have been found in this way, such as [52, 58]; function words have also proven to
be surprisingly effective on their own [72, 4, 9]. In the long term, however, a clear foundation in
a linguistic theory of meaning will be needed to gain true insight into the nature of the stylistic
dimension(s) under study. We hope in this chapter to provide a tentative framework for such study,
and show how some use ful insights can be gleaned from computational stylistic analyses.
2 Style and the Communicative Act
To understand what may constitute a meaningful characterization of s tylistic variation, as derived
from computational analysis of texts, we first discuss the nature of the ‘communicative act’ which a
text embodies. Among the components of the communicative act, we may identify and distinguish
diverse aspects of what we call ‘style’. The objective is to model how various factors constituting
the communicative act influence (in various ways ) the c omposition of a text. Given such a model,
the goal of computational stylistics is to determine, to the extent possible, the specific stylistic
aspects of the communicative act embodied in a given text, or of the class of such acts embodied
in a corpus of texts. Again, by ‘stylistic’ we mean those aspects that are (relatively) independent
3
Author Audience
Medium
Context
(ontology, ideology, intertext)
Text
Content Purpose
Figure 1: Schematic diagram of our model of the communicative act.
of ‘content’.
The most fundamental participants in any communicative act are the Author (equivalently, the
‘writer’ or ‘speaker’) who is the source of the text, and the Audience (equivalently, the ‘reader’,
‘listener’, or ‘recipient’) to whom the text is conveyed. Of course, when analyzing a non-dialogic
text, the text itself gives no possible access to the nature of the actual Audience, hence we are
interested in the Audience that the Author had in mind; it is that ‘intended’ Audience to which we
refer. The identities and characteristics of the Author and Audience can affect the nature of the
text communicated between them—hence the classical stylistics problem of authorship attribution,
for example.
Three other factors are also directly implicated in the composition of the text. First, a text is
seen to encode some Content, normally conceived of as some sort of propositional content denoted
by the text. For example, the declarative clause “The red book is on the table,” denotes the
proposition that a book which is red and whose identity is assumed known, is on a table whose
identity is also assumed known; the case of questions, such as “Please give me the book,” is more
4
obscure, but can be thought of as representing a pote ntial future event of the Audience giving a
bo ok whose identity is known to the Author. Second, in addition to such propositional Content,
the Purpose of the communication affects the form of the text, by expressing its Conte nt in a speech
act designed to accomplish that purpose [11, 84], which will specify to some extent the grammatical
and rhetorical form of the text. The overall form and texture of an essay geared to informing a
student about the effects of monetary policy on inflation will differ noticeably from that of one
geared to convincing voters of the advantages of a particular policy, even though the Content of
both may be very similar. Third, the Medium through which the text is expressed can affect its
composition in both overt and subtle ways. To take two modern extremes, the Medium of a book
affords greater possibilities of depth and options for variety of e xpression than does the Medium of
an instant message (IM), while the implicit intimacy and s peed of sending an IM offers possibilities
that writing a book may lack. Certainly, texts in these two Media are distinctive based on a few
key surface features (e.g., text length, certain abbreviations, syntactic ‘well-formedness’).
Finally, in addition to these five direct influences on the characteristics of a particular text, we
must also consider the larger Context in which the communicative act takes place. This includes its
direct social context, the organization(s) or institution(s) sanctioning the communication, which
may impose constraints on the form of the text (in the extreme case, through specific editorial
guidelines). We may also consider the less-direct influence imposed by the context of the enclosing
culture, as mediated through various overlapping communities: communities of practice [92, 48],
formed of people engaged together in a given activity (e.g., a project team , or a club), communities
of association, formed of people sharing location and resources (e.g., a neighborhood), and discourse
communities [46, 88], comprising a group of people with a shared textual history.
Without entering into existing debates on the exact nature of such communal groupings, their
relationships with each other, and their precise methods influences on textual form, we may identify
(in a relatively theory-neutral fashion) three overarching notions of mediating factors for a context’s
effect on the form of communicative act realized as a text. First, we can identify what we may
call an ontology, or theory of what is [25, 41, 42]. Different cultural contexts may entail different
commitments to the sorts of objects that exist, how they are referred to in the language, and
the taxonomic relationships that may obtain between them. Examples of ontological differences
include, for example, the variation in food items available in different regions, and in other everyday
5
items—modern urbanites are more likely to have reference to automobile care than farm animal
care. As well, specialized fields such as science or metaphysics may reify certain abstract concepts
and processes, treating them similarly to tangible objects in their discourse.
Second, we should consider the role of an ideology, which establishes a set of possible social
roles and assumed relationships between them, of prestige and power [63]. It is the ideological
context that provides the background for the interpersonal relationship(s) between the Author and
the Audience which will affect the nature of the text, in terms of its formality, politeness, level of
epistemic commitment, and so on.
Third, we must also consider the intertext, or background of all pre- and co-e xisting texts that
may have influenced the composition of the text under consideration. This may be by direct quo-
tation or citation, by the incorporation of similar thematic elements or rhetorical structures, or
by the use of particular phraseology. Much or even most such intertextual reference may not be
conscious, but rather incorporation of textual elements ‘floating’ in a background knowledge of a
bo dy of texts. A community of discourse is distinguished primarily by its having a particular inter-
textual history to which its texts make reference; prime examples are various traditional religious
communities whose religious texts refer to the community’s acknowledged canonical texts.
Figure 1 gives a schematic diagram of this model. In these terms, the (ambitious) goal of our
inclusive notion of stylistic analysis is, given a text, to glean as much as possible about the various
components of the communicative act it embodies, apart from its Content. Thus authorship attri-
bution and profiling, identifying the genre of a text (i.e., its purpose and place in some community),
and determining social power relations realized in a body of texts thus all involve forms of stylistic
analysis in our terms. Certainly there are correlations between various components—particular au-
thors will have idiosyncratic topical preferences, certain topics are largely embedded in communal
discourses that come with particular stylistic commitments (e.g., scholarly disciplines), different
media may be more or less appropriate for various types of content or purpose, and so on. Further-
more, how a text expresses its purpose, for example, depends on its particular content and context.
Thus no one factor can be considered in isolation, apart from all the others. However, in the context
of a specific research question, consideration of one or another aspect may be usefully considered
separately, provided that appropriate caution be exercised in drawing general conclusions.
Our mo del should perhaps be understood in the context of systemic functional models of register,
6
context, and culture, such as Martin [63], Halliday [44], Gregory [40], and Fawcett [32]. For
example, in terms of systemic register theory, our notion of Content parallels Halliday’s “field”
(though Ontology is surely also relevant), while the conjunction of Author and Audience determines
Gregory’s “personal tenor” (included with Purpose in Halliday’s “tenor”), and the Medium is
essentially the “mode”. Genre effects are embedded in our model as as realizations of different
Purposes within a discourse community (or community of practice); the communal dimension
foregrounds asp ec ts of Context in genre analysis—as Halliday and Martin (among others) have
noted, ideology is a key player in the c onstruction of genre. Note that we are not proposing here
any new detailed linguistic theory of these phenomena, but merely suggest that a model such as
ours will be useful for organizing research into automated style analysis.
3 Computational Stylistics
Research in computational stylistics seeks effective models of language style by applying machine
learning algorithms to stylistically meaningful features. The roots of the field go back to the
studies of Mendenhall [75] and Mascol [65, 66] in the late 19th century on the use of word-length
statistics for determining authorship. In the 20th century, the foundations of such ‘stylometric
analysis’ were further advanced by Yule’s statistical studies of word-length and part-of-speech
distributions inliterary prose [97, 96], and Mosteller and Wallace’s seminal authorship study of
The Federalist Papers [76], based on analyzing frequencies of function words. Due to the high
cost of computing and analyzing such features before the wide availability of powerful computers,
stylometrics researchers had, until recently, traditionally sought relatively simple statistically valid
models of stylistic distinctions, based on a small number (dozens, at mos t) of eas ily-computed
textual statistics, such as word-frequencies [76], phrase-type frequencies [12], or sentence-complexity
[97].
More recent research on machine-learning techniques for text classification, on the other hand,
has developed more sophisticated learning algorithms which can use combinations of many thou-
sands of features to classify documents according to topic (see Sebastiani’s [85] excellent survey).
Working systems that have been developed use a variety of modern machine learning techniques
such as Na¨ıve Bayes [60, 61], Winnow [24], and Support Vector Machines [50]. Recent work on
7
A
B
Cleaning;
Feature
extraction
Classification
Model
Feature Vectors
<x1,…,xN> = A
<x1,…,xN> = A
<x1,…,xN> = A
:
:
:
<x1,…,xN> = B
<x1,…,xN> = B
<x1,…,xN> = B
:
:
:
Machine
Learning
Figure 2: Generic text categorization system architecture.
applying machine learning and statistical methods for text classification to stylometric features
for style analysis has achieved useful techniques for authorship attribution [3, 86, 7], genre analy-
sis [17, 67, 10, 30], and other applications [47, 39, 55].
Text categorization is a key problem in the field of machine learning [85]. The idea is that
we are given two or more classe s of documents and we need to find some formula (usually called
a ”model”) that reflects statistical differences between the classes and that can then be used to
classify a new document. For example, we might wish to classify a document as being about one of
a number of possible topics, as having been written by a man or a woman, as having been written
by one of a given set of candidate authors and so forth.
Figure 2 depicts the basic architecture of a text categorization system in which we are given
examples of two classes of documents, Class A and C lass B. The first step, document representation,
involves defining a set of text features which might potentially be useful for categorizing texts in a
given corpus (for example, words that are neither too common nor too rare) and then representing
each text as a vector in which entries represent (some non-decreasing function of) the frequency of
each feature in the text. Optionally, one may then use various criteria for reducing the dimension
of the feature vectors [95, 35].
8
Once documents have been represented as vectors, there are a number of learning algorithms
that can be used to construct models that distinguish between ve ctors represe nting documents in
Class A and vectors representing documents in C lass B. Yang [94] compares and assesses some of
the most promising algorithms, which include k-nearest-neighbor, neural nets, Winnow, Support
Vector Machines, etc. One particular class of learned model which is easy to understand and
analyze, and which we use here, is the linear separator. The basic idea is that each feature x
i
is
assigned a weight w
c
i
for each possible text class c; these weights collectively form several weight
vectors w
c
. The dotproduct of weight vector w
c
with a text’s feature vector x gives the text’s
score for class c; the class with the highest score is assigned to the text. A number of different
methods are extant for computing such weight vectors, including Naive Bayes [61, 71], Support
Vector Machines (SVMs) [51], Bayesian logistic regression [36], and so forth, each with somewhat
different properties. Generally speaking, however, many algorithms give similar performance in
text categorization, provided that features are chosen well.
Linear separator learning algorithms which do not require feature independence and are robust
to presence of irrelevant features (such as Winnow and SVMs), have tended to work quite well for
stylistic categorization problems. The main current research issue in the field is the question of what
kinds of textual features are good s tyle discriminators, especially with the use of algorithms that
can effectively deal with very large numbers of such features. Features for stylistic discrimination
must be invariant as to topic but vary with the specific stylistic dimension under study. Our results
and those of others [4, 7, 33, 39] have shown that using just relative frequencies of several hundred
function words often gives exce llent results, while adding syntactic, semantic, or complexity-based
features can sometimes be advantageous as well. Often, a great deal of insight into the underlying
stylistic dimension being studied can be found by using semantically-derived features, as we will
see in some of the studies described b elow.
4 Author Profiling
Our first case study is from an analysis of a large corpus of blogs to see if writing style and
topic preferences vary with age and sex of the author [5], and what this variation may mean. In
this study, we applied two different machine-learning algorithms: Bayesian multinomial logistic
9
regression (BMR [36]) and multi-class balanced real-valued Winnow (WIN [62, 24]), to construct
classification models for author age and for author sex.
4.1 The corpus
The corpus for this study included all blogs on blogger.com in mid-August 2004 that had both
author-provided indication of sex and age and at least 200 occurrences of common English words.
The unit of analysis was the collected writing of each blogger from the blog’s inception date until
harvest; we do not distinguish between different posts by a given blogger. Each blog was labeled
for sex and age based on the blogger’s self-identification. For purposes of analysis, formatting and
non-English text was removed from each blog. To enable reliable age categorization (since a blog
can span several years of writing), all blogs for boundary ages (ages 18-22 and 28-32) were removed.
Each blogger was categorized by age at time of harvest: ”10s” (ages 13-17), ”20s” (ages 23-27) and
”30s” (ages 33-47), and also by sex: ”male” and ”female”. To decouple sex from age effects, the
numbers of blogs of each sex within each age category were equalized by randomly deleting surplus
blogs from the larger sex category. The final corpus contained 19,320 blogs (8,240 in 10s, 8,086 in
20s, and 2,994 in 30s), comprising a total of 681,288 posts and over 140 million words; there were
35 posts and 7300 words per blog on average.
4.2 Classification accuracy
Each blog was represented by a vector containing the frequencies in the blog of the above-mentioned
377 function words as well as of the 1000 words with highest information gain for age and sex,
respectively (as computed on the holdout set).
10-fold classification accuracies in ten-fold cross-validation for author age (over all three age
classes) were 77.4% (BMR) and 75.0% (WIN). Results for author sex were accuracies of 79.3%
(BMR) and 80.5% (WIN), consistent with classification studies on author sex in other corpora [28,
4]. When one takes into account that self-identified sex and age information by bloggers may often
inaccurate and that blogs commonly include much quoted text, these results might be considered
surprisingly high, clearly showing that author sex and age are indicated by word usage.
10
4.3 Factor analysis
What do such differences in language use mean, however? First, let us c onsider the 1000 most
frequent words in the corpus overall and how different classes of bloggers differ in their use. These
1000 words contain 323 different function words and 677 different content words, accounting for
59.4% and 21.7%, respectively, of all word occurrences. In order to understand underlying patterns
of language variation, we considered naturally occurring word classes.
Function words can be divided straightforwardly into a number of grammatical classes, such
as PersonalPronouns, Articles, Conjunctions, and so forth. For content words, a straightforward
way to identify natural word classes for a given corpus is to perform factor analysis. A maximum
likelihood factor analysis with equimax rotation and Kaiser normalization [37] on the rate of use
of each of the 677 most frequent content words yields twenty coherent factors that depict different
content-related themes, each containing between 13 and 32 words. We assigned intuitive names to
the factors:: Conversation, At Home, Family, Time, Work, PastActions, Games, Internet, Location,
Fun, Food/Clothes, Poetic, Books/Movies, Religion, Romance, Swearing, Politics, Music, School,
Business.
Frequencies of the twenty factors’ usage for each age and each sex, as well as the same data
for function words broken down by parts of speech, indicate meaningful differences in both content
and style among bloggers of different ages. Use of words associated with Family, Religion, Politics,
Business, and Internet increases significantly (p <0.001) with age, while use of words associated
with Conversation, AtHome, Fun, Romance, Music, School, and Swearing decreases significantly
(p <0.001) with age. Use of other factors e ither doesn’t vary monotonically or shows no significant
differences.
In addition, use of PersonalPronouns, Conjunctions, and AuxiliaryVerbs decreases significantly
(p <0.001) with age, while use of Articles and Prepositions increases significantly (p <0.001) with
age. Style and content effects are highly correlated: use of multiple regressions indicates that
controlling for style effects essentially eliminates content effects and vice versa.
These results uniformly suggest that younger writers are more focused on personal issues, while
older writers have a more external focus on social issues. Two notes of caution are in order.
Interpretation must not ignore the fact that since this study is synchronic, we cannot separate
11
generational effects from age effects. Moreover, since there are fewer older bloggers, they may
represent an atypical demographic as early adopters of technology.
Another key finding was that the parts of speech, Articles and Prepositions, which are used
significantly more by older bloggers are also used significantly more (p <0.001) by male bloggers,
while the parts of speech, PersonalPronouns, Conjunctions, and AuxiliaryVerbs, which are used
significantly more by younger bloggers are also used significantly more (p <0.001) by female blog-
gers. Similarly, the factors Religion, Politics, Business, and Internet, which are used significantly
more by older bloggers are also used significantly more (p <0.001) by male bloggers, while the
factors Conversation, AtHome, Fun, Romance, and Swearing, which are used significantly more by
younger bloggers are also used significantly more (p <0.001) by female bloggers. (The exceptions to
this pattern are Family, used more by older bloggers and by fem ales, Music, used more by younger
bloggers and by males, and School, for which there is no significant difference between male and
female usage.) As in the case of age, multiple regression indicates that controlling for style effects
essentially eliminates content effects and vice vers a.
These results suggest a single underlying factor distinguishing inner- and outer-directed com-
munication (both style and topic) that can explain both sex-linked and age-linked variation in
language use. Indeed, as we have seen, these variations are pronounced enough that they c an be
exploited to predict the age and sex of a blog author with reasonably high accuracy.
5 Authorship Verification
The second problem we consider is that of authorship verification. In the authorship verification
problem, we are given examples of the writing of a single author and are asked to determine if
given texts were or were not written by this author. As a categorization problem, verification is
significantly more difficult than attribution and little, if any, work has been performed on it in the
learning community. When we wish to determine if a text was written by one of several known
authors, it is sufficient to use their resp ec tive known writings, to construct a model distinguishing
them, and to test the unknown text against the model [76, 91, 82, 18]. If, on the other hand, we
need to determine if a text was written by Author A or not, it is very difficult, if not impossible, to
assemble an exhaustive, or even representative, sample of not-A. The situation in which we suspect
12
that a given author may have written some text but do not have an exhaustive list of alternative
candidates is a common one.
The particular authorship verification problem we will consider here is a genuine literary co-
nundrum. We are given two nineteenth century collections of Jewish rabbinic responsa written
in a combination of Hebrew and Aramaic. The first, RP (Rav P e‘alim) includes 509 documents
authored by an Iraqi rabbinic scholar known as Ben Ish Chai. The second, TL (Torah Lishmah)
includes 524 documents that Ben Ish Chai, claims to have found in an archive. There is ample
historical reas on to believe that he in fact authored the manuscript but did not wish to take credit
for it for personal reasons [15]. What do the texts tell us?
The first thing we do is to find four more collections of responsa written by four other authors
working in roughly the same area during (very) roughly the same period. These texts are Zivhei
Zedeq (ZZ; Iraq, nineteenth century), Sho’el veNish’al (SN; Tunisia, nineteenth century), Darhei
No‘am (DN; Egypt, seventeenth century), and Ginat Veradim (GV; Egypt, seventeenth century).
We begin by checking whether we are able to distinguish one collection from another using standard
text categorization techniques. We select a list of lexical features as follows: the 200 most frequent
words in the corpus are selected and all those that are deemed content-words are eliminated man-
ually. We are left with 130 features. After pre-processing the text as in the previous experiment,
we constructed vectors of length 130 in which each element represented the relative frequency (nor-
malized by document length) of each feature. We then used Balanced Winnow as our learner to
distinguish pairwise between the various collections. Five-fold c ross-validation experiments yield
accuracy of greater than 95% for each pair. In particular, we are able to distinguish between RP
and TL with accuracy of 98.5%.
One might thus be led to conclude that RP and TL are by different authors. It is still possible,
however, that in fact only a small number of features are doing all the work of distinguishing between
them. The situation in which an author will use a small number of features in a consistently different
way between works is typical. These differences might result from thematic differences between the
works, from differences in genre or purpose, from chronological stylistic drift, or from deliberate
attempts by the author to mask his or her identity.
In order to test whether the differences found between RP and TL reflect relatively shallow
differences that can be expected between two works of the s ame author or reflect deeper differences
13
Figure 3: Accuracy (y-axis) on training data of learned models comparing TL to other collections
as best features are eliminated, five per iteration (x-axis). Dotted line on bottom is RP vs. TL.
that can be expected between two different authors, we invented a new technique that we call
unmasking [56, 57] that works as follows. We begin by learning models to distinguish TL from each
of the other authors including RP. As noted, such models are quite effective. In each case, we then
eliminate the five highest-weighted features and learn a new model. We ite rate this procedure ten
times. The depth of difference between a given pair can then be gauged by the rate with which
results degrade as good features are eliminated.
The results (shown in Figure 3) could not b e more glaring. For TL versus each author other
than RP, we are able to distinguish with gradually degrading effectiveness as the best features
are dropped. But for TL versus RP, the effectiveness of the models drops right off a shelf. This
indicates that just a few features, possibly deliberately inserted as a ruse or possibly a function of
slightly differing purposes assigned to the works, distinguish between the works. We have shown
elsewhere [57], that the evidence offered in Figure 3 is sufficient to conclude that the author of RP
and TL are one and the same: Ben Ish Chai.
For example, the frequency (per 10000 words) of the word zeh [= this] in RP is 80 and in TL
is 116. A cursory glance at the texts is enough to establish why this is the case: the author of TL
ended every responsum with the phrase vehayah zeh shalom [= this shall be farewell], thus artificially
inflating the frequency of these words. Indeed the presence or absence of this phrase alone is enough
14
to allow highly accurate classification of a given responsum as either RP or TL. Once features of
this sort are eliminated, however, the works become indistinguishable—a phenomenon which does
not occur when we compare TL to each of the other collections. In other words, many features can
be used to distinguish TL from works in our corpus other than RP, but only a few distinguish TL
from RP. Most features distribute similarly in RP and TL. A wonderfully illustrative example of
this is the abbreviation vkhu [= etc.], the respective frequencies of which in the various corpora are
as follows: TL:29 RP:28 SV:4 GV:4 DN:41 ZZ:77. Note that this similarity is unlikely to be due
to regional and chronological dependencies, since GV and DN have widely differing values, though
they were both written in seventeenth century Egypt.
PLACEHOLDER
6 Scientific Rhetoric
A third type of texts that we have studied are peer-reviewed scientific articles. Our motivation in
doing so is to see what, if any, stylistic realization there may be to likely methodological difference s
between different scientific fields. Our analysis here thus relates to questions of genre, specifically
to the interaction between Content and Purpose within a certain disciplinary Context.
Scientific fields form clear communities of discourse, with a shared textual history in textbooks
and the scientific literature; they are also communities of practice (or closely intertwined sets of
such), in that scientists in a given field undertake similar/related research activities in the pursuit
of communally-defined research goals. For these reasons, we might expect that the language used
in articles from one discipline differ from those in another; our goal is to verify this intuition, and
to see if analysis of any differences found can shed any light on how scientists in different fields
construe their arguments in different ways.
6.1 Scientific methodologies
A key distinction which has recently been identified by historians and philosophers of science [22,
29, 70, 83] is between experimental sciences such as physics, which attempt to formulate general
predictive laws, and so rely heavily on repeatable series of controlled experiments which test hy-
potheses, and historical sciences such as geology, which study specific contigent past phenomena in
15
an attempt to find unifying explanations for effects caused by those phenomena (cf. [69]). Reason-
ing in historical sciences thus consists largely of ‘synthetic’ reconstructive reasoning (retrodiction),
as compared with more explicitly predictive reasoning from causes to possible effec ts characteristic
of exp e rimental science [13, 29, 38, 93]. We summarize here results of our earlier studies [8, 2] on
articles in several fields of experimental and historical sciences, which show how analysis of style
differences c an give insights into how scientific rhetoric reflects methodological differences among
the sciences.
The two studies we will discuss analyze a corpus of recent (2003) articles drawn from twelve
peer-reviewed journals in both historical and experimental sciences; the numbers of articles used
from each journal and their average (preprocessed) lengths in words are given in Table 1. They
are:
Journal of Geology (geology, historical) includes research on the full range of geological principles
including geophysics, geochemistry, sedimentology, geomorphology, petrology, plate tectonics,
volcanology, structural geology, mineralogy, and planetary sciences.
Journal of Metamorphic Geology (geology, historical) focuses on metamorphic studies
1
, from the
scale of individual crystals to that of lithospheric plates.
Biological Journal of the Linnean Society (evolutionary biology, historical) publishes work on or-
ganic evolution in a broad sense, particularly research unifying concepts of evolutionary biol-
ogy with evidence from genetics, systematics, biogeography, or ecology.
Journal of Human Evolution (evolutionary biology, historical) covers all aspects of human evolu-
tion, including both work on human/primate fossils and comparative studies of living species.
Palaeontologia Electronica (paleontology, historical) publishes papers in all branches of paleontol-
ogy as well as related biological or paleontologically-related disciplines.
Quaternary Research (paleontology, historical) publishes research in diverse areas in the earth and
biological sciences w hich examine the Quaternary period of the Earth’s history (from roughly
1.6 million years ago to the present).
1
Metamorphism refers to changes in mineral assemblage and texture in rocks that have been s ubjected to tem-
peratures and pressures different from those under which they originally formed.
16
Physics Letters A (physics, experimental) publishes research in a wide range of areas, including :
condensed matter physics, theoretical physics, nonlinear science, statistical physics, mathe-
matical and computational physics, atomic, molecular and cluster physics, plasma and fluid
physics, optics, biological physics and nanoscience.
Physical Review Letters (physics, experimental) also covers a wide range of physics research, in-
cluding: gravitation and astrophysics, elementary particles and fields, nuclear physics, atomic,
molecular, and optical physics, nonlinear dynamics, fluid dynamics, plasma and beam physics,
and condensed matter physics.
Journal of Physical Chemistry A (physical chemistry, experimental) publishes chemical research at
the level of molecules (including dynamics, spectroscopy, gaseous clusters, molecular beams,
kinetics, atmospheric and environmental physical chemistry, molecular structure, bonding,
quantum chemistry, and general theory).
Journal of Physical Chemistry B (physical chemistry, experimental) publishes res earch on materi-
als (including nanostructures, micelles, macro-molecules, statistical mechanics and thermody-
namics of condensed matter, biophysical chemistry, and general physical chemistry), as well
as studies on the structure and properties of surfaces and interfaces.
Heterocycles (organic chemistry, e xperimental) publishes research in the areas of organic, pharma-
ceutical, analytical, and medicinal chemistry of heterocyclic compounds.
Tetrahedron (organic chemistry, experimental) publishes general experimental and theoretical re-
search results in the field of organic chemistry and applications in related disciplines especially
bio-organic chemistry.
6.2 Functional lexical features
To analyze rhetorically-relevant features of these texts, we use features derived based on the theory
of Systemic Functional Grammar (SFG), a functional approach to linguistic analysis [45]. SFG
models the grammar of a language by a network of choices of meanings that can be expressed [68],
and so all lexical and structural choices are represented as the realizations of particular semantic
and contextual meanings. The theory takes a primarily sociological view of language, and has
17
Table 1: Journals used in the studies with number of articles and average length (in words) per
article.
Journal # Art. Avg. Len.
H
1
Journal of Geology 93 4891
H
2
Journal of Metamorphic Geology 108 5024
H
3
Biological Journal of the Linnean Society 191 4895
H
4
Human Evolution 169 4223
H
5
Palaeontologia Electronica 111 4132
H
6
Quaternary Research 113 2939
E
1
Physics Letters A 132 2339
E
2
Physical Review Let. 114 2545
E
3
Journal of Physical Chemistry A 121 4865
E
4
Journal of Physical Chemistry B 71 5269
E
5
Heterocycles 231 3580
E
6
Tetrahedron 151 5057
developed largely in the context of its use by applied linguists for literary/genre analysis and for
studying language learning. (See Butler [20] for an excellent overview of SFG and its relation to
other functional linguistic theories.)
Briefly put, SFG construes language as a set of interlocking choices for expressing meanings,
with more general choices constraining the possible specific choices. A choice at one level may open
up further choices at other levels, choices that may not otherwise be possible; e.g., English does
not allow one to distinguish between third-person plural references to conscious or non-conscious
individuals—only one pronoun “them” is available. Furthermore, any specific choice of lexical item
or syntactic structure is determined by choices from multiple systems at once, as, e.g., the choice
between “I” and “me” is determined by the independent choice governing the pronoun’s syntactic
role as either a subject or an object.
For our purposes , it suffices to assign attributes to relevant lexical items, where each such
attribute takes on a value from a system taxonomy defined by SFG. For simplicity, we require
that each such taxonomy is a tree, having a unique root, and requiring entry conditions to b e
only single option or conjunctions of options. This simplifies computational issues, though it only
approximates the full SFG representation which e ss entially allows system networks to be general
AND/OR graphs; see [68] for a full discussion.
Such taxonomies enable a straighforward ‘variational’ interpretation of a linear classification
model, in which certain nodes in the taxonomy are features indicating one document class, and other
18
nodes are features indicating the other document class. The idea is to find all oppositions, where an
opposition is defined as a pair of sibling nodes where one indicates one class and the other indicates
the other class. For example, if CONJUNCTION/Extension (i.e., RF
d
(Extension|CONJUNCTION))
is indicative of class A and CONJUNCTION/Enhancement of class B, we would have the opposi-
tion:
Condition Class A Class B
CONJUNCTION Extension Enhancement
The oppositions given by such analysis give direct information about linguistic differences between
two doc ument classes, in that the two classes have differing preferences about how to express the
parent ‘concept’. In the example, Class A prefers to conjoin items by Expansion, likely indicating
a higher density of more-or-less independent information units, whereas Class B prefers conjoining
items by Enhancements, indicating a more closely focused structure dealing with a smaller number
of independent information units.
The remainder of this subsection outlines the main system networks which we have used for
analyzing textual style. They are divided into three categories, denoting the general ‘stylistic
goals’ that these textual features relate to: Cohesion, referring to how a text is constructed to
‘hang together’, Assessment, meaning how a text construes propositions as statements of belief,
obligation, or necessity, contextualizing them in the larger discourse, and Appraisal, how the text
adjudges the quality of various objects or events. The relevant taxonomies are only summarized
here due to space considerations; a fuller description can be found in [8].
6.2.1 Cohesion
Cohesion refers to linguistic resources that enable language to connect to its larger context, both tex-
tual and extratextual [43]. Such resources include a wide variety of referential modalities (pronomi-
nal reference, deictic expressions, ellipsis, and more), as well as lexical repetition and variation, and
different ways of linking clauses together. How an author uses these various cohesive resources is an
indication of how the author organizes concepts and relates them to each other. Within cohesion,
we consider here only conjunctions, which are easiest to deal with computationally. Automated
coreference resolution [14, 77], for example, is a very difficult unsolved problem.
Words and phrases that conjoin clauses (such as ‘and’, ‘while’, and ‘in other words’) are or-
19
ganized in SFG in the CONJUNCTION system network. Types of CONJUNCTION serve to
link a clause with its textual context, by denoting how the given clause expands on some aspect
of its preceding context [68, p. 519–528]. The three top-level options of CONJUNCTION are
Elaboration, Extension, and Enhancement, defined as:
Elaboration: Deepening the content in its context by exemplification or refocusing (for ex-
ample, in other words, i.e.);
Extension: Adding new related information, perhaps contrasting with the current information
(and, furthermore, on the other hand);
Enhancement: Qualifying the context by circumstance or logical connection (and then, be-
cause, similarly).
6.2.2 Assessment
Generally speaking, assessment may be defined as “contextual qualification of the epistemic or
rhetorical status of events or propositions represented in a text”. Examples include assessment of
the likelihood of a proposition, the typicality of an event, the desirability of some fact, or its scope
of validity. Two important systems in SFG that address assessment are MODALITY, enabling
expression of typicality and necessity of some fact or event, and COMMENT, enabling assessment
of the writer’s stance with respect to an assertion in the text.
The system of MODALITY enables one to qualify events or entities in the text according
to their likelihood, typicality, or necessity. Syntactically, MODAL ITY may be realized in a text
through a modal verb (e.g., ‘can’, ‘might’, ‘should’, ‘must’), an adverbial adjunct (e .g., ‘probably’,
‘preferably’), or use of a projective clause (e.g., “I think that...”, “It is neces sary that...”). Each
word or phrase expressing MODALITY has a value for each of four attributes:
Type: What kind of modality is being expressed?
Modalization: How ‘typical’ is it? (probably, seldom)
Modulation: How ‘necessary’ is it? (ought to, allowable)
Value: What degree of the relevant modality scale is being averred?
Median: The ‘normal’ amount. (likely, usually)
20
Outer: An extreme (either high or low) amount. (maybe, always)
Orientation: Relation of the modality expressed to the speaker/writer.
Objective: Modality expressed irrespective of the speaker/writer. (maybe, always)
Subjective: Modality expressed relative to the speaker/writer. (We think..., I require...)
Manifestation: How is the modal assessment related to the event being assessed?
Implicit: Modality realized ‘in-line’ by an adjunct or modal auxiliary. (preferably...,
maybe..)
Explicit: Modality realized by a projective verb, with the nested clause being assessed.
(It is preferable..., It is possible..)
The system of COMMENT provides a resource for the writer to ‘comment’ on the status of
a message with respect to textual and interactive context in a discourse. Comments are usually
realized as adjuncts in a clause and may appear initially, medially, or finally. We use the eight
categories of COMMENT listed by Matthiessen [68]: Admissive, message is an admission (e.g., we
concur...’), Assertive, emphasis of reliability (e.g., Certainly...’), Desiderative, desirability of the
content (e.g., Unfortunately...’), Evaluative, judgment of the actors involved (e.g., Sensibly...’),
Predictive, coherence with predictions (e.g., As expected...’), Presumptive, dependence on other
assumptions (e.g., I suppose...’), Tentative, assessing the message as tentative (e.g., Tentatively...’),
and Validative, assessing scope of validity (e.g., In general...’).
6.2.3 Appraisal
Finally, appraisal denotes how language is used to adopt or express an attitude of some kind
towards some target [64]. For example, in “I found the movie quite monotonous”, the speaker
adopts a negative Attitude (“monotonous”) towards “the movie” (the appraised object). Note that
attitudes come in different types; for example, ‘monotonous’ describes an inherent quality of the
appraised object, while ‘loathed’ would describe an emotional reaction of the writer. The overall
type and orientation of appraisal expressed in the text about an object gives a picture of how
the writer wishes the reader to view it (mo dulo sarcasm, of course). To date, we have developed
a lexicon for appraisal adjectives as well as relevant modifiers (such as ‘very’ or ‘sort of’). The
two main attributes of appraisal, as used in this work, are Attitude, giving the kind of appraisal
21
being expressed, and Orientation, giving whether the appraisal is positive (good, beautiful, nice) or
negative (bad, ugly, evil). (There are also other attributes of appraisal, discussed in the Appendix.)
The three main types of Attitude are: affect, relating to the speaker/writers emotional state (e.g.,
‘happy’, ‘sad’), appreciation, expressing evaluation of supposed intrinsic qualities of an object (e.g.,
‘tall’, ‘complex’), and judgment, expressing social evaluation (e.g., ‘brave’, ‘cowardly’).
6.3 Experimental and historical science
The first study we will discuss compares writing styles between experimental and historical science
journals [2]. We first considered if a difference between the types of science could be identified,
examining the 10-fold cross-validation accuracy of models built by an SVM (SMO with a linear
kernel [80]) for classifying articles as “experimental” or “historical”, using SFL features of EXPAN-
SION, COMMENT, and MODALI TY over the entire corpus of articles from 12 journals. Average
accuracy was 81.6%. To calibrate results, we then ran the same discrimination test for all 462
different partitions of the twelve journals into two groups of six journals each. This gave a mean
accuracy of only 65.4% (range 55.4% to 81.6%), indicating that the division into experimental and
historical sciences is well supported by differences in style between writing in different journals.
Given this result, we can now consider what consistent pattern of distinguishing features, if any,
emerges. That is, what features can be said to consistently indicate either historical or experimental
science articles. To do this, we ran SMO with a linear kernel on all training data for each pair of
a historical science journal with an experimental science journal (36 pairs in all), and ranked the
features by their weight for one or the other journal in the weight vector computed by SMO. We
summarize here the main results; for more detail, see [2].
First, in the system of Expansion, we see an opposition between Extension, which is an indica-
tor for historical science articles, and Enhancement, an indicator for experimental science articles.
This implies that historical science articles generally have a higher density of separate informa-
tional items, whereas experimental science articles tend to have fewer discrete information items,
though the information items they do have may have their meaning deepened or qualified by
informationally-related clauses. This may reflect differing principles of rhetorical organization—
experimental scientists preferring a single coherent ‘story line’ focused on enhancements of a small
number of fo cal propositions, with historical scientists preferring a multifocal ‘landscape’ of con-
22
nected propositions. This supports the hypothesis that contrasts contextual examination of various
and highly unique entities by historical science with a more universalist, hence narrowly focused,
examination of generic entities by experimental science.
Further support for such methodological distinctions between kinds of science are further sup-
ported by preferences for types of Comment. Validative and Admissive Comments are indicators
for historical science articles compared to a very strong consistent indication of Predictive Com-
ments for experimental science articles. The latter result is a clear consequence of the experimental
scientist’s focus on predictive accuracyx. Historical sciencex, on the other hand, evinces a rhetorical
need (via Validative Comments) to explicitly delineate the scope of validity of different assertions,
likely as a consequence of synthetic thinking [13] about complex and ambiguous webs of past cau-
sation [22]. An Admissive comment marks a clause as the opinion (perhaps strongly held) of the
author; this too appears indicative of a more hedged and explicitly comparative argumentation
style.
Finally, we may consider some aspects of variation in expressions of Modality between the
two classes of articles. The primary opposition is in modality Type. Experimental science writing
has a preference for using Modulation (assessing what ‘must’ or ‘is able’ to happen), which is
consistent with a focus on prediction and manipulation of nature. Concurrently, historical science
writing shows a preference for Modalization (assessing ‘likelihood’ or ‘usuality’), consistent with the
outlook of an observer who usually cannot directly manipulate or replicate outcomes, and therefore
(i) c annot make unqualified statements of what must (or must not) happen, and (ii) uses therefore
the method of “multiple working hypotheses”.
These results show how variations in language use between articles from different disciplines can
be directly linked with the particular modes of reasoning posited by philosophers for these different
kinds of science. Stylistic text analysis thus can lend some empirical weight to the argument for a
multiplicity of methods in science, rather than a single monolithic “scientific metho d”.
6.4 Geology and paleontology
In another study [8], we consider if stylistic differences between articles in geological and paleon-
tological journals may be found, and if so, what they may mean. As above, we applied the SMO
system to learn classification models and measured accuracy by 10-fold cross-validation. We found
23
Table 2: Oppositions from the twenty highest-ranked systemic features in geology and paleontology
articles, from the model learned using function words plus all systemic features.
Condition Geology Paleontology
ORIENTATION Positive Negative
JUDGEMENT/So cialEsteem ORIENT/Positive ORIENT/Negative
JUDGEMENT/So cialSanction ORIENT/Positive ORIENT/Negative
ATTITUDE/Judgement ORIENT/Positive ORIENT/Negative
ATTITUDE/Affect ORIENT/Positive ORIENT/Negative
APPRECIATION ReactionQuality CompositionComplexity
ReactionImpact CompositionBalance
SOCIALSANCTION Propriety Veracity
COMMENT Assertive Validative
Desiderative Presumptive
ENHANCEMENT SpatioTemporal CausalConditional
that using Conjunction, Modality, and Assessme nt features res ulted in low classification accuracies
(all below 68%), while Appraisal features gave a higher 77.4%. Using function words together with
all the systemic feature types gave the highest accuracy of 87.5%, higher than using just function
words at 84.9% (p < 0.05). Together, these results indicate that while paleontology and geology
have similar preferences for rhetorical structure (measured by Conjunction) and epistemic commit-
ments (measured by Assessment), there are definite stylistic differences, a go od portion of which
relate to the use of evaluative language.
To understand better the differences between geological and paleontological language, we next
consider oppositions among the top twenty systemic features from the model constructed for the two
article classes using the full feature set, shown in Table 2. Appraisal is the most important, yielding
the largest boost in classification power, as noted above, and accordingly generating many highly
ranked oppositions. ORIENTATION is most important overall—geologists appear to prefer Positive
appraisal, while paleontologists prefer Negative. This opposition is also seen within JUDGEMENT
and ATTITUDE. Such appraisal often appears when describing the results of the current, or of
previous, research. Geology appears to prefer positive appraisal, stressing the c ooperative and
incremental nature of the research enterprise, as in, e.g., “...collectively and consistently point
to a single conclusion...” On the other hand, paleontology tends to prefer negative orientation,
seeming to stress inadequacies of the evidence or of previous work, as, for example, in, “...records
are unfortunately much more fragmented...”. As well, we see cases where a researcher will discredit
24
previous work based on new evidence, as in ”...the approach taken is fundamentally flawed.” It
seems that, in a sense, geologists more often express positive views of previous work as they often
apparently view their work as adding to it, while paleontologists are more often negative, seeing
themselves as replacing old ‘truths’ with new ones.
Next, oppositions in APPRECIATION indicate a distinction between a geological fo cus on
Reaction (i.e., the effect of the object on an observer) and a paleontological focus on Composition
(i.e., qualities of how the object is put together). This may indicate that paleontologists are more
concerned with analyzing configurations of complex, multi-part entities (fossils of various sorts),
whereas geologists tend somewhat towards more qualitative evaluations of specimens.
A similar distinction is seen in SOCIALSANCTION and in COMMENT. In SOCIALSANC-
TION, we see geologists more concerned with Propriety, i.e., how a methodology or a piece of
evidence may fit with others, whereas paleontologists are more concerned with Veracity, in terms
of how reliable particular methods or bits of evidence are on their own.
Similarly, we see two COMMENT types descriptive of geological prose: Assertive COMMENTs
(e.g., “There is surely more to it...”), and Desiderative COMMENTs (e.g., “In doing so, we hope to
deduce”), which is consistent with the apparent focus of geologists on ‘fitting in’ noted above. Pale-
ontologists, on the other hand, tend more to use Validative COMMENTs, expanding or contracting
the scope of validity of a claim (e.g., ”Holocene shells generally lack...”), and Presumptive COM-
MENTs, evaluating new claims in light of general background knowledge (e.g., ”...which apparently
are linked with...”).
Finally, the single opposition we find within the CONJUNCTION system is in ENHANCE-
MENT, where geology prefers SpatioTemporal, while paleontology prefers CausalConditional. Ge-
ological uses of SpatioTemporal conjunction tend to describe rock configurations and development
over time, as well as discourse organization (in a sort of descriptive ‘story-line’). Paleontologists,
however, are more often explicit about hypothesized causal links between specimens and historical
events (as in, e.g., ”...perhaps as a result of migration...”).
25
7 Discussion
In this chapter we have presented a view of style as a semantics of text that points to various non-
propositional aspects of the communicative act that the text realize s. This view subsumes previous
work in computational stylistics, such as authorship attribution and genre analysis, as well as
suggesting possible new avenues for research, such as automatic detection of rhetorical purpose,
or identification of ideological facets, as encoded in a given text. How to properly deal with the
inevitable interdependencies among textual realizations of the various aspects of the c ommunicative
act remains the central open research question.
Even so, isolation of certain aspects for investigation can yield insight, as the results of the
several s tudies described here show. Discrimination analysis of controlled corpora with properly
chosen feature sets enables semantic/pragmatic characterization of stylistic differences between
groups of text that can be linked back to aspectual differences in the relevant communicative acts.
Our schema of the communicative act may also prove fruitful for organizing stylistic analyses in
expressive media other than text. The fact that characteristics of the Author and Audience influence
style are clear in all media, though the specific expressive correlates and relevant characteristics
may vary. Harold Cohen, in his chapter in this volume [23], has forcefully made the point that
understanding an artist’s choice of Medium for a particular artistic Purpose are essential to a true
understanding of artistic style. Similarly, aspects of musical style can be best understo od by relating
them to Purposive goals of eliciting emotional responses [31] or serving particular social functions
(accompanying dances or the like). Context is also relevant—in all media, the artistic/expressive
history and background, the analogue of the intertext, is certainly a major influence on style; social
ideology is an important influence; and more-or-less elaborated medium-specific ontologies may also
be relevant. We suggest therefore that this schema, or something like it, could serve as a framework
to integrate, to some extent, work on stylistic analysis in disparate media.
References
[1] I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. Spyropoulos. An evalu-
ation of naive bayesian anti-spam filtering. In Proc. of the workshop on Machine Learning in
the New Information Age, 2000.
26
[2] S. Argamon, J. Dodick, and P. Chase. Language use reflects scientific methodology: A corpus-
based study of peer-reviewed journal articles. Scientometrics, 2007. to appear.
[3] S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In Proc. Int’l
Workshop on Innovative Internet Information Systems, Pisa, Italy, 1998.
[4] S. Argamon, M. Koppel, J. Fine, and A. R. Shimony. Gender, genre, and writing style in
formal written texts. Text, 23(3), 2003.
[5] S. Argamon, M. Koppel, J. W. Pennebake r, and J. Schler. Age- and sex-linked language vari-
ation in the blogosphere. Journal of Computer Mediated Communication, 2007. (s ubmitted).
[6] S. Argamon and M. Olsen. Toward meaningful c omputing. Communications of the ACM,
49(4):33–35, 2006.
[7] S. Argamon, M.
ˇ
Sari´c, and S. S. Stein. Style mining of electronic messages for multiple author
discrimination. In Proc. ACM Conference on Knowledge Discovery and Data Mining, 2003.
[8] S. Argamon, C. Whitelaw, P. Chase, S. Dhawle, N. Garg, S. R. Hota, and S. Levitan. Stylis-
tic text classification using functional lexical features. Journal of the American Society of
Information Science, 2007. (in press).
[9] Shlomo Argamon and Shlomo Levitan. Measuring the usefulness of function words for author-
ship attribution. In Proceedings of the 2005 ACH/ALLC Conference, Victoria, BC, Canada,
June 2005.
[10] S. Argamon-Engelson, M. Koppel, and G. Avneri. Style-based text categorization: What
newspaper am i reading? In Proc. of AAAI Workshop on Learning for Text Categorization,
pages 1–4, 1998.
[11] J.L. Austin. How to do things with words. Oxford University Press Oxford [Eng, 1976.
[12] R. Harald Baayen, Hans van Halteren, and Fiona Tweedie. Outside the cave of shadows: Using
syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing,
7:91–109, 1996.
27
[13] V. R. Baker. The pragmatic routes of Am erican Quaternary geology and geomorphology.
Geomorphology, 16:197–215, 1996.
[14] D. Bean and E. Riloff. Unsupervised learning of contextual role knowledge for coreference
resolution. Proc. of HLT/NAACL, pages 297–304, 2004.
[15] Y. L. Ben-David. Shevet mi-Yehudah (in Hebrew). (no publisher listed), Jerusalem, 2002.
[16] Michael J. Berry and Gordon Linoff. Data Mining Techniques: For Marketing, Sales, and
Customer Support. John Wiley & Sons, Inc., New York, NY, USA, 1997.
[17] D. Biber. Dimensions of registe r variation: A cross-linguistic comparison. Cambridge Univer-
sity Press, Cambridge, 1995.
[18] J. Burrows. ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary
and Linguistic Computing, 17(3):267–287, 2002.
[19] John F. Burrows. Computation into Criticism: A Study of Jane Austen’s Novels and an
Experiment in Method. Clarendon Press, Oxford, 1987.
[20] C.S. Butler. Structure and function: A Guide To Three Major structural-functional Theories.
John Benjamins, 2003.
[21] Carole E. Chaski. Linguistic authentication and reliability. In National Conference on Science
and the Law, San Diego, California, 1999. National Institute of Justice.
[22] C. E. Cleland. Methodological and epistemic differences between historical science and exper-
imental science. Philosophy of Science, 2002.
[23] Harold Cohen. Style as emergence (from what?). In S. Argamon, K. Burns, and S. Dubnov,
editors, The Structure of Style: Algorithmic Approaches to Manner and Meaning, pages X–
XX. Springer, 2007.
[24] Ido Dagan, Yael Karov, and Dan Roth. Mistake-driven learning in text categorization. In
Claire Cardie and Ralph Weischedel, editors, Proceedings of EMNLP-97, 2nd Conference on
Empirical Methods in Natural Language Processing, pages 55–63, Providence, US, 1997. As-
sociation for Computational Linguistics, Morristown, US.
28
[25] R.G. D’Andrade. The Development of Cognitive Anthropology. Cambridge University Press,
1995.
[26] Roger Dannenberg. Synthesizing musical style. In S. Argamon, K. Burns, and S. Dubnov,
editors, The Structure of Style: Algorithmic Approaches to Manner and Meaning, pages X–
XX. Springer, 2007.
[27] O. de Vel. Mining e-mail authorship. In Workshop on Text Mining, ACM International
Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000.
[28] O. de Vel, M. Corney, A. Anderson, and G. Mohay. Language and gender author cohort analysis
of e-mail for computer forensics. In Proc. Digital Forensic Research Workshop, Syracuse, NY,
2002.
[29] J. Diamond. Guns, Germs and Steel: The Fates of Human Societies. W.W. Norton, New
York, 2002.
[30] M. Dimitrova, A. Finn, N. Kushmerick, and B. Smyth. Web genre visualization. In Proc.
Conference on Human Factors in Computing Systems, 2002.
[31] Shlomo Dubnov. Perception of music as an information processing experience. In S. Argamon,
K. Burns, and S. Dubnov, editors, The Structure of Style: Algorithmic Approaches to Manner
and Meaning, pages X–XX. Springer, 2007.
[32] R.P. Fawcett.
Cognitive Linguistics and Social Interaction: Towards an Integrated Model of a Systemic Functional Grammar and the Other Components of a Communicating Mind.
Groos, 1980.
[33] Aidan Finn, Nicholas Kushmerick, and Barry Smyth. Genre classification and domain transfer
for information filtering. In Fabio Crestani, Mark Girolami, and Cornelis J. van Rijsbergen, ed-
itors, Proceedings of ECIR-02, 24th European Colloquium on Information Retrieval Research,
Glasgow, UK, 2002. Springer Verlag, Heidelberg, DE.
[34] J.R. Firth. A synopsis of linguistic theory 1930-1955. In F.R. Palmer, editor, Selec ted Papers
of J.R. Firth 1952-1959. Longman, 1968.
29
[35] G. Forman. An extensive empirical study of feature selection metrics for text classification.
Journal of Machine Learning Research, 3(7-8):1289–1305, 2003.
[36] A. Genkin, D.D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text
categorization. Technometrics, 2006.
[37] R.L. Gorsuch. Factor Analysis. L. Erlbaum Associates, 1983.
[38] S. J. Gould. Evolution and the triumph of homology, or, why history matters. American
Scientist, Jan.-Feb.:60–69, 1986.
[39] Neil Graham and Graeme Hirst. Segmenting a document by stylistic character. In Workshop
on Computational Approaches to Style Analysis and Synthesis, 18th International Joint
Conference on Artificial Intelligence, Acapulco, 2003.
[40] M. Gregory. Aspects of varieties differentiation. Journal of Linguistics, 3:177–198, 1967.
[41] J.J. Gumperz and S.C. Levinson. Rethinking Linguistic Relativity. Cambridge University
Press, 1996.
[42] I. Hacking. Historical ontology. Harvard University Press Cambridge, Mass, 2002.
[43] M. A. K. Halliday and R. Hasan. Cohesion in English. Longman, London, 1976.
[44] M.A.K. Halliday. Language as Social Semiotic: The Social Interpretation of Language and Meaning.
Edward Arnold, 1978.
[45] Michael A. K. Halliday. Introduction to Functional Grammar. Edward Arnold, second edition,
1994.
[46] J. Harris. The Idea of Community in the Study of Writing. College Composition and
Communication, 40(1):11–22, 1989.
[47] D. I. Holmes. The evolution of stylometry in humanities scholarship. Literary and Linguistic
Computing, 13(3):111–117, 1998.
[48] J. HOLMES and M. MEYERHOFF. The Community of Practice: Theories and methodologies
in language and gender research. Language in Society, 28(02):173–183, 2000.
30
[49] David Hoover. Frequent word sequences and statistical stylistics. Literary and Linguistic
Computing, 17:157–180, 2002.
[50] T. Joachims. Text categorization with support vector machines: Learning with many relevant
features. In Machine Learning: ECML-98, Tenth European Conference on Machine Learning,
pages 137–142, 1998.
[51] T. Joachims. Making large-scale SVM learning practical. In B. Scolkopf, C. Burges, and
A. Smola, editors, Advances in Kernel Methods—Support Vector Learning. MIT Press, 1999.
[52] J. Karlgren. Stylistic Experiments for Information Retrieval. PhD thesis, SICS, 2000.
[53] Brett Kessler, Geoffrey Nunberg, and Hinrich Sch¨utze. Automatic detection of text genre. In
Philip R. Cohen and Wolfgang Wahlster, editors, Proceedings of the Thirty-Fifth Annual
Meeting of the Association for Computational Linguistics and Eighth Conference of the
European Chapter of the Association for Computational Linguistics, pages 32–38, Somerset,
New Jersey, 1997. Association for Computational Linguistics.
[54] B. Kjell and O. Frieder. Visualization of literary style. In IEEE International Conference on
Systems, Man and Cybernetics, pages 656–661, Chicago, 1992.
[55] M. Koppel, S. Argamon, and A. R. Shimoni. Automatically categorizing written texts by
author gender. Literary and Linguistic Computing, 17(4), 2003.
[56] M. Koppel, D. Mughaz, and J. Schler. Text categorization for authorship verification. In
Proc.
8th Symposium on Artificial Intelligence and Mathematics, Fort Lauderdale, FL, 2004.
[57] M. Koppe l and J. Schler. Authorship verification as a one-class classification problem. In Proc.
of Int’l Conf. on Machine Learning, Banff, Canada, 2004.
[58] Moshe Koppel, Navot Akiva, and Ido Dagan. A corpus-independent feature set for style-
based text cate gorization. In Workshop on Computational Approaches to Style Analysis and
Synthesis, 18th International Joint Conference on Artificial Intelligence, Acapulco, 2003.
31
[59] Nicholas Kushmerick. Learning to remove internet advertisement. In Oren Etzioni, org P.
M¨uller, and Jeffrey M. Bradshaw, editors, Proceedings of the Third International Conference
on Autonomous Agents (Agents’99), pages 175–181, Seattle, WA, USA, 1999. ACM Press.
[60] Ken Lang. NewsWeeder: learning to filter netnews. In Proceedings of the 12th International
Conference on Machine Learning, pages 331–339. Morgan Kaufmann publishers Inc.: San
Mateo, CA, USA, 1995.
[61] D.D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval.
Proceedings of ECML-98, 10th European Conference on Machine Learning, 1398:4–15, 1998.
[62] N. Littlestone. Learning when irrelevant attributes abound. In Proceedings of the
Twenty-Eighth Annual Symposium on Foundations of Computer Science, pages 68–77, Oc-
tober 1987.
[63] J. R. Martin. English Text: System and Structure. Benjamin’s, Amsterdam, 1992.
[64] J. R. Martin and P. R. R. White. The Language of Evaluation: Appraisal in English. Palgrave,
London, 2005. (http://grammatics.com/appraisal/).
[65] C. Mascol. Curves of Pauline and Pseudo-Pauline Style I. Unitarian Review, 30:452460, 1888.
[66] C. Mascol. Curves of Pauline and Pseudo-Pauline Style II. Unitarian Review, 30:539546, 1888.
[67] R. A. J. Matthews and T. V. N. Merriam. Distinguishing literary styles using neural networks,
chapter 8. IOP publishing and Oxford University Press, 1997.
[68] Christian Matthiessen. Lexico-grammatical cartography: English systems. International Lan-
guage Sciences Publishers, 1995.
[69] E. Mayr. Evolution and the Diversity of Life. Harvard University Press, Cambridge, 1976.
[70] E. Mayr. How biology differs from the physical sciences. In Evolution at the Crossroads: The
New Biology and the New Philosophy of Science, pages 43–46. MIT Press, Cambridge, 1985.
[71] A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification.
AAAI-98 Workshop on Learning for Text Categorization, 752, 1998.
32
[72] A. McEnery and M. Oakes. Authorship studies/textual statistics, pages 234–248. Marcel
Dekker, 2000.
[73] Vicki McKinney, Kanghyun Yoon, and Fatemeh ?Mariam? Zahedi. The measurement of
web-customer satisfaction: An expectation and disconfirmation approach. Info. Sys. Research,
13(3):296–315, 2002.
[74] G. McMenamin. Forensic Linguistics: Advances in Forensic Stylistics. CRC Press, 2002.
[75] T.C. Mendenhall and Y.A.P. Collection. Characteristic Curves of Composition. Moore &
Langen, printers and binders, 1887.
[76] F. Mosteller and D. L. Wallace. Inference and Disputed Authorship: The Federalist. Series in
behavioral science: Quantitative methods edition. Addison-Wesley, Massachusetts, 1964.
[77] V. Ng. Learning noun phrase anaphoricity to improve coreference resolution: Issues in repre-
sentation and optimization. Proceedings of the 42nd Annual Meeting of the Association for
Computational Linguistics (ACL), pages 152–159, 2004.
[78] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification
using machine learning techniques. In Proceedings of EMNLP, 2002.
[79] Jon Patrick. The scamseek project: Text mining for finanical scam s on the internet. In S.J.
Simoff and G.J. Williams, editors, Proc. 3rd Australasian Data Mining Conf, pages 33–38.
Carins, 2004.
[80] J. Platt. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector
Machines. Microsoft Research Technical Report MSR-TR-98-14, Redmond, WA, 1998.
[81] Ehud Reiter. Generating texts in different styles. In S. Argamon, K. Burns, and S. Dubnov,
editors, The Structure of Style: Algorithmic Approaches to Manner and Meaning, pages X–
XX. Springer, 2007.
[82] J. Rudman. The state of authorship attribution studies: Some problems and solutions.
Computers and the Humanities, 31(4):351–365, 1997.
33
[83] J. L. Rudolph and J. Stewart. Evolution and the nature of science: On the historical discord
and its implication for education. Journal of Research in Science Teaching, 35:1069–1089,
1998.
[84] J.R. Searle. Expression and meaning: Studies in the Theory of Speech Acts. Cambridge Univ.
P., 1989.
[85] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing
Surveys, 34(1), 2002.
[86] Efstathios Stamatatos, Nikos Fakotakis, and George K. Kokkinakis. Automatic text catego-
rization in terms of genre, author. Computational Linguistics, 26(4):471–495, 2000.
[87] George Stiny. Whose style is it? In S. Argamon, K. Burns, and S. Dubnov, editors, The
Structure of Style: Algorithmic Approaches to Manner and Meaning, pages X–XX. Springer,
2007.
[88] John M. Swales. Genre Analysis. Cambridge University Press, 1990.
[89] V.I. Torvik, M. Weeber, D.R. Swanson, and N.R. Smalheiser. A probabilistic similarity metric
for Medline records: A model for author name disambiguation. Journal of the American
Society for Information Science and Technology, 56(2):140–158, 2005.
[90] Peter D. Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised
classification of reviews. In Proceedings 40th Annual Meeting of the ACL (ACL’02), pages
417–424, Philadelphia, Pennsylvania, 2002.
[91] F. Tweedie, S. Singh, and D. Holmes . Neural network applications in stylometry: The Feder-
alist Papers. Computers and the Humanities, 30(1):1–10, 1996.
[92] E. Wenger. Communities of Practice: Learning, Meaning, and Identity. Cambridge University
Press, 1999.
[93] W. Whewell. History of the Inductive Sciences. John W. Parker, London, 1837.
[94] Y. Yang. An Evaluation of Statistical Approaches to Text Categorization. Information
Retrieval, 1(1):69–90, 1999.
34
[95] Y. Yang and J.O. Pedersen. A Comparative Study on Feature Selection in Text Categorization.
Proceedings of the Fourteenth International Conference on Machine Learning table of contents,
pages 412–420, 1997.
[96] G. U. Yule. Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.
[97] G.U. Yule. On sentence length as a statistical characteristic of style in prose with application
to two cases of disputed authorship. Biometrika, 30:363–390, 1938.
35
... Little work, if any, has examined this problem specifically from the standpoint of register, as opposed to other stylistic questions such as genre and authorship -researchers have sought "stylistic features" of texts that are correlated with different styles, contrasted with "topical features" that correlate with different topics or domains of discourse, which are typically used for information retrieval and related tasks. A wide variety of such features have been proposed (Stamatatos et al. 2000, Finn & Kushmerick 2006, Argamon & Koppel 2010, including relative frequencies of function words, part-of-speech n-grams, character n-grams, syntactic constructs, and systemicfunctional categories, as well as type/token ratios, word and sentence length, and other textual statistics. These are discussed in more detail below. ...
... High-level view of parameters and constraints on textual form and style. Adapted fromArgamon & Koppel (2010). ...
Preprint
Full-text available
The study of register in computational language research has historically been divided into register analysis, seeking to determine the registerial character of a text or corpus, and register synthesis, seeking to generate a text in a desired register. This article surveys the different approaches to these disparate tasks. Register synthesis has tended to use more theoretically articulated notions of register and genre than analysis work, which often seeks to categorize on the basis of intuitive and somewhat incoherent notions of prelabeled “text types”. I argue that a integration of computational register analysis and synthesis will benefit register studies as a whole, by enabling a new large-scale research program in register studies. It will enable comprehensive global mapping of functional language varieties in multiple languages, including the relationships between them. Furthermore, computational methods together with high coverage systematically collected and analyzed data will thus enable rigorous empirical validation and refinement of different theories of register, which will have also implications for our understanding of linguistic variation in general.
... Little work, if any, has examined this problem specifically from the standpoint of register, as opposed to other stylistic questions such as genre and authorship -researchers have sought "stylistic features" of texts that are correlated with different styles, contrasted with "topical features" that correlate with different topics or domains of discourse, which are typically used for information retrieval and related tasks. A wide variety of such features have been proposed (Stamatatos et al. 2000, Finn & Kushmerick 2006, Argamon & Koppel 2010, including relative frequencies of function words, part-of-speech n-grams, character n-grams, syntactic constructs, and systemicfunctional categories, as well as type/token ratios, word and sentence length, and other textual statistics. These are discussed in more detail below. ...
... High-level view of parameters and constraints on textual form and style. Adapted fromArgamon & Koppel (2010). ...
Preprint
Full-text available
The study of register in computational language research has historically been divided into register analysis, seeking to determine the registerial character of a text or corpus, and register synthesis, seeking to generate a text in a desired register. This article surveys the different approaches to these disparate tasks. Register synthesis has tended to use more theoretically articulated notions of register and genre than analysis work, which often seeks to categorize on the basis of intuitive and somewhat incoherent notions of prelabeled 'text types'. I argue that an integration of computational register analysis and synthesis will benefit register studies as a whole, by enabling a new large-scale research program in register studies. It will enable comprehensive global mapping of functional language varieties in multiple languages, including the relationships between them. Furthermore, computational methods together with high coverage systematically collected and analyzed data will thus enable rigorous empirical validation and refinement of different theories of register, which will have also implications for our understanding of linguistic variation in general.
... The nuance and subtilty of language variations are functions of individual, social as well as situational differences. Emerging research on automatic style transfer of written text converges toward a common view that style is an integral part of a sentence, indicated by the choices of words a person makes [36]. We provide an introduction of various text styles that have been in the focus of the research on the automatic style transfer. ...
Preprint
Full-text available
Style is an integral component of a sentence indicated by the choice of words a person makes. Different people have different ways of expressing themselves, however, they adjust their speaking and writing style to a social context, an audience, an interlocutor or the formality of an occasion. Text style transfer is defined as a task of adapting and/or changing the stylistic manner in which a sentence is written, while preserving the meaning of the original sentence. A systematic review of text style transfer methodologies using deep learning is presented in this paper. We point out the technological advances in deep neural networks that have been the driving force behind current successes in the fields of natural language understanding and generation. The review is structured around two key stages in the text style transfer process, namely, representation learning and sentence generation in a new style. The discussion highlights the commonalities and differences between proposed solutions as well as challenges and opportunities that are expected to direct and foster further research in the field.
... Some additional related work uses features from systemic functional grammar in the tradition of Halliday for text genre classification (Argamon and Koppel, 2010;Argamon et al., 2003;Argamon and Koppel, 2012;Argamon et al., 2007). ...
Conference Paper
Full-text available
While there is wide acknowledgement in NLP of the utility of document characterization by genre, it is quite difficult to determine a definitive set of features or even a comprehensive list of genres. This paper addresses both issues. First, with prototype semantics, we develop a hierarchical taxonomy of discourse functions. We implement the taxonomy by developing a new text genre corpus of contemporary German to perform a text based comparative register analysis. Second, we extract a host of style features, both deep and shallow, aiming beyond linguistically motivated features at situational correlates in texts. The feature sets are used for supervised text genre classification, on which our models achieve high accuracy. The combination of the corpus typology and feature sets allows us to characterize types of communicative purpose in a comparative setup, by qualitative interpretation of style feature loadings of a regularized dis-criminant analysis. Finally, to determine the dependence of genre on topics (which are arguably the distinguishing factor of sub-genre), we compare and combine our style models with Latent Dirichlet Allocation features across different corpus settings with unstable topics.
Article
Style is an integral component of a sentence indicated by the choice of words a person makes. Different people have different ways of expressing themselves, however, they adjust their speaking and writing style to a social context, an audience, an interlocutor or the formality of an occasion. Text style transfer is defined as a task of adapting and/or changing the stylistic manner in which a sentence is written, while preserving the meaning of the original sentence. A systematic review of text style transfer methodologies using deep learning is presented in this paper. We point out the technological advances in deep neural networks that have been the driving force behind current successes in the fields of natural language understanding and generation. The review is structured around two key stages in the text style transfer process, namely, representation learning and sentence generation in a new style. The discussion highlights the commonalities and differences between proposed solutions as well as challenges and opportunities that are expected to direct and foster further research in the field.
Article
Shlomo Argamon is Professor of Computer Science and Director of the Master of Data Science Program at the Illinois Institute of Technology (USA). In this article, he reflects on the current and potential relationship between register and the field of computational linguistics. He applies his expertise in computational linguistics and machine learning to a variety of problems in natural language processing. These include stylistic variation, forensic linguistics, authorship attribution, and biomedical informatics. He is particularly interested in the linguistic structures used by speakers and writers, including linguistic choices that are influenced by social variables such as age, gender, and register, as well as linguistic choices that are unique or distinctive to the style of individual authors. Argamon has been a pioneer in computational linguistics and NLP research in his efforts to account for and explore register variation. His computational linguistic research on register draws inspiration from Systemic Functional Linguistics, Biber’s multi-dimensional approach to register variation, as well as his own extensive experience accounting for variation within and across text types and authors. Argamon has applied computational methods to text classification and description across registers – including blogs, academic disciplines, and news writing – as well as the interaction between register and other social variables, such as age and gender. His cutting-edge research in these areas is certain to have a lasting impact on the future of computational linguistics and NLP.
Article
Full-text available
Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist “approaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolect” (Coulthard, 2004: 31). However, given the difficulty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite offers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ‘n-gram textbites’, small textual segments that characterise that author’s writing, providing DNA-like chunks of identifying material? Drawing on a corpus of 63,000 emails and 2.5 million words written by 176 employees of the former American energy corporation Enron, a case study approach is adopted, first showing through stylistic analysis that one Enron employee repeatedly produces the same stylistic patterns of politely encoded directives in a way that may be considered habitual. Then a statistical experiment with the same case study author finds that word n-grams can assign anonymised email samples to him with success rates as high as 100%. This paper argues that, if sufficiently distinctive, these textbites are able to identify authors by reducing a mass of data to key segments that move us closer to the elusive concept of idiolect.
Article
Full-text available
The performance of a classification system of any type can suffer from irrelevant or redundant data, contained in characteristic features that describe objects of the universe. To estimate relevance of attributes and select their subset for a constructed classifier typically either a filter, wrapper, or an embedded approach, is implemented. The paper presents a combined wrapper framework, where in a pre-processing step, a ranking of variables is established by a simple wrapper model employing sequential backward search procedure. Next, another predictor exploits this resulting ordering of features in their reduction. The proposed methodology is illustrated firstly for a binary classification task of authorship attribution from stylometric domain, and then for additional verification for a waveform dataset from UCI machine learning repository.
Article
Knowledge discovered from data can be represented in a form of decision rules, consisting of required conditions and decisions to which they lead. The quality of rules is usually considered in terms of some quantitative measures such as confidence, support or length. Depending on all these parameters the constructed classifiers can greatly vary in the predictive accuracy and the size of their structure. Both these elements depend strongly on the choice of characteristic features, which can be found by some independent feature selection procedure, but also by applying a wrapper model. In the wrapper model the classifier and its parameters are used to evaluate the importance of attributes. In the paper there are proposed measures of attribute relevance based on rule lengths. The usefulness of the described methodology is shown for rule-based classifiers, obtained through Dominance-Based Rough Set Approach, and a connectionist solution implemented with Artificial Neural Networks, both employed in the task of authorship attribution.
Article
Full-text available
Authorship attribution, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and a wide range of application. Recent work in “non-traditional” authorship attribution demonstrates the practicality of automatically analyzing documents based on authorial style, but the state of the art is confusing. Analyses are difficult to apply, little is known about type or rate of errors, and few “best practices” are available. In part because of this confusion, the field has perhaps had less uptake and general acceptance than is its due. This review surveys the history and present state of the discipline, presenting some comparative results when available. It shows, first, that the discipline is quite successful, even in difficult cases involving small documents in unfamiliar and less studied languages; it further analyzes the types of analysis and features used and tries to determine characteristics of well-performing systems, finally formulating these in a set of recommendations for best practices.
Book
This is the first comprehensive account of the Appraisal Framework. The underlying linguistic theory is explained and justified, and the application of this flexible tool, which has been applied to a wide variety of text and discourse analysis issues, is demonstrated throughout by sample text analyses from a range of registers, genres and fields.
Chapter
A significant problem in many information filtering systems is the dependence on the user for the creation and maintenance of a user profile, which describes the user's interests. NewsWeeder is a netnews-filtering system that addresses this problem by letting the user rate his or her interest level for each article being read (1-5), and then learning a user profile based on these ratings. This paper describes how NewsWeeder accomplishes this task, and examines the alternative learning methods used. The results show that a learning algorithm based on the Minimum Description Length (MDL) principle was able to raise the percentage of interesting articles to be shown to users from 14% to 52% on average. Further, this performance significantly outperformed (by 21%) one of the most successful techniques in Information Retrieval (IR), term-frequency/inverse-document-frequency (tf-idf) weighting.