Conference Paper

Shallow Text Analysis and Machine Learning for Authorship Attribtion.

Conference: Computational Linguistics in the Netherlands 2004, Selected Papers from the Fifteenth CLIN Meeting, December 17, Leiden Centre for Linguistics
Source: DBLP

ABSTRACT Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experimen ts with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspaper De Standaard. Because the documents are in a similar genre, register, and range of topics, token-based (e.g., sentence length) and lexical features (e.g., vocabulary richness) can be kept roughly constant over the different authors. This allows us to focus on the use of syntax-based features as possible predictors for an author 's style, as well as on those token-based features that are predictive to author style more than to topic or register. These style characteristics are not under the author's conscious control and therefore good clues for Authorship Attribution. Machine Learning methods (TiMBL and the WEKA software package) are used to select informative combinations of syntactic, token-based a nd lexical features and to predict authorship of unseen documents. The combination of these features can be considered an implicit profile that characterizes the style of an author.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we will focus on the scribal variation in manually copied medieval texts. Using a lazy machine learning technique, we will argue that it is possible to discriminate between scribes, implying that they did adapt texts when copying them. Consequently, we will assess to what extent scribal interventions compromise our ability to detect the original authorship of medieval texts. It will be shown that, if the right features and weighting methods are used, the automated discrimination of both copyists and authors is possible for medieval texts. The case studies presented suggest that scribes only corrupted the original texts in a shallow and superficial way, leaving authorial features generally intact on deeper levels. This result will be of interest for research into e.g. contemporary newspaper articles when trying to detect editorial interventions.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Style is an integral part of natural language in written, spoken or machine generated forms. Humans have been dealing with style in language since the beginnings of language itself, but computers and machine processes have only recently begun to process natural language styles. Automatic processing of styles poses two interrelated challenges: classification and transformation. There have been recent advances in corpus classification, automatic clustering and authorship attribution along many dimensions but little work directly related to writing styles directly and even less in transformation. In this paper we examine relevant literature to define and operationalize a notion of "style" which we employ to designate style markers usable in classification machines. A measurable reading of these markers also helps guide style transformation algorithms. We demonstrate the concept by showing a detectable stylistic shift in a sample piece of text relative to a target corpus. We present ongoing work in building a comprehensive style recognition and transformation system and discuss our results.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we will stress-test a recently proposed technique for computational authorship verification, ''unmasking'', which has been well received in the literature. The technique envisages an experimental set-up commonly referred to as ''authorship verification'', a task generally deemed more difficult than so-called ''authorship attribution''. We will apply the technique to authorship verification across genres, an extremely complex text categorization problem that so far has remained unexplored. We focus on five representative contemporary English-language authors. For each of them, the corpus under scrutiny contains several texts in two genres (literary prose and theatre plays). Our research confirms that unmasking is an interesting technique for computational authorship verification, especially yielding reliable results within the genre of (larger) prose works in our corpus. Authorship verification, however, proves much more difficult in the theatrical part of the corpus.
    English Studies 05/2012; DOI:10.1080/0013838X.2012.668793


Available from
May 30, 2014