Article

Ant colony optimisation for stylometry: The federalist papers

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This paper describes the use of Ant Colony Optimisation for the classification of works of disputed authorship, in this case the Federalist Papers.Classification accuracy was 79.1%, which compares reasonably well with previous work on the same data set using neural networks and genetic algorithms. Although statistical approaches have performed much better than this, the advantage of a rule-based approach is the ability to produce readily intelligible criteria for the classification decisions made.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... 3 Modifications or extensions: Using a pseudo random proportional transition rule. Using parameter q0 [12, 52, 80] Authors Year Open questions: Chen, Chen & He, Liu, Abbas & Mc Kay, Wang & Feng, 2006 2004 2004 This transition rule has an advantage that it allows the user to have an explicit control over the exploitation versus exploration trade-of. It requires to choose a good value for the parameter q0 in empirical way. ...
... 3 Modifications or extensions: Using a pseudo random proportional transition rule. Using parameter q0 [12, 52, 80] Authors Year Open questions: Chen, Chen & He, Liu, Abbas & Mc Kay, Wang & Feng, 2006 2004 2004 This transition rule has an advantage that it allows the user to have an explicit control over the exploitation versus exploration trade-of. It requires to choose a good value for the parameter q0 in empirical way. ...
... A self adaptive parameter. Different procedures to update pheromone as a function of the rule quality [52, 59, 48, 80] Authors Year Open questions: Liu, Abbas & Mc Kay, Martens et al., Smalton & Freitas, Wang & Feng, 2004 2004 It is important to change the original Ant- Miner's formula to update pheromone in order to cope better with law-quality rules. We treat evaporation trails as a form of a learning factor. ...
Chapter
Full-text available
In our approach we want to ensure the good performance of Ant- Miner by applying the well-known (from the ACO algorithm) two pheromone updating rules: local and global, and the main pseudo-random proportional rule, which provides appropriate mechanisms for search space: exploitation and exploration. Now we can utilize an improved expression of this classification rule discovery system as an Ant-Colony-Miner. Further modifications are connected with the simplicity of the heuristic function used in the standard Ant-Miner. We propose to employing a new heuristic function based on quantitative, not qualitative parameters used during the classification process. The main transition rule will be changed dynamically as a result of the simple frequency analysis of the number of cases from the point of view characteristic partitions. This simplified heuristic function will be compensated by the pheromone update in different degrees, which helps ants to collaborate and is a good stimulant on ants’ behavior during the rule construction. The comparative study will be conducted using 5 data sets from the UCI Machine Learning repository.
... The field of stylometry (or authorship recognition) has been used to great effect by historians and literary detectives to identify the authors of the Fedaralist Papers, Civil War letters , and Shakespeare's plays (Klarreich 2003; Oakes 2004). While stylometric methods existed before computers and artificial intelligence techniques, the field is currently dominated by AI techniques such as neural networks and statistical pattern recognition. ...
... 85 papers were published anonymously in the late 18th century to persuade the people of New York to ratify the American Constitution. The authorship of 73 of the texts was undisputed but who authored the remaining 12 was heavily contested (Oakes 2004 ). In order to discover who wrote the unknown papers, researchers have turned to analyzing the writing style of the known authors and comparing it to that of the papers of unknown authorship . ...
... The results of this study are based upon the participation of 15 individual authors. This is significantly larger than most studies in the field which generally deal with 2-4 authors (Tweedie, Singh, & Holmes 1996; Clark & Hannon 2007; Oakes 2004; Celikel & Dalkilic 2004). There were three basic elements to their participation. ...
Conference Paper
Abstract The use of statistical AI techniques in authorship recognition (or stylometry) has contributed to literary and historical breakthroughs. These successes have led to the use of these techniques in criminal investigations and prosecutions. However, few have studied adversar- ial attacks and their devastating effect on the robustness of existing classification methods. This paper presents three key contributions to address this shortcoming. First, it uses human subjects to empirically validate the claim of high accuracy for current techniques (without attacks) by reproducing results for three representative stylometric methods. Secondly, it presents a framework for adversarial attacks including obfuscation attacks, where a subject attempts to hide their identity and imitation attacks, where a subject attempts to frame another subject by imitating their writing style. Finally, it demonstrates that both attacks work well. The obfuscation attack reduces the effectiveness of the techniques to the level of random,guessing and the imitation attack succeeds with 68-91% probability depending on the stylometric technique used. These results are made,more significant by the fact that the experimental subjects were unfamiliar with stylometric techniques, without specialized knowledge in linguistics, and spent little time on the attacks (approximately 30-40 minutes). Practical Attacks Against Authorship Recognition Techniques Michael Brennan
... Popular lexical style markers or stylometric lexical features used in various studies are vocabulary richness (Holmes, 1992(Holmes, , 1994, such as hapax legomena (Holmes, 1992;Stamatatos, 2009), and token-based features, such as word length and sentence length (Stamatatos, 2009). Works in stylometric analysis or computational stylistics (Oakes, 2004) focus on two main areas: authorship attribution and text genre detection (Stamatatos et al., 2000). Argamon (2007) defines authorship attribution as the problem of determining the writer of an anonymous text that surfaced in old literary works on unknown authors (Stamatatos, 2008). ...
... Another typical model is the case of the Federalist Papers, whose 12 papers were claimed to have been written by Alexander Hamilton and James Madison. Stylometric analysis using function words, such as prepositions, conjunctions, and articles as discriminators, revealed that the papers were authored by Madison alone (Mosteller and Wallace, 1984;Holmes and Forsyth, 1995;Oakes, 2004). ...
Article
The Quranic oath is God’s emphasizing the importance or truthfulness of a concept. Oaths are multifaceted, rich expressions, in which a single oath contains a line of meaning and a variety of aspects. This study proposes a new stylometric model for detecting apparent and narrative oaths. Toward this end, two types of application-specific features from a stylometric perspective—structural and content-specific features—were examined. The stylometric features were extracted, and a Bayesian network was constructed to model such features. The stylometric model of oaths was then evaluated through a series of machine-learning experiments using various classifiers: the Bayesian network, a decision tree, instance-based learning, and a neural network. These classification experiments focused on applying stylometric features in apparent and narrative oaths. The experiments covered two datasets: the entire Quran and the smaller dataset of Juz’ ‘Amma. The results led to two main conclusions. First, stylometric application-specific features are best used in their entirety—both structural-based and content-specific—rather than as two separate entities. Second, applying stylometric features was more significant in Juz’ ‘Amma, in which 40% of its surahs (chapters) contain oath statements. Finally, the stylometric model was extended for oath styles detection using three additional stylometric features—syntactic, character, and lexical, and it was analyzed using statistical approach.
... However, so far, research into this issue has been limited. The authors of [102] were the first to explore the possibility to computationally obfuscate the (most likely) author of the disputed Federalist Papers (see e.g., [155]). They attempted to hide the author's identity by neutralising 14 of the most informative words per thousand words in the texts. ...
Thesis
Full-text available
This PhD thesis examines the viability of a text mining approach for supporting cybercrime investigations pertaining to online child protection. The main contributions of this dissertation are as follows. A systematic study of different aspects of methodological design of a state-of- the-art text mining approach is presented to assess its scalability towards a large, imbalanced and linguistically noisy social media dataset. In this framework, three key automatic text categorisation tasks are examined, namely the feasibility to (i) identify a social network user’s age group and gender based on textual information found in only one single message; (ii) aggregate predictions on the message level to the user level without neglecting potential clues of deception and detect false user profiles on social networks and (iii) identify child sexual abuse media among thousands of legal other media, including adult pornography, based on their filename. Finally, a novel approach is presented that combines age group predictions with advanced text clustering techniques and unsupervised learning to identify online child sex offenders’ grooming behaviour. The methodology presented in this thesis was extensively discussed with law enforcement to assess its forensic readiness. Additionally, each component was evaluated on actual child sex offender data. Despite the challenging characteristics of these text types, the results show high degrees of accuracy for false profile detection, identifying grooming behaviour and child sexual abuse media identification.
... Authorship analysis also has a significant impact in the fields related to literary science. Many AA techniques have been developed to infer the disputed authorship of historical documents such as Civil War letters [Klarreich 2003], Shakespeare's plays [Klarreich 2003], the Federalist Papers [Oakes 2004; Tschuggnall and Specht 2014; Nasir et al. 2014] , and the classic French literary mystery: " Le Roman de Violette " [Boukhaled and Ganascia 2014]. AA techniques are also used to quantify the performance of literary translators, since the best translators will not have their own writing style reflected in the translated works [Almishari et al. 2014] . ...
Article
Full-text available
Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for authorship analysis. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization and authorship verification with the Twitter, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the bag-of-lexical-n-grams, Latent Dirichlet Allocation, Latent Semantic Analysis, PVDM, PVDBOW, and word2vec representations.
... More recently, technological advances and interdisciplinary collaborations have expanded the methodologies used for literary scholarship, enabling investigations into more complex topics of authorship (e.g. The Federalist Papers: Adair, 1944;Mosteller and Wallace, 1963;Rokeach et al., 1970;Holmes and Forsyth, 1995;Tweedie et al., 1996;Bosch and Smith, 1998;Fung, 2003;Collins et al., 2004;Oakes, 2004;Rudman, 2005) and character psychology (Zunshine, 2010;Vermeule, 2011). ...
Article
Full-text available
Deliberate differences in how authors represent characters has been a core area of literary investigation since the dawn of literary theory. Here, we focus on epistolary literature, where authors consciously attempt to create different character styles through series of documents like letters. Previous studies suggest that the linguistic gestalt of an author’s style—the author’s ‘writeprint’—can be extracted from the various characters of an epistolary novel, but it is unclear whether individual characters themselves also have distinct writeprints. We examine Samuel Richardson’s Clarissa, lauded as a watershed example of the epistolary novel, using a recently developed and highly successful authorship attribution technique to determine (1) whether Richardson can construct distinct character writeprints, and (2) if so, which linguistic features he manipulated to do so. We find that while there are not as many distinct character writeprints as characters, Richardson does appear to have signature features he alters to create distinct character styles—and few of these features are the function word or abstract syntactic features typically comprising author writeprints. We discuss implications for other questions about character identity in Clarissa and character writeprint analysis more generally.
... In the authorship attribution task, the objective is to identify the authors of texts whose identity is lacking [65]. The dataset employed here comprises books written by eight authors: Arthur Conan Doyle (ACD), Bram Stoker (BRS), Charles Dickens (CHD), Thomas Hardy (THH), Pelham Grenville Wodehouse (PGW), Hector Hugh Munro (HHM) and Herman Melville (HME). ...
Article
Full-text available
Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.
... Significantly, authors of novels (such as Wodehouse), are separated from those of scientific works (such as Darwin),which indicates that the difference in style may be reflected in the use of words of distinct consistency indices. In summary, consistency indices are useful to detect authorship, which now can be combined with other conventional methods [34,57,58,59,60] to enhance accuracy rates in distinguishing authors. ...
Article
In this paper we have quantified the consistency of word usage in written texts represented by complex networks, where words were taken as nodes, by measuring the degree of preservation of the node neighborhood. Words were considered highly consistent if the authors used them with the same neighborhood. When ranked according to the consistency of use, the words obeyed a log-normal distribution, in contrast to Zipf's law that applies to the frequency of use. Consistency correlated positively with the familiarity and frequency of use, and negatively with ambiguity and age of acquisition. An inspection of some highly consistent words confirmed that they are used in very limited semantic contexts. A comparison of consistency indices for eight authors indicated that these indices may be employed for author recognition. Indeed, as expected, authors of novels could be distinguished from those who wrote scientific texts. Our analysis demonstrated the suitability of the consistency indices, which can now be applied in other tasks, such as emotion recognition.
... This makes the discovered rules easier for the user to interpret, since now the interpretation of each rule is independent from all the other discovered rules. Although some modifications to the Ant-Miner algorithm have already been proposed [2][3][4], to the best of our knowledge, an unordered rule set modification to the original Ant-Miner algorithm is an area of research that has not yet been explored. ...
Conference Paper
The Ant-Miner algorithm, first proposed by Parpinelli and colleagues, applies an ant colony optimization heuristic to the classification task of data mining to discover an ordered list of classification rules. In this paper we present a new version of the Ant-Miner algorithm, which we call Unordered Rule Set Ant-Miner, that produces an unordered set of classification rules. The proposed version was evaluated against the original Ant-Miner algorithm in six public-domain datasets and was found to produce comparable results in terms of predictive accuracy. However, the proposed version has the advantage of discovering more modular rules, i.e., rules that can be interpreted independently from other rules - unlike the rules in an ordered list, where the interpretation of a rule requires knowledge of the previous rules in the list. Hence, the proposed version facilitates the interpretation of discovered knowledge, an important point in data mining.
... Recently, Galea (Galea and Shen, 2006) proposed a few modifications in Ant-Miner. Another modifications (Oakes, 2004; Martens et al., 2006) cope with the problem of attributes having ordered categorical values, some of them improve the flexibility of the rule representation language. Finally, more sophisticated modifications have been proposed to discover multi-label classification rules (Chan and Freitas, 2006) and to investigate fuzzy classification rules (Galea and Shen, 2006). ...
Conference Paper
Full-text available
In this paper, a novel rule discovery system that utilizes the Ant Colony Optimization (ACO) is presented. The ACO is a metaheuristic inspired by the behavior of real ants, where they search optimal solutions by considering both local heuristic and previous knowledge, observed by pheromone changes. In our approach we want to ensure the good performance of Ant-Miner by applying the new versions of heuristic functions in a main rule. We want to emphasize the role of the heuristic function by analyzing the influence of different propositions of these functions to the performance of Ant-Miner. The comparative study will be done using the 5 data sets from the UCI Machine Learning repository.
... They use features such as style markers (average sentence or word length, total number of function words, vocabulary richness, etc.) and structural attributes (availability of signatures, number of attachments, etc.). There are also several alternative learning paradigms for authorship attribution, e.g., Khmelev and Tweedie [21] Table 3: Error for Classification based on Different Features and their Combination on the Gutenberg Corpus sidering learning models for authorship attribution tasks using Markov chains of characters, or Oakes [32] using a kind of swarm intelligence simulation technique called Ant Colony Optimization. Combination vectors are used for authorship attribution (e.g. ...
Article
Full-text available
In this paper, we provide several alternatives to the classical Bag-Of-Words model for automatic authorship attribution. To this end, we consider linguistic and writing style infor- mation such as grammatical structures to construct di®er- ent document representations. Furthermore we describe two techniques to combine the obtained representations: combi- nation vectors and ensemble based meta classi¯cation. Our experiments show the viability of our approach.
Chapter
In this chapter we present some concepts pertaining to a hybrid approach to classification and clustering. Hybridization amounts to combining standard algorithms, such as those generating decision rules and decision trees, with nonstandard ones, e.g., those based on ant colony optimization (ACO) concepts.
Chapter
In this paper, we develop a user-centric privacy framework for quantitatively assessing the exposure of personal information in open settings. Our formalization addresses key-challenges posed by such open settings, such as the necessity of user- and context-dependent privacy requirements. As a sanity check, we show that hard non-disclosure guarantees are impossible to achieve in open settings. In the second part, we provide an instantiation of our framework to address the identity disclosure problem, leading to the novel notion of d-convergence to assess the linkability of identities across online communities. Since user-generated text content plays a major role in linking identities between Online Social Networks, we further extend this linkability model to assess the effectiveness of countermeasures against linking authors of text content by their writing style. We experimentally evaluate both of these instantiations by applying them to suitable data sets: we provide a large-scale evaluation of the linkability model on a collection of 15 million comments collected from the Online Social Network Reddit, and evaluate the effectiveness of four semantics-retaining countermeasures and their combinations on the Extended-Brennan-Greenstadt Adversarial Corpus. Through these evaluations we validate the notion of d-convergence for assessing the linkability of entities in our Reddit data set and explore the practical impact of countermeasures on the importance of standard writing style features on identifying authors.
Article
The use of stylometry, authorship recognition through purely linguistic means, has contributed to literary, historical, and criminal investigation breakthroughs. Existing stylometry research assumes that authors have not attempted to disguise their linguistic writing style. We challenge this basic assumption of existing stylometry methodologies and present a new area of research: adversarial stylometry. Adversaries have a devastating effect on the robustness of existing classification methods. Our work presents a framework for creating adversarial passages including obfuscation, where a subject attempts to hide her identity, and imitation, where a subject attempts to frame another subject by imitating his writing style, and translation where original passages are obfuscated with machine translation services. This research demonstrates that manual circumvention methods work very well while automated translation methods are not effective. The obfuscation method reduces the techniques' effectiveness to the level of random guessing and the imitation attempts succeed up to 67% of the time depending on the stylometry technique used. These results are more significant given the fact that experimental subjects were unfamiliar with stylometry, were not professional writers, and spent little time on the attacks. This article also contributes to the field by using human subjects to empirically validate the claim of high accuracy for four current techniques (without adversaries). We have also compiled and released two corpora of adversarial stylometry texts to promote research in this field with a total of 57 unique authors. We argue that this field is important to a multidisciplinary approach to privacy, security, and anonymity.
Article
In digital forensics, questions often arise about the authors of documents: their identity, demographic background, and whether they can be linked to other documents. The field of stylometry uses linguistic features and machine learning techniques to answer these questions. While stylometry techniques can identify authors with high accuracy in non-adversarial scenarios, their accuracy is reduced to random guessing when faced with authors who intentionally obfuscate their writing style or attempt to imitate that of another author. While these results are good for privacy, they raise concerns about fraud. We argue that some linguistic features change when people hide their writing style and by identifying those features, stylistic deception can be recognized. The major contribution of this work is a method for detecting stylistic deception in written documents. We show that using a large feature set, it is possible to distinguish regular documents from deceptive documents with 96.6% accuracy (F-measure). We also present an analysis of linguistic features that can be modified to hide writing style.
Article
Includes summary. Thesis (M. Sc.)(Computer Science)--University of Pretoria, 2006. Includes bibliographical references (leaves 135-149).
Article
Full-text available
The Federalist Papers, twelve of which are claimed by both Alexander Hamilton and James Madison, have long been used as a testing-ground for authorship attribution techniques despite the fact that the styles of Hamilton and Madison are unusually similar. This paper assesses the value of three novel stylometric techniques by applying them to the Federalist problem. The techniques examined are a multivariate approach to vocabulary richness, analysis of the frequencies of occurrence of sets of common high-frequency words, and use of a machine-learning package based on a 'genetic algorithm' to seek relational expressions characterizing authorial styles. All three approaches produce encouraging results to what is acknowledged to be a difficult problem.
Conference Paper
Full-text available
This paper utilizes Ant-Miner - the first Ant Colony algorithm for discovering classification rules - in the field of web content mining, and shows that it is more effective than C5.0 in two sets of BBC and Yahoo web pages used in our experiments. It also investigates the benefits and dangers of several linguistics-based text preprocessing techniques to reduce the large numbers of attributes associated with web content mining.
Article
Full-text available
Ant-based algorithms or ant colony optimization(ACO) algorithms have been applied successfully to combinatorial optimization problems. More recently, Parpinelli and colleagues applied ACO to data mining classification problems, where they introduced a classification algorithm called Ant_Miner. In this paper, we present an improvement to Ant_Miner (we call itAnt_Miner3). The proposed version was tested on two standard problems and performed better than the original Ant_Miner algorithm.
Article
Full-text available
The paper proposes an algorithm for data mining called Ant-Miner (ant-colony-based data miner). The goal of Ant-Miner is to extract classification rules from data. The algorithm is inspired by both research on the behavior of real ant colonies and some data mining concepts as well as principles. We compare the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets. The results provide evidence that: 1) Ant-Miner is competitive with CN2 with respect to predictive accuracy, and 2) the rule lists discovered by Ant-Miner are considerably simpler (smaller) than those discovered by CN2
Article
Neural Networks have recently been a matter of extensive research and popularity. Their application has increased considerably in areas in which we are presented with a large amount of data and we have to identify an underlying pattern. This paper will look at their application to stylometry. We believe that statistical methods of attributing authorship can be coupled effectively with neural networks to produce a very powerful classification tool. We illustrate this with an example of a famous case of disputed authorship, The Federalist Papers. Our method assigns the disputed papers to Madison, a result which is consistent with previous work on the subject.
Conference Paper
Ant-based algorithms or ant colony optimization (ACO) algorithms have been applied successfully to combinatorial optimization problems. More recently, Parpinelli and colleagues applied ACO to data mining classification problems, where they introduced a classification algorithm called Ant_Miner. In this paper, we present an improvement to Ant_Miner (we call it Ant_Miner3). The proposed version was tested on two standard problems and performed better than the original Ant_Miner algorithm.