Figure 6 - uploaded by Chris Forstall
Content may be subject to copyright.
Herodotus' Histories (circles) versus Iliad (crosses) and Odyssey (boxes). First two principal components for full set of tri-grams among 10,000-character samples.

Herodotus' Histories (circles) versus Iliad (crosses) and Odyssey (boxes). First two principal components for full set of tri-grams among 10,000-character samples.

Source publication
Article
Full-text available
To be able to capture repetitive sound in a feature for authorship and stylistic analysis is of great interest. In this paper, we present the functional n-gram as a feature well suited to the analysis of poetry and other sound-sensitive material, working toward stylistics based on sound rather than text. Using Support Vector Machines (SVM) for text...

Similar publications

Article
Full-text available
Hyperspectral remote sensing has a strong ability in information expression, so it provides better support for classification. The methods proposed to deal the hyperspectral data classification problems were build one by one. However, most of them committed to spectral feature extraction that means wasting some valuable information and poor classif...
Article
Full-text available
The analytical model (AM) of suspension force in a bearingless flywheel machine has model mismatch problems due to magnetic saturation and rotor eccentricity. A numerical modeling method based on the differential evolution (DE) extreme learning machine (ELM) is proposed in this paper. The representative input and output sample set are obtained by f...
Conference Paper
Full-text available
Physical activity recognition is an emerging research field of machine learning and recognition in computer vision. The main theme of human physical activity recognition is to identify different daily activities of human being using wearable sensor. As the relevant features at the surface of raw data play significant role in the accuracies of class...
Article
Full-text available
The leaves of plants have rich information in recognition of plants. In general, agriculture experts accomplish information extraction from the leaves. Since the leaves contain useful features for recognising various types of plants, so these features can be extracted and applied by automatic image recognition algorithms to classify plant species....
Article
Full-text available
This study aimed at evaluating the synergistic use of Sentinel-1 and Sentinel-2 data combined with the Support Vector Machines (SVMs) machine learning classifier for mapping land use and land cover (LULC) with emphasis on wetlands. In this context, the added value of spectral information derived from the Principal Component Analysis (PCA), Minimum...

Citations

... Nevertheless, rhythmic or prosodic features have also been employed in authorship analysis of prose text. However, these works usually consist of the study of word repetitions, like the anaphora (the repetition of a word, or a sequence of words, from a previous sentence at the beginning of a new sentence) (Lagutina et al., 2021), or are based on mapping the texts into the corresponding sequences of sounds before extracting the n-grams, such as in the research by Forstall and Scheirer (2010), where the authors employ the CMU Pronouncing Dictionary for the conversion. 19 Finally, syllables have been used as base units in other AId works (Sidorov, 2018) and more generally in other NLP tasks, such as poem generation (Zugarini et al., 2019). ...
Article
Full-text available
It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so‐called syllabic quantity, that is, on the length of the involved syllables, and there is substantial evidence suggesting that certain authors had a preference for certain metric patterns over others. In this research we investigate the possibility to employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts. We test the impact of these features on the authorship attribution task when combined with other topic‐agnostic features. Our experiments, carried out on three different datasets using support vector machines (SVMs) show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
... The robustness of the proposed features in different domains are tested in cross-topic, multi-topic, single-topic, cross-genre, multi-genre and single-genre datasets [20][21][22][23][24]. The feature selection and dimensionality reduction techniques are widely applied on these datasets [6][7][8][9][10][25][26][27]. Although these methodologies applied on AA tasks are useful, they are mostly restricted to the general text classification methodology and are not specific contributions to the heterogeneous nature of the AA tasks. ...
... Such varying characteristics of domains are addressed in AA literature in terms of genre and topics. Genre or topic independent feature sets [28][29][30][31][32][33], increased diversity of the features, domain adaptation [32,34] and instance sampling methods [26,27,[35][36][37][38] have been proposed for the robustness of AA in heterogeneous domains. However, they are not widely known and are not easily adaptable to the existing classification approaches. ...
Thesis
In this thesis, we propose a scaling algorithm using multivariate analysis for authorship attribution in different document types with heterogeneous properties. The scaling algorithm is inspired by the idea of removing the non-variable background used in capturing moving objects in image recognition systems. This algorithm consists of two steps, which are determining the source-based common features of the documents in different topics and genres and removing these common features from the document vector for uncovering the style of the authors. Authorship attribution differs from other text classification types in terms of text processing techniques. The topic, genre, and target audience affect the author's word choice, causing the author's style to blur. In this context, the author's different types of documents are scaled according to the type which the document belongs to, and the similarity between the documents by the same author or different authors is exposed. In the thesis, classification based accuracy measurements were made by using term and character sequences on different types of documents, such as e-mails, blogs, micro messages, newspaper articles, and novel excerpts. The proposed scaling algorithm achieves the highest accuracy regardless of topic, feature set and genre in any dataset in classification based authorship attribution. In addition, scaling on only the term or character sequence features in the cross-domain and cross-genre datasets is highly competitive with the complex text processing techniques obtained by linguistic analysis.
... Function words, such as stopwords, that are less topic correlated features have shown to be effective features for AA [14]. Character n-grams performs better than token bi-grams [15,16,17]. Character n-grams also captures phonetic information inside words, and token bi-grams capture repetitive sequences that are helpful for stylistic analysis. ...
... Word endings have been noted to contribute to the success of character n-grams in stylometric analysis [48], however, in Lithuanian language the word endings are typically longer than bigrams or trigrams commonly used as general features (we used bi-grams only), thus a separate feature type of Lithuanian word endings seems to be useful. ...
Conference Paper
Full-text available
Internet can be misused by cyber criminals as a plat-form to conduct illegitimate activities (such as harassment, cyberbullying, and incitement of hate or violence) anonymously. As a result, authorship analysis of anonymous texts in Internet (such as emails, forum comments) has attracted significant attention in the digital forensic and text mining communities. The main problem is a large number of possible of authors, which hinders the effective identification of a true author. We interpret open class author attribution as a process of expert recommendation where the decision support system returns a list of suspected authors for further analysis by forensics experts rather than a single prediction result, thus reducing the scale of the problem.We describe the task formally and present algorithms for constructing the suspected author list. For evaluation we propose using a simple Winner-Takes-All (WTA) metric as well as a set of gain-discount model based metrics from the information retrieval domain (mean reciprocal rank, discounted cumulative gain and rank-biased precision). We also propose the List Precision (LP) metric as an extension of WTA for evaluating the usability of the suspected author list. For experiments, we use our own dataset of Internet comments in Lithuanian language and consider the use of language-specific (Lithuanian) lexical features together with general lexical features derived from English language. For classification we use one-class Support Vector Machine (SVM) classifier. The results of experiments show that the usability of open class author attribution can be improved considerably by using a set of language-specific lexical features together with general lexical features, while the proposed method can be used to reduce the number of suspected authors thus alleviating the work of forensic linguists.
... The motivation for such a feature is that some authors might have a preference for some expressions composed of two or more words in sequence, the probability of which is captured by n-grams of these words. Further, Forstall and Scheirer [56], [57] have argued that character-level n-grams serve as useful proxies for phonemes, and express the sound of wordsanother facet of language that can be quantified as a feature for authorship attribution. ...
... SVM was first applied to authorship attribution through the works of de Vel et al. [46] for e-mails and Diederich et al. [47] for German newspapers. Many subsequent works highlighted the overall success of SVM in classifying authors (see: [1], [9], [53], [57], [63], [64], [73], [106], [110]- [112], [148], [159], [161], [168], [174]), making it a dominant classification strategy in the field. The accuracy of approaches that use both n-grams and SVM are further discussed in a recent report from Stamatatos [176], who has investigated the question of whether the n-grams remain effective for cross-topic authorship attribution. ...
... It is also possible to gather more information out of limited sample sizes by looking for primitive sound features. Forstall and Scheirer introduced the concept of the functional ngram [57], which applied at the character-level, is an n-grambased feature that describes the most frequent sound-oriented information in a text. Similar to function words, functional n-grams are those n-grams that are elements of most of the lexicon, necessitating their use. ...
Article
Full-text available
The veil of anonymity provided by smartphones with pre-paid SIM cards, public Wi-Fi hotspots, and distributed networks like Tor has drastically complicated the task of identifying users of social media during forensic investigations. In some cases, the text of a single posted message will be the only clue to an author's identity. How can we accurately predict who that author might be when the message may never exceed 140 characters on a service like Twitter? For the past 50 years, linguists, computer scientists and scholars of the humanities have been jointly developing automated methods to identify authors based on the style of their writing. All authors possess peculiarities of habit that influence the form and content of their written works. These characteristics can often be quantified and measured using machine learning algorithms. In this article, we provide a comprehensive review of the methods of authorship attribution that can be applied to the problem of social media forensics. Further, we examine emerging supervised learningbased methods that are effective for small sample sizes, and provide step-by-step explanations for several scalable approaches as instructional case studies for newcomers to the field. We argue that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multimodal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.
... Our work in authorship and stylistic analysis has considered the importance of phonetic style markers, with the observation that sound plays a fundamental role in an author's style, particularly for poets. To capture sound information, we have developed a feature that we call a functional n-gram (Forstall and Scheirer, 2010), whereby the power of the Zipfian distribution (Zipf, 1949) is realized by selecting the n-grams that occur most frequently as features, while preserving their relative probabilities as the actual feature element. By using more primitive, sound-oriented features—namely, character-level n-grams—we are able to build accurate classifiers with the functional n-gram approach. ...
Article
Full-text available
In this study, we use computational methods to evaluate and quantify philological evidence that an eighth century CE Latin poem by Paul the Deacon was influenced by the works of the classical Roman poet Catullus. We employ a hybrid feature set composed of n-gram frequencies for linguistic structures of three different kinds—words, characters, and metrical quantities. This feature set is evaluated using a one-class support vector machine approach. While all three classes of features prove to have something to say about poetic style, the character-based features prove most reliable in validating and quantifying the subjective judgments of the practicing Latin philologist. Word-based features were most useful as a secondary refining tool, while metrical data were not yet able to improve classification. As these features are developed in ongoing work, they are simultaneously being incorporated into an existing online tool for allusion detection in Latin poetry.
... Previously, we have used character-and word-level functional n-grams to compare Homer's two epic poems to one another and to later written text (Forstall & Scheirer, 2010a). We have also adapted the functional n-gram to metrical data (Forstall, Jacobson, & Scheirer, 2010;Forstall & Scheirer, 2010b). ...
... for Paper Presentation Continuing our study [2,3] of repetitive sound and its relationship to style in poetry, this talk introduces a variety of statistical features found to be useful descriptors of Latin elegiac couplets (for background, see [4]). Using computational statistical methods, we have undertaken a broad survey of Latin elegiac poets. ...
... Our functional n-gram [2], when applied at the character level, is a feature that describes the most frequent sound oriented information in a text. The values taken on by distinct functional n-grams have been found to vary by poet and meter. ...
Article
Full-text available
Continuing our study (2, 3) of repetitive sound and its relationship to style in poetry, this talk introduces a variety of statistical features found to be useful descriptors of Latin elegiac couplets (for background, see (4)). Using computational statistical methods, we have undertaken a broad survey of Latin elegiac poets. The elegiac meter is used for a variety of themes, most notably Love (1, p. 322). The elegiac couplet is a pair of two dierent one-line \verses": { ( j { ( j { ( j { ( j { (( j { t { ( j { ( j { jj { (( j { (( j t In the above, \{" represents a long syllable, \(" a short syllable, \( either one long syllable or two shorts, and t either one long syllable or a short. The
Preprint
Full-text available
It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so-called syllabic quantity, i.e., on the length of the involved syllables, and there is substantial evidence suggesting that certain authors had a preference for certain metric patterns over others. In this research we investigate the possibility to employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts. We test the impact of these features on the authorship attribution task when combined with other topic-agnostic features. Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
Article
Establishing authorship of online texts is fundamental to combat cybercrimes. Unfortunately, text length is limited on some platforms, making the challenge harder. We aim at identifying the authorship of Twitter messages limited to 140 characters. We evaluate popular stylometric features, widely used in literary analysis, and specific Twitter features like URLs, hashtags, replies or quotes. We use two databases with 93 and 3957 authors, respectively. We test varying sized author sets and varying amounts of training/test texts per author. Performance is further improved by feature combination via automatic selection. With a large amount of training Tweets (>500), a good accuracy (Rank-5>80%) is achievable with only a few dozens of test Tweets, even with several thousands of authors. With smaller sample sizes (10-20 training Tweets), the search space can be diminished by 9-15% while keeping a high chance that the correct author is retrieved among the candidates. In such cases, automatic attribution can provide significant time savings to experts in suspect search. For completeness, we report verification results. With few training/test Tweets, the EER is above 20-25%, which is reduced to <15% if hundreds of training Tweets are available. We also quantify the computational complexity and time permanence of the employed features.