Fig 3 - uploaded by Chris Forstall
Content may be subject to copyright.
Source publication
In this study, we use computational methods to evaluate and quantify philological evidence that an eighth century CE Latin
poem by Paul the Deacon was influenced by the works of the classical Roman poet Catullus. We employ a hybrid feature set composed
of n-gram frequencies for linguistic structures of three different kinds—words, characters, and m...
Context in source publication
Context 1
... probabilities of the same n-gram (should it exist) for specific authors, or literary groups. This type of style marker is very well suited to our case study, where certain word sequences are common to a particular group (Catullus and his elegiac successors), but uncommon or non-existent in the work of other groups. To compute the actual feature, a count of the chosen n-gram sequence must be calculated. Following this, a second count must be calculated from all other n-gram sequences starting with word and bounded by and . The final probability feature is given by: / . If the sequence exists, it will have a very low probability, otherwise, it receives a probability of 0; in most texts, the target n-gram should not exist. This feature can also be computed in a ‘commutative’ fashion (for a bi-gram: ‘word word ’ or ‘word word ’). The resulting probability features can be used to augment the existing functional n-gram features to train a more accurate one-class SVM. 4.4 A third feature type, derived from syllabic quantity, attempted to quantify larger scale patterns in the poets’ use of meter. Paul wrote in elegiac couplets, in imitation of his classical predecessors, despite the fact that in the intervening centuries the prosody of spoken Latin had changed greatly. A couplet is composed of two lines of slightly different prescriptions; in each the number and quantities (‘lengths’) of syllables must conform to one of a very limited set of patterns (see Fig. 3). Barring rare exceptions, there are sixteen possible forms for the first line and four for the second. The quantity of a syllable depends upon the length of the vowel and the disposition of following consonants (including those in ...
Citations
... This indicates the increasingly close relationship between linguistics and humanities research, as well as the growing application of linguistic tools and theories to research on literature (e.g. Noe 1988;Hoover 2001;Forstall, Jacobson, and Scheirer 2011;Zheng and Jin 2022) and text analysis (e.g. Holmes and Forsyth 1995;Koppel, Argamon, and Shimoni 2002;Gorman 2019). ...
Digital humanities (DH) is an emerging interdisciplinary academic field that has gained prominence in recent decades. This study explores the evolution of topics, research impact, and attractiveness of DH through the lens of the journal Digital Scholarship in the Humanities (DSH), a leading platform for DH research, from 1986 to 2023 (in three phases: 1986-2003, 2004-2014, and 2015-2023). The study also examines the role of linguistic research in DH. The results reveal that: (1) the primary themes and trends in DH research have evolved from text encoding and analysis to critical studies of technology, infrastructure, and knowledge production; (2) the citation patterns demonstrate the growing influence and recognition of DH within the humanities and computer sciences; (3) European and North American scholars have dominated DH networks, but new centers and scholars are emerging in Asia, South America, and Oceania; and (4) linguistics-related publications have given less attention to specific linguistic features but have provided vital intellectual support for DH. This study provides a data-based perspective on the development and direction of DH, and demonstrates the value of linguistic methods for mapping scholarly fields over time. Based on these findings, suggestions are made for scholars interested in DH.
... This means that all the authors seen at the testing time are also known at training time [42]. Most of the authorship attribution application and experimental effort is directed towards the closed set task [13]. ...
Reliance on anonymity in social media has increased its popularity on these platforms among all ages. The availability of public Wi-Fi networks has facilitated a vast variety of online content, including social media applications. Although anonymity and ease of access can be a convenient means of communication for their users, it is difficult to manage and protect its vulnerable users against sexual predators. Using an automated identification system that can attribute predators to their text would make the solution more attainable. In this survey, we provide a review of the methods of pedophile attribution used in social media platforms. We examine the effect of the size of the suspect set and the length of the text on the task of attribution. Moreover, we review the most-used datasets, features, classification techniques and performance measures for attributing sexual predators. We found that few studies have proposed tools to mitigate the risk of online sexual predators, but none of them can provide suspect attribution. Finally, we list several open research problems.
... Besides, graph structure seems to be an appropriate way for the community-level modelling of intertextuality (Romanello 2016;Rockmore et al. 2018). Intertextuality modelling on classical literature widely supports cultural studies, such as quantitative literary criticism and stylometry (Forstall et al. 2011;Burns et al. 2021). Existing related studies on Chinese literature were limited to the detection methods (Liang et al. 2021;Yu et al. 2022) and shallow studies of intertextual texts on small corpora (Sturgeon 2018a;Sturgeon 2018b;Deng et al. 2022), short of macroanalysis (Jockers 2013) on Chinese culture. ...
... Based on the consensus of Chinese philosophy (Feng and Bodde 1948), we selected the keystone works as the benchmarks for each school. We first calculated the average intertextual score between a book and the keystone works of each school. ...
... The schools in ancient China were constantly evolving as scholars reshaped previous theories. As acknowledged in the history of Chinese philosophy (Feng and Bodde 1948), the original Taoist philosophy inspired the Taoist religion and Wei Jin metaphysics, while Neo-Confucianism inherited the theories of Confucianism. This section validates these evolutionary paths of Taoism and Confucianism quantitatively. ...
Being recognized among the cradles of human civilization, ancient China nurtured the longest continuous academic traditions and humanistic spirits, which continue to impact today’s society. With an unprecedented large-scale corpus spanning 3000 years, this paper presents a quantitative analysis of cultural evolution in ancient China. Millions of intertextual associations are identified and modelled with a hierarchical framework via deep neural network and graph computation, thus allowing us to answer three progressive questions quantitatively: (1) What is the interaction between individual scholars and philosophical schools? (2) What are the vicissitudes of schools in ancient Chinese history? (3) How did ancient China develop a cross-cultural exchange with an externally introduced religion such as Buddhism? The results suggest that the proposed hierarchical framework for intertextuality modelling can provide sound suggestions for large-scale quantitative studies of ancient literature. An online platform is developed for custom data analysis within this corpus, which encourages researchers and enthusiasts to gain insight into this work. This interdisciplinary study inspires the re-understanding of ancient Chinese culture from a digital humanities perspective and prompts the collaboration between humanities and computer science.
... AId methodologies can be applied even to literary pieces whose authorship is well-known and certain, in order to find possible stylistic influences from other authors; for example, the goal of Forstall et al. (2011) is to verify a supposed influence by Catullus on the poetry of Paul the Deacon. ...
... In this landscape the idea of employing prosodic features is not a new one. Of course, their most natural use is in studies focused on poetry, such as in the study of Neidorf et al. (2019) on the Old English verse tradition, or in the already cited investigation by Forstall et al. (2011) on the supposed influence of Catullus on Paul the Deacon's writings. Nevertheless, rhythmic or prosodic features have also been employed in authorship analysis of prose text. ...
It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so-called syllabic quantity, i.e., on the length of the involved syllables, and there is substantial evidence suggesting that certain authors had a preference for certain metric patterns over others. In this research we investigate the possibility to employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts. We test the impact of these features on the authorship attribution task when combined with other topic-agnostic features. Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
... To the best of my knowledge, this approach has only ever been tested sporadically. Two studies, conducted with small samples of Latin poetry (Forstall, Jacobson and Scheirer 2011) ...
https://versologie.cz/versification-authorship
Contemporary stylometry uses different methods to figure out a poem’s author based on features like the frequencies of words and character n-grams. However, there is one potential textual fingerprint it tends to ignore: versification. Using poetic corpora in three different languages (Czech, German and Spanish), this book asks whether versification features like rhythm patterns and types of rhyme can help determine authorship. It then tests its findings on two real-life unsolved literary mysteries. In the first, we distinguish the parts of the verse play The Two Noble Kinsmen written by William Shakespeare from those by his co-author, John Fletcher. In the second, we seek to solve a case of suspected forgery. How authentic was a group of poems first published as the work of the 19th-century Russian author Gavriil Stepanovich Batenkov?
... Computational analysis of literary intertextuality is typically treated as an information retrieval problem, as in the previous section. Here we consider an alternative framework of studying intertextuality through anomaly detection (Forstall et al., 2011). For this approach, we train word embeddings on highly restricted corpora, so that the resulting models capture aspects of authorial style. ...
... lacuna has been noted by Plech et al. (2019), who have begun to treat metrical features as a first-class citizen of the stylometry universe, and are some of the few researchers applying the most recent multivariate and machine-learning techniques. One final paper that deserves attention is Forstall et al. (2011), which examines metre, specifically in Latin poetry. Unfortunately, the methods used by this team treated syllable quantities in Latin simplistically as an n-gram (a string of n symbols, where each symbol may only be or ), and they were unable to achieve useful separation between authors on this basis. ...
This paper demonstrates that metre is a privileged indicator of authorial style in classical Latin hexameter poetry. Using only metrical features, pairwise classification experiments are performed between 5 first-century authors (10 comparisons) using four different machine-learning models. The results showed a two-label classification accuracy of at least 95% with samples as small as ten lines and no greater than eighty lines (up to around 500 words). These sample sizes are an order of magnitude smaller than those typically recommended for BOW ('bag of words') or n-gram approaches, and the reported accuracy is outstanding. Additionally, this paper explores the potential for novelty (forgery) detection, or 'one-class classification'. An analysis of the disputed Aldine Additamentum (Sil. Ital. Puni. 8:144-225) concludes (p=0.0013) that the metrical style differs significantly from that of the rest of the poem.
... Finally, we compare the performance of our models to human, professional poets. 1 Background NLP for poetry NLP for poetry has mainly focused on stylistic analysis (Hayward 1996;Kaplan and Blei 2007;He et al. 2007;Fang, Lo, and Chinn 2009;Greene, Bodrumlu, and Knight 2010;Genzel, Uszkoreit, and Och 2010;Kao and Jurafsky 2012) and poetry generation (Manurung, Ritchie, and Thompson 2012;Zhang and Lapata 2014;Ghazvininejad et al. 2016). Research on stylistics has focused on the features that make a poem come across as poetic (Kao and Jurafsky 2012); on quantifying poetic devices such as rhyme and meter (Hayward 1996;Greene, Bodrumlu, and Knight 2010;Genzel, Uszkoreit, and Och 2010); on evidence of intertextuality and how to prove stylistic influence between authors (Forstall, Jacobson, and Scheirer 2011); or on authorship and style attribution (Kaplan and Blei 2007;He et al. 2007;Fang, Lo, and Chinn 2009). These studies are examples of detecting statistical regularities in poetic language, potentially helping us to better understand and categorize poetic literature (Fabb 2006). ...
One dimension of modernist poetry is introducing entities in surprising contexts, such as wheelbarrow in Bob Dylan’s feel like falling in love with the first woman I meet/ putting her in a wheelbarrow. This paper considers the problem of teaching a neural language model to select poetic entities, based on local context windows. We do so by fine-tuning and evaluating language models on the poetry of American modernists, both on seen and unseen poets, and across a range of experimental designs. We also compare the performance of our poetic language model to human, professional poets. Our main finding is that, perhaps surprisingly, modernist poetry differs most from ordinary language when entities are concrete, like wheelbarrow, and while our fine-tuning strategy successfully adapts to poetic language in general, outperforming professional poets, the biggest error reduction is observed with concrete entities.
... Functional n-grams are short (typically syllable-length) substrings of natural language text (for example, the substring 'ab' in the sentence ' Abel elaborated about his intentions. '), which have proven useful in previous analyses of both English and Latin literary style, and for authorship attribution, as works by the same author tend to have similar phonetic profiles 9,[22][23][24][25] . In verse corpora, patterns of functional n-gram usage can reflect poetic sound play and aural effects. ...
... In this case, the variety of features complements and enhances the more established focus on word usage and distribution, incorporating in addition phonetic, formulaic, rhythmic and metrical elements. In doing so, we exploit features that are known to play an important role in the specific tradition (for example, nominal compounds), as well as validate the extension of features that have proven useful for studying traditions in other languages (for example, functional n-grams and sense-pauses) to Old English 9,23,34 . In our analysis of Cynewulf, we show that a corpus-specific feature (nominal compounds) can be combined with a general-purpose stylometric technique (unsupervised learning with character n-grams) to provide broad-based support for the Cynewulfian authorship of Andreas. ...
The corpus of Old English verse is an indispensable source for scholars of the Indo-European tradition, early Germanic culture and English literary history. Although it has been the focus of sustained literary scholarship for over two centuries, Old English poetry has not been subjected to corpus-wide computational profiling, in part because of the sparseness and extreme fragmentation of the surviving material. Here we report a detailed quantitative analysis of the whole corpus that considers a broad range of features reflective of sound, metre and diction. This integrated examination of fine-grained features enabled us to identify salient stylistic patterns, despite the inherent limitations of the corpus. In particular, we provide quantitative evidence consistent with the unitary authorship of Beowulf and the Cynewulfian authorship of Andreas, shedding light on two longstanding questions in Old English philology. Our results demonstrate the usefulness of high-dimensional stylometric profiling for fragmentary literary traditions and lay the foundation for future studies of the cultural evolution of English literature.
... Of greater interest is how the spurious work differs from authentic writings and how its composition was influenced by the larger tradition. Recent studies have begun to repurpose stylometry to answer such literary critical questions (10,(35)(36)(37)(38)(39). Much of this research relies on the suitability of techniques of authorship attribution for addressing broader literary questions (40). ...
... Functional n-grams are short, syllable-length strings of characters, which can reflect ingrained authorial style and capture patterns of sound in poetry. Analysis of functional n-grams has proven useful for authorship attribution studies and addressed literary questions in the postclassical reception history of the Roman poet Catullus (37). Although critics have long paid attention to specific aural effects and sound play in poetry, systematic studies have been infeasible without computational tabulation of n-grams. ...
... A primary challenge in the analysis of the citation database is the length of individual entries, many of which include only a few sentences. To generate meaningful feature statistics, we aggregated multiple citations into "bins" randomly and analyzed each bin as if it were a single passage (37). We set the bin size at 35 sentences, which was the minimum passage length for which we obtained consistent results (SI Appendix, Fig. S8). ...
Significance
Famous works of literature can serve as cultural touchstones, inviting creative adaptations in subsequent writing. To understand a poem, play, or novel, critics often catalog and analyze these intertextual relationships. The study of such relationships is challenging because intertextuality can take many forms, from direct quotation to literary imitation. Here, we show that techniques from authorship attribution studies, including stylometry and machine learning, can shed light on inexact literary relationships involving little explicit text reuse. We trace the evolution of features not tied to individual words across diverse corpora and provide statistical evidence to support interpretive hypotheses of literary critical interest. The significance of this approach is the integration of quantitative and humanistic methods to address aspects of cultural evolution.