Conference Paper

On the Feasibility of Automated Detection of Allusive Text Reuse

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The most widely used tools for the detection of Latin intertextuality, such as Tesserae and Diogenes, rely on lexical matching of repeated words or phrases (Coffee et al., 2012(Coffee et al., , 2013Heslin, 2019). In addition to these core methods, other research has explored the use of sequence alignment (Chaudhuri et al., 2015;Chaudhuri and Dexter, 2017), semantic matching (Scheirer et al., 2016), and hybrid approaches (Moritz et al., 2016;Manjavacas et al., 2019) for Latin intertextual search, complementing related work on English (Smith et al., 2014;Zhang et al., 2014;Barbu and Trausan-Matu, 2017). Much NLP research on historical text reuse, including previous applications of Latin word embeddings, has focused on the Bible and other religious texts (Lee, 2007;Moritz et al., 2016;Bjerva and Praet, 2016;Manjavacas et al., 2019). ...
... In addition to these core methods, other research has explored the use of sequence alignment (Chaudhuri et al., 2015;Chaudhuri and Dexter, 2017), semantic matching (Scheirer et al., 2016), and hybrid approaches (Moritz et al., 2016;Manjavacas et al., 2019) for Latin intertextual search, complementing related work on English (Smith et al., 2014;Zhang et al., 2014;Barbu and Trausan-Matu, 2017). Much NLP research on historical text reuse, including previous applications of Latin word embeddings, has focused on the Bible and other religious texts (Lee, 2007;Moritz et al., 2016;Bjerva and Praet, 2016;Manjavacas et al., 2019). As such, there is a clear need for enhanced computational methods for classical Latin literature. ...
... Following attempts to train word2vec models on unlemmatized corpora of Latin literature shortly after the method's introduction (Bamman; Bjerva and Praet, 2015) and inclusion of Latin in large-scale multilingual releases of FastText and BERT (Grave et al., 2018;Devlin et al., 2019), in the past year there has been increased interest in systematic optimization and evaluation of Latin embeddings. Spurred by the recent EvaLatin challenge (Sprugnoli et al., 2020), a number of Latin models have been trained for use in lemmatization and part-ofspeech tagging (Bacon, 2020;Celano, 2020;Straka and Straková, 2020;Stoeckel et al., 2020), complementing new literary applications to Biblical text reuse and neo-Latin philosophy (Manjavacas et al., 2019;Bloem et al., 2020). In addition, Sprugnoli et al. (2019) recently introduced a synonym selection dataset, based on the TOEFL benchmark for English, which they used to evaluate word2vec and FastText models trained on the LASLA corpus of Latin literature. ...
... The most well known of these tools (TRACER, 2 Tesserae, 3 and Passim 4 ) have also been tested for Latin: one of the last experiments is that of Franzini, Passarotti, Moritz and Büchler (2018), in which a thorough exploration of HTRD (Historical Text Reuse Detection) tools can be found. As for Cosine Similarity, (Manjavacas et al., 2019) approached allusive textual reuse detection on a Latin Biblical corpus from an Information Retrieval standpoint: through an extensive usage of Cosine Similarity scores and Word Embeddings, they found that custom query algorithms for automatic allusion detection were consistently outperformed by simpler TF-IDF models and that Cosine Similarity can prove a sound basis for inquiring textual reuse. Other studies, such as (Bär et al., 2012) and (Sturgeon, 2018), employed Cosine Similarity and TF-IDF scores, among other approaches, in text reuse and similarity detection with good results, both for contemporary language corpora (the former, which was tested on the METER corpus and the Webis Crowd Paraphrase corpus) and historical language corpora (the latter, which worked on an Early Chinese corpus). ...
... The importance of lemmatisation in cosine similarity score measuring for textual similarity is also confirmed by(Manjavacas et al., 2019), as "lemmatization boosts the performance of nearly all models" ...
Conference Paper
Full-text available
This paper describes the first experiments towards tracking the complex and international network of text reuse within the Early Modern (XV-XVII centuries) community of Neo-Latin humanists. Our research, conducted within the framework of the TransLatin project, aims at gaining more evidence on the topic of textual similarities and semiconscious reuse of literary models. It consists of two experiments conveyed through two main research fields (Information Retrieval and Stylometry), as a means to a better understanding of the complex and subtle literary mechanisms underlying the drama production of Modern Age authors and their transnational network of relations. The experiments led to the construction of networks of works and authors that fashion different patterns of similarity and models of evolution and interaction between texts.
... In this context, an important question-which has been however rarely approached in previous researchaddresses the level of agreement that expert anno-tators may reach. Still, as it has been noted before (Manjavacas et al., 2019b), inter-annotator agreement studies of intertextuality are rare. 2 The present paper starts by addressing such research question, but moves beyond it and further aims towards an examination and understanding of the contextual factors that may affect interannotator agreement in intertextuality research. ...
... The current version of this digital edition contains already a total of 6,689 manually identified biblical references. On the basis of the available annotations, we fine-tuned two text reuse retrieval algorithms: one using the local alignment algorithm Smith-Waterman (Smith and Waterman, 1981), with a bias towards verbatim cases, and another one based on the Soft-Cosine similarity measure (Sidorov et al., 2014), which takes into account lexical similarity using word embeddings and has been used in a previous study on Bernard (Manjavacas et al., 2019b). 3 These algorithms were then applied to the remaining dataset in order to find references potentially overlooked by the editors. ...
... As many parallels involve repetition or adaptation of short phrases, the study of Latin intertextuality is well-suited to computational approaches. Several foundational tools for corpus and intertextual search, including Diogenes, Tesserae, and TRACER, are now standard resources in the field (Heslin, 2019;Coffee et al., 2012;Moritz et al., 2016), and computational intertextual criticism of Latin literature continues to be an active topic of research (Bernstein et al., 2015;Burns 2017;Dexter et al., 2017;Forstall & Scheirer, 2019;Manjavacas et al., 2019). As part of our ongoing work on the quantitative study of Latin intertextuality, we have created a benchmark dataset of intertextual parallels in Latin epic, which can be used for thorough and consistent evaluation of different search methods. ...
... For an excellent overview, seeThompson (1993, p.248), see alsoJauss (1985). About possibilities of automated detection of allusive text reuse, seeManjavacas, Long and Kestemont (2019). ...
Article
Full-text available
Text similarity analysis entails studying identical and closely similar text passages across large corpora, with a particular focus on intentional and unintentional borrowing patterns. At a larger scale, detecting repeated passages takes on added importance, as the same text can convey different meanings in different contexts. This approach offers numerous benefits, enhancing intellectual and literary scholarship by simplifying the identification of textual overlaps. Consequently, scholars can focus on the theoretical aspects of reception with an expanded corpus of evidence at their disposal. This article adds to the expanding field of historical text reuse, applying it to intellectual history and showcasing its utility in examining reception, influence, popularity, authorship attribution, and the development of tools for critical editions. Focused on the works and various editions of Bernard Mandeville (1670–1733), the research applies comparative text similarity analysis to explore his borrowing habits and the reception of his works. Systematically examining text reuses across several editions of Mandeville’s works, it provides insights into the evolution of his output and influences over time. The article adopts a forward-looking perspective in historical research, advocating for the integration of archival and statistical evidence. This is illustrated through a detailed examination of the attribution of Publick Stews to Mandeville. Analysing cumulative negative evidence of borrowing patterns suggests that Mandeville might not have been the author of the piece. However, the article aims not to conclude the debate but rather to open it up, underscoring the importance of taking such evidence into consideration. Additionally, it encourages scholars to incorporate text reuse evidence when exploring other cases in early modern scholarship. This highlights the adaptability and scalability of text similarity analysis as a valuable tool for advancing literary studies and intellectual history.
... In the design of TRACER , Büchler et al. (2014) have addressed this subtlety of text reuse in literary texts by giving users access to a wide array of Information Retrieval (IR) algorithms, as well as direct access to the tool's output at each step of the processing chain. More recent studies have investigated the usefulness of sentence and word embeddings, especially with respect to detecting these more allusive forms of text reuse (Manjavacas et al., 2019;Liebl and Burghardt, 2020), finding that they do not bring substantial advantages over traditional IR techniques. ...
Article
Full-text available
Text Reuse reveals meaningful reiterations of text in large corpora. Humanities researchers use text reuse to study, e.g., the posterior reception of influential texts or to reveal evolving publication practices of historical media. This research is often supported by interactive visualizations which highlight relations and differences between text segments. In this paper, we build on earlier work in this domain. We present impresso Text Reuse at Scale, the to our knowledge first interface which integrates text reuse data with other forms of semantic enrichment to enable a versatile and scalable exploration of intertextual relations in historical newspaper corpora. The Text Reuse at Scale interface was developed as part of the impresso project and combines powerful search and filter operations with close and distant reading perspectives. We integrate text reuse data with enrichments derived from topic modeling, named entity recognition and classification, language and document type detection as well as a rich set of newspaper metadata. We report on historical research objectives and common user tasks for the analysis of historical text reuse data and present the prototype interface together with the results of a user evaluation.
... 2 Existing research has explored matching words or stems (Coffee et al., 2012) as well as methods that focus on semantics (Scheirer et al., 2014). Additionally, techniques that combine both lexical and semantic elements have been examined, where semantic understanding is established through word embeddings (Manjavacas et al., 2019) or via the (Ancient Greek) WordNet (Bizzoni et al., 2014). While the majority of preceding studies have concentrated on detecting text reuse in the Bible and various religious texts, Burns et al. (2021) focus on Classical Latin literature. ...
Preprint
Intertextual allusions hold a pivotal role in Classical Philology, with Latin authors frequently referencing Ancient Greek texts. Until now, the automatic identification of these intertextual references has been constrained to monolingual approaches, seeking parallels solely within Latin or Greek texts. In this study, we introduce SPhilBERTa, a trilingual Sentence-RoBERTa model tailored for Classical Philology, which excels at cross-lingual semantic comprehension and identification of identical sentences across Ancient Greek, Latin, and English. We generate new training data by automatically translating English texts into Ancient Greek. Further, we present a case study, demonstrating SPhilBERTa's capability to facilitate automated detection of intertextual parallels. Our models and resources are available at https://github.com/Heidelberg-NLP/ancient-language-models.
... The resulting models were not tuned or evaluated for Latin. Manjavacas et al. (2019) applied fastText to the same data to create embeddings for the task of semantic information retrieval, also without tuning, finding that more basic BOW methods outperform it and finding fastText to outperform Word2vec. The only study we are aware of that includes an evaluation of Latin word embeddings is by , who create lemma embeddings from a manually annotated corpus of Classical Latin, the 1.7M token Opera Latina corpus, which includes manually created lemmatization. ...
Conference Paper
We address the problem of creating and evaluating quality Neo-Latin word embeddings for the purpose of philosophical research, adapting the Nonce2Vec tool to learn embeddings from Neo-Latin sentences. This distributional semantic modeling tool can learn from tiny data incrementally, using a larger background corpus for initialization. We conduct two evaluation tasks: definitional learning of Latin Wikipedia terms, and learning consistent embeddings from 18th century Neo-Latin sentences pertaining to the concept of mathematical method. Our results show that consistent Neo-Latin word embeddings can be learned from this type of data. While our evaluation results are promising, they do not reveal to what extent the learned models match domain expert knowledge of our Neo-Latin texts. Therefore, we propose an additional evaluation method, grounded in expert-annotated data, that would assess whether learned representations are conceptually sound in relation to the domain of study.
Article
Full-text available
This paper explores the reception of classical works in Early Modern Britain during the hand press era, between the 1470s and 1790s. It investigates canon formation, knowledge transmission, and the integration of digital archives in quantitative book history. The study quantitatively maps changing perceptions of the classical canon across time, offering a panoramic view of ‘shifting canons’. The analysis is based on three data archives: the English Short Title Catalogue (ESTC), Early English Books Online (EEBO), and Eighteenth Century Collections Online (ECCO). We conclude that we can observe a “canonization” of the set of classical authors printed in Early Modern England, which is reflected in a significant loss of diversity in publications, despite a general increase of the publication of classical works. Preferences also shift, with ancient Greek authors of the early centuries gaining significantly more space in the 18th century. This finding however is balanced by the observation that the circulation of Ancient Greek editions in the original language does not increase during this time. This multidimensional approach contributes to a comprehensive understanding of the reception of Classics in Early Modern Britain, shedding light on cultural and intellectual transformations.
Article
Full-text available
The increasing capacities of large language models (LLMs) have been shown to present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, by automating complex qualitative tasks otherwise typically carried out by human researchers. While numerous benchmarking studies have assessed the analytic prowess of LLMs, there is less focus on operationalizing this capacity for inference and hypothesis testing. Addressing this challenge, a systematic framework is argued for here, building on mixed methods quantitizing and converting design principles, and feature analysis from linguistics, to transparently integrate human expertise and machine scalability. Replicability and statistical robustness are discussed, including how to incorporate machine annotator error rates in subsequent inference. The approach is discussed and demonstrated in over a dozen LLM-assisted case studies, covering nine diverse languages, multiple disciplines and tasks, including analysis of themes, stances, ideas, and genre compositions; linguistic and semantic annotation, interviews, text mining and event cause inference in noisy historical data, literary social network construction, metadata imputation, and multimodal visual cultural analytics. Using hypothesis-driven topic classification instead of “distant reading” is discussed. The replications among the experiments also illustrate how tasks previously requiring protracted team effort or complex computational pipelines can now be accomplished by an LLM-assisted scholar in a fraction of the time. Importantly, the approach is not intended to replace, but to augment and scale researcher expertise and analytic practices. With these opportunities in sight, qualitative skills and the ability to pose insightful questions have arguably never been more critical.
Article
Full-text available
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks—but that their use makes the interpretation of the value of the coefficient even harder.
Article
Full-text available
We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data.We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words “play” and “game” are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call “soft cosine measure”. We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.
Article
Full-text available
Tesserae is a web-based tool for automatically detecting allusions in Latin poetry. Although still in the start-up phase, it already is capable of identifying significant numbers of known allusions, as well as similar numbers of allusions previously unnoticed by scholars. In this article, we use the tool to examine allusions to Vergil’s Aeneid in the first book of Lucan’s Civil War. Approximately 3,000 linguistic parallels returned by the program were compared with a list of known allusions drawn from commentaries. Each was examined individually and graded for its literary significance, in order to benchmark the program’s performance. All allusions from the program and commentaries were then pooled in order to examine broad patterns in Lucan’s allusive techniques which were largely unapproachable without digital methods. Although Lucan draws relatively constantly from Vergil’s generic language in order to maintain the epic idiom, this baseline is punctuated by clusters of pointed allusions, in which Lucan frequently subverts Vergil’s original meaning. These clusters not only attend the most significant characters and events but also play a role in structuring scene transitions. Work is under way to incorporate the ability to match on word meaning, phrase context, as well as metrical and phonological features into future versions of the program.
Article
Full-text available
The TREC-8 Question Answering track was the first large-scale evaluation of domain-independent question answering systems. This paper summarizes the results of the track by giving a brief overview of the different approaches taken to solve the problem. The most accurate systems found a correct response for more than 2/3 of the questions. Relatively simple bag-of-words approaches were adequate for finding answers when responses could be as long as a paragraph (250 bytes), but more sophisticated processing was necessary for more direct responses (50 bytes).
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Chapter
This chapter touches upon several issues in the calculation and assessment of inter-annotator agreement. It gives an introduction to the theory behind agreement coefficients and examples of their application to linguistic annotation tasks. Specific examples explore variation in annotator performance due to heterogeneous data, complex labels, item difficulty, and annotator differences, showing how global agreement coefficients may mask these sources of variation, and how detailed agreement studies can give insight into both the annotation process and the nature of the underlying data. The chapter also reviews recent work on using machine learning to exploit the variation among annotators and learn detailed models from which accurate labels can be inferred. I therefore advocate an approach where agreement studies are not used merely as a means to accept or reject a particular annotation scheme, but as a tool for exploring patterns in the data that are being annotated.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
Texts propagate through many social networks and provide evidence for their structure. We describe and evaluate efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. We apply these techniques to two case studies: analyzing the culture of free reprinting in the nineteenth-century United States and the development of bills into legislation in the U.S. Congress. Using these divergent case studies, we evaluate both the efficiency of the approximate local text reuse detection methods and the accuracy of the results. These techniques allow us to explore how ideas spread, which ideas spread, and which subgroups shared ideas.
Article
In literary study, intertextuality refers to the reuse of text, where new meaning or novel stylistic effects have been generated. Most typically in the digital humanities, algorithms for intertextual analysis search for approximate lexical correspondence that can be described as paraphrase. In this article, we look at a complimentary approach that more closely captures the behavior of the reader when faced with meaningful connections between texts in the absence of words that have the same form or stem, which constrains the match to semantics. The technique we employ for identifying such semantic intertextuality is the popular natural language processing strategy of semantic analysis. Unlike the typical scenario for semantic analysis, where a corpus of long form documents is available, we examine the far more limited textual fragments that embody intertextuality. We are primarily concerned with texts from antiquity, where small phrases or passages often form the locus of comparison. In this vein, we look at a specific case study of established parallels between book 1 of Lucan’s Civil War and all of Vergil’s Aeneid. Applying semantic analysis over these texts, we are able to recover parallels that lexical matching cannot, as well as discover new and interesting thematic matches between the two works.
Article
The study of intertextuality, or how authors make artistic use of other texts in their works, has a long tradition, and has in recent years benefited from a variety of applications of digital methods. This article describes an approach for detecting the sorts of intertexts that literary scholars have found most meaningful, as embodied in the free Tesserae website http://tesserae.caset.buffalo.edu/. Tests of Tesserae Versions 1 and 2 showed that word-level n-gram matching could recall a majority of parallels identified by scholarly commentators in a benchmark set. But these versions lacked precision, so that the meaningful parallels could be found only among long lists of those that were not meaningful. The Version 3 search described here adds a second stage scoring system that sorts the found parallels by a formula accounting for word frequency and phrase density. Testing against a benchmark set of intertexts in Latin epic poetry shows that the scoring system overall succeeds in ranking parallels of greater significance more highly, allowing site users to find meaningful parallels more quickly. Users can also choose to adjust both recall and precision by focusing only on results above given score levels. As a theoretical matter, these tests establish that lemma identity, word frequency, and phrase density are important constituents of what make a phrase parallel a meaningful intertext.
Chapter
Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.
Article
We present here a method for automat-ically discovering several classes of text reuse across different languages, from the most similar (translations) to the most oblique (literary allusions). Allusions are an important subclass of reuse because they involve the appropriation of isolated words and phrases within otherwise unre-lated sentences, so that traditional methods of identifying reuse including topical sim-ilarity and translation models do not ap-ply. To evaluate this work we have created (and publicly released) a test set of liter-ary allusions between John Milton's Par-adise Lost and Vergil's Aeneid; we find that while the baseline discovery of trans-lations (55.0% F-measure) far surpasses the discovery of allusions (4.8%), its abil-ity to expedite the traditional work of hu-manities scholars makes it a line of re-search strongly worth pursuing.
Article
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.
Conference Paper
We describe here a method for automatically identifying word sense variation in a dated collection of historical books in a large digital library. By leveraging a small set of known translation book pairs to induce a bilingual sense inventory and labeled training data for a WSD classifier, we are able to automatically classify the Latin word senses in a 389 million word corpus and track the rise and fall of those senses over a span of two thousand years. We evaluate the performance of seven different classifiers both in a tenfold test on 83,892 words from the aligned parallel corpus and on a smaller, manually annotated sample of 525 words, measuring both the overall accuracy of each system and how well that accuracy correlates (via mean square error) to the observed historical variation.
Conference Paper
We propose a computational model of text reuse tailored for ancient literary texts, avail- able to us often only in small and noisy sam- ples. The model takes into account source alternation patterns, so as to be able to align even sentences with low surface similarity. We demonstrate its ability to characterize text reuse in the Greek New Testament.
Article
Categorization and taxonomy are topical issues in intertextuality studies. Instead of increasing the number of overlapping or contradictory definitions (often established with reference to limited databases) which exist even for key concepts such as "allusion" or "quotation", we propose an electronically implemented data-driven approach based on the isolation, analysis and description of a number of relevant parameters such as general text relation, marking for quotation, modification etc. If a systematic parameter analysis precedes discussions of possible correlations and the naming of features bundles as composite categories, a dynamic approach to categorization emerges which does justice to the varied and complex phenomena in this field. The database is the HyperHamlet corpus, a chronologically and generically wide-ranging collection of Hamlet references that confront linguistic and literary researchers with a comprehensive range of formal and stylistic issues. Its multi-dimensional encodings and search facilities provide the indispensable 'freedom from the analytic limits of hardcopy', as Jerome McGann put it. The methodological and heuristic gains include a more complete description of possible parameter settings, a clearer recognition of multiple parameter settings (as implicit in existing genre definitions), a better understanding of how parameters interact, descriptions of disregarded literary phenomena that feature unusual parameter combinations and, finally, descriptive labels for the most polysemous areas that may clarify matters without increasing taxonomical excess.
Book
Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.
The logic and discovery of textual allusion
  • David Bamman
  • Gregory Crane
David Bamman and Gregory Crane. 2008. The logic and discovery of textual allusion. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data.
Informationstechnische Aspekte des Historical Text Re-use
  • Marco Büchler
Marco Büchler. 2013. Informationstechnische Aspekte des Historical Text Re-use. Ph.D. thesis, Universität Leipzig.
Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures
  • Alexander Budanitsky
  • Graeme Hirst
Alexander Budanitsky and Graeme Hirst. 2001. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and other lexical resources, volume 2, page 2.
From word embeddings to document distances
  • Matt Kusner
  • Yu Sun
  • Nicholas Kolkin
  • Kilian Weinberger
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In International Conference on Machine Learning, pages 957-966.
Python implementation of Empirical Mode Decomposition algorithm
  • Dawid Laszuk
Dawid Laszuk. 2017. Python implementation of Empirical Mode Decomposition algorithm.
The Latin WordNet project
  • Stefano Minozzi
Stefano Minozzi. 2010. The Latin WordNet project. In Akten des 15. Internationalen Kolloquiums zur Lateinischen Linguisti, pages 707-716, Innsbruck. Institut für Sprachen und Literaturen der Universität Innsbruck Bereich Sprachwissenschaft.
A Method for Human-Interpretable Paraphrasticality Prediction
  • Maria Moritz
  • Johannes Hellrich
  • Sven Buechel
Maria Moritz, Johannes Hellrich, and Sven Buechel. 2018. A Method for Human-Interpretable Paraphrasticality Prediction. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 113-118.
  • Waleed Ammar
  • George Mulcaire
  • Yulia Tsvetkov
  • Guillaume Lample
  • Chris Dyer
  • Noah A Smith
Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A Smith. 2016. Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.