Chapter

Towards a Historical Text Re-use Detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... While stylometry focuses on authorship attribution and stylistic analysis on a macro-level, as demonstrated by the stylo package, text reuse detection tools offer a micro-level approach to detecting each text reuse such as quotations and allusions among texts. Specifically, TRACER (Büchler, 2013;Büchler et al., 2014; is a text reuse detection tool that has been successfully applied to study intertextuality in ancient Greek (Buechler et al., 2008;Büchler et al., 2010), Latin (Franzini et al., 2018b), Coptic texts (Miyagawa, 2022(Miyagawa, , 2021Miyagawa et al., 2018), Classical Tibetan (Almogi et al., 2019), German (Franzini et al., 2018a), etc. ...
... We use the cosine similarity as the similarity metric. For text reuse detection, we use TRACER (Büchler, 2013;Büchler et al., 2014;, which has been successfully applied to study intertextuality in various ancient language corpora. It provides a Java implementation to detect different types of text reuse such as quotations, allusions and idioms. ...
... 4 However, despite the increasing interest and adoption of text reuse analysis in digital humanities, a systematic approach to thoroughly explore an author's entire body of work, as we have done, has not been extensively undertaken until now. Although we believe that working with methods of text reuse is a natural step in building critical editions (as we demonstrate in this paper), and a natural 1 There is a growing body of work on text reuse in digital humanities: see Gladstone and Cooney (2020); Roe, Olsen and Morrissey (2022); O'Neill et al. (2021); Cordell (2015); Büchler et al. (2014). 2 For ARTFL and related work see Horton, Roe and Olsen (2010); Cooney et al. (2008); Roe (2012); Edelstein, Morrissey and Roe (2013); Roe (2018). 3 For newspapers and virality, see Salmi et al. (2021) ;Smith, Cordell and Dillon (2013). ...
... Naturally, the sequences found cover a wide range of uses. Taking inspiration from a similar version created by the eTrap project, we created a taxonomy of the text reuse ( Figure 3) found by the BLAST method (Büchler et al. 2014). The sequences can be mostly explained under a small number of categories. ...
Article
Full-text available
Text similarity analysis entails studying identical and closely similar text passages across large corpora, with a particular focus on intentional and unintentional borrowing patterns. At a larger scale, detecting repeated passages takes on added importance, as the same text can convey different meanings in different contexts. This approach offers numerous benefits, enhancing intellectual and literary scholarship by simplifying the identification of textual overlaps. Consequently, scholars can focus on the theoretical aspects of reception with an expanded corpus of evidence at their disposal. This article adds to the expanding field of historical text reuse, applying it to intellectual history and showcasing its utility in examining reception, influence, popularity, authorship attribution, and the development of tools for critical editions. Focused on the works and various editions of Bernard Mandeville (1670–1733), the research applies comparative text similarity analysis to explore his borrowing habits and the reception of his works. Systematically examining text reuses across several editions of Mandeville’s works, it provides insights into the evolution of his output and influences over time. The article adopts a forward-looking perspective in historical research, advocating for the integration of archival and statistical evidence. This is illustrated through a detailed examination of the attribution of Publick Stews to Mandeville. Analysing cumulative negative evidence of borrowing patterns suggests that Mandeville might not have been the author of the piece. However, the article aims not to conclude the debate but rather to open it up, underscoring the importance of taking such evidence into consideration. Additionally, it encourages scholars to incorporate text reuse evidence when exploring other cases in early modern scholarship. This highlights the adaptability and scalability of text similarity analysis as a valuable tool for advancing literary studies and intellectual history.
... Methods for TRD are shaped by the disciplines in which they emerged. Since text reuse in literary texts is often more subtle than the mere repetition of words (e.g., in the case of paraphrase, allusion, translation, or parody), researchers strive to go beyond lexical similarities in order to capture affinities in syntax, content, or metrical structure (Büchler et al., 2014;Scheirer et al., 2016;Moritz and Steding, 2018). In the design of TRACER , Büchler et al. (2014) have addressed this subtlety of text reuse in literary texts by giving users access to a wide array of Information Retrieval (IR) algorithms, as well as direct access to the tool's output at each step of the processing chain. ...
... Since text reuse in literary texts is often more subtle than the mere repetition of words (e.g., in the case of paraphrase, allusion, translation, or parody), researchers strive to go beyond lexical similarities in order to capture affinities in syntax, content, or metrical structure (Büchler et al., 2014;Scheirer et al., 2016;Moritz and Steding, 2018). In the design of TRACER , Büchler et al. (2014) have addressed this subtlety of text reuse in literary texts by giving users access to a wide array of Information Retrieval (IR) algorithms, as well as direct access to the tool's output at each step of the processing chain. More recent studies have investigated the usefulness of sentence and word embeddings, especially with respect to detecting these more allusive forms of text reuse (Manjavacas et al., 2019;Liebl and Burghardt, 2020), finding that they do not bring substantial advantages over traditional IR techniques. ...
Article
Full-text available
Text Reuse reveals meaningful reiterations of text in large corpora. Humanities researchers use text reuse to study, e.g., the posterior reception of influential texts or to reveal evolving publication practices of historical media. This research is often supported by interactive visualizations which highlight relations and differences between text segments. In this paper, we build on earlier work in this domain. We present impresso Text Reuse at Scale, the to our knowledge first interface which integrates text reuse data with other forms of semantic enrichment to enable a versatile and scalable exploration of intertextual relations in historical newspaper corpora. The Text Reuse at Scale interface was developed as part of the impresso project and combines powerful search and filter operations with close and distant reading perspectives. We integrate text reuse data with enrichments derived from topic modeling, named entity recognition and classification, language and document type detection as well as a rich set of newspaper metadata. We report on historical research objectives and common user tasks for the analysis of historical text reuse data and present the prototype interface together with the results of a user evaluation.
... Detecting plagiarism in documents is similar to identifying text reuses in historical documents [21]. These plagiarism detection methods typically narrow the search space of all documents using vector similarity and then performing lexical analysis on the smaller subset of documents [22]. ...
Article
Full-text available
Text reuse is of fundamental importance in humanities research, as near-verbatim pieces of text in different documents provide invaluable information about the historical spread, evolution of ideas and composition of cultural artifacts. Traditionally, scholars have studied text reuse at a very small scale, for example, when comparing the writings of two philosophers; however, modern digitized corpora spanning entire centuries promise to revolutionize humanities research through the detection of previously unobserved large-scale patterns. This paper presents insights from ReceptionReader, a system for large-scale text reuse analysis over almost all known 18th-century books, articles, and newspapers. The system implements a data management pipeline for billions of text reuse instances and supports analysis tasks based on database queries (e.g., retrieving the most reused quotes from queried documents). The paper describes the principled and extensive evaluations across different normalization levels, query execution engines, and queries of interest that led to an optimized system—and offers insights from the observed trade-offs and how they were resolved to fit specific requirements. In summary, the paper explains how, for our system, (1) the row-store engine (MariaDB Aria) with denormalized relations emerged as the optimal choice for front-end interfaces, while (2) big data processing (Apache Spark) proved irreplaceable for data preprocessing.
... While NLP has undoubtedly given the Biblical scholars a new method for biblical analysis, the Bible in and of itself is also an invaluable corpus for computational linguistics research. Buchler et al. used seven English translations of the Bible to investigate the techniques behind historical text re-use detection process and examine algorithms for paraphrase detection [5]. The Bible provides a good test bed for paraphrase detection as there exist several different translations all stemming from the same origin. ...
Preprint
Question answering (QA) has significantly benefitted from deep learning techniques in recent years. However, domain-specific QA remains a challenge due to the significant amount of data required to train a neural network. This paper studies the answer sentence selection task in the Bible domain and answer questions by selecting relevant verses from the Bible. For this purpose, we create a new dataset BibleQA based on bible trivia questions and propose three neural network models for our task. We pre-train our models on a large-scale QA dataset, SQuAD, and investigate the effect of transferring weights on model accuracy. Furthermore, we also measure the model accuracies with different answer context lengths and different Bible translations. We affirm that transfer learning has a noticeable improvement in the model accuracy. We achieve relatively good results with shorter context lengths, whereas longer context lengths decreased model accuracy. We also find that using a more modern Bible translation in the dataset has a positive effect on the task.
... See also Li and Mullen (2020). 6 https://www.etrap.eu/research/tracer/. See also Büchler et al. (2014); Franzini et al. (2018). 7 For an excellent overview of Intertextuality see Samoyault (2005). ...
Article
Full-text available
Text reuse, encompassing direct citations, paraphrases and allusions, represents a key aspect of intertextuality – a concept central to literary theory since the 1960s. This paper highlights how computational methods, particularly automatic text-reuse detection, can illuminate the complex system of intertextual exchange that informs 18th-century literary culture, focusing on significant works like the Encyclopédie and Voltaire's correspondence. By employing advanced techniques such as sequence alignment and social network analysis, we uncover hidden patterns of influence, citation strategies and the subtle interplay between originality and imitation in Enlightenment literature. The paper also considers the implications of these findings for modern understandings of authorship, originality and textuality, drawing connections to contemporary digital humanities practices. The paper ultimately aims to recontextualise the Enlightenment as a period of intense intertextual productivity, where the reuse of texts was not merely a scholarly exercise but a dynamic and essential component of literary creation.
... 11 https://www.etrap.eu/research/tracer/. See alsoBüchler et al. (2014) andFranzini et al. (2019). 12 https://github.com/tesserae/tesserae/. See alsoCoffee et al. (2013). ...
Article
Full-text available
The European Research Council's ModERN project (Modelling Enlightenment: reassembling networks of modernity through data-driven research) is a pioneering five-year research initiative. This programme seeks to redefine the conventional understanding of 18th-century literary history by employing advanced data-modelling and analysis techniques. By developing a comprehensive corpus of 18th-century French texts and leveraging a range of data-science methodologies such as text-reuse detection and network analysis, the project aims to uncover novel research avenues and provide fresh insights into early-modern French print culture and its intertextual dynamics.In this report, we discuss some theoretical points underlying our research; we explain the choices made in constructing our corpus and their implications; and we present some case studies to show the potential of our research and the most prudent methodologies to adopt.
... The second scenario is concerned with the discovery and analysis of re-used text passages within a collection of texts (76). This oral or written reproduction of textual content is called text re-use (43). Deliberate text re-use appears in the form of direct quotes and phrases like winged words and wisdom sayings. ...
Thesis
Full-text available
Translation alignment is an essential task in Digital Humanities and Natural Language Processing, and it aims to link words/phrases in the source text with their translation equivalents in the translation. In addition to its importance in teaching and learning historical languages, translation alignment builds bridges between ancient and modern languages through which various linguistics annotations can be transferred. This thesis focuses on word-level translation alignment applied to historical languages in general and Ancient Greek and Latin in particular. As the title indicates, the thesis addresses four interdisciplinary aspects of translation alignment. The starting point was developing Ugarit, an interactive annotation tool to perform manual alignment aiming to gather training data to train an automatic alignment model. This effort resulted in more than 190k accurate translation pairs that I used for supervised training later. Ugarit has been used by many researchers and scholars also in the classroom at several institutions for teaching and learning ancient languages, which resulted in a large, diverse crowd-sourced aligned parallel corpus allowing us to conduct experiments and qualitative analysis to detect recurring patterns in annotators’ alignment practice and the generated translation pairs. Further, I employed the recent advances in NLP and language modeling to develop an automatic alignment model for historical low-resourced languages, experimenting with various training objectives and proposing a training strategy for historical languages that combines supervised and unsupervised training with mono- and multilingual texts. Then, I integrated this alignment model into other development workflows to project cross-lingual annotations and induce bilingual dictionaries from parallel corpora. Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold standard datasets and support quantitative and qualitative evaluation of translation alignment models. Besides, I designed and implemented visual analytics tools and reading environments for parallel texts and proposed various visualization approaches to support different alignment-related tasks employing the latest advances in information visualization and best practice. Overall, this thesis presents a comprehensive study that includes manual and automatic alignment techniques, evaluation methods and visual analytics tools that aim to advance the field of translation alignment for historical languages.
... Like most other wordnets, the motivation behind this project is to perform automatic analysis of texts, including: classic uses in NLP, word similarity tasks, classification of texts, and enhancing the performance of information retrieval. One of the major motivations behind the construction of the Coptic wordnet in particular was to use the hierarchies for text reuse in TRACER (Büchler et al., 2014), but applications for searching and hyperlemmatization using senses (discussed further in Kučera (2007)) are conceivable as well. The currently available NLP pipeline for Coptic already offers lemmatization to base dictionary entries, but automatically linking word forms to wordnet entries could make comparisons of automatically analyzed texts to existing texts in Coptic, as well as other languages with aligned wordnets, much easier. ...
Article
Full-text available
With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millennium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.
... The previous automatic detection methods of text-level intertextuality aimed to discover similar phrases or sequences by lexical matching approach (Lee 2007;Coffee et al. 2012a;Coffee et al. 2012b;Ganascia et al. 2014;Forstall et al. 2015), which are insufficient and rigid in semantic modelling. The non-literal feature like synonym (Büchler et al. 2014;Moritz et al. 2016) and rhythm (Neidorf et al. 2019) also implies intertextuality, yet it requires language-specific design. Topic modelling lends a hand to passage-level modelling (Scheirer et al. 2016), while its dependence on expert annotation limits its generalization on diverse corpora. ...
Article
Full-text available
Being recognized among the cradles of human civilization, ancient China nurtured the longest continuous academic traditions and humanistic spirits, which continue to impact today’s society. With an unprecedented large-scale corpus spanning 3000 years, this paper presents a quantitative analysis of cultural evolution in ancient China. Millions of intertextual associations are identified and modelled with a hierarchical framework via deep neural network and graph computation, thus allowing us to answer three progressive questions quantitatively: (1) What is the interaction between individual scholars and philosophical schools? (2) What are the vicissitudes of schools in ancient Chinese history? (3) How did ancient China develop a cross-cultural exchange with an externally introduced religion such as Buddhism? The results suggest that the proposed hierarchical framework for intertextuality modelling can provide sound suggestions for large-scale quantitative studies of ancient literature. An online platform is developed for custom data analysis within this corpus, which encourages researchers and enthusiasts to gain insight into this work. This interdisciplinary study inspires the re-understanding of ancient Chinese culture from a digital humanities perspective and prompts the collaboration between humanities and computer science.
... Text reuse detection using computational methods has become a common practice in digital humanities (Gladstone & Cooney, 2020;Vesanto et al., 2017;Salmi et al., 2020;Büchler et al., 2014;Citron & Ginsparg, 2015;Cordell, 2015;Lee, 2007;Mullen, 2016;Smith et al., 2013) and makes it possible to cover a vast number of works and a greater extent of shared passages despite noisy data. While text reuse does not necessarily imply influence, it provides valuable information for understanding reception at scale. ...
Article
The Reception Reader is a web tool for studying text reuse in the Early English Books Online (EEBO-TCP) and Eighteenth Century Collections Online (ECCO) data. Users can: 1) explore a visual overview of the reception of a work, or its incoming connections, across time based on shared text segments, 2) interactively survey the details of connected documents, and 3) examine the context of reused text for “close reading”. We show examples of how the tool streamlines research and exploration tasks, and discuss the utility and limitations of the user interface along with its current data sources.
... Text reuse detection using computational methods has become a common practice in digital humanities (Gladstone & Cooney, 2020;Vesanto et al., 2017;Salmi et al., 2020;Büchler et al., 2014;Citron & Ginsparg, 2015;Cordell, 2015;Lee, 2007;Mullen, 2016;Smith et al., 2013) and makes it possible to cover a vast number of works and a greater extent of shared passages despite noisy data. While text reuse does not necessarily imply influence, it provides valuable information for understanding reception at scale. ...
Preprint
Full-text available
The Reception Reader is a web tool for studying text reuse in the Early English Books Online (EEBO-TCP) and Eighteenth Century Collections Online (ECCO) data. Users can: 1) explore a visual overview of the reception of a work, or its incoming connections, across time based on shared text segments, 2) interactively survey the details of connected documents, and 3) examine the context of reused text for "close reading". We show examples of how the tool streamlines research and exploration tasks, and discuss the utility and limitations of the user interface along with its current data sources.
... Come il programma WCopyFind, che è stato creato per rilevare il plagio e solo più tardi si è trovato per esso un'applicazione meno forense, anche Tracer, un altro software utilizzato nella ricerca quantitativa delle versioni italiane di Quo vadis, aveva inizialmente uno scopo diverso: quello di studiare tutte le relazioni intertestuali tra testi o gruppi di testi, con l'utilizzo di algoritmi molto numerosi e molto avanzati, che tengono conto di una vasta gamma di diversi elementi testuali: n-grammi di parole e di lettere all'interno e oltre i confini delle frasi rilevate automaticamente per vari valori di n, con l'uso facoltativo dei dizionari di sinonimi (Buechler et al., 2014). Tracer deve ancora dimostrare la sua utilità nei compiti per i quali è stato creato; ha invece già dimostrato di essere molto utile nel confronto delle traduzioni dello stesso testo, e precisamente negli studi sulle traduzioni italiane del bestseller internazionale di Sienkiewicz. ...
... Wenngleich Film in erster Linie ein visuelles Medium ist, so bietet sich über die Dialoge zusätzlich ein verbaler Analysezugang (vgl. Kozloff, 2000), Klarer (1998, S. 54) Berti et al., 2013;Büchler et al. 2014;Scheirer et al., 2014;Forstall et al., 2015;Bamman & Crane, 2018), weniger im Bereich der Anglistik und fast gar nicht in der Shakespeareforschung. Wenngleich das Thema umfangreich mit qualitativhermeneutischen Methoden untersucht wurde und wird (vgl. etwa den Sammelband von Maxwell & Rumbold, 2018), so finden sich nur wenige Arbeiten, die quantitative, digitale Methoden einsetzen, bspw. ...
... The retrieval of textual contexts that describe similar experiences is a text mining task that features two main differences from text reuse and plagiarism detection (Alzahrani et al., 2012;Büchler et al., 2014). First, non-native speakers use different vocabulary, as well as different grammatical constructions, to describe the same experience. ...
Article
The experiences of murdered victims of Nazi persecutions perished with them. This article discusses how text and data mining technology has helped to recover fragments of these lost experiences out of 2,500 oral history interviews with survivors. This gave rise to Let them Speak, a data edition of Holocaust testimonies. The first part situates the challenge of revealing lost experiences in historiography, and argues that the experience of murdered victims can be reconstructed through the collective experience. The second part shows how text and data mining techniques assisted the author to identify some pieces of the collective experience. The third part presents how web technology and visualization are used to render pieces of the collective experience as testimonial fragments of the Holocaust.
... Previous research on text reuse detection in literary texts has extensively explored methods such as n-gram matching (Büchler et al., 2014) and se-quence alignment algorithms (Lee, 2007;Smith et al., 2014). In such approaches, fuzzier forms of intertextual links are accounted for through the use of edit distance comparisons or the inclusion of abstract linguistic information such as word lemmata or part-of-speech tags, and lexical semantic relationships extracted from WordNet. ...
... Previous research on text reuse detection in literary texts has extensively explored methods such as n-gram matching (Büchler et al., 2014) and se-quence alignment algorithms (Lee, 2007;Smith et al., 2014). In such approaches, fuzzier forms of intertextual links are accounted for through the use of edit distance comparisons or the inclusion of abstract linguistic information such as word lemmata or part-of-speech tags, and lexical semantic relationships extracted from WordNet. ...
Preprint
The detection of allusive text reuse is particularly challenging due to the sparse evidence on which allusive references rely---commonly based on none or very few shared words. Arguably, lexical semantics can be resorted to since uncovering semantic relations between words has the potential to increase the support underlying the allusion and alleviate the lexical sparsity. A further obstacle is the lack of evaluation benchmark corpora, largely due to the highly interpretative character of the annotation process. In the present paper, we aim to elucidate the feasibility of automated allusion detection. We approach the matter from an Information Retrieval perspective in which referencing texts act as queries and referenced texts as relevant documents to be retrieved, and estimate the difficulty of benchmark corpus compilation by a novel inter-annotator agreement study on query segmentation. Furthermore, we investigate to what extent the integration of lexical semantic information derived from distributional models and ontologies can aid retrieving cases of allusive reuse. The results show that (i) despite low agreement scores, using manual queries considerably improves retrieval performance with respect to a windowing approach, and that (ii) retrieval performance can be moderately boosted with distributional semantics.
... Text reuse (TR) can be summarily described as the written repetition or borrowing of text and can take different forms. Büchler et al. (2014) separate syntactic TR, such as (near-)verbatim quotations or idiomatic expressions, from semantic TR, which can manifest itself as a paraphrase, an allusion or other loose reproduction. The study of quotation is key to any philological examination of a text, as it is not only indicative of the intellectual and cultural endowment of an author, but may shed light on the sources used, the relation between works and literary influence. ...
Conference Paper
Full-text available
This article describes a computational text reuse study on Latin texts designed to evaluate the performance of TRA-CER, a language-agnostic text reuse detection engine. As a case study, we use the Index Thomisticus as a gold standard to measure the performance of the tool in identifying text reuse between Thomas Aquinas' Summa contra Gentiles and his sources.
... While NLP has undoubtedly given the Biblical scholars a new method for biblical analysis, the Bible in and of itself is also an invaluable corpus for computational linguistics research. Buchler et al. used seven English translations of the Bible to investigate the techniques behind historical text re-use detection process and examine algorithms for paraphrase detection [5]. The Bible provides a good test bed for paraphrase detection as there exist several different translations all stemming from the same origin. ...
Conference Paper
Full-text available
Question answering (QA) has significantly benefitted from deep learning techniques in recent years. However, domain-specific QA remains a challenge due to the significant amount of data required to train a neural network. This paper studies the answer sentence selection task in the Bible domain and answer questions by selecting relevant verses from the Bible. For this purpose, we create a new dataset BibleQA based on bible trivia questions and propose three neural network models for our task. We pre-train our models on a large-scale QA dataset, SQuAD, and investigate the effect of transferring weights on model accuracy. Furthermore, we also measure the model accuracies with different answer context lengths and different Bible translations. We affirm that transfer learning has a noticeable improvement in the model accuracy. We achieve relatively good results with shorter context lengths, whereas longer context lengths decreased model accuracy. We also find that using a more modern Bible translation in the dataset has a positive effect on the task.
Preprint
We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at most one word and then identifying clusters of such matched pairs. Using this method, over 4600 parallel pairs of passages were identified in the Babylonian Talmud, a Hebrew-Aramaic corpus of over 1.8 million words, in just over 30 seconds. Empirical comparisons on sample data indicate that the coverage obtained by our method is essentially the same as that obtained using slow exhaustive methods.
Article
Full-text available
Cet article expose certains des défis qui ont émergé au cours des premières phases du projet Modern, programme de recherche financé par l’ERC (European Research Council, ou Conseil européen de la recherche) pour cinq ans, qui adopte une nouvelle approche partant des données (data driven) pour étudier l’histoire littéraire du siècle des Lumières. À partir d’un grand corpus de textes français du début de la période moderne, les auteurs détaillent les diverses étapes de la construction de réseaux intertextuels en se servant des résultats d’algorithmes de réutilisation de textes. De l’harmonisation du corpus et des métadonnées à l’entraînement d’un réseau neuronal pour filtrer les passages « bruités », cet article propose une chaîne de traitement pragmatique pour les projets similaires travaillant sur d’importantes collections de textes numérisés, tout en mettant en lumière les promesses ainsi que les périls de la recherche littéraire à grande échelle.
Article
The New Joyce Studies indicates the variety and energy of research on James Joyce since the year 2000. Essays examine Joyce's works and their reception in the light of a larger set of concerns: a diverse international terrain of scholarly modes and methodologies, an imperilled environment, and crises of racial justice, to name just a few. This is a Joyce studies that dissolves early visions of Joyce as a sui generis genius by reconstructing his indebtedness to specific literary communities. It models ways of integrating masses of compositional and publication details with literary and historical events. It develops hybrid critical approaches from posthuman, medical, and queer methodologies. It analyzes the nature and consequences of its extension from Ireland to mainland Europe, and to Africa and Latin America. Examining issues of copyright law, translation, and the history of literary institutions, this volume seeks to use Joyce's canonical centrality to inform modernist studies more broadly.
Thesis
Full-text available
This study sees itself as part of the larger field of intertextuality studies and examines, using the latest digital text reuse technology, the reuse of biblical texts in the writings of two Late Antique Christian authors from Egypt, the abbots Shenoute and Besa. It explores, on the basis of selected writings by these authors, the advantages and limitations of using digital methods to study the form and function of biblical intertexts in monastic literature written in Coptic. In particular, it seeks answers to a number of specific research questions, e.g., the extent to which quotations are faithful to the original biblical sources, the influence of quotation-introducing formulae or the question how digital technologies can be used to facilitate intertextuality studies. To pursue these topics, after an introduction to the life and works of both authors (Chapter 1), it describes the state of research for intertextuality studies with a focus on biblical and Early Christian and Coptic studies, and Shenoute in particular (Chapter 2). Text reuse detection technology, which was developed in the field of computer science, is introduced, and its practical applications are described. Specific focus is placed on the history of text reuse detection and the subtle differences between intertextuality and text reuses. In addition, the current progress in studies on intertextuality in Shenoute’s works is explored. Digital text reuse technology is described in detail (Chapter 3), in particular the technology that underpins the processing mechanisms employed by the latest text reuse detection software, TRACER, and pre-processing features such as optical character recognition, Unicode conversion, tokenization, lemmatization, and part-of-speech tagging. The case study for examining the application of digital text reuse technology has to focus on a limited selection of biblical texts, and specifically, on the best attested and most well-known book of the Old Testament, the Book of Psalms. Chapter 4 presents the philological and codicological information on the corpora used, the Sahidic translation of the Psalms, and the selected works by Shenoute and Besa, while Chapter 5 is dedicated to the case study and its results. It analyzes text reuses newly identified by TRACER, discusses instances of idiomatic text reuse and the question of quotation-introducing signals. In summary, this study confirms observations by previous research that the monastic authors built on the audience’s collective memory of the Bible by blending biblical phrases and concepts with their own monastic ideals. For the purpose of recontextualizing the source texts and fitting them to the current situation, unmarked changes may be applied, mostly of a grammatical nature. An interesting difference between the two monastic authors may be noted in their use of quotation-introducing signals, which merits further exploration, as does the question of the relation between the introduction of a quotation and its faithfulness. Finally, it needs to be stressed that ongoing and future digitization of the corpus of monastic authors and Coptic literature in general will very much widen the scope of digital text reuse methods and lead to new research questions and discoveries.
Article
Full-text available
How can computational methods illuminate the relationship between a leading intellectual, and their lifetime library membership? We report here on an international collaboration that explored the interrelation between the reading record and the publications of the British philosopher and economist John Stuart Mill, focusing on his relationship with the London Library, an independent lending library of which Mill was a member for 32 years. Building on detailed archival research of the London Library’s lending and book donation records, a digital library of texts borrowed, and publications produced was assembled, which enabled natural language processing approaches to detect textual reuse and similarity, establishing the relationship between Mill and the Library. Text mining the books Mill borrowed and donated against his published outputs demonstrates that the collections of the London Library influenced his thought, transferred into his published oeuvre, and featured in his role as political commentator and public moralist. We reconceive archival library issue registers as data for triangulating against the growing body of digitized historical texts and the output of leading intellectual figures. We acknowledge, however, that this approach is dependent on the resources and permissions to transcribe extant library registers, and on access to previously digitized sources. Related copyright and privacy restrictions mean our approach is most likely to succeed for other leading eighteenth- and nineteenth-century figures.
Chapter
Text reuse measurement is important for both LIS and literary studies, where it is mainly used to study influence between authors. Although projects such as Tesserae have already adopted computational methods for investigating text reuse in Latin poetry, its potential applications to the rich collections of English poetry have not been realized. This research proposes a modified version of the Tesserae Project’s measure based on the insight embodied in TF–IDF to study English poetry. Using the Irish poet Yeats’ relationship to five English Romantic poets as a test case, three parallel experiments were conducted in order to evaluate the suitability of this method for English poetry. The results show that this new method is effective in measuring text reuse in English poetry, and the TF–IDF based modification is more sensitive to known cases of text reuse than the original method. This method can also be adopted to noncanonical literary works in the future, providing an example of the significance of LIS for digital humanities.
Chapter
The Word Formation Latin (WFL) project has been awarded a Marie Curie Individual Fellowship to create a language resource consisting of a derivational morphological lexicon of the Latin language, which connects lexical elements on the basis of word formation rules. In WFL, lexemes are segmented and analysed into their derivational morphological components, in order to establish relationships between them on the basis of derivational or compounding processes. This chapter illustrates the methodology lying behind WFL, its lexical basis and the choices taken in order to maintain consistency in the resource, when encountering a number of cases raising theoretical issues. The online graphical query system used to access WFL is described, as well as a few examples of linguistic investigations that are made easier through the use of such a resource.
Article
Full-text available
The digital collections of newspapers have given rise to a growing interest in studying them with computational methods. This article contributes to this discussion by presenting a method for detecting text reuse in a large corpus of digitized texts. Empirically, the article is based on the corpus of newspapers and journals from the collection of the National Library of Finland. Often, digitized repositories offer only partial views of what actually was published in printed form. The Finnish collection is unique, however, since it covers all published issues up to the year 1920. This article has a two-fold objective: methodologically, it explores how computational methods can be developed so that text reuse can be effectively identified; empirically, the article concentrates on how the circulation of texts developed in Finland from the late eighteenth century to the early twentieth century and what this reveals about the transformation of public discourse in Finland. According to our results, the reuse of texts was an integral part of the press throughout the studied period, which, on the other hand, was part of a wider transnational practice.
Chapter
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges.
Chapter
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges.
Article
Full-text available
This paper overviews 17 plagiarism detectors that have been evaluated within the sixth international competition on plagiarism detection at PAN 2014. We report on their performances for the two tasks source retrieval and text alignment of external plagiarism detection. For the third year in a row, we invite software submissions instead of run submissions for this task, which allows for cross-year evaluations. Moreover, we introduce new performance measures for text alignment to shed light on new aspects of detection performance.
Article
Full-text available
The aim of this work is to formally describe the most general principles operating in language that currently are reliably observable, broken down to the most essential ones. These formal descriptions are open to changes and additions, allowing for a later integration of new language-independent mechanisms, but also for improvement of previously introduced ones and the discarding of formalizations which were proved wrong or inappropriate. In order to support the formal description, an additional and more practical aim of this work is to examine several specific areas such as the compuation of word similarity, ambiguity or morphology.
Conference Paper
Full-text available
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of “roughly the same” and “roughly contained.” The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints
Article
Full-text available
Fast similarity search is important for time-sensitive applications. Those include both enterprise and web scenarios, where typos, mis- spellings, and noise need to be removed in an ecient way, in o rder to improve data quality, or to find all information of interest to the user. This paper presents a new algorithm called Fast Similarity Search (FastSS) that performs an exhaustive similarity search in a dictionary, based on the edit distance model of string similarity. The algorithm uses deletions to model the edit distance. For a dictionary contain- ing n words of average length m, and given a maximum number of spelling errors k, FastSS uses a deletion dictionary of size O(nmk). At search time each query is mutated to generate a deletion neighborhood of size O(mk), which is compared to the indexed deletion dictionary. As a deletion neighborhood is smaller than a neighborhood using dele- tions, insertions and replacements, this contributes to a faster search. FastSS looks up misspellings in a time which is independent of n for a hash-based index, or logarithmic in the size of the dictionary, for a tree-based one. FastSS has been evaluated and compared with NR-grep, a keyword tree, dynamic programming, n-grams, and neighborhood generation
Conference Paper
Full-text available
The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.
Conference Paper
Full-text available
Hash-based similarity search reduces a continuous similarity rela- tion to the binary concept "similar or not similar": two feature vec- tors are considered as similar if they are mapped on the same hash key. From its runtime performance this principle is unequaled— while being unaffected by dimensionality concerns at the same time. Similarity hashing is applied with great success for near sim- ilarity search in large document collections, and it is considered as a key technology for near-duplicate detection and plagiarism anal- ysis. This papers reveals the design principles behind hash-based search methods and presents them in a unified way. We introduce new stress statistics that are suited to analyze the performance of hash-based search methods, and we explain the rationale of their effectiveness. Based on these insights, we show how optimum hash functions for similarity search can be derived. We also present new results of a comparative study between different hash-based search methods.
Conference Paper
Full-text available
Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when sentences, facts or passages, rather than whole documents, are reused and modified. Detecting this type of reuse can be the basis of new tools for text analysis. In this paper, we introduce a new approach to detecting local text reuse and compare it to other approaches. This comparison involves a study of the amount and type of reuse that oc- curs in real documents, including TREC newswire and blog collections.
Conference Paper
Full-text available
Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized index. In order to detect many kinds of paraphrases the semantic networks of a candidate text are varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Important phenomena occurring in difficult duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora is explained briefly. The deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee a high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably in comparison to traditional shallow methods.
Article
Full-text available
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with Moss, a widely-used plagiarism detection service.
Article
Full-text available
Feature selection is a central problem in data analysis that have received a signicant amount of attention from several disciplines, such as machine learning or pattern recognition. However, most of the research has been addressed towards supervised tasks, paying little attention to unsupervised learning. In this paper, we introduce an unsupervised feature selection method for symbolic clustering tasks. Our method is based upon the assumption that, in the absence of class labels, we can deem as irrelevant those features that exhibit low dependencies with the rest of features. Experiments with several data sets demonstrate that the proposed approach is able to detect completely irrelevant features and that, additionally, it removes other features without signicantly hurting the performance of the clustering algorithm. Key words: Feature selection, clustering, data preprocessing. 1
Article
In this paper, we present various visualizations for the Text Re-use found between texts of a collection to support humanists in answering a broad palette of research questions. When juxtaposing all texts of a corpus in the form of tuples, we propose the Text Re-use Grid as a distant reading method that emphasizes text tuples with systematic or repetitive Text Re-use. In contrast, the Text Re-use Browser allows for close reading of the Text Re-use between the two texts of a tuple. Additionally, we present Sentence Alignment Flows to improve the readability for Text Variant Graphs on sentence level that are used to compare various text editions to each other. Finally, we portray findings of the humanists of our project using the proposed visualizations.
Article
An interesting relationship between rounding algorithms used for rounding fractional solutions of LPs and vector solutions of SDPs on the one hand, and the constructions of locality sensitive hash functions for interesting classes of objects, on the other was demonstrated. Thus, rounding algorithms yielded new constructions of locality sensitive hash functions that were not previously known. Conversely, locality sensitive hash functions lead to rounding algorithm.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Conference Paper
Texts propagate through many social networks and provide evidence for their structure. We present efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. We apply these techniques to analyzing the culture of reprinting in the United States before the Civil War. Without substantial copyright enforcement, stories, poems, news, and anecdotes circulated freely among newspapers, magazines, and books. From a collection of OCR'd newspapers, we extract a new corpus of reprinted texts, explore the geographic spread and network connections of different publications, and analyze the time dynamics of different genres.
Article
Bell System Technical Journal, also pp. 623-656 (October)
Conference Paper
Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived from the content of the given texts, thereby inherently implying that any other text characteristics are negligible. In this paper, we overcome this traditional limitation and compute similarity along three characteristic dimensions inherent to texts: content, structure, and style. We explore and discuss possible combinations of measures along these dimensions, and our results demonstrate that the composition consistently outperforms previous approaches on three standard evaluation datasets, and that text reuse detection greatly benefits from incorporating a diverse feature set that reflects a wide variety of text characteristics.
Article
For the purposes of the present discussion, the term structure will be used in the following non-rigorous sense: A set of phonemes or a set of data is structured in respect to some feature, to the extent that we can form in terms of that feature some organized system of statements which describes the members of the set and their interrelations (at least up to some limit of complexity). In this sense, language can be structured in respect to various independent features. And whether it is structured (to more than a trivial extent) in respect to, say, regular historical change, social intercourse, meaning, or distribution — or to what extent it is structured in any of these respects — is a matter decidable by investigation. Here we will discuss how each language can be described in terms of a distributional structure, i.e. in terms of the occurrence of parts (ultimately sounds) relative to other parts, and how this description is complete without intrusion of other features such as history or meaning. It goes without saying that other studies of language — historical, psychological, etc.—are also possible, both in relation to distributional structure and independently of it.
Article
An obstacle to research in automatic paraphrase identification and genera-tion is the lack of large-scale, publicly-available labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Micro-soft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judg-ment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classi-fier to select likely sentence-level para-phrases from a large corpus of topic-clustered news data. These pairs were then submitted to human judges, who confirmed that 67% were in fact se-mantically equivalent. In addition to describing the corpus itself, we explore a number of issues that arose in defin-ing guidelines for the human raters.
Conference Paper
Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into RECAP, a prototype information flow analysis tool. Our experimental results with RECAP indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity.
Conference Paper
(MATH) A locality sensitive hashing scheme is a distribution on a family \F of hash functions operating on a collection of objects, such that for two objects x,y, PrhεF[h(x) = h(y)] = sim(x,y), where sim(x,y) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A &Pgr; B|}{|A &Pgr B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/π, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) &xie; Ehε\F [d(h(P),h(Q))] &xie; O(log n log log n). EMD(P, Q)..
Conference Paper
This paper describes an approach to automatically align fragments of texts of two documents in different languages. A text fragment is a list of continuous sentences and an aligned pair of fragments consists of two fragments in two documents, which are content-wise related. Cross-lingual similarity between fragments of texts is estimated based on models of divergence from randomness. A set of aligned fragments based on the similarity scores are selected to provide an alignment between sections of the two documents. Similarity measures based on divergence show strong performance in the context of cross-lingual fragment alignment in the performed experiments.
Conference Paper
With the overwhelming number of reports on similar events originating from dieren t sources on the web, it is often hard, using existing web search paradigms, to nd the origi- nal source of \facts", statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between dieren t sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of nding original sources on the web, such as dening a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are signican tly more ac- curate than standard web search for nding text reuse, and provide a richer representation for tracking the information o w.
Conference Paper
A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. The approach is to use state space models on the natural param- eters of the multinomial distributions that repre- sent the topics. Variational approximations based on Kalman filters and nonparametric wavelet re- gression are developed to carry out approximate posterior inference over the latent topics. In addi- tion to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. The models are demon- strated by analyzing the OCR'ed archives of the journal Science from 1880 through 2000.
Conference Paper
We propose a computational model of text reuse tailored for ancient literary texts, avail- able to us often only in small and noisy sam- ples. The model takes into account source alternation patterns, so as to be able to align even sentences with low surface similarity. We demonstrate its ability to characterize text reuse in the Greek New Testament.
Article
We define and study the notion of min-wise independent families of permutations. We say that ⊆Sn (the symmetric group) is min-wise independent if for any set X⊆[n] and any x∈X, when π is chosen at random in we have In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept—we present the solutions to some of them and we list the rest as open problems.
Article
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet ¹ provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].
Article
We address the problem of sentence alignment for monolingual corpora, a phenomenon distinct from alignment in parallel corpora. Aligning large comparable corpora automatically would provide a valuable resource for learning of text-totext rewriting rules. We incorporate context into the search for an optimal alignment in two complementary ways: learning rules for matching paragraphs using topic structure and further refining the matching through local alignment to find good sentence pairs. Evaluation shows that our alignment method outperforms state-of-the-art systems developed for the same task.
On the resemblance and containment of documents. In: In compression and complexity of sequences (SEQUENCES97
  • A Z Broder
Broder AZ (1997) On the resemblance and containment of documents. In: In compression and complexity of sequences (SEQUENCES97. IEEE Computer Society, Los Alamitos, pp 21-29
Aspects of an infrastructure for eHumanities
  • M Büchler
  • V Boehlke
  • G Heyer
Büchler M, Boehlke V, Heyer G (2011) Aspects of an infrastructure for eHumanities. In: Proceedings of Supporting Digital Humanities 2011
One step closer to paraphrase detection on historical texts: about the quality of text re-use techniques and the ability to learn paradigmatic relations
  • P R Burns
  • G Crane
  • M Mueller
  • G Heyer
  • M Büchler
Burns PR, Crane G, Mueller M, Heyer G, Büchler M (2011) One step closer to paraphrase detection on historical texts: about the quality of text re-use techniques and the ability to learn paradigmatic relations. In: Proceedings of the 2011 Chicago Colloquium on Digital Humanities and Computer Science, Chicago, 2012
What do you do with a million books? D-Lib Magazine 12:3. doi:10.1045/ march2006-crane
  • G Crane
Crane G (2006) What do you do with a million books? D-Lib Magazine 12:3. doi:10.1045/ march2006-crane. ISSN: 1082-9873. http://www.dlib.org/dlib/march06/crane/03crane.html
Analyse von Bedeutungsveränderungen in diachronen Textkorpora
  • G Heyer
Heyer G (2009) Analyse von Bedeutungsveränderungen in diachronen Textkorpora. Technical report, Natural Language Processing Group, University of Leipzig, Germany, Februar 2009. Vortrag im Forschungsseminar, Leipzig, Germany
Sequence alignment and similarity in biology and the humanities
  • R Horton
  • L Henderson
Horton R, Henderson L (2010) Sequence alignment and similarity in biology and the humanities. J Chicago Colloq Digit Humanit Comput Sci