Project

TRACER

Goal: TRACER is an automatic text reuse detection machine geared towards the identification of text reuse in historical texts.
Homepage available at: http://www.etrap.eu/research/tracer/

Updates
0 new
3
Recommendations
0 new
0
Followers
0 new
9
Reads
0 new
104

Project log

Greta Franzini
added a research item
This article describes a computational text reuse study on Latin texts designed to evaluate the performance of TRA-CER, a language-agnostic text reuse detection engine. As a case study, we use the Index Thomisticus as a gold standard to measure the performance of the tool in identifying text reuse between Thomas Aquinas' Summa contra Gentiles and his sources.
Greta Franzini
added 2 research items
Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.
Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a digital library. In Natural Language Processing it is crucial to remove these redundancies before we can apply any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. Identification can be accomplished by way of automatic or semi-automatic methods. Text re-use algorithms, however, are of squared complexity and call for higher computational power. The present paper addresses this issue of complexity, with a particular focus on its algorithmic implications and solutions.
Greta Franzini
added an update
An updated version of the TRACER user manual, version 1.2, is now available for download at: http://www.etrap.eu/wp-content/uploads/2017/05/TRACER-user-manual-v1_2.pdf
The user manual also reflects new TRACER developments, so you might want to ensure you're running the latest version.
Among the many updates, the most important are:
- A revised pre-processing section following the addition of TreeTagger and Stanford CoreNLP conversion scripts. The latest version of TRACER contains scripts to automatically import lemma lists produced with the TreeTagger and Stanford CoreNLP morphological analysers and convert them into the required TRACER processing format. This means less formatting work for the user.
- A revised post-processing section following a new implementation of the text reuse results visualisation. In previous versions of TRACER, the user had to semi-automatically create TRAViz visualisations by moving files across folders. In the latest version of TRACER, visualisations are automatically computed and stored for every detection task.
- Layout: the user manual has changed looks for an improved reading experience.
The user manual is work in progress so any feedback is most welcome.
 
Greta Franzini
added an update
TRACER now integrates the Latin parameter of LMU's TreeTagger for higher lemmatisation and Part-of-Speech accuracy. TreeTagger is available at: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
 
Emily Franzini
added an update
TRACER is an automatic text reuse detection machine geared towards the identification of text reuse in historical texts. Homepage available at: http://www.etrap.eu/research/tracer/
 
Greta Franzini
added a project goal
TRACER is an automatic text reuse detection machine geared towards the identification of text reuse in historical texts.