Conference PaperPDF Available

Tracking Textual Similarities in Neo-Latin Drama Networks

Authors:

Abstract

This paper describes the first experiments towards tracking the complex and international network of text reuse within the Early Modern (XV-XVII centuries) community of Neo-Latin humanists. Our research, conducted within the framework of the TransLatin project, aims at gaining more evidence on the topic of textual similarities and semiconscious reuse of literary models. It consists of two experiments conveyed through two main research fields (Information Retrieval and Stylometry), as a means to a better understanding of the complex and subtle literary mechanisms underlying the drama production of Modern Age authors and their transnational network of relations. The experiments led to the construction of networks of works and authors that fashion different patterns of similarity and models of evolution and interaction between texts.
Tracking Textual Similarities in Neo-Latin Drama Networks
Andrea Peverelli1, Marieke van Erp2and Jan Bloemendal1
1Huygens ING
2KNAW Humanities Cluster
Amsterdam, Netherlands
andrea.peverelli.huygens.knaw.nl,
jan.bloemendal@huygens.knaw.nl,
marieke.van.erp@dh.huc.knaw.nl
Abstract
This paper describes the first experiments towards tracking the complex and international network of text reuse within the
Early Modern (XV-XVII centuries) community of Neo-Latin humanists. Our research, conducted within the framework
of the TransLatin project, aims at gaining more evidence on the topic of textual similarities and semi-conscious reuse of
literary models. It consists of two experiments conveyed through two main research fields (Information Retrieval and
Stylometry), as a means to a better understanding of the complex and subtle literary mechanisms underlying the drama
production of Modern Age authors and their transnational network of relations. The experiments led to the construction of
networks of works and authors that fashion different patterns of similarity and models of evolution and interaction between texts.
Keywords: text reuse, textual similarity, transnationality, Neo-Latin, drama
1. Introduction
One of the defining characteristics of the early Modern
Era, burgeoning from the Italian Renaissance period, is
the wide international network of exchanges between
writers of different nationalities that bears the Latin
name of respublica literaria (Republic of Letters)1Au-
thors feel part of a wider, universal, intellectual com-
munity, and their authorial signal can and must be read
especially in light of its complex network of interde-
pendent exchanges with other peers.
Tracking instances of textual reuse and similarities be-
tween works thus comes as the prime reflection of the
complexities of the respublica literaria. Relations be-
tween authors are primarily expressed through writing
and an intense, foregoing discussion upon the reuse of
models (that is, the hot topic of imitatio).
Besides the vast literary production in the various natu-
ral languages of Europe, the early Modern Age is char-
acterised by a wide, if barely known, production in
Latin from the first sparks of Humanism in the Italian
Peninsula during the XIV century. The production of
poetry and prose in Latin increased dramatically, and
drama began to be involved in this process in quantities
never seen (at least 10,000 works are known from this
period). The Low Countries were at the forefront of
this revitalisation, thanks especially to the outstanding
work of Erasmus.
Early Modern Latin (or “Neo-Latin”) was very differ-
ent from the one written and spoken in the former me-
dieval centuries. According to (Bloemendal and Nor-
1The concept appears for the first time in an epistolary ex-
change between the humanists Francesco Barbaro and Poggio
Bracciolini at the start of the XV century. See (van Miert,
2018) for a recent reading on the topic
land, 2013), Neo-Latin was characterised by: “a shift
[from the Middle Ages] in the use from a pragmatic
one (if necessary, new words could be coined, even ‘un-
classical’ ones, and syntactic means could be used as
seemed fit), to a principled one, which should aim at
writing ‘classical’ Latin morphologically and syntacti-
cally”. This, paired with the general methodology typ-
ical of Humanism of “going ad fontes” (i.e. to strictly
adhere to the original classical texts), makes for a re-
semblance of early Modern Latin to the classical stan-
dards like it was never seen before. It thus comes nat-
urally that comparison with classical authors is, in our
topic of research, particularly meaningful as a means
of clustering authors within common ancestries.
Whether conscious text reuse or coincidental resem-
blance, textual similarity can be viewed in a twofold
manner, based on its presence or absence: when
present, it is a measure of the closeness between two
texts, so that one of them can be read as a means of re-
lations to the other one; when instead absent, it repre-
sents their degree of distance (or “dissimilarity”), and it
is as important as its counterpart. Moreover, dissimilar-
ity can be a criterion for further inquiries: as a standard
measure between two texts, to state their closeness un-
der different literary aspects (style, content, space and
time, etc.); or as a marker for a more subtle closeness to
be found in a common ancestry back in time or in an-
other unrelated place, in the form of a predecessor, or
“pre-text” shared by both texts. The concept of pre-text
within evolutionary networks is well explained in the
paper The structure and evolution of story networks,
by which our research was heavily inspired. According
to (Karsdorp and Van den Bosch, 2016): “Story net-
works consist of stories and links between stories that
represent pre-textual relationships. We make the sim-
plifying assumption that stories that are more similar to
each other are more likely to stand in a pre-textual rela-
tionship than stories that are more distant”. The focus
of their paper is on “storied retellings”, well-defined
story frames towards which heavy text reuse is ascer-
tained as a starting point.
Our research question is formulated within the frame-
work of the TransLatin Project Project, which tries
to inquire this very notion: what is the extent of the
process of imitation and reception within Neo-Latin
drama? Are any authors connected at all? Which
ones serve as the strongest pre-texts (in literary terms:
“models“) for the others? To answer these questions we
made our first steps towards a thorough investigation of
similarities networks, while being aware of the wide ar-
range of tools for text reuse detection, through two dif-
ferent methodologies: Cosine Similarity and Bootstrap
Consensus Trees.
The contributions of this paper are as follows:
CURRENS: a new tool for the pre-processing of
Latin texts:
insights into reuse of Neo-Latin Drama;
new applications of known methodologies, drawn
from Information Retrieval and Stylometry, to-
wards the topic of textual similarities.
The remainder of this paper is structured as follows. In
Section 3 we explain the criteria that we followed for
preparing our corpus. In section 4 we get into details
about the experimental setup. In section 5 we discuss
the results. Finally, in section 6 we consider the fu-
ture steps and draw conclusions about our whole ex-
periment.
2. Related Work
In the last decade, several tools have been made avail-
able, for historical languages, for tracking proper text
reuse through text alignment or feature extraction. The
most well known of these tools (TRACER,2Tesserae,3
and Passim4) have also been tested for Latin: one of the
last experiments is that of Franzini, Passarotti, Moritz
and B¨
uchler (2018), in which a thorough exploration
of HTRD (Historical Text Reuse Detection) tools can
be found. As for Cosine Similarity, (Manjavacas et al.,
2019) approached allusive textual reuse detection on
a Latin Biblical corpus from an Information Retrieval
standpoint: through an extensive usage of Cosine Sim-
ilarity scores and Word Embeddings, they found that
custom query algorithms for automatic allusion detec-
tion were consistently outperformed by simpler TF-
IDF models and that Cosine Similarity can prove a
sound basis for inquiring textual reuse. Other studies,
such as (B¨
ar et al., 2012) and (Sturgeon, 2018), em-
ployed Cosine Similarity and TF-IDF scores, among
2Last visited: 16/1/2021
3Last visited: 16/1/2021
4Last visited: 16/1/2021
other approaches, in text reuse and similarity detec-
tion with good results, both for contemporary language
corpora (the former, which was tested on the METER
corpus and the Webis Crowd Paraphrase corpus) and
historical language corpora (the latter, which worked
on an Early Chinese corpus). As for stylometric ap-
proaches, the use of Stylometry for textual similarity
and reuse detection is ample. Some experiments have
also been conducted upon historical languages, espe-
cially Latin (cf. (Eder, 2016)) and Ancient Greek (Gor-
man and Gorman, 2016).
3. Corpus Preparation
Our corpus was assembled considering three parallel
tracks, attempting to cover the main aspects of a literary
corpus:
Topical aspect: works pertaining the same subject;
Authorial aspect: works from the same author;
Diatopical and diachronical aspects: works from
different times and places.
Our aim for this initial exploration is to set a stable
pipeline and a golden standard to expand upon in the
future.
Our corpus is thus built containing 47 works in total,
sub-divided as follows.
15 works from early Modern Neo-Latin drama, of
which 8 pertain to the topic of “Joseph play” (to satisfy
the first aspect), 3 same-author clusters (to satisfy the
second aspect), and a range of 4 different nationalities
and places of publication (to satisfy the third aspect): 3
authors from Germany, 1 from Poland, 1 from England
and 7 from the Netherlands, thus keeping our particular
focus on Dutch writing. The diachronic aspect is satis-
fied by the range of these works, spanning from 1510
to 1639.
Furthermore, to track the very first models of our
Modern-era drama corpus and to serve as an additional
counter-check proof for the clustering in section 3.2,
we added the 6 works from the Latin playwright Ter-
entius, the 20 from Plautus (the known 21st is heavily
fragmented and could not serve our purpose) and 6 cer-
tain dramas from Seneca, whose corpus authenticity is
still highly debated.
Since our Neo-Latin drama texts come through a pro-
cess of OCR from centuries old prints, they come with
errors and imprecisions that can severely impact the
processing of a text (van Strien et al., 2020). Further-
more, we needed our texts to be devoid of any unneces-
sary information (e.g. verse number and character ab-
breviations), just presenting the bare tokenised script.
We thus proceeded to clean the texts in our corpus, fol-
lowing a popular common pipeline of text manipulation
for Latin texts:
Cleaning OCR errors;
Replacing punctuation;
Changing everything to lower case;
Normalizing Latin-related issues with spelling
(such as V into U and J into I);
Replacing para-textual annotation (e.g. characters
speaking, line number, verse type).
A final layer of cleaning involved the process of ma-
nipulating the actual content of the texts:
Stopwords filtering, based on the Perseus Project
list and then heavily modified and expanded;
Non-semantic words filtering (conjunctions, sub-
junctions, pronouns, auxiliaries, some very com-
mon adverbs);
Lemmatisation. These two final steps were oNeo-
Latiny implemented in the Cosine Similarity part
of our analysis (Subsection 4.2).
This whole process has been done automatically us-
ing our custom-built program CURRENS that builds
upon the tokeniser and enclitics exception list from the
CLTK pipeline, and the LemLat lemmatiser amended
with in-house developed modules and expanded stop-
words from the Perseus project5. CURRENS is avail-
able on Github.6The results of the pre-processing can
be seen in 8.
4. Experimental Setup and Analysis
In this section, we present our experiment setup and
analysis. The experimental setup sketches our general
approach to analysing Neo-Latin texts. We then ex-
plain how we use Cosine Similarity 4.2 and Stylome-
try 4.3, by constructing a Bootstrap Consensus Tree, to
compare different texts and what these different analy-
sis methods bring.
4.1. Experimental Setup
Our first analysis is inspired by (Karsdorp and Van den
Bosch, 2016), where we calculate the cosine similarity
for every text in our corpus and produce a sparse corre-
lation matrix, in order to express the closeness between
texts and authors in terms of vector representation in
an n-dimensional space model. This served as a basis
to build a network representation that revealed a pat-
tern of evolution shaped by the “PA-TA”(preference-
based and temporal-based) attractiveness, basically a
heavy-tailed, mostly chronological distribution of sim-
ilarities that resemble real life evolutionary growth net-
works (and thus confirming the findings of (Karsdorp
and Van den Bosch, 2016)).
Our second analysis involves a stylometric ap-
proach (Eder, 2017). We computed a Delta-distance
5http://www.perseus.tufts.edu/hopper/
6https://github.com/AndrewPeverells/
CURRENS
Bootstrap Consensus Tree and produced the Princi-
pal Components Analysis (PCA) for our corpus of 47
works. Combining the results from these analyses, we
drew another network that revealed a new and unex-
pected clustering, unveiling similarities unknown be-
fore. The computer was also able to correctly identify
classical models and age- or generation-defined clus-
ters, as found in previous literary inquiries, thus con-
firming the general structure and the evolution of Early-
Modern Neo-Latin drama.
4.2. Cosine Similarity
As a first step in our twofold experiment, we calcu-
lated the cosine similarity scores for each pair of texts
in our corpus, which needed a final layer of manipu-
lation: we thus lemmatised the texts, since calculat-
ing the cosine similarity between tokenised corpora,
for a highly inflected language such as Latin, would
bear too many false negatives (for example, the tokens
“Deus” and “Deorum” would be held separated and
would not contribute towards the final cosine similarity
score, when they are clearly the same word - “lemma”
level - realised in two different ways - “token” level
-); on the other hand, to prevent the inflation of the fi-
nal score due to false positives, we eliminated most of
the lemmas that do not possess a high semantically-
informative content and that usually occur in the form
of textual invariants (stopwords and function words, to-
gether with some very frequent Latin words, such as a
few adverbs - ut, iam, saepe... - nouns - res - and verbs
- mostly auxiliaries and derivates: habeo, sum, fio, pos-
sum... -).7However, we kept the interjections, which
are an important part of theatrical writing.
Firstly, we transformed the lemmatised corpus into a
Word Vector Space model; secondly, we turned it into
a matrix of TF-IDF features; finally we computed the
actual cosine similarity scores for our texts. We then
generated the co-occurrence tables for every text, for
a better in-depth explanation of the word likeness be-
tween works, divided into spreadsheets with a precise
ratio: one for the general co-occurrences between the
two sub-corpora (Neo-Latin Modern plays and Classi-
cal plays), and the other group for the highest cosine
similarity scoring texts. Every spreadsheet is accompa-
nied by a second sub-sheet bearing some general statis-
tics for the particular pair analysed: type-token ratio
(TTR), medium word length and lemma dispersion.
4.3. Stylometry
For the second part of the experiment, we opted for a
stylometric analysis through the R package stylo (Eder
et al., 2016). We went back a step in the corpus prepa-
ration procedure to maintain the stopwords/function
words in the texts, as they are the vital part of every
7The importance of lemmatisation in cosine similarity
score measuring for textual similarity is also confirmed
by (Manjavacas et al., 2019), as “lemmatization boosts the
performance of nearly all models”
Figure 1: Cosine Similarity scores heatmap.
stylometric analysis, and we kept our corpus tokenised.
We then produced a Bootstrap Consensus Tree, span-
ning through different parameter tests.
As for the computed Distance, we decided to
choose Eder’s Delta (Evert et al., 2017), which is
particularly suited for highly-inflected languages
and not too long word vectors (the texts in our cor-
pus very rarely exceed 15.000 tokens);
As for the most frequent word (MFWs) analysed,
we run through several trials, and found that the
clustering begun to fall off at around 500 MFWs,
gradually reuniting every work together in a sin-
gle branch. We thus chose a comfortable plateau
of 200 MFWs, that could properly show a mean-
ingful branching of the clusters;
As for other important parameters, we kept the
Consensus of our tree at 0.5, left the pronouns out,
and employed no culling of the MFWs.
5. Results and Discussion
(Karsdorp and Van den Bosch, 2016) propose that the
evolution of textual networks is to be based upon two
dimensions: temporal attractiveness (TA), the prin-
ciple for which authors tend to prefer more recent
models, and model-based attractiveness (MA), that in-
volves elements from the context (such topic or the im-
portance of an author). Our networks follow these two
principles.
From our cosine similarity experiment (see fig.3), we
can see that texts tend to follow a TA evolutionary
fashion, exhibiting works that are closer in time as
their highest-scoring models. Moreover, another key
element incurs in the earlier stages of the network:
a first cluster is clearly visible, composed of authors
(Macropedius-Crocus-Gnapheus) of Dutch origin and
active in the Netherlands. This shows the relative im-
portance of the spatial aspect, which is closely related
to the temporal one, thus transforming the TA into a
T-SA (temporal-spatial attractiveness). However, this
model of T-SA is especially true for the initial elements
of the corpus, while the probability of works straying
off their closest ones as models gets increasingly higher
with the evolution of the network. This is due to the
growing importance of context (MA): as time passes,
authors are given more choice for their inspiration.
Another crucial aspect is that of hubs, or, in our case,
key turning points. We drew a graph from our co-
sine data (see fig.1), introducing a minimum threshold
of 0.3 to filter out the weakest scores. The resulting
(out-degree) network, displaying the outgoing edges,
clearly shows that some texts serve as central hubs of
reuse, or “models”: works from a first, earlier period
(1510-1556) appear to be strong models for later au-
thors, and a clear-cut clustering also stands out, with
one very tight group (Crocus-Macropedius-Diether)
and another cluster (Macropedius-Gnapheus-Foxe) that
looks loosely connected to the first one. Moreover,
each cluster has its key central hub that serves as a cor-
nerstone: in the first one, Diether is well connected to
both works from an earlier period and to later texts,
while in the other one Foxe serves this purpose. In gen-
eral, Crocus, Macropedius, Diether and Foxe were the
highest scoring authors, both in cosine similarity and
out-degree values, so we can (relatively safely) assume
their importance and renown in the greater respublica
literaria. From the data gathered in our second part of
the experiment, we can draw some new and comple-
mentary considerations. We drew a Bootstrap Consen-
sus Tree with our full corpus (also comprising Plautus,
Terence and Seneca) through a built-in algorithm from
the R package stylo (fig.2). Three main aspects stand
out. Firstly, the authorial signature is very strong: all
three groups of same-author works were correctly clus-
tered together. This came in spite of the first aspect that
we wanted to inquire: topic seems to be completely ir-
relevant to this kind of analysis, as Joseph plays are
mixed with non-Joseph plays without any discerning
ratio. Moreover, it is particularly interesting since this
preeminence of the authorial aspect over topic was not
really clear in the first part of the experiment: from
cosine similarity scores, sometimes same-topic texts
over-scored same-author clusters (as in the case of the
Joseph by Schonaeus, which scored really low when
compared to the Cunae, another of his works), while
sometimes works from the same author had a higher
score than same-topic works from other authors (as in
the case of the Joseph and the Hecastus, both from
Macropedius). Secondly, the T-SA dimension is main-
tained, but with new and interesting additions: the al-
gorithm automatically drew two very distinct clusters,
separating the XVI century works from the XVII cen-
tury ones (although Bidermann seems to be an excep-
tion). This goes in pair with a third consideration, re-
garding classical authors: Plautus was put aside from
everything else, while Terence seem to have a higher
influence on the XVI century cluster and Seneca on the
XVII century one. This clear-cut subdivision is con-
firmed by literary studies on the matter. (Bloemendal
and Norland, 2013) identifies a three-staged evolution
of Dutch Neo-Latin drama: a first one (roughly 1500-
1550) that serves as a proving ground for new authors
that revealed to be very influential in later periods; a
second one (1550-1600), characterised by the use of
Terence as the primary model; and, finally, a third one,
more akin to Baroque literature, that shifted heavily to-
wards a more Senecan style. There are still two notable
exceptions to our Consensus Tree: Bidermann (1615)
seems to fit better in the XVI century cluster, and the
Adelphoe resulted as the oNeo-Latiny separated teren-
tian work in all of our tests, more akin to authors in the
XVII century cluster. The first anomaly is maybe due to
Bidermann’s Jesuit background: within the XVII cen-
tury cluster, oNeo-Latiny one (Libenus) out of three au-
thors pertain to the Jesuit Order, which is much more
concentrated in the XVI century cluster. The second
anomaly, the Adelphoe by Terence, still needs more ev-
idence.
6. Conclusion and future work
This paper describes a preliminary exploration towards
building a functioning pipeline for assessing textual
similarities in Neo-Latin texts. Through this exper-
iment we wanted to test textual similarity and reuse
through the three main aspects of literary works: the
authorial aspect (works from the same authors, thus
tracking internal reuse); the topical aspect (works
pertaining the same topic, thus tracking similarities
throughout a same-topic sub-corpus); and the diatopic-
diachronic aspect (thus tracking the reuse of other au-
thors’ texts through time and space and the individu-
ation of “models”). We hence built our test corpus in
such a way that it covered every one of the three aspects
we wanted to inquire, also inserting the works of clas-
sical drama authors (Plautus, Terence and Seneca) as
a counter-check for the second part of our experiment.
Although our focus is on drama pieces and on Neo-
Latin, this pipeline can be applied to any kind of Latin
text, as its parameters are the same. This is thanks, es-
pecially, to our tool, CURRENS, which can be used
to pre-process a Latin work in a customised fashion,
depending on which module is needed in one’s analy-
sis. We employed it in its entirety to generate clean,
tokenised texts to work upon, and tweaked its modules
in order to get a two-fold type of data from our initial
corpus: one raw, tokenised, as presented in the origi-
nal texts (the full 49, comprising classical authors); and
one lemmatised and deprived of stopwords and func-
tion words (just the 15 Neo-Latin texts of the XVI-
XVII centuries period that we gathered as an initial ex-
ploration). The latter was used in the first part of our
experiment, involving cosine similarity, for which we
employed a TF-IDF vector space model to calculate its
score for every text in our corpus. From these results,
we built a network showing the out-degree values for
the processed texts, and a heatmap showing the corre-
lation distribution between the 15 samples. From this,
Figure 2: Out-Degree Network of the Neo-Latin works.
Figure 3: Bootstrap Consensus Tree of 49 Latin and Neo-Latin Dramas using 2-202 Most Frequent Words, Eder’s
Delta distance, Consensus 0.5.
two noteworthy results stand out:
The evolution of similarity patterns in our cor-
pus tends to follow a S-TA/MA model (spatial-
temporal attractiveness / modal attractiveness): in
the early stages of the network, authors tend to
prefer texts closer in time and space as their mod-
els of reuse (S-TA), while, as time passes by,
the dispersion of this preferential attachment in-
creases dramatically, with authors preferring other
texts based on more aleatory reasons such as topic,
style or vicinity (MA).
The emergence of text clusters is modeled around
central hubs (our “models”), represented by par-
ticularly fortunate authors: our corpus, in particu-
lar, split into two clear-cut clusters, with Andreas
Diether in one and John Foxe in the other serv-
ing as central hubs of reuse, well connected with
both authors from the early age and authors from
later stages, around which the other texts seem to
gravitate. This shows the relative importance of
these authors and the end of the early period of
our network (1544-1556) as a testing ground for
later literary imitation.
The other type of processed data (raw, tokenised text)
that resulted through the use of CURRENS was instead
used for the second part of our experiment. We gener-
ated a BCT (Bootstrap Consensus Tree) of the whole 49
works that make up our corpus, combining together the
Neo-Latin works and the texts from classical authors
as a counter-check for the clustering method that we
employed (Eder’s Delta, 0.5 consensus strength, 200
MFWs). From this, we could draw further considera-
tions:
The authorial signal is stronger than the topical
aspect. Internal style within same-author clusters
takes over features of same-topic style. The co-
sine similarity experiment gave mixed results in
this regard.
S-TA is maintained, but with new clusterings that
define an age-dependent evolution of style: the
algorithm automatically recognised two very dis-
tinct groups, one in the XVI century and one in
the XVII century, with classical authors arranged
as clear models (Terence for the first group and
Seneca for the second; Plautus was set apart as
too distant). This is confirmed by literary critique
studies, that report a similar generation-like evo-
lution of Neo-Latin drama and model selection.
Two main exceptions stand out: the alien presence
of the Adelphoe by Terence in the XVII century
group and that of Bidermann (1615) in the XVI
century cluster. These need more evidence.
We thus answered to the original questions: we proved
that the process of imitation and reception within Neo-
Latin drama is extensive, and it happened on many
layers (spatiality, temporality, modality); we tracked
connections between authors and checked the reuse of
models, both contemporary and ancient: finally, we
proved the existence of hubs of reuse, thus gaining
more insight on the importance of some authors in the
Early Modern Era and the reflection of classical drama
writers on this very age.
As a further step to improve our model of textual sim-
ilarity for Neo-Latin texts, we plan to improve on the
basis we have set, as well as employ new methodolo-
gies for our next experiment. First of all, an expansion
of our corpus, with new Neo-Latin texts from the XVI-
XVII centuries, will be a constant background opera-
tion, as the TransLatin Project moves forward and en-
ables more texts to be digitised and analysed. Secondly,
a word embeddings analysis for our corpus will be con-
ducted, to improve upon the foundations of the cosine
similarity experiment that we already conducted. Fi-
nally, for a more different approach, we would like to
implement a topic modelling analysis to better inquire
the topical aspect of our pipeline and have a deeper un-
derstanding of how textual reuse works in conjunction
with topic variation.
The code for this paper is available on GitHub at
https://github.com/AndrewPeverells/
Translatin
7. Acknowledgements
This research is conducted within the framework of
the TransLatin project, funded by the Dutch Research
Council (NWO).
8. Bibliographical References
B¨
ar, D., Zesch, T., and Gurevych, I. (2012). Text reuse
detection using a composition of text similarity mea-
sures. In Proceedings of COLING 2012, pages 167–
184.
Bloemendal, J. and Norland, H. (2013). Neo-Latin
Drama in Early Modern Europe. Brill.
Eder, M., Rybicki, J., and Kestemont, M. (2016). Sty-
lometry with r: a package for computational text
analysis. The R Journal, 8(1).
Eder, M. (2016). A bird’s-eye view of early modern
latin: Distant reading, network analysis, and style
variation. Early Modern Studies After the Digital
Turn, page 63.
Eder, M. (2017). Visualization in stylometry: cluster
analysis using networks. Digital Scholarship in the
Humanities, 32(1):50–64.
Evert, S., Proisl, Jannidis, F., Reger, I., Pielstr¨
om, S.,
Sch¨
och, C., and Vitt, T. (2017). Understanding and
explaining delta measures for authorship attribution.
Digital Scholarship in the Humanities, 32.
Gorman, V. B. and Gorman, R. J. (2016). Approaching
questions of text reuse in ancient greek using compu-
tational syntactic stylometry. Open Linguistics, 2(1).
Karsdorp, F. and Van den Bosch, A. (2016). The struc-
ture and evolution of story networks. Royal Society
open science, 3(6):160071.
Manjavacas, E., Long, B., and Kestemont, M. (2019).
On the feasibility of automated detection of allusive
text reuse. In Proceedings of the 3rd Joint SIGHUM
Workshop on Computational Linguistics for Cultural
Heritage, Social Sciences, Humanities and Litera-
ture, pages 104–114.
Sturgeon, D. (2018). Digital approaches to text reuse
in the early chinese corpus. Journal of Chinese Lit-
erature and Culture, 5(2):186–213.
van Miert, D. (2018). Towards a conceptual history of
the republic of letters in the modern period. Cultural
History Seminar.
van Strien, D., Beelen, K., Ardanuy, M. C., Hosseini,
K., McGillivray, B., and Colavizza, G. (2020). As-
sessing the impact of ocr quality on downstream nlp
tasks. In Proceedings of ICAART 2020.
Authors/Titles Year/Age Tokens in
text
Types
(distinct
words)
Lemmas
(after
cleaning)
Type/token
ratio
(TTR)
Mean
word
length
(chars)
Macropedius Asotus 1510 11,450 7,936 4,724 41.26 5.39
Gnapheus Acolastus 1529 8,823 5,964 3,646 41.32 5.17
Crocus Joseph 1535 6,964 4,702 2,756 39.57 5.09
Macropedius Hecastus 1539 12,577 8,958 4,575 36.38 5.33
Macropedius Joseph 1544 12,013 7,116 4,479 37.28 5.45
Diether Joseph 1544 16,475 9,609 6,430 39.03 5.49
Foxe Christus Triumphans 1556 10,082 6,840 4,251 42.16 5.33
Simonides Joseph Castus 1587 9,628 6,915 4,418 45.89 5.35
Schonaeus Joseph 1592 12,420 8,282 3,745 30.15 5.13
Schonaeus Cunae 1596 5,504 3,423 2,237 40.64 5.35
Bidermann Joseph 1615 18,129 11,170 6,001 33.10 5.20
Heinsius Herodes 1632 9,280 7,430 4,379 47.19 5.55
Libenus Joseph Venditus 1634 5,484 4,582 2,777 50.64 5.41
Grotius Sophomponeas 1635 6,907 5,261 3,694 53.48 5.39
Libenus Joseph Agnitus 1639 6,284 4,807 3,165 50.37 5.44
Terence II century B.C.
Adelphoe 8,711 4,310 2,745 31.52 4.72
Andria 8,413 4,362 2,709 32.20 4.80
Eunuchus 9,010 5,608 2,932 32.54 4.82
Heauton 8,832 4,529 2,812 31.84 4.77
Hecyra 7,301 4,264 2,390 32.74 4.81
Phormio 8,971 4,394 2,900 32.33 4.73
Plautus III century B.C.
Amphitruo 8,425 2,749 32.63 4.90
Asinaria 14,747 9,223 4,508 30.57 4.91
Aulularia 3,929 2,315 1,627 41.41 4.96
Bacchides 10,030 8,862 3,307 32.97 4.96
Captivi 8,350 4,044 2,883 34.53 4.91
Casina 7,271 4,173 2,520 34.66 4.74
Cistellaria 5,397 3,124 2,057 38.11 4.85
Curculio 2,300 4,818 1,124 48.87 4.87
Epidicus 6,546 5,005 2,327 35.55 4.87
Menaechmi 9,133 6,159 2,879 31.52 4.85
Mercator 8,766 6,002 2,915 33.25 4.86
Miles Gloriosus 12,811 7,226 3,886 30.33 4.98
Mostellaria 9,777 5,600 3,081 31.51 4.84
Poenulus 10,858 8,579 3,486 32.11 4.93
Pseudolus 11,369 7,281 3,579 31.48 4.85
Rudens 11,450 8,642 3,543 30.94 4.93
Stichus 6,394 3,998 2,477 38.74 5.02
Trinummus 9,834 6,554 3,262 33.17 4.96
Truculentus 8,226 5,028 2,942 35.76 4.88
Persa 7,954 4,556 2,699 33.93 4.73
Seneca I century C.E.
Hercules Furens 3,592 2,879 2,335 65.01 5.57
Hercules Oetaeus 10,292 7,818 4,290 41.68 5.42
Medea 6,349 4,957 3,034 47.79 5.53
Oedipus 5,792 4,709 3,439 59.38 5.56
Thyestes 6,220 4,360 3,410 54.82 5.46
Troades 6,698 5,235 3,520 52.55 5.50
Overall 419,861 271,609 53,812 12.82 5.10
Table 1: Dataset statistics. The titles of the plays are in
italics.
Article
This contribution offers a survey of the field of biblical drama research and of several approaches. It makes a plea for a transnational approach of studying early modern drama, especially biblical drama. Such a transnational approach is a continuation of many of these earlier approaches, makes use of some and breaks with others. The contribution also shows that it is important to include Neo-Latin drama in the study of early modern drama at the cost of losing sight on a crucial constituent part of the development of this drama and this theatre. Neo-Latin drama contributed to the knowledge and reception of the Bible among the elites and among the ordinary people who attended performances of these plays or of one of the many translations and adapations.
Conference Paper
Full-text available
A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.
Article
Full-text available
This article builds on a mathematical explanation of one the most prominent stylometric measures, Burrows's Delta (and its variants), to understand and explain its working. Starting with the conceptual separation between feature selection , feature scaling, and distance measures, we have designed a series of controlled experiments in which we used the kind of feature scaling (various types of standardization and normalization) and the type of distance measures (notably Manhattan, Euclidean, and Cosine) as independent variables and the correct authorship attributions as the dependent variable indicative of the performance of each of the methods proposed. In this way, we are able to describe in some detail how each of these two variables interact with each other and how they influence the results. Thus we can show that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (im-plicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently. We are also able to show that the information particularly relevant to the identification of the author of a text lies in the profile of deviation across the most frequent words rather than in the extent of the deviation or in the deviation of specific words only.
Article
Full-text available
We are investigating methods by which data from dependency syntax treebanks of ancient Greek can be applied to questions of authorship in ancient Greek historiography. From the Ancient Greek Dependency Treebank were constructed syntax words (sWords) by tracing the shortest path from each leaf node to the root for each sentence tree. This paper presents the results of a preliminary test of the usefulness of the sWord as a stylometric discriminator. The sWord data was subjected to clustering analysis. The resultant groupings were in accord with traditional classifications. The use of sWords also allows a more fine-grained heuristic exploration of difficult questions of text reuse. A comparison of relative frequencies of sWords in the directly transmitted Polybius book 1 and the excerpted books 9–10 indicate that the measurements of the two texts are generally very close, but when frequencies do vary, the differences are surprisingly large. These differences reveal that a certain syntactic simplification is a salient characteristic of Polybius’ excerptor, who leaves conspicuous syntactic indicators of his modifications.
Article
Full-text available
With this study, we advance the understanding about the processes through which stories are retold. A collection of story retellings can be considered as a network of stories, in which links between stories represent pre-textual (or ancestral) relationships. This study provides a mechanistic understanding of the structure and evolution of such story networks: we construct a story network for a large diachronic collection of Dutch literary retellings of Red Riding Hood, and compare this network to one derived from a corpus of paper chain letters. In the analysis, we first provide empirical evidence that the formation of these story networks is subject to age-dependent selection processes with a strong lopsidedness towards shorter time-spans between stories and their pre-texts (i.e. ‘young’ story versions are preferred in producing new versions). Subsequently, we systematically compare these findings with and among predictions of various formal models of network growth to determine more precisely which kinds of attractiveness are also at play or might even be preferred as explicatory models. By carefully studying the structure and evolution of the two story networks, then, we show that existing stories are differentially preferred to function as a new version's pre-text given three types of attractiveness: (i) frequency-based and (ii) model-based attractiveness which (iii) decays in time.
Article
Observed textual similarities between different pieces of writing are frequently cited by textual scholars as grounds for interpretative stances about the meaning of a passage and its authorship, authenticity, and accuracy. Historically, identifying occurrences of such similarities has been a matter of extensive knowledge and recall of the content and locations of passages contained within certain texts, together with painstaking manual comparison by examining printed copies, use of concordances, or more recently, appropriate use of full-text searchable database systems. The development of increasingly comprehensive and accurate digital corpora of early Chinese transmitted writing raises many opportunities to study these phenomena using more systematic digital techniques. These offer the promise of not only vast savings in time and labor but also new insights made possible only through exhaustive comparisons of types that would be entirely impractical without the use of computational methods. This article investigates and contrasts unsupervised techniques for the identification of textual similarities in premodern Chinese works in general, and the classical corpus in particular, taking the text of the Mozi 墨子 as a concrete example. While specific examples are presented in detail to concretely demonstrate the utility and potential of the techniques discussed, all of the methods described are generally applicable to a wide range of materials. With this in mind, this article also introduces an open-access platform designed to help researchers quickly and easily explore these phenomena within those materials most relevant to their own work.
Article
The aim of this article is to discuss reliability issues of a few visual techniques used in stylometry, and to introduce a new method that enhances the explanatory power of visualization with a procedure of validation inspired by advanced statistical methods. A promising way of extending cluster analysis dendrograms with a self-validating procedure involves producing numerous particular 'snapshots', or dendrograms produced using different input parameters, and combining them all into the form of a consensus tree. Significantly better results, however, can be obtained using a new visualization technique, which combines the idea of nearest neighborhood derived from cluster analysis, the idea of hammering out a clustering consensus from bootstrap consensus trees, with the idea of mapping textual similarities onto a form of a network. Additionally, network analysis seems to be a good solution for large data sets.
Article
This software paper describes 'Stylometry with R' (stylo), a flexible R package for the highlevel analysis of writing style in stylometry. Stylometry (computational stylistics) is concerned with the quantitative study of writing style, e.g. authorship verification, an application which has considerable potential in forensic contexts, as well as historical research. In this paper we introduce the possibilities of stylo for computational text analysis, via a number of dummy case studies from English and French literature. We demonstrate how the package is particularly useful in the exploratory statistical analysis of texts, e.g. with respect to authorial writing style. Because stylo provides an attractive graphical user interface for high-level exploratory analyses, it is especially suited for an audience of novices, without programming skills (e.g. from the Digital Humanities). More experienced users can benefit from our implementation of a series of standard pipelines for text processing, as well as a number of similarity metrics.
Conference Paper
Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived from the content of the given texts, thereby inherently implying that any other text characteristics are negligible. In this paper, we overcome this traditional limitation and compute similarity along three characteristic dimensions inherent to texts: content, structure, and style. We explore and discuss possible combinations of measures along these dimensions, and our results demonstrate that the composition consistently outperforms previous approaches on three standard evaluation datasets, and that text reuse detection greatly benefits from incorporating a diverse feature set that reflects a wide variety of text characteristics.
Neo-Latin Drama in Early Modern Europe
  • J Bloemendal
  • H Norland
Bloemendal, J. and Norland, H. (2013). Neo-Latin Drama in Early Modern Europe. Brill.