PreprintPDF Available

Multilingualism in Greater Poland court records (1396-1446): Tagging discourse boundaries and code-switching

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The report introduces the Electronic Repository of Greater Poland Oaths, eROThA, 1386-1446, a digitisation project of a diplomatic edition of mediaeval land court oaths recorded in Latin and Old Polish, resulting in a small, lightly tagged specialised bilingual corpus. 1 We present the background, aims, design and methodology behind the project. We also discuss the problems and limitations entrenched in turning a printed diplomatic edition into a machine-readable diplomatic edition equipped with a new interpretative layer sensitive to the switches between Latin and Old Polish. In addition to automatic annotation of code-switched items on the basis of typographic characteristics of the printed edition, flexible coding of recurrent language and discourse boundary phenomena has been introduced manually to account for linguistically ambiguous or neutral forms. The project offers a fully multilingual corpus, as well as customised Polish-only and Latin-only datasets, and enables filtered metadata searches in the online front-end. Overall, the report presents a methodology for constructing multilingual corpora in the context of legal cultures in medieval Central Europe that may be extrapolated to datasets originating in other periods and regions.
Content may be subject to copyright.
!
1!
Short corpus report AOP
to appear in Corpora 15 (3), 2020
Multilingualism in Greater Poland court records (1386-1446): Tagging discourse boundaries
and code-switching
Matylda Włodarczyk, Adam Mickiewicz University in Poznań
Joanna Kopaczyk, University of Glasgow
Michał Kozak, Poznań Supercomputing and Networking Center
Abstract
The report introduces the Electronic Repository of Greater Poland Oaths, eROThA, 1386-
1446, a digitisation project of a diplomatic edition of mediaeval land court oaths recorded in
Latin and Old Polish, resulting in a small, lightly tagged specialised bilingual corpus.
1
We
present the background, aims, design and methodology behind the project. We also discuss
the problems and limitations entrenched in turning a printed diplomatic edition into a!
machine-readable!diplomatic!edition!equipped!with!a!new!interpretative!layer!sensitive!
to!the!switches!between!Latin!and!Old!Polish. In addition to automatic annotation of code-
switched items on the basis of typographic characteristics of the printed edition, flexible
coding of recurrent language and discourse boundary phenomena has been introduced
manually to account for linguistically ambiguous or neutral forms. The project offers a fully
multilingual corpus, as well as customised Polish-only and Latin-only datasets, and enables
filtered metadata searches in the online front-end. Overall, the report presents a methodology
for constructing multilingual corpora in the context of legal cultures in medieval Central
Europe that may be extrapolated to datasets originating in other periods and regions.
Keywords: code-switching, tagging, Old Polish, mediaeval land courts, Greater Poland
1. Introduction: Building a multilingual historical corpus
Recent corpus compilation efforts display two tendencies: on the one hand, striking a balance
between representativeness and size in the spirit of big data, and, on the other, acknowledging
the significance and utility of small specialised datasets. The project summarised in this paper
represents the latter trend, which has been particularly distinct in historical corpus linguistics
(Claridge, 2008; Kytö, 2012: 1512-1514). Historical datasets are fragmentary by nature and
pose very specific research questions which may be addressed by means of specialised
databases. The recent interest in historical multilingualism in particular has been fuelled by
small and medium-sized specialised corpora and it is on this basis that new theoretical and
methodological frameworks are being developed (see, among others, Schendl and Wright,
2011 eds., Nurmi et al. 2017 eds., Peikola et al. 2017 eds., Pahta et al. eds. 2018). This work
!
1
The paper and the free electronic database are a result of a research project funded by the National Science
Centre in Poland (OPUS No. 2014/13/B/HS2/00644 [https://projekty.ncn.gov.pl/index.php?s=11260]).
!
2!
has mostly relied on texts written in administrative, legal and religious contexts, especially in
pre-Renaissance and the Early Modern period.
2
The characteristic feature of such texts across
Europe, alongside the use of Latin as the traditional and prestigious written medium, is that
they contain glimpses of the vernaculars, also in the spoken domain (e.g. Ingham and Marcus,
2016 on Anglo French and Latin; Stam, 2017 on Irish and Latin), and, inevitably, code-
switching (henceforth CS) from and to Latin (or more languages in some cases; e.g.
Kopaczyk, 2013; Lazar, 2016).
In order to achieve systematic ways of investigating and interpreting historical
multilingualism, the field is in the process of developing guidelines and digitalisation policies
for the presentation of multilingual primary material. In this respect, scholars have
emphasised a growing need for tagging solutions that are entrenched in broader frameworks
of multilingual communication via the written medium. Such solutions have already been
adopted for the languages with most extensive digital documentation, such as English
(Tyrkkö et al. 2017). In the case of the languages whose histories have not been so well
documented, resources are frequently lacking to support the application of more sophisticated
semi- or fully automatised tagging of language boundaries and switches. Importantly for our
project, there are no databases in a fully modern digital format that could be used as reliable
control corpora for automatised language recognition in mediaeval texts containing Latin and
Old Polish. Although useful electronic reference works exist, none of these had been designed
specifically for the purpose of studying CS between Latin and the vernacular.
3
Compared to
the existing diachronic corpora that include Old Polish, the novelty of the eROThA tagging
scheme lies in (1) TEI-compliant marking of multilingual elements; (2) separate marking of
the linguistically ambiguous discourse boundary elements (e.g. visual diamorphs); (3)
incorporation of visual cues in the tagging of CS.
It is due to the Latin–Polish bilingualism in the legal administrative domain in the late
fourteenth and early fifteenth century that the earliest extant passages in Old Polish survive
within the Latin text. This bilingual practice is richly attested in the records of land courts all
over the Polish-speaking territory and the source texts have been published in printed editions
(e.g. Kuraszkiewicz and Wolff, 1950), but not in an electronic format. Our project
concentrates on the Greater Poland land court records, since the monumental five-volume
diplomatic edition of these texts by Kowalewicz and Kuraszkiewicz (1959-1981; Section 2
below) uses some typographic marking of CS that may be used as the basis for automatic
annotation. Thus, it offers an ideal starting point to develop a methodology for constructing
multilingual corpora in the context of legal cultures in medieval Central Europe.
In this paper, we present the background (Section 2), aims, design and methodology
behind the Electronic Repository of Greater Poland Oaths (eROThA) database – an electronic
corpus based on the diplomatic printed edition of the Greater Poland court oaths
(http://rotha.ehum.psnc.pl). In Section 3, we also discuss the problems and limitations
entrenched in the task of digitising a printed diplomatic edition into a proper digital edition in
the modern sense (e.g. Honkapohja et al. 2009; Vanhoutte and van den Branden, 2009). Then
we present the steps taken to transform the printed edition into a digital edition (3.2), followed
by a detailed overview of the objectives and outcomes of the project (3.3). In Section 3.4, we
!
2
E.g. the following projects: Medieval Nordic Text Archive (MENOTA; Oslo at
http://www.menota.org/EN_forside.xhtml); Bilingualism in Medieval Ireland Language choice as a part of
intellectual culture (Galway and Utrecht; e.g. Bisagni and Wartjens, 2007; Stam, 2017).
3
In particular, the eFontes dictionary (Elektroniczny korpus łaciny średniowiecznej na
ziemiach polskich, “Electronic Corpus of Medieval Latin in Polish
Lands”) and Słownik pojęciowy języka staropolskiego (“Old Polish Conceptual Dictionary”) proved very useful
as reference works for the study of the land book data. Cf. Pastuch et al. 2018 for an overview of electronic
resources for historical and contemporary Polish.
!
3!
focus on the mark-up and annotation schemes that were used to represent CS along with the
manual tagging of the so-called triggers, which simultaneously mark discourse and language
boundaries. A short summary closes the report (Section 4).
2. Latin and the vernacular in the land court
In the land courts of Greater Poland and beyond,
4
Latin held the position of the language of
the record well beyond the medieval period (Bedos-Rezak, 1996).
5
In the oral litigation
procedure, however, at least in the late fourteenth and the first three decades of the fifteenth
century, the vernacular also had a crucial part to play, as showcased in the land books.
6
As a
result, the administrative and court record displays, ‘Chancery bilingualism’ (Adamska,
2013).
7
In the land courts where defendants were subject to the so-called compurgation ritual,
the vernacular was the vehicle for the defendant’s plea. The ritual involved giving an oath of
denial (in the vernacular) in response to the accusation, or the account of the complaint
witnesses (cf. Ziv and Smith, 1996: 50-51). The multilingual context in which the oath (in
Latin: rot(h)a) is embedded constitutes the focus of the eROThA project. In the record,
individual cases which typically begin with the administrative details provided in Latin
(usually the names and provenance of the involved parties) are followed by a rota in Old
Polish most likely produced orally in court by the defendant’s witnesses. In addition, oaths
may be followed by further Latin sections concerning the dates of subsequent trials or further
procedural information (Trawińska, 2014: 97).
The edited record containing over 6,330 civil and criminal cases for six locations in
Greater Poland (Kopaczyk et al. 2016: 20-21; Table 1) spans six decades (1386-1446;
Kowalewicz and Kuraszkiewicz, 1951-1981),
8
although the regional and temporal
representation is uneven (for details see Kopaczyk et al. 2016). The printed oaths have been
employed for the study of grapho–phonemic correspondences, Old Polish syntax and
morphology (e.g. Krążyńska, 2010; Słoboda, 2012),
9
as well as legal history (Rymaszewski,
2008). According to the palaeographic analysis conducted by the editors, over 200 scribal
hands were involved in record keeping. The eROThA project has transformed the edition into
a modern open access multipurpose digital resource for linguists and historians, incorporating
high resolution facsimiles. Our aim is to enhance and support the existing scholarship and
inspire new lines of inquiry in these fields and beyond, e.g. into palaeographic detail. The
following section presents the aims and design of the project at length.
3. Aims and design of the eROThA project
!
4
Land and town court records (acta terrestria et castrensia) surviving from other Polish-speaking territories
contain the vernacular next to Latin well into the eighteenth century (Wąsowicz, 1975; Kulecki, 2008).
5
The time span of the Latin land books (180 volumes) of Greater Poland is 1396-1791 (Trawińska 2009: 345).
6
The editors indicate that only the earliest land books include Polish oaths (Kowalewicz and Kuraszkiewicz,
1959: 10), while in the later ones (books XV and XV for the 1440s), oaths were overwhelmingly recorded in
Latin.
7
In Central Europe, both land law and canon law regulated real estate of the nobility and ecclesiastical
properties respectively, while ‘several varieties of ‘German’ law of the inhabitants of towns and villages [...] had
been created according to German models’ (Adamska, 2013: 354). The situation was similar in Hungary and the
Bohemian Crowns. There may have been analogous mediaeval manor courts in the English speaking territories,
albeit earlier on (the thirteenth century; Zvi and Smith, 1996: 50-51).
8
A selection from c. 55 land books was diplomatically edited by a palaeographer, Kowalewicz and a historical
linguist Kuraszkiewicz.
9
More recent work has reassessed the usefulness of this material for dialect studies (see Trawińska, 2009 and
references therein). The revision is in line with a growing volume of work on historical specialised registers
which stresses the independence of written text and the need to study it as linguistic data in its own right, rather
than as a reflection of spoken language (Jucker and Pahta eds. 2011).
!
4!
In addition to various research topics that have already been explored based on data samples,
the main interest of our project falls on a systematic recognition of multilingual features in the
linguistically mixed texts of the oaths. The project’s aim is to enable access to language
contact features via searches in a freely available web-based engine and to facilitate research
on the modes and constraints on CS operating on different linguistic and discourse levels, for
different scribes, courts, and over time. To achieve these aims, annotation of multilingual
features had to be implemented while digitising the printed edition. However, as we have
shown in our studies (Kopaczyk et al. 2016; Włodarczyk et al. forthcoming a; Włodarczyk
and Adamczyk, forthcoming b), the complexity of CS between Latin and Polish is so rich that
no currently available annotation scheme would be able to capture all its features. Instead, we
include the tagging of CS phenomena on the highest level of linguistic and discourse
organisation, as well as on the level of syntax and lexicon, albeit excluding personal and place
names where CS is frequently present on the level of morphology (Kucała, 1974).
There is a growing conviction in the field of corpus building that digitisation projects
based on printed diplomatic or critical editions of primary data should try to reconcile the
requirements of linguistic corpora and those of digital editions (Vanhoutte and van den
Branden, 2009; Honkapohja et al. 2009; Martilla, 2013). A digitised static edition is not
enough; instead researchers should aim to provide a multipurpose digital resource that
involves modularity and dynamic architecture. In this way, open-endedness and combinability
are achieved: these are the prerequisites for true interdisciplinarity. Honkapohja et al. state
that by constructing modular and adaptable networks of texts open to new layers of
annotation, corpus compilers may be liberated ‘from the chains of “what has been edited”’
and enabled ‘to add texts from original sources with reasonable effort, effectively becoming
digital editors themselves’ (2009: 467). Although the eROThA project does not undertake any
major revision of the interpretations proposed by the printed edition, it follows the concept of
an open-ended digital edition in that (1) it offers a lightly annotated corpus of data that can be
used by linguists and non-linguists; (2) it provides ‘an analytical descriptive model’ (Martilla,
2013: 3) of CS on the discourse level; as well as (3) ‘an archive of research material in the
form of the intermediary facsimiles and transcriptions, which may (…) be used to produce
different kinds of editions in the future’ (Martilla, 2013: 4).
3.1. The unit of analysis: Oaths as individual cases
In the eROThA project, corpus design and data presentation formats are based on the
analytical unit of discourse and the schema of representation of the communicative event,
inherited from the printed edition. In the printed edition, an individual compurgation event,
including an oath (a single one as a rule; in a few cases – a number of consecutive oaths
formulated in connection to a single case) was selected for presentation. This decision was
most likely determined by the monolingual focus: in selecting the Polish discourse element as
the unit of manuscript representation, the editors aimed to foreground the earliest record of
extended utterances in the vernacular.
10
Thus, oaths were spotted in the books and paired with
Latin introductions, thus rendering two discourse elements (the Latin and the Polish) that
constitute a single ‘unit’, or one oath-taking (compurgation) event. The Polish components of
such an event were diplomatically transcribed and then enhanced with editorial transcription –
a quasi-translation into standardised Old Polish (achieved mostly by reducing spelling and
morphological variation) (see Kopaczyk et al. 2016: 22; Figure 1 for an illustration of the
!
10
This is the most likely explanation, which goes hand in hand with monolingual ideologies and the myth of
national languages prevalent in linguistic and literary studies of the twentieth century, cf. Tyler (ed. 2011).
!
5!
presentation mode in the edition). In terms of organising the records, Kowalewicz and
Kuraszkiewicz used the division into scribal hands in each individual location. In effect, the
chronology and the ordering of individual books, or even of the leaves within the books, are
not consistently followed, as precedence is given to the grouping by scribal hand.
Importantly for our purposes, a unit of presentation is equipped with information about
the source, date of origin and scribe. These features have been transformed into the metadata
used to describe the oaths in the eROThA database. Extracting the metadata automatically was
not a straightforward process, as relevant information was not recorded in a consistent manner
in the printed edition. For example, the information on scribal hands was only given once in
the main text: in a separate paragraph (heading) followed by further units by this hand. Thus,
the eROThA metadata entry had to be constructed out of the relevant pieces of information in
a carefully designed automatic selection process. As a result, each entry consists of the
individual unit number (reference to the printed edition), the location of the court, date of the
record, source information (book and leaf number) and scribal hand.
3.2. From the printed edition to the TEI P5 scheme
In the first step, the edited volumes were converted to text files using a standard Optical
Character Recognition (OCR) process resulting in html files with adjacent css files.
11
This
process was sensitive to the structure of the text (divisions into the Latin introduction, the
Polish oath, standardised editorial text and the footnotes) and its typographic format (font
size, italics, etc.). Then we implemented a chain of grammars using ANTLR 4 (Parr, 2013)
and the Java programming language, transforming the OCR output into standardized TEI P5
format, encoded in UTF-8.
12
The implemented parser was designed to combine and divide the
text according to the metadata and to convert an individual unit into a separate xml file in the
TEI P5 scheme. A sample TEI header is presented in Figure 1 below.
<tei:teiHeader>
<tei:fileDesc>
<tei:titleStmt>
<tei:title xml:lang="pol">Rota 330, Księga Ziemska 4, Karta 117, Gniezno</tei:title>
<tei:title xml:lang="eng">Rotha 330, Land Court Book 4, Page 117, Gniezno</tei:title>
</tei:titleStmt>
<tei:publicationStmt>
<tei:publisher>
<tei:orgName>ROThA</tei:orgName>
</tei:publisher>
<tei:date when="2019-02-20"/>
</tei:publicationStmt>
<tei:sourceDesc>
<tei:biblStruct>
<tei:monogr>
<tei:title xml:lang="pol">Rota 330, Księga Ziemska 4, Karta 117, Gniezno</tei:title>
<tei:title xml:lang="eng">Rotha 330, Land Court Book 4, Page 117, Gniezno</tei:title>
<tei:author role="author">
<tei:persName sameAs="Per.19">PISARZ 19</tei:persName>
</tei:author>
<tei:textLang mainLang="lat-med"/>
<tei:textLang mainLang="pol-old" otherLangs="pol"/>
<tei:imprint>
!
11
OCR and Fine Reader procedures.
12
http:// http://www.tei-c.org
!
6!
<tei:pubPlace>Gniezno</tei:pubPlace>
<tei:date when="1430"/>
<tei:biblScope unit="issue">330</tei:biblScope>
<tei:biblScope unit="volume">GnZ4</tei:biblScope>
<tei:biblScope unit="page" from="117" to="117"/>
</tei:imprint>
</tei:monogr>
</tei:biblStruct>
<tei:msDesc>
<tei:msIdentifier>
<tei:repository>ROThA</tei:repository>
<tei:idno>Gn.330</tei:idno>
</tei:msIdentifier>
</tei:msDesc>
</tei:sourceDesc>
</tei:fileDesc>
</tei:teiHeader>
!
Figure 1. Sample TEI header
Apart from the metadata, the printed edition also defined the regions of representation for the
digital version: the Latin introduction, the Polish oath (transliteration into Old Polish) and its
standardised version. The above mentioned parser has converted the three regions into three
separate annotated text regions within one TEI P5 file (see Section 3.3 below). Rich TEI-
schema tags were implemented in the Latin and Old Polish sections. In addition to marking
out CS, many other features of the text, such as glosses, additions, deletions and omissions,
are captured in the schema. The three text regions of Gn.330 are presented in Figure 2 below.
Additionally, the parser has linked each file with a high quality photograph of the relevant
manuscript leaf (or leaves), featuring a red square around the corresponding source text. The
images are available in the eROThA portal
13
together with the automatically generated text
files which are compatible with any computational tool used for linguistic analysis (e.g.
AntConc).
<tei:teiHeader>
<tei:fileDesc>
<tei:titleStmt>
<tei:title xml:lang="pol">Rota 330, Księga Ziemska 4, Karta 117, Gniezno</tei:title>
<tei:title xml:lang="eng">Rotha 330, Land Court Book 4, Page 117, Gniezno</tei:title>
</tei:titleStmt>
<tei:publicationStmt>
<tei:publisher>
<tei:orgName>ROThA</tei:orgName>
</tei:publisher>
<tei:date when="2019-02-20"/>
</tei:publicationStmt>
<tei:sourceDesc>
<tei:biblStruct>
<tei:monogr>
<tei:title xml:lang="pol">Rota 330, Księga Ziemska 4, Karta 117, Gniezno</tei:title>
<tei:title xml:lang="eng">Rotha 330, Land Court Book 4, Page 117, Gniezno</tei:title>
!
13
The images were provided by the National Archives in Poznań, one of the project partners. See, for instance,
Rota 330 from Gniezno (ID = Gn.330) is available on https://rotha.ehum.psnc.pl/breeze/Gn.330
!
7!
<tei:author role="author">
<tei:persName sameAs="Per.19">PISARZ 19</tei:persName>
</tei:author>
<tei:textLang mainLang="lat-med"/>
<tei:textLang mainLang="pol-old" otherLangs="pol"/>
<tei:imprint>
<tei:pubPlace>Gniezno</tei:pubPlace>
<tei:date when="1430"/>
<tei:biblScope unit="issue">330</tei:biblScope>
<tei:biblScope unit="volume">GnZ4</tei:biblScope>
<tei:biblScope unit="page" from="117" to="117"/>
</tei:imprint>
</tei:monogr>
</tei:biblStruct>
<tei:msDesc>
<tei:msIdentifier>
<tei:repository>ROThA</tei:repository>
<tei:idno>Gn.330</tei:idno>
</tei:msIdentifier>
</tei:msDesc>
</tei:sourceDesc>
</tei:fileDesc>
</tei:teiHeader>
!
Figure 2: Rich TEI-schema tags
Although the printed edition provided the basic text input and structure, the digital edition has
the added value of (1) annotating multilingual passages and enabling searches either in a
selected monolingual section, or in the full multilingual database; and (2) opening the primary
sources to new interpretations by linking the transcriptions with high-quality manuscript
images.
3.3. CS at the discourse boundary
In terms of CS tagging, the project concentrates on the level of discourse, and within
individual discourse elements on clausal, phrasal and lexical switches. This kind of analysis is
supported by the layout employed in the printed edition, whereby a division is consistently
maintained between the Latin introduction (the witness list) and the Polish oath. Within each
of the discourse chunks, the printed edition renders lower-level CS elements (into Polish or
into Latin) in italics. Such formatting is used for both inter- and intrasentential CS, as well as
single lexical items, and translates easily into TEI marking <foreign> in an automatised
parsing process which was implemented during OCR design and correction (see the workflow
below). Nevertheless, in some cases the typographic marking of CS elements was incomplete,
so the parser complemented it with Apache Tica language detection.
14
There are three regions in the basic unit of presentation (the individual rota): the Latin
introduction (and the optional Latin chunks that followed an oath), the Polish oath, and the
original editorial translation into Old Polish (separate TEI sections <div>s, see Figure 2 in
Section 3.2). In both Latin and Old Polish <div>s the items or phrases which are code-
switched (italicised in the printed edition, or recognized by Apache Tika as a different
language) were automatically tagged as <foreign>. The tag is dynamic in nature, as it encodes
!
14
https://tika.apache.org/1.17/detection.html
!
8!
switches from Latin into Polish and from Polish into Latin, depending on the <div>. The
items at the boundary of discourse and language switch (triggers)
15
were also tagged as
<foreign>, but with the additional attribute ‘source’ pointing to the type of the trigger. The
attribute has the value ‘INDEPENDENT trigger’ if it is visually independent in the
manuscript and the value ‘POLISH transliteration’ if it belongs visually to the Polish oath (the
text region of Polish oath is denoted as ‘transliteration’). The boundary elements which did
not stand out visually from the Latin introduction were not marked, even if their semantics
indicated flagging or another transitional function.
16
All boundary elements, whether tagged
or not, remain part of the Latin introduction <div>. This approach has allowed for creating
datasets which are Latin only (excluding the CS), or Polish only, while at the same time
remaining sensitive to the special nature and the dynamic positioning of the trigger.
The three regions described above (see also 3.1), and the <foreign> marked sections
were used in the indexing phase of TEI documents in the SOLR search engine working in the
background. In effect, the users can search within any of the three text regions, while the
exclusion of the multilingual elements or triggers is an additional option they may choose.
The following metadata filters may be applied: location, scribal hand and year. The entire
design (the website with TEI indexing, a user-friendly display and search options) has been
implemented in Java programming language with the use of Play Framework. We provide a
host of download functions: from TEI files of whole collections, through
individual rota units, to particular text layers of a given unit. Additionally, our interface
allows the user to export full context search results in tab-separated value files (.tsv),
17
which
can be further processed by corpus linguistic tools.
3.4. Project workflow
In brief, the eROThA project involved the following interconnected stages:
1) Correction (the first round in Fine Reader training mode) of the ‘dirty’ OCR, including:
- implementation of the parser-readable scheme
- manual tagging of the discourse boundary element (see 3.3 above and Włodarczyk et
al. 2017)
- manual correction of the content
- manual correction of the html scheme
Although the OCR process was conducted by means of specialised software in a learning
mode and followed a relatively painstaking training period (c. 1-2 rota units for 50-100
batches of 5 scans), the accuracy of the recognition was not completely satisfactory. The
crude OCR procedure produced errors in about 10%-15% of the word tokens, but as the
training process was dynamic, at the end of a training session (per file consisting of 10-15
pages) the error level was lower and a manual correction was conducted on the residue of the
!
15
Triggers may be visually independent of any of the sections of the manuscript, thus constituting a special
status element which may be linguistically ambiguous (cf. the notion of homophonous diamorphs in Muysken,
2000 and visual diamorphs in Wright, 2001; for a detailed discussion of the trigger element, see Włodarczyk and
Adamczyk forthcoming c). Even though xml does not handle overlaps, it is still possible to preserve the dynamic
nature of the trigger by deploying this special annotation.
16
The strategies of boundary marking in the Latin introduction require a thorough investigation, ideally based on
a specific visual-semantic-pragmatic typology.
17
.tsv stands for tab-separated values a simple text format for storing data in a a tabular structure where each
field is separated from the next by the tab character.
!
9!
incorrect tokens. Detailed analyses of the mistakes that had to be corrected manually were not
performed, but three factors seem to have affected the accuracy. First of all, the quality of the
printed source was not consistent and the pages with the lowest print quality produced the
least accurate outcomes. Secondly, the accuracy of the OCR process was much poorer for the
region which included many special characters (i.e. the Old Polish transliteration) and better
for the regions containing transcription and spelling modernisation (Latin introduction and
standardised Polish sections). Thirdly, there was a group of graphemes which were confused
despite repeated training attempts: these included ‘long s’ <ſ> (U+017F) vs. <f>, <u > vs.
<n> and <y>, <ÿ> (U+00FF) and <> (U+1E8F) and the digraphs <ni> vs. <m>, <ci> vs.
<d>, <li> vs. <h>, <nr> vs. <rn> and others. Division marks (|), brackets and hyphens were
frequently not read correctly or omitted by the software. More typical OCR mistakes involved
small <l> confused with the numeral <1> and capital <R> frequently recognized as <B>.
Finally, the OCR process had to be preceded by a manual division of scan pages into regions
which were then ordered manually according to the requirements of the parser. In some cases,
in the transfer of the corrected file to the xml format, the regions were curiously reordered.
This was later amended manually in an xml editor.
2) Correction (the second round, in html printouts): to increase inter-coder reliability, the
coders switched datasets to verify tagging decisions, content and scheme accuracy. The
coders cross-checked their decisions and discussed disagreements (c. 5%). In general, the so-
called borderline cases, where the visual marking of the trigger was ambiguous, were
problematic. In such cases, the coders followed a strict policy of treating the trigger as
independent only if it occurred in a new line or was preceded by an intentional gap that
separated it from the Latin introduction.
3) Parser verification focusing on the section, language and CS scheme accuracy
4) Final correction round: minor consistency amendments
5) xml correction round: final adjustments (line breaks, word divisions, etc.) and manual
correction of the TEI scheme.
3.5. Project outcomes
The digital edition of the Kowalewicz and Kuraszkiewicz Greater Poland Court Oaths
complements and enhances the available resources for the study of mediaeval Latin-
vernacular CS through up-to-date and robust methodological decisions and design. The
project offers:
1) A searchable version of the oaths which is sensitive to their multilingual character. Old
Polish and Latin are presented in TEI-coded sections, with internal automatic tagging of
<foreign> elements within each of the sections (inter- and intrasentential CS and single
lexical items).
2) Metadata filtering in online searches. The user is able to define the region and language
selection, exclude either of the languages, or focus on CS elements exclusively.
3) Flexible coding of the discourse boundary material.
!
10!
4) High-quality images of the original manuscripts. In this way, the primary material is open
to new research questions which can be asked of the material according to various criteria,
such as:
- the unit of presentation/analysis
- scribal hand
- transliteration (e.g. treatment of abbreviated items; capitalisation) and font choice
- transcription
- modernised version
5) A user-friendly presentation of the editorial divisions into regions and languages in the
manuscript scans and in the TEI coding (cf. also section and Figure 2 above). A thin line
separates each region in the image and different font colours are used to display different
languages. For other TEI elements, clear formatting was used, like underline, strike-throughs,
superscripting, etc.
6) Automatic export of editorial regions and search results to text files compatible with basic
linguistic software such as AntConc or Wordsmith.
18
The database was checked for compatibility with AntConc 3.4.3m (Anthony 2014) in the txt
and xml format. No coding compatibility issues occurred while searching in the tagged and
untagged display options. For some purposes, the search strings had to be supported by the
use of wildcards, for example *<tei:foreign xml:lang=* in a search for the <foreign> tag (we
elaborate on the search options in “How to read and search results” on the eROThA platform),
As a result, all the elements tagged automatically (i.e. the italicized items from the edition)
and manually (triggers) could be extracted for further work. The simple sorting option (in 5th,
6th and 7th word to the left) allowed for ordering the <foreign> strings according to the
section in which these occurred and the direction of the switches (<tei:foreign xml:lang=“lat-
med”> for Latin in the Old Polish text and <tei:foreign xml:lang=“pol-old”> for Old Polish in
the Latin text). Unsorted search results are presented in Figure 3. In addition, by means of the
‘file view’ option (activated from Concordance or file list), TEI schemes identical to those
presented in Figures 1 and 2 may be displayed in AntConc.
!
18
The application of WordSmith tools (trial version of word list and concordance) resulted in some compatibility
issues affecting character encoding. These issues were partially resolved by means of adjustments in language
setting features. However, as the AntConc software, which is open access, involved no such issues, it is
recommended for corpus-analytic work on the eROThA files.
!
11!
Figure 3: AntConc search sample
4. Summary
The significance of the coexistence of different languages in written texts of the past has
recently come to the fore not only in historical linguistics, but also in communication studies
(Garrison et al. eds. 2013). In particular, the interfaces of Latin and the vernaculars in
mediaeval and Early Modern Europe have become an attractive new research area. Projects
involving both the so far unexplored and already known data have been undertaken; many of
them exploit administrative and legal genres. It is in these contexts that the position of Latin
remained strong, or even dominant, well into the Early Modern era, but the vernaculars also
have a place in these communicative contexts. The Latin–vernacular bi- and multilingualism
is attested all over Europe. Various research projects thus involve different languages and
locations but the scribes, usually professional clerks trained in prominent European centres of
learning, remain a unifying feature. However, the degree to which the coexistence of Latin
and the vernaculars in different texts shows similarities or differences across space, time and
genres remains to be explored.
Overall, not unlike other corpora of mediaeval specialised texts that aim to shed light
on the underlying interplay of Latin and the vernaculars, i.e. the extent of multilingualism and
switches, the digitised edition of the land court oaths from Greater Poland expands our
knowledge of language variation, language contact and language change. As the digitised
eROThA provides easy access to high-quality manuscripts, one direction of study that may
bring further advances in this respect is a new, technology-assisted scribal hand analysis.
Such an endeavour might further add to our knowledge on the role of bi-/multilingual users in
the conventionalisation of specialised registers and genres, as well as more generally, in
language change, the diffusion of change in multilingual settings and the role of multilingual
scribes as agents of variation and change.
!
12!
Primary sources
Elektroniczny Korpus Łaciny Średniowiecznej na Ziemiach Polskich, (Electronic Corpus of
Medieval Latin in Polish"Lands), http://scriptores.ijp-
pan.krakow.pl/fontes/efontes/run.cgi/first_form
Kowalewicz, H. and W. Kuraszkiewicz (eds.). 1959-1981. Wielkopolskie Roty Sądowe XIV–
XV Wieku [The Greater Poland Court Aaths of the 14th-15th Century], vol. 1, Roty
poznańskie [The Poznań oaths], vol. 2, Roty pyzdrskie [The Pyzdry oaths], vol. 3, Roty
kościańskie [The Kościan oaths], vol. 4, Roty kaliskie [The Kalisz oaths], vol. 5, A, Roty
gnieźnieńskie [The Gniezno oaths], B, Roty konińskie [The Konin oaths]. Warszawa, Poznań,
Wrocław, Kraków and Gdańsk: Państwowe Wydawnictwo Naukowe.
Słownik Pojęciowy Języka Staropolskiego (Old Polish Conceptual Dictionary)
http://spjs.ijp.pan.pl/spjs/strona/opisProjektu
References
Adamska, A. 2013. ‘Latin and three vernaculars in East Central Europe from the point of
view of the history of social communication’ in M. Garrison, A. Órban and M. Mostert (eds.)
Spoken and Written Language. Relations between Latin and the Vernacular Languages in the
Earlier Middle Ages, pp. 325-364. Turnhout: Brepols.
Anthony, L. 2014. ‘AntConc (Version 3.4.3) [Computer software]’. Tokyo, Japan: Waseda
University. Available from http://www.laurenceanthony.net/software
Bedos-Rezak, B. 1996. ‘Secular administration’ in F.A.C. Mantello and A.G. Rigg (eds.)
Medieval Latin. An Introduction and Bibliographical Guide, pp. 195-229. Washington: The
Catholic University of America Press.
Bisagni, J., and I. Warntjes. 2007. ‘Latin and Old Irish in the Munich Computus: A
reassessment and further evidence’, Ériu 57, pp. 1–33.
Claridge, C. 2008. ‘Historical corpora’, in A. Lüdeling and M. Kytö (eds.) Corpus
Linguistics: An International Handbook. Vol. 1., pp. 242–259. Berlin/New York: Mouton de
Gruyter.
Garrison, M., A. Órban, and M. Mostert (eds.). 2013. Spoken and Written Language:
Relations between Latin and the Vernacular Languages in the Earlier Middle Ages. Turnhout:
Brepols.
Honkapohja, A., S. Kaislaniemi, and V. Marttila. 2009. ‘Digital editions for corpus
linguistics: Representing manuscript reality in electronic corpora’ in A. H. Jucker, D. Schreier
and M. Hundt (eds.), Corpora: Pragmatics and Discourse, pp. 451-475. Amsterdam: Rodopi.
Ingham, R., and I. Marcus. 2016. ‘Vernacular bilingualism in professional spaces, 1200 to
1400’ in A. Classen (ed.) Multilingualism in the Middle Ages and Early Modern Age:
Communication and Miscommunication in the Premodern World, pp. 145-165. Berlin: De
Gruyter.
!
13!
Kopaczyk, J. 2013. ‘Code-switching in the records of a Scottish Brotherhood in early modern
Poland-Lithuania’, Poznań Studies in Contemporary Linguistics 49 (3), pp. 281-319.
Kopaczyk, J., M. Włodarczyk, and E. Adamczyk. 2016. ‘Medieval multilingualism in Poland:
Creating a corpus of Greater Poland Court Oaths (ROThA)’, Studia Anglica Posnaniensia 51
(3), pp. 9-35.
Krążyńska, Z. 2010. ‘Średniowieczne techniki rozbudowywania zdań (na przykładzie
wielkopolskich rot sądowych)’ [Medieval techniques of syntactic elaboration (based on the
Greater Poland court oaths)], Kwartalnik Językoznawczy (3–4), pp. 1-16.
Kucała, M. 1974. ‘Łacińska fleksja rzeczowników polskich w tekstach średniowiecznych’
[Latin inflections of Polish nouns in medieval texts] in J. Kuryłowicz and J. Safarewicz (eds.)
Studia Indoeuropejskie [Indo-European Studies], pp. 91–96. Wrocław: Ossolineum.
Kulecki, M. 2008. ‘Zespoły ksiąg sądów szlacheckich I instancji –
wprowadzenie’[Collections of 1st grade land courts – introduction] in D. Lewandowska (ed.)
Archiwum Główne Akt Dawnych w Warszawie. Informator o zasobie archiwalnym [The
Central Archives of Historical Records in Warsaw. Survey of the Collections], pp. 95-101.
Warszawa: Archiwum Główne Akt Dawnych.
Kuraszkiewicz, W., and A. Wolff. 1950. Zapiski i Roty Polskie XV-XVI Wieku z Ksiąg
Sądowych Ziemi Warszawskiej [Records and Polish Oaths of the 15th-17th c. From the Land
Books of the Warsaw Area]. Kraków: Prace Komisji Językowej PAU nr 36.
Kytö, M. 2012. ‘New perspectives, theories and methods: Corpus linguistics’ in A. Bergs and
L. Brinton (eds.) English Historical Linguistics: An International Handbook. Vol. 2, pp.
1509-31. Berlin: De Gruyter Mouton.
Lazar, M. 2016. ‘Grenzüberschreitungen: Stadtbücher aus der Westslovakei, Schlesien und
Kleinpolen und Interpretationen ihrer Mehrsprachigkeit’ Paper presented at Workshop Fontes
Iuris Lusatiae Superioris Vetustissimi. Kraków 2016.
Muysken, P. 2000. Bilingual Speech: A Typology of Code-Mixing. Cambridge: Cambridge
University Press.
Nurmi, A., T. Rütten, and P. Pahta (eds.). 2017. Challenging the Myth of Monolingual
Corpora: Multilingualism in English Corpora. Amsterdam: Brill/ Rodopi.
Pahta, P., and A. H. Jucker (eds.). 2011. Communicating Early English Manuscripts.
Cambridge: Cambridge University Press.
Pahta, P., J. Skaffari, and L. Wright (eds.) 2018. Multilingual Practices in Language History.
English and Beyond. Berlin: De Gruyter.
Parr, T. 2013. The Definitive ANTLR 4 Reference. San Francisco: The Pragmatic Bookshelf,
Raleigh.
!
14!
Pastuch, M., B. Duda, K. Lisczyk, B. Mitrenga, J. Przyklenk, and K. Sujkowska-Sobisz.
2018. ‘Digital Humanities in Poland from the perspective of the historical linguist of the
Polish language: Achievements, needs, demands’, Digital Scholarship in the Humanities 33
(4), pp. 857–873. https://doi.org/10.1093/llc/fqy008
Peikola, M., A. Mäkilähde, H. Salmi, M.-L. Varila, and J. Skaffari (eds.). 2017. Verbal and
Visual Communication in Early English Texts (Utrecht Studies in Medieval Literacy 37).
Turnhout: Brepols.
Rymaszewski, Z. 2008. Z Badań nad Organizacją Sądów Prawa Polskiego w Średniowieczu.
Woźny Sądowy. [From the Research on the Organisation of Polish Law Courts in the Middle
Ages. Court Usher]. Warszawa: Akademia Leona Koźmińskiego.
Schendl, H., and L. Wright (eds.). 2011. Code-switching in Early English. Berlin: Walter de
Gruyter.
Słoboda, A. 2012. Liczebnik w Grupie Nominalnej Średniowiecznej Polszczyzny. Semantyka i
Składnia [Numeral in the Nominal Group of Medieaval Polish. Semantics and Syntax.]
Poznań: Wydawnictwo Rys.
Stam, N. 2017. A Typology of Code-switching in the Commentary to the Félire Óengusso.
Utrecht: LOT publications.
Trawińska, M. 2009. ‘Cechy dialektalne wielkopolskich rot sądowych w świetle badań nad
rękopisem poznańskiej księgi ziemskiej’ [Dialect features of the Greater Poland court oaths:
The analysis of the Poznań municipal book manuscript], Prace Filologiczne LVI, pp. 345-
360.
Trawińska, M. 2014. Rękopis Najstarszej Poznańskiej Księgi Ziemskiej (1396-1400).
[Manuscript of the Oldest Poznań Land Book (1386-1400)]. Warszawa and Poznań:
Wydawnictwo Rys.
Tyler, E. M. (ed.) 2011. Conceptualizing Multilingualism in England c. 800-1250. Utrecht:
Brepols.
Tyrkkö, J., A. Nurmi, and I. Tuominen, J. 2017. ‘Semi-automatic discovery of code-switching
from English historical corpora: Methods and challenges’ in A. Nurmi, T. Rütten and P. Pahta
(eds.) Challenging the Myth of Monolingual Corpora: Multilingualism in English Corpora,
pp. 172-199. Leiden: Brill.
Wąsowicz M. 1975. ‘Księgi ziemskie i grodzkie (acta terrestria et castrensia) —
wprowadzenie ogólne’ [Land and town books (acta terrestria et castrensia) — a eneral
ntroduction] in J. Karwasińska (ed.) Archiwum Główne Akt Dawnych w Warszawie.
Przewodnik po zespołach. I. Archiwa dawnej Rzeczypospolitej [The Central Archives of
Historical Records in Warsaw. Survey of the Collections. I Archives of the Former Republic
of Poland], pp. 147–155. Warszawa: Archiwum Główne Akt Dawnych.
Wright, L. 2011. ‘On variation in medieval mixed-language business writing’ in H. Schendl
and L. Wright (eds.) Code-Switching in Early English, pp. 191–218. Berlin/Boston: De
Gruyter Mouton.
!
15!
Włodarczyk, M., E. Adamczyk, and O. Makarova. Forthcoming a. ‘Code-switching and
literalisation in provincial court books (libri terrestres): Evidence from the Electronic
Repository of Greater Poland Oaths (1386-1446)’, in M. Lazar, and W. Carls (eds.), Das
Sächsisch-Magdeburgische Recht als Kulturelles Bindeglied zwischen den Rechtsordnungen
Ost- und Mitteleuropas. Bestandsaufnahme und Perspektiven der Forschung (Reihe „IVS
SAXONICO-MAIDEBVRGENSE IN ORIENTE“ 8), Berlin: Mouton de Gruyter.
Włodarczyk, M., and E. Adamczyk. Forthcoming b. ‘Constraints on embedded multilingual
practices in the Electronic Repository of Greater Poland Oaths (1386-1446)’.
Włodarczyk, M., and E. Adamczyk. Forthcoming c. ‘Metalinguistic and visual cues to the co-
occurrence of Latin and Old Polish in the Electronic Repository of Greater Poland Oaths, 1386-
1446 (eROThA)’, in M. Włodarczyk, J. Tyrkkö, J. Kopaczyk, and E. Adamczyk (eds.),
Multilingualism Meets Multimodality: Historical and Modern Contexts.
Zvi, R., and R. M. Smith. 1996. ‘The origins of the English manorial court roles as the written
record’ in T. Zvi and R. M. Smith (eds.) Medieval Society and the Manor Court, pp. 36-68.
Oxford: Clarendon Press.
ResearchGate has not been able to resolve any citations for this publication.
Book
Full-text available
"[A] landmark in the field" (JELF 2018, 7:2) [Volume co-edited by P. Pahta, J. Skaffari and L. Wright.] Texts of the past were often not monolingual but were produced by and for people with bi- or multilingual repertoires; the communicative practices witnessed in them therefore reflect ongoing and earlier language contact situations. However, textbooks and earlier research tend to display a monolingual bias. This collected volume on multilingual practices in historical materials, including code-switching, highlights the importance of a multilingual approach. The authors explore multilingualism in hitherto neglected genres, periods and areas, introduce new methods of locating and analysing multiple languages in various sources, and review terminology, theories and tools. The studies also revisit some of the issues already introduced in previous research, such as Latin interacting with European vernaculars and the complex relationship between code-switching and lexical borrowing. Collectively, the contributors show that multilingual practices share many of the same features regardless of time and place, and that one way or the other, all historical texts are multilingual. This book takes the next step in historical multilingualism studies by establishing the relevance of the multilingual approach to understanding language history. [Publisher's web page for the volume: https://www.degruyter.com/view/product/477182]
Article
Full-text available
In this paper we introduce the research plan for the preparation of a searchable electronic repository of the earliest extant legal oaths from medieval Poland drawing on the expertise in historical corpus-building developed for the history of English. The oaths survive in the overwhelmingly Latin land books from the period between 1386 and 1446 for six localities Greater Poland, in which the land courts operated: Poznań, Kościan, Pyzdry, Gniezno, Konin and Kalisz. A diplomatic edition of the oaths was published in five volumes by Polish historical linguists (Kowalewicz & Kuraszkiewicz 1959–1966). The edition is the only comprehensive resource of considerable scope (over 6300 oaths from the years 1386–1446) for the study of the earliest attestations of the Polish language beyond glosses. Recognising some limitations, but most of all its unparalleled coverage of the coexistence of Latin and the vernacular, the ROThA project embarks on transforming the edition into an open up-to-date digital resource. We thus aim to facilitate research into the history of Polish and Latin as well as of the legal system and the related social and linguistic issues of the period.
Chapter
This chapter discusses the role played by modern corpus linguistics in the study of the history of English. Various types of English historical corpora and electronic text collections will be introduced, among them multipurpose and specialized corpora, electronic text editions, large-scale text collections, and electronic dictionaries. In addition to the size of current English historical corpora, other issues of importance to end-users will be addressed such as corpus annotation and computerized searches. Examples will be given of studies based on the use of corpus linguistics techniques, with special reference to work done within the variationist and historical pragmatics frameworks. While morphosyntactic phenomena and the historical sociolinguistic approach have profited significantly from historical corpora and automated search techniques, work on areas such as register analysis and genre variation has also produced ground-breaking results. The chapter argues that the advantages offered by corpus linguistics outweigh the limitations of the approach.
Article
The article presents the achievements of digital humanities in Poland, draws attention to the needs related to the development of digitization, and points to possible future undertakings aimed to popularize the accomplishments in this field. Apart from digitization, issues such as methods of sharing historic and linguistic sources in the digital form are covered. These sources include historic and scientific dictionaries of the Polish language, records, and texts dating back to before 1945. Additionally, linguistic corpora of historic Polish are presented, both those completed and those underway. The article emphasizes the imperative to create constellations of linguistic data warehouses. The last part, dedicated to the concept of the platform Diachronic Corpora of Polish, constitutes an attempt to catalog, disseminate, and made public the results of activities pertaining to works on the Polish corpora.