ArticlePDF Available

On the question of the acquisition of literary texts and the method of their processing for the needs of independent literary research

Authors:

Abstract and Figures

The study deals with the issue of acquisition of digital literary data, specifically prose texts of Czech literature, which the data would serve for independent scientific research in the context of digital humanities, or computational literary studies. In the first part, we focus on selected available foreign textual databases, which we characterize with respect to the stated goal, i.e. to the existence of such a digital data collection that would be internally structured and machine-readable. We then focus on the Czech environment, in the context of which we present the emerging database of prosaic texts of Czech literature. We describe its basic structure, the advantage of such structuring, and concrete examples of possible use of the database in statistical analysis of literary texts. We conclude that in the context of the current development of DH we can expect an increasing demand not only for specialized web applications of digital literary corpora, but especially for access to such or similar databases, as these allow for highly variable and individual research.
Content may be subject to copyright.
On the ques tion of the ac qui si tion of
li ter a ry texts and the method of their
pro ces sing for the needs
of in de pen dent li ter a ry re search1
RI CHARD ZMĚLÍK
Pa lacky Uni ver si ty in Olo mouc
ORCID: https://orcid.org/0000-0002-5414-4574
E- mail: richard.zmelik@upol.cz
Abstract: The study deals with the issue of ac qui si tion of digital literary data,
spe ci fi cal ly prose texts of Czech li tera ture, which the data would serve for in de pen -
dent scien ti fic research in the context of digital hu ma ni ties, or com pu ta tio nal literary
studies. In the first part, we focus on selected available foreign textual databases,
which we cha rac te ri ze with respect to the stated goal, i.e. to the existence of such
a digital data col lec tion that would be in ter nal ly struc tu red and ma chine- rea da ble. We
then focus on the Czech en vi ronment, in the context of which we present the emerging
database of prosaic texts of Czech li tera ture. We describe its basic structure, the
advantage of such struc tu ring, and concrete examples of possible use of the database
in sta tis ti cal analysis of literary texts. We conclude that in the context of the current
de ve lopment of DH we can expect an in crea sing demand not only for spe cia lized web
ap pli ca tions of digital literary corpora, but es pe cial ly for access to such or similar
databases, as these allow for highly variable and in di vi dual research.
Key words: di gi tal li ter a ry stu dies, com pu ta tio nal li ter a ry stu dies, li ter a ry cor po ra, li -
ter a ry da ta ba ses
In the con text of the cur rent pro gres sive de ve lopment of di gi tal hu -
ma ni ties, a dis ci pline that is ge neral ly fo cu sed on the func tio nal in ter -
con nec tion of di gi tal tech no lo gies and the hu ma ni ties or so cial scien -
ces, one of the im por tant is sues con cer ning the avai la bi li ty of re lia ble
ma chine- rea da ble struc tu red da ta emer ges, es pe cial ly in the field of li -
ter a ry stu dies, whe re the se ap pro a ches are still stron gly re flec ted and
ap plied mainly in the fo reign re search con text. The ba sis of such da ta
is for med pri ma rily by di gi tized li ter a ry texts, which would form a so -
lid ba sis not only for par tial ana ly ses car ried out by means of ma chine
pro ces sing, but al so for lar ger and more re pre sen ta tive (from a li ter a -
ry- his to ri cal point of view) di gi tal li ter a ry cor po ra, which, in ad di tion
to the no wa days stan dard tools used by cor pus lin guis tics (e.g. etc.)
would be able to of fer such spe cia lized tools that would be mea ning -
ful ly usa ble pri ma rily for li ter a ry re search, and se con da rily for lin -
guis tic or other re search, e.g. his to ri cal re search, etc.
From the few exam ples we have avai la ble to day we know that the
ba sic con di tion for crea ting a set of mea ning ful and use ful tools for
mi ning a li ter a ry cor pus is pri ma rily a ques tion of a func tio nal con nec -
tion between the li ter a ry science task and the real pro gram ming out -
put.2 Ho we ver, what pre ce des this coo pe ra tion, or ra ther what it ne ces -
sa rily re lies on, is the re le vant di gi tal da ta in the form of di gi ti zed li ter -
a ry texts, pre fe ra bly in the form of struc tu red and ma chine- rea da ble fi -
139 140
1 This publication was created with the support of FPVC2022/20.
Boh e mis t y ka , 2025(1): 139–154 ISSN 1642–9893 ISSN (online) 2956–4425
DOI: http://doi.org/10.14746/bo.2025.1.8
2 Although I believe that the above stated sequence is self-evident and should
not be understood in reverse order, we present this information here deliberately
because where DH methods are not yet fully developed in a literary-scientific
context, which is related to the critical approach to DH, questions may arise over
the possibilities of adequate implementation of literary-scientific requirements in
a programming context. In a scientific environment, the requirements for specific
applications and software should always be primarily based on the methodological
and theoretical requirements of the discipline, i.e., in this case, the literary science
context. Necessary constraints on the formulation of certain requirements arise
Bohemistyka, 2025 (1) © The Author(s). Published by: Adam Mickiewicz University in Poznań, Open Access article,
distributed under the terms of the CC Attribution-NonCommercial-NoDerivatives 4.0 International licence (BY-NC-ND,
https://crea ti ve com mons .org/licenses/by-nc-nd/4.0/)
les. Let us look at a few se lec ted exam ples of fe red by the fo reign and
Czech en vi ronment. Cur ren tly, the most ac ces si ble di gi tal da ta base of
li ter a ry texts is Pro ject Gu ten berg, which of fers wi de ac cess to En -
glish- lan guage li ter a ry texts (https://archive.org/details/gutenberg).
The se can be re trie ved for ma chine pro ces sing in an open ap pli ca tion
en vi ronment, from whe re texts can be ei ther sim ply ma nual ly down -
loa ded or co pied, or who le text fi les can be re trie ved by web scra ping3.
The Pro ject Gu ten berg eBook, Pride and Pre ju dice, by
Ja ne Aus ten, Edi ted by R. W. (Ro bert William) Chapman
This eBook is for the use of anyone any whe re at no cost and
with al most no res tric tions what soe ver. You may co py it,
give it away or re- u se it un der the terms of the Pro ject
Gu ten berg Li cense in clu ded with this eBook or on line at
www.gu ten berg.org
Title: Pride and Pre ju dice
Au thor: Ja ne Aus ten
Edi tor: R. W. (Ro bert William) Chapman
Re lease Date: May 9, 2013 [eBook #42671]
Lan guage.
Cha rac ter set en co ding: ISO-8859-1
***START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDI-
CE***
E- text pre pa red by Greg Weeks, Jon Hurst, Ma ry Mee han, and
the On line Dis tri bu ted Proo frea ding Team (http://www.
pgdp.net) from page ima ges ge ne rously made avai la ble by
In ter net Ar chive (https://ar chive.org)
Note: Pro ject Gu ten berg al so has an HTML ver sion of this
file which in clu des the ori gi nal il lus tra tions.
See 42671-h.htm or 42671-h.zip:
(hHttp://www.gutenberg.org/files/42671/42671-h/42671-h.htm)
or
(http://www.gu ten berg.org/fi les/42671/42671-h.zip)
Ima ges of the ori gi nal pa ges are avai la ble through
In ter net Ar chive. See
http://archive.org/stream/novelstextbasedo02austuoft#page/n23/m
ode/2up
Fig. 1: Sam ple OCR text of Ja ne Aus ten: Pride and Pre ju dice in the Gu ten berg
Li bra ry (https://ia903107.us.archive.org/8/items/prideandprejudic42671gut/
42671-8.txt)
Another advantage of this digital library project is the availability
of the database in Python as an installable li bra ry (see https://pypi.org/
project/gutenbergpy). Once the library is loaded, it is possible to work
with the book titles immediately. The following example uses a simple
code to demonstrate the loading of the text of Jane Austen’s Pride and
Prejudice in Python, or the display of the first 5000 characters from
the OCR format.
Fig. 2: Strings that are ir re le vant to the text ana ly sis are de le ted from the text (see
li nes 11–17) in the left pa nel. The re trie val of a spe ci fic text is done via an ID
that is iden ti cal to the eBook num ber in the Gu ten berg da ta base (cf. line 10 in
Fig. 1 and line 5 he re in the left pa nel)
141 142
naturally in the context of humanities. For example, until recently it was unthink-
able to require a machine algorithm to interpret a continuous literary text, but this is
beginning to change with the advent of AI. The next question, of course, is to what
extent such an interpretation is currently professionally valid.
3 These are methods of downloading web content using a specially written
program that searches for content by structural html tags.
The mo di fied text can al rea dy be trea ted as a da ta type (he re
string), which is im por tant for fur ther ma chine ana ly sis. Al though the
Gu ten berg pro ject, as well as its Py thon mo dule, is a uni que pro ject
that al lows to ana lyse a large num ber of En glish li ter a ry titles and thus
to per form a re pre sen ta tive dis tant rea ding ana ly sis, which has a more
com prehen sive pre dic tive va lue re flec ting the si tua tion in En glish
writ ten prose in a lon ger pe riod of time, but as can be seen from the
exam ple, this is not struc tu red da ta. The me ta da ta are thus di rectly part
of the text, and it is the re fore ne ces sa ry to fil ter them out for fur ther
work (au thor’s name, place and year of pu bli ca tion, pu blisher’s no tes,
etc.). How can such a text be fur ther pro ces sed? Spe ci fi cal ly, it can be
an ana ly sis of lexi con, word richness, mo tives, themes, but al so a sty -
lo me tric ana ly sis sho wing the sta tis ti cal dis tan ces between the texts
un der stu dy and, of course, ma ny other cri te ria. Some of them will be
brie fly de mons tra ted he re.
Ano ther cur rent ma jor pro ject for a de di ca ted di gi tal da ta base is
the Dra Cor pro ject at the Uni ver si ty of Pots dam. It fo cu ses on the dra -
ma tic texts of se lec ted Ger man (485 texts) and Rus sian (80 texts)
plays writ ten between 1730 and 1930. The spe ci fi ci ty of this pro ject
lies in the fact that it is not a sim ple da ta base of di gi ti zed texts, but a
much more so phis ti ca ted di gi tal en vi ronment that uses the so- cal led
“cor pus” to work with cor po ra. API ap pro a ches that ope rate on da ta
pro vi ded in the form of the Json for mat, which is, among other things,
sui ta ble for da ta and me ta da ta sto rage and struc tu ring (see be low). As
the au thors state,
Dra Cor is not pri ma rily a web site. Dra Cor is a show case for the con cept of Pro -
gram ma ble Cor po ra. [...] Dra Cor is an eco sys tem. You can con nect to it on dif ferent
le vels. The cor po ra can be used free ly, even in de pen den tly of the plat form itself. For
ease of use, the re is an in ter face pro vi ding ac cess to spe ci fic sli ces of the cor po ra, i.e.
re search da ta that can be used di rectly in your own work [...] Dra Cor aims to crea te an
in ter face between tra di tio nal and di gi tal li ter a ry stu dies. The ex tent to which you can
in volve Dra Cor in your re search on Eu ro pean dra ma (or li tera ture in ge ne ral), de -
pends on your le vel of tech ni cal ex per tise (or sup port) (https://dracor.org/doc/
what-is-dracor#).
The in for ma tion gi ven in the pro ject des crip tion is of in terest pri -
ma rily be cause it de mons tra tes how to day’s di gi tal cor pus- da ta base
en vi ronments are being built, and in par ti cu lar what their na ture is. In
other words, users are not re liant on ins tal led cor pus search tools, as
we are used to with the “tra di tio nal” cor po ra that have been de ve lo ped
since the 1990s, but can de sign (i.e. pro gram) them them sel ves ac cor -
ding to their own re search needs. The re fore, the au thors draw at ten -
tion to the ne ces sa ry tech ni cal (i.e. pro gram ming) skills or sup port.
This is an in dis pu ta ble ad van tage of a cor pus- da ta base con cei ved in
this way, in clu ding the pos si bi li ty to fur ther ex tend it. This cur rent
trend can al so be ob ser ved in other di gi tal cor po ra; if we li mit our sel -
ves to the Czech en vi ronment, we can men tion the Czech Na tio nal
Cor pus (https://www.korpus.cz) or the Czech Verse Cor pus
(https://versologie.cz/v2/web_content/corpus.php).
This ap pro ach is al so pro gres sive com pa red to other fo reign pro -
jects. A list of other di gi tal li ter a ry cor po ra can be found in the
CLARIN re search in fra struc ture da ta base, which ser ves as a plat form
for lin guis tic, so cial and cul tu ral da ta (https://www.clarin.eu/re sour -
ce- fa mi lies/ li ter a ry-corpora). For the sake of com pa ri son, we se lect
a few exam ples of fe red he re. For exam ple, The Com ple te Cor pus of
An glo- Saxon Poe try (https://sacred-texts.com/neu/ascp) of fers a sim -
ple, al beit ex ten sive, di gi tized da ta base of An glo- Saxon verse texts,
si mi lar ly the Col lec tion of ol der ori gi nal Es to nian- lan guage works of
fic tion is a col lec tion of di gi tized li ter a ry texts of Es to nian pro ve -
nance. The pro ject al so has the pos si bi li ty of browsing through ol der
edi tions, a ti me line of his to ri cal events that form the con text of Es to -
nian works. The ad van tage of the pro ject, as its au thors state, is that it
cur ren tly in clu des all 19th cen tu ry Es to nian ar tis tic li tera ture, which is
im por tant not only in terms of pre ser ving his to ri cal texts and do cu -
ments, but can al so serve as a com ple te tex tual re source for sys te ma tic
re search. Other cor po ra, such as Clas sics of En glish and Ame ri can Li -
tera ture in Fin nish (CEAL) (https://korp.csc.fi/shibboleth-ds/in dex.
html?https%3A%2F%2Fkorp.csc.fi%2Fkorp%2F%3Fsaved_params
%3D1705922341337) or Clas sics of Fin nish Li tera ture, Kie li pank ki
Ver sion (https://korp.csc.fi/korp/#?prequery_within=sentence&cqp
143 144
=%5B%5D&corpus=skk_aho,skk_canth,skk_finne,skk_jarnefelt,skk
_kailas,skk_lassila,skk_linnankoski,skk_kramsu,skk_lehtonen,skk_l
eino,skk_pakkala,skk_siljo,skk_sodergran,skk_wilkuna), con tain so-
me ese lec ted tools for sear ching the cor pus, which is made pos si ble by
pro ces sing the tex tual da ta at the le vel of syn tac tic par sing and mor -
pho lo gi cal tag ging. If we look at other cor po ra, then ma jo ri ty of them
are ei ther im ple men ted as di gi tal col lec tions of li ter a ry texts, but al so
of other texts, such as es says, or the texts are pro ces sed at the le vel of
par sing and mor pho lo gi cal an no ta tion. It is the re fore pro ba bly im por -
tant to dis tin guish between a di gi tal col lec tion, which is the first exam -
ple, and cor po ra that al rea dy have cer tain cor pus tools for sear ching.
The re are nu me rous of si mi lar pro jects to day. We would like to
men tion one of them he re, be cause it is al so re la ted to the Czech en vi -
ronment. It is the Eu ro pean Li ter a ry Text Col lec tion (EL TeC) pro ject
(https://distantreading.github.io/ELTeC), which, as the name im plies,
fo cu ses on di gi ti zed col lec tions of li ter a ry works of Eu ro pean li tera tu -
res, which it of fers in the form of web con tent. The con di tion for the
crea tion of a spe ci fic di gi tal col lec tion is that it must con tain exactly
100 book titles. Fo cu sing on the Czech col lec tion (https://dis tan trea -
ding.github.io/ELTeC/cze/index.html), the first thing that is some-
what sur pri sing is the se lec tion of the texts them sel ves. The first pro -
blem for a pos si ble sys te ma tic ana ly sis is the non- re pre sen ta ti ve ness
of the da ta base. The texts that are part of it are mo stly not part of the
Czech li ter a ry ca non; in ma ny ca ses they are mar gi nal and less known
or unknown 19th cen tu ry texts, in some ca ses, on the con tra ry, they are
texts re pre sen ting the li ter a ry ca non of the 19th cen tu ry Czech li tera -
ture. Ho we ver, the ques tion of ca no ni ci ty might not be so bin ding in
the fi nal ana ly sis if the re we re enough texts re pre sen ting the re le vant
his to ri cal de ve lopmen tal sta ges of Czech li tera ture. We have in mind
(sub)sets of texts re la ting, for exam ple, to the 1830s, 1840s, 1850s and
to other de ca des of the 19th cen tu ry. Instead, the se are mo stly ran -
domly se lec ted texts from the pe riod 1855–1920. Ano ther pro blem is
that each au thor is re pre sen ted he re by a sin gle spe ci fic text, while for
exam ple Alois Jirásek has three texts. Such a se lec tion de fac to pre -
vents more mea ning ful com pa ri sons, the re sults of which would have
some more va lid cog ni tive va lue for mo del ling the li ter a ry- his to ri cal
pro ces ses of Czech li tera ture.
Other re sour ces that con cen trate a dis pro por tio na te ly lar ger num -
ber of Czech li ter a ry works in clude the Kra me rius da ta base ma na ged
by the Na tio nal Li bra ry of the Czech Re pu blic (https://kra me rius.
Nkp.cz/kra me rius/wel come.do;jses sio nid=EDA92C4CCCFBB6ED5
585E62C91C37BD3). This da ta base can be used to search a truly lar-
ge num ber of li ter a ry do cu ments, but even using this en vi ronment has
its pit falls. Texts are col lec ted he re in pdf for mat, whe re it is al ways
pos si ble to down load only the maxi mum batch of 20 pa ges. As a re -
sult, it is then ne ces sa ry to merge the in di vi dual pdf fi les in to a sin gle
file, per form OCR and then check, clean and sa ve the text in a for mat
that can be fur ther pro ces sed by ma chine. The ne wer ver sion of this
da ta base, Kra me rius5, does of fer a text trans crip tion of the dis played
pdf, i.e. a spe ci fic page from the do cu ment, but again it is ne ces sa ry to
ma nual ly co py the OCR part and de fac to as sem ble the who le work in
pie ces, which is a ve ry te dious and in ef fi cient work. Not even the di gi -
tal li bra ry of the Mo ra vian Li bra ry of fers a fun da men tal ly dif ferent
ap pro ach. It should be no ted, ho we ver, that in the case of both the se
ins ti tu tions, they are li mi ted by co py right and are not pri ma rily orien -
ted to wards the crea tion of da ta ba ses for scien ti fic pur po ses, i.e. da ta -
sets that could be fur ther pro ces sed in the con text of di gi tal hu ma ni ties
re search. Ho we ver, we men tion both re sour ces he re be cause their col -
lec tion cons ti tu tes the lar gest da ta base of Czech li ter a ry texts in the
Czech Re pu blic.
The si tua tion is dif ferent in the case of the Cor pus of Czech Verse,
which is im ple men ted at the Ins ti tu te for Czech Li tera ture of the Aca -
demy of Scien ces of the Czech Re pu blic. In ad di tion to the web in ter -
face (https://versologie.cz/v2/web_content/tools.php?lang=cz), whe -
re it is pos si ble to search ac cor ding to se lec ted ver so lo gi cal cri te ria, it
al so of fers open da ta (https://github.com/ver so tym/cor pusC zechVerse)
for fur ther in di vi dual pro ces sing.
We are fol lo wing a si mi lar path in the Li ter a ry Car to graphic and
Quan ti ta tive Mo dels of Czech No vels from the 19th to 21st Cen tu ry pro -
ject (https://korpusprozy.com), pro vi ding a struc tu red and ma chine-
146145
rea da ble da ta set of texts and other me ta da ta for in de pen dent re search
as part of the glo bal ly sha red open da ta trend. In doing so, we aim not
only to meet the ge neral ly sha red call for open and ac ces si ble da ta, but
al so as the abo ve men tio ned re view of Czech text da ta ba ses has
shown – to pro vide re sear chers with a free and struc tu red da ta base of
li ter a ry texts for their pro fes sio nal work, which can be come the ba sis
for in di vi dual ly fo cu sed and in de pen dent re search work.
The de fault for mat in which text da ta is sto red and pro vi ded is the
Json text for mat, in which both text and me ta da ta are struc tu red in to a
dic tio na ry da ta type (see Fig. 3). The ba sic prin ci ple of this da ta type is
that it con tains two va lues writ ten in the man ner {key: va lue}. If we
look at the pro ces sing struc ture of each li ter a ry text (see Fig. 4) we can
see that a dic tio na ry can con tain ano ther dic tio na ry, etc., which ma kes
this da ta type ve ry sui ta ble for struc tu ring and hierar chi zing. In this
case, a va lue is as so cia ted with the TITLE key, which is a dic tio na ry
that con tains a se ries of keys and va lues.
This formatted data can be worked with completely independently,
taking into account the wide variability of possible research tasks.
Here we demonstrate several such examples that illustrate the
different possibilities of processing such structured data. The
following list (see Fig. 4) is a basic listing of works, authors, number
of tokens and lemmas in each text in a summary ta ble.
{
“TITLE” : {
“AUTHOR” : au thor name,
“BORN” : date of au thor born,
“DEATH” : date of au thor death,
“1. PUB PUB” : 1. pu bli ci ton of the title,
“ACTUAL PUB” : ac tual pu bli ca tion,
“TEXT” : text of title (to kens)
“LEMMA” : lem mas,
“MORPHO TAGS” : mor pho lo gi cal to ken tags
}
}
Fig. 3: Pro ces sing struc ture of each work
Fig. 4: List of the first 42 works of Czech prose wri ters of the 19th cen tu ry with the
si ze of each text in num ber of to kens and lem mas. The da ta base cur ren tly
con tains 74 titles of Czech prose from the 19th cen tu ry to the 21st cen tu ry20.
Another possibility resulting from the dataset is a limited listing of
works that, for example, meet a certain condition. This is the time
limitation for texts that were first published between 1830 and 1880
(see Fig. 5).
148147
Fig. 5: Ta ble of titles that match the gi ven con di tion, i.e. we re first pu blished between
1830 and 1880. Be low the ta ble, the to tal num ber of texts re trie ved and the
si ze of such a cor pus in to ken counts are gi ven.4
As we can see, the output is a table of literary texts sorted by the
year of the first publication of the respective texts. The particular
program that we use to access the database will allow us to save this
selection as a single TXT text file containing the texts of all the filtered
titles; of course, it is up to each user to customize their own program. It
is certainly possible to save the selection as a custom Json format,
Excel spreadsheet, etc. The filtered texts can be further analysed in
any way, e.g. within the framework of methods standardly used in
NLP. This example also shows that each user can generate his own text
files (corpora) and perform various statistical measurements between
them. This generation can be varied in any way. In addition to the time
range, custom sub-corpora can be defined, e.g. by author names. The
following figure is an example of a simple storage of all prosaic texts
by Jan Neruda that are part of the database. As can be seen, the
specificity of this format is its machine readability. Currently, it is
perhaps one of the most widely used formats for storing and
exchanging data in the digital environment of the web.
Fig. 6: Exam ple of sa ving text da ta in to Json for mat.
150149
4 The database can be downloaded after registration here: https://korpuspro
zy.com.
Fig. 7: Considering the size of some of Karel Hynek Mácha’s prose, we calculated
the relatedness between the texts with respect to the 100 most frequent words.
From each text in the database, the 100 most frequent lemmas were selected
and a common set of lemmas was created. For each lemma in this set, the
relative frequency that the lemma has in each text was calculated and then
a dendrogram was constructed. The distances between texts are expressed in
the graph by the length of the y-axis.
Thus, as can be seen from this data arrangement, the potential
analyses that such an essentially simple structure allows are many. For
example, one can measure the percentage of word types, build one’s
own concordance or collocation searches, perform a number of sta tis -
tical analyses of the text, e.g. measuring word richness, entropy, ex -
tensiveness, or concentration of texts (see Mis trík, 1968, pp. 40–52),
detecting the so-called thematic concentration of texts (see Čech,
2016; Čech, Popescu, Altman, 2014, pp. 13–29), sentiment analysis5
etc., or detecting the degree of similarity between texts using one of
the stylometric methods (see Warmer-Colan, 2024). The following
graph is an example of a so-called dendrogram, which shows the
distances between titles in the whole existing database.
Fig 8: PCA analysis of a sub-corpus consisting of 40 selected texts (see Fig. 5).
152151
5 Sentiment analysis is a way of measuring the emotional load of texts, for
example using special word lists or databases.
Similarly, the relatedness between texts can be modelled with the
use of principal component analysis (PCA), which transforms mul ti-
dimensional values into a two-dimensional representation that results
in clusters of texts that are closest to each other. In the above graph, we
can clearly observe the clusters of selected texts (see Fig. 8) dis tri-
buted according to their affiliation to each author. The analysis was
performed on the 100 most frequent words, but its criteria can be
chosen according to different categories, e.g., word frequency in the
different speech bands of the narratives, emotional load (so-called
sentiment analysis), keywords, sentence lengths, types of n-grams,
etc. As we can see in the graph, within the set of 40 texts that meet our
criterion above, i.e. were first published in 1830–1880, the most
distant clusters are prose works by Karel Hynek Mácha and Neruda’s
Tales of the Lesser Town. Especially in the case of Mácha, we can
additionally observe their more pronounced internal diversification,
which is due to the greater distances between the individual texts in
Mácha’s sub-corpus.
The cri te ria for wor king with such a da ta base, which will of course
be cons tan tly upda ted and ex pan ded, are si mi lar to those for wor king
with the Gu ten berg li bra ry or the Dra Cor cor pus; its user must have
cer tain tech ni cal skills or ex perts who will be able to ex tract re le vant
in for ma tion from such a da ta set. For users who are used to stan dard
ways of wor king with di gi tal cor po ra, this pro ject in par ti cu lar al so
pro vi des a web in ter face (https://korpusprozy.com) with a num ber of
func tio na li ties for cor pus search. Ho we ver, it is im por tant to note that
any web ap pli ca tion with cor pus tools is ne ces sa rily li mi ted to cer tain
tools. On the con tra ry, ma chine- rea da ble and struc tu red da ta al lows
for in di vi dual re search and the de ve lopment of spe ci fic tools for da ta
mi ning. This can be ob ser ved in some fo reign uni ver si ties, which al so
en gage in such prac ti ces di rectly in their teaching.6 With the gro wing
in fluence of di gi tal hu ma ni ties, the re will be an in crea sing de mand not
only for spe cial ap pli ca tions but al so for spe cia lized da ta ba ses that al -
low re sear chers to con duct highly va ria ble and in de pen dent re search.
Trans la ted from Czech by Jo sef Línek
Re feren ces:
MIS TRÍK, Jo sef. (1968). Sty lis tics of the Slo vak lan guage. Koši ce: Slo vak Pe da go gi -
cal Pu blish ing House in Bra tis la va.
ČECH, Ra dek. (2016). The ma tic con cen tra tion of text in Czech. Pra gue: Ins ti tu te of
For mal and Ap plied Lin guis tics.
ČECH, Ra dek; PO PES CU, Ioan- Io vitz & ALT MAN, Ga briel. (2014). Methods of
quan ti ta tive ana ly sis of (not only) poe tic texts. Olo mouc: Pa lacky Uni ver si ty
in Olo mouc.
DEFUS, A. (2024). What is stylometry? Available from: https://nauka.uj.edu.pl/
aktualnosci/-/journal_content/56_INSTANCE_Sz8leL0jYQen/74541952/14
1176992.
WAR MER- CO LAN, A. (2024). Sty lo me try Methods and Prac ti ces. Avai la ble from:
https://gui des.tem ple.edu/sty lo me try fordh/home.
E- re feren ces
Pro ject Gu ten berg. Avai la ble from: https://ar chive.org/de tails/gu ten berg.
Dra Cor. Avai la ble from: https://dra cor.org.
Czech Na tio nal Cor pus. Pra gue: Ins ti tu te of the Czech Na tio nal Cor pus FF UK. Avai -
la ble from: https://www.kor pus.cz.
Cor pus of Czech verse. Avai la ble from: https://ver so lo gie.cz/v2/web_con tent/cor -
pus.php.
Li ter a ry Cor po ra. Avai la ble from: https://www.cla rin.eu/re sour ce- fa mi lies/li ter a ry-
cor po ra.
The Com ple te Cor pus of An glo- Saxon Poe try. Avai la ble from: https://sa cred-
texts.com/neu/ascp.
Korp – The Lan guage Bank of Fin land. Avai la ble from: https://korp.csc.fi/shib bo -
leth- ds/in dex.html?https%3A%2F%2Fkorp.csc.fi%2Fkorp%2F%3Fsa ved_
pa rams%3D1705922341337.
Dis tant Rea ding. Avai la ble from: https://dis tan trea ding.gi thub.io/EL TeC.
Kra mer. Avai la ble from: https://kra me rius.nkp.cz/kra me rius/Wel come.do;jses sio -
nid=EDA92C4CCCFBB6ED5585E62C91C37BD3.
Li ter a ry Car to graphic and Quan ti ta tive Mo dels of Czech No vels from the 19th to 21st
Cen tu ry. Avai la ble from: https://kor puspro zy.com.
154153
6 Cf. Statistical methods for studying literature using R at the University of
Missouri-Kansas City (https://daedalus.umkc.edu/StatisticalMethods/index.html) or
Mathew L. Jockers’ Text Analysis with R for Students of Literature (2014).
ResearchGate has not been able to resolve any citations for this publication.
Thematic concentration of text in Czech. Prague: Institute of Formal and Applied Linguistics
  • Radek Čech
Stylometry Methods and Practices
  • A Warmer-Colan
Stylistics of the Slovak language
  • Josef Mistrík
Sty lis tics of the Slo vak lan guage. Koši ce: Slo vak Pe da go gical Pu blish ing House in Bra tis la va
  • Jo Mis Trík
  • Sef
MIS TRÍK, Jo sef. (1968). Sty lis tics of the Slo vak lan guage. Koši ce: Slo vak Pe da go gical Pu blish ing House in Bra tis la va.
The ma tic con cen tra tion of text in Czech
  • Ra Čech
  • Dek
ČECH, Ra dek. (2016). The ma tic con cen tra tion of text in Czech. Pra gue: Ins ti tu te of For mal and Ap plied Lin guis tics.
Sty lo me try Methods and Prac ti ces
  • Mer-Co War
  • A Lan
WAR MER-CO LAN, A. (2024). Sty lo me try Methods and Prac ti ces. Avai la ble from: https://gui des.tem ple.edu/sty lo me try fordh/home.
Lan guage Bank of Fin land
  • Korp -The
Korp -The Lan guage Bank of Fin land. Avai la ble from: https://korp.csc.fi/shib boleth-ds/in dex.html?https%3A%2F%2Fkorp.csc.fi%2Fkorp%2F%3Fsa ved_ pa rams%3D1705922341337.