Content uploaded by Teodora Vukovic
Author content
All content in this area was uploaded by Teodora Vukovic on Dec 29, 2020
Content may be subject to copyright.
Acta Linguistica Petropolitana. 2020. Vol. 16.2. P. 160–180
DOI 10.30842/alp2306573716207
(
)*
. .
(); anastasia.makarova@uzh.ch
. .
, -;
dsuetina@yandex.ru
.
(); teodora.vukovic2@uzh.ch
. .
, -;
sobolev@sta .uni-marburg.de
.
(); olivier-andreas.winistoerfer@uzh.ch
. ()
,
-
. -
. -
, / ( ) ,
, -
, . .
* : 18-512-
76002 _ « -
: », EraNet Rus Plus grant/Swiss National Science Foun-
dation IZRPZ0_177557/1 (TraCeBa project, https://traceba.net/); SNF100015_176378/1
(‘Ill-bred sons’, family and friends: tracing the multiple a liations of Balkan Slavic).
… 161
( «»)
.
: , ,
, , , -
, , , .
Automatic language profi ling of a dialect speaker:
the case of the Timok variety spoken
in the village of Berinovac (Eastern Serbia)
A. L. Makarova
University of Zurich (Switzerland); anastasia.makarova@uzh.ch
D. V. Konior
Institute for Linguistic Studies, Russian Academy of Sciences, St. Petersburg;
dsuetina@yandex.ru
T. Vukovi
University of Zurich (Switzerland); teodora.vukovic2@uzh.ch
A. N. Sobolev
Institute for Linguistic Studies, Russian Academy of Sciences, St. Petersburg;
sobolev@sta .uni-marburg.de
O. Winistörfer
University of Zurich (Switzerland); olivier-andreas.winistoerfer@uzh.ch
Abstract. In a previously published paper [Konior et al. 2019], which thematically
led up to the present article, we explored the possibility of developing a quantitative tool
for assessing the intrasystemic dialectal coherence and the degree of dialectal authentic-
ity (preservation) for a particular variety of Slavic (and more broadly Balkan) dialectal
speech. In order to do so, we analysed and manually counted all cases of presence or ab-
sence of specifi c phonemes, direct and indirect object reduplication, ways of expressing
peripheral cases meaning, presence of a postpositive article, and some other language
features. The data used for that purpose was extracted from “Linguistic Atlas of Eastern
Serbia and Western Bulgaria” [SAOSWB]; an idiolect of a native speaker of the Timok
dialect spoken in the village of Berinovac (near the town of Knjaževac in the Zajear
district, Eastern Serbia) was chosen for analysis. Subsequently, the following question
arose: how can the use of modern technologies for automatic text processing increase
the e ciency of dialectologists’ work, and what technical obstacles must be over-
come in this regard? In the article, we present a method of (semi-)automatic analysis
162 . . , . . , . . ALP 16.2
of phonetic and morphosyntactic features in a dialect text with the use of morpholog-
ical annotation (the tagger model is based on the ReLDI tagger [Ljubeši et al. 2016]
and user Python scripts). An algorithm searching for some important dialect features is
described and exemplifi ed. Trying to imitate and automate historical and structural lin-
guistic analysis, we open a discussion about the advantages and disadvantages of com-
puter analysis of dialect data as compared with the manual analysis. In the future, the
automatic method is expected to be helpful in managing larger amounts of dialect data.
Keywords: statistical methods in linguistics, machine text analysis, linguistic pro-
fi ling, dialect speakers, Balkan Slavic languages, Serbian dialects, Timok dialect, idio-
lect of dialect speaker, village of Berinovac, Eastern Serbia.
1.
-
— (
, «-
» « -
») 1 (
).
[ . 2019],
, , -
« » ( «-
») ( — )
. , -
[ . 2019: 30–31], -
: -
, -
?
.
: -
,
1
. [, 2020].
… 163
, -
. 2 -
, ; 3
,
-
. 4, -
, -
. . -
, .
« : -
» (18–20 2018 ., )
« -
»
-
.
(1906 . .), 2 .
. [-
. 2019]. . , 1990- ., -
« -
» [ 1998] ( — SAOSWB
). -
, -
, , ,
-
(4453 ). -
( 20 . ) , -
-
. :
2 —
- , -
. - -
, -
(
) —
.
164 . . , . . , . . ALP 16.2
1) ; 2) -
, « »; 3) -
; 4) -
.
, -
()
3 ( -
[Vukovi et al. 2019] ReLDI-
[Ljubeši et al. 2016]; )
Python-. SAOSWB -
« » 4, -
, , , 2015–2017 .
—
(«») 5:
1) . *tj : *vtje > vee ‘’,
*svtja > svea ‘’, e;
2) . *dj : *medja >
mea ‘’, *tjudje > uo ‘’, *vidj- > vi- ‘-’;
3)
:
*sn > sn ‘’, *vš > vška ‘’, *dsky > dska ‘’;
*-v (takv ‘()’); *dn > dn ‘’;
4) :
. (IO) (POSS): i na
tuj ovcu ( .. ..) se toj dade prvo i venc
3 -
, .
4 , -
: http://balksrv2012.sanu.ac.rs/webdict/timok/index.
5
-
, a -
[ . 2019].
… 165
‘ (. «») ’; tg u mo-
je znae i na mojega tatu ( .. ..) odnela
vodenicu ‘, , []
’;
. -
: ot sviu ( ..) ostala samo glava ‘ -
’, orala sam ss pluk ( ..)
‘ ’, on bil u vojsku ( ..) ‘
’;
. () ():
toj na nas ( 1.) priala baba jedna ‘ -
’;
. -
: pokraj u ( . 3.) ‘ ’, ss ega
( .3.) ‘ ’, da peemo leb u u ( . 3.) ‘
’;
5) : dojde do
nas voda-ta (-..) do ovdeka ‘ ’;
6) . -
po :
pomlad ‘, , ’;
7) :
. : tebe te stra (2. 2. ..)
‘ ’;
. : tep ti je dobro (2. 2. ..3
) ‘ ’;
8)
: sad u # priam ( .1 #
..1) ‘ ’, . sad u da priam (-
.1 ..1) [ . 2019: 21–22].
-
, -
. , -
, -
.
166 . . , . . , . . ALP 16.2
2. .
2.1.
SAOSWB
:
, -
, -
.
:
(, dan / dan / dan / dn ‘’).
.txt OCR Transkribus; -
. -
, ()
(, -
) ,
(. 1).
2.2.
. -
,
(
) (). -
. ,
« » ,
.
ReLDI, -
6. ,
MULTEXT-East V5, -
6 https://github.com/bravethea/Torlak-ReLDI-Tagger-2019
… 167
: ,
žena ‘, ’ Ncfsny: Noun, common, femi-
nine, singular, nominative, animate (yes).
,
-v, -t -n ( -
). , ženata ‘[] ’
Ncfsny-t, (-t) -
-ta. 1
.
1.
Table 1. Examples of the transcription and additional annotation
pa pa Cc pa
dójde dOjde Vma3s doi
na na Sa ma
nás nAs Pp1-pa mi
vodáta vodAta Npmsa-t voda
doovdéka doovdEka Rgp dovde
3. .
-
,
() , -
. « »,
, -
. -
, -
.
: ,
, / -
[Birkner 2015; Dash 2018].
.
168 . . , . . , . . ALP 16.2
1 ( . *tj), 2 ( . *dj),
3 (
). -
,
. -
, -
.
-
, , , -
,
- . « » -
MULTEXT-east [Erjavec et al. 2003] 7. -
, -
[Dash, Hussain 2013],
, -
.
, , -
,
c , , dž, ,
,
*tj *dj ,
, . -
*dj. , -
: ( -
) dž ( ). , 2,
( -
) . -
-
.
-
, /dž, -
. *tj
( )
7 http://nl.ijs.si/ME/V6/msd/html/msd-hbs.html
… 169
. / (-
*tj) , -
,
.
, ,
8:
*tj, *dj -
, -
. SAOSWB
( ) *tti *tj
*sn *dn « -
» ( — *došl
*jedn 9).
3
« -
»: «» (*dn,
dan) — —
«» ( ii) -l . . . .,
.
8 (vs -
) [Vukovi et al. 2020].
9 , — ,
, , -
, -
, , , -
.
2. *dj:
Table 2. Refl exes of *dj: dialect vs standard realization
*dj > dž *dj >
lédža ‘’
prédžu ‘’
róena ‘’
govée ‘’
80 % 20 %
170 . . , . . , . . ALP 16.2
3. :
Table 3. Refl exes of the reduced vowels:
dialect vs standard realization
('vzdn', 'RGP',
'vzdan')
('dna', 'Ncmsa',
'dan')
('dOšo', 'Vmp-sm',
'doi')
('dOšo', 'Vmp-sm',
'doi')
('dOšl', 'Vmp-sm',
'doi')
('dAn', 'Ncmsn',
'dan')
('Išal', 'Vmp-sm',
'ii')
88 % 12 %
12 %, — 88 %. -
(37 % 63 %
), , ,
— -
.
( )
. 4 -
, -
( ).
4. -
10 .
4. -
. -
, -
( ),
.
10 « , , -
na ‘, ’ ( -
100 % „-
“ )» [ . 2019: 22].
… 171
4.
Table 4. Analytic case-marking
4.
tg u mojE znAnje i na mojEga
tAtu ( . .) odnEla
vodenIcu
‘, , []
’.
,
.
100 % 0 %
4.
boluvAla sam nEšto u glAvu
( ..)
‘ - ’.
a mI smo si u selO ( ..)
‘ ’.
ne znAm kOje gOdine (...
..) bIlo
‘ , ’.
99 % 1 %
4.
pa dOjde na nAs ( 1.) vodAta
doovdEka
‘ ’
tOj na nAs ( 1.) priAla bAba
jednA
‘
’.
p nIšta, On dadE mEne 11 parU, jA
njEmu (.3.) nEšto
‘ , ,
-’
80 % 20 %
4.
s njU / ss njU (c . 3.) ‘ ’
pOkraj njU ( . 3.) ‘
’
s njEga / ss njEga (c .3.)
‘ ’
s nAs (c 1.) ‘ ’
ss njI (c 3.) ‘ ’
,
.
100 % 0 %
11 mEne (-
*men > mene
*men > meni),
: mEne — Pp1-sa 'ja'.
172 . . , . . , . . ALP 16.2
4. -
() () 12.
4. -
.
, -
( ), -
.
5.
. , , —
, -
.
, -
. ,
115 (). -
(« -
», . coreference annotation [Deemter, Kibble 1999]), -
, ,
.
6. ( -
po ).
: 'pOmladu'.
() .
7.
( , -
, ,
).
-
-
. -
[Escher 2021] ,
-
. ,
« » / -
. , , ( ) -
. , -
12 () -
.
… 173
/ ,
().
-
,
. /
,
,
. -
, -
— jA ga vIdim tUj mOjega Iu
(1. .. ..1 ... ..)
‘ (. «») ’ —
. , -
, .
7. -
.
.
8. -
( 5).
5. :
Table 5. Conjuntive particle: dialect vs. standard realization
Ekaj sAd u prIam ( .1
..1)
‘, ’.
on e da poglEda (.3. .3
..3) ‘ ’
37 % 63 %
4.
-
, (
)
, -
, . . . , -
-
174 . . , . . , . . ALP 16.2
, « -
»
, -
.
1.
Figure 1. Dialect profi le of the informant
based on the results of the automatic analysis
« ».
. ,
3, , -
- -
. -
.
… 175
2.
Figure 2. Dialect profi le of the informant based on the results of the manual analysis
5.
, /
-
, ,
. , -
.
: , -
( -
/ , -
). : ,
176 . . , . . , . . ALP 16.2
-
. -
, , -
.
, , , -
. -
-
( ) —
.
« » [Vukovi et al. 2019], -
SAOSWB, -
(«»),
, -
.
, -
(,
, ).
- -
.
— ,
. -
-
,
13. -
. -
, -
,
-
. --
() [Goedertier et al. 2000], -
,
.
13 , ,
https://www.clarin.eu/resource-families/spoken-corpora.
… 177
. , -
, -
« ».
1, 2, 3 — , — , — , —
, — , — , — , —
, — , — , —
, — , — , SAOSWB — -
, — -
, — , — .
. 2019 — . . , . . , . . .
( -
) //
. . 2019. 58. . 17–33. DOI: 10.17223/19986645/58/2.
, 2020 — . , . . .
(
) // . .
2020. 66. C. 158–176. DOI: 10.17223/19986645/66/9.
1998 — . . .
// . . (. .).
. . 5. .: , 1998. . 106–167.
Birkner 2015 — V. Birkner. The advantages and disadvantages of employing corpus ev-
idence in sociolinguistic studies // The Teacher Magazine. 2015. Vol. 2. P. 11–17.
Dash 2012 — N. S. Dash. Etymological Annotation: a New Concept of Corpus Anno-
tation // Proceedings of the 34th All India Conference of Linguists (34-AICL). Shil-
long, India, 2012. P. 100–104.
Dash, Arulmozi 2018 — N. S. Dash, S. Arulmozi. Limitations of language corpora //
N. Dash, S. Arulmozi. History, features, and typology of language corpora. Singa-
pore: Springer Singapore, 2018. P. 259–272.
Dash, Hussain 2013 — N. S. Dash, M. M. Hussain. Designing a Generic Scheme for Ety-
mological Annotation: a New Type of Language Corpora Annotation // P. Bhattacha-
rayya, K.-S. Choi (eds.). Proceedings of the 11th Workshop on Asian Language Re-
sources. Nagoya: Asian Federation of Natural Language Processing, 2013. P. 64–71.
Deemter, Kibble 1999 — K. van Deemter, R. Kibble. What is coreference, and what
should coreference annotation be? // A. Bagga, B. Baldwin, S. Shelton (eds.).
178 . . , . . , . . ALP 16.2
Proceedings of the Workshop on Coreference and Its Applications. Stroudsburg,
PA: Association for Computational Linguistics, 1999. P. 90–96.
Erjavec et al. 2003 — T. Erjavec, C. Krstev, V. Petkevic, K. Simov, M. Tadic, D. Vitas.
The MULTEXT-east morphosyntactic specifi cations for Slavic languages // T. Er-
javec, D. Vitas (eds.). Proceedings of the Workshop on Morphological Processing
of Slavic Languages, EACL 2003. Stroudsburg, PA: Association for Computation-
al Linguistics, 2003. P. 25–32.
Escher 2021 — A. L. Escher. Double argument marking in Timok dialect texts (in Bal-
kan Slavic context). Zeitschrift für Slawistik. Forthcoming.
Goedertier et al. 2000 — W. Goedertier, S. Goddijn, J.-P. Martens. Orthographic tran-
scription of the spoken Dutch corpus // M. Gavrilidou, G. Carayannis, S. Markan-
tonatou, S. Piperidis, G. Stainhouer (eds.). Proceedings of the Second International
Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece.
Athens: National Technical University of Athens Press, 2000. P. 909–914.
Ljubeši et al. 2016 — N. Ljubeši, F. Klubika, Ž. Agi, I.-P. Jazbec. New Infl ectional
Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croa-
tian and Serbian // N. Calzolari, Kh. Choukri, Th. Declerck, S. Goggi, M. Grobelnik,
B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (eds.). Proceed-
ings of the Tenth International Conference on Language Resources and Evaluation
(LREC 2016). Paris: European Language Resources Association, 2016. P. 4264–4270.
Vukovi et al. 2019 — T. Vukovi, N. Muheim, O. Winistörfer, I. Simko, A. Makarova,
S. Bradjan. Corpora and Processing Tools for Non-Standard Contemporary and Dia-
chronic Balkan Slavic // I. Temnikova, I. Nikolova, N. Konstantinova (eds.). Pro-
ceedings of the Student Research Workshop associated with The 12th International
Conference on Recent Advances in Natural Language Processing (RANLP 2019).
Shoumen: Incoma, 2019. P. 62–68.
Vukovi et al. 2020 — T. Vukovi, B. Sonnenhauser, A. Escher. Degrees of non-stan-
dardness. Feature-based analysis of variation in a Torlak dialect corpus. Manuscript.
SAOSWB — A. N. Sobolev. Sprachatlas Ostserbiens und Westbulgariens. Bd. I. Pro-
blemstellung, Materialen und Kommentare, Kartenanalyse. Bd. II. Sprachkarten.
Bd. III. Texte. Marburg; Lahn: Biblion Verlag, 1998.
References
Birkner 2015 — V. Birkner. The advantages and disadvantages of employing corpus
evidence in sociolinguistic studies. The Teacher Magazine. 2015. Vol. 2. P. 11–17.
Dash 2012 — N. S. Dash. Etymological Annotation: a New Concept of Corpus Anno-
tation. Proceedings of the 34th All India Conference of Linguists (34-AICL). Shill-
ong, India, 2012. P. 100–104.
… 179
Dash, Arulmozi 2018 — N. S. Dash, S. Arulmozi. Limitations of language corpora.
N. Dash, S. Arulmozi. History, features, and typology of language corpora. Singa-
pore: Springer Singapore, 2018. P. 259–272.
Dash, Hussain 2013 — N. S. Dash, M. M. Hussain. Designing a Generic Scheme for Et-
ymological Annotation: a New Type of Language Corpora Annotation. P. Bhattacha-
rayya, K.-S. Choi (eds.). Proceedings of the 11th Workshop on Asian Language Re-
sources. Nagoya: Asian Federation of Natural Language Processing, 2013. P. 64–71.
Deemter, Kibble 1999 — K. van Deemter, R. Kibble. What is coreference, and what
should coreference annotation be? A. Bagga, B. Baldwin, S. Shelton (eds.). Pro-
ceedings of the Workshop on Coreference and Its Applications. Stroudsburg, PA:
Association for Computational Linguistics, 1999. P. 90–96.
Erjavec et al. 2003 — T. Erjavec, C. Krstev, V. Petkevic, K. Simov, M. Tadic, D. Vitas.
The MULTEXT-east morphosyntactic specifi cations for Slavic languages. T. Erja-
vec, D. Vitas (eds.). Proceedings of the Workshop on Morphological Processing
of Slavic Languages, EACL 2003. Stroudsburg, PA: Association for Computation-
al Linguistics, 2003. P. 25–32.
Escher 2021 — A. L. Escher. Double argument marking in Timok dialect texts (in Bal-
kan Slavic context). Zeitschrift für Slawistik. Forthcoming.
Goedertier et al. 2000 — W. Goedertier, S. Goddijn, J.-P. Martens. Orthographic tran-
scription of the spoken Dutch corpus. M. Gavrilidou, G. Carayannis, S. Markanto-
natou, S. Piperidis, G. Stainhouer (eds.). Proceedings of the Second International
Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece.
Athens: National Technical University of Athens Press, 2000. P. 909–914.
Konior et al. 2019 — D. V. Konior, A. L. Makarova, A. N. Sobolev. Statisticheskiy metod
yazykovogo profi lirovaniya nositelya dialekta (na materiale vostochnoserbskogo
idioma sela Berchinovats) [Quantitative method of language profi ling of a dialect
speaker (based on the material of the East Serbian idiom of the village of Bercino-
vac)]. Tomsk State University Journal of Philology. 2019. No. 58. P. 17–33.
Ljubeši et al. 2016 — N. Ljubeši, F. Klubika, Ž. Agi, I.-P. Jazbec. New Infl ectional
Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Cro-
atian and Serbian. N. Calzolari, Kh. Choukri, Th. Declerck, S. Goggi, M. Grobel-
nik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (eds.).
Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC 2016). Paris: European Language Resources Association, 2016.
P. 4264–4270.
Sikimi, Sobolev 2020 — B. Sikimi, A. N. Sobolev. Processy divergentcii v razdelen-
nom gosudarstvennoy granitcey zapadnoyuzhnoslavyanskom dialekte (na mate-
riale sovremennoy dialektnoy rechi Vostochnoy Serbii i Zapadnoy Bolgarii) [Di-
vergence Processes in the West South Slavic Dialect Divided by the State Border
(Based on the Modern Dialect Speech of Eastern Serbia and Western Bulgaria)].
Tomsk State University Journal of Philology. 2020. No. 66. P. 158–176. DOI:
10.17223/19986645/66/9.
Sobolev 1998 — A. N. Sobolev. O dialektologicheskom atlase Vostochnoy Serbii
i Zapadnoy Bolgarii [On the dialectological atlas of Eastern Serbia and Western
180 . . , . . , . . ALP 16.2
Bulgaria]. G. P. Klepikova (ed.). Issledovaniya po slavyanskoy dialektologii [Stud-
ies in Slavic Dialectology]. Iss. 5. Moscow: Institute of Slavic Studies RAS, 1998.
P. 106–167.
Vukovi et al. 2019 — T. Vukovi, N. Muheim, O. Winistörfer, I. Simko, A. Makaro-
va, S. Bradjan. Corpora and Processing Tools for Non-Standard Contemporary and
Diachronic Balkan Slavic. I. Temnikova, I. Nikolova, N. Konstantinova (eds.). Pro-
ceedings of the Student Research Workshop associated with The 12th International
Conference on Recent Advances in Natural Language Processing (RANLP 2019).
Shoumen: Incoma, 2019. P. 62–68.
Vukovi et al. 2020 — T. Vukovi, B. Sonnenhauser, A. Escher. Degrees of non-stan-
dardness. Feature-based analysis of variation in a Torlak dialect corpus. Manuscript.
Sources
SAOSWB — A. N. Sobolev. Sprachatlas Ostserbiens und Westbulgariens. Bd. I. Pro-
blemstellung, Materialen und Kommentare, Kartenanalyse. Bd. II. Sprachkarten.
Bd. III. Texte. Marburg; Lahn: Biblion Verlag, 1998.