PosterPDF Available

Infrastructure of the Polish Learner Corpus PoLKo

Authors:

Abstract

Please find the video presenetation here: https://www.youtube.com/watch?v=fJL7CbWXAHQ In recent years, learner corpora have become increasingly popular as a source for analysing L2 learners’ language (Gilquin, Granger, & Paquot, 2007: 322–323). However, in comparison with national corpora, which are available for many languages, we find relatively few learner corpora for other languages than English (Štindlová, 2013: 62–65). In particular, a learner corpus is still missing that could drive analyses of the language of non-native speakers of Polish (Zasina, 2019). Therefore, the goal of our poster is to present the first attempt to compile such a learner corpus and report on the ongoing project. The primary goal of the project is to collect learners’ writings in Polish as a foreign language at various levels of language proficiency. The collected material will be a basis for analysing the learners’ language, identifying the most common language errors, creating classroom materials, and improving modern teaching methods. In the first step, we are going to collect all available electronic texts, so as to gather a sizeable amount of starting material in the shortest possible time. In the second step of our project, we intend to focus more on hand-written texts and the rules for transcribing such texts in a computer-readable format, and on balancing the entire corpus in terms of first language and language level (CEFR). In the initial phase, the corpus will be prepared in the TeiTok environment (Janssen, 2016), where it can be easily modified during its creation. TeiTok will also be used for text transcription and managing all collected learners’ writings.
Learner corpora have become popular as a source for analysing
L2 learners’ language (Gilquin, Granger, & Paquot, 2007: 322323)
Relatively few learner corpora for other languages than English
(https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html)
Lack of learner corpus of Polish (Zasina, 2019)
Need for a modern source of learning materials
Polish language is taught all over the world all continents
We would like to show our gratitude to Alexandr Rosen for the extraordinary
technical support and to Urszula Sajkowska from the Linguae Mundi Foundation.
adrian.zasina@ff.cuni.cz
Infrastructure of the Polish Learner Corpus PoLKo
Adrian Jan Zasina & Elżbieta Kaczmarska
Charles University | University of Warsaw
e.h.kaczmarska@uw.edu.pl
Polish language
About 50 millions of speakers (Achtelik et al. 2018)
One of the 25 most commonly spoken languages of the world (Council for the Polish Language, 2007)
The 15th language according to Power Language Index (Chan, 2016)
(Ministerstwo Spraw Zagranicznych, 2014)
The primary goal of the project is to collect learners’ writings in Polish as a foreign language at
various levels of language proficiency
The collected material will be a basis for:
analysing the L2 learners’ language
identifying the most common language errors
creating classroom materials
improving modern teaching methods
Project goal
Corpus building in the TeiTok environment (Janssen, 2016)
TeiTok used for text transcription and managing all collected learners’ writings
TeiTok used as a search engine
Metadata divided into two groups:
information on a respondent (e.g. age, sex, L1)
information on a text (e.g. title, word count, topic)
Morphosyntactic annotation by MorphoDiTa tagger with the Polish language model
Corpus PoLKo
Motivation
According to the error taxonomy used by State Commission for the Certification of Proficiency in
Polish as a Foreign Language (Markowski 2008), we distinguish several error levels:
grammatical
lexical
stylistic
spelling
punctuation
Error taxonomy
Corpus searching
Text view
http://utkl.ff.cuni.cz/teitok/polko/
https://www.researchgate.net/project/The-Polish-Learner-Corpus
http://slawistyka.uw.edu.pl/pl/the-polish-learner-corpus/
Project websites
Studentstexts from the School and Foundation Linguae Mundi
Private students’ texts
Corpus data collection
Achtelik et al. (2018). Nauczanie i promocja języka polskiego w świecie. Diagnoza stan perspektywy.
Katowice: Wydawnictwo Uniwersytetu Śląskiego.
Chan, K. L. (2016). Power Language Index. Which are the world’s most influential languages? Retrieved from
http://www.kailchan.ca/wp-content/uploads/2016/12/Kai-Chan_Power-Language-Index-full-
report_2016_v2.pdf
Council for the Polish Language. (2007). The Polish Language. Retrieved from
http://www.rjp.pan.pl/images/stories/pliki/broszury/jp_angielski.pdf
Gilquin, G., Granger, S., & Paquot, M. (2007). Learner corpora: The missing link in EAP pedagogy. Journal of
English for Academic Purposes,6(4), 319335. doi: 10.1016/j.jeap.2007.09.007
Janssen, M. (2016). TEITOK: Text-Faithful Annotated Corpora. In N. Calzolari et al. (Eds.), Proceedings of the
Tenth International Conference on Language Resources and Evaluation (LREC'16)(pp. 40374043). Portorož:
ELRA.
Markowski, A. (2008). Kultura języka polskiego. Teoria. Zagadnienia leksykalne. Warszawa: Wydawnictwo
Naukowe PWN.
Ministerstwo Spraw Zagranicznych. (2014). Atlas polskiej obecności za granicą [Atlas of Polish presence
abroad]. Retrieved from https://issuu.com/msz.gov.pl/docs/atlas_polskiej_obecnosci_za_granica
Zasina, A. J. (2019). Podejście korpusowe w nauczaniu języka polskiego jako obcego na przykładzie
rzeczownikowych alternacji ó:o.In K. Zioło-Pużuk (Ed.), Panorama glottodydaktyki polonistycznej. Wyzwania,
pytania, kierunki (pp. 181199). Warszawa: Wydawnictwo Naukowe Uniwersytetu Kardynała Stefana
Wyszyńskiego.
References
Acknowledgement
Conference Paper
Full-text available
This article presents the written Slovenian learner corpus KOST, focusing on its position among other learner corpora for other target languages. In terms of the sociolinguistic position of the target language, KOST can be compared with approximately one-tenth of more than 190 learner corpora. With its design, current size of almost 835,000 words, partially tagged language errors, and free access to data, KOST is fully comparable to these corpora and thus a useful resource for various forms of language research.
Article
Full-text available
The present article addresses valency errors in writings of non-native speakers learning Polish as a foreign language. Valency is a key element in a foreign language acquisition, and yet there are no studies on valency errors based on empirical data for Polish as a foreign language. Therefore, this study presents the first attempt to examine valency errors based on data from the Polish Learner Corpus PoLKo. The pilot analysis deals with different proficiency levels (A1-C1) and nationalities (Slavic and non-Slavic) of learners. The corpus material has shown that valency errors are present across all language levels and for different nationalities. Nevertheless , valency errors are more common in prepositional phrases among learners with uninflected mother tongue. Further investigation into the subject is obviously needed, and it may be expected that it will bring new results and new conclusions, along with the development of PoLKo.
ResearchGate has not been able to resolve any references for this publication.