Conference PaperPDF Available

The KiezDeutsch Korpus (KiDKo) Release 1.0

Authors:

Abstract and Figures

This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multiethnic urban areas in Germany. The first release of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.
Content may be subject to copyright.
The KiezDeutsch Korpus (KiDKo) Release 1.0
Ines Rehbein, S¨
oren Schalowski and Heike Wiese
Potsdam University
German Department, SFB 632 “Information Structure”
irehbein|heike.wiese@uni-postdam.de
Abstract
This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues
of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multiethnic urban areas in Germany. The first release
of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we
describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger
achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we
present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.
Keywords: spoken language corpora; urban youth language; Kiezdeutsch
1. Introduction
Linguistically annotated corpora are an essential basis for
(quantitative) studies of language variation. However, most
language resources are based on canonical written lan-
guage, often from the newspaper domain, while only few
corpora exist which are large enough for investigating vari-
ation in spoken language. The reasons for this are obvi-
ous. Written text is easy to come by in an already digi-
tised format, whereas the creation of spoken language cor-
pora requires time-consuming preprocessing. Besides the
highly cost-intensive transcription process, applying auto-
matic preprocessing tools like POS taggers and syntactic
parsers to spoken language also results in a substantially
lower accuracy than the one we can expect for canonical,
written text, as these tools are usually trained on data from
a written register.
This decrease in accuracy is partly due to data sparseness,
caused by the high number of different pronunciation vari-
ants for each canonical lexical form. In addition, we ob-
serve elements not typically used in written language and
thus not known to the preprocessing tools. For instance, in
spoken language we find a great number of filled pauses
like uh, uhm, backchannel signals (hm, m-hm), question
tags (ne, wa, gell) and interjections. Many morphologi-
cal and syntactic structures typical for spoken language are
also not covered by the training data, which again leads
to a decrease in tagging accuracy. Examples are cliticisa-
tions, exclamations, verbless utterances, or non-canonical
word order, for instance verb-second word order in subordi-
nate sentences with weil (because). In nonstandard dialects,
there will be additional lexical and grammatical character-
istics that might cause problems, such as specific lexemes,
different inflectional patterns or syntactic options. Further-
more, the different distribution of lexical elements in the
(written) training data and in spoken language results in er-
roneous tagger predictions. Finally, when working with in-
formal spoken data, we also have to deal with abandoned
utterances, unfinished words, and repairs.
The contribution of our paper is threefold. First of all, we
present a new resource for general investigations of spo-
ken, informal youth language and, in particular, for inves-
tigations of language use in monolingual and multilingual
urban settings. Second, the new corpus provides training
data for the development or adaptation of POS taggers for
informal spoken language. Finally, we present our efforts
to improve a POS tagger for spoken, informal German and
to automatically detect tagging errors in the corpus.
2. Kiezdeutsch – the data
Kiezdeutsch (’hood German) is a new variety of Germany
emerging in multiethnic urban neighbourhoods (Wiese,
2009; Wiese, 2013). This urban dialect is characteris-
tic of informal peer-group conversations among adoles-
cents, and is spoken across multilingual and monolingual
speakers and different heritage language backgrounds. The
linguistically highly diverse context in which it emerges,
with its wealth of language contact opportunities, makes
Kiezdeutsch more open to variation and innovation and re-
sults in a special linguistic dynamics. Kiezdeutsch thus of-
fers a special access to ongoing tendencies of language de-
velopment and change in contemporary German.
The lexical and grammatical features that make it interest-
ing for linguistic investigations at the same time also con-
stitute a challenge for automatic annotation. As a new,
emerging dialect, Kiezdeutsch shows characteristic features
at phonological/phonetic, lexical, and grammatical levels,
such as some non-canonical pronunciation patterns (e.g.,
coronalisation of [c¸]), the development of new particles, the
integration of new loan words from other languages, some
non-canonical inflections, variations in the use of functional
categories such as articles and pronouns, and new word or-
der options (for overviews cf., e.g. Wiese (2009; 2013),
Auer (2013), and references therein).
(1) and (2) give some linguistic examples from the cor-
pus material,1illustrating the occurrence of bare NPs for
local expressions ((1); in contrast to Standard German
1Capitalisation indicates main stress; speakers’ codes include
information on corpus part (first two letters: in this, case, all data
is from the multiethnic main corpus, “Mu”), gender (last but one
letter: all speakers are male, “M”), and family/heritage language
(last letter, in the examples above: “A” for Arabic, and “D” for
German).
3927
Figure 1: Screenshot KiDKo sample of a short dialogue between 3 speakers (MuH9WT, SPK3, SPK5) in EXMARaLDA
(non-verbal layer (nv), transcription (v), normalisation (norm) and POS) (engl. transliteration: MuH9WT: Why were you
today not school ? “Why weren’t you at school today?” SPK3: Why should I ? “Why should I?” MuH9WT: I ask PTCL
just . “I’m just asking.SPK5: We wanted together chill. “We wanted to chill together.)
PP[DP[NP]]), coronalisation ((1); isch instead of Standard
German ich), and the option to use two constituents (1) or
none (2) before the finite verb in declarative main clauses
(in addition to the option of using exactly one constituent,
which would lead to canonical verb-second word order).
(1) GEStern
yesterday
isch
I
war
was
KUdamm
Kudamm
“Yesterday I was (at the) Kudamm.
. [KiDKo, MuH25MA]
(2) brauchst
need
du
you
VIER
four
alter
old.one
“You need four of those, man!”
(= parts for building virtual cars in a computer
game) [KiDKo, MuH11MD]
The data was collected in the first phase of project B6
”Grammatical reduction and information structural prefer-
ences in a contact variety of German: Kiezdeutsch” as part
of the SFB (Collaborative Research Centre) 632 ”Informa-
tion Structure” in Potsdam. It contains spontaneous peer-
group dialogues of adolescents from multiethnic Berlin-
Kreuzberg (around 266,000 tokens) and a supplementary
corpus with adolescent speakers from monoethnic Berlin-
Hellersdorf (around 111,000 tokens, excluding punctua-
tion). On the normalisation layer where punctuation is in-
cluded, the token counts add up to around 359,000 tokens
(main corpus) and 149,000 tokens (supplementary corpus).
For a more detailed description of the data see (Wiese et al.,
2012). The current, second and final, phase of the project
is dedicated to corpus compilation including annotation.
2.1. Corpus architecture
The current version of the corpus contains the audio signals
aligned with transcriptions. The data was transcribed us-
ing an adapted version of the transcription inventory GAT
2 (Selting et al., 1998), also called GAT minimal transcript,
which includes information on primary accent and pauses.
Release 1.0 of KiDKo also includes a level of orthographic
normalisation where non-canonical pronunciations, punc-
tuation, and capitalisation are transferred to Standard Ger-
man spelling, as well as a layer of annotation for part-of-
speech tags (Section 3.).2
The normalisation layer is necessary for different reasons.
First, the normalised version of the data allows users to
search for all pronunciation variants of a particular word
and thus increases the usability of the corpus. Second, it
provides the input for automatic POS tagging, which con-
siderably reduces the number of unknown words in the
data and thus increases tagging accuracy considerably. The
normalised version of the data, however, should be con-
sidered as an annotation and thus as an interpretation of
the data. Often, missing context information or poor au-
dio quality (caused by noisy environments) complicate the
transcription and license different possible interpretations
of the same audio sequence. Here, the normalisation layer
makes explicit what has been understood by the transcriber
and thus can be considered as a poor man’s target hy-
pothesis where decisions made during the transcription be-
come more transparent (also see Hirschmann et al. (2007),
Reznicek et al. (2010) for a discussion of the importance of
target hypotheses for the analysis of learner language).
Figure 1 shows an example transcript from KiDKo in the
transcription tool EXMARaLDA (Schmidt, 2012), display-
ing the transcription and the normalisation layer, the POS
tags and a layer for non-verbal information. Uppercase let-
ters on the transcription layer mark the main accent of the
2Please note that the normalisation does not transfer the data
into canonical structures. We do not change nonstandard pat-
terns, e.g., in such domains as inflection or word order. The
normalised layer also includes disfluencies, repetitions, and aban-
doned utterances.
3928
Figure 2: Screenshot KiDKo sample of German-Turkish code-mixing (English transliteration SPK1: Every day me have
bought. “Every day I bought one for myself. Always for myself own one box have bough. “I always bought my own
box.)
utterance. The equals sign is used to encode the tight con-
tinuation of a word form with a following form, where one
of the two forms (or both of them) is reduced (e.g. such
cliticisations as in ’warst =e’ (warst du) “were you”). The
(-) marks a silent pause of short length.
Since the data for the main corpus was recorded in con-
versations with a lot of multilingual speakers, it also in-
cludes some code-mixing and code-switching of German
with other heritage languages, mostly Turkish. For those
passages, the Turkish part has been transcribed and trans-
lated (Figure 2). On the (German) transcription and nor-
malisation layer, the utterance has been marked as foreign
language material. We provide a Turkish transcription layer
(tr) that captures nonstandard pronunciation, but does not
mark the main accent of the utterance. The Turkish nor-
malisation layer (trnorm) translates this into Standard Turk-
ish. In addition, we provide a literal German translation
(trdtwwue) and a free translation (trdtue).
2.2. Corpus access and future work
We plan to release the POS tagged version of the corpus
in spring 2014. Due to legal constraints, the audio files
will have restricted access and can only be accessed locally
while the transcribed and annotated version of the corpus
will be available over the internet via ANNIS (Zeldes et al.,
2009).3
In the near future, we will augment the corpus with a flat
syntactic analysis and topological field information (Drach,
1937; H¨
ohle, 1998). The new layers will enable users to
3ANNIS (ANNotation of Information Structure) is a corpus
search and visualisation interface which allows the user to formu-
late complex search queries which can combine multiple layers
of annotation. (http://www.sfb632.uni-potsdam.de/
annis/)
conduct corpus searches for complex syntactic phenomena.
In the remainder of the paper we focus on the challenges of
automatic POS tagging of spoken language and report our
efforts to improve the tagger and to identify error patterns
in the automatically tagged data.
3. POS tagging
The procedure for adding a POS annotation layer to KiDKo
is as follows. First, the data is transcribed. Then, we
automatically add the normalisation layer by copying the
transcriptions to a separate layer and automatically correct-
ing spelling and frequent pronunciation variants based on
a dictionary lookup. Then the normalisation is checked by
the transcriber and remaining errors are corrected manu-
ally. Afterwards, the normalisation is automatically POS
tagged, using a CRF-based tagger developed for the annota-
tion of Kiezdeutsch (Rehbein and Schalowski, To appear)4
and manually corrected in a post-processing phase.
The tagger is based on the CRFSuite package (Okazaki,
2007) and uses features like word form, word length, or
the number of upper case letters or digits in a word. In
addition, we use prefix/suffix features (the first/last nchar-
acters of the input word form) as well as feature templates
which generate new features of word ngrams where the in-
put word form is combined with preceding and following
word forms. To address the unknown word problem in our
data, we add features from LDA word clusters (Chrupała,
2011) learned on untagged Twitter data and an automati-
cally created dictionary which was harvested from the Huge
4The annotation scheme we use is an extended version of the
Stuttgart-T¨
ubingen Tagset (STTS) (Schiller et al., 1999) with 11
new tags tailored to the annotation of spoken discourse. Our an-
notators achieved an inter-annotator agreement of 0.975 (Fleiss’
κ) on KiDKo data using the extended tagset.
3929
German Corpus (HGC) (Fitschen, 2004) which had been
POS tagged using the Treetagger (Schmid, 1995).
Our tagger achieves an accuracy of 95.8% on the nor-
malised transcripts when trained on a small training set with
10,682 tokens, and of 96.9% when trained on a larger train-
ing set (66,043 tokens; 5-fold cross validation).
The accuracy of the tagger is in the same range as state-
of-the-art taggers on newspaper text. However, the results
might be a bit too optimistic as we also tag silent pauses
and foreign as well as uninterpretable material, which are
all unambiguous and occur with a high frequency in the
corpus. To give a more realistic assessment of the tag
quality, we exclude punctuation, silent pauses and unin-
terpretable/foreign language material from the evaluation
and compare the KiDKo results to results achieved by our
tagger when trained and tested on the TIGER treebank
(Brants et al., 2002), using the data split from the CoNLL-
2006 shared task.5Results show that the impact of silent
pauses and foreign/uninterpretable material on tagging ac-
curacy is quite low but that the large number of punctua-
tion signs, owed to the shorter utterance lengths in the cor-
pus, has a crucial influence on tagging accuracy. Remov-
ing punctuation results in a decrease in accuracy of 1.4%
for KiDKo whereas the accuracy on TIGER only decreases
from 98.3% to 98.0%.
Figure 3 shows the learning curve for our best tagger, the
CRF tagger. In the beginning, the curve is quite steep up to
a training size of around 50,000 tokens. After that, adding
more training data does not have such a strong effect on
accuracy any more.
Figure 3: Learning curve for the CRF tagger (5-fold cross
validation)
5In the experiments we also use LDA word clusters from Twit-
ter. Replacing those by word clusters learned from the HGC gives
a small improvement of around 0.1%
Baseline taggers with w/o
punc punc
Brill 94.4 91.8
Treetagger 95.1 92.8
Stanford 95.3 93.5
Hunpos 95.6 93.6
CRF 96.9 95.5
majority vote 96.4 94.8
stacking (brill, crf, hun, stan, tree) 96.8 95.4
stacking (brill, hun, stan, tree) 96.8 95.3
stacking (hun, stan, tree) 96.8 95.4
stacking (hun, stan) 96.8 95.4
Table 1: Baseline and ensemble results for different taggers
and tagger combinations, using majority vote and stacking
a CRF tagger with the output of the baseline taggers (5-fold
cross validation on the training set; second column shows
results excluding punctuation)
3.1. Ensemble tagging
It has often been shown that combining different taggers
and either using a simple majority vote or stacking a tag-
ger with POS tags predicted by other taggers does improve
tagging results (Brill and Wu, 1998; M`
arquez et al., 1999;
Søgaard, 2010).
Thus, we tried to improve tagging accuracy by combining
the output of five different taggers. The taggers used in our
experiments are
the Brill tagger (Brill, 1992)
the Stanford tagger (Toutanova and Manning, 2000)
the Hunpos tagger6
the Treetagger (Schmid, 1995)
our CRF-based tagger7
We tried two different approaches. In the first one, we used
a simple majority vote. In the second approach, we trained
a new CRF-based classifier, using the output of the five dif-
ferent taggers as additional features. The results are shown
in Table 1.
Surprisingly, we were not able to improve over our best
baseline tagger (CRF: 96.9%). Results for classifier stack-
ing are a bit higher than for the simple majority vote, but
still below the results of the CRF tagger. We suspect that the
gap in accuracy between our best tagger and the other sys-
tems is too large so that the highest-scoring system could
not benefit from the output of the other taggers.
6The Hunpos tagger is an open source reimplementation of the
TnT tagger (https://code.google.com/p/hunpos)
7http://www.chokkan.org/software/
crfsuite/
3930
ALL w/o NE PRF PTKZU PTKVZ VAINF VVFIN VVIMP
CRF 96.9 95.5 89.8 71.3 84.8 93.5 77.3 94.5 89.0
2STEP 97.2 96.0 92.3 82.6 88.2 100.0 89.6 95.2 90.7
Table 2: Improvments for all tags (with and without punctuation) and for individual POS tags (NE: proper name, PRF:
reflexive pronoun, PTKZU: infinitive with zu, PTKVZ: seperated verb particle, VAINF: infinite auxiliary, VVFIN: finite
full verb, VVIMP: imperative full verb
3.2. Improved tagging with linguistically
motivated features
Our error analysis shows that the tagger often mistakes
proper names for nouns and vice versa. Other frequent er-
rors are the confusion of finite and infinite verbs, of per-
sonal and reflexive pronouns, and of demonstrative pro-
nouns and determiners.8
Most of this is not really surprising, as the distinction be-
tween nouns and proper names is also problematic on a
theoretical level, and some of the decisions made in the
annotation guidelines seem to be arbitrary (see Schiller et
al. (1999), pp. 15.) Discriminating between finite and in-
finite verbs, however, is easy for human annotators even in
cases where surface forms are identical, as, e.g., for some
verb forms inflected for 2PL and infinitives. In order to
accomplish the task, the tagger needs more global context.
Example 3 illustrates this. In 3 a) the finite, plural form
machen (do) should be assigned the VVFIN tag, while in b)
and c) machen is infinite and should be tagged as VVINF.
(3) a. weil
because
sie
they
Hausaufgaben
homework
machen
do2.pl
.
.
“because they are doing their homework.
b. weil
because
sie
she
Hauaufgaben
homework
machen
doinfinite
muss
must
.
.
“because she has to do her homework.
c. weil
because
sie
she
Hauaufgaben
homework
machen
doinfinite
nicht
not
mag
likes
.
.
“because she doesn’t like to do homework.
The left context in these three examples is exactly the same.
The only clue is the modal verb in the right context in b)
(muss) and c) (mag). While in b) the direct adjacency of the
two word forms enables the tagger to use this information,
in c) the modal verb is out of range, resulting in the false
prediction for machen as a finite verb.9
Due to the semi-free word order in German, we often ob-
serve cases like the one above where global information is
not locally accessible. We thus use a two-step approach
where in the first step we assign POS tags to the text, using
our best baseline system. In the second step we extract new
features from the output of the first tagger and train a sec-
ond classifier, adding linguistically motivated clues from
the left and right context.
8These errors are not typical for informal, spoken youth lan-
guage, but also occur when tagging newspaper text.
9It is, of course, possible to train tagging models utilising a
larger context window. This, however, usually results in sparse
data problems and thus in a lower accuracy.
3.2.1. Finite vs. infinite verbs
To better distinguish between finite and infinite verb forms,
we search in the right context of each token for a verb, start-
ing from the end of the sentence. If we find one, we add the
POS for this verb predicted by the CRF tagger as a new
feature.10. This feature is added for each token.
The left context feature is only added for tokens that have
been identified as a verb form in the first step. For all other
tokens, this feature is set to null. Starting from the token
we want to tag, we search the left context for either another
verb form or for a subordinating conjunction, a relative or
interrogative pronoun. If we find one, we add the tag as a
new feature.
3.2.2. Personal vs. reflexive pronouns
To help the tagger making a more informed decision on
identifying reflexive pronouns, we add a new feature for
each of the following word forms: dich, dir, euch, mich, mir,
sich, uns. These forms are ambiguous between a reflexive
and an irreflexive reading. We thus search the clausal con-
text for another pronoun agreeing in person with the first
form.
(4) a. Ich
I
habe
have
mich
myselfreflexive
geschnitten
cut
.
.
“I have cut myself.
b. Sie
she
hat
has
mich
myselfirreflexive
gek¨
usst
kissed
.
.
“She has kissed me.”
c. Hab
have
mich
myselfreflexive
geschnitten
cut
.
.
“Have cut myself.
In 4 a) we would find the pronoun ich (I) which is in agree-
ment with mich (myself). We thus add a new feature RFLX.
In 4 b), the pronoun sie (she) does not agree with mich and
thus the feature value is set to null. Our new feature does
not fire in elliptical contexts (4 c) where the relevant infor-
mation is not present in the surface structure. To capture
these cases, we would need a morphological analysis of the
verb. However, the accuracy of morphological tools on in-
formal spoken language is not as good as on Standard Ger-
man text. We thus did not follow up on this approach but
left it to future work.
10The STTS distinguishes 12 verb tags: V(V|A|M)INF (full/
auxiliary/modal infinite verbs), V(V|A|M)FIN (full/auxiliary/
modal finite verbs), V(V|A|M)PP (full/auxiliary/modal past par-
ticiples), (V|A)IMP (full/auxiliary imperatives), VVIZU (infini-
tive with zu)
3931
3.2.3. Nouns vs. proper names
To see if we can further improve the accuracy for nouns and
proper names, we also add features extracted from Brown
clusters learned on unannotated data from Twitter.
3.2.4. Results
Table 2 gives results for the two-step approach. We ob-
served a modest improvement of 0.3% (0.5% when exclud-
ing punctuation) over our best baseline system. While these
numbers do not seem very impressive, the detailed results
for individual POS tags (Table 2) show that our new fea-
tures did increase accuracy for reflexive pronouns by more
than 11%. For infinite auxiliaries, the increase is also sub-
stantial with more than 12%. In addition, we observe a
small, but positive effect on most verb tags and also on the
identification of seperated verb particles. The Brown clus-
ter features improved POS accuracy for proper names by
2.5%, showing that the hierarchical clustering adds com-
plementary information not already captured by the LDA
cluster features.
4. Error detection
Our POS accuracy, now in the range of 96-97%, is quite
good, considering that we are dealing with a non-canonical
variety of spoken language. However, as our goal is to build
a new resource for linguistic research, the remaining error
rate of 3-4% is still too high. Unfortunately, we do not have
the funds necessary for doing a complete manual correc-
tion of the whole corpus, least of all for double annotation.
We thus have to find efficient ways to identify errors in the
tagger output and to correct these.
In this section, we describe our approach to automatic er-
ror detection where we use the predictions of the different
ensemble taggers (Section 3.1.) to identify tagging errors.
4.1. Related work
Most work on (semi-)automatic POS error detection has fo-
cussed on identifying errors in POS assigned by human
annotators where variation in word-POS assignments in
the corpus can be caused either by ambiguous word forms
which, depending on the context, can belong to different
word classes, or by erroneous annotator decisions (Eskin,
2000; van Halteren, 2000; Kvˇ
etoˇ
n and Oliva, 2002; Dick-
inson and Meurers, 2003; Loftsson, 2009).
The variation n-gram algorithm (Dickinson and Meurers,
2003) allows users to identify potentially incorrect tagger
predictions by looking at the variation in the assignment of
POS tags to a particular word ngram. The algorithm pro-
duces a ranked list of varying tagger decisions that have
to be processed by a human annotator. Potentional tagger
errors are positioned at the top of the list. Later work (Dick-
inson, 2006) extends this approach and explores the possi-
blities of automatic correction of the detected errors.
Eskin (2000) describes a method for error identification us-
ing anomaly detection. Anomalies in this approach are de-
fined as elements coming from a different distribution than
the one in the data at hand.
Kvˇ
etoˇ
n and Oliva (2002) present an approach to error de-
tection based on a semi-automatically compiled list of im-
possible ngrams. Instances of these ngrams in the data are
assumend to be tagging errors.
tokens candidates true err. out of (% err)
train 66,024 4,120 986 1,840 53.6
dev 16,530 1,228 267 437 61.1
test 20,472 1,797 558 788 70.8
Table 3: Number of error candidates identified by disagree-
ments in the ensemble tagger predictions
Loftsson (2009) evaluates different methods for error de-
tection, using the method of Dickinson and Meurers (2003)
as well as an ensemble of five POS taggers, showing that
both approaches allow for the successful identification of
POS errors and increase tagging accuracy.
All these approaches are tailored towards identifying hu-
man annotation errors and cannot be applied to our setting,
where we have to detect systematic errors made by auto-
matic POS taggers. Thus, we can not rely on anomalies or
impossible ngrams in the data, as the errors made by the
taggers are consistent and, furthermore, our corpus of non-
canonical spoken language includes many structures which
are considered impossible in Standard German.
Rocio et al. (2007) address the problem of finding sys-
tematic errors in POS tagger predictions. Their method is
based on a modified multiword unit extraction algorithm
that extracts cohesive sequences of tags from the corpus.
These sequences are then sorted manually into linguisti-
cally sound ngrams and potential errors. This approach
hence focusses on correcting large, automatically anno-
tated corpora. It successfully identifies (a small number
of) incorrectly tagged high-frequency sequences in the text
which are often based on tokenisation errors. The more
diverse errors due to lexical ambiguity, which we have to
deal with in our data, however, are not captured by this ap-
proach.
4.2. Using tagger ensembles for error detection
We follow Loftsson (2009) and use the predictions of
the different ensemble taggers described above to iden-
tify POS errors in the corpus. We use the same train-
ing/development/test set split as described in Section 3. In
the training data, our tagger ensembles agree on 61,904 out
of 66,024 instances. For 4,120 tokens, the ensemble tag-
gers’ decisions diverge (Table 3). Out of these 4,120 in-
stances, 986 were in fact errors, which gives us an error
detection precision of 23.9%. For the development and test
data, the precision is 21.7 and 33.0, respectively. This is
somewhat higher than the precision of 16.6% reported by
Loftsson (2009) for the Icelandic tagger ensemble, mean-
ing that we have to look at a smaller number of instances to
correct the same amount of errors in our data.
The ensemble tagger approach succeeds in detecting more
than 50% of all errors in the data, with reasonable effort.
After manually correcting those instances, the POS tag ac-
curacy in the corpus increases up to 98.7% (development
set) and to 99.0% on the test set.
5. Increasing POS accuracy to over 99%
To attain our goal of creating a high-quality annotated cor-
pus, we follow a second approach to identifying POS errors
3932
tokens candidates true err. out of acc.
train 66,024 7,472 505 854 99.5
dev 16,530 2,022 108 213 99.4
test 20,472 2,104 66 207 99.3
Table 4: Correcting ambiguous word forms
in the tagger output. While our last approach relied on the
judgements of automatic taggers, this time we make use of
the manually annotated training data used to develop the
taggers.
From the training data, we extract word forms where our
best tagger frequently made mistakes. We use a threshold
of 5, meaning that we extract all word forms that have been
assigned an incorrect POS tag at least 5 times in the train-
ing data. This threshold can, of course, be adjusted accord-
ing to the quality requirements and resources available for
manual correction.
Setting the threshold to 5, we extract a list of 72 differ-
ent word forms from the training data. As we already cor-
rected those instances where the different taggers disagreed
in their judgements, we now only have to look at instances
where all five taggers predicted the same tag. This gives
us 7,472 instances for the training set and a bit more than
2,000 instances for the development and test set (Table 4).
This means that we have to manually check around 10% of
all instances in the different sets, which can be done quite
efficiently by providing the annotators with a tool that high-
lights these instances, sorted by word form.
Most of the instances are, in fact, correct. Only around
3-7% of these error candidates are real POS errors. How-
ever, after applying this simple heuristic, the overall POS
accuracy in the corpus increases up to 99.5% (training set),
99.3% (development set) and 99.3% (test set).
These number are achievable if the annotators are well
trained and always assign the correct POS tag. This as-
sumption is, of course, overly optimistic. However, our
inter-annotator agreement of 0.975 (Fleiss’ κ) for three hu-
man annotators on a subset of the corpus showed that POS
annotation on such informal spoken language can be done
with good reliability, and thus potential annotator errors are
not expected to have a crucial impact on the final POS ac-
curacy in the corpus.
6. Conclusions and Future Work
We presented KiDKo, a new, POS annotated corpus for in-
vestigations of informal youth language and of language
variation in monolingual and multilingual urban settings.
Release 1.0 of the corpus includes the transcriptions, a nor-
malisation layer and POS annotations, as well as the tran-
scription and translation of Turkish language material from
code-mixing and -switching. The corpus will be made
freely available for research purposes.
In future work, we will augment the corpus with a shallow
syntactic analysis and topological field information.
7. Acknowledgements
This work was supported by a grant from the German Re-
search Association (DFG) awarded to SFB 632 “Informa-
tion Structure” of Universit¨
at Potsdam, Humboldt Univer-
sit¨
at Berlin and Freie Universit¨
at Berlin, Project B6: “The
Kiezdeutsch Korpus (KiDKo)”. We acknowledge the work
of our transcribers and annotators, Anne Junghans, Banu
Hueck, Charlotte Pauli, Emiel Visser, Franziska Rohland,
Jana Kiolbassa, Julia Kostka, Marlen Leisner, Nadine Lest-
mann, Nadja Reinhold, Oli Bunk, and Sophie Hamm. Ad-
ditional researchers involved in the first phase of the project
were Ulrike Freywald, Tiner ¨
Ozc¸elik, and Katharina Mayr,
who contributed to gathering the linguistic material, com-
piling the corpus data, and organising first transcriptions.
We would also like to thank the anonymous reviewers for
helpful comments.
8. References
Auer, P. (2013). Ethnische Marker im Deutschen zwis-
chen Variet ¨
at und Stil. In Deppermann, A., editor, Das
Deutsch der Migranten [IDS Yearbook 2012], pages 9–
40. Berlin, New York: de Gruyter.
Brants, S., Dipper, S., Hansen, S., Lezius, W., and Smith,
G. (2002). The TIGER treebank. In Proceedings of the
First Workshop on Treebanks and Linguistic Theories,
pages 24–42.
Brill, E. and Wu, J. (1998). Classifier combination for
improved lexical disambiguation. In Proceedings of the
36th Annual Meeting of the Association for Computa-
tional Linguistics and 17th International Conference on
Computational Linguistics - Volume 1, ACL ’98.
Brill, E. (1992). A simple rule-based part of speech tagger.
In 3rd conference on Applied natural language process-
ing (ANLC’92), Trento, Italy.
Chrupała, G. (2011). Efficient induction of probabilistic
word classes with LDA. In Proceedings of 5th Interna-
tional Joint Conference on Natural Language Process-
ing, pages 363–372, Chiang Mai, Thailand, November.
Asian Federation of Natural Language Processing.
Dickinson, M. and Meurers, D. W. (2003). Detecting er-
rors in part-of-speech annotation. In 10th Conference of
the European Chapter of the Association for Computa-
tional Linguistics (EACL-03).
Dickinson, M. (2006). From detecting errors to automati-
cally correcting them. In Annual Meeting of The Euro-
pean Chapter of The Association of Computational Lin-
guistics (EACL-06), Trento, Italy.
Drach, E. (1937). Grundgedanken der Deutschen Sat-
zlehre.
Eskin, E. (2000). Automatic corpus correction with
anomaly detection. In 1st Conference of the North Amer-
ican Chapter of the Association for Computational Lin-
guistics (NAACL), Seattle, Washington.
Fitschen, A. (2004). Ein computerlinguistisches Lexikon
als komplexes System. Ph.D. thesis, Institut f ¨
ur
Maschinelle Sprachverarbeitung der Universit¨
at
Stuttgart.
Hirschmann, H., Doolittle, S., and L¨
udeling, A. (2007).
Syntactic annotation of non-canonical linguistic struc-
tures. In Proceedings of Corpus Linguistics 2007, Birm-
ingham, UK.
H¨
ohle, T. (1998). Der Begriff ”Mittelfeld”, Anmerkun-
gen ¨
uber die Theorie der topologischen Felder. In Ak-
3933
ten des Siebten Internationalen Germanistenkongresses
1985, pages 329–340, G¨
ottingen, Germany.
Kvˇ
etoˇ
n, P. and Oliva, K. (2002). (Semi-)Automatic de-
tection of errors in PoS-tagged corpora. In 19th In-
ternational Conference on Computational Linguistics
(COLING-02).
Loftsson, H. (2009). Correcting a POS-tagged corpus us-
ing three complementary methods. In Proceedings of the
12th Conference of the European Chapter of the ACL
(EACL 2009), Athens, Greece, March.
M`
arquez, L., Rodrfguez, H., Carmona, J., and Montolio, J.
(1999). Improving pos tagging using machine-learning
techniques. In In Proceedings of the 1999 Joint SIGDAT
Conference on empirical Methods in Natural Language
Processing and very large corpora, pages 53–62.
Okazaki, N. (2007). CRFsuite: a fast implementation of
Conditional Random Fields (CRFs).
Rehbein, I. and Schalowski, S. (To appear). STTS goes
Kiez – Experiments on annotating and tagging urban
youth language. Journal for Language Technology and
Computational Linguistics.
Reznicek, M., Walter, M., Schmidt, K., L¨
udeling, A.,
Hirschmann, H., Krummes, C., and Andreas, T., (2010).
Das Falko-Handbuch: Korpusaufbau und Annotationen.
Institut f¨
ur deutsche Sprache und Linguistik, Humboldt-
Universit¨
at zu Berlin, Berlin.
Rocio, V., Silva, J., and Lopes, G. (2007). Detection
of strange and wrong automatic part-of-speech tagging.
In Proceedings of the Aritficial Intelligence 13th Por-
tuguese Conference on Progress in Artificial Intelli-
gence, EPIA’07.
Schiller, A., Teufel, S., and Thielen, C. (1999). Guidelines
f¨
ur das Tagging deutscher Textkorpora mit STTS. Tech-
nical report, Universit¨
at Stuttgart, Universit¨
at T¨
ubingen.
Schmid, H. (1995). Improvements in part-of-speech tag-
ging with an application to German. In ACL SIGDAT-
Workshop.
Schmidt, T. (2012). EXMARaLDA and the FOLK tools.
In The 8th International Conference on Language Re-
sources and Evaluation (LREC-12), Istanbul, Turkey.
Selting, M., Auer, P., Barden, B., Bergmann, J.,
Couper-Kuhlen, E., G¨
unthner, S., Quasthoff, U.,
Meier, C., Schlobinski, P., and Uhmannet, S. (1998).
Gespr¨
achsanalytisches Transkriptionssystem (GAT).
Linguistische Berichte, 173:91–122.
Søgaard, A. (2010). Simple semi-supervised training of
part-of-speech taggers. In Proceedings of the ACL 2010
Conference Short Papers, ACLShort ’10, pages 205–
208, Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
Toutanova, K. and Manning, C. D. (2000). Enriching the
knowledge sources used in a maximum entropy part-of-
speech tagger. In Proceedings of the conference on Em-
pirical methods in natural language processing and very
large corpora, EMNLP ’00, Hong Kong.
van Halteren, H. (2000). The detection of inconsistency in
manually tagged text. In Proceedings of the COLING-
2000 Workshop on Linguistically Interpreted Corpora,
Centre Universitaire, Luxembourg, August.
Wiese, H., Freywald, U., Schalowski, S., and Mayr,
K. (2012). Das KiezDeutsch-Korpus. Spontansprach-
liche Daten Jugendlicher aus urbanen Wohngebieten.
Deutsche Sprache, 2(40):797–123.
Wiese, H. (2009). Grammatical innovation in multiethnic
urban Europe: New linguistic practices among adoles-
cents. Lingua, 119:782–806.
Wiese, H. (2013). What can new urban dialects tell
us about internal language dynamics? The power of
language diversity. In Abraham, W. and Leiss, E.,
editors, Dialektologie in neuem Gewand. Zu Mikro-
/Variet ¨
atenlinguistik, Sprachenvergleich und Universal-
grammatik, number 19, pages 207–245.
Zeldes, A., Ritz, J., L¨
udeling, A., and Chiarcos, C. (2009).
Annis: A search tool for multi-layer annotated corpora.
In Corpus Linguistics 2009.
3934
... We address variation and innovation in Kiezdeutsch through this lens. Indeed, demographic data in Wiese et al. (2012) supports this position; the Kiezdeutsch speakers who feature in the Kiezdeutsch Corpus (KiDKo) Rehbein et al. 2014), a corpus of spontaneous conversations between adolescent speakers aged 14-17, come chiefly from multi-ethnic/lingual backgrounds. The data was collected in 2008 in Kreuzberg, Berlin, a highly ethnically heterogeneous borough, from 17 anchor speakers and many interlocutors; Wiese et al. (2012) do not give the number of interlocutors, but corpus examination finds 33 V3-producing interlocutors. ...
... Krifka (2008) and Féry & Krifka (2008) call this function DELIMITATION, one shared by Left Dislocation, to which we return in Section 5. Wiese & Rehbein (2016) report that 94% of instances followed the pattern [Framesetter→ Topic] (63% of V3), where information structure is unambiguously identifiable. From a categorial perspective, framesetters are non-unitary; different adverb types with framesetting function appear in the literature and in KiDKo (Rehbein et al. 2014) (3): ...
... We searched for V3 in the main multiethnic KiDKo corpus (KiDKo-mu) (Rehbein et al. 2014) (c.345,000 tokens and 23,506 matrix clauses), using the syntactic query used by Walkden (2017) and a PoS-based query from the corpus handbook (2017) (see the Appendix for all queries). ...
Preprint
Full-text available
This study revisits the V3 linearisation AdvP>Subject>finite verb in Kiezdeutsch, comparing it to resumptive verb-third Left Dislocation and Hanging Topic Left Dislocation. Using corpus data, preverbal object DPs are shown to almost never occur across verb-third distributions, yet preverbal nominative subjects and spatio-temporal elements are unproblematic. This behaviour is argued to involve a low C-domain position encoding a Subject of Predication requirement (cf. Cardinaletti 2004) tied to aboutness and nominative Case-assigning features, but not a strict D-related subject EPP. Based on comparison with other corpora and analysis of metadata, speakers from non-German speaking homes, i.e. successive bilinguals are argued to have innovated this property. A novel account is suggested for the emergence of V3 based on claims that it results from a natural informational order (Wiese et al. 2020), which is formalized as a Minimal Default Grammar (Roeper 1999) available to children before they fully acquire CP and TP. Children acquiring a V2 language must either reject V3 or incorporate it into a V2 syntax. Lacking adequate counter-evidence in their input, Kiezdeutsch speakers do the latter. (Forthcoming Journal of Germanic Linguistics)
... Diese beschreiben die Daten auf verschiedenen linguistischen Ebenen, wie beispielsweise Syntax (z. B. hierarchische topologische Felder auf den deutschen Daten nach Rehbein et al., 2014), nominale Flexionsmorphologie (z. B. Tsehaye, 2024), verbale Flexion (z. ...
Article
Full-text available
The article presents the RUEG corpora. We begin by describing the basic principles and research questions, that influenced and shaped the corpora and introduce the method of data generation. We proceed by describing the fundamental components of the building process and the overall outcome. Said outcome provides interfaces for new researchers, who wish to conduct their own research using the RUEG corpora. Options for such research are discussed by showing that it can succeed within the range of re-using exisiting annotations and metadata up to building entirely new, but comparable corpora.
... KiDKo, the "KiezDeutsch-Korpus", started as a corpus of spoken Kiezdeutsch Rehbein et al. 2014), but has in the mean time grown to incorporate a number of additional Kiezdeutsch-related subcorpora. Currently, KiDKo encompasses the following main corpora of spontaneous data (and some others), which are all accessible via a shared website, www.kiezdeutschkorpus.de: ...
Chapter
Full-text available
Using the cover term "Kiezdeutsch", we discuss urban contact dialects in Germany, drawing on different lines of research with different conceptual/theoretical backgrounds. We look at the setting of Kiezdeutsch, a society whose strongly dominant monolingual habitus contrasts with its linguistically highly diverse makeup (Section 1); give pointers to pertinent corpora that are available through open access and to the different research foci for Kiezdeutsch so far (Section 2); and provide an overview of findings at grammatical, pragmatic, lexical, and prosodic levels (Section 3). Finally, we summarise sociolinguistic findings from the Kiezdeutsch corpora, including domains of usage and the attitudes and perceptions evident in the macro context (Section 4).
... pitch accent). Since pitch accent does not necessarily occur on left-edge elements of Kiezdeutsch V3-stuctures, te Velde sees this as 50 Based on V3-sentences extracted from the Kiezdeutsch-Korpus (Rehbein et al. 2014). ...
Thesis
Full-text available
Finite verb placement in German(ic) contact languages has received heightened attention in recent years. In particular, the occurrence of main clauses with two preverbal constituents instead of the “canonical” only one, or verb-third word order (V3), has attracted researchers’ interest especially for Germanic contact varieties. Although previous studies of V3 in urban vernaculars, heritage languages and monolingual populations have used a variety of different methodologies, and proposed an abundance of theoretical approaches, to date, there has been no study (1) using variationist methodology, (2) exploring the contributions of prosody and information-structure to V3 syntax, (3) offering a longitudinal perspective, and (4) focusing on heritage Low German in the United States. This dissertation seeks to fill these gaps. The dissertation is based on a total of 58 interviews recorded in 1998 and 2018/19 with 46 heritage East Frisian Low German speakers from Grundy County and surrounding counties in Iowa, USA. The community was established in the USA in the mid-19th century and is now acutely endangered by communal language shift to English as the majority language. In addition to a detailed sociolinguistic history of this speech community, the dissertation presents a quantitative description of the linguistic and social factors contributing to the use of V3-structures. A statistical analysis of more than 2000 main clauses confirms the presence of a sentence-initial adverbial (i.e. a temporal adverb) to be the most significant constraint on V3-structures. The exploration of a more narrowly defined data-set of more than 600 main clauses with sentence-initial adverbials reveals both linguistic and social factors contributing to the variable use of V3-structures. Most notably, V3-structures are most strongly favored by prosodically separated adverbials which occur in a preceding intonation unit from the finite main verb and/or are followed by a pause. An additional factor that favors V3-structures is greater prosodic weight (i.e., more preverbal syllables). These prosodically separated adverbials may serve to highlight a contrast between information from the previous discourse and new (contrary) information in the subsequent intonation unit, and seem to be consciously employed as effective narrative devices by the speakers. Also promoting V3 are verbs conjugated in the present tense. From a more exploratory survey of the data, it emerges that V3-structures are preferred in longer, uninterrupted narrations, where a narrative present tense may be used as a storytelling strategy. Moreover, V3-structures may be more frequently used when the subject has been mentioned in the 10 preceding intonation units but importantly is different from the subject referent in the immediately preceding intonation unit. In other words, V3-structures seem to be more likely, if the subject is topical and accessible but needs to be “reactivated” after an utterance with a different subject referent. Concerning the social factors, it is shown that men use V3-structures markedly more often than women and that the usage of V3-structures increased over time, both with regard to speakers’ year of birth and between the two points of data collections. Nevertheless, because the usage of V3-structures remains constrained by linguistic factors and is systematically motivated by discourse-pragmatic needs, these structures do not occur arbitrarily. Thus, the observed verb placement variation seems to be part of an ongoing communal language change.
... For speech analysis it is important to transcribe utterances close to how they are pronounced. In some transcription guidelines, capitalisation and punctuation are omitted (e.g., in the SEAME corpus by Lyu et al. 2015); 7 in some others they are used to mark speech information (e.g., in the Kiezdeutsch corpus by Rehbein et al. 2014). 8 Text analysis on the other hand generally relies on standard orthography. ...
Article
Full-text available
This paper presents the SAGT Turkish–German code-switching treebank, and observations and annotation challenges we encountered during its development. The treebank consists of transcriptions of bilingual conversations annotated with several layers: language IDs, lemmas, POS tags, morphological features, and dependency relations. The annotations follow the Universal Dependencies annotation scheme and the conventions used in monolingual treebanks as much as possible. We present and discuss a number of issues that arise because of the need for consistent multilingual annotation within a single treebank, as well as the informal language, which is where code-switching is observed most. Besides proposing solutions to these issues, we present some observations about code-switching phenomena that are only possible to observe in a data set with rich linguistic annotation. The treebank was annotated with a focus on quality of annotations through an iterative process of detecting and correcting annotation errors. We also present quantitative measures for indication of annotation quality. The code-switching treebank created in this study is released to the public through Universal Dependencies repositories.
... This study draws data from two main sources: for Kiezdeutsch, I use the multiethnic sub-corpus of KiDKo, the 'Kiezdeutsch corpus' Rehbein et al. 2014). The corpus contains a 'monoethnic' (147,000 tokens) and 'multiethnic' subcorpus (345,000 tokens). ...
Thesis
Full-text available
This dissertation approaches the syntax of structures involving non-canonical subjects and non-canonical subject positions. In particular, I investigate two phenomena: Locative Inversion (LI) in English, French, Italian, and Hebrew; and well-known verb-third violations of the verb-second (V2) rule in Kiezdeutsch, an urban contact variety of German. In LI a spatio-deictic XP appears to occupy the preverbal canonical subject position, while the canonical nominative subject DP surfaces in a postverbal inverted position. From a theoretical and empirical perspective, I compare the distribution of different covert and overt arguments participating in LI and the availability of LI in embedded and matrix contexts crosslinguistically. The second case study concentrates on remarkable Kiezdeutsch V2 violations, as they appear to follow a regular order of [frame-setting adverb → Subject → finite verb]; this is remarkable not only on account of it violating an otherwise strict V2 requirement, but it also indicates the innovation of a subject position that Standard German does not have. I carry out a corpus study and find that an apparent subject requirement extends to other verb-third resumptive- dislocation phenomena, yet it is shown that we cannot understand this requirement in the sense of an EPP position associated purely with nominative DP subjects. From a theoretical perspective, this dissertation develops and applies a theory of subject requirements, which is able to account for the breadth of investigated crosslinguistic variation in LI and the presence or absence of a high clausal subject requirement in verb-third V2-violations in Kiezdeutsch and more standard varieties of German. Ultimately, I make use of finite differences across C and T in the distribution of D, φ, and discourse-related δ-features (cf. Miyagawa 2017) features via different inheritance options from the phase head. I demonstrate that the presence of non-canonical subjects in LI and the presence of canonical subjects in a seemingly non-canonical subject position in Kiezdeutsch can both be derived via variation in the placement of a δ-feature with a specification for Subject of Predication orthogonal to typical EPP requirements related to D and φ.
... Our main data basis is the 'KiezDeutsch-Korpus' (KiDKo), a multi-modal digital corpus of spontaneous data from informal, oral peer group conversations among adolescents (Wiese et al. 2012, Rehbein et al. 2014). The data come from self-recordings of informal conversations between friends, mostly in German. ...
Article
Full-text available
In examples as (i), there are two constituents, rather than one, occupying the domain in front of the finite verb (the "forefield") in a German sentence, in violation of the Germanic V2 rule. (i) Heute ich werd meine Zigaretten mitbringen today I will my cigarettes with-bring 'Today, I will bring my cigarettes with me.' Accordingly, such examples are typically marked as "ungrammatical" in discussions of word order options. Yet, over the last years, such data has been attested in a range of Germanic "V2 languages", indicating a systematic option. While initial observations came from multilingual communities, lately there has also been some evidence from monolingual contexts. We pick up on this by showing, through corpus analyses on spontaneous speech from multilingual and monolingual speakers, that what we find here is a genuine V3 option that is integrated into the general topological layout that also supports V2. We argue that this is a systematic-if comparably less frequent-and possibly diachronically old option with a broad distribution that can shed a special light on the nature of verb-second. This variant has been overlooked in analyses so far, underlining the importance of taking into account the whole gamut of language variation for grammatical theory, rather than only standard-close language use.
... The corpus compiles data from spontaneous speech in multiethnic neighborhoods, which are based on self-recordings of adolescents from Berlin-Kreuzberg (17 anchor speakers, aged 14 to 17). The data are compared to data by adolescents from a mono-ethnic neighborhood (Berlin-Hellersdorf) with comparable socioeconomic indicators (Rehbein, Schalowski, & Wiese, 2014;Wiese et al., 2012). Other large projects have been carried out in, for example, Mannheim (Keim, 2004(Keim, , 2007, among others). ...
Article
Full-text available
Since the 1980s, an increasing number of studies on youth languages in Europe has appeared. In this paper, a selection of the literature on linguistic practices and identity work by young people in multilingual and mul-tiethnic urban areas in Western Europe is reviewed and discussed. Practices in Germany, the Netherlands, Denmark, Norway, and Sweden are focused on. From a bird's eye view, the literature on linguistic practices of urban youth in other West-and South-European countries is reviewed as well. After a sketch of the context in which the first studies on multilingual and mul-tiethnic youth languages appeared, research on youth languages in the five focus countries is presented, followed by a comparison of linguistic characteristics. Specific features from the levels of grammar, lexicon, and pronunciation are used to index social belonging and identity.
Article
Gençlerin dil pratikleri; hem akademi hem de toplum tarafından sıklıkla mercek altına alınan bir olgu olmasına rağmen, Türkçe konuşan gençler tarafından kullanılan çağdaş sözlü Türkçenin sistematik ve tutarlı araştırma yöntemleri ile incelendiği dilbilimsel araştırmalar yok denecek kadar azdır. Bu makale, öncelikle gençlik diline odaklanan güncel dilbilim tartışma ve paradigmaları araştırmacıların dikkatine sunmakta, ardından da Türkçe alan yazında gençlik diline dair mevcut araştırmaları eleştirel bir bakış ile irdelemektedir. Son olarak da Türkçe gençlik dili araştırmalarında sürdürülebilir, hesap verebilir, birikimli çalışmalara katkıda bulunmak için öncül bir çalışma olarak oluşturulan Türkçe Gençlik Dili Derlemi (CoTY) (Corpus of Turkish Youth Language) tanıtılarak, derlem araçlarının çağdaş sözlü Türkçe gençlik dilinin çoklu etkileşimsel yönlerini incelemek için sunduğu olanaklar tartışılmaktadır. Bu makalenin Türkçe gençlik diline dair daha fazla dilbilimsel araştırma yürütülmesine dair motivasyon oluşturarak Türkçe gençlik dilini araştırmak ve tartışmak için hesap verebilir ve tutarlı metodolojileri benimseyen sistematik çalışmaların önünü açması hedeflenmektedir.
Article
Full-text available
This study focuses on the linguistic phenomenon of code-switching (CS) in the bilingual Ethnic-Adyghe community in the Black Sea region of Turkey. Specifically, this paper aims to analyze the types of CS and the factors that influence CS in different situational conversations, with a focus on Ethnic Adyghe individuals. The study utilizes a qualitative research design, using a purposive sample type of 10 Ethnic Adyghe individuals living in Samsun, Turkey. This study favors the use of informal settings to gather data from various linguistic contexts and analyze the frequency of CS. The findings indicate that participants used intra-sentential CS the most (59.9%). One major reason for that was the lack of vocabulary in Adyghe, leading participants to switch to Turkish to fill the lexical gap and effectively communicate their thoughts. Moreover, a total of 14 factors were identified. These factors encompassed aspects such as proficiency, social norms, cultural identity, vocabulary limitations, and communication effectiveness. The identification of these factors contributes to a comprehensive understanding of the complex dynamics of code-switching within the bilingual Ethnic-Adyghe community.
Conference Paper
Full-text available
This paper presents two toolsets for transcribing and annotating spoken language: the EXMARaLDA system, developed at the University of Hamburg, and the FOLK tools, developed at the Institute for the German Language in Mannheim. Both systems are targeted at users interested in the analysis of spontaneous, multi-party discourse. Their main user community is situated in conversation analysis, pragmatics, sociolinguistics and related fields. The paper gives an overview of the individual tools of the two systems – the Partitur-Editor, a tool for multi-level annotation of audio or video recordings, the Corpus Manager, a tool for creating and administering corpus metadata, EXAKT, a query and analysis tool for spoken language corpora, FOLKER, a transcription editor optimized for speed and efficiency of transcription, and OrthoNormal, a tool for orthographical normalization of transcription data. It concludes with some thoughts about the integration of these tools into the larger tool landscape.
Data
Full-text available
This paper investigates a new urban vernacular that developed in linguistically diverse, mul-tiethnic and multilingual, areas in urban Germany, "Kiezdeutsch", from the perspective of German dialects. I show that the grammatical and pragmatic patterns we find here indicate a system in its own standing (rather than a mere accumulation of features) that characterises a new, dynamic variety that fits into the dialectal landscape of German. A closer look at Kiez-deutsch phenomena from such a 'dialect' point of view reveals a dominance of language-internal motivations, rather than contact-induced effects, suggesting that such new urban dialects might offer us a special means of access to internal tendencies of language variation and change.
Article
Full-text available
This paper discusses a phenomenon that has recently been observed in areas with a large migrant population in European cities: the rise of new linguistic practices among adolescents in multiethnic contexts. The main grammatical characteristics that have been described for them are (1) phonological/phonetic and lexical influences from migrant languages and (2) morpho-syntactic reductions and simplifications. In this paper, I show that from a grammatical point of view, morpho-syntactic reductions are only part of the story. Using ‘Kiezdeutsch’ as an example, the German instance of such a youth language (which may be the one with most speakers), I discuss several phenomena that provide evidence for linguistic productivity and show that they evolve from a specific interplay of grammatical and pragmatic features that is typical for contact languages: grammatical reductions go hand-in-hand with productive elaborations that display a systematicity that can lead to the emergence of new constructions, indicating the innovative grammatical power of these multiethnolects.
Conference Paper
Full-text available
The quality of the part-of-speech (PoS) annotation in a corpus is crucial for the development of PoS taggers. In this pa- per, we experiment with three complemen- tary methods for automatically detecting errors in the PoS annotation for the Ice- landic Frequency Dictionary corpus. The first two methods are language indepen- dent and we argue that the third method can be adapted to other morphologically complex languages. Once possible errors have been detected, we examine each er- ror candidate and hand-correct the cor- responding PoS tag if necessary. Over- all, based on the three methods, we hand- correct the PoS tagging of 1,334 tokens (0.23% of the tokens) in the corpus. Fur- thermore, we re-evaluate existing state-of- the-art PoS taggers on Icelandic text using the corrected corpus.