Content uploaded by Helge Dyvik
Author content
All content in this area was uploaded by Helge Dyvik on Jan 14, 2015
Content may be subject to copyright.
Norm clusters in written Norwegian
Helge Dyvik
University of Bergen
Written Norwegian comes in two standards, Bokmål and Nynorsk, each
allowingmuch variation in stem forms and inections. Actual texts show
intricateinterdependencies among the alternative options. is chapter reports
on the use of correspondence analysis and implicational analysis applied to a
newspaper corpus in order to chart the patterns of variation. e aim is to test
the utility of the method itself, and the study therefore concentrates on a small
selection of phenomena. e long-term aim of a future extended project is to
monitor actual usage in relation to the ocial standards, thereby identifying
possible emerging subvarieties. is will be of value in the maintenance of
the ocial written standards. e results indicate that the approach can yield
valuableinsights.
1. Introduction1
e two written standards for Norwegian, Bokmål and Nynorsk, are linguistically
exceptional, not only because there are two of them, but also because each of them
allows an unusual amount of variation among alternative stem forms2 and inec-
tional endings. is is a result of the ocial language policy of the 20th century,
where the original idea was to pave the way for the merger of the two standards by
allowing extensive variation while excluding many traditional forms. is policy
failed and has been abandoned, and much of the variation that was allowed in
the ocial norms, but never taken up in actual usage, has been removed. But
still the policy has le its mark on the language as it is actually written, in the
form of a fair amount of variation. e variation is correlated in rather complex
1. I am grateful to the Norwegian Language Council and the Faculty of Humanities, Univer-
sity of Bergen, for funding the research reported here. About 90% of the funding came from
the Language Council.
. e variations in stem forms, such as hånd vs. hand ‘hand’, are frequently referred to as
‘spelling variants’. is is slightly misleading since they are not alternative ways of spelling the
same sequence of phonemes, but rather capture alternative spoken variants.
1 Helge Dyvik
ways with stylistic and sociolinguistic variables, e.g. on a scale from ‘traditional’
or ‘moderate’ to ‘radical’, or on a scale from ‘high style’ to ‘folksy’. e choice
between those two scales for classication may also be politically loaded, since
some would claim that ‘radical’ forms should not be seen as conned to ‘folksy’
style. Also, the stylistic perceptions change over time; thus, to some the stylistic
eect of extensive use of ‘radical’ forms in Bokmål today (e.g. extensive feminine
gender), at least in texts of an abstract, discursive nature, may be to make the
text look quaint and oldfashioned rather than modern, reminiscent of the social-
democratic, reform-optimistic ies.
Present language policy is to base the ocial norms for Bokmål and
Nynorsk, not on a future goal, but on developments in observed usage, which
hence needs to be systematically examined. One important dierence between
the ocial, prescribed norms and the operative norms of actual usage is that
while the former impose few constraints on how alternative forms are combined
in a text, actual usage displays intricate patterns of dependency among the alter-
natives.3 ese dependencies frequently seem to take the form of unilateral or
bilateral implications. For example, one dimension of variation in Bokmål con-
cerns the past and past participle endings in a subclass of weak verbs, which
may be either-et (traditional, moderate) or -a (radical): kastet or kasta ‘threw’,
‘thrown’. Similarly, while Bokmål has three grammatical genders, a two-gender
option is possible, since all feminine nouns may optionally (and individually)
be masculine. Apart from agreement, gender is reected in the ending of the
denite singular. us a noun like hytte ‘cabin’ in the denite singular may have
either the form hytten masc. (traditional, moderate) or hytta fem. (radical, or in
the case of this word, neutral).4 A Bokmål dictionary describes these options on
a word-by-word level, thereby imposing no constraints on the combination of
choices across lemmas in a text. However, in actual usage the choice of a-forms
. In practical teaching of the norm, however, there tends to be a strong emphasis on
conveying information about a number of such dependencies, in order to achieve consistency
within a subnorm.
. Since gender per definition is a matter of agreement, the use of the ending -a does not
imply feminine gender with absolute necessity. ere is a common variety of Bokmål where-a
occurs on some nouns, while there is no masculine/feminine distinction in the agreeing forms
(e.g. the indefinite article en(m) in en jente ‘a girl’ occurs together with jenta ‘the girl’), and con-
structions that would force such agreement are avoided (e.g. postposed possessive in phrases
like jenta mi(f) ‘my girl’, min(m) jente with preposed possessive being used instead). In this
variety the -en/-a variation is simply a case of allomorphy within common gender. However,
this fact does not reduce the value of the -en/-a variation as an indicator of subnorm. For
simplicity, we will continue to refer to the -a-forms as ‘feminines’ in this chapter.
Norm clusters in written Norwegian 1
of weak verbs is generally perceived as more radical than the choice of feminine
forms of most nouns. A pattern resulting from this is that a text using the form
kasta will very probably use the feminine form hytta as well, while hytta is per-
ceived as stylistically compatible with both kasta and kastet. If this turns out to
be the actual pattern, then there is a unilateral implication between kasta and
hytta: in the domain of Bokmål texts, kasta implies hytta, but not vice versa.
Conversely, the more traditional masculine hytten will then imply the tradi-
tional kastet, but not vice versa. In a pattern like this, then, the implying, mutu-
ally exclusive, forms – kasta and hytten – emerge as more ‘marked’ or ‘special’,
while the implied forms – kastet and hytta – emerge as more ‘unmarked’ or
‘neutral’, typical of more texts.
e research questions addressed in the present chapter concern the extent
to which the form choices in actual texts display implicational patterns of this
kind, and, as a corollary, the extent to which more or less well-dened subvariet-
ies of Bokmål and Nynorsk may be identied.5 Among earlier attempts to specify
the properties of subvarieties of Bokmål is the work of Koenraad De Smedt and
Victoria Rosén (Rosén & De Smedt 2000, De Smedt & Rosén 2000) in connec-
tion with the SCARRIE proofreading project. eir experimental system allowed
users to choose among ve subvarieties, thus constraining the set of proposals
made by the system. While De Smedt and Rosén’s individuation of subvarieties
was informant-based, the aim of the present project is to derive information about
subvarieties from corpus data.
. A pilot study
A ‘morphosyntactic word’ (MSW) is taken to be a combination of a lemma form
and a set of values of morphological categories which the lemma can realize. us,
〈hytte +Noun +Def +Sg〉 is the MSW which can be realized as either hytta or
hytten in Bokmål. A MSW with more than one possible realization may be called
a ‘variable MSW’. Each variable MSW may be seen as a ‘dimension’ along which
texts may vary (the ordering of forms within each dimension is not important –
it may be taken to be alphabetical). e ocial norms for Bokmål and Nynorsk
(in conjunction with possible non-ocial forms in actual usage) will then dene
a multidimensional space, and a given text will be located as a point within the
. e phenomena studied here are confined to cases of variation in inflectional endings,
although the approach is easily extended to form variation in general, such as the variation
between the stem forms hånd and hand ‘ h an d’, et c .
1 Helge Dyvik
subspace dened by the variable MSWs occurring within it (or as a set of points if
at least one variable MSW is inconsistently realized).6
Since the ocial norms involve few constraints on the combination of pos-
sible forms within a text, they suggest as a null hypothesis that texts will spread out
approximately evenly throughout the space. However, experience tells us that this
is not the case, as indicated above. To the extent that implicational relations of the
exemplied kind hold, certain combinations of forms will not occur, thus leaving
the space empty in certain regions, e.g. in the region where the forms hytten and
kasta are combined. Consequently our expectation is that the texts will tend to
form more or less clearly dened clusters within the space. I will refer to such
clusters as ‘norm clusters’. Relatively clearly dened norm clusters within the total
text universe will then approach the status of emerging subvarieties of Bokmål
and Nynorsk. Hence empirically based charting of norm clusters will provide
important information for further work towards the standardization of written
Norwegian.
I have conducted a pilot study based on newspaper texts in order to inves-
tigate the viability of an approach along the lines suggested. In the preliminary
investigation reported here I concentrate on a small set of linguistic phenom-
ena, subjecting the extracted material to two kinds of analysis: correspondence
analysis in order to visualize the clustering of forms and texts, and a more direct
analysis of the implicational relations among the forms, visualized as directed
graphs.
. e texts
e pilot study is based on texts extracted from the web editions of four
Bokmål and six Nynorsk newspapers.7 Two newspapers, Bergens Tidende and
. A multidimensional space may be an unaccustomed concept to some readers. Since three
dimensions are as many as we can easily visualize, we may exemplify with an imagined cube
c whose breadth dimension has the points handa – handen – hånda – hånden, whose depth
dimension has the points framtida – framtiden – fremtida – fremtiden, and whose height
dimension has the points kona – konen. A text with the forms framtida, hånda, kona will then
correspond to a given point within c, a text with fremtiden, hånden, kona to another point, etc.
while we will expect certain other locations in c not to be filled by any texts, e.g. the location
of the combination framtida, handen, konen.
. I am indebted to Knut Hofland, Uni Digital, for collecting, structuring and morphologi-
cally tagging the corpus, and for organizing it into texts according to the criteria described
be l ow.
Norm clusters in written Norwegian 1
Klassekampen, occur in both lists because they contain articles in both language
varieties.8 e Bokmål newspapers are:
– Aenposten, an Oslo-based daily newspaper of nation-wide distribution, and
one of Norway’s two largest newspapers in terms of circulation.
– Bergens Tidende, a Bergen-based daily newspaper and the major newspaper in
Western Norway, with about 35% of Aenposten’s circulation.
– Dagbladet, an Oslo-based daily newspaper of nation-wide distribution and
about 45% of Aenposten’s circulation.
– Klassekampen, an Oslo-based daily newspaper of nation-wide distribution
and around 5% of Aenposten’s circulation.
e Nynorsk newspapers are:
– Bergens Tidende (see above).
– Dag og Tid, an Oslo-based weekly newspaper of nation-wide distribution and
around 3% of Aenposten’s circulation.
– Hallingdølen, a local newspaper published in Ål in Hallingdal, with about 4%
of Aenposten’s circulation.
– Klassekampen (see above).
– Nationen, an Oslo-based daily newspaper of nation-wide distribution and
around 6% of Aenposten’s circulation.
– Sogn Avis, a local newspaper published in Leikanger in Sogn, with about 4%
of Aenposten’s circulation.
e higher number of Nynorsk newspapers was chosen in order to compensate to
some extent for the smaller text volume in the Nynorsk sources. According to the
criteria specied below a selection of 55,000 Bokmål articles and 35,000 Nynorsk
articles was made, yielding a Bokmål corpus of 22 million words and a Nynorsk
corpus of 12 million words. e articles are taken from the period between the
years 2000 and 2009, but the overwhelming majority is from 2008 and 2009.
As the aim of the project is to study the clustering of texts in the space
dened by the variable MSWs, an appropriate ‘text’ concept had to be dened.
e individual newspaper article is too small to yield sucient data for informa-
tive comparison with other texts. erefore, based on the assumption that the
two most important determinants of the form options chosen are the journalist/
author and the newspaper, a ‘text’ is dened for present purposes as the sum of
. is is also true of Nationen, although only the Nynorsk parts have been sampled in this
project.
1 Helge Dyvik
what a given author has written in a given newspaper. For each newspaper (or
for the Bokmål & Nynorsk parts of the paper, respectively, in the case of newspa-
pers using both Bokmål & Nynorsk) the 20 most productive authors were then
selected, yielding a text inventory of 80 Bokmål and 120 Nynorsk texts. ese
are the texts that constitute the corpus of 22 + 12 = 34 million words mentioned
above. In the graphs each text is coded by an index identifying the author, with
the initials of the newspaper suxed. us, ‘71KK’ is the text consisting of what
author no. 71 has written in the Klassekampen part of the corpus. e texts were
automatically tagged with lemma forms and morphological categories by means
of the Oslo-Bergen Tagger.9
. Data extraction and processing
e lexical database Norsk Ordbank, in which (ocially and non-ocially) pos-
sible inectional forms are registered, allows the automatic identication of vari-
able MSWs in a text. Based on this information, for each phenomenon studied
all forms of variable MSWs in each text – i.e. all word forms that could have been
dierent according to Norsk Ordbank – were registered in a table with the text ids
along one axis and the forms along the other.10 Distances between texts based on
the dierences between their rows of forms, and distances between forms based
on the dierences between their columns of texts, could now be calculated by
means of correspondence analysis, of which the implementation in the program
R (Baayen 2008: 128 .) was used, or more precisely the method corres.fnc in the
module languageR.11 In correspondence analysis the row and column maps (i.e.
the plottings of texts and word forms, respectively) are superimposed on each
other. e result is represented as a common multidimensional space in which
both texts and word forms are distributed. e necessarily two-dimensional
diagrams below show the projection of this space on its two most informative
dimensions, i.e. the two dimensions responsible for most of the information
about distances within the multidimensional space. e most common forms,
typical of most texts, will tend to occur close to the centre of the diagram, while
more rare forms tend to occur more peripherally, and near opposite edges to
. See Johannesen et al. this volume.
1. Second elements of compounds that are listed in the lexical resources are also included
in the data.
11. I am indebted to Øystein Reigem, Uni Digital, and Christer Johansson, University of
Bergen, for help in carrying out the correspondence analysis.
Norm clusters in written Norwegian 1
the extent that they tend to be mutually exclusive. erefore, as we move from a
peripheral form towards the centre, the sequence of forms which we pass on our
way roughly suggests an implicational relation holding between peripheral, rare
forms and more central and widespread forms, which are ‘implied’ in the sense
that choice of a more peripheral form (usually) implies the choice of a more cen-
tral form along the same line, but not vice versa.
Several caveats are in order here. In the rst place, it is important to bear in
mind that much information about distances may be hidden in non-projected
dimensions, somewhat like the actual distances between the visible stars in a
night sky. In the second place, the dimensions of the space calculated are the
result of the correspondence analysis’ attempt to structure the information
about distances as simply as possible, which means that the dimensions do not
correspond exactly to the imagined dimensions dened by the MSWs, as briey
discussed in 2 above. In the third place, and as a corollary of this, the clustering
of forms and texts will not only be inuenced by the strictly linguistic param-
eters of choice among a set of alternative inectional forms, but also by corre-
sponding vocabulary: texts with a shared topic, and hence a shared vocabulary,
will tend to be closer together than texts with dierent topics, other things being
equal. However, the latter problem is to some extent counteracted by limiting
the analysis to high-frequency items.
Since the positions of the forms in the correspondence analysis diagrams only
give rough hints about implicational relations among the forms, primarily because
dierences in vocabulary across texts also inuence the plots of texts and forms,
some of the implicational relations (IRs) have in addition been subjected to a more
precise analysis whose results are represented as directed graphs.12 e IR-analysis
is based on the following normalizations and assumptions:
1. Quite a few texts are ‘inconsistent’ in the sense that for some MSW, more than
one of its alternative forms occur in the text. is may partly be the result
of tagging errors, partly the result of quotations in the text, and partly real
inconsistencies. In such cases the alternatives tend to have clearly dierent
frequencies of occurrence, and I have therefore made the simplifying move of
disregarding the form with the lowest frequency of occurrence in such cases,
classifying the texts according to the most frequent form expressing a given
MSW. us, an ‘a-text’ will be a text in which a is the most frequent expression
of the relevant MSW.
1. I am indebted to Øystein Reigem, Uni Digital, for implementing the implication analysis
and its graphics according to gradually developing specifications.
Helge Dyvik
2. A form a is said to imply a form b if all a-texts which contain some form of
the MSW of which b is a possible expression, are also b-texts. For example,
consider MSW1 〈hytte +Noun +Def +Sg〉 (‘cabin’) with the possible
forms hytten masc. and hytta fem., and MSW2 〈gate +Noun +Def +Sg〉
(‘street’) with the possible forms gaten masc. and gata fem. hytten is said
to imply gaten if and only if all hytten-texts containing MSW2 are also
gaten-texts. is is compatible with the converse not holding, i.e. that some
gaten-texts which also contain MSW1 are hytta-texts. If so, the implication is
unilateral; if not, it is bilateral.
3. I assume that the implicational relation is transitive, i.e. that if a implies b and
b implies c, then a implies c. is extends the relation to form pairs which
never occur together in the same text in the corpus, and it is hence somewhat
risky and probably leads to some spurious properties of the graphs, in par-
ticular given the sparseness of some of the relevant data. Nevertheless I nd
reason to assume that the graphs generally give a valid picture of the situation.
It should be stressed that these normalizations and assumptions apply to the
IR-analysis only, and not to the correspondence analysis.
. Case I: Feminine nouns in Bokmål
I rst consider the choice of feminine vs. masculine gender13 for the set of Bokmål
nouns allowing this option. is set comprises all nouns that may be feminine in
the language, i.e. feminine gender is optional in Bokmål. From this set I select the
subset whose lemmas have a frequency of 500 or above in the 22 mill word Bokmål
part of the corpus. is yields an inventory of 60 lemmas.
.1 Correspondence analysis
Figure 1, in which the individual items are too small to be readable, is included
in order to show the shape of the ‘cloud’ resulting from a correspondence analysis
of the varying denite singular forms of the 60 nouns and the texts in which they
occur. e two maximally informative dimensions are shown in Figure 1, but
they give only 33.9% of the total information about distances in the multidimen-
sional space (x-axis: 21%, y-axis: 12.9%). Still, some clear patterns emerge. As later
gures zooming in on parts of the plot show, the radical feminine a-forms distrib-
ute towards the le in the plot, while the moderate masculine en-forms distribute
1. See Footnote4 above.
Norm clusters in written Norwegian 1
towards the right. Furthermore, with some overlaps we nd the texts from the four
newspapers distributed from le to right in the order Klassekampen – Dagbladet –
Aenposten – Bergens Tidende.
Figure 1. e ‘cloud’ displaying the distribution of masc. and fem. noun forms
e pointed shape of the ‘cloud’ towards the right is noteworthy. Dense
clusters occur when relatively many texts share a number of choices; this is the
intended meaning of the term ‘norm cluster’. Choices which less consistently
are correlated with other choices across texts lead to a more diuse distribu-
tion. A claim which is sometimes made is that moderate Bokmål is more of a
real, identiable subvariety of Bokmål than radical Bokmål, which may be more
of an abstraction in the sense that it refers to the sum of possible departures
from moderate Bokmål without itself being a subvariety about which language
users tend to have consistent intuitions. e pointed shape towards the moder-
ate -en end of the plot, and the higher density within it, is compatible with such
a claim, although it must of course be borne in mind that the plot covers only
one phenomenon and considers only a limited number of texts. It should also
be noted that low frequency in itself typically leads to a more sparse distribution
because there will be fewer shared cooccurrence partners across texts among
Helge Dyvik
low-frequency forms. is is therefore an alternative possible explanation of the
sparseness in the le-hand -a end of the plot.
Figure 2 is a close-up of the right-hand side of the plot, in which we nd texts
form Bergens Tidende (BT), Aenposten (AP) and a few from Dagbladet (DB).
Figure 2. e rightmost part of the plot in Figure 1
At the far right we see the forms jenten ‘the girl’ masc. and konen ‘the wife’/‘the
woman’ masc. ese are special for Bergens Tidende and reect a consistent two-
gender variety of Bokmål (which means that the gender here should properly
be called ‘common’ rather than ‘masculine’). As we move lewards towards the
centre, we nd the feminine alternative jenta among the rst a-forms to show up.
kona and klokka ‘the clock’ also occur to the right of the centre, in the denser part
of the plot (but outside Figure 2; see, however, Figure 4). is suggests that konen
and jenten are marked forms at the top of the implicational scale: a text with
these forms will most probably have all the other relevant nouns in the masculine,
too. Within the denser norm cluster we nd the a-forms jenta and kona occur-
ring together with mostly en-forms of other nouns, such as ulykken ‘the accident’,
etterforskningen ‘the investigation’, turneringen ‘the tournament’, nanskrisen ‘the
nancial crisis’, døren ‘the door’, avisen ‘the newspaper’, natten ‘the night’, luen
‘the air’, kvinnen ‘the woman’ etc. e last noun is worth noticing, since it shows
that biological gender has limited relevance here. e nouns kvinne and kone
have dierent stylistic properties, and the feminine form kvinna is perceived as
Norm clusters in written Norwegian
markedly more radical than kona. e form kvinna occurs near the middle of the
bottom le quadrant in Figure 1. It should also be noted that the total frequency
of the form kvinna in the corpus is 14, as against 3,049 for kvinnen. e numbers
for the forms of ‘kone’ are 80 for konen and 934 for kona.
Figure 3 is a close-up of a region in the far le of the plot.
utviklinga
avisa
framtida
handa
høyresida
ordninga
venstresida kirka
tida
8KK
77KK
73KK
Figure 3. e lemost part of the plot in Figure 1
In this part of the plot only the newspaper Klassekampen is represented.14 At
the le edge we expect to nd marked forms, typical of relatively few texts, and
probably implying a-forms of other relevant nouns as well. However, the sparse-
ness of this part of the plot indicates that there is less consistency in the choice
of a-forms across the texts departing from the norm in the direction of -a than
there is in the choice of en-forms across the texts near the other edge (Figure 2).
e nouns in Figure 3 are ordninga ‘the arrangement’, venstresida ‘the (political)
le’, høyresida ‘the (political) right’, kirka ‘the church’, utviklinga ‘the development’,
tida ‘the time’, avisa ‘the newspaper’, framtida ‘the future’, handa ‘the hand’, where
the lemost members tend to denote abstract concepts typical of texts of discur-
sive type, which is compatible with the common impression that these are the last
nouns to get the radical a-forms.
1. is does not preclude that the forms may occur in the other newspapers as well, but
other choices pull those texts further to the right in the plot.
Helge Dyvik
Figure 4 is a close-up of the central region of the plot in Figure 1.
Figure 4. e central part of the plot in Figure 1
It is noticable that the Klassekampen texts occur exclusively on the le-hand
side of the middle line, although also quite close to it, while the Aenposten and
Bergens Tidende texts occur exclusively on the right-hand side. Only Dagbladet
distributes on both sides, placing the newspaper in the central region where we
expect to nd the forms that are typical of most of the texts. However, the central
part of the plot is not the maximally dense part, indicating that the texts deviating
to the right are more consistent in their choices than the others are, thus forming
a norm cluster.
e nouns with a-forms in the central region are typically concretes or
other words characteristic of everyday language, such as skylda ‘the blame’, uka
‘the week’, hånda ‘the hand’, gata ‘the street’, natta ‘the night’, døra ‘the door’,
klokka ‘the clock’, and kona ‘the wife’. en-forms in Figure 4 comprise abstracts
typical of discursive prose stretching into the le part, such as venstresiden ‘the
(political) le’, utfordringen ‘the challenge’, sannheten ‘the truth’, utviklingen
‘the development’, makten ‘the power’, løsningen ‘the solution’, etterforskningen
‘the investigation’, høyresiden ‘the (political) right’, ordningen ‘the arrangement’
(corroborating the impression that the en-form is most persistent in such words
Norm clusters in written Norwegian
even in texts where other nouns get -a), and further to the right muligheten
‘the possibility’, pressen ‘the press’, behandlingen ‘the treatment’, undersøkelsen
‘the investigation’, nyheten ’ the piece of news’, framtiden ‘the future’, kirken ‘the
church’, stillingen ‘the position’, tiden ‘the time’, årsaken ‘the cause’, moren ‘the
mother’, and jakten ‘the hunt’.
en-forms even further to the le than the area shown in Figure 4, in the bottom
le quadrant, are historien ‘the history’, ytringsfriheten ‘the freedom of expression’,
teksten ‘the text’, virkeligheten ‘the reality’, oentligheten ‘the public sphere’, and for-
estillingen ‘the performance’/‘the idea’. For ytringsfriheten, teksten and virkeligheten
the corresponding a-forms do not occur at all in the corpus, while the numbers of
occurrences in the other three cases are: forestillingen: 757, forestillinga: 7, historien:
2,540, historia: 16, oentligheten: 828, oentligheta: 2. e reason why the strongly
dominant en-forms in these cases still do not occur more centrally in the diagram is
probably related to vocabulary: these ‘intellectual’ concepts seem to be more typical
of the Klassekampen and Dagbladet journalists in their vicinity than they are of the
writers in Aenposten and Bergens Tidende.
. Implication analysis
As described in Section 4 above I have performed a more direct analysis of the
implicational relations between the form choices in the corpus as a whole, disre-
garding the distribution across dierent newspapers. In the graphs that follow, the
forms which imply each other mutually according to the criteria in 4 are placed
within the same oval, while unilateral implications are marked with arrows between
ovals. e frequency of the forms is roughly indicated by the thickness of the oval
line, according to a logarithmic scale (log5). When the oval contains more than one
form, the thickness of the line indicates average frequency.
Figure 5 shows the bottom of the implicational graph, directly or indirectly
dominated by all other forms, both a-forms and en-forms. ese forms, then,
emerge as the maximally ‘unmarked’ forms, expected to occur across the text
universe.
Figure 5. e bottom of the implicational graph with maximally ‘unmarked’ forms
Helge Dyvik
e forms in Figure 5 are all en-forms, with the exception of kona. Also, many
of them denote abstracts typical of discursive prose.
As indicated in Section4 the graph has a few spurious properties resulting
from the low frequency of some forms in combination with our assumption of
transitivity of the implicational relation. For example, the rare forms utfordringa
and konen occur high up in the graph (see Figures 7 and 8), and hence appear
to imply their alternatives utfordringen and kona in Figure 5, which obviously is
not actually the case. e same is the case with the forms utstillingen, treningen,
stillingen, kirken, kvinnen and datteren in Figure 5, whose alternatives also occur
higher up in the graph. I assume that less sparse data would have moved these
forms out of the bottom oval into positions dominated by only subsets of the rest
of the forms.
Figure 6 shows the le part of the graph, with kvinna ‘the woman’ at the top,
indicating its position as a strongly marked choice implying most other a-forms.
Unexpectedly it also dominates the en-forms natten ‘the night’ and tiden ‘the
time’ (which could have been seen more easily in the undivided version of the
graph divided up between Figures6, 7 and 8 for reasons of space), which indi-
cates that the implicational relations found for low-frequency items (kvinna has
Figure 6. e le part of the implication graph
Norm clusters in written Norwegian
14 occurrences) must be taken with a grain of salt – there is exactly one text in
which the form kvinna cooccurs with natten, and the same is the case with kvinna
and tiden. Similar probably spurious implications occur with løsninga ‘the solu-
tion’, utfordringa ‘the challenge’, stillinga ‘the position’, utstillinga ‘the exhibition’,
målinga ‘the measuring’, meldinga ‘the message’, treninga ‘the exercise’ and dattera
‘the daughter’ in Figure 7. For the more frequent and hence more reliable a-cases
in Figures6 and 7 we notice that a-forms of more abstract words tend to imply
a-forms of more concrete or everyday words. us, makta ‘the power’, grensa ‘the
limit’/‘the border’, kirka ‘the church’, and further down venstresida ‘the le’, høyre-
sida ‘the right’, skylda ‘the blame’ and nanskrisa ‘the nancial crisis’ dominate the
Figure 7. e middle part of the implication graph
Helge Dyvik
more everyday concepts lua ‘the air’, avisa ‘the newspaper’, døra ‘the door’, gata
‘the street’, boka ‘the book’, framtida ‘the future’, and even further down uka ‘the
week’ and jenta ‘the girl’. Some of the exceptions to this pattern should probably be
attributed to stylistic properties associated with individual words. We may notice
that some of the a-forms of concretes that occur higher up and hence appear to
be more marked choices denote female humans, such as kvinna ‘the woman’ and
dattera ‘the daughter’. Another example is mora ‘the mother’, which does not occur
in the graph because it does not seem to cooccur with any of the other 59 nouns
considered.
Figures 7 and 8 show the en-forms, which, as expected, tend to display
the same implicational hierarchy as the a-forms, but turned upside-down. On
the en-side the everyday words are on top: if you choose the en-form of con-
cretes like gaten ‘the street’, boken ‘the book’, døren ‘the door’, klokken ‘the clock’,
avisen ‘the newspaper’ luen ‘the air’ etc. then you are likely to choose also the
en-forms of more abstract nouns like venstresiden ‘the le’, høyresiden ‘the right’,
nanskrisen ‘the nancial crisis’, makten ‘the power’, grensen ‘the limit’/‘the
border’, etc. At the very top (apart from the spurious løsninga) we nd the rare
forms jenten ‘the girl’ and konen ‘the wife’. It is interesting that nouns denot-
ing female humans stand apart from other concretes, but in two diametrically
opposed ways. ere are two subclasses of them, each occurring high up in its
own implicational hierarchy: those whose en-forms are clearly marked choices
(jenten, konen), and those whose a-forms are clearly marked choices (kvinna,
dattera, mora).
Figure 8. e right part of the implication graph
Norm clusters in written Norwegian
. Case II: Weak verbs in Bokmål
e largest and most productive class of weak verbs in Bokmål has the alterna-
tive endings -et (traditional, moderate) and -a (radical) in past and past parti-
ciple forms. us, the MSWs 〈kaste +Verb +Past〉 and 〈kaste +Verb
+PastPart〉 both have the alternative forms kastet and kasta. I have registered
all past and past participle forms of verbs of this class whose lemmas have a fre-
quency of occurrence equal to or greater than 500 in the 22 million word Bokmål
part of the corpus. is yields an inventory of 58 verbs.
Figure 9 shows the shape of the ‘cloud’ resulting from a correspondence analy-
sis of this material projected on the two most informative dimensions, which are
jointly responsible for 31% of the information about distances in the space (x-axis:
18.6%; y-axis: 12.4%).
Figure 9. e ‘cloud’ displaying the distribution of et- and a-forms of weak verbs
In the plot in Figure 9, unlike in the masculine/feminine case, there is no
transitional area where the two form categories mingle. e oval encloses only
et-forms (with the exception of the form rykka ‘moved quickly’ and a couple
of peripheral forms mentioned below), while all the forms outside the oval are
a-forms (except a few forms which are neither, for verbs which allow further
1 Helge Dyvik
options, such as lagde, past tense of lage ‘make’). is suggests that there may
be less of an implicational hierarchy among the verbal et-forms or a-forms
than there was in the case of the nouns: the tendency is to use either one or
the other ending consistently, irrespective of verb. Still the a-forms spread out
much more sparsely than the et-forms. is sparseness is probably not the
result ofless consistency in the choice of a-forms as against et-forms across the
texts, then, but rather the result of the extremely low frequency of the a-forms.
e strong tendency is that the a-forms have less than ten occurrences, while
the et-forms have a three-digit number of occurrences in the corpus. ere-
fore the a-forms cooccur with a much lower number of the other forms than
do the et-forms, a circumstance which gives rise to a greater distance between
them in the plot, since the a-forms will share fewer cooccurrence partners
thanthe et-forms.
With one exception – one Dagbladet text – Klassekampen is the only news-
paper whose texts occur outside the oval.15 Among the Klassekampen texts,
only one is located near the upper le corner, as indicated by the lemost arrow
in Figure 9. e next Klassekampen text is located at the rightmost arrow. e
forms in the sparse area from the le inwards are: jobba ‘worked’, laga ‘made’,
bekrea ‘conrmed’, handla ‘shopped’/‘acted’, henta ‘fetched’, snakka ‘talked’,
venta ‘waited’/‘expected’, samla ‘collected’, overraska ‘surprised’, mista ‘lost’, endra
‘changed’, regna ‘calculated’/‘rained’, sikra ‘secured’, ytta ‘ mo v e d ’, varsla ‘noti-
ed’, erna ‘removed’. us the indications are that one single writer among the
80 Bokmål writers is mainly responsible for this a-form protuberance from the
central area in the plot.
Figure 10 shows the central and densest part of the plot in Figure 9.16 All
the four newspapers are densely represented in this region (although with
Klassekampen near the edge of the et-area), indicating that the et-forms consti-
tute a clear norm cluster for this corpus, and the only cluster in the verbal -et/-a
space.
As shown in Figure 9 the et-forms (and two a-forms) also have a protu-
berance into the top right quadrant. All three newspapers Aenposten, Bergens
Tidende and Dagbladet are represented here, but not Klassekampen. Apparently
this departure from the main cluster is not explained by the choice of inectional
forms, but by vocabulary. e forms from the top down are: scoret, scora ‘s co re d’,
1. is does not exclude the possibility of a-forms in other texts, but if so, the preponder-
ance of other forms still place such texts within the oval.
1. When a form occurs twice in the plot, one occurrence is the past tense form and the
other the past participle form. e past tense forms are printed in bolder type.
Norm clusters in written Norwegian 11
mista ‘lost’, trent ‘exercised’, rykte ‘moved quickly’,17 byttet ‘changed’, laget ‘made’,
reddet ‘s av e d ’, sørget ‘secured’/‘grieved’, virket ‘worked’/‘seemed’, skuet ‘disap -
pointed’, klarte ‘managed’, sikret ‘secured’, havnet ‘ended up’, hentet ‘fetched’, ledet
‘led’. As this vocabulary already suggests, the writers represented in this area are
sports journalists. e absence of Klassekampen ts well with the fact that this
newspaper does not have sports pages. e reason why the sports terminology
1. e forms trent and rykte, a past participle and a past tense form, respectively, do
notend in -a or -et, but are included because the verbs in question also allow the -a/-et
inflection.
Figure 10. e central region of the plot in Figure 9
1 Helge Dyvik
leads to this kind of departure from the main cluster may be that the sports pages
use comparatively little of the rest of the vocabulary of the language, and possibly
also, conversely, that the verbs typical of the sports pages are not frequent in other
text types.
As expected from the fact that et- and a-forms hardly mingle at all in the plot,
the implicational analysis of this material indicates no clear implicational hierar-
chy among these forms across dierent verbs. e tendency is for a text to use one
or the other ending consistently across verbs.
In order to inspect patterns of cooccurrence across the noun and verb forms
of cases I and II the two tables were combined into one and the result subjected
to correspondence analysis again. e resulting plot has the same general shape as
Figure 1, with roughly the same distribution of the noun forms, but now with the
verb forms interspersed. All the verbal a-forms occur peripherally, and the vast
majority on the le side, in the general area of the more marked nominal a-forms
(kvinna, mora etc.). As we move towards the centre, the et-forms start showing up
at the same time as the nominal a-forms typical of everyday language (gata, uka,
natta, skylda etc.), i.e. as we enter the area shown in Figure 4. is supports the
impression that the choice of verbal a-forms clusters with the choice of the most
markedly ‘radical’ nominal a-forms.
. Case III: Innitives in Nynorsk
Innitives in Nynorsk may end in -a or in -e. Unlike the case of feminine and
masculine nouns in Bokmål, in this case the ocial norm to some extent pre-
scribes the distribution of the two endings across verbs, according to three
options: (i)consistent -a in all verbs, (ii) consistent -e in all verbs, (iii) ‘split
innitive’ (‘kløyvd innitiv’). Option (iii) involves -a in some innitives and -e
in others based on historically rooted patterns of variation in certain dialects in
the Eastern and middle part of Norway, excluding most of North Norway (see
e.g. Faarlund et al. 1997: 476 f.; Skjekkeland 1997: 69). e historical explana-
tion is related to syllable quantity. In Modern Norwegian, with the exception
of a small dialectal area in Gudbrandsdalen, accented syllables are always long,
i.e. they have a long vowel, or a short vowel plus a long consonant or conso-
nant cluster.18 Accented syllables in Old Norse could also be short, with a short
1. is is the traditional analysis. ere are alternative phonological analyses of these
phenomena which I will not go into for present purposes.
Norm clusters in written Norwegian 1
vowel plus a short consonant. In modern dialects such syllables generally have
been lengthened either by lengthening of the vowel (typical of the West) or by
lengthening of the consonant (typical of the East). But already in late Old Norse,
before the changes in syllable quantity, we see evidence that unstressed [a] was
reduced to an [e]- or [æ]-like sound aer long syllables, but not aer short ones,
in manuscripts from the Eastern part of the country. e split innitive is a
reex in modern dialects of this quantitatively conditioned reduction. However,
aer the lengthening of the old short accented syllables there is no quantitative
conditioning of this variation from a synchronic point of view. is means that
unless you have either split innitive in your own dialect or expert knowledge of
Old Norse and language history, there is no way to predict that bite = ‘bite’ (with
originally long /i:/) should have -e while vita = ‘know’ (with originally short /i/)
should have -a according to the rules of the split innitive. As a consequence,
the split innitive option in written Nynorsk is recommended only for people
who have this phenomenon in their own dialect. As long as such writers distrib-
ute the -a and the -e according to their own dialect they are within the ocial
norm. is still opens up for some variation, since the split innitive dialects
are not consistent among themselves as to how many of the originally short-
syllabic verbs get -a rather than -e in the innitive. Hence charting the variation
in Nynorsk innitives is also of some interest.
.1 Correspondence analysis
I have registered all innitival forms of verbs of this class whose lemmas have
a frequency of occurrence equal to or greater than 500 in the 12 million word
Nynorsk part of the corpus. is yields an inventory of 83 verb lemmas. Figure 11
shows the shape of the ‘cloud’ plotting the e- and a-innitives in the six Nynorsk
newspapers. e two dimensions in the graph contain 48.7% of the information
about distances in the space (x-axis: 41.9%, y-axis: 6.8%).
In Figure 11 most of the a-innitives cluster densely in the far le, while most
of the e-innitives spread out vertically in the far right. In the le-hand a-cluster
we nd all the writers of Sogn Avis, exactly half of the writers of Dag og Tid, 8 out
of the 20 writers of Bergens Tidende, 6 out of the 20 writers of Klassekampen, 2 out
of the 20 writers of Nationen, and none from Hallingdølen. us, across the six
newspapers 38% of the writers use consistent a-innitives.
e remaining writers mostly cluster with the e-forms, the most notable
exception being Hallingdølen. Figure 12 shows part of the lower right quadrant
of Figure 11, with most of the Hallingdølen writers spreading out between the two
clusters.
1 Helge Dyvik
Figure 11. e ‘cloud’ displaying the distribution of e- and a-innitives
Figure 12. Part of the lower right quadrant of Figure 11, showing writers between the
mainclusters
Norm clusters in written Norwegian 1
Hallingdølen is published in an area where the dialects use split innitive, but
the newspaper does not impose the use of split innitive on its journalists. e plot
indicates that the writers of Hallingdølen, plus a couple from Nationen and Dag og
Tid, use split innitive to varying degrees, from almost consistent e-forms through
a gradually increasing number of a-forms.
e rules of the split innitive, prescribing-a in originally short-syllabic verbs,
have consequences for the vertical distribution of the forms. us the lower le
quadrant of Figure 11, shown in Figure 13 shows a ‘tail’ of a-forms departing from
the main cluster in the direction of the Hallingdølen writers, and they are all short-
syllabic with the single exception of styrka ‘strengthen’.
Figure 13. e lower le quadrant of Figure 11; a tail of the a-cluster
ere are two forms of the innitive vera = ‘be’, vera and væra, shown in
Figure 13, with the latter apparently mostly restricted to the users of split inni-
tive, as appears from its position near the centre of the graph.
On the right-hand side the split innitive writers lead to a corresponding pre-
ponderance of long-syllabic forms in the lower half of the e-cluster, with most of
the short-syllabic forms in the upper half. However, it seems that other factors
also inuence the distribution of the e-forms. It is not immediately clear why the
e-forms spread out so much more in the vertical dimension than the a-forms (see
Figure 11), but it seems likely that it is related to the distribution of vocabulary
across writers. ere is a noticable tendency for the e-innitives in the upper and
lower halves to belong to dierent semantic elds. e verbs in the upper region,
from the top, are (with originally long-syllabic forms marked with an asterisk,
since the split innitive favours short-syllabic forms in this region, making long-
syllabic forms more marked): lese ‘read’, *skrive ‘ w r i t e’, *snakke ‘talk’, leve ‘live’,
*kjenne ‘ fe el ’, fortelje ‘ te ll ’, *tenkje ‘think’, spele ‘play’, spørje ‘a sk’, vite ‘know’, vere
‘ be’, *lære ‘learn’/‘teach’, *kalle ‘call’, gjere ‘do’, lage ‘ ma ke’, tene ‘earn’/‘serve’, komme
‘come’, *vise ‘show’, *høyre ‘hear’, sitje ‘sit’, velje ‘choose’, oppleve ‘experience’, seie
1 Helge Dyvik
‘say’. us, the majority of these verbs are related to the general sphere of com-
munication or intellectual activities. e verbs in the lower three quarters of
the lower half, on the other hand, are, from the bottom upwards (this time with
originally short-syllabic forms asterisked): køyre ‘drive’, bygge ‘build’, koste ‘cost’,
vurdere ‘evaluate’, styrke ‘strengthen’, legge ‘lay’, sikre ‘secure’, starte ‘s ta r t ’, innføre
‘import’/‘introduce’, byggje ‘build’, satse ‘invest’, etablere ‘establish’, melde ‘report’,
søkje ‘apply’, auke ‘increase’, redusere ‘reduce’, løyse ‘solve’, rekne ‘calculate’, *betale
‘pay’, opne ‘open’, kjøpe ‘buy’, hente ‘fetch’, ytte ‘ mo ve ’, jobbe ‘work’ *selje ‘s el l’, liggje
‘lie’ (in a position) vente ‘wait’/‘expect’, samle ‘collect’, gjennomføre ‘carry through’,
drive ‘run’ (transitive), sende ‘send’, leggje ‘lay’, følgje ‘follow’, skae ‘provide’, møte
‘meet’, bruke ‘ u s e’, arbeide ‘work’, *klare ‘manage’, greie ‘manage’, utvikle ‘develop’.
us, the general sphere here seems to be industry, trade and economy.
e vertical distribution distinguishing two semantic elds among the e-forms
indicates that the writers tend to specialize in certain domains, such as culture vs.
economic news. is accords well with the fact that all the Dag og Tid writers
and most of the Klassekampen writers within the e-cluster are located in its upper
half. Both these newspapers have an emphasis on culture and political commen-
tary. e majority e-cluster writers of Nationen, a newspaper with an emphasis on
regional politics and agriculture, are located in the lower region, while the Bergens
Tidende writers within the e-cluster spread out evenly between the two regions.
is indicates that the journalists of this major newspaper can be more special-
ized than what is possible for the journalists of the much smaller Sogn Avis, which
dominates the le-hand a-cluster. is circumstance may be the reason why we
do not see a corresponding division of the a-cluster into semantically motivated
subregions: each journalist in a small newspaper has to cover several domains, and
some domains may also be covered to a lesser extent.
. Implication analysis
e semantic factors discussed above to some extent obscure the patterns of vari-
ation within the choices of e-forms and a-forms of the innitives. We have there-
fore subjected the data to an implication analysis of the kind described in 4 above.
e analysis reveals an implicational pattern among the users of split innitive
concerning the choice of innitive forms. Figure 14 shows the right half of the
resulting graph, comprising the a-forms. e le half, comprising the e-forms, is
almost exactly like the right half turned upside down.
Most of the originally short-syllabic a-forms occur below the large set of
a-forms with mutual implications. e pattern indicates a relatively clear ranking
of the short-syllabic a-forms used by writers which use split innitive. vera ‘be’ at
the bottom indicates that at least this form is used by all such writers, and some
Norm clusters in written Norwegian 1
Figure 14. Implicational relations among the a-innitives
1 Helge Dyvik
will use only this a-form while using -e in all other innitives. gjera ‘do’ above it
indicates that some will use only these two a-forms, etc. is is matched by the fact
that vere dominates gjere etc. at the top of the other half of the graph (not shown
here) – i.e. if a writer chooses vere, then all innitives are e-forms.
e graph must be read with the caveats mentioned in paragraphs 4 and 5.2
above. Still, it strongly suggests a rather consistent pattern of implications across
split-innitive writers with regard to the most frequent verbs. Breaking this pattern
would then probably be perceived as a violation of the operative norm for nynorsk.
At the same time it is clearly unrealistic to expect non-expert writers without this
phenomenon in their own dialects to acquire mastery of this system, including the
implicational relationships. is might be taken as an argument against keeping the
split innitive option as part of the standard language, assuming that a standard in
a strict sense is the aim.
. Conclusion
e primary goal of the pilot project reported here was to test the utility of corre-
spondence analysis and implication analysis in the investigation of norm clusters
in Bokmål and Nynorsk. e techniques will have to be applied to a much larger
and more representative corpus, and to a much wider range of linguistic phenom-
ena, before fairly safe conclusions about the situation of written Norwegian and
its possible emergent subvarieties can be reached. Still, the analyses indicate that
the approach is able to yield plausible insights about clusterings and implicational
relationships. e following tendencies were among the ones indicated:
– In the alternation between masculine and feminine gender (in the sense
of Footnote4) of nouns in Bokmål, the choice of masculine gender forms
the densest cluster, indicating that this choice may be correlated with more
consistency across texts, i.e. a more clearly dened subnorm, than the alterna-
tive choice.
– ere is an implicational relationship between the choice of feminine gender
for nouns denoting abstract concepts from the academic sphere and the
choice of feminine gender for concretes and everyday words, where the for-
mer choice implies the latter, but not vice versa.
– In the alternation between -et and -a in weak verbs in Bokmål the et-forms
form a solid norm cluster, while pervasive use of a-forms seems to be typical
of a tiny fraction of the writers involved.
– e use of a-forms of weak verbs clusters with the choice of feminine gender
for abstract words from the academic sphere.
Norm clusters in written Norwegian 1
– In the alternation between -a and -e in innitives in Nynorsk the writers using
the split innitive show a fair amount of variation, but still display a compara-
tively clear implicational relationship in their choice of ending for individual
verbs. e complexity of this system is not conducive to having it captured by
clear normative rules.
ese results in themselves are not very novel or surprising. But they corroborate
and provide a further articulation of common existing assumptions, and this is
done on the basis of empirical corpus data rather than on that of intuitive judg-
ments. is supports the conclusion that the approach tested here, applied to a
much larger corpus, may provide useful information about emerging norm
patterns for the long-term work towards the adaptation of the written standards to
developments in the operative norms revealed in real texts.
References
Baayen, R.H. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using
R. Cambridge: CUP.
Faarlund, Jan Terje, Lie, Svein & Vannebo, Kjell Ivar. 1997. Norsk referansegrammatikk. Oslo:
Universitetsforlaget.
Rosén, Victoria & De Smedt, Koenraad 2000. *Er korrekturlesningsevnen di god? Resultater
fra SCARRIE. In Nordlyd: Tromsø University Working Papers on Language and Linguistics
28, Olaf Jansen Westvik, Toril Swan, Endre Mørck & Ove Lorentz (eds), 214–228. Tromsø:
University of Tromsø.
De Smedt, Koenraad & Rosén, Victoria. 2000. Automatic proofreading for Norwegian: e chal-
lenges of lexical and grammatical variation. In NODALIDA’99: Proceedings from the 12th
“Nordiske datalingvistikkdager”, Trondheim, December 9–10, 1999, Torbjørn Nordgård
(ed), 206–215. Trondheim: NTNU.
Norsk Ordbank. 〈http://www.hf.uio.no/iln/om/organisasjon/edd/forsking/norsk-ordbank/〉 (20
March, 2011).
SCARRIE. 〈http//ling.b.uib.no/projects/scarrie/〉 (19 March, 2011).
Skjekkeland, Martin. 1997. Dei norske dialektane. Kristiansand: Høyskoleforlaget.