Content uploaded by Peter Dekker

Author content

All content in this area was uploaded by Peter Dekker on Apr 01, 2017

Content may be subject to copyright.

Determining Dutch dialect phylogeny using

bayesian inference

Peter Dekker

Bachelor’s thesis (7.5 ECTS)

BSc Artiﬁcial Intelligence, Universiteit Utrecht

Supervisors: Alexis Dimitriadis and Martin Everaert

July 31, 2014

Abstract

In this thesis, bayesian inference is used to determine the phylogeny of Dutch

dialects. Bayesian inference is a computational method that can be used to

calculate which phylogenetic tree has the highest probability, given the data.

Dialect data from the Reeks Nederlandse Dialectatlassen, a corpus of words in

several Dutch dialects, serves as input for the bayesian algorithm. The data was

aligned and converted to phonological features. The trees generated by bayesian

inference were evaluated by comparing them with an existing dialect map by

Daan and Blok.

Contents

1 Introduction 2

1.1 Bayesianinference .......................... 2

1.2 Earlierresearch............................ 3

1.3 Applying bayesian inference to the Dutch dialects . . . . . . . . . 3

2 Method 4

2.1 Data.................................. 4

2.2 Alignment............................... 4

2.3 Phonological mapping . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Bayesianinference .......................... 6

2.5 Evaluation............................... 7

3 Results 7

3.1 Netherlandic dialects . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Equal rate variation model . . . . . . . . . . . . . . . . . 8

3.1.2 Gamma-distributed rate variation model . . . . . . . . . . 14

3.2 Comparison: Belgian dialects . . . . . . . . . . . . . . . . . . . . 15

4 Discussion 18

5 Conclusion 20

6 Literature 20

7 Appendix 22

7.1 Tree of Netherlandic dialects – equal rate variation . . . . . . . . 22

7.2 Tree of Netherlandic dialects – gamma rate variation . . . . . . . 25

7.3 Tree of Belgian dialects – equal rate variation . . . . . . . . . . . 28

7.4 Tree of Belgian dialects – gamma rate variation . . . . . . . . . . 30

7.5 Dialect map by Daan and Blok (1969) . . . . . . . . . . . . . . . 32

1

1 Introduction

How are languages related? Languages are genetically related if they share a

single ancestor from which they derive (Campbell, 1998). To prove a common

ancestor, an array of methods can be applied. The phylogeny, or evolutionary

relationship, of languages can be viewed as a tree, where a branching shows that

two languages Band Cderived from an ancestor A.

1.1 Bayesian inference

In recent years, computational methods have seen their advent in historical

linguistics. One of them is bayesian phylogenetic inference. This method is

inspired by Bayes’ law: the probability of a hypothesis Hxfor a certain phe-

nomenon Ecan be given using the probability of the phenomenon given the

hypothesis.

P(Hx|E) = P(Hx)·P(E|H x)

Pn

k=1 P(Hk)·P(E|Hk)

In our case, a phylogenetic tree is the hypothesis. A tree is a tuple ω= (τ , υ, φ)

with (Larget and Simon, 1999):

•a tree topology τ

•a vector of branch lengths υassociated with topology τ. Each branch

length in υrepresents the distance between two adjacent nodes in τ.

•a substitution model φ, which determines the probability that a certain

element of a language changes into a certain other element.

The tree topology τis deﬁned as:

•a set of vertices V

•a set of edges E∈V×V

•the graph that E describes on V is strongly connected

•there are no cycles.

Branch lengths show how much distance there is between languages: a longer

branch means that more substitutions have been made between the input strings.

The substitution model determines how likely the change from a character in

the input string to a certain other character is.

The question is: what is the most probable tree to describe the linguistic data

X? Our application of Bayes’ law becomes (Ronquist and Huelsenbeck, 2003):

f(ω|X) = f(ω)·f(X|ω)

f(X)

f(ω) is the prior distribution, containing the a priori probabilities of the diﬀerent

trees. f(X|ω) is the likelihood function, which returns the probability that the

2

data has been generated by a tree. f(X) is the total probability of the data.

f(ω|X) is the posterior distribution, containing the probabilities of all the trees.

Assumptions have to be made about the prior probabilities of trees, because they

are generally unknown. Calculating the posterior probability means summing

over all of the trees, whose posterior probabilities have not been calculated yet,

and integrating over all the possible combinations of τ,υand φ. The posterior

probability distribution cannot be calculated directly (Huelsenbeck et al., 2001).

To address this issue, bayesian algorithms use a technique called Markov Chain

Monte Carlo (MCMC) sampling. MCMC is an approximation of Bayes’ law,

with a number of simpliﬁcations. The prior probability distribution is deter-

mined by a Dirichlet distribution, inferring the prior probability of a tree from

the data itself (Ronquist et al., 2011). The algorithm starts oﬀ with a ran-

dom tree. Every generation, a small random change of the parameter values is

proposed. It is accepted or rejected with a probability given by the Metropolis-

Hastings algorithm (Larget and Simon, 1999). Every sgenerations (where s

is the sample frequency), the accepted tree of the current generation is saved

to the posterior sample. After a number of generations, the posterior sam-

ple should approximate the real posterior probability distribution (Huelsenbeck

et al., 2001). From this distribution, a best tree can be drawn, according to the

desired criteria.

In principle, the input for the bayesian method could be any kind of linguistic

data. When applied to languages, it is common to use Swadesh lists, a list of

words which are unlikely to be borrowed. A linguist manually classiﬁes each

word in a dialect to be in a certain cognate class. Crucial in the cognate classi-

ﬁcation is the Neogrammarian hypothesis, which states that sound changes are

regular. Regularity means that if a sound in a word changed into another sound,

it will do so in every other word. Two words can only be cognate, if the common

ancestor can be reached from both cognate candidates by a number of known

sound changes (Campbell, 1998). Generally, in phylogenetic linguistics, a string

of the cognate classiﬁcations of every word in a language serves as the input for

the bayesian algorithm. For dialects, words in the diﬀerent dialects are likely to

be cognates (Dunn, 2008). If cognacy is assumed, manual classiﬁcation can be

omitted and an objective measure of distance between the phonological forms

in diﬀerent dialects can be used.

1.2 Earlier research

In earlier research, the bayesian method has been applied to Bulgarian dialects

(Proki´c et al., 2011). The focus in the research was on vowel change. The

consonants were dropped and the vowels were classiﬁed into a limited number

of classes to reduce computational cost. A broader approach will be used in this

thesis, because consonants may also amount to important distinctions between

dialect groups.

1.3 Applying bayesian inference to the Dutch dialects

I would like to evaluate the bayesian method when used on the Dutch dialects.

My research question is: how well is bayesian inference suited to determine the

3

phylogeny of Dutch dialects?

2 Method

The input for the bayesian inference is a string for every dialect, which uniquely

describes that dialect. A corpus of words and their translations into diﬀerent

dialects were used as the basis for the input string. Each word was aligned

with its counterparts in diﬀerent dialects. All the aligned words for a dialect

were concatenated. The concatenated strings were converted to phonological

features. The resulting feature strings were used as input for the bayesian

inference.

2.1 Data

The dialect data was taken from the Reeks Nederlandse Dialectatlassen (RND).

This is a corpus of transcribed speech in Dutch and Frisian dialects in the

whole Dutch language area: The Netherlands, a neighbouring area in Germany,

Flanders and the north of France. The corpus was recorded between 1925 and

1982. A selection of 166 words and 363 dialects has been made from this corpus

and digititalized (Heeringa, 2001). An interesting addition to the digital version

is the Plautdietsch dialect from Protasovo, Siberia. This dialect descended

from 16th century Mennonites who migrated via Eastern Europe to Siberia. It

maintained its Dutch character in Slavic surroundings (Nieuweboer, 1998).

The data was written in X-SAMPA, an ASCII version of the International Pho-

netic Alphabet (IPA). The data was converted to IPA, represented in Unicode,

using the cxs2ipa script (Theiling, 2008)1. The dialect data was split into two

subsets: one set of 269 Netherlandic (and neighbouring German) dialects and

one set of 94 Belgian (and neighbouring French) dialects. It is interesting to

use the parameters from the Netherlandic data set on the Belgian data, to see

whether the setting of the parameters is generally applicable to diﬀerent data

sets.

2.2 Alignment

Figure 1: Alignment of translations of the lemma are. The sound classes (colors

of the phones) enable comparable sounds to be matched.

In order to compare which sounds diﬀer in the translation of a lemma in diﬀerent

dialects, the words need to be aligned. Comparable sounds are put in the same

1The script was modiﬁed in order to convert the æ properly as well.

4

column (Figure 1). Making an alignment assumes that the words in diﬀerent

dialects are cognates. For most, but not all, words in the RND, this is the

case. For example, the lemma chickens has entries which look like Dutch kippen

and entries that look like German H¨uhner. Aligning these with each other

is less informative (Figure 2). For some dialects, there was more than one

Figure 2: Alignment of translations of the lemma chickens. The entries are

not cognate, there are entries which look like kippen and entries that look like

H¨uhner. Still, they are aligned based on their phonological characteristics.

translation for a certain lemma. In these cases, the ﬁrst one was chosen as the

only translation. There are also lemmas where the alignment may have been

distracted by morphological rather than phonological diﬀerences. For example,

the lemma sore throat has items which look like keelpijn (throat-sore) and other

entries which look like pijnindekeel (sore-in-the-throat). The sounds are roughly

the same, only the order of stems is diﬀerent. This is not fully reﬂected in the

alignment (Figure 3).

Figure 3: Alignment of translations of the lemma sore throat. The phonological

similarity is not fully reﬂected in the alignment, because of the morphological

diﬀerence in order. The part pin in both words is however aligned.

The alignment is done using LingPy (List and Moran, 2013). This is a program

for multiple sequence alignment, which means that all words are aligned with

each other at the same time. LingPy matches phones, by classifying them into

a number of sound classes (List, 2012). Phones in the same sound class have

the highest probability of matching.

Before the alignment, the data was tokenized: phones were grouped with dia-

critic signs to form one token. The list of possible tokens was based on Hoppen-

brouwers and Hoppenbrouwers (1988). It supplies a list of IPA tokens and maps

those to phonological features. The list is based on the RND data. Still, some

combinations of vowels/consonants and diacritics that were used in my RND

data, were missing in the list. These tokens were omitted, because this means

there is no phonological mapping for these tokens as well. The result is that the

omitted diacritic signs are shown as a ?in the alignment, which means it can

be aligned with a random phone. It seems this has not decreased the quality of

the alignment heavily.

5

Standard Dutch k Ip - @mEi n b l u m @-

Standard German h y - n @mAi n b l u: m@n

Midsland h E-n: - m i- - b l u m @n

Figure 4: Concatenation of the lemmas chickens,my and ﬂowers for three

dialects. The real concatenated strings are far longer, they contain 166 lemmas.

Figure 5: The feature strings which serve as input for the bayesian inference.

This is the result of converting the concatenations from ﬁgure 4 to phonological

features.

After the alignment, all aligned words for a dialect were concatenated with each

other, resulting in a long string of all words for that dialect (Figure 4).

2.3 Phonological mapping

A possiblity would be to directly use the concatenated string of all aligned words

as the input string for a dialect. The positions in the alignment, the phones,

would then be the features, on the basis of which a dialect can be compared with

other dialects. However, the symbol alphabet of all phones used in the RND

is too big to be computationally feasible. Furthermore, it would be nice if the

algorithm also takes into account that phones that are phonologically close to

each other can change more easily than phones that are further from each other.

For these two reasons, the aligned phone strings were converted to an array of

phonological features. Hoppenbrouwers and Hoppenbrouwers (1988) provide

a mapping from each character to 21 binary phonological features, which was

used. The result is a long string of 0’s and 1’s for every dialect (Figure 5).

2.4 Bayesian inference

The program used to execute the bayesian inference was MrBayes (Ronquist and

Huelsenbeck, 2003). As described in the introduction, an MCMC analysis starts

from a randomly chosen tree and proposes small random changes to this tree.

MrBayes runs two diﬀerent MCMC analyses at the same time, starting from

6

two diﬀerent randomly chosen trees. By calculating the convergence between

the two analyses, it is possible to get an indication whether a stable posterior

probability distribution has been reached. The algorithm was run for 1,000,000

generations. The sample frequency was set to 20, which means that every 20

generations, the most probable tree is saved.

The likelihood is determined by two parameters: the substitution model and the

rate variation model. Together they provide the probability of the data, given

a certain tree. The substitution model determines the chance that a certain

character changes into another character. We have only two characters (0 and

1), so the only state changes are 0 →1 and 1 →0. A substitution model with

equal probability for every state change was used. The rate variation model

determines the chance that a certain feature changes state. For example, a

rate variation model could state that letters at the end of a word have a higher

chance of changing than letters in the middle (Proki´c et al., 2011).

Two rate variation models were tried: an equal rate variation model and a

gamma-distributed rate variation model. In an equal rate variation model, every

feature has the same chance of changing. In a gamma-distributed rate variation

model, the bayesian algorithm infers from the data which features change more

often than others. It categorizes the features in rate classes of higher or lower

probability of change, according to a gamma distribution.

2.5 Evaluation

A dialect map by Daan and Blok (1969) is used as the gold standard to evaluate

the results of the bayesian analysis. The map is based on the perception of

speakers. In a questionnaire, people from villages in the Dutch language area

were asked which dialects from other villages were (almost) the same. Arrows

could be drawn between villages with roughly the same dialect. Daan and Blok’s

map is based on this arrow method, combined with some linguistic knowledge,

in cases where the arrow method did not match the known insights (Daan and

Blok, 1969).

Finally, the bayesian algorithm with the same settings was applied to the Belgian

dialects. The results were also compared with Daan and Blok’s map.

3 Results

The output of the bayesian inference is a set of trees, each with their own

probability. The consensus tree is a tree that tries to reconcile all trees. If the

branching is contradictory between trees, the consensus tree places the branch at

a lower level (Dunn, 2008). The trees were shown graphically using the FigTree

(Rambaut, 2013) program.

7

3.1 Netherlandic dialects

3.1.1 Equal rate variation model

The consensus tree that was outputted correctly shows groups of dialects that

are connected locally, but does not generally show higher-order grouping be-

tween the local groups. This probably happens because the consensus tree could

not decide between two speciﬁc branchings and places dialects at a lower level.

The MCMC analysis has also not converged optimally, even after 1,000,000

generations.

Hardly any false groupings are made. Dialects that are grouped in the tree, are

generally also in one group on Daan’s map. Sometimes dialects are linked with

a dialect that is just across the border of a diﬀerent group on the map, but still

geographically close. This is visible in the grouping of the dialects of Zeeuws-

Vlaanderen with some neighbouring dialects from Noord-Brabant (Figure 6).

No strange groupings between diﬀerent parts of the country are made. The

price for this accuracy is that a lot of dialects remain ungrouped or are only

connected with their direct neighbours (Figure 7).

Figure 6: Equal model. The dialects Clinge, Lamswaarde and Groenendijk

from Zeeuws-Vlaanderen have been grouped together. They all belong to the

Zeeuws group on the map. The geographically close Zundert, Roosendaal and

Ossendrecht dialects have been grouped together in the tree, although they

belong to the diﬀerent Noord-Brabant group on the map.

Some distinguishing groups can be seen, which correspond with groups on Daan

and Blok’s map. The dialects of Groningen (Figure 8) and southern Dutch

Limburg (Figure 9) form groups which correspond with Daan and Blok’s map.

The dialects of northern Noord-Holland and the islands of Texel and Vlieland

form one group, as the map would predict (Figure 10). There is a branch of the

tree that splits into Frisian dialects and Frisian city dialects (Figure 11). It is

good to see this clear division but still close connection between Frisian dialects

and Frisian city dialects. The Frisian city dialects are dialects which originate

from Frisian, but have been inﬂuenced by the dialects from Holland in the 16th

century (Jansen, 2002).

It is clear that Daan’s Utrecht-Alblasserwaard group is not well-visible in the

tree. Many dialects are clustered with other groups. Utrecht and Amersfoort

are unresolved (Figure 7).

Dialects from eastern Noord-Brabant have been connected, but are not con-

nected with dialects from the west of Noord-Brabant, which from one group on

Daan’s map. The dialects are however closely connected with two dialects from

Zuid-Gelderland, a related, but diﬀerent group on the map (Figure 12).

8

Figure 7: Equal model. A lot of dialects have not been grouped: dialects

from the Utrecht-Alblasserwaard group like Utrecht and Amersfoort are on the

same level as eastern dialects like Beilen and Emmen. Other dialects have been

clustered into small groups: for example Goirle, Oirschot and Loon op Zand.

9

Figure 8: Equal model. The dialects of Groningen have been grouped according

to Daan’s map.

Figure 9: Equal model. The dialects of southern Dutch Limburg form a well-

divided group that is coherent with the map.

10

Figure 10: Equal model. The dialects of northern Noord-Holland are grouped

with the dialects from the islands of Texel and Vlieland, as Daan predicts. The

group is on the same level with a totally diﬀerent, but also coherent group, that

of Zeeland. The long branch length of Protasovo is remarkable and signiﬁes a

large distance compared to the other dialects.

11

Figure 11: Equal model. The dialects of Friesland. The Frisian dialects (red)

and Frisian city dialects (green) are related, but it is clear that there is a division.

Figure 12: Equal model. Dialects from the eastern side of the Noord-Brabant

group (red) have been grouped with dialects from the river region (green). Al-

though these groups are related, it is remarkable that the Noord-Brabant di-

alects match with dialects from a diﬀerent group, whereas they do not match

with dialects from the western part of the same Noord-Brabant group.

12

The tree lacks some higher-order grouping. The dialects of the southern Nether-

lands are shown as a family in Daan’s map using red shades. The Low-Saxon

dialects of the eastern and northern Netherlands are shown as a family using

green shades. These higher-order groupings are however not visible in the tree

(Figure 13).

Figure 13: Equal model. The red group is a mix of dialects from the Utrecht-

Alblasserwaard group (Oudewater, Soest, Driebergen, Polsbroek) and Zuid-

Holland (Berkel, Wateringen, Nieuwveen, Langeraar, Warmond, Zoetermeer).

Maybe the border between these groups is not really clear-cut, as Daan and

Blok (1969) state. A second observation is that there is no suﬃcient higher-

order grouping. The red group of western-central dialects is at the same level

as the two blue groups of northeastern (Low-Saxon) dialects. These two blue

groups would be expected to be on a diﬀerent level, together with other Low-

Saxon dialects.

The Protasovo (Plautdietsch) dialect has a very long branch, which shows that

it diﬀers a lot from the other dialects (Figure 10). This seems reasonable, given

that it is a form of Dutch that has not been in contact with other Dutch dialects

for centuries.

Concludingly, the dialects that have been grouped together form groups that

are coherent with Daan and Blok. The groups have however not been grouped

in higher-order groups that show relations between dialect regions. This makes

13

the explanatory power of the tree smaller.

3.1.2 Gamma-distributed rate variation model

The consensus tree of the gamma-distributed rate variation model shows the

same pattern as the consensus tree of the equal rate variation model. There are

some diﬀerences in the groupings, sometimes these are improvements, sometimes

these are degradations. It is not really clear whether these small diﬀerences are

caused by diﬀerent rate variation models. Diﬀerenes across diﬀerent executions

of the same rate variation model also occurred.

The groups of southern Dutch Limburg and Groningen are also salient in this

consensus tree. Some groupings from Daan’s map are better under the gamma

model. The dialects of Twente have been grouped together under this model,

whereas they were spread across diﬀerent groups in the equal model (Figure

14). The dialects groups of Noord-Holland and Zuid-Holland are related, this

is shown to a greater extent in this tree (Figure 15).

Figure 14: Gamma model. The dialects of Twente (and the directly neighbour-

ing places in Germany) form one group under the gamma model, whereas they

were spread across several groups in the equal model.

The distinction between Frisian dialects and Frisian city dialects is still shown,

but this time the Frisian dialects are shown as a subgroup of the Frisian city

dialects (Figure 16). This is not correct from a historical point of view, since

the Frisian city dialects split oﬀ the Frisian dialects. However, from a distance

point of view, it is less remarkable. The Frisian city dialects are closer to the

14

Figure 15: Gamma model. Dialects of Noord-Holland and Zuid-Holland have

been combined as one group under the gamma model.

other Dutch dialects at the root of the tree, because they have been inﬂuenced

by the dialects from Holland.

An interesting result is that the dialect of Katwijk aan Zee, a coastal place in

Zuid-Holland, is grouped with the dialects of Zeeland, a diﬀerent group further

to the south (Figure 17). Apparently there are some shared characteristics

between these coastal areas.

There are also dialects that were grouped in the consensus tree of the equal

model, but are unresolved under the gamma model. Examples are the places

Oldemarkt and Steenwijk.

It is hard to say whether the gamma or the equal model is better. The gamma

model shows a few interesting groups that the equal model does not show, but

it also leaves dialects ungrouped which the equal model grouped. Furthermore,

some diﬀerences can occur across diﬀerent executions of the same model and

are not caused by the model choice.

3.2 Comparison: Belgian dialects

The Belgian data was kept apart to see whether the method works for a diﬀerent

data set as well. The Belgian data was processed in the same way as the

Netherlandic data and the bayesian algorithm was run with the same parameters

(1,000,000 generations, sample frequency 20).

Again, an equal rate variation model and a gamma-distributed rate variation

were tried.

In the equal rate variation model, three important groups are seen. Only a few

small groups are available and few dialects remain unresolved. This could mean

15

Figure 16: Gamma model. The Frisian dialects are shown as a subgroup of the

Frisian dialects.

Figure 17: Gamma model. Katwijk aan Zee, a coastal place in the Zuid Holland

dialect region, is grouped with dialects from the Zeeland group, further to the

south.

16

there is less contradiction between the diﬀerent trees than in the results of the

Netherlandic data set. Also, the convergence between the runs is better.

The ﬁrst group contains places from the east of Belgium (Figure 18). Most

dialects belong to the Limburg group on Daan’s map, two belong to the Brabant

group and one belongs to the group of dialects between Brabant and Limburg.

This group in the tree seems to be a coherent group of Limburg dialects with

some other dialects which are geographically very close.

Figure 18: Equal model. This subtree contains dialects from the east of Belgium.

The red dialects belong to the Limburg group, the green dialects belong to the

Brabant group, the yellow dialect belongs to the group of dialects between

Brabant and Limburg.

The second big group consists solely of Brabant dialects (Figure 19). There is

a division into subgroups that are geographically close to each other. There

are only two small groups of Brabant dialects that are not included in this big

group and are connected separately in the tree. The grouping is stronger than

in the Netherlandic tree.

Figure 19: Equal model. This subtree consists of the Brabant dialects.

The third group consists of dialects from the west of Belgium and northern

17

France. The dialects are roughly in three groups from Daan’s map: Western

Flemish, Eastern Flemish and dialects between Western and Eastern Flemish.

As can be seen in ﬁgure 20 the tree is nicely subdivided into these three groups.

Figure 20: Equal model. This subtree contains dialects from the west of Bel-

gium. The red dialects belong to the Western Flemish group, the green dialects

belong to the Eastern Flemish group, the yellow dialects belong to the group of

dialects between the Western and Eastern Flemish dialects.

The consensus tree under the gamma model shows the same three main groups.

The only diﬀerence is that some dialects have split from the bigger groups and

formed a smaller group.

4 Discussion

The data from the RND which was used seems to have given a reliable set of

basic words. However, there is no guarantee that the words are used as often in

one area as in another area. The closest approximation to a list of words that

is used in every area would be a Swadesh list.

The alignment has been done using a system of sound classes, which gives good

results. The quality of the alignments could possibliy become even higher. To

focus on phonology and ﬁlter out morphological eﬀects, the stems of composed

words could all be put in the same order. Furthermore, lemmas which contain

words from diﬀerent cognate sets, could be split in several lemmas: one for every

cognate sets. Finally, more combinations of phones and their diacritics could be

added to the token list. The diacritics that were not listed in (Hoppenbrouwers

and Hoppenbrouwers, 1988) were not taken into account in the alignment now.

18

As a summary of the tree sample, consensus trees were used. A characteristic

of the consensus tree is that it is not guaranteed to be a real tree from the

sample, but a reconciliation of the trees. Contradicting branchings are solved

by placing a branch at a lower level. Other tree summaries are the the maximum

probability tree (Nichols and Warnow, 2008) and the maximum clade credibility

tree (Dunn, 2008). Both methods pick a tree which exists in the tree sample: the

tree with the highest probability or the tree with the highest sum of probabilities

of the branchings respectively. The phylogenetic program (MrBayes) that was

used, was not accustomed to the creation of these trees. It was possible to fetch a

maximum probability tree topology, but without branch lengths. Furthermore,

it was possible to create a maximum clade credibility tree with an external

program, but this did not succeed. For these reasons, only consensus trees were

used in my analysis.

Although bayesian inference is a quantitative method, which draws conclusions

from large amounts of data, the evaluation of the method in this thesis was

done qualitatively. Ideally, an objective measure of distance between a bayesian

inference tree and Daan’s dialect map would be used. Both representations

would then have to be converted to the same format. Zhang and Shasha (1989)

proposes an algorithm for edit distance between trees. Implementing this al-

gorithm and processing the tree data in such a way that it could be read by

the algorithm could be a direction for future research. It would also have to

be assessed whether the edit distance between language trees coincides with a

linguistic feeling of similarity between language trees.

In earlier quantative dialect research (Heeringa and Nerbonne, 2006), the eval-

uation of the tree was also done qualitatively. However, new methods are being

applied to visualize the data in such a way that it is easier to do a human com-

parison with the gold standard. Nerbonne et al. (2011) present the Gabmap

package, which has, among other features, the possibility to project a dialect

tree onto a map.

Gamma-distributed and equal rate variation models were evaluated. The dif-

ferences in the resulting trees of the models were not very large. It seems that

for the current input format, long strings of phonological features, the choice of

the rate variation model is not of utmost importance.

The results for the bayesian inference on the Belgian dialects were better than

the results on the Netherlandic dialects. The bayesian analyses for the Belgian

dialects had better convergence rates than the analyses for the Netherlandic

dialects. It must be noted that the number of Belgian dialects was smaller than

the number of Netherlandic dialects (94 vs. 269 dialects), but they ran the same

number of generations. It may be that the Netherlandic analyses should have

run for more generations, to compensate for the large number of dialects. This

is however made inattractive by the long running times of the algorithm. There

could also be other reasons for the better performance on the Belgian data set.

For example, it could be the case that the Belgian data set had clearer divisions

between the dialects, making it easier to generate a tree.

19

5 Conclusion

The application of bayesian inference on the Netherlandic dialects performed

well on local groups. Distinctive groups from common linguistic theory were

visible. The grouping was very accurate, hardly any groupings were made that

were not coherent with the dialect map by Daan and Blok. However, many

dialects remained ungrouped. Also, local groups were not grouped with other

groups in order to get higher-order families. This limited the explanatory power

of the results.

The bayesian inference for the Belgian dialects gave suprisingly good results.

Almost all dialects were grouped and there was higher-order grouping appar-

ent. There were a few large groups, which contained dialects from a bounded

geographical area, eg. the east of Belgium. These large groups were divided

into smaller groups, which mostly followed the dialect groups from the dialect

map by Daan and Blok.

All in all, bayesian inference seems to be a good addition to the tools used

to determine the phylogeny of dialects. The performance of the method is

not constant enough to use it as the only method to create a dialect tree. In

this thesis, the method performed better on the Belgian dialects than on the

Netherlandic dialects. However, once the results have been validated using a

dialect map for the researched area, insights from bayesian inference can be

used to get a full image of dialect kinship. For example, even if no higher-order

groupings are returned in a bayesian inference tree, local groupings (as in Figure

17) can give interesting clues about relationships between dialects.

6 Literature

Campbell, L. (1998). Historical linguistics: An introduction. MIT press.

Daan, J. and Blok, D. (1969). Van Randstad tot Landrand: Toelichting

bij de kaart: Dialecten en Naamkunde. Bijdragen en mededelingen der Di-

alectenkommissie van de Koninklijke Nederlandse Akademie van Wetenschap-

pen. Noord-Hollandsche Uitgevers Maatschappij.

Dunn, M. (2008). Language phylogenies (in press). Routledge hand-

book of historical linguistics. http://pubman.mpdl.mpg.de/pubman/

item/escidoc:1851319:5/component/escidoc:1851318/dunn-phylogenetic-

approaches.pdf.

Heeringa, W. (2001). De selectie en digitalisatie van dialecten en woorden uit

de reeks nederlandse dialectatlassen. TABU, Bulletin voor Taalwetenschap,

31, number 1/2:61–103.

Heeringa, W. and Nerbonne, J. (2006). De analyse van taalvariatie in het

nederlandse dialectgebied: methoden en resultaten op basis van lexicon en

uitspraak. Nederlandse Taalkunde, 11(3):18–257.

Hoppenbrouwers, C. and Hoppenbrouwers, G. (1988). De featurefrequen-

tiemethode en de classiﬁcatie van nederlandse dialecten. TABU, Bul-

letin voor Taalwetenschap, Jaargang 18, nummer 2, 1988. Retrieved

20

from http://urd.let.rug.nl/nerbonne/papers/inferring-sound-changes-Prokic-

et-al-2011-Diachronica.pdf on 15-06-2014.

Huelsenbeck, J., Ronquist, F., Nielsen, R., and Bollback, J. (2001). Bayesian

inference of phylogeny and its impact on evolutionary biology. Science,

294(5550):2310–2314.

Jansen, M. (2002). De dialecten van ameland en midsland in vergelijking met

het stadsfries. Us Wurk. Tydskrift foar frisistyk, 51:128–152. Retrieved from

http://depot.knaw.nl/9683 on 25-06-2014.

Larget, B. and Simon, D. (1999). Markov chain monte carlo algorithms for

the bayesian analysis of phylogenetic trees. Molecular Biology and Evolution,

16:750–759.

List, J.-M. (2012). Multiple sequence alignment in historical linguistics. a sound

class based approach. In Proceedings of ConSOLE XIX, pages 241–260.

List, J.-M. and Moran, S. (2013). An open source toolkit for quantitative histor-

ical linguistics. In Proceedings of the 51st Annual Meeting of the Association

for Computational Linguistics: System Demonstrations, August 4-9, Soﬁa,

Bulgaria., pages 13–18.

Nerbonne, J., Colen, R., Gooskens, C., Kleiweg, P., and Leinonen, T. ((2011)).

Gabmap – a web application for dialectology. Dialectologia, Special Issue

II:65–89.

Nichols, J. and Warnow, T. (2008). Tutorial on computational linguistic phy-

logeny. Language and Linguistics Compass, 2.5:760–820.

Nieuweboer, R. (1998). The altai dialect of plautdiitsh (west-siberian mennonite

low german). Master’s thesis, University of Groningen.

Proki´c, J., Gray, R., and Nerbonne, J. (2011). Inferring sound changes

using bayesian mcmc. Submitted to Diachronica, 1/2011. Retrieved

from http://urd.let.rug.nl/nerbonne/papers/inferring-sound-changes-Prokic-

et-al-2011-Diachronica.pdf on 15-06-2014.

Rambaut, A. (2013). Figtree 1.4.1. tree ﬁgure drawing tool.

Ronquist, F. and Huelsenbeck, J. (2003). Mrbayes 3: Bayesian phylogenetic

inference under mixed models. Bioinformatics, 19:15721574.

Ronquist, F., Huelsenbeck, J., and Teslenko, M. (2011). Draft mrbayes

version 3.2 manual: Tutorials and model summaries. Retrieved from

http://mrbayes.sourceforge.net/mb3.2 manual.pdf on 25-06-2014.

Theiling, H. (2008). cxs2ipa. an x-sampa to ipa converter. Retrieved from

http://www.theiling.de/ipa/ on 16-06-2014.

Zhang, K. and Shasha, D. (1989). Simple fast algorithms for the editing distance

between trees and related problems. SIAM journal on computing, 18(6):1245–

1262.

21

7 Appendix

7.1 Tree of Netherlandic dialects – equal rate variation

The tree spans two pages.

22

7.2 Tree of Netherlandic dialects – gamma rate variation

The tree spans two pages.

25

7.3 Tree of Belgian dialects – equal rate variation

28

0.06

Tienen

Kalken

Kinrooi

Bekegem

Nazareth

Diest

Geraardsbergen

Overpelt

Gent

Kieldrecht

Kapelle-Broek

Gits

Kortrijk

Thisselt

Kampenhout

Itegem

Eupen

Oostende

Hingene

Hondegem

Overijse

Brugge

Warhem

Houthulst

Houthalen

Balen

Reninge

Zevendonk

Steenbeek

Geel

Wijnegem

Grimbergen

Assenede

Waregem

Moerbeke

Heldergem

Vreren

Ingooigem

Baelen

Lebbeke

Boom

Essen

Lochristi

Hekelgem

Werchter

Velm

Oostkamp

Arendonk

Lot

Raeren

Woesten

Diepenbeek

Wemmel

Zolder

Ronse

Gierle

Oelegem

Kalmthout

Blankenberge

Zwevegem

Aubel

Moerkerke

Alveringem

Humbeek

Roeselare

Oostkerke

Damme

Wingene

Zomergem

Bree

Boutersem

Zelzate

Meerhout

Lippelo

Nukerke

Beveren

Bellegem

Aarschot

s-Gravenvoeren

Zandvliet

Middelkerke

Herselt

Nieuwkerke

Gistel

Aalst

Buggenhout

Vertrijk

Moorslede

Veurne

Bollezeele

Lauw

Bottelare

Rijkevorsel

Mechelen

7.4 Tree of Belgian dialects – gamma rate variation

30

0.2

Mechelen

Zomergem

Grimbergen

Middelkerke

Damme

Bellegem

Gierle

Aarschot

Bekegem

Heldergem

Steenbeek

Oostkerke

Diest

Raeren

Oelegem

Hondegem

Kortrijk

Zelzate

Vreren

Wingene

Nieuwkerke

Reninge

Roeselare

Moerkerke

Houthalen

Nazareth

Thisselt

Boutersem

Itegem

Hekelgem

Balen

Ingooigem

Werchter

Houthulst

Bottelare

Wijnegem

Blankenberge

Velm

Woesten

Kalken

Kinrooi

Baelen

Hingene

Assenede

Wemmel

Boom

Overijse

Kieldrecht

Rijkevorsel

Geraardsbergen

Kalmthout

Lochristi

Lot

Lebbeke

Aalst

Humbeek

Diepenbeek

Warhem

Zolder

Kampenhout

Aubel

Kapelle-Broek

Essen

Arendonk

Herselt

Gent

Bollezeele

Buggenhout

Zwevegem

Lauw

Zandvliet

Oostende

Bree

Moerbeke

Meerhout

Geel

Eupen

Lippelo

Alveringem

Brugge

Oostkamp

Nukerke Ronse

Overpelt

Veurne

Tienen

Waregem

Beveren

Zevendonk

Gistel

s-Gravenvoeren

Vertrijk

Gits

Moorslede

7.5 Dialect map by Daan and Blok (1969)

The ﬁrst page shows the map, the second page shows the legend.

32

34