ArticlePDF Available

Networks uncover hidden lexical borrowing in Indo-European language evolution

doi: 10.1098/rspb.2010.1917
, 1794-1803 first published online 24 November 2010278 2011 Proc. R. Soc. B
Martin and Tal Dagan
Shijulal Nelson-Sathi, Johann-Mattis List, Hans Geisler, Heiner Fangerau, Russell D. Gray, William
language evolution
Networks uncover hidden lexical borrowing in Indo-European
Supplementary data
"Data Supplement"
This article cites 26 articles, 8 of which can be accessed free
This article is free to access
Subject collections (2396 articles)ecology
Articles on similar topics can be found in the following collections
Email alerting service hereright-hand corner of the article or click
Receive free email alerts when new articles cite this article - sign up in the box at the top go to: Proc. R. Soc. BTo subscribe to
This journal is © 2011 The Royal Society
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
Networks uncover hidden lexical borrowing
in Indo-European language evolution
Shijulal Nelson-Sathi1, Johann-Mattis List2, Hans Geisler2,
Heiner Fangerau3, Russell D. Gray4, William Martin1
and Tal Dagan1,*
Institute of Botany III, and
Faculty of Philosophy, Heinrich-Heine University Du
¨sseldorf, Germany
Institute of the History, Philosophy and Ethics of Medicine, Ulm University, Germany
Department of Psychology, University of Auckland, Auckland 1142, New Zealand
Language evolution is traditionally described in terms of family trees with ancestral languages splitting
into descendent languages. However, it has long been recognized that language evolution also entails hori-
zontal components, most commonly through lexical borrowing. For example, the English language was
heavily influenced by Old Norse and Old French; eight per cent of its basic vocabulary is borrowed.
Borrowing is a distinctly non-tree-like process—akin to horizontal gene transfer in genome evolution—
that cannot be recovered by phylogenetic trees. Here, we infer the frequency of hidden borrowing
among 2346 cognates (etymologically related words) of basic vocabulary distributed across 84 Indo-
European languages. The dataset includes 124 (5%) known borrowings. Applying the uniformitarian
principle to inventory dynamics in past and present basic vocabularies, we find that 1373 (61%) of the
cognates have been affected by borrowing during their history. Our approach correctly identified 117
(94%) known borrowings. Reconstructed phylogenetic networks that capture both vertical and horizontal
components of evolutionary history reveal that, on average, eight per cent of the words of basic vocabulary
in each Indo-European language were involved in borrowing during evolution. Basic vocabulary is often
assumed to be relatively resistant to borrowing. Our results indicate that the impact of borrowing is far
more widespread than previously thought.
Keywords: community structure; lateral transfer; phylogenetics
Genome evolution and language evolution have a lot in
common. Both processes entail evolving elements—
genes or words—that are inherited from ancestors to
their descendants. The parallels between biological and
linguistic evolution were evident both to Charles
Darwin, who briefly addressed the topic of language
evolution in The origin of species [1], and to the linguist
August Schleicher, who in an open letter to Ernst
Haeckel discussed the similarities between language
classification and species evolution [2]. Computational
methods that are currently used to reconstruct genome
phylogenies can also be used to reconstruct evolutionary
trees of languages [3,4]. However, approaches to
language phylogeny that are based on bifurcating trees
recover vertical inheritance only [3,57], neglecting
the horizontal component of language evolution
(borrowing). Horizontal interactions during language
evolution can range from the exchange of just a few
words to deep interference [8]. In previous investi-
gations, which focused only on the component of
language evolution that is described by a bifurcating
tree [3,57], the extent of borrowing might therefore
have been overlooked.
Lexical borrowing is the transfer of a word from a
donor language to a recipient language as a result of a cer-
tain kind of contact between the speakers of the two
languages [9]. This is one of the most common types of
interaction between languages. Lexical borrowing can
be reciprocal or unidirectional, and occurs at variable
rates during evolution. Factors affecting the rate of lexical
borrowing during evolution include the intensity of con-
tact between the speakers of the respective languages,
the genetic or typological closeness of the languages
(which facilitates the inclusion of foreign words), the
amount of bi- or multi-lingual speakers in the respective
linguistic communities, or a combination thereof
[10,11]. For example, English has been heavily influenced
throughout its history by different languages such as Old
Norse and Old French [12], it has been estimated that
8 per cent of its basic vocabulary is borrowed from
those languages [13]. Icelandic, on the other hand, has
preserved most of its original words [14].
A key part of inferences in historical linguistics is the
identification of cognate sets. These are sets of words
from different languages that are etymologically related.
The words in a cognate set are derived from a single
common ancestral form that was present in an ancestral
language. Cognate judgement is an arduous enterprise
since it includes the complete evolutionary reconstruction
of all words in the sampled languages for a cer tain
concept. Historical linguists usually make use of an
in-depth analysis of structural resemblances between the
*Author for correspondence (
Electronic supplementary material is available at
10.1098/rspb.2010.1917 or via
Proc. R. Soc. B (2011) 278, 1794–1803
Published online 24 November 2010
Received 6 September 2010
Accepted 3 November 2010 1794 This journal is q2010 The Royal Society
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
word forms, looking for sound correspondences in
specific environments. The identification of a cognate is
thus much more than just a hunt for resemblant forms
or ‘lookalikes’. Only a set of words that have regular
sound correspondences provide good evidence for genea-
logical relatedness and thus only these words can be
grouped into a single cognate set (COG). For example,
the concept ‘tooth’ has a cognate set that unites English
tooth, German Zahn, Italian dente and French dent as
etymologically related (figure 1). However, similar
word forms can arise not only by inheritance, but also
by lexical borrowing. Unfortunately, the further we go
back in time, the more difficult it becomes to distinguish
inheritance from transfer, and reconstructed COGs may
include hidden borrowing events that are erroneously
coded as vertical inheritance.
Lexical borrowing is a non-tree-like evolutionary event
that cannot be reconstructed using phylogenetic trees that
are common in evolutionary biology [15,16]. Linguists
have long been aware of the problems that borrowing
introduces. At about the same time that Darwin
suggested the tree metaphor for the evolution of species
in 1859 [1], August Schleicher introduced the family
tree to linguistics [17]. Few years later, his model was
rejected by several scholars arguing against the use of a
simple tree model to describe the evolution of languages,
which they noted to be reticulated by nature [18,19].
Other non-tree-like models were proposed by linguists
to study language evolution—including waves [18,20]
and networks [21]—but they lacked either quantitative
parameters, historical dimensions or both. At the other
extreme, quantitative estimates for language divergence
lacked an explicit model to explain language relatedness
[22,23]. Apart from some sporadic attempts to visualize
language evolution of specific words by a combination
of a bifurcating family tree with the non-tree-like
component superimposed on it [24], linguists have, for
lack of better alternatives, largely stuck to the tree
model, while emphasizing its inadequacies.
Phylogenetic methods that were developed to take into
account horizontal transfer of genes during microbial
evolution offer an alternative model for the horizontal
aspects of language evolution. Recent years have wit-
nessed several applications of reticulated trees and split
networks to language evolution [25 28], yet none of
these have either specifically uncovered borrowing
events or delivered an estimate for the borrowing fre-
quency during language evolution. Here, we apply
phylogenetic networks to recover the frequency of
hidden borrowings during the evolution of Indo-
European languages using the criterion of word inventory
dynamics over time, proposing a general model for
language evolution that includes both vertical and
horizontal components of word transfer during evolution.
Here, we used two publicly available cognate datasets: Dyen
[29] and Tower of Babel (ToB) [30]. For the analysis, all
COGs in both datasets are converted into a binary pres-
ence/absence pattern (PAP). A PAP within the Dyen
dataset includes 84 digits; if a cognate set includes one
or more words from language i, then digit x
in its corre-
sponding pattern is ‘1’; otherwise, it is ‘0’. The same
conversion method is used for the ToB dataset where the
PAPs include 73 digits.
(b)Shared COGs network
The number of shared COGs between each language pair is
calculated as the number of cognate sets in which both
languages are present. A division of the network into modules
Proto-Germanic Latin
German English Italian French
dente dent
Figure 1. Etymological reconstruction of the concept tooth. The English and German word forms have descended from the
Proto-Germanic ancestor [52]. The Italian and French words are descendants of Latin, and the Proto-Germanic and Latin
forms stem from Proto-Indo-European [43,53].
Networks of Indo-European languages S. Nelson-Sathi et al. 1795
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
is based on maximizing a modularity function defined as the
number of edges within a community minus the expected
number of edges [31]. Initially, an optimal division into
two components is found by maximizing this function over
all possible divisions by using spectral optimization, which
is based on the leading eigenvector of the matching modular-
ity matrix. To further subdivide the network into more than
two modules, additional subdivisions are made, each time
comparing the contribution of the new subdivision with the
general modularity score of the entire network. This process
is carried out until there are no additional subdivisions that
will increase the modularity of the network as a whole [31].
(c)Reference trees
Language trees were inferred by a Bayesian approach using
MRBAYES [32] as detailed by Gray & Atkinson [3]. In
addition, neighbour-joining (NJ) trees [33] were recon-
structed from Hamming distances using SPLITSTREE [34].
A reference tree with English internal to the Germanic
clade was produced manually from the Bayesian tree. A ran-
domized reference tree for the Dyen dataset was produced by
randomizing the language names in the Bayesian reference
tree. Trees are available in Newick format at http://www.
(d)Borrowing models and the minimal lateral
In the loss-only (LO) model, all COGs are assumed to have
originated at the root of the reference tree. The loss events for
each COG are estimated by using a binary recursive PERL
algorithm that scans the reference tree and infers the mini-
mum number of losses [35]. When a COG is absent in a
whole clade, a single loss event is inferred in the common
ancestor of that clade. In the single-origin (SO) model,
each cognate is assumed to have originated at its first occur-
rence on the reference tree. A binary recursive algorithm
scans the reference tree from root to tips to identify the
first ancestral node that is the common ancestor of all cog-
nate ‘present’ cases.
In the BOR1 model, each cognate is allowed to have two
word origins, where one is a borrowing. A preliminary origin
is inferred as in the SO model, followed by researching for a
cognate origin in each of the two clades branching from the
preliminary origin node. If the hypothetical taxonomic unit
that was inferred as the preliminary origin has no cognate
‘absent’ descendants, the cognate is inferred to have an SO.
Once the nodes of the two origins are set, losses are inferred
as in the LO model.
We tested additional models allowing four, eight and 16
origins, where one is an origin, and the rest are borrowings.
These are implemented in the same way as in the BOR1
model, except that the origin search is iterated. For example,
a search for origins under the BOR3 model entails (i) a
search for a preliminary origin (as in the SO model), (ii) a
search for the next origin in descendants (as in the BOR1
model) and, (iii) for each next origin, another search. If an
origin has no cognate-absent descendants, the number of ori-
gins inferred is smaller than the maximum allowed. Ancestral
vocabulary size at a certain internal node is inferred as the
total COG origins that were inferred to occur at that node.
The distributions of ancestral and modern vocabulary
sizes were compared by using the Wilcoxon non-parametric
test [36].
The minimal lateral network (MLN) [37] is calculated for
each dataset by the allowance model that was statistically
accepted by the test described above. The MLN comprises
the reference tree, with additional information of the vocabu-
lary size in all internal nodes. Lateral cognate sharing among
internal and external nodes is summarized in a 167 167
matrix that includes all tree nodes, where a
of laterally shared COGs between nodes iand j. The MLN
is then depicted by an in-house script using MATLAB.
(a)Community structure in the network of
shared cognate sets
For the study of evolution by borrowing, we analysed two
independent, publicly available collections of cognate sets
from Indo-European languages. Both datasets comprise
words from individual languages or dialects correspond-
ing to concepts that are included in Swadesh lists [38].
Basic concepts are expressed by simple words rather
than compounds or phrases and contain names for body
parts, pronouns, common verbs and numerals, but
exclude technological words and words related to specific
ecologies or habitats. Words expressing basic concepts are
supposed to exist in all languages and thus may serve as a
tertium comparationis for language comparison [39].
Moreover, basic concepts are rarely replaced by other
words, either through external (lexical borrowing) or
internal factors (semantic shift) [13,16].
The Dyen dataset [29] includes word forms for 84
languages (including Greek, Armenian, Celtic, Romance,
Germanic, Slavic, Albanian and Indo-Iranian languages)
corresponding to 200 basic vocabulary concepts [39]
sorted into 2346 COGs [3]. While obvious borrowings
were excluded in the original Dyen dataset [29], we
used an edited version where 124 marked borrowings
are coded into their respective COGs [25]. Detailed rein-
spection of Romance cognates revealed an additional six
hidden borrowings [40] (electronic supplementary
material, table S1).
The second dataset is based on etymological diction-
aries and Swadesh lists published by the ToB project
[30]. It is based on word forms for 110 basic vocabulary
items for a total of 98 languages from which we extracted
73 contemporary ones, including languages from the
Celtic, Romance, Germanic, Slavic, Albanian and Indo-
Iranian branches of Indo-European, sorted into 722
COGs. Detectable borrowings were excluded in the orig-
inal database; however, a recent detailed screening
revealed five undetected borrowings within Romance
languages [40].
A network analysis of the distribution of cognate word
forms across Indo-European languages should provide
new insights into the frequency and distribution of bor-
rowing in Indo-European language history. Networks
are mathematical structures used to model pairwise
relations between entities. The entities are called vertices
and they are linked by edges that represent the connec-
tions or interactions between the vertices. A network
of Nvertices can be fully defined by the matrix A¼
, with a
=0 if a link exists between nodes
iand j, and a
¼0 otherwise. In the study of Indo-
European languages, each language is represented by a
vertex, i, whereas the elements of the matrix, A,
1796 S. Nelson-Sathi et al. Networks of Indo-European languages
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
correspond to the number of shared cognate sets between
language pairs, a
. Cognate sharing can result either from
vertical inheritance or from borrowing.
For network reconstruction, cognate sets were con-
verted into a binary format of PAPs for each COG in
each language [3]. For the 2346 COGs in the Dyen data-
set [29], 1169 different PAPs were observed, of which
942 (80%) are unique and 227 are recurring
(figure 2a). Closely related languages typically share the
most frequent PAPs. For example, Panjabi and Lahnda,
two Indian languages, share 78 cognates that are unique
to both languages. The ToB dataset includes 532 different
PAPs, none of which are unique (electronic supplemen-
tary material, figure S1). The frequency of shared
COGs among languages in the main branches uncovers
components of both inheritance and borrowing.
The binary PAPs of the Dyen COGs are readily
assorted into an 84 84 matrix representation of the cog-
nate-sharing network that consists of vertices (languages)
connected by edges (shared cognates), the edge weights
are the number of shared cognates per vertex pair.
There are 3486 edges in the network, all vertices of
which are connected, thereby forming a ‘clique’ in net-
work terms (figure 2b). Some groups of languages are
more strongly interconnected among themselves than
with others in the cognate-sharing network, thereby
forming communities.
We examined the community structure in the network
by division into modules [31,41]. Modules correspond to
‘natural’ groups within a network, that is, groups of ver-
tices that are more highly connected to each other than
they are to other vertex sets. With only two exceptions,
the nine modules calculated within the cognate-sharing
network correspond exactly to the main branches of
Indo-European languages. One exception concerns the
Armenian dialects Adapazar (Armenian List in Dyen data-
set [42]) and eastern modern Armenian (Ar menian Mod
in Dyen dataset [42]), which are grouped with the
Greek languages into one module. This is because Arme-
nian shares significantly ( p0.01, using the Wilcoxon
test) more cognates with the Greek languages (30 +2,
n¼5) than with the other languages (22 +3, n¼79).
This module has been independently recognized by lin-
guists [43]. The other exception is the split of both Irish
dialects from Celtic (figure 2c). The same network-
based analysis of the ToB dataset yields only four
modules: (i) Slavic and Albanian; (ii) Armenian, Greek,
Celtic, Germanic and Romance; (iii) Indo; and (iv)
Iranian (electronic supplementary material, figure S2).
Language communities that do not correspond to
monophyletic clades in the tree are the result of patchy
COG distributions that could not be reconciled with the
phylogenetic tree. For example, Romani, which branches
with Indo-Iranian languages, shares 25 COGs with
Modern Greek, such as the COGs for ‘flower’ (Modern
(louloudi ); Romani: lulugi) and ‘because’
(Modern Greek:
´(epeide); Romani: epidhi). Since
the Romani dialect in the Dyen dataset [29] is a variety
spoken in Greece [42], these are probably borrowed
from Greek to Romani.
(b)Borrowing frequency during Indo-European
language evolution
In the Dyen dataset, there are 1391 (59%) patchily dis-
tributed PAPs that are incongruent with the tree
12 20 33 54 90 148 210
no. shared co
0400 800 1200 1600 2000
Figure 2. Modules in the shared COGs network. (a) A graphic representation of cognate PAPs. Languages are sorted by their
order on the reference phylogenetic tree [3]. COGs are sorted by their size in ascending order. A presence case of a certain
COG in a certain language is coloured in blue if the COG pattern is congruent with the tree branching patterns and red
otherwise. (b) A matrix representation of the shared COGs network in Indo-European languages. Cells in the matrix are
edges in the network. Edges are colour-coded by the frequency of shared cognate according to the colour bar at the
bottom. The languages in the matrix are sorted by order of appearance in the phylogenetic tree on the left. (c) Modules
within the shared COGs network. Languages included in the same module are coloured in the same colour.
Networks of Indo-European languages S. Nelson-Sathi et al. 1797
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
branching pattern (figure 2a). In principle, such patchy
COG distributions could arise solely through indepen-
dent parallel evolution, through vertical inheritance
from the common ancestor of all languages and differen-
tial loss of lexica during language evolution, or via lexical
borrowing among languages. The first possibility seems
sufficiently unlikely as to exclude a priori. There is no
clear estimation for the frequency of parallel evolution
during language evolution, but we can assume that it is
rather rare and cannot, therefore, be used to explain the
distribution pattern of all patchy COGs. If we invoke
the second scenario to explain all COGs of patchy distri-
bution, then the result is a common ancestral language
that includes each and every COG existing in contempor-
ary languages. In order to entertain such a claim, one
would have to assume that the proto-language employed
many different, but redundant, words for the same basic
concepts, far more than every known contemporary
language. This runs contrary to uniformitarianism, a
key principle in historical sciences such as geology,
biology and linguistics, which states that processes in
the past should not be assumed to differ fundamentally
from those observed today [44,45]. Hence, if ancient
and modern languages were of similar nature, then the
number of words that were used to express fundamental
concepts (basic vocabulary size) in ancestral languages
should be similar to that used in contemporary languages.
This principle can be used to infer the minimum amount
of lexical borrowing in Indo-European languages that is
required in order to bring the distribution of basic voca-
bulary size in ancestral languages into agreement with
that of contemporary languages.
This network method to address non-tree-like patterns
of shared characters requires the use of a reference tree
[37]. Here, we use a phylogenetic tree reconstructed by
a Bayesian approach [3]. First, we designate an evolution-
ary scenario that uses vertical inheritance and LO
(model), according to which current COG distribution
is governed solely by loss. Each ancestral language con-
tains all cognates present in its descendants, and
vocabulary size hence becomes progressively larger back
through time (figure 3a). Note that a loss event applies
only to the sample of basic vocabulary and does not
mean a loss from the language as a whole. With the
Dyen dataset [29] and the reference tree, the common
Indo-European ancestor would have had a vocabulary
size of 2346 for basic words, expressing 200 basic con-
cepts. This estimate is 11 times larger than the average
basic vocabulary size in our sample ( p¼1.05 10
using the Wilcoxon test). Such large vocabulary sizes
are indeed unrealistic, but so is the assumption that new
words do not arise during language evolution. In the
SO model, we allow new words to arise over time, placing
the word origin at the most parsimonious place that is the
common ancestor of all COG-present cases (figure 3b).
This model results in smaller ancestral vocabularies of
up to 317 COGs, but these are still significantly larger
than the contemporary vocabularies ( p¼1.65 10
using the Wilcoxon test). The SO model entails an aver-
age of three losses per COG (electronic supplementary
material, table S2).
Thus, we either have to embrace the untenable
assumption that ancestral vocabulary sizes were fun-
damentally different in the past than they are today
or, preferably, we have to allow some amount of borrow-
ing during evolution. We start by allowing only one
borrowing event per COG, the BOR1 model. This
model allows each COG to have two origins in the refer-
ence tree, one of which is by borrowing from any source
(figure 3c). The result of this model is reduced ancestral
vocabularies during the early evolution of languages,
and an overall ancestral vocabulary size distribution that
is not significantly different from that of contemporary
languages (p¼0.61, using the Wilcoxon test). Of the
total Dyen COGs, 918 (39%) are monophyletic, hence
loss only
single origin(b)
basic vocabular
number of languages
number of languages
20 100 1000
number of languages
number of languages
Figure 3. Inference of borrowing frequency by ancestral voca-
bulary size. (ad) Schematic (left) and dynamics of ancestral
and contemporary vocabulary size (right) under the different
borrowing models. The fraction of interquartile range
different models is as follows. Loss only: 2.92; origin only:
1.93; BOR1: 0.12; BOR3: 20.86. Green triangles, origin;
red circles, loss; green circles, word presence; blue line, con-
temporary languages; red line, ancestral languages.
1798 S. Nelson-Sathi et al. Networks of Indo-European languages
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
their distribution is readily explained by an SO, while the
remaining 1373 (61%) are patchy enough to infer two ori-
gins (one borrowing event). This frequency translates to
an average rate of 0.6 borrowing events per COG
during Indo-European language evolution.
If we allow up to three borrowings per COG (the
BOR3 model; figure 3d), inferred ancestral vocabulary
shrinks towards sizes that are again significantly different
from modern ones, but this time are smaller than those of
contemporary languages ( p¼4.43 10
, using the
Wilcoxon test); that is, too much borrowing and not
enough vertical descent are incurred from the standpoint
of ancestral vocabulary sizes. Furthermore, under the
BOR3 model, the average number of inferred word
losses per COG is less than 1. But loss of COGs within
basic vocabulary occurs quite frequently in language evol-
ution [7], hence the BOR3 model is also unrealistic in
that sense. Additional models allowing up to 15 borrow-
ings per COG result in even smaller ancestral
vocabulary sizes (electronic supplementary material,
figure S3). Hence, ancestral basic vocabulary sizes
demand borrowings to keep them realistically small, but
too much borrowing makes them unrealistically small.
Testing the present evolutionary models with the help
of a reference tree that is inferred from the same data
might bias the inference of origin and loss events. How-
ever, using the Bayesian approach to reconstruct the
tree yields the majority signal in the data. If the majority
of COGs evolve mainly by vertical inheritance, then the
tree is expected to be a reliable representation of the
language phylogeny [46]. High frequency of borrowing
events may mask the vertical signal and lead to less
reliable reconstruction. To test the robustness of our bor-
rowing frequency estimates, we repeated our analysis
using various reference trees. Use of an alternative phylo-
genetic tree reconstructed by NJ [33] results in the same
BOR1 model (p¼0.7, using the Wilcoxon test; elec-
tronic supplementary material, figure S3). In both
reference trees, English is basal to the Germanic clade.
However, this position is debated among linguists, and
traditional classifications put English inside that clade
[12,47]. To test the influence of the English position
within the tree on our borrowing assessment, we tested
all models using a reference tree with English in an
internal position. Using that reference tree also yielded
the BOR1 model (p¼0.78, using the Wilcoxon test),
with all other models rejected (
¼0.05). Using a
random phylogenetic tree eliminates all patterns of
vertically inherited COGs and accordingly results in the
BOR15 model (p¼0.16, using the Wilcoxon test;
electronic supplementary material, figure S4).
Performing the same tests on the ToB dataset yielded
higher borrowing frequencies, with BOR3 being the
only statistically accepted model ( p¼0.59, using the
Wilcoxon test; electronic supplementary material, figure
S5). Inference by this model results in 155 COGs of
SO, 181 COGs of two origins, 307 COGs of three origins
and 79 COGs of four origins. Hence, in 567 (79%) of the
722 COGs, we detected one or more borrowing event.
The average rate of borrowing events per COG during
language evolution in the ToB dataset is 1.4 (electronic
supplementary material, table S2). The higher borrowing
rate inferred for the ToB dataset in comparison to the
Dyen dataset might have to do with differences in their
reconstruction. The cognate judgements in ToB are
based on a deeper etymological reconstruction in com-
parison to the Dyen dataset. This results in more words
that are distributed over fewer cognate sets, which leads
to patchy COG distribution patterns that are frequently
incongruent with the phylogenetic tree.
The sample of languages is crucial for the distinction
between COG origin by birth or borrowing because
what may seem to be a word birth within a given
sample of languages in our data could in fact be a borrow-
ing event from a non-sampled language. How severe is the
effect of external borrowing on our results? If we assume
the extreme case, for example, that all COGs in the data-
set originated by borrowing from external languages, then
we have to add one borrowing event to the average rate for
each COG. In that case, the average borrowing rate would
increase from 0.6 to 1.6 events per COG using the Dyen
dataset. However, this extreme scenario is unlikely
because it entails the assumption that the Indo-European
groups sampled here lacked the wherewithal to invent
even one new COG. Nonetheless, external borrowing
has almost certainly had an effect on these data. Although
we currently lack a dataset that would allow us to quantify
the rate of external borrowing, if we assume that it is
similar to the internal borrowing rate within our sample,
the overall borrowing rate would be double our current
estimate. Again we stress that the borrowing frequency
inferred from the present sample of languages using
our method delivers a minimum value (a conservative
lower bound).
Another aspect of the data sample used in our analysis
is the collection of cognates. Here, we study the dynamics
of vocabulary size during evolution through the proxy of
basic vocabulary (i.e. the Swadesh list). However, origin
and loss of words in the COGs sample can occur by
semantic shift where the word is present in the language
but absent from the sample. It is possible that different
meaning collections evolve under regimens different
from the ones described here. Application of similar
methods to study vocabulary size dynamics over time
using different cognate datasets will help to clarify
this issue.
Notwithstanding certain amounts of cognate misjud-
gements and parallel evolution [48] resulting in tree-
incompatible COG distributions, our inference uncovers
abundant, and hitherto unrecognized, borrowing during
the evolution of the Indo-European languages.
Scholars usually agree that nouns are more easily bor-
rowed than verbs [49]. When classified according to the
English gloss, the Dyen dataset includes 887 (53%) cog-
nate sets corresponding to nouns within basic vocabulary
and 766 (46%) cognate sets corresponding to verbs. A
total of 503 (53%) nominal cognate sets and 450 (47%)
verbal cognate sets were identified as including hidden
borrowing events. A comparison of these frequencies
shows that there is no significant difference in borrowing
frequencies between nouns and verbs (p¼0.4, using the
(c)Minimal lateral networks of Indo-European
COG distributions that do not map exactly onto the phy-
logenetic tree, with borrowing constrained by ancestral
Networks of Indo-European languages S. Nelson-Sathi et al. 1799
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
vocabulary size only, constitute the MLN [37]. The MLN
reconstructed from the Dyen dataset consists of 167 ver-
tices, of which 84 are contemporary and 83 are ancestral
languages (internal nodes in the reference tree). The ver-
tices are interconnected either by the branches of the
reference tree, representing vertical inheritance, or by lat-
eral edges, representing horizontal transfer (figure 4a).
The internal and external vertices in the MLN for the
broad sample of COGs are linked by 666 lateral edges.
The connectivity (number of edges per vertex) within
the MLN ranges between 0 and 21 edges per language,
with a median of 7 (figure 4b). The most highly con-
nected node is Ossetic (21 edges), an east Iranian
language, which is connected with Indo-Iranian, Greek
and Slavic languages. Lateral edges connected to external
nodes correspond to comparatively recent borrowing
events. On average 8 +7% COGs per language are
involved in recent borrowing (electronic supplementary
material, table S3). This result suggests that English, at
8 per cent borrowing rate [13], is not exceptional; it is
merely the most studied language. The clustering coeffi-
cient of the MLN is 0.22, and the mean shortest path is
3.128 edges. Combined with the high level of clustering,
this means that the MLN forms a small-world network.
The edge weight distribution within the MLN is
characterized by a majority of small edge weights. Of
the total edges, 422 (63%) are of a single laterally
shared COG, while edges of multiple COGs are rare
(figure 4c). The two heaviest lateral edges include an
edge between Slovene and the remaining Slavic languages
(28 COGs), and an edge between Romanian and the
remaining Romance languages (19 COGs). These lateral
edges uncover a certain kind of language change that
results from the same evolutionary process. Both Slovene
and Romanian, being heavily influenced by neighbouring
languages, underwent a process of linguistic revival start-
ing from the early 19th century, in which the original
vocabulary size
number of cognates
e wei
5 10152025300
number of edges
5 10 15 200
number of edges
Figure 4. The MLN of Indo-European languages. (a) An MLN for 84 contemporary languages reconstructed under the BOR1
model. Vertical edges are indicated in grey, with both the width and the shading of the edge shown proportional to the number
of inferred vertically inherited COGs along the edge (see the scale). The lateral network is indicated by edges that do not map
onto the vertical component, with the number of cognates per edge indicated in colour (see the scale). Lateral edges that link
ancestral nodes represent laterally shared COGs among the descendent languages of the connected nodes, whose distribution
pattern could not be explained by origin and LO under the ancestral vocabulary size constraint. The two heaviest edges of
Slovene (Slavic) and Romanian (Romance) are marked by an arrow. (b) Distribution of connectivity, the number of one-
edge-distanced neighbours for each vertex, in the network. (c) Frequency distribution of edge weight in the lateral component
of the network.
Table 1. Reconstructed borrowing events. The origin node
that includes the reinserted borrowing is shaded in light grey.
edge type origin node
number of reinserted
1800 S. Nelson-Sathi et al. Networks of Indo-European languages
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
traits that had been lost during long periods of contact
were artificially reintroduced into the languages by the
speakers in order to bring them back to a stage of earlier
‘purity’ [50,51]. Before the 19th century, Slovene com-
prised several dialects spoken in the Alpine provinces of
the Austrian Empire, which were dominated by German
and Italian. Romanian, on the other hand, was heavily
influenced by neighbouring Slavic and Greek varieties,
with which it formed the so-called Balkan Sprachbund.
Along with the nationalist movements in Europe starting
from the end of the 18th century, both languages were
successively ‘purified’ by replacing the loanwords of
non-Slavic or non-Romance origin with ‘native’ words
from Slavic or Romance languages, respectively [50,51].
This process is somewhat different from the process of
borrowing as it was defined in the beginning of this
paper. It nonetheless illustrates additional horizontal
complexities in the processes of language evolution that
are readily detected in the MLN.
The comparison between the edges reconstructed using
the two reference trees that differ in their English position
supplies a few interesting observations regarding the appli-
cability of our approach to detect borrowing events. While
both reference trees yielded the same borrowing model
(i.e. the same overall borrowing rates), there are 23 lateral
edges connecting to English in the basal position and
only 15 lateral edges connecting to English in the internal
position. A closer inspection of the COGs in which the lat-
eral edges connecting to English were detected revealed
the basal position could not be verified as borrowings by
traditional historical linguistics. Thus, using different refer-
ence trees with the same COG distribution patterns does
not much affect the resulting borrowing model, but it
may increase the accuracy of concrete predictions made
by this approach (see electronic supplementary material,
table S4 for detailed etymological reconstruction of the
COGs). Consequently, the borrowing inference accuracy
in our approach is expected to increase with the accuracy
of the reference tree.
The MLN inferred from the ToB dataset shows similar
network characteristics, with the ancestors of Indian and
Iranian clades found also as highly connected nodes and
a majority (676; 76%) of single laterally shared COGs
(electronic supplementary material, figure S6).
Of the total 666 edges in the MLN reconstructed for
the Dyen dataset, 148 (22%) edges connect between two
external nodes—that is, between two contemporary
languages. The 301 (45%) edges that connect between
an internal node and an external node represent COGs
that are shared between a group and an outlier. The 217
(33%) edges that connect between two internal nodes rep-
resent COGs that are common to two different groups, yet
their distribution pattern could not be explained by vertical
inheritance alone under the vocabulary size criterion. As a
control to see whether our method is inferring spurious
borrowing, we examined the edges within cognates that
included the 124 reinserted borrowing events. In seven
cognates, the algorithm detected no borrowings, while in
all other 117 (94%) cognates a borrowing event was
inferred. In 59 (48%), the reinserted borrowing language
was inferred as an external node. In the remaining 58
(47%), reinserted borrowing languages were inferred
within descendants of an internal node (table 1).
The data can address the issue of whether words are
exchanged more frequently within than between main
branches of Indo-European. We can compare the prob-
ability of a certain language to be laterally connected
with languages that are either from the same main branch
or from different main branches of the Indo-European
languages. With the exception of the Armenian branch,
the probability for a lateral edge within the branch (internal
edge) is considerably higher than between branches (exter-
nal edge). Furthermore, lateral edge weights are
significantly larger in internal lateral edges than in external
lateral edges (table 2). Hence, lexical borrowing in Indo-
European languages is much more frequent among
languages within the same branch in comparison to
languages from different branches. This provides new
evidence for the existence of certain cultural barriers to
lexical borrowing during language evolution [10].
The study was supported by the German Federal Ministry of
Education and Research (S.N.S., J.M.L., H.G., T.D. and
W.M.) and the European Research Council (W.M.). We are
Table 2. Lateral edge (LE) frequencies between and within groups in the MLN.
median LE
group n
int ext int ext p-value
Greek 9 1.22 0.25 2 1 ,0.05
Armenian 3 0 0.17 0 1 n.a.
Celtic 13 1.61 0.29 2 1 0.05
Romance 31 2.45 0.36 1 1 0.05
Germanic 29 2.37 0.44 1 1 0.05
Slavic 31 2.35 0.64 1 1 0.05
Albanian 9 1.55 0.18 4 1 0.05
Indic 21 3.33 0.68 2 1 0.05
Iranian 14 2.35 0.75 2 1 0.05
Number of languages within group.
Range of median number of COGs per lateral edge.
One-side Kolmogorov– Smirnov test for lateral edge distribution.
For internal edges (int), number of internal edges per number of nodes within the group; for external edges (ext), number of external
edges per number of nodes outside the group.
Networks of Indo-European languages S. Nelson-Sathi et al. 1801
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
thankful to Frank Kressing, Matthis Krischel, Thorsten
Halling and Sven Sommerfeld for helpful discussions, and
to Dan Graur for his help in refining the manuscript. We
thank Liat Shavit-Grievink for her help in phylogenetic
1 Darwin, C. 1859 On the origin of species by means of natural
selection, or, the preservation of favoured races in the struggle
for life. London, UK: John Murray. See http://www.nla.
2 Schleicher, A. 1863 Die Darwinsche Theorie und die
Sprachwissenschaft offenes Sendschreiben an Herrn Dr.
Ernst Ha
¨ckel, 3rd edn (1873). Weimar, Germany: Bo¨hlau.
3 Gray, R. D. & Atkinson, Q. D. 2003 Language-tree
divergence times support the Anatolian theory of
Indo-European origin. Nature 426, 435 439. (doi:10.
4 Pagel, M. 2009 Human language as a culturally
transmitted replicator. Nat. Rev. Genet. 10, 405 415.
5 Dunn, M., Terrill, A., Reesink, G., Foley, R. A. &
Levinson, S. C. 2005 Structural phylogenetics and the
reconstruction of ancient language history. Science 309,
20722075. (doi:10.1126/science.1114615)
6 Lansing, J. S. et al. 2007 Coevolution of languages and
genes on the island of Sumba, eastern Indonesia. Proc.
Natl Acad. Sci. USA 104, 16 022 –16 026. (doi:10.1073/
7 Pagel, M. & Atkinson, Q. D. 2007 A Meade frequency of
word-use predicts rates of lexical evolution throughout
Indo-European history. Nature 449, 717– 720. (doi:10.
8 Thomason, S. G. 2001 Language contact: an introduction.
Edinburgh, UK: Edinburgh University Press.
9 Trask, R. L. 2000 The dictionary of historical and compara-
tive linguistics. Edinburgh, UK: Edinburgh University
10 Thomason, S. & Kaufman, T. 1988 Language contact,
creolization, and genetic linguistics. Berkeley, CA: Univer-
sity of California Press.
11 Aikhenvald, A. Y. 2006 Grammars in contact: a cross-
linguistic perspective. In Grammars in contact: a cross-
linguistic typology (eds A. Y. Aikhenvald & R. M.
Dixon), pp. 1– 66. Oxford, UK: Oxford University Press.
12 Fox, A. 1995 Linguistic reconstruction: an introduction
to theory and method. Oxford, UK: Oxford University
13 Embleton, S. 2000 Lexicostatistics/glottochronology:
from Swadesh to Sankoff to Starostin to future
horizons. In Time depth in historical linguistics (eds
C. Renfrew, A. McMahon & L. Trask), pp. 143 165.
Cambridge, UK: The McDonald Institute for
Archaeological Research.
14 Bergsland, K. & Vogt, H. 1962 On the validity of glotto-
chronology. Curr. Anthropol. 3, 115153. (doi:10.1086/
15 Boyd, R., Borgerhoff, M. M., Durham, W. H. &
Richerson, P. J. 1997 Are cultural phylogenies
possible? In Human by nature, between biology and the
social sciences (eds P. Weingart, P. J. Richerson, S. D.
Mitchell & S. Maasen), pp. 355– 386. Mahwah, NJ:
16 Atkinson, Q. D. & Gray, R. D. 2006 How old is the Indo-
European language family? Illumination or more moths
to the flame? In Phylogenetic methods and the prehistory of
languages (eds P. Forster & C. Renfrew), pp. 91 109.
Cambridge, UK: McDonald Institute for Archaeological
17 Schleicher, A. 1853 Die ersten Spaltungen des indoger-
manischen Urvolkes. Allgemeine Monatsschrift fu
Wissenschaft und Literatur,September, 786– 787.
18 Schmidt, J. 1872 Die Verwantschaftsverha
¨ltnisse der indoger-
manischen Sprachen. Weimar, Germany: Hermann
19 Schuchardt, H. 1922 U
¨ber die Klassifikation der roma-
nischen Mundarten. In Hugo Schuchardt-Brevier. Ein
Vademekum der allgemeinen Sprachwissenshaft. Als Festgabe
zum 80. Geburtstag des Meisters zusammengestellt und einge-
leitet von Leo Spitzer (ed. L. Spitzer), pp. 144– 166. Halle,
Germany: Max Niemeyer.
20 Hirt, H. 1905 Die Indogermanen. Ihre Verbreitung, ihre
Urheimat und ihre Kultur, vol. 1. Strassburg, France:
21 Bonfante, G. I. 1931 I dialetti indoeuropei. Annali del
R. Istituto Orientale di Napoli 4, 69 185.
22 Dyen, I., James, A. T. & Cole, J. W. L. 1967 Language
divergence and estimated word retention rate. Language
43, 150 171. (doi:10.2307/411390)
23 Ringe, D. A. 1992 On calculating the factor of chance in
language comparison. Trans. Am. Phil. Soc. 82, 1 110.
24 Southworth, F. C. 1964 Family-tree diagrams. Language
40, 557 565. (doi:10.2307/411938)
25 Bryant, D., Filimon, F. & Gray, R. D. 2005 Untangling
our past: languages, trees, splits and networks. In
The evolution of cultural diversity: phylogenetic approaches
(eds R. Mace, C. Holden & S. Shennan), pp. 67– 84.
London, UK: UCL Press.
26 Nakhleh, L., Ringe, D. & Warnow, T. 2005 Perfect phy-
logenetic networks: a new methodology for
reconstructing the evolutionary history of natural
languages. Language 81, 382420. (doi:10.1353/lan.
27 McMahon, A., Heggarty, P., McMahon, R. & Slaska, N.
2005 Swadesh sublists and the benefits of borrowing: an
Andean case study. Trans. Phil. Soc. 103, 147 170.
28 Ben Hamed, M. & Wang, F. 2006 Stuck in the forest: trees,
networks and Chinese dialects. Diachronica 23,2960.
29 Dyen, I., Kruskal, J. B. & Black, P. 1997 Comparative
Indo-European database: file IEdata1. See http://www.
30 Starostin, G. 2008 Tower of Babel: an etymological database
project. See
31 Newman, M. E. J. 2003 The structure and function of
complex networks. SIAM Rev. 45, 167– 256. (doi:10.
32 Ronquist, F. & Huelsenbeck, J. P. 2003 MRBAYES 3:
Bayesian phylogenetic inference under mixed models.
Bioinformatics 19, 1572 1574. (doi:10.1093/bioinfor-
33 Saitou, N. & Nei, M. 1987 The neighbor-joining
method. A new method for reconstructing phylogenetic
trees. Mol. Biol. Evol. 4, 406– 425.
34 Huson, D. H. & Bryant, D. 2006 Application of phyloge-
netic networks in evolutionary studies. Mol. Biol. Evol.
23, 254 267. (doi:10.1093/molbev/msj030)
35 Dagan, T. & Martin, W. 2007 Ancestral genome sizes
specify the minimum rate of lateral gene transfer during
prokaryote evolution. Proc. Natl Acad. Sci. USA 104,
870 875. (doi:10.1073/pnas.0606318104)
36 Zar, J. H. 1999 Biostatistical analysis, 4th edn. Englewood
Cliffs, NJ: Pearson Prentice-Hall.
37 Dagan, T., Artzy-Randrup, Y. & Martin, W. 2008
Modular networks and cumulative impact of lateral
transfer in prokaryote genome evolution. Proc. Natl
Acad. Sci. USA 105, 10 039– 10 044. (doi:10.1073/
1802 S. Nelson-Sathi et al. Networks of Indo-European languages
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
38 Swadesh, M. 1955 Towards greater accuracy in lexicosta-
tistic dating. Int. J. Am. Linguist. 21, 121– 137. (doi:10.
39 Swadesh, M. 1952 Lexicostatistic dating of prehistoric
ethnic contacts: with special reference to North American
Indians and Eskimos. Proc. Am. Phil. Soc. 96, 452 463.
40 Geisler, H. & List, J.-M. 2011 Beautiful trees on unstable
ground: notes on the data problem in lexicostatistics. In
Die Ausbreitung des Indogermanischen. Thesen aus Sprach-
wissenschaft, Archa
¨ologie und Genetik, Akten der
Arbeitstagung der Indogermanischen Gesellschaft Wu
24 26 September 2009 (ed. H. Hettrich). Wiesbaden,
Germany: Reichert.
41 Girvan, M. & Newman, M. E. J. 2002 Community struc-
ture in social and biological networks. Proc. Natl Acad.
Sci. USA 12, 7821 7826.
42 Dyen, I., Kruskal, J. B. & Black, P. 1992 An Indoeuro-
pean classification: a lexicostatistical experiment. Trans.
Am. Phil. Soc. 82, 3 132.
43 Mallory, J. P. & Adams, D. Q. 2006 The Oxford introduc-
tion to Proto-Indo-European and the Proto-Indo-European
world. Oxford, UK: Oxford University Press.
44 Wells, R. S. 1973 Uniformitarianism in linguistics. In
Dictionary of the history of ideas (ed. P. Wiener), pp.
423 431. New York, NJ: Scribner.
45 Christy, C. 1983 Uniformitarianism in linguistics. Amster-
dam, The Netherlands: John Benjamins.
46 Greenhill, S. J., Currie, T. E. & Gray, R. D. 2009 Does
horizontal transmission invalidate cultural phylogenies?
Proc. R. Soc. B 276, 2299 2306. (doi:10.1098/rspb.
47 Lewis, P. M. 2009 Ethnologue: languages of the world,
16th edn. Dallas, TX: SIL International. See http://
48 Garrett, A. 2006 Convergence in the formation of
Indo-European subgroups: phylogeny and chronology.
In Phylogenetic methods and the prehistory of languages
(eds P. Forster & C. Renfrew), pp. 139151. Cambridge,
UK: McDonald Institute for Archaeological Research.
49 Hock, H. H. & Joseph, B. D. 2009 Language
history, language change and language relationship.
In An introduction to historical and comparative
linguistics, 2nd edn. Berlin, Germany: Mouton de
50 Auty, R. 1963 The formation of the Slovene literary
language against the background of the Slavonic national
revival. SEER 41, 391402.
51 Mallinson, G. 1988 The Romance languages in. In
The Romance languages (eds M. Harris & V. Nigel),
pp. 391– 419. London, UK: Croom Helm.
52 Orel, V. 2003 A handbook of Ger manic etymology. Leiden,
The Netherlands: Brill.
53 Soukhanov, A. H. 1992 The American heritage dictionar y
of the English language. Boston, MA: Mifflin.
Networks of Indo-European languages S. Nelson-Sathi et al. 1803
Proc. R. Soc. B (2011)
on May 16, 2011rspb.royalsocietypublishing.orgDownloaded from
... Such data provide only holistic evolutionary hints of languages without fully considering linguistic compositions, including lexical and phonemic systems, which may portray distinct evolutionary processes. The evolution of lexical systems, such as loss or gain of core vocabulary, can trace language divergence [12]. In comparison, the evolution of phonemic systems is more complicated. ...
... In contrast, the Neighbour-Net for mtDNA in Fig. 2b clearly illustrates an East-West geographic polarization, indicating two major IE populations in matrilineages: Indo-Iranian and European. Due to the limited lexical borrowings in the Dunn's lexical dataset [12], the Neighbour-Net for lexicon thus appeared to better approximate a tree-like structure with fewer reticulations than the phonemic Neighbour-Net. The clustering groups for languages based on lexicon were consistent with traditional linguistic classifications. ...
Full-text available
In opposite to the Mother Tongue Hypothesis, the Father Tongue Hypothesis states that humans tend to speak their fathers' language, based on a stronger correlation of languages to paternal lineages (Y-chromosome) than to maternal lineages (mitochondria). To reassess these two competing hypotheses, we conducted a genetic-linguistic study of 34 modern Indo-European (IE) populations. In this study, genetic histories of paternal and maternal migrations in these IE populations were elucidated using phylogenetic networks of Y-chromosomal and mitochondrial DNA haplogroups, respectively. Unlike previous studies, we quantitatively characterized the languages based on lexical and phonemic systems, separately. We showed that genetic and linguistic distances are significantly correlated with each other and that both are correlated with geographic distances among these populations. However, when controlling for geographic factors, only the correlation between the distances of paternal and lexical characteristics and between those of maternal and phonemic remained. These unbalanced correlations reconciled the two seemingly conflicting hypotheses.
... Supplementary data is available at Journal of Language Evolution online. Notes 1. Compare, for example, Gray et al. (2007), Nelson-Sathi et al. (2011), Jordan (2011), Geisler and List (2013), Bickel et al. (2015), Bouckaert et al. (2018), Cathcart et al. (2020) . 2. For example, in the Chirila database grammatical information (Bowern 2016), seventy-five of ninety-two languages with information have ergative marking on at least some nouns, though only twenty-seven regularly have it for pronouns. ...
Bayesian phylogenetic methods have been gaining traction and currency in historical linguistics, as their potential for uncovering elements of language change is increasingly understood. Here, we demonstrate a proof of concept for using ancestral state reconstruction methods to reconstruct changes in morphology. We use a simple Brownian motion model of character evolution to test how splits in ergative marking evolve across Pama-Nyungan, a large family of Australian languages. We are able to recover linguistically plausible paths of change, as well as rejecting implausible paths. The results of these analyses elucidate constraints on changes that have led to extensive synchronic variation in an interlocking morphological system. They further provide evidence of an ergative–accusative split traceable to Proto-Pama-Nyungan.
... Within the linguistic field, perhaps the two most extensive examinations of this matter are Gray et al. (2010) and Wichmann et al. (2011), both of which focus on three techniques with which the treelikeness of datasets can be measured: NeighborNets, d (delta) scores, and Q-residuals (see below). Other noteworthy studies on this include Nelson-Sathi et al. (2010), where minimal lateral networks were used to visualize incompatibilities in a reference tree. Also, in a recent article, Verkerk (2019) applied a Bayesian technique called the 'multiple topologies method' to explore nontree-like language history using material from four language families. ...
Full-text available
In recent years, techniques such as Bayesian inference of phylogeny have become a standard part of the quantitative linguistic toolkit. While these tools successfully model the tree-like component of a linguistic dataset, real-world datasets generally include a combination of tree-like and nontree-like signals. Alongside developing techniques for modeling nontree-like data, an important requirement for future quantitative work is to build a principled understanding of this structural complexity of linguistic datasets. Some techniques exist for exploring the general structure of a linguistic dataset, such as NeighborNets, δ scores, and Q-residuals; however, these methods are not without limitations or drawbacks. In general, the question of what kinds of historical structure a linguistic dataset can contain and how these might be detected or measured remains critically underexplored from an objective, quantitative perspective. In this article, we propose TIGER values, a metric that estimates the internal consistency of a genetic dataset, as an additional metric for assessing how tree-like a linguistic dataset is. We use TIGER values to explore simulated language data ranging from very tree-like to completely unstructured, and also use them to analyze a cognate-coded basic vocabulary dataset of Uralic languages. As a point of comparison for the TIGER values, we also explore the same data using δ scores, Q-residuals, and NeighborNets. Our results suggest that TIGER values are capable of both ranking tree-like datasets according to their degree of treelikeness, as well as distinguishing datasets with tree-like structure from datasets with a nontree-like structure. Consequently, we argue that TIGER values serve as a useful metric for measuring the historical heterogeneity of datasets. Our results also highlight the complexities in measuring treelikeness from linguistic data, and how the metrics approach this question from different perspectives.
... Whereas the degree of uncertainty can be intuitively posited to increase as we trace the states of languages back to the past, the tree model, when used for this process, simultaneously reduces the degree of freedom by repeatedly merging nodes into a parent. Horizontal transmission elicits a degree of freedom that cannot be currently modeled without imposing some strong assumptions (Daumé 2009;Nelson-Sathi et al. 2010). ...
Full-text available
A major pursuit within the study of language evolution is to advance understanding of the historical behavior of typological features. Previous studies have identified at least three factors that determine the typological similarity of a pair of languages: (1) vertical stability, (2) horizontal diffusibility, and (3) universality. Of these factors, the first two are of particular interest. Although observed data are affected by all three factors to a greater or lesser degree, previous studies have not jointly modeled them in a straightforward manner. Here, we propose a solution that is derived from the field of cultural anthropology. We present a simple and extensible Bayesian autologistic model to jointly infer the three factors from observed data. Although a large number of missing values in the data set pose serious difficulties for statistical modeling, the proposed model can robustly estimate these parameters as well as missing values. Applying missing value imputation to indirectly evaluate the estimated parameters, we quantitatively demonstrated that they were meaningful. In conclusion, we briefly compare our findings with those of previous studies and discuss future directions.
... Until now, phylogenetic networks were primarily visualization tools. Recently, however, these methods have been applied to languages to estimate the amount of lexical borrowing in Indo-European languages (Nelson-Sathi et al., 2011), and to quantify the amount of conflicting signal in various linguistic and cultural datasets (Gray et al., 2010). Perhaps the best way to analyze the complex descent patterns found in places like Vanuatu and Polynesia is to use these recently developed phylogenetic network methods rather than phylogenetic trees (Gray et al., 2010). ...
Full-text available
Language phylogenies are a potentially powerful way to answer questions about how languages and cultures evolve. Recently, phylogenetic methods have been applied to a range of questions about the evolution of human languages and cultures. This article reviews the historical background of these approaches and provides a detailed methodological overview. Three different applications of phylogenetic methods are discussed: how language phylogenies can be used to test population dispersal hypotheses, to investigate processes in language evolution, and to infer patterns in cultural evolution. The article discusses briefly some controversies over the use of these methods before closing with some future prospects.
... If numeral features were borrowed across ancient Indo-European languages and subsequently inherited by the descendant languages we observe today, the evolutionary history of these features would be better captured by a phylogenetic network rather than a phylogenetic tree. However, recent discussion and simulation studies (Nunn et al. 2006;Greenhill et al. 2009;Currie et al. 2010;Gray et al. 2010;Nelson-Sathi 2010) indicates that phylogenetic comparative methods are robust to realistic levels of borrowing. It remains to be seen whether the technical advantages of productive numeral systems are more easily borrowed than other typological features. ...
Full-text available
Numerals have fascinated and mystified linguists, mathematicians and lay persons alike for centuries. The productive use of numerals (in languages where this happens) exploits recursivity to give rise to what we call the ‘the number line’. While the smaller numerals 1–10 have enjoyed intense scrutiny, the typological study of the formation of the higher numerals has received comparatively less attention. This article contains a comprehensive typological account of how languages in the Indo-European language family code numerals beyond 10 (10–99, 100s, 1,000s), the morphemes involved, and how these are ordered. We use this dataset from eighty-one Indo-European languages with phylogenetic comparative methods to propose diachronic reconstructions of these patterns in the Proto-Indo-European language. Our findings indicate that small numerals (11–19) show the widest cross-linguistic variation, and that higher numerals exhibit more consistency in both component parts and their ordering. Additionally, we show statistical evidence of correlations between the ordering of base and atom morphemes and other word order patterns (noun-postposition, noun-genitive, and verb-object order).
Full-text available
Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.
Full-text available
Based on the number of words per meaning across the Indo-European Swadesh list, Pagel et al. (2007) suggest that frequency of use is a general mechanism of linguistic evolution. We test this claim using within-language change. From the IDS ( Key & Comrie 2015 ) we compiled a comparative word list of 1,147 cognate pairs for Classical Latin and Modern Spanish, and 1,231 cognate pairs for Classical and Modern Greek. We scored the amount of change for each cognate pair in the two language histories according to a novel 6-point scale reflecting increasing levels of change from regular sound change to external borrowing. We find a weak negative correlation between frequency of use and lexical change for both the Latin-Spanish and Classical-Modern Greek language developments, but post hoc tests reveal that low frequency of use of borrowed words drive these patterns, casting some doubt on frequency of use as a general mechanism of language change.
This paper presents and discusses regular correspondences between Uralic geminate items and Yukaghiric with proposed sound change laws and new and some modified older cognate suggestions (twenty-four nouns and eight verbs). Geminate items were found to contain surprisingly stable, relatively unchanging vowels in Yukaghiric in regard to the Proto-Uralic form. The results suggest that degemination – taking place in all cases except in a few forms that can otherwise be explained – was an early process in Yukaghiric and occurred after or while many vowel changes had already taken place in the Yukaghiric vocabulary. The data shows that the relationship between Uralic and Yukaghiric is more extensive than previously believed. Some very early possible sound changes are discussed. Furthermore, a correspondence to Proto-Uralic *-ü- has been found in Late Proto-Yukaghiric *-ö-. Also, it is shown that the early suffixation in Yukaghir to Uralic-like stems has produced several modern words through grammaticalization.
Full-text available
We present a new open source software tool called BEASTling, designed to simplify the preparation of Bayesian phylogenetic analyses of linguistic data using the BEAST 2 platform. BEASTling transforms comparatively short and human-readable configuration files into the XML files used by BEAST to specify analyses. By taking advantage of Creative Commons-licensed data from the Glottolog language catalog, BEASTling allows the user to conveniently filter datasets using names for recognised language families, to impose monophyly constraints so that inferred language trees are backward compatible with Glottolog classifications, or to assign geographic location data to languages for phylogeographic analyses. Support for the emerging cross-linguistic linked data format (CLDF) permits easy incorporation of data published in cross-linguistic linked databases into analyses. BEASTling is intended to make the power of Bayesian analysis more accessible to historical linguists without strong programming backgrounds, in the hopes of encouraging communication and collaboration between those developing computational models of language evolution (who are typically not linguists) and relevant domain experts. © 2017 Maurits et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
The German linguists Johannes Schmidt (1843–1901) and Hugo Schuchardt (1842–1927) sought to answer many questions relating to the development of Indo-European languages, which are all believed to be descended from a single common ancestor. Schmidt's Verwantschaftsverhältnisse was originally published in 1872 and Schuchardt's Über die Lautgesetze followed in 1885; here they are reissued together in one volume. Schmidt's work developed the 'wave model' of language change, to which Schuchardt also subscribed. According to this theory, linguistic innovations spread outwards concentrically like waves, which become progressively weaker as time elapses and the distance from their point of origin increases. Since later changes may not cover the same area, there may be no sharp boundaries between neighbouring languages or dialects. This theory stood in opposition to the tree model and the doctrine of sound laws propounded by the Neogrammarian school of linguists, which is roundly critiqued in Schuchardt's contribution.
A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.