Illustration of the MLN method. A: Two cognate sets for “to count” in three Germanic and three Romance languages. The English word is a known borrowing from Old French. The original reflex of Proto-Germanic *tal- is still preserved in English “to tell,” but its original meaning has shifted under the influence of the borrowing from Old French, and it is thus not listed in this sample. B: The loss-only scenario assumes that the cognate set with reflexes of Latin originated in the root and was then lost independently in both German and Danish. C: The two-gain scenario infers two separate origins of the cognate sets. The pattern is thus suggestive of lateral transfer, and one lateral transfer event is inferred. This is marked by the link drawn between the two nodes where the characters first originate. D: Combination of scenarios for both cognate sets based on the loss-only scenario in B. Note that this scenario forces us to assume that the ancestor of the Germanic languages had two words expressing the concept “to count.” While this is not improbable per se, cases of inferred overwhelming amounts of synonymy are suspicious in language history. E: Combination of scenarios for both cognate sets based on the two-gain scenario in C. This scenario is preferred by the MLN method, since the number of synonyms in the ancestral languages is in balance with the modern languages. Note that the inference does not tell us which language is the real donor (which is Old French). According to our model, it could be any of the three Romance languages. For this reason, the edge is drawn between the ancestor off all languages.

Illustration of the MLN method. A: Two cognate sets for “to count” in three Germanic and three Romance languages. The English word is a known borrowing from Old French. The original reflex of Proto-Germanic *tal- is still preserved in English “to tell,” but its original meaning has shifted under the influence of the borrowing from Old French, and it is thus not listed in this sample. B: The loss-only scenario assumes that the cognate set with reflexes of Latin originated in the root and was then lost independently in both German and Danish. C: The two-gain scenario infers two separate origins of the cognate sets. The pattern is thus suggestive of lateral transfer, and one lateral transfer event is inferred. This is marked by the link drawn between the two nodes where the characters first originate. D: Combination of scenarios for both cognate sets based on the loss-only scenario in B. Note that this scenario forces us to assume that the ancestor of the Germanic languages had two words expressing the concept “to count.” While this is not improbable per se, cases of inferred overwhelming amounts of synonymy are suspicious in language history. E: Combination of scenarios for both cognate sets based on the two-gain scenario in C. This scenario is preferred by the MLN method, since the number of synonyms in the ancestral languages is in balance with the modern languages. Note that the inference does not tell us which language is the real donor (which is Old French). According to our model, it could be any of the three Romance languages. For this reason, the edge is drawn between the ancestor off all languages.

Source publication
Article
Full-text available
Like biological species, languages change over time. As noted by Darwin, there are many parallels between language evolution and biological evolution. Insights into these parallels have also undergone change in the past 150 years. Just like genes, words change over time, and language evolution can be likened to genome evolution accordingly, but wha...

Citations

... 4 Instead, I am proposing a mycelial turn in the conceptualisation and modelling of language evolution, i.e. a move towards a unified model of language evolution that considers all languages to be shaped by vertical and horizontal processes to a greater or lesser extent. This shift has occurred to some extent on a conceptual level in the work of Labov (1994aLabov ( , 1994b, Croft (2000); Mufwene (2001) ;Bromham, Hua, Algy, & Meakins (2020) and others; and in evolutionary modelling by Bryant & Moulton (2004); Kalyan & François (2018); List et al. (2013) and others. I will build on this work with a different proposal about how to reconcile horizontal and vertical models of language evolution thereby integrating contact and inheritance processes into a single model. ...
Article
Full-text available
The aim of this column is not to provide another binarised account of the evolution of contact languages (i.e. rhizomic, network structure, horizontal, variable) versus non-contact languages (i.e. arborescent, hierarchical structure, vertical, invariant). Instead, I am proposing a mycelial turn in the conceptualisation and modelling of language evolution, i.e. a move towards a unified model of language evolution that considers all languages to be shaped by vertical and horizontal processes to a greater or lesser extent. I propose a way to reconcile horizontal and vertical models of language evolution thereby integrating contact and inheritance processes into a single model.
... Another approach to borrowed word detection is to construct phylogenetic models of language families based on wordlists, including also intruder languages that are not necessarily part of the language family. Observed discrepancies in the model, in particular lexical items that detract from hierarchical family relations and contribute instead to lateral transfers, are likely due to borrowed words (List et al., 2014;Delz, 2014). ...
... Similar to the prevalence of multilingual approaches to borrowing detection in classical historical linguistics, most recent attempts to detect borrowings automatically have also been based on comparative rather than monolingual evidence. Various authors have tried to detect borrowings by searching for phylogenetic conflicts [18][19][20][21][22][23][24]. Other approaches identify similar words in unrelated languages [25][26][27]. ...
Article
Full-text available
Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.
... The above tree-based approach which involves a certain root and divergence of languages from that root was introduced by German Philologist August Schleicher [12]. This has remained main-stay for the comparative linguistics field to-date. ...
Preprint
Full-text available
Traditionally linguists have organized languages of the world as language families modelled as trees. In this work we take a contrarian approach and question the tree-based model that is rather restrictive. For example, the affinity that Sanskrit independently has with languages across Indo-European languages is better illustrated using a network model. We can say the same about inter-relationship between languages in India, where the inter-relationships are better discovered than assumed. To enable such a discovery, in this paper we have made use of instance-based learning techniques to assign language labels to words. We vocalize each word and then classify it by making use of our custom linguistic distance metric of the word relative to training sets containing language labels. We construct the training sets by making use of word clusters and assigning a language and category label to that cluster. Further, we make use of clustering coefficients as a quality metric for our research. We believe our work has the potential to usher in a new era in linguistics. We have limited this work for important languages in India. This work can be further strengthened by applying Adaboost for classification coupled with structural equivalence concepts of social network analysis.
... The goal of this article is to explicitly investigate these datasets, because as both these papers and the results in Section 4 show, evidence for reticulation is still present despite removal of known loans. We know that languages are full of loanwords in general (Tadmor 2009);Nelson-Sathi et al. (2011) and List et al. (2013) have shown undetected borrowings to be present in earlier Indo-European lexical datasets, and I side with Lee & Hasegawa (2011) in agreeing that detecting borrowings can be very difficult (see List et al. 2016: 7 for perspectives on automatic loanword identification). Aside from borrowing, there are other processes, such as incomplete lineage sorting, that result in non-tree-like signal. ...
... The Luvian-Lycian-Umbrian-Oscan clade found in the minority tree is probably caused by missing data for a highly overlapping set of concepts in these four languages, but it serves as a good illustration of the multiple topologies method, i.e. that such a distinctive pattern is picked up on by only one of the trees. For the Indo-European analysis, no clear indication of the role of reticulation is found (despite earlier findings such as Nelson-Sathi et al. 2011 andList et al. 2013). This is not the case for Japonic and Sinitic, as is discussed next. ...
Article
Full-text available
Recent applications of phylogenetic methods to historical linguistics have been criticized for assuming a tree structure in which ancestral languages differentiate and split up into daughter languages, while language evolution is inherently non-tree-like ( François 2014 ; Blench 2015 : 32–33). This article attempts to contribute to this debate by discussing the use of the multiple topologies method ( Pagel & Meade 2006a ) implemented in BayesPhylogenies ( Pagel & Meade 2004 ). This method is applied to lexical datasets from four different language families: Austronesian ( Gray, Drummond & Greenhill 2009 ), Sinitic ( Ben Hamed & Wang 2006 ), Indo-European ( Bouckaert et al. 2012 ), and Japonic ( Lee & Hasegawa 2011 ). Evidence for multiple topologies is found in all families except, surprisingly, Austronesian. It is suggested that reticulation may arise from a number of processes, including dialect chain break-up, borrowing (both shortly after language splits and later on), incomplete lineage sorting, and characteristics of lexical datasets. It is shown that the multiple topologies method is a useful tool to study the dynamics of language evolution.
... The same dataset was subsequently analyzed by List (2015List ( , 2016. Further analysis on the Sinitic languages using a different dataset is performed by List et al. (2014). iii. ...
... While Schmidt remained very vague in his criticism, Schuchardt was more concrete, pointing in particular to the problem of diffusion between very closely related languages: "We connect the branches and twigs of the family tree with countless horizontal lines and it ceases to be a tree" (Schuchardt 1900: 9). 3 While Schuchardt's observations were based on his deep knowledge of the Romance languages, Schmidt drew his conclusions from a thorough investigation of shared cognate words in the major branches of Indo-European. In this investigation, he found patterns of words that were in a strong "patchy distribution" (see List et al. 2014) -that is, a distribution that showed many gaps across the languages under investigation, with only a few (if any) patterns that could be found across all languages. One seemingly surprising fact was, for example, that while Greek and Sanskrit shared about 39 of cognate vocabulary (according to Schmidt's count;see Geisler & List 2013) and Greek and Latin shared 53, Latin and Sanskrit shared only 8. ...
Book
Full-text available
There are important reasons to be sceptical of the accuracy and usefulness of the family-tree model in historical linguistics. That model assumes that every linguistic innovation applies to a language considered as an undifferentiated whole, a point with no “width”. But this assumption makes it impossible to use a tree to model the partial diffusion of an innovation within a language community (“internal diffusion”), or the diffusion of an innovation across language communities (“external diffusion”). These limitations have long been noticed by historical linguists (Schmidt 1872, Schuchardt 1900); but they become glaringly obvious in the cases discussed by Ross (1988) and François (2014) under the heading of “linkages” – i.e., language families that arise through the diversification, in situ, of a dialect network. The articles in this special issue all contribute towards addressing this problem, from a range of perspectives. ______________ Siva Kalyan, Alexandre François & Harald Hammarström (eds), 2019. "Understanding language genealogy: Alternatives to the tree model". Special issue of "Journal of Historical Linguistics" 9/1.
... In a related vein, computational phylogenetics and population genetics are offering new methods for measuring large-scale borrowing across language families and within populations of speakers (e.g. List et al., 2013;Meakins et al., under review). ...
Chapter
Full-text available
The transfer of morphology is of interest to the field of language contact because it occurs less frequently than lexical transfer. Morphology is borrowed less often than vocabulary (Section 2), and is generally only derived from the dominant of the two interacting languages in insertional code-switching (Section 3). In contrast, morphology from both source languages is maintained in some mixed languages (Section 4). Finally it is argued that pidgin and creole languages contain relatively little morphology com- pared with their lexifier languages, although this claim is controversial (Section 5). These generalizations about morphology apply to varying degrees to different types of morphology. Derivational morphology is generally more likely to undergo transfer or be maintained than inflectional morphology, and inherent inflection is generally more resilient than contextual inflection. This borrowability hierarchy is based on the degree to which a morpheme relates to other parts of the clause outside its maximal projection, e.g. contextual inflection, such as agreement markers and case morphology, is exception- ally fragile in contact situations due to its syntactically dependent nature. Given the sensitivity of morphology to language contact, it can be used as a litmus test to gauge the relative strengths of interacting languages. For example, in borrowing and code-switching, one language is more dominant, as defined by the presence of inflec- tional morphology. On the other hand, the maintenance of inflectional morphology from both languages in some mixed languages suggests a relatively equal weighting given to both languages, with neither language being definitively stronger.
... Ancestral state reconstruction For our study, we tested three different established algorithms, namely (1) Maximum Parsimony (MP) reconstruction using the Sankoff algorithm (Sankoff, 1975), (2) the minimal lateral network (MLN) approach (Dagan et al., 2008) as a variant of Maximum Parsimony in which parsimony weights are selected with the help of the vocabulary size criterion (List et al., 2014b(List et al., , 2014c, and (3) Maximum Likelihood (ML) reconstruction as implemented in the software BayesTraits (Pagel and Meade, 2014). These algorithms are described in detail below. ...
... The MLN approach was originally developed for the detection of lateral gene transfer events in evolutionary biology (Dagan et al., 2008). In this form, it was also applied to linguistic data (Nelson-Sathi et al., 2011), and later substantially modified (List et al., 2014b(List et al., , 2014c. While the original approach was based on very simple gain-loss-mapping techniques, the improved version uses weighted parsimony on presence-absence data of cognate set distributions. ...
... The ratio between gains and losses follows from the experience with the MLN approach, which is presented in more detail below and which essentially tests different gain-loss scenarios for their suitability to explain a given dataset. In all published studies in which the MLN approach was tested(List et al., 2014b(List et al., , 2014cList, 2015), the best gain-loss ratio reported was 2:1.Downloaded from Brill.com12/22/2019 07:11:03PM via free access ...
Article
Full-text available
Current efforts in computational historical linguistics are predominantly concerned with phylogenetic inference. Methods for ancestral state reconstruction have only been applied sporadically. In contrast to phylogenetic algorithms, automatic reconstruction methods presuppose phylogenetic information in order to explain what has evolved when and where. Here we report a pilot study exploring how well automatic methods for ancestral state reconstruction perform in the task of onomasiological reconstruction in multilingual word lists, where algorithms are used to infer how the words evolved along a given phylogeny, and reconstruct which cognate classes were used to express a given meaning in the ancestral languages. Comparing three different methods, Maximum Parsimony, Minimal Lateral Networks, and Maximum Likelihood on three different test sets (Indo-European, Austronesian, Chinese) using binary and multi-state coding of the data as well as single and sampled phylogenies, we find that Maximum Likelihood largely outperforms the other methods. At the same time, however, the general performance was disappointingly low, ranging between 0.66 (Chinese) and 0.79 (Austronesian) for the F-Scores. A closer linguistic evaluation of the reconstructions proposed by the best method and the reconstructions given in the gold standards revealed that the majority of the cases where the algorithms failed can be attributed to problems of independent semantic shift (homoplasy), to morphological processes in lexical change, and to wrong reconstructions in the independently created test sets that we employed.
... We hope, however, that we could help readers to understand what they should keep in mind if they want to carry out sequence comparison analyses on their own. Additional questions will be answered in an interactive tutorial supplemented with this paper, and for deeper questions going beyond the pure application of sequence comparison algorithms -such as additional analyses (e.g., the minimal lateral network method for borrowing detection, List et al. 2014, or an algorithm for the detection of partial cognates, List et al. 2016b), routines for plotting and data visualization, or customization routines for user-defined sound-class models. -we recommend the readers to turn to the extensive online documentation of the LingPy package (http://lingpy.org). ...
Article
Full-text available
With increasing amounts of digitally available data from all over the world, manual annotation of cognates in multilingual word lists becomes more and more time-consuming in historical linguistics. Using available software packages to pre-process the data prior to manual analysis can drastically speed up the process of cognate detection. Furthermore, it allows us to get a quick overview on data which has not yet been intensively studied by experts. LingPy is a Python library which provides a large arsenal of routines for sequence comparison in historical linguistics. With LingPy, linguists can not only automatically search for cognates in lexical data, they can also align the automatically identified words, and output them in various forms, which aim at facilitating manual inspection. In this tutorial, we will briefly introduce the basic concepts behind the algorithms employed by LingPy, and then illustrate in concrete workflows, how automatic sequence comparison can be applied to multilingual word lists. The goal is to provide the readers with all information they need to (a) carry out cognate detection and alignment analyses in LingPy, (b) select the appropriate algorithms for the appropriate task, (c) evaluate how well automatic cognate detection algorithms perform compared to experts, and (d) export their data into various formats useful for additional analyses or data sharing. While basic knowledge of the Python language is useful for all analyses, our tutorial is structured in such a way that scholars with basic knowledge of computing can follow through all steps as well.
... This point was well recognized by early promoters of cross-disciplinary dialogue between evolutionary biology and historical linguistics (Morpugo Davies, 1975), such as Charles Darwin, August Schleicher, and Charles Lyell (Lyell, 1863;Schleicher, 1869;Darwin, 1871). For example, Schleicher's analogy between borrowing from a foreign language and biological cross-breeding did not imply the same mechanism for both, yet both have the effect of confounding attempts to represent evolutionary history as a bifurcating phylogeny (List et al., 2014). Yet the same solutions may apply to both processes, regardless of their mechanistic origin, such as representation of relationships as a network rather than a tree. ...
Article
Full-text available
What role does speaker population size play in shaping rates of language evolution? There has been little consensus on the expected relationship between rates and patterns of language change and speaker population size, with some predicting faster rates of change in smaller populations, and others expecting greater change in larger populations. The growth of comparative databases has allowed population size effects to be investigated across a wide range of language groups, with mixed results. One recent study of a group of Polynesian languages revealed greater rates of word gain in larger populations and greater rates of word loss in smaller populations. However, that test was restricted to 20 closely related languages from small Oceanic islands. Here, we test if this pattern is a general feature of language evolution across a larger and more diverse sample of languages from both continental and island populations. We analyzed comparative language data for 153 pairs of closely-related sister languages from three of the world's largest language families: Austronesian, Indo-European, and Niger-Congo. We find some evidence that rates of word loss are significantly greater in smaller languages for the Indo-European comparisons, but we find no significant patterns in the other two language families. These results suggest either that the influence of population size on rates and patterns of language evolution is not universal, or that it is sufficiently weak that it may be overwhelmed by other influences in some cases. Further investigation, for a greater number of language comparisons and a wider range of language features, may determine which of these explanations holds true.