Ramon Ferrer-i-Cancho

Ramon Ferrer-i-CanchoUniversitat Politècnica de Catalunya | UPC · Department of Computer Science

· Computer Scientist
  • About
    Top co-authors
    Followers (240)
    View All
    Velina Slavova
    Shufeng Zhang
    Maxim Khazimullin
    Maxime Sainte-Marie
    Hernane B. de B. Pereira
    Ján Mačutek
    Jinghui Ouyang
    Raphaela Heesen
    Vasiliki Velona
    Xin Li
    Following (105)
    View All
    Víctor M. Eguíluz
    Alex Arenas
    Marta B Manser
    Bernardino Casas
    Marie A Roch
    Massimo Stella
    Thore Bergman
    Martin Gerlach
    Jan Macutek
    Velina Slavova
    Projects (3)
    Developing a mathematical theory of word order based on abstract principles and their predictions.
    A family of information theoretic models to shed light on the origins of language patterns (Zipf's law for word frequencies, Zipf's law of abbreviation, Zipf's law of meaning distribution, the principle of contrast,...), vocabulary learning biases,... The scope of these models is not only human language but also animal communication and genomes at different levels of organization.
    Research Items (105)
    Project - Word order theory
    An answer to the mistery at Physics Central
    In his pioneering research, G. K. Zipf observed that more frequent words tend to have more meanings, and showed that the number of meanings of a word grows as the square root of its frequency. He derived this relationship from two assumptions: that words follow Zipf's law for word frequencies (a power law dependency between frequency and rank) and Zipf's law of meaning distribution (a power law dependency between number of meanings and rank). Here we show that a single assumption on the joint probability of a word and a meaning suffices to infer Zipf's meaning-frequency law or relaxed versions. Interestingly, this assumption can be justified as the outcome of a biased random walk in the process of mental exploration.
    Project - Word order theory
    After a very long process, the article
    Gómez-Rodríguez, C. & Ferrer-i-Cancho, R. (2017). The scarcity of crossing dependencies: a direct outcome of a specific constraint? Physical Review E 96, 062304.
    is finally out.
    You can download it from here http://dx.doi.org/10.1103/PhysRevE.96.062304
    Project - Word order theory
    50 days' free access to first statistically rigorous evidence that crossing dependencies in languages are small in number
    Project - Word order theory
    A satellite of Evolang XII (Torun, Poland), 16 April 2018. Deadline for submissions: 22 December 2017.
    Project - Information theoretic models of natural communication
    The article "Optimization models of natural communication" has appeared online in the Journal of Quantitative Linguistics.
    The article reviews the psychological foundations and the various predictions of a family of information theoretic models of language that were originally put forward to shed light on the origins of Zipf's law for word frequencies. The article also shows connections between these models and standard information theory and provides an alternative view to Piantadosi's, "Zipf’s word frequency law in natural language: A critical review and future directions" (Psychonomic Bulletin & Review).
    Free eprints here
    Project - Word order theory
    Ferrer-i-Cancho, R. (2017). Towards a theory of word order. Comment on "Dependency distance: a new perspective on syntactic patterns in natural language" by Haitao Liu et al. Physics of Life Reviews 21, 218-220.
    Deleted research item The research item mentioned here has been deleted
    A comment on "Neurophysiological dynamics of phrase-structure building during sentence processing" by Nelson et al (2017), Proceedings of the National Academy of Sciences USA 114(18), E3669-E3678.
    Entropy is a fundamental property of a repertoire. Here, we present an efficient algorithm to estimate the entropy of types with the help of Zhang's estimator. The algorithm takes advantage of the fact that the number of different frequencies in a text is in general much smaller than the number of types. We justify the convenience of the algorithm by means of an analysis of the statistical properties of texts from more than 1000 languages. Our work opens up various possibilities for future research.
    The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.
    In a course of the degree of computer science, the programming project has changed from individual to teamed work, tentatively in couples (pair programming). Students have full freedom to team up with minimum intervention from teachers. The analysis of the couples made indicates that students do not tend to associate with students with a similar academic performance, maybe because general cognitive parameters do not govern the choice of academic partners. Pair programming seems to give great results, so the efforts of future research in this field should focus precisely on how these pairs are formed, underpinning the mechanisms of human social interactions.
    The minimization of the length of syntactic dependencies is a well-stablished principle of word order and the basis of a mathematical theory of word order. Here we complete that theory from the perspective of information theory, adding a competing word order principle: the maximization of predictability of a target element. These two principles are in conflict: to maximize the predictability of the head, the head should appear last, which maximizes the costs with respect to dependency length minimization. The implications of such a broad theoretical framework to understand the optimality, diversity and evolution of the six possible orderings of subject, object and verb are reviewed.
    The syntactic structure of a sentence can be modelled as a tree, where vertices correspond to words and edges indicate syntactic dependencies. It has been claimed recurrently that the number of edge crossings in real sentences is small. However, a baseline or null hypothesis has been lacking. Here we quantify the amount of crossings of real sentences and compare it to the predictions of a series of baselines. We conclude that crossings are really scarce in real sentences. Their scarcity is unexpected by the hubiness of the trees. Indeed, real sentences are close to linear trees, where the potential number of crossings is maximized.
    Here we study polysemy as a potential learning bias in vocabulary learning in children. We employ a massive set of transcriptions of conversations between children and adults in English, to analyze the evolution of mean polysemy in the words produced by children whose ages range between 10 and 60 months. Our results show that mean polysemy in children increases over time in two phases, i.e. a fast growth till the 31st month followed by a slower tendency towards adult speech. In contrast, no dependency with time is found in adults. This suggests that children have a preference for non-polysemous words in their early stages of vocabulary acquisition. Our hypothesis is twofold: (a) polysemy is a standalone bias or (b) polysemy is a side-effect of other biases. Interestingly, the bias for low polysemy described above weakens when controlling for syntactic category (noun, verb, adjective or adverb). The pattern of the evolution of polysemy suggests that both hypotheses may apply to some extent, and that (b) would originate from a combination of the well-known preference for nouns and the lower polysemy of nouns with respect to other syntactic categories.
    The pioneering research of G.K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws. Here we focus on a couple of them: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be shorter. Here we evaluate the robustness of these laws in contexts where they have not been explored yet to our knowledge. The recovery of the laws again in new conditions provides support for the hypothesis that they originate from abstract mechanisms.
    Vocalizations, and less often gestures, have been the object of linguistic research for decades. However, the development of a general theory of communication with human language as a particular case requires a clear understanding of the organization of communication through other means. Infochemicals are chemical compounds that carry information and are employed by small organisms that cannot emit acoustic signals of an optimal frequency to achieve successful communication. Here, we investigate the distribution of infochemicals across species when they are ranked by their degree or the number of species with which they are associated (because they produce them or are sensitive to them). We evaluate the quality of the fit of different functions to the dependency between degree and rank by means of a penalty for the number of parameters of the function. Surprisingly, a double Zipf (a Zipf distribution with two regimes, each with a different exponent) is the model yielding the best fit although it is the function with the largest number of parameters. This suggests that the worldwide repertoire of infochemicals contains a core which is shared by many species and is reminiscent of the core vocabularies found for human language in dictionaries or large corpora.
    A short review of similarities between dolphins and humans with the help of quantitative linguistics and information theory.
    Here we sketch a new derivation of Zipf's law for word frequencies based on optimal coding. The structure of the derivation is reminiscent of Mandelbrot's random typing model but it has multiple advantages over random typing: (1) it departs from realistic cognitive pressures (2) it does not require fine tuning of parameters and (3) it sheds light on the origins of other statistical laws of language and thus can lead to a compact theory of linguistic laws. Our findings suggest that the recurrence of Zipf's law in human languages could originate from pressure for easy and fast communication.
    Identifying universal principles underpinning diverse natural systems is a key goal of the life sciences. A powerful approach in addressing this goal has been to test whether patterns consistent with linguistic laws are found in nonhuman animals. Menzerath's law is a linguistic law that states that, the larger the construct, the smaller the size of its constituents. Here, to our knowledge, we present the first evidence that Menzerath's law holds in the vocal communication of a nonhuman species. We show that, in vocal sequences of wild male geladas (Theropithecus gelada), construct size (sequence size in number of calls) is negatively correlated with constituent size (duration of calls). Call duration does not vary significantly with position in the sequence, but call sequence composition does change with sequence size and most call types are abbreviated in larger sequences. We also find that intercall intervals follow the same relationship with sequence size as do calls. Finally, we provide formal mathematical support for the idea that Menzerath's law reflects compression-the principle of minimizing the expected length of a code. Our findings suggest that a common principle underpins human and gelada vocal communication, highlighting the value of exploring the applicability of linguistic laws in vocal systems outside the realm of language.
    Words that are used more frequently tend to be shorter. This statement is known as Zipf's law of abbreviation. Here we perform the widest investigation of the presence of the law to date. In a sample of 1262 texts and 986 different languages-about 13% of the world's language diversity-a negative correlation between word frequency and word length is found in all cases. In line with Zipf's original proposal, we argue that this universal trend is likely to derive from fundamental principles of information processing and transfer.
    The minimum linear arrangement problem on a network consists of finding the minimum sum of edge lengths that can be achieved when the vertices are arranged linearly. Although there are algorithms to solve this problem on trees in polynomial time, they have remained theoretical and have not been implemented in practical contexts to our knowledge. Here we use one of those algorithms to investigate the growth of this sum as a function of the size of the tree in uniformly random trees. We show that this sum is bounded above by its value in a star tree. We also show that the mean edge length grows logarithmically in optimal linear arrangements, in stark contrast to the linear growth that is expected on optimal arrangements of star trees or on random linear arrangements.
    Crossing syntactic dependencies have been observed to be infrequent in natural language, to the point that some syntactic theories and formalisms disregard them entirely. This leads to the question of whether the scarcity of crossings in languages arises from an independent and specific constraint on crossings. We provide statistical evidence suggesting that this is not the case, as the proportion of dependency crossings in a wide range of natural language treebanks can be accurately estimated by a simple predictor based on the local probability that two dependencies cross given their lengths. The relative error of this predictor never exceeds 5% on average, whereas a baseline predictor assuming a random ordering of the words of a sentence incurs a relative error that is at least 6 times greater. Our results suggest that the low frequency of crossings in natural languages is neither originated by hidden knowledge of language nor by the undesirability of crossings per se, but as a mere side effect of the principle of dependency length minimization.
    Word order evolution has been hypothesized to be constrained by a word order permutation ring: transitions involving orders that are closer in the permutation ring are more likely. The hypothesis can be seen as a particular case of Kauffman's adjacent possible in word order evolution. Here we consider the problem of the association of the six possible orders of S, V and O to yield a couple of primary alternating orders as a window to word order evolution. We evaluate the suitability of various competing hypotheses to predict one member of the couple from the other with the help of information theoretic model selection. Our ensemble of models includes a six-way model that is based on the word order permutation ring (Kauffman's adjacent possible) and another model based on the dual two-way of standard typology, that reduces word order to basic orders preferences (e.g., a preference for SV over VS and another for SO over OS). Our analysis indicates that the permutation ring yields the best model when favoring parsimony strongly, providing support for Kaufman's general view and a six-way typology.
    More than 30 years ago, Shiloach published an algorithm to solve the minimum linear arrangement problem for undirected trees. Here we fix a small error in the original version of the algorithm.
    A commentary on the article "Large-scale evidence of dependency length minimization in 37 languages" by Futrell, Mahowald & Gibson (PNAS 2015 112 (33) 10336-10341).
    In a recent article, Christiansen and Chater (2015) present a fundamental constraint on language, i.e. a now-or-never bottleneck that arises from our fleeting memory, and explore its implications, e.g., chunk-and-pass processing, outlining a framework that promises to unify different areas of research. Here we explore additional support for this constraint and suggest further connections from quantitative linguistics and information theory.
    The syntactic structure of sentences exhibits a striking regularity: dependencies tend to not cross when drawn above the sentence. Here we investigate two competing hypotheses for the origins of non-crossing dependencies. The traditional hypothesis is that the low frequency of dependency crossings arises from an independent principle of syntax that reduces crossings practically to zero. An alternative to this view is the hypothesis that crossings are a side effect of dependency lengths. According to this view, sentences with shorter dependency lengths should tend to have fewer crossings. We recast the traditional view as a null hypothesis where one of the variables, i.e. the number of crossings, is mean independent of the other, i.e. the sum of dependency lengths. The alternative view is then a positive correlation between these two variables. In spite of the rough estimation of dependency crossings that this sum provides, we are able to reject the traditional view in the majority of languages considered. The alternative hypothesis can lead to a more parsimonious theory of syntax.
    Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to be shorter. The generalized version of the law, namely an inverse relationship between the frequency of a unit and its magnitude, holds also for the behaviors of other species and the genetic code. The apparent universality of this pattern in human language and its ubiquity in other domains calls for a theoretical understanding of its origins. We generalize the information theoretic concept of mean code length as a mean energetic cost function over the probability and the magnitude of the symbols of the alphabet. We show that the minimization of that cost function and a negative correlation between probability and the magnitude of symbols are intimately related.
    It is well known that the length of a syntactic dependency determines its online memory cost. Thus, the problem of the placement of a head and its dependents (complements or modifiers) that minimizes online memory is equivalent to the problem of the minimum linear arrangement of a star tree. However, how that length is translated into cognitive cost is not known. This study shows that the online memory cost is minimized when the head is placed at the center, regardless of the function that transforms length into cost, provided only that this function is strictly monotonically increasing. Online memory defines a quasi-convex adaptive landscape with a single central minimum if the number of elements is odd and two central minima if that number is even. We discuss various aspects of the dynamics of word order of subject (S), verb (V) and object (O) from a complex systems perspective and suggest that word orders tend to evolve by swapping adjacent constituents from an initial or early SOV configuration that is attracted towards a central word order by online memory minimization. We also suggest that the stability of SVO is due to at least two factors, the quasi-convex shape of the adaptive landscape in the online memory dimension and online memory adaptations that avoid regression to SOV. Although OVS is also optimal for placing the verb at the center, its low frequency is explained by its long distance to the seminal SOV in the permutation space.
    The Sulawesi endemic species Macaca maura has been listed in the IUCN Red List as Endangered (A2cd) since 1996, mainly due to habitat disturbance and fragmentation. Nowadays, residual populations have increasingly been relegated to the karst areas of the island's Southern District. The main goal of this project is to point out preferences and strategies adopted by the studied group (N individuals = 31) in the use of this rare habitat, in order to contribute to the understanding of functional aspects of the spatial niche of the species. Hutcheson's diversity t-test on 34 vegetation plots yielded significant differences between the 'ground forest' and the 'karst tower forest' (t = –6.31; p < 0.05). The t test showed significant differences also in terms of tree abundance (t = 4.76; p < 0.05), average tree DBH (t = –2.96; p < 0.05) and average canopy closure (t = –4.61; p < 0.05), suggesting that the two forests could perhaps be considered as distinct micro-habitats. Preliminary analyses on habitat use revealed preferential use of 'ground forest' by all group members during selected budget activities (locomotion; feeding and foraging; social behaviour ; resting). On the other hand, no differences were found in selective use of the two habitats between sex classes (8 adult females, 4 adult males) or age classes (6 young adults, 6 old adults). As a multidisciplinary study design, the group's feeding ecology and health status will also be analysed for possible ecological correlates to habitat-use. In fact, a comparison between food selection and nutritional composition of preferred foods is currently being performed, in addition to enteric parasite analyses on stool samples (collected in accordance with Directive 2010/63/EU). Project results will help to understand whether the macaques' dispersed presence in the area of karst formations, which is hardly accessible by humans, is an ideal condition for the survival of the species.
    Here we respond to some comments by Alday concerning headedness in linguistic theory and the validity of the assumptions of a mathematical model for word order. For brevity, we focus only on two assumptions: the unit of measurement of dependency length and the monotonicity of the cost of a dependency as a function of its length. We also revise the implicit psychological bias in Alday's comments. Notwithstanding, Alday is indicating the path for linguistic research with his unusual concerns about parsimony from multiple dimensions.
    A family of information theoretic models of communication was introduced more than a decade ago to explain the origins of Zipf's law for word frequencies. The family is a based on a combination of two information theoretic principles: maximization of mutual information between forms and meanings and minimization of form entropy. The family also sheds light on the origins of three other patterns: the principle of contrast, a related a vocabulary learning bias and the meaning-frequency law. Here two important components of the family, namely the information theoretic principles and the energy function that combines them linearly, are reviewed from the perspective of psycholinguistics, language learning, information theory and synergetic linguistics. The minimization of this linear function resembles a sort of agnostic information theoretic model selection that might be tuned by self-organization.
    Menzerath's law, the tendency of Z, the mean size of the parts, to decrease as X, the number of parts, increases is found in language, music and genomes. Recently, it has been argued that the presence of the law in genomes is an inevitable consequence of the fact that Z = Y/X, which would imply that Z scales with X as Z ~ 1/X. That scaling is a very particular case of Menzerath-Altmann law that has been rejected by means of a correlation test between X and Y in genomes, being X the number of chromosomes of a species, Y its genome size in bases and Z the mean chromosome size. Here we review the statistical foundations of that test and consider three non-parametric tests based upon different correlation metrics and one parametric test to evaluate if Z ~ 1/X in genomes. The most powerful test is a new non-parametric based upon the correlation ratio, which is able to reject Z ~ 1/X in ten out of eleven taxonomic groups. Rather than a fact, Z ~ 1/X is a baseline that real genomes do not meet. The view of Menzerath-Altmann law as inevitable is seriously flawed.
    We study the correlations in the connec- tivity patterns of large scale syntactic de- pendency networks. These networks are induced from treebanks: their vertices de- note word forms which occur as nuclei of dependency trees. Their edges con- nect pairs of vertices if at least two in- stance nuclei of these vertices are linked in the dependency structure of a sentence. We examine the syntactic dependency net- works of seven languages. In all these cases, we consistently obtain three find- ings. Firstly, clustering, i.e., the probabil- ity that two vertices which are linked to a common vertex are linked on their part, is much higher than expected by chance. Secondly, the mean clustering of vertices decreases with their degree — this find- ing suggests the presence of a hierarchical
    Animal acoustic communication often takes the form of complex sequences, made up of multiple distinct acoustic units. Apart from the well-known example of birdsong, other animals such as insects, amphibians, and mammals (including bats, rodents, primates, and cetaceans) also generate complex acoustic sequences. Occasionally, such as with birdsong, the adaptive role of these sequences seems clear (e.g. mate attraction and territorial defence). More often however, researchers have only begun to characterise – let alone understand – the significance and meaning of acoustic sequences. Hypotheses abound, but there is little agreement as to how sequences should be defined and analysed. Our review aims to outline suitable methods for testing these hypotheses, and to describe the major limitations to our current and near-future knowledge on questions of acoustic sequences. This review and prospectus is the result of a collaborative effort between 43 scientists from the fields of animal behaviour, ecology and evolution, signal processing, machine learning, quantitative linguistics, and information theory, who gathered for a 2013 workshop entitled, ‘Analysing vocal sequences in animals’. Our goal is to present not just a review of the state of the art, but to propose a methodological framework that summarises what we suggest are the best practices for research in this field, across taxa and across disciplines. We also provide a tutorial-style introduction to some of the most promising algorithmic approaches for analysing sequences. We divide our review into three sections: identifying the distinct units of an acoustic sequence, describing the different ways that information can be contained within a sequence, and analysing the structure of that sequence. Each of these sections is further subdivided to address the key questions and approaches in that area. We propose a uniform, systematic, and comprehensive approach to studying sequences, with the goal of clarifying research terms used in different fields, and facilitating collaboration and comparative studies. Allowing greater interdisciplinary collaboration will facilitate the investigation of many important questions in the evolution of communication and sociality.
    The use of null hypotheses (in a statistical sense) is common in hard sciences but not in theoretical linguistics. Here the null hypothesis that the low frequency of syntactic dependency crossings is expected by an arbitrary ordering of words is rejected. It is shown that this would require star dependency structures, which are both unrealistic and too restrictive. The hypothesis of the limited resources of the human brain is revisited. Stronger null hypotheses taking into account actual dependency lengths for the likelihood of crossings are presented. Those hypotheses suggests that crossings are likely to reduce when dependencies are shortened. A hypothesis based on pressure to reduce dependency lengths is more parsimonious than a principle of minimization of crossings or a grammatical ban that is totally dissociated from the general and non-linguistic principle of economy.
    The syntactic structure of a sentence can be modeled as a tree where vertices are words and edges indicates syntactic dependencies between words. It is well-known that those edges normally do not cross when drawn over the sentence. Here a new null hypothesis for the number of edge crossings of a sentence is presented. That null hypothesis takes into account the length of the pair of edges that may cross and predicts the relative number of crossings in random trees with a small error, suggesting that a ban of crossings or a principle of minimization of crossings are not needed in general to explain the origins of non-crossing dependencies. Our work paves the way for more powerful null hypotheses for investigating the origins of non-crossing dependencies in nature.
    According to Zipf's meaning-frequency law, words that are more frequent tend to have more meanings. Here it is shown that a linear dependency between the frequency of a form and its number of meanings is found in a family of models of Zipf's law for word frequencies. This is evidence for a weak version of the meaning-frequency law. Interestingly, that weak law (a) is not an inevitable of property of the assumptions of the family and (b) is found at least in the narrow regime where those models exhibit Zipf's law for word frequencies.
    Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. In order to have as homogeneous sources as possible, we analyze some of the longest literary texts ever written, comprising four different languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable.
    It has been hypothesized that early stages of language have left traces of simpler forms of language, for instance, in child language (Bickerton, 1990; Jackendoff, 1999). Word learning biases in children (Saxton, 2010) suggest constraints that human language had to meet at its very origin. Here a candidate for a new (to the best of our knowledge) learning bias is investigated: a global preference for words with a small number of meanings unraveled by the analysis of massive electronic corpora in English and Dutch…
    Vocabulary learning by children is characterized by many biases. When encountering a new word, children as well as adults, are biased towards assuming that it means something totally different from the words that they already know. To the best of our knowledge, the 1st mathematical proof of the optimality of this bias is presented here. First, it is shown that this bias is a particular case of the maximization of mutual information between words and meanings. However, this is not the only information theoretic principle constraining communication. Second, the optimality is proven within a more general information theoretic framework where mutual information maximization competes with other information theoretic principles. The bias is a prediction from modern information theory.
    Little is known about why SOV order is initially preferred and then discarded or recovered. Here we present a framework for understanding these and many related word order phenomena: the diversity of dominant orders, the existence of free words orders, the need of alternative word orders and word order reversions and cycles in evolution. Under that framework, word order is regarded as a multiconstraint satisfaction problem in which at least two constraints are in conflict: online memory minimization and maximum predictability.
    It is well known that the length of a syntactic dependency determines its online memory cost. Thus, the problem of the placement of a head and its dependents (complements or modifiers) that minimizes online memory is equivalent to the problem of the minimum linear arrangement of a star tree. However, how that length is translated into a cognitive cost is not known. Here it is shown that the online memory cost is minimized when the head is placed at the center, regardless of the function that transforms length into cost, provided only that this function is strictly monotonically increasing. Online memory defines a quasi-convex adaptive landscape with a single central minimum if the number of elements is odd and two central minima if that number is even. We discuss various aspects of the dynamics of word order of subject (S), verb (V) and object (O) from a complex systems perspective. We suggest that word orders tend to evolve by swapping adjacent constituents from and initial or early SOV configuration that is attracted towards a central word order by online memory minimization. We also suggest that the stability of SVO is to due at least two factors, the quasi-convex shape of the adaptive landscape in the online memory dimension and online memory adaptations to avoid regression to SOV. Although OVS is also optimal for placing the verb at the center, its low frequency is explained by its long distance to the seminal SOV in the permutation space.
    A key aim in biology and psychology is to identify fundamental principles underpinning the behavior of animals, including humans. Analyses of human language and the behavior of a range of non-human animal species have provided evidence for a common pattern underlying diverse behavioral phenomena: Words follow Zipf's law of brevity (the tendency of more frequently used words to be shorter), and conformity to this general pattern has been seen in the behavior of a number of other animals. It has been argued that the presence of this law is a sign of efficient coding in the information theoretic sense. However, no strong direct connection has been demonstrated between the law and compression, the information theoretic principle of minimizing the expected length of a code. Here, we show that minimizing the expected code length implies that the length of a word cannot increase as its frequency increases. Furthermore, we show that the mean code length or duration is significantly small in human language, and also in the behavior of other species in all cases where agreement with the law of brevity has been found. We argue that compression is a general principle of animal behavior that reflects selection for efficiency of coding.
    It has been hypothesized that the rather small number of crossings in real syntactic dependency trees is a side-effect of pressure for dependency length minimization. Here we answer a related important research question: what would be the expected number of crossings if the natural order of a sentence was lost? We show that this number depends only on the number of vertices of the dependency tree (the sentence length) and the second moment of vertex degrees. The expected number of crossings is minimum for a star tree (crossings are impossible) and maximum for a linear tree (the number of crossings is of the order of the square of the sequence length).
    Constant entropy rate (conditional entropies must remain constant as the sequence length increases) and uniform information density (conditional probabilities must remain constant as the sequence length increases) are two information theoretic principles that are argued to underlie a wide range of linguistic phenomena. Here we revise the predictions of these principles to the light of Hilberg's law on the scaling of conditional entropy in language and related laws. We show that constant entropy rate (CER) and two interpretations for uniform information density (UID), full UID and strong UID, are inconsistent with these laws. Strong UID implies CER but the reverse is not true. Full UID, a particular case of UID, leads to costly uncorrelated sequences that are totally unrealistic. We conclude that CER and its particular cases are incomplete hypotheses about the scaling of conditional entropies.
    Networks of interconnected nodes have long played a key role in Cognitive Science, from artificial neural net- works to spreading activation models of semantic mem- ory. Recently, however, a new Network Science has been developed, providing insights into the emergence of global, system-scale properties in contexts as diverse as the Internet, metabolic reactions, and collaborations among scientists. Today, the inclusion of network theory into Cognitive Sciences, and the expansion of complex- systems science, promises to significantly change the way in which the organization and dynamics of cognitive and behavioral processes are understood. In this paper, we review recent contributions of network theory at different levels and domains within the Cognitive Sciences.
    Here tree dependency structures are studied from three different perspectives: their degree variance (hubiness), the mean dependency length and the number of dependency crossings. Bounds that reveal pairwise dependencies among these three metrics are derived. Hubiness (the variance of degrees) plays a central role: the mean dependency length is bounded below by hubiness while the number of crossings is bounded above by hubiness. Our findings suggest that the online memory cost of a sentence might be determined not just by the ordering of words but also by the hubiness of the underlying structure. The 2nd moment of degree plays a crucial role that is reminiscent of its role in large complex networks.
    Mixing dependency lengths from sequences of different length is a common practice in language research. However, the empirical distribution of dependency lengths of sentences of the same length differs from that of sentences of varying length and the distribution of dependency lengths depends on sentence length for real sentences and also under the null hypothesis that dependencies connect vertices located in random positions of the sequence. This suggests that certain results, such as the distribution of syntactic dependency lengths mixing dependencies from sentences of varying length, could be a mere consequence of that mixing. Furthermore, differences in the global averages of dependency length (mixing lengths from sentences of varying length) for two different languages do not simply imply a priori that one language optimizes dependency lengths better than the other because those differences could be due to differences in the distribution of sentence lengths and other factors.
    It is well-known that word frequencies arrange themselves according to Zipf's law. However, little is known about the dependency of the parameters of the law and the complexity of a communication system. Many models of the evolution of language assume that the exponent of the law remains constant as the complexity of a communication systems increases. Using longitudinal studies of child language, we analysed the word rank distribution for the speech of children and adults participating in conversations. The adults typically included family members (e.g., parents) or the investigators conducting the research. Our analysis of the evolution of Zipf's law yields two main unexpected results. First, in children the exponent of the law tends to decrease over time while this tendency is weaker in adults, thus suggesting this is not a mere mirror effect of adult speech. Second, although the exponent of the law is more stable in adults, their exponents fall below 1 which is the typical value of the exponent assumed in both children and adults. Our analysis also shows a tendency of the mean length of utterances (MLU), a simple estimate of syntactic complexity, to increase as the exponent decreases. The parallel evolution of the exponent and a simple indicator of syntactic complexity (MLU) supports the hypothesis that the exponent of Zipf's law and linguistic complexity are inter-related. The assumption that Zipf's law for word ranks is a power-law with a constant exponent of one in both adults and children needs to be revised.
    Here we improve the mathematical arguments of Baixeries et al (BioSystems 107(3) (2012) 167–173). The corrections do not alter the conclusion that the random breakage model yields an insufficient fit to the scaling of mean chromosome length as a function of chromosome number in real genomes.
    The importance of statistical patterns of language has been debated over decades. Although Zipf's law is perhaps the most popular case, recently, Menzerath's law has begun to be involved. Menzerath's law manifests in language, music and genomes as a tendency of the mean size of the parts to decrease as the number of parts increases in many situations. This statistical regularity emerges also in the context of genomes, for instance, as a tendency of species with more chromosomes to have a smaller mean chromosome size. It has been argued that the instantiation of this law in genomes is not indicative of any parallel between language and genomes because (a) the law is inevitable and (b) non-coding DNA dominates genomes. Here mathematical, statistical and conceptual challenges of these criticisms are discussed. Two major conclusions are drawn: the law is not inevitable and languages also have a correlate of non-coding DNA. However, the wide range of manifestations of the law in and outside genomes suggests that the striking similarities between non-coding DNA and certain linguistics units could be anecdotal for understanding the recurrence of that statistical law.
    Power laws pervade statistical physics and complex systems, but, traditionally, researchers in these fields have paid little attention to properly fit these distributions. Who has not seen (or even shown) a log-log plot of a completely curved line pretending to be a power law? Recently, Clauset et al. have proposed a method to decide if a set of values of a variable has a distribution whose tail is a power law. The key of their procedure is the identification of the minimum value of the variable for which the fit holds, which is selected as the value for which the Kolmogorov-Smirnov distance between the empirical distribution and its maximum-likelihood fit is minimum. However, it has been shown that this method can reject the power-law hypothesis even in the case of power-law simulated data. Here we propose a simpler selection criterion, which is illustrated with the more involving case of discrete power-law distributions.
    Words follow the law of brevity, i.e. more frequent words tend to be shorter. From a statistical point of view, this qualitative definition of the law states that word length and word frequency are negatively correlated. Here the recent finding of patterning consistent with the law of brevity in Formosan macaque vocal communication (Semple, Hsu, & Agoramoorthy, 201016. Semple , S. , Hsu , M. J. and Agoramoorthy , G. 2010. Efficiency of coding in macaque vocal communication. Biology Letters, 6: 469–471. [CrossRef], [PubMed], [Web of Science ®]View all references) is revisited. It is shown that the negative correlation between mean duration and frequency of use in the vocalizations of Formosan macaques is not an artefact of the use of a mean duration for each call type instead of the customary ‘word’ length of studies of the law in human language. The key point demonstrated is that the total duration of calls of a particular type increases with the number of calls of that type. The finding of the law of brevity in the vocalizations of these macaques therefore defies a trivial explanation.
    Long-range correlations are found in symbolic sequences from human language, music and DNA. Determining the span of correlations in dolphin whistle sequences is crucial for shedding light on their communicative complexity. Dolphin whistles share various statistical properties with human words, i.e. Zipf's law for word frequencies (namely that the probability of the $i$th most frequent word of a text is about $i^{-\alpha}$) and a parallel of the tendency of more frequent words to have more meanings. The finding of Zipf's law for word frequencies in dolphin whistles has been the topic of an intense debate on its implications. One of the major arguments against the relevance of Zipf's law in dolphin whistles is that is not possible to distinguish the outcome of a die rolling experiment from that of a linguistic or communicative source producing Zipf's law for word frequencies. Here we show that statistically significant whistle-whistle correlations extend back to the 2nd previous whistle in the sequence using a global randomization test and to the 4th previous whistle using a local randomization test. None of these correlations are expected by a die rolling experiment and other simple explanation of Zipf's law for word frequencies such as Simon's model that produce sequences of unpredictable elements.
    Parallels of Zipf's law of brevity, the tendency of more frequent words to be shorter, have been found in bottlenose dolphins and Formosan macaques. Although these findings suggest that behavioral repertoires are shaped by a general principle of compression, common marmosets and golden-backed uakaris do not exhibit the law. However, we argue that the law may be impossible or difficult to detect statistically in a given species if the repertoire is too small, a problem that could be affecting golden backed uakaris, and show that the law is present in a subset of the repertoire of common marmosets. We suggest that the visibility of the law will depend on the subset of the repertoire under consideration or the repertoire size.
    The relationship between the size of the whole and the size of the parts in language and music is known to follow Menzerath-Altmann law at many levels of description (morphemes, words, sentences...). Qualitatively, the law states that larger the whole, the smaller its parts, e.g., the longer a word (in syllables) the shorter its syllables (in letters or phonemes). This patterning has also been found in genomes: the longer a genome (in chromosomes), the shorter its chromosomes (in base pairs). However, it has been argued recently that mean chromosome length is trivially a pure power function of chromosome number with an exponent of -1. The functional dependency between mean chromosome size and chromosome number in groups of organisms from three different kingdoms is studied. The fit of a pure power function yields exponents between -1.6 and 0.1. It is shown that an exponent of -1 is unlikely for fungi, gymnosperm plants, insects, reptiles, ray-finned fishes and amphibians. Even when the exponent is very close to -1, adding an exponential component is able to yield a better fit with regard to a pure power-law in plants, mammals, ray-finned fishes and amphibians. The parameters of Menzerath-Altmann law in genomes deviate significantly from a power law with a -1 exponent with the exception of birds and cartilaginous fishes.
    Recently, a random breakage model has been proposed to explain the negative correlation between mean chromosome length and chromosome number that is found in many groups of species and is consistent with Menzerath-Altmann law, a statistical law that defines the dependency between the mean size of the whole and the number of parts in quantitative linguistics. Here, the central assumption of the model, namely that genome size is independent from chromosome number is reviewed. This assumption is shown to be unrealistic from the perspective of chromosome structure and the statistical analysis of real genomes. A general class of random models, including that random breakage model, is analyzed. For any model within this class, a power law with an exponent of -1 is predicted for the expectation of the mean chromosome size as a function of chromosome length, a functional dependency that is not supported by real genomes. The random breakage and variants keeping genome size and chromosome number independent raise no serious objection to the relevance of correlations consistent with Menzerath-Altmann law across taxonomic groups and the possibility of a connection between human language and genomes through that law.
    Recently, it has been claimed that a linear relationship between a measure of information content and word length is expected from word length optimization and it has been shown that this linearity is supported by a strong correlation between information content and word length in many languages (Piantadosi et al 2011 Proc. Nat. Acad. Sci. 108 3825). Here, we study in detail some connections between this measure and standard information theory. The relationship between the measure and word length is studied for the popular random typing process where a text is constructed by pressing keys at random from a keyboard containing letters and a space behaving as a word delimiter. Although this random process does not optimize word lengths according to information content, it exhibits a linear relationship between information content and word length. The exact slope and intercept are presented for three major variants of the random typing process. A strong correlation between information content and word length can simply arise from the units making a word (e.g., letters) and not necessarily from the interplay between a word and its context as proposed by Piantadosi and co-workers. In itself, the linear relation does not entail the results of any optimization process.
    It is known that chromosome number tends to decrease as genome size increases in angiosperm plants. Here the relationship between number of parts (the chromosomes) and size of the whole (the genome) is studied for other groups of organisms from different kingdoms. Two major results are obtained. First, the finding of relationships of the kind “the more parts the smaller the whole” as in angiosperms, but also relationships of the kind “the more parts the larger the whole”. Second, these dependencies are not linear in general. The implications of the dependencies between genome size and chromosome number are two-fold. First, they indicate that arguments against the relevance of the finding of negative correlations consistent with Menzerath-Altmann law (a linguistic law that relates the size of the parts with the size of the whole) in genomes are seriously flawed. Second, they unravel the weakness of a recent model of chromosome lengths based upon random breakage that assumes that chromosome number and genome size are independent.
    Scaling laws are ubiquitous in nature, and they pervade neural, behavioral and linguistic activities. A scaling law suggests the existence of processes or patterns that are repeated across scales of analysis. Although the variables that express a scaling law can vary from one type of activity to the next, the recurrence of scaling laws across so many different systems has prompted a search for unifying principles. In biological systems, scaling laws can reflect adaptive processes of various types and are often linked to complex systems poised near critical points. The same is true for perception, memory, language and other cognitive phenomena. Findings of scaling laws in cognitive science are indicative of scaling invariance in cognitive mechanisms and multiplicative interactions among interdependent components of cognition.
    It shows the age ranges of the target children considered for our analysis, explains the rationale behind the choice of the different cut-offs, shows results not included in the main article (based upon lower cut-offs for normalization by prefix and also the normalization by random sampling, which is not used for the main article), compares the fit of a fixed ( ) versus a free and summarizes the range of variation of the exponent . (PDF)
    Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank , the 2nd most frequent word has rank ,...) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf's law-like word rank distribution. In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text. The good fit of random texts to real Zipf's law-like rank distributions has not yet been established. Therefore, we suggest that Zipf's law might in fact be a fundamental law in natural languages.
    We show that dolphin whistle types tend to be used in specific behavioral contexts, which is consistent with the hypothesis that dolphin whistle have some sort of “meaning”. Besides, in some cases, it can be shown that the behavioral context in which a whistle tends to occur or not occur is shared by different individuals, which is consistent with the hypothesis that dolphins are communicating through whistles. Furthermore, we show that the number of behavioral contexts significantly associated with a certain whistle type tends to grow with the frequency of the whistle type, a pattern that is reminiscent of a law of word meanings stating, as a tendency, that the higher the frequency of a word, the higher its number of meanings. Our findings indicate that the presence of Zipf's law in dolphin whistle types cannot be explained with enough detail by a simplistic die rolling experiment.
    We show that the law of brevity, i.e. the tendency of words to shorten as their frequency increases, is also found in dolphin surface behavioral patterns. As far as we know, this is the first evidence of the law in another species, suggesting that coding efficiency is not unique to humans. © 2008 Wiley Periodicals, Inc. Complexity, 2009.
    It has been argued that the actual distribution of word frequencies could be reproduced or explained by generating a random sequence of letters and spaces according to the so-called intermittent silence process. The same kind of process could reproduce or explain the counts of other kinds of units from a wide range of disciplines. Taking the linguistic metaphor, we focus on the frequency spectrum, i.e., the number of words with a certain frequency, and the vocabulary size, i.e., the number of different words of text generated by an intermittent silence process. We derive and explain how to calculate accurately and efficiently the expected frequency spectrum and the expected vocabulary size as a function of the text size.
    Quantitative linguistics has provided us with a number of empirical laws that characterise the evolution of languages and competition amongst them. In terms of language usage, one of the most influential results is Zipf's law of word frequencies. Zipf's law appears to be universal, and may not even be unique to human language. However, there is ongoing controversy over whether Zipf's law is a good indicator of complexity. Here we present an alternative approach that puts Zipf's law in the context of critical phenomena (the cornerstone of complexity in physics) and establishes the presence of a large scale "attraction" between successive repetitions of words. Moreover, this phenomenon is scale-invariant and universal -- the pattern is independent of word frequency and is observed in texts by different authors and written in different languages. There is evidence, however, that the shape of the scaling relation changes for words that play a key role in the text, implying the existence of different "universality classes" in the repetition of words. These behaviours exhibit striking parallels with complex catastrophic phenomena.
    Menzerath-Altmann law is a general law of human language stating, for instance, that the longer a word, the shorter its syllables. With the metaphor that genomes are words and chromosomes are syllables, we examine if genomes also obey the law. We find that longer genomes tend to be made of smaller chromosomes in organisms from three different kingdoms: fungi, plants, and animals. Our findings suggest that genomes self-organize under principles similar to those of human language. © 2009 Wiley Periodicals, Inc. Complexity, 2010
    In this paper, we propose a mathematical framework for studying word order optimization. The framework relies on the well-known positive correlation between cognitive cost and the Euclidean distance between the elements (e.g. words) involved in a syntactic link. We study the conditions under which a certain word order is more economical than an alternative word order by proposing a mathematical approach. We apply our methodology to two different cases: (a) the ordering of subject (S), verb (V) and object (O), and (b) the covering of a root word by a syntactic link. For the former, we find that SVO and its symmetric, OVS, are more economical than OVS, SOV, VOS and VSO at least 2/3 of the time. For the latter, we find that uncovering the root word is more economical than covering it at least 1/2 of the time. With the help of our framework, one can explain some Greenbergian universals. Our findings provide further theoretical support for the hypothesis that the limited resources of the brain introduce biases toward certain word orders. Our theoretical findings could inspire or illuminate future psycholinguistics or corpus linguistics studies.
    This article is a critical analysis of Michael Cysouw's comment "Linear Order as a Predictor of Word Order Regularities."
    It is widely assumed that long-distance dependencies between elements are a unique feature of human language. Here we review recent evidence of long-distance correlations in sequences produced by non-human species and discuss two evolutionary scenarios for the evolution of human language in the light of these findings. Though applying their methodological framework, we conclude that some of Hauser, Chomsky and Fitch's central claims on language evolution are put into question to a different degree within each of those scenarios.
    Until recently, models of communication have explicitly or implicitly assumed that the goal of a communication system is just maximizing the information transfer between signals and 'meanings'. Recently, it has been argued that a natural communication system not only has to maximize this quantity but also has to minimize the entropy of signals, which is a measure of the cognitive cost of using a word. The interplay between these two factors, i.e. maximization of the information transfer and minimization of the entropy, has been addressed previously using a Monte Carlo minimization procedure at zero temperature. Here we derive analytically the globally optimal communication systems that result from the interaction between these factors. We discuss the implications of our results for previous studies within this framework. In particular we prove that the emergence of Zipf's law using a Monte Carlo technique at zero temperature in previous studies indicates that the system had not reached the global optimum.
    Here we study the arrangement of vertices of trees in a 1-dimensional Euclidean space when the Euclidean distance between linked vertices is minimized. We conclude that links are unlikely to cross when drawn over the vertex sequence. This finding suggests that the uncommonness of crossings in the trees specifying the syntactic structure of sentences could be a side-effect of minimizing the Euclidean distance between syntactically related words. As far as we know, nobody has provided a successful explanation of such a surprisingly universal feature of languages that was discovered in the 60s of the past century by Hays and Lecerf. On the one hand, support for the role of distance minimization in avoiding edge crossings comes from statistical studies showing that the Euclidean distance between syntactically linked words of real sentences is minimized or constrained to a small value. On the other hand, that distance is considered a measure of the cost of syntactic relationships in various frameworks. By cost, we mean the amount of computational resources needed by the brain. The absence of crossings in syntactic trees may be universal just because all human brains have limited resources.
    Here we study the sequences of surface behavioral patterns of dolphins (Tursiops sp.) and find long-term correlations. We show that the long-term correlations are not of a trivial nature, i.e. they cannot be explained by the repetition of the same surface behavior many times in a row. Our findings suggest that dolphins have a long collective memory extending back at least to the 7-th past behavior. As far as we know, this is the first evidence of long-term correlations in the behavior of a non-human species.
    Here, we study a communication model where signals associate to stimuli. The model assumes that signals follow Zipf's law and the exponent of the law depends on a balance between maximizing the information transfer and saving the cost of signal use. We study the effect of tuning that balance on the structure of signal-stimulus associations. The model starts from two recent results. First, the exponent grows as the weight of information transfer increases. Second, a rudimentary form of language is obtained when the network of signal-stimulus associations is almost connected. Here, we show the existence of a sudden destruction of language once a critical balance is crossed. The model shows that maximizing the information transfer through isolated signals and language are in conflict. The model proposes a strong reason for not finding large exponents in complex communication systems: language is in danger. Besides, the findings suggest that human words may need to be ambiguous to keep language alive. Interestingly, the model predicts that large exponents should be associated to decreased synaptic density. It is not surprising that the largest exponents correspond to schizophrenic patients since, according to the spirit of Feinberg's hypothesis, i.e. decreased synaptic density may lead to schizophrenia. Our findings suggest that the exponent of Zipf's law is intimately related to language and that it could be used to detect anomalous structure and organization of the brain.
    Here we present a new model for Zipf's law in human word frequencies. The model defines the goal and the cost of communication using information theory. The model shows a continuous phase transition from a no communication to a perfect communication phase. Scaling consistent with Zipf's law is found in the boundary between phases. The exponents are consistent with minimizing the entropy of words. The model differs from a previous model [Ferrer i Cancho, Solé, Proc. Natl. Acad. Sci. USA 100, 788–791 (2003)] in two aspects. First, it assumes that the probability of experiencing a certain stimulus is controlled by the internal structure of the communication system rather than by the probability of experiencing it in the `outside' world, which makes it specially suitable for the speech of schizophrenics. Second, the exponent α predicted for the frequency versus rank distribution is in a range where α>1, which may explain that of some schizophrenics and some children, with α=1.5-1.6. Among the many models for Zipf's law, none explains Zipf's law for that particular range of exponents. In particular, two simplistic models fail to explain that particular range of exponents: intermittent silence and Simon's model. We support that Zipf's law in a communication system may maximize the information transfer under constraints.
    We analyze here a particular kind of linguistic network where vertices representwords and edges stand for syntactic relationships between words. The statisticalproperties of these networks have been recently studied and various features such as the small-world phenomenon and a scale-free distribution of degrees have been found. Our work focuses on four classes of words: verbs, nouns, adverbs and adjectives. Here, we use spectral methods sorting vertices. We show that the ordering clusters words of the same class. For nouns and verbs, the cluster size distribution clearly follows a power-law distribution that cannot be explained by a null hypothesis. Long-range correlations are found between vertices in theordering provided by the spectral method. The findings support the use of spectral methods for detecting community structure.
    Although many species possess rudimentary communication systems, humans seem to be unique with regard to making use of syntax and symbolic reference. Recent approaches to the evolution of language formalize why syntax is selectively advantageous compared with isolated signal communication systems, but do not explain how signals naturally combine. Even more recent work has shown that if a communication system maximizes communicative efficiency while minimizing the cost of communication, or if a communication system constrains ambiguity in a non-trivial way while a certain entropy is maximized, signal frequencies will be distributed according to Zipf's law. Here we show that such communication principles give rise not only to signals that have many traits in common with the linking words in real human languages, but also to a rudimentary sort of syntax and symbolic reference.
    Words in humans follow the so-called Zipfs law. More precisely, the word frequency spectrum follows a power function, whose typical exponent is 2, but significant variations are found. We hypothesize that the full range of variation reflects our ability to balance the goal of communication, i.e. maximizing the information transfer and the cost of communication, imposed by the limitations of the human brain. We show that the higher the importance of satisfying the goal of communication, the higher the exponent. Here, assuming that words are used according to their meaning we explain why variation in should be limited to a particular domain. From the one hand, we explain a non-trivial lower bound at about =1.6 for communication systems neglecting the goal of the communication. From the other hand, we find a sudden divergence of if a certain critical balance is crossed. At the same time a sharp transition to maximum information transfer and unfortunately, maximum communication cost, is found. Consistently with the upper bound of real exponents, the maximum finite value predicted is about =2.4. It is convenient for human language not to cross the transition and remain in a domain where maximum information transfer is high but at a reasonable cost. Therefore, only a particular range of exponents should be found in human speakers. The exponent contains information about the balance between cost and communicative efficiency.
    Here, assuming a general communication model where objects map to signals, a power function for the distribution of signal frequencies is derived. The model relies on the satisfaction of the receiver (hearer) communicative needs when the entropy of the number of objects per signal is maximized. Evidence of power distributions in a linguistic context (some of them with exponents clearly different from the typical β≈2 of Zipf's law) is reviewed and expanded. We support the view that Zipf's law reflects some sort of optimization but following a novel realistic approach where signals (e.g. words) are used according to the objects (e.g. meanings) they are linked to. Our results strongly suggest that many systems in nature use non-trivial strategies for easing the interpretation of a signal. Interestingly, constraining just the number of interpretations of signals does not lead to scaling.
    We study the Euclidean distance between syntactically linked words in sentences. The average distance is significantly small and is a very slowly growing function of sentence length. We consider two nonexcluding hypotheses: (a) the average distance is minimized and (b) the average distance is constrained. Support for (a) comes from the significantly small average distance real sentences achieve. The strength of the minimization hypothesis decreases with the length of the sentence. Support for (b) comes from the very slow growth of the average distance versus sentence length. Furthermore, (b) predicts, under ideal conditions, an exponential distribution of the distance between linked words, a trend that can be identified in real sentences.
    Many languages are spoken on Earth. Despite their diversity, many robust language universals are known to exist. All languages share syntax, i.e., the ability of combining words for forming sentences. The origin of such traits is an issue of open debate. By using recent developments from the statistical physics of complex networks, we show that different syntactic dependency networks (from Czech, German, and Romanian) share many nontrivial statistical patterns such as the small world phenomenon, scaling in the distribution of degrees, and disassortative mixing. Such previously unreported features of syntax organization are not a trivial consequence of the structure of sentences, but an emergent trait at the global scale.
    The emergence of a complex language is one of the fundamental events of human evolution, and several remarkable features suggest the presence of fundamental principles of organization. These principles seem to be common to all languages. The best known is the so-called Zipf's law, which states that the frequency of a word decays as a (universal) power law of its rank. The possible origins of this law have been controversial, and its meaningfulness is still an open question. In this article, the early hypothesis of Zipf of a principle of least effort for explaining the law is shown to be sound. Simultaneous minimization in the effort of both hearer and speaker is formalized with a simple optimization process operating on a binary matrix of signal-object associations. Zipf's law is found in the transition between referentially useless systems and indexical reference systems. Our finding strongly suggests that Zipf's law is a hallmark of symbolic reference and not a meaningless feature. The implications for the evolution of language are discussed. We explain how language evolution can take advantage of a communicative phase transition.
    Selection, Tinkering, and Emergence in Complex Networks - Crossing the Land of Tinkering cite: "Modularity can arise in two ways: by parcellation or by integration. Parcellation consists of the differential elimination of cross-interactions involving different parts of the system. Instead, if the network is originally formed by many independent, disconnected parts, it is conceivable that modularity arises by differential integration of those independent characters serving a common functional role."
    Summary of the Basic Features that Relate and Distinguish Different Types of Complex Networks, Both Natural and Artificial Property Proteomics Ecology Language TechnologyTinkering Gene duplication and recruitation Local assemblages fromregional species pools andpriority effectsCreation of words fromalready established onesReutilization of modules andcomponentsHubs Cellular signaling genes (e.g.,p53)Omnivorous and mostabundant speciesFunction words Most used componentsWhat can be optimized? Communication speed and linkingcostUnclear Communication speed withrestrictionsMinimize development effortwithin constraintsFailures Small phenotypic effect ofrandom mutationsLoss of only a few species-specific functionsMaintenance of expressionand communicationLoss of functionalityAttacks Large alterations of cell-cycle andapoptosis (e.g., cancer)Many coextinctions and lossof several ecosystemsfunctionsAgrammatism (i.e., greatdifficulties for buildingcomplex sentences)Avalanches of changes and largedevelopment costsRedundancy and degeneracy Redundant genes rapidly lost R minimized and D restrictedto non-keystone speciesGreat D Certain degree of R but no DHere different characteristic features of complex nets, as well as their behavior under different sources of perturbation, are considered.
    A large number of complex networks, both natural and artificial, share the presence of highly heterogeneous, scale-free degree distributions. A few mechanisms for the emergence of such patterns have been suggested, optimization not being one of them. In this letter we present the first evidence for the emergence of scaling (and smallworldness) in software architecture graphs from a well-defined local optimization process. Although the rules that define the strategies involved in software engineering should lead to a tree-like structure, the final net is scale-free, perhaps reflecting the presence of conflicting constraints unavoidable in a multidimensional optimization process. The consequences for other complex networks are outlined. Comment: 6 pages, 2 figures. Submitted to Europhysics Letters. Additional material is available at http://complex.upc.es/~sergi/software.htm
    Certain word types of natural languages - conjunctions, articles, prepositions and some verbs - have a very low or very grammatically marked semantic contribution. They are usually named functional categories or relational items. Recently, the possibility of considering prepositions as simple parametrical variations of semantic features instead of categorial features or as the irrelevance of such categorial features has been pointed out. The discussion about such particles has been and still is widespread and controversial. Nonetheless, there is no quantitative evidence of such semantic weakness and no satisfactory evidence against the coexistence of categorial requirements and the fragility of the semantic aspects. This study aims to quantify the semantic contribution of particles and presents some corpora-based results for English that suggest that such weakness and its relational uncertainty come from the categorial irrelevance mentioned before.
    Random-text models have been proposed as an explanation for the power law relationship between word frequency and rank, the so-called Zipf's law. They are generally regarded as null hypotheses rather than models in the strict sense. In this context, recent theories of language emergence and evolution assume this law as a priori information with no need of explanation. Here, random texts and real texts are compared through (a) the so-called lexical spectrum and (b) the distribution of words having the same length. It is shown that real texts fill the lexical spectrum much more efficiently and regardless of the word length, suggesting that the meaningfulness of Zipf's law is high.
    Several factors play a role during the replication and transmission of RNA viruses. First, as a consequence of their enormous mutation rate, complex mixtures of genomes are generated immediately after infection of a new host. Secondly, differences in growth and competition rates drive the selection of certain genetic variants within an infected host. Thirdly, but not less important, a random sampling occurs at the moment of viral infectious passage from an infected to a healthy host. In addition, the availability of hosts also influences the fate of a given viral genotype. When new hosts are scarce, different viral genotypes might infect the same host, adding an extra complexity to the competition among genetic variants. We have employed a two-fold approach to analyse the role played by each of these factors in the evolution of RNA viruses. First, we have derived a model that takes into account all the preceding factors. This model employs the classic Lotka-Volterra competition equations but it also incorporates the effect of mutation during RNA replication, the effect of the stochastic sampling at the moment of infectious passage among hosts and, the effect of the type of infection (single, coinfection or superinfection). Secondly, the predictions of the model have been tested in an in vitro evolution experiment. Both theoretical and experimental results show that in infection passages with coinfection viral fitness increased more than in single infections. In contrast, infection passages with superinfection did not differ from the single infection. The coinfection frequency also affected the outcome: the larger the proportion of viruses coinfecting a host, the larger increase in fitness observed.
    Words in human language interact in sentences in non-random ways, and allow humans to construct an astronomic variety of sentences from a limited number of discrete units. This construction process is extremely fast and robust. The co-occurrence of words in sentences reflects language organization in a subtle manner that can be described in terms of a graph of word interactions. Here, we show that such graphs display two important features recently found in a disparate number of complex systems. (i) The so called small-world effect. In particular, the average distance between two words, d (i.e. the average minimum number of links to be crossed from an arbitrary word to another), is shown to be d approximately equal to 2-3, even though the human brain can store many thousands. (ii) A scale-free distribution of degrees. The known pronounced effects of disconnecting the most connected vertices in such networks can be identified in some language disorders. These observations indicate some unexpected features of language organization that might reflect the evolutionary and social history of lexicons and the origins of their flexibility and combinatorial nature.
    Top co-authors