Preprint

The role of grammar in transition-probabilities of subsequent words in English text

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Sentence formation is a highly structured, history-dependent, and sample-space reducing (SSR) process. While the first word in a sentence can be chosen from the entire vocabulary, typically, the freedom of choosing subsequent words gets more and more constrained by grammar and context, as the sentence progresses. This sample-space reducing property offers a natural explanation of Zipf's law in word frequencies, however, it fails to capture the structure of the word-to-word transition probability matrices of English text. Here we adopt the view that grammatical constraints (such as subject--predicate--object) locally re-order the word order in sentences that are sampled with a SSR word generation process. We demonstrate that superimposing grammatical structure -- as a local word re-ordering (permutation) process -- on a sample-space reducing process is sufficient to explain both, word frequencies and word-to-word transition probabilities. We compare the quality of the grammatically ordered SSR model in reproducing several test statistics of real texts with other text generation models, such as the Bernoulli model, the Simon model, and the Monkey typewriting model.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Sample space reducing (SSR) processes offer a simple analytical way to understand the origin and ubiquity of power-laws in many path-dependent complex systems. SRR processes show a wide range of applications that range from fragmentation processes, language formation to search and cascading processes. Here we argue that they also offer a natural framework to understand stationary distributions of generic driven non-equilibrium systems that are composed of a driving- and a relaxing process. We show that the statistics of driven non-equilibrium systems can be derived from the understanding of the nature of the underlying driving process. For constant driving rates exact power-laws emerge with exponents that are related to the driving rate. If driving rates become state-dependent, or if they vary across the life-span of the process, the functional form of the state-dependence determines the statistics. Constant driving rates lead to exact power-laws, a linear state-dependence function yields exponential or Gamma distributions, a quadratic function produces the normal distribution. Logarithmic and power-law state dependence leads to log-normal and stretched exponential distribution functions, respectively. Also Weibull, Gompertz and Tsallis-Pareto distributions arise naturally from simple state-dependent driving rates. We discuss a simple physical example of consecutive elastic collisions that exactly represents a SSR process.
Article
Full-text available
It has been shown recently that a specific class of path-dependent stochastic processes, which reduce their sample space as they unfold, lead to exact scaling laws in frequency and rank distributions. Such Sample Space Reducing processes (SSRP) offer an alternative new mechanism to understand the emergence of scaling in countless processes. The corresponding power law exponents were shown to be related to noise levels in the process. Here we show that the emergence of scaling is not limited to the simplest SSRPs, but holds for a huge domain of stochastic processes that are characterized by non-uniform prior distributions. We demonstrate mathematically that in the absence of noise the scaling exponents converge to $-1$ (Zipf's law) for almost all prior distributions. As a consequence it becomes possible to fully understand targeted diffusion on weighted directed networks and its associated scaling laws law in node visit distributions. The presence of cycles can be properly interpreted as playing the same role as noise in SSRPs and, accordingly, determine the scaling exponents. The result that Zipf's law emerges as a generic feature of diffusion on networks, regardless of its details, and that the exponent of visiting times is related to the amount of cycles in a network could be relevant for a series of applications in traffic-, transport- and supply chain management.
Article
Full-text available
This article studies the emergence of ambiguity in communication through the concept of logical irreversibility and within the framework of Shannon's information theory. This leads us to a precise and general expression of the intuition behind Zipf's vocabulary balance in terms of a symmetry equation between the complexities of the coding and the decoding processes that imposes an unavoidable amount of logical uncertainty in natural communication. Accordingly, the emergence of irreversible computations is required if the complexities of the coding and the decoding processes are balanced in a symmetric scenario, which means that the emergence of ambiguous codes is a necessary condition for natural communication to succeed.
Article
Full-text available
Complex systems are often inherently non-ergodic and non-Markovian for which Shannon entropy loses its applicability. In particular accelerating, path-dependent, and aging random walks offer an intuitive picture for these non-ergodic and non-Markovian systems. It was shown that the entropy of non-ergodic systems can still be derived from three of the Shannon-Khinchin axioms, and by violating the fourth -- the so-called composition axiom. The corresponding entropy is of the form $S_{c,d} \sim \sum_i \Gamma(1+d,1-c\ln p_i)$ and depends on two system-specific scaling exponents, $c$ and $d$. This entropy contains many recently proposed entropy functionals as special cases, including Shannon and Tsallis entropy. It was shown that this entropy is relevant for a special class of non-Markovian random walks. In this work we generalize these walks to a much wider class of stochastic systems that can be characterized as `aging' systems. These are systems whose transition rates between states are path- and time-dependent. We show that for particular aging walks $S_{c,d}$ is again the correct extensive entropy. Before the central part of the paper we review the concept of $(c,d)$-entropy in a self-contained way.
Article
Full-text available
We investigate the origin of Zipf's law for words in written texts by means of a stochastic dynamic model for text generation. The model incorporates both features related to the general structure of languages and memory effects inherent to the production of long coherent messages in the communication process. It is shown that the multiplicative dynamics of our model lead to rank-frequency distributions in quantitative agreement with empirical data. Our results give support to the linguistic relevance of Zipf's law in human language.
Article
Full-text available
Zipf's law seems to be ubiquitous in human languages and appears to be a universal property of complex communicating systems. Following the early proposal made by Zipf concerning the presence of a tension between the efforts of speaker and hearer in a communication system, we introduce evolution by means of a variational approach to the problem based on Kullback's Minimum Discrimination of Information Principle. Therefore, using a formalism fully embedded in the framework of information theory, we demonstrate that Zipf's law is the only expected outcome of an evolving, communicative system under a rigorous definition of the communicative tension described by Zipf.
Article
Full-text available
The emergence of a complex language is one of the fundamental events of human evolution, and several remarkable features suggest the presence of fundamental principles of organization. These principles seem to be common to all languages. The best known is the so-called Zipf's law, which states that the frequency of a word decays as a (universal) power law of its rank. The possible origins of this law have been controversial, and its meaningfulness is still an open question. In this article, the early hypothesis of Zipf of a principle of least effort for explaining the law is shown to be sound. Simultaneous minimization in the effort of both hearer and speaker is formalized with a simple optimization process operating on a binary matrix of signal-object associations. Zipf's law is found in the transition between referentially useless systems and indexical reference systems. Our finding strongly suggests that Zipf's law is a hallmark of symbolic reference and not a meaningless feature. The implications for the evolution of language are discussed. We explain how language evolution can take advantage of a communicative phase transition.
Article
Significance Many complex systems reduce their flexibility over time in the sense that the number of options (possible states) diminishes over time. We show that rank distributions of the visits to these states that emerge from such processes are exact power laws with an exponent −1 (Zipf’s law). When noise is added to such processes, meaning that from time to time they can also increase the number of their options, the rank distribution remains a power law, with an exponent that is related to the noise level in a remarkably simple way. Sample-space-reducing processes provide a new route to understand the phenomenon of scaling and provide an alternative to the known mechanisms of self-organized criticality, multiplicative processes, or preferential attachment.
Article
The formation of sentences is a highly structured and history-dependent process. The probability of using a specific word in a sentence strongly depends on the 'history' of word usage earlier in that sentence. We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along sentence formation, on average. We first show that the model explains the approximate Zipf law found in word frequencies as a direct consequence of sample-space reduction. We then empirically quantify the amount of sample-space reduction in the sentences of 10 famous English books, by analysis of corresponding word-transition tables that capture which words can follow any given word in a text. We find a highly nested structure in these transition tables and show that this 'nestedness' is tightly related to the power law exponents of the observed word frequency distributions. With the proposed model, it is possible to understand that the nestedness of a text can be the origin of the actual scaling exponent and that deviations from the exact Zipf law can be understood by variations of the degree of nestedness on a book-by-book basis. On a theoretical level, we are able to show that in the case of weak nesting, Zipf's law breaks down in a fast transition. Unlike previous attempts to understand Zipf's law in language the sample-space reducing model is not based on assumptions of multiplicative, preferential or self-organized critical mechanisms behind language formation, but simply uses the empirically quantifiable parameter 'nestedness' to understand the statistics of word frequencies. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Article
Significance The maximum entropy principle (MEP) states that for many statistical systems the entropy that is associated with an observed distribution function is a maximum, given that prior information is taken into account appropriately. Usually systems where the MEP applies are simple systems, such as gases and independent processes. The MEP has found thousands of practical applications. Whether a MEP holds for complex systems, where elements interact strongly and have memory and path dependence, remained unclear over the past half century. Here we prove that a MEP indeed exists for complex systems and derive the generalized entropy. We find that it belongs to the class of the recently proposed ( c , d )-entropies. The practical use of the formalism is shown for a path-dependent random walk.
Article
Ulam has defined a history-dependent random sequence of integers {an} by the recursion an+1 = an + aX(n), a1 = 1, P{X(n) = K} = n−1, and K= 1,2,…, n. It is shown that the expectation of an is asymptotic to exp(2√n) and that the expectation of an2 is asymptotic to exp[√2[5 + √17]√n]. The methods of generating functions and steepest descent are used.
Article
Ulam has defined a history-dependent random sequence by the recursion Xn+1=Xn+XU(n), where (U(n); n≥1) is a sequence of independent random variables with U(n) uniformly distributed on {1, …, n} and X1=1. We introduce a new class of continuous-time history-dependent random processes regulated by Poisson processes. The simples of these, a univariate process regulated by a homogeneous Poisson process, replicates in continuous time the essential propertie of Ulam's sequence, and greatly facilitates its analysis. We consider several generalizations and extensions of this, includin bivariate and multivariate coupled history-dependent processes, and cases when the dependence on the past is not uniform. The analysis of the discrete-time formulations of these models would be at the very least an extremely formidable project but we determine the asymptotic growth rates of their means and higher moments with relative ease.
Article
It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf's law observed in natural languages such as the English. The facts that the frequency of occurrence of a word is almost an inverse power law function of its rank and the exponent of this inverse power law is very close to 1 are largely due to the transformation from the word's length to its rank, which stretches an exponential function to a power law function.
  • P Harremoës
  • F Topsøe
Harremoës P. and F. Topsøe (2001) Maximum Entropy Fundamentals. Entropy 3: 191-226.