ChapterPDF Available

Structural Complexity of Simplified Chinese Characters

Authors:
0
STRUCTURAL COMPLEXITY OF SIMPLIFIED CHINESE CHARACTERS
Yanru WANG1, Xinying Chen2*
ABSTRACT. In this paper, we studied, from a synergetic perspective, the relationship between the
structural complexity and frequency of Chinese characters. We measured the structural
complexity frequency of Chinese characters by both the number of strokes and the number of
components. Then, we tested whether the relationship between Chinese characters structural
complexity and their frequency fits the Zipf’s law by analyzing the most frequent 3,061 Chinese
characters from Chinese Characters’ Frequency Dictionary. The result shows that the relationship
between Chinese characters’ structural complexity of both measurements and the frequency of
Chinese characters abides by Zipf-Mandelbrot’s law.
Keywords: Zipf’s law, structural complexity, Chinese characters, components, strokes, frequency
1. INTRODUCTION. The existing studies of Chinese morphology discuss the morphemic
combination of characters into words (Packard 2000) and the placement of aspectual markers
(Stallings 1975). However, there is a rarely explored topic in this area, that is, the structural
complexity of the Chinese characters.
There is seldom theoretical research which addressed this question, and most of the studies of
the structural complexity of Chinese characters were driven by the interest of natural language
processing applications and Chinese teaching (Bunke & Wang 1997). Although there are few
exceptions which tried to discuss the structural complexity of Chinese characters from a theoretic
point of view, lacking sufficient data and solid theoretical standpoints is still a common problem
of them (Wang 2007).
1
This study addressed the question that how to measure the structural complexity of Chinese
characters by using the methods that established and tested by Köhler and colleagues (2005) and
Wang (2011) for the morphology research of inflected languages.
The synergetic linguistics proposed by Köhler and colleagues (2005) regards language as a
dynamic system of self-organizing and self-regulating and moreover, provides a linguistic
framework building on that foundation. The synergetic-linguistic model was built and tested on
German language for a lexical subsystem by Köhler (1986). The model described and, in a way,
explained the relationship between different language features, such as the relationships between
structural complexity, number of meanings, and frequency of language units, and it has been
proved applicable to many languages. For Chinese, Wang’s work (2011) proved that the
relationship between Chinese words’ polysemy and word length fits this model. Wang (2014)
also studied the relationship between word length, polysemy, polytexuality, and frequency of
Chinese according to this model. And the results further proved the model’s applicability to the
Chinese lexical subsystem.
According to this model, which has been proved applicable in many other linguistic features
of Chinese, we speculated that the more complex characters would be less frequent due to the
minimum effort of language production, which is similar to the principle of least effort proposed
by Zipf (1949), and that the relation between frequency and complexity of characters should
abide by Zipf’s Law.
However, there is still no agreement among Chinese linguists on the question that whether
the number of strokes (referred to as NS) or the number of components (referred to as NC) is the
better measurement of the structural complexity of Chinese characters. The question has been put
forward early in 1950s by Du (1954), who himself had contradictory statements on the question
that whether strokes or components should be the basic units of Chinese characters. It was not
until 1990s when Chinese linguists developed a systemized theory about the structural units of
Chinese characters. Su (1994) proposed that in order to analyze the structure of modern Chinese
characters, scholars should set up a new theory, where there are three levels in the structure of
2
Chinese characters: stroke, component and a complete Chinese character, among which
component is the core. However, traditional Chinese linguists did not point out which of the
three units is more suitable for the measurement of the structural complexity of Chinese
characters.
In quantitative linguistics, Bohn proposed that we should use NS to measure the structural
complexity of Chinese characters (Liu 2012), while Altmann (2004) proposed a universal
procedure of measuring script complexity by assigning numerical values to different types of
elementary units and connections and then adding up all the values of a given sign. Liu (2012)
argued that, when it comes to the measurement of the structural complexity of Chinese characters,
both Bohn’s proposal and Altmann’s proposal can serve certain research purposes; however,
neither of the two methods provides the interpretation of the structural complexity of Chinese
characters from the perspectives of linguistics and philology.
Although Altmann’s method is universally applicable to different scripts, this method cannot
describe the inner structure of Chinese characters as well as NS and NC do. Therefore, we used
NS and NC as the measurements of structural complexity in this study. Moreover, our goals are
twofold: test our hypothesis that the frequency-structural complexity relationship of Chinese
charters would fit the Zipfs law, and gain insight on measuring structural complexity of
characters by comparing the results from the two measurements.
2. DATA AND MATERIALS. Our Data is derived from Chinese Characters’ Frequency Dictionary,1
which was built according to People’s Daily, a famous newspaper in China, Corpus. It contains
the most frequent 3,061 different Chinese characters and their frequency in the corpus, the total
of which amounts to 99.43% of the whole corpus.
Other materials the we used as standards to measure complexity of the 3,061 Chinese
characters include: Dictionary of Chinese Characters’ Information,2 which presented basic
information, namely the pinyin, number of strokes, and components of each given Chinese
character, on commonly used Chinese characters; The Table of Basic Components of Chinese
Characters,3 which listed 560 basic components of Chinese characters; and The Modern
3
Commonly Used Standardized Chinese Dictionary,4 which also presented basic information
(origin, simplification approach, pinyin, number of strokes, and components) of characters in the
dictionary.
3. EXPERIMENTS AND RESULTS. In his two books, the Psychobiology of Language (Zipf 1935)
and Human Behavior and the Principal of Least Effort (Zipf 1949), George Kingsley Zipf
proposed that word frequency distribution follows a power law, which is now the well-known
Zipf’s law. The law originally takes the form below:
(1) f=C/r
In this equation, fand ‘r’ represent word frequency and the rank order of that frequency
respectively. And ‘C’ is a constant based on the corpus in question. Later, to address the problem
that some researchers found the word frequency distribution was slightly deviated from the
original Zipf distribution at both ends, a modified function was proposed (Mandelbrot 1953).
Thus comes the famous Zipf-Mandelbrot law as below, which can more accurately describe
empirical data in linguistics.
(2) Pr =C(r+a)-b
Although Zipf’s law was first discovered in linguistics, it has been proved to be applicative in
many scientific fields such as bibliometrics, linguistics, economics, sociology, and informatics.
In linguistics, this law has been testified by many scholars and proved to be applicable to various
languages, such as Korean (Chois 2000), Spanish (Hatzigeorgiu et al. 2001, Ha et al. 2006),
French (Ha et al. 2006), Irish (Ha et al. 2006), Latin (Ha et al. 2006), and Indian languages
(Jayaram & Vidya 2008). Moreover, it was tested in the ancient language, Meroitic, used in
southern Sudan a thousand years ago (Smith 2007).
It has also been tested in Chinese language. Wang and colleagues (2009) used Dream of the
Red Chamber 红楼梦, Selected Works of MaoTse-tung 毛泽东选集 and Selected Works of
Deng Xiaoping 邓小平文选 as their research corpus and proved the applicability of Zipf’s
distribution law in Chinese. Guan and colleagues (1995) used Chinese word frequency statistics
shared on the Internet and found out that modern Chinese was in keeping with Zipf’s law at the
4
level of characters, words, and so on. Recently, Chen and colleagues (2012) investigated Chinese
literatures from Tang Dynasty (A.D. 618–A.D. 907) to the present at both word level and
character level, and they found that although Chinese language changed dramatically over time,
Chinese word frequency and character frequency always abide by Zipf-Mandelbrot’s law and
Menrerath-Altmann law.
Different from previous studies, which mainly focused on the word or complete character
levels, we would like to look into the inner structure of Chinese characters. For testing our
hypothesis that we brought out in the first section, we carried out lineal fitting experiments by
using the equation of Zipf-Mandelbrot law, as below, with ‘f being frequency of Chinese
characters and ‘N’ being NS or NC of each given Chinese characters.
(3) f =C(N+a)-b
3.1 EXPERIMENT. The experiments were carried out on following steps:
First, we calculated NS and NC of all 3061 characters according to the materials we
discussed in section 2.
Second, we add up the cumulative frequency of Chinese characters that share the same NS or
NC and then calculated the average frequency by dividing the cumulative frequency by the
number of Chinese characters (NCC) sharing the same NS or NC.
Third, we carried out lineal fitting experiments by using the data we got in last two steps
Finally, we looked at the results of step 3 and discussed whether the relationship between the
structural complexity of Chinese characters and their frequency fits the Zipfs law, and draw
conclusions.
3.2 RESULTS. Table 1 and 2 showed the data of Chinese characters that we used for lineal fitting
experiments.
<INSERT TABLE 1 ABOUT HERE>
<INSERT TABLE 2 ABOUT HERE>
According to Table 1 and 2, NS and NC ranges from 1 to 23 and 1 to 7 respectively. And the
most complex Chinese character in term of NS is ‘’guan (pottery used for containing things or
5
drawing water) with 23 strokes. The most complex Chinese characters in terms of NC are:
’jiang (region, area, or border; limit; to draw boundries), ‘’ning (to change from gas to
liquid or from liquid to solid; to stay attentive and focused on something) and ‘’chan (to
vibrate; the vibration of something) with 7 components.
As shown in Figure 1 and 2, the cumulative frequency of Chinese characters first increases,
and then decreases as NS or NC increases, which seems to contradict with our hypothesis that
the more complex characters would be less frequent. This may due to the fact that characters that
have different structural complexity, are not evenly distributed. As shown in Figure 3 and 4, as
the characters’ structural complexity increases, the number of characters, follows the paracurve.
And this paracurve tendency may override the Zipf’s law, causing the paracurve tendency of the
cumulative frequency of Chinese characters changing with NS or NC.
<INSERT FIGURE 1 ABOUT HERE>
<INSERT FIGURE 2 ABOUT HERE>
<INSERT FIGURE 3 ABOUT HERE>
<INSERT FIGURE 4 ABOUT HERE>
In order to reduce the influence of these paracurves, we used the average frequency, of
characters with the same NS or NC, instead of the cumulative frequency, to do the experiments.
The fitting results for NS and NC are respectively: a=-0.5251, b=1.0339, C=0.0024; a=0.2626,
b=1.3949, C=0.0015 with significant determinant coefficient R2=0.9787 and R2=0.9804. Both
fitting results are satisfactory. The results are shown in Figure 5 and 6.
<INSERT FIGURE 5 ABOUT HERE>
<INSERT FIGURE 6 ABOUT HERE>
4. DISCUSSION AND CONCLUSION. As the result shown in the last section, the relationship
between structural complexity and frequency abides by Zipf’s law, with the structural
complexity measured both by NS and NC and frequency being the average frequency of
characters sharing the same structural complexity. This proves once again the universal
applicability of Zipf’s law.
As we have mentioned in the first section, which measurement of structural complexity of
6
Chinese characters is better is an ongoing discussion. In this study, we used two measurements,
NS and NC. Both measurements turned out to fit the Zipf’s law with a significant determinant
coefficient. Although the results still need to be tested yet with larger corpus or texts including
more genres, according to our results, both measurements are proper for linguistic, especially
quantitative linguistic, studies.
Although our study did not show the distinct advantage of stroke measurement over
component measurement, taking NS as the structural complexity measurement of Chinese
characters still have clear advantages in linguistic studies. As stated by Altmann (2004),
complexity of things is not an inherent property of things themselves, but a property about how
people interpret their structure. Although both NS and NC can reflect the inner structure of
Chinese characters as perceived by (Chinese) people, NS is a more practical measurement of
complexity for two reasons. Firstly, the definition of stroke is clear and noncontroversial while
the definition of component is not so. The definition that stroke is the basic unit of the structure
of Chinese characters, a continuous line people write by once, is commonly agreed by linguists
whereas the definition of component is quite controversial. Ban and Zhang (2004) listed as many
as eight most representative definitions of component. Secondly, the standards for distinguishing
stroke is quite straightforward due to its clear definition while the standards for splitting a
character into components is not so explicit. Even the two most authoritative component
standards issued by Chinese State Language Committee are considered by Wang and Huang
(2013) to be unpractical and to have unacceptably large number of components.
In conclusion, the relationship between structural complexity, measured both by NS and NC,
and frequency abides by Zipf’s law. And according to our research, both NS and NC are proper
measurements of complexity of Chinese characters. Future research still needs to be done to
address the issue whether NS or NC should be better used in linguistic research.
7
REFERENCES
Altmann, G. (2004). Script complexity. Glottometrics, 8, 68-74.
Ban, J. Q., & Zhang, Y. J. (2004). Definitions of Chinese character component. Journal of
Yangzhou University: Humanities and Social Sciences, 8(4), 62-65. [班吉庆, & 张亚军.
(2004). 汉字部件的定义.
扬州大学学报
:
人文社会科学版
, 8(4), 62-65.]
Bunke, H., & Wang, P. S. (Eds.). (1997). Handbook of Character Recognition and Document
Image Analysis. Singapore: World Scientific.
Chen, Q., Guo, J., & Liu, Y. (2012). A statistical study on Chinese Word and Character Usage in
literatures from the tang dynasty to the present. Journal of Quantitative Linguistics, 19(3),
232-248.
Choi, S. W. (2000). Some statistical properties and zipf’s law in korean text corpus. Journal of
Quantitative Linguistics, 7(1), 19-30.
Du, D. Y. (1954). The strange organization of square-shaped characters. Studies of the Chinese
Language, (12), 27. [杜定友. (1954). 方块字的怪组织.
中国语文
, (12), 27.]
Guan, Y., Wang, X. L., & Zhang, K. (1995). Frequency-frequency rank relation of language unit
in computational language model of modern Chinese. Journal of Chinese Information
Processing, 13(2). [关毅, 王晓龙, & 张凯. (1995). 现代汉语计算语言模型中语言单
位的频度-频级关系.
中文信息学报
, 13(2).]
Ha, L. Q., Stewart, D. W., Hanna, P. J., & Smith, F. J. (2006). Zipf and type-token rules for the
English, Spanish, Irish and Latin languages. Web Journal of Formal, Computational and
Cognitive Linguistics, 1(8), 1-12.
Hatzigeorgiu, N., Mikros, G., & Carayannis, G. (2001). Word length, word frequencies and
Zipfs law in the Greek language. Journal of Quantitative Linguistics, 8(3), 175-185.
Jayaram, B. D., & Vidya, M. N. (2008). Zipf’s law for Indian languages. Journal of Quantitative
Linguistics, 15(4), 293-317.
Köhler, R. (2005). Synergetic linguistics. In: Köhler, R., Altmann, G., and Piotrowski, R. G.
(Eds.), Quantitative Linguistics. An International Handbook. Berlin: de Gruyter.
8
Liu, H. T., & Huang, W. (2012). Quantitative linguistics: state of the art, theories and methods.
Journal of Zhejiang University (Humanities and Social Sciences), 42(2), 178-192. [刘海
. (2012). 计量语言学的现状, 理论与方法.
浙江大学学报
(
人文社会科学
) , 42(2), 178-192.]
Mandelbrot, B. (1953). An informational theory of the statistical structure of language. In:
Jackson, B. (Eds.), Communication theory. Woburn, MA: Butterworth.
Packard, J. (Eds.). (2000). The Morphology of Chinese: A Linguistic and Cognitive Approach.
Cambridge: Cambridge University Press.
Smith, R. D. (2008). Investigation of the Zipf-plot of the extinct Meroitic language. arXiv
preprint arXiv:0808.2904.
Stallings, W. (1975). The morphology of Chinese characters: a survey of models and applications.
Computers and the Humanities, 9(1), 13-24.
Su, P. C. (1994). Outline of Modern Chinese Characterology. Peking: Peking University Press.
[苏培成. (1994).
现代汉字学纲要
. 北京: 北京大学出版社.]
Wang, G. A. (Eds.). (2007). A Handbook for 1,000 Basic Chinese Characters. Hong Kong: The
Chinese University Press.
Wang, L. (2011). Polysemy and word length in Chinese. Glottometrics, 22, 73-84.
Wang, L. (2014). Synergetic studies on some properties of lexical structures in Chinese. Journal
of Quantitative Linguistics, 21(2), 177-197.
Wang, D. P., & Huang, W. L. (2013). Thoughts about two Chinese character component
standards. Journal of Chinese Information Processing, 27(2), 74-78. [王道平, & 黄文丽.
(2013). 关于两个汉字部件规范的一点思考.
中文信息学报
, 27(2), 74-78.]
Wang, Y., Liu, Y. F., & Chen, Q. H. (2009). Zipfs word frequency distribution in Chinese
literature works. Journal of Beijing Normal University (Natural Science), 45(4), 424-427.
[王洋, 刘宇凡, & 陈清华. (2009). 汉语言文学作品中词频的 Zipf 分布.
北京师范大
学学报
(
自然科学版
), 45(4), 424-427.]
9
Zipf, G. K. (1935). The Psycho-Biology of Language: An Introduction to Dynamic Philology.
Cambridge, MA: M.I.T. Press.
Zipf, G. K. (1949). Human Behavior and the Principal of Least Effort. Cambridge, MA:
Addison-Wesley.
1 Chinese Characters’ Frequency Dictionary was downloaded from the website of International
R & D Center for Chinese Education. URL:
http://nlp.blcu.edu.cn/downloads/download-resources/50.html.Retrieved 2013-1-30
2 Dictionary of Chinese Characters’ Information was also downloaded from the website of
International R & D Center for Chinese Education. URL:
http://nlp.blcu.edu.cn/downloads/download-resources/25.html. (Retrieved 2013-1-30
3 The Table of Basic Components of Chinese Characters was issued by State Language Work
Committee in 1997.
4 Zhang, W. Y. (2006). The Modern Commonly Used Standardized Chinese Dictionary. Xi’an:
Shaanxi People’s Education Press. [张万有. (2006).
现代常用汉字规范字典
. 西安: 陕西人民
教育出版社.]
... For example, there is evidence from dozens of languages that overall, shorter words are used more frequently (Börstell et al., 2016;Casas et al., 2019;Corral & Serra, 2020;Hernández-Fernández et al., 2016King & Wedel, 2020;Mahowald et al., 2018;Piantadosi et al., 2011;Shu et al., 2003;Sigurd et al., 2004;Strauss et al., 2007, pp. 277-294;Teahan et al., 2000;Wang & Chen, 2015), as well as 986 translations of The Universal Declaration of Human Rights (Bentz & Ferrer-i-Cancho, 2015). Similarly, ...
... languages appear efficient in this regard(Gibson et al., 2019) with evidence of Zipf's Law of Abbreviation so far having been demonstrated in corpora-based analyses of a variety of languages(Börstell et al., 2016;Casas et al., 2019;Hernández-Fernández et al., 2016 King & Wedel, 2020;Mahowald et al., 2018;Piantadosi et al., 2011; Strauss et al., 2007, pp. 277-294;Tamaoka & Kiyama, 2013;Teahan et al., 2000;Wang & Chen, 2015) as well as 986 individual translations of The Universal ...
... The law abbreviation, i.e. the tendency of more frequent words to be shorter (Zipf 1935), holds practically in every language where it has been tested (Bentz, Ferrer-i-Cancho 2015) and it is a robust pattern, i.e. it holds regardless of the magnitude: letters (Bentz, Ferrer--i-Cancho 2015), duration in time (Hernández-Fernández et al. 2019), number of strokes in Japanese kanjis or Chinese characters (Sanada 2008;Wang, Chen 2015). Thus, the law can be generalized as a tendency of more frequency types to have a greater magnitude. ...
Preprint
A short review of similarities between dolphins and humans with the help of quantitative linguistics and information theory.
... . It holds across language families (Bentz and Ferrer-i-Cancho, 2016;Koplenig et al., 2022;Levshina, 2022;Meylan and Griffiths, 2021;Piantadosi et al., 2011), writing systems (Sanada, 2008;Wang and Chen, 2015) and modalities (Börstell et al., 2016;Torre et al., 2019), and also when word length in characters is replaced by word duration in time . Furthermore, the number of species where a parallel of this law has been confirmed is growing over time (Semple et al., 2022) . ...
Preprint
Full-text available
Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal, in the sense that it has the potential for being exceptionless or with a number of exceptions that is vanishingly small compared to the number of languages on Earth. Since Zipf's pioneering research, this law has been viewed as a manifestation of a universal principle of communication, i.e. the minimization of word lengths, to reduce the effort of communication. Here we revisit the concordance of written language with the law of abbreviation. Crucially, we provide wider evidence that the law holds also in speech (when word length is measured in time), in particular in 46 languages from 14 linguistic families. Agreement with the law of abbreviation provides indirect evidence of compression of languages via the theoretical argument that the law of abbreviation is a prediction of optimal coding. Motivated by the need of direct evidence of compression, we derive a simple formula for a random baseline indicating that word lengths are systematically below chance, across linguistic families and writing systems, and independently of the unit of measurement (length in characters or duration in time). Our work paves the way to measure and compare the degree of optimality of word lengths in languages.
... De esta forma, se ha verificado empíricamente en corpus escritos de casi mil idiomas de ochenta familias lingüísticas diferentes [17]. También se ha observado en los logogramas del japonés y chino, los cuales tienden a estar formados por menor número de trazos en función de su frecuencia [165,33]. La ley de Brevedad también se ha observado acústicamente cuando el tamaño de la palabra se mide en duración temporal [74,66,181] e incluso evidencias recientes sugieren que esta ley también aplica a la comunicación acústica de otros primates [81]. ...
Thesis
Full-text available
Linguistic laws constitute one of the cornerstones, quantitatively measurable, of modern cognitive sciences and linguistics, and have been intensively researched during last century, mainly in written corpora. The conclusions reached from the study of statistical patterns of language are therefore, biased by the segmentation used, and characteristics such as the variability on the energy and duration of linguistic units are lost in the uniformity of the written transcription. In this thesis we examine whether linguistic laws hold with respect to the physical manifestations of linguistic units in segmented speech, and later, using a lexical independent segmentation method, we recover those equivalent statistical patterns at timescales even below the phonemic level, where cognitive process does not operate. Using well-known corporas, we first verify that acoustically transcribed durations of linguistic units at several scales comply with a Log-Normal distribution, and we quantitatively justify this observation using a stochastic generative model that only assumes lognormalty in the lower scale. Then, we find that classical linguistic laws appear stronger when using physical units than in their symbolic counterpart, supporting the hypothesis that statistical laws in language have a physical origin. We show that Herdan-Heaps’ law has to hold the same exponent regarding of the units of study, we propose for the first time a precise mathematical formulation of Zipf’s law of abbreviation, which we show to be connected to optimal compression principles in information theory and a mathematical derivation of Menzerath-Altmann’s law which also highlights an additional regime where the law is inverted. Finally, we apply the threshold method to the speech signal, without the need of a segmentation, recovering linguistic laws at infraphonemic levels and showing more evidences that the speech may be produced by a system operating close to a critical point. These methods support the idea that linguistic laws found in written corpora may be, at least in part, a byproduct of dynamics that emerge from non-cognitive levels and, in any case, it pave the way for new comparative studies in animal communication or the analysis of signals of unknown code.
... It qualitatively states that the more frequently a word is used, the 'shorter' that word tends to be [44,45,62]. Word size has been measured in terms of number of characters, according to which the law has been verified empirically in written corpora from almost a thousand languages of 80 different linguistic families [63], and similarly logograms tend to be made of fewer strokes in both Japanese and Chinese [64,65]. The law has also been observed acoustically when word size is measured in terms of word time duration [8,66,67], and recent evidence even suggests that this law also holds in the acoustic communication of other primates [68]. ...
Article
Full-text available
Physical manifestations of linguistic units include sources of variability due to factors of speech production which are by definition excluded from counts of linguistic symbols. In this work, we examine whether linguistic laws hold with respect to the physical manifestations of linguistic units in spoken English. The data we analyse come from a phonetically transcribed database of acoustic recordings of spontaneous speech known as the Buckeye Speech corpus. First, we verify with unprecedented accuracy that acoustically transcribed durations of linguistic units at several scales comply with a lognormal distribution, and we quantitatively justify this ‘lognormality law’ using a stochastic generative model. Second, we explore the four classical linguistic laws (Zipf’s Law, Herdan’s Law, Brevity Law and Menzerath–Altmann’s Law (MAL)) in oral communication, both in physical units and in symbolic units measured in the speech transcriptions, and find that the validity of these laws is typically stronger when using physical units than in their symbolic counterpart. Additional results include (i) coining a Herdan’s Law in physical units, (ii) a precise mathematical formulation of Brevity Law, which we show to be connected to optimal compression principles in information theory and allows to formulate and validate yet another law which we call the size- rank law or (iii) a mathematical derivation of MAL which also highlights an additional regime where the law is inverted. Altogether, these results support the hypothesis that statistical laws in language have a physical origin.
... This theoretical approaches strongly suggest that it might be possible to provide a coherent and parsimonious explanation for the laws we have examined in this article and other laws such as Zipf's law for word frequencies [44] or Menzerath's law [45]. The need for an abstract standpoint is not only suggested by our analyses but also by patterning consistent with these laws in human language in different conditions, e.g., sign language [46], Kanji or Chinese characters [47,48], and also in animal communication [12,49,50]. ...
Preprint
Full-text available
The pioneering research of G. K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws. The most popular is Zipf's law for word frequencies. Here we focus on two laws that have been studied less intensively: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be shorter. In a previous work, we tested the robustness of these Zipfian laws for English, roughly measuring word length in number of characters and distinguishing adult from child speech. In the present article, we extend our study to other languages (Dutch and Spanish) and introduce two additional measures of length: syllabic length and phonemic length. Our correlation analysis indicates that both the meaning-frequency law and the law of abbreviation hold overall in all the analyzed languages.
... This theoretical approaches strongly suggest that it might be possible to provide a coherent and parsimonious explanation for the laws we have examined in this article and other laws such as Zipf's law for word frequencies [44] or Menzerath's law [45]. The need for an abstract standpoint is not only suggested by our analyses but also by patterning consistent with these laws in human language in different conditions, e.g., sign language [46], Kanji or Chinese characters [47,48], and also in animal communication [12,49,50]. ...
Article
Full-text available
The pioneering research of G. K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws. The most popular is Zipf's law for word frequencies. Here we focus on two laws that have been studied less intensively: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be shorter. In a previous work, we tested the robustness of these Zipfian laws for English, roughly measuring word length in number of characters and distinguishing adult from child speech. In the present article, we extend our study to other languages (Dutch and Spanish) and introduce two additional measures of length: syllabic length and phonemic length. Our correlation analysis indicates that both the meaning-frequency law and the law of abbreviation hold overall in all the analyzed languages.
Article
Full-text available
Two language laws have been identified as consistent patterns shaping animal behaviour, both acting on the organizational level of communicative systems. Zipf's law of brevity describes a negative relationship between behavioural length and frequency. Menzerath's law defines a negative correlation between the number of behaviours in a sequence and average length of the behaviour composing it. Both laws have been linked with the information-theoretic principle of compression, which tends to minimize code length. We investigated their presence in a case study of male chimpanzee sexual solicitation gesture. We failed to find evidence supporting Zipf's law of brevity, but solicitation gestures followed Menzerath's law: longer sequences had shorter average gesture duration. Our results extend previous findings suggesting gesturing may be limited by individual energetic constraints. However, such patterns may only emerge in sufficiently large datasets. Chimpanzee gestural repertoires do not appear to manifest a consistent principle of compression previously described in many other close-range systems of communication. Importantly, the same signallers and signals were previously shown to adhere to these laws in subsets of the repertoire when used in play; highlighting that, in addition to selection on the signal repertoire, ape gestural expression appears shaped by factors in the immediate socio-ecological context.
Preprint
Full-text available
Two language laws have been identified as manifestations of universal principles of animal behaviour, both acting on the organisation of numerous vocal and behavioural communicative systems. Zipf's law of brevity describes a negative relationship between behavioural length and frequency of behaviour. Menzerath's law defines a negative correlation between the number of behaviours in a sequence and average length of the behaviour composing it. Both laws have been linked with the information-theoretic principle of compression, which tends to minimise code length. We investigate the presence of these two laws in the repertoire of chimpanzee sexual solicitation gestures. We find that chimpanzee solicitation gestures do not follow either Zipf's law of brevity or Menzerath's law consistently. For the second time in ape gestural communication, evidence supporting Zipf's law of brevity was absent, and, here, the presence of Menzerath's law appears individually driven. Ape gesture does not appear to manifest a principle of compression or pressure for efficiency that has been previously proposed to be universal. Importantly, the same signals were shown to adhere to these laws when used in a different behavioural context; highlighting that signallers consider signalling efficiency broadly, and diverse factors play important roles in shaping investment in signal production.
Article
Full-text available
Studies testing linguistic laws outside language have provided important insights into the organization of biological systems. For example, patterns consistent with Zipf's law of abbreviation (which predicts a negative relationship between word length and frequency of use) have been found in the vocal and non-vocal behaviour of a range of animals, and patterns consistent with Menzerath's law (according to which longer sequences are made up of shorter constituents) have been found in primate vocal sequences, and in genes, proteins and genomes. Both laws have been linked to compression—the information theoretic principle of minimizing code length. Here, we present the first test of these laws in animal gestural communication. We initially did not find the negative relationship between gesture duration and frequency of use predicted by Zipf's law of abbreviation, but this relationship was seen in specific subsets of the repertoire. Furthermore, a pattern opposite to that predicted was seen in one subset of gestures—whole body signals. We found a negative correlation between number and mean duration of gestures in sequences, in line with Menzerath's law. These results provide the first evidence that compression underpins animal gestural communication, and highlight an important commonality between primate gesturing and language.
Article
Full-text available
The aim of this paper is to report for the first time the 1000 most common words and lemmas of Modern Greek and some of their quantitative characteristics. The frequency word list produced is based on the Hellenic National Corpus (HNC), a corpus of Modern Greek language consisting of about 13 million words of written texts. In particular, we investigate the application of Zipf’s law in both the 1000 most common words and lemmas. In addition we examine the frequency distribution of the grammatical categories in the 1000 most common words and lemmas as well as the average word length in the whole HNC and the growth of the average word length as a function of the number of the most common words.
Article
Full-text available
In this paper, we carried out a statistical analysis on the Chinese corpus in the Tang, Song, Yuan, Ming and Qing Dynasties, as well as in the modern time. We found that character and word frequencies change over time so that the word frequency always abides by the Zipf-Mandelbrot law p(r) = C(r + r0), while the character frequency follows the Menzerath-Altmann law P(r) = Aer . In the case of the character frequency distribution, the exponential property increases and the power-law feature declines as time passes by. We also found that more and more compound words were created since the Tang Dynasty. Single-character words show up unevenly in the whole word frequency distribution, with more of them concentrating in the earlier period and decaying exponentially.
Book
This ground breaking study dispels the common belief that Chinese 'doesn't have words' but instead 'has characters'. Jerome Packard's book provides a comprehensive discussion of the linguistic and cognitive nature of Chinese words. It shows that Chinese, far from being 'morphologically impoverished', has a different morphological system because it selects different 'settings' on parameters shared by all languages. The analysis of Chinese word formation therefore enhances our understanding of word universals. Packard describes the intimate relationship between words and their components, including how the identities of Chinese morphemes are word-driven, and offers new insights into the evolution of morphemes based on Chinese data. Models are offered for how Chinese words are stored in the mental lexicon and processed in natural speech, showing that much of what native speakers know about words occurs innately in the form of a hard-wired, specifically linguistic 'program' in the brain.
Chapter
Synergetics is a special type of systems theoretical modelling whose specific characteristic is the treatment of the spontaneous rise and development of structures. The exponents of this interdisciplinary approach in the field of linguistics have shown that synergetics is also compatible with the functional analytic models and explanatory approaches of quantitative linguistics. It provides concepts which are applicable to the phenomena of self-regulation and self-organisation as they are investigated in quantitative linguistics. Like other self-organising systems, language is characterised by the presence of cooperative and competitive processes which, together with the external forces of biology, psychology, physics, the social system and others, form the dynamics of the system.
Article
This paper reports on a test of some aspects of the synergetic-linguistic model on data from Chinese. The lexical model, which is based on word length, polysemy, frequency and polytextuality, is confronted with data from the People’s Daily news corpus. The results of this study demonstrate once more the cross-linguistic validity of the model with respect to a highly analytic language.
Article
An account of discoveries pertaining to linguistic change, presenting many problems to the psychologist whose interest lies in speech-behavior or meaning. Harvard Book List (edited) 1955 #268 (PsycINFO Database Record (c) 2012 APA, all rights reserved)