Content uploaded by Xinying Chen
Author content
All content in this area was uploaded by Xinying Chen on May 27, 2017
Content may be subject to copyright.
0
STRUCTURAL COMPLEXITY OF SIMPLIFIED CHINESE CHARACTERS
Yanru WANG1, Xinying Chen2*
ABSTRACT. In this paper, we studied, from a synergetic perspective, the relationship between the
structural complexity and frequency of Chinese characters. We measured the structural
complexity frequency of Chinese characters by both the number of strokes and the number of
components. Then, we tested whether the relationship between Chinese character’s structural
complexity and their frequency fits the Zipf’s law by analyzing the most frequent 3,061 Chinese
characters from Chinese Characters’ Frequency Dictionary. The result shows that the relationship
between Chinese characters’ structural complexity of both measurements and the frequency of
Chinese characters abides by Zipf-Mandelbrot’s law.
Keywords: Zipf’s law, structural complexity, Chinese characters, components, strokes, frequency
1. INTRODUCTION. The existing studies of Chinese morphology discuss the morphemic
combination of characters into words (Packard 2000) and the placement of aspectual markers
(Stallings 1975). However, there is a rarely explored topic in this area, that is, the structural
complexity of the Chinese characters.
There is seldom theoretical research which addressed this question, and most of the studies of
the structural complexity of Chinese characters were driven by the interest of natural language
processing applications and Chinese teaching (Bunke & Wang 1997). Although there are few
exceptions which tried to discuss the structural complexity of Chinese characters from a theoretic
point of view, lacking sufficient data and solid theoretical standpoints is still a common problem
of them (Wang 2007).
1
This study addressed the question that how to measure the structural complexity of Chinese
characters by using the methods that established and tested by Köhler and colleagues (2005) and
Wang (2011) for the morphology research of inflected languages.
The synergetic linguistics proposed by Köhler and colleagues (2005) regards language as a
dynamic system of self-organizing and self-regulating and moreover, provides a linguistic
framework building on that foundation. The synergetic-linguistic model was built and tested on
German language for a lexical subsystem by Köhler (1986). The model described and, in a way,
explained the relationship between different language features, such as the relationships between
structural complexity, number of meanings, and frequency of language units, and it has been
proved applicable to many languages. For Chinese, Wang’s work (2011) proved that the
relationship between Chinese words’ polysemy and word length fits this model. Wang (2014)
also studied the relationship between word length, polysemy, polytexuality, and frequency of
Chinese according to this model. And the results further proved the model’s applicability to the
Chinese lexical subsystem.
According to this model, which has been proved applicable in many other linguistic features
of Chinese, we speculated that the more complex characters would be less frequent due to the
minimum effort of language production, which is similar to the principle of least effort proposed
by Zipf (1949), and that the relation between frequency and complexity of characters should
abide by Zipf’s Law.
However, there is still no agreement among Chinese linguists on the question that whether
the number of strokes (referred to as NS) or the number of components (referred to as NC) is the
better measurement of the structural complexity of Chinese characters. The question has been put
forward early in 1950s by Du (1954), who himself had contradictory statements on the question
that whether strokes or components should be the basic units of Chinese characters. It was not
until 1990s when Chinese linguists developed a systemized theory about the structural units of
Chinese characters. Su (1994) proposed that in order to analyze the structure of modern Chinese
characters, scholars should set up a new theory, where there are three levels in the structure of
2
Chinese characters: stroke, component and a complete Chinese character, among which
component is the core. However, traditional Chinese linguists did not point out which of the
three units is more suitable for the measurement of the structural complexity of Chinese
characters.
In quantitative linguistics, Bohn proposed that we should use NS to measure the structural
complexity of Chinese characters (Liu 2012), while Altmann (2004) proposed a universal
procedure of measuring script complexity by assigning numerical values to different types of
elementary units and connections and then adding up all the values of a given sign. Liu (2012)
argued that, when it comes to the measurement of the structural complexity of Chinese characters,
both Bohn’s proposal and Altmann’s proposal can serve certain research purposes; however,
neither of the two methods provides the interpretation of the structural complexity of Chinese
characters from the perspectives of linguistics and philology.
Although Altmann’s method is universally applicable to different scripts, this method cannot
describe the inner structure of Chinese characters as well as NS and NC do. Therefore, we used
NS and NC as the measurements of structural complexity in this study. Moreover, our goals are
twofold: test our hypothesis that the frequency-structural complexity relationship of Chinese
charters would fit the Zipf’s law, and gain insight on measuring structural complexity of
characters by comparing the results from the two measurements.
2. DATA AND MATERIALS. Our Data is derived from Chinese Characters’ Frequency Dictionary,1
which was built according to People’s Daily, a famous newspaper in China, Corpus. It contains
the most frequent 3,061 different Chinese characters and their frequency in the corpus, the total
of which amounts to 99.43% of the whole corpus.
Other materials the we used as standards to measure complexity of the 3,061 Chinese
characters include: Dictionary of Chinese Characters’ Information,2 which presented basic
information, namely the pinyin, number of strokes, and components of each given Chinese
character, on commonly used Chinese characters; The Table of Basic Components of Chinese
Characters,3 which listed 560 basic components of Chinese characters; and The Modern
3
Commonly Used Standardized Chinese Dictionary,4 which also presented basic information
(origin, simplification approach, pinyin, number of strokes, and components) of characters in the
dictionary.
3. EXPERIMENTS AND RESULTS. In his two books, the Psychobiology of Language (Zipf 1935)
and Human Behavior and the Principal of Least Effort (Zipf 1949), George Kingsley Zipf
proposed that word frequency distribution follows a power law, which is now the well-known
Zipf’s law. The law originally takes the form below:
(1) f=C/r
In this equation, ‘f’ and ‘r’ represent word frequency and the rank order of that frequency
respectively. And ‘C’ is a constant based on the corpus in question. Later, to address the problem
that some researchers found the word frequency distribution was slightly deviated from the
original Zipf distribution at both ends, a modified function was proposed (Mandelbrot 1953).
Thus comes the famous Zipf-Mandelbrot law as below, which can more accurately describe
empirical data in linguistics.
(2) Pr =C(r+a)-b
Although Zipf’s law was first discovered in linguistics, it has been proved to be applicative in
many scientific fields such as bibliometrics, linguistics, economics, sociology, and informatics.
In linguistics, this law has been testified by many scholars and proved to be applicable to various
languages, such as Korean (Chois 2000), Spanish (Hatzigeorgiu et al. 2001, Ha et al. 2006),
French (Ha et al. 2006), Irish (Ha et al. 2006), Latin (Ha et al. 2006), and Indian languages
(Jayaram & Vidya 2008). Moreover, it was tested in the ancient language, Meroitic, used in
southern Sudan a thousand years ago (Smith 2007).
It has also been tested in Chinese language. Wang and colleagues (2009) used Dream of the
Red Chamber 红楼梦, Selected Works of MaoTse-tung 毛泽东选集 and Selected Works of
Deng Xiaoping 邓小平文选 as their research corpus and proved the applicability of Zipf’s
distribution law in Chinese. Guan and colleagues (1995) used Chinese word frequency statistics
shared on the Internet and found out that modern Chinese was in keeping with Zipf’s law at the
4
level of characters, words, and so on. Recently, Chen and colleagues (2012) investigated Chinese
literatures from Tang Dynasty (A.D. 618–A.D. 907) to the present at both word level and
character level, and they found that although Chinese language changed dramatically over time,
Chinese word frequency and character frequency always abide by Zipf-Mandelbrot’s law and
Menrerath-Altmann law.
Different from previous studies, which mainly focused on the word or complete character
levels, we would like to look into the inner structure of Chinese characters. For testing our
hypothesis that we brought out in the first section, we carried out lineal fitting experiments by
using the equation of Zipf-Mandelbrot law, as below, with ‘f’ being frequency of Chinese
characters and ‘N’ being NS or NC of each given Chinese characters.
(3) f =C(N+a)-b
3.1 EXPERIMENT. The experiments were carried out on following steps:
First, we calculated NS and NC of all 3061 characters according to the materials we
discussed in section 2.
Second, we add up the cumulative frequency of Chinese characters that share the same NS or
NC and then calculated the average frequency by dividing the cumulative frequency by the
number of Chinese characters (NCC) sharing the same NS or NC.
Third, we carried out lineal fitting experiments by using the data we got in last two steps
Finally, we looked at the results of step 3 and discussed whether the relationship between the
structural complexity of Chinese characters and their frequency fits the Zipf’s law, and draw
conclusions.
3.2 RESULTS. Table 1 and 2 showed the data of Chinese characters that we used for lineal fitting
experiments.
<INSERT TABLE 1 ABOUT HERE>
<INSERT TABLE 2 ABOUT HERE>
According to Table 1 and 2, NS and NC ranges from 1 to 23 and 1 to 7 respectively. And the
most complex Chinese character in term of NS is ‘罐’guan (pottery used for containing things or
5
drawing water) with 23 strokes. The most complex Chinese characters in terms of NC are:
‘疆’jiang (region, area, or border; limit; to draw boundries), ‘凝’ning (to change from gas to
liquid or from liquid to solid; to stay attentive and focused on something) and ‘颤’chan (to
vibrate; the vibration of something) with 7 components.
As shown in Figure 1 and 2, the cumulative frequency of Chinese characters first increases,
and then decreases as NS or NC increases, which seems to contradict with our hypothesis that
the more complex characters would be less frequent. This may due to the fact that characters that
have different structural complexity, are not evenly distributed. As shown in Figure 3 and 4, as
the characters’ structural complexity increases, the number of characters, follows the paracurve.
And this paracurve tendency may override the Zipf’s law, causing the paracurve tendency of the
cumulative frequency of Chinese characters changing with NS or NC.
<INSERT FIGURE 1 ABOUT HERE>
<INSERT FIGURE 2 ABOUT HERE>
<INSERT FIGURE 3 ABOUT HERE>
<INSERT FIGURE 4 ABOUT HERE>
In order to reduce the influence of these paracurves, we used the average frequency, of
characters with the same NS or NC, instead of the cumulative frequency, to do the experiments.
The fitting results for NS and NC are respectively: a=-0.5251, b=1.0339, C=0.0024; a=0.2626,
b=1.3949, C=0.0015 with significant determinant coefficient R2=0.9787 and R2=0.9804. Both
fitting results are satisfactory. The results are shown in Figure 5 and 6.
<INSERT FIGURE 5 ABOUT HERE>
<INSERT FIGURE 6 ABOUT HERE>
4. DISCUSSION AND CONCLUSION. As the result shown in the last section, the relationship
between structural complexity and frequency abides by Zipf’s law, with the structural
complexity measured both by NS and NC and frequency being the average frequency of
characters sharing the same structural complexity. This proves once again the universal
applicability of Zipf’s law.
As we have mentioned in the first section, which measurement of structural complexity of
6
Chinese characters is better is an ongoing discussion. In this study, we used two measurements,
NS and NC. Both measurements turned out to fit the Zipf’s law with a significant determinant
coefficient. Although the results still need to be tested yet with larger corpus or texts including
more genres, according to our results, both measurements are proper for linguistic, especially
quantitative linguistic, studies.
Although our study did not show the distinct advantage of stroke measurement over
component measurement, taking NS as the structural complexity measurement of Chinese
characters still have clear advantages in linguistic studies. As stated by Altmann (2004),
complexity of things is not an inherent property of things themselves, but a property about how
people interpret their structure. Although both NS and NC can reflect the inner structure of
Chinese characters as perceived by (Chinese) people, NS is a more practical measurement of
complexity for two reasons. Firstly, the definition of stroke is clear and noncontroversial while
the definition of component is not so. The definition that stroke is the basic unit of the structure
of Chinese characters, a continuous line people write by once, is commonly agreed by linguists
whereas the definition of component is quite controversial. Ban and Zhang (2004) listed as many
as eight most representative definitions of component. Secondly, the standards for distinguishing
stroke is quite straightforward due to its clear definition while the standards for splitting a
character into components is not so explicit. Even the two most authoritative component
standards issued by Chinese State Language Committee are considered by Wang and Huang
(2013) to be unpractical and to have unacceptably large number of components.
In conclusion, the relationship between structural complexity, measured both by NS and NC,
and frequency abides by Zipf’s law. And according to our research, both NS and NC are proper
measurements of complexity of Chinese characters. Future research still needs to be done to
address the issue whether NS or NC should be better used in linguistic research.
7
REFERENCES
Altmann, G. (2004). Script complexity. Glottometrics, 8, 68-74.
Ban, J. Q., & Zhang, Y. J. (2004). Definitions of Chinese character component. Journal of
Yangzhou University: Humanities and Social Sciences, 8(4), 62-65. [班吉庆, & 张亚军.
(2004). 汉字部件的定义.
扬州大学学报
:
人文社会科学版
, 8(4), 62-65.]
Bunke, H., & Wang, P. S. (Eds.). (1997). Handbook of Character Recognition and Document
Image Analysis. Singapore: World Scientific.
Chen, Q., Guo, J., & Liu, Y. (2012). A statistical study on Chinese Word and Character Usage in
literatures from the tang dynasty to the present. Journal of Quantitative Linguistics, 19(3),
232-248.
Choi, S. W. (2000). Some statistical properties and zipf’s law in korean text corpus. Journal of
Quantitative Linguistics, 7(1), 19-30.
Du, D. Y. (1954). The strange organization of square-shaped characters. Studies of the Chinese
Language, (12), 27. [杜定友. (1954). 方块字的怪组织.
中国语文
, (12), 27.]
Guan, Y., Wang, X. L., & Zhang, K. (1995). Frequency-frequency rank relation of language unit
in computational language model of modern Chinese. Journal of Chinese Information
Processing, 13(2). [关毅, 王晓龙, & 张凯. (1995). 现代汉语计算语言模型中语言单
位的频度-频级关系.
中文信息学报
, 13(2).]
Ha, L. Q., Stewart, D. W., Hanna, P. J., & Smith, F. J. (2006). Zipf and type-token rules for the
English, Spanish, Irish and Latin languages. Web Journal of Formal, Computational and
Cognitive Linguistics, 1(8), 1-12.
Hatzigeorgiu, N., Mikros, G., & Carayannis, G. (2001). Word length, word frequencies and
Zipf’s law in the Greek language. Journal of Quantitative Linguistics, 8(3), 175-185.
Jayaram, B. D., & Vidya, M. N. (2008). Zipf’s law for Indian languages. Journal of Quantitative
Linguistics, 15(4), 293-317.
Köhler, R. (2005). Synergetic linguistics. In: Köhler, R., Altmann, G., and Piotrowski, R. G.
(Eds.), Quantitative Linguistics. An International Handbook. Berlin: de Gruyter.
8
Liu, H. T., & Huang, W. (2012). Quantitative linguistics: state of the art, theories and methods.
Journal of Zhejiang University (Humanities and Social Sciences), 42(2), 178-192. [刘海
涛 黄 伟. (2012). 计量语言学的现状, 理论与方法.
浙江大学学报
(
人文社会科学
版
) , 42(2), 178-192.]
Mandelbrot, B. (1953). An informational theory of the statistical structure of language. In:
Jackson, B. (Eds.), Communication theory. Woburn, MA: Butterworth.
Packard, J. (Eds.). (2000). The Morphology of Chinese: A Linguistic and Cognitive Approach.
Cambridge: Cambridge University Press.
Smith, R. D. (2008). Investigation of the Zipf-plot of the extinct Meroitic language. arXiv
preprint arXiv:0808.2904.
Stallings, W. (1975). The morphology of Chinese characters: a survey of models and applications.
Computers and the Humanities, 9(1), 13-24.
Su, P. C. (1994). Outline of Modern Chinese Characterology. Peking: Peking University Press.
[苏培成. (1994).
现代汉字学纲要
. 北京: 北京大学出版社.]
Wang, G. A. (Eds.). (2007). A Handbook for 1,000 Basic Chinese Characters. Hong Kong: The
Chinese University Press.
Wang, L. (2011). Polysemy and word length in Chinese. Glottometrics, 22, 73-84.
Wang, L. (2014). Synergetic studies on some properties of lexical structures in Chinese. Journal
of Quantitative Linguistics, 21(2), 177-197.
Wang, D. P., & Huang, W. L. (2013). Thoughts about two Chinese character component
standards. Journal of Chinese Information Processing, 27(2), 74-78. [王道平, & 黄文丽.
(2013). 关于两个汉字部件规范的一点思考.
中文信息学报
, 27(2), 74-78.]
Wang, Y., Liu, Y. F., & Chen, Q. H. (2009). Zipf’s word frequency distribution in Chinese
literature works. Journal of Beijing Normal University (Natural Science), 45(4), 424-427.
[王洋, 刘宇凡, & 陈清华. (2009). 汉语言文学作品中词频的 Zipf 分布.
北京师范大
学学报
(
自然科学版
), 45(4), 424-427.]
9
Zipf, G. K. (1935). The Psycho-Biology of Language: An Introduction to Dynamic Philology.
Cambridge, MA: M.I.T. Press.
Zipf, G. K. (1949). Human Behavior and the Principal of Least Effort. Cambridge, MA:
Addison-Wesley.
1 Chinese Characters’ Frequency Dictionary was downloaded from the website of International
R & D Center for Chinese Education. URL:
http://nlp.blcu.edu.cn/downloads/download-resources/50.html.(Retrieved 2013-1-30)
2 Dictionary of Chinese Characters’ Information was also downloaded from the website of
International R & D Center for Chinese Education. URL:
http://nlp.blcu.edu.cn/downloads/download-resources/25.html. (Retrieved 2013-1-30)
3 The Table of Basic Components of Chinese Characters was issued by State Language Work
Committee in 1997.
4 Zhang, W. Y. (2006). The Modern Commonly Used Standardized Chinese Dictionary. Xi’an:
Shaanxi People’s Education Press. [张万有. (2006).
现代常用汉字规范字典
. 西安: 陕西人民
教育出版社.]