Cover of Astounding Science Fiction (February 1957; edited by John W. Campbell, Jr.) featuring the fictional characters M. Dane, H. Penrose, and S. von Ohlmhorst. In the far background, a mural found amidst the University ruins shows a 'heroic-sized Martian' handling a 'theodolite'-like apparatus. Illustration by Frank Kelly Freas; reprinted after Wikipedia (2018) public domain image.

The article will focus on H. Beam Piper’s classical story Omnilingual (1957). This Piper-esque writing has entered the records of the science fiction prose for the ‘Martian’ periodic table of elements, being synonymous with a scientific ‘Rosetta-like stone’ in the decipherment area. The work, while having a search potential in text analysis and sty...

... published as a novelette in Astounding Science Fiction magazine (February 1957; Figure 1), subsequently known as Analog, was later collected in Federation (1981), a compilation of short stories by HBP. Omnilingual deals with a human survey party -archaeologists includedlooking for clues and/or indigenous relics among the ruins of a very ancient Martian city. ...
... 2000, or Gastwirth, 2017). Gini's coefficient is the space between the Lorenz curve and the straight line joining <0;1> in the two-dimensional coordinate system (Gini, 1921;cf. Ceriani & Verme, 2012). The Lorenz curve is the stepwise adding of relative frequencies beginning from the lowest up to the highest (Popescu, Altmann et al., 2009, p. 56, Fig. 3.11). Since this constitutes an area, one needs to figure out all individual areas between the two lines. Regardless of the fact, there are easily computable approximations at our disposal. One of them is given ...
... frequency distribution of words in a text, whether ranked or presented as a spectrum, displays a number of properties which can be measured, compared and tested. One among the many others developed in the last years is the socalled lambda indicator, defined on the basis of Euclidean distances between neighbouring/ranked frequencies (cf . Popescu et al., 2011, p. 3). It may be relativized in such a way that text size does not have an apparent influence. 7 Still, under the premises, it can be approximated in a simple form (cf. Popescu & Altmann, 2015). As the underlying Euclidean distance can be approximated ...
... to lambda, the results are not easy to be interpreted; it seems that there is a passage with high figures (Sections 9-12), which means that the distances between neighbouring ranked words are constantly high; this is broken by Section 13, where the definite article 'the' substantially prevails over all the other words. This part of the novelette focuses on the description of the Martian premises, where objects and events are treated from different viewpoints, and many speculations are made. ...


... The idea for the essay emerged from research in symbolic and writ ing systems and semiotics in general (see Melka, 2008;Melka and Místecký, 2020;Melka and Schoch, 2020;Zörnig and Melka, 2014). A number of obstacles, present during the process of interpre tation and deciphering of unknown verbal and/or nonverbal informa tion, certainly raise parallels for the far more complex and difficult task: that of identifying and retrieving messages from nonterrestrial sources. ...
This article discusses from a semiotic perspective one important variable-anthropocentrism-present in various proposed messages intended to communicate with offworld intelligences. Our review of different scenarios re veals embedded flaws to various degrees. This should not be a reason for desist ing in the pursuit of SETIstyle (Search for Extraterrestrial Intelligence) pas sive "listening" or active "messaging" programs; however, in tandem with such SETIstyle programs, a robust and efficient strategy for potential contact should be developed as well. Such a strategy will require adequate time for major struc tural improvements in both the semiotic and technological realms rather than attempted lastminute adjustments carried out, for instance, when a SETIstyle program claims success and contact seems imminent. The realization that humans often have great difficulties in interpreting their own cultural products and experiences (especially, the longforgotten ones), as well as the communicative abilities of nonhuman residents on Earth, is deemed a critical aspect that must be overcome in order to undertake successful hypo thetical communication with extraterrestrial intelligences (ETIs). Furthermore, we believe it is pertinent to raise the issue of the modality that any particu lar ETIs might utilize or recognize as a communication system. Arguably, the widely held assumption (often unstated) that ETIs will recognize and respond positively to either visual or auditory communication (where auditory commu nication is often encoded in visual graphic forms, such as writing systems), in many cases coded in and transmitted via electromagnetic waves or some other medium, is simply a form of anthropocentrism at a fundamental level.
... The use of MultiAzterTest is not limited to readability assessment, it can be used for text analysis, profiling or stylometrics. Text analysis has been used in other research areas such as textbook analysis (Aguirregoitia Martinez et al., 2020), fake news detection and classification (Choudhary and Arora, 2020), authorship attribution (Hou and Huang, 2020), misogyny identification (Fersini et al., 2020), register analysis (Argamon, 2019), analysis of literature (Melka and Místeckỳ, 2019), plagiarism detection (Foltỳnek et al., 2019), analysis of the writing differences of women and men (Cocciu et al., 2018), analysis of the narratives in schizophrenia (Willits et al., 2018) or detection of dementia (da Cunha, 2015). ...
Readability assessment is the task of determining how difficult or easy a text is or which level/grade it has. Traditionally, language dependent readability formula have been used, but these formulae take few text characteristics into account. However, Natural Language Processing (NLP) tools that assess the complexity of texts are able to measure more different features and can be adapted to different languages. In this paper, we present the MultiAzterTest tool: (i) an open source NLP tool which analyzes texts on over 125 measures of cohesion,language, and readability for English, Spanish and Basque, but whose architecture is designed to easily adapt other languages; (ii) readability assessment classifiers that improve the performance of Coh-Metrix in English, Coh-Metrix-Esp in Spanish and ErreXail in Basque; iii) a web tool. MultiAzterTest obtains 90.09 % in accuracy when classifying into three reading levels (elementary, intermediate, and advanced) in English and 95.50 % in Basque and 90 % in Spanish when classifying into two reading levels (simple and complex) using a SMO classifier. Using cross-lingual features, MultiAzterTest also obtains competitive results above all in a complex vs simple distinction.
... Stylometrics techniques, measuring the distribution of linguistic phenomena, essentially deal with two types of issues (Tuldava, 2004, p. 141): individual or functional styles and authorship identification. Recently, a body of stylometric investigations have flourished (e.g., Kernot et al., 2019;Melka & Místecký, 2020). Corpus stylistics is previously defined as applying corpus linguistic methods to the analysis of literary texts (Mahlberg, 2014). ...
... For quantitative linguists in stylometrics; first, going through stylometric analyses, we would find that their research objects are not restricted to literature, either, ranging from authorial style of literary writers to that of political figures (Kernot et al., 2019;Kubát & Čech, 2016;Melka & Místecký, 2020). Similar to what the authors have suggested in terms of corpus stylistics, this indicates the vigour of stylometrics as an independent discipline of study. ...
... Equipped with diverse quantitative techniques, quantitative linguistics concerns itself with various phenomena, structures and structural properties of language in order to discover the governing laws and driving forces behind these phenomena and dynamics of language evolution (Liu, 2017). It has focused extensively on synchronic variability of linguistic features in stylistic analysis (Liu & Xiao, 2019;Melka & Místecký, 2019), genre analysis (Hou et al., 2014), authorship attribution (Chen et al., 2012), comparative analysis of different speakers (Y. Zhang, 2014), esp. ...
... Substantially dependent on the text-length (Kubát et al., 2014;Melka & Místecký, 2019), TTR is used to analyse longitudinally annual QCM (for the first research question), but not to compare QCM with BNC (for the second research question). The count states: ...
... Borrowed from information theory, Shannon's entropy (H) measures uncertainty or diversity (Manning & Schütze, 1999;Liu, 2016;Shannon, 1948Shannon, , 1951. In linguistics, entropy expresses the degree of vocabulary dispersion, also interpreted as its monotony (Kubát et al., 2014;Liu, 2017;Melka & Místecký, 2019). The smaller the H is, the more concentrated the vocabulary is and the less rich the vocabulary is. ...
Queen's English (QE), a linguistic symbol of the royal or upper class, is a particular variety or an aristocratic form of English. However, QE has been dethroned by a surprising finding that it shifted phonologically towards common people's English (CE) between the 1950s-1980s, arousing a debate on its existence. Based upon Queen's Christmas Messages (1952-2018) and BNC, this study quantitatively investigated whether QE has experienced diachronic changes and drifted towards CE. Our PCA analysis shows QE's fluctuating lexical richness, increasing lexical complexity and synthetism, and steady syntactic features during the six decades. Piecewise regression and statistical results indicate 1) QE is drifting towards CE in lexical richness and complexity between the 1950s-1980s; 2) QE exhibits an interaction between a "drifting force" and a "deviating force" towards or from CE between the 1950s-1980s in syntactic features; 3) QE maintains a synthetic form distinct from the analytical one of CE over the 66 years. These phenomena are likely related to the collapsing social structure between the 1950s-1980s, identity building in Queen's early reign and age factor. This study firstly quantify the drift of QE towards CE lexically and syntactically, which may shed some light on quantitative investigation of diachronic language changes.
The study examined contemporary Chinese novels that depicted the social changes in rural and urban areas. We analyzed the texts using several quantitative methods of stylometry (lexical richness, activity, descriptivity, nominality, cluster analysis of the most frequent words) and corpus linguistics (keywords). We focused on the differences (both in style and in content) between contemporary fiction depicting life in Chinese rural and urban areas. The results revealed that rural stories are more dynamic (active), focusing more on the plot with a simpler description and a smaller vocabulary.
It has been qualitatively evidenced that Daojing and Dejing, the two sections of DaoDejing, are independent of each other in content, style, and authorship. However, this claim remains controversial because there has been scant quantitative study of them that probes into their syntactic structure to quantitatively evidence the independence due to methodological challenges to quantitative approaches. To address this problem, the present study employed two indexes of quantitative linguistics, activity and descriptivity, to capture the syntactic features of the inner structure of Daojing and Dejing to validate the independence as of the above. Results of this study reveal that (1) the chi-square test shows that both sections are significantly active texts; (2) the u-test and nonparametric two-related-sample test show that the two texts do not differ from each other significantly either holistically or by the 300-character segments in terms of activity and descriptivity; and (3) the activity sequences (Q-sequence) in the two sections can be both better fitted by two different functions, viz. the beta-function and the Morse function. Moreover, with the development of the text, a synergic relationship is found in the distribution of verbs and adjectives, indicative of a dynamic and complex self-regulating process of the text development. To conclude, contrary to the claim of prior studies that they were written by different authors, our stylometric study has quantitatively demonstrated that Daojing and Dejing are not independent of each other stylistically, especially in terms of activity and descriptivity.
Full-text available
This study investigates the extent to which English translations of Chinese Wuxia fiction and Western heroic literature in modern English are stylistically similar through stylometric analyses. It adds to literary translation research by highlighting possible stylistic connections between heroic literature in the East and that in the West, clues that may help understand the current reception of Wuxia translations. It also contributes to stylometric studies by introducing the stylistic panorama, a novel concept proposed to describe the stylistic picture of a (translated) text in a relatively holistic and functional way. Examining six English translations of Wuxia novels and twelve chivalric stories and heroic fantasies in modern English, the study finds that the Wuxia translations differ from the two Western subgenres in stylistic panoramas built by formal features (dispersion of word lengths and average sentence length), as well as the most frequent words (MFW) and the MFW-sequences. Such differences have foregrounded the unique stylistic features (richer Wuxia-specific vocabularies and shorter paragraph lengths) of these translations, which has contributed in part to their favorable reception among English-speaking readers. It is hoped that this study will encourage new applications for the concept of stylistic panoramas in future stylometric studies.
The issue of authorship attribution has long been considered and continues to be a popular topic. Because of advances in digital computers, this field has experienced rapid developments in the last decade. In this article, a survey of recent advances in authorship attribution in text mining is presented. This survey focuses on authorship attribution methods that are statistically or computationally supported as opposed to traditional literary approaches. The main aspects covered include the changes in research topics over time, basic feature metrics, machine learning techniques, and the advantages and disadvantages of each approach. Moreover, the corpus size, number of candidates, data imbalance, and result description, all of which pose challenges in authorship attribution, are discussed to inform future work. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Text Mining >Authorship attribution in text mining.
The goal of this chapter is to determine stylometric features and keywords of the selected texts produced by the candidates for the 2018 Czech presidential election, and to interpret whether these may have had any impact upon the final results. The stylometric indexes researched include MATTR (moving-average type-token ratio), ATL (average token length), TC (thematic concentration), STC (secondary thematic concentration), Q (activity), and VD (verb distances); finally, a keyword analysis for two chosen candidatesʼ programmes is carried out. The outcomes of the analyses show that each candidate adopts a special strategy to influence his electorate and that this strategy can be captured via stylometric methods.