Article

A Queuing-Theory Model of Word Frequency Distributions

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This paper describes a novel model for term frequency distributions that is derived from queuing-theory. It is compared with Poisson distributions, in terms of how well the models describe the observed distributions of terms, and it is demonstrated that a model for term frequency distributions based on queue utilisation generally gives a better fit. It is further demonstrated that the ratio of the fit/error between the Poisson and queue utilisation distributions may be used as a good indication of how interesting a word is in relation to the topic of the discourse. A number of possible reasons for this are discussed. 1

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The fitting of various models to real distribution of words in natural language texts was studied e.g. in [1]. An interesting queuing-theory model of word frequency distributions was developed by Munro in [19]. Özbey et al. [21] proposed a framework to explain the nonlinear behavior of low frequencies on the log-log scale. ...
Article
Full-text available
Analysis of the shape of a Laplacian spectrogram is a new line of research used in graph spectral clustering. More precisely, we observed that (properly normalized) plots of the eigenvalues of sub-Laplacians characterizing different groups of documents differ in their shape. Thus, by computing the distance between these plots, we can solve the problem of clustering and classifying new observations. This idea is developed in a number of our papers and as such, can be considered a pioneering approach to cluster analysis. In an attempt to answer why it is so useful, in this paper we consider the hypothesis that the shape of a spectrogram could be attributed to the writing style of the authors of the document group in the cluster. We explore this hypothesis for several models of word distribution. In particular, we assume that the writing style is reflected in the word distribution of texts of an author or a group of them. We check if changing of distribution parameters of a widely accepted log-normal word distribution model changes in fact the Laplacian eigenvalue spectrogram in such a way as to distinguish between document groups. We found that in fact variation of each of the distribution parameters leads to distinct groups of documents. These findings justify the usage of Laplacian spectrograms to distinguish (cluster or classify) groups of documents.
... Lexical features base on employed vocabulary with its richness, and statistical information on linguistic elements present in a text sample. They can specify averages, frequencies of occurrence [44], or distributions of selected elements [45], which explains their inherent continuous nature. Inside this group, character features are distinguished, which refer to single characters (letters and digits) or their groups, of either fixed or varied length (but not necessarily forming words [46]). ...
Article
Full-text available
Typically discretisation procedures are implemented as a part of initial pre-processing of data, before knowledge mining is employed. It means that conclusions and observations are based on reduced data, as usually by discretisation some information is discarded. The paper presents a different approach, with taking advantage of discretisation executed after data mining. In the described study firstly decision rules were induced from real-valued features. Secondly, data sets were discretised. Using categories found for attributes, in the third step conditions included in inferred rules were translated into discrete domain. The properties and performance of rule classifiers were tested in the domain of stylometric analysis of texts, where writing styles were defined through quantitative attributes of continuous nature. The performed experiments show that the proposed processing leads to sets of rules with significantly reduced sizes while maintaining quality of predictions, and allows to test many data discretisation methods at the acceptable computational costs.
... Features describing styles need to refer to such elements that are not easily imitated or common to many authors, reflect individual linguistic preferences, whether conscious or subconscious, observable in many samples [3,34]. Popularly, there are exploited either lexical or syntactic descriptors, the first providing some statistical characteristics such as average word length, average sentence length, frequencies of usage for characters, words or phrases, distributions of all these averages and frequencies [44], while syntactic markers refer to punctuation marks and the way in which they organise the structure of the text into units of sentences, paragraphs [7]. These descriptors need to be calculated over many examples, using sufficiently wide corpus, otherwise they would be unreliable [39]. ...
Article
Full-text available
The performance of a classification system of any type can suffer from irrelevant or redundant data, contained in characteristic features that describe objects of the universe. To estimate relevance of attributes and select their subset for a constructed classifier typically either a filter, wrapper, or an embedded approach, is implemented. The paper presents a combined wrapper framework, where in a pre-processing step, a ranking of variables is established by a simple wrapper model employing sequential backward search procedure. Next, another predictor exploits this resulting ordering of features in their reduction. The proposed methodology is illustrated firstly for a binary classification task of authorship attribution from stylometric domain, and then for additional verification for a waveform dataset from UCI machine learning repository.
... Collocation trends and ratios of up three terms were collected, but found to have only a slight improvement on the overall results. In (Munro, 2003a) it was demonstrated that model of term self-collocation across documents is well described in terms of queuing theory, and that the relationship between a queuing theory and Poisson model is a good indication of function. An attribute representing this relationship also increased the overall accuracy, especially that of the Classifiers. ...
Article
This thesis describes a methodology for the computational learning and classification of a Systemic Functional Grammar. A machine learning algorithm is developed that allows the structure of the classifier learned to be a representation of the grammar. Within Systemic Functional Linguistics, Systemic Functional Grammar is a model of language that has explicitly probabilistic distributions and overlapping categories. Mixture modeling is the most natural way to represent this, so the algorithm developed is one of the few machine learners that extends mixture modeling to supervised learning, retaining the desirable property that it is also able to discover intrinsic unlabelled categories. As a Systemic Function Grammar includes theories of context, syntax, semantics, function and lexis, it is a particularly difficult concept to learn, and this thesis presents the first attempt to infer and apply a truly probabilistic Systemic Functional Grammar. Because of this, the machine learning algorithm is benchmarked against a collection of state-ofthe- art learners on some well-known data sets. It is shown to be comparably accurate and particularly good at discovering and exploiting attribute correlation, and in this way it can also be seen as a linearly scalable solution to the Naïve Bayes attribute independence assumption. With a focus on function at the level of form, the methodology is shown to infer an accurate functional grammar that classifies with above 90% accuracy, even across registers of text that are fundamentally very different from the one that was learned on. The discovery of unlabelled functions occurred with a high level of sophistication, and so the proposed methodology has very broad potential as an analytical and/or classification tool in a functional approach to Computational Linguistics and Natural Language Processing.
Chapter
Estimation of relevance for attributes can be gained by the means of their ranking, which, by calculated weights, puts variables into a specific order. A ranking of features can be exploited not only at the stage of data pre-processing, but also in post-processing exploration of properties for obtained solutions. The chapter is dedicated to research on weighting condition attributes and decision rules inferred within Classical Rough Set Approach, basing on a ranking and numbers of intervals found for features during supervised discretisation. The rule classifiers tested were employed within the stylometric analysis of texts for the task of binary authorship attribution with balanced data.
Article
Computational stylistics or stylometry is a study on writing styles. Through linguistic analysis it yields observations on stylistic characteristics for authors, expressed in terms of quantifiable measures. These measures can be exploited for characterisation of writers, finding some similarities and differentiating features amongst their styles, for authorship attribution, and for recognition of documents based not on their topic, which is so common, but style. Stylistic analysis belongs with text mining, data mining, information retrieval, but also pattern recognition [4].
Article
Constructing a set with characteristic features for supervised classification is a task which can be considered as preliminary for the intended purpose, just a step to take on the way, yet with its significance and bearing on the outcome, the level of difficulty and computational costs involved, the problem has evolved in time to constitute by itself a field of intense study. We can use statistics, available expert domain knowledge, specialised procedures, analyse the set of all accessible features and reduce them backward, we can examine them one by one and select them forward. The process of sequential selection can be conditioned by the performance of a classification system, while exploiting a wrapper model, and the observations with respect to selected variables can result in assignment of weights and ranking. The chapter illustrates weighting of features with the procedures of sequential backward and forward selection for rule and connectionist classifiers employed in the stylometric task of authorship attribution.
Chapter
Computational stylistics focuses on such description and quantifiable expression of linguistic styles of written documents and their authors that enable their characterisation, comparison, and attribution. Characterisation of a text and its author can yield information about educational experiences, social background, but also about the author gender which can be exploited within the automatic categorisation of texts. This is an example of a classification task with knowledge uncertain and incomplete. Therefore, techniques from the artificial intelligence area are particularly well suited to handle the problem. The paper presents research on application of ANN-based classifier in recognition of the author gender for literary texts, with some considerations on the performance of the classifier when the reduction of characteristic features based on elements of frequency analysis is attempted. Keywordscomputational stylistics–text mining–text categorisation–feature selection–ANN classifier
Conference Paper
Full-text available
Repetition is very common. Adaptive language models, which allow probabilities to change or adapt after seeing just a few words of a text, were introduced in speech recognition to account for text cohesion. Suppose a document mentions Noriega once. What is the chance that he will be mentioned again? If the first instance has probability p, then under standard (bag-of-words) independence assumptions, two instances ought to have probability p2, but we find the probability is actually closer to p/2. The first mention of a word obviously depends on frequency, but surprisingly, the second does not. Adaptation depends more on lexical content than frequency; there is more adaptation for content words (proper nouns, technical terminology and good keywords for information retrieval), and less adaptation for function words, cliches and ordinary first names.
Article
Full-text available
The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Each step in the argument is matched by comparative retrieval tests, to provide a single coherent account of a major line of research. The experiments demonstrate, for a large test collection, that the probabilistic model is effective and robust, and that it responds appropriately, with major improvements in performance, to key features of retrieval situations.Part 1 covers the foundations and the model development for document collection and relevance data, along with the test apparatus. Part 2 covers the further development and elaboration of the model, with extensive testing, and briefly considers other environment conditions and tasks, model training, concluding with comparisons with other approaches and an overall assessment.Data and results tables forboth partsare given in Part 1. Key results are summarised in Part 2.
Article
Full-text available
Shannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a "bag-of-words" assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter theta to vary over documents subject to a density function phi. phi is intended to capture dependencies on hidden variables such [as] genre, author, topic, etc. (The Negative Binomial is a well-known special case where phi is a Gamma distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (sigma^2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x>=2|x>=1)).
Article
Full-text available
We describe here an algorithm for detecting subject boundaries within text based on a statistical lexical similarity measure.
Article
Every day, millions of people use the internet to answer questions. Unfortunately, at present, there is no simple and successful means to consistently accomplish this goal. One common approach is to enter a few terms from a question into a Web search system and scan the resulting pages for the answer, a laborious process. To address this need, a question answering (QA) system was created to find and extract answers from a corpus. This system contains three parts: a parser for generating question queries and categories, a passage retrieval element, and an information extraction (IE) component. The extraction method was designed to elicit answers from passages collected by the information retrieval engine. The subject of this paper is the information extraction component. It is based on the premise that information related to the answer will be found many times in a large corpus like the Web.
Article
This paper considers the pattern of occurrences of words in text as part of an attempt to develop formal rules for identifying those indicative of content and thereby suitable for use as index terms. A probabilistic model was proposed which, with a suitable fitting of parameters, could account for the occupancy distribution of most words, both index terms and non-index terms. The parameters take quite different values for the two classes. In this model each abstract was considered to receive word occurrences in a Poisson process. Abstracts can then be divided into classes, such that all abstracts within a given class receive word occurrences at the same average rate. The appearance of a particular number of occurrences of some word within an abstract then serves to give information, in a Bayesian sense, on the class membership of that abstract. It is of central interest to determine the minimum number of classes that can account for the occupancy distribution of each word. Though more testing needs to be done it may be concluded that the distribution of a very large majority of words can be accounted for by assuming three or fewer classes.
Conference Paper
Repetition is very common. Adaptive language models, which allow probabilities to change or adapt after seeing just a few words of a text, were introduced in speech recognition to account for text cohesion. Suppose a document mentions . The first mention of a word obviously depends on frequency, but surprisingly, the second does not. Adaptation depends more on lexical content than frequency; there is more adaptation for content words (proper nouns, technical terminology and good keywords for information retrieval), and less adaptation for function words, cliches and ordinary first names.
Article
In an earlier study, we presented a query key goodness scheme, which can be used to separate between good and bad query keys. The scheme is based on the relative average term frequency (RATF) values of query keys. In the present paper, we tested the effectiveness of the scheme in Finnish to English cross-language retrieval in several experiments. Query keys were weighted and queries were reduced based on the RATF values of keys. The tests were carried out in TREC and CLEF document collections using the InQuery retrieval system. The TREC tests indicated that the best RATF-based queries delivered substantial and statistically significant performance improvements, and performed as well as syn-structured queries shown to be effective in many CLIR studies. The CLEF tests indicated the limitations of the use of RATF in CLIR. However, the best RATF-based queries performed better than baseline queries also in the CLEF collection.
Article
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
Article
Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule- based methods. In this paper, we present a sim- ple rule-based part of speech tagger which automatically acquires its rules and tags with accuracy coinparable to stochastic taggers. The rule-based tagger has many advantages over these taggers, including: a vast reduction in stored information required, the perspicuity of a sinall set of meaningful rules, ease of finding and implementing improvements to the tagger, and better portability from one tag set, cor- pus genre or language to another. Perhaps the biggest contribution of this work is in demonstrating that the stochastic method is not the only viable method for part of speech tagging. The fact that a simple rule-based tagger that automatically learns its rules can perform so well should offer encouragement for researchers to further explore rule-based tagging, searching for a better and more expressive set of rule templates and other variations on the simple but effective theme described below.
Numerical Recipes in C
  • W H Press
  • S A Teukolsky
  • W T Vetterling
  • B P Flannery
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical Recipes in C. Cambridge University Press, 2 edition.