Article

ITRI-02-08 Measuring corpus homogeneity using a range of measures for inter-document distance Measuring corpus homogeneity using a range of measures for inter-document distance

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the ever more widespread use of corpora in language research, it is becoming increasingly important to be able to describe and compare corpora. The analysis of corpus homogeneity is preliminary to any quantitative approach to corpora comparison. We describe a method for text analysis based only on document-internal linguistic features, and a set of related homogeneity measures based on inter-document distance. We present a preliminary experiment to validate the hypothesis that in the presence of a homogeneous corpus the subcorpus that is necessary to train an NLP system is smaller than the one required if a heterogeneous corpus is used.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Kilgariff and Rose (1998) compare Spearman's S with χ 2 . Cavaglia (2002) uses relative entropy, χ 2 and G2. We base our measure on χ 2 because it performs well in comparative experiments (Cavaglia 2002;Rose and Haddock 1997), as long as each of the individual frequency values is greater than or equal to 5 and the sample size is large enough (Dunning 1993). ...
... Cavaglia (2002) uses relative entropy, χ 2 and G2. We base our measure on χ 2 because it performs well in comparative experiments (Cavaglia 2002;Rose and Haddock 1997), as long as each of the individual frequency values is greater than or equal to 5 and the sample size is large enough (Dunning 1993). On the other hand, our aim of testing the homogeneity hypothesis requires a more fine-grained tool than reporting the χ 2 statistic as a homogeneity measure. ...
Article
Full-text available
We motivate the need for dataset profiling in the context of evaluation, and show that textual datasets differ in ways that challenge assumptions about the applicability of techniques. We set out some criteria for useful profiling measures. We argue that distribution patterns of frequent words are useful in profiling genre, and report on a series of experiments with χ2 based measures on the TIPSTER collection, and on textual intranet data. Findings show substantial differences in the distribution of very frequent terms across datasets.
... There have been various approaches to corpus comparison (e.g. Dunning, 1993;Rose and Kilgariff, 1998;Cavaglia, 2002;McInnes, 2004). We compare our result with more common approaches, including chi-square (χ 2 ), as recommended by Kilgarriff (2001), and log-likelihood ratio (LLR) recommended by Rayson and Garside (2000). ...
... Habert and colleagues (Folch et al., 2000; Beaudouin et al., 2001) have been developing a workstation for specifying subcorpora according to text type, using Biber-style analyses amongst others. In Kilgarriff (2001) we present a first pass at quantifying similarity between corpora and Cavaglia (2002) continues this line of work. As mentioned above, Sekine (1997) and Gildea (2001) are two papers which directly address the relation between NLP systems and text type; one further such item is (Roland et al., 2000). ...
Article
Full-text available
The web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists' playground. The Special Issue explores ways in which this dream is being explored.
Chapter
Collaborative learning has grown more popular as a form of instruction in recent decades, with a significant number of studies demonstrating its benefits from many perspectives of theory and methodology. However, it has also been demonstrated that effective collaborative learning does not occur spontaneously without orchestrating collaborative learning groups according to the provision of favourable group criteria. Researchers have investigated different foundations and strategies to form such groups. However, the group criteria semantic information, which is essential for classifying groups, has not been explored. To capture the group criteria semantic information, we propose a novel Natural Language Processing (NLP) approach, namely using pre-trained word embedding. Through our approach, we could automatically form homogeneous and heterogeneous collaborative learning groups based on student’s knowledge levels expressed in assessments. Experiments utilising a dataset from a university programming course are used to assess the performance of the proposed approach.
Chapter
Today, collaborative learning has become quite central as a method for learning, and over the past decades, a large number of studies have demonstrated the benefits from various theoretical and methodological perspectives. This study proposes a novel approach that utilises Natural Language Processing(NLP) methods, particularly pre-trained word embeddings, to automatically create homogeneous or heterogeneous groups of students in terms of knowledge and knowledge gaps expressed in assessments. The two different ways of creating groups serve two different pedagogical purposes: (1) homogeneous group formation based on students’ knowledge can support and make teachers’ pedagogical activities such as feedback provision more time efficient, and (2) the heterogeneous groups can support and enhance collaborative learning. We evaluate the performance of the proposed approach through experiments with a dataset from a university course in programming didactics.
Article
The statistical NLP and IR literatures tend to make a "homogeneity assumption" about the distribution of terms, either by adopting a "bag of words" model, or in their treatment of function words. In this paper we develop a notion of homogeneity detection to a level of statistical significance, and conduct a series of experiments on different datasets, to show that the homogeneity assumption does not generally hold. We show that it also does not hold for function words. Importantly, datasets and document collections are found not to be neutral with respect to the property of homogeneity, even for function words. The homogeneity assumption is defeated substantially even for collections known to contain similar documents, and more drastically for diverse collections. We conclude that it is statistically unreasonable to assume that word distribution within a corpus is homogeneous. Because homogeneity findings differ substantially between different collections, we argue for the use of homogeneity measures as a means of profiling datasets.
Conference Paper
Full-text available
The basic statistic tools used in computational and corpus linguistics to capture distributional information have not changed much in the past 20 years even though many standard tools have been proved to be inadequate. In this demo (SMR-Cmp), we adopt the new tool of Square-Mean- Root (SMR) similarity, which measures the evenness of distribution between contrastive corpora, to extract lexical variations. The result based on one case study shows that the novel approach outperforms traditional statistical measures, including chi-square (χ2) and log-likelihood ratio (LLR). KEYWORDS : Square-Mean-Root evenness, SMR similarity, corpus comparison, chi-square, log- likelihood ratio.
Article
In this work we study the influence of corpus homogeneity on corpus-based NLP system performance. Experi-ments are performed on both stochas-tic language models and an EBMT sys-tem translating from Japanese to En-glish with a large bicorpus, in order to reassess the assumption that using only homogeneous data tends to make system performance go up. We de-scribe a method to represent corpus homogeneity as a distribution of sim-ilarity coefficients based on a cross-entropic measure investigated in previ-ous works. We show that beyond min-imal sizes of training data the exces-sive elimination of heterogeneous data proves prejudicial in terms of both per-plexity and translation quality : exces-sively restricting the training data to a particular domain may be prejudicial in terms of In-Domain system perfor-mance, and that heterogeneous, Out-of-Domain data may in fact contribute to better sytem performance.
Article
Full-text available
The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or "profiling" within the ELRA benchmark called "Contribution to the construction of contemporary french corpora" based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations.
Article
Full-text available
Corpus linguistics lacks strategies for describing and comparing corpora. Currently most descriptions of corpora are textual, and questions such as ‘what sort of a corpus is this?’, or ‘how does this corpus compare to that?’ can only be answered impressionistically. This paper considers various ways in which different corpora can be compared more objectively. First we address the issue, ‘which words are particularly characteristic of a corpus?’, reviewing and critiquing the statistical methods which have been applied to the question and proposing the use of the Mann-Whitney ranks test. Results of two corpus comparisons using the ranks test are presented. Then, we consider measures for corpus similarity. After discussing limitations of the idea of corpus similarity, we present a method for evaluating corpus similarity measures. We consider several measures and establish that a \chi\tsup{2}-based one performs best. All methods considered in this paper are based on word and ngram frequencies; the strategy is defended.
Conference Paper
Full-text available
Very large corpora are increasingly exploited to improve Natural Language Processing (NLP) Systems. Thishowever implies that the lexical, morpho-syntactic and syntactic homogeneity of the data used are mastered.This control in turn requires the development of tools aimed at text calibration or profiling. We areimplementing such profiling tools and developing an associated methodology within the ELRA benchmarknamed Contribution to the construction of corpora of contemporary French. The first ...
Article
Full-text available
Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results. This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text. However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical. This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.
Article
Full-text available
As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties.
Article
Similarities and differences between speech and writing have been the subject of innumerable studies, but until now there has been no attempt to provide a unified linguistic analysis of the whole range of spoken and written registers in English. In this widely acclaimed empirical study, Douglas Biber uses computational techniques to analyse the linguistic characteristics of twenty three spoken and written genres, enabling identification of the basic, underlying dimensions of variation in English. In Variation Across Speech and Writing, six dimensions of variation are identified through a factor analysis, on the basis of linguistic co-occurence patterns. The resulting model of variation provides for the description of the distinctive linguistic characteristics of any spoken or written text andd emonstrates the ways in which the polarization of speech and writing has been misleading, and thus enables reconciliation of the contradictory conclusions reached in previous research.
Article
Categorization of text in IR has traditionally focused on topic. As use of the Internet and e-mail increases, categorization has become a key area of research as users demand methods of prioritizing documents. This work investigates text classification by format style, i.e. "genre", and demonstrates, by complementing topic classification, that it can significantly improve retrieval of information. The paper compares use of presentation features to word features, and the combination thereof, using Naïve Bayes, C4.5 and SVM classifiers. Results show use of combined feature sets with SVM yields 92% classification accuracy in sorting seven genres.
Article
The present study summarizes corpus-based research on linguistic characteristics from several different structural levels, in English as well as other languages, showing that register variation is inherent in natural language. It further argues that, due to the importance and systematicity of the linguistic differences among registers, diversified corpora representing a broad range of register variation are required as the basis for general language studies.First, the extent of cross-register differences are illustrated from consideration of individual grammatical and lexical features; these register differences are also important for probabilistic part-of-speech taggers and syntactic parsers, because the probabilities associated with grammatically ambiguous forms are often markedly different across registers. Then, corpus-based multidimensional analyses of English are summarized, showing that linguistic features from several structural levels function together as underlying dimensions of variation, with each dimension defining a different set of linguistic relations among registers. Finally, the paper discusses how such analyses, based on register-diversified corpora, can be used to address two current issues in computational linguistics: the automatic classification of texts into register categories and cross-linguistic comparisons of register variation.
Article
In this paper, an attempt is first made to clarify and tease apart the somewhat confusing terms genre, register, text type, domain, sublanguage, and style. The use of these terms by various linguists and literary theorists working under different traditions or orientations will be examined and a possible way of synthesising their insights will be proposed and illustrated with reference to the disparate categories used to classify texts in various existing computer corpora. With this terminological problem resolved, a personal project which involved giving each of the 4,124 British National Corpus (BNC, version 1) files a descriptive "genre" label will then be described. The result of this work, a spreadsheet/database (the "BNC Index") containing genre labels and other types of information about the BNC texts will then be described and its usefulness shown. It is envisaged that this resource will allow linguists, language teachers, and other users to easily navigate through or scan the huge BNC jungle more easily, to quickly ascertain what is there (and how much) and to make informed selections from the mass of texts available. It should also greatly facilitate genre-based research (e.g., EAP, ESP, discourse analysis, lexicogrammatical, and collocational studies) and focus everyday classroom concordancing activities by making it easy for people to restrict their searches to highly specified sub-sets of the BNC using PC-based concordancers such as WordSmith, MonoConc, or the Web-based BNCWeb.
Article
A major concern in corpus based approaches is that the applicability of the acquired knowledge may be limited by some feature of the corpus, in particular, the no- tion of text 'domain'. In order to examine the domain dependence of parsing, in this paper, we report 1) Comparison of structure distributions across domains; 2) Examples of domain specific structures; and 3) Parsing experiment using some domain dependent grammars. The observations using the Brown corpus demonstrate domain dependence and idiosyncrasy of syntactic structure. The parsing results show that the best accuracy is obtained using the grammar acquired from the same domain or the same class (fiction or nonfiction) . We will also discuss the relationship between parsing accuracy and the size of training corpus.
Article
The probabilistic relation between verbs and their arguments plays an important role in modern statistical parsers and supertaggers, and in psychological theories of language processing. But these probabilities are computed in very different ways by the two sets of researchers. Computational linguists compute verb subcategorization probabilities from large corpora while psycholinguists compute them from psychological studies (sentence production and completion tasks). Recent studies have found differences between corpus frequencies and psycholinguistic measures. We analyze subcategorization frequencies from four different corpora: psychological sentence production data (Connine et al. 1984), written text (Brown and WSJ), and telephone conversation data (Switchboard). We find two different sources for the differences. Discourse influence is a result of how verb use is affected by different discourse types such as narrative, connected discourse, and single sentence productions. Semantic influence is a result of different corpora using different senses of verbs, which have different subcategorization frequencies. We conclude that verb sense and discourse type play an important role in the frequencies observed in different experimental and corpus based sources of verb subcategorization frequencies. 1
  • Frank Owen
  • Ronald Jones
Frank Owen and Ronald Jones. 1977. Statistics. Polytech Publishers, Stockport, UK.
Word frequencies in British and American English. The Norwegian Computing Centre for the Humanities
  • S Hofland
  • Johansson
Hofland and S. Johansson. 1982. Word frequencies in British and American English. The Norwegian Computing Centre for the Humanities.
Generic features for text profiler
  • Typtex
Typtex: Generic features for text profiler. In Content-Based Multimedia Information Access, Paris, France. RIAO'2000.
  • Frank Owen
  • Ronald Jones
Frank Owen and Ronald Jones. 1977. Statistics. Polytech Publishers, Stockport, UK.
Pills: A multilingual authoring system for patient information
  • N Scott
  • R Bouayad-Agha
  • S Power
  • R Schulz
  • D Beck
  • R Murphy
  • Lockwood
Scott, N. Bouayad-Agha, R. Power, S. Schulz, R. Beck, D. Murphy, and R. Lockwood. 2001. Pills: A multilingual authoring system for patient information. In Vision of the Future and Lessons from the past. Proceeding of the 2001 AMIA Annual Symposium, Washington DC.