Article

Discovering planar segregations

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

this report I present an algorithm for finding planar segregations of phonemes for particular languages. This algorithm requires no domain-specific knowledge of phonology or phonetics. Despite this lack of knowledge, the implemented algorithm has identified the structurally significant segregations for thirty languages

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The classi cation program presented by Ellison (1991) determines major segmental categories, usually consonants and vowels. Given that these categories can be discovered, looking for regularities in the subsequences of these words which only consist of segments from one category does not require the introduction of new knowledge to the analysis system. ...
Article
Full-text available
: It is possible to construct an unsupervised learning system for vowel harmony, which makes accessible generalisations. Furthermore, such a learning system can be constructed with little built-in knowledge, and, consequently, be applicable to data from a wide range of domains. This paper presents a learning system which fulfills these criteria, and examines the results of applying an implementation of it to data from a number of natural languages. A language exhibits vowel harmony if it imposes contextual constraints between adjacent vowels in the same word. A number of languages exhibit vowel harmony: Finnish, Hungarian, Turkish, various Mongolian languages, a number of African languages such as Yoruba and Okpe, and, it has been claimed, the Australian language, Warlpiri. Some languages show similar type of adjacency constraints on proximal consonants, or between some aspects of consonants and vowels. Guaran'i for instance shows nasal harmony between voiced consonants and vow...
... Since the number of probability parameters in word-based models is large (O(N. V. R)), accurate 1 Recently, MDL and related techniques have become popular in corpus-based natural language processing and other related fields (Ellison 1991(Ellison , 1992Cartwright and Brent 1994;Stolcke and Omohundro 1994;Brent, Murthy, and Lundberg 1995;Ristad and Thomas 1995;Brent and Cartwright 1996;Grunwald 1996). In this paper, we introduce MDL into the context of case frame pattern acquisition. ...
Article
Full-text available
A new method for automatically acquiring case frame patterns from large corpora is proposed. In particular, the problem of generalizing values of a case frame slot for a verb is viewed as that of estimating a conditional probability distribution over a partition of words, and a new generalization method based on the Minimum Description Length (MDL) principle is proposed. In order to assist with efficiency, the proposed method makes use of an existing thesaurus and restricts its attention to those partitions that are present as "cuts" in the thesaurus tree, thus reducing the generalization problem to that of estimating a "tree cut model" of the thesaurus tree. An efficient algorithm is given, which provably obtains the optimal tree cut model for the given frequency data of a case slot, in the sense of MDL. Case frame patterns obtained by the method were used to resolve PP-attachment ambiguity. Experimental results indicate that the proposed method improves upon or is at least comparable with existing methods.
... Recently MDL and related techniques have become popular in natural language processing and related fields; a number of learning methods based on MDL have been proposed for various applications (Ellison, 1991;Ellison, 1992;Cartwright and Brent, 1994;Stolcke and Omohundro, 1994;Brent, Murthy, and Lundberg, 1995;Ristad and Thomas, 1995;Brent and Cartwright, 1996;Grunwald, 1996). ...
Article
In this thesis, I address the problem of automatically acquiring lexical semantic knowledge, especially that of case frame patterns, from large corpus data and using the acquired knowledge in structural disambiguation. The approach I adopt has the following characteristics: (1) dividing the problem into three subproblems: case slot generalization, case dependency learning, and word clustering (thesaurus construction). (2) viewing each subproblem as that of statistical estimation and defining probability models for each subproblem, (3) adopting the Minimum Description Length (MDL) principle as learning strategy, (4) employing efficient learning algorithms, and (5) viewing the disambiguation problem as that of statistical prediction. Major contributions of this thesis include: (1) formalization of the lexical knowledge acquisition problem, (2) development of a number of learning methods for lexical knowledge acquisition, and (3) development of a high-performance disambiguation method.
Article
Full-text available
In order to acquire a lexicon, young children must segment speech into words, even though most words are unfamiliar to them. This is a non-trivial task because speech lacks any acoustic analog of the blank spaces between printed words. Two sources of information that might be useful for this task are distributional regularity and phonotactic constraints. Informally, distributional regularity refers to the intuition that sound sequences that occur frequently and in a variety of contexts are better candidates for the lexicon than those that occur rarely or in few contexts. We express that intuition formally by a class of functions called DR functions. We then put forth three hypotheses: First, that children segment using DR functions. Second, that they exploit phonotactic constraints on the possible pronunciations of words in their language. Specifically, they exploit both the requirement that every word must have a vowel and the constraints that languages impose on word-initial and word-final consonant clusters. Third, that children learn which word-boundary clusters are permitted in their language by assuming that all permissible word-boundary clusters will eventually occur at utterance boundaries. Using computational simulation, we investigate the effectiveness of these strategies for segmenting broad phonetic transcripts of child-directed English. The results show that DR functions and phonotactic constraints can be used to significantly improve segmentation. Further, the contributions of DR functions and phonotactic constraints are largely independent, so using both yields better segmentation than using either one alone. Finally, learning the permissible word-boundary clusters from utterance boundaries does not degrade segmentation performance.
Article
This paper reports experiments in the automatic discovery of linguistically significant regularities in text. The minimum description length principle is exploited to evaluate linguistic hypotheses with respect to a corpus and a theory of the types of regularities to be found in it. The domain of inquiry in this paper is the discovery of morphemic suffixes such as English --ing and --ly, but the technique is widely applicable to language learning problems. 1 Introduction Many recent papers have reported work on the automatic discovery of linguistic regularities in text. Most of these exploit statistics based on information theory to measure how likely two linguistic entities are to co-occur. Co-occurrence statistics have been used to assess semantic similarity [15], PP attachment preference [16], linguistically significant collocations [23], syntactic categories [4], and syntactic rules [5]. The above work raises two interesting questions: how should statistical measurements be interp...
ResearchGate has not been able to resolve any references for this publication.