Article

Discovering planar segregations

January 1991

Authors:

University of Cologne

this report I present an algorithm for finding planar segregations of phonemes for particular languages. This algorithm requires no domain-specific knowledge of phonology or phonetics. Despite this lack of knowledge, the implemented algorithm has identified the structurally significant segregations for thirty languages

Learning Vowel Harmony

Article

Full-text available

Apr 1992

T. Mark Ellison

: It is possible to construct an unsupervised learning system for vowel harmony, which makes accessible generalisations. Furthermore, such a learning system can be constructed with little built-in knowledge, and, consequently, be applicable to data from a wide range of domains. This paper presents a learning system which fulfills these criteria, and examines the results of applying an implementation of it to data from a number of natural languages. A language exhibits vowel harmony if it imposes contextual constraints between adjacent vowels in the same word. A number of languages exhibit vowel harmony: Finnish, Hungarian, Turkish, various Mongolian languages, a number of African languages such as Yoruba and Okpe, and, it has been claimed, the Australian language, Warlpiri. Some languages show similar type of adjacency constraints on proximal consonants, or between some aspects of consonants and vowels. Guaran'i for instance shows nasal harmony between voiced consonants and vow...

Generalizing Case Frames Using a Thesaurus and the MDL Principle

Article

Full-text available

Jul 2002

A new method for automatically acquiring case frame patterns from large corpora is proposed. In particular, the problem of generalizing values of a case frame slot for a verb is viewed as that of estimating a conditional probability distribution over a partition of words, and a new generalization method based on the Minimum Description Length (MDL) principle is proposed. In order to assist with efficiency, the proposed method makes use of an existing thesaurus and restricts its attention to those partitions that are present as "cuts" in the thesaurus tree, thus reducing the generalization problem to that of estimating a "tree cut model" of the thesaurus tree. An efficient algorithm is given, which provably obtains the optimal tree cut model for the given frequency data of a case slot, in the sense of MDL. Case frame patterns obtained by the method were used to resolve PP-attachment ambiguity. Experimental results indicate that the proposed method improves upon or is at least comparable with existing methods.

A Probabilistic Approach to Lexical Semantic Knowledge Acquisition and S tructural Disambiguation

Article

Jan 1999

Hang LI

In this thesis, I address the problem of automatically acquiring lexical semantic knowledge, especially that of case frame patterns, from large corpus data and using the acquired knowledge in structural disambiguation. The approach I adopt has the following characteristics: (1) dividing the problem into three subproblems: case slot generalization, case dependency learning, and word clustering (thesaurus construction). (2) viewing each subproblem as that of statistical estimation and defining probability models for each subproblem, (3) adopting the Minimum Description Length (MDL) principle as learning strategy, (4) employing efficient learning algorithms, and (5) viewing the disambiguation problem as that of statistical prediction. Major contributions of this thesis include: (1) formalization of the lexical knowledge acquisition problem, (2) development of a number of learning methods for lexical knowledge acquisition, and (3) development of a high-performance disambiguation method.

Distributional regularity and phonotactic constraints are useful for segmentation

Article

Full-text available

Oct 1996
COGNITION

In order to acquire a lexicon, young children must segment speech into words, even though most words are unfamiliar to them. This is a non-trivial task because speech lacks any acoustic analog of the blank spaces between printed words. Two sources of information that might be useful for this task are distributional regularity and phonotactic constraints. Informally, distributional regularity refers to the intuition that sound sequences that occur frequently and in a variety of contexts are better candidates for the lexicon than those that occur rarely or in few contexts. We express that intuition formally by a class of functions called DR functions. We then put forth three hypotheses: First, that children segment using DR functions. Second, that they exploit phonotactic constraints on the possible pronunciations of words in their language. Specifically, they exploit both the requirement that every word must have a vowel and the constraints that languages impose on word-initial and word-final consonant clusters. Third, that children learn which word-boundary clusters are permitted in their language by assuming that all permissible word-boundary clusters will eventually occur at utterance boundaries. Using computational simulation, we investigate the effectiveness of these strategies for segmenting broad phonetic transcripts of child-directed English. The results show that DR functions and phonotactic constraints can be used to significantly improve segmentation. Further, the contributions of DR functions and phonotactic constraints are largely independent, so using both yields better segmentation than using either one alone. Finally, learning the permissible word-boundary clusters from utterance boundaries does not degrade segmentation performance.

Discovering Morphemic Suffixes A Case Study In MDL Induction

Article

Dec 1994

This paper reports experiments in the automatic discovery of linguistically significant regularities in text. The minimum description length principle is exploited to evaluate linguistic hypotheses with respect to a corpus and a theory of the types of regularities to be found in it. The domain of inquiry in this paper is the discovery of morphemic suffixes such as English --ing and --ly, but the technique is widely applicable to language learning problems. 1 Introduction Many recent papers have reported work on the automatic discovery of linguistic regularities in text. Most of these exploit statistics based on information theory to measure how likely two linguistic entities are to co-occur. Co-occurrence statistics have been used to assess semantic similarity [15], PP attachment preference [16], linguistically significant collocations [23], syntactic categories [4], and syntactic rules [5]. The above work raises two interesting questions: how should statistical measurements be interp...

ResearchGate has not been able to resolve any references for this publication.

Discovering planar segregations

Abstract

No full-text available

Recommended publications

Unknown