Article

A stochastic approach to sentence parsing

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

A description will be given of a procedure to assign the most likely probabilities to each of the rules of a given context-free grammar. The grammar developed by S. Kuno at Harvard University was picked as the basis and was successfully augmented with rule probabilities. A brief exposition of the method with some preliminary results, when used as a device for disambiguating parsing English texts picked from natural corpus, will be given.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... A probabilistic approach to natural language processing is not new [1]. Recently, many parsers based on this line have been proposed [2][3][4][5][6][7][8][9]. Garside and Leech [2] apply the constituent-likehood grammar of Atwell [10] to probabilistic parsing. ...
... He also claims the probabilistic method should be controlled, otherwise it is not useful to us. Some papers [5][6][7][8][9] employ probabilistic context-free grammar to parsing task. ...
... In the field of parsing techniques, many parsers based on this line are proposed. Some of them are LR-style [5][6][7][8][9]; some of them are chart-based [3]; some adopt constituent-likehood grammar [2]. These approaches are more complexive. ...
Article
Full-text available
This paper proposes a probabilistic partial parser, which we call chunker. The chunker partitions the input sentence into segments. This idea is motivated by the fact that when we read a sentence, we read it chunk by chunk. We train the chunker from Susanne Corpus, which is a modified but shrunk version of Brown Corpus, underlying bi-gram language model. The experiment is evaluated by outside test and inside test. The preliminary results show the chunker has more than 98% chunk correct rate and 94% sentence correct rate in outside test, and 99% chunk correct rate and 97% sentence correct rate in inside test. The simple but effective chunker design has shown to be promising and can be extended to complete parsing and many applications. 1. Introduction A probabilistic approach to natural language processing is not new [1]. Recently, many parsers based on this line have been proposed [2-9]. Garside and Leech [2] apply the constituentlikehood grammar of Atwell [10] to probabilist...
... As we are working on spoken language, we try to reflect real language usage. For this purpose, a stochastic approach beyond trigrams, namely stochastic sentence parsing[7] , seems most promising. Ideally, syntactic rules should be generated automatically from a large dialogue corpus and probabilities should also be automatically assigned to each node. ...
Conference Paper
Full-text available
This paper describes the syntactic rules which are applied in the Japanese speech recognition module of a speech-to-speech translation system. Japanese is considered to be a free word/phrase order language. Since syntactic rules are applied as constraints to reduce the search space in speech recognition, applying rules which take into account all possible phrase orders can have almost the same effect as using no constraints. Instead, we take into consideration the recognition weaknesses of certain syntactic categories and treat them precisely, so that a minimal number of rules can work most effectively. In this paper we first examine which syntactic categories are easily misrecognized. Second, we consult our dialogue corpus, in order to provide the rules with great generality. Based on both studies, we refine the rules. Finally, we verify the validity of the refinement through speech recognition experiments.
... Despite the fact that in many cases the independent rule assumption does not hold, several researchers have transported this idea to natural languages and achieved different degrees of success depending on the training method and scoring function. Fujisaki (1984) was one of the first to build a probabilistic CFG. He chose the unsupervised training approach (i.e., frequency statistics were drawn from both correct and incorrect parses), using the inside-outside algorithm: each grammar rule is assigned an a priori probability as an initial guess, and these probabilities are repeatedly approximated using the probabilities of the different parses of ambiguous sentences until they converge. ...
Article
s Service, who graciously shared his powerful workstation for some heavy duty computing that would have overwhelmed walt, the departmental server; Don Lewis, of Jackson State Community College, for granting dedicated internet access to a faculty member of a "rival school." A special note of gratitude is also extended to Union University, especially John David Barham, for providing me with the sole Unix workstation on campus, for purchasing Quintus Prolog, and for setting iii up a dedicated telephone connection with JSCC. On the software side, I am indebted to Drs. Debbie Dahl and Marcia Linebarger for granting a license to use PUNDIT and for taking the time to answer my detailed questions; to Dr. Eric Brill for sharing the code to his bracketing program; and to Dr. Mike Berry for allowing access to his SVD implementation. Thanks are also due to my chair Dwayne Jennings, dean Dr. Jim Baggett, and vicepresident Dr. Howard Newell, for giving me a reduced teaching load during several sem...
Conference Paper
CRITAC (CRITiquing using A Ccumulated knowledge) is an experimental expert system for proofreading Japenese text. It detects mistypes, Kana-to-Kanji misconversions, and stylistic errors. This system combines Prolog-coded heuristic knowledge with conventional Japanese text processing techniques which involve heavy computation and access to large language databases.
Article
T h e recent trend in natural language processing research has been to develop systems that deal with text concern-ing small, well defined domains. One practical applica-tion for such systems is to process messages pertaining to some very specific task or activity [5]. T h e advantage of dealing with such domains is twofold-firstly, due to the narrowness of the domain, it is possible to encode most of the knowledge related to the domain and to make this knowledge accessible to the natural language processing system, which in turn can use this knowledge to dis-ambiguate the meanings of the messages. Secondly, in such a domain, there is not a great diversity of language constructs and therefore it becomes easier to construct a g r a m m a r which will capture all the constructs which exist in this sub-language. However, some practical aspects of such domains tend to make the problem somewhat difficult. Often, the mes-sages tend not to be absolutely grammatically correct. As a result, the g r a m m a r designed for such a s y s t e m needs to be far more forgiving than one designed for the task of parsing edited English. This can result in a proliferation of parses, which in turn makes the dis-ambiguation task more difficult. This problem is further compounded by the telegraphic n a t u r e of the discourse, since telegraphic discourse is more prone to be syntacti-cally ambiguous.
Chapter
This paper presents a direction-free framework for handling ambiguity by treating heuristics or preferences as orderings on the set of structures under consideration. It examines various properties that the heuristics must have in order to produce coherent and efficient rankings, and presents ways of combining rankings into more complex preferences. Finally, it considers the issues that arise when heuristics are used bi-directionally by both the understanding and generation system.
Chapter
Kumauni language is one of the relatively understudied regional languages of India. Here, we have attempted to develop a parsing tool for use in Kumauni language studies, with the eventual aim of developing a technique for checking grammatical structures of sentences in Kumauni language. For this purpose, we have taken a set of pre-existing Kumauni sentences and derived rules of grammar from them, which have been converted to a mathematical model using Earley’s algorithm, suitably modified by us. The Mathematical model so developed has been verified by testing it on a separate set of pre-existing Kumauni language sentences. This mathematical model can be used for parsing new Kumauni language sentences, thus providing researchers a new parsing tool. KeywordsKumauni language–Context Free Grammar–Earley’s Algorithm–Natural Language Processing–Parsing
Article
This paper relates different kinds of language modeling methods that can be applied to the linguistic decoding part of a speech recognition system with a very large vocabulary. These models are studied experimentally on a pseudophonetic input arising from French stenotypy. We propose a model which combines the advantages of a statistical modeling with information theoretic tools, and those of a grammatical approach.
Article
This paper describes an experimental expert system for proofreading Japanese text. The system is called CRITAC (CRITiquing using ACcumulated knowledge). It can detect typographical errors, Kana-to-Kanji conversion errors, and stylistic errors in Japanese text. We describe the basic concepts and features of CRITAC, including preprocessing of text, a high-level text model, Prolog-coded heuristic proofreading knowledge, and a user-friendly interface. Although CRITAC has been primarily designed for Japanese text, it appears that most of the concepts and the architecture of CRITAC can be applied to other languages as well.
Article
Development of a machine translation system (Mts) requires many tradeoffs in terms of the variety of available formalisms and control mechanisms. The tradeoffs involve issues in the generative power of grammar, formal linguistic power and efficiency of the parser, manipulation flexibility for knowledge bases, knowledge acquisition, degree of expressiveness and uniformity of the system, integration of the knowledge sources, and so forth. In this paper we discuss some basic decisions which must be made in constructing a large system. Our experience with an operational English-Chinese Mts, ArchTran, is presented to illustrate decision making related to procedural tradeoffs.
Article
The past decade has witnessed substantial progress toward the goal of constructing a machine capable of understanding colloquial discourse. Central to this progress has been the development and application of mathematical methods that permit modeling the speech signal as a complex code with several coexisting levels of structure. The most successful of these are "template matching," stochastic modeling, and probabilistic parsing. The manifestation of common themes such as dynamic programming and finite-state descriptions accentuates a superficial likeness amongst the methods which is often mistaken for the deeper similarity arising from their shared Bayesian foundation. In this paper, we outline the mathematical bases of these methods, invariant metrics, hidden Markov chains, and formal grammars, respectively. We then recount and briefly interpret the results of experiments in speech recognition to which the various methods were applied. Since these mathematical principles seem to bear little resemblance to traditional linguistic characterizations of speech, the success of the experiments is occasionally attributed, even by their authors, merely to excellent engineering. We conclude by speculating that, quite to the contrary, these methods actually constitute a powerful theory of speech that can be reconciled with and elucidate conventional linguistic theories while being used to build truly competent mechanical speech recognizers.
Article
Full-text available
Ó 1995 by Rens Bod. All rights reserved. Printed in the Netherlands by Academische Pers, Amsterdam. Acknowledgements This thesis benefitted from discussions with many people. I would like to express my thanks to Martin van den Berg, Kenneth Church, Marc Dymetman, Bipin Indurkhya, Laszlo Kalman, Ronald Kaplan, Martin Kay, Steven Krauwer, Kwee Tjoe Liong, Neza van der Leeuw, David Magerman, Arie Mijnlieff, Fernando Pereira, Philip Resnik, Yves Schabes, Khalil Sima'an and Frederik Somsen. Furthermore, I wish to thank the members of the graduation committee: Renate Bartsch, Jan van Eijck, Gerard Kempen, Chris Klaassen and Anton Nijholt. I am grateful to Steven Krauwer for allowing me to work at this thesis while I was involved in the CLASK project ("Combining Linguistic and Statistical Knowledge") at Utrecht University. The fruitful discussions and positive cooperation with my colleagues Martin van den Berg and Khalil Sima'an have been of incalculable value
Article
The availability of large files of manuallyreviewed parse trees from the University of Pennsylvania "tree bank", along with a program for comparing system-generated parses against these "standard" parses, provides a new opportunity for evaluating different pars- ing strategies. We discuss some of the restructuring required to the output of our parser so that it could be meaningfully compared with these standard parses. We then describe several heuristics for improving parsing accuracy and coverage, such as closest attachment of modifiers, statistical grammars, and fitted parses, and present a quantitative evaluation of the improvements obtained with each strategy.
Article
Full-text available
Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.
Article
Basic considerations in designing a natural data base query language system are discussed. The notion of the noun-phrase data model is elaborated, and its role in making a query system suitable for general use is stressed. An experimental query system, Yachimata, embodying the concept, is described. Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article. Published in: IBM Journal of Research and Development (Volume:22 , Issue: 5 ) Page(s): 533 - 540 ISSN : 0018-8646 DOI: 10.1147/rd.225.0533 Date of Publication : Sept. 1978 Date of Current Version : 06 April 2010 Issue Date : Sept. 1978 Sponsored by : IBM Publisher: IBM
Article
It has been proven by Greibach that for a given context-free grammar G, a standard-form grammar G s , can be constructed, which generates the same language as is generated by G and whose rules are all of the form Z → cY 1 ··· Y m ( m ≥ 0) where Z and Y i are intermediate symbols and c a terminal symbol. Since the predictive analyzer at Harvard uses a standard-form grammar, it can accept the language of any context-free Grammar G , given an equivalent standard-form grammar G s . The structural descriptions SD ( G s , χ ) assigned to a given sentence χ by the predictive analyzer, however, are usually different from the structural descriptions SD ( G, χ ) assigned to the same sentence by the original context-free grammar G from which G s is derived. In Section 1, an algorithm, originally due to Abbott is described, which converts a given context-free grammar into an augmented standard-form grammar each of whose rules is in standard form, supplemented by additional information describing its derivation from the original context-free grammar. A technique for performing the SD ( G s , χ ) to SD ( G, χ ) transformation effectively is also described. In Section 2, the augmented predictive analyzer as a parsing algorithm for arbitrary context-free languages is compared with two other parsing algorithms: a selective top-to-bottom algorithm similar to Irons' “error correcting parse algorithm” and
Conference Paper
Further results have been obtained on the recognition of continously read sentences from a natural language corpus of laser patents. The vocabulary is limited to the 1000 most frequently occurring words in the corpus. The model of the task language has a perplexity of 24. 1 words (corresponding to an entropy of 4. 6 bits/word). This paper describes modifications and improvements to the system which have resulted in the lowering of the word error rate from the previously reported 33. 1% to 8. 9%.
Conference Paper
We report performance results on the recognition of continuously spoken sentences from the finite state grammar for the "New Raleigh Language" (vocabulary-250 words; average sentence length-8 words; entropy-2.86 bits/word; perplexity-7.27 words). Sentence and word error rates of 5% and 0.6% , respectively, are achieved, using a new centisecond-level model for the acoustic processor. We also report results for the "CMU-AIX05 Language" (vocabulary-1011 words; average sentence length-about 7 words; entropy-2.18 bits/word; perplexity-4.53 words), using both our earlier phone-level model and the centisecond-level model. With the phone-level acoustic-processor model, sentence and word error rates of 2% and 0.8%, respectively, are achieved. With the centisecond-level model, sentence and word error rates are 1% and 0.1%, respectively.