Article

Generation of Zero Pronouns Based on the Centering Theory and Pairwise Salience of Entities

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper investigates zero pronouns in Korean, especially focusing on the center transitions of adjacent utterances under the framework of Centering Theory. Four types of nominal entity (Epair, Einter, Eintra, and Enon) from Centering Theory are defined with the concept of inter-, intra-, and pairwise salience. For each entity type, a case study of zero phenomena is performed through analyzing corpus and building a pronominalization model. This study shows that the zero phenomena of entities which have been neglected in previous Centering works are explained via the center transition of the second previous utterance, and provides valuable results for pronominalization of such entities, such as p2-trans rule. We improve the accuracy of pronominalization model by optimal feature selection and show that our accuracy outperforms the accuracy of previous works.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Pronouns are a functional word which plays a crucial part in all languages. Pronouns, both full and zero, have been studied worldwide: Spanish: Luis and D' Introno 2001; Thai: Aroonmanakun 2000; Chinese: Zhao and Ng 2007; Japanese: Kawahara and Kurohashi 2004; Korean:Roh and Lee 2006;Lao: Compton 1992;Turkish: Gürel 2002, to name a few. Both full pronouns (overt) and zero pronouns (covert) have been researched theoretically and differently in many language families. ...
Article
Full-text available
The aim of this study is to explore the zero pronouns in Tai Dam on the basis of a discourse approach using Systemic Functional Grammar (SFG). The objectives of this paper are as follows: (1) to explore the syntactic distribution of zero pronouns in Tai Dam, and (2) to analyze the discourse function of zero pronouns in the narrative discourses of Tai Dam. The research data used in this paper were collected from the texts of Tai Dam folktales of the people living in Petchaburi Province in central Thailand. Five folktales were analyzed for the study. The findings reveal that zero pronouns in Tai Dam can function in different ways. It has been found that zero pronouns can occur in both Theme and Rheme positions. In thematic positions, they function as unmarked topical themes. This study shows the different discourse functions of zero pronouns in terms of co-references. It has been found that there are two types of co-reference of zero pronouns in Tai dam as follows: (1) zero anaphora and (2) zero cataphora. The main discourse function of zero pronouns is to signal an active referent in narrative discourse.
... This problem is well-known as 'zero-pronoun problem'. The zero-pronoun is a common problem in both Korean and Japanese and it is considered as a very important problem in corpus annotation [9] and Korean language processing [3] Roh and Lee adopt a linguistic theory called 'centering theory' to solve this problem [10]. While machine learning methods are common in this problem [12], they are not widely used for processing zero pronoun in Korean. ...
Article
It is an important task in Korean-English machine translation to classify the gender of names correctly. When a sentence is composed of two or more clauses and only one subject is given as a proper noun, it is important to find the gender of the proper noun for correct translation of the sentence. This is because a singular pronoun has a gender in English while it does not in Korean. Thus, in Korean-English machine translation, the gender of a proper noun should be determined. More generally, this task can be expanded into the classification of the general Korean names. This paper proposes a statistical method for this problem. By considering a name as just a sequence of syllables, it is possible to get a statistics for each name from a collection of names. An evaluation of the proposed method yields the improvement in accuracy over the simple looking-up of the collection. While the accuracy of the looking-up method is 64.11%, that of the proposed method is 81.49%. This implies that the proposed method is more plausible for the gender classification of the Korean names.
Conference Paper
The common use of null arguments is one of the most critical issues in pro-drop languages. When translating Korean into other languages, the omitted elements should be replaced with appropriate pronouns to get grammatical target sentences. One of the most important issues when dealing with zero pronouns is to determine the referentiality of zero pronouns. Since, like expletive ‘it’ in English, omitted elements do not have always explicit referents, it is important to determine whether a zero pronoun is referential or not. In this paper, we focus on identifying non-referential zero pronouns. Since non-referential zero pronouns are likely to occur in similar contexts, referentiality determination in this paper is regarded as the identification of clauses containing non-referential zero pronouns. Our method outperforms the baseline systems using n-grams and bag of words, and achieves the F-measure of 0.51 and 0.78.
Conference Paper
Machine translation systems have various problems although they have been developed continuously. Especially, in Korean-English translation system, zero pronoun problem is an important problem, since omitted subject or object Korean are must be restored in English. In order to solve this problem, various methods have been proposed. In this paper, we focus on the gender determination problem in Korean names as a first-step for solving a zero pronoun problem in Korean. Since this problem can be viewed as a binary classification problem, we adopt support vector machines which are well-known for solving binary classification. The bag-of-words model is used to represent a name with context as a vector and information entropy of words is adopted for selecting features. An evaluation of the proposed method shows about 86% of accuracy. This method achieves higher accuracy than baseline which determines the gender of a name by its majority and additionally resolves the limitation of memory based and statistical method which use only names.
Article
Full-text available
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes.
Article
Full-text available
Centering theory is the best-known framework for theorizing about local coherence and salience; however, its claims are articulated in terms of notions which are only partially specified, such as “utterance,” “realization,” or “ranking.” A great deal of research has attempted to arrive at more detailed specifications of these parameters of the theory; as a result, the claims of centering can be instantiated in many different ways. We investigated in a systematic fashion the effect on the theory's claims of these different ways of setting the parameters. Doing this required, first of all, clarifying what the theory's claims are (one of our conclusions being that what has become known as “Constraint 1” is actually a central claim of the theory). Secondly, we had to clearly identify these parametric aspects: For example, we argue that the notion of “pronoun” used in Rule 1 should be considered a parameter. Thirdly, we had to find appropriate methods for evaluating these claims. We found that while the theory's main claim about salience and pronominalization, Rule 1—a preference for pronominalizing the backward-looking center (CB)—is verified with most instantiations, Constraint 1–a claim about (entity) coherence and CB uniqueness—is much more instantiation-dependent: It is not verified if the parameters are instantiated according to very mainstream views (“vanilla instantiation”), it holds only if indirect realization is allowed, and is violated by between 20% and 25% of utterances in our corpus even with the most favorable instantiations. We also found a trade-off between Rule 1, on the one hand, and Constraint 1 and Rule 2, on the other: Setting the parameters to minimize the violations of local coherence leads to increased violations of salience, and vice versa. Our results suggest that “entity” coherence—continuous reference to the same entities—must be supplemented at least by an account of relational coherence.
Article
Full-text available
Graphical presentations can be used to communicate information in relational data sets succinctly and effectively. However, novel graphical presentations that represent many attributes and relationships are often difficult to understand completely until explained. Automatically generated graphical presentations must therefore either be limited to generating simple, conventionalized graphical presentations, or risk incomprehensibility. A possible solution to this problem would be to extend automatic graphical presentation systems to generate explanatory captions in natural language, to enable users to understand the information expressed in the graphic. This paper presents a system to do so. It uses a text planner to determine the content and structure of the captions based on: (1) a representation of the structure of the graphical presentation and its mapping to the data it depicts, (2) a framework for identifying the perceptual complexity of graphical elements, and (3) the structure of the data expressed in the graphic. The output of the planner is further processed regarding issues such as ordering, aggregation, centering, generating referring expressions, and lexical choice. We discuss the architecture of our system and its strengths and limitations. Our implementation is currently limited to 2-D charts and maps, but, except for lexical information, it is completely domain independent. We illustrate our discussion with figures and generated captions about housing sales in Pittsburgh.