Conference Paper

Hybrid word sense disambiguation using language resources for transliteration of Arabic numerals in Korean.

DOI: 10.1145/1644993.1645053 Conference: Proceedings of the 2009 International Conference on Hybrid Information Technology, ICHIT 2009, Daejeon, Korea, August 27-29, 2009
Source: DBLP

ABSTRACT The high frequency of the use of Arabic numerals in informative texts and their multiple senses and readings deteriorate the accuracy of TTS systems. This paper presents a hybrid word sense disambiguation method exploiting a tagged corpus and a Korean wordnet, KorLex 1.0, for the correct and efficient conversion of Arabic numerals into Korean phonemes according to their senses. Individual contextual features are extracted from the tagged corpus and are grouped in order to determine the sense of Arabic numerals. Least upper bound synsets among common hypernyms of contextual features were obtained from the KorLex hierarchy, and they were used as semantic categories of the contextual features of Arabic numerals. The semantic classes were trained to classify the meaning and the reading of Arabic numerals using decision tree and to compose grapheme-to-phoneme rules for an automatic transliteration system for Arabic numerals. The proposed system outperforms the customized TTS systems by 3.9%--20.3%.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We have proposed Auto-TAN, an automatic transcription system of Arabic numerals into Korean alphabetic letters using linguistic rules and clues. Few previous studies have previously discussed the problems in transcribing Arabic numerals into Korean text. We have suggested detailed NRF (number reading formula) paradigms, analyzed the structure of NUMEs (numerical expressions) and components in a larger scope, and investigated compatibilities and selection rules among those components. Based on these linguistic features, 13 stereotyped patterns, 16 rules and 63 clues determining NRF types are formulated for Auto-TAN. This system works modularly in 5 steps. The pilot test was conducted with a test suite which contains 56782 NUMEs. Encouraging results of 84.8% and 10.5% accuracy were obtained for unique transcription and multiple transcriptions, respectively.
    Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on; 11/2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound" rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize" text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types—news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.
    Computer Speech & Language 07/2001; · 1.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Both quantitive and qualitative methods have been tried, but much of this work has been stymied by difficulties in acquiring appropriate lexical resources. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 92% accuracy in discriminating between two very distinct senses of a noun. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. The proposed method is probably most appropriate for those aspects of sense disambiguation that are closest to the information retrieval task. In particular, the proposed method was designed to disambiguate senses that are usually associated with different topics.
    Computers and the Humanities 11/1992; 26(5):415-439.


Available from
May 15, 2014