Conference Proceeding

Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines.

01/2001; In proceeding of: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, November 27-30, 2001, Hitotsubashi Memorial Hall, National Center of Sciences, Tokyo, Japan
Source: DBLP

ABSTRACT The accuracy of part-of-speech (POS) tagging for unknown words is substantially lower than that for known words. Considering the high accuracy rate of up-to-date statis- tical POS taggers, unknown words account for a non-negligible portion of the errors. This paper describes POS prediction for unknown words using Support Vector Machines. We achieve high accuracy in POS tag prediction using substrings and surrounding context as the features. Furthermore, we integrate this method with a practical English POS tagger, and achieve accuracy of 97.1%, higher than conventional approaches.

0 0
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Many studies have tried to search useful information on the Internet by meaningful terms or words. The performance of these approaches is often affected by the accuracy of unknown word extraction and POS tagging, while the accuracy is affected by the size of training corpora and the characteristics of language. This work proposes and develops a method that concentrates on tagging the POS of Chinese unknown words for the domain of our interest, based on the integration of morphological, contextual rules and a statistics-based method. Experimental results indicate that the proposed method can overcome the difficulties resulting from small corpora in oriental languages, and can accurately tags unknown words with POS in domain-specific small corpora.
    Natural Language Processing and Knowledge Engineering (NLP-KE), 2010 International Conference on; 09/2010
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, several methods are combined to improve the accuracy of HMM based POS tagger for Bahasa Indonesia. The first method is to employ affix tree which covers word suffix and prefix. The second one is to use succeeding POS tag as one of the feature for HMM. The last method is to use the additional lexicon (from KBBI-Kateglo) in order to limit the candidate tags resulted by the affix tree. The HMM model was built on 15000-tokens data corpus. In the experiment, on a 15% OOV test corpus, the best accuracy was 96.50% with 99.4% for the in-vocabulary words and 80.4% for the OOV(out of vocabulary) words. The experiment showed that the affix tree and additional lexicon is effective in increasing the POS tagger accuracy, while the usage of succeeding POS tag does not give much improvement on the OOV handling.
    4th International MALINDO (Malaysian-Indonesian Language) Workshop; 01/2010
  • [show abstract] [hide abstract]
    ABSTRACT: The maximum entropy (ME) method is a powerful supervised machine learning technique that is useful for various tasks. In this paper, we introduce new studies that successfully employ ME for natural language processing (NLP) problems including machine translation and information extraction. Specifically, we demonstrate, using simulation results, three applications of ME for NLP: estimation of categories, extraction of important features, and correction of error data items. We also evaluate the comparative performance of the proposed ME methods with other state-of-the-art approaches.
    Cognitive Computation 01/2010; 2:272-279. · 0.87 Impact Factor


Available from