Conference Paper

Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines.

Conference: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, November 27-30, 2001, Hitotsubashi Memorial Hall, National Center of Sciences, Tokyo, Japan
Source: DBLP


The accuracy of part-of-speech (POS) tagging for unknown words is substantially lower than that for known words. Considering the high accuracy rate of up-to-date statis- tical POS taggers, unknown words account for a non-negligible portion of the errors. This paper describes POS prediction for unknown words using Support Vector Machines. We achieve high accuracy in POS tag prediction using substrings and surrounding context as the features. Furthermore, we integrate this method with a practical English POS tagger, and achieve accuracy of 97.1%, higher than conventional approaches.

Download full-text


Available from: Yuji Matsumoto, Feb 07, 2015
  • Source
    • "Scott M. Thede and Mary Harper [5] in their paper presented an approach using morphology and syntactic parsing rules in post-mortem method for determining the probable lexical classes of words. Tetsuji, Taku Kudoh and Yuji [6] proposed a POS tagging approach for unknown English words using Support Vector Machines (SVM). SVM classifiers are created for each POS tag using all words in the training set, then POS tags to unknown words predict using those classifiers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Part of Speech (POS) is a very vital topic in Natural Language Processing (NLP) task in any language, which involves analysing the construction of the language, behaviours and the dynamics of the language, the knowledge that could be utilized in computational linguistics analysis and automation applications. In this context, dealing with unknown words (words do not appear in the lexicon referred as unknown words) is also an important task, since growing NLP systems are used in more and more new applications. One aid of predicting lexical categories of unknown words is the use of syntactical knowledge of the language. The distinction between open class words and closed class words together with syntactical features of the language used in this research to predict lexical categories of unknown words in the tagging process. An experiment is performed to investigate the ability of the approach to parse unknown words using syntactical knowledge without human intervention. This experiment shows that the performance of the tagging process is enhanced when word class distinction is used together with syntactic rules to parse sentences containing unknown words in Sinhala language.
  • Source
    • "In this table, w i and t i denote the lexicons and POS tag for the i-th word in a sentence respectively. The POS tags for following words are obtained from a two-pass approach proposed by Nakagawa et al. [23]. The combinations of POS tags from previous words (t i−2 · t i−1 ) and those from next words (t i+1 ·t i+2 ) are adopted to reflect interaction between POS tags of surrounding words. "
    [Show abstract] [Hide abstract]
    ABSTRACT: All types of part-of-speech (POS) tagging errors have been equally treated by existing taggers. However, the errors are not equally important, since some errors affect the performance of subsequent natural language processing (NLP) tasks seriously while others do not. This paper aims to minimize these serious errors while retaining the overall performance of POS tagging. Two gradient loss functions are proposed to reflect the different types of errors. They are designed to assign a larger cost to serious errors and a smaller one to minor errors. Through a set of POS tagging experiments, it is shown that the classifier trained with the proposed loss functions reduces serious errors compared to state-of-the-art POS taggers. In addition, the experimental result on text chunking shows that fewer serious errors help to improve the performance of subsequent NLP tasks.
    Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1; 07/2012
  • Source
    • "There are several techniques available and approved for realizing this classification task. Referred to section 3.1 SVMs can be used for such a task as applied by (Nakagawa et al., 2001). But there are several other ways for accomplishing this classification behavior like using Bayesian approaches (Goldwater and Griffiths, 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The concept of so called Thought Bubbles deals with the problem of finding appropriate new connections within Social Networks, especially Twitter. As a byproduct of exploring new users, Tweets are classified and rated and are used to generate a kind of news feed, which will extend the personal Twitter feed. Each user has several interests that can be classified by first evaluating their Tweets and then by evaluating user related and already existing contacts. By categorizing a user and related connections, one can be placed in an imaginary category specific subset of users, called Thought Bubbles. Following the trace of people who are also active within the same specific Thought Bubble, should reveal interesting and helpful connections between similar minded users.
    Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies; 01/2012
Show more