Conference Paper

Poster: A Novel Approach for Parts of Speech (PoS) tagging of Pashto Language

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Pashto is a language that belongs to Indo-European family, mostly spoken in south Asian countries especially in Pakistan and Afghanistan. Building software's that enables to translate Pashto sentences into various languages and building Natural Language Understand (NLU) applications to make interactive Pashto software's require's a well defined corpus and (Part of Speech) POS tagging approach.A well defined corpus is developed by scraping data from different websites.As Pashto is somehow similar to Persian and Arabic Language.The dataset is create according to the guidelines written for these languages.Gated based neural networks architectures shows better performance as compared to traditional machine learning techniques used in POS tagging.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e.g. speech utterances or handwritten documents. While word embedding has been demoed as a powerful representation for characterizing the statistical properties of natural language. In this study, we propose to use BLSTM-RNN with word embedding for part-of-speech (POS) tagging task. When tested on Penn Treebank WSJ test set, a state-of-the-art performance of 97.40 tagging accuracy is achieved. Without using morphological features, this approach can also achieve a good performance comparable with the Stanford POS tagger.
Conference Paper
Full-text available
In natural language processing, part-of-speech tagging plays a vital role. It is a significant pre-requisite for putting a human language on the engineering track. Before developing a part-of-speech tagger, a tagset is required for that language. This paper is about the first ever rule based part-of-speech tagging system for Pashto language and a tagset that helps in the development of a Parser for the said language [8]. A very simple architecture is applied that gives reasonably good accuracy.
Conference Paper
Full-text available
While building a machine translation system, the embedded part-of-speech (POS) tagger deserves special attention. The ever first tagset discussed here is created in accordance with the EAGLES guidelines. These guidelines were written for the languages of European Union. They can also be applied to Pashto language. This paper presents the creation process of Pashto tagset, which helps in the development of a POS tagger.
Article
Full-text available
The Penn Treebank, in its eight years of operation (1989-1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available http://www.ldc.upenn.edu.
Article
We present a character-based model for joint segmentation and POS tagging for Chinese. The bidirectional RNN-CRF architecture for general sequence tagging is adapted and applied with novel vector representations of Chinese characters that capture rich contextual information and lower-than-character level features. The proposed model is extensively evaluated and compared with a state-of-the-art tagger respectively on CTB5, CTB9 and UD Chinese. The experimental results indicate that our model is accurate and robust across datasets in different sizes, genres and annotation schemes. We obtain state- of-the-art performance on CTB5, achieving 94.38 F1-score for joint segmentation and POS tagging.
Article
Part-of-Speech (POS) Tagging is a process that attaches each word in a sentence with a suitable tag from a given set of tags. POS Tagging is important in various areas of Natural Language Processing. Different methods of automating the process have been developed and employed for English and other Western languages. Some similar work, most of which utilize the stochastic approaches for POS Tagging has also been done in the same area for South Asian languages. We experimented with some of the widely-used approaches for POS Tagging on three South Asian languages, Bangla, Hindi and Telegu, using corpora of different sizes. We observed the performance of the approaches and found the Brill’s transformation based tagger’s performance to be superior to the other approaches in all of our experiments, though the use of this approach has been very limited until recently.
A corpus-based study of pashto
  • M A Khan
  • F T Zuhra
Keras: The python deep learning library
  • chollet