Conference Paper

Shallow parsing in Turkish

If you want to read the PDF, try requesting it from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Bu iki işlem sonucunda anlamsal ve biçimbilimsel olarak çözümlenmiş cümleler son olarak yüzeysel çözümleyiciden geçirilmiştir. Yüzeysel çözümlemede veri tabanındaki cümleler hiyerarşik yapılarına dikkat etmeksizin özne, nesne, yüklem, zarf tümleci ve dolaylı tümleç adı altında beş ana kategoride manuel olarak işaretlenmiştir [23]. ...
Conference Paper
Full-text available
Bu çalışmada, Türkçenin özgün tipolojisinden dogan ihtiyaçlara cevap verebilecek özgün bir baglılık analizi yaratılması hedeflenmiştir. Evrensel Bağlılık Analizleri'nden yola çıkarak Türkçenin dil yapısına uygun kurallar çıkarılmış ve turizm alanına özgü veri tabanında otomatik bağlılık analizi yapılmıştır. Bu proje dahilinde turizm alanına özgü 20.000 cümleye ait 51.185 kelimenin baglılık analizi yapılmıştır. Tutarlı ve doğru bir bağlılık analizine ulaşmak adına üç farklı kaynaktan gelen veriler kullanılmıştır: anlamsal işaretleme, biçimbilimsel analiz ve yüzeysel çözümleme.
... For the POS tag annotation of Bakay's work, Topsakal et al.' work [23] was taken as the baseline. Topsakal et al. created a parser that gives sufficient information about the syntactic segments of a sentence, which is a rare study topic for Turkish because of its agglutinative nature. ...
Conference Paper
Full-text available
In this study, Bakay et. al [1] and Yıldız et. al.'s [2] work on Turkish constituency treebanks were developed further. Compared to the previous work, the most prominent feature of this study is the fact that every annotation and refinement process is held manually. In addition, constituency treebank created as a result of this study abides by the syntactic rules and typologic features of Turkish while the trees created by previous studies convey only the translated and simply inverted trees that completely ignore the syntactic properties of Turkish. The methodology followed in this study resulted in a significantly more accurate representation of Turkish language and simpler, relatively flatter trees. The straightforward style of trees in this study reduces the complexity and offers a better training dataset for learning algorithms.
... Although not all the layers are fully annotated yet, the corpus currently consists of over 9,600 sentences. The preliminary version of this dataset was previously used in NER [16], shallow parsing [17], and WSD [18] tasks. ...
Conference Paper
In this paper, we present the first multilayer annotated corpus for Turkish, which is a low-resourced agglutinative language. Our dataset consists of 9,600 sentences translated from the Penn Treebank Corpus. Annotated layers contain syntactic and semantic information including morphological disambigua-tion of words, named entity annotation, shallow parse, sense annotation, and semantic role label annotation.
Conference Paper
Full-text available
In this paper, we report our preliminary ef-forts in building an English-Turkish paral-lel treebank corpus for statistical machine translation. In the corpus, we manually generated parallel trees for about 5,000 sentences from Penn Treebank. English sentences in our set have a maximum of 15 tokens, including punctuation. We con-strained the translated trees to the reorder-ing of the children and the replacement of the leaf nodes with appropriate glosses. We also report the tools that we built and used in our tree translation task.
Conference Paper
Full-text available
State-of-the-art phrase chunking focuses on English and shows high accuracy with very basic word features such as the word itself and the POS tag. In case of morphologically rich languages like Turkish, basic features are not sufficient. Moreover, phrase chunking may not be appropriate and the "chunk" term should be redefined for these languages. In this paper, the first study on Turkish constituent chunking using two different methods is presented. In the first method, we directly extracted chunks from the results of the Turkish dependency parser. In the second method, we experimented with a CRF-based chunker enhanced with morphological and contextual features using the annotated sentences from the Turkish dependency treebank. The experiments showed that the CRF-based chunking augmented with extra features outperforms the baseline chunker with basic features and dependency parser-based chunker. Overall, we developed a CRF-based Turkish chunker with an F-measure of 91.95 for verb chunks and 87.50 for general chunks.
Article
Full-text available
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Full-text available
The identification of noun groups in text is a well researched task and serves as a pre-step for other natural language processing tasks, such as the ex-traction of keyphrases or technical terms. We present a first version of a noun group chunker that, given an unannotated text corpus, adapts it-self to the domain at hand in an unsupervised way. Our approach is inspired by findings from cognitive linguistics, in particular the division of language into open-class elements and closed-class elements. Our system extracts noun groups using lists of closed-class elements and one lin-guistically inspired seed extraction rule for each open class. Supplied with raw text, the sys-tem creates an initial validation set for each open class based on the seed rules and applies a boot-strapping procedure to mutually expand the set of extraction rules and the validation sets. Possibly domain-dependent information about open-class elements, as for example provided by a part-of speech lexicon, is not used by the system in or-der to ensure the domain-independency of the ap-proach. Instead, the system adapts itself automat-ically to the domain of the input text by boot-strapping domain-specific validation lists. An evaluation of our system on the Wall Street Jour-nal training corpus used for the CONLL 2000 shared task on chunking shows that our boot-strapping approach can be successfully applied to the task of noun group chunking.
Article
Full-text available
We present the issues that we have encountered in designing a treebank architec-ture for Turkish along with rationale for the choices we have made for various representation schemes. In the resulting representation, the information encoded in the complex agglutinative word structures are represented as a sequence of in-flectional groups separated by derivational boundaries. The syntactic relations are encoded as labeled dependency relations among segments of lexical items marked by derivation boundaries. Our current work involves refining a set of treebank annotation guidelines and developing a sophisticated annotation tool with an extendable plug-in architecture for morphological analysis, morpholog-ical disambiguation and syntactic annotation disambiguation.
Article
Full-text available
Thesis (Ph. D.)--Harvard University, 1985. Includes bibliographical references (leaves 282-287). Microfilm. s
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Noun phrase chunking is a sub-category of shallow parsing that can be used for many natural language processing tasks. In this paper, we propose a noun phrase chunker system for Turkish texts. We use a weighted constraint dependency parser to represent the relationship between sentence components and to determine noun phrases. The dependency parser uses a set of hand-crafted rules which can combine morphological and semantic information for constraints. The rules are suitable for handling complex noun phrase structures because of their flexibility. The developed dependency parser can be easily used for shallow parsing of all phrase types by changing the employed rule set. The lack of reliable human tagged datasets is a significant problem for natural language studies about Turkish. Therefore, we constructed a noun phrase dataset for Turkish. According to our evaluation results, our noun phrase chunker gives promising results on this dataset.
Conference Paper
In this paper, we report our work on chunking in Turkish. We used the data that we generated by manually translating a subset of the Penn Treebank. We exploited the already available tags in the trees to automatically identify and label chunks in their Turkish translations. We used conditional random fields (CRF) to train a model over the annotated data. We report our results on different levels of chunk resolution.
T he function of word order in T urkish Grammar
  • E T Erguvanlt