Conference Paper

Türkçe Anlamsal Söylem ve Cümle Benzerliği Analizleri İçin Veri Kümesi Oluşturma Yöntemi

To read the full-text of this research, you can request a copy directly from the authors.


Doğal dil işleme çalışmamızın amacı Türkçe dili için paragraf-cümle düzeyinde anlamsal söylem analizi ve paragraf-cümle ve cümle-cümle düzeyinde metinsel benzerlik ölçümlemesi için bir veri kümesi hazırlamaktır. Girdi olarak kullanılan çoktan seçmeli sorular Türkiye Cumhuriyeti Ölçme, Seçme ve Yerleştirme Merkezi tarafından gerçekleştirilen sınavlarda çıkmış çoktan seçmeli Türkçe sorularıdır. Hedeflenen yaklaşımlar için iki kategoride dört farklı soru tipi belirlenmiştir: (i) paragrafın akışının bozulmasının tespit edilmesi, (ii) cümlelerin doğru sırasının bulunması, (iii) paragrafta geçen ifade ile anlatılmak istenen cümlenin bulunması, (iv) anlamca en yakın cümlelerin bulunması. Tüm veri toplama, hazırlama, biçimbilimsel etiketleme ve biçim dönüştürme aşamaları sonucunda nihai olarak anlamsal söylem analizi için 434 soruluk, metinsel benzerlik analizi için de 539 soruluk veri kümesine ulaşılmıştır.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Full-text available
In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors--namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)--that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.
Full-text available
Latent Semantic Analysis (LSA) is a theory and me:hod for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text (Landauer & Dumais, 1997). The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word-word and passage-word lexical priming data; and, as reported in 3 following articles in this issue, it accurately estimates passage coherence, learnability of passages by individual students, and the quality and quantity of knowledge contained in an essay.
Work on modeling semantics in text is progressing quickly, yet there are few existing public datasets which authors can use to measure and compare their systems. This work takes a step towards addressing this issue. We present the MSR Sentence Completion Challenge Data, which consists of 1,040 sentences, each of which has four impostor sentences, in which a single (fixed) word in the original sentence has been replaced by an impostor word with similar occurrence statistics. For each sentence the task is then to determine which of the five choices for that word is the correct one. This data was constructed from Project Gutenberg data. Seed sentences were selected from Sherlock Holmes novels, and then imposter words were suggested with the aid of a language model trained on over 500 19th century novels. The language model was used to compute 30 alternative words for a given low frequency word in a sentence, and human judges then picked the 4 best impostor words, based on a set of provided guidelines. Although the data presented here will not be changed, this is still a work in progress, and we plan to add similar datasets based on other sources. This technical report is a living document and will be updated appropriately as new datasets are constructed and new results on existing datasets (for example, using human subjects) are reported.
Conference Paper
In this paper, we propose a classification based approach to the morphological disambiguation for Turkish language. Due to complex morphology in Turkish, any word can get unlimited number of affixes resulting very large tag sets. The problem is defined as choosing one of parses of a word not taking the existing root word into consideration. We trained our model with well-known classifiers using WEKA toolkit and tested on a common test set. The best performance achieved is 95.61% by J48 Tree classifier. 1
Son 52 Yıl YGS-LYS Paragraf Soruları ve Çözümleri
  • S Korkmaz
S. Korkmaz, Son 52 Yıl YGS-LYS Paragraf Soruları ve Çözümleri. Akıllı Adam Yayınları.
Efficient estimation of word representations in vector space
--, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.