ArticlePDF Available

Constructing A Turkish Corpus for Paraphrase Identification and Semantic Similarity

Authors:

Abstract and Figures

The Paraphrase identification (PI) task has practical importance for work in Natural Language Processing (NLP) because of the problem of linguistic variation. Accurate methods should help improve performance of key NLP applications. Paraphrase corpora are important resources in developing and evaluating PI methods. This paper describes the construction of a paraphrase corpus for Turkish. The corpus comprises pairs of sentences with semantic similarity scores based on human judgments, permitting experimentation with both PI and semantic similarity. We believe this is the first such corpus for Turkish. The data collection and scoring methodology is described and initial PI experiments with the corpus are reported. Our approach to PI is novel in using ‘knowledge lean’ methods (i.e. no use of manually constructed knowledge bases or processing tools that rely on these). We have previously achieved excellent results using such techniques on the Microsoft Research Paraphrase Corpus, and close to state-of-the-art performance on the Twitter Paraphrase Corpus.
Content may be subject to copyright.
Constructing A Turkish Corpus for Paraphrase
Identification and Semantic Similarity
Asli Eyecioglu, Bill Keller
University of Sussex
Department of Informatics
Brighton, BN19QJ, UK
A.Eyecioglu@sussex.ac.uk, billk@sussex.ac.uk
Abstract. The Paraphrase identification (PI) task has practical importance for
work in Natural Language Processing (NLP) because of the problem of linguis-
tic variation. Accurate methods should help improve performance of key NLP
applications. Paraphrase corpora are important resources in developing and
evaluating PI methods. This paper describes the construction of a paraphrase
corpus for Turkish. The corpus comprises pairs of sentences with semantic sim-
ilarity scores based on human judgments, permitting experimentation with both
PI and semantic similarity. We believe this is the first such corpus for Turkish.
The data collection and scoring methodology is described and initial PI experi-
ments with the corpus are reported. Our approach to PI is novel in using
'knowledge lean' methods (i.e. no use of manually constructed knowledge bases
or processing tools that rely on these). We have previously achieved excellent
results using such techniques on the Microsoft Research Paraphrase Corpus,
and close to state-of-the-art performance on the Twitter Paraphrase Corpus.
Keywords. Paraphrase Identification ! Turkish ! Corpora Construction
!Knowledge-Lean ! Paraphrasing ! Sentential Semantic Similarity
1 Introduction
Paraphrase identification (PI) may be defined as “the task of deciding whether two
given text fragments have the same meaning” [22]. The PI task has practical im-
portance for Natural Language Processing (NLP) because of the need to deal with the
pervasive problem of linguistic variation. Accurate methods should help improve the
performance of NLP applications, including machine translation, information retrieval
and question answering, amongst others. Acquired paraphrases have been shown to
improve Statistical Machine Translation (SMT) systems [5, 24, 28], for example.
To support the development and evaluation of PI methods, the creation of para-
phrase corpora plays an important part. This paper describes the construction and
characteristics of a new paraphrase corpus, for Turkish. The Turkish Paraphrase Cor-
pus1 (TuPC) is comprised of pairs of sentences, together with associated semantic
similarity scores based on human judgements. The corpus thus supports study of PI
methods that make simple, binary judgements (i.e. paraphrase v. non-paraphrase) or
that are capable of assigning finer-grained judgements of semantic similarity.
To our knowledge, this is the first such corpus made available for Turkish. The
corpus will be of value as a novel resource for other researchers working on PI and
semantic similarity in Turkish. It should also be of wider interest, to researchers in-
vestigating paraphrase and cross-linguistic techniques. We report the result of initial
PI experiments which may serve as a baseline for other researchers. Our approach is
novel in using 'knowledge lean' techniques. By 'knowledge lean' we mean that we do
not make use of manually constructed knowledge bases or processing tools that rely
on these. As far as possible, we work with just tokenised text and statistical tools.
The work described here continues the authors' earlier investigations into the extent
to which PI techniques requiring few external resources may achieve results compa-
rable to methods employing more sophisticated NLP processing tools and knowledge
bases. We have already demonstrated that excellent results can be obtained using
combinations of features based on lexical and character n-gram overlap, together with
Support Vector Machine (SVM) classifiers. The approach has been shown to perform
well on the Microsoft Research Paraphrase Corpus (MSRPC) and recently attained
near state-of-the-art performance on the Twitter Paraphrase Corpus (TPC).
In the rest of the paper we briefly introduce current work on paraphrase corpora
and then describe the methods that were used to construct the TuPC. Our methodolo-
gy should be of interest to researchers considering building paraphrase corpora for
other languages. This may be particularly the case for languages with few existing
language processing resources. We then describe our approach to PI using
knowledge-lean techniques and report on initial experiments with the new corpus.
2 Paraphrase Corpora
The creation of paraphrase corpora in recent years has been important in promoting
research into methods for PI. Paraphrase corpora provide a basis for training and
evaluation of models of paraphrase. The MSRPC [11, 30] was constructed by initial-
ly extracting candidate sentence pairs from a collection of topically clustered news
data. A classifier was used to identify a total of 5801 paraphrase pairs and “near-miss”
non-paraphrases. These were inspected by human annotators and labeled as either
paraphrase (3900) or non-paraphrase (1901). The Plagiarism Detection Corpus (PAN)
was constructed by aligning sentences from 41,233 plagiarised documents. It is made
available by Madnani et al. [23] for the use of PI tasks, publishing the initial results of
their experiment. The Semeval-2015 Task1, “Paraphrase and Semantic Similarity in
Twitter” [35] involves predicting whether two tweets have the same meaning. Train-
ing and test data are provided in the form of the TPC [34], which is constructed semi-
randomly and annotated via Amazon Mechanical Turk by 5 annotators.
1 TuPC can be downloaded from: http://aslieyecioglu.com/data/
Paraphrase corpora have been constructed for languages other than English. The
TICC Headline Paraphrase Corpus [33] is a collection of English and Dutch, while
the Hebrew Paraphrase Corpus [32] implements a scoring scheme with sentence pairs
labelled as paraphrase, non-paraphrase and partial paraphrase. WiCoPaCo, [26] is a
corpus for French, which includes text normalization and corrections. However, this
data is not constructed solely for paraphrasing tasks, and no scoring scheme is sup-
plied. Previously, a Turkish Paraphrase Corpus [10] has been reported. However, the
corpus is not widely available and does not provide any negative instances or scoring
scheme. Consequently, it is less useful for research purposes.
The Microsoft Research Video Description Corpus [7] consists of multilingual sen-
tences generated from short videos that are described with one sentence by Mechani-
cal Turk workers. A multi-lingual paraphrase databases PPDB [18] has been con-
structed using pivot language techniques [2] and recently expanded to more than 20
languages. It does not currently provide sentential paraphrases, but a substantial quan-
tity of phrasal paraphrases is available, in many languages.
3 A Paraphrase Corpus for Turkish
Turkish belongs to the Altaic languages family of western and central Asia that are
named under the Turkic languages. Turkish is a highly inflected and agglutinative
language. The productive use of affixes is very typical, either to change the meaning
or the stress of a word. Turkish and English use the same Latin alphabet, except for a
few letters. Words are space-separated and follow subject-object-verb order.
Our strategy for creating TuPC combines the methodologies from previously con-
structed corpora MSRPC [30] and TPC [34]. We automatically extracted and paired
sentences from daily news websites. Candidate pairs were then hand-annotated by
inspection of their context. In contrast to MSRPC, but like TPC, candidate pairs were
scored according to semantic similarity.
3.1 Data Collection
We implemented a simple web crawler to extract plain text and cluster it according to
a pre-selected list of sub-topics. A list of URLs was gathered from a website that links
to most daily Turkish newspapers. A further list of URLs for each newspaper was
then extracted. Widely used heading tags were identified by looking at categories on
each website. For instance, the latest news can be found under the heading “last mi-
nute” on one site, and “lateston another. A subset of popular headings was estab-
lished to extract topic related news. Some examples are:
[haber, gundem, guncel, sondakika, haberler, turkiye, haberhatti,…,]
This step serves as an initial filter that limits the topics to “news,” “latest, sports”
and so on, rather than including articles related to “travel”, “fashion” and other mis-
cellaneous texts. For each category, we gathered all news from the different news
sites. We collected in two phases, between 4th and 14th May 2015 and then between
17th and 23rd June 2015. This was because the news sites tend not to update all of
their content each day, and that can lead to duplicate data. Approximately 10k lines of
text on a daily basis were clustered according to topic.
The resulting data are texts with html mark-up. These were made in-line by remov-
ing the html formatting. Duplicate lines were then removed and a sentence segmenta-
tion tool [3] was trained on a small collection of Turkish text to be used on our da-
taset. The tool was very successful in splitting paragraphs into sentences, but manual
correction was also applied in some cases. We believe that the sentence segmenter
tool can be trained on a larger collection of Turkish text for better results. The result-
ing text is not adjusted for case and numbers and named entities are not replaced with
generic names (unlike in MSRPC).
3.2 Generating Candidate Pairs
To generate candidate pairs we first removed any sentence considered too short (less
than 5 words) or overly long (greater than 40 words). This criterion is adapted from
the methodology adopted for the construction of the MSRPC. Next, all pairs of sen-
tences were considered as candidate paraphrase pairs and initially filtered according
to the following criteria, which are based on previous methods used for paraphrase
corpora construction. A candidate pair is removed if:
1. Lexical overlap is less than five words between a pair; or
2. Absolute difference of length is more than seven between a pair
Our text has relatively high lexical overlap due to the presence of named entities,
relatively long sentences and the prevalence of stop words. Consequently, the lexical
overlap criterion adopted for the MSRPC (fewer than four words) is too stringent in
our case. Our second criterion is similar to MSRPC’s word-based length difference
of sentences, which is defined as 66.6%. For the TPC, tweets of length less than three
are filtered out and the remaining candidate pairs are pruned with Jaccard similarity
score, if it is less than 0.5. The filtering criteria that we applied are generally compa-
rable to MSRPC due to the similarity of source data. Finally, to make the selection
process of candidate paraphrases easier, we further filtered out any pairs with at least
three overlapping words, after removing stop words.
The filtering process is enough to exclude pairs that are unfavourable to be selected
as candidates. In addition, once two sentences are paired, both are excluded from
further pairs. In this way, each sentence is used only once. Despite this, we obtained a
great number of candidate pairs. Approximately 5K were obtained on a daily basis
and so we handpicked a final set of 1002 candidate pairs.
3.3 Similarity Scoring
Our annotation method follows that used in creating the TPC. Rather than simply
label sentence pairs as paraphrase or non-paraphrase, a finer-grained, semantic simi-
larity score is assigned. This annotation scheme is richer and more general, since se-
mantic similarity scores can be converted to paraphrase judgments (but not vice-
versa). We followed the guidelines provided for the semantic similarity task [1] and
annotators scored sentence pairs directly, according to the criteria shown in Table 1.
A similar scoring scheme has been adopted for the TPC. However, in that case, scores
were generated based on counts of binary judgments made by a number of annotators.
Table 1. Sentential Semantic Similarity scores for candidate paraphrase pairs
5- IDENTICAL
Completely equivalent; mean the same
4-CLOSE
Mostly equivalent; some small details differ.
3- RELATED
Roughly equivalent; some important infor-
mation differs/missing.
2 CONTEXT
Not equivalent, but share some details.
1-SOMEWHAT RELATED
Not equivalent, but are on the same topic.
0- UNRELATED
On different topics
Eight native speakers of Turkish were recruited as annotators. The data were split
into two halves and different groups of four annotators judged each half. We prepared
a set of annotation guidelines to explain the scoring process. The guidelines intro-
duced the similarity scores, along with an example pair for each score and a short
explanation. The examples were also chosen from multiple samples by asking three
different native speakers who gave exactly the same score for those pairs.
In order to further clarify the task, a small preliminary experiment was completed
prior to annotation. Two short videos were prepared. These were adapted from Mi-
crosoft Research (MSR) Video Description Corpus [7], where the objective was to
generate a paraphrase evaluation corpus by asking multilingual Mechanical Turk
reviewers to describe the videos. Providing our annotators with the description of
other reviewers for the same videos is a heuristic approach designed to familiarize
them with the task. They summarized the videos with one sentence and, afterwards,
we gathered all the annotators’ descriptions. Then we showed them how the interpre-
tation or wording for the same video can be different.
The annotators noted that this small experiment gave a better understanding of the
scoring task. Video description sentences were collected and also included in the set
of guidelines. After completing this preliminary experiment, sentence pairs were sent
to each annotator. We did not impose a time restriction for the full task because anno-
tators noted that some pairs were confusing and required more time.
For the purpose of experimenting with simple PI, the assigned scores were also
converted to binary labels. First, the scores of each annotator were converted to binary
labels by taking sentence pairs marked as identical (5), close (4) and related (3) as
positive (i.e. paraphrase) and those marked context (2), somewhat related (1) and
unrelated (0) are as negative (i.e. non-paraphrase). The number of positive and nega-
tive decisions for each instance may be summarised as a pair. For example, (1,3)
shows that only one annotator tagged the pair as a paraphrase, while the remaining
three labeled it a non-paraphrase. In Table 2, we show the criteria of the binary judg-
ment that is based on the number of annotators’ answers. Note that where there are
equal numbers of positive and negative decisions, we consider a pair ‘debatable’. This
approach is similar to TPC labeling method, which is also based on the agreement
between annotators.
Table 2. The criteria of binary judgement based on the number of annotator’s answers
Number of answers
Judgement
(4,0); (3,1)
Positive (1)
(0,4); (1,3)
Negative (0)
(2,2)
Debatable
For semantic similarity, we also provide mean scores. These range between 1.75
and 3.00 for debatable pairs, whereas positive pairs are higher than 3.00 and negative
pairs are scored less than 1.75. Additionally, the criteria defined in Table 2 can be
interpreted in a range between 0 and 1 as follows: (4,0): 1.0; (3,1): 0.75; (2,2): 0.50;
(1,3): 0.25 and (0,4): 0.0.
Table 3 presents three sample pairs from the data. Each pair is shown with the
scores of 4 different annotators and the average similarity scores. The debatable pair
has been scored (4,2,3,0) by four annotators, and the average similarity is 2.25.
Table 3. Sample pairs from TuPC
Value
Scores
Pair of Sentences
Debatable
(4,2,3,0)
Average
(2.25)
İşada Ethem Sancak, Aydın Doğan ve Ertuğrul Özkök
ile ilgili "Bazı şeyleri açıklarsam Türkiye'de barınamazlar"
dedi.
24 TV'de konuşan İşadamı Ethem Sancak ''Doğan Grubu,
Aydın Doğan ve Ertuğrul Özkök'le ilgili çarpıcı açıklamalar
yaptı.
Positive
(3,4,5,5)
Average
(4.25)
Çekilişte 10 rakamı isabet eden 15 kupondan 13'ünün
Muğla'nın Yatağan ilçesindeki bayilere yatırıldığı ortaya
çıktı
10 rakamı isabet eden 15 kupondan 13'ü Muğla'nın
Yatağan ilçesindeki bayi ya da bayilerden yatırıldı.
Negative
(1,3,0,0)
Average
(1.00)
Toplam konut satışları içerisinde ipotekli satış payının en
yüksek olduğu il ise yüzde 52,9 ile Kars oldu.
Toplam konut satışları içinde ilk satışın payı yüzde 45,4
oldu.
3.4 Inter-annotator Agreement
Assigning similarity scores was a challenging task for the annotators. They noted that
the difficulty lay not in deciding whether or not two sentences were semantically
similar, but in determining the precise degree of similarity. To understand how con-
sistent the annotators were with one another we first applied Cohen’s Kappa [8] as a
measure of inter-annotator agreement between pairs of annotators. In general, a Kap-
pa coefficient value of 1 indicates strong agreement between annotators, whereas a
result of 0 shows an agreement by chance. A negative value (<0) indicates no agree-
ment. Interpretation of intermediate values is subject to debate [19] and we report
Landis and Koch’s [21] agreement interpretation scores. Kappa scores in our case
ranged from 0.23 (“fair agreement”) to 0.56 (“moderate agreement”).
Cohen’s Kappa can only be used for pairs of annotators and so cannot reflect the
inter-rater reliability of the full dataset. To compute agreement across all annotators,
we used Fleiss’s Kappa [17]. We aggregated the data by concatenating the two halves.
Although different annotators scored each part of the data, Fleiss’s Kappa relies only
on the number of scores given to each instance. We calculated the statistic on the full
dataset for both semantic similarity and binary paraphrase judgments. The results are
reported in Table 4.
Table 4. Fleiss Kappa score is computed based on the two different judgment criteria
Fleiss Kappa (%)
0.17
0.42
The results show “moderate agreement” at the level of binary paraphrase judge-
ments. We note that this is entirely consistent with the inter-rater reliability amongst
annotators of about 0.40 reported for the TPC [34]. The results show slight agree-
mentwith regard to semantic equivalency, on the other hand. The lower score is to
be expected, as the binary judgment is a more coarse-grained measure of semantic
equivalence. Lower agreement for the finer-grained scoring confirms the annotators’
sense that it is harder to be precise about degree of similarity. Despite this, inter-rater
agreement for binary judgment demonstrates that there is a broad consensus between
annotators’ scores.
3.5 Corpus Statistics
The TuPC comprises 1002 sentence pairs labelled for both sentential semantic simi-
larity and paraphrase. After converting scores to binary labels, we obtained 563 posi-
tive, 285 negative and 154 debatable pairs. Excluding the 154 debatable pairs, TuPC
has 848 sentence pairs that can be used for the PI task. A breakdown of the TuPC
according to agreement between 4 annotators is shown in Table 5.
Table 5. TuPC data statistics
Agreement
Number of pairs
Value
Positive
(4,0)
376
563
(3,1)
187
Debatable
(2,2)
154
154
Negative
(1,3)
151
285
(0,4)
134
There are various ways to split a dataset into train and test sets. TPC has a rela-
tively small test set as compared its training set (838 and 11530 sentence pairs respec-
tively, after removing debatable pairs). The percentage of the train and test sets for
MSRPC is 70.3% and 29.7%, respectively, while for PAN it is approximately 77%
and 23%. For TuPC we selected 60% (500 pairs) for training and 40% (348 pairs) for
testing. TuPC contains 339 positive and 161 negative sentence pairs in the train set;
and 224 positive and 124 negative pairs in the test set. Note that TuPC was shuffled
randomly before splitting the data into train and test set. A naive baseline obtained by
labelling every sentence pair as positive (i.e. paraphrase) yields an accuracy of 0.6639
and an F-Score of 0.798.
4 Knowledge-Lean Paraphrase Identification
Much PI research makes use of existing NLP tools and other resources. Duclaye et al.
[12] exploits the NLP tools of a question-answering system. Finch et al. [16] , Mihal-
cea et al. [27], Fernando and Stevenson [15], Malakasiotis [25], and Das and Smith
[9] employ lexical semantic similarity information based on resources such as Word-
Net [14]. A number of researchers have investigated whether state-of-the-art results
can be obtained without use of such tools and resources. Socher et al. [31] trains re-
cursive neural network models that learn vector space representations of single words
and multi-word phrases. Blacoe and Lapata [4] use distributional methods to find
compositional meaning of phrases and sentences. Lintean and Rus [22] consider over-
lap of word unigrams and bigrams. Finch et al. [16] combines several MT metrics and
uses them as features. Madnani et al. [23] combines different machine translation
quality metrics and outperforms most existing methods.
The work reported in this paper is part of on-going research that aims to investigate
the extent to which knowledge-lean techniques may help to identify paraphrases. By
knowledge-lean we mean that little or no use is made of manually constructed
knowledge bases or processing tools that rely on these. As far as possible, we work
with just text. We previously presented a knowledge-lean approach to identifying
twitter paraphrase pairs using character and word n-gram features by employing SVM
classifiers. Our system was ranked first out of 18 submitted systems as part of
SemEval-2015 Task 1: Paraphrase and semantic similarity in Twitter [35]. We
demonstrated that better results can be obtained using fewer but more informative
features. Our solution already outperforms the most sophisticated methods applied on
the TPC [13], and competitive results are obtained on the MSRPC and PAN. In the
following we report initial experiments applying these techniques to the Turkish Par-
aphrase Corpus.
4.1 Overlap Features
Text pre-processing is essential to many NLP applications. It may involve tokenising,
removal of punctuation, part-of-speech tagging, morphological analysis, lemmatisa-
tion, and so on. For identifying paraphrases, this may not always be appropriate. Re-
moving punctuation and stop words, or performing lemmatisation results in a loss of
information that may be critical in terms of PI. We therefore keep text pre-processing
to a minimum. The TuPC includes punctuation and spelling errors but these have not
been hand-corrected. We tokenise by splitting at white space. Lowercasing was also
performed, as for other corpora with which we have experimented.
We consider different representations of a text as a set of tokens, where a token
may be either a word or a character n-gram. For the work described here we restrict
attention to word and character unigrams and bigrams. Use of a variety of machine
translation techniques [23] that utilise word n-grams motivated their use in represent-
ing texts for this task. In particular, word bigrams may provide potentially useful syn-
tactic information about a text. Character bigrams, on the other hand, allow us to cap-
ture similarity between related word forms. Possible overlap features are constructed
using basic set operations:
Size of union (U): the size of the union of the tokens in the two texts of a candi-
date paraphrase pair.
Size of intersection (N): the number of tokens common to the texts of a candidate
paraphrase pair.
In addition we consider Text size (S). For a given pair of sentences, feature S1 is the
size of the set of tokens representing the first sentence and S2 is the size of the second
sentence. Knowing about the union, intersection or size of a text in isolation may not
be very informative. However, for a given token type, these four features in combina-
tion provide potentially useful information about similarity of texts. The four features
(U, N, S1 and S2) are computed for word and character unigrams and bigrams. This
yields a total of eight possible overlap features for a pair of texts, plus four ways of
measuring text size. Each data instance is then a vector of features representing a pair
of tweets.
4.2 Method
We trained SVM classifiers, using implementations from scikit-learn [29]. We report
results obtained using Support Vector Classifier (SVC), which was adapted from
libsvm [6] by embedding different kernels. We have experimented with both linear
and Radial Basis Function (RBF) kernels. Linear kernels are known to work well with
large datasets and RBF kernels are the first choice if a small number of features is
applied [20]. Both cases apply to our experimental datasets. Both linear and RBF
classifiers are used with their default parameters.
In keeping with earlier experiments, we have applied a simple scaling method to
features. This is a form of standardisation also called “z-score” in statistics, in which
the transformed data variable has a mean of 0 and a variance of 1. Subtracting the
mean, 𝜇!, from the feature vector,!𝑥, and dividing each of those features by its stand-
ard deviation,!𝜎, scales features, and a new feature vector, x, is obtained (Equation 1).
Apart from this Simple Scaler, features are kept as they are.
x =𝒙!𝝁𝒙
𝝈 (1)
In all experiments, 10-fold cross validation was applied. We combined the whole
dataset (train/test) and calculated the feature values. These features were split into 10
different sets after applying simple scaling. Each of 10 sets is used as test set against
the remaining 9 sets as training data. Both the linear and RBF kernels of SVM are
experimented with for character and word unigrams and bigrams.
4.3 Results
In Table 6, C1 and C2 each denote four features (U, N, S1 and S2) produced by char-
acter unigrams and bigrams, respectively. Similarly, W1 and W2 each denote four
features generated by word unigrams and bigrams. Combinations such as C1W2 rep-
resent eight features (those for C1 plus those for W2) and the notation C12 abbrevi-
ates the combination of both C1 and C2, etc.
Table 6. Character and word n-grams results on TuPC
SVM (Linear kernel)
SVM (RBF kernel)
Features
Acc.
Pre.
Rec.
F-sc.
Features
Acc.
Pre.
Rec.
F-sc.
C1
66.4
66.4
100.0
79.8
C1
68.8
68.7
97.3
80.5
C2
77.5
80.8
86.9
83.7
C2
76.4
78.8
88.3
83.3
W1
73.6
75.5
89.3
81.8
W1
72.2
75.4
86.5
80.5
W2
71.2
72.7
90.9
80.8
W2
71.3
73.4
89.2
80.5
C1C2
76.8
80.5
86.0
83.1
C1C2
75.5
77.6
88.8
82.8
W1W2
73.7
76.1
88.5
81.7
W1W2
73.7
76.7
86.9
81.4
C1W1
72.6
75.1
88.1
81.0
C1W1
72.4
74.7
88.6
81.0
C2W2
76.7
80.3
86.1
83.1
C2W2
76.4
78.7
88.5
83.3
C1W2
71.1
73.2
89.4
80.4
C1W2
71.6
72.9
91.3
81.0
C2W1
76.4
79.6
86.9
83.0
C2W1
76.3
78.7
88.3
83.2
C12W12
77.5
81.0
86.5
83.6
C12W12
74.5
77.0
88.1
82.1
The presence of the C2 features is observed to lead to the best results. Indeed, C2
alone produces the best result overall, with an accuracy of 77.5 and an F-Score of
83.7, when combined with the linear SVM classifier. This comfortably beats the ‘na-
ïve’ baseline. The C2 features also provide the best results for the RBF kernel. The
results are consistent with our earlier work with MSRPC, PAN and TPC, where the
inclusion of character bigram overlap features also helps achieve the best results. We
hypothesize that measuring overlap of character bigrams provides a way of detecting
similarity of related word-forms and thus performs a similar function to stemming or
lemmatization. For MSRPC, PAN and TPC, comparable 10-fold cross-validation
experiments have shown that the feature combination C2W1 is robust in yielding
optimal results. The combination C2W1 gives best results for the PI task when used
with either the linear (MSRPC) or RBF (PAN and TPC) kernels. For both PAN and
TPC performance is at state-of-the art level for PI.
5 Conclusion
We have described the creation of a paraphrase corpus for Turkish. As far as we are
aware, the TuPC is unique in providing both positive and negative instances and is
currently the only paraphrase corpus for PI and semantic textual similarity available
in Turkish. It may be noted that the TuPC is relatively small compared to other para-
phrase corpora. However, the methods used to create it provide for a diverse set of
paraphrase examples.
We have detailed the methods used to gather raw data from news websites and to
score candidate pairs. These methods will be of value to others interested in creating
paraphrase corpora and could be applied to generate additional paraphrase data for the
TuPC. The main obstacle to producing data is the relatively time-consuming process
of scoring candidate pairs. In future we may investigate the possibility of developing
methods for crowdsourcing, for example through games for linguistic annotation.
A novel knowledge-lean approach to PI using character and word n-gram features
and SVM classifiers has been presented. Our approach has already been shown to
outperform more sophisticated methods applied to the TPC, and competitive results
have also been obtained for the MSRPC and PAN. The results obtained for the TuPC
cannot be compared directly to those for the other paraphrase corpora, but it is notable
that the same features (in particular, overlap of character bigrams) yield the best re-
sults. The performance of our approach on the TuPC provides other PI researchers
with a more realistic and challenging baseline than the naïve comparator of labeling
all instances as paraphrases. We are now investigating methods for determining se-
mantic similarity and intend to report on our results in another paper.
References
1. Agirre, E. et al.: Semeval-2012 task 6: A pilot on semantic textual similarity.
In: Proceedings of the 6th International Workshop on Semantic Evaluation, in
conjunction with the First Joint Conference on Lexical and Computational
Semantics. pp. 385393 (2012).
2. Bannard, C., Callison-Burch, C.: Paraphrasing with Bilingual Parallel
Corpora. In: Proceedings of the 43th Annual Meeting on Association for
Computational Linguistics. pp. 597604 (2005).
3. Bird, S. et al.: Natural Language Processing with Python. O’Reilly Media Inc.
(2009).
4. Blacoe, W., Lapata, M.: A Comparison of Vector-based Representations for
Semantic Composition. In: Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL ’12). pp. 546556 (2012).
5. Callison-Burch, C. et al.: Improved Statistical Machine Translation Using
Paraphrases. In: Proceedings of the main conference on Human Language
Technology Conference of the North American Chapter of the Association of
Computational Linguistics (HLT-NAACL’06). pp. 1724 (2006).
6. Chang, C., Lin, C.: LIBSVM$: A Library for Support Vector Machines. ACM
Trans. Intell. Syst. Technol. 2, 3, 127 (2011).
7. Chen, D.L., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase
Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies (HLT’11). pp.
190200 (2011).
8. Cohen, J.: A coefficient of agreement for Nominal Scales. Educ. Psyhological
Meas. 20, 1, 3746 (1960).
9. Das, D., Smith, N.A.: Paraphrase Identification as probabilistic quasi-
synchronous recognition. In: Proceedings of the Joint Conference of the 47th
Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP: ACL-IJCNLP ’09. pp. 468476
(2009).
10. Demir, S. et al.: Turkish Paraphrase Corpus. In: Proceedings of the Eight
International Conference on Language Resources and Evaluation (LREC’12).
pp. 40874091 (2012).
11. Dolan, B. et al.: Unsupervised construction of large paraphrase corpora. In:
Proceedings of the 20th international conference on Computational
Linguistics - COLING ’04. p. 350es , NJ, USA (2004).
12. Duclaye, F. et al.: Using the Web as a Linguistic Resource for Learning
Reformulations Automatically. In: Proceedings of the third international
conference on language resources and evaluation (LREC’02). pp. 390396 ,
Las Palmas, Canary Islands, Spain (2002).
13. Eyecioglu, A., Keller, B.: ASOBEK$: Twitter Paraphrase Identification with
Simple Overlap Features and SVMs. In: Proceedings of the 9th International
Workshop on Semantic Evaluation (SemEval 2015). pp. 6469 , Denver,
Colorado (2015).
14. Fellbaum, C.: WordNet. An electronic lexical database. MIT Press. (1998).
15. Fernando, S., Stevenson, M.: A Semantic Similarity Approach to Paraphrase
Detection. In: Proceedings of the 11th Annual Research Colloquium of the
UK Special Interest Group for Computational Linguistics. pp. 4552 (2008).
16. Finch, A. et al.: Using Machine Translation Evaluation Techniques to
Determine Sentence-level Semantic Equivalence. In: Proceedings of the Third
International Workshop on Paraphrasing (IWP2005). pp. 1724 (2005).
17. Fleiss, J.L.: Measuring Nominal Scale Agreement Among Many Raters.
Psychol. Bull. 76, 378382 (1971).
18. Ganitkevitch, J. et al.: PPDB$: The Paraphrase Database. In: Proceedings of
NAACL-HLT. pp. 758764 , Atlanta,Gerogia (2013).
19. Gwet, K.L.: Handbook of Inter-rater reliability. Advanced Analytics,
Gaithersburg (2012).
20. Hsu, C.-W. et al.: A Practical Guide to Support Vector Classification. BJU
Int. 101, 1, 1396400 (2008).
21. Landis, J.R., Koch, G.G.: The measurement of observer agreement for
categorical data. Biometrics. 33, 1, 159174 (1977).
22. Lintean, M., Rus, V.: Dissimilarity Kernels for Paraphrase Identification. In:
Proceedings of the 24th International Florida Artificial Intelligence Research
Society Conference. pp. 263268 , Palm Beach, FL (2011).
23. Madnani, N. et al.: Re-examining Machine Translation Metrics for Paraphrase
Identification. In: Proceedings of the 2012 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies (NAACL-HLT’12). pp. 182190 , PA, USA (2012).
24. Madnani, N. et al.: Using Paraphrases for Parameter Tuning in Statistical
Machine Translation. In: Proceedings of the Second Workshop on Statistical
Machine Translation (WMT’07). , Prague, Czech Republic (2007).
25. Malakasiotis, P.: Paraphrase Recognition Using Machine Learning to
Combine Similarity Measures. In: Proceedings of the ACL-IJCNLP 2009
Student Research Workshop. pp. 2735 , Suntec, Singapore (2009).
26. Max, A., Wisniewski, G.: Mining Naturally-occurring Corrections and
Paraphrases from Wikipedia’s Revision History. Proc. Lr. 31433148 (2010).
27. Mihalcea, R. et al.: Corpus-based and Knowledge-based Measures of Text
Semantic Similarity. In: Proceedings of the 21st national conference on
Artificial intelligence- Volume 1. pp. 775780 AAAI Press (2006).
28. Owczarzak, K. et al.: Contextual Bitext-Derived Paraphrases in Automatic
MT Evaluation. In: StatMT ’06. pp. 8693 , Stroudsburg, PA, USA (2006).
29. Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. et al.:
Scikit-learn: Machine Learning in Python, http://scikit-learn.org/stable/.
30. Quirk, C. et al.: Monolingual Machine Translation for Paraphrase Generation.
In: EMNLP-2014. pp. 142149 (2004).
31. Socher, R. et al.: Dynamic Pooling and Unfolding Recursive Autoencoders
for Paraphrase Detection. In: Advances in Neural Information Processing
Systems. pp. 801809 (2011).
32. Stanovsky, G.: A Study in Hebrew Paraphrase Identification. Ben-Gurion
University of Negev (2012).
33. Wubben, S. et al.: Creating and using large monolingual parallel corpora for
sentential paraphrase generation. 4292–4299 (2010).
34. Xu, W.: Data-driven approaches for paraphrasing across language variations.
New York University (2014).
35. Xu, W. et al.: SemEval-2015 Task 1: Paraphrase and semantic similarity in
Twitter (PIT). In: Proceedings of the 9th International Workshop on Semantic
Evaluation (SemEval). (2015).
... In Sect. 5 we first report on PI experiments previously conducted with several different paraphrase corpora: the Microsoft Research Paraphrase Corpus [14], the Plagiarism Detection Corpus [32], the Twitter Paraphrase Corpus [51] and a recently constructed Turkish Paraphrase Corpus [17]. Our approach uses combinations of features based on lexical-and character-based feature overlap, together with Support Vector Machine (SVM) classifiers. ...
... Although a paraphrase corpus for Turkish has previously been reported [12] the corpus data are not widely available and currently do not provide any negative instances or scoring scheme. To address this, a small Turkish Paraphrase Corpus (TuPC) [17] 1 was constructed from news items extracted from online Turkish newspapers. The method of construction was adapted from that used for the MSRPC. ...
... We have previously reported good results for the PI task with respect to the Twitter Paraphrase Corpus [51] and shown that our approach is also applicable to a highly inflected, agglutinative language such as Turkish [17]. In the following section, our focus is on identifying feature combinations that provide robust PI across different corpora. ...
Conference Paper
The paraphrase identification task has practical importance in the NLP community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods should help improve the performance of NLP applications, including machine translation, information retrieval, question answering, text summarization, document clustering and plagiarism detection, amongst others. We consider an approach to paraphrase identification that may be considered “knowledge-lean”. Our approach minimizes the need for data transformation and avoids the use of knowledge-based tools and resources. Candidate paraphrase pairs are represented using combinations of word- and character-based features. We show that SVM classifiers may be trained to distinguish paraphrase and non-paraphrase pairs across a number of different paraphrase corpora with good results. Analysis shows that features derived from character bigrams are particularly informative. We also describe recent experiments in identifying paraphrase for Russian, a language with rich morphology and free word order that presents a particularly interesting challenge for our knowledge-lean approach. We are able to report good results on a three-way paraphrase classification task.
Preprint
Full-text available
One fundamental task for NLP is to determine the similarity between two texts and evaluate the extent of their likeness. The previous methods for the Persian language have low accuracy and are unable to comprehend the structure and meaning of texts effectively. Additionally, these methods primarily focus on formal texts, but in real-world applications of text processing, there is a need for robust methods that can handle colloquial texts. This requires algorithms that consider the structure and significance of words based on context, rather than just the frequency of words. The lack of a proper dataset for this task in the Persian language makes it important to develop such algorithms and construct a dataset for Persian text. This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks. In addition, a Persian dataset named FarSSiM has been constructed for this purpose, using real data from social networks and manually annotated and verified by a linguistic expert team. The proposed model involves training a large language model using the BERT architecture from scratch. This model, called FarSSiBERT, is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language. Moreover, a novel specialized informal language tokenizer is provided that not only performs tokenization on formal texts well but also accurately identifies tokens that other Persian tokenizers are unable to recognize. It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria. Additionally, the pre-trained large language model has great potential for use in other NLP tasks on colloquial text and as a tokenizer for less-known informal words.
Article
Full-text available
This paper presents a comprehensive survey of corpora and lexical resources available for Turkish. We review a broad range of resources, focusing on the ones that are publicly available. In addition to providing information about the available linguistic resources, we present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
Conference Paper
The paper describes the results of the First Russian Paraphrase Detection Shared Task held in St.-Petersburg, Russia, in October 2016. Research in the area of paraphrase extraction, detection and generation has been successfully developing for a long time while there has been only a recent surge of interest towards the problem in the Russian community of computational linguistics. We try to overcome this gap by introducing the project ParaPhraser.ru dedicated to the collection of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task, which uses the corpus as the training data. The participants of the task applied a wide variety of techniques to the problem of paraphrase detection, from rule-based approaches to deep learning, and results of the task reflect the following tendencies: the best scores are obtained by the strategy of using traditional classifiers combined with fine-grained linguistic features, however, complex neural networks, shallow methods and purely technical methods also demonstrate competitive results.
Conference Paper
Full-text available
We present an approach to identifying Twitter paraphrases using simple lexical overlap features. The work is part of ongoing research into the applicability of knowledge-lean techniques to paraphrase identification. We utilize features based on overlap of word and character n-grams and train support vector machine (SVM). Our results demonstrate that character and word level overlap features in combination can give performance comparable to methods employing more sophisticated NLP processing tools and external resources. We achieve the highest F-score for identifying paraphrases on the Twitter Paraphrase Corpus as part of the SemEval-2015 Task1.
Conference Paper
Full-text available
Semantic Textual Similarity (STS) measures the degree of semantic equivalence between two texts. This paper presents the results of the STS pilot task in Semeval. The training data contained 2000 sentence pairs from previously existing paraphrase datasets and machine translation evaluation resources. The test data also comprised 2000 sentences pairs for those datasets, plus two surprise datasets with 400 pairs from a different machine translation evaluation corpus and 750 pairs from a lexical resource mapping exercise. The similarity of pairs of sentences was rated on a 0-5 scale (low to high similarity) by human judges using Amazon Mechanical Turk, with high Pearson correlation scores, around 90%. 35 teams participated in the task, submitting 88 runs. The best results scored a Pearson correlation >80%, well above a simple lexical baseline that only scored a 31% correlation. This pilot task opens an exciting way ahead, although there are still open issues, specially the evaluation metric.
Conference Paper
In this paper we address the problem of modeling compositional meaning for phrases and sentences using distributional methods. We experiment with several possible combinations of representation and composition, exhibiting varying degrees of sophistication. Some are shallow while others operate over syntactic structure, rely on parameter learning, or require access to very large corpora. We find that shallow approaches are as good as more computationally intensive alternatives with regards to two particular tests: (1) phrase similarity and (2) paraphrase detection. The sizes of the involved training corpora and the generated vectors are not as important as the fit between the meaning representation and compositional method.
Conference Paper
We propose to re-examine the hypothesis that automated metrics developed for MT evaluation can prove useful for paraphrase identification in light of the significant work on the development of new MT metrics over the last 4 years. We show that a meta-classifier trained using nothing but recent MT metrics outperforms all previous paraphrase identification approaches on the Microsoft Research Paraphrase corpus. In addition, we apply our system to a second corpus developed for the task of plagiarism detection and obtain extremely positive results. Finally, we conduct extensive error analysis and uncover the top systematic sources of error for a paraphrase identification approach relying solely on MT metrics. We release both the new dataset and the error analysis annotations for use by the community.