ArticlePDF Available

Abstract and Figures

This paper presents an approach for the source retrieval task using two distinct keyphrase extraction strategies, namely n-grams from chunked text and named entities. The proposed approach was evaluated on TIRA and performed well against other participants of PAN CLEF.
Content may be subject to copyright.
ИСКУССТВЕННЫЙ ИНТЕЛЛЕКТ И ПРИНЯТИЕ РЕШЕНИЙ 3/2016 95
R. Maluleka, I.V. Sochenkov
Query Formulation for Source Retrieval based on Named
Entities and N-grams Extraction
Abstract. This paper presents an approach for the source retrieval task using two distinct keyphrase extraction strategies,
namely n-grams from chunked text and named entities. The proposed approach was evaluated on TIRA and performed
well against other participants of PAN CLEF.
Keywords: source retrieval, named entity extraction, plagiarism detection.
Introduction
For the as long as people have been creating orig-
inal works, there have been imitators. Whilst imita-
tion is said to be the sincerest form of flattery, it is a
serious problem in the areas of research and academ-
ia. With the spread of internet access, it is now easier
than ever to plagiarise from a plethora of sources. In
addition to this, the availability of online translators
and synonymizers has made it possible to obfuscate
plagiarism with little effort. For these reasons it is
simply not practical to manually detect most instanc-
es of plagiarism. In fact, plagiarism involving para-
phrasing and translation in particular still presents a
formidable challenge in the active research area of
automatic plagiarism detection [1]. There exist sever-
al different types for plagiarism, ranging from im-
proper citation to cut-and-paste copying of another's
work. Obfuscated plagiarism is the most difficult to
detect as the plagiarist has attempted to transform the
appropriated work enough to make it seem distinct
from the original. The effectiveness of a plagiarism
detection software can be measured by the kinds of
plagiarism it can identify. This software drastically
reduces the amount of effort, and time spent, in per-
forming the detecting plagiarism. Automatic plagia-
rism detection has been an active research area since
the 1970's and has attracted increased interest in re-
cent years as advancements in computational speed,
and access to sophisticated search engines, allow in-
creasingly complex approaches to be implemented.
These tools allow academics to quickly retrieve po-
tential sources for a suspicious passages of text,
which can then be further analysed. Making the task
significantly more tractable.
This paper describes an algorithm for identify-
ing key features of a suspicious document, building
on the approaches of teams that competed in the
PAN international competition on plagiarism de-
tection [1-4]. The method closely follows the ap-
proach described by Williams, Chen, Choudhury
and Giles in their 2013 paper [5], while also incor-
porating named entity n-grams similar to those
used by Elizalde [6].
1. Related Work
The Uncovering Plagiarism, Authorship and
Social Software Misuse Lab (PAN) has been held
annually since 2007, and beginning in 2009 has
been part of the Conference and Labs of the Evalu-
ation Forum (CLEF). PAN aims to answers the
questions: given a document, is is original? who
wrote it? what are the author's traits? through ex-
perimentation on shared tasks. With the goal of
_
________________________________________
1The reported study was partially funded by RFBR, according to the research project No. 16-37-60048 mol_a_dk.
R. Maluleka, I.V. Sochenkov
ИСКУССТВЕННЫЙ ИНТЕЛЛЕКТ И ПРИНЯТИЕ РЕШЕНИЙ 3/2016
96
providing sustainable and reproducible evaluations
of state-of-the-art-algorithms. The approach de-
scribes in this paper applies to the source retrieval
sub-task of plagiarism detection. This task entails
retrieving sources, given a potentially plagiarised
document.
The framework of the PAN evaluation lab aims
to emulate the real-life scenario of text reuse,
where a plagiarist uses a web search engine to find
source documents. To achieve this, organisers cre-
ated a crowd-sourced corpus of manually written
documents with instances of plagiarism. Instead of
using the actual World Wide Web, authors were
asked to use a static web-crawl of the web (the
ClueWeb092). They access ClueWeb through
search engines (Indri3 and ChatNoir [7]), and can
browse it as if it were the real thing. This same set-
up is used for evaluation. Participants in the evalu-
ation lab are then given API access to these search
engines and a subset of documents from the plagia-
rism corpus on which to train their software. This
allows participants to design software within a set-
up that is very similar to the real-world task of re-
trieving sources of plagiarism by programmatically
accessing a web search engine, but with the repro-
ducibility of working in a static environment. The
task is then to retrieve source documents while
minimising the retrieval cost.
Participants in the PAN/CLEF evaluation lab
are required to submit their software on the TIRA
experimentation platform [8], allowing organisers
to compare the current year's submissions to those
submitted in previous years (since 2012). Thus the
outcome of the PAN evaluation labs is perfor-
mance data about different approaches to the
shared tasks and, additionally, a collection of state-
of-the-art implementations of these assorted ap-
proaches [9].
One of the best performing software in the
source retrieval task in 2013 and 2014 was that of
Williams et al [5, 10]. Even though they did not
submit a new version in 2015, their software still
went unmatched in the 2015 lab. William’s ap-
proach in 2013 made use of an unsupervised rank-
ing method to rank the results returned by a search
engine by their similarity with the suspicious doc-
ument. In 2014 they switched to supervised meth-
od for ranking results. However, F1 score increased
insignificantly. Elizalde’s 2013 approach makes
2 http://lemurproject.org/clueweb09.php
3 http://lemurproject.org/indri.php
use of a novel idea of extracting named entities
across a document in an attempt to match highly
obfuscated plagiarism. These named entity queries
are of interest because they could compliment, and
potentially improve, William’s approach.
2. The Proposed Approach
The approach consists of several stages, name-
ly: chunking, key-phrase extraction, query formu-
lation, and download filtering.
Chunking: The suspicious document is first
segmented into paragraphs of 5 sentences each.
Each paragraph is pre-processed, removing all
non-alphabetic characters.
Keyphrase Extraction and Query Formula-
tion: Two different methods are employed in form-
ing keyphrases. The first attempt to find the most im-
portant features of the entire document, while the
second forms queries based on individual chunks.
Named entity queries: Named Entities are iden-
tified over the whole text. They are then ranked in
descending order of length. The longest are submitted
as-is as queries to the search engine. As noted by
Elizalde [6], the rationale behind is that the named
entities are unlikely to change even if paraphrasing
has been used to obfuscate plagiarism.
Chunk based queries: Each sentence in each
paragraph is tokenized. All stopwords are re-
moved, and only verbs, nouns, and adjectives are
retained. Queries are formed by concatenating se-
quences of tokens to form disjunct sequential 10-
grams. The first three 10-grams from each para-
graph are submitted to the search engine.
Download Filtering: The ChatNoir search engine
[7] allows one to request a snippet, of up to five hun-
dred characters, of a specific document. Documents
are only downloaded if they share at least five word
5-grams with the suspicious document.
Algorithm 1 shows the algorithm for the intro-
duced approach. The algorithm was implemented
using Python programming language, making use
of the following non-standard libraries: Beautiful-
Soup44, NLTK5, NumPy6 and Shingling7. The im-
plementation is publicly available through PAN’s
online code repository8.
4 http://www.crummy.com/software/BeautifulSoup/bs4/
5 http://www.nltk.org/
6 http://www.numpy.org/
7 https://pypi.python.org/pypi/shingling
8 https://github.com/pan-webis-de/maluleka16
Query Formulation for Source Retrieval based on Named Entities and N-grams Extraction
ИСКУССТВЕННЫЙ ИНТЕЛЛЕКТ И ПРИНЯТИЕ РЕШЕНИЙ 3/2016
97
3. Evaluation
There is no unified performance measure for a
plagiarism detection task. Thus an approach is
judged based on several scores taken as averages
over a dataset [1]:
Number of queries submitted; number of web
pages downloaded; precision and recall of web pages
downloaded regarding actual sources of a suspicious
document; number of queries until the first actual
source is found; and the number of downloads until
the first actual source is downloaded. The first three
measures capture the overall behaviour of a system
and measures. The last two assess the time to first re-
sult. The quality of identifying reused passages be-
tween documents is not taken into account here,
however retrieving duplicates of a source document
is considered a true positive, whereas retrieving more
than one duplicate of a source document does not
improve performance.
Our algorithm was evaluated on TIRA against the
PAN 2014 source retrieval test dataset 2. The same
dataset used in the 2015 labs, allowing us to directly
compare our results to the those of the competition
participants. Table 1 shows a detailed comparison of
our approach to those of Williams and Elizalde using
data from the results of the PAN 2015 source retriev-
al task [9]. Figure 1 shows a graphical comparison
using only the F1, precision and recall measures.
From the results data we can conclude that con-
sidering named entity queries does indeed improve
the approach suggested by Williams. As one can
see from Table 1, the presented algorithm has
comparable precision and recall, and the highest
recall and overall F1 score of the three approaches.
It fact, it currently holds the top F1 score of all
evaluated approaches9 (see Table 2).
Conclusion
This article suggests an approach to the source
retrieval using a combination of two distinct
keyphrase extraction strategies, namely 10-grams
from chunks and named entities. The evaluation
results show that this approach achieves a good
compromise between precision and recall.
The introduced approach is based on the results of
work done by participants in the PAN lab shared
tasks. Certainly, any number of approaches could be
derived from the many approaches that have been
implemented as part of the task evaluations. This pa-
per seeks to suggest a high performing approach, and
make a state-of-the-art implementation publicly
available to aid other researchers and practitioners.
9
http://www.tira.io/task/source-retrieval/dataset/pan14-
source-retrieval-test-dataset2-2014-05-14/
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Elizalde Williams Maluleka
F1
Precision
Recall
Fig. 1. Comparison of Related Approaches –
Key Measures
R. Maluleka, I.V. Sochenkov
ИСКУССТВЕННЫЙ ИНТЕЛЛЕКТ И ПРИНЯТИЕ РЕШЕНИЙ 3/2016
98
This software can be readily applied to real-
world plagiarism detection; as modern search en-
gines provide similar features to those used in ex-
perimentation. There is, of course, room for im-
provement. Rather than considering a small
number of top results returned by the search en-
gine, which is done for the sake of expediting ex-
perimentation, we could consider many more re-
sults. Indeed, [3] argues that this might improve
performance with little added cost.
References
1. Potthast, M. et al. Overview of the 5th international com-
petition on plagiarism detection. In: CLEF Conference on
Multilingual and Multimodal Information Access Evalua-
tion. CELCT. 2013, pp. 301-331.
2. Potthast, M. et al. Overview of the 4th International Com-
petition on Plagiarism Detection. In: CLEF (Online Work-
ing Notes/Labs/Workshop). 2012.
3. Potthast, M. et al. Overview of the 6th International Com-
petition on Plagiarism Detection. In: Working Notes Pa-
pers of the CLEF 2014 Evaluation Labs. Ed. by Cappella-
to, L. et al. CEUR Workshop Proceedings. CLEF and
CEUR-WS.org, Sept. 2014.
4. Stamatatos, E. et al. Overview of the PAN/CLEF 2015
Evaluation Lab. In: Information Access Evaluation meets
Multilinguality, Multimodality, and Visualization. 6th In-
ternational Conference of the CLEF Initiative (CLEF 15).
Springer, Berlin Heidelberg New York. 2015.
5. Williams, K. et al. Unsupervised Ranking for Plagiarism
Source Retrieval. In: Notebook for PAN at CLEF 2013
(2013).
6. Elizalde, V. Using statistic and semantic analysis to detect
plagiarism. In: CLEF (Online Working
Notes/Labs/Workshop). 2013.
7. Potthast, M. et al. ChatNoir: a search engine for the
ClueWeb09 corpus. In: Proceedings of the 35th interna-
tional ACM SIGIR conference on Research and develop-
ment in information retrieval. ACM. 2012, pp. 1004-1004.
8. Gollub, T., Stein, B., and Burrows, S. Ousting ivory tower
research: towards a web framework for providing experi-
ments as a service. In: Proceedings of the 35th internation-
al ACM SIGIR conference on Research and development
in information retrieval. ACM. 2012, pp. 1125-1126.
9. Hagen, M., Potthast, M., and Stein, B. \Source Retrieval
for Plagiarism Detection from Large Web Corpora: Recent
Approaches". In: Working Notes Papers of the CLEF
(2015), pp. 1613-0073.
10. Williams, K., Chen, H., and Giles, C. Supervised Ranking for
Plagiarism Source Retrieval Notebook for PAN at CLEF
2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W.
(eds.) CLEF 2014 Evaluation Labs and Workshop - Working
Notes Papers, 15-18 September, Sheffield, UK. CEUR Work-
shop Proceedings, CEUR- WS.org (Sept. 2014).
Малулека Рулани. Аспирант Российского университета дружбы народов (РУДН). Окончил РУДН в 2016 году. Область
научных интересов: интеллектуальные методы поиска и анализа информации, методы поиска текстовых заимствований.
E-mail: rhumaluleka@gmail.com
Соченков Илья Владимирович. Заместитель заведующего лабораторией ИСА ФИЦ ИУ РАН, Доцент кафедры инфор-
мационных технологий, ст. научный сотрудник Российского университета дружбы народов (РУДН). Окончил РУДН
в 2009 году. Кандидат физико-математических наук. Автор 50 печатных работ. Область научных интересов: интеллекту-
альные методы поиска и анализа информации, обработка больших массивов данных, контентная фильтрация, компью-
терная лингвистика, распознавание образов. E-mail: isochenkov@sci.pfu.edu.ru
Table 1. Comparison of Related Approaches
Team F1
Measure Prec. Rec. Queries Dwlds
Queries
to 1st
Detect.
Dwlds
to 1st
Detect.
No
Detect.
Elizalde13 0.15622 0.11845 0.36621 41.6 83.9 18.0 18.2 4
Williams13 0.46597
0.59656 0.46919 117.1 12.4 23.3 2.2 7
Maluleka16 0.47458 0.55403 0.52677 138.4 18.7 20.9
2.2 6
Table 2. Comparison of Top Performing Approaches
Team F1 Meas-
ure Prec. Rec. Queries Dwlds
Queries
to 1st
Detect.
Dwlds
to 1st
Detect.
No
Detect.
Gillam13 0.05545 0.03831 0.14813 15.7 86.8 16.1 28.6 34
Haggag13 0.38303
0.67290 0.31370 41.7 5.2 13.9 1.4 12
Kong13 0.01119 0.00587
0.58559 47.9 5185.3 2.5 210.2 0
Maluleka16 0.47458 0.55403 0.52677 138.4 18.7 20.9 2.2 6
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of text mining research focusing on the identification of personal traits of authors left behind in texts unintentionally. PAN 2015 comprises three tasks: plagiarism detection, author identification and author profiling studying important variations of these problems. In plagiarism detection, community-driven corpus construction is introduced as a new way of developing evaluation resources with diversity. In author identification, cross-topic and cross-genre author verification (where the texts of known and unknown authorship do not match in topic and/or genre) is introduced. A new corpus was built for this challenging, yet realistic, task covering four languages. In author profiling, in addition to usual author demographics, such as gender and age, five personality traits are introduced (openness, conscientiousness, extraversion, agreeableness, and neuroticism) and a new corpus of Twitter messages covering four languages was developed. In total, 53 teams participated in all three tasks of PAN 2015 and, following the practice of previous editions, software submissions were required and evaluated within the TIRA experimentation framework.
Conference Paper
Full-text available
With its close ties to the Web, the IR community is destined to leverage the dissemination and collaboration capabilities that the Web provides today. Especially with the advent of the software as a service principle, an IR community is conceivable that publishes experiments executable by anyone over the Web. A review of recent SIGIR papers shows that we are far away from this vision of collaboration. The benefits of publishing IR experiments as a service are striking for the community as a whole, and include potential to boost research profiles and reputation. However, the additional work must be kept to a minimum and sensitive data must be kept private for this paradigm to become an accepted practice. To foster experiments as a service in IR, we present a Web framework for experiments that addresses the outlined challenges and possesses a unique set of compelling features in comparison to existing solutions. We also describe how our reference implementation is already used officially as an evaluation platform for an established international plagiarism detection competition.
Conference Paper
Full-text available
Thispaper overviews 18 plagiarism detectors that have been developed and evaluated within PAN'10. We start with a unified retrieval process that sum- marizes the best practices employed this year. Then, the detectors' performances are evaluated in detail, highlighting several important aspects of plagiarism de- tection, such as obfuscation, intrinsic vs. external plagiarism, and plagiarism case length. Finally, all results are compared to those of last year's competition.
Article
The source retrieval task for plagiarism detection involves the use of a search engine to retrieve candidate sources of plagiarism for a suspicious document and provides a way to efficiently identify candidate documents so that more accurate comparisons can take place. We describe a strategy for source retrieval that makes use of an unsupervised ranking method to rank the results returned by a search engine by their similarity with the query document and that only retrieves documents that are likely to be sources of plagiarism. Evaluation shows the performance of our approach, which achieved the highest F1 score (0.47) among all task participants.
Using statistic and semantic analysis to detect plagiarism
  • V Elizalde
Elizalde, V. Using statistic and semantic analysis to detect plagiarism. In: CLEF (Online Working Notes/Labs/Workshop). 2013.
ChatNoir: a search engine for the ClueWeb09 corpus
  • M Potthast
Potthast, M. et al. ChatNoir: a search engine for the ClueWeb09 corpus. In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM. 2012, pp. 1004-1004.
\Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches
  • M Hagen
  • M Potthast
  • B Stein
Hagen, M., Potthast, M., and Stein, B. \Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches". In: Working Notes Papers of the CLEF (2015), pp. 1613-0073.
Supervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF
  • K Williams
  • H Chen
  • C Giles
  • L Cappellato
  • N Ferro
  • M Halvey
Williams, K., Chen, H., and Giles, C. Supervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Evaluation Labs and Workshop -Working Notes Papers, 15-18 September, Sheffield, UK. CEUR Workshop Proceedings, CEUR-WS.org (Sept. 2014).
Область научных интересов: интеллектуальные методы поиска и анализа информации, методы поиска текстовых заимствований. E-mail: rhumaluleka@gmail
  • Малулека Рулани
Малулека Рулани. Аспирант Российского университета дружбы народов (РУДН). Окончил РУДН в 2016 году. Область научных интересов: интеллектуальные методы поиска и анализа информации, методы поиска текстовых заимствований. E-mail: rhumaluleka@gmail.com