January 2022
·
101 Reads
·
1 Citation
Communications in Computer and Information Science
In this paper, we describe a method for cross-lingual plagiarism detection for a distant language pair (Russian-English). All documents in a reference collection are split into fragments of fixed size. These fragments are indexed in a special inverted index, which maps words to a bit array. Each bit in the bit array shows whether a sentence contains this word. This index is used for the retrieval of candidate fragments. We employ bit arrays stored in the index for assessing similarity of query and candidate sentences by lexis. Before doing retrieval, top keywords of a query document are mapped from one language to other with the help of cross-lingual word embeddings. We also train a language-agnostic sentence encoder that helps in comparing sentence pairs that have few or no lexis in common. The combined similarity score of sentence pairs is used by a text alignment algorithm, which tries to find blocks of contiguous and similar sentence pairs. We introduce a dataset for evaluation of this task - automatically translated Paraplag (monolingual dataset for plagiarism detection). The proposed method shows good performance on our dataset in terms of F1. We also evaluate the method on another publicly available dataset, on which our method outperforms previously reported results.KeywordsCross-lingual plagiarism detectionCross-lingual word embeddingsCross-lingual sentence embeddings