ArticlePDF Available

String Similarity via Greedy String Tiling and Running Karp−Rabin Matching

Authors:

Abstract

This is the original report that described the Running Karp Rabin - Greedy String Tiling algorithm, that was subsequently used in the Neweyes (protein/DNA string alignment) and YAP (plagiarism detection) applications. The paper was submitted to Software Practice and Experience, but was not accepted. I retypeset the paper in 2001.
A preview of the PDF is not available
... We chose JPlag [23] due to its simplicity, maturity, and popularity, and adapted it to be able to run for our purposes in the clustering mode. Like in many other popular tools, the algorithm [32] that JPlag uses to compare code snippets cannot handle short programs. This is an especially important disadvantage for Python solutions, since they can be short even for complex tasks. ...
... JPlag [23] is a popular plagiarism detection tool that uses a greedy string tiling algorithm [32] to find the distance between d) even = [0, 2, 4, 6, 8] odd = [1,3,5,7,9] my_sum = map(lambda x, y: x + y, even, odd) remainders = map(lambda x: x % 3, my_sum) nonzero_remainders = list(filter(lambda x: x != 0, remainders)) even = [0, 2, 4, 6, 8] odd = [1,3,5,7,9] solutions. During each pair-wise comparison, JPlag attempts to cover one string with substrings taken from the other as well as possible, and then applies the Dice coefficient [10] to the results to obtain the final similarity score. ...
... In addition, this approach considers code snippets as a set of string tokens, which makes it impossible to analyze the code structure and the used constructs. Another popular plagiarism detection tool is MOSS [26], a Web service that also uses an adaptation of the greedy string tiling algorithm [32]. The key difference is that the fragments of code marked as similar appear in no more than N submissions. ...
Conference Paper
Full-text available
In many MOOCs, whenever a student completes a programming task, they can see previous solutions of other students to find potentially different ways of solving the problem and to learn new coding constructs. However, a lot of MOOCs simply show the most recent solutions, disregarding their diversity or quality, and thus hindering the students' opportunity to learn. In this work, we explore this novel problem for the first time. To solve it, we adapted the existing plagiarism detection tool JPlag to Python submissions on Hyperskill, a popular MOOC platform. However, due to the tool's inner algorithm, JPLag fully processed only 46 out of 867 studied tasks. Therefore, we developed our own tool called Rhubarb. This tool first standardizes solutions that are algorithmically the same, then calculates the structure-aware edit distance between them, and then applies clustering. Finally, it selects one example from each of the largest clusters, thus ensuring their diversity. Rhubarb was able to handle all 867 tasks successfully. We compared different approaches on a set of 59 real-life tasks that both tools could process. Eight experts rated the selected solutions based on diversity, code quality, and usefulness. The default platform approach of simply selecting recent submissions received on average 3.12 out of 5, JPlag-3.77, Rhubarb-3.50. To ensure both quality and coverage, we created a system that combines both tools. We conclude our work by discussing the future of this new problem and the research needed to solve it better.
... We chose JPlag [23] due to its maturity and popularity, and adapted it to be able to run for our purposes. Like many other popular tools, the algorithm [32] that JPlag uses to compare code snippets cannot handle short programs. This is an especially important disadvantage for Python solutions, since they can be short even for complex tasks. ...
... JPlag [23] is a popular plagiarism detection tool that uses a greedy string tiling algorithm [32] to find the distance between solutions. During each pair-wise comparison, JPlag attempts to cover one string with substrings taken from the other as well as possible, and then applies the Dice coefficient [10] to the results to obtain the final similarity score. ...
... In addition, this approach d) even = [0, 2, 4, 6, 8] odd = [1,3,5,7,9] my_sum = map(lambda x, y: x + y, even, odd) remainders = map(lambda x: x % 3, my_sum) nonzero_remainders = list(filter(lambda x: x != 0, remainders)) even = [0, 2, 4, 6, 8] odd = [1,3,5,7,9] considers code snippets as a set of string tokens, which makes it impossible to analyze the code structure and the used constructs. Another popular plagiarism detection tool is MOSS [26], a Web service that uses an adaptation of the greedy string tiling algorithm [32]. The key difference is that the fragments of code marked as similar appear in no more than N submissions. ...
Preprint
Full-text available
In many MOOCs, whenever a student completes a programming task, they can see previous solutions of other students to find potentially different ways of solving the problem and learn new coding constructs. However, a lot of MOOCs simply show the most recent solutions, disregarding their diversity or quality. To solve this novel problem, we adapted the existing plagiarism detection tool JPlag to Python submissions on Hyperskill, a popular MOOC platform. However, due to the tool's inner algorithm, it fully processed only 46 out of 867 studied tasks. Therefore, we developed our own tool called Rhubarb. This tool first standardizes solutions that are algorithmically the same, then calculates the structure-aware edit distance between them, and then applies clustering. Finally, it selects one example from each of the largest clusters, taking into account their code quality. Rhubarb was able to handle all 867 tasks successfully. We compared approaches on a set of 59 tasks that both tools could process. Eight experts rated the selected solutions based on diversity, code quality, and usefulness. The default platform approach of selecting recent submissions received on average 3.12 out of 5, JPlag - 3.77, Rhubarb - 3.50. Since in the real MOOC, it is imperative to process everything, we created a system that uses JPlag on the 5.3% of tasks it fully processes and Rhubarb on the remaining 94.7%.
... Running-Karp-Rabin Greedy-String-Tiling (RKR-GST) Similarity:It is often used in the context of detecting plagiarism by identifying maximal sequences of contiguous matching tokens (tiles)[38].-Semdiff Similarity: Semdiff is a method for detecting semantic differences between program versions. Semdiff similarity measures how code changes affect the program's semantics[17]. ...
Chapter
Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/ codesim
Article
Code auditing ensures that the developed code adheres to standards, regulations, and copyright protection by verifying that it does not contain code from protected sources. The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing. The dataset for training these models is mainly collected from publicly available sources. This raises the issue of intellectual property infringement as developers’ codes are already included in the dataset. Therefore, auditing code developed using LLMs is challenging, as it is difficult to reliably assert if an LLM used during development has been trained on specific copyrighted codes, given that we do not have access to the training datasets of these models. Given the non-disclosure of the training datasets, traditional approaches such as code clone detection are insufficient for asserting copyright infringement. To address this challenge, we propose a new approach, TraWiC; a model-agnostic and interpretable method based on membership inference for detecting code inclusion in an LLM’s training dataset. We extract syntactic and semantic identifiers unique to each program to train a classifier for detecting code inclusion. In our experiments, we observe that TraWiC is capable of detecting 83.87% of codes that were used to train an LLM. In comparison, the prevalent clone detection tool NiCad is only capable of detecting 47.64%. In addition to its remarkable performance, TraWiC has low resource overhead in contrast to pair-wise clone detection that is conducted during the auditing process of tools like CodeWhisperer reference tracker, across thousands of code snippets.
Preprint
Full-text available
The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from https://github.com/jorge-martinez-gil/ensemble-codesim.
Article
We present randomized algorithms to solve the following string-matching problem and some of its generalizations: Given a string X of length n (the pattern) and a string Y (the text), find the first occurrence of X as a consecutive block within Y. The algorithms represent strings of length n by much shorter strings called fingerprints, and achieve their efficiency by manipulating fingerprints instead of longer strings. The algorithms require a constant number of storage locations, and essentially run in real time. They are conceptually simple and easy to implement. The method readily generalizes to higher-dimensional pattern-matching problems.
Article
A simple algorithm is described for isolating the differences between two files. One application is the comparing of two versions of a source program or other file in order to display all differences. The algorithm isolates differences in a way that corresponds closely to our intuitive notion of difference, is easy to implement, and is computationally efficient, with time linear in the file length. For most applications the algorithm isolates differences similar to those isolated by the longest common subsequence. Another application of this algorithm merges files containing independently generated changes into a single file. The algorithm can also be used to generate efficient encodings of a file in the form of the differences between itself and a given “datum” file, permitting reconstruction of the original file from the diference and datum files.
Article
We present an average case analysis of the Karp-Rabin string matching algorithm. This algorithm is a probabilistic algorithm, that adapts hashing techniques to string searching. We also propose an efficient implementation of this algorithm
Article
The primary purpose of a programming language is to assist the programmer in the practice of her art. Each language is either designed for a class of problems or supports a different style of programming. In other words, a programming language turns the computer into a ‘virtual machine’ whose features and capabilities are unlimited. In this article, we illustrate these aspects through a language similar tologo. Programs are developed to draw geometric pictures using this language.
Neweyes: A System for Comparing Biological Sequences Using the Running KarpRabin Greedy String-Tiling Algorithm
  • Wise
  • J Michael
[Wise95] Wise, Michael J, ''Neweyes: A System for Comparing Biological Sequences Using the Running KarpRabin Greedy String-Tiling Algorithm'', ThirdI nternational Conference on Intelligent Systems for Molecular Biology,C ambridge, England., ed. Christopher Rawlings, Dominic Clark, Russ Altman et. al., p. 393−401, AAAI Press (July 16-19, 1995).
ImprovedD etection of Similarities in Computer Program and other Texts
  • Wise
  • J Michael
[Wise96] Wise, Michael J., ''ImprovedD etection of Similarities in Computer Program and other Texts'', TwentySeventh SIGCSE Technical Symposium,Philadelphia, U.S.A., pp. 130-134 (February 15-17, 1996).