ArticlePDF Available

Abstract and Figures

Plagiarism in programming assignments has been increasing these days which affects the evaluation of students. Thispaper proposes a machine learning approach for plagiarism detection of programming assignments. Different features related to source code are computed based on similarity score of n-grams, code style similarity and dead codes. Then, xgboost model is used for training and predicting whether a pair of source code are plagiarised or not. Many plagiarism techniques ignores dead codes such as unused variables and functions in their predictions tasks. But number of unused variables and functions in the source code are considered in this paper. Using our features, the model achieved an accuracy score of 94% and average f1-score of 0.905 on the test set. We also compared the result of xgboost model with support vector machines(SVM) and report that xgboost model performed better on our dataset.
Content may be subject to copyright.
Journal of Artificial Intelligence and Capsule Networks (2020)
Vol.02/ No. 03
Pages: 177-184
http://irojournals.com/aicn/
DOI: https://doi.org/10.36548/jaicn.2020.3.005
177
ISSN: 2582-2012 (online)
Submitted: 24.05.2020
Accepted: 14.07.2020
Published: 21.07.2020
Plagiarism Detection in Programming Assignments
using Machine Learning
1Nishesh Awale, 2Mitesh Pandey,3Anish Dulal, 4Bibek Timsina
Department of Electronics and Computer Engineering, Pulchowk Campus, Lalitpur, Nepal
1073bex423.nishesh@pcampus.edu.np,
2073bex417.mitesh@pcampus.edu.np,3073bex406.anish@pcampus.edu.np, 4073bex409.bibek@pcampus.edu.np
Abstract: Plagiarism in programming assignments has been increasing these days which affects the evaluation of students.
Thispaper proposes a machine learning approach for plagiarism detection of programming assignments. Different features related
to source code are computed based on similarity score of n-grams, code style similarity and dead codes. Then, xgboost model is
used for training and predicting whether a pair of source code are plagiarised or not. Many plagiarism techniques ignores dead
codes such as unused variables and functions in their predictions tasks. But number of unused variables and functions in the source
code are considered in this paper. Using our features, the model achieved an accuracy score of 94% and average f1-score of 0.905
on the test set. We also compared the result of xgboost model with support vector machines(SVM) and report that xgboost model
performed better on our dataset.
Keywords: Source Code Plagiarism Detection; Xgboost; Programming Assignments Plagiarism; Source Code Features;
1. Introduction
With the increase in digital content because of the Internet, information has been easily available for everyone. The
sharing of assignments, whether text based or programming assignments, has been easy for students. Due to this,
plagiarism has been increasing in the academic sector. Plagiarism will not result in the fair evaluation of the students
as well as it hampers the learning of the student. The manual checking for plagiarism in assignments is tedious and
time consuming.
Many tools have been made for checking plagiarism in programming assignments. The plagiarism detection task can
be considered as a classification problem. The two assignments can be taken as inputs and the system will determine
if the pair of assignments are plagiarised or not. Early methods for plagiarism detection in programming assignments
are based on n-gram techniques which create fingerprints of the assignments to measure similarity between them.
They do not take code style similarity such as braces, comments and whitespace similarity into consideration. Some
machine learning techniques have been pro- posed based on traditional learning algorithms such as k-nearest
neighbors, naive Bayes, etc. Also, with the success of deep learning in computer vision and natural language
processing, people are trying to use deep learning for source code plagiarism detection as well. One approach uses
recurrent neural network to extract the features from hidden layers and train a classifier on those features.
Our approach computes different features from the programming assignments that are helpful for de- tecting
plagiarism. We compute different features such as similarity score, number of unused variables and functions, etc
from the source code. Then, we train an xgboost learning algorithm on these features and com- pare its result with
Support Vector Machine(SVM). We believe that our features will detect plagiarism even if the assignments are highly
obfuscated. With our proposed features, our model obtained 94% accuracy on the test set. More discussion on the
result is given in section 4.
2. Related Works
Different attempts had been made to detect plagiarism in programming assignments submitted by students. Jplag [1]
is one of the earliest and very powerful work we found in the field of testing plagiarism of source code programs. It
Journal of Artificial Intelligence and Capsule Networks (2020)
Vol.02/ No. 03
Pages: 177-184
http://irojournals.com/aicn/
DOI: https://doi.org/10.36548/jaicn.2020.3.005
178
ISSN: 2582-2012 (online)
Submitted: 24.05.2020
Accepted: 14.07.2020
Published: 21.07.2020
used a specified language parser or scanner depending upon the programming language in which source code was
written. The initial parser pre-processes the source code changing it to appropriate tokens and tokens are chosen in
such a manner that characterize the essence of program(which is difficult to change by a plagiarist). For instance,
whitespace and comments should never produce a token, since they are the most obvious points of attack. Various run
time optimizations are also discussed well in the paper. MOSS(Measure of Software Similarity) [2] is also one of the
popular tool to detect plagiarism for digital content. It is a token based method which selects fingerprints from hashes
of k-grams. It uses winnowing approach in which a window is defined and from each window, a minimum hash value
is selected as a fingerprint. Then, the fingerprints of different documents are compared for plagiarism detection. But
it ignores the similarity of braces, comments and number of unused variables and functions.
In [3], the authors have defined different levels of plagiarism from level 0 to 6 and used machine learning techniques
to detect plagiarism. They calculate different source code metrics such as number of words in one source code lines,
number of characters in one source code lines, number of underscore characters used in identifiers, etc. They calculate
similar 9 features and used naive Bayes classifier, k-nearest neighbor and AdaBoost algorithm for classification. It
also ignores the dead codes such as unused variables and functions in the source code. In [4], different high level
features such as number of lines, characters, input and outputs of the programs are computed. Then, weighted average
of all features are used to compute similarity score. Another approach in [5] detect plagiarism at the function level.
After preprocessing, source codes are converted to different functions and similarity score is computed between
different pairs of functions.
With the rise in deep learning, people have tried to use deep learning for source code plagiarism detection as well. In
[6], char-RNN model was pre-trained on Linux kernel source code and the last layer of the RNN model was used to
extract deep features. Then, a classifier was trained on those features to detect plagiarism. According to their result,
deep learning approaches to plagiarism detection also looks very promising.
In [7], the authors have used plagiarism detection in the submission record of Online Judge System. They calculate
various features such as level of concentration of plagiarism targets, difficulty level of the problem, ranking of the
student, etc. and irrelevant features were filtered out using information gain. They also used various performance
metrics such as macro F1 score, AUC and ROC to select the learning model.
3. Proposed Work
Our approach uses machine learning techniques for plagiarism detection on programming assignments. First, we
extract features from the C/C++ source code and then, train an xgboost algorithm on our dataset for plagiarism
detection. Our model predicts whether a pair of programming assignments is plagiarised or not. The detail explanation
of our approach is shown in figure 1.
3.1 Dataset
We used the programming assignments submitted by students during two introductory programming courses at
University of Sarajevo which was found online on IEEEDataPort[8]. From the collection of C/C++ assignment
submissions, we formed different combinations of plagiarised and non-plagiarised assignment pairs according to the
given ground truth. The plagiarised class was labeled as 1 and non-plagiarised class was labeled as 0. As the ratio of
plagiarised and non-plagiarised assignment pairs was very small, so we sample only a portion of non-plagiarised pairs
and select all the plagiarised pairs. As a result, we had a total assignment pairs of 5884 out of which 1262 pairs were
plagiarised while 4622 were non-plagiarised.
Journal of Artificial Intelligence and Capsule Networks (2020)
Vol.02/ No. 03
Pages: 177-184
http://irojournals.com/aicn/
DOI: https://doi.org/10.36548/jaicn.2020.3.005
179
ISSN: 2582-2012 (online)
Submitted: 24.05.2020
Accepted: 14.07.2020
Published: 21.07.2020
Figure 1: Our proposed approach. First, we extract pairwise features from source code. Then, we train an xgboost
model for classification
3.2 Computing Relevant Features
After preparing the dataset, we extracted various features from the programming assignment. Various fea- tures which
were used for plagiarism detection are described below. Similarity score, code style similarity and categorical value
features were computed based on the methodology proposed by Huang et al.[7] with some modifications.
3.2.1 Similarity Score
To get an idea of how similar the given two submissions are, we modify Karp -Rabin string matching algo- rithm to
get a similarity score as output. The similarity score was calculated as follows:
1. First, all the variable names, function names and string inside print statements were converted to a fixed character
so that even there is change in variable name, function name and the string inside print statements between two
assignments, they are treated as plagiarised.
2. The line breaks, tabs and extra spaces were removed as well as the whole program was converted to a single
string, S.
3. After converting to a single string S, the k-gram collection from string S was generated. A k-gram consists of
the sub-string of length k. For our experiment, we used value of k as 20.
4. For making comparison of k-grams faster, we hash all k-grams to integer value. The hash value of first k-
gram(c1...ck) was calculated based on the constant b as,
H(c1...ck)=(c1 * bk-1 + c2 * bk-2 + ... + ck-1* b + ck) * b
To compute the hash values of consecutive k-grams, we used rolling hash function for faster computation instead
of equation 1. We computed the consecutive hash values as follows,
H(c2...ck+1) = ((H(c1...ck) − c1*bk) + ck+1) * b
We can see that a single string S has n-k+1 hash values.
Journal of Artificial Intelligence and Capsule Networks (2020)
Vol.02/ No. 03
Pages: 177-184
http://irojournals.com/aicn/
DOI: https://doi.org/10.36548/jaicn.2020.3.005
180
ISSN: 2582-2012 (online)
Submitted: 24.05.2020
Accepted: 14.07.2020
Published: 21.07.2020
5. After converting an assignment pair to string S1 and S2 then, to the collection of hash values
H1 and H2, we used Jaccard similarity to compute a similarity score as,
Similarity Score = (H1∩ H2)
(H1U H2 )
3.2.2 Code Style Similarity
Different IDEs produce different style of codes i.e. the styling of braces and whitespaces is different among various
IDEs. So, plagiarism can also be detected on the basis of style of code written. Braces similarity, comment similarity
and whitespace similarity were computed which are described below:
A. Braces Similarity
We classify the braces into four types. The first one is when the brace is at the far left of the line and some code is
written after the brace. The second one represents the condition when the brace is at the far right of the line after some
code is written on that line. The third one indicates the brace is in the middle of the line. The fourth one indicates the
brace is on the separate line and no code is written on that line. These four conditions are represented by 1, 2, 3 and 4
respectively. An example of braces notation is shown in the figure 2.
Figure 2: An example of braces notation. Braces notation of above code is {2{1}3{4}4}4.
We compute the braces notation of both programming assignments in the pair. Then, the longest common subsequence
in the braces notations is computed. If LCS be the length of the longest common subsequence and L1, L2 be the length
of the braces notation of two assignmentsrespectively then, the braces similarity is calculated as,
Braces Similarity (BS) = 2∗𝐿𝐶𝑆
𝐿1∗𝐿2
B. Comment Similarity
Different people have different style of writing comment. Some write the comment in the same line after the code
while some people uses new line to write comments. Some people uses multi-line comment notation to write comment
while some people uses single-line comment notation. Comments are classified into 3 types. The first one represents
comment on a new line. The second one is when comments are written in the same line after writing the code. The
thirdone represents the multi-line comment notation. They are represented by S1,S2 and M3 respectively. An example
of comment notation is shown in figure 3. For the comment similarity also, we calculate the longest common
subsequence between the pair of source codes and use equation 4 to calculate comment similarity.
Journal of Artificial Intelligence and Capsule Networks (2020)
Vol.02/ No. 03
Pages: 177-184
http://irojournals.com/aicn/
DOI: https://doi.org/10.36548/jaicn.2020.3.005
181
ISSN: 2582-2012 (online)
Submitted: 24.05.2020
Accepted: 14.07.2020
Published: 21.07.2020
Figure 3: An example of comment notation. Comment notation of above code is S1M3S2.
C. Spaces and Newline Similarity
In source code practices, spaces can be in two forms: tabs and whitespaces. Some prefer whitespaces and some prefer
tabs. So this practice differs from person to person and can be used as an important feature to check similarity between
two source codes. In addition, newlines are also arbitrarily used by coders as per their desire. Some prefer wide clear
gaps while some try to make the lines as much less as possible. So,number of newlines implementation could also be
an important feature to check the similarity between two source codes. Edit distance(ED) is calculated as the end
parameter which would summarize the relative position of tabs,spaces and newlines in two source code program being
compared.
ED = tabsDistance + spaceDistance + newlineDistance
total = totalTabs + totalSpaces + totalNewlines
SNS = 1 − ED
total
After computing the braces similarity(BS), comment similarity(CS) as well as Spaces and Newline Similarity(SNS),
we computed the code style similarity as follows:
Code Style Similarity =) (BS + CS + SNS
3
3.2.3 Categorical Value According to Similarity Score
A categorical value is also considered as a feature in which six different classes are defined according to the difference
between the similarity score(SS) and the similarity threshold value(ST). Similarity threshold value can be changed
according to the nature of the problem. We used the threshold value of 0.4. Distance between similarity score and
similarity threshold value is calculated and normalized as,
Distance =|SSST|
ST
Then, we determine the category using the following algorithm.
If (SS < ST), then:
If 0.5 < Distance ≤ 1, then category = 1;
If 0.2 < Distance ≤ 0.5, then category = 2;
If Distance ≤ 0.2, then category = 3;
Else:
If Distance ≤ 0.2, then category = 4;
If 0.2 < Distance ≤ 0.5, then category = 5;
If 0.5 < Distance ≤ 1, then category = 6.
3.2.4 Number of Common Lines
The number of common lines in an assignment pair is also a measure of the plagiarism detection between them. Each
line in a file are stored in a set so that duplicate lines in the same file are removed. Then, the lines which are common
in both files are counted. The blank lines in both files are ignored while calculating this feature.
Journal of Artificial Intelligence and Capsule Networks (2020)
Vol.02/ No. 03
Pages: 177-184
http://irojournals.com/aicn/
DOI: https://doi.org/10.36548/jaicn.2020.3.005
182
ISSN: 2582-2012 (online)
Submitted: 24.05.2020
Accepted: 14.07.2020
Published: 21.07.2020
3.2.5 Number of Unused Variables
While plagiarising the source codes, students may add dead codes such as variables name which are unused. So, we
used cppcheck which is a tool for static C/C++ code analysis. Using this tool, we count the number of unused variables
in both files. The total number of unused variables in both files are added to give a single value which is used as a
feature for the assignment pair.
3.2.6 Number of Unused Functions
Similar to unused variables, students may add functions which are never called in the main program. To detect this
type of plagiarism, we used cppcheck to count the number of unused functions. Similar to unused variables, the total
number of unused functions in both files are added to give a single value which is used as a feature for the assignment
pair.
3.3 Training a Classifier
After extracting the above six features from the source code pairs, supervised learning approach was used to detect
plagiarism. We trained an xgboost algorithm on these features and the labels. Xgboost is a decision- tree based
ensemble machine learning algorithm. It uses a gradient boosting framework. The whole dataset was divided into 70%
training set and 30% test set. Then, the model was trained and different metrics such as accuracy and f1-score was
used to evaluate the model.
4. Results and Discussion
We used Google Colaboratory for training our learning model. After training the xgboost model, we got an accuracy
score of 94% on the test set. As our data was imbalanced in the ratio nearly equal to 1:4, the model which predicted
all assignment pairs as non-plagiarised would have got an accuracy of 75%. But we can see that using our source code
metrics, the accuracy of the model has greatly increased.
We also used per label f1-score as the metric to evaluate the model which summarized the precision and recall of our
model. The per label precision, recall and f1-score of xgboost model has been shown in the table 1.
Class Label
Precision
Recall
F1-score
Accuracy
Not plagiarised
0.95
0.97
0.96
0.94
Plagiarised
0.89
0.82
0.85
Table 1: Result of Xgboost Model
For the comparison of xgboost model with other classifier, we also trained support vector machine (SVM) under
similar conditions. From the tables 1 and 2, we can see that xgboost model performs well and has higher accuracy and
f1-score than SVM.
Class Label
Precision
Recall
F1-score
Accuracy
Not plagiarised
0.91
0.97
0.94
0.90
Plagiarised
0.87
0.66
0.75
Table 2: Result of SVM Model
Journal of Artificial Intelligence and Capsule Networks (2020)
Vol.02/ No. 03
Pages: 177-184
http://irojournals.com/aicn/
DOI: https://doi.org/10.36548/jaicn.2020.3.005
183
ISSN: 2582-2012 (online)
Submitted: 24.05.2020
Accepted: 14.07.2020
Published: 21.07.2020
The result of true positive, true negative, false positive and false negative of xgboost model has also been summarized
by the confusion matrix shown in figure 4. The result in the confusion matrix has been normalized by the total number
of true labels of each class. We can see that 97% of the truly non-plagiarised assignment pairs were classified correctly
by the model. Similarly, 81% of the truly plagiarised assignment pairs were classified correctly. 18% of truly
plagiarised assignments were predicted as non-plagiarised by our model. It might have occurred because we may have
missed some features of the assignment pairs to take into consideration.
Figure 4: Confusion matrix using Xgboost model
5. Conclusion
In this paper, we have discussed a machine learning approach for plagiarism detection in programming assignments.
Different features related to programming assignment pair were computed and xgboost model was used for
classification. Based on the result of our model on test set, machine learning approach for plagiarism detection looks
very promising. Although there are various deep learning methods for text based classification, but we found they are
not particularly suitable for testing of source code plagiarism. We are still thriving to improve our algorithm by
incorporating compiler based features(specific to programming languages) in our classification model.
Acknowledgement
We are very grateful to Prof. Dr. Shashidhar Ram Joshi for supervising our project and helping us in every possible
manner to achieve the targeted goals. We also like to convey our deepest gratitude to Prof. Dr Subarna Shakya for
guiding us in publishing this paper. He helped us draft and critically revise the paper in keeping with important
intellectual content and mitigating our errors.
Journal of Artificial Intelligence and Capsule Networks (2020)
Vol.02/ No. 03
Pages: 177-184
http://irojournals.com/aicn/
DOI: https://doi.org/10.36548/jaicn.2020.3.005
184
ISSN: 2582-2012 (online)
Submitted: 24.05.2020
Accepted: 14.07.2020
Published: 21.07.2020
References
[1] Prechelt,L.,Malpohl,G., &Phillipsen,M. (2000). Jplag:Finding plagiarisms among a set of programs, Technical
Report 2000-1,FakultätfürInformatik , Universität Karlsruhe.
[2] Schleimer, S., & Wilkerson, D. S., Aiken, A. (2003). Winnowing: local algorithms for document fingerprinting,
Acmsigmod, 76-85.
[3] Bandara, U., &Wijayarathna, G. (2011). A Machine Learning Based Tool for Source Code Plagiarism Detection,
International Journal of Machine Learning and Computing, 1(04), 337-343.
[4] Narayanan, S., & Simi, S. (2012). Source code plagiarism detection and performance analysis using fingerprint
based distance measure method, 7th International Conference on Computer Science & Education, 1065-1068.
[5] Agrawal, M., & Sharma, D. K. (2016). A novel method to find out the similarity between source codes, IEEE
Uttar Pradesh Section International Conference on Electrical, Computer and Electronics Engi- neering, 339-343.
[6] Yasaswi, J., Purini, S., &Jawahar, C. V. (2017). Plagiarism Detection in Programming Assignments Using Deep
Features, 4th IAPR Asian Conference on Pattern Recognition, 652-657.
[7] Huang, Q., Fang, G., & Jiang, K. (2019). An Approach of Suspected Code Plagiarism Detection Based on XGBoost
Incremental Learning, International Conference on Computer, Network, Communication and Information Systems,
88, 269-276.
[8] Ljubovic, V. (2020). Programming Homework Dataset for Plagiarism Detection.IEEE Dataport.
http://dx.doi.org/10.21227/71fw-ss32
Authors Biography
The paper presented above is a part of major projectdone by the 4th year students of Department of Electronics and
Computer Engineering at Institute of Engineering, Pulchowk Campus. The authors have participated in many project
competitions as a team and their areas of interests covers Artificial Intelligence, Machine Learning, Block chain and
IOT technologies. They were the winner of 2019 Rising Student ICT Award for project named “Raktadaan”. Their
projects mainly focus on solving the current social problems using feasible technical solutions.
Nishesh Awale
Mitesh Pandey
Anish Dulal
Bibek Timsina
... Metrics-based Representation 8 [34], [74], [75], [78], [99], [138], [145], [146] Text-based Representation ...
... Classification Metrics Acc 24 [1], [35], [37], [39], [48], [75], [79], [80], [85], [99], [104], [108], [116], [118], [132], [137]- [140], [142], [146], [147], [149], [153] AUC 8 [37], [80], [81], [83], [124], [126], [137], [138] F1 75 ...
... [1]- [6], [9], [35]- [37], [39], [41], [43], [44], [46], [48], [49], [73], [77], [79], [82], [84]- [86], [88]- [98], [100]- [107], [109]- [117], [121], [123], [124], [127]- [132], [134]- [136], [138], [140]- [142], [144], [146], [148], [150], [152], [164], [ [6], [9], [35], [36], [39]- [41], [43], [44], [46], [48], [49], [73], [74], [76]- [79], [82]- [86], [88]- [107], [109]- [118], [121], [123]- [125], [127]- [132], [134]- [136], [138], [140]- [142], [144], [146], [148]- [150], [152], [164], [165] Rec 83 ...
Article
Full-text available
Source code similarity measurement, which involves assessing the degree of difference between code segments, plays a crucial role in various aspects of the software development cycle. These include but are not limited to code quality assurance, code review processes, code plagiarism detection, security, and vulnerability analysis. Despite the increasing application of ML technique in this domain, a comprehensive synthesis of existing methodologies remains lacking. This paper presents a systematic review of Machine Learning techniques applied to code similarity measurement, aiming to illuminate current methodologies and contribute valuable insights to the research community. Following a rigorous systematic review protocol, we identified and analyzed 84 primary studies on a broad spectrum of dimensions covering application type, devised Machine Learning algorithms, used code representations, datasets, and performance metrics, as well as performance evaluations. A deep investigation reveals that 15 applications for code similarity measurement have utilized 51 different machine learning algorithms. Additionally, the most prevalent code representation is found to be the abstract syntax tree (AST). Furthermore, the most frequently employed dataset across various code similarity research applications is BigCloneBench. Through this comprehensive analysis, the paper not only synthesizes existing research but also identifies prevailing limitations and challenges, shedding light on potential avenues for future work.
... Recently, AI-generated code detection approaches have been proposed (Hoq et al., 2024). However, most of these studies focused on higher education (Awale et al., 2020;Cheers et al., 2021a;Karnalim et al., 2021), with limited studies on plagiarism detection in K-12 education. This is particularly important as early education in programming is becoming more prevalent (Macrides et al., 2022). ...
... In addition, most plagiarism detection studies focus on popular programming languages such as Java (Hoq et al., 2024), C/C++ (Awale et al., 2020) and Python (Mitchell et al., 2023). There is a scarcity of research on plagiarism detection for pseudocode. ...
Article
Full-text available
The ability of large language models (LLMs) to generate code has raised concerns in computer science education, as students may use tools like ChatGPT for programming assignments. While much research has focused on higher education, especially for languages like Java and Python, little attention has been given to K-12 settings, particularly for pseudocode. This study seeks to bridge this gap by developing explainable machine learning models for detecting pseudocode plagiarism in online programming education. A comprehensive pseudocode dataset was constructed, comprising 7,838 pseudocode submissions from 2,578 high school students enrolled in an online programming foundations course from 2020 to 2023, along with 6,300 pseudocode samples generated by three versions of ChatGPT. An ensemble model (EM) was then proposed to detect AI-generated pseudocode and was compared with six other baseline models. SHapley Additive exPlanations were used to explain how these models differentiate AI-generated pseudocode from student submissions. The results show that students’ submissions have higher similarity with GPT-3 than with the other two GPT models. The proposed model can achieve a high accuracy score of 98.97%. The differences between AI-generated pseudocode and student submissions lies in several aspects: AI-generated pseudocode often begins with more complex verbs and features shorter sentence lengths. It frequently includes clear numerical or word-based indicators of sequence and tends to incorporate more comments throughout the code. This research provides practical insights for online programming and contributes to developing educational technologies and methods that strengthen academic integrity in such courses.
... In fact, SVM is widely used in text plagiarism detection and has also previously been used in code plagiarism detection. Awale et al. (2020) employed SVM and xgboost classifiers to assess C++ programming assignments for plagiarism, focusing primarily on string matching techniques, with particular focus on the location of braces and the commenting style on a variety of features including coding style and logic structure. Eppa and Murali (2022) compared SVM with other machine learning methods for detecting plagiarism in C programming, evaluating the efficacy of different features like syntax and code structure in identifying copied content. ...
... In this study we combine the use of textual or stylistic elements of the code similar to those used by Awale et al. (2020) along with syntax-based elements such as those used by Zhao et al. (2015) with the key difference that our technique utilizes customized Pythonspecific textual and stylistic similarity measures, whereas the studies cited previously have primarily focused on C, C++, and Java. ...
Article
Full-text available
Mechanisms for plagiarism detection play a crucial role in maintaining academic integrity, acting both to penalize wrongdoing while also serving as a preemptive deterrent for bad behavior. This manuscript proposes a customized plagiarism detection algorithm tailored to detect source code plagiarism in the Python programming language. Our approach combines textual and syntactic techniques, employing a support vector machine (SVM) to effectively combine various indicators of similarity and calculate the resulting similarity scores. The algorithm was trained and tested using a sample of code submissions of 4 coding problems each from 45 volunteers; 15 of these were original submissions while the other 30 were plagiarized samples. The submissions of two of the questions was used for training and the other two for testing-using the leave-p-out cross-validation strategy to avoid overfitting. We compare the performance of the proposed method with two widely used tools-MOSS and JPlag—and find that the proposed method results in a small but significant improvement in accuracy compared to JPlag, while significantly outperforming MOSS in flagging plagiarized samples.
... These methods include the use of word embeddings in embedding-based approaches, stylistic feature-focused stylometric analysis, and linguistic analysis with n-gram frequencies [145]. Machine learning algorithms facilitate the identification of potentially plagiarised passages within a manuscript as well as changes in writing style. ...
Article
Full-text available
In text analysis, identifying plagiarism is a crucial area of study that looks for copied information in a document and determines whether or not the same author writes portions of the text. With the emergence of publicly available tools for content generation based on large language models, the problem of inherent plagiarism has grown in importance across various industries. Students are increasingly committing plagiarism as a result of the availability and use of computers in the classroom and the generally extensive accessibility of electronic information found on the internet. As a result, there is a rising need for reliable and precise detection techniques to deal with this changing environment. This paper compares several plagiarism detection techniques and looks into how well different detection systems can distinguish between content created by humans and content created by Artificial Intelligence (AI). This article systematically evaluates 189 research papers published between 2019 and 2024 to provide an overview of the research on computational approaches for plagiarism detection (PD). We suggest a new technically focused structure for efforts to prevent and identify plagiarism, types of plagiarism, and computational techniques for detecting plagiarism to organize the way the research contributions are presented. We demonstrated that the field of plagiarism detection is rife with ongoing research. Significant progress has been made in the field throughout the time we reviewed in terms of automatically identifying plagiarism that is highly obscured and hence difficult to recognize. The exploration of nontextual contents, the use of machine learning, and improved semantic text analysis techniques are the key sources of these advancements. Based on our analysis, we concluded that the combination of several analytical methodologies for textual and nontextual content features is the most promising subject for future research contributions to further improve the detection of plagiarism.
... In the E commerce sector, ML has enhanced [26] recommendation systems, [27] personalized marketing, and [28] inventory management. Lastly, [29] automated course suggestions, [30] intelligent tutoring systems, and [31] plagiarism detection are some contributions of ML to the education sector. Nevertheless, ML has emerged as a game changer, transforming [32]- [34] countless industries. ...
... Social incentive:In this algorithm, it is assumed that 50% of chimps will follow their normal behavior in the last step of the hunting process, and the other 50% of chimps will follow the chaotic strategies to update their continuous positions [40]. The updating method is described as (14). ...
Article
Full-text available
Traditional essay scoring methods not only consume tremendous manpower and financial resources, but also the scoring results are easily affected by subjective factors. To improve the efficiency of essay scoring and reduce scoring errors, this paper proposes an automated essay scoring method based on the enhanced chimp optimization algorithm-back propagation neural network (ENChOA-BP) and K-means clustering. Firstly, this paper utilized K-means to select representative samples near cluster centers for experiments, decreasing the subjectivity influences of examiners. Then, three improvement strategies are introduced to the chimp optimization algorithm (ChOA) to improve its search capability, which is named the enhanced chimp optimization algorithm (ENChOA). In this algorithm, the good point set of initialization improves the global search ability of the ChOA algorithm. The teaching and memory strategies achieve group communication and experiential learning, enabling chimps to learn independently and approach optimal individuals. 15 benchmark functions are used to validate the superiority of the proposed algorithm by comparing it with 9 other algorithms. The experimental results indicate that ENChOA is more powerful than ChOA and other meta-heuristics algorithms. Finally, the ENChOA is used to optimize the parameters of back propagation neural network (BP), which is the ENChOA-BP model applied to essay scoring. The experimental results show that using the ENChOA-BP model for essay scoring has a correlation coefficient of up to 90% between the predicted score and the actual score.
... Furthermore, [19] proposed a ML approach to detect plagiarism in programming assignments. ...
Article
Full-text available
In recent times, Machine learning (ML) is one of the most valuable fields of artificial intelligence (AI) that is transforming education. The application of ML in education provides a promising benefit both to the scientists and researchers and this is the focus of this study. This paper reviews recent trends and advancements of ML in education focusing on areas such as personalisation of learning, predictive analytics, plagiarism detection, intelligent tutoring systems, gamification of learning and recommendation systems. After conducting the literature review we found out the current benefits and challenges of ML in education. The paper also provides insight into the applications and provide the recommendations to address the challenges of ML in the field of education.
... Libraries play a crucial role in preventing plagiarism in educational and research institutions (Awasthi, 2019). All plagiarism detection software uses ML and DL approaches to match similar sources over the internet (Awale et al., 2020). It also used pattern recognition to detect plagiarism (Mausumi, 2016). ...
Article
The paper aims to discuss ChatGPT and its role in academic libraries. It identifies the key areas where ChatGPT can be implemented in libraries to improve various services and the overall work experience of academics, along with the risk factors associated with ChatGPT. Research finds that ChatGPT can be aided with many library services such as collection development management, virtual reference services, digital information service and library discovery, research writing and publishing, research performance analysis and bibliometric services. It can be helpful for libraries to meet patron requirements, streamline the research process, and increase the efficiency of the library and staff. It reduces the workload of library staff to a certain extent rather than replacing the human librarians as various threats are associated with the ChatGPT, such as incorrect query responses, data protection, misuse, privacy, security, inaccessibility, limited technology comprehension and other ethical issues. It also discusses the limitations and challenges libraries may face while implementing the technology to impart various services. The importance of this study lies in the fact that it acknowledges the reality of ChatGPT and its limitations and how the library will be ready to cope with and deal with the technology in the future.
Conference Paper
Full-text available
In today's world, technological trends facilitate faster and easier communication between people any where in the world. Data search and exchange is one mouse click away from every people and this causes an exponential increase in the magnitude of plagiarism. Researches show that this is a serious problem in many academic institutions and research institutes. In this paper, we have developed an algorithm based on fingerprinting approach to identify the reuse of source codes in direct and indirect way. The results show that the proposed method is efficient in computation and resources and, provides effective measures to detect plagiarized programs. Effectiveness of the features used in the proposed methods is tested using a large database, including programs of different programming languages.
Article
Full-text available
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with Moss, a widely-used plagiarism detection service.
Conference Paper
Plagiarism is not only found in the academic work but also in the industries dealing with software designing. Plagiarism is becoming a serious problem for academician community. Academics can detect similarity of source code files with the help of plagiarism detection tools. In this paper we present JSIM (Java Similarity) tool to find out the Similarity between Source codes. Two source codes will be matched at function's level or method's level with the help of JSIM. With the help of JSIM tool teachers can find the similarity between source code and according to the similarity, teacher can easily give marks to the students in an assignment of programming subject.
Article
Abstract— Source code plagiarism is a severe problem in academia. In academia programming assignments are used to evaluate students in programming courses. Therefore checking programming assignments for plagiarism is essential. If a course consists of a large number of students, it is impractical to check each assignment by a human inspector. Therefore it is essential to have automated tools in order to assist detection of plagiarism in programming assignments. Majority of the current source code plagiarism detection tools are based on structured methods. Structural properties of a plagiarized program and the original program differ significantly. Therefore it is hard to detect plagiarized programs when plagiarism level is 4 or above by using tools which are based on structural methods. This paper presents a new plagiarism detection method, which is based on machine learning techniques. We have trained and tested three machine learning algorithms for detecting source code plagiarism. Furthermore, we have utilized a meta-learning algorithm in order to improve the accuracy of our system. Index Terms— k-nearest neighbor, machine learning, naïve bayes classifier, plagiarism detection, source code