Figure 2 - uploaded by Bibek Timsina
Content may be subject to copyright.
Source publication
Plagiarism in programming assignments has been increasing these days which affects the evaluation of students. Thispaper proposes a machine learning approach for plagiarism detection of programming assignments. Different features related to source code are computed based on similarity score of n-grams, code style similarity and dead codes. Then, xg...
Citations
... These methods include the use of word embeddings in embedding-based approaches, stylistic feature-focused stylometric analysis, and linguistic analysis with n-gram frequencies [145]. Machine learning algorithms facilitate the identification of potentially plagiarised passages within a manuscript as well as changes in writing style. ...
In text analysis, identifying plagiarism is a crucial area of study that looks for copied information in a document and determines whether or not the same author writes portions of the text. With the emergence of publicly available tools for content generation based on large language models, the problem of inherent plagiarism has grown in importance across various industries. Students are increasingly committing plagiarism as a result of the availability and use of computers in the classroom and the generally extensive accessibility of electronic information found on the internet. As a result, there is a rising need for reliable and precise detection techniques to deal with this changing environment. This paper compares several plagiarism detection techniques and looks into how well different detection systems can distinguish between content created by humans and content created by Artificial Intelligence (AI). This article systematically evaluates 189 research papers published between 2019 and 2024 to provide an overview of the research on computational approaches for plagiarism detection (PD). We suggest a new technically focused structure for efforts to prevent and identify plagiarism, types of plagiarism, and computational techniques for detecting plagiarism to organize the way the research contributions are presented. We demonstrated that the field of plagiarism detection is rife with ongoing research. Significant progress has been made in the field throughout the time we reviewed in terms of automatically identifying plagiarism that is highly obscured and hence difficult to recognize. The exploration of nontextual contents, the use of machine learning, and improved semantic text analysis techniques are the key sources of these advancements. Based on our analysis, we concluded that the combination of several analytical methodologies for textual and nontextual content features is the most promising subject for future research contributions to further improve the detection of plagiarism.
... In the E commerce sector, ML has enhanced [26] recommendation systems, [27] personalized marketing, and [28] inventory management. Lastly, [29] automated course suggestions, [30] intelligent tutoring systems, and [31] plagiarism detection are some contributions of ML to the education sector. Nevertheless, ML has emerged as a game changer, transforming [32]- [34] countless industries. ...
... Recently, AI-generated code detection approaches have been proposed (Hoq et al., 2024). However, most of these studies focused on higher education (Awale et al., 2020;Cheers et al., 2021a;Karnalim et al., 2021), with limited studies on plagiarism detection in K-12 education. This is particularly important as early education in programming is becoming more prevalent (Macrides et al., 2022). ...
... In addition, most plagiarism detection studies focus on popular programming languages such as Java (Hoq et al., 2024), C/C++ (Awale et al., 2020) and Python (Mitchell et al., 2023). There is a scarcity of research on plagiarism detection for pseudocode. ...
The ability of large language models (LLMs) to generate code has raised concerns in computer science education, as students may use tools like ChatGPT for programming assignments. While much research has focused on higher education, especially for languages like Java and Python, little attention has been given to K-12 settings, particularly for pseudocode. This study seeks to bridge this gap by developing explainable machine learning models for detecting pseudocode plagiarism in online programming education. A comprehensive pseudocode dataset was constructed, comprising 7,838 pseudocode submissions from 2,578 high school students enrolled in an online programming foundations course from 2020 to 2023, along with 6,300 pseudocode samples generated by three versions of ChatGPT. An ensemble model (EM) was then proposed to detect AI-generated pseudocode and was compared with six other baseline models. SHapley Additive exPlanations were used to explain how these models differentiate AI-generated pseudocode from student submissions. The results show that students’ submissions have higher similarity with GPT-3 than with the other two GPT models. The proposed model can achieve a high accuracy score of 98.97%. The differences between AI-generated pseudocode and student submissions lies in several aspects: AI-generated pseudocode often begins with more complex verbs and features shorter sentence lengths. It frequently includes clear numerical or word-based indicators of sequence and tends to incorporate more comments throughout the code. This research provides practical insights for online programming and contributes to developing educational technologies and methods that strengthen academic integrity in such courses.
... Metrics-based Representation 8 [34], [74], [75], [78], [99], [138], [145], [146] Text-based Representation ...
... Classification Metrics Acc 24 [1], [35], [37], [39], [48], [75], [79], [80], [85], [99], [104], [108], [116], [118], [132], [137]- [140], [142], [146], [147], [149], [153] AUC 8 [37], [80], [81], [83], [124], [126], [137], [138] F1 75 ...
... [1]- [6], [9], [35]- [37], [39], [41], [43], [44], [46], [48], [49], [73], [77], [79], [82], [84]- [86], [88]- [98], [100]- [107], [109]- [117], [121], [123], [124], [127]- [132], [134]- [136], [138], [140]- [142], [144], [146], [148], [150], [152], [164], [ [6], [9], [35], [36], [39]- [41], [43], [44], [46], [48], [49], [73], [74], [76]- [79], [82]- [86], [88]- [107], [109]- [118], [121], [123]- [125], [127]- [132], [134]- [136], [138], [140]- [142], [144], [146], [148]- [150], [152], [164], [165] Rec 83 ...
Source code similarity measurement, which involves assessing the degree of difference between code segments, plays a crucial role in various aspects of the software development cycle. These include but are not limited to code quality assurance, code review processes, code plagiarism detection, security, and vulnerability analysis. Despite the increasing application of ML technique in this domain, a comprehensive synthesis of existing methodologies remains lacking. This paper presents a systematic review of Machine Learning techniques applied to code similarity measurement, aiming to illuminate current methodologies and contribute valuable insights to the research community. Following a rigorous systematic review protocol, we identified and analyzed 84 primary studies on a broad spectrum of dimensions covering application type, devised Machine Learning algorithms, used code representations, datasets, and performance metrics, as well as performance evaluations. A deep investigation reveals that 15 applications for code similarity measurement have utilized 51 different machine learning algorithms. Additionally, the most prevalent code representation is found to be the abstract syntax tree (AST). Furthermore, the most frequently employed dataset across various code similarity research applications is BigCloneBench. Through this comprehensive analysis, the paper not only synthesizes existing research but also identifies prevailing limitations and challenges, shedding light on potential avenues for future work.
... Social incentive:In this algorithm, it is assumed that 50% of chimps will follow their normal behavior in the last step of the hunting process, and the other 50% of chimps will follow the chaotic strategies to update their continuous positions [40]. The updating method is described as (14). ...
Traditional essay scoring methods not only consume tremendous manpower and financial resources, but also the scoring results are easily affected by subjective factors. To improve the efficiency of essay scoring and reduce scoring errors, this paper proposes an automated essay scoring method based on the enhanced chimp optimization algorithm-back propagation neural network (ENChOA-BP) and K-means clustering. Firstly, this paper utilized K-means to select representative samples near cluster centers for experiments, decreasing the subjectivity influences of examiners. Then, three improvement strategies are introduced to the chimp optimization algorithm (ChOA) to improve its search capability, which is named the enhanced chimp optimization algorithm (ENChOA). In this algorithm, the good point set of initialization improves the global search ability of the ChOA algorithm. The teaching and memory strategies achieve group communication and experiential learning, enabling chimps to learn independently and approach optimal individuals. 15 benchmark functions are used to validate the superiority of the proposed algorithm by comparing it with 9 other algorithms. The experimental results indicate that ENChOA is more powerful than ChOA and other meta-heuristics algorithms. Finally, the ENChOA is used to optimize the parameters of back propagation neural network (BP), which is the ENChOA-BP model applied to essay scoring. The experimental results show that using the ENChOA-BP model for essay scoring has a correlation coefficient of up to 90% between the predicted score and the actual score.
... Furthermore, [19] proposed a ML approach to detect plagiarism in programming assignments. ...
In recent times, Machine learning (ML) is one of the most valuable fields of artificial intelligence (AI) that is transforming education. The application of ML in education provides a promising benefit both to the scientists and researchers and this is the focus of this study. This paper reviews recent trends and advancements of ML in education focusing on areas such as personalisation of learning, predictive analytics, plagiarism detection, intelligent tutoring systems, gamification of learning and recommendation systems. After conducting the literature review we found out the current benefits and challenges of ML in education. The paper also provides insight into the applications and provide the recommendations to address the challenges of ML in the field of education.
... Libraries play a crucial role in preventing plagiarism in educational and research institutions (Awasthi, 2019). All plagiarism detection software uses ML and DL approaches to match similar sources over the internet (Awale et al., 2020). It also used pattern recognition to detect plagiarism (Mausumi, 2016). ...
The paper aims to discuss ChatGPT and its role in academic libraries. It identifies the key areas where ChatGPT can be implemented in libraries to improve various services and the overall work experience of academics, along with the risk factors associated with ChatGPT. Research finds that ChatGPT can be aided with many library services such as collection development management, virtual reference services, digital information service and library discovery, research writing and publishing, research performance analysis and bibliometric services. It can be helpful for libraries to meet patron requirements, streamline the research process, and increase the efficiency of the library and staff. It reduces the workload of library staff to a certain extent rather than replacing the human librarians as various threats are associated with the ChatGPT, such as incorrect query responses, data protection, misuse, privacy, security, inaccessibility, limited technology comprehension and other ethical issues. It also discusses the limitations and challenges libraries may face while implementing the technology to impart various services. The importance of this study lies in the fact that it acknowledges the reality of ChatGPT and its limitations and how the library will be ready to cope with and deal with the technology in the future.
... In fact, SVM is widely used in text plagiarism detection and has also previously been used in code plagiarism detection. Awale et al. (2020) employed SVM and xgboost classifiers to assess C++ programming assignments for plagiarism, focusing primarily on string matching techniques, with particular focus on the location of braces and the commenting style on a variety of features including coding style and logic structure. Eppa and Murali (2022) compared SVM with other machine learning methods for detecting plagiarism in C programming, evaluating the efficacy of different features like syntax and code structure in identifying copied content. ...
... In this study we combine the use of textual or stylistic elements of the code similar to those used by Awale et al. (2020) along with syntax-based elements such as those used by Zhao et al. (2015) with the key difference that our technique utilizes customized Pythonspecific textual and stylistic similarity measures, whereas the studies cited previously have primarily focused on C, C++, and Java. ...
Mechanisms for plagiarism detection play a crucial role in maintaining academic integrity, acting both to penalize wrongdoing while also serving as a preemptive deterrent for bad behavior. This manuscript proposes a customized plagiarism detection algorithm tailored to detect source code plagiarism in the Python programming language. Our approach combines textual and syntactic techniques, employing a support vector machine (SVM) to effectively combine various indicators of similarity and calculate the resulting similarity scores. The algorithm was trained and tested using a sample of code submissions of 4 coding problems each from 45 volunteers; 15 of these were original submissions while the other 30 were plagiarized samples. The submissions of two of the questions was used for training and the other two for testing-using the leave-p-out cross-validation strategy to avoid overfitting. We compare the performance of the proposed method with two widely used tools-MOSS and JPlag—and find that the proposed method results in a small but significant improvement in accuracy compared to JPlag, while significantly outperforming MOSS in flagging plagiarized samples.
... The converse is true with thirdparty cookies, which indicate that information is sent from one website to another, and that the first website has granted permission to gather data. The third party cookies may thus violate your privacy 25 . By categories, we can classify cookies. ...
Every time we open our computer systems, laptops, or mobile phones to browse the web. We visit different web sites and open diverse hyperlinks or look for gadgets, what to shop or classify an advert. After a while, a separate website gives us a picture of the equal element we have been seeking out. That means we are being tracked and delivered tailored classified advertisement depending on our previous pursuits and location based totally on cookies content material. What makes the situation complex is that they may not be accredited or permitted to do that. The work aimed to study how are visitors of popular Nigerian websites tracked and how their privacy is affected. For that, all the cookies were identified by category and type. We determine if popular Nigerian webpages comply with the ePrivacy Directive to understand if visitors of popular Nigerian websites were tracked without consent. Finally, we calculate which are effective defense methods against third-party tracking. This study has been based on 22 popular Nigerian websites ranked by Amazon Alexa.com. And for the crawl a python code that generate Web crawling activities was created and executed for crawling purpose. The results showed that 64% of the popular websites use third-party cookies, and most of these websites track visitors without their consent.
... If the similarity of the sentences was higher than the specified value, the sentence similarity is compared through Word2Vec, and finally, the comparison results of similar sentences were displayed. Awale, et al. [20] proposed an approach to detect plagiarism in programming assignments by using N-grams and machine learning techniques. The method involved extracting features from source code using N-grams and then training a classifier using various machine learning algorithms to identify similarities between source codes. ...
Document similarity recognition is one of the most important problems in natural language processing. This paper proposes a plagiarism comparison mechanism called JCF. Initially, the TF–IDF scheme is applied to build a bag of words as the representation of the common features of all documents. Then, the plagiarism comparison is carried out in a coarse-grained manner, which speeds up the similarity comparison. Finally, the most similar documents can then be compared in detail based on a fine-grained approach. In addition, the JCF detects plagiarism at both syntax level and semantic-like level. To prevent the distortion of similarity comparison, this paper further develops a similarity restoration approach such that the proposed JCF can obtain both advantages of quickness and accuracy. Performance studies confirm that the proposed JCF outperforms existing studies in terms of precision, recall and F1 score.