Conference PaperPDF Available

Spam E-Mail Classification by Utilizing N-Gram Features of Hyperlink Texts

Authors:

Abstract

With the advent of the Internet and reduction of the costs in digital communication, spam has become a key problem in several types of media (i.e. email, social media and micro blog). Further, in recent years, email spamming in particular has been subjected to an exponentially growing threat which affects both individuals and business world. Hence, a large number of studies have been proposed in order to combat with spam emails. In this study, instead of subject or body components of emails, pure use of hyperlink texts along with word level n-gram indexing schema is proposed for the first time in order to generate features to be employed in a spam/ham email classifier. Since the length of link texts in e-mails does not exceed sentence level, we have limited the n-gram indexing up to trigram schema. Throughout the study, provided by COMODO Inc, a novel large scale dataset covering 50.000 link texts belonging to spam and ham emails has been used for feature extraction and performance evaluation. In order to generate the required vocabularies; unigrams, bigrams and trigrams models have been generated. Next, including one active learner, three different machine learning methods (Support Vector Machines, SVM-Pegasos and Naive Bayes) have been employed to classify each link. According to the results of the experiments, classification using trigram based bag-of-words representation reaches up to 98,75% accuracy which outperforms unigram and bigram schemas. Apart from having high accuracy, the proposed approach also preserves privacy of the customers since it does not require any kind of analysis on body contents of e-mails.
A preview of the PDF is not available
... In order to work with text, the first we need to process it so that it is easy to vectorize. We will use the following methods of primary text processing, described in [18]: ...
... -stemming and lemmatization -transformation of a word into an initial form and infinitive; -division of text into a set of tokens. Following the recommendations [18] during preprocessing, we received sets of 1-gram and 2-gram, with which we will continue to work. ...
... F-measure = 2 * (Precision * Recall) / (Precision + Recall). Fig. 2 shows the comparative results of algorithms that worked on data vectorized using the principles of PV-DM and TF-IDF [19] using the addition of 2-grams [18]. To be able to compare algorithms in standard conditions, experiments were performed without the use of 2-grams and using the following standard vectorization methods: TF-IDF (Fig. 3) and Bag of Words (BOW) (Fig. 4). ...
Article
Full-text available
Increased use of email in daily transactions for many businesses or general communication due to its cost-effectiveness has made emails vulnerable to attacks, including spam. Spam emails are unsolicited messages that are very similar to each other and sent to multiple recipients randomly. This study analyzes the Rotation Forest model and modifies it for spam classification problem. Also, the aim of this study is to create a better classifier. To improve classifier stability, the experiments were carried out on Enron spam, Ling spam, and SpamAssasin datasets and evaluated for accuracy, f-measure, precision, and recall.
... Advancements in text representation techniques such as TF-IDF [9,10], word embeddings, and n-grams [11][12][13][14][15][16] have contributed to more robust feature representations of text data. These approaches help in capturing more nuanced information from messages, improving the classifier's ability to discern between spam and non-spam emails. ...
Article
Full-text available
With the advent of digital technologies as an integral part of today’s everyday life, the risk of information security breaches is increasing. Email spam, commonly known as junk email, continues to pose a significant challenge in the digital realm, inundating inboxes with unsolicited and often irrelevant messages. This relentless influx of spam not only disrupts user productivity but also raises security concerns, as it frequently serves as a vehicle for phishing attempts, malware distribution, and other cyber threats. The prevalence of spam is fueled by its low-cost dissemination and its ability to reach a wide audience, exploiting vulnerabilities in email systems. This paper marks the inception of an in-depth investigation into the viability and potential implementation of a robust spam filtering and prevention system tailored explicitly to university networks. With the escalating threat of email-based hacking attacks and the incessant deluge of spam, the need for a comprehensive and effective defense mechanism within academic institutions becomes increasingly imperative. In exploring potential solutions, this study delves into the applicability and efficacy of Bayesian filters, a class of probabilistic classifiers renowned for their aptitude in distinguishing between legitimate emails and spam messages. Bayesian filters utilize statistical algorithms to analyze email content, learning patterns and features to accurately categorize incoming emails.
... Bozkır ve ark. [3], N-gram yöntemi kullanarak bir elektronik posta kümesinin özniteliklerini çıkardıktan sonra Naive Bayes (NB) algoritmasını kullanarak spam sınıflandırma çalışması yapmışlardır. Nazlı [4], Makine öğrenmesi tabanlı spam filtreleme yöntemlerinin F1 metriğine göre karşılaştırılmaları üzerinde çalışmıştır. ...
Article
Full-text available
Elektronik posta, internet üzerinden gönderilen bir tür dijital mektuptur. Elektronik postalar aracılığı ile belge, resim, video, müzik gibi her türlü dosya gönderilip alınabilmektedir. Düşük maliyeti nedeniyle sıklıkla tercih edilmektedir. Elektronik postalar zaman ve para tasarrufu sağladığı için etkili bir iletişim yoludur. Düşük maliyetinden ve kullanımının kolaylığından dolayı reklam yapmak isteyenler tarafından etkin bir şekilde kullanılmaktadır. Bunun yanında siber saldırganlar da kurbanlarına bu tür elektronik postalar göndererek onlara zarar verebilmektedirler. Bu durumların önüne geçebilmek için, günümüzde makine öğrenmesi algoritmalarıyla spam elektronik postaları sınıflayan modeller tasarlanmaktadır. Bu çalışmanın amacı da spam tespiti konusunda literatürde sıklıkla yer alan Word2Vec ve Term Frequency – Inverse Document Frequency(TF-IDF) yöntemlerinin karşılaştırılmasını Türkçe bir veri seti üzerinde yapmak ve daha önce bahsedilen veri seti üzerinde yapılan çalışmalara göre başarı oranını artırmaktır. Bu amaç doğrultusunda, daha önce yapılan çalışmalar incelendiğinde, çalışmaların genellikle İngilizce veri setleri üzerinde yoğunlaştığı görülmektedir. Bu konudaki eksiği gidermek adına, Türkçe veri seti üzerinde yapılan bu çalışmada bahsedilen özellik çıkarma yöntemlerinin karşılaştırılması yapılarak iki farklı model oluşturulmuştur. Bu modellerde farklı sınıflayıcılar da kullanılarak en etkili yöntemin öne çıkarılması hedeflenmiştir
... algorithm to mine the operational law of users according to a series of sending behaviors. N-gram algorithm [24] is a commonly used method in fuzzy matching to evaluate the rationality of two sequences by the degree of difference between them. We can find anomalous patterns in spammer operations by determining the sequence matches of the entire sample. ...
Chapter
Spammers have existed since the birth of the Internet. They constantly pollute the social network environment, seriously degrade user experience and pose a threat to user account security. Finding spammers has become one of the most important tasks for social networking platforms. However, spammers use various methods to hide themselves from normal users, which makes it more difficult to detect spammers effectively. We propose a spammer detection method based on GraphSAGE Graph Neural Network, which distinguishes spammers from normal users based on the social attribute relationship of accounts. Even if spammers constantly change the content of their spam messages to avoid detection, they can still be identified by the different social attributes of spammers and normal users. In our method, user feature, relationship feature and behavior feature are designed and extracted to represent the social attribute relationship of users. At the same time, we have successfully and effectively utilized GraphSAGE to address the spammer detection problem. We prove the effectiveness of our method through experiments on the real-world dataset, and the results show that our performance is better than other comparison methods.
Article
With the advent of digital technologies as an integral part of today's everyday life, the risk of information security breaches is increasing. Email spam, commonly known as junk email, continues to pose a significant challenge in the digital realm, inundating inboxes with unsolicited and often irrelevant messages. This relentless influx of spam not only disrupts user productivity but also raises security concerns, as it frequently serves as a vehicle for phishing attempts, malware distribution, and other cyber threats. The prevalence of spam is fueled by its low-cost dissemination and its ability to reach a wide audience, exploiting vulnerabilities in email systems. This paper marks the inception of an in-depth investigation into the viability and potential implementation of a robust spam filtering and prevention system tailored explicitly to university networks. With the escalating threat of email-based hacking attacks and the incessant deluge of spam, the need for a comprehensive and effective defense mechanism within academic institutions becomes increasingly imperative. In exploring potential solutions, this study delves into the applicability and efficacy of Bayesian filters, a class of probabilistic classifiers renowned for their aptitude in distinguishing between legitimate emails and spam messages. Bayesian filters utilize statistical algorithms to analyze email content, learning patterns and features to accurately categorize incoming emails
Chapter
With this study, a system will be implemented for the use of Kariyer.net Customer Solution Center, which estimates the category of the mails that comes from customers and job seekers. These e-mails sent to Customer Solution Center employees via an internal demand system will be automatically estimated by the system to which category they belong before the action is taken. And guidance will be provided for proper action to be taken. Technological expertise, such as natural language processing, classification algorithms, machine learning, user interface will be studied in the system.
Article
In this paper, we explore the use of a text semantic analysis to improve the accuracy of spam detection. We propose a method based on two semantic level analysis. In the first level, we categorize emails by specific domains (e.g., Health, Education, Finance, etc.) to enable a separate conceptual view for spams in each domain. In the second level, we combine a set of manually-specified and automatically-extracted semantic features for spam detection in each domain. These features are meant to summarize the email content into compact topics discriminating spam from non-spam emails in an efficient way. We show that the proposed method enables a better spam detection compared to existing methods based on Bag-of-Words (BoW) and semantic content, and leads to more interpretable results.
Article
Full-text available
We present a comprehensive review of the most effective content-based e-mail spam filtering techniques. We focus primarily on Machine Learning-based spam filters and their variants, and report on a broad review ranging from surveying the relevant ideas, efforts, effectiveness, and the current progress. The initial exposition of the background examines the basics of e-mail spam filtering, the evolving nature of spam, spammers playing cat-and-mouse with e-mail service providers (ESPs), and the Machine Learning front in fighting spam. We conclude by measuring the impact of Machine Learning-based filters and explore the promising offshoots of latest developments.
Conference Paper
Full-text available
Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and providers. Machine learning techniques such as the Support Vector Machines (SVM) have achieved a high accuracy filtering the spam messages. However, a certain amount of legitimate emails are often classified as spam (false positive errors) although this kind of errors are prohibitively expensive. In this paper we address the problem of reducing particularly the false positive errors in anti-spam email filters based on the SVM. To this aim, an ensemble of SVMs that combines multiple dissimilarities is proposed. The experimental results suggest that the new method outperforms classifiers based solely on a single dissimilarity and a widely used combination strategy such as bagging.
Conference Paper
Full-text available
Many solutions have been deployed to prevent harmful effects from spam mail. Typical methods are either pattern matching using the keyword or method using the probability such as naive Bayesian method. In this paper, we proposed a classification method of spam mail from normal mail using support vector machine, which has excellent performance in binary pattern classification problems. Especially, the proposed method efficiently practices a learning procedure with a word dictionary by the n-gram. In the conclusion, we showed our proposed method being superior to others in the aspect of comparing performance.
Conference Paper
Malicious spam is one of the major problems of the Internet nowadays. It brings financial damage to companies and security threat to governments and organizations. Most recent spam emails contain URLs that redirect spam receivers to malicious Web servers. In this paper, we propose an online machine learning based malicious spam email detection system. The term-weighting scheme represents each spam email. These feature vectors are then used as the input of the classifier. The learning is periodically performed to update the classifier so that the system provides increased adaptability to take account of spam emails whose contents change from time to time. A real data set is labeled by the SPIKE system which is developed by NICT. Evaluation experiments show that the detection system is efficient and accurate to identify malicious spam emails.
Article
Junk email - yes, it's annoying, but it can also be overwhelming. A new study evaluates the current extent of the spamming problem and suggests there are no quick fixes to solve the situation.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Conference Paper
Unsolicited Bulk Email (UBE) has become a large problem in recent years. The number of mass mailers in existence is increasing dramatically. Automatically detecting UBE has become a vital area of current research. Many email clients (such as Outlook and Thunderbird) already have junk filters built in. Mass mailers are continually evolving and overcoming some of the junk filters. This means that the need for research in the area is ongoing. Many existing techniques seem to randomly choose the features that will be used for classification. This paper aims to address this issue by investigating the utility of over 40 features that have been used in recent literature. Information gain for these features are calculated over Ham, Spam and Phishing corpora.
Conference Paper
We describe and analyze a simple and effective iterative algorithm for solving the optimization problem cast by Support Vector Machines (SVM). Our method alternates between stochastic gradient descent steps and projection steps. We prove that the number of iterations required to obtain a solution of accuracy is (1/). In contrast, previous analyses of stochastic gradient descent methods require (1/2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/, where is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is (d/()), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function. We demonstrate the efficiency and applicability of our approach by conducting experiments on large text classification problems, comparing our solver to existing state-of-the-art SVM solvers. For example, it takes less than 5 seconds for our solver to converge when solving a text classification problem from Reuters Corpus Volume 1 (RCV1) with 800,000 training examples.