ArticlePDF Available

Abstract and Figures

This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, also known as "spam", floods the mailboxes of users, causing frustration, wasting bandwidth and money, and exposing minors to unsuitable content. Using a recently introduced publicly available corpus, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed, including different attribute and distance weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naïve Bays filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.
Content may be subject to copyright.
National Centre for Scientific Research (NCSR) “Demokritos”, Technical Report DEMO 2001.
A Memory-Based Approach to Anti-Spam Filtering
Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras,
Vangelis Karkaletsis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos
Software and Knowledge Engineering Laboratory
Institute of Informatics and Telecommunications
National Centre for Scientific Research “Demokritos”
GR-153 10 Ag. Paraskevi, Athens, Greece
e-mail: {ionandr, paliourg, vangelis, costass}@iit.demokritos.gr
Department of Informatics, University of Athens
TYPA Buildings, Panepistimiopolis, GR-157 71, Athens, Greece
e-mail: {stud0926, T.Stamatopoulos}@di.uoa.gr
Abstract
This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a
novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, also known as “spam”, floods the
mailboxes of users, causing frustration, wasting bandwidth and money, and exposing minors to unsuitable content.
Using a recently introduced publicly available corpus, a thorough investigation of the effectiveness of a memory-based
anti-spam filter is performed, including different attribute and distance weighting schemes, and studies on the effect of
the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are
identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam
filtering is practically feasible, especially when combined with additional safety nets. Compared to a previously tested
Naïve Bays filter, the memory-based filter performs on average better, particularly when the misclassification cost for
non-spam messages is high.
Keywords: text categorization; machine learning; unsolicited commercial e-mail; spam.
1. Introduction
This paper presents a thorough empirical evaluation of memory-based learning in the context of a novel cost-sensitive
application, that of filtering unsolicited commercial e-mail.
The increasing popularity and low cost of electronic mail have intrigued direct marketers to flood the mailboxes of
thousands of users with unsolicited messages. These messages are usually referred to as spam, or more formally
Unsolicited Commercial E-mail (UCE), and may advertise anything, from vacations to get-rich schemes. Spam
messages are becoming extremely annoying to users, as they clutter their mailboxes and prolong dial-up connections.
2
They also waste bandwidth, and often expose minors to unsuitable content by advertising pornographic sites. A 1999
study found that spam messages constituted approximately 10% of the incoming messages to a corporate network
(Cranor & Lamacchia, 1998). The situation seems to be worsening, and without appropriate counter-measures, spam
messages could eventually undermine the usability of e-mail.
The proposed counter-measures have been either regulatory or technical, with regulatory measures having limited
effect so far.1 Technical measures are based on anti-spam filters, which attempt to discriminate between spam and non-
spam, hereafter legitimate, messages. Typical anti-spam filters currently in the market employ blacklists of known
spammers, and handcrafted rules that block messages that contain specific words or phrases. Blacklists, however, are of
little use, as spammers often use forged addresses. Handcrafted rules are also problematic, as they need to be tuned to
each user’s incoming e-mail. This is a task requiring time and expertise, which has to be repeated periodically to
account for gradual changes in the characteristics of spam messages (Cranor & Lamacchia, 1998).
The success of machine learning techniques in text categorization (Sebastiani, 2001) has recently led researchers to
explore the applicability of learning algorithms in anti-spam filtering.2 A supervised learning algorithm is fed with a
corpus of e-mail messages that have been classified manually as spam or legitimate, and builds a classifier, which is
then used to detect incoming spam messages. Apart from collecting separately spam and legitimate training messages,
the learning process is fully automatic, and can be repeated to tailor the filter to the incoming messages of different
users, or to capture changes in the characteristics of spam e-mail. Anti-spam filtering differs from other electronic mail
and news categorization tasks (Lang, 1995; Cohen, 1996; Payne & Edwards, 1997), in that spam messages cover a very
wide spectrum of topics, and hence are much less homogeneous than other categories that have been considered.
Sahami et al. (1998) experimented with an anti-spam filter based on Naïve Bayes (Mitchell, 1997). In similar anti-
spam experiments, Pantel and Lin (1998) found Naïve Bayes to outperform Ripper (Cohen, 1996). Drucker et al. (1999)
experimented with Ripper, Rocchio’s classifier (Rocchio, 1971; Joachims, 1997), Support Vector Machines (Cristianini
and Shawe-Taylor, 2000), and boosting decision trees (Quinlan, 1993; Schapire & Singer, 2000), with results showing
that Support Vector Machines and boosting decision trees achieve very similar error rates, both outperforming
Rocchio’s classifier. A direct comparison of these previous results, however, is impossible, as they are based on
different, and not publicly available data sets. Furthermore, the reported figures can be misleading, since they are not
formulated within a cost-sensitive framework: the cost of accidentally blocking a legitimate message can be much
higher than letting a spam message pass the filter, and this cost difference must be taken into account during both
training and evaluation.3
3
To address these shortcomings we have recently introduced Ling-Spam, a publicly available collection of legitimate
and spam messages, as well as suitable cost-sensitive evaluation measures, which were used to conduct a detailed
evaluation of the Naïve Bayes filter (Androutsopoulos, et al. 2000a).4 Continuing that strand of work, this paper
presents a thorough empirical evaluation of a memory-based anti-spam filter, using the same corpus and evaluation
framework as in our previous experiments. This makes our results directly comparable, and allows other researchers to
build upon our work to evaluate other anti-spam filters, leading to a standard benchmark.
Memory-based classifiers are particularly promising for anti-spam filtering, on the grounds that spam messages
form a rather incoherent class in terms of topics. Hence, a classifier that predicts the class of a new message by recalling
similar already classified messages is likely to perform at least as well as classifiers that build a unique model for each
message class. A preliminary investigation of memory-based anti-spam filtering was presented in (Androutsopoulos, et
al. 2000b), where we experimented with a simplistic version of the k-Nearest Neighbor algorithm (Mitchell, 1997) with
promising results. Gomez Hidalgo et al. (2000) have reported similar experiments. The work that will be presented here
is much more detailed, in that it considers the effect of several extensions to the basic k-Nearest Neighbor algorithm that
have not been explored in previous anti-spam experiments, including several functions for attribute and distance
weighting, as well as the effect of the neighborhood size, the size of the training corpus, the size of the attribute set, and
different cost scenarios. In all cases, we attempt to justify our observations, and thus increase our confidence that
similar behavior is likely to appear in other cost-sensitive applications as well.
Overall, our results indicate that memory-based anti-spam filtering is practically feasible, especially when combined
with additional safety nets. Compared to the Naïve Bays filter, the memory-based filter performs on average better,
particularly when the misclassification cost for legitimate messages is high.
The rest of this paper is organized as follows: section 2 presents our benchmark corpus; section 3 describes the
preprocessing that is applied to the messages to convert them to training or testing instances; section 4 discusses the
basic memory-based learner; section 5 introduces the cost-sensitive evaluation measures; section 6 presents our
experimental results, investigating separately the effect of each one of the extensions to the basic algorithm that we have
considered; section 7 concludes and suggests further directions.
2. Benchmark corpus
Text categorization research has benefited significantly from the existence of publicly available manually categorized
document collections, like the Reuters corpus (Lewis, 1992), that have been used as benchmarks. Producing such
corpora for anti-spam filtering is not straightforward, since user mailboxes cannot be made public without violating the
privacy of their owners and the senders of the messages. A useful and publicly available approximation of a user’s
4
mailbox, however, can be constructed by mixing spam messages with messages extracted from spam-free public
archives of mailing lists. Our benchmark corpus, Ling-Spam, is based on this approach. It is a mixture of spam
messages and messages sent via the Linguist list, a moderated mailing list about the science and profession of
linguistics.5 The corpus consists of 2893 messages:
2412 messages from the Linguist list, obtained by randomly downloading digests from the list’s archives, breaking
the digests into their messages, and removing text added by the list’s server.
481 spam messages, received by one of the authors. Attachments, HTML tags, and duplicate spam messages
received on the same day have not been included.
Spam messages constitute 16.6% of Ling-Spam, a rate close to those reported by Cranor and LaMacchia (1998), and
Sahami et al. (1998). Although the Linguist messages are more topic-specific than most users’ incoming e-mail, they
are less standardized than one might expect. For example, they contain job postings, software availability
announcements, even flame-like responses. We, therefore, believe that useful conclusions about anti-spam filtering can
be reached with Ling-Spam. With a more direct interpretation, our experiments can also be seen as a study on anti-spam
filters for open unmoderated mailing lists or newsgroups.
An alternative approach is to distribute suitably encrypted mailboxes, which will allow different representation and
learning techniques to be compared, while still maintaining privacy. We have recently started to explore that path as
well (Androutsopoulos, et al. 2000c). To the best of our knowledge, the only other publicly available collection of spam
and legitimate messages is Spambase (Gomez Hidalgo, et al. 2000).6 This is a collection of 4601 vectors, each
representing a spam or legitimate message, with spam messages being 39.4% of the total. Each vector contains the
values of 58 pre-selected attributes, including the category. Spambase is much more restrictive than Ling-Spam, since
the original texts are not available. For example, one cannot experiment with more attributes, different attribute
selection algorithms, or attributes corresponding to phrases, rather than individual words (see section 3 below).
3. Message representation and preprocessing
For the purposes of our experiments, each message of Ling-Spam is converted into a vector n
xxxxx ,,,, 321
=,
where n
xx ,,
1 are the values of attributes n
XX ,,
1, as in the vector space model (Salton & McGill, 1983). All
attributes are binary: 1=
i
X if some characteristic represented by i
X is present in the message; otherwise 0=
i
X.
In our experiments, attributes represent words, i.e. each attribute shows if a particular word (e.g. “adult”) occurs in the
5
message. It is also possible to use attributes corresponding to phrases (e.g. “be over 21”) or non-textual characteristics
(e.g. whether the message contains attachments or not) (Sahami, et al. 1998; Pantel and Lin, 1998). To avoid treating
forms of the same word as different attributes, a lemmatizer was applied to Ling-Spam, converting each word to its base
form (e.g. “was” becomes “be”).7
To reduce the high dimensionality of the instance space, attribute selection was performed. First, words occurring in
less than 4 messages were discarded, i.e. they were not considered as candidate attributes. Then, the Information Gain
(IG) of each candidate attribute
X
with respect to the category-denoting variable C was computed as follows:
)()(
),(
log),(),(
},{},1,0{ cCPxXP
cCxXP
cCxXPCXIG
legitspamcx ==
==
===
(1)
The attributes with the m highest IG-scores were selected, with m varying in the experiments from 50 to 700 by 50. The
probabilities are estimated from the training corpus using m-estimates (Mitchell, 1997). Yang and Pedersen (1997)
report that it is feasible to remove up to 98% of the candidate attributes using IG or other similar functions, and
preserve, or even improve, generalization accuracy.
4. Memory-based learning
Memory-based, or “instance-based”, methods do not construct a unique model for each category, but simply store the
training examples (Aha, et al. 1991; Wilson, 1997). Test instances are then classified by estimating their similarity to
the stored examples. In its simplest form, memory-based learning treats instances as points in a multi-dimensional space
defined by the attributes that have been selected. Classification is usually conducted through a variant of the basic k-
Nearest-Neighbor (k-NN) algorithm (Cover & Hart, 1967), which assigns to each test instance the majority class of its k
closest training instances (its k-neighborhood).
Various metrics can be used to compute the distance between two instances (Giraud-Carrier & Martinez, 1995;
Wilson & Martinez, 1997). With symbolic (nominal) attributes, as in our case, the overlap metric is a common choice.
This counts the attributes where the two instances have different values. Given two instances iniii xxxx ,,, 21
=
and jnjjj xxxx ,,, 21
= their overlap distance is:
),(),(
1
jrir
n
r
ji xxxxd
=
δ
(2)
6
where
=
δ otherwise ,1
if ,0
),( yx
yx
In our experiments, we used the memory-based learning software of TiMBL (Daelemans, et al. 2000). TiMBL
implements the basic k-NN classifier as above, except that the k-neighborhood is taken to contain all the training
instances at the k closest distances, rather than the k closest instances. If there is more than one neighbor at some of the
k closest distances, the neighborhood will contain more than k neighbors.
We have also experimented with different attribute weighting and distance weighting schemes. Unlike the basic k-
NN classifier, where all the attributes are treated as equally important, attribute weighting extensions assign different
importance scores to the attributes, depending on how well they discriminate between the categories, and adjust the
distance metric accordingly (Aha, 1992; Wettschereck, et al. 1995). Distance weighting takes memory-based learning
one step further, by considering neighbors nearer to the input instance as more important, assigning greater voting
weight to them (Dudani, 1976; Bailey & Jain, 1978; Wettschereck, 1994). This can reduce the sensitivity of the
classifier to the k parameter, the neighborhood size. Attribute and distance weighting will be considered in more detail
in Section 6 below.
5. Cost-sensitive evaluation
Mistakenly blocking a legitimate message (classifying a legitimate message as spam) is generally more severe an error
than accepting a spam message (classifying a spam message as legitimate). Let SL and LS denote the two
error types. Invoking a decision-theoretic notion of cost, we assume that SL is λ times more costly than LS .
Let )( xWL
and )(xWS
be the degrees of confidence of the classifier that instance x
is legitimate and spam,
respectively. We classify a message as spam iff the following criterion is satisfied:
Sx v
iff λ>
)(
)(
xW
xW
L
S
(3)
If )(xWL
and )(xWS
are accurate estimates of the conditional probabilities )|( xXlegitCP
== and
)|( xXspamCP
== , respectively, criterion (3) achieves optimal results (Duda & Hart, 1973). )( xWL
and
)( xWS
can be scaled to the [0,1] interval, so that their sum equals to 1. In this case, criterion (3) is equivalent to (4):
Sx u
iff txWS>)(
*
, with λ+
λ
=
1
t, t
t
=λ
1 (4)
7
In (4), t is the classification threshold. A message exceeding this threshold is classified as spam; otherwise it is
classified as legitimate. For the basic k-NN algorithm of the previous section, a suitable measure of the confidence that
a test instance belongs to a category (legitimate or spam) is the percentage of training instances in the k-neighborhood
that belong to that category.
In their experiments, Sahami et al. (1998) set t to 0.999, which corresponds to λ = 999, i.e. blocking a legitimate
message was taken to be as bad as letting 999 spam messages pass the filter. When blocked messages are deleted
without further processing, such a high value of λ is reasonable, because most users would consider losing a legitimate
message unacceptable. Alternative usage scenarios are possible, however, where lower λ values are reasonable.
For example, rather than being deleted, a blocked message could be returned to the sender, explaining the reason of
the return and asking the sender to repost the message to another, private, unfiltered e-mail address of the recipient
(Hall, 1998). Direct marketers use lists of e-mail addresses typically harvested from web pages and newsgroup archives.
The private address would never be advertised on web pages, newsgroups, etc., making it unlikely to receive spam mail
directly. An additional safety measure could be the inclusion of a frequently changing riddle (e.g. “Include in the
subject the capital of France.”) to ensure that spam messages are not forwarded automatically to the private address by
robots that scan returned messages for new e-mail addresses. Spammers cannot afford the time to answer thousands of
riddles, and messages sent to the private address without the correct answer would be deleted automatically. In this
scenario λ=9 (t = 0.9) seems more reasonable: blocking a legitimate message is penalized mildly more than letting a
spam message pass, to account for the fact that recovering from a blocked legitimate message is more costly (counting
the sender’s extra work to repost it, not to mention his/hers possible frustration) than recovering from a spam message
that passed the filter (deleting it manually).
In a third scenario, the filter could simply flag messages it suspects to be spam, without removing them from the
user’s mailbox; or, it could rank incoming messages by decreasing confidence )(xWL
(Drucker, at al. 1999). In these
cases, λ = 1 (t = 0.5) seems reasonable, since none of the two error types is more significant than the other.
Apart from the classification threshold, cost must also be taken into account when defining the evaluation measures.
In classification tasks, accuracy (Acc) and its complementary error rate (Err = 1 – Acc) are often used. In our context:
SL
SSLL
NN
NN
Acc
+
+
=
SL
LSSL
NN
NN
Err
+
+
=,
where ZY
N is the number of messages in category Y that the filter classified as
Z
, SLLLL NNN += is the
total number of legitimate messages to be classified, and LSSSS NNN += the total number of spam messages.
8
Accuracy and error rate assign equal weights to the two error types ( SL and LS ). To make these
measures sensitive to cost, each legitimate message is treated, for evaluation purposes, as if it were λ messages. That is,
when a legitimate message is blocked, this counts as λ errors; and when it passes the filter, this counts as λ successes.
This leads to the following definitions of weighted accuracy (WAcc) and weighted error rate (WErr = 1 – WAcc):
SL
SSLL
NN
NN
WAcc +λ
+λ
=
SL
LSSL
NN
NN
WErr +λ
+λ
=
When one of the categories is more frequent than the other, as in our case and especially when λ = 9 or 999, the
values of accuracy, error rate and their weighted versions are often misleadingly high. To get a more realistic picture of
a classifier’s performance, it is common to compare its accuracy or error rate to those of a simplistic baseline approach.
We consider the case where no filter is present as our baseline: legitimate messages are (correctly) never blocked, and
spam messages (mistakenly) always pass. The weighted accuracy and weighted error rate of the baseline are:
SL
L
b
NN
N
WAcc +λ
λ
=
SL
S
b
NN
N
WErr +λ
=
The total cost ratio (TCR) allows the performance of a filter to be compared easily to that of the baseline:
LSSL
S
b
NN
N
WErr
WErr
TCR
+λ
==
Greater TCR values indicate better performance. For TCR < 1, not using the filter is better. If cost is proportional to
wasted time, TCR expresses how much time is wasted to delete manually all spam messages when no filter is used
(S
N), compared to the time wasted to delete manually any spam messages that passed the filter ( LS
N) plus the time
needed to recover from mistakenly blocked legitimate messages ( SL
N
λ ).
We also present our results in terms of spam recall (SR) and spam precision (SP), defined below:
LSSS
SS
NN
N
SR
+
=
SLSS
SS
NN
N
SP
+
=
SR measures the percentage of spam messages that the filter manages to block (intuitively, its effectiveness), while SP
measures the degree to which the blocked messages are indeed spam (intuitively, the filter’s safety). Despite their
intuitiveness, comparing different filter configurations using SR and SP is difficult: each filter configuration yields a
pair of SR and SP results; and without a single combining measure, like TCR that incorporates the notion of cost, it is
difficult to decide which pair is better.8
9
6. Experiments
We now proceed with the presentation of our experimental results, providing at the same time more information on
the attribute and distance weighting extensions to the basic k-NN classifier of Section 4 that we have considered. We
first investigated the impact of attribute weighting by considering three weighting schemes. For each of them, we
performed three sets of experiments on Ling-Spam, corresponding to the three scenarios (λ parameter) that were
described in Section 5. In each scenario, we varied the number of selected attributes from 50 to 700 by 50, each time
retaining the attributes with the highest IG scores (Section 3). Three neighborhood sizes were tested (k = 1, 2 and 10)
for each weighting scheme, scenario and dimensionality, as the initial results suggested that there was no need to try
more values of k in order to draw safe conclusions.
In the second set of experiments we investigated the effect of distance weighting. Four voting functions were
examined for each cost scenario, dimensionality, and k-neighborhood, with k now ranging from 1 to 10.
In a third set of experiments, we examined the effect of dimensionality and the size of the neighborhood, i.e., the
value of k, using the best-performing configuration in terms of attribute-weighting schemes and distance-weighting
functions.
Finally, in a fourth set of experiments, we examined the effect of the training corpus size for each cost scenario,
using the best configuration of the previous experiments.
In all the experiments, 10-fold stratified cross-validation was used to evaluate the performance of the classifier
(Kohavi, 1995). That is, Ling-Spam was partitioned into 10 parts, with each part maintaining the same ratio of
legitimate and spam messages as in the entire corpus. Each experiment was repeated 10 times, each time reserving a
different part for testing and using the remaining 9 parts for training. WAcc was averaged over the 10 iterations, and
TCR was computed as b
WErr over the average WErr (Section 5).
6.1. Attribute weighting
The basic k-NN classifier assigns equal importance to all the attributes. In real-world applications, however,
irrelevant or redundant attributes are often included in the representation of the instances. This causes the classification
accuracy of k-NN to degrade, unless appropriate weights are assigned to the attributes, corresponding to their relevance
to the classification task. The distance metric, that computes the distance between two instances, has to be adjusted
accordingly, to incorporate the attribute weights. Equation (2) becomes:
),(),(
1
jrir
n
r
rji xxwxxd
=
δ
, (5)
10
where r
w is the weight assigned to r-th attribute.
Attribute weighting schemes
Information Gain (IG) was mentioned in Section 3 as our attribute selection function. The same function can be used
for attribute weighting. An equivalent expression of (1) in terms of information theory is the following:
)|()()(),(
}1,0{
xXCHxXPCHCXIG
x
===
, (6)
where H(C) is the entropy of the category-denoting variable C. H(C) measures the uncertainty on the category of a
randomly selected instance, and is defined as:
)(log)()( 2
},{
cCPcCPCH
legitspamc
===
. (7)
H(C|X) is defined similarly, replacing P(C) in (7) by P(C|X). H(C|X) measures the uncertainty on the category given the
value of attribute X. Equation (6) subtracts from the entropy of the category (H(C)) the expected value of the entropy
when the value of X is known, averaged over all the possible values of X. IG is therefore a measure of how much
knowing the value of X reduces the entropy of C. The larger the reduction, the more useful X is in predicting C.
IG tends to overrate attributes with a large number of values. In the extreme case, an attribute that happens to be a
key, i.e., that specifies uniquely the category in the training set (e.g. a patient’s ID), would be assigned the highest IG
score, although it cannot lead to any useful generalizations (e.g. classification of new patients). To normalize IG for
attributes with different numbers of values, Quinlan (1986) introduced Gain Ratio (GR), which is IG divided by the split
information (SI) of the attribute. The latter is simply the entropy of the attribute values; in our case:
)(log)()( 2
}1,0{
xXPxXPXSI
x
===
.
Thus, GR is defined as:
)(
)|()()(
)(
),(
),( }1,0{
XSI
xXCHxXPCH
XSI
CXIG
CXGR x
==
==
.
The TiMBL software that we used (Section 4) supports both IG and GR for attribute weighting, as well as the standard
no-weighting scheme (i.e. equal weights for all attributes, hereafter EW).9 We experimented with all three of these
schemes, and the results are presented below.
Results of the attribute weighting experiments
Figures 1 and 2 show the TCR of 10-NN (k = 10) for λ = 1 and λ = 999, respectively, for the three attribute
weighting schemes and varying numbers of attributes. The corresponding curves for λ = 9 are similar to those for λ = 1,
and are omitted for the sake of brevity.
11
For λ = 1 (as well as λ = 9), the conclusion is that 10-NN with IG outperforms 10-NN with GR, and the latter
outperforms 10-NN with EW. The same pattern was observed in the experiments that we performed with k = 1 and k =
2. Another interesting pattern is the dependence on the dimensionality, which differs significantly between the three
weighting schemes. When using IG, the performance improved continuously as we retained more attributes, while the
opposite effect was observed for GR. Finally, without attribute weighting (EW) the number of attributes does not seem
to affect the performance.
For λ = 999, the picture is quite different. The distinguishing characteristic of this scenario is the unstable behavior
of the filter, which was already noticed in our previous work with the Naïve Bayes classifier (Androutsopoulos, et al.
2000a, 2000b). The reason is that SL errors are penalized so heavily, that a single blocked legitimate message
causes the baseline to outperform the memory-based filter in terms of WAcc, and TCR to drop below 1. Given this
instability, the objective in this scenario is to select a reliable configuration, which maintains TCR above 1, even if that
configuration does not achieve the highest TCR score.
Bearing the above goal in mind, the most reliable option for 10-NN when λ = 999 is to use no attribute weighting
(EW), since it attains consistently TCR > 1. However, its spam recall (SR) does not exceed 17%, which is a rather low
performance. On the other hand, GR seems safe for over 250 attributes, increasing SR up to 37%. Finally, IG seems to
be the least reliable of the three. On the other hand, it reaches 47% SR, blocking almost half of the spam messages in
Figure 1: TCR of 10-NN for λ = 1 and three attribute-weighting functions.
12
this demanding scenario. Unlike 10-NN, 1-NN and 2-NN do not reach the baseline (TCR < 1) in this scenario for any of
the attribute weighting schemes. The general impression formed by these results is that high dimensionality and a large
k-neighborhood provide more reliably TCR scores above the baseline, when λ = 999. This impression is strengthened by
the results presented in the following sections.
Interpretation of the results of the attribute weighting experiments
The first striking observation in Figure 1 is the poor performance of 10-NN when no attribute weighting is used
(EW). This may seem odd, as k-NN has been used successfully in many domains without attribute weighting. This
phenomenon can be explained by the fact that EW led to large numbers of instances at equal distances in the k-
neighborhood. (The reader is reminded that in our experiments the k-neighborhood comprises all the neighbors at the k
closest distances; see Section 4.) For example, 10-NN with 700 attributes and EW gave rise to k-neighborhoods of
typically 100-200 instances, while the respective number of instances for 50 attributes often exceeded 2000! Since the
vast majority of messages are legitimate, it is reasonable that within neighborhoods of hundreds or thousands of
instances, most of them will also be legitimate. This forces 10-NN to classify almost all of the messages as legitimate,
which is why its performance is very close to that of the baseline. A similar explanation can be given to the fact that for
λ = 999 (Figure2), the behavior of 10-NN with EW is closer to the behavior of the baseline, compared to the cases
where IG or EW are used.
Figure 2: TCR of 10-NN for λ = 999 and three attribute-weighting functions.
13
The question that arises is why there are so many instances in the k-neighborhood when EW is used. The answer lies
in the representation of the instances and the use of the overlap distance metric (Equation 2). As mentioned in Section 4,
the metric counts the number of attributes where the instances have different values. At the same time, the
representation results in sparse instance vectors, i.e., vectors with many zeros, indicating the absence of attribute-words
in the document. Thus, it is frequently the case that many messages differ in the same number of features, but not
necessarily the same features. All these messages are considered equally distant from an incoming new message. On the
other hand, IG, GR, and most attribute weighting functions avoid this problem: since every attribute weighs differently
(Equation 5), two instances are equally distant from an incoming message practically only when they are identical; and
finding two different messages at the same distance becomes further unlikely as the dimensionality increases.
The second issue to be investigated is the superiority of IG over GR, in terms of overall higher TCR (Figures 1 and
2). GR was introduced as an improved form of IG, and it may seem strange that IG outperforms GR in our experiments
(this is particularly prominent for λ = 1). A first explanation comes from the use of the binary representation in the
instance vectors: GR corrects the bias of IG towards features with many uniformly distributed values, but this is not
helpful when using binary attributes. This explains why GR does not perform better than IG, but it does not explain why
GR performs worse than IG. The explanation is subtler. Between two attributes with the same IG, the attribute with
lower split information (SI), or entropy, is ranked higher by GR (Equation 8). Lower entropy means greater certainty on
the value of the attribute. In the context of our study, where attributes are binary and represent the presence or absence
of a word, the attributes bearing certainty are those that correspond to the most and least frequent words. In practice,
there are no words that are very frequent (e.g. words with probability over 90-95%), but there are many very rare words
(e.g. probability of less than 0.1%). Thus, GR overestimates a large number of very rare words, which intuitively are
bad attributes, exactly because they occur seldom. It should be stressed that these arguments apply when binary
representation is employed. Other representations, like tfidf (Salton & Buckley, 1988), might be more suitable to GR
than to IG.
The third issue to look into is the effect of dimensionality on each weighting scheme. In the case of EW, the large
neighborhoods that are formed are also responsible for the stability of the curves with different numbers of retained
attributes (Figures 1 and 2): the majority class (legitimate) prevails most of the times, and the filter’s behavior is very
similar to that of the baseline, regardless of the selected attributes. With IG, the overall marginal increase in
performance as the number of attributes increases, suggests that the weights assigned to the attributes are appropriate.
Using the IG weights, the classifier seems to be taking advantage of even inferior attributes, by assigning them
appropriately lower importance. Finally, GR’s decline is again due to its bias towards rare attributes. Indeed, a closer
14
look at the word rankings produced by the two metrics revealed that the top ten words of IG were ranked in the second
hundredth of the words of GR, and vice versa. Given that the selection of the top-ranking attributes is always performed
using IG (Section 3), and that IG produces a good ranking, the smaller the number of selected attributes the less room
GR has to make grave misjudgments when weighting the attributes that have been selected.
The main conclusion of the discussion above is that IG is the best attribute weighting scheme among those that we
have tried. Hence, we use this scheme in the experiments of the following sections.
6.2. Distance weighting
Having chosen the attribute weighting scheme, we now examine the effect of different distance weighting methods.
These methods do not treat all the instances in a k-neighborhood as equally important, but weigh them according to
their distance from the incoming instance. The advantages of distance weighting, or else weighted voting, in memory-
based methods have been discussed extensively in the literature (Dudani 1976; MacLeod, et al. 1987). The main benefit
is that distance weighting reduces the sensitivity of k-NN to the choice of k. A k value that may be suitable for densely
populated regions may be unsuitable for sparse regions, generating in the latter case neighborhoods that contain many
irrelevant instances. Weighted voting undervalues distant neighbors, without ignoring them completely, in order to
adapt the effective size of the neighborhood to the local distribution of instances.
Distance weighting schemes
Various distance functions, or voting rules, have been proposed for distance weighting. We experimented with four
simple functions, one linear and three hyperbolic ones. The linear function that we tested is:
dddf = max0 )( ,
where d is the neighbor’s distance from the incoming instance and max
d is the maximum obtainable distance. The
maximum distance occurs when two instances differ in every attribute; hence, it is equal to the sum of all attribute
weights. The three hyperbolic functions that we tested are:
n
nd
df 1
)( =, n = 1,2,3
When one or more neighbors are identical to the incoming instance (i.e. d = 0), the incoming instance is classified to the
majority class of the identical instances, and all other neighbors are ignored.
The confidence level )(xWc
of the distance-weighted k-NN that the incoming instance x
belongs to class c, is
computed by the following formula:
15
=
δ=
k
i
iinc xCcxxdfxW
1
))(,()),(()( ,
where ),(1),( yxyx δ=δ and )( i
xC is the class of neighbor i
x
. This formula simply weighs the contribution of
each neighbor by its distance from the incoming instance. As in the basic k-NN classifier, the confidence levels for the
two categories can be scaled to the [0,1] interval, so that their sum equals to 1 (Section 5).
All the distance weighting experiments were conducted using IG for attribute weighting, which gave the best results
in the attribute weighting experiments. Selecting a good attribute weighting scheme is particularly important when
using distance weighting, because misjudgments about the similarity of two instances, caused by ineffective attribute
weighting schemes, can be amplified by the distance weighting functions.
Results of the distance weighting experiments
Figure 3 shows the TCR of 10-NN for λ = 1. Each curve corresponds to one of the four voting rules mentioned
above, and there is an additional curve for majority voting (i.e. no distance weighting). The selected value for k (k = 10)
is the highest of those considered in our experiments, because distance weighting is usually combined with a large value
for k: one of the principal reasons for employing distance weighting is the smoothing of the classifier’s performance for
varying k, so that the optimal selection of its value is less crucial. The corresponding figure for λ = 9 is similar to that
Figure 3: TCR of 10-NN for λ = 1 and different distance weighting functions
(using IG for attribute weighting).
16
for λ = 1, with respect to the conclusions drawn, and is therefore omitted. Finally, for λ = 999 the curves are almost
uninfluenced by the use of distance weighting and the corresponding figure is much like Figure 2.
Figure 3 shows clearly the improvement brought by the distance weighting functions. The improvement is greater
when distant neighbors are heavily undervalued. It is notable that the relative order of the four lower curves remains
practically the same for all dimensionalities. The two higher curves, )(
2df and )(
3df , are quite close to each other
suggesting that )(dfn functions for n > 3 are not likely to increase TCR further.
Interpretation of the results of the distance weighting experiments
The fact that performance improves as more importance is placed on close neighbors suggests that local
classification is preferable in this task. It is interesting to note that an important contributor to the success of the )(dfn
weighting functions is the correct prediction of almost all incoming instances with one or more identical neighbors. In
those cases, the aforementioned functions base their classification solely on the identical neighbors, and are nearly
always right. Even when there are no identical neighbors, functions that strongly prioritize the nearest neighbors reward
the local minority class, if its members are closer to the incoming instance. The good performance of the IG attribute
weighting function helps further, by bringing the “right” instances closer to the incoming instance.
Moderate voting functions, i.e. those that do not underestimate strongly the distant neighbors, do not produce the
required effect, because the differences in the distances between the incoming instance and different neighbors are
relatively small. Thus, unless the distances between neighbors are amplified by the distance function, the dominant
factor for the classification is still the majority class, as in majority voting.
6.3. Neighborhood size and dimensionality
The results of the distance weighting experiments have indicated a clear superiority of voting rules that favor close
neighbors. The question that arises is whether the value of the k parameter, which determines the size of the
neighborhood, can still affect the performance of the classifier when such voting rules are employed. We continued our
investigation towards that direction, by examining the influence of k, in combination with the dimensionality of the
instance space. In all the experiments below, we used IG for attribute weighting and 3
31)( ddf = for distance
weighting, following the conclusions of the previous experiments.
Results of the neighborhood size and dimensionality experiments
Figures 4, 5 and 6 show some representative curves for various k-neighborhood sizes in each one of the cost
scenarios (λ = 1, 9 and 999), respectively. Although we carried out experiments for every k from 1 to 10, we present
17
only some representative curves, for the sake of brevity. The discussion below, however, refers to the whole set of
results.
In Figure 4, a general trend involving the dimensionality and the performance of the classifier is observed: as the
number of retained attributes increases, so does the TCR. The relationship between the size of the neighborhood and the
performance of the classifier is less clear: TCR seems to be improving as k increases, up to k = 8, when it starts to
deteriorate slightly. However, the extensive overlap of the curves does not allow any safe conclusions to be drawn.
Figure 5 presents the two best and two worst curves for λ = 9. The behavior of the classifier is different here, in that
the best performance occurs for very small neighborhoods (k = 2, 3). As the neighborhood grows, TCR declines
gradually, but steadily. Interestingly enough, 1-NN performs the worst, which can be attributed to the fact that no
distance weighting can be used in this case.
Figure 4: TCR of k-NN for λ = 1 and different k values
(using IG for attribute weighting and 3
1d for distance weighting).
5.0
5.5
6.0
6.5
7.0
7.5
50 100 150 200 250 300 350 400 450 500 550 600 650 700
Number of retained attributes
TCR
2-NN
5-NN
8-NN
10-NN
18
Finally, in Figure 6, we observe the same steep fluctuations of TCR as in Figure 2, indicating transitions from below
baseline performance to above, and vice versa. Exceptions are k-NN for k < 4, which are always below baseline.
Another interesting phenomenon is the fact that 4-NN achieves the highest TCR globally for 250 attributes, and yet it is
not useful, as its performance is not steadily above the baseline; in practice, it is not possible to predict accurately the
exact dimensionality of such a narrow peak. The general conclusion is that satisfactory performance is obtained reliably
using a high dimensionality and a large k-neighborhood.
Interpretation of the results of the neighborhood size and dimensionality experiments
The conclusions to be drawn here are less clear than those in the previous experiments. The optimal value of k
depends heavily on the selected scenario (λ parameter), and also correlates with the dimensionality in an unintelligible
way. The experiments showed that for a given scenario, there is no clear ranking of k-NN classifiers for various values
of k. Furthermore, the effect of the number of retained attributes does not follow a coherent pattern. For example, k1-NN
may be better than k2-NN for 50–250 retained attributes, k2-NN may outperform k1-NN for 300–400 attributes, and in a
third interval (e.g. 450–600 features) k1-NN may again be better than k2-NN. To some extent this confusing picture is
justified by the use of the distance weighting function, which reduces significantly the dependence of the classifier on
the choice of k. It should also be stressed that the difference between the best and worst performing configuration is
Figure 5: TCR of k-NN for λ = 9 and different k values
(using IG for attribute weighting and 3
1d for distance weighting).
19
much smaller in this set of experiments than in the previous two sets. The performance of the worst classifier in Figure
4 does not fall below TCR = 5, which is comparable only to the best classifier of Figure 1, where no distance weighting
was used.
Best results overall and comparison to a naïve Bayes filter
It is interesting to relate these results to the spam recall (SR) and spam precision (SP) measures, which are perhaps
more intuitive than the combined TCR measure. Table 1 summarizes the best configurations witnessed for each
scenario. In every case, IG was used for attribute weighting and 3
31)( ddf = for distance weighting. A spam recall
around 89% and a spam precision of over 97% attained for λ = 1 make a satisfactory, if not sufficient, performance. The
gain from moving to the second scenario with λ = 9 is, on the other hand, questionable. A small increase of SP by 1.4%
is accompanied by a nearly five times greater decrease in SR. However, this might be acceptable, as almost no
legitimate messages are misclassified. In the third scenario (λ = 999), the filter’s safety (not blocking legitimate
messages) becomes the crucial issue, and therefore SP must be kept to 100% at any cost. The filter achieves an SR of
around 68%, which is notable though far from perfect. The “stable” configuration in Table 1 belongs to an interval
where SP is constantly kept at 100%.
Figure 6: TCR of k-NN for λ = 999 and different k values
(using IG for attribute weighting and 3
1d for distance weighting).
20
λ k dimensionality spam recall (%) spam precision (%) ΤCR
1 8 600 88.60 97.39 7.18
9 2 700 81.93 98.79 3.64
999 4 250 68.02 100 3.12
999 Stable 7 600 59.91 100 2.49
The results presented in Table 1 are directly comparable to the results of our earlier work with the naïve Bayes
classifier on the same corpus. Table 2 reproduces our earlier best results, as presented in (Androutsopoulos, et al.
2000c). The comparison of the two tables shows that the memory-based approach compares favourably to the naïve
Bayes classifier. In terms of TCR, the memory-based classifier is clearly better for λ = 1, slightly worse for λ = 9 and
slightly better again for λ = 999. In the strict scenario (λ = 999), it should be noted that the performance of the naïve
Bayes classifier is very unstable, i.e., the result shown in Table 2 corresponds to a very narrow peak with respect to the
number of retained attributes. Apart from that peak, the Naïve Bayes classifier never exceeded TCR = 1 for λ = 999. In
contrast, there are intervals where the memory-based classifier achieves TCR steadily above 1. Examining spam recall
and spam precision, it is clear that the memory-based classifier improves recall in all three scenarios, at a small cost of
precision for λ = 1 and λ = 9.
λ dimensionality spam recall (%) spam precision (%) ΤCR
1 100 82.35 99.02 5.41
9 100 77.57 99.45 3.82
999 300 63.67 100 2.86
6.4. Corpus size
Having examined the basic parameters of the classifier, we now turn to the size of the training corpus, which was
kept fixed to 2603 messages (90% of the whole corpus) in the experiments presented above. As before, in every ten-fold
experiment the corpus was divided into ten parts, and a different part was reserved for testing at each iteration. From
each of the remaining nine parts, only x% was used for training, with x ranging from 10 to 100 by 10. For every cost
scenario, the best configuration was employed (as in Table 1), with IG attribute weighting and 3
31)( ddf =used for
distance weighting. For the strict scenario (λ = 999), the “stable” configuration was employed.
Table 1: Best configurations per usage scenario and the corresponding performance.
Table 2: Best results of the naïve Bayes filter.
21
Figure 7 presents the resulting learning curves in terms of TCR. In all three scenarios, the diagram shows a clear
trend of improvement as the size of the training corpus increases. By studying the corresponding SR and SP measures,
we observed that the increase in performance is primarily due to an increase in SR. For example, the transition from
80% to 100% of the training corpus for λ = 1 raises SR by nearly 10%, decreasing at the same time SP by only 1.6%.
The increase of TCR in Figure 7 is generally mild and smooth, with the exception of TCR’s leap from 4.5 to 6.5 for λ
= 1. There is no indication that the learning curves have approached an asymptote for any size of the training set, which
suggests that a larger corpus might give even better results. This is particularly encouraging for the strict scenario (λ =
999), where there is still a large scope for improvement. It is also notable that, when λ = 999, TCR remains above the
baseline for all the sizes of the training corpus. In contrast, the Naïve Bayes filter that we examined in previous work
(Androutsopoulos, et al. 2000a) had reached TCR > 1 only for 100% of the training corpus. These findings indicate that
a memory-based spam filter may be viable in practice, even when absolute spam precision is required.
7. Conclusions
In this paper, we have presented the results of a thorough empirical evaluation of a memory-based anti-spam filter.
Our results suggest that this is a promising approach to the problem, attaining high levels of recall and precision.
Furthermore, the memory-based classifier compares favorably to the probabilistic classifier that we have used in our
Figure 7: TCR for variable sizes of training corpus.
22
earlier work on this problem. Due to the sensitive nature of any mail-filtering process, we have performed a cost-based
evaluation, according to three scenarios of varying strictness. The results show that the use of the memory-based
classifier can be justified, even in the strictest scenario, where the blocking of a legitimate message is practically
unacceptable.
The most important contribution of this work is the exploration of various parameters of the memory-based method,
such as attribute weighting, distance weighting, and neighborhood size. Our experiments have shown that the best
results are obtained using the Information Gain attribute weighting function, and that voting rules that strongly devalue
distant neighbors are beneficial. We have also shown that by using the right attribute- and distance-weighting functions,
the size of the neighborhood considered for classification becomes less important.
In addition to the parameters of the method, we have explored two important parameters of the problem: the
dimensionality and the training corpus size. Regarding the dimensionality of the problem, which is determined by the
number of retained attributes after the initial selection, our results show that its effect on classification performance is
positive when using a good attribute weighting scheme, i.e., the performance improves as the number of retained
attributes increases. Similarly, the performance improves as the size of the training corpus increases, which is an
indication that a larger training corpus might lead to even better results. It should be noted that our corpus is publicly
available on the Web, in an attempt to contribute towards standard benchmarks in this interesting problem.
The experiments presented here have opened a number of interesting research issues, which we are currently
examining. In the context of the memory-based classifier, we are examining non-binary representations of the
messages, by taking into account the frequency of a word within a message. Additionally, we would like to examine
other attribute weighting functions and their relationship to the chosen representation of instances. Also, weighted
voting can gain from functions that do not depend on the absolute distance from the input instance, but take into account
the local properties of the neighborhood, as shown in (Zavrel, 1997).
References
Aha, W. D., Kibler D., and Albert, M.K., (1991). “Instance-Based Learning Algorithms”. Machine Learning, Vol. 6,
pp. 37–66.
Aha, W.D. (1992). “Tolerating Noisy, Irrelevant and Novel Attributes in Instance-Based Learning Algorithms”.
International Journal of Man-Machine Studies, vol. 36, pp. 267–287.
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., and Spyropoulos, C.D. (2000a). “An evaluation of
naïve Bayesian anti-spam filtering” in Proceedings of the Workshop on Machine Learning in the New Information Age,
11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9–17.
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., and Stamatopoulos, P. (2000b).
“Learning to filter spam e-mail: A comparison of a naïve Bayesian and a memory-based approach”. In Proceedings of
23
the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and
Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France, pp. 1– 13.
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., and Spyropoulos, C.D. (2000c) “An experimental comparison of
naïve Bayesian and keyword-based anti-Spam Filtering with encrypted personal e-mail messages". In Proceedings of
the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR
2000), Athens, Greece, pp. 160–167.
Bailey, T., and Jain, A.K. (1978). “A Note on Distance-Weighted k-Nearest Neighbor Rules", IEEE Transactions on
Systems, Man, and Cybernetics, 8(4):311–313.
Cohen, W.W. (1996). “Learning rules that classify e-mail” in Proceedings of the AAAI Spring Symposium on Machine
Learning in Information Access, Palo Alto, CA, pp.18 –25.
Cranor, L.F., and LaMacchia, B.A. (1998). “Spam!”, Communications of ACM, 41(8):74–83.
Cristianini, N., and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based
Learning Methods, Cambridge University Press.
Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A. (2000). TiMBL: Tilburg Memory Based Learner,
version 3.0, Reference Guide. ILK, Computational Linguistics, Tilburg University. http:/ilk.kub.nl/~ilk/papers.
Drucker, H. D., Wu, D., and Vapnik V. (1999). “Support Vector Machines for Spam Categorization”. IEEE
Transactions On Neural Networks, 10(5):1048–1054.
Duda, R.O., and Hart, P.E. (1973). “Bayes Decision Theory”. Chapter 2 in Pattern Classification and Scene Analysis,
pp. 10–43. John Wiley.
Dudani, A. S. (1976). “The Distance-Weighted k-Nearest Neighbor rule”. IEEE Transactions on Systems, Man and
Cybernetics, 6(4):325–327.
Giraud-Carrier, C., and Martinez, R. T. (1995). “An Efficient Metric for Heterogeneous Inductive Learning
Applications in the Attribute-Value Language”. Intelligent Systems, pp. 341–350.
Gómez Hidalgo, J.M., Maña Lσpéz, M., and Puertas Sanz, E. (2000). “Combining text and heuristics for cost-sensitive
spam filtering” in Proceedings of the 4th Computational Natural Language Learning Workshop (CoNLL-2000), Lisbon,
Portugal, pp. 99–102.
Hall, R.J. (1998). “How to Avoid Unwanted Email”. Communications of ACM, 41(3):88–95.
Joachims, T. (1997). “A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization” in
Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, pp. 143–151.
Kohavi, R. (1995). “A study of cross-validation and bootstrap for accuracy estimation and model selection” in
Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), Morgan Kaufmann, pp.
1137–1143.
Lang, K. (1995). “Newsweeder: learning to filter netnews” in Proceedings of the 12th International Conference on
Machine Learning, Stanford, CA, pp. 331–339.
Lewis, D. (1992). “Feature selection and feature extraction for text categorization” in Proceedings of the DARPA
Workshop on Speech and Natural Language, pp. 212–217, Harriman, New York.
MacLeod, E. S. J., Luk, A., and Titterington, D. M. (1987). “A Re-Examination of the Distance-Weighted k-Nearest
Neighbor Classification Rule”. IEEE Transactions on Systems, Man, and Cybernetics, 17(4):689–696.
Mitchell, T.M. (1997). Machine Learning. McGraw-Hill.
Payne, T.R. and Edwards., P. (1997). “Interface Agents that Learn: An Investigation of Learning Issues in a Mail Agent
Interface”. Applied Artificial Intelligence, 11(1):1–32.
24
Pantel, P., and Lin, D. (1998). “SpamCop: a spam classification and organization program”. Learning for Text
Categorization – Papers from the AAAI Workshop, pp. 95–98, Madison Wisconsin. AAAI Technical Report WS-98-05.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, California.
Quinlan, J.R. (1986). “Induction of Decision Trees”. Machine learning, 1(1):81-106.
Rocchio, J. (1971). “Relevance Feedback Information Retrieval”. In Salton, G. (Ed.), The Smart Retrieval System –
Experiments in Automatic Document Processing, pp. 313–323. Prentice-Hall, Englewood Cliffs, NJ.
Sahami, M., Dumais, S., Heckerman D., and Horvitz, E. (1998). “A Bayesian approach to filtering junk e-mail”.
Learning for Text Categorization – Papers from the AAAI Workshop, pp. 55–62, Madison Wisconsin. AAAI Technical
Report WS-98-05.
Salton, G., and Buckley, C. (1988). “Term-Weighting Approaches in Automatic Text Retrieval”. Information
Processing and Management 24(5):513–523.
Salton, G., and McGill., M.J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.
Sebastiani, F. (2001). Machine Learning in Automated Text Categorization. Revised version of Technical Report IEI-
B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy.
Schapire, R.E., and Singer, Y. (2000). “BoosTexter: a Boosting-Based System for Text Categorization”. Machine
Learning, 39(2/3):135–168.
Wettschereck, D. (1994). A Study of Distance-Based Machine Learning Algorithms. PhD Thesis, Oregon State
University.
Wettschereck, D., Aha, W. D., and Mohri, T., (1995). A Review and Comparative Evaluation of Feature Weighting
Methods for Lazy Learning Algorithms, Technical Report AIC-95-012, Washington, D.C.: Naval Research Laboratory,
Navy Center for Applied Research in Artificial Intelligence.
Wilson, D.R. (1997). Advances in Instance-Based Learning Algorithms. PhD Thesis, Brigham Young University.
Wilson, D. R., and Martinez, R. T. (1997). “Improved Heterogeneous Distance Functions”. Journal of Artificial
Intelligence Research, 6(1):1–34.
Yang, Y., and Pedersen, J.O. (1997). “A comparative study on feature selection in text categorization” in Proceedings
of ICML-97, 14th International Conference on Machine Learning , Nashville, US, pp. 412 –420.
Zavrel, J. (1997). “An empirical re-examination of weighted voting for k-NN” in Proceedings of the 7th Belgian-Dutch
Conference on Machine Learning (BENELEARN-97),Tilburg, The Netherlands.
25
1 Consult http://www.cauce.org, http://spam.abuse.net, and http://www.junkemail.org for further information on UCE
and related legal issues.
2 See http://www.esi.uem.es/~jmgomez/spam/index.html for a collection of resources related to machine learning and
anti-spam filtering.
3 An on-line bibliography on cost-sensitive learning can be found at http://extractor.iit.nrc.ca/bibliographies/cost-
sensitive.html.
4 Ling-Spam is publicly available from http://www.iit.demokritos.gr/~ionandr/publications.
5 The Linguist list is archived at http://listserv.linguistlist.org/archives/linguist.html.
6 Spambase was created by M. Hopkins, E. Reeber, G. Forman, and J. Suermondt. It is available from
http://www.ics.uci.edu/~mlearn/MLRepository.html.
7 We used morph, a lemmatizer included in the GATE system. See http://www.dcs.shef.ac.uk/research/groups/nlp/gate.
8 The F-measure, used in information retrieval and extraction to combine recall and precision (Riloff & Lehnert, 1994),
cannot be used here, because its weighting factor cannot be related to the cost difference of the two error types.
9 From version 3.0, TiMBL provides two additional attribute-weighting measures, namely chi-squared and shared
variance. Although we did not explore these measures thoroughly, the experiments we conducted on some randomly
selected settings showed no significant difference from IG, in agreement with (Yang & Pedersen, 1997).
... According to [6], a leading body in IT, inaccurate anti-spam solutions may be responsible for wasting more than five million working hours a year on checking that legitimate messages were not mistakenly quarantined. Recently, various machinelearning methods [7] have been used to address spam filtering including support vector machines [8], memory-based learning [9,10], rough set [11], neural networks [12], Bayesian classifiers [13][14][15][16], sparse binary polynomial hash [17], etc. Among these methods, the naïve Bayesian classifier has been widely applied as one of the most effective methods to counteract spam [18]. ...
... Sakkis et al. [9] proposed a memory-based approach to anti-spam filtering for mailing lists. In this International Journal of Computer Science and Informatics, ISSN (PRINT): 2231 -5292, Volume-4, Issue-1 approach, each message in the training examples is converted into a vector representing the values of different attributes of the message. ...
... A thorough evaluation of memory -based filtering was performed in [16], and it was found that it achieved better or comparable results to the naïve Bayesian approach. Another extensive empirical evaluation of memory -based learning in the context of anti-spam filtering is provided in [9]. It provides a thorough investigation on the effect of different parameters such as various attributes, distance weighting schemes, neighborhood size, the size of the attribute set, and the size of the training corpus. ...
Article
Unethical e-mail senders bear little or no cost for mass distribution of messages, yet normal e-mail users are forced to spend time and effort in reading undesirable messages from their mailboxes. Due to the rapid increase of electronic mail (or e-mail), several people and companies found it an easy way to distribute a massive amount of undesired messages to a tremendous number of users at a very low cost. These unwanted bulk messages or junk e-mails are called spam messages .Several machine learning approaches have been applied to this problem. In this paper, we explore a new approach based on Bayesian classification that can automatically classify e-mail messages as spam or legitimate. We study its performance for various datasets.
... Anti-spam filters can be categorized into four groups. The first group of spam filters is called content-based filters [6,7], as it distinguishes spam mails from normal mails by examining the mail content, which includes both the mail header and mail body. The second group comprises of the signature-based filters [8,9,10]. ...
... Spam filtering techniques are tools used to identify and exclude spam mails from normal ones. Many research works [6,7,8,11,12] have been paying attention to developing the techniques and tools to decrease the false positive rate which may be caused during the filtering process. Designers of antispam filters have to carefully design their filters to be optimistic because, in practical situation, it is considered to be more cost-effective to erroneously let spam e-mails get into the system rather than falsely discard a legitimate e-mail [7]. ...
... Many research works [6,7,8,11,12] have been paying attention to developing the techniques and tools to decrease the false positive rate which may be caused during the filtering process. Designers of antispam filters have to carefully design their filters to be optimistic because, in practical situation, it is considered to be more cost-effective to erroneously let spam e-mails get into the system rather than falsely discard a legitimate e-mail [7]. Four major groups of anti-spam filtering techniques include contentbased, signature-based, computation of bounded functions and trust and reputation. ...
Conference Paper
Unsolicited e-mail messages or spam mails are the common misuse of the e-mail technology. It wastes computer and network resources and irritates e-mail users. There are many anti-spam filtering software deployed, and many researchers find the solution in the machine learning fields. In this paper, we employ a clustering method called Dynamic State Clustering (DSC) to adaptively cluster e-mail in real time environment. With its ability in order to learn while at work, DSC allows the spam filter to incrementally learn to identify spam mails from good ones. The result of our investigation shows that the technique performs acceptably well in detecting spam e-mails.
... Spam filters can be categorized into four groups. The first group of spam filters is called content-based filters [5,6], as it distinguishes spam mails from normal mails by examining the mail content, which includes both the mail header and mail body. The second group comprises of the signature-based filters [7,8,9]. ...
... Spam filtering techniques are tools to help identify and exclude spam mails from normal mails. Many researches [5,6,7,10,11] have been develop the spam filter to decrease the false positive rate which occur during the filtering process. Two major groups of spam filtering techniques are contentbased and signature-based. ...
... In content-based filtering, incoming e-mail are analyzed and clustered into categories, usually legitimate or spam messages. The objectives of developing spam filters based on content are to make the filter flexible, adaptable and autonomous [6]. Most content-based filters utilize machine learning technique to classify incoming e-mail that comes in to the system. ...
Conference Paper
Spam mails are the common misuse of the e-mail technology. It wastes resources of computer and network and irritates users. There are many spam filtering software deployed, and many researchers in the machine learning fields find the solution of spam filter. While the techniques used in detecting spam mails have obtained high detection rate, an open issue still remains-spamming behavior change all the time. Today's spam filter researchers are developing a spam filter that may learn to adapt follow the changes. This research proposes a framework for an adaptive spam mail filter that can adapt itself to changes in spamming behavior. The focus of the research is then on the spam filter itself, and the ability of the filter to accept feedback from users to update its classification performance. The filter is using Dynamic State Clustering (DSC) with construct moving window to predict spam mail. The design reduces the impact of noise that come to the filter, and at the same time allowing the filter to gradually and carefully adapt to changing spamming behavior.
... For fine-tuning, we use the same classification datasets as in [15] which include Amazon [27], Yelp, IMDB [26] and SST-2 [39] for sentiment classification, Offenseval [46], Jigsaw and Twitter [7] for abusive behavior detection, and Enron [30] and Ling-Spam [36] for spam detection. Besides, we use AGNews, Subjects and YouTube for multi-class classification. ...
Preprint
Full-text available
Pre-trained general-purpose language models have been a dominating component in enabling real-world natural language processing (NLP) applications. However, a pre-trained model with backdoor can be a severe threat to the applications. Most existing backdoor attacks in NLP are conducted in the fine-tuning phase by introducing malicious triggers in the targeted class, thus relying greatly on the prior knowledge of the fine-tuning task. In this paper, we propose a new approach to map the inputs containing triggers directly to a predefined output representation of the pre-trained NLP models, e.g., a predefined output representation for the classification token in BERT, instead of a target label. It can thus introduce backdoor to a wide range of downstream tasks without any prior knowledge. Additionally, in light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks in terms of both effectiveness and stealthiness. Our experiments with various types of triggers show that our method is widely applicable to different fine-tuning tasks (classification and named entity recognition) and to different models (such as BERT, XLNet, BART), which poses a severe threat. Furthermore, by collaborating with the popular online model repository Hugging Face, the threat brought by our method has been confirmed. Finally, we analyze the factors that may affect the attack performance and share insights on the causes of the success of our backdoor attack.
... Enron dataset contains 517,413 emails from 151 users. Other commonly used spam datasets are PU datasets [179] and Ling-Spam [180]. SMS Spam Collection is another dataset contains 5,574 labelled SMS [158]. ...
Article
Full-text available
Pervasive growth and usage of the Internet and mobile applications have expanded cyberspace. The cyberspace has become more vulnerable to automated and prolonged cyberattacks. Cyber security techniques provide enhancements in security measures to detect and react against cyberattacks. The previously used security systems are no longer sufficient because cybercriminals are smart enough to evade conventional security systems. Conventional security systems lack efficiency in detecting previously unseen and polymorphic security attacks. Machine learning (ML) techniques are playing a vital role in numerous applications of cyber security. However, despite the ongoing success, there are significant challenges in ensuring the trustworthiness of ML systems. There are incentivized malicious adversaries present in the cyberspace that are willing to game and exploit such ML vulnerabilities. This paper aims to provide a comprehensive overview of the challenges that ML techniques face in protecting cyberspace against attacks, by presenting a literature on ML techniques for cyber security including intrusion detection, spam detection, and malware detection on computer networks and mobile networks in the last decade. It also provides brief descriptions of each ML method, frequently used security datasets, essential ML tools, and evaluation metrics to evaluate a classification model. It finally discusses the challenges of using ML techniques in cyber security. This paper provides the latest extensive bibliography and the current trends of ML in cyber security.
... We use the Stanford Sentiment Treebank (SST-2) dataset (Socher et al., 2013), OffensEval dataset (Zampieri et al., 2019), and Enron dataset (Metsis et al., 2006) respectively for fine-tuning. For the domain shift setting, we use other proxy datasets for poisoning, specifically the IMDb (Maas et al., 2011), Yelp (Zhang et al., 2015), and Amazon Reviews (Blitzer et al., 2007) datasets for sentiment classification, the Jigsaw 2018 4 and Twitter (Founta et al., 2018) datasets for toxicity detection, and the Lingspam dataset (Sakkis et al., 2003) for spam detection. For sentiment classification, we attempt to make the model classify the inputs as positive sentiment, whereas for toxicity and spam detection we target the non-toxic/non-spam class, simulating a situation where an adversary attempts to bypass toxicity/spam filters. ...
Preprint
Full-text available
Emails and SMSs are the most popular tools in today communications, and as the increase of emails and SMSs users are increase, the number of spams is also increases. Spam is any kind of unwanted, unsolicited digital communication that gets sent out in bulk, spam emails and SMSs are causing major resource wastage by unnecessarily flooding the network links. Although most spam mail originate with advertisers looking to push their products, some are much more malicious in their intent like phishing emails that aims to trick victims into giving up sensitive information like website logins or credit card information this type of cybercrime is known as phishing. To countermeasure spams, many researches and efforts are done to build spam detectors that are able to filter out messages and emails as spam or ham. In this research we build a spam detector using BERT pre-trained model that classifies emails and messages by understanding to their context, and we trained our spam detector model using multiple corpuses like SMS collection corpus, Enron corpus, SpamAssassin corpus, Ling-Spam corpus and SMS spam collection corpus, our spam detector performance was 98.62%, 97.83%, 99.13% and 99.28% respectively. Keywords: Spam Detector, BERT, Machine learning, NLP, Transformer, Enron Corpus, SpamAssassin Corpus, SMS Spam Detection Corpus, Ling-Spam Corpus.
Article
Despite the great advances in spam detection, spam remains a major problem that has affected the global economy enormously. Spam attacks are popularly perpetrated through different digital platforms with a large electronic audience, such as emails, microblogging websites (e.g. Twitter), social networks (e.g. Facebook), and review sites (e.g. Amazon). Different spam detection solutions have been proposed in the literature, however, Machine Learning (ML) based solutions are one of the most effective. Nevertheless, most ML algorithms have computational complexity problem, thus some studies introduced Nature Inspired (NI) algorithms to further improve the speed and generalization performance of ML algorithms. This study presents a survey of recent ML-based and NI-based spam detection techniques to empower the research community with information that is suitable for designing effective spam filtering systems for emails, social networks, microblogging, and review websites. The recent success and prevalence of deep learning show that it can be used to solve spam detection problems. Moreover, the availability of large-scale spam datasets makes deep learning and big data solutions (such as Mahout) very suitable for spam detection. Few studies explored deep learning algorithms and big data solutions for spam detection. Besides, most of the datasets used in the literature are either small or synthetically created. Therefore, future studies can consider exploring big data solutions, big datasets, and deep learning algorithms for building efficient spam detection techniques.
Article
Machine Learning (ML) algorithms, specifically supervised learning, are widely used in modern real-world applications, which utilize Computational Intelligence (CI) as their core technology, such as autonomous vehicles, assistive robots, and biometric systems. Attacks that cause misclassifications or mispredictions can lead to erroneous decisions resulting in unreliable operations. Designing robust ML with the ability to provide reliable results in the presence of such attacks has become a top priority in the field of adversarial machine learning. An essential characteristic for rapid development of robust ML is an arms race between attack and defense strategists. However, an important prerequisite for the arms race is access to a well-defined system model so that experiments can be repeated by independent researchers. This article proposes a fine-grained system-driven taxonomy to specify ML applications and adversarial system models in an unambiguous manner such that independent researchers can replicate experiments and escalate the arms race to develop more evolved and robust ML applications. The article provides taxonomies for: 1) the dataset, 2) the ML architecture, 3) the adversary’s knowledge, capability, and goal, 4) adversary’s strategy, and 5) the defense response. In addition, the relationships among these models and taxonomies are analyzed by proposing an adversarial machine learning cycle. The provided models and taxonomies are merged to form a comprehensive system-driven taxonomy, which represents the arms race between the ML applications and adversaries in recent years. The taxonomies encode best practices in the field and help evaluate and compare the contributions of research works and reveals gaps in the field.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Conference Paper
Full-text available
Spam filtering is a text categorization task that shows especial features that make it interesting and difficult. First, the task has been performed traditionally using heuristics from the domain. Second, a cost model is required to avoid misclassification of legitimate messages. We present a comparative evaluation of several machine learning algorithms applied to spam filtering, considering the text of the messages and a set of heuristics for the task. Cost-oriented biasing and evaluation is performed.
Article
Full-text available
Many inductive learning problems can be expressed in the classical attribute-value language. In order to learn and to generalize, learning systems often rely on some measure of similarity between their current knowledge base and new information. The attribute-value language defines a heterogeneous multi-dimensional input space, where some attributes are nominal and others linear. Defining similarity, or proximity, of two points in such input spaces is non trivial. We discuss two representative homogeneous metrics and show examples of why they are limited to their own domains. We then address the issues raised by the design of a heterogeneous metric for inductive learning systems. In particular, we discuss the need for normalization and the impact of don't-care values. We propose a heterogeneous metric and evaluate it empirically on a simplified version of ILA.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
Chapter
A significant problem in many information filtering systems is the dependence on the user for the creation and maintenance of a user profile, which describes the user's interests. NewsWeeder is a netnews-filtering system that addresses this problem by letting the user rate his or her interest level for each article being read (1-5), and then learning a user profile based on these ratings. This paper describes how NewsWeeder accomplishes this task, and examines the alternative learning methods used. The results show that a learning algorithm based on the Minimum Description Length (MDL) principle was able to raise the percentage of interesting articles to be shown to users from 14% to 52% on average. Further, this performance significantly outperformed (by 21%) one of the most successful techniques in Information Retrieval (IR), term-frequency/inverse-document-frequency (tf-idf) weighting.
A distance-weighted k-nearest neighbor rule is not necessarily better than the majority rule for small sample size if ties among classes are broken in a judicious manner. The behavior of several tie-breaking procedures is demonstrated using the bivariate distributions for three classes used by S. A. Dudani. In the infinite sample case, the majority rule is the best among all distance-weighted rules.
Article
This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.