Conference PaperPDF Available

Spam filter optimality based on signal detection theory

Authors:

Abstract and Figures

Unsolicited bulk email, commonly known as spam, represents a significant problem on the Internet. The seriousness of the situation is reflected by the fact that approximately 97% of the total e-mail traffic currently (2009) is spam. To fight this problem, various anti-spam methods have been proposed and are implemented to filter out spam before it gets delivered to recipients, but none of these methods are entirely satisfactory. In this paper we analyze the properties of spam filters from the viewpoint of Signal Detection Theory (SDT). The Bayesian approach of Signal Detection Theory provides a basis for determining the optimality of spam filters, i.e. whether they provide positive utility to users. In the process of decision making by a spam filter various tradeoffs are considered as a function of the costs of incorrect decisions and the benefits of correct decisions.
Content may be subject to copyright.
Spam Filter Optimality Based on Signal Detection Theory
Singh Kuldeep
University Graduate Center
(UNIK), Norway
NTNU, Norway
HUT, Finland
kuldeep@unik.no
Jøsang Audun
University Graduate Center
(UNIK), Norway
University of Oslo, Norway
QUT, Australia
josang@unik.no
Md. Sadek Ferdous
University Graduate Center
(UNIK), Norway
NTNU, Norway
University of Tartu, Estonia
sadek@unik.no
Ravishankar Borgaonkar
University Graduate Center
(UNIK), Norway
HUT, Finland
KTH, Sweden
ravishankar@unik.no
ABSTRACT
Unsolicited bulk email, commonly known as spam, represents
a significant problem on the Internet. The seriousness of the
situation is reflected by the fact that approximately 97% of
the total e-mail traffic currently (2009) is spam. To fight
this problem, various anti-spam methods have been proposed
and are implemented to filter out spam before it gets deliv-
ered to recipients, but none of these methods are entirely
satisfactory. In this paper we analyze the properties of spam
filters from the viewpoint of Signal Detection Theory (SDT).
The Bayesian approach of Signal Detection Theory provides
a basis for determining the optimality of spam filters, i.e.
whether they provide positive utility to users. In the process
of decision making by a spam filter various tradeoffs are con-
sidered as a function of the costs of incorrect decisions and
the benefits of correct decisions.
Categories and Subject Descriptors
D.m [Software]: Miscellaneous; D.m [Software]: Miscella-
neous
General Terms
performance, security, measurement
Keywords
Spam, e-mail, filters, tradeoffs, Optimality, Signal Detection
Theory (SDT)
1. INTRODUCTION
Spam in the form of unwanted email is a huge and grow-
ing problem. The amount of spam that circulates through
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIN’09, October 6–10, 2009, North Cyprus, Turkey.
Copyright 2009 ACM 978-1-60558-412-6/09/10 ...$10.00.
the Internet is increasing day by day, and is affecting every-
one on the Internet, ranging from network providers to In-
ternet Service Providers (ISPˇ
Ss), companies and end users.
Manually deleting spam in the inbox every day is annoying
and time consuming for all Internet users. In [1] it has been
found that approximately 97% of the total email traffic these
days consists of spam. The problem gets even worse when
spam is used to actively harm the recipients by attacks like
such as phishing and 419 Scams [11, 7]. Apart from these
threats, spam causes waste of time and money. For exam-
ple in a survey conducted in 2006 among employees of 500
large companies in US and Finland, it was found that on an
average an employee spends 13 minutes of his daily working
time in reading, deleting or replying to spam messages[12].
The increasing amount of spam has attracted the atten-
tion of Internet and security experts. As a result many
anti spam strategies have been proposed and implemented.
Current work also investigates methods to completely block
spam. The main reason behind the increasing amount of
spam lies in the cost imbalance between senders and recipi-
ents. Sending large amounts of spam has a very small cost
compared to the relatively high cost of viewing and deleting
a single spam message. Millions of emails can be sent per
hour with just 56 kbps of bandwidth[6]. According to[14], if
even one among 500,000 spam messages of direct-mail print
campaigns attracts a recipient to buy the product then the
whole cost incurred in sending 500,000 spams is covered.
On the other hand the recipients and the ISPs have to carry
significant costs. The most obvious cost is the bandwidth
consumed for processing spam. In large organization the
charging for Internet connections is based on traffic, and be-
cause of spam traffic these firms end up paying significant
amounts for non-productive traffic. On the ISP side the cost
comes from wasted bandwidth and CPU time.
It is important to understand, analyze and measure the
effectiveness and efficiency of the spam filters in order to im-
prove their quality. In the context of spam filters, effective-
ness” means the degree to which genuine spam is detected
and removed. On the other hand, efficiency” means the de-
gree to which genuine email messages are correctly delivered.
A filter that removes most spam messages will have high ef-
fectiveness, but if it removes many genuine email messages
together with spam messages it will have poor efficiency.
SDT (Signal Detection Theory)[10, 2] is a mathematical
model that is suitable for analyzing the effectiveness and ef-
ficiency of spam filters. SDT provides a rational basis for
decision making under conditions of uncertainty. For exam-
ple, the question ”Is this my dog barking, or is it just the
television?” is a typical situation where SDT can be applied
to guide the dog owner to the most optimal action, i.e. to
ignore the sound, or to go to look after the dog. Visualiza-
tion techniques used in SDT can provide additional decision
support in situations of uncertainty.
Section 2 briefly describes related studies on analyzing
spam filter performance. Section 3 presents the background
of SDT, Section 4 describes how SDT can be applied to spam
filter analysis, Section 5 discusses the presented technique,
and Section 6 concludes this paper.
2. RELATED WORK
In the context of spam filtering, genuine (non-spam) email
messages are commonly called ham”. Since spam filters
are trying to identify spam, a message identified as spam
is called a positive”. A ham message incorrectly classified
as spam therefore represents an instance of false positive
(FP), and a spam message identified as ham represents a
false negative (FN).
Various analyzes of the performance of spam filters have
been done in previous studies. The effectiveness of a spam
filter is affected by the domain in which it is used. For exam-
ple the cost of a lost genuine email message incorrectly de-
tected as spam will depend on the recipient’s (and sender’s)
business area, as well as on the recipient’s (and sender’s)
perception, attitude and level of frustration.
A method for analyzing spam filters was proposed by Gar-
cia et al. in 2004 [8]. Garcia’s analysis was restricted to open
source filters, and only considered content based filters, i.e.
not for example black/white lists. In [8] apart from com-
puting false positive and false negative rates, a function was
proposed for calculating a single measure of a filterˇ
Ss er-
ror rate as a function of its false positive and false negative
rates.
Another approach to analyzing spam filter performance is
through the Precision and Recall metrics. This method was
extensively used for spam filter classification in [13]. Preci-
sion is the ratio of spam messages classified as spam relative
to the total number of messages classified as spam, and Re-
call is the ratio of spam messages classified as spam relative
to the total number of spam messages. For example, if 5
out of 10 spam messages are correctly identified as spam
then the Recall rate is 0.5. As long as no ham messages
are classified as spam the Precision will be 1, but as soon
as some ham messages are incorrectly classified as spam the
Precision falls below 1. For spam filters, an instance of FP
is normally considered more problematic than an instance of
FN. Precision which reflects a filter’s FP property is there-
fore considered to be a more important measure than Recall
which reflects the filter’s FN property. The Precision value
therefore needs to be higher than the Recall value, but at
the same time there should be a proper balance between the
two values.
Another proposed method for measuring the effectiveness
of spam filters is Weighted Accuracy which uses the accuracy
and error rate as measures [5]. They assign equal relative
weights (λ) to the error types FP (False Positive) and FN
(False Negative), as well as to the correct classification types.
An instance of FP counts λtimes an instance of FN. An
instance of TN (true negative), i.e. a correct classification
of a genuine email message, counts λtimes an instance of
TP (true positive), i.e. a correct classification of spam. This
method reflects that an instance of FP is λtimes more costly
than an instance of FN.
In [4], 10-fold cross validation is used as an evaluation
method to estimate how well the filter works after training.
According to this method the corpus is spilt into 10 mutually
exclusive parts and the subject is tested against all of these
parts. And finally the estimation is made on the basis of the
mean of all the tests.
The ROC (Receiver Operating Characteristics) curve is
another method for spam filter evaluation suggested by Hi-
dalgo in [2]. It has a discrimination threshold value which
when varied produces the trade-off between FP and TP.
From a visualization viewpoint, if the ROC curve of one
filter is uniformly above than that of another filter, then
it can be inferred that the performance of the first filter is
superior that of the other.
3. SIGNAL DETECTION THEORY
This section presents a model for analyzing spam filters
based on SDT (Signal Detection Theory)[10, 2, 9, 15]. SDT
is based on probability theory and is an effective means to
analyze ambiguous data. In the SDT framework each event
is assumed to be either: 1) signal (from a known process) or
2) noise (from an unknown process). SDT provides a formal
framework for setting optimal thresholds for distinguishing
between signal and noise. For example, in radar system the
operator tries to determine from the display on the radar
screen whether it is a signal (aircraft) or a noise (bird or
something else), and setting the optimal decision threshold
is importance for the success of military operations.
SDT assumes that signal and noise distributions overlap
each other and that an observed stimulus may come from
any side of the distribution. In addition to this SDT also
assumes that the signal is added to the noise and that the
decision maker tries to find out the optimal performance by
balancing cost and benefit.
Fig.1 shows the SDT model with the two distributions
(signal and noise) assuming that both distributions are nor-
mal with equal standard deviations. The X-axis / horizon-
tal axis represents the strength of the internal response (also
called hidden variable, decision variable or internal variable)
which is a function of the external observed stimulus. The
Y-axis / vertical axis represents the probability of the inter-
nal response. These distributions are used in the process of
making the decision whether the stimulus represents signal
or noise. The vertical line between the two distributions is
the criterion threshold for the internal response that is used
to make a decision.
In the process of decision making any internal response
with a value less than the criterion is determined to come
from the noise distribution while an internal response with
a value greater than the criterion is determined to come
from the signal distribution. After receiving the stimulus
the decision maker has to decide whether to accept or reject
it.
The overlapping between noise distribution and signal dis-
tribution results in four possible decisions which are shown
in Fig.(2).
Figure 1: SDT model showing overlap between sig-
nal and noise distribution
False Negative (FN): Stimulus coming from the signal
distribution incorrectly detected as noise1.
True Positive (TP): Stimulus coming from the signal
distribution correctly detected as signal2.
False Positive (FP): Stimulus coming from the noise
distribution incorrectly detected as signal3.
True Negative (TN): Stimulus coming from the noise
distribution correctly detected as noise 4.
FP and FN are also known as Type I error and Type II
errors respectively in statistics. The SDT decision making
method is based on the concepts of TP Rate and FP Rate.
The TP Rate is the total number of times a genuine signal
is detected as signal divided by the total number of genuine
signals. Hence, it can be calculated as follows:
TP Rate = TP
TP + FN (1)
The FP Rate is the total number of times genuine noise is
detected as signal, divided by the total number of genuine
noise instances. Hence the FP Rate can be calculated using
the following formula:
FP Rate = FP
FP + TN (2)
It can be noted that the sum of the TP and FN Rates, as
well as the sum of the FP and TN Rates both are equal to
1. This can be expressed as:
FN Rate = 1 TP Rate
TN Rate = 1 FP Rate
(3)
Fig. 3 illustrates the analysis of TP and FP rates. The
lower half of Fig. 3 sets the criterion at the left-most edge
1Called Miss” in SDT terminology.
2Called Hit” in SDT terminology.
3Called False Alarm” or FA” in SDT terminology.
4Called Correct Identification” or CI” or Correct Rejec-
tion” or CR”in SDT terminology.
Figure 2: The model of SDT showing TP,FN,FP and
TN
of the signal distribution. Statistically, it means that the
TP Rate is 100%. Let us assume the example of a doctor
who makes the decision whether there is a tumor in the
brain based on the internal response of a brain scan. If
the value of the criterion is lowered such that the TP Rate
is 100% then the FP Rate also increases as shown in the
lower half of Fig. 3. The doctor will therefore never miss
a real tumor, but a negative side-effect of increasing TP
Rate is a corresponding increase in the FP rate. In case
the criterion value is increased to the rightmost edge of the
noise distribution as shown in the upper half of Fig.3 then
the FP Rate becomes 0%, but at the same time the TP
Rate also gets very low. This means that the doctor gets no
false alarms, but will miss many real tumors. The optimal
criterion value will depend on the cost of FPs and FNs.
SDT assumes that it is practically impossible to simulta-
neously have a 100% TP Rate and 0% FP Rate because of
the overlap between the signal and the noise distributions.
STD offers a method for defining the criterion value which
will result in optimal decision making. Thus the choice of
the criterion value is important. In this paper we use STD
and Bayesian methods for analyzing spam filters with regard
to their inherent criterion values.
SDT based decision making is mainly influenced by two
values:
1. Likelihood Ratio (LR) which can be called as Actual
LR.
2. Optimal LR (LR’) which is compared with the actual
LR to find out the optimality of the decision maker.
Actual LR is calculated using the following formula:
LR = TP Rate
FP Rate (4)
Figure 3: SDT model showing showing criterion
at two different places: FP Rates=0% and TP
Rates=100%
The Optimal LR value is dependent on the base rate prob-
abilities of stimulus being signal or noise, and also on the
costs of incorrect and the benefits of correct detection and
it is calculated by multiplying the ratio of the base rate prob-
ability of noise P(noise) and the base rate probability of sig-
nal P(signal) with a constant Kthat incorporates the costs
of errors and benefits of correct identifications. Note that
for every stimulus, the equation P(noise) + P(signal) = 1
holds.
LR=P(noise)
P(signal) ·K(5)
where the constant Kis calculated as follows:
K=Benefits of TN Costs of FP
Benefits of TP Costs of FN (6)
In the process of decision making in SDT the four possible
outcomes are TP, FN, FP and TN. The decision matrix of
the spam detector is shown in Fig.4.
Eq.(6) is useful in deciding whether the decision maker is
behaving optimally or not. The all four values in the Eq.(6)
can be different and there could be significantly large dif-
ference. For example, in the case of Tsunami detection the
cost of FN is very high while in the case of Spam detection
the cost of FP is relatively high in comparison with the cost
of FN. The Bayesian approach used in this paper for deci-
sion making considers all the costs and benefits and various
tradeoffs.
4. SIGNAL DETECTION THEORY USED FOR
SPAM FILTER ANALYSIS
Spam filters are used to separate spam from ham. A spam
filter carries out this separation in different ways. For exam-
ple, content based filtering [8] is done by analyzing the body
of the message. Origin based filtering[8] is done by judg-
ing the source of the message. SDT can be used to analyze
the spam filters based on a single method as well as filters
based on multiple methods like those used by email service
providers like: Gmail, Yahoo mail and Hotmail.
When applying SDT to spam filter analysis, we will use
the terminology convention that an instance of spam is con-
sidered as a signal, and an instance of ham is considered as
noise. Within the SDT framework, the difficulty of distin-
guishing between spam and ham increases with the degree
of overlap between the two distributions, as would be ex-
pected. The overlap between spam and ham distributions
results in two types of incorrect and two types of correct
decisions, defined as:
1. Ham classified as ham (TN)
2. Spam classified as ham (FN)
3. Spam classified as spam (TP)
4. Ham classified as spam (FP)
The 3rd and 4th outcomes are important from the SDT
point of view as they are used in the mathematical expres-
sions. In the following Sdenotes a genuine spam message,
and Sdenotes an assumed spam message. Similarly, Hde-
notes a genuine ham message, and Hdenotes an assumed
ham message. The four possible outcomes of the spam fil-
ter are shown in Table 4. P(S|S), P(H|S), P(S|H) and
P(H|H) in the Fig. 4 represents the four conditional prob-
abilities.
Figure 4: Decision Matrix for a spam filter showing
four possible cases
All the four possible cases are dependent on each other.
For example, when the message really is spam (1st row)
the proportion of TP and FN add up to 1 because the fil-
ter can only respond in one of the two ways- either Yes or
No. Likewise when the message really is ham (2nd row), the
proportion of FP and TN add up to 1. Thus all the infor-
mation in the decision matrix can be obtained from TP and
FP. Therefore we have
P(H|S) = 1 P(S|S)(7)
P(H|H) = 1 P(S|H)(8)
The conditional probabilities P(S|S) and P(S|H) repre-
sent the TP and FP rates respectively. The TP rate in-
dicates the successful filtering of spam messages, and can
therefore be used to analyze the effectiveness of the spam
filter. The FP rate on the other hand shows errors which
can be used to determine the efficiency of spam filters. Ef-
ficiency can be increased by reducing the FP rate. The ef-
fectiveness of the spam filter increases as the TP rate gets
closer to 1 and the efficiency increases as the FP rate gets
closer to 0.
It can be easily concluded that spam filters will behave
in the best way when the TP rate is maximum and the
FP rate is minimum. Practically no automated spam filter
can be both 100% effective and 100% efficient at the same
time. The reason for this is of course that clever composition
of spam messages give them similar characteristics to ham
messages. For automated filters that do not have the same
cognitive and semantic capabilities as humans, separation
between ham and spam is not always possible.
Spam filters makes use of the TP rate and the FP rate to
calculate the LR (Likelihood Ratio). The formula to calcu-
late the LR is as follows:
LR = TP
FP
=P(S|S)
P(S|H)
(9)
After the Actual LR has been calculated it is compared with
the Optimal LR (LR’). The LR’ is calculated using the base
rate probabilities of occurrence of spam messages in a repre-
sentative set of messages. In addition, LR’ is also based on
the cost associated incorrect and the benefits associated with
correct decisions. With the goal of maximizing the gains and
minimizing the losses, LR’ value can be calculated as follows:
LR=P(H)
P(S)·(BH|HCSH)
(BS|SCH|S)(10)
where P(H) and P(S) represent the base rate probabilities
of ham and spam in the message set. The additivity P(H)+
P(S) = 1 always holds.
In the above equation BH|Hdenotes the benefit asso-
ciated with TN, and BS|Sdenotes the benefit associated
with TP. Similarly CS|Hdenotes the cost associated with
FP, and CH|Sdenotes the cost associated with FN. Eq.(10)
shows that the optimal LR’ value depends on two factors:
1. Base rates of spam and ham
2. Relative costs of errors and benefits of correct identi-
fication
In Eq.(10) if the cost of errors is the same as the benefits
of correct responses then the value of LR’ becomes equal to
the fraction of base rate probabilities of spam and ham i.e.
LR=P(H)/P (S).
From empirical researches [13, 5, 3] it has been found that
the base rate probability of spam affects the detection of
spam. The base rate probability will therefore influence the
criterion value.
The cost of FP is normally significantly higher than the
cost of FN. People are normally more concerned about the
loss of a ham that about receiving a spam. With the help of
Eq.(11) different aspects of the spam filter can be evaluated
and analyzed.
While comparing LR and LR’ the most optimal tuning of
the spam filter is when the following equation holds:
P(S|S)
P(S|H)=P(H)
P(S)·(BH|HCS|H)
(BS|SCH|S)(11)
In case the LR is equal to LR’ then it can be concluded that
the spam filter is optimal for the particular user otherwise
not.
Eq.(11) represents the equation for a filter equipped with
just one technique to distinguish between ham and spam,
meaning that it will maximize the utility for the user. When
a spam filter has more than one filtering techniques, which
is generally the case, then additional considerations must be
taken.
All the filtering techniques are assumed to be in sequence.
In addition to this, the inherent characteristics of each fil-
tering technique are statistically independent of each other.
If the filtering techniques are not statistically independent
then the sequential set of filters is assumed to consist of just
one filtering technique, and this filter would be relatively less
effective. A filtering technique at one point in the chain will
change the base rate probabilities for the next filtering tech-
nique in the chain. If the base rate probabilities are changed
by the stimulus emanating from the 1st filtering technique,
it should result in LR equal to that of Eq.(9). This new
value will be denoted as LR1.
LR1=P(S
1|S)
P(S
1|H)(12)
Therefore:
P(S
1|S)
P(S
1|H)=P(H)
P(S)·(BH|HCS|H)
(BS|SCH|S)(13)
or equivalently
P(S
1|S)
P(S
1|H)·P(S)
P(H)=(BH|HCS|H)
(BS|SCH|S)(14)
In the above equation the left hand side determines the new
base rate probability for the 2nd filtering technique. The
base rate probability and the LR changes every time an e-
mail passes through the new filtering technique. LR1indi-
cates the LR after the 1st filtering technique. If the filter
incorporates nfiltering techniques then the Eq.13 changes
to:
i=n
Y
i=1
P(S
i|S)
P(S
i|H)=P(H)
P(S)·(BH|HCS|H)
(BS|SCH|S)(15)
Eq.15 can be used for analyzing multiple technique spam
filters.
5. DISCUSSION
It can be concluded from the Eq.(11) that if the base rate
probabilities of spam and ham are equal, then we get
P(H)/P (S) = 1 .(16)
If in addition the cost benefit differences are balanced,
expressed by
(BH|H+CS|H) = (BS|S+CH|S),(17)
then the LR’ becomes equal to 1. This means TP rate is
equal to FP rate which is not good at all from the filter’s
efficiency and affectivity point of view.
Considering the scenario from FP rate and FN rate point
of view then we can easily conclude that either the FP can
be minimized or the FN can be minimized but not both at
the same time. Therefore Eq.(11) helps in finding out the
optimal criterion value.
In case of e-mails one would normally prefer receiving a
spam message over losing a ham message because the cost of
a FP is significantly higher than cost of a FN. Therefore, in
order to be more efficient, spam filters should use a stricter
criterion while classifying e-mails. Since a spam message
represents a signal for the spam filter, by behaving stricter
the spam filter would classify incoming messages as ham,
even with a certain likelihood of being a spam. This would
eventually result in ham messages ending up in the normal
inbox. Hence, resulting in less FPs.
If we consider the Eq.(15) derived from the perspective
of multi-technique spam filters then we can find interesting
results. Assuming that benefits of correct responses are ap-
proximately equal. The major difference lies between the
costs associated with the FN and FP (generally the main
concern is with the FP and FN rates). Therefore, assuming
the payoffs as the ratio of cost of a FP and cost of a FN.
Moreover, considering modern day spam and spam filters
we assume that the base rate probability of spam is equal
to 97% i.e. P(H)|P(S) = 3/97 and the TP and FP rates
are 80% and 20% respectively and also assuming the pay-
offs at right hand side of Eq.(15) to be 1000/1 then a filter
needs to incorporate three filtering techniques to satisfy the
needs and provide positive utility to the user as shown by
the calculation: (80/20)*(80/20)*(80/20) which is greater
than (3/97)*1000/1. This means likelihood ratio is greater
and hence means less FP rate and more TP rate.
Smaller the difference between the LR and LR’, lesser the
tuning will be needed for the spam filters to behave opti-
mally for the particular user.
6. CONCLUSIONS
This paper describes the analysis of spam filters within
the framework of signal detection theory.
The criterion value plays an important part in decision
making. It represents the environment in which the spam
filter operates as well as the user’s subjective view of the
cost and benefits of false and correct filtering.
For a spam filter that is perfect, the cost and benefits of
false and correct filtering are less important. The spam filter
will normally make optimal filtering decisions and provide
positive utility for the user.
However, if the spam filter characteristics are not close to
optimal, the values that the user assigns to the cost of in-
correct filtering and the benefits of correct filtering do mat-
ter for determining whether the filter behaves optimally, i.e.
whether it provides positive utility. If not, the user would
be better of not using the spam filter, because that would
provide better utility.
7. REFERENCES
[1] Security intelligence. Technical report, Microsoft,
December 2008.
[2] H. Abdi. Signal detection theory (sdt) overview.
[3] A. R. Agustin Orfila, Javier Carbo. Decision model
analysis for spam. Information and Security: An
International Journal, 15(2):151–161, 2004.
[4] G. P. F. R. Andre Bergholz, Jeong-Ho Chang and
S. Strobel. Improved phishing detection using
model-based features. In Fifth Conference on Email
and Anti-Spam. CEAS, August 2008.
[5] I. Androutsopoulos, J. Koutsias, K. V. Ch,
G. Paliouras, and C. D. Spyropoulos. An evaluation of
naive bayesian anti-spam filtering. In Proceedings of
the workshop on Machine Learning in the New
Information Age, G. Potamias, V. Moustakis and M.
van Someren (eds.), 11th European Conference on
Machine Learning, pages 9–17, 2000.
[6] A. Cournane and R. Hunt. An analysis of the tools
used for the generation and prevention of spam.
[7] M. A. Dyrud. ”i brought you a good news”: An
analysis of nigerian 419 letters. In Proc. of the 2005
Association for Business Communication Annual
Convention, 2005.
[8] F. D. Garcia, J. henk Hoepman, and J. V.
Nieuwenhuizen. Spam filter analysis. In in
’Proceedings of 19th IFIP International Information
Security Conference, WCC2004-SEC, pages 395–410.
Kluwer Academic Publishers, 2004.
[9] D. M. Green and J. A. Swets. Signal Detection Theory
and Psychophysics. Peninsula Publishing, 1966.
[10] D. Heeger. Signal detection theory. Technical report,
November 1997. Available
at:http://www.cns.nyu.edu/david/handouts/sdt-
advanced.pdf.
[11] A. Jøsang and S. Pope. User centric identity
management. In in Asia Pacific Information
Technology Security Conference, AusCERT2005,
Austrailia, pages 77–89, 2005.
[12] S. Mikko and S. Carl. Effective anti-spam strategies in
companies: An international study. In HICSS ’06:
Proceedings of the 39th Annual Hawaii International
Conference on System Sciences, page 127.3,
Washington, DC, USA, 2006. IEEE Computer Society.
[13] M. Sahami, S. Dumais, D. Heckerman, and
E. Horvitz. A bayesian approach to filtering junk
e-mail. In AAAI Workshop on Learning for Text
Categorization, July 1998.
[14] S. Shirali-Shahreza and A. Movaghar. A new
anti-spam protocol using captcha. pages 234–238,
April 2007.
[15] T. D. Wickens. Elementary Signal Detection Theory.
Oxford University Press (OUP), 2001.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Nigerian 419 letters (so called because 419 is the fraud section of the Nigerian penal code) offer fertile ground for classroom activities related to persuasion. Originally sent as paper mail, 419s now appear by the thousands via fax and email. An examination of over 100 emails received by the author reveals several persuasive techniques that make the scam successful. The paper also includes pedagogical suggestions.
Article
Full-text available
One of the security challenges in e-Government is to offer a smooth dialogue with citizens, which guarantees the availability, confidentiality and integ-rity of the information interchanged. Spam jeopardizes the survival of electronic mail as a communication means. Many approaches to tackle the problem with spam have been proposed. This paper shows the necessity of studying the real value of spam filters. Contrary to common belief, false positive rate and false negative rate do not completely reveal to what extent a junk filter is worth using. Very important parameters like the hostility of the environment (summarized by the probability of receiving spam) or the error costs associated with the filter play a decisive role.
Article
Full-text available
It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail ("spam"). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filter's performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice. Comment: 9 pages
Conference Paper
Full-text available
Phishing emails are a real threat to inter- net communication and web economy. Crim- inals are trying to convince unsuspecting on- line users to reveal passwords, account num- bers, social security numbers or other per- sonal information. Filtering approaches us- ing blacklists are not completely eective as about every minute a new phishing scam is created. We investigate the statistical filter- ing of phishing emails, where a classifier is trained on characteristic features of existing emails and subsequently is able to identify new phishing emails with dierent contents. We propose advanced email features gener- ated by adaptively trained Dynamic Markov Chains and by novel latent Class-Topic Mod- els. On a publicly available test corpus clas- sifiers using these features are able to reduce the number of misclassified emails by two thirds compared to previous work. Using a recently proposed more expressive evaluation method we show that these results are statis- tically significant. In addition we successfully tested our approach on a non-public email corpus with a real-life composition.
Article
Unsolicited bulk email (aka. spam) is a major problem on the Internet. To counter spam, several techniques, ranging from spam filters to mail protocol extensions like hashcash, have been proposed. In this paper we investigate the effectiveness of several spam filtering techniques and technologies. Our analysis was performed by simulating email traffic under different conditions. We show that genetic algorithm based spam filters perform best at server level and naive Bayesian filters are the most appropriate for filtering at user level.
Article
Signal Detection Theory (sdt) is used to analyze data coming from experiments where the task is to categorize ambiguous stimuli which can be generated either by a known process (called the signal) or be obtained by chance (called the noise in the sdt framework). In particular sdt is used to analyze experiments where a binary answer (e.g., \Yes" vs \No") needs to be provided. For example, if we need to decide if an education program is efiective or not, we can use sdt.
Article
Signal detection theory describes how an observer makes decisions about weak, uncertain, or ambiguous events or signals. It is widely applied in psychology, medicine, and other related fields. This book describes the theory, explains its mathematical basis, and shows how to separate the observer's sensitivity to a signal from his or her tendency to say "yes" or "no." Both detection of an event and discrimination between two events are treated. Chapters 1-4 describe the basic form of the signal-detection model and how to use it; Chapters 5-7 extend the model to different procedures such as identification of a signal; Chapters 8-10 expand it to other methods and distributions; and Chapter 11 describes the statistical treatment of detection data.
Article
Identity management is traditionally seen from the service providers' point of view, meaning that it is an activity undertaken by the service provider to manage service user identities. Traditional identity man-agement systems are designed to be cost effective and scalable primarily for the service providers, but not necessarily for the users, which often results in poor usability. Users are, for example, often required to memorise multiple passwords for accessing different services. This represents a minor inconvenience if users only access a few online services. However, with the rapid increase in the uptake of online ser-vices, the traditional approach to identity management is already having serious negative effects on the user experience. The industry has responded by proposing new identity management models to improve the user experience, but in our view these proposals give little relief to users at the cost of relatively high increase in server system complexity. This paper takes a new look at identity management, and proposes solutions that are designed to be cost effective and scalable from the users' perspective, while at the same time being compatible with traditional identity management systems.
Article
This paper examines the problems caused by the spamming of e-mail and newsgroup users. Spamming is now considered to be a serious threat to the Internet and is posing a serious threat to both ISP and users' resources. In particular, this paper examines the motivation of, and the tools used to generate, spam. Methods of protection and prevention are then discussed. The paper includes case studies of some spam generation and prevention tools as well as examines evolving spam-related laws.