Available from: psu.edu
This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found support vector machine, AdaBoost, and maximum entropy model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension, and good performances across different datasets. In contrast, naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fails to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ = 999), so as to maintain a better-than-baseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This implies that message headers can be reliable and powerfully discriminative feature sources for spam filtering.
ACM Transactions on Asian Language Information Processing 12/2004; 3:243-269. DOI:10.1145/1039621.1039625