[Show abstract][Hide abstract] ABSTRACT: This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found support vector machine, AdaBoost, and maximum entropy model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension, and good performances across different datasets. In contrast, naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fails to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ = 999), so as to maintain a better-than-baseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This implies that message headers can be reliable and powerfully discriminative feature sources for spam filtering.
Preview · Article · Dec 2004 · ACM Transactions on Asian Language Information Processing
[Show abstract][Hide abstract] ABSTRACT: We address the problem of statisti- cal language modeling in the context of PinYin to Chinese (PTC) conver- sion, a similar problem to speech recog- nition but without acoustic recogni- tion step. Inputted phonetic sylla- bles were rst segmented and con- verted into word lattice, which was then scored within a Source-Channel framework in order to nd the most probable Chinese sentence. In partic- ular, we discuss the use of a Whole Sentence Maximum Entropy (WSME) model, an expressive framework for constructing language models with di- verse features. Experiment showed WSME model trained with d2-ngrams and word triggers achieved a 20% re- duction in perplexity and a 11.05% re- duction in character conversion error over a baseline trigram.