A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters.

Journal of Machine Learning Research (Impact Factor: 2.47). 06/2008; 9:1115-1146. DOI: 10.1145/1390681.1390719
Source: DBLP


Statistical spam filters are known to be vulnerable to adversarial attacks. One of the more common adversarial attacks, known as the good word attack, thwarts spam filters by appending to spam messages sets of "good" words, which are words that are common in legitimate email but rare in spam. We present a counterattack strategy that attempts to differentiate spam from legitimate email in the input space by transforming each email into a bag of multiple segments, and subsequently applying multiple instance logistic regression on the bags. We treat each segment in the bag as an instance. An email is classified as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We show that a classifier using our multiple instance counterattack strategy is more robust to good word attacks than its single instance counterpart and other single instance learners commonly used in the spam filtering domain.

9 Reads
  • Source
    • "However, such words have been strategically injected into spam messages to avoid spam filters. To defend against this type of attack, each email is transformed into a bag of multiple segments and multiple instance logistic regression is then applied on the bags [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Machine learning has become a prominent tool in various domains owing to its adaptability. However, this adaptability can be taken advantage of by an adversary to cause dysfunction of machine learning; a process known as Adversarial Learning. This paper investigates Adversarial Learning in the context of artificial neural networks. The aim is to test the hypothesis that an ensemble of neural networks trained on the same data manipulated by an adversary would be more robust than a single network. We investigate two attack types: targeted and random. We use Mahalanobis distance and covariance matrices to selected targeted attacks. The experiments use both artificial and UCI datasets. The results demonstrate that an ensemble of neural networks trained on attacked data are more robust against the attack than a single network. While many papers have demonstrated that an ensemble of neural networks is more robust against noise than a single network, the significance of the current work lies in the fact that targeted attacks are not white noise.
    Fuzzy Systems (FUZZ), 2010 IEEE International Conference on; 08/2010
  • Source
    • "It was concluded that the proposed approach is very efficient, almost three orders of magnitude faster than SVM, effective and more robust to concept drift and unbalanced data. Jorgensen, Zhou, and Inge (2008) formulated a multiple instance learning approach for dealing with good word attacks, characterized by the inclusion in a Spam message of words that are typical of legitimate messages. An example is represented by a set of instances, and is classified depending on the classification of each instance, with four methods for generating the instances being proposed. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual- and image-based approaches. Instead of considering Spam filtering as a standard classification problem, we highlight the importance of considering specific characteristics of the problem, especially concept drift, in designing new filters. Two particularly important aspects not widely recognized in the literature are discussed: the difficulties in updating a classifier based on the bag-of-words representation and a major difference between two early naive Bayes models. Overall, we conclude that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings.
    Expert Systems with Applications 09/2009; 36(7-36):10206-10222. DOI:10.1016/j.eswa.2009.02.037 · 2.24 Impact Factor
  • Source
    • "Multi-instance learning techniques have already been applied to diverse applications including image categorization [17] [18], image retrieval [71] [84], text categorization [3] [59], web mining [86], spam detection [36], computer security [53], face detection [66] [76], computer-aided medical diagnosis [30], etc. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose the MIML (Multi-Instance Multi-Label learning) framework where an example is described by multiple instances and associated with multiple class labels. Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects which have multiple semantic meanings. To learn from MIML examples, we propose the MimlBoost and MimlSvm algorithms based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings in the MIML framework can lead to good performance. Considering that the degeneration process may lose information, we propose the D-MimlSvm algorithm which tackles MIML problems directly in a regularization framework. Moreover, we show that even when we do not have access to the real objects and thus cannot capture more information from real objects by using the MIML representation, MIML is still useful. We propose the InsDif and SubCod algorithms. InsDif works by transforming single-instances into the MIML representation for learning, while SubCod works by transforming single-label examples into the MIML representation for learning. Experiments show that in some tasks they are able to achieve better performance than learning the single-instances or single-label examples directly.
    Artificial Intelligence 08/2008; 176(1). DOI:10.1016/j.artint.2011.10.002 · 3.37 Impact Factor
Show more


9 Reads
Available from