Conference Paper

Dynamically Weighted Hidden Markov Model for Spam Deobfuscation.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Spam deobfuscation is a processing to detect obfus- cated words appeared in spam emails and to convert them back to the original words for correct recog- nition. Lexicon tree hidden Markov model (LT- HMM) was recently shown to be useful in spam deobfuscation. However, LT-HMM suffers from a huge number of states, which is not desirable for practical applications. In this paper we present a complexity-reduced HMM, referred to as dy- namically weighted HMM (DW-HMM) where the states involving the same emission probability are grouped into super-states, while preserving state transition probabilities of the original HMM. DW- HMM dramatically reduces the number of states and its state transition probabilities are determined in the decoding phase. We illustrate how we con- vert a LT-HMM to its associated DW-HMM. We confirm the useful behavior of DW-HMM in the task of spam deobfuscation, showing that it signifi- cantly reduces the number of states while maintain- ing the high accuracy.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Derefter bliver der gjort rede for, hvordan man kan benytte en Skjult Markov Model til spam deobfuskering som foreslået i [LN05]. Til slut naevnes de aendringer til modellen, der bliver foreslået i Dynamically Weighted Hidden Markov Model for Spam Deobfuscation [LJC07], og deres resultater bliver fremlagt. ...
... Jeg vil nu beskrive, hvordan jeg har tilpasset Beam Search algoritmen til traestrukturen i min model. I både [LN05] og [LJC07] benyttes Beam Search til at forøge hastigheden for Viterbi afkodningen. Det gøres ved at saette en graense for, hvad der er vaerd at regne videre på for hvert skridt i algoritmen. ...
... For at kunne teste de beskrevne ideer og sammenligne med resultaterne fra [LN05] og [LJC07], har jeg implementeret min egen vaegtede model. I dette kapitel vil jeg kort gennemgå nogle af detaljerne om, hvordan modellen er implementeret. ...
Article
Abstract Spammers,try to circumvent spam filters by obfuscating offen sive words. A very used way to do this, is by misspelling the words on purpose. For example viagra may be written as v!agr@,V-I-A-G-R-A or \/ |_A GR 2A. The task of deobfuscating such obfuscations is very important for text based spam filte rs in order to filter e-mails correctly. The Lexicon Tree Hidden Markov Model(LTHMM) has been proposed as a deobfuscation method. However, LTHMM runs too slow for any practical use, due to a huge number,of states. This master thesis explores the possibilities of making,the LTHMM more useful by changing the model to be more focused on spam filtering in practise. Furthermore a modified version of the Viterbi algo rithm, that take advance of the tree structure of LTHMM, is introduced to make the deobfuscation faster. All changes have been tested on a practical setup with a big spam corpus, where the model works as a preprocess for a real spam filter. The resu lts show that my ideas can improve both the efficiency and the speed of the model, com pared to LTHMM. But my,results also indicate that a Hidden Markov Models might be a too complex,model for the problem. The tests shows that a much,simpler model almost can find as many obfuscations as the full Hidden Markov Model. iii Indhold
... Lee e Ng [74] utilizaram um método que faz uso do Modelo de Markov para o desofuscamento de palavras em spam, o qual eles chamam de Árvore Léxica do Modelo Escondido de Markov (LT-HMM -Lexicon Tree Hidden Markov Model) [73]. A abordagem integra ao Modelo de Markov um dicionário (léxico) e informações de contexto. ...
... A redução na sua eficiência ocorre principalmente nos classificadores tradicionais (e.g. classificadores bayesianos e redes neurais), mas há trabalhos que buscam o aprimoramento dessas técnicas contra este tipo específico de spam [67,68,69,73,74,75]. ...
Chapter
Full-text available
The e-mail, one of the oldest and widely used services on the Internet, is the more used tool to send an indiscriminate number of unsolicited message, known as spam. Given the wide variety of techniques used for sending spam, this type of e-mail is a problem still far from being solved. This work aims to present works and techniques relating of spam detection from a new perspective. Instead of classifying the works by type of detection technique , as is usually made in the literature, each one will be organized using the technique applied in spam dissemination as entry key. Then, it will addressed the detection techniques for each case and it will make consideration about its efficiency. Resumo O e-mail, um dos serviços mais antigos e mais utilizados na Internet, é o meio mais utili-zado para o envio indiscriminado de mensagens não solicitadas, conhecidas como spam. Devido à grande variedade de técnicas utilizadas para o envio de spam, esse tipo de e-mail é um problema que ainda está longe de ser solucionado. Este trabalho tem como objetivo apresentar as principais técnicas e trabalhos relacionados a detecção de spam sob uma nova perspectiva. Ao invés de classificar os trabalhos pelo tipo de técnica de detecção do spam, como normalmente é feito na literatura, as abordagens serão organi-zadas a partir da técnica utilizada na disseminação do spam. Então, serão abordadas as técnicas de detecção para cada caso e feitas considerações acerca de sua eficiência. 4.1. Introdução 4.1.1 Spam O SMTP (Simple Mail Transfer Protocol) é o protocolo padrão utilizado para transferência de e-mails [1]. Nas últimas décadas, o e-mail tem sido um dos serviços mais utilizados na Internet, sendo o primeiro da lista quando o assunto é comunicação entre usuários na rede mundial. Por ser um dos serviços mais antigos e mais utilizados na In
... Lee e Ng [74] utilizaram um método que faz uso do Modelo de Markov para o desofuscamento de palavras em spam, o qual eles chamam de Árvore Léxica do Modelo Escondido de Markov (LT-HMM -Lexicon Tree Hidden Markov Model) [73]. A abordagem integra ao Modelo de Markov um dicionário (léxico) e informações de contexto. ...
... A redução na sua eficiência ocorre principalmente nos classificadores tradicionais (e.g. classificadores bayesianos e redes neurais), mas há trabalhos que buscam o aprimoramento dessas técnicas contra este tipo específico de spam [67,68,69,73,74,75]. ...
... This technique chooses the hidden sequence of states -in this case insertion, substitution, deletion-that most probably explains the observed sequence of symbols of the spam term variation. The main disadvantage of these methods is the computational time required to train and use the models, despite recent optimisations alleviating this issue [12]. In this paper we take a different viewpoint and regard the masking trick as a word-to-word matching application, that is, the aim is to determine if the disguised and the genuine spam-trigger vocables coincide. ...
... Table 3. Resulting distance matrix D for the input strings a:= "viagra" and b:= "v.1.@.g.r.@" (left column and top row respectively) computed with Algorithm 1 coupled with Equation (1). The final dissimilarity score D (7,12) = 0 indicates a perfect match. ...
Article
Full-text available
Unsolicited bulk email (spam) nowadays accounts for nearly 75% of daily email traffic, a figure that speaks strongly for the need of finding better protection mechanisms against its dissemination. A clever trick recently exploited by email spammers in order to circumvent textual-based filters, involves obfuscation of black-listed words with visually equivalent text substitutions from non-alphabetic symbols, in such a way it still conveys the semantics of the original word to the human eye (e.g. masking viagra as v1@gr@ or as v-i-a-g-r-a). In this paper we discuss how a simple-yet-effective adaptation of a classical algorithm for string matching may meet this stylish challenge to effectively reveal the similarity between genuine spam-trigger terms with their disguised alpha-numeric variants.
... Spammers have employed many obfuscation methods in an attempt to defeat spam filter tokenization [75,81,131] which in turn have given rise to deobfuscation techniques that aim to recover the original tokens [104,105]. ...
Article
Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than “I know it when I see it.” Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be
Chapter
The increasing volume of unsolicited bulk e-mails leads to the need for reliable stochastic spam detection methods for the classification of the received sequence of e-mails. When a sequence of emails is received by a recipient during a time period, the spam filters have already classified them as spam or not spam. Due to the dynamic nature of the spam, there might be emails marked as not spam but are actually real spams and vice versa. For the sake of security, it is important to be able to detect real spam emails. This paper utilizes stochastic methods to refine the preliminary spam detection and to find maximum likelihood for spam e-mail classification. The method is based on the Bayesian theorem, hidden Markov model (HMM), and the Viterbi algorithm.
Conference Paper
Grid is an emerging technology that aims at utilizing resources efficiently and effectively. The proposed robust three-tier architecture ensures trustworthy resource selection in a grid environment. This architecture is secured by using spam filtering mechanisms and by incorporating spamicity of a node in the computation of its trust value. For this purpose, a network of Anomaly Detection Agents (ADA) is set up at the Regional Resource Administrator (RRA) level which enables collaborative spam detection and filtering. The proposed solution detects and filters spam using layered filters consisting of statistical content-based and origin based spam filters.
Article
Obscenity (the use of rude words or offensive expressions) has spread from informal verbal conversations to digital media, becoming increasingly common on user-generated comments found in Web forums, newspaper user boards, social networks, blogs, and media-sharing sites. The basic obscenity-blocking mechanism is based on verbatim comparisons against a blacklist of banned vocabulary; however, creative users circumvent these filters by obfuscating obscenity with symbol substitutions or bogus segmentations that still visually preserve the original semantics, such as writing shit as $h¡;t or s.h.i.t or even worse mixing them as $.h….¡.t. The number of potential obfuscated variants is combinatorial, yielding the verbatim filter impractical. Here we describe a method intended to obstruct this anomaly inspired by sequence alignment algorithms used in genomics, coupled with a tailor-made edit penalty function. The method only requires to set up the vocabulary of plain obscenities; no further training is needed. Its complexity on screening a single obscenity is linear, both in runtime and memory, on the length of the user-generated text. We validated the method on three different experiments. The first one involves a new dataset that is also introduced in this article; it consists of a set of manually annotated real-life comments in Spanish, gathered from the news user boards of an online newspaper, containing this type of obfuscation. The second one is a publicly available dataset of comments in Portuguese from a sports Web site. In these experiments, at the obscenity level, we observed recall rates greater than 90%, whereas precision rates varied between 75% and 95%, depending on their sequence length (shorter lengths yielded a higher number of false alarms). On the other hand, at the comment level, we report recall of 86%, precision of 91%, and specificity of 98%. The last experiment revealed that the method is more effective in matching this type of obfuscation compared to the classical Levenshtein edit distance. We conclude discussing the prospects of the method to help enforcing moderation rules of obscenity expressions or as a preprocessing mechanism for sequence cleaning and/or feature extraction in more sophisticated text categorization techniques.
Article
Grid aims at utilising resources efficiently through coordinated resource sharing and problem solving in a virtual organisation spanning geographical boundaries. A three-tier grid architecture consisting of Service Providers or Resource Provider (SP/RP), brokers and Regional Resource Administrators (RRA), is secured using DDoS attack defense mechanism and spam filtering mechanisms that endeavour to optimise the resource utilisation. An overlay network of Anomaly Detection Agents (ADA) is set up at the RRA level which enables deployment of collaborative defense mechanisms. The presence of these mechanisms minimises the wastage of resources which can be put to effective use in grid applications.
Conference Paper
A new system for spam e-mail annotation by end-users is presented. It is based on the recursive application of handwritten annotation rules by means of an inferential engine based on Logic Programming. Annotation rules allow the user to express nuanced considerations that depend on deobfuscation, word (non-)occurrence and structure of the message in a straightforward, human-readable syntax. We show that a sample collection of annotation rules are effective on a relevant corpus that we have assembled by collecting emails that have escaped detection by the industry-standard SpamAssassin filter. The system presented here is intended as a personal tool enforcing personalized annotation rules that would not be suitable for the general e-mail traffic.
Article
Cette thèse est consacrée à la problématique de l’évaluation des produits antivirus. L’utilisateur final d’un produit antivirus ne sait pas toujours quelle confiance il peut placer dans son produit antivirus pour parer convenablement la menace virale. Pour répondre à cette question, il est nécessaire de formuler la problématique de sécurité à laquelle doit répondre un tel produit et de disposer d’outils et de critères méthodologiques, techniques et théoriques permettant d’évaluer la robustesse des fonctions de sécurité et la pertinence des choix de conception au regard d’une menace virale identifiée. Je concentre mon analyse de la menace virale sur quelques mécanismes (mécanismes cryptographiques, transformations de programme) adaptés au contexte boîte blanche, i.e. permettant à un virus de protéger ses données critiques en confidentialité et de masquer son fonctionnement dans un environnement complètement maîtrisé par l’attaquant. Une première étape incontournable avant toute organisation d’une ligne de défense consiste à analyser les virus, afin d’en comprendre le fonctionnement et les effets sur le système. Les méthodes et techniques de la rétro-ingénierie logicielle occupent ici - au côté des techniques de cryptanalyse - une place centrale. J’ai pris le parti de focaliser sur les méthodes dynamiques d’extraction d’information fondées sur une émulation du matériel supportant le système d’exploitation. L’évaluation d’un moteur de détection selon des critères objectifs requiert un effort de modélisation. J’étudie quelques modèles utilisés actuellement (grammaires formelles, interprétation abstraite, statistique). Chacun de ces modèles permet de formaliser certains aspects du problème de la détection virale. Je m’attache à l’étude des critères qu’il est possible de définir dans chacun de ces cadres formels et à leur utilisation pour partie des travaux d’évaluation d’un moteur de détection. Mes travaux m’ont conduit à la mise en oeuvre d’une approche méthodologique et d’une plate-forme de tests pour l’analyse de robustesse des fonctions et mécanismes d’un produit anti-virus. J’ai développé un outil d’analyse de code viral, appelé VxStripper, dont les fonctionnalités sont utiles à la réalisation de plusieurs des tests. Les outils formels sont utilisés pour qualifier en boîte blanche - ou sur la base des documents de conception détaillée - la conformité et l’efficacité d’un moteur de détection.
Article
Full-text available
In this paper we introduce a new search algorithm that provides a simple, clean, and efficient interface between the speech and natural language components of a spoken language system. The N-Best algorithm is a time-synchronous Viterbi-style beam search algorithm that can be made to find the most likely N whole sentence alternatives that are within a given a "beam" of the most likely sentence. The algorithm can be shown to be exact under some reasonable constraints. That is, it guarantees that the answers it finds are, in fact, the most likely sentence hypotheses. The computation is linear with the length of the utterance, and faster than linear in N. When used together with a first-order statistical grammar, the correct sentence is usually within the first few sentence choices. The output of the algorithm, which is an ordered set of sentence hypotheses with acoustic and language model scores can easily be processed by natural language knowledge sources. Thus, this method of integrating speech recognition and natural language avoids the huge expansion of the search space that would be needed to include all possible knowledge sources in a top-down search. The algorithm has also been used to generate alternative sentence hypotheses for discriminative training. Finally, the alternative sentences generated are useful for testing overgeneration of syntax and semantics.
Article
Full-text available
The authors describe a technique called progressive search which is useful for developing and implementing speech recognition systems with high computational requirements. The scheme iteratively uses more and more complex recognition schemes, where each iteration constrains the speech space of the next. An algorithm called the forward-backward word-life algorithm is described. It can generate a word lattice in a progressive search that would be used as a language model embedded in a succeeding recognition pass to reduce computation requirements. It is shown that speed-ups of more than an order of magnitude are achievable with only minor costs in accuracy.
Article
Full-text available
Considerable progress has been made in handwriting recognition technology over the last few years. Thus far, handwriting recognition systems have been limited to small and medium vocabulary applications, since most of them often rely on a lexicon during the recognition process. The capability of dealing with large lexicons, however, opens up many more applications. This article will discuss the methods and principles that have been proposed to handle large vocabularies and identify the key issues affecting their future deployment. To illustrate some of the points raised, a large vocabulary off-line handwritten word recognition system will be described.
Conference Paper
Full-text available
To circumvent spam filters, many spam- mers attempt to obfuscate their emails by deliberately misspelling words or introduc- ing other errors into the text. For exam- ple viagra may be written vigra, or mort- gage written m0rt gage. Even though hu- mans have little diculty reading obfus- cated emails, most content-based filters are unable to recognize these obfuscated spam words. In this paper, we present a hid- den Markov model for deobfuscating spam emails. We empirically demonstrate that our model is robust to many types of obfusca- tion including misspellings, incorrect segmen- tations (adding/removing spaces), and sub- stitutions/insertions of non-alphabetic char- acters.
Article
Full-text available
The authors describe a method for the recognition of cursively handwritten words using hidden Markov models (HMMs). The modelling methodology used has previously been successfully applied to the recognition of both degraded machine-printed text and hand-printed numerals. A novel lexicon-driven level building (LDLB) algorithm is proposed, which incorporates a lexicon directly within the search procedure and maintains a list of plausible match sequences at each stage of the search, rather than decoding using only the most likely state sequence. A word recognition rate of 93.4% is achieved using a 713 word lexicon, compared to just 49.8% when the same lexicon is used to post-process the results produced by a standard level building algorithm. Various procedures are described for the normalisation of cursive script. Results are presented on a single-author database of scanned text. It is shown how very high reliability, up to near perfect recognition, can be achieved by using a threshold to reject those word hypotheses to which the system assigns a low confidence. At 19% rejection, 99.2% of accepted words appeared in the top two choices produced by the system, and 100% of the 1645 accepted words were correctly recognised within the top eight choices
Article
Full-text available
: Most on-line cursive handwriting recognition systems use a lexical constraint to help improve the recognition performance. Traditionally, the vocabulary lexicon is stored in a trie (automaton whose underlying graph is a tree). In this paper, we propose a solution based on a more compact data structure, the directed acyclic word graph (DAWG). We show that our solution is equivalent to the traditional system. Moreover, we propose a number of heuristics to reduce the size of the DAWG and present experimental results demonstrating a significant improvement. 1 Introduction Since the pioneering work of Vintsyuk [17] on Automatic Speech Recognition (ASR) systems, it is well known that Hidden Markov Models (HMM) [13] and Dynamic Programming (DP) [3], [12], provide a theoretical framework and practical algorithms for temporal pattern recognition with lexical constraints (even for large vocabularies). The techniques initially developed for ASR are also applicable to Handwriting Recogni...
Article
This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described