Language and gender author cohort analysis of e-mail for computer forensics

Source: OAI

ABSTRACT We describe an investigation of authorship gender and language background cohort attribution mining from e-mail text documents. We used an extended set of predominantly topic content-free e-mail document features such as style markers, structural characteristics and gender-preferential language features together with a Support Vector Machine learning algorithm. Experiments using a corpus of e-mail documents generated by a large number of authors of both genders gave promising results for both author gender and language background cohort categorisation.

1 Bookmark
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This work addresses the problem of automatic annotation of clinical interview transcripts. We formulate this task as su-pervised machine learning problem and propose highly scal-able and efficient probabilistic classifiers based on generative latent variable models to solve it. Experimental results indi-cate that the proposed classifiers outperform some popular standard algorithms, such as Nave Bayes, and provide more interpretable results for clinicians and researchers.
    Knowledge Discovery and Data Mining, New York, New York; 08/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: There has been tremendous growth in the information environment since the advent of the Internet and wireless networks. Just as e-mail has been the mainstay of the web in its use for personal and commercial communication, one can say that text messaging or Short Message Service (SMS) has become synonymous with communication on mobile networks. With the increased use of text messaging over the years, the amount of mobile evidence has increased as well. This has resulted in the growth of mobile forensics. A key function of digital forensics is efficient and comprehensive evidence analysis which includes authorship attribution. Significant work on mobile forensics has focused on data acquisition from devices and little attention has been given to the analysis of SMS. Consequentially, we propose a software application called: SMS Management and Information Retrieval Kit (SMIRK). SMIRK aims to deliver a fast and efficient solution for investigators and researchers to generate reports and graphs on text messaging. It also allows investigators to analyze the authorship of SMS messages.
    Digital Forensics and Cyber Crime - First International ICST Conference, ICDF2C 2009, Albany, NY, USA, September 30-October 2, 2009, Revised Selected Papers; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we report an investigation into the learning of semi-structured document categorization. We automatically discover low-level, short-range byte data structure patterns from a document data stream by extracting all byte sub-sequences within a sliding window to form an augmented (or bounded-length) string spectrum feature map and using a modified suffix trie data structure (called the coloured generalized suffix tree or CGST) to efficiently store and manipulate the feature map. Using the CGST we are able to efficiently compute the stream's bounded-length sequence spectrum kernel. We compare the performance of two classifier algorithms to categorize the data streams, namely, the SVM and Naive Bayes (NB) classifiers. Experiments have provided good classification performance results on a variety of document byte streams, particularly when using the NB classifier under certain parameter settings. Results indicate that the bounded-length kernel is superior to the standard fixed-length kernel for semi-structured documents.
    Data Mining and Knowledge Discovery 01/2006; 13:309-334. · 2.88 Impact Factor

Full-text (4 Sources)

Available from
Jun 10, 2014