E-Mail Author Identification

SMAI@IIIT-H (Monsoon 2017)

Team 15

Course Instructor

Dr. Vineet Gandhi

Project Mentor

Ishit Mehta

Overview

Classify emails from the Enron email dataset based on their predicted authorship, and used the trained classifier to identify authors of test samples.

Method

Enron Email Dataset

Available here, the dataset contains 0.5 million emails from about 150 users, who were employees of Enron.

The classifers use the authors as classess and the emails as samples to be assigned to those classes by authorship.

Data Preparation

The number of author classes were fixed while maximising the number of emails per author, and while keeping the emails-per-author ratio similar for every author class.

This number was found to be 10 authors with 800-1000 emails each.

Cleaning

The Enron corpus contains all emails in raw form, including not only the message but also all the email metadata.

The data is cleaned to keep only the subject and body of the mails. All attached forward chains are removed, including forwarded threads, and salutations.

The data is also tokenised by word, sentence and paragraph, and is case normalised.

Models

The following different models have been implemented and tested:

CNN implementation

The CNN can identify commonly used groups of words and phrases by an author. Also, the CNN captures localized chunks of information which is useful for finding phrasal units within long texts. There are three layers to the CNN

First, the embedding layer generates a sequence of word-embeddings from a sequence of words
Second, the conv layer performs the convolution operation using 128 5x5 filter
Third, the dense layer is used for classification

Bi-LSTM implementation

The Bi-LSTM is a commonly used technique for text classification.

LSTMs are a special kind of RNN which are more capable of remembering long term dependencies in a sequence. This gives more context to the classifier which helps in author identification while processing a sequence of text.

There are three layers to the model

First, the embedding layer generates a sequence of word-embeddings from a sequence of words
Second, the bidirectional LSTM generates email embeddings from the sequence of word embeddings
Third, the dense layer is performs the classification

Hierarchical Bi-LSTM implementation

LSTMs are known to work best for a sequence of length of 10-15 elements. However, in this implementation the model can take the entire document, increasing the length and hence the overall context for classification.

There are four layers to this model

First, the embedding layer generates a sequence of word-embeddings from a sequence of words
Second, the first bidirectional LSTM generates sentence embeddings from the sequence of word embeddings
Third second bidirectional LSTM generates email embeddings from sentence embeddings
Fourth, the dense layer is performs the classification

Augmented Hierarchical Bi-LSTM implementation

This model appends stylometric features to the final document embedding in the hierarchical Bi-LSTM, right before it is passed on to the dense layer. The classification is now performed these augmented documenting-embeddings.

Stylometry

The stylometric features extracted from the data and experimented with are

Lexical
1. Average sentence length
2. Average word length
3. total number of words
4. Ratio of unique words to total number of words
5. Total number of characters
Syntactic
1. Total number of function words
2. Total number of personal pronouns
3. Total number of adjectives

Dependencies

Project Structure

root/

    | data_preprocessing_scripts/
        - dataProcessing.py

    | extracted_features/
        - adjperemail.txt
        - avgsentlenperemail.txt
        - avgwordlenperemail.txt
        - charsperemail.txt
        - funcwordsperemail.txt
        - perpronperemail.txt
        - stylometricVector.txt
        - uniqbytotperemail.txt
        - wordsperemail.txt

    | feature_extraction_scripts/
        - adjperemail.py
        - avgsentlenperemail.py
        - avgwordlenperemail.py
        - charsperemail.py
        - funcwordsperemail.py
        - perpronperemail.py
        - stylometricVector.py
        - uniqbytotperemail.py
        - wordsperemail.py

    | models/
        - CNN.py
        - HierLSTM_withStylometry.py
        - HierLSTM.py
        - LSTM_final.py

    - README.md

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
data_processing_scripts		data_processing_scripts
extracted_features		extracted_features
feature_extraction_scripts		feature_extraction_scripts
models		models
.gitignore		.gitignore
Presentation.pdf		Presentation.pdf
README.md		README.md
Report.pdf		Report.pdf

IamAdiSri/auth-id

Folders and files

Latest commit

History

Repository files navigation

E-Mail Author Identification

Team 15

Table of Contents

Overview

Method

Enron Email Dataset

Data Preparation

Cleaning

Models

CNN implementation

Bi-LSTM implementation

Hierarchical Bi-LSTM implementation

Augmented Hierarchical Bi-LSTM implementation

Stylometry

Dependencies

Project Structure

References

About

Topics

Resources

Stars

Watchers

Forks

Languages