Skip to content

ycombinator/es-enron

Repository files navigation

Pre-requisite

Download dataset.tgz from here into the same folder as where you clone this repository.

Preparation

The dataset.tgz file contains an archive of all Enron emails, de-duped, and parsed into JSON files. Each JSON file in the archive represents one email message.

The size of this compressed dataset is 252MB. Uncompressed into individual JSON files, the size becomes 1.3GB.

  1. Install Node.js, MySQL, and Elasticsearch. Make sure MySQL and Elasticsearch are running.

  2. Uncompress the archive.

tar xvf dataset.tgz
  1. Load the emails into Elasticsearch.
npm install   # if you haven't run this already
./load_into_es.sh
  1. Load the emails in MySQL.
./load_into_mysql.sh

Appendix

The original Enron email dataset was taken from https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz. This is an archive of all Enron emails in EML format, where each file represents one email message. Some of these messages are duplicated in multiple files.

The parse_email_files.js script will parse the original Enron email dataset into JSON files, after de-duplicating them.

The included dataset.tgz file is archive of exactly these JSON files.

About

Elasticsearch demo using Enron email dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published