GitHub - naominguyen7/Enron-Poi-Identifier

The project tries to predict the people involved in the Enron scheme, ‘persons of interest’. The model uses information about financial benefits and email communications, trying to figuring out patterns that distinguished the people of interests versus the rest. The dataset given includes individual’s information on financial benefits (payments and stock) involved with the company and email communications. It is obvious that the financial information will be useful for identifying people involved since money is the motive of the scheme. Individuals with especially high rate of communication the people of interests are likely to be one.

This is not a 50-50 bag of labels. There are 18 people of interests out of 144. There is a clear outlier with extreme financial data and this turns out to be the ‘total’ of all the data points, which is not a valid data point itself so I removed it. As for individuals that have extremely large financial values, I still keep the data points because they are people of interest. In terms of features, all features have missing data and some have a lot of missing data. Therefore, I edited the feature_format.py file to add an additional argument replace_median (replace NAs with median values).

Note: Since this is a small dataset, I decided to run many grid-searched model. However, if you want to know the end of the movie, I find the model that applies principle component and logistic regression works best. This is a project for my Data Science course at Udacity, and the code for feature_format.py(except for a few edits by me) and tester.py is written by Udacity

poi_id.ipynb: Report with snippets of code. Can be used to run and generated the results.
feature_format.py: Modified file from Udacity's. Add Option to fill NAs with medians.
- converting data from dict form to numpy matrix for easy computation & manipulation
- replace missing values with 0s or medians of the feature.
- separating the data matrix to a vector of labels and a matrix of features
my_classifier.pkl: Classifier with best performance
my_dataset.pkl
my_feature_list.pkl
tester.py: include a function to test the classfier performance.
- employ stratified random shuffled splitting (since this is a small, skewed dataset)
- give sensitivity, specificity, f1 score, accuracy

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
final_project_dataset.pkl		final_project_dataset.pkl
my_classifier.pkl		my_classifier.pkl
my_dataset.pkl		my_dataset.pkl
my_feature_list.pkl		my_feature_list.pkl
poi_id.ipynb		poi_id.ipynb
tester.py		tester.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

README.md

README.md

final_project_dataset.pkl

final_project_dataset.pkl

my_classifier.pkl

my_classifier.pkl

my_dataset.pkl

my_dataset.pkl

my_feature_list.pkl

my_feature_list.pkl

poi_id.ipynb

poi_id.ipynb

tester.py

tester.py

Repository files navigation

About

Releases

Packages

Languages

naominguyen7/Enron-Poi-Identifier

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Languages