titanic-kaggle

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history...In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

This is an introductory Kaggle challenge where the goal is to predict which passengers survived the sinking of the Titanic based on a set of attributes of the passengers, including name, gender, age, and more.

Feature engineering

After taking an initial stab at feature engineering, I took some ideas from Megan Risdal and piplimoon. One of the fun parts about this challenge was seeing all of the creative ideas that others have thought up. To summarise what I did:

Extracted a person's title (Mr, Mrs, Miss, Col, etc.) from the person's name
Created a family size feature by adding up the number of siblings/spouses and parents/children on board
Created a family variable from people's last names and their family size - since non-related people can share last names, last name + family size should be a good proxy for a specific family
Used the ticket feature (where multiple people can share a ticket) only for cases where a ticket was shared by two or more people across the training and test sets (ths result is a bit of bleeding between the training and test sets)
Figured out which deck a person's cabin was on from the cabin feature
Used one-hot encoding to create dummies for categorical features
Used the fancyimpute package to impute missing values using MICE

Modeling

I used 5-fold grid search to choose hyperparameters and do model selection. I tried logistic regression, KNN, random forest, SVM, and gradient boosted trees models. They all performed reasonably well (accuracy in the ~ .78 - .82 range) except for KNN. My best score on the public leaderboard was from creating a majority voting ensemble of the four reasonably well performing models but giving the random forest model 2 votes (out of 5), giving a score of ~ .825.

To run

Uses Python 2.7, tested on Ubuntu 14.04 LTS.

python project.py --name <FILE-NAME>

Arguments:

--name required, name of the resulting .csv file to create
--findhyperparameters if you don't include this argument the script uses pre-optimized hyperparameters - including this argument results in grid search being used to optimize the hyperparameters. This takes ~ 1 - 1.5 hours depending on your machine.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
submissions		submissions
.gitignore		.gitignore
README.md		README.md
data-processing-modeling.ipynb		data-processing-modeling.ipynb
project.py		project.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

submissions

submissions

.gitignore

.gitignore

README.md

README.md

data-processing-modeling.ipynb

data-processing-modeling.ipynb

project.py

project.py

Repository files navigation

titanic-kaggle

Feature engineering

Modeling

To run

About

Releases

Packages

Languages

jakesherman/titanic-kaggle

Folders and files

Latest commit

History

Repository files navigation

titanic-kaggle

Feature engineering

Modeling

To run

About

Topics

Resources

Stars

Watchers

Forks

Languages