Use ML to Model Who Will Die in Game of Thrones

Project Status: Complete

Objective

This is a machine learning project to train a model to classify Game of Thrones characters as dead or alive. We a specific point in time, the end of book 5, after which the TV series and books diverge more frequently. The Game of Thrones TV series is based on George RR Martin’s “A Song of Fire and Ice” 7 book series.

Results

Of the 3 models: a Baseline Classifier, Logistic Regression and Random Forest, the Random Forest model was the most accurate. The Baseline model which always predicts the most common class had an accuracy score of 0.620, Logistic Regression 0.691 and Random Forest 0.759.

When classes are imbalanced, it’s good practice to consider additional metrics beyond accuracy to measure classifier performance. Our classes aren’t too skewed: 62% of characters are alive, 38% dead, but let’s look into the Out of bag (OOB) score. The OOB score for the Random Forest model is 0.767, indicating the model will be accurate when applied to new unseen data.

Let’s look at the correlation between features in the dataset. In the correlation matrix below we see there aren't any strong positive or negative correlations between a character's survival and the features in our dataset. Strong correlations are 0.8 and up, weak ones are under 0.5. The highest correlation occurs between a character’s popularity and the number of dead relatives they have. Only the numeric column correlations were calculated, so it's possible there's a correlation between survival and non-numeric columns such as house or culture.

Digging into the house variable, House Tyrell is the most intact, while Targaryens have the fewest members of their house alive. This matches up with the history of Westeros which includes a major war between two Targaryens that resulted in the destruction of most of house Targaryen and nearly all their dragons.

Data

The raw data contains 1946 characters from the Westeros wiki http://awoiaf.westeros.org, in csv format by the A Song of Ice and Data team. Features in the data set included: “is alive”, “married”, “single”, “has dead brothers/sisters”, a “popularity” score (page-rank based on inbound links to the character’s wiki page), “house”, “culture”.

Methods

Data Cleaning
Exploratory Data Analysis
Data Visualization
Machine Learning

I chose supervised learning models since the data had a labeled response for the feature I wanted to predict. If the character was alive by the end of the 5th book, they were counted as “alive”. I dropped that column from my training data, and used it as an answer key to compare how well each machine learning model predicted. I started with a simple model first (Logistic Regression) before trying more complex models like Naive Bayes, Decision Trees, and Random Forest. The workflow was: acquire data, look for insights and patterns with an exploratory data analysis, split data into training and test sets, apply machine learning models, score each model, and compare results across models.

Technologies

Python
Pandas
NumPy
Seaborn
Matplotlib
Sklearn - preprocessing
Sklearn.preprocessing - StandardScaler
Sklearn.metrics - accuracy_score, confusion_matrix, plot_confusion_matrix, classification_report
Sklearn.model_selection - cross_val_score, train_test_split
Sklearn.preprocessing - StandardScaler
Sklearn.dummy - DummyClassifier
Sklearn.linear_model - LogisticRegression
Sklearn.ensemble - RandomForestClassifier

Limitations

There are several limitations to this kind of model.

Generalizability: One of the most helpful features for predicting death was the popularity score, so if you wanted to use a model like this to win bets with friends about another fantasy series that doesn’t have an extensive online wiki dedicated to it that might not work so well.
Books vs TV series: The model is trained on book data, and the TV series differs from the books. The model does not account for characters who have died in the TV series, but are still alive in the book series. For example, Stannis was among those predicted next to die, but in the TV show he had already died.
Snapshot in Time: As the story progresses more characters will die. The model currently uses the snapshot of the end of the 5th book.
Resurrection: The model currently does not track how many times a character dies. Characters who come back from the dead by the end of the 5th book are counted as alive.

Future Work

Iterate on the model: I used basic ML models here since I wanted a high level of explainability and insight into which features impacted predictions. It would be interesting to see results of more complex ML models & feature engineering.
Moving Window: The model could make predictions for different time periods of the story.
Increase Generalizability: It’d be interesting to try building a generalized plot predictor that performed well across many different fantasy series.
What about the animals? The source data didn’t include direwolves, but if you add pet data let me know what you find. :)
Push to Production: The model could be turned into an interactive app.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
data		data
eda		eda
images		images
predictions		predictions
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

eda

eda

images

images

predictions

predictions

README.md

README.md

Repository files navigation

Use ML to Model Who Will Die in Game of Thrones

Table of contents

Objective

Results

Data

Methods

Technologies

Limitations

Future Work

About

Releases

Packages

Contributors 2

Languages

megano/storytelling

Folders and files

Latest commit

History

Repository files navigation

Use ML to Model Who Will Die in Game of Thrones

Table of contents

Objective

Results

Data

Methods

Technologies

Limitations

Future Work

About

Topics

Resources

Stars

Watchers

Forks

Languages