Skip to content

bakerwho/pca_pycon_talk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains resources used for my talk on Dimensionality Reduction.

We use the Netflix Prize dataset to illustrate how PCA explains variance in data.

The dataset has become difficult to find, though it is available on Kaggle and through various other sources. I used the Archive link below, which I recommend you use as well:

https://archive.org/download/nf_prize_dataset.tar

Download this .tar to your local copy of this repository and extract its contents.

This should create an nf_prize_dataset subdirectory in your repository. You will also have to extract training_set.tar within this subdirectory.

You can then run netflix_sparse_matrix_prep.py to prepare a sparse matrix where rows represent users and columns represent movies. The number in the i th row and j th column indicates the rating (1-5) given by the i th user to the j th movie. 0 indicates that that user did not watch that movie. This sparse matrix is stored in data.npz in the nf_prize_dataset folder you should have already extracted.

This file also parses the CSV movie_titles.csv into the easier-to-read movie_titles_2.csv. Some of the movie names have commas in them, which is dealt with here.

Then run netflix_PCA.py to go through a set of visualisations on the results of PCA on this dataset.

About

Resources from my talk on Dimensionality Reduction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages