Netflix Prize Dataset

This is a small ML project that attempts to predict if a user would like Miss Congeniality based on their ratings for the 30 most rated movies. This data is from real users on Netflix. (this data is from 2009)

This was a project from the 2009 competition called "Netflix Prize" https://www.netflixprize.com/community/topic_1537.html

Tools Used

Language: Python
Sklearn (Scikit-Learn)
Seaborn
Pandas

The data

The data was obtained from the following link:

Stanford CS109: Machine Learning Datasets

You can find some other interesting data sets at the some link

Data formatting

Each row in the train and test set represents one user. Each column represents one movie. All users in the dataset rated all movies in the dataset. Each entry in this dataset is binary.

A value of 1 indicates a rating of 4 or 5 (they liked the movie).
A value of 0 indicates a rating of 1, 2 or 3 (didn't really like it).

The 30 input features used:

Each column represents ratings for a particular movie.

Independence Day (1996)
The Patriot (2000)
The Day After Tomorrow (2004)
Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Pretty Woman (1990)
Forrest Gump (1994)
The Green Mile (1999)
Con Air (1997)
Twister (1996)
Sweet Home Alabama (2002)
Pearl Harbor (2001)
Armageddon (1998)
The Rock (1996)
What Women Want (2000)
Bruce Almighty (2003)
Ocean's Eleven (2001)
The Bourne Identity (2002)
The Italian Job (2003)
I Robot (2004)
American Beauty (1999)
How to Lose a Guy in 10 Days (2003)
Lethal Weapon 4 (1998)
Shrek 2 (2004)
Lost in Translation (2003)
Top Gun (1986)
Pulp Fiction (1994)
Gone in 60 Seconds (2000)
The Sixth Sense (1999)
Lord of the Rings: The Two Towers (2002)
Men of Honor (2000)

The expected output:

The variable you are predicting is the binary value for the user's rating of Miss Congeniality (2000).

Data splitting

Total number of samples in data: 41,188

Number of Training data: 1,720
Number of Validation data: 430
Number of Test data: 597

The MLP Model

We are using a 3 layer MLP model using:

30 neurons in input layer
15 neurons in hidden layer (ReLU activation function)
1 neuron in output layer

We use the adam solver

The Code

How To Run

To run this code run main.py

There are two parameters that you can use to generated output plots and debug console logs.

There is also SEED parameter to define the seeds for the sudo random generators in the code. You can set it as a constant to ensure consistency between runs.

DEBUG = True
PLOTS = True
SEED = 555

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
data		data
output		output
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

data

data

output

output

.gitignore

.gitignore

README.md

README.md

main.py

main.py

Repository files navigation

Netflix Prize Dataset

Tools Used

The data

Data formatting

Data splitting

The MLP Model

The Code

How To Run

Good references

About

Releases

Packages

Languages

dynamic11/Netflix-ML

Folders and files

Latest commit

History

Repository files navigation

Netflix Prize Dataset

Tools Used

The data

Data formatting

Data splitting

The MLP Model

The Code

How To Run

Good references

About

Resources

Stars

Watchers

Forks

Languages