Project highlights: KNN, K-means, PCA, Data visualization, Recommender system
Explore this project on my online demo in one click
Step by step analysis in jupyter-notebook
Movie recommendation is an exciting topic closely related to our daily lives. It saves our precious time and energy to choose from millions of movies. But no one could ever go through all millions of movies so as to make a good recommendation. So this is the time when data science techniques such as K Nearest Neighbor (KNN) and K-means can help. Here we will first use KNN model and then improve it with K-means clustering. Fun!
The idea of KNN model is quite intuitive. Imagine that you are in a movie lover's club at NYC. You have fun and meet a bunch of friends who are eager to share you their favorite movies. Who's recommendation should you take? The one who have the same taste just like you, or the one who disagree with your opinion on every single movie? Well, of course it is important sometimes to be open to different opinions. But as for a movie to relax after a day of tiring work, I would go with the recommendation from my like-minded friends.
KNN model just follows this intuition and asks for the opinion of
The following is a easy demo of 1097 people (one is missing) and their ratings for 400 movies (movies haven't seen are rated as NaN). Based on KNN, we can recommend movies for each of them!
Explore this recommendation on my online demo
Here we employed KNN model to complete the rating matrix. Therefore we can recommend movies for users by sorted rating. This method ensures that the recommendations are movies highly rated by like-mined people.
Except for overall recommendation, we have a lot of other real-world scenarios! When I just finished a great movie, chances are that I would want more movies of the same kind. Well, clustering model can help us do that!
In the next part, we will deploy the basic clustering model K-means to recommend movies in the same genre.
The idea of K-means is quite similar to KNN with a little bit difference. In the KNN part we want the average rating from K like-minded friends. But here we want more than that. We not only want to find similar movies, but also want to further group them up and give each groups meaningful interpretation!
In short by Silhouette score, our clustering result shows that there are three possible movie clusters. We further take mean average and take a look at the top 10 movies in each group.
Explore this recommendation on my online demo
We can take into account many other factors such as staring actors, director, language, area, length, etc. We can even employ NLP methods to analyse audiences' movie reviews and construct our own features!
Our sample size 1097 is a big number. But considering the huge user amount in movie companies database, we can of course simply do better by having a larger sample (With far more movies and more users!).
We adopt KNN and K-means model because of their simplicity and good interpretability. But in industrial applications we may have much more to worry about, like in model performance and speed in real-time recommendation and so on. So it is a discretionary call and we may employ more SOTA models in that case.
User Clustering for anime recommendation by Lastnight.
├─ data
│ └─ movieReplicationSet.csv # 1097 people's rating of 400 movies.
├─ image # 400 movie posters for result visualization.
│ ├─ 10 Things I Hate About You (1999).png
│ ├─ 10000 BC (2008).png
│ ├─ 12 Monkeys (1995).png
...
│ ├─ Wing Commander (1999).png
│ ├─ X-Men (2000).png
│ ├─ X-Men 2 (2003).png
│ ├─ You're Next (2011).png
│ └─ Zoolander (2001).png
├─ readme
├─ tools # web crawler for movie posters.
│ └─ get_movie_cover.py
├─ README.md
├─ movie_recommend.py # main py file for KNN recommendation and webpage rendering.
├─ movie_recommender_analysis.ipynb # A full data analysis in jupyter-notebook
└─ requirements.txt
Explore this project on my online demo
A full data analysis is in this jupyter-notebook.
More on my github.