NLP-and-Clustering - Movie-similarity Analysis Project

Using NLP and Clustering on movie plot summaries from IMDB and Wikipedia to quantify movie similarity

Introduction

This project aims to quantify the similarity between movies based on their plot summaries available on IMDb and Wikipedia, clustering them accordingly. By employing natural language processing techniques and unsupervised learning algorithms, this project clusters movies based on their plot descriptions to identify common themes and similarities among various films. The goal is to cluster movies based on thematic similarities to better understand their relationships and groupings.

Requirements

To run the code in this project, ensure the installation of the NLTK library using the following command:

pip -q install nltk

Movies dataset - https://drive.google.com/file/d/1p2m41Oef8s29dkE-C_Yl9cj_DlVZiYYp/view?usp=sharing

Workflow Overview

Dataset Overview:
- The dataset combines plot summaries from IMDb and Wikipedia, providing a rich source of information for movie analysis.
Text Processing:
- Text data is tokenized, stemmed using the Snowball Stemmer, and transformed into numerical representations using the TF-IDF method.
Clustering with K-means:
- Utilizing the K-means algorithm, movies are grouped into clusters based on their genres, allowing for a comprehensive analysis of their interrelationships.
Similarity Calculation:
- Cosine similarity is employed to measure the closeness of movie plot summaries, providing insights into their thematic similarities.
Dendrogram Visualization:
- A dendrogram is created to visualize movie relationships, displaying hierarchical clustering results to understand the strength of similarity between various movie pairs.

Results

The project successfully identified five (5) thematic clusters among movies, allowing for a clearer understanding of the relationships between different films. The dendrogram visualization showcases how movies are related based on their plot summaries.

Usage

The provided code and analysis can be replicated in a Python environment, such as Jupyter Notebook or Google Colab, to explore movie similarities and clusters based on plot summaries.

Conclusion

This project demonstrates how text-based movie plot summaries can be processed and clustered to reveal thematic similarities among movies. The visual representations and clustering methodologies enable a better understanding of movie relationships and thematic groupings.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Movies_Clustering_NLP.ipynb		Movies_Clustering_NLP.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Movies_Clustering_NLP.ipynb

Movies_Clustering_NLP.ipynb

README.md

README.md

Repository files navigation

NLP-and-Clustering - Movie-similarity Analysis Project

Introduction

Requirements

Workflow Overview

Results

Usage

Conclusion

About

Releases

Packages

Languages

Netcodez/NLP-and-Clustering---Movie-similarity

Folders and files

Latest commit

History

Movies_Clustering_NLP.ipynb

Movies_Clustering_NLP.ipynb

README.md

README.md

Repository files navigation

NLP-and-Clustering - Movie-similarity Analysis Project

Introduction

Requirements

Workflow Overview

Results

Usage

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages