Skip to content

This projects aims to provide lists containing only great movies to users based only a gew filters and search parameters.

Notifications You must be signed in to change notification settings

pkhaan/AutoCuratedMovieLists

Repository files navigation

📷Auto Curated Movie Lists

This projects aims to provide lists containing only great movies to users based only a few filters and search parameters. 💿

Our Aim

  • Identify Movie Websites: We used websites with lists of "good movies". We chose lists of movies based not on user reviews or personal taste. Most lists come from the opinions of hundreds of film critics around the world and are.

  • Implement Web Scraping: We then scraped those websites using Python in order to extract the info we needed.

  • Match Data: After the web scraping process we are left with 5 lists. We want to match those lists and create a single one without duplicates or "bad" data.

  • Define TMDB API Integration and extract movie information: using TMDB's public API we want to get further information for each movie. Such information would be the release date, genres, language etc. We will use this informations to filter the movies based on the user search query.

  • Apply Filters and Parameters: Design and implement the filter and parameter functionality in your web app's user interface. Allow users to input their desired filters, such as genre, release year, rating, etc. Capture the user inputs and incorporate them into the search query for further processing.

  • On-the-Fly Integration and Processing: Combine the extracted movie titles from web scraping with the retrieved movie information from the API on the fly. Match the movie titles between the scraped data and API responses to associate the relevant movie details with each title. Perform any necessary data transformations or filtering based on the user's input filters.

  • Generate Search Results: Apply the user's selected filters and parameters to the integrated and processed data. Sort and rank the movies based on relevance or other criteria. Generate the search results, including the movie titles and associated information, such as synopsis, release date, and other details.

  • Present Results to Users: Display the search results to the users through a user-friendly interface. This can be in the form of a list, grid, or any other suitable format. Consider including pagination or infinite scrolling if there are many search results to display.

  • Continual Improvement: Regularly monitor the movie websites for changes in structure or accessibility, and adjust the web scraping scripts accordingly. Stay updated with the movie information API's documentation and ensure compatibility with any changes they introduce. Gather user feedback to improve the search functionality and refine the integration process.

Our Project Structure

  1. Scraping
  2. Data Matching
  3. API Integration
  4. On-the-Fly Integration and Processing
  5. Applying Filters and Parameters
  6. Using Flutter for web integration
For our scraping sources we used the following lists:

Using basic BeautifoulSoup structure

def scrape_movie_titles(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract movie titles from the HTML structure
    titles = soup.find_all('h2', cdlass_='movie-title')
    movie_titles = [title.text for title in titles]
    return movie_titles

Data Matching

For Data Matching we used Python's RecordLinkage library. We opted for RecordLinkage instead of a simple concatenation because the titles of the movies aren't always exactly the same across two lists. For example we have "The Godfather" in a list and "The Godfather trilogy" in another, but we only need to keep one. Record Linkage uses Levenshtein Distance to calculate the similarity between two strings. We iteratively compare all pairs of lists to find movies appearing on both lists. We check that both the titles are similar enough (based on a threshold) anf that the years (if available) are the same. The output after this step is a sinlge DataFrame containing every movie appearing at least once in some list, without any duplicates.

Using Jaccard Similarity

We implement also the matching parameter using fuzzy word token ratio and jaccard similarity. $$d(i,j) = \frac {b+c}{a+b+c} \implies 1- sim(i,j)$$ for given sets while the general rule for 2 sets is $$J(A, B) = \frac {|A \cap B|}{|A \cup B|}$$

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

csv_file1 = 'ScrapedCSVs/top_100_movies.csv'
csv_file2 = 'ScrapeLOC/movies.csv'
csv_file3 = 'ScrapeRottenTomatoes/moviesRT.csv'

# Read each CSV into separate DataFrames
df1 = pd.read_csv(csv_file1)
df2 = pd.read_csv(csv_file2)
df3 = pd.read_csv(csv_file3)
#....and many other .csvs

# Function to find fuzzy matches using Jaccard similarity
def fuzzy_match_jaccard(movie_title, titles_list):
    return process.extractOne(movie_title, titles_list, scorer=fuzz.token_sort_ratio)[0]

# Get a list of all movie titles in all three DataFrames
all_movie_titles = df1['Title'].tolist() + df2['Title'].tolist() + df3['Title'].tolist()

# Find fuzzy matches for each DataFrame
df1['Matched_Title'] = df1['Title'].apply(fuzzy_match_jaccard, args=(all_movie_titles,))
df2['Matched_Title'] = df2['Title'].apply(fuzzy_match_jaccard, args=(all_movie_titles,))
df3['Matched_Title'] = df3['Title'].apply(fuzzy_match_jaccard, args=(all_movie_titles,))

# Concatenate the DataFrames based on fuzzy matches
combined_df = pd.concat([df1, df2, df3], ignore_index=True)

# Write the combined DataFrame to a new CSV file
output_csv = 'combined_movies_jaccard.csv'
combined_df.to_csv(output_csv, index=False)

print(f"Jaccard similarity matching completed. The combined data is saved to {output_csv}.")

Flutter Application

Operating Principle

The app sends requests and receives responses from the themoviedb API.
To learn more about APIs and the Multitier architecture click here.

multitier_architecture

....in this phase we integrate via api calls our . csv and we try to macth our movie titles with the TMDB database in order to extract our list.

Dependencies

Getting Started

This application is using api of themoviedb, so before using it you have to create an api from themoviedb and generate an API and apply it to this application, follow the below step to connect api with this app.

First go to https://www.themoviedb.org/documentation/api, and follow the API Documentation, you will get the API Code.

  • go to secret/the_moviedb_api.dart
  • you will see the code like this
const String themoviedbApi = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
  • replace the all xx.. to your API, like this
const String themoviedbApi = 'your_api_token_here';

Snapshots

Intro Screen finder

Known Bugs(TODO)

  • Android Emulator doesn't work due to jvm error
  • Web view needs to be fixed as the project renders mainly to android and iOS devices
  • Correct connection with our API and the predisposed API that flutter provided.

Collaborators

About

This projects aims to provide lists containing only great movies to users based only a gew filters and search parameters.

Topics

Resources

Stars

Watchers

Forks