GitHub - transitive-bullshit/populate-movies: Populates a high quality database of movies from TMDB, IMDB, and Rotten Tomatoes.

Populates a high quality database of movies from TMDB, IMDB, and Rotten Tomatoes.

Intro
Movie Schema
Prerequisites
Steps
Stats
Database Notes
License

Intro

This project resolves a full set of movies from TMDB and IMDB in batches, storing them into a Postgres database using Prisma.

This includes difficult-to-access data like IMDB ratings, Rotten Tomatoes audience & critic scores, and YouTube trailers.

Everything is handled including data fetching, normalization, cleaning, and filtering. All scripts are idempotent, so you should be able to re-run them whenever you want to refresh your data.

Movie Schema

Movies are transformed into the following format, which combines metadata from TMDB, IMDB, Rotten Tomatoes, and FlickMetrix.

{
  "tmdbId": 680,
  "imdbId": "tt0110912",
  "title": "Pulp Fiction",
  "originalTitle": "Pulp Fiction",
  "language": "en",
  "releaseYear": 1994,
  "releaseDate": "1994-09-10",
  "genres": ["thriller", "crime", "drama"],
  "plot": "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.",
  "runtime": 154,
  "adult": false,
  "budget": "8000000",
  "revenue": "213928762",
  "homepage": "https://www.miramax.com/movie/pulp-fiction/",
  "mpaaRating": "R",
  "keywords": ["nonlinear timeline", "overdose", "drug use", "drug overdose"],
  "countriesOfOrigin": ["United States"],
  "languages": ["English", "Spanish", "French"],
  "cast": ["John Travolta", "Samuel L. Jackson", "Uma Thurman", "Bruce Willis"],
  "director": "Quentin Tarantino",
  "production": "Miramax Films",
  "awardsSummary": "Won 1 Oscar. Another 63 wins & 47 nominations.",
  "status": "released",
  "posterUrl": "https://image.tmdb.org/t/p/w780/fIE3lAGcZDV1G6XM5KmuWnNsPp1.jpg",
  "posterWidth": 2000,
  "posterHeight": 3000,
  "backdropUrl": "https://image.tmdb.org/t/p/w1280/suaEOtk1N1sgg2MTM7oZd2cfVp3.jpg",
  "backdropWidth": 2631,
  "backdropHeight": 1480,
  "trailerUrl": "https://youtube.com/watch?v=tGpTpVyI_OQ",
  "trailerYouTubeId": "tGpTpVyI_OQ",
  "imdbRating": 8.9,
  "imdbVotes": 2033927,
  "imdbType": "movie",
  "tmdbPopularity": 74.051,
  "tmdbRating": 8.491,
  "tmdbVotes": 24004,
  "metacriticRating": 94,
  "metacriticVotes": 24,
  "rtCriticRating": 93,
  "rtCriticVotes": 103,
  "rtAudienceRating": 93,
  "rtAudienceVotes": 1120247,
  "rtUrl": "https://www.rottentomatoes.com/m/pulp_fiction",
  "letterboxdScore": 86,
  "letterboxdVotes": 1102425,
  "flickMetrixScore": 90,
  "flickMetrixId": 110912,
  "foreign": false,
  "relevancyScore": 1883295.240641344,
  "imdbCustomPopularity": 2033927,
  "createdAt": "2022-11-07T03:51:42.276Z",
  "updatedAt": "2022-11-07T03:51:42.276Z"
}

In addition to basic movie metadata, most movies also include:

Ratings
- IMDB ratings
- Rotten Tomatoes critic and audience scores
- Metacritic ratings
- Letterboxd ratings
- FlickMetrix ratings
Media
- YouTube trailers
- Poster images
- Backdrop images

We also compute some useful custom fields like foreign, which is a flag for whether or not most Western audience members would recognize the movie. This includes most English movies as well as many of the most popular foreign films.

This format aims to be as compact as possible while still including the most important metadata, so for certain fields like the trailer, we choose the best possible YouTube video according to a set of heuristics and ignore any other related videos.

This may result in less metadata in your database for fallbacks and such, but the project is explicitly designed to be re-run regularly so your data remains valid and up-to-date, relying on these third-parties to deal with data drift.

Prerequisites

Store your TMDB_BEARER_TOKEN and DATABASE_URL in a local .env file.

Steps

Once we have our environment setup, we can start processing movies. Most of the processing steps break things up into batches (defaults to 32k movies per batch) in order to allow for incremental processing and easier debugging along the way.

All scripts are idempotent, so you can run these steps repeatedly and expect up-to-date results.

Before getting started, make sure you've run pnpm install to initialize the Node.js project.

1. Download Data Dumps

We'll start off by running npx tsx src/download-data-dumps.tsx, which downloads the following data dumps to seed our scripts:

The most recent daily export of TMDB movie IDs such as 10/30/2022
- This gives us a full list of TMDB movie IDs to kick things off.
An official IMDB title ratings data dump
- This file contains a large sample of IMDB movie IDs and their associated IMDB ratings + number of votes.

The output will be stored in the data/ directory and is ignored by git.

2. Populate TMDB Movies

Next, we'll run npx tsx src/populate-tmdb-movie-dump.ts which populates each of the TMDB movie IDs with its corresponding TMDB movie details and stores the results into a series of batched JSON files out/tmdb-0.json, out/tmdb-1.json, etc. (takes ~1 hour)

The result is ~655k movies in 24 batches.

3. Populate Flick Metrix Movies (Optional)

The next optional step is to download movie metadata using Flick Metrix's private API. This is really only necessary if you want Rotten Tomatoes scores. (takes ~3 minutes)

Run npx tsx src/populate-flick-metrix-movies which will fetch ~70k movies and store the results into out/flick-metrix-movies.json. This optional metadata will be used by src/process-movies.ts if it exists and will be ignored if it doesn't.

4. Process Movies

Next, we'll run npx tsx src/process-movies.ts which takes all of the previously resolved TMDB movies and transforms them to a normalized schema, adding in IMDB ratings from our partial IMDB data dump. This will output normalized JSON movies in batched JSON files out/movies-0.json, out/movies-1.json, etc. (takes ~30 seconds)

This script also filters movies which are unlikely to be relevant for most use cases:

filters adult movies
filters movies which do not have a valid IMDB id (~40%)
filters movies which do not have a valid YouTube trailer (~58%)
filters movies which are too far in the future (~1.5%)
filters movies which are too short (< 30min)
filters music videos
filters tv series and episodes
filters live concerts
optionally computes LQIP placeholder images for all movie posters and backdrops
- caches the results in redis
- this is controlled by enablePreviewImages in src/lib/config.ts
- if you have it enabled, set REDIS_HOST and REDIS_PASSWORD
- I just use a local redis instance (brew install redis)
adds additional IMDB info from any previous populate-imdb-movies cache (if out/imdb-movies.json exists)
adds additional Rotten Tomatoes info from any previous populate-rt-movies cache (if out/rt-movies.json exists)
adds additional Flick Metrix info from any previous populate-flick-metrix-movies cache (if out/flick-metrix-movies.json exists)

NOTE: this is the most important step, and you will likely find yourself running it multiple times. It is pretty quick, though, since it doesn't require any network access. It is essentially just aggregating all of the data we've downloaded in the other steps into a normalized format and performing some filtering.

The result is ~73k movies.

5. Populate Rotten Tomatoes Movies (Optional)

The next optional step is to download additional RT info for each movie, using a cheerio-based scraper. We self-impose a strict rate-limit on the scraping, so this step takes a long time to run and requires a solid internet connection with minimal interruptions. (takes ~4 hours)

Currently, the only way this script will work is if you've previously downloaded all Flick Metrix movies.

This step is important if you want up-to-date Rotten Tomatoes critic and audience scores, along with a criticsConsensus description.

If you want to proceed with this step, you can run npx tsx src/populate-rt-movies.tsx which will read in our normalized movies from out/movie-0.json, out/movie-1.json, etc, scrape each RT movie and store the results to out/rt-movies.json.

Once this step finishes, you'll need to re-run npx tsx src/process-movies.ts, which will now take into account all of the extra RT metadata that was downloaded to our local cache.

6. Populate IMDB Movies (Optional)

The next optional step is to download additional IMDB info for each movie, using a cheerio-based scraper called movier. We self-impose a strict rate-limit on the scraping, so this step takes a long time to run and requires a solid internet connection with minimal interruptions. (takes ~12 hours)

NOTE: see IMDB's personal and non-commercial licensing before proceeding. This step is optional for a reason. IMDB does have an official API, but it is extremely expensive to use (starting at $50k + usage-based billing).

If you want to proceed with this step, you'll can run npx tsx src/populate-imdb-movies.ts which will read in our normalized movies from out/movie-0.json, out/movie-1.json, etc, scrape each IMDB movie individually with movier and store the results to out/imdb-movies.json.

If you are running into rate-limit issues (likely 503 errors), then you'll need to adjust the rate-limit in src/lib/imdb.ts.

Once this step finishes, you'll need to re-run npx tsx src/process-movies.ts, which will now take into account all of the extra IMDB metadata that was downloaded to our local cache.

7. Upsert Movies into Prisma

The final step is to upsert all of the movies into your Prisma database. (takes ~30 seconds)

Make sure that you have DATABASE_URL set to a Postgres instance in your .env file and then run npx prisma db push to sync the Prisma schema with your database (as well as generating the prisma client locally in node_modules). You can alternatively use prisma db migrate and prisma generate if you prefer that workflow.

NOTE: this script defaults to upserting movies into your database, so it shouldn't be destructive, but if you'd rather clear your movies table first, you can set DROP_MOVIES=true — which will drop the movies table and perform upserts in batches. This method is much faster, but be careful with it so that you don't accidentally wipe out movie data.

NOTE if you set DRY_RUN=true, this script will not make any changes to your database. Instead, it will loop through all movies that would've been inserted and print the diffs — which can be extremely useful for making sure that your upsert will work as intended.

Now you should be ready to run npx tsx src/upsert-movies-to-db.ts which will run through out/movies-0.json, out/movies-1.json, etc and insert each movie into the database.

8. Query Movies from Prisma

You should now have a full Postgres database of movies, complete with the most important metadata from TMDB and IMDB. Huzzah!

You can run npx tsx scripts/scratch.ts to run an example Prisma query.

Stats

~750k "movies" in TMDB
~73k movies after resolving and filtering
- 8.6k documentaries
- 44k english movies
- 34k movies made in the U.S.
  - (80% of these are made exclusively in the U.S.)
- 4.3k movies made in India
  - (99% of these are made exclusively in India)
- 1.3k movies made in China
  - (40% of these are made exclusively in China)
IMDB stats
- 30k movies have at least 1k votes on IMDB
- 40k movies have an IMDB rating of at least 6
- 29k movies have an IMDB rating of at least 6.5
- 17k movies have an IMDB rating of at least 7
- 3k movies have an IMDB rating of at least 8
- 300 movies have an IMDB rating of at least 9

(all stats are approximate)

Database Notes

The resulting movie dataset is ~130MB (about half of which is Postgres indices) and should fit in most free-tier Postgres instances.

Some fields like budget and revenue use strings instead of numbers, because they can overflow Postgres integers.

If you want to use a different type of database, it should be pretty easy since the majority of the processing happens with local JSON files. Note that Prisma supports several popular databases, including MongoDB, MySQL, and SQLite.

License

Support my open source work by following me on twitter twitter

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github		.github
.husky		.husky
data		data
media		media
prisma		prisma
src		src
.env.example		.env.example
.gitignore		.gitignore
.npmrc		.npmrc
.prettierignore		.prettierignore
.prettierrc.cjs		.prettierrc.cjs
license		license
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
readme.md		readme.md
tsconfig.json		tsconfig.json

License

transitive-bullshit/populate-movies

Folders and files

Latest commit

History

Repository files navigation

Intro

Movie Schema

Prerequisites

Steps

1. Download Data Dumps

2. Populate TMDB Movies

3. Populate Flick Metrix Movies (Optional)

4. Process Movies

5. Populate Rotten Tomatoes Movies (Optional)

6. Populate IMDB Movies (Optional)

7. Upsert Movies into Prisma

8. Query Movies from Prisma

Stats

Database Notes

License

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages