Skip to content

Aveek-Saha/Movie-Script-Database

Repository files navigation

The Movie Script Database

This is an utility that allows you to collect movie scripts from several sources and create a database of 2.5k+ movie scripts as .txt files along with the metadata for the movies.

There are four steps to the whole process:

  1. Collect scripts from various sources - Scrape websites for scripts in HTML, txt, doc or pdf format
  2. Collect metadata - Get metadata about the scripts from TMDb and IMDb for additional processing
  3. Find duplicates from different sources - Automatically group and remove duplicates from different sources.
  4. Parse Scripts - Convert scripts into lines with just Character and dialogue

Usage

The following steps MUST be run in order

Clone

Clone this repository:

git clone https://github.com/Aveek-Saha/Movie-Script-Database.git
cd Movie-Script-Database

Dependencies

Read the instructions for installing textract first here.

Then install all dependencies using pip

pip install -r requirements.txt

Collect movie scripts

Modify the sources you want to download in sources.json. If you want a source to be included, set the value to true, or else set it as false.

python get_scripts.py

Collect all the scripts from the sources listed below:

{
    "imsdb": "true",
    "screenplays": "true",
    "scriptsavant": "true",
    "dailyscript": "true",
    "awesomefilm": "true",
    "sfy": "true",
    "scriptslug": "true",
    "actorpoint": "true",
    "scriptpdf": "true"
}
  • This might take a while (4+ hrs) depending on your network connection.
  • The script takes advantage of parallel processing to speed up the download process.
  • If there are missing/incomplete downloads, the script will only download the missing scripts if run again.
  • In case of scripts in PDF or DOC format, the original file is stored in the temp directory.

Collect metadata

Collect metadata from TMDb and IMDb:

python get_metadata.py

You'll need an API key for using the TMDb api and you can find out more about it here. Once you get the API key it has to be stored in a file called config.py in this format:

tmdb_api_key = "<Your API key>"

This step will also combine duplicates, and your final metadata will be in this format:

{
    "uniquescriptname": {
        "files": [
            {
                "name": "Duplicate 1",
                "source": "Source of the script",
                "file_name": "name-of-the-file",
                "script_url": "Original link to script",
                "size": "size of file"
            },
            {
                "name": "Duplicate 2",
                "source": "Source of the script",
                "file_name": "name-of-the-file",
                "script_url": "Original link to script",
                "size": "size of file"
            }
        ],
        "tmdb": {
            "title": "Title from TMDb",
            "release_date": "Date released",
            "id": "TMDb ID",
            "overview": "Plot summary"
        },
        "imdb": {
            "title": "Title from IMDb",
            "release_date": "Year released",
            "id": "IMDb ID"
        }
    }
}

Remove duplicates

Run:

python clean_files.py

This will remove the duplicate files as best as possible without false positives. In the end, the files will be stored in the scripts\filtered directory.

A new metadata file is created where only one file exists for each unique script name, in this format:

{
    "uniquescriptname": {
        "file": {
            "name": "Movie name from source",
            "source": "Source of the script",
            "file_name": "name-of-the-file",
            "script_url": "Original link to script",
            "size": "size of file"
        },
        "tmdb": {
            "title": "Title from TMDb",
            "release_date": "Date released",
            "id": "TMDb ID",
            "overview": "Plot summary"
        },
        "imdb": {
            "title": "Title from IMDb",
            "release_date": "Year released",
            "id": "IMDb ID"
        }
    }
}

The scripts are also cleaned to remove as much formatting weirdness that comes from using OCR to read from a PDF as possible.

Parse Scripts

Run:

python parse_files.py

This will parse your non duplicate scripts from the previous step. The parsed scripts are put into three folders

  • scripts/parsed/tagged: Contains scripts where each line has been tagged. The tags are
    • S = Scene
    • N = Scene description
    • C = Character
    • D = Dialogue
    • E = Dialogue metadata
    • T = Transition
    • M = Metadata
  • scripts/parsed/dialogue: Contains scripts where each line has the character name, followed by a dialogue, in this format, C=>D
  • scripts/parsed/charinfo: Contains a list of each character in the script and the number of lines they have, in this format, C: Number of lines

A new metadata file is created with the following format:

{
    "uniquescriptname": {
        "file": {
            "name": "Movie name from source",
            "source": "Source of the script",
            "file_name": "name-of-the-file",
            "script_url": "Original link to script",
            "size": "size of file"
        },
        "tmdb": {
            "title": "Title from TMDb",
            "release_date": "Date released",
            "id": "TMDb ID",
            "overview": "Plot summary"
        },
        "imdb": {
            "title": "Title from IMDb",
            "release_date": "Year released",
            "id": "IMDb ID"
        },
        "parsed": {
            "dialogue": "name-of-the-file_dialogue.txt",
            "charinfo": "name-of-the-file_charinfo.txt",
            "tagged": "name-of-the-file_parsed.txt"
        }
    }
}

Directory structure

After running all the steps, your folder structure should look something like this:

scripts
│
├── unprocessed // Scripts from sources
│   ├── source1
│   ├── source2
│   └── source3
│
├── temp // PDF files from sources
│   ├── source1
│   ├── source2
│   └── source3
│
├── metadata // Metadata files from sources/cleaned metadata
│   ├── source1.json
│   ├── source2.json
│   ├── source3.json
│   └── meta.json
│
├── filtered // Scripts with duplicates removed
│
└── parsed // Scripts parsed using the parser
    ├── dialogue
    ├── charinfo
    └── tagged

Sources

Metadata:

Scripts:

Note:

Citing

If you use The Movie Script Database, please cite:

@misc{Saha_Movie_Script_Database_2021,
    author = {Saha, Aveek},
    month = {7},
    title = {{Movie Script Database}},
    url = {https://github.com/Aveek-Saha/Movie-Script-Database},
    year = {2021}
}

Credits

The script for parsing the movie scripts come from this paper: Linguistic analysis of differences in portrayal of movie characters, in: Proceedings of Association for Computational Linguistics, Vancouver, Canada, 2017 and the code can be found here: https://github.com/usc-sail/mica-text-script-parser