Movie Scripts Dataset

scriptsonscreen in a dataset of movie scripts downloaded from the https://scripts-onscreen.com/ website.

This repository contains the data and source code used to scrape and preprocess the scripts from the scriptsonscreen website. You can find the movie scripts in the scripts directory.

The scripts directory contains subdirectories named after the IMDB id of the movie. Each such subdirectory contains the movie script and related processed files of the movie corresponding to the IMDB id.

The contents of a movie subdirectory are:

script.txt contains the raw movie script.
parse-rule.txt and parse-trfr.txt contains the parsed output of the movie script. The parsed output is a single structural label for each script line. The labels could be S (slugline), N (description), C (character), D (utterance), E (utterance expression), T (transition), M (metadata), or O (other, usually blank lines). The parse-rule.txt has been created by a rule-based parser. The parse-trfr.txt has been created by a transformer-based parser. We recommend you use parse-trfr.txt file because it is more accurate. We provide parse-rule.txt file for sake of comparison.
imdb.json contains some basic metadata about the movie. This information has been obtained from the IMDB website. It contains the cast list, character names, title, genres, year of production, earnings, etc.
clusters.json contains the coreference clusters of the characters of the movie.

We use the Movie Screenplay Parser to parse the scripts, and the Character Coreference Resolution models to find the coreference clusters.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
imdbids.txt		imdbids.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

imdbids.txt

imdbids.txt

Repository files navigation

Movie Scripts Dataset

About

Releases

Packages

Languages

License

usc-sail/mica-scriptsonscreen-scripts

Folders and files

Latest commit

History

Repository files navigation

Movie Scripts Dataset

About

Resources

License

Stars

Watchers

Forks

Languages