license |
---|
mit |
Scrape and convert movie scripts from https://imsdb.com to EPub files.
The scraping is adapted from https://huggingface.co/datasets/mattismegevand/IMSDb/tree/main.
The EPub creation is largely done via calibre's
ebook-convert
CLI-tool, so make sure to have calibre
installed and linked against your shell.
The conversion script was sketched together in roughly 30 minutes, and the source material is basically whitespace-formatted plaintext.
Hence, the quality of the resulting EPubs varies drastically ranging from perfectly enjoyable over somewhat readable to being utter junk.
First, install the (few) Python requirements with:
pip install -r requirements.txt
Then, install calibre using your preferred way.
To scrape all movie scripts from the website, run the scrape.py
script:
python scrape.py
This will create a data_html.jsonl
file within the project's root.
To convert the scraped scripts to EPub files, run the convert2epub.py
script afterwards:
python convert2epub.py
This script will create two new folders:
epub
contains the EPub files.html
contains the html files in two version (before and after preprocessing them for the conversion).
The HTML-files can be helpful, for manually sanity checking.
By executing the download_posters.py
script, available posters will be downloaded and saved in the poster
directory.
The EPub-conversion checks automatically if this directory is available and will use the posters as book cover whenever possible.