This is the second version of the comic scraper. It is a complete rewrite of the original scraper writter in TypeScript, which can be found here.
This project is mainly used in UltimateComic.
The scraper scrapes websites, extracts comics and stores them in a database.
I am still working on more great features. Here are some of the features that I want to implement:
- ViewComics support
- ReadComicOnline support
- ReadComicsOnline support
- ComicExtra support
- Scrape single comic by providing name
- Scrape single comic by providing URL
- Scrape all comics from a genre
- Scrape all comics from a publisher
- Scrape all comics from a author
- Remove watermark from images
- Upload scraped images to a cloud storage
Since this is a Python project, you need to have Python installed on your system. You can download it here.
After you have installed Python, copy this repository to your computer and navigate to the directory.
The preferred way is to generate a Virtual Environment before installing dependencies via pip
. You can learn more about Virtual Environment for Python in here.
While you are in the main directory of this repository, run the following command:
python -m venv venv
This will create a folder called venv
. After creating virtual environment, you need to activate it. venv
folder contains another folder called Scripts
. In there you will find scripts to activate and deactivate virtual environment. You need to run the activate
script according to your operating system.
For Windows, run the following command:
venv\Scripts\activate
If you are using Powershell, run the following command:
venv\Scripts\Activate.ps1
For Linux, run the following command:
source venv/bin/activate
For MacOS, run the following command:
source venv/bin/activate
After activating the virtual environment, you need to install the dependencies. You can do that by running the following command:
pip install -r requirements.txt
This will install all the required dependencies to your virtual environment.
Before using the scraper, you need to fill some information in the .env
file. The sample .env
file is provided in the repository. You can copy it and rename it to .env
.
The .env
file contains the following information:
HOST
: The host of the MySQL database.USER
: The username of the MySQL database.PASSWORD
: The password of the MySQL database.DATABASE
: The name of the MySQL database.SSL_CERT
: This is mainly for PlanetScale database. It defaults to the file atcacert.pem
located in the root directory of this repository. You can leave it empty if you are not using PlanetScale.
If you just want to scrape comic(s) and don't want to save it to MySQL database, you can just leave the .env
file empty and remove AddToDatabasePipeline
from the ITEM_PIPELINES
in settings.py
.
ITEM_PIPELINES = {
'ultimatescraper.pipelines.ValidateItemPipeline.ValidateItemPipeline': 100,
- 'ultimatescraper.pipelines.AddToDatabasePipeline.AddToDatabasePipeline': 200,
}
To use the scraper, you need to run Scrapy CLI
commands. You can learn more about Scrapy CLI in here.
For example, to run ViewComics scraper, you need to run the following command:
python -m scrapy crawl ViewComics
This will scrape ViewComics website and save all the processed items in the database.
You can also scrape a single comic by providing the URL of the comic. For example, to scrape The Walking Dead comic, you need to run the following command:
python -m scrapy crawl ViewComics -a comic=https://viewcomics.me/comic/the-walking-dead