Ultimate Scraper

This is the second version of the comic scraper. It is a complete rewrite of the original scraper writter in TypeScript, which can be found here.

This project is mainly used in UltimateComic.

Features

The scraper scrapes websites, extracts comics and stores them in a database.

I am still working on more great features. Here are some of the features that I want to implement:

Installation

Since this is a Python project, you need to have Python installed on your system. You can download it here.

After you have installed Python, copy this repository to your computer and navigate to the directory.

The preferred way is to generate a Virtual Environment before installing dependencies via pip. You can learn more about Virtual Environment for Python in here.

Create a Virtual Environment using `venv`

While you are in the main directory of this repository, run the following command:

python -m venv venv

This will create a folder called venv. After creating virtual environment, you need to activate it. venv folder contains another folder called Scripts. In there you will find scripts to activate and deactivate virtual environment. You need to run the activate script according to your operating system.

Windows

For Windows, run the following command:

venv\Scripts\activate

If you are using Powershell, run the following command:

venv\Scripts\Activate.ps1

Linux

For Linux, run the following command:

source venv/bin/activate

MacOS

For MacOS, run the following command:

source venv/bin/activate

Installing Dependencies

After activating the virtual environment, you need to install the dependencies. You can do that by running the following command:

pip install -r requirements.txt

This will install all the required dependencies to your virtual environment.

Usage

Before using the scraper, you need to fill some information in the .env file. The sample .env file is provided in the repository. You can copy it and rename it to .env.

`.env` file

The .env file contains the following information:

HOST: The host of the MySQL database.
USER: The username of the MySQL database.
PASSWORD: The password of the MySQL database.
DATABASE: The name of the MySQL database.
SSL_CERT: This is mainly for PlanetScale database. It defaults to the file at cacert.pem located in the root directory of this repository. You can leave it empty if you are not using PlanetScale.

If you just want to scrape comic(s) and don't want to save it to MySQL database, you can just leave the .env file empty and remove AddToDatabasePipeline from the ITEM_PIPELINES in settings.py.

ITEM_PIPELINES = {
   'ultimatescraper.pipelines.ValidateItemPipeline.ValidateItemPipeline': 100,
-  'ultimatescraper.pipelines.AddToDatabasePipeline.AddToDatabasePipeline': 200,
}

Running the Scraper

To use the scraper, you need to run Scrapy CLI commands. You can learn more about Scrapy CLI in here.

For example, to run ViewComics scraper, you need to run the following command:

python -m scrapy crawl ViewComics

This will scrape ViewComics website and save all the processed items in the database.

You can also scrape a single comic by providing the URL of the comic. For example, to scrape The Walking Dead comic, you need to run the following command:

python -m scrapy crawl ViewComics -a comic=https://viewcomics.me/comic/the-walking-dead

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
ultimatescraper		ultimatescraper
.env.test		.env.test
.gitignore		.gitignore
README.md		README.md
cacert.pem		cacert.pem
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows