async-ptt-crawler

This project scrapes/crawls post content and comments from PTT website, and implements neural CKIP Chinese NLP tools on the scraped data asynchronously.

Documentation

1. Installation

Python version
- python >= 3.10

Clone repository

git clone git@github.com:Taiwan-Social-Media-Corpus/async-scraptt.git

Install Requirement

pip

cd async-scraptt && pip install -r requirements.txt

poetry

cd async-scraptt && poetry shell && poetry install

2. Usage

Commands

scrapy crawl ptt -a boards=BOARDS [-a fetch_all=BOOLEAN]
            [-a index_from=NUMBER -a index_to=NUMBER]
            [-a scrape_from=YEAR] [-a data_dir=PATH]

positional arguments:
-a boards=BOARDS                          ptt board name (e.g. Soft_Job)
-a index_from=NUMBER -a index_to=NUMBER   html index number from a ptt board
-a fetch_all=BOOLEAN                     scrap all posts if true
-a scrape_from=YEAR                             scrap all posts from a given year
-a data_dir=PATH                          output file path (default: ./data)

Crawl all the posts of a board:

scrapy crawl ptt -a boards=Soft_Job -a fetch_all=True

Crawl all the posts of a board from a year in the past:

scrapy crawl ptt -a boards=Soft_Job -a scrape_from=2022

Crawl the posts of a board based on html indexes:
```
scrapy crawl ptt -a boards=Soft_Job -a index_from=1722 -a index_to=1723
```
Please make sure the number of index_from is greater than index_to.
Crawl the posts of multiple boards. For example:
```
scrapy crawl ptt -a boards=Soft_Job,Gossiping -a index_from=1722 -a index_to=1723
```
Note: the comma in the argument boards cannot have spaces. It cannot be boards=Soft_Job, Baseball or boards=["Soft_Job", "Baseball"].

3. Running Applications with Docker

A Docker setup is provided for the crawler.

To run the crawler, go to the docker-compose.yml file to edit the command:

version: "3"

services:
  scraptt:
    build: .
    # define your crawler here!
    command: bash -c "poetry run scrapy crawl ptt -a boards=Soft_Job -a index_from=1500 -a index_to=1500"

    volumes:
      - "./data:/app/data"

Feel free to change the command as long as it follows the command format as stated above.

Now start the crawler:

docker compose up

Contact

If you have any suggestion or question, please do not hesitate to email us at shukai@gmail.com or lixingyang.dev@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
scraptt		scraptt
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scraptt

scraptt

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

docker-compose.yml

docker-compose.yml

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

scrapy.cfg

scrapy.cfg

Repository files navigation

async-ptt-crawler

Documentation

1. Installation

2. Usage

3. Running Applications with Docker

Contact

About

Releases

Packages

Contributors 2

Languages

License

Taiwan-Social-Media-Corpus/async-scraptt

Folders and files

Latest commit

History

Repository files navigation

async-ptt-crawler

Documentation

1. Installation

2. Usage

3. Running Applications with Docker

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages