Python crawler for news

Use python scrapy build crawler for real-time Taiwan NEWS website.

使用 python scrapy 建置抓取台灣新聞網站即時新聞的爬蟲

TODO LIST

整理 setting 與 cicd
部署 line notify
用成 k8s 部到 GKC，或是 VM 即可？
實作多執行緒，同步爬蟲執行，使用 python script
可以考慮實作 docker install shell
持續修改 Bug
實作全網站一次性爬蟲（提供給 production）
消滅 TODO
寫一隻資料庫清理爬蟲

TODO website

List from Alexa台灣排名

[update 2022/3] Alexa台灣排名

! Alexa停止營運了，之後再看看要換成什麼

自由時報
- [2022/12/30] 已更新
東森新聞
- [2022/12/30] 已更新
聯合新聞網
- [2022/12/30] 已更新
今日新聞
- [2023/01/03] 已更新
ettoday
- [2023/01/03] 已更新
[NEW] 巴哈姆特電玩資訊站
- TODO
風傳媒
- TODO
[公司還在嗎?] 頻果新聞網
- [2022/12] 尚未檢查
- 要使用 javascript
- 不能用 cookie,session
- 新聞整體格式非主流，例：文章時間
中時電子報
- [2023/01/03] 已更新
今周刊
- [2022/12] 尚未檢查
- Maybe need javascript
- Non-instant news
- Mostly for business news
TVBS
- [2023/01/04] 已更新
商業週刊
- [2022/12] 尚未檢查
- Non-instant news
- Mostly for business news
三立新聞網
- [2023/01/03] 已更新
[NEW] 民視新聞
- [2022/12] 尚未檢查
中央通訊社
- [2023/01/04] 已更新
關鍵評論網
- [2022/12] 尚未檢查
- Non-instant news

Crawler step

Request real-time news lists.
Request news page from setp.1 list.
Parsing html and get target value. item.py
- url
- article_from
- article_type
- title
- publish_date
- authors
- tags
- text
- text_html
- images
- video
- links
Save into database. pipelines.py
- Default Use Cassandra
- [TODO][feature] Use Mongo or Mysql
Done

Requirement Install

Develop Env

python 3.7.6
scrapy >= 2.0.0
Cassandra 3.11.4
Develop on macOS (main)

python scrapy

    pip install scrapy
    # or
    pip3 install scrapy

Install Cassandra Database

mac os

    brew install cassandra

python extension

    pip install cassandra-driver
    # or
    pip3 install cassandra-driver

start cassandra

    # start on bash
    cassandra -f

    # start on backgroud

Install Mysql Database

mac os

    brew install mysql

python extension

    pip install PyMySQL
    # or
    pip3 install PyMySQL

RUN Project

Run all in localhost terminal

    ./run_spiders.sh

Run in Docker use docker-compose.yml

build docker image

    docker build . -t crawler_news

If you want exec crawler without database. modify docker/setting.py and re-build.

    # run without database (linux base command)
    docker run --rm -it -v `pwd`/tmp:/src/tmp -v `pwd`/log:/src/log crawler_news

If you want exec single crawler. modify Dockerfile and re-build.

    CMD ["/bin/bash"]
    # or assign crawler
    CMD ["scrapy", "crawl", "ettoday"]

run docker-compose

    # start
    docker-compose up -d

    # stop
    docker-compose down

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.circleci		.circleci
crawler_news		crawler_news
log		log
tmp		tmp
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app.py		app.py
requirements.txt		requirements.txt
run.sh		run.sh
scrapy.cfg		scrapy.cfg
unittest.py		unittest.py

License

SecondDim/crawler-news

Folders and files

Latest commit

History

Repository files navigation

Python crawler for news

TODO LIST

TODO website

Crawler step

Requirement Install

Develop Env

python scrapy

Install Cassandra Database

Install Mysql Database

RUN Project

Run all in localhost terminal

Run in Docker use docker-compose.yml

About

Topics

Resources

License

Stars

Watchers

Forks

Languages