Skip to content

NeroHin/millions-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

millions-crawler

This the NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION homework III, targe is create a crawler application to crawling millions webpage.

image source

Part of the homework:

Medium Article

Homework Scope

  1. Crawl millions of webpages
  2. Remove non-HTML pages
  3. Performance optimization
    • How many page can crawl per hour
    • Total time to crawl millions of pages

Project architecture

Distributed architecture

distributed_architecture

Each spider

spider

Spider with 台灣 E 院

tweh_parse_flowchat

Spider with 問 8 健康諮詢

w8h_parse_flowchat

Spider with Wiki

wiki_parse_flowchat

Anti-Anti-Spider

  1. Skip robot.txt
# edit settings.py
ROBOTSTXT_OBEY = False
  1. Use random user-agent
pip install fake-useragent
# edit middlewares.py
class FakeUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = UserAgent()
        request.headers['User-Agent'] = ua.random
DOWNLOADER_MIDDLEWARES = {
   "millions_crawler.middlewares.FakeUserAgentMiddleware": 543,
}

Result

single spider in 2023/03/21

Spider Total Page Total Time (hrs) Page per Hour
tweh 152,958 1.3 117,409
w8h 4,759 0.1 32,203
wiki* 13,000,320 43 30,240

distributed spider (4 spider) in 2023/03/24

Spider Total Page Total Time (hrs) Page per Hour
tweh 153,288 0.52 -
w8h 4,921 0.16 -
wiki* 4,731,249 43.2 109,492

How to use

  1. create a .env file
bash create_env.sh
  1. Install Redis
sudo apt-get install redis-server
  1. Install MongoDB
sudo apt-get install mongodb
  1. Run Redis
redis-server
  1. run MongoDB
sudo service mongod start
  1. Run spider
cd millions-crawler
scrapy crawl [$spider_name] # $spider_name = tweh, w8h, wiki

Requirement

pip install -r requirements.txt

Reference

  1. GitHub | fake-useragent
  2. GitHub | scrapy
  3. 【Day 20】反反爬蟲
  4. Documentation of Scrapy
  5. 解决 Redis 之 MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist o...
  6. Ubuntu Linux 安裝、設定 Redis 資料庫教學與範例
  7. 如何連線到遠端的 Linux + MongoDB 伺服器?
  8. Scrapy-redis 之終結篇

About

Homework III of NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION , I've used the distribute crawler to crawling over miliion web page.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published