Skip to content

hllj/nlp-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vietnamese NLP Scrapy

Codebase to crawl data from major Vietnamese websites

To Do

News websites:

Forums:

Installation

Libraries

pip install -r requirements.txt

Tech Stack

  • ElasticSearch + Kibana
  • MongoDB

Install Docker + Docker-Compose. Note: Edit the following line to add volume from docker.

volumes:
    - ./esdata:/home/lap15363/elasticsearch/data

To start service with ElasticSearch + Kibana

docker-compose up -d

Wait about 1 minute for the network to start (default: localhost). Services will be ported as follows:

  • ElasticSearch: 9200
  • Kibana: 5601

Setup spiders and Setting

Spider:

Spiders will contain an implementation of how we crawl pages.

You can use SitemapSpider types to parse links from the sitemap if available. You can see an example from crawler/spiders/thanhnien.py

Other types of spiders can be referenced here

You can refer to my custom spider example to make requests, check out crawler/spiders/tv4u.py

Items:

Items will define the Schema of the items every time we crawl.

Refer to Items implemented in crawler/items.py

Exporters:

Exporters is where we implement how to export items.

Here we can write ways to connect database and export items and import into DB.

Middlewares

Middlewares are intermediaries between spiders and the site, here we can add some middleware such as:

  • Random User-Agents: randomly select user-agents to send requests to the site.
  • Proxy Middlewares: randomly select proxies to send to the request site.
  • Retry Middleware: how to retry after failing to connect.

You can refer to the following libs to use instead of middlewares:

Pipelines

Defining pipelines: the flow of processing items after crawling, what to do next. Here we may combine Exporters and Items to create Pipelines.

Refer to ESPipeline in crawler/pipelines.py

Settings

In the Settings.py file will be where we configure all the above components.

Some important configurations:

DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# MAXIMUM TIME FOR A REQUEST
DOWNLOAD_TIMEOUT = 10

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False
# Number of retry times if a request failed
RETRY_TIMES = 10

Use Middlewares: give the order of numbers for middlewares, None is to turn off the use of that middlewares.

DOWNLOADER_MIDDLEWARES = {
    "crawler.middlewares.CrawlerAgentMiddleware": 100,
    # 'rotating_free_proxies.middlewares.RotatingProxyMiddleware': 200,
    # 'rotating_free_proxies.middlewares.BanDetectionMiddleware': 300,
    # "crawler.middlewares.CrawlerProxyMiddleware": 200,
    # "crawler.middlewares.CrawlerRetryMiddleware": 300,
    # 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

Pipelines:

ITEM_PIPELINES = {
   "crawler.pipelines.CrawlerPipeline": 100,
   "crawler.pipelines.ESPipeline": 200,
}

Logging:

LOG_LEVEL = 'INFO'  # DEBUG for debug mode, ERROR for only display errors
LOG_FORMAT = '%(levelname)s: %(message)s'
LOG_FILE = 'crawl.log' # Logging filename

ElasticSearch Config (sử dụng trong ESExporters):

ELASTIC_HOSTS = [
    {'host': 'localhost', 'port': 9200, "scheme": "http"},
]

Scrapy

To start crawling, we use:

cd crawler/crawler
scrapy crawl <spider name> --set JOBDIR=<job name>

Ex:

scrapy crawl thanhnien --set JOBDIR=thanhnien

To pause and resume crawling, we can press Ctrl + C (only once) and restart the above command. Scrapy will automatically save a JOBDIR folder to restart URLs that have not been run.

About

Vietnamese popular site Crawler with Scrapy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages