Skip to content

invana/crawlerflow

Repository files navigation

CrawlerFlow

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Features

Features

  • [*] Write spiders in the YAML configs.
  • [*] Create extractors to scrape data using YAML configs (HTML, API, RSS)
  • [*] Define multiple extractors per spider.
  • [*] Use standard extractors to scrape data like Tables, Paragraphs, Meta tags, JSON+LD of the page.
  • Traverse between multiple websites.
  • Write Python Extractors for advanced extraction strategy

Installation

pip install git+https://github.com/invana/crawlerflow#egg=crawlerflow

Usage

Scraping with CrawlerFlow

from crawlerflow.runner import Crawlerflow
from crawlerflow.utils import yaml_to_json


crawl_requests = yaml_to_json(open("example-configs/crawlerflow/requests/github-detail-urls.yml"))
spider_config = yaml_to_json(open("example-configs/crawlerflow/spiders/default-spider.yml"))
github_default_extractor = yaml_to_json(open("example-configs/crawlerflow/extractors/github-blog-detail.yml"))

flow = Crawlerflow()
flow.add_spider_with_config(crawl_requests, spider_config, default_extractor=github_default_extractor)
flow.start()

Scraping with WebCrawler

from crawlerflow.runner import WebCrawler
from crawlerflow.utils import yaml_to_json

 
scraper_config_files = [
    "example-configs/webcrawler/APISpiders/api-publicapis-org.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-list.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-detail.yml"
]

crawlerflow = WebCrawler()

for scraper_config_file in scraper_config_files:
    scraper_config = yaml_to_json(open(scraper_config_file))
    crawlerflow.add_spider_with_config(scraper_config)
crawlerflow.start()

Refer examples-configs/ folder for example configs.

Available Extractors

  • [*] HTMLExtractor
  • [*] MetaTagExtractor
  • [*] JSONLDExtractor
  • [*] TableContentExtractor
  • [*] IconsExtractor

About

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages