CrawlerFlow

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Features

[*] Write spiders in the YAML configs.
[*] Create extractors to scrape data using YAML configs (HTML, API, RSS)
[*] Define multiple extractors per spider.
[*] Use standard extractors to scrape data like Tables, Paragraphs, Meta tags, JSON+LD of the page.
Traverse between multiple websites.
Write Python Extractors for advanced extraction strategy

Installation

pip install git+https://github.com/invana/crawlerflow#egg=crawlerflow

Usage

Scraping with CrawlerFlow

from crawlerflow.runner import Crawlerflow
from crawlerflow.utils import yaml_to_json


crawl_requests = yaml_to_json(open("example-configs/crawlerflow/requests/github-detail-urls.yml"))
spider_config = yaml_to_json(open("example-configs/crawlerflow/spiders/default-spider.yml"))
github_default_extractor = yaml_to_json(open("example-configs/crawlerflow/extractors/github-blog-detail.yml"))

flow = Crawlerflow()
flow.add_spider_with_config(crawl_requests, spider_config, default_extractor=github_default_extractor)
flow.start()

Scraping with WebCrawler

from crawlerflow.runner import WebCrawler
from crawlerflow.utils import yaml_to_json

 
scraper_config_files = [
    "example-configs/webcrawler/APISpiders/api-publicapis-org.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-list.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-detail.yml"
]

crawlerflow = WebCrawler()

for scraper_config_file in scraper_config_files:
    scraper_config = yaml_to_json(open(scraper_config_file))
    crawlerflow.add_spider_with_config(scraper_config)
crawlerflow.start()

Refer examples-configs/ folder for example configs.

Available Extractors

[*] HTMLExtractor
[*] MetaTagExtractor
[*] JSONLDExtractor
[*] TableContentExtractor
[*] IconsExtractor

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
example-configs		example-configs
examples		examples
src/crawlerflow		src/crawlerflow
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
conftest.py		conftest.py
pytest.ini		pytest.ini
runtests.py		runtests.py
setup.py		setup.py

invana/crawlerflow

Folders and files

Latest commit

History

Repository files navigation

CrawlerFlow

Features

Installation

Usage

Scraping with CrawlerFlow

Scraping with WebCrawler

Available Extractors

About

Topics

Resources

Stars

Watchers

Forks

Languages