Skip to content

crawlkit/crawlkit

Repository files navigation

CrawlKit

Build status npm npm David node bitHound Overall Score Commitizen friendly semantic-release Code Climate

A crawler based on PhantomJS. Allows discovery of dynamic content and supports custom scrapers. For all your ajaxy crawling & scraping needs.

  • Parallel crawling/scraping via Phantom pooling.
  • Custom-defined link discovery.
  • Custom-defined runners (scrape, test, validate, etc.)
  • Can follow redirects (and because it's based on PhantomJS, JavaScript redirects will be followed as well as <meta> redirects.)
  • Streaming
  • Resilient to PhantomJS crashes
  • Ignores page errors

Install

npm install crawlkit --save

Usage

const CrawlKit = require('crawlkit');
const anchorFinder = require('crawlkit/finders/genericAnchors');

const crawler = new CrawlKit('http://your/page');
crawler.setFinder({
    getRunnable: () => anchorFinder
});

crawler.crawl()
    .then((results) => {
        console.log(JSON.stringify(results, true, 2));
    }, (err) => console.error(err));

Also, have a look at the samples.

API

See the API docs (published) or the docs on doclets.io (live).

Debugging

CrawlKit uses debug for debugging purposes. In short, you can add DEBUG="*" as an environment variable before starting your app to get all the logs. A more sane configuration is probably DEBUG="*:info,*:error,-crawlkit:pool*" if your page is big.

Contributing

Please contribute away :)

Please add tests for new functionality and adapt them for changes.

The commit messages need to follow the conventional changelog format so semantic-release picks the semver versions properly. It is probably easiest if you install commitizen via npm install -g commitizen and commit your changes via git cz.

Available runners

Products using CrawlKit

About

A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published