Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertix Crawler 0.5.0 Beta 2

27 Jan 01:32
66ce668
Compare
Choose a tag to compare
Pre-release

Add support for WACZ signing (experimental), enabled via WACZ_SIGN_URL and WACZ_SIGN_TOKEN env vars.

Browsertix Crawler 0.5.0 Beta 1

23 Nov 21:01
9f541ab
Compare
Choose a tag to compare
Pre-release

Support for uploading WACZ to S3-compatible storage!

Browsertrix Crawler 0.5.0 Beta 0

25 Sep 17:10
Compare
Choose a tag to compare
Pre-release

Initial Build of 0.5.0 beta for testing!

Browsertrix Crawler 0.4.4

18 Aug 04:28
Compare
Choose a tag to compare

This release includes fixes block rules system and README improvements:

  • Page Block Rules Fix: 'request already handled' errors by avoiding adding duplicate handlers to same page.
  • Page Block Rules Fix: await all continue/abort() calls and catch errors.
  • Page Block Rules: Don't apply to top-level page, print warning and recommend scope rules instead.
  • Setup: Attempt to create the crawl working directory (cwd) specified via --cwd if it doesn't exist.
  • Scope Types: Rename 'none' -> 'page' (single page only) and 'page' -> 'page-spa' (page with hashtags).
  • README: Add more scope rule examples, clarify distinction between scope rules and block rules.
  • README: Update old type -> scopeType, list new scope types.

Browsertrix Crawler 0.4.3

27 Jul 16:48
be1ee53
Compare
Choose a tag to compare

This release includes a bug fix for the 'block rules' system:

  • When considering the 'inFrameUrl' for a navigation request for an iframe, use URL of parent frame.
  • Always allow pywb proxy static scripts, ignoring block rules settings.
  • When 'debug' set in 'logging' options, log blocked requests and conditional iframe requests.

Browsertrix Crawler 0.4.2

24 Jul 03:03
Compare
Choose a tag to compare

This releases includes the following fixes:

  • Compose/docs: Build latest image by default, update README to refer to latest image
  • Fix typo in crawler.capturePrefix that resulted in directFetchCapture() always failing (also catch any fails in direct fetch)
  • Tests: Update all tests to use test-crawls directory
  • extractLinks() just extracts links from default selectors, allows custom driver to filter results
  • loadPage() accepts a list of selector options with selector, extract, and isAttribute settings for further customization of link extraction

Released image published to Docker Hub at webrecorder/browsertrix-crawler:0.4.2

Browsertix Crawler 0.4.1

22 Jul 21:34
f4c6b6a
Compare
Choose a tag to compare

This release includes a multi-platform build for amd64 and arm64 (Apple M1).
Other fixes and enhancements include:

  • BlockRules Optimizations: don't intercept requests if no blockRules
  • Profile Creation: Support extending existing profile by passing a --profile param to load on startup
  • Profile Creation: Set default window size to 1600x900, add --windowSize param for setting custom size
  • Behavior Timeouts: Add --behaviorTimeout to specify custom timeout for behaviors, in seconds (defaulting to 90 seconds)
  • Load Wait Default: Switch to 'load,networkidle2' to speed-up waiting for initial load
  • Multi-platform build: Support building for amd64 and Arm using oldwebtoday/chrome:91 images (check for google-chrome and chromium-browser automatically)
  • CI: Build a multi-platform (amd64 and arm64) image on each release

Browsertix Crawler 0.4.1 Beta 1

22 Jul 03:08
Compare
Choose a tag to compare
Pre-release

[Testing Multi-platform building]
(Beta) Changes for 0.4.1

BlockRules Optimizations: don't intercept requests if no blockRules
Profile Creation: Support extending existing profile by passing a --profile param to load on startup
Behavior Timeouts: Add --behaviorTimeout to specify custom timeout for behaviors, in seconds (defaulting to 90 seconds)
Load Wait Default: Switch to 'load,networkidle2' to speed-up waiting for initial load
Multi-platform build: Support building for amd64 and Arm using oldwebtoday/chrome:91 images (check for google-chrome and chromium-browser automatically)
CI: Builds an amd64 and arm64 images on each release

Browsertix Crawler 0.4.0

21 Jul 06:28
Compare
Choose a tag to compare

This release includes many new features, including:

  • YAML based config, specifiable via --config property or via stdin (with --config stdin)
  • Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
  • Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
  • Support for 'blockRules' for blocking certain URLs from being stored in WARCs, conditional blocking for iframe based on contents, and iframe URLs (see README for more details)
  • Interactive profile creation: creating profiles by interacting with embedded browser loaded in the browser (see README for more details).
  • Screencasting: streaming the output of each window via websocket-based streaming, configurable with --screencastPort option
  • New 'window' based parallelization: Open each worker in new window in same session
  • Simplified custom driver config, default calls 'loadPage'
  • Refactor arg parsing, other auxiliary functions into separate utils files
  • Image customization: support for customizing browser image, eg. building with Chromium instead of Chrome, support for ARM architecture builds (see README for more details).
  • Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)

Browsertix Crawler 0.4.0 Beta 2

28 Jun 22:07
ef7d5e5
Compare
Choose a tag to compare
Pre-release

Support for per-seed scoping (#63)
YAML Config:

  • Fixes for behavior, other options to work with YAML config
  • Support passing YAML config via stdin
    New Docker Image, support for customizing browser image (support for multi-arch builds)