Releases: webrecorder/browsertrix-crawler
Browsertix Crawler 0.5.0 Beta 2
Add support for WACZ signing (experimental), enabled via WACZ_SIGN_URL and WACZ_SIGN_TOKEN env vars.
Browsertix Crawler 0.5.0 Beta 1
Support for uploading WACZ to S3-compatible storage!
Browsertrix Crawler 0.5.0 Beta 0
Initial Build of 0.5.0 beta for testing!
Browsertrix Crawler 0.4.4
This release includes fixes block rules system and README improvements:
- Page Block Rules Fix: 'request already handled' errors by avoiding adding duplicate handlers to same page.
- Page Block Rules Fix: await all continue/abort() calls and catch errors.
- Page Block Rules: Don't apply to top-level page, print warning and recommend scope rules instead.
- Setup: Attempt to create the crawl working directory (cwd) specified via --cwd if it doesn't exist.
- Scope Types: Rename 'none' -> 'page' (single page only) and 'page' -> 'page-spa' (page with hashtags).
- README: Add more scope rule examples, clarify distinction between scope rules and block rules.
- README: Update old type -> scopeType, list new scope types.
Browsertrix Crawler 0.4.3
This release includes a bug fix for the 'block rules' system:
- When considering the 'inFrameUrl' for a navigation request for an iframe, use URL of parent frame.
- Always allow pywb proxy static scripts, ignoring block rules settings.
- When 'debug' set in 'logging' options, log blocked requests and conditional iframe requests.
Browsertrix Crawler 0.4.2
This releases includes the following fixes:
- Compose/docs: Build latest image by default, update README to refer to latest image
- Fix typo in
crawler.capturePrefix
that resulted indirectFetchCapture()
always failing (also catch any fails in direct fetch) - Tests: Update all tests to use
test-crawls
directory - extractLinks() just extracts links from default selectors, allows custom driver to filter results
- loadPage() accepts a list of selector options with selector, extract, and isAttribute settings for further customization of link extraction
Released image published to Docker Hub at webrecorder/browsertrix-crawler:0.4.2
Browsertix Crawler 0.4.1
This release includes a multi-platform build for amd64 and arm64 (Apple M1).
Other fixes and enhancements include:
- BlockRules Optimizations: don't intercept requests if no blockRules
- Profile Creation: Support extending existing profile by passing a --profile param to load on startup
- Profile Creation: Set default window size to 1600x900, add --windowSize param for setting custom size
- Behavior Timeouts: Add --behaviorTimeout to specify custom timeout for behaviors, in seconds (defaulting to 90 seconds)
- Load Wait Default: Switch to 'load,networkidle2' to speed-up waiting for initial load
- Multi-platform build: Support building for amd64 and Arm using oldwebtoday/chrome:91 images (check for google-chrome and chromium-browser automatically)
- CI: Build a multi-platform (amd64 and arm64) image on each release
Browsertix Crawler 0.4.1 Beta 1
[Testing Multi-platform building]
(Beta) Changes for 0.4.1
BlockRules Optimizations: don't intercept requests if no blockRules
Profile Creation: Support extending existing profile by passing a --profile param to load on startup
Behavior Timeouts: Add --behaviorTimeout to specify custom timeout for behaviors, in seconds (defaulting to 90 seconds)
Load Wait Default: Switch to 'load,networkidle2' to speed-up waiting for initial load
Multi-platform build: Support building for amd64 and Arm using oldwebtoday/chrome:91 images (check for google-chrome and chromium-browser automatically)
CI: Builds an amd64 and arm64 images on each release
Browsertix Crawler 0.4.0
This release includes many new features, including:
- YAML based config, specifiable via
--config
property or via stdin (with--config stdin
) - Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
- Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
- Support for 'blockRules' for blocking certain URLs from being stored in WARCs, conditional blocking for iframe based on contents, and iframe URLs (see README for more details)
- Interactive profile creation: creating profiles by interacting with embedded browser loaded in the browser (see README for more details).
- Screencasting: streaming the output of each window via websocket-based streaming, configurable with --screencastPort option
- New 'window' based parallelization: Open each worker in new window in same session
- Simplified custom driver config, default calls 'loadPage'
- Refactor arg parsing, other auxiliary functions into separate utils files
- Image customization: support for customizing browser image, eg. building with Chromium instead of Chrome, support for ARM architecture builds (see README for more details).
- Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)
Browsertix Crawler 0.4.0 Beta 2
Support for per-seed scoping (#63)
YAML Config:
- Fixes for behavior, other options to work with YAML config
- Support passing YAML config via stdin
New Docker Image, support for customizing browser image (support for multi-arch builds)