Releases · webrecorder/browsertrix-crawler

Page Block Rules Fix: 'request already handled' errors by avoiding adding duplicate handlers to same page.
Page Block Rules Fix: await all continue/abort() calls and catch errors.
Page Block Rules: Don't apply to top-level page, print warning and recommend scope rules instead.
Setup: Attempt to create the crawl working directory (cwd) specified via --cwd if it doesn't exist.
Scope Types: Rename 'none' -> 'page' (single page only) and 'page' -> 'page-spa' (page with hashtags).
README: Add more scope rule examples, clarify distinction between scope rules and block rules.
README: Update old type -> scopeType, list new scope types.

Assets 2

27 Jul 16:48

ikreymer

0.4.3

be1ee53

Browsertrix Crawler 0.4.3

This release includes a bug fix for the 'block rules' system:

When considering the 'inFrameUrl' for a navigation request for an iframe, use URL of parent frame.
Always allow pywb proxy static scripts, ignoring block rules settings.
When 'debug' set in 'logging' options, log blocked requests and conditional iframe requests.

Assets 2

24 Jul 03:03

ikreymer

0.4.2

f0c5ca1

Browsertrix Crawler 0.4.2

This releases includes the following fixes:

Compose/docs: Build latest image by default, update README to refer to latest image
Fix typo in crawler.capturePrefix that resulted in directFetchCapture() always failing (also catch any fails in direct fetch)
Tests: Update all tests to use test-crawls directory
extractLinks() just extracts links from default selectors, allows custom driver to filter results
loadPage() accepts a list of selector options with selector, extract, and isAttribute settings for further customization of link extraction

Released image published to Docker Hub at webrecorder/browsertrix-crawler:0.4.2

Assets 2

22 Jul 21:34

ikreymer

0.4.1

f4c6b6a

Browsertix Crawler 0.4.1

This release includes a multi-platform build for amd64 and arm64 (Apple M1).
Other fixes and enhancements include:

BlockRules Optimizations: don't intercept requests if no blockRules
Profile Creation: Support extending existing profile by passing a --profile param to load on startup
Profile Creation: Set default window size to 1600x900, add --windowSize param for setting custom size
Behavior Timeouts: Add --behaviorTimeout to specify custom timeout for behaviors, in seconds (defaulting to 90 seconds)
Load Wait Default: Switch to 'load,networkidle2' to speed-up waiting for initial load
Multi-platform build: Support building for amd64 and Arm using oldwebtoday/chrome:91 images (check for google-chrome and chromium-browser automatically)
CI: Build a multi-platform (amd64 and arm64) image on each release

Assets 2

22 Jul 03:08

ikreymer

0.4.1-beta.1

7efacec

Browsertix Crawler 0.4.1 Beta 1 Pre-release

Pre-release

[Testing Multi-platform building]
(Beta) Changes for 0.4.1

BlockRules Optimizations: don't intercept requests if no blockRules
Profile Creation: Support extending existing profile by passing a --profile param to load on startup
Behavior Timeouts: Add --behaviorTimeout to specify custom timeout for behaviors, in seconds (defaulting to 90 seconds)
Load Wait Default: Switch to 'load,networkidle2' to speed-up waiting for initial load
Multi-platform build: Support building for amd64 and Arm using oldwebtoday/chrome:91 images (check for google-chrome and chromium-browser automatically)
CI: Builds an amd64 and arm64 images on each release

Assets 2

21 Jul 06:28

ikreymer

0.4.0

6a65ea7

Browsertix Crawler 0.4.0

This release includes many new features, including:

YAML based config, specifiable via --config property or via stdin (with --config stdin)
Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
Support for 'blockRules' for blocking certain URLs from being stored in WARCs, conditional blocking for iframe based on contents, and iframe URLs (see README for more details)
Interactive profile creation: creating profiles by interacting with embedded browser loaded in the browser (see README for more details).
Screencasting: streaming the output of each window via websocket-based streaming, configurable with --screencastPort option
New 'window' based parallelization: Open each worker in new window in same session
Simplified custom driver config, default calls 'loadPage'
Refactor arg parsing, other auxiliary functions into separate utils files
Image customization: support for customizing browser image, eg. building with Chromium instead of Chrome, support for ARM architecture builds (see README for more details).
Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)

Assets 2

28 Jun 22:07

ikreymer

0.4.0-beta.2

ef7d5e5

Browsertix Crawler 0.4.0 Beta 2 Pre-release

Pre-release

Support for per-seed scoping (#63)
YAML Config:

Fixes for behavior, other options to work with YAML config
Support passing YAML config via stdin
New Docker Image, support for customizing browser image (support for multi-arch builds)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: webrecorder/browsertrix-crawler

Browsertix Crawler 0.5.0 Beta 2

Browsertix Crawler 0.5.0 Beta 1

Browsertrix Crawler 0.5.0 Beta 0

Browsertrix Crawler 0.4.4

Browsertrix Crawler 0.4.3

Browsertrix Crawler 0.4.2

Browsertix Crawler 0.4.1

Browsertix Crawler 0.4.1 Beta 1

Browsertix Crawler 0.4.0

Browsertix Crawler 0.4.0 Beta 2