Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev 1.0.0 -> Main #482

Merged
merged 37 commits into from Mar 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
877d9f5
Use new browser-based archiving mechanism instead of pywb proxy (#424)
ikreymer Nov 8, 2023
af1e086
TypeScript Conversion (#425)
ikreymer Nov 9, 2023
2a49406
Add Prettier to the repo, and format all the files! (#428)
emma-sg Nov 10, 2023
783d006
follow-up to #428: update ignore files (#431)
ikreymer Nov 10, 2023
ab0f66a
Raise size limit for large HTML pages (#430)
ikreymer Nov 10, 2023
3972942
logging: don't log filtered out direct fetch attempt as error (#432)
ikreymer Nov 13, 2023
0d51e03
Fix potential for pending list never being processed (#433)
ikreymer Nov 13, 2023
456155e
more specific types additions (#434)
ikreymer Nov 13, 2023
19dac94
Add types + validation for log context options (#435)
ikreymer Nov 15, 2023
e9ed7a4
Merge 0.12.2 into dev-1.0.0
ikreymer Nov 16, 2023
3323262
WARC filename prefix + rollover size + improved 'livestream' / trunca…
ikreymer Dec 8, 2023
703835a
detect invalid custom behaviors on load: (#450)
ikreymer Dec 13, 2023
63c884f
Merge branch 'main' (0.12.3) into 1.0.0
ikreymer Jan 3, 2024
db2dbe0
bump to 1.0.0-beta.1
ikreymer Jan 3, 2024
2fc0f67
Generate urn:pageinfo:<page url> records (#458)
ikreymer Jan 15, 2024
18ffb3d
skipping resources: ensure HEAD, OPTIONS, 206, and 304 response/reque…
ikreymer Jan 17, 2024
f4ecaa8
Merge branch 'main' into dev-1.0.0
ikreymer Jan 17, 2024
298deac
add fix from 0.12.4 - puppeteer-core to 20.8.2
ikreymer Jan 17, 2024
bdffa79
Add arg to write pages to Redis (#464)
tw4l Feb 10, 2024
96f3c40
Page Resources: Include Cached Resources (#465)
ikreymer Feb 16, 2024
46eb02d
version: bump to 1.0.0-beta.3
ikreymer Feb 16, 2024
e8f2073
Update Browser Image (#466)
ikreymer Feb 18, 2024
8d2d79a
Misc Page Resource/Recorder Fixes (#467)
ikreymer Feb 18, 2024
a512e92
Include resource type + mime type in page resources list (#468)
ikreymer Feb 20, 2024
a5e9395
Set warc prefix via WARC_PREFIX env var (#470)
ikreymer Feb 21, 2024
51660cd
pageinfo: add console errors to pageinfo record, tracking in 'counts'…
ikreymer Feb 22, 2024
d36564e
typo: remove extra console.log
ikreymer Feb 23, 2024
cdd047d
warcwriter: better filehandle init on first use (#474)
ikreymer Feb 24, 2024
dd48251
Include WARC prefix for screenshots and text WARCs (#473)
ikreymer Feb 28, 2024
fba4730
new seed on redirect + error page check: (#476)
ikreymer Feb 28, 2024
c348de2
store page statusCode if not 200 (#477)
ikreymer Feb 29, 2024
184f4a2
Ensure links added via behaviors also get processed (#478)
ikreymer Feb 29, 2024
dd78457
version: bump to 1.0.0-beta.5
ikreymer Feb 29, 2024
4520e9e
Fail on status code option + requeue fix (#480)
ikreymer Mar 5, 2024
5a47cc4
warc: add Network.resourceType (https://chromedevtools.github.io/devt…
ikreymer Mar 5, 2024
63cedbc
version: bump to 1.0.0-beta.6
ikreymer Mar 5, 2024
65133c9
resourceType lowercase fix: (#483)
ikreymer Mar 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions .eslintignore
@@ -1,2 +1,4 @@
.*
behaviors.js
behaviors/
scratch/
63 changes: 30 additions & 33 deletions .eslintrc.cjs
@@ -1,35 +1,32 @@
module.exports = {
"env": {
"browser": true,
"es2021": true,
"node": true,
"jest": true
},
"extends": "eslint:recommended",
"parserOptions": {
"ecmaVersion": 12,
"sourceType": "module"
},
"rules": {
"indent": [
"error",
2
],
"linebreak-style": [
"error",
"unix"
],
"quotes": [
"error",
"double"
],
"semi": [
"error",
"always"
],
"no-constant-condition": [
"error",
{"checkLoops": false }
]
}
env: {
browser: true,
es2021: true,
node: true,
jest: true,
},
extends: [
"eslint:recommended",
"plugin:@typescript-eslint/recommended",
"prettier",
],
parser: "@typescript-eslint/parser",
plugins: ["@typescript-eslint"],
parserOptions: {
ecmaVersion: 12,
sourceType: "module",
},
rules: {
"no-constant-condition": ["error", { checkLoops: false }],
"no-use-before-define": [
"error",
{
variables: true,
functions: false,
classes: false,
allowNamedExports: true,
},
],
},
reportUnusedDisableDirectives: true,
};
51 changes: 23 additions & 28 deletions .github/workflows/ci.yaml
Expand Up @@ -6,46 +6,41 @@ on:

jobs:
lint:

runs-on: ubuntu-latest

strategy:
matrix:
node-version: [18.x]

steps:
- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
- name: install requirements
run: yarn install
- name: run linter
run: yarn lint

build:
- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
- name: install requirements
run: yarn install
- name: run linter
run: yarn lint && yarn format

build:
runs-on: ubuntu-latest

strategy:
matrix:
node-version: [18.x]

steps:
- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
- name: install requirements
run: yarn install
- name: build docker
run: docker-compose build
- name: run jest
run: sudo yarn test





- uses: actions/checkout@v3
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
- name: install requirements
run: yarn install
- name: build js
run: yarn run tsc
- name: build docker
run: docker-compose build
- name: run jest
run: sudo yarn test
22 changes: 7 additions & 15 deletions .github/workflows/release.yaml
Expand Up @@ -8,44 +8,36 @@ jobs:
name: Build x86 and ARM Images and push to Dockerhub
runs-on: ubuntu-22.04
steps:
-
name: Check out the repo
- name: Check out the repo
uses: actions/checkout@v4

-
name: Docker image metadata
- name: Docker image metadata
id: meta
uses: docker/metadata-action@v5
with:
images: webrecorder/browsertrix-crawler
tags: |
type=semver,pattern={{version}}

-
name: Set up QEMU
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
with:
platforms: arm64

-
name: Set up Docker Buildx
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
-
name: Login to DockerHub
- name: Login to DockerHub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
-
name: Build and push
- name: Build and push
id: docker_build
uses: docker/build-push-action@v3
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
platforms: "linux/amd64,linux/arm64"
-
name: Image digest
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}

2 changes: 2 additions & 0 deletions .gitignore
Expand Up @@ -6,3 +6,5 @@ node_modules/
crawls/
test-crawls/
.DS_Store
dist
scratch/
2 changes: 1 addition & 1 deletion .husky/pre-commit
@@ -1,4 +1,4 @@
#!/usr/bin/env sh
. "$(dirname -- "$0")/_/husky.sh"

yarn lint
yarn lint:fix
14 changes: 7 additions & 7 deletions .pre-commit-config.yaml
@@ -1,8 +1,8 @@
repos:
- repo: local
hooks:
- id: husky-run-pre-commit
name: husky
language: system
entry: .husky/pre-commit
pass_filenames: false
- repo: local
hooks:
- id: husky-run-pre-commit
name: husky
language: system
entry: .husky/pre-commit
pass_filenames: false
4 changes: 4 additions & 0 deletions .prettierignore
@@ -0,0 +1,4 @@
dist
scratch
crawls
test-crawls
18 changes: 13 additions & 5 deletions CHANGES.md
@@ -1,11 +1,13 @@
## CHANGES

v0.8.1

- Logging and Behavior Tweaks by @ikreymer in https://github.com/webrecorder/browsertrix-crawler/pull/229
- Fix typos by @stavares843 in https://github.com/webrecorder/browsertrix-crawler/pull/232
- Add crawl log to WACZ by @ikreymer in https://github.com/webrecorder/browsertrix-crawler/pull/231

v0.8.0

- Switch to Chrome/Chromium 109
- Convert to ESM module
- Add ad blocking via request interception (#173)
Expand All @@ -25,11 +27,13 @@ v0.8.0
- update behaviors to 0.4.1, rename 'Behavior line' -> 'Behavior log' by @ikreymer in https://github.com/webrecorder/browsertrix-crawler/pull/223

v0.7.1

- Fix for warcio.js by @ikreymer in #178
- Guard against pre-existing user/group by @edsu in #176
- Fix incorrect combineWARCs property in README.md by @Georift in #180

v0.7.0

- Update to Chrome/Chromium 101 - (0.7.0 Beta 0) by @ikreymer in #144
- Add --netIdleWait, bump dependencies (0.7.0-beta.2) by @ikreymer in #145
- Update README.md by @atomotic in #147
Expand All @@ -41,7 +45,6 @@ v0.7.0
- Interrupt Handling Fixes by @ikreymer in #167
- Run in Docker as User by @edsu in #171


v0.6.0

- Add a --waitOnDone option, which has browsertrix crawler wait when finished (for use with Browsertrix Cloud)
Expand All @@ -56,8 +59,8 @@ v0.6.0
- Fixes to interrupting a single instance in a shared state crawl
- force all cookies, including session cookies, to fixed duration in days, configurable via --cookieDays


v0.5.0

- Scope: support for `scopeType: domain` to include all subdomains and ignoring 'www.' if specified in the seed.
- Profiles: support loading remote profile from URL as well as local file
- Non-HTML Pages: Load non-200 responses in browser, even if non-html, fix waiting issues with non-HTML pages (eg. PDFs)
Expand All @@ -75,8 +78,8 @@ v0.5.0
- Signing: Support for optional signing of WACZ
- Dependencies: update to latest pywb, wacz and browsertrix-behaviors packages


v0.4.4

- Page Block Rules Fix: 'request already handled' errors by avoiding adding duplicate handlers to same page.
- Page Block Rules Fix: await all continue/abort() calls and catch errors.
- Page Block Rules: Don't apply to top-level page, print warning and recommend scope rules instead.
Expand All @@ -86,18 +89,21 @@ v0.4.4
- README: Update old type -> scopeType, list new scope types.

v0.4.3

- BlockRules Fixes: When considering the 'inFrameUrl' for a navigation request for an iframe, use URL of parent frame.
- BlockRules Fixes: Always allow pywb proxy scripts.
- Logging: Improved debug logging for block rules (log blocked requests and conditional iframe requests) when 'debug' set in 'logging'

v0.4.2

- Compose/docs: Build latest image by default, update README to refer to latest image
- Fix typo in `crawler.capturePrefix` that resulted in `directFetchCapture()` always failing
- Tests: Update all tests to use `test-crawls` directory
- extractLinks() just extracts links from default selectors, allows custom driver to filter results
- loadPage() accepts a list of selector options with selector, extract, and isAttribute settings for further customization of link extraction

v0.4.1

- BlockRules Optimizations: don't intercept requests if no blockRules
- Profile Creation: Support extending existing profile by passing a --profile param to load on startup
- Profile Creation: Set default window size to 1600x900, add --windowSize param for setting custom size
Expand All @@ -107,6 +113,7 @@ v0.4.1
- CI: Build a multi-platform (amd64 and arm64) image on each release

v0.4.0

- YAML based config, specifyable via --config property or via stdin (with '--config stdin')
- Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
- Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
Expand All @@ -120,16 +127,17 @@ v0.4.0
- Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)

v0.3.2
- Added a `--urlFile` option: Allows users to specify a .txt file list of exact URLs to crawl (one URL per line).

- Added a `--urlFile` option: Allows users to specify a .txt file list of exact URLs to crawl (one URL per line).

v0.3.1

- Improved shutdown wait: Instead of waiting for 5 secs, wait until all pending requests are written to WARCs
- Bug fix: Use async APIs for combine WARC to avoid spurious issues with multiple crawls
- Behaviors Update to Behaviors to 0.2.1, with support for facebook pages


v0.3.0

- WARC Combining: `--combineWARC` and `--rolloverSize` flags for generating combined WARC at end of crawl, each WARC upto specified rolloverSize
- Profiles: Support for creating reusable browser profiles, stored as tarballs, and running crawl with a login profile (see README for more info)
- Behaviors: Switch to Browsertrix Behaviors v0.1.1 for in-page behaviors
Expand Down
13 changes: 8 additions & 5 deletions Dockerfile
@@ -1,4 +1,4 @@
ARG BROWSER_VERSION=1.59.120
ARG BROWSER_VERSION=1.62.165
ARG BROWSER_IMAGE_BASE=webrecorder/browsertrix-browser-base:brave-${BROWSER_VERSION}

FROM ${BROWSER_IMAGE_BASE}
Expand All @@ -20,7 +20,6 @@ ENV PROXY_HOST=localhost \
WORKDIR /app

ADD requirements.txt /app/
RUN pip install 'uwsgi==2.0.21'
RUN pip install -U setuptools; pip install -r requirements.txt

ADD package.json /app/
Expand All @@ -39,14 +38,18 @@ RUN mkdir -p /tmp/ads && cd /tmp/ads && \

RUN yarn install --network-timeout 1000000

ADD *.js /app/
ADD util/*.js /app/util/
ADD tsconfig.json /app/
ADD src /app/src

RUN yarn run tsc

ADD config/ /app/

ADD html/ /app/html/

RUN ln -s /app/main.js /usr/bin/crawl; ln -s /app/create-login-profile.js /usr/bin/create-login-profile
RUN chmod u+x /app/dist/main.js /app/dist/create-login-profile.js

RUN ln -s /app/dist/main.js /usr/bin/crawl; ln -s /app/dist/create-login-profile.js /usr/bin/create-login-profile

WORKDIR /crawls

Expand Down