Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Crawl Support (Beta) #469

Merged
merged 50 commits into from Mar 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
e1e7743
convert driver to a class that supports crawlPage, setupPage and tear…
ikreymer Feb 7, 2024
2c0617c
qa work: initial support for crawling over replay!
ikreymer Feb 6, 2024
6827788
add missing await, remove console.log
ikreymer Feb 6, 2024
a00176b
refactor on new driver format
ikreymer Feb 7, 2024
7cc741a
replace driver with ReplayCrawler subclass
ikreymer Feb 7, 2024
540efeb
load WACZ page list directly (via wabac.js ZipRangeReader)
ikreymer Feb 9, 2024
db491fc
types: fix types for WARCResourceWriter / textextract / screenshots
ikreymer Feb 9, 2024
a8869f7
resources pageinfo, include redirects
ikreymer Feb 9, 2024
cefdf52
fix using date for 'ts' field in pageinfo: records to match crawler
ikreymer Feb 10, 2024
7787d8a
add qa option to parseArgs, requires --replaySource but not --seeds
ikreymer Feb 9, 2024
d833e2a
diff work: add screenshot, text, and resource comparisons!
ikreymer Feb 16, 2024
7b8ab4b
add comparison to replay pageinfo!
ikreymer Feb 17, 2024
222ef1d
typo fixes
ikreymer Feb 17, 2024
1791f16
experiment with reloading page after initial load (disabled), add dee…
ikreymer Feb 17, 2024
e15d25d
update to page info with status/mime/type
ikreymer Feb 20, 2024
59382a3
rename --replaySource -> --qaSource
ikreymer Feb 20, 2024
aca1a64
add --qaRedisKey to set redis key for pushing qa comparison data to r…
ikreymer Feb 20, 2024
bad67a0
replayserver: support serving sw.js directly, make RWP version config…
ikreymer Feb 21, 2024
3617bb6
replay: install RWP files directly into image on build, instead of lo…
ikreymer Feb 21, 2024
fb9de39
Merge branch 'dev-1.0.0' into qa-crawl-work
ikreymer Feb 29, 2024
0e0d74e
fixes for 1.0.0-beta.5 merge
ikreymer Feb 29, 2024
c987424
Merge branch 'main' into qa-crawl-work
ikreymer Mar 5, 2024
2d85f2d
Merge branch 'main' into qa-crawl-work
ikreymer Mar 7, 2024
c4231e5
misc qa work:
ikreymer Mar 8, 2024
5c42549
Merge branch 'main' into qa-crawl-work
ikreymer Mar 8, 2024
4f4f7a1
qa: consolidate comparison data into pages data added to redis
ikreymer Mar 8, 2024
5a1b2a9
tests: add qa comparison test:
ikreymer Mar 8, 2024
0a1018a
Merge branch 'main' into qa-crawl-work
ikreymer Mar 8, 2024
0abfaac
qa test: use redis://127.0.0.1:36379 for ci to match other redis usage
ikreymer Mar 8, 2024
3a9ffd8
tests: try different port for redis
ikreymer Mar 8, 2024
d7d6558
support loading multi-wacz .json files locally
ikreymer Mar 11, 2024
aa4ecd5
qa crawl init: support loading pages from json file if 'pages' key is…
ikreymer Mar 12, 2024
8d0f411
disable CORS for replaycrawler (for now) to allow loading any existin…
ikreymer Mar 12, 2024
ceffad9
cleanup
ikreymer Mar 13, 2024
251e1b3
Merge branch 'main' into qa-crawl-work
ikreymer Mar 16, 2024
e4d8388
Merge branch 'main' into qa-crawl-work
ikreymer Mar 19, 2024
cb435f6
readd parseArgs import
ikreymer Mar 19, 2024
52f80d0
cleanup, add more constants, remove commented out code
ikreymer Mar 20, 2024
aee5af5
more cleanup
ikreymer Mar 20, 2024
b18148b
tests: change ports for different tests that use redis to be unique
ikreymer Mar 20, 2024
ce2ffca
Merge branch 'main' into qa-crawl-work
ikreymer Mar 21, 2024
f6a7dab
Merge branch 'main' into qa-crawl-work
ikreymer Mar 21, 2024
387e269
tests: fix non-root user tests
ikreymer Mar 22, 2024
cc5e130
tweak test ci steps
ikreymer Mar 22, 2024
4979d86
Merge branch 'main' into qa-crawl-work
ikreymer Mar 22, 2024
c8dc60d
lint fix
ikreymer Mar 22, 2024
3c4f552
type fix
ikreymer Mar 22, 2024
ae9fdbe
don't bypass service workers for replay crawl!
ikreymer Mar 22, 2024
cdab557
additional type fixes in browser
ikreymer Mar 23, 2024
a4ef485
bump version to 1.1.0-beta.2
ikreymer Mar 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/ci.yaml
Expand Up @@ -42,10 +42,10 @@ jobs:
run: yarn run tsc
- name: build docker
run: docker-compose build
- name: run jest
- name: run all tests as root
run: sudo yarn test
- name: run saved state test with volume owned by different user
- name: run saved state + qa compare test as non-root - with volume owned by current user
run: |
sudo rm -rf ./test-crawls
mkdir test-crawls
sudo yarn test ./tests/saved-state.test.js
sudo yarn test ./tests/saved-state.test.js ./tests/qa_compare.test.js
10 changes: 8 additions & 2 deletions Dockerfile
Expand Up @@ -48,9 +48,15 @@ ADD config/ /app/

ADD html/ /app/html/

RUN chmod a+x /app/dist/main.js /app/dist/create-login-profile.js
ARG RWP_VERSION=1.8.15
ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/ui.js /app/html/rwp/
ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/sw.js /app/html/rwp/

RUN ln -s /app/dist/main.js /usr/bin/crawl; ln -s /app/dist/create-login-profile.js /usr/bin/create-login-profile
RUN chmod a+x /app/dist/main.js /app/dist/create-login-profile.js && chmod a+r /app/html/rwp/*

RUN ln -s /app/dist/main.js /usr/bin/crawl; \
ln -s /app/dist/main.js /usr/bin/qa; \
ln -s /app/dist/create-login-profile.js /usr/bin/create-login-profile

WORKDIR /crawls

Expand Down
39 changes: 39 additions & 0 deletions html/replay.html
@@ -0,0 +1,39 @@
<!doctype html>
<html>
<head>
<script src="/ui.js"></script>
<style>
html {
width: 100%;
height: 100%;
display: flex;
}
body {
width: 100%;
margin: 0;
padding: 0;
}
replay-web-page {
margin: 0;
padding: 0;
border: 0;
position: fixed;
width: 100vw;
height: 100vh;
top: 0;
left: 0;
}
</style>
</head>
<body>
<replay-web-page
embed="replayonly"
deepLink="true"
source="$SOURCE"
url="about:blank"
ts=""
coll="replay"
>
</replay-web-page>
</body>
</html>
10 changes: 8 additions & 2 deletions package.json
@@ -1,6 +1,6 @@
{
"name": "browsertrix-crawler",
"version": "1.0.2",
"version": "1.1.0-beta.2",
"main": "browsertrix-crawler",
"type": "module",
"repository": "https://github.com/webrecorder/browsertrix-crawler",
Expand All @@ -24,9 +24,12 @@
"get-folder-size": "^4.0.0",
"husky": "^8.0.3",
"ioredis": "^5.3.2",
"js-levenshtein": "^1.1.6",
"js-yaml": "^4.1.0",
"minio": "^7.1.3",
"p-queue": "^7.3.4",
"pixelmatch": "^5.3.0",
"pngjs": "^7.0.0",
"puppeteer-core": "^20.8.2",
"sax": "^1.3.0",
"sharp": "^0.32.6",
Expand All @@ -37,16 +40,19 @@
"yargs": "^17.7.2"
},
"devDependencies": {
"@types/js-levenshtein": "^1.1.3",
"@types/js-yaml": "^4.0.8",
"@types/node": "^20.8.7",
"@types/pixelmatch": "^5.2.6",
"@types/pngjs": "^6.0.4",
"@types/uuid": "^9.0.6",
"@types/ws": "^8.5.8",
"@typescript-eslint/eslint-plugin": "^6.10.0",
"@typescript-eslint/parser": "^6.10.0",
"eslint": "^8.53.0",
"eslint-config-prettier": "^9.0.0",
"eslint-plugin-react": "^7.22.0",
"jest": "^29.2.1",
"jest": "^29.7.0",
"md5": "^2.3.0",
"prettier": "3.0.3",
"typescript": "^5.2.2"
Expand Down