Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MKDocs documentation site for Browsertrix Crawler 1.0.0 #494

Merged
merged 37 commits into from Mar 16, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
8e68ee3
WIP: First pass at mkdocs
tw4l Mar 13, 2024
0f7d736
Specify src dir for linting and formatting
tw4l Mar 13, 2024
3231e72
Update docs and README
tw4l Mar 13, 2024
eed8820
Adds logos to docs
Shrinks99 Mar 13, 2024
ea70159
Remove venv from git repo
tw4l Mar 13, 2024
c3f6bb4
Fix typo
tw4l Mar 13, 2024
21fc0b6
Update link
tw4l Mar 13, 2024
b21f5b8
Add CNAME for crawler-docs.browsertrix.com
tw4l Mar 13, 2024
16a2022
Add GH Actions workflow to publish docs
tw4l Mar 13, 2024
9bdb2e8
Modify docs url to crawler-docs.webrecorder.net for now
tw4l Mar 13, 2024
46fe56a
Add mentions of collecting data from CDP
tw4l Mar 13, 2024
41948e7
Fix typo
tw4l Mar 13, 2024
ecba2f0
Remove unused icons
tw4l Mar 13, 2024
011286f
Updates colours
Shrinks99 Mar 13, 2024
3529686
Fixes em dashes, adds oxford comma
Shrinks99 Mar 13, 2024
6f9fadf
Fixes em dashes
Shrinks99 Mar 13, 2024
9c06d43
Fixes line breaks, adds em dash
Shrinks99 Mar 13, 2024
fad2bda
Apply style changes to text
tw4l Mar 14, 2024
2ca1e67
Remove insertversion js
tw4l Mar 14, 2024
0b2e91e
Clarify headless profile streaming
tw4l Mar 14, 2024
5cf3f8d
Use example admonition
tw4l Mar 14, 2024
a044a86
Remove errant backticks
tw4l Mar 14, 2024
b68c4b6
Fix unclear sentence
tw4l Mar 14, 2024
4495a24
Adds oxford comma and em dash
Shrinks99 Mar 14, 2024
3b6afaf
Removes double space
Shrinks99 Mar 14, 2024
941fd7b
Adds dark mode favicon colors
Shrinks99 Mar 14, 2024
516c966
Updates favicon with dynamic color for dark mode
Shrinks99 Mar 14, 2024
ef707df
SVG optimization
Shrinks99 Mar 14, 2024
edaa07b
add gen-cli.sh script to auto-generated the cli-options.md
ikreymer Mar 15, 2024
4cbd446
test publishing, update CNAME
ikreymer Mar 15, 2024
dc00f03
test publish
ikreymer Mar 15, 2024
d223836
fix typo
ikreymer Mar 15, 2024
3753f57
add missing docs.md
ikreymer Mar 15, 2024
8841275
fix docs link
ikreymer Mar 15, 2024
17de513
Update Crawl Scope section with better regex examples
tw4l Mar 15, 2024
916c434
Update link in README to docs to make CNAME
tw4l Mar 15, 2024
ab21d67
remove test publish branch, remove dupe CNAME
ikreymer Mar 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/docs-publish.yaml
@@ -0,0 +1,21 @@
name: docs-publish
on:
push:
branches:
- main
paths:
- 'docs/**'

permissions:
contents: write

jobs:
deploy_docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.x
- run: pip install mkdocs-material
- run: cd docs/ && mkdocs gh-deploy --force
1 change: 0 additions & 1 deletion .husky/pre-commit
@@ -1,4 +1,3 @@
#!/usr/bin/env sh
. "$(dirname -- "$0")/_/husky.sh"

yarn lint:fix
794 changes: 4 additions & 790 deletions README.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/docs/CNAME
@@ -0,0 +1 @@
crawler-docs.webrecorder.net
6 changes: 6 additions & 0 deletions docs/docs/assets/brand/browsertrix-crawler-color.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/docs/assets/brand/browsertrix-crawler-white.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/docs/assets/fonts/Inter-Italic.var.woff2
Binary file not shown.
Binary file added docs/docs/assets/fonts/Inter.var.woff2
Binary file not shown.
Binary file added docs/docs/assets/fonts/Recursive_VF_1.084.woff2
Binary file not shown.
39 changes: 39 additions & 0 deletions docs/docs/develop/index.md
@@ -0,0 +1,39 @@
# Development

## Usage with Docker Compose

Many examples in User Guide demonstrate running Browsertrix Crawler with `docker run`.

Docker Compose is recommended for building the image and for simple configurations. A simple Docker Compose configuration file is included in the Git repository.

For example, to build the latest image, simply run:
tw4l marked this conversation as resolved.
Show resolved Hide resolved

```sh
docker-compose build
```

Docker Compose also simplifies some config options, such as mounting the volume for the crawls.

For example, the following command starts a crawl with 2 workers and generates the CDX.
tw4l marked this conversation as resolved.
Show resolved Hide resolved

```sh
docker-compose run crawler crawl --url https://webrecorder.net/ --generateCDX --collection wr-net --workers 2
```

In this example, the crawl data is written to `./crawls/collections/wr-net` by default.

While the crawl is running, the status of the crawl prints the progress to the JSON-L log output. This can be disabled by using the `--logging` option and not including `stats`.

## Multi-Platform Build / Support for Apple Silicon (M1/M2)
tw4l marked this conversation as resolved.
Show resolved Hide resolved

Browsertrix Crawler uses a browser image which supports amd64 and arm64.

This means Browsertrix Crawler can be built natively on Apple Silicon systems using the default settings. Simply running `docker-compose build` on an Apple Silicon should build a native version that should work for development.
tw4l marked this conversation as resolved.
Show resolved Hide resolved

## Modifying Browser Image

It is also possible to build Browsertrix Crawler with a different browser image. Currently, browser images using Brave Browser and Chrome/Chromium (depending on host system chip architecture) are supported via [browsertrix-browser-base](https://github.com/webrecorder/browsertrix-browser-base), however, only Brave Browser is receiving regular version updates.
tw4l marked this conversation as resolved.
Show resolved Hide resolved

The browser base image used is specified and can be changed at the top of the Dockerfile in the Browsertrix Crawler repo.

Custom browser images can be used by forking [browsertrix-browser-base](https://github.com/webrecorder/browsertrix-browser-base), locally building or publishing an image, and then modifying the Dockerfile in this repo to build from that image.
41 changes: 41 additions & 0 deletions docs/docs/index.md
@@ -0,0 +1,41 @@
---
hide:
- navigation
- toc
---

# Home

Welcome to the Browsertrix Crawler official documentation.

Browsertrix Crawler is a simplified browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses [Puppeteer](https://github.com/puppeteer/puppeteer) to control one or more [Brave Browser](https://brave.com/) browser windows in parallel. Data is captured through the [Chrome Devtools Protocol (CDP)](https://chromedevtools.github.io/devtools-protocol/) in the browser.


!!! note

This documentation applies to Browsertrix Crawler versions 1.0.0 and above. Documentation for earlier versions of the crawler is available in the [Browsertrix Crawler Github repository](https://github.com/webrecorder/browsertrix-crawler)'s README file in older commits.


## Features

Thus far, Browsertrix Crawler supports:
tw4l marked this conversation as resolved.
Show resolved Hide resolved

- Single-container, browser based crawling with a headless/headful browser running pages in multiple windows.
- Support for custom browser behaviors, using [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) including autoscroll, video autoplay and site-specific behaviors.
tw4l marked this conversation as resolved.
Show resolved Hide resolved
- YAML-based configuration, passed via file or via stdin.
- Seed lists and per-seed scoping rules.
- URL blocking rules to block capture of specific URLs (including by iframe URL and/or by iframe contents).
- Screencasting: Ability to watch crawling in real-time.
- Screenshotting: Ability to take thumbnails, full page screenshots, and/or screenshots of the initial page view.
- Optimized (non-browser) capture of non-HTML resources.
- Extensible Puppeteer driver script for customizing behavior per crawl or page.
- Ability to create and reuse browser profiles interactively or via automated user/password login using an embedded browser.
- Multi-platform support -- prebuilt Docker images available for Intel/AMD and Apple Silicon (M1/M2) CPUs.
tw4l marked this conversation as resolved.
Show resolved Hide resolved

## Documentation

Our docs are still under construction. If you find something missing, chances are we haven't gotten around to writing that part yet. If you find typos or something isn't clear or seems incorrect, please open an [issue](https://github.com/webrecorder/browsertrix-crawler/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc) and we'll try to make sure that your questions get answered here in the future!
tw4l marked this conversation as resolved.
Show resolved Hide resolved

## Code

Browsertrix Crawler is free and open source software, with all code available in the [main repository on Github](https://github.com/webrecorder/browsertrix-crawler).
34 changes: 34 additions & 0 deletions docs/docs/js/insertversion.js
tw4l marked this conversation as resolved.
Show resolved Hide resolved
@@ -0,0 +1,34 @@
const KEY = "/.__source";
let retries = 0;

function loadVersion() {
const value = self.sessionStorage.getItem(KEY);
if (value) {
parseVersion(value);
} else if (retries++ < 10) {
setTimeout(loadVersion, 500);
}
}

function parseVersion(string) {
const version = JSON.parse(string).version;
if (!version) {
return;
}

const elems = document.querySelectorAll("insert-version");
for (const elem of elems) {
try {
const code = elem.parentElement.nextElementSibling.querySelector("code");
code.childNodes.forEach((node) => {
if (node.nodeType === Node.TEXT_NODE) {
node.nodeValue = node.nodeValue.replaceAll("VERSION", version);
}
});
} catch (e) {}
}
}

if (window.location.pathname.startsWith("/deploy/local")) {
window.addEventListener("load", () => loadVersion());
}
4 changes: 4 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/bug-fill.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/chat-left-text-fill.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/check-circle-fill.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/check-circle.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/dash-circle.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/exclamation-triangle.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/eye.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/github.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/globe.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/info-circle-fill.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/mastodon.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/mortarboard-fill.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/pencil-fill.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/pencil.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/docs/overrides/.icons/bootstrap/question-circle-fill.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.