Skip to content

Single Docker container running Heritrix 3, picking up jobs from a directory.

Notifications You must be signed in to change notification settings

sepastian/heritrix3-standalone-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

This uses the heritrix-worker image by ukwa.

Usage

docker-compose up

Go to https://localhost:8443. Note, its https, not http.

Login with admin/admin, configure credentials in docker-compose.yml by setting HERITRIX_USERNAME and HERITRIX_PASSWORD, respectively.

Jobs

Create a new job my-job in Heritrix's web interface. This will create the folder ./jobs/my-job. Configure the crawl job by editing ./jobs/my-job/crawler-beans.xml.

Start the jobs through the web interface (build, launch, reload page, unpause). Each crawl will create a folder named ./jobs/my-job/<YYYYMMDD>; in addition, ./jobs/my-job/latest will point at just that.

Clicking checkpoint will create a gzipped WARC file under ./jobs/my-job/latest/warcs; when the job has finished, such a WARC file will be created, too.

About

Single Docker container running Heritrix 3, picking up jobs from a directory.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published