Isoxya plugin Crawler HTML

Isoxya plugin Crawler HTML provides a core run loop for the crawling engine, parsing each page as static HTML, and extracting request metadata and outbound URLs. It is a plugin for Isoxya web crawler.

https://hub.docker.com/r/tiredpixel/isoxya-plugin-crawler-html
https://github.com/tiredpixel/isoxya-plugin-crawler-html

Features

links parsed <a href="http://example.com">link</a>
header redirects extracted Location; HTTP Status 301, 302, 303, 307, 308
no-follow links respected <a href="http://www.iana.org/domains/example" rel="nofollow">
base tags used for relative links <base href="http://www.example.com/">
meta robots no-follow tags respected <meta name="robots" content="nofollow">
header X-Robots-Tag no-follow respected X-Robots-Tag: nofollow

Installation

Compile and boot locally:

docker compose up

Images are also published using the latest tag (for development), and version-specific tags (for production). Do not use a latest tag in production!

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
bin		bin
lib		lib
src/Isoxya/Plugin		src/Isoxya/Plugin
test		test
.dockerignore		.dockerignore
.dockerrepo		.dockerrepo
.env		.env
.gitignore		.gitignore
.gitmodules		.gitmodules
.stylish-haskell.yaml		.stylish-haskell.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cabal.project.freeze		cabal.project.freeze
docker-compose.yml		docker-compose.yml
isoxya-plugin-crawler-html.cabal		isoxya-plugin-crawler-html.cabal

License

tiredpixel/isoxya-plugin-crawler-html

Folders and files

Latest commit

History

Repository files navigation

Isoxya plugin Crawler HTML

Features

Installation

Licence

About

Topics

Resources

License

Stars

Watchers

Forks

Languages