Skip to content
This repository has been archived by the owner on Apr 4, 2024. It is now read-only.

qdm12/htmlspitter

Repository files navigation

HTMLSpitter

Lightweight Docker image with NodeJS server to spit out HTML from loaded JS using Puppeteer and Chrome

Medium story: HTML from the Javascript world

htmlspitter

Build Status Docker Pulls Docker Stars Image size Image version

Join Slack channel GitHub last commit GitHub commit activity GitHub issues

Image size RAM usage
558MB 110MB+
Click to show base components

The program is written in NodeJS with Typescript, in the src directory.

Description

Runs a NodeJS server accepting HTTP requests with two URL parameters:

  • url which is the URL to prerender into HTML
  • wait which is the optional load event to wait for before stopping the prerendering. It can be:
    • load (wait for the load event)
    • domcontentloaded (wait for the DOMContentLoaded event)
    • networkidle0 (default, wait until there is no network connections for at least 500 ms)
    • networkidle2 (wait until there are less than 3 network connections for at least 500 ms)

For example:

http://localhost:8000/?url=https://github.com/qdm12/htmlspitter
  • The server scales up Chromium instances if needed
  • It limits the number of opened pages per instance to prevent one page crashing all the other pages
  • It has a 1 hour cache for loaded HTML
  • It has a queue system for requests once the maximum number of pages/chromium instances is reached
  • Not compatible with other architectures than amd64 as Chrome-Beta is only built for amd64 for now and is required.

Usage

  1. Run the container

    docker run -it --rm --init -p 8000:8000 qmcgaw/htmlspitter

    You can also use docker-compose.yml.

Environment variables

Name Default Possible values Description
MAX_PAGES 10 -1 or integer larger than 0 Max number of pages per Chromium instance at any time, -1 for no max
MAX_HITS 300 -1 or integer larger than 0 Max number of pages opened per Chromium instance during its lifetime (before relaunch), -1 for no max
MAX_AGE_UNUSED 60 -1 or integer larger than 0 Max age in seconds of inactivity before the browser is closed, -1 for no max
MAX_BROWSERS 10 -1 or integer larger than 0 Max number of Chromium instances at any time, -1 for no max
MAX_CACHE_SIZE 10 -1 or integer larger than 0 Max number of MB stored in the cache, -1 for no max
MAX_QUEUE_SIZE 100 -1 or integer larger than 0 Max size of queue of pages per Chromium instance, -1 for no max
LOG normal normal or json Format to use to print logs
TIMEOUT 15000 -1 or integer larger than 0 Timeout in ms to load a page, -1 for no timeout

Troubleshooting

Chrome fails to launch

If you obtain the error:

{"error":"Error: Failed to launch chrome!\nFailed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Operation not permitted\n\n\nTROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md\n"}

Then you might need to use seccomp with the chrome.json file of this repository:

wget https://raw.githubusercontent.com/qdm12/htmlspitter/master/chrome.json
docker run -it --rm --init --security-opt seccomp=$(pwd)/chrome.json -p 8000:8000 qmcgaw/htmlspitter

Details

Program

  • A built-in local memory cache holds HTML content obtained the last hour and is limited in the size of characters it contains.
  • A built-in pool of Chromium instances creates and removes Chromium instances according to the server load.
  • Each Chromium instance has a limited number of pages so that if one page crashes Chromium, not all page loads are lost.
  • As Chromium caches content, each instance is destroyed and re-created once it reaches a certain number of page loads.

Docker

  • chrome.json may be required depending on your host OS.
  • The --init flag is added to prevent eventual zombie Chromium processes to exist when the container stops the main NodeJS program.
  • A built in healthcheck is implemented by running node build/healthcheck.js against a running instance.

Performance considerations

  • Chromium is written in C++ and multi threaded so it scales well with more CPU cores
  • The NodeJS program should not be the bottleneck because all the work is done by Chromium
  • The bottleneck will be CPU and especially RAM used by Chromium instance(s)
  • You can scale up by having multiple machines running the program, behind a load balancer

Development

  • Either use the Docker container development image with Visual Studio Code and the remote development extension
  • Or install Node and NPM on your machine
# Install all dependencies
npm i
# Transcompile the Typescript code to Javascript and run build/main.js with
npm run start

Test it with, for example:

wget -qO- http://localhost:8000/?url=https://github.com/qdm12/htmlspitter

You can also:

  • Run tests

    npm t
  • Run the sever with hot reload (performs npm run start on each .ts change)

    npx nodemon
  • Build Docker

    docker build -t qmcgaw/htmlspitter .

    You can also specify the branch of Google Chrome from beta (default), stable and unstable

    docker build -t qmcgaw/htmlspitter --build-arg GOOGLE_CHROME_BRANCH=unstable
  • There are two environment variables you might find useful:

    • PORT to set the HTTP server listening port
    • CHROME_BIN which is the path to the Chrome binary or Puppeteer-bundled

TODOs

  • Show Chrome version at start
  • Fake user agents
  • Prevent recursive calls to localhost
  • Format JSON or raw HTML
  • Limit Chromium instances in terms of RAM
  • Compression Gzip
  • Sync same URL with Redis (not getting twice the same URL)
  • Sync Cache with Postgresql or Redis depending on size
  • Limit data size in Postgresql according to time created
  • Unit testing
  • ReactJS GUI
  • Static binary in Scratch Docker image

Credits

License

This repository is under an MIT license