Browser Emulation using browserless

Problem: Certain sites have ugly source code and/or render the page using JavaScript, making it next to impossible to use the Website Agent. (Described in issue #888)

Solution: Use the Post Agent as the source for a Website Agent to scrape data from these sites. Use browserless to emulate the browser and return a fully rendered DOM. The Post Agent will use the browserless API to get the rendered html of the site and send this to the Website Agent. This allows the Website Agent to then properly scrape dynamic content from JavaScript-heavy pages.

In order to use browserless, deploy an own instance first. See https://github.com/browserless/chrome for more installation instructions (Docker image available at https://hub.docker.com/r/browserless/chrome).

In Huginn, the Post Agent will request the rendered html from browserless for a given url through an API call (https://docs.browserless.io/docs/content.html). These are the values to set in the Post Agent

post_url - set to browserless_url/content (where browserless_url is wherever the instance is hosted, if using docker this will be a URL outside of both the Browserless and Huginn containers)
content_type - usually set to json
method - set to post
payload - set to {url: site_url} where site_url is the url of the site that should be rendered
emit_events - set to true (this will allow setting this agent as the source for the html of the desired site)

The remaining keys can stay at their default values.

Try a dry run to confirm that the agent returns rendered html.

Website Agent

Configure the source for this agent to be the post agent created previously.
Since the html for the url has already been generated by the post agent, this website agent will scrape data_from_event instead of url. The value for data_from_event will likely be {{body}}. type is also html.
The rest of the extraction configuration is the same as for scraping a "regular" site with the website agent ie. configure the css selectors according to the rendered html of the site being scraped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Browser Emulation using browserless

Clone this wiki locally