Skip to content

Browser Emulation using browserless

AxvryIoz edited this page Sep 19, 2022 · 6 revisions

Problem: Certain sites have ugly source code and/or render the page using JavaScript, making it next to impossible to use the Website Agent. (Described in issue #888)

Solution: Use the Post Agent as the source for a Website Agent to scrape data from these sites. Use browserless to emulate the browser and return a fully rendered DOM. The Post Agent will use the browserless API to get the rendered html of the site and send this to the Website Agent. This allows the Website Agent to then properly scrape dynamic content from JavaScript-heavy pages.

In order to use browserless, deploy an own instance first. See https://github.com/browserless/chrome for more installation instructions (Docker image available at https://hub.docker.com/r/browserless/chrome).

In Huginn, the Post Agent will request the rendered html from browserless for a given url through an API call (https://docs.browserless.io/docs/content.html). These are the values to set in the Post Agent

  • post_url - set to browserless_url/content (where browserless_url is wherever the instance is hosted, if using docker this will be a URL outside of both the Browserless and Huginn containers)
  • content_type - usually set to json
  • method - set to post
  • payload - set to {url: site_url} where site_url is the url of the site that should be rendered
  • emit_events - set to true (this will allow setting this agent as the source for the html of the desired site)

The remaining keys can stay at their default values.

Try a dry run to confirm that the agent returns rendered html.

Website Agent

  1. Configure the source for this agent to be the post agent created previously.
  2. Since the html for the url has already been generated by the post agent, this website agent will scrape data_from_event instead of url. The value for data_from_event will likely be {{body}}. type is also html.
  3. The rest of the extraction configuration is the same as for scraping a "regular" site with the website agent ie. configure the css selectors according to the rendered html of the site being scraped.
Clone this wiki locally