Selenium basic integration #444

dgoiko · 2020-05-10T16:01:52Z

Very basic selenium integration. This is not intended to be a full selenium crawler like Nutch, the main goal is to provide a simple way to crawl full-js pages without directly calling the REST APIs. If you're trying to navigate simple HTML pages with, lets say, a POST form, I'd recommend #419 .

For those who don't know Selenium, it is a browser automation tool designed for developers to automatically test their websites on different browsers. Some crawlers like Nutch integrate Selenium in order to provide full JS render capabilities to the crawler. This PR implements a very naive selenium crawling using JBrowserDriver as a headless browser. Please, note that JBrowserDriver does NOT support Java 8 or higher. HtmlUnitDriver provides Java 11 compatibility, but it is too sassy with JS that is not perfectly formed (while normal browsers accept it and manage to render). Even google.com failed to load with HtmlUnitDriver because there's a catch without brackets somewhere.

Connections stablished through selenium are not counted in the same pool than those opened with HttpClient, so limitations are not taken into consideration. Further commits will attemp to resolve this issue, but it is not straight-forward.

Selenium request will NOT be intercepter by the credentials interceptors. FormLogin shoulld work, though

You can define inclussions / exclussions on the new SeleniumCrawlConfig class to determine which URLs will be visited using Selenium and which won't. Starting with a Selenium seed is not possible at the moment (although it would be possible using the new functions created in my POST CAPABNILITIES MR, which allow to pass WebURLs to addSeed methods.

Please, note that Selenium API does NOT provide headers information, so they won't be available in the Page class. Selenium is also best-guessing the content encoding, blocking us from directly extracting the byte array. For compatibility, output String is converted to a UTF8 byte array, however, is Selenium did not properly detect the encoding charset will be messed-up

A full selenium integration would require to modify the crawler too deeply, but right now, the active selenium headless browser window is available through page#getFetchedResult. This is a bit unconvenient as it forces you to perform instanceof verifications in order to access it.

I'd recommend providing this as a optional artifact, so users who will never use this feature don't need to include selenium dependencies into their projects. The only thing that needs to be in the main package is the "selenium" flag for WebUrls and the extracted interfaces for Parser, Fetcher and FetchResult, which would allow to use the custom ones created for Selenium

I'll be adding more features and configurations for Selenium in further commits.

Some of the extracted interfaces in commits are not really necesary, but I needed Parser and PageFetcher interfaces, so decided to start from my existing branch of separated interfaces.

Extracted interfaces from Parser and PageFetcher in order to make it easier to create totally custom classes

Made a silly change in an error in javadoc in order to make a commit and pass the merge checks again, now that the bug in java8 checks is fixed in repo

There was an http fetch error on Java11 test. Commit to pass the test again

This modification would allow to use a database other than sleepycat easilly.

Allows to use any implementation for the DocIDServer, not only sleepycat

Very basic selenium integration. This is not intended to be a full selenium crawler like Nutch, the main goal is to provide a simple way to crawl full-js pages without directly calling the REST APIs. If you're trying to navigate simple HTML pages with, lets say, a POST form, I'd recommend using the POST Capabilities MR instead Connections stablished through selenium are not counted in the same pool than those opened with HttpClient, so limitations are not taken into consideration. Further commits will attemp to resolve this issue, but it is not straight-forward. Selenium request will NOT be intercepter by the credentials interceptors, and cookies obtained via FormLogin (or any other non-selenium request) will not be visible for selenium browser. You can define inclussions / exclussions on the new SeleniumCrawlConfig class to determine which URLs will be visited using Selenium and which won't. Starting with a Selenium seed is not possible at the moment (although it would be possible using the new functions created in my POST CAPABNILITIES MR, which allow to pass WebURLs to addSeed methods. Please, note that Selenium API does NOT provide headers information, so they won't be available in the Page class. A full selenium integration would require to modify the crawler too deeply, but right now, the active selenium headless browser window is available through page#getFetchedResult. This is a bit unconvenient as it forces you to perform instanceof verifications in order to access it.

dgoiko · 2020-05-10T16:10:24Z

This is just an starting point. IF someone really wanted to integrate Selenium with crawler4j, studying the way Nutch actually does it would be a good starting point.

Plugin protocol Selenium
Lib selenium
Protocol interactive-selenium

I'll get into this as soon as I've some spare time for it, right now the current PR is a quick and dirty solution I urgently needed to implement for a project, and I thought someone would find it usefull, so I extracted it from my codebase and prepared this PR.

Selenium now sees the cookies generated by HttpClientRequest and vice-versa

All Selenium classes are now in a new package

dgoiko added 8 commits November 16, 2019 02:27

Extracted interfaces from Parser and PageFetcher

761513b

Extracted interfaces from Parser and PageFetcher in order to make it easier to create totally custom classes

Syle fixes

a64580c

Style fix

ad5ebc3

change to pass checks

4c237eb

Made a silly change in an error in javadoc in order to make a commit and pass the merge checks again, now that the bug in java8 checks is fixed in repo

Silly changes to pass test again

5d0964c

There was an http fetch error on Java11 test. Commit to pass the test again

Extracted interface for Frontier

4ef7de9

This modification would allow to use a database other than sleepycat easilly.

DocIDServer interface created

8e890b7

Allows to use any implementation for the DocIDServer, not only sleepycat

dgoiko added 2 commits May 11, 2020 00:18

Persisting cookies

cd957a5

Selenium now sees the cookies generated by HttpClientRequest and vice-versa

Package separation

8be3bf3

All Selenium classes are now in a new package

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selenium basic integration #444

Selenium basic integration #444

dgoiko commented May 10, 2020 •

edited

dgoiko commented May 10, 2020 •

edited

Selenium basic integration #444

Are you sure you want to change the base?

Selenium basic integration #444

Conversation

dgoiko commented May 10, 2020 • edited

dgoiko commented May 10, 2020 • edited

dgoiko commented May 10, 2020 •

edited

dgoiko commented May 10, 2020 •

edited