Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selenium basic integration #444

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Selenium basic integration #444

wants to merge 10 commits into from

Conversation

dgoiko
Copy link

@dgoiko dgoiko commented May 10, 2020

Very basic selenium integration. This is not intended to be a full selenium crawler like Nutch, the main goal is to provide a simple way to crawl full-js pages without directly calling the REST APIs. If you're trying to navigate simple HTML pages with, lets say, a POST form, I'd recommend #419 .

For those who don't know Selenium, it is a browser automation tool designed for developers to automatically test their websites on different browsers. Some crawlers like Nutch integrate Selenium in order to provide full JS render capabilities to the crawler. This PR implements a very naive selenium crawling using JBrowserDriver as a headless browser. Please, note that JBrowserDriver does NOT support Java 8 or higher. HtmlUnitDriver provides Java 11 compatibility, but it is too sassy with JS that is not perfectly formed (while normal browsers accept it and manage to render). Even google.com failed to load with HtmlUnitDriver because there's a catch without brackets somewhere.

Connections stablished through selenium are not counted in the same pool than those opened with HttpClient, so limitations are not taken into consideration. Further commits will attemp to resolve this issue, but it is not straight-forward.

Selenium request will NOT be intercepter by the credentials interceptors. FormLogin shoulld work, though

You can define inclussions / exclussions on the new SeleniumCrawlConfig class to determine which URLs will be visited using Selenium and which won't. Starting with a Selenium seed is not possible at the moment (although it would be possible using the new functions created in my POST CAPABNILITIES MR, which allow to pass WebURLs to addSeed methods.

Please, note that Selenium API does NOT provide headers information, so they won't be available in the Page class. Selenium is also best-guessing the content encoding, blocking us from directly extracting the byte array. For compatibility, output String is converted to a UTF8 byte array, however, is Selenium did not properly detect the encoding charset will be messed-up

A full selenium integration would require to modify the crawler too deeply, but right now, the active selenium headless browser window is available through page#getFetchedResult. This is a bit unconvenient as it forces you to perform instanceof verifications in order to access it.

I'd recommend providing this as a optional artifact, so users who will never use this feature don't need to include selenium dependencies into their projects. The only thing that needs to be in the main package is the "selenium" flag for WebUrls and the extracted interfaces for Parser, Fetcher and FetchResult, which would allow to use the custom ones created for Selenium

I'll be adding more features and configurations for Selenium in further commits.

Some of the extracted interfaces in commits are not really necesary, but I needed Parser and PageFetcher interfaces, so decided to start from my existing branch of separated interfaces.

Extracted interfaces from Parser and PageFetcher in order to make it easier to create totally custom classes
Made a silly change in an error in javadoc in order to make a commit and pass the merge checks again, now that the bug in java8 checks is fixed in repo
There was an http fetch error on Java11 test. Commit to pass the test again
This modification would allow to use a database other than sleepycat easilly.
Allows to use any implementation for the DocIDServer, not only sleepycat
Very basic selenium integration. This is not intended to be a full selenium crawler like Nutch, the main goal is to provide a simple way to crawl full-js pages without directly calling the REST APIs. If you're trying to navigate simple HTML pages with, lets say, a POST form, I'd recommend using the POST Capabilities MR instead

Connections stablished through selenium are not counted in the same pool than those opened with HttpClient, so limitations are not taken into consideration. Further commits will attemp to resolve this issue, but it is not straight-forward.

Selenium request will NOT be intercepter by the credentials interceptors, and cookies obtained via FormLogin (or any other non-selenium request) will not be visible for selenium browser.

You can define inclussions / exclussions on the new SeleniumCrawlConfig class to determine which URLs will be visited using Selenium and which won't. Starting with a Selenium seed is not possible at the moment (although it would be possible using the  new functions created in my POST CAPABNILITIES MR, which allow to pass WebURLs to addSeed methods.

Please, note that Selenium API does NOT provide headers information, so they won't be available in the Page class.

A full selenium integration would require to modify the crawler too deeply, but right now, the active selenium headless browser window is available through page#getFetchedResult. This is a bit unconvenient as it forces you to perform instanceof verifications in order to access it.
@dgoiko
Copy link
Author

dgoiko commented May 10, 2020

This is just an starting point. IF someone really wanted to integrate Selenium with crawler4j, studying the way Nutch actually does it would be a good starting point.

Plugin protocol Selenium
Lib selenium
Protocol interactive-selenium

I'll get into this as soon as I've some spare time for it, right now the current PR is a quick and dirty solution I urgently needed to implement for a project, and I thought someone would find it usefull, so I extracted it from my codebase and prepared this PR.

Selenium now sees the cookies generated by HttpClientRequest and vice-versa
All Selenium classes are now in a new package
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant