Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on ContentFetchError #437

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

dgoiko
Copy link

@dgoiko dgoiko commented Feb 3, 2020

Allows fetching a WebURL again when it fails. Features:

  • Stores the number of repetitions with WebURL in order to limit
  • Maximun number of repetitions configurable with ConfigCrawler.
  • onContentFetchError(Page) deprecated, created onContentFetchError(Page, Throwable) to pass information about the fetching error.
  • onContentFetchErrorNotFinal can abort reschedule in subclases based on exception or Page
  • provides protected method so subclases can re-schedule urls.
  • Handles HttpHostConnectException and UnknownHostException individually.

closes #99

@dgoiko dgoiko changed the title Retry on error Retry on ContentFetchError Feb 3, 2020
HttpHostConnectException  and UnknownHostException, as they're relevant to the crawling task.

Crawler can be configured to retry if HttpHostConnectException  . It is disabled by default, even if maxRetries > 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When a read timeout occurs, crawler4j doesn't try to visit the webpage once again
1 participant