Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No DocIDs will be created if maxPagesToFetch is reached (most times). #430

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dgoiko
Copy link

@dgoiko dgoiko commented Jan 24, 2020

Fixes #413

This is not a thread safe solution. One thread may fill the DocIDServer while another thread is looping, however, the amount of memory wasted will be decreased.

The current problem with this hotfix:
Supose there's only 1 slot left to fill maxPagesToFetch

  1. Thread A creates a new docID (docA). DocIDServer will see that there's room for pages, so it will create it.
  2. Thread B creates a new docID (docB)
  3. Thread C schedules docA. It is accepted.
  4. Thread D schedules docB. It is rejected, but it has been already created.

There should be a method to clean docB from DocIDServer in when it gets rejected, making the Frontier aware od DocIDServer, but since CrawlController has setFrontier and setDocIdServer methods, this wouldn't be safe either, as someone playing around with multiple DocIDServers and Frontiers may cause unpredictable situacions.

Since it isn't a perfect solution either, I decided to keep the hotfix as simple as possible. However, if you feel like it would be OK, I can quickly add another commit that allows the Frontier to remove unscheduled URLs from DocIDServer, given the problem stated above.

This is not a thread safe solution. One thread may fill the DocIDServer while another thread is looping, however, the amount of memory wasted will be decreased.
@dgoiko dgoiko changed the title No DocIDs will be created if maxPagesToFetch No DocIDs will be created if maxPagesToFetch is reached. Jan 24, 2020
@dgoiko dgoiko changed the title No DocIDs will be created if maxPagesToFetch is reached. No DocIDs will be created if maxPagesToFetch is reached (most times). Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocIDServer contains more than configured maxPagesToFetch Url count
1 participant