Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed redirect causes one less URL to be crawled due to limit #508

Closed
ikreymer opened this issue Mar 24, 2024 · 0 comments · Fixed by #517
Closed

Seed redirect causes one less URL to be crawled due to limit #508

ikreymer opened this issue Mar 24, 2024 · 0 comments · Fixed by #517
Assignees
Labels
bug Something isn't working

Comments

@ikreymer
Copy link
Member

A follow up to #475, currently adding extra redirect seeds are not subtracted from the limit, resulting in one less page being crawled if one of the pages redirects, eg. with 1 seed and limit of 10, if that seed redirects, only 9 pages total will be crawled.

The solution is to subtract the number of 'extra seeds' from the seen list when computing the limit

@ikreymer ikreymer added the bug Something isn't working label Mar 24, 2024
@ikreymer ikreymer self-assigned this Mar 24, 2024
ikreymer added a commit that referenced this issue Mar 24, 2024
- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is done

fixes #508
(for 1.0.3 release)
ikreymer added a commit that referenced this issue Mar 26, 2024
)

- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is
done

fixes #508
(for 1.0.3 release)
ikreymer added a commit that referenced this issue Mar 26, 2024
sitemap improvements: gz support + application/xml + extraHops fix #511
- follow up to
#496
- support parsing sitemap urls that end in .gz with gzip decompression
- support both `application/xml` and `text/xml` as valid sitemap
content-types (add test for both)
- ignore extraHops for sitemap found URLs by setting to past extraHops
limit (otherwise, all sitemap URLs would be treated as links from seed
page)

fixes redirected seed (from #476) being counted against page limit: #509
- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is
done

fixes #508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done!
Development

Successfully merging a pull request may close this issue.

1 participant