Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes redirected seed (from #475) being counted againt page limit: #509

Merged
merged 2 commits into from Mar 26, 2024

Conversation

ikreymer
Copy link
Member

  • subtract extraSeeds when computing limit
  • don't include redirect seeds in seen list when serializing
  • tests: adjust saved-state-test to also check total pages when crawl is done

fixes #508
(for 1.0.3 release)

- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is done

fixes #508
(for 1.0.3 release)
@ikreymer ikreymer requested a review from tw4l March 24, 2024 17:58
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good! The exclusions test failure seems to be an issue with with 1.0.3-release branch rather than these changes, can see it in #511 as well. Should fix that before release.

src/util/state.ts Outdated Show resolved Hide resolved
@ikreymer ikreymer merged commit bf5cbb0 into 1.0.3-release Mar 26, 2024
2 of 4 checks passed
ikreymer added a commit that referenced this pull request Mar 26, 2024
sitemap improvements: gz support + application/xml + extraHops fix #511
- follow up to
#496
- support parsing sitemap urls that end in .gz with gzip decompression
- support both `application/xml` and `text/xml` as valid sitemap
content-types (add test for both)
- ignore extraHops for sitemap found URLs by setting to past extraHops
limit (otherwise, all sitemap URLs would be treated as links from seed
page)

fixes redirected seed (from #476) being counted against page limit: #509
- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is
done

fixes #508
ikreymer added a commit to webrecorder/browsertrix that referenced this pull request Apr 4, 2024
… a redirect

following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get the actual page count.
ikreymer added a commit to webrecorder/browsertrix that referenced this pull request Apr 4, 2024
… a redirect (#1649)

Following changes in webrecorder/browsertrix-crawler#475,
webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get
the total page count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants