Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better tracking of failed requests + logging context exclude #485

Merged
merged 5 commits into from Mar 7, 2024

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Mar 7, 2024

  • add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included)
  • enable 'recorderNetwork' logging for debugging CDP network
  • create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with status code 0

  • cleanup cdp handler methods
  • intercept requestWillBeSent to track requests that started (but may not complete)
  • fix shouldSkip() still working if no url is provided (eg. check only headers)
  • set status to 0 for async fetch failures
  • remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called
  • pageinfo: include page requests that failed with status code 0, also include 'error' status if available.
  • ensure page is closed on failure
  • ensure pageinfo still written even if nothing else is crawled for a page
  • track cached responses, add to debug logging (can also add to pageinfo later if needed)

tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0

bump to 1.0.0-beta.7

- add --logExcludeContext for log contexts that should be excluded
- enable 'recorderNetwork' logging
- create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: track failed requests
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started, but may not complete
- fix shouldSkip() still working if no url is provided (eg. check only headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data generally not available then, and responseReceive is still called
- pageinfo: include page requests that failed with status code 0
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a page

tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0

bump to 1.0.0-beta.7
@ikreymer ikreymer requested a review from tw4l March 7, 2024 05:08
cache status: track if a resource is loaded from browser cache, include in debug logging
(though not yet in pageinfo, to revisit later)
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Tested and working well. The pages.jsonl file produced pre-WACZ generation includes all of the pages now. Once we switch to js-wacz and use the existing pages files, we should be all good!

@tw4l tw4l merged commit 9f18a49 into main Mar 7, 2024
4 checks passed
@tw4l tw4l deleted the track-failed-requests-and-logging branch March 7, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants