Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warc: add Network.resourceType (https://chromedevtools.github.io/devt… #481

Merged
merged 3 commits into from Mar 5, 2024

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Mar 4, 2024

Add resourcesType value from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType as WARC-Resource-Type header, fixes #451

Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, looks good

@ikreymer
Copy link
Member Author

ikreymer commented Mar 4, 2024

Only question for me is if: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType is standard enough.

There is the extension-focused: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType which is slightly more standardized (still differences between Chrome and Firefox) but its not a one-to-one mapping with the CDP resourceType that we have access to

@tw4l
Copy link
Contributor

tw4l commented Mar 4, 2024

Only question for me is if: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType is standard enough.

There is the extension-focused: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType which is slightly more standardized (still differences between Chrome and Firefox) but its not a one-to-one mapping with the CDP resourceType that we have access to

It's a good question, but if the CDP resourceType is what we have access to and we have a reference we can point to, I think that's probably sufficient at this stage, especially since we are only crawling using Chromium-based browsers. Unless there's a very clear standard to map to cleanly, which I don't know that there is in this case.

@ikreymer
Copy link
Member Author

ikreymer commented Mar 4, 2024

This is probably fine for now, but also opened an issue in WARC spec about possibly standardizing how this is approached. iipc/warc-specifications#96

@ikreymer
Copy link
Member Author

ikreymer commented Mar 5, 2024

Per discussion, it appears that Puppeteer / Playwright both use resourceType as all lowercase. To match their convention, also setting WARC-Resource-Type header to be all lowercase and updating the pageinfo record to store the lowercase version as well.

@ikreymer ikreymer requested a review from tw4l March 5, 2024 01:32
@ikreymer ikreymer merged commit 5a47cc4 into dev-1.0.0 Mar 5, 2024
4 checks passed
@ikreymer ikreymer deleted the add-resourceType branch March 5, 2024 02:11
ikreymer added a commit that referenced this pull request Mar 5, 2024
follow up to #481, check reqresp.resourceType with lowercase value
ikreymer added a commit that referenced this pull request Mar 5, 2024
follow up to #481, check reqresp.resourceType with lowercase value
just set message based on resourceType value
ikreymer added a commit that referenced this pull request Mar 5, 2024
follow up to #481, check reqresp.resourceType with lowercase value just
set message based on resourceType value
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

2 participants