Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC-Resource-Type field possibilities (feedback wanted) #96

Open
ikreymer opened this issue Mar 4, 2024 · 7 comments
Open

WARC-Resource-Type field possibilities (feedback wanted) #96

ikreymer opened this issue Mar 4, 2024 · 7 comments

Comments

@ikreymer
Copy link
Member

ikreymer commented Mar 4, 2024

Browsers have different ways of reporting the 'resource type' for any resource that's being fetched. When using browser-based crawling, it is often easy to access this 'resource type' and store it in a custom WARC header.

It is possible to introduce a WARC-Resource-Type header to store this type. Unfortunately, there isn't a single standard of 'resource types' and various browser APIs expose different variations on this.

If a resource type is written to a WARC header, is there a way to make it future proof to support different vocabularies?

Some possibilities include:

  • Chrome Debug Protocol (CDP) resource type
    this is easiest for Chromium-browser based crawling as these fields are directly accessible, but is not especially well standardized and could change anytime.

  • Fetch Request.destination - this is well standardized vocabulary but not a one-to-one mapping and may not be accessible for non-Fetch data.

  • Extension API webRequest.resourceType - better standardized and supported by all the major browsers with some differences for browser extensions. Not quite one-to-one with CDP types.

One approach to make this more future proof might be to prefix the resourceType with a namespace based on where the data is coming from and which vocabulary is used.

For example, if using CDP, cdp:Document or cdp:Image, if using webRequest, might be webRequest:sub_frame, webRequest:image, if using destination, destination:image, destination:document, etc...

This allows for expanding into other vocabularies in the future, but may be harder to parse.

Alternatively, there could be a fixed vocabulary that is allowed that is a common subset of at least 2 of the above, which might be:
document, image, media, script, stylesheet, font, ping, websocket, fetch and a catch-all other.

(In this case, we should specify what the more specific values are recorded as, eg. main_frame / sub_frame would be recorded as document)

Other thoughts / suggestions welcome!

@ikreymer
Copy link
Member Author

ikreymer commented Mar 4, 2024

I should note our initial implementation just stores the Chrome CDP value, eg. WARC-Resource-Type: Document, WARC-Resource-Type: Image, etc... w/o a prefix, as that was the easiest to try. We could also just keep that, but wanted to see if there were any thoughts on the above proposals. Other tools that work directly with Chrome Debug Protocol, such as Brozzler or the Chrome Extractor for Heritrix, would actually have the same vocabulary as well, so may not be an immediate concern.
Mostly a question of other tools / future proofing to support vocabulary not coming from CDP, if such a header were to be standardized.

@tw4l
Copy link

tw4l commented Mar 4, 2024

@ato
Copy link
Member

ato commented Mar 5, 2024

Do we have any use cases in mind for this field when reading the WARC?

I guess one might be be listing all the top-level crawled documents. This can't be done accurately by Content-Type alone as XHR/Fetch requests can have text/html responses.

The main_frame/sub_frame distinction also seems interesting for that use case. It's not in the CDP resource type but if we map to one of the other vocabularies presumably it could be determined from the frameId?

I guess the hopsFromSeed metadata field could be used for listing top-level crawled documents but it's coarse grained and doesn't make distinctions between different kinds of embedded content.

It's also possible for an image to have a text/html Content-Type and still display correctly due to MIME sniffing. So similarly if you wanted to do something with all the images in a crawl, Content-Type alone is insufficient.

@tw4l
Copy link

tw4l commented Mar 5, 2024

We've added this to our WARCs in response to a user-submitted issue: webrecorder/browsertrix-crawler#451, with the primary use case being differentiating between resources fetched by JavaScript (via fetch, xhr) versus resources loaded directly from the HTML.

@edsu
Copy link
Contributor

edsu commented Mar 5, 2024

This is probably off topic for this issue, but it came up recently in the context of using mailbagit that it would be useful to know if a record is for a seed URL. Or is there another common way of doing that? The motivation here is to be able to pick out URLs from the WARC data to serve as entry points during replay.

@ato
Copy link
Member

ato commented Mar 6, 2024

it would be useful to know if a record is for a seed URL. Or is there another common way of doing that?

For WARCs created by Heritrix a metadata record without the via and hopsFromSeed fields is indicative of a seed. If the crawler doesn't populate those fields though I don't think there's a reliable way to tell from a WARC file alone. Requests without a Referer header might also be indicative for some crawlers but but not ones that obey Referrer-Policy: no-referrer.

WACZ defines an accompanying pages.jsonl file for entry points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants