Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending the HTTPArchive Data #112

Open
1 of 3 tasks
nrllh opened this issue Mar 9, 2024 · 4 comments
Open
1 of 3 tasks

Extending the HTTPArchive Data #112

nrllh opened this issue Mar 9, 2024 · 4 comments

Comments

@nrllh
Copy link

nrllh commented Mar 9, 2024

I'm curious about the possibility of enhancing our current dataset by incorporating additional data points.

  • Cookies: We're currently missing the cookies in the cookie jar, having only those sent via HTTP headers. @pmeenan mentioned it should be manageable to extend the existing data to include these.
  • DNS queries: Chrome seems quite limited in this. There is an API, but it's mainly for extensions on dev channel releases. I'm uncertain about our ability to collect DNS resolution data. Any thoughts?
  • JavaScript calls: there seems to be an API (s. timelineStack) for JS call stacks, I think it's pretty heavy, but still wonder if we could include it in our data?

These types of data are often crucial for web measurement studies. Do you think it's feasible to enrich our data with these?

@rviscomi
Copy link
Member

@pmeenan @tunetheweb any thoughts on the DNS question?

As for JS calls, we've tried to capture that info for specific functions of interest with the observers.js script, but it was breaking some pages' functionality. I never got the time to thoroughly debug it. My hunch is that we'll need to rewrite it to use Proxy instead of Object.defineProperty. Not sure if timelineStack does what we need. Do you have an example of its output?

@pmeenan
Copy link
Member

pmeenan commented Apr 10, 2024

What dns resolution information are you looking for? We have the basic timings from Chrome itself but we also collect dns information for the origin of the main page (CNAME's, authoratative DNS server list and PTR records). They are stored in the page-levej JSON as base_page_ip_ptr, base_page_cname, and base_page_dns_server.

We have full access to the netlog so we can capture anything the Chrome does that is DNS-related but it's important to remember that it is being measured in a lab environment so we only have visibility into the DNS path that we are using.

@nrllh
Copy link
Author

nrllh commented Apr 11, 2024

@rviscomi Unfortunately, I don't have a specific example; I found that API in the documentation.

Ideally, we should have this information at the request level. Please review this subset, where we have DNS-related information for each request. We can also limit the scope to the origin level, since the results for foo.com/bar1 and foo.com/bar2 are the same, but not for bar.foo.com. This aspect helps to understand various techniques within the ecosystem. For example, some players circumvent blocking of ad blockers by exploiting CNAME records, and for cookie syncing.

@max-ostapenko
Copy link
Contributor

@pmeenan in the netlog I see a dns overview in HOST_RESOLVER tasks.
For example script request to consent.cookiebot.com contains:

{
  "aliases": [
    "consent.cookiebot.com",
    "consent.cookiebot.com-v2.edgekey.net",
    "e110990.dsca.akamaiedge.net"
  ],
  "canonical_names": [
    "e110990.dsca.akamaiedge.net"
  ],
  "endpoint_metadatas": [],
  "expiration": "13358745242163846",
  "host_ports": [],
  "hostname_results": [],
  "ip_endpoints": [
    {
      "endpoint_address": "88.221.221.75",
      "endpoint_port": 0
    },
    {
      "endpoint_address": "88.221.221.147",
      "endpoint_port": 0
    }
  ],
  "text_records": []
}

canonical_names is the best fit for privacy analysis. Would be nice to have it in requests (or at least aggregated under page data).
Please point to, if public, where this extension can be added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants