Skip to content

Frequently Asked Questions

Barry Pollard edited this page May 12, 2022 · 10 revisions

How can I see last year’s queries?

These are available at the bottom of each chapter. For example: https://almanac.httparchive.org/en/2021/third-parties#explore-results. There are links to the old results sheet and the SQL queries in our GitHub Repo.

Screenshot of Resource Links section of a Web Almanac chapter

I don’t want all the data and queries - just a specific one - can I get that?

We got you covered! Find the graph in the chapter, and in the top right hand corner there is a three dots menu with access to the tab of that graphs data and the actual SQL query.

Screenshot of the three dots menu available for each Web Almanac Figure

How can I avoid duplication with another chapter?

My advice is to ignore that for now and concentrate on what YOU want for YOUR chapter! There will inevitably be some duplication as many chapters want to cover some of the same topics. And sometimes that’s not a bad thing (as long as you’re not saying the opposite thing!). Feel free to reach out to other chapters, or nosey in on their issues/doc/slack channels (we’re all about openness and transparency here!) but at this stage every chapter is still figuring out what they want, so you don’t know if you’re duplicating work yet. As I say, worry about what you need for your chapter, and can look at this more later.

Do I have to wait until after June crawl to start writing my queries?

Absolutely not! the HTTP Archive crawls EVERY month and, while there are some almanac-specific tables which are currently only updated once a year, most of the data is there now from last month. Explore it, get comfortable with it, try out queries. We also have a sample_data dataset that we’ll shortly be updating to more recent data (currently it’s from last year) and that’s a lot quicker to query than the full data set.

Is it free to query the HTTP Archive dataset?

There’s a free limit, but it’s easy to go over that. The HTTP Archive is committed to covering the costs for analysts for Web Almanac work. To do that, make sure you get on the HTTP Archive account for the duration of this project and make sure you are using that, and not your personal account, cause we can’t get back any costs! Speak to your Section Lead about getting on that account.

Why is there a deadline of May 15th - there’s loads of time?

The main deadline (other than actual publication date) is the 1st of June crawl. That is what we are going to use for ALL chapters. Therefore if there’s some extra data you want to capture during that crawl it needs to be code, reviewed, and merged by then. For that reason we’ve set a 15th May deadline to have an outline so we can at least have some idea if anything else needs to be written. If you miss that, and also don’t have at least a rough idea of metrics you want shortly after that, then you will be unlikely to get any extra data other than the usual stuff that HTTP Archive crawl collects every month (which is a LOT in and of itself btw!).

What are the main sources of data we use?

  • HTTP Archive crawl itself - this is the top 7-8 Million websites as crawled by Web PageTest as both a desktop and mobile client (using the Chrome browser) and all information about requests/responses stored. The urls crawled are the top most popular websites for desktop/mobile and most are crawled for both, but some only one or the other depending if it makes the cut for popularity (e.g. mobile.facebook.com is only crawled by mobile as not a popular website on desktop).
  • Wappalyzer is also run to detect technologies (check if you need to add any new Wappalyzer detections before 1st June)
  • Lighthouse also runs (and in Desktop too for the first time this year), which has a wealth of audits. Review all the Lighthouse Audits to see which would be useful for your chapters.
  • Custom Metrics also run, which is custom JavaScript run on the rendered DOM after the page has been completed. We have a load of those written already, but this is the other key thing to update before 1st June.
  • We also cross reference with the CrUX (Chrome User Experience report) data set for real-life performance data.

Can I use another dataset?

Some chapters way want to use another data set, specific to their topic. The preference is to use HTTP Archive data where at all possible, but if that is not sufficient then other data sets can be used. One of the aims of the Web Almanac is to surface the HTTP Archive data to others. Also we are VERY transparent about our dataset, and others may not be, in which case think strongly about whether it’s a good fit for our report. We’re also looking to mine new insights, rather than just report others’ work. So it may be better to refer to their data or reports and link out to them, rather than necessarily pull that data set directly into our report. But, ultimately we do want it to be a fair representation on the “state of the web”, so if we need to use another data set to have a complete report then do consider alternative datasets.