Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track technology adoption and share #591

Open
rviscomi opened this issue May 4, 2022 · 8 comments
Open

Track technology adoption and share #591

rviscomi opened this issue May 4, 2022 · 8 comments
Assignees

Comments

@rviscomi
Copy link
Member

rviscomi commented May 4, 2022

Add a new report that tracks the adoption and share of detected technologies.

Reports currently fall into timeseries and histograms, so we many need a new report template that handles more custom ways to explore and visualize this data.

The primary use case for this feature is to track CMS adoption, but it would be good to build this in a way that supports any given technology category and users can filter it down however they want.

Similar to the CWV Technology Report, it could be useful to apply dimensions to the stats, like ranking and country. @jdevalk also suggested slicing by "new" sites.

@tunetheweb
Copy link
Member

If slicing by new sites, probably want to avoid the long tail of sites that drop in and out of our dataset depending on traffic that month, but aren’t really new - just low traffic-ed sites.

Could exclude any new sites in the largest 10m rank, and only look at new sites in top 1m or 100k sites that either haven’t appeared at all before or only in top 10m previously.

@jdevalk
Copy link

jdevalk commented May 5, 2022

@tunetheweb I was actually hoping we could find a source for truly new sites; sites that are just hitting the web.

@tunetheweb
Copy link
Member

Not aware of any to be hones. We could use meta dates but they are notoriously unreliable.

“New to top million” or similar is best way I can think of measuring this. It would then also include sites that launched maybe a few months ago but are only now getting serious traffic/traction.

Maybe, once we figure out the algorithm to mention this we can become that source 😁

@rviscomi
Copy link
Member Author

rviscomi commented May 5, 2022

Am I oversimplifying or can we just check to see if the website had ever been in the dataset?

@jdevalk
Copy link

jdevalk commented May 5, 2022

@rviscomi ok, can I be really cheeky? I was hoping to “add” a bit to the dataset, so “on top”, not “within”. I think a certain search engine would know about some sites new to them?

@rviscomi
Copy link
Member Author

rviscomi commented May 5, 2022

I think we can only assume we're able to work with the data already publicly available to us.

Beyond "have we seen this URL before" we could also look at resource freshness data like the Last-Modified header of 1P content. If this was truly a new site, we wouldn't expect to see 3 year-old content, for example. It might still take time for a new site to reach the popularity threshold to be included in CrUX and ultimately HTTP Archive, as @tunetheweb noted.

@tomvangoethem or @nrllh might also be interested in this problem from a research perspective.

@rviscomi
Copy link
Member Author

rviscomi commented May 5, 2022

Perhaps worth forking the "new site" dimension from the technology adoption report for now.

@tomvangoethem
Copy link

From what I understand, with the "new site" dimension you're mainly interested in sites that were created/developed recently? How about using Certificate Transparency logs for that? Should be feasible to determine when a site's first certificate was issued (or, given that domains expire and get reused: the last time that the site did not have a valid certificate for a certain period of time).

Accessing CT logs might be a bit tricky though; depending on the number of sites to test, it might be feasible using the crt.sh or censys.io APIs. Censys also provides access to their data on BigQuery for research purposes (not sure if that would fall under "publicly available to us"?). Ingesting CT logs into the HTTP Archive dataset might also be an interesting option. Perhaps there's some other data sources that I don't know about?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants