Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate missing Top 1k home pages #222

Open
rviscomi opened this issue Nov 13, 2023 · 1 comment
Open

Investigate missing Top 1k home pages #222

rviscomi opened this issue Nov 13, 2023 · 1 comment

Comments

@rviscomi
Copy link
Member

For some reason HA has no data for ~90 of the top 1k sites in CrUX:

https://allegro.pl/
https://aquamanga.com/
https://auctions.yahoo.co.jp/
https://auth.uber.com/
https://betproexch.com/
https://blaze-1.com/
https://bollyflix.tax/
https://brainly.com.br/
https://brainly.in/
https://brainly.lat/
https://chance.enjoy.point.auone.jp/
https://cookpad.com/
https://detail.chiebukuro.yahoo.co.jp/
https://e-okul.meb.gov.tr/
https://filmyfly.club/
https://game.hiroba.dpoint.docomo.ne.jp/
https://gamewith.jp/
https://gdz.ru/
https://hdhub4u.markets/
https://hentailib.me/
https://holoo.fun/
https://ifilo.net/
https://indianhardtube.com/
https://login.caixa.gov.br/
https://m.autoplius.lt/
https://m.fmkorea.com/
https://m.happymh.com/
https://m.pgf-asw0zz.com/
https://m.porno365.pics/
https://m.skelbiu.lt/
https://mangalib.me/
https://mangalivre.net/
https://mnregaweb4.nic.in/
https://myaadhaar.uidai.gov.in/
https://myreadingmanga.info/
https://namu.wiki/
https://nhattruyenplus.com/
https://nhentai.net/
https://onlar.az/
https://page.auctions.yahoo.co.jp/
https://passbook.epfindia.gov.in/
https://pixbet.com/
https://pmkisan.gov.in/
https://quizlet.com/
https://schools.emaktab.uz/
https://schools.madrasati.sa/
https://scratch.mit.edu/
https://supjav.com/
https://tathya.uidai.gov.in/
https://uchi.ru/
https://v.daum.net/
https://vl2.xvideos98.pro/
https://vlxx.moe/
https://www.avto.net/
https://www.bartarinha.ir/
https://www.bestbuy.com/
https://www.betproexch.com/
https://www.cardmarket.com/
https://www.chegg.com/
https://www.cityheaven.net/
https://www.deviantart.com/
https://www.dns-shop.ru/
https://www.fiverr.com/
https://www.fmkorea.com/
https://www.hotstar.com/
https://www.idealista.com/
https://www.idealista.it/
https://www.justdial.com/
https://www.khabaronline.ir/
https://www.leboncoin.fr/
https://www.leroymerlin.fr/
https://www.makemytrip.com/
https://www.mediaexpert.pl/
https://www.milanuncios.com/
https://www.namasha.com/
https://www.nettruyenus.com/
https://www.ninisite.com/
https://www.nitrotype.com/
https://www.otvfoco.com.br/
https://www.ozon.ru/
https://www.realtor.com/
https://www.sahibinden.com/
https://www.shahrekhabar.com/
https://www.si.com/
https://www.studocu.com/
https://www.thenetnaija.net/
https://www.varzesh3.com/
https://www.wannonce.com/
https://www.wayfair.com/
https://www.winzogames.com/
https://www.zillow.com/
https://znanija.com/
WITH ha AS (
  SELECT
    page
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-10-01' AND
    rank = 1000 AND
    is_root_page
),

crux AS (
  SELECT
    DISTINCT CONCAT(origin, '/') AS page
  FROM
    `chrome-ux-report.materialized.metrics_summary`
  WHERE
    date = '2023-09-01' AND
    rank = 1000
)


SELECT
  page
FROM
  crux
LEFT OUTER JOIN
  ha
USING
  (page)
WHERE
  ha.page IS NULL
ORDER BY
  page

This has been pretty consistent:

Row date top_1k
1 2023-01-01 918
2 2023-02-01 922
3 2023-03-01 910
4 2023-04-01 924
5 2023-05-01 916
6 2023-06-01 913
7 2023-07-01 908
8 2023-08-01 917
9 2023-09-01 910
10 2023-10-01 908

And here are the top 1k home pages that have consistently been missing all year (202301–202309):

https://aquamanga.com/
https://auctions.yahoo.co.jp/
https://betproexch.com/
https://brainly.in/
https://chance.enjoy.point.auone.jp/
https://detail.chiebukuro.yahoo.co.jp/
https://game.hiroba.dpoint.docomo.ne.jp/
https://login.caixa.gov.br/
https://m.fmkorea.com/
https://m.happymh.com/
https://mangalib.me/
https://mangalivre.net/
https://myreadingmanga.info/
https://namu.wiki/
https://page.auctions.yahoo.co.jp/
https://pmkisan.gov.in/
https://quizlet.com/
https://scratch.mit.edu/
https://v.daum.net/
https://www.bartarinha.ir/
https://www.bestbuy.com/
https://www.betproexch.com/
https://www.deviantart.com/
https://www.fiverr.com/
https://www.fmkorea.com/
https://www.idealista.com/
https://www.justdial.com/
https://www.khabaronline.ir/
https://www.leboncoin.fr/
https://www.leroymerlin.fr/
https://www.milanuncios.com/
https://www.namasha.com/
https://www.ninisite.com/
https://www.ozon.ru/
https://www.realtor.com/
https://www.sahibinden.com/
https://www.wannonce.com/

Are the tests erroring out? Are they blocking us?

@tunetheweb
Copy link
Member

Just trying the first one (https://allegro.pl/) it also fails in the public WebPageTest:
https://www.webpagetest.org/result/231113_AiDcFK_98G/1/details/#waterfall_view_step1
With a 403.

When I try with curl it asks for JS to be enabled and depends on something using https://ct.captcha-delivery.com/c.js

So would guess it's just blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants