Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CommonCrawl index date range code is broken #26

Open
wumpus opened this issue Mar 27, 2022 · 5 comments
Open

CommonCrawl index date range code is broken #26

wumpus opened this issue Mar 27, 2022 · 5 comments

Comments

@wumpus
Copy link
Member

wumpus commented Mar 27, 2022

cdxt --cc --from 2021 --to 2020 -v -v --limit 1 iter https://www.pbm.com/
INFO:cdx_toolkit.cli:set loglevel to DEBUG
DEBUG:cdx_toolkit.myrequests:getting https://index.commoncrawl.org/collinfo.json None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 200 1157
INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
INFO:cdx_toolkit:making a custom cc index list
INFO:cdx_toolkit.commoncrawl:using cc index range from https://index.commoncrawl.org/CC-MAIN-2021-04-index to https://index.commoncrawl.org/CC-MAIN-2020-50-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2021-04-index

The above date range should be empty.

@Medstaar
Copy link

Medstaar commented Oct 6, 2022

I've recently started using ranges and hit this issue. Is this likely to be picked up in the near future? I've also noticed that the 'closest' argument for commoncrawl works okay and creates a 3 month window, but does not wayback.

@wumpus
Copy link
Member Author

wumpus commented Oct 6, 2022

Can you give some examples? The bug I was complaining about shouldn't affect any real usage.

@Medstaar
Copy link

Medstaar commented Oct 7, 2022

Sorry I think I might have miss-understood how the ranges work. It looks like if I put from=20220101 it will use the index CC-MAIN-2021-49 (November 2021), and if I put from=20220401 it will use the CC-MAIN-2022-05 (January 2022). Looks like it actually uses the closest index to the date that's below the date provided.

For wayback if I use closest=20221007 it seems to extract URL's with a 2019 timestamp. Using from and to is okay with wayback however.

@wumpus
Copy link
Member Author

wumpus commented Oct 7, 2022

OK, so Common Crawl is doing the right thing, and the closest on wayback issue is a problem on the Internet Archive side, something I can't control.

@wumpus wumpus closed this as completed Oct 7, 2022
@wumpus wumpus reopened this Oct 7, 2022
@sgjohnson1981
Copy link

sgjohnson1981 commented Mar 11, 2024

I don't know what precisely you're trying to explain but my issue is also related to the index date ranges, though I'm trying to programmatically use them with from_ts. Using it with the iter method isn't working. Doesn't return anything. Using it without works, but I don't need every capture going back a year or whatever the default is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants