Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subcmd to use metadata to roughly calculate the size of the local bandersnatch mirror #1305

Open
leochen12-rgb opened this issue Dec 11, 2022 · 3 comments
Labels
help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested

Comments

@leochen12-rgb
Copy link

At present, I can obtain the official directory size of pypi(https://pypi.org/stats/), while I am synchronizing the pypi directory. However, the du or duc command takes too long to count. Is there a more convenient way to do this?

@cooperlees cooperlees added invalid This doesn't seem right question Further information is requested labels Dec 11, 2022
@cooperlees
Copy link
Contributor

cooperlees commented Dec 11, 2022

Howdy,

This isn't really a bandersnatch question. This is all a limitation of lots of small files on your storage backend.

The only ideas we could possibly try:

  • Use the JSON metadata in parallel and check if a simple dir exists and if so just sum up all the packages
    • Many bugs, but if you use filtering, that won't be applied
  • Use the JSON metadata in parallel and check if the files exist, but I think this will be just as expensive as du (but not sure all the operations du does under the covers)

Another hack I've generally recommended is making a dedicated partition or volume for each part of bandersnatch's storage - e.g. simple and packages directories to be in their own filesystems and then df -h can give quicker insight too.

  • If you use hash-index = true you could also create a volume/file system per shard to get further insight

I don't have the cycles to look into these ideas, but would take a PR add docs or a bandersnatch du like command that works out the sizes quicker if possible. But I feel we'd need to use a lower level language than python to get true speed here. Will leave open incase someone smarter comes along with better ideas.

@leochen12-rgb
Copy link
Author

Thank you for your reply, and look forward to adding the du parameter to bandersmatch.

@cooperlees
Copy link
Contributor

Awesome. Yeah I’ll be surprised if it’s much faster and will be hard to get accurate without checking if the files exist, which is the expensive part. It might surprise us and be much quicker than du …

@cooperlees cooperlees added the help wanted Extra attention is needed label Sep 22, 2023
@cooperlees cooperlees changed the title How to count the size of pypi directory Add subcmd to use metadata to roughly calculate the size of the local bandersnatch mirror Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants