Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage for verify --delete due to deletion occurring at end #1283

Open
kenyon opened this issue Nov 20, 2022 · 3 comments
Open

High memory usage for verify --delete due to deletion occurring at end #1283

kenyon opened this issue Nov 20, 2022 · 3 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@kenyon
Copy link

kenyon commented Nov 20, 2022

I had a PyPI mirror that hadn't ever had a run of verify --delete, so had grown to around 25 TB. Initially trying to run verify --delete was exhausting all of my machine's memory. It only had 8 GB of RAM, but still, the algorithm should be able to delete during the run (and therefore using a relatively constant amount of memory regardless of the number of deletions needed) rather than building a list in memory and deleting everything at the end.

I was able to get verify --delete to finish with 64 GB of RAM, but I don't know how much memory it actually needed. Now the PyPI mirror is somewhere less than 9.5 TB.

@cooperlees cooperlees added enhancement New feature or request help wanted Extra attention is needed labels Nov 21, 2022
@cooperlees
Copy link
Contributor

This is known. maybe we should document it more. 25TB is pretty crazy, there must be a lot of nightlies or something that add up as PyPI reports ~13.5TB these days (https://pypi.org/stats/).

We have to keep state of all the files found on the file system in order to workout what files to delete. That said, I would be open to adding a different way to save this state or try avoid it all together.

Some ideas I can think of (in order of preference):

  • Check the metadata for the project as we walk the file system to see if it's listed, if not just unlink there and then
    • The problem is sometimes the package name and the release files do not match and have been normalized (should be calculable)
  • Write out the filesystem state to a SQLite Database on the file system ...
  • Periodically write out to a file or some format - Searching this will be expensive ...

All open to other ideas and if we agree on something PRs are totally welcome to make this better ...

@kenyon
Copy link
Author

kenyon commented Nov 21, 2022

Thanks for the reply. I suppose it should not be a high priority issue if you always run verify --delete right after running mirror daily or so, since then the number of deletions should be fairly small.

@cooperlees
Copy link
Contributor

cooperlees commented Nov 28, 2022

I don't think that helps from memory (I haven't looked in the code for a long time). But I believe we have to map the whole file system in order to see files that are there and no longer belong to any metadata ... It's a horrible algorithm, but was the safest to be accurate.

With the size of the mirror (both file count and bytes) these days I think it's time we look into maybe adding deleted releases into metadata + look at if we can slightly improve this using the yanked PEP(s). I wrote this before they existed.

The main complexity is we follow the PyPI Blob storage pathing. We don't "need" to do this and we could move to just sharding via package name similar to simple which would then make deletes able to just walk the "projects" blob area and delete any that are no longer referenced by metadata. There are many ways to make this better. Just not sure the best. Something like the change how we store the blobs would need to go into a 7.0 release ... It's a big change.

We should probably need a tool to go and do the 100000000000 mv's too reorganize existing mirrors too. That would not be a cheap operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants