Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packages are not automatically deleted + delete CLI bugs #1273

Open
89ao opened this issue Nov 11, 2022 · 11 comments
Open

Packages are not automatically deleted + delete CLI bugs #1273

89ao opened this issue Nov 11, 2022 · 11 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@89ao
Copy link
Contributor

89ao commented Nov 11, 2022

Could you tell me how to remove official removed packages automatically?

for example : https://pypi.org/project/apicolors/

the apicolors are deleted by pypi.org 4 days ago(Nov 9), but after my bandersnatch server synced it locally,It exist till now (Nov 11).(but my sync interval is 30min)

here is the bander.log:

2022-11-06 10:21:15,841 bandersnatch.package: INFO Fetching metadata for package: apicolors (serial 15671340)2022-11-06 10:21:15,966 bandersnatch.mirror: INFO Downloading: https://mirrors.tuna.tsinghua.edu.cn//packages/12/27/92bfd44c97e3ed74a028da41b3ae419d4b2c6e7233003841f2c49cafec98/apicolors-6.6.6.tar.gz
2022-11-06 10:21:17,947 bandersnatch.mirror: INFO Continuing to next candidate URL after error downloading: https://mirrors.tuna.tsinghua.edu.cn//packages/12/27/92bfd44c97e3ed74a028da41b3ae419d4b2c6e7233003841f2c49cafec98/apicolors-6.6.6.tar.gz2022-11-06 10:21:17,948 bandersnatch.mirror: INFO Downloading: https://files.pythonhosted.org/packages/12/27/92bfd44c97e3ed74a028da41b3ae419d4b2c6e7233003841f2c49cafec98/apicolors-6.6.6.tar.gz
2022-11-06 10:21:17,980 bandersnatch.mirror: INFO Storing index page(s): apicolors - in /opt/bandersnatch/web/simple/apicolors2022-11-07 08:51:20,898 bandersnatch.package: INFO Fetching metadata for package: apicolors (serial 15678961)
2022-11-07 08:51:20,939 bandersnatch.mirror: INFO Downloading: https://mirrors.tuna.tsinghua.edu.cn//packages/2d/0a/d4c6fa3f16b71d70ab2ca6387aee93a84c191fd9711daa812df0054c17b4/apicolors-6.6.7.tar.gz2022-11-07 08:51:20,974 bandersnatch.mirror: INFO Continuing to next candidate URL after error downloading: https://mirrors.tuna.tsinghua.edu.cn//packages/2d/0a/d4c6fa3f16b71d70ab2ca6387aee93a84c191fd9711daa812df0054c17b4/apicolors-6.6.7.tar.gz2022-11-07 08:51:20,975 bandersnatch.mirror: INFO Downloading: https://files.pythonhosted.org/packages/2d/0a/d4c6fa3f16b71d70ab2ca6387aee93a84c191fd9711daa812df0054c17b4/apicolors-6.6.7.tar.gz
2022-11-07 08:51:21,008 bandersnatch.mirror: INFO Storing index page(s): apicolors - in /opt/bandersnatch/web/simple/apicolors2022-11-09 08:21:31,336 bandersnatch.package: INFO Fetching metadata for package: apicolors (serial 15704728)
2022-11-09 08:21:31,625 bandersnatch.package: INFO apicolors no longer exists on PyPI

And here is the bandersnatch.conf and I'am using bandersnatch-6.0.0 on docker-compose.

[mirror]
directory = /opt/bandersnatch
storage-backend = filesystem
master = https://pypi.org/
json = true
timeout = 300
workers = 3
hash-index = false
stop-on-error = false
delete-packages = true
compare-method = stat
log-config = /conf/bandersnatch-log.conf
download-mirror = https://mirrors.tuna.tsinghua.edu.cn/


[plugins]
enabled =
    blocklist_project
    blocklist_release
    regex_project


[blocklist]
packages =
    uselesscapitalquiz
    tf-nightly-gpu
    tf-nightly
    tensorflow-io-nightly
    tf-nightly-cpu
    pyagrum-nightly
    appium
[filter_regex]
packages =
    .+-nightly.*
@cooperlees
Copy link
Contributor

cooperlees commented Nov 11, 2022

Bandersnatch does not support delete during the mirror. There is not enough metadata to know what blobs to delete. That said, I have not dug into yanking, we might have enough metadata for those - might be worth looking into.

We only have bandersnatch verify --delete as it has to walk the file system and workout what files on the file system are not part of any JSON metadata anymore ...

Without adding more metadata to PyPI we can't make this more efficient.

@cooperlees cooperlees added bug Something isn't working help wanted Extra attention is needed labels Nov 11, 2022
@89ao
Copy link
Contributor Author

89ao commented Nov 11, 2022

@cooperlees thanks cooper, problem is that may someday risk packages may appear online.After official delete it , I'd like to stay consistent.
please consider adding this feature ,tks!

@cooperlees
Copy link
Contributor

cooperlees commented Nov 11, 2022

This is not an easy fix. As I said, ideally we'd need to put more metadata into Warehouse (pypi.org). If you have cycles, opening an issue on warehouse (if we don't have one) asking for better metadata to allow mirroring to delete packages would be a good start.

  • Another hacky thing we could do too is change how we store blobs, so we can check on mirror if there are any packages that need to be removed ... we just copy pypi.orgs layout today as it matches metadata.

bandersnatch will correctly generate correct Simple API HTML + JSON, so the package manager (e.g. pip) won't know the deleted/yanked version exists. The artifacts/blobs are just sitting there wasting disk space. A verify running in the background could slowly reclaim space. Walking filesystems is slow tho, I get that :(

@89ao
Copy link
Contributor Author

89ao commented Nov 15, 2022

Thanks to you @cooperlees ,It's not only the disk space's issue , It'seems that once in a while the official will delete some risk packages just like "rest-framework" and "apicolors" as I said.We also don't want them can still be downloaded.
May "bandersnatch verify --deleted" deleted the outdated packages automately? If not we may need to write some shell to manually do this.

@cooperlees
Copy link
Contributor

cooperlees commented Nov 15, 2022

https://bandersnatch.readthedocs.io/en/latest/#bandersnatch-verify

Yes, running a verify with --delete will keep track and delete packages. It's not smart or incremental and needs to walk project by project to do so. All enhancements welcome.

  • But once again, the simple API we generate should (pending no bugs or missing yanked support) will no longer include the deleted/yanked packages, so pip pointed at your mirror will not consider using those versions.

I would love to know how you imagine doing this via shell? It should be no easier than just enhancing bandersnatch's logic.

@89ao
Copy link
Contributor Author

89ao commented Nov 16, 2022

Maybe obtain a official package list and compare it to local list ? If one package is not exist ,delete it locally?

@89ao
Copy link
Contributor Author

89ao commented Jan 3, 2023

just as a infomation-sync, this situation happens again as below:
https://medium.com/checkmarx-security/py-torch-a-leading-ml-framework-was-poisoned-with-malicious-dependency-e30f88242964

https://pypi.org/project/torchtriton/ has already deleted torchtriton,but it just did't not delete it automaticlly while using bandersnatch.
So we deleted it manually, looking forward to some more update ,tks!

@89ao
Copy link
Contributor Author

89ao commented Jan 5, 2023

@cooperlees hello cooperlees,
recently I write a small tool to compare local project list and official project list ()
and now I've found a bunch of projects exist locally but no longer exist official any more for example:

...
a-plus-b
a-simple-modu
a1g0py8128
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaa-lama-ze-lo-oved
aabs7calc
aaron
aashika-calculator
abaxador-de-arquivo
abc-reader
abc0123
abcmikivideo
abenity
abhishekwebcodett
abhishekwebcodett2
abhishekwebcodett3
abilityrequest
...

Question is that when I plan to delete them manually , I just can't make it done,take project "aaaaaaaaaaa" for example :

[root@VM_21_104_centos /data/home/motorao/bandersnatch]# ls -al /yum/pip/web/simple/aaaaaaaaaaa/
total 18580
drwxr-xr-x 2 root root     4096 Jun  7  2022 .
drwxr-xr-x 1 root root 19009536 Jan  5 22:48 ..
-rw-r--r-- 1 root root      452 Jun  7  2022 index.html
[root@VM_21_104_centos /data/home/motorao/bandersnatch]# cat /yum/pip/web/simple/aaaaaaaaaaa/index.html
<!DOCTYPE html>
<html>
  <head>
    <title>Links for aaaaaaaaaaa</title>
  </head>
  <body>
    <h1>Links for aaaaaaaaaaa</h1>
    <a href="../../packages/6d/c1/2d60ee949b1be5382703260b0bdd4345e2711abdddc2b9e2bbb46f788ac1/aaaaaaaaaaa-1.1.1-py2.py3-none-any.whl#sha256=05ff699e6eb769bdcc489f4390a51d1056332e8d16bb0bd0ef5f15709341b88f" data-requires-python="&gt;=2">aaaaaaaaaaa-1.1.1-py2.py3-none-any.whl</a><br/>
  </body>
</html>
<!--SERIAL 14055105-->[root@VM_21_104_centos /data/home/motorao/bandersnatch]# bandersnatch delete aaaaaaaaaaa
2023-01-05 22:50:00,019 ERROR: Unable to load entry point swift_plugin = bandersnatch_storage_plugins.swift:SwiftStorage: No module named 'keystoneauth1'
2023-01-05 22:50:00,020 ERROR: /yum/pip/web/json/aaaaaaaaaaa does not exist. Pulling from PyPI
2023-01-05 22:50:00,021 INFO: Fetching https://pypi.python.org/pypi/aaaaaaaaaaa/json
2023-01-05 22:50:00,399 ERROR: /yum/pip/web/json/aaaaaaaaaaa.new does not exist - Did not get new JSON metadata
2023-01-05 22:50:00,399 ERROR: Unable to HTTP get JSON for /yum/pip/web/json/aaaaaaaaaaa

could you help me explain why does it happens?

@cooperlees
Copy link
Contributor

So I don't have any plans to work on this. To do this correctly we need to store packages differently, change PyPI metadata or add another API to PyPI to let us know what to delete.

In the logs I see /yum/pip/web/json/aaaaaaaaaaa - Seems it's not adding /data/home/motorao/bandersnatch to the path? I haven't read the code but we must have a bug there.

If that's not the issue, then it's the fact the the package is deleted, and so it the JSON metadata, so we need to use local metadata only. If that's somehow been deleted too we're out of luck and need to manually delete.

  • Maybe the delete CLI needs a --no-json-update to try not to pull from pypi.org
    • Or we could just fallback to local metadata by default and just log we're doing so

Fix PR with unittest covering bug/new behavior welcome!

@cooperlees cooperlees changed the title packages not deleted Packages are not automatically deleted + delete CLI bugs Jan 5, 2023
@89ao
Copy link
Contributor Author

89ao commented Jan 6, 2023

Maybe the delete CLI needs a --no-json-update to try not to pull from pypi.org

yes it is indeed.
I'll learn and try how to make Fix PR later. tks a lot !@cooperlees

@cooperlees
Copy link
Contributor

cooperlees commented Jan 7, 2023

Should just need a boolean around the code that calls pipit.org to pull the JSON in verify.py - I haven't read the code tho, and I have a terrible memory :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants