Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Filter packages with very long versions #1228

Open
craigatfetch opened this issue Sep 30, 2022 · 7 comments
Open

Enhancement: Filter packages with very long versions #1228

craigatfetch opened this issue Sep 30, 2022 · 7 comments
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@craigatfetch
Copy link

There is currently a package "uselesscaptialquiz" that has a very long version name. So long that sync'ing the package fails on Ubuntu 20.04 because the result package file name is too long for the OS, and the mirror fails.

Request: add filtering option to ignore packages which have version names so long that the OS gives an error when sync'ing.

@cooperlees
Copy link
Contributor

What do you exactly want here? I guess you want automatic catching of the OSError (I guess it is) exception when this occurs and just print an error saying we're ignoring package X cause it's naming/versioning is to long for the storage?

For now to avoid the error. you could just deny list it: https://bandersnatch.readthedocs.io/en/latest/filtering_configuration.html#allowlist-blocklist-filtering-settings

@craigatfetch
Copy link
Author

craigatfetch commented Sep 30, 2022 via email

@cooperlees
Copy link
Contributor

I would accept a PR doing this with appropriate unit testing showing the behavior.

Agree PyPI should be more strict there. Have you search / opened an issue there? i.e. https://github.com/pypi/warehouse/issues

@cooperlees cooperlees added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Sep 30, 2022
@craigatfetch
Copy link
Author

craigatfetch commented Sep 30, 2022 via email

@forky2
Copy link
Contributor

forky2 commented Oct 26, 2022

I've just run into this problem too. One big problem with it from a banderstatch point of view is that a new mirror will never get past serial 0 and all subsequent executions of bandersnatch mirror will be looping over the same set of todo packages to try (and fail) to get to serial X that was current at the time of the first execution.

Of course, that's the correct behaviour because bandersnatch has failed to get one of the todo packages due to an unhandled exception and I don't think any of us expected this to happen:

2022-10-26 08:41:43,150 INFO: Downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:875)
2022-10-26 08:41:43,653 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:686)
Traceback (most recent call last):
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 662, in sync_release_files
    downloaded_file = await self.download_file(
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 892, in download_file
    with self.storage_backend.rewrite(path, "wb") as f:
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/bandersnatch/src/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
    with tempfile.NamedTemporaryFile(
  File "/usr/lib/python3.8/tempfile.py", line 679, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/lib/python3.8/tempfile.py", line 389, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mnt/mirrors/pypi/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.7lqjgi8o'
2022-10-26 08:41:43,658 ERROR: Error syncing package: uselesscapitalquiz@14521754 (mirror.py:377)
Traceback (most recent call last):
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 130, in package_syncer
    await self.process_package(package)
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 337, in process_package
    await self.sync_release_files(package)
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 693, in sync_release_files
    raise deferred_exception  # raise the exception after trying all files
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 662, in sync_release_files
    downloaded_file = await self.download_file(
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 892, in download_file
    with self.storage_backend.rewrite(path, "wb") as f:
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/bandersnatch/src/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
    with tempfile.NamedTemporaryFile(
  File "/usr/lib/python3.8/tempfile.py", line 679, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/lib/python3.8/tempfile.py", line 389, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mnt/mirrors/pypi/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.7lqjgi8o'

I'm not sure how best to deal with this:

  • bandersnatch could jump through hoops to accurately reflect the troll package. It could do something like store it in a shortened name, and update the metadata. I don't like this idea because it's making us do exceptional work for conrived troll packages.
  • alternatively, bandersnatch could just not support filenames longer than X and drop the packages (treating the exception as OK, and not recording the issue as an error; perhaps just a warning on output). This isn't great as we'd probably be breaking various PEPs.
  • we could leave it as it is, and it's up to the user to troubleshoot the issue and add offending packages to the disallowlist. This isn't very user friendly however; I've probably wasted about an hour on this already. 😃
  • we could put packages that fail for unexpected reasons like this into a problem_packages file (like todo) and not record it as an error. This would allow the process to complete the loop and not view the outcome as a failure: the serial number will get updated, todo will be empty, problem_packages will reflect the fact that certain packages had issues and can be retried with bandersnatch sync. I like this the best because it correctly records the minority of packages that didn't fully sync, allows easy remedial action, and allows bandersnatch to continue operating normally without being stuck on serial 0.

@cooperlees
Copy link
Contributor

Thanks for sharing your experience.

Here are my thoughts on each of your suggestions:

  • I do not wish to implement logic like this as it will hide problems and create new ones potentially
    • As stated above, it would be nice to have PyPI not allow this naming/versioning at all
  • We could just see the filename to long exception and then just state due to it, we're skipping he package in logs
    • Long as we're not silent about it, I'm open to behavior like this ...
  • I am sorry it ate time, but the error in the logs is pretty clear ...
  • This just sounds like a separate denylist? Maybe we could just add a --no-filtering switch to sync so people can still manually try denylisted packages via the sync sub command?

@forky2
Copy link
Contributor

forky2 commented Oct 27, 2022

I'm really not annoyed about occasionally having to occasionally waste some time fixing problems with this. The amount of time I've saved with this project is immense, and I really like the project.

My suggestion of a problem_packages file was not for the user to create an exception list (no, that would just be like the blocklist); rather I was suggesting that the tool could record the fact that the package had issues.

I don't know how other people use bandersnatch, but I find tracking problems quite difficult. The logs are very verbose! If I run bandersnatch mirror in a screen terminal, then any errors that may have been encountered are long lost above the maximum scrollback. If I have an issue I've got to rerun it and pipe the logs somewhere and grep -v INFO to get rid of all the noise, and just hope that the issue appears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants