Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non latin-1 filenames are not supported #13

Open
spock opened this issue Nov 12, 2023 · 5 comments
Open

Non latin-1 filenames are not supported #13

spock opened this issue Nov 12, 2023 · 5 comments
Assignees
Labels

Comments

@spock
Copy link

spock commented Nov 12, 2023

Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).

While doing so, came across this exception:

Traceback (most recent call last):
  File "/home/user/.local/bin/pff", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main
    return saecc_main(argv=subargs, command=fullcommand)
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main
    relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra)
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string
    fpfile = BytesIO(b(string))
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b
    return codecs.latin_1_encode(x)[0]
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256)

Looking at the code, it seems that latin-1 is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:

if sys.version_info < (3,):
    def b(x):
        return x
else:
    import codecs
    def b(x):
        if isinstance(x, _str):
            return codecs.latin_1_encode(x)[0]  # <-- here
        else:
            return x

Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of latin-1 encoding.
Example string: зображення.

pyFileFixity version 3.1.4 installed with pip. I'm on Python 3.10.12.

@spock
Copy link
Author

spock commented Nov 12, 2023

Replacing

return codecs.latin_1_encode(x)[0]
# codecs.latin_1_encode("зображення")

with

return codecs.utf_8_encode(x)[0]
# codecs.utf_8_encode("зображення")

will work, but will likely raise an exception elsewhere, where latin-1 is expected?

@lrq3000
Copy link
Owner

lrq3000 commented Nov 12, 2023 via email

@lrq3000
Copy link
Owner

lrq3000 commented Nov 12, 2023 via email

@lrq3000
Copy link
Owner

lrq3000 commented Nov 13, 2023

Ok so I remember why it is in latin-1, because the software encodes byte by byte, and a byte is 255 characters maximum, so the idea was to use latin-1 as a codec if necessary but normally these should be treated as bytes.

This is an old code that remains from the Python 2/3 compatibility era, now since Py2 support is dropped everywhere, I should rewrite this code to be more Py3 idiomatic.

Can you please maybe share a minimum example file that produces this issue? Just a simple text file with some random non latin-1 characters should be enough (I'll try to make some myself but just in case it's good if you can provide an example file too).

@lrq3000
Copy link
Owner

lrq3000 commented Nov 13, 2023

Ok I can reproduce the issue using the example filename you provided, thank you very much. I can't believe I never tested a non-latin-1 filename. I will work on it, hopefully it's not too complicated.

@lrq3000 lrq3000 changed the title UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256) Non latin-1 filenames are not supported Nov 13, 2023
@lrq3000 lrq3000 self-assigned this Nov 13, 2023
@lrq3000 lrq3000 added the bug label Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants