Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttachFile move: unicode normalization troubles #59

Open
ThomasWaldmann opened this issue Jul 13, 2020 · 5 comments
Open

AttachFile move: unicode normalization troubles #59

ThomasWaldmann opened this issue Jul 13, 2020 · 5 comments

Comments

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jul 13, 2020

Some characters can be encoded differently, e.g. the german umlaut ä can be:

# NFC normalization (composed):
>>> print("\xc3\xa4".decode('utf8'))
ä
# NFD normalization (decomposed):
>>> print("a\xcc\x88".decode('utf8'))
ä

So it is both an ä somehow, but different unicode method:

>>> "\xc3\xa4".decode('utf8')
u'\xe4'
>>> "a\xcc\x88".decode('utf8')
u'a\u0308'

Uploading an attachment with ä as a\xCC\x88 encoding makes it later impossible to rename:

File "/srv/moin-1.9/MoinMoin/action/AttachFile.py", line 301, in
move_attachment
     filesize = os.path.getsize(attachment_path)
File "/usr/lib/python2.7/genericpath.py", line 57, in getsize
     return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/srv/.../attachments/...\\xc3\\xa4....pdf'
@ThomasWaldmann
Copy link
Member Author

@ThomasWaldmann
Copy link
Member Author

>>> unicodedata.normalize('NFC', u'a\u0308')
u'\xe4'

@ThomasWaldmann
Copy link
Member Author

https://pypi.org/project/nfd2nfc/ about the usual normalization forms on linux / macOS.

@ThomasWaldmann ThomasWaldmann changed the title AttachFile move: umlaut coding troubes AttachFile move: unicode normalization troubles Jul 13, 2020
@ThomasWaldmann
Copy link
Member Author

moin 1.9 code:

  • does not normalize page text in any way (it is just whatever you typed in), so "not normalized" vs NFC vs NFD is unknown.
  • does not normalize filenames, so "not normalized" vs NFC vs NFD is unknown.

thus (moin running on linux):

  • it will work ok as long as you are consistent. when referring to an attachment, you need to refer to it precisely (same name, same unicode representation, same NFC / NFD, same encoded utf-8 byte sequence).
  • using a NFD link to a NFC attachment name (or vice versa) will not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant