Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong maccyrillic decoding #297

Open
batyshkaLenin opened this issue Aug 11, 2022 · 11 comments
Open

Wrong maccyrillic decoding #297

batyshkaLenin opened this issue Aug 11, 2022 · 11 comments

Comments

@batyshkaLenin
Copy link

batyshkaLenin commented Aug 11, 2022

In this encoding after the character ю there is a symbol ¤. Because of this, in places where there should have been the letter "я" is decoded symbol "€" (last symbol).
изображение

@ashtuchkin
Copy link
Owner

Hmm I see the letter я at 0xDF, could it be intentional?

@ashtuchkin
Copy link
Owner

Also ¤ is the "current currency" symbol AFAIK, so I think it should be converted to Euro as expected. Let me know if it's a wrong assumption.

@batyshkaLenin
Copy link
Author

The problem is that the decoding is going wrong. If you write a maccyrillic decoding test, instead of the letter я you get an ¤. The letter я is not a symbol of ¤. You are correct, it is a currency symbol.

@ashtuchkin
Copy link
Owner

Note that iconv-lite here uses generated data from the low-level iconv library, which is an informal standard for character encoding conversion, so I tend to trust it unless there's compelling data that it's wrong.

@ashtuchkin
Copy link
Owner

Wait, what do you expect the code for this letter be - 0xFF or 0xDF?

@batyshkaLenin
Copy link
Author

The code for this letter should be 0xDF, but when decoding it translates as 0xFF. I don't know how to prove that this is true, except that I enter the letter я in Numbers on MacOS, and after decoding it turns into ¤, even though it should remain я.

@batyshkaLenin
Copy link
Author

As a test, you can write a test for this encoding, as well as other Cyrillic encodings.

@ashtuchkin
Copy link
Owner

Well, if you can debug print the Buffer that is sent to the decode() method, we can check which byte corresponds to я there and potentially add a test. Iconv-lite is pretty thoroughly tested already, but it uses either iconv library or WHAT-WG as the "ground truth". These sources might be wrong but it's pretty rare.

@batyshkaLenin
Copy link
Author

batyshkaLenin commented Aug 12, 2022

\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xdf must be equivalent to АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя, but it's not. Or am I misunderstanding something?

@ashtuchkin
Copy link
Owner

ashtuchkin commented Aug 12, 2022

Just checked it and looks correct:

$ node
> iconv = require("iconv-lite")
> iconv.decode(Buffer("\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xdf", "binary"), "maccyrillic")
'АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя'

@ashtuchkin
Copy link
Owner

Where are you getting the wrong results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants