Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Should input string be UTF-8 normalized? #127

Open
dermoth opened this issue Jan 4, 2022 · 0 comments
Open

Question: Should input string be UTF-8 normalized? #127

dermoth opened this issue Jan 4, 2022 · 0 comments
Labels

Comments

@dermoth
Copy link

dermoth commented Jan 4, 2022

Hi,

I came across this tool while searching my own version of a NATO speller. Of course I got curious; I'm not really a web dev so mine is much more simplistic but one thing I noticed is that you don't seem to normalize UTF-8 before converting to NATO. Normalization would allow removing accents before conversion or copying them as-is for NATO, and is required to get consistent results for other alphabets that include accented characters. See my code (web page) for an example

Testing

To test you can paste this on your browser's js console to generate NFC and NFD version of accented characters (providing é and Ë as examples):

'é'.normalize('NFD')
'é'.normalize('NFD')
'Ë'.normalize('NFC')
'Ë'.normalize('NFD')

Then copy/paste the output into https://cryptii.com/pipes/nato-phonetic-alphabet

The issues

  • The denormalized é prints as Echo ́ (in my version I strip the accents from the denormalized form which can be matched using /[\u0300-\u036f]/g).
  • The diaeresis of the denormalized Ë doesn't even print, I see a square box.

I think on the 2nd issue this is because of the way you iterate over the characters; see line 54 of my code; this is how I loop over multiplanar unicode characters... Using just index on a string iterates over each individual element of the denormalized form.

For further reading about normalisation forms: https://unicode.org/reports/tr15/

Bug #17 also needs to be taken into consideration - it could actually be done in an standalone UTF-8 codec, else the spelling alphabet codec could have this as a parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant