Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long lines in new ScriptExtensions.txt cause the space after # to disappear #736

Open
roozbehp opened this issue Mar 13, 2024 · 2 comments
Assignees

Comments

@roozbehp
Copy link
Contributor

roozbehp commented Mar 13, 2024

Instead of # Po MIDDLE DOT, the first data line in https://github.com/unicode-org/unicodetools/blob/main/unicodetools/data/ucd/dev/ScriptExtensions.txt reads #Po MIDDLE DOT. I see no reason for this space to be dropped if the data is too long, while it's kept for lines with shorter data.

@eggrobin
Copy link
Member

I had noticed this. It is weird, but it is consistent with what we do elsewhere, note in DerivedNormalizationProps.txt

FDF1          ; NFKC_CF; 0642 0644 06D2 # Lo       ARABIC LIGATURE QALA USED AS KORANIC STOP SIGN ISOLATED FORM
FDF2          ; NFKC_CF; 0627 0644 0644 0647 #Lo   ARABIC LIGATURE ALLAH ISOLATED FORM
FDF3          ; NFKC_CF; 0627 0643 0628 0631 #Lo   ARABIC LIGATURE AKBAR ISOLATED FORM
FDF4          ; NFKC_CF; 0645 062D 0645 062F #Lo   ARABIC LIGATURE MOHAMMAD ISOLATED FORM
FDF5          ; NFKC_CF; 0635 0644 0639 0645 #Lo   ARABIC LIGATURE SALAM ISOLATED FORM
FDF6          ; NFKC_CF; 0631 0633 0648 0644 #Lo   ARABIC LIGATURE RASOUL ISOLATED FORM
FDF7          ; NFKC_CF; 0639 0644 064A 0647 #Lo   ARABIC LIGATURE ALAYHE ISOLATED FORM
FDF8          ; NFKC_CF; 0648 0633 0644 0645 #Lo   ARABIC LIGATURE WASALLAM ISOLATED FORM
FDF9          ; NFKC_CF; 0635 0644 0649 # Lo       ARABIC LIGATURE SALLA ISOLATED FORM

and it seems to be intentional, see this comment:

// old:
// 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
// new
// 0009..000D ; White_Space #Cc [5] <control>..<control>
tabber.add((mergeRanges ? 14 : 6) + minSpacesBeforeSemicolon, Tabber.LEFT);

I have no idea what the intention is though. The commit that added that comment is unicode-org/icu@cd418af, its message is not particularly illuminating, and neither is ICU-6106. @macchiati, do you remember what you were thinking 16 years ago?

@markusicu
Copy link
Member

Let's stop doing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants