Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for Unicode whitespace chars when doing text parsing #2312

Open
3 tasks
ivan-mogilko opened this issue Jan 24, 2024 · 3 comments
Open
3 tasks

Check for Unicode whitespace chars when doing text parsing #2312

ivan-mogilko opened this issue Jan 24, 2024 · 3 comments
Labels
context: unicode type: enhancement a suggestion or necessity to have something improved

Comments

@ivan-mogilko
Copy link
Contributor

ivan-mogilko commented Jan 24, 2024

As a part of improving engine's unicode support, there are at least 2 cases which might be checking for unicode "whitespace" characters.
These are:

This is not high priority, but may be useful to do.

I am not an expert in unicode, but randomly found following list of unicode character categories that should probably be considered:
Separator, Line
Separator, Paragraph
Separator, Space

EDIT: Another quick reference, documentation for std::iswspace:
https://en.cppreference.com/w/cpp/string/wide/iswspace
has a list of POSIX space chars under NOTES.

Allegro 4 sources had this function for unicode "space" detection:

/* uisspace:
* Unicode-aware version of the ANSI isspace() function.
*/
int uisspace(int c)
{
return ((c == ' ') || (c == '\t') || (c == '\r') ||
(c == '\n') || (c == '\f') || (c == '\v') ||
(c == 0x1680) || ((c >= 0x2000) && (c <= 0x200A)) ||
(c == 0x2028) || (c == 0x202f) || (c == 0x3000));
}

It may either be expanded to cover necessary chars, or our own function written for this.

@ivan-mogilko ivan-mogilko added the type: enhancement a suggestion or necessity to have something improved label Jan 24, 2024
@ericoporto
Copy link
Member

ericoporto commented May 1, 2024

I know this low priority but decided to check and noticed that 0x202f is according to the link NO-BREAK SPACE so it makes sense to be cleared out in a trim call (guess is the usage for the ini format parsing), but I guess it should not be a valid wrap - assuming that is what no-break means here.

maybe isspace could be copied to a isbreak that would be exactly the same but that character.

@messengerbag
Copy link

messengerbag commented May 1, 2024

A quick web search indicates that it's not standard for trim to remove non-breaking spaces (00A0, 202F, 2007)—at least in C, C++, Java, SQL or Excel. The rationale appears to be that trim removes whitespace, while "non-breaking" is understood to mean that the space is not considered whitespace. (And indeed, non-breaking spaces do not line wrap, by definition.)

It would seem desirable to stay consistent with standard libraries and conventions.

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented May 1, 2024

Another quick reference, documentation for std::iswspace (widechar variant of std::isspace):
https://en.cppreference.com/w/cpp/string/wide/iswspace

has a following remark:

ISO 30112 defines POSIX space characters as Unicode characters U+0009..U+000D, U+0020, U+1680, U+180E, U+2000..U+2006, U+2008..U+200A, U+2028, U+2029, U+205F, and U+3000.

Comparing this list with existing Allegro's uisspace implementation, it looks like indeed they have an extra char there (0x202F).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
context: unicode type: enhancement a suggestion or necessity to have something improved
Projects
None yet
Development

No branches or pull requests

3 participants