Check for Unicode whitespace chars when doing text parsing #2312

ivan-mogilko · 2024-01-24T20:02:39Z

As a part of improving engine's unicode support, there are at least 2 cases which might be checking for unicode "whitespace" characters.
These are:

Splitting text into lines for a text wrapping;
INI format parsing.
String.Trim function in ags4 now (see ags4: add string trim (remove trailing and leading whitespace in string) #2397).

This is not high priority, but may be useful to do.

I am not an expert in unicode, but randomly found following list of unicode character categories that should probably be considered:
Separator, Line
Separator, Paragraph
Separator, Space

EDIT: Another quick reference, documentation for std::iswspace:
https://en.cppreference.com/w/cpp/string/wide/iswspace
has a list of POSIX space chars under NOTES.

Allegro 4 sources had this function for unicode "space" detection:

ags/libsrc/allegro/src/unicode.c

Lines 1716 to 1725 in 59e1a5a

    
           /* uisspace: 
        
            *  Unicode-aware version of the ANSI isspace() function. 
        
            */ 
        
           int uisspace(int c) 
        
           { 
        
              return ((c == ' ') || (c == '\t') || (c == '\r') ||  
        
           	   (c == '\n') || (c == '\f') || (c == '\v') || 
        
           	   (c == 0x1680) || ((c >= 0x2000) && (c <= 0x200A)) || 
        
           	   (c == 0x2028) || (c == 0x202f) || (c == 0x3000)); 
        
           }

It may either be expanded to cover necessary chars, or our own function written for this.

ericoporto · 2024-05-01T01:24:26Z

I know this low priority but decided to check and noticed that 0x202f is according to the link NO-BREAK SPACE so it makes sense to be cleared out in a trim call (guess is the usage for the ini format parsing), but I guess it should not be a valid wrap - assuming that is what no-break means here.

maybe isspace could be copied to a isbreak that would be exactly the same but that character.

messengerbag · 2024-05-01T05:56:32Z

A quick web search indicates that it's not standard for trim to remove non-breaking spaces (00A0, 202F, 2007)—at least in C, C++, Java, SQL or Excel. The rationale appears to be that trim removes whitespace, while "non-breaking" is understood to mean that the space is not considered whitespace. (And indeed, non-breaking spaces do not line wrap, by definition.)

It would seem desirable to stay consistent with standard libraries and conventions.

ivan-mogilko · 2024-05-01T07:56:19Z

Another quick reference, documentation for std::iswspace (widechar variant of std::isspace):
https://en.cppreference.com/w/cpp/string/wide/iswspace

has a following remark:

ISO 30112 defines POSIX space characters as Unicode characters U+0009..U+000D, U+0020, U+1680, U+180E, U+2000..U+2006, U+2008..U+200A, U+2028, U+2029, U+205F, and U+3000.

Comparing this list with existing Allegro's uisspace implementation, it looks like indeed they have an extra char there (0x202F).

ivan-mogilko added the context: unicode label Jan 24, 2024

ivan-mogilko added this to the 3.6.2 (performance update) milestone Jan 24, 2024

ivan-mogilko added the type: enhancement a suggestion or necessity to have something improved label Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for Unicode whitespace chars when doing text parsing #2312

Check for Unicode whitespace chars when doing text parsing #2312

ivan-mogilko commented Jan 24, 2024 •

edited

ericoporto commented May 1, 2024 •

edited

messengerbag commented May 1, 2024 •

edited

ivan-mogilko commented May 1, 2024 •

edited

Check for Unicode whitespace chars when doing text parsing #2312

Check for Unicode whitespace chars when doing text parsing #2312

Comments

ivan-mogilko commented Jan 24, 2024 • edited

ericoporto commented May 1, 2024 • edited

messengerbag commented May 1, 2024 • edited

ivan-mogilko commented May 1, 2024 • edited

ivan-mogilko commented Jan 24, 2024 •

edited

ericoporto commented May 1, 2024 •

edited

messengerbag commented May 1, 2024 •

edited

ivan-mogilko commented May 1, 2024 •

edited