str_split not splitting correctly on Unicode character #542

alexanderbeatson · 2024-03-29T06:24:50Z

I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.

str_split("စမ်းသပ်မှု", "")[[1]]

it returns:

[1] "စ" "မ်" "း" "သ" "ပ်" "မှု"

If I use buildin strsplit: strsplit("စမ်းသပ်မှု", "")[[1]] it returns character level:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:

[1] "စမ်း" "သပ်" "မှု"

So, I don't think it is actually a feature like Issue:88

For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.

The text was updated successfully, but these errors were encountered:

gagolews · 2024-04-02T13:08:01Z

... and what would be the correct result?

alexanderbeatson · 2024-04-04T06:09:55Z

Correct return should be:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str_split not splitting correctly on Unicode character #542

str_split not splitting correctly on Unicode character #542

alexanderbeatson commented Mar 29, 2024

gagolews commented Apr 2, 2024

alexanderbeatson commented Apr 4, 2024

str_split not splitting correctly on Unicode character #542

str_split not splitting correctly on Unicode character #542

Comments

alexanderbeatson commented Mar 29, 2024

gagolews commented Apr 2, 2024

alexanderbeatson commented Apr 4, 2024