Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

Open
sammo3182 opened this issue Jul 24, 2021 · 15 comments

Comments

@sammo3182
Copy link

sammo3182 commented Jul 24, 2021

stri_detect_regex looks not recognizing Chinese characters correctly when it is treated as a regex pattern. I'm using the 1.4.0.9000 dev version on R 4.1.0. Here's an example:

Sys.setlocale(, "Chinese")
library(stringi)

stri_detect_fixed("昌平区", "") # Works fine
#> [1] FALSE
stri_detect_regex("昌平区", "") # TRUE
#> [1] TRUE
grepl("", "昌平区") # FALSE
#> [1] FALSE

Another example:

library(dplyr)
library(rvest)
library(stringi)

link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"

tx_xi <- read_html(link_speech) %>% 
  html_nodes("p") %>%
    html_text

stri_detect_regex(tx_xi, "同志们")  #Note that these are the very first three characters of the speech

#> [1] FALSE
sessionInfo()
#> R Under development (unstable) (2021-05-17 r80314)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#>  [1] LC_COLLATE=Chinese (Simplified)_China.936 
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> system code page: 65001
#>
#> attached base packages:
#>  [1] stats     graphics  grDevices utils     datasets  methods  
#> [7] base     
#>
#> other attached packages:
#>   [1] stringi_1.7.3
#>
#> loaded via a namespace (and not attached):
#>   [1] compiler_4.2.0 tools_4.2.0    parallel_4.2.0

The issue was submitted to stringr (tidyverse/stringr#386 (comment)), but it looks like a stringi problem?

@gagolews
Copy link
Owner

I cannot reproduce the above; I get:

>  library("stringi")
> stri_detect_regex("昌平区", "")
[1] FALSE
> stri_detect_fixed("昌平区", "")
[1] FALSE
> grepl("", "昌平区") 
[1] FALSE
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.7.3

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0   
> 
  1. What does stri_escape_unicode() return on your platform when run on both strings (pattern, search string)? How about charToRaw()? How about utf8ToInt()?
  2. Can you try with a more recent version of the stringi package?

@gagolews
Copy link
Owner

Also, could you please show me the result of a call to stri_info(FALSE)?

@gagolews
Copy link
Owner

gagolews commented Jul 24, 2021

With the latter, I get:

stri_detect_regex(tx_xi, "同志们") 
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[69] FALSE FALSE
> tx_xi[1]
[1] "在庆祝中国共产党成立100周年大会上的讲话"

@sammo3182
Copy link
Author

I cannot reproduce the above; I get:

>  library("stringi")
> stri_detect_regex("昌平区", "")
[1] FALSE
> stri_detect_fixed("昌平区", "")
[1] FALSE
> grepl("", "昌平区") 
[1] FALSE
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.7.3

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0   
> 
  1. What does stri_escape_unicode() return on your platform when run on both strings (pattern, search string)? How about charToRaw()? How about utf8ToInt()?
  2. Can you try with a more recent version of the stringi package?

Marek, first, thank you so much for helping me with this!!
One reason you didn't reproduce my result may be that you alternates the Sys.setlocate to chinese as I showed in the first line of the example. It's important; without it, many outputs in Chinese would just returned the hex unicodes or utf-8 codes. (Yihui has talked about this in many places).

Per your questions, here are what I got:

> stri_escape_unicode("昌平区")
Error in stri_escape_unicode("昌平区") : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode("")
Error in stri_escape_unicode("") : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> 
> # According to the error message, I did the the folliwng
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> ?stri_enc_toutf8
> # According to the error message, I did the the folliwng
> stri_enc_toutf8("昌平区")
[1] "昌平区"
> stri_enc_toutf8("")
[1] ""
> 
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode(stri_enc_toutf8(""))
Error in stri_escape_unicode(stri_enc_toutf8("")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> 
> 
> charToRaw("昌平区")
[1] b2 fd c6 bd c7 f8
> charToRaw("")
[1] cf d8
> 
> utf8ToInt("昌平区")
[1] NA
> utf8ToInt("")
[1] NA

> stri_info(FALSE)
$Unicode.version
[1] "13.0"

$ICU.version
[1] "69.1"

$Locale
$Locale$Language
[1] "en"

$Locale$Country
[1] "US"

$Locale$Variant
[1] ""

$Locale$Name
[1] "en_US"


$Charset.internal
[1] "UTF-8"  "UTF-16"

$Charset.native
$Charset.native$Name.friendly
[1] "UTF-8"

$Charset.native$Name.ICU
[1] "UTF-8"

$Charset.native$Name.UTR22
[1] NA

$Charset.native$Name.IBM
[1] "ibm-1208"

$Charset.native$Name.WINDOWS
[1] "windows-65001"

$Charset.native$Name.JAVA
[1] "UTF-8"

$Charset.native$Name.IANA
[1] "UTF-8"

$Charset.native$Name.MIME
[1] "UTF-8"

$Charset.native$ASCII.subset
[1] TRUE

$Charset.native$Unicode.1to1
[1] NA

$Charset.native$CharSize.8bit
[1] FALSE

$Charset.native$CharSize.min
[1] 1

$Charset.native$CharSize.max
[1] 3


$ICU.system
[1] FALSE

$ICU.UTF8
[1] FALSE

> 

Does the last couple of lines indicate anything?

@sammo3182
Copy link
Author

With the latter, I get:

stri_detect_regex(tx_xi, "同志们") 
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[69] FALSE FALSE
> tx_xi[1]
[1] "在庆祝中国共产党成立100周年大会上的讲话"

Sorry for the confusion. My bad for the miscoding. The problem remains, though. Try this:

library(dplyr)
library(rvest)
library(stringi)
#> 
link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"

tx_xi <- read_html(link_speech) %>% 
+     html_nodes("p") %>%
+     html_text 

tx_xi[6]
#> [1] "同志们,朋友们:"
stri_detect_regex(tx_xi[6], "同志们")  #Note that these are the very first three characters of the speech
#> [1] FALSE
#> 

@gagolews
Copy link
Owner

I think the problem is due to:

[2] LC_CTYPE=Chinese (Simplified)_China.936   
...
system code page: 65001

ICU thinks your native encoding is UTF-8, whereas it's probably GBK.

Could you give stri_enc_set("Windows-936") a try?

@sammo3182
Copy link
Author

My, it works! It looks that the error is indeed attributed to the ICU encoding recognition. Once the Windows-936 is set, both the above cases work well! Thank you so much, Marek, for helping me with this issue! I'm not sure if this is an issue only for recognizing Chinese on a PC, but I bet many text analysts would appreciate knowing this issue and the solution above!

@gagolews gagolews changed the title Problem of detecting Chinese characters Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding Jul 26, 2021
@gagolews
Copy link
Owner

Great, I changed the title of the issue so that it's more searchable.

To sum up, the solution was:

stri_enc_set("Windows-936")

@sammo3182
Copy link
Author

A quick follow-up question: is there any tradeoff by changing the stringi encoding? Or is there a way to let stringi recognize Chinese characters in UTF-8 as UTF-8? The encoding converter seem not to make any difference at all without str_enc_set:

# No str_enc_set is conducted
stri_detect_regex(stri_conv("昌平区", to = "UTF8"), stri_conv("", to = "UTF8")) 
#> [1] TRUE
# The correct outcome should be false, since the "县" isn't in "昌平区"

@gagolews
Copy link
Owner

I get FALSE. I think the problem might as well be on your system side, not just stringi, but it's worth digging into it.

Can you call:

  • charToRaw(stri_conv("昌平区", to = "UTF8"))
  • charToRaw(stri_conv("县", to = "UTF8"))
  • charToRaw("昌平区")
  • charToRaw("县")
  • stri_enc_mark("昌平区")
  • stri_enc_mark("县")

Also, try iconv instead of stri_conv

@gagolews
Copy link
Owner

Also, maybe the most recent R - UCRT is worth giving a try? https://github.com/r-windows/docs/blob/master/ucrt.md

@sammo3182
Copy link
Author

iconv works. The PC system is definitely a primary part of the reason of this issue. Nevertheless, I guess, my situate can represent the most system environment of R users in China. In that case, either a stri_enc_set or iconv would work. Of course, if the stringi can offer an argument to do so automatically, it would be great, ha-ha!

Regarding the UCRT, it is definitely intriguing, but it looks only about writing packages? I didn't see there's an instruction showing how I can automatically let Windows to convert everything to UTF-8 at the input stage. If not, UCRT won't be that different from manually converting to UTF-8 with inconv, no?

#> [1] ef bf bd ef bf bd c6 bd ef bf bd ef bf bd
charToRaw(stri_conv("", to = "UTF8"))
#> [1] ef bf bd ef bf bd
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("")
#> [1] "native"

stri_detect_regex(iconv("昌平区", to = "UTF8"), "") # supposed to be FALSE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), "") # supposed to be TRUE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), iconv("", to = "UTF8")) # supposed to be FALSE
#> [1] TRUE

@gagolews
Copy link
Owner

Hmmm... are these really generated with stri_enc_set("Windows-936") in place? This needs to be called each time the package is loaded.

The byte sequence ef bf bd denotes the replacement character ("unknown") btw

@gagolews gagolews reopened this Jul 26, 2021
@sammo3182
Copy link
Author

Oh, I might mislead you! The above outputs were produced without setting the stri_enc_set. As asked in #448 (comment), I was seeking solutions that I don't have to reset the stri_enc_set. Everything works fine when the encoding is manually set:

library(stringi)
stri_enc_set("Windows-936")
#> New settings: stringi_1.7.3 (en_US.GBK; ICU4C 69.1 [bundle]; Unicode 13.0)
#> Warning message:
#> In stri_info(short = TRUE) :
#>   Your native charset does not map to Unicode well. This may cause serious problems. Consider switching to UTF-8.
charToRaw(stri_conv("昌平区", to = "UTF8"))
#> [1] e6 98 8c e5 b9 b3 e5 8c ba
charToRaw(stri_conv("", to = "UTF8"))
#> [1] e5 8e bf
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("")
#> [1] "native"

@gagolews
Copy link
Owner

gagolews commented Jul 26, 2021

:)

Dear all, has anyone working in this locale experienced similar issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants