Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookmark \@ref fails to work for multibyte strings #37

Open
madlogos opened this issue Aug 27, 2020 · 7 comments
Open

Bookmark \@ref fails to work for multibyte strings #37

madlogos opened this issue Aug 27, 2020 · 7 comments
Labels
bug Something isn't working

Comments

@madlogos
Copy link

madlogos commented Aug 27, 2020

Suppose I have a .Rmd file like below:

---
title: "Untitled"
output:
  officedown::rdocx_document:
    default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Chapter1 {#ch1}

# Chapter2 {#ch2}

Refer to \@ref(ch1).

When \@ref(ch1) is surrounded by multibyte strings (e.g., Chinese characters), it would possibly encounter errors.

  • Pure multibyte + ref

    • Example: 上下\@ref(ch1)
    • Result: correct
  • Mixed multibyte/singlebyte + ref

    • Example: 上a下\@ref(ch1)
    • Result: incorrect (上a下@ref(ch1))
  • ref + multibyte

    • Example: \@ref(ch1)。

    • Result: compile failed

      Error in nchar(u, itype) : invalid multibyte string, element 1
      Calls: ... regmatches<- -> regmatches -> Map -> mapply ->

Can you please look into this issue? Thanks.

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 20180)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] officer_0.3.12 officedown_0.2.0 flextable_0.5.10
[4] ggplot2_3.3.2 tidyr_1.1.1 knitr_1.29
[7] dplyr_1.0.2 reticulate_1.16

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lattice_0.20-41 prettyunits_1.1.1
[4] sysfonts_0.8.1 ps_1.3.4 utf8_1.1.4
[7] rprojroot_1.3-2 assertthat_0.2.1 digest_0.6.25
[10] R6_2.4.1 backports_1.1.9 evaluate_0.14
[13] pillar_1.4.6 gdtools_0.2.2 rlang_0.4.7
[16] curl_4.3 uuid_0.1-4 data.table_1.13.0
[19] callr_3.4.3 Matrix_1.2-18 rmarkdown_2.3
[22] desc_1.2.0 labeling_0.3 devtools_2.3.1
[25] stringr_1.4.0 munsell_0.5.0 tinytex_0.25
[28] compiler_4.0.2 xfun_0.16 pkgconfig_2.0.3
[31] systemfonts_0.2.3 base64enc_0.1-3 pkgbuild_1.1.0
[34] rvg_0.2.5 htmltools_0.5.0 tidyselect_1.1.0
[37] tibble_3.0.3 bookdown_0.20 fansi_0.4.1
[40] crayon_1.3.4 showtextdb_3.0 withr_2.2.0
[43] grid_4.0.2 jsonlite_1.7.0 gtable_0.3.0
[46] lifecycle_0.2.0 magrittr_1.5 scales_1.1.1
[49] zip_2.1.0 cli_2.0.2 stringi_1.4.6
[52] farver_2.0.3 fs_1.5.0 remotes_2.2.0
[55] testthat_2.3.2 xml2_1.3.2 ellipsis_0.3.1
[58] generics_0.0.2 vctrs_0.3.2 tools_4.0.2
[61] showtext_0.9 glue_1.4.1 purrr_0.3.4
[64] processx_3.4.3 pkgload_1.1.0 yaml_2.2.1
[67] colorspace_1.4-1 sessioninfo_1.1.1 memoise_1.1.0
[70] usethis_1.6.1

@davidgohel
Copy link
Owner


---
title: "Untitled"
output:
  officedown::rdocx_document:
    default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Chapter1 {#ch1}

# Chapter2 {#ch2}

Refer to \@ref(ch1).

When \@ref(ch1) is surrounded by multibyte strings (e.g., Chinese characters), it would possibly encounter errors.

- Pure multibyte + ref: 上下\@ref(ch1)
- Example: 上a下\@ref(ch1)
- ref + multibyte: \@ref(ch1)。

Your issue is related to the fact you are not working with a UTF-8 encoded file.

R, R Markdown and Windows does not work well when encoding is not UTF-8.

Capture d’écran 2020-08-27 à 10 53 27

Untitled.docx

@madlogos
Copy link
Author

Yes, @davidgohel, you are right. Althougth the .Rmd file is in UTF-8, the OS is running on GBK encoding. When I change to bookdown::word_document2, the knitr engine manages to compile the file. But I still get ?? where the bookmark is supposed to appear.

@davidgohel
Copy link
Owner

You don't need to try new output format functions.

The result shown below is made with a Windows with french locale. But I made sure the file was encoded as UTF-8 (I am using readr::guess_encoding(), if not UTF-8 encoded, I can change it to UTF8 with fpeek::peek_iconv()).

Could you show the result of

readr::guess_encoding("your/rmd/file")

@madlogos
Copy link
Author

madlogos commented Aug 29, 2020

The results are

no encoding confidence
1 UTF-8 1
2 windows-1252 0.28

@bishun945
Copy link

Hi @madlogos,

I am aslo a Chinese user. The multibyte problem has also bothered me for a long time. Here is my trick for it:

  1. Write @ref as usual;
  2. Save the Rmd file and readr::read_lines it;
  3. Match the strings containing "\\\\@ref\\([^\\)]+\\)" pattern;
  4. Split it and make sure the "\\\\@ref\\([^\\)]+\\)" on a single line;
  5. Save the character vector to a new Rmd file and render it with the format you like. Done!

For example, 请参考表\@ref(tab: coco)中的数据 should be splited as
[line 1] 请参考表
[line 2] \@ref(tab: coco)
[line 3] 中的数据

Well, I am not sure if this is an effective solution but it works for me. 😄

@madlogos
Copy link
Author

@bishun945 thank you for the turn-around. Good stuff.

@bishun945
Copy link

@madlogos I have tried another solution: just switch your system and MS Word language to English.

@davidgohel davidgohel added the bug Something isn't working label Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants