Improve documentation for string encodings #16

PolMine · 2020-07-24T13:16:59Z

To get the id for a string that contains non-ASCII characters, the function RcppCWB::cl_id2str() will not yield a result if the Encoding of the incoming string is different from the encoding of the corpus. Yet this may happen frequently if you work under a UTF-8 locale (default on macOS, Linux) and your corpus is latin1-encoded (common encoding of CWB corpora). See the follwing examples.

So this fails ...

cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = "über", registry = registry())

To make it work, you need to iconv the input string to the encoding of the corpus.

cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = iconv("über", from = "UTF-8", to = "latin1"), registry = registry())

It might be worthwhile to let the R wrapper for the C function cl_str2id() check whether the encoding of the incoming string is identical with the encoding of the corpus. That may entail a performance loss that is is to be avoided.

It is much more important that the documentation stresses that the encoding of the string needs to conform to the encoding of the corpus. The documentation at its present stage falls short of making this requirement clear.

The text was updated successfully, but these errors were encountered:

ablaette added a commit that referenced this issue Jul 15, 2021

patch cl/lex.creg.c #16

9a921a6

ablaette added a commit that referenced this issue Jul 15, 2021

patch cl/lex.creg.c #16

877cc94

ablaette added a commit that referenced this issue Jul 15, 2021

patch cl/lex.creg.c #16

ca2ee08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve documentation for string encodings #16

Improve documentation for string encodings #16

PolMine commented Jul 24, 2020

Improve documentation for string encodings #16

Improve documentation for string encodings #16

Comments

PolMine commented Jul 24, 2020