Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation for string encodings #16

Open
PolMine opened this issue Jul 24, 2020 · 0 comments
Open

Improve documentation for string encodings #16

PolMine opened this issue Jul 24, 2020 · 0 comments

Comments

@PolMine
Copy link
Collaborator

PolMine commented Jul 24, 2020

To get the id for a string that contains non-ASCII characters, the function RcppCWB::cl_id2str() will not yield a result if the Encoding of the incoming string is different from the encoding of the corpus. Yet this may happen frequently if you work under a UTF-8 locale (default on macOS, Linux) and your corpus is latin1-encoded (common encoding of CWB corpora). See the follwing examples.

So this fails ...

cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = "über", registry = registry())

To make it work, you need to iconv the input string to the encoding of the corpus.

cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = iconv("über", from = "UTF-8", to = "latin1"), registry = registry())

It might be worthwhile to let the R wrapper for the C function cl_str2id() check whether the encoding of the incoming string is identical with the encoding of the corpus. That may entail a performance loss that is is to be avoided.

It is much more important that the documentation stresses that the encoding of the string needs to conform to the encoding of the corpus. The documentation at its present stage falls short of making this requirement clear.

ablaette added a commit that referenced this issue Jul 15, 2021
ablaette added a commit that referenced this issue Jul 15, 2021
ablaette added a commit that referenced this issue Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants