You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To get the id for a string that contains non-ASCII characters, the function RcppCWB::cl_id2str() will not yield a result if the Encoding of the incoming string is different from the encoding of the corpus. Yet this may happen frequently if you work under a UTF-8 locale (default on macOS, Linux) and your corpus is latin1-encoded (common encoding of CWB corpora). See the follwing examples.
To make it work, you need to iconv the input string to the encoding of the corpus.
cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = iconv("über", from = "UTF-8", to = "latin1"), registry = registry())
It might be worthwhile to let the R wrapper for the C function cl_str2id() check whether the encoding of the incoming string is identical with the encoding of the corpus. That may entail a performance loss that is is to be avoided.
It is much more important that the documentation stresses that the encoding of the string needs to conform to the encoding of the corpus. The documentation at its present stage falls short of making this requirement clear.
The text was updated successfully, but these errors were encountered:
To get the id for a string that contains non-ASCII characters, the function
RcppCWB::cl_id2str()
will not yield a result if the Encoding of the incoming string is different from the encoding of the corpus. Yet this may happen frequently if you work under a UTF-8 locale (default on macOS, Linux) and your corpus is latin1-encoded (common encoding of CWB corpora). See the follwing examples.So this fails ...
To make it work, you need to iconv the input string to the encoding of the corpus.
It might be worthwhile to let the R wrapper for the C function cl_str2id() check whether the encoding of the incoming string is identical with the encoding of the corpus. That may entail a performance loss that is is to be avoided.
It is much more important that the documentation stresses that the encoding of the string needs to conform to the encoding of the corpus. The documentation at its present stage falls short of making this requirement clear.
The text was updated successfully, but these errors were encountered: