Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) #415

Open
jjesusfilho opened this issue Mar 26, 2021 · 13 comments

Comments

@jjesusfilho
Copy link

jjesusfilho commented Mar 26, 2021

According to the Unicode Technical Standard #35, when case Level is set to On and strength set to primary, ICU should ignore accents but take case into account. This is not the behavior I find in stringi package. To achieve that, I have to set locale = "", which is not what I expected. Examples:

This is ok:

stringi::stri_detect_coll("Mário", "mario", strength = 1L)
[1] TRUE

This should return FALSE:

stringi::stri_detect_coll("Mário", "mario", strength = 1L,case_level = TRUE)
[1] TRUE

In order to achieve that I need to set locale to "" on stringi:

stringi::stri_detect_coll("Mário", "mario", strength = 1L, locale = "")
[1] FALSE

Here I respect case but don't add accent and get the expected result:

stringi::stri_detect_coll("Mário", "Mario", strength = 1L, locale = "")
[1] TRUE
@gagolews
Copy link
Owner

Could you please paste the output of a call to stringi::stri_info() here? Thank you.

@gagolews
Copy link
Owner

gagolews commented Mar 28, 2021

I wonder if strength=2 is what you might need:

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR")
## [1]  TRUE  TRUE FALSE FALSE
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE)
## [1]  TRUE  TRUE FALSE FALSE

@jjesusfilho
Copy link
Author

Could you please paste the output of a call to stringi::stri_info() here? Thank you.

$Unicode.version
[1] "10.0"

$ICU.version
[1] "61.1"

$Locale
$Locale$Language
[1] "pt"

$Locale$Country
[1] "BR"

$Locale$Variant
[1] ""

$Locale$Name
[1] "pt_BR"

$Charset.internal
[1] "UTF-8" "UTF-16"

$Charset.native
$Charset.native$Name.friendly
[1] "UTF-8"

$Charset.native$Name.ICU
[1] "UTF-8"

$Charset.native$Name.UTR22
[1] NA

$Charset.native$Name.IBM
[1] "ibm-1208"

$Charset.native$Name.WINDOWS
[1] "windows-65001"

$Charset.native$Name.JAVA
[1] "UTF-8"

$Charset.native$Name.IANA
[1] "UTF-8"

$Charset.native$Name.MIME
[1] "UTF-8"

$Charset.native$ASCII.subset
[1] TRUE

$Charset.native$Unicode.1to1
[1] NA

$Charset.native$CharSize.8bit
[1] FALSE

$Charset.native$CharSize.min
[1] 1

$Charset.native$CharSize.max
[1] 3

$ICU.system
[1] FALSE

$ICU.UTF8
[1] FALSE

@jjesusfilho
Copy link
Author

jjesusfilho commented Apr 2, 2021

I wonder if strength=2 is what you might need:

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR")
## [1]  TRUE  TRUE FALSE FALSE
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE)
## [1]  TRUE  TRUE FALSE FALSE

Actually, what I want is to ignore accents, but not case. According to the ICU documentation mentioned above, I should achieve that by setting strength to level 1 and case level to On. I also work with PostgreSQL and it works accordingly, but with stringi, I can only get the same result by setting locale to '"", regardless of case_level's setting.

stringi::stri_detect_coll(c("Mário","mário"), "mario", strength=1L,  locale = '')
[1] FALSE  TRUE

@gagolews
Copy link
Owner

gagolews commented Apr 7, 2021

First of all, thanks, there was a bug; locale="" should mean locale=NULL, i.e., your own locale, pt_BR.

gagolews added a commit that referenced this issue Apr 7, 2021
@gagolews
Copy link
Owner

gagolews commented Apr 7, 2021

instead, locale='' meant locale="POSIX" - that is why it worked as expected (and perhaps this is what postgresql uses, hence the correct results). I would recommend setting locale="POSIX" explicitly then.

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 1L, locale="POSIX")
## [1] FALSE  TRUE FALSE  TRUE

@jjesusfilho
Copy link
Author

jjesusfilho commented Apr 14, 2021

Yes, but I still believe that it can get a bit confusing because, in order to ignore accents but not case, in stringi, you have to set the locale, but not the case_level, when this is not the way PostgreSQL works and probably other languages, i.e, you must have the case level turned on along with the primary strength. This is also what the ICU documentation says:

"The case level is independent from the strength of comparison. It is possible to have a collator set to 
 primary strength with the case level turned on. This provides for comparison that takes into account the 
 case differences, while at the same time ignoring accents and tertiary differences other than case"

@gagolews
Copy link
Owner

Hmmm.... interestingly, a collator-based string comparison honours the above rule...

> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=2L)
[1] FALSE  TRUE FALSE FALSE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
[1] FALSE  TRUE FALSE  TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=1L)
[1] TRUE TRUE TRUE TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=2L)
[1]  TRUE  TRUE FALSE FALSE

@gagolews
Copy link
Owner

I was trying hard to figure out why usearch returns a different result below, but with no success. A bug in ICU?

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] TRUE TRUE TRUE TRUE
stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] FALSE  TRUE FALSE  TRUE

@gagolews
Copy link
Owner

gagolews commented May 4, 2021

(note to self): ICU 69.1 gives the results as above. @todo: create a minimal reproducible example outside of stringi

@gagolews gagolews added the bug label May 5, 2021
@gagolews gagolews added this to the stringi-1.7 milestone May 5, 2021
@gagolews
Copy link
Owner

[note to self]

Yes, this is reproducible outside of stringi:

/* g++ -std=c++11 icu_test_bug_ucol_caselevel.cpp -licui18n -licuuc -licudata  && ./a.out */

#include <unicode/ustring.h>
#include <unicode/rbbi.h>
#include <unicode/coll.h>
#include <unicode/ucol.h>
#include <unicode/stsearch.h>
#include <cstdio>
using namespace icu;


#define WITH_CHECK_STATUS(f) \
    status = U_ZERO_ERROR; \
    f; \
    if (U_FAILURE(status)) {printf("error %s!\n", u_errorName(status));return 1;}


int main()
{
    const char* haystacks[] = {
        "mario", "Mario", "MARIO", "MÁRIO", "Mário", "mário", "dario"
    };

    const char* needle = "mario";


    UErrorCode status;
    UCollator* col;
    UStringSearch* matcher;
    int v;

    printf("U_ICU_VERSION=%s\n", U_ICU_VERSION);

    WITH_CHECK_STATUS(col = ucol_open("pt_BR", &status))

    WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_STRENGTH, UCOL_PRIMARY, &status))
    WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_CASE_LEVEL, UCOL_ON, &status))

    UnicodeString _needle = UnicodeString::fromUTF8(needle);
    UnicodeString haystack = UnicodeString::fromUTF8("whatever");

    WITH_CHECK_STATUS(matcher = usearch_openFromCollator(
        _needle.getBuffer(),
        _needle.length(),
        haystack.getBuffer(),
        haystack.length(),
        col, NULL, &status
    ))

    for (int i=0; i<sizeof(haystacks)/sizeof(haystacks[0]); ++i) {
        printf("%s vs. %s: ", needle, haystacks[i]);
        haystack = UnicodeString::fromUTF8(haystacks[i]);

        WITH_CHECK_STATUS(
            usearch_setText(matcher, haystack.getBuffer(), haystack.length(), &status)
        )

        usearch_reset(matcher);

        WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=usearch_first(matcher, &status)))
        printf(" usearch=%d", v);

        WITH_CHECK_STATUS(v = (int)UCOL_EQUAL==ucol_strcollUTF8(col,
            haystacks[i], -1, needle, -1, &status))
        printf(" ucol=%d", v);

        RuleBasedCollator* rbc = RuleBasedCollator::rbcFromUCollator(col);
        WITH_CHECK_STATUS(StringSearch matcher2(needle, haystack, rbc, NULL, status))
        WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=matcher2.first(status)))
        printf(" StringSearch=%d", v);

        printf("\n");
    }

    ucol_close(col);
    usearch_close(matcher);

    return 0;
}

yields:

U_ICU_VERSION=67.1
mario vs. mario:  usearch=1 ucol=1 StringSearch=1
mario vs. Mario:  usearch=1 ucol=0 StringSearch=1
mario vs. MARIO:  usearch=1 ucol=0 StringSearch=1
mario vs. MÁRIO:  usearch=1 ucol=0 StringSearch=1
mario vs. Mário:  usearch=1 ucol=0 StringSearch=1
mario vs. mário:  usearch=1 ucol=1 StringSearch=1
mario vs. dario:  usearch=0 ucol=0 StringSearch=0

@gagolews
Copy link
Owner

All right, it turns out that this issue has already been reported. It is ICU-related.

https://unicode-org.atlassian.net/browse/ICU-21338

@gagolews gagolews changed the title Collation: case_level doesn't seem to work properly to ignore accents but take case into account Collation: case_level doesn't seem to work properly to ignore accents but take case into account (UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) May 14, 2021
@gagolews gagolews changed the title Collation: case_level doesn't seem to work properly to ignore accents but take case into account (UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) May 14, 2021
@jjesusfilho
Copy link
Author

Thank you very much for taking your time to work this out. I have followed your thread posts and tried to find the issue on ICU's bug page, but couldn't figure it out.

@gagolews gagolews modified the milestones: stringi-1.7, stringi-1.8 Jun 20, 2021
@gagolews gagolews removed this from the stringi-1.8 milestone Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants