Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexical ordering of strings with surrogate pairs #1346

Open
dlurton opened this issue Jan 24, 2024 · 0 comments
Open

Lexical ordering of strings with surrogate pairs #1346

dlurton opened this issue Jan 24, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@dlurton
Copy link
Member

dlurton commented Jan 24, 2024

Description

PartiQL can get the ordering of strings wrong if they contain surrogate pairs.

To Reproduce

This is a somewhat contrived example, but it demonstrates the point.

@Test
fun `lexical ordering of strings with surrogate pairs`() {
    // The codepoint of 'ꬰ' is U+AB30.
    // The codepoint of `💩` is U+1F4A9.
    // Therefore, `ꬰ` should be ordered first by PartiQL.

    // However, PartiQL currently falls back on the JVM to compare strings.  The JVM lexicographcailly
    // compares by UTF-16 code unit instead of full code point and this can cause strings with characters
    // requiring surrogate pairs to sort incorrectly.
    
    // Therefore this test fails.

    assertTrue(
        DEFAULT_COMPARATOR.compare(
            ExprValue.newString("ꬰ"),
            ExprValue.newString("💩")
        ) > 0,
        "'ꬰ' should come before '💩'"
    )
}

Expected Behavior

The test in the repro case should pass.

Additional Context

I can't think of anything else to add.

@dlurton dlurton added the bug Something isn't working label Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant