Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0). #501

Open
Bodigrim opened this issue Feb 8, 2023 · 3 comments
Open

Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0). #501

Bodigrim opened this issue Feb 8, 2023 · 3 comments
Labels
internal No API-level changes question Requires more investigation

Comments

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 8, 2023

On GHC 9.6 t_toCaseFold_char fails on '\66937', which is VITHKUQI CAPITAL LETTER FE' (U+10579) added in Unicode 14.0. The test should be guarded with #if MIN_VERSION_base(4,16,0).

@Lysxia
Copy link
Contributor

Lysxia commented Feb 8, 2023

That may be the right fix for now, but it's worth noting it's only a temporary fix. Is there a longer term solution?

The root of the problem is that GHC and text don't use the same Unicode version. That makes our tests flaky, as in this case, and this is likely to happen again for any fix that looks only at base/GHC versions. Regardless of how we fix our tests, users of mismatched versions of GHC and text may also get very surprising bugs. This shouldn't happen for distributions that keep them in sync, hence the status of text as a core library that ships with GHC has been a pretty good mitigating factor so far; that might explain why we haven't seen reports of it already. But it seems worth reducing the risk further.

Ideally, the Unicode version should be set only once, and it seems reasonable that GHC should ultimately be responsible for this. This doesn't seem overly hard, but there are some executive decisions to be made, notably regarding who should fetch CaseFolding.txt and SpecialCase.txt. (I think that should end up in GHC, but we would then have to figure out how to maintain backward compatibility with existing GHCs).

I might still be useful to enable building text with an arbitrary Unicode version, independent of GHC. Imagine an application with Unicode-heavy data that they don't want to break when they upgrade or downgrade GHC.

It's fun to look at the QuickCheck test that failed. It looks like there are ~50 counterexamples (new upper case letters in Unicode 14), and Chars are sampled among the ~150k assigned ones. That's enough that we were bound to run into it, but it's not everyday that we get to look at a QuickCheck property that fails 3% of the time (with the standard 100 iterations).

@Lysxia Lysxia added question Requires more investigation internal No API-level changes labels Feb 8, 2023
@Bodigrim
Copy link
Contributor Author

Fundamentally Data.Text.toLower and map Data.Char.toLower are different functions with different semantics, even if both were generated from the same Unicode database. So in my view it's purely internal issue how to test Data.Text.toLower: instead of combininig Data.Char.toLower and a hand-crafted list of exclusions, we could have incorporated SpecialCase.txt into test suite and match against it directly.

I'll go with a tactical fix for now, just to keep CI green.

@phadej
Copy link
Contributor

phadej commented Feb 14, 2023

instead of combininig Data.Char.toLower and a hand-crafted list of exclusions

That is btw how Data.Text.toLower was implemented. List of exclusions/overrides and then calling Data.Char.toLower on the rest.

That didn't work well (e.g. issue #277, fix #402).

So probably won't work well for tests either :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal No API-level changes question Requires more investigation
Projects
None yet
Development

No branches or pull requests

3 participants