Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0). #501

Bodigrim · 2023-02-08T00:32:20Z

On GHC 9.6 t_toCaseFold_char fails on '\66937', which is VITHKUQI CAPITAL LETTER FE' (U+10579) added in Unicode 14.0. The test should be guarded with #if MIN_VERSION_base(4,16,0).

The text was updated successfully, but these errors were encountered:

Lysxia · 2023-02-08T03:21:26Z

That may be the right fix for now, but it's worth noting it's only a temporary fix. Is there a longer term solution?

The root of the problem is that GHC and text don't use the same Unicode version. That makes our tests flaky, as in this case, and this is likely to happen again for any fix that looks only at base/GHC versions. Regardless of how we fix our tests, users of mismatched versions of GHC and text may also get very surprising bugs. This shouldn't happen for distributions that keep them in sync, hence the status of text as a core library that ships with GHC has been a pretty good mitigating factor so far; that might explain why we haven't seen reports of it already. But it seems worth reducing the risk further.

Ideally, the Unicode version should be set only once, and it seems reasonable that GHC should ultimately be responsible for this. This doesn't seem overly hard, but there are some executive decisions to be made, notably regarding who should fetch CaseFolding.txt and SpecialCase.txt. (I think that should end up in GHC, but we would then have to figure out how to maintain backward compatibility with existing GHCs).

I might still be useful to enable building text with an arbitrary Unicode version, independent of GHC. Imagine an application with Unicode-heavy data that they don't want to break when they upgrade or downgrade GHC.

It's fun to look at the QuickCheck test that failed. It looks like there are ~50 counterexamples (new upper case letters in Unicode 14), and Chars are sampled among the ~150k assigned ones. That's enough that we were bound to run into it, but it's not everyday that we get to look at a QuickCheck property that fails 3% of the time (with the standard 100 iterations).

Bodigrim · 2023-02-13T23:31:36Z

Fundamentally Data.Text.toLower and map Data.Char.toLower are different functions with different semantics, even if both were generated from the same Unicode database. So in my view it's purely internal issue how to test Data.Text.toLower: instead of combininig Data.Char.toLower and a hand-crafted list of exclusions, we could have incorporated SpecialCase.txt into test suite and match against it directly.

I'll go with a tactical fix for now, just to keep CI green.

phadej · 2023-02-14T09:05:52Z

instead of combininig Data.Char.toLower and a hand-crafted list of exclusions

That is btw how Data.Text.toLower was implemented. List of exclusions/overrides and then calling Data.Char.toLower on the rest.

That didn't work well (e.g. issue #277, fix #402).

So probably won't work well for tests either :)

Lysxia added question Requires more investigation internal No API-level changes labels Feb 8, 2023

Bodigrim mentioned this issue Feb 13, 2023

Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0) #507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0). #501

Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0). #501

Bodigrim commented Feb 8, 2023 •

edited

Lysxia commented Feb 8, 2023

Bodigrim commented Feb 13, 2023

phadej commented Feb 14, 2023 •

edited

Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0). #501

Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0). #501

Comments

Bodigrim commented Feb 8, 2023 • edited

Lysxia commented Feb 8, 2023

Bodigrim commented Feb 13, 2023

phadej commented Feb 14, 2023 • edited

Bodigrim commented Feb 8, 2023 •

edited

phadej commented Feb 14, 2023 •

edited