Tailoring: denormalized Japanese code points in the default FCE table #52

KL-7 · 2012-07-15T10:48:33Z

It turned out that some code points occur in the default FCE table in denormalized form. As we always normalize given code points to NFD form, we completely ignore denormalized elements of the FCE table. If processing normalized and denormalized forms results in different collation elements, we get wrong collation order in the end.

This issue affects only one test for Japanese tailoring, but it's possible that we simply don't have enough tests to reveal a bigger impact of this problem.

More details in the gist.

srl295 · 2012-08-22T18:15:10Z

@KL-7 what does 'failing in ICU' mean in this context? Have you filed a ticket?

KL-7 · 2012-08-22T22:02:57Z

@srl295, I believe that's a problem with test data and not the implementation. I was looking for tailoring tests and I was quite disappointed when I found this note saying that CLDR no longer provides conformance tests. Out of frustration I used tests from an older version of CLDR from here. I run ICU4J (as a reference implementation) on these tests, excluded those that were failed, and used the rest as a test suit for our implementation.

srl295 · 2012-08-22T22:42:39Z

@KL-7 Ouch.......... several times ouch.
Okay. I am sorry about the frustration. However... that is sort of like grading a student against the wrong answer key.

One thing that could have been done.. or, even, still done, would be to request generation of newer data. As I mentioned, we don't get much notice of others picking up the data period until they are in some sense 'done' (as with TwitterCLDR's announcement). I've never heard of anyone actually using that test data, besides CLDR's own tests.

I don't want to scare you off by repeating myself, but.. please file tickets, use the mailing list, ..

in any event, it may be better to use ICU's test cases. Below is not comprehensive (there are others), but is one start. I think this one is consumed by both C and J.

http://source.icu-project.org/repos/icu/icu/trunk/source/test/testdata/DataDrivenCollationTest.txt

I assume Ruby has some mechanism for calling/being called to/from C or Java, one could also consider testing by comparing results. Worst case you could execute my usort sample and compare the output. http://source.icu-project.org/repos/icu/icuapps/trunk/usort/

KL-7 · 2012-08-22T23:03:47Z

@srl295, I think we're pretty good even now, because I'm using only tests that are passed by ICU4J (it's basically results comparison that you mentioned). I found a lot of issues (hopefully, most of them) in my implementation that way. I had hard time trying to track down ICU's test data, so I used what I had at hands.

Regarding people using this data, I know for sure that at least ZTM project is using it – I originally found this data in their repository and only then managed to find it in CLDR's SVN. Though, I know nothing about the project's status, maybe they don't need updated version.

And about mailing lists... I don't use them a lot in general and at the time I didn't feel brave enough to write someting to Unicode or CLDR mailing list. But you made me believe that it's not that scary =) Next time I need help or spot an issue I won't hesitate.

srl295 · 2012-08-22T23:25:27Z

@KL-7 hm, ZTM does not seem to be active presently, but that would have been good to have their input. Glad to have made things a little less scary.

KL-7 mentioned this issue Nov 16, 2015

[WIP] Update collation #168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tailoring: denormalized Japanese code points in the default FCE table #52

Tailoring: denormalized Japanese code points in the default FCE table #52

KL-7 commented Jul 15, 2012

srl295 commented Aug 22, 2012

KL-7 commented Aug 22, 2012

srl295 commented Aug 22, 2012

KL-7 commented Aug 22, 2012

srl295 commented Aug 22, 2012

Tailoring: denormalized Japanese code points in the default FCE table #52

Tailoring: denormalized Japanese code points in the default FCE table #52

Comments

KL-7 commented Jul 15, 2012

srl295 commented Aug 22, 2012

KL-7 commented Aug 22, 2012

srl295 commented Aug 22, 2012

KL-7 commented Aug 22, 2012

srl295 commented Aug 22, 2012