Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tailoring: denormalized Japanese code points in the default FCE table #52

Open
KL-7 opened this issue Jul 15, 2012 · 5 comments
Open

Tailoring: denormalized Japanese code points in the default FCE table #52

KL-7 opened this issue Jul 15, 2012 · 5 comments

Comments

@KL-7
Copy link
Contributor

KL-7 commented Jul 15, 2012

It turned out that some code points occur in the default FCE table in denormalized form. As we always normalize given code points to NFD form, we completely ignore denormalized elements of the FCE table. If processing normalized and denormalized forms results in different collation elements, we get wrong collation order in the end.

This issue affects only one test for Japanese tailoring, but it's possible that we simply don't have enough tests to reveal a bigger impact of this problem.

More details in the gist.

@srl295
Copy link

srl295 commented Aug 22, 2012

@KL-7 what does 'failing in ICU' mean in this context? Have you filed a ticket?

@KL-7
Copy link
Contributor Author

KL-7 commented Aug 22, 2012

@srl295, I believe that's a problem with test data and not the implementation. I was looking for tailoring tests and I was quite disappointed when I found this note saying that CLDR no longer provides conformance tests. Out of frustration I used tests from an older version of CLDR from here. I run ICU4J (as a reference implementation) on these tests, excluded those that were failed, and used the rest as a test suit for our implementation.

@srl295
Copy link

srl295 commented Aug 22, 2012

@KL-7 Ouch.......... several times ouch.
Okay. I am sorry about the frustration. However... that is sort of like grading a student against the wrong answer key.

One thing that could have been done.. or, even, still done, would be to request generation of newer data. As I mentioned, we don't get much notice of others picking up the data period until they are in some sense 'done' (as with TwitterCLDR's announcement). I've never heard of anyone actually using that test data, besides CLDR's own tests.

I don't want to scare you off by repeating myself, but.. please file tickets, use the mailing list, ..

in any event, it may be better to use ICU's test cases. Below is not comprehensive (there are others), but is one start. I think this one is consumed by both C and J.

http://source.icu-project.org/repos/icu/icu/trunk/source/test/testdata/DataDrivenCollationTest.txt

I assume Ruby has some mechanism for calling/being called to/from C or Java, one could also consider testing by comparing results. Worst case you could execute my usort sample and compare the output. http://source.icu-project.org/repos/icu/icuapps/trunk/usort/

@KL-7
Copy link
Contributor Author

KL-7 commented Aug 22, 2012

@srl295, I think we're pretty good even now, because I'm using only tests that are passed by ICU4J (it's basically results comparison that you mentioned). I found a lot of issues (hopefully, most of them) in my implementation that way. I had hard time trying to track down ICU's test data, so I used what I had at hands.

Regarding people using this data, I know for sure that at least ZTM project is using it – I originally found this data in their repository and only then managed to find it in CLDR's SVN. Though, I know nothing about the project's status, maybe they don't need updated version.

And about mailing lists... I don't use them a lot in general and at the time I didn't feel brave enough to write someting to Unicode or CLDR mailing list. But you made me believe that it's not that scary =) Next time I need help or spot an issue I won't hesitate.

@srl295
Copy link

srl295 commented Aug 22, 2012

@KL-7 hm, ZTM does not seem to be active presently, but that would have been good to have their input. Glad to have made things a little less scary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants