Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for all locales that have additional data in Collator #4167

Open
sffc opened this issue Oct 17, 2023 · 5 comments · May be fixed by #4767
Open

Add test for all locales that have additional data in Collator #4167

sffc opened this issue Oct 17, 2023 · 5 comments · May be fixed by #4767
Assignees
Labels
C-collator Component: Collation, normalization good first issue Good for newcomers S-tiny Size: Less than an hour (trivial fixes) T-docs-tests Type: Code change outside core library

Comments

@sffc
Copy link
Member

sffc commented Oct 17, 2023

We should add a test for the 'vi' locale and any other locales that load additional data from other keys.

Follow-up to #4165 / #4166

@sffc sffc added T-docs-tests Type: Code change outside core library S-tiny Size: Less than an hour (trivial fixes) C-collator Component: Collation, normalization labels Oct 17, 2023
@sffc sffc added this to the Priority Backlog ⟨P3⟩ milestone Oct 19, 2023
@sffc sffc added the good first issue Good for newcomers label Oct 19, 2023
@ashu26jha
Copy link
Contributor

In languages like Vietnamese, the diacritics are important in determing the sorting order (something like CollationDiacriticsV1Marker ) so loading additional data may help accuracy and might change the order.

To fix this, I could use something like (pseducode)

let local_provider = MultiForkByKeyProvider::new(vec![]);

Collator::try_new_with_buffer_provider(
    &local_provider, 
    LOCALE.into(), 
    CollatorOptions::new()
);

Wheather I am going in the correct direction or not, can you elaborate on this issue?

@sffc
Copy link
Member Author

sffc commented Feb 26, 2024

We have some tests here already: https://github.com/unicode-org/icu4x/blob/main/components/collator/tests/tests.rs

We should add some more tests there. This issue is to focus on the locales that load non-root data. You can see which locales do this by looking in the baked data files:

https://github.com/unicode-org/icu4x/tree/main/provider/baked/collator/data/macros

For example, in collator_dia_v1.rs.data, there is: static KEYS: [&str; 2usize] = ["und", "vi"]; which means that vi is the only locale in this case that uses non-root data. Ideally you can find a pair of strings that have different ordering in und versus vi due to this difference. This helps prove that the vi data supplement is being loaded correctly.

It would be good to have at least one test per data supplement. Some of these might already be covered, but we know vi is not covered due to #4165.

@ashu26jha
Copy link
Contributor

Ideally you can find a pair of strings that have different ordering in und versus vi due to this difference.

Now I get it, test cases should like this:

assert_eq!(collator.compare("à", "á"), Ordering::Less);       // For "vi"
assert_eq!(collator.compare("à", "á"), Ordering::Greater);    // For "und"

@sffc
Copy link
Member Author

sffc commented Mar 4, 2024

@ashu26jha Do you consider this issue complete?

@ashu26jha
Copy link
Contributor

ashu26jha commented Mar 4, 2024

I personally feel not to close this one as there are few TODOs at the end of the file which needs to be completed. For example writing the test for Tibetan, "sr and sr-Latn" etc.

Since #4633 was my first PR, I wanted to keep it short and easy to review ✌🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization good first issue Good for newcomers S-tiny Size: Less than an hour (trivial fixes) T-docs-tests Type: Code change outside core library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants