Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-13219 add -u-dx support to BreakIterator #2702

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

FrankYFTang
Copy link
Contributor

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-13219
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@FrankYFTang
Copy link
Contributor Author

@srl295 @eggrobin could you look at the unit test and see does that fit what you understand about DX ?

@srl295
Copy link
Member

srl295 commented Nov 15, 2023

@srl295 @eggrobin could you look at the unit test and see does that fit what you understand about DX ?

Suggestion: change the title to -u-dx

I will have to look at the test cases a bit more but it seems like it could work.

Did you see my pr #2676 which has a test case from a minority language?

@FrankYFTang
Copy link
Contributor Author

Did you see my pr #2676 which has a test case from a minority language?

The tricky part will not be the behavior of break within a script, but in the boundary with another characters or between two script inside a -u-dx. For example, let's say we have -u-dx-thai-laoo and we have a run of text in thai and lao script and number without any spaces, would there any break in that run of text? or shoudl it beak in the boundary with number, or break in in the spot between the lao script and the thai script? or none at a all.

@FrankYFTang FrankYFTang changed the title ICU-13219 add DX support to BreakIterator ICU-13219 add -u-dx support to BreakIterator Nov 15, 2023
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/brkiter.cpp is different
  • icu4c/source/common/rbbi_cache.cpp is different
  • icu4c/source/common/rbbi.cpp is different
  • icu4c/source/common/unicode/rbbi.h is different
  • icu4j/main/core/src/main/java/com/ibm/icu/text/BreakIterator.java is now changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/text/BreakIteratorFactory.java is now changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is now changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/rbbi/RBBITest.java is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/brkiter.cpp is different
  • icu4c/source/common/unicode/brkiter.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/rbbi_cache.cpp is different
  • icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin
eggrobin previously approved these changes Nov 20, 2023
Copy link
Member

@eggrobin eggrobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour matches my understanding of the definition.

@FrankYFTang
Copy link
Contributor Author

Did you see my pr #2676 which has a test case from a minority language?

I try your proposed diff of icu4c/source/test/testdata/rbbitst.txt and below is the error I got in my PR. Is your expectation "correct"?

=== Handling test: rbbi/RBBITest/TestExtended: ===
   rbbi {
      RBBITest {
         TestExtended {
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
------------------------------------------------ 4
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break expected, but not found.  Pos=   4  File line,col= 1538,  13
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
------------------------------------------------ 5
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break found, but not expected.  Pos=   5  File line,col= 1538,  15
         Reverse Itertion, break found, but not expected.  Pos=   5  File line,col= 1538,  15
         Reverse Iteration, break expected, but not found.  Pos=   4  File line,col= 1538,  13
         isBoundary(4) incorrect. File line,col= 1538,  13
                 Expected, Actual= true, false
         isBoundary(5) incorrect. File line,col= 1538,  15
                 Expected, Actual= false, true
         following(0) incorrect. File line,col= 1538,   8
                 Expected, Actual= 4, 5
         following(1) incorrect. File line,col= 1538,  10
                 Expected, Actual= 4, 5
         following(2) incorrect. File line,col= 1538,  11
                 Expected, Actual= 4, 5
         following(3) incorrect. File line,col= 1538,  12
                 Expected, Actual= 4, 5
         following(4) incorrect. File line,col= 1538,  13
                 Expected, Actual= 10, 5
         preceding(10) incorrect. File line,col= 1538,  20
                 Expected, Actual= 4, 5
         preceding(9) incorrect. File line,col= 1538,  19
                 Expected, Actual= 4, 5
         preceding(8) incorrect. File line,col= 1538,  18
                 Expected, Actual= 4, 5
         preceding(7) incorrect. File line,col= 1538,  17
                 Expected, Actual= 4, 5
         preceding(6) incorrect. File line,col= 1538,  16
                 Expected, Actual= 4, 5
         preceding(5) incorrect. File line,col= 1538,  15
                 Expected, Actual= 4, 0
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
------------------------------------------------ 12
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break expected, but not found.  Pos=  12  File line,col= 1538,  13
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
------------------------------------------------ 13
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break found, but not expected.  Pos=  13  File line,col= 1538,  15
         Reverse Itertion, break found, but not expected.  Pos=  13  File line,col= 1538,  15
         Reverse Iteration, break expected, but not found.  Pos=  12  File line,col= 1538,  13
         isBoundary(12) incorrect. File line,col= 1538,  13
                 Expected, Actual= true, false
         isBoundary(13) incorrect. File line,col= 1538,  15
                 Expected, Actual= false, true
         following(0) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(1) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(2) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(3) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(4) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(5) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(6) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(7) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(8) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(9) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(10) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(11) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(12) incorrect. File line,col= 1538,  13
                 Expected, Actual= 26, 13
         preceding(28) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(27) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(26) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(25) incorrect. File line,col= 1538,  19
                 Expected, Actual= 12, 13
         preceding(24) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(23) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(22) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(21) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(20) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(19) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(18) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(17) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(16) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(15) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
         preceding(14) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
         preceding(13) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
      
         } ERRORS (52) in TestExtended (38ms) 
      
   
      } ERRORS (52) in RBBITest (38ms) 
   

   } ERRORS (52) in rbbi (38ms) 


--------------------------------------
Errors in total: 52.
            TestExtended
         RBBITest
      rbbi
   
--------------------------------------

@FrankYFTang
Copy link
Contributor Author

You diff actually only add one test case

<data>•โอํน• อะไป •จู่วาม •โล่น•</data>

for line break
but I think it is incorrect
why should the line break happen before the space ?
It should be

<data>•โอํน •อะไป •จู่วาม •โล่น•</data>

instead, right?

@FrankYFTang
Copy link
Contributor Author

FrankYFTang commented Nov 20, 2023

Looking https://github.com/unicode-org/icu/pull/2676/files#diff-b177067bbc1df57fc40ae7629a81e8df960899b9088555b010680a1c500943e2
Also, the line

<line>
# Should no longer break at the dictionary points - it's not Thai language
...
#<data>•โอํน• •อะไป• •จู่วาม• •โล่น• •เปี่ยร• •อะลู่วาง• •แมะ,• •ปาย• •อัน• •แบ็จ• •อะโจํน• •ซา• •เมาะ.• •อัน• •ฮะบืน• •ตะ• •เวี่ยะ• •ตะ• •งี่ยาน,• •อัน• •ฮะบืน• •อีว• •อะปายฮ.•</data>

should be

<line>
# Should no longer break at the dictionary points - it's not Thai language
...
<data>•โอํน •อะไป •จู่วาม •โล่น •เปี่ยร •อะลู่วาง •แมะ, •ปาย •อัน •แบ็จ •อะโจํน •ซา •เมาะ. •อัน •ฮะบืน •ตะ •เวี่ยะ •ตะ •งี่ยาน, •อัน •ฮะบืน •อีว •อะปายฮ.•</data>

there are no reason to have a line break before the space. Line break should only happen after the SPACE not before the SPACE, right?

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/test/testdata/rbbitst.txt is now changed in the branch
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/rbbi/rbbitst.txt is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@FrankYFTang
Copy link
Contributor Author

@srl295 I copy your test change over but change it. Please read my modified version in this PR and see do you agree with that. The change are

  1. for line break, there should have no line break before the SPACE
  2. for word break, the status should be 200 not 0
  3. for word break, we should break beefore . and , if we treat the Thai as AL.
  4. not using dx=zyyyy . That part of spec is very bad. I file bug https://unicode-org.atlassian.net/browse/CLDR-17247 for that. I do not think we should implement that behavior. It is clearly a spec bug from my point of view.

icu4c/source/common/rbbi.cpp Outdated Show resolved Hide resolved
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/rbbi.cpp is different
  • icu4c/source/test/testdata/rbbitst.txt is different
  • icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/rbbi/rbbitst.txt is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/rbbi.cpp is different
  • icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

// Ask the language object if there are any breaks. It will add them to the cache and
// leave the text pointer on the other side of its range, ready to search for the next one.
if (lbe != null) {
foundBreakCount += lbe.findBreaks(fText, rangeStart, rangeEnd, fBreaks, fPhraseBreaking);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked in detail at this, but it appears that this wouldn't catch the case where a character before rangeEnd should be excluded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so... how is that behavior specified in UTS 35 + UAX 29 + UAX 14?
Could we have a test case for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I mean. The meaning of dx-xxxx is that none of the xxxx characters will be processed by the break iterators.

So say that 't' stands for Thai, and . stands for other characters.

ttttt......ttttttttttt......ttttttt.......

With dx-thai, break iterators must only act on the dots (non-thai)

With your code, the iterator would skip over the first ttttt and start at the first non-Thai (the first dot)

However, I see nothing in the code that would prevent the iterator from continuing at least part-way into into the second group of ttttttttttt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... notice this part is inside a function
populateDictionary(int startPos, int endPos,...)
If you have the text "ttttt......ttttttttttt......ttttttt......."
so the first "ttttt" is in index 0-5, and the second "ttttttttttt" is in index 11-22, 28-35
the upper caller will call this function three time

  1. first call this with populateDictionary(0, 5...)
  2. second time call this with populateDictionary(11, 22...)
  3. third time call this with populateDictionary(28, 35...)

and the code put in the break between 0 - 5 into fBreak first call, the breaks between 11-22 into fBreak second call and the breaks between 28-35 into fBreak the third call
and the upper caller will figure out how to break the ...

With my change, when we hit any t, if (excludedFromDictionaryBreak(c)) willl return true and therrefore just advance the iterator till 5 and return out of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark, I add the following to unit test to show it does work

    bi->setText(UnicodeString(u"aaอออaaaaaอออ    aaaa"));

for line break, only the 1) begin, 2) between" " and "a" and 3) the end of text break
for word break, only the 1) begin, 2) between "อ" and " ", 3) between " " and "a" and 4) the end of text break

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/brkiter.cpp is different
  • icu4c/source/common/rbbi.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@macchiati
Copy link
Member

Let me try to be clearer.

Suppose that

  1. The dictionary breakIterator will act on any characters marked A or B below, but will skip over C and D.
  2. The dx vales need to cause the characters B and C to be skipped, but has no effect on characters A and D.

AAABBBCCCDDD

What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD.

From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear.

What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored.

@FrankYFTang
Copy link
Contributor Author

FrankYFTang commented Nov 23, 2023

Let me try to be clearer.

Suppose that

  1. The dictionary breakIterator will act on any characters marked A or B below, but will skip over C and D.
  2. The dx vales need to cause the characters B and C to be skipped, but has no effect on characters A and D.

AAABBBCCCDDD

What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD.

From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear.

What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored.

I see. ok, you are right, that is not clear. I need to call excludedFromDictionaryBreak to adjust the startRange and endRange before passing to the findBreaks. the startRange and endRange pass to the findBreaks may need to be changed to a different values excluding these characters.

@srl295
Copy link
Member

srl295 commented Nov 23, 2023

@FrankYFTang my apologies, I have not found spare time to review this recently. i will let @mhosken in case he's able to review.

it seems it's going in a good direction… certainly feel free to amend my test as as needed (it was meant as an example not as proscriptive) and close the other PR…

throw new IllegalArgumentException("Incorrect value for dx key: " + dxs);
}
String script = dxs.substring(i*5, i*5+4);
// Special handling of zyyy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this. Right after the length check, see if the entire dxValues value equals (case insensitive) "-zyyy". If so, return UnicodeSet.ALL_CODE_POINTS (everything, might want a static constant).

Otherwise, there is no special zyyy handling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a test case also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we need to take care the case of "en-u-dx-thai-zyyy" or "en-u-dx-thai-hani-zyyy", etc too right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; the CLDR ticket clarifying dx was accepted for CLDR v44.1 (you are a watcher)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is " if the entire dxValues value equals (case insensitive) "-zyyy"" is not good enough because we may have
"en-u-dx-thai-zyyy" or "en-u-dx-thai-hani-zyyy" which the type is "thai-hani-zyyy" not just "zyyy"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. Just saw what you landed in https://github.com/unicode-org/cldr/pull/3411/files that make sense. Sorry. Ignore my previous comments.

// For example, if the locale is "en-u-dx-abc-defgh", dxs is "abc-defgh"
// and builder.toString() return "[[:scx=abc-:][:scx=efgh:]]" and causes
// UnicodeSet constructor to throw IllegalArgumentException
return new UnicodeSet(builder.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Freeze the UnicodeSet — add .freeze(). that makes it immutable and faster.

@@ -206,6 +253,9 @@ public boolean equals(Object that) {
(!fRData.fRuleSource.equals(other.fRData.fRuleSource))) {
return false;
}
if (!((fDX == null && other.fDX == null) || fDX.equals(other.fDX))) {
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Objects.equal(s) to avoid the null check.

@macchiati
Copy link
Member

macchiati commented Dec 6, 2023 via email

@FrankYFTang FrankYFTang added the incomplete Needs work; do not approve/merge as is. label Dec 12, 2023
@FrankYFTang
Copy link
Contributor Author

Please ignore my update. I am still working on this PR. It is not ready for review. After Mark point out some issue, I found my design was wrong and need a more intensify rework.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/rbbi.cpp is different
  • icu4c/source/test/intltest/rbbitst.cpp is different
  • icu4c/source/test/intltest/rbbitst.h is different
  • icu4j/main/core/src/main/java/com/ibm/icu/impl/breakiter/CjkBreakEngine.java is now changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/rbbi/RBBITest.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes the problems I noted

@mhosken
Copy link

mhosken commented Apr 15, 2024

I'm getting user complaints again on this. Can we action this. Some fix for disabling dictionary breaking has been requested since, well I can't find out since I can't get to the old bug tracker, but it came into the latest tracker in 2019.

Perhaps the best is the enemy of the good here? The only people, that I know of, that are affected by this are those using minority languages, are inserting ZWSP for word breaks and are dealing with correctly tagged text. Do we have to refine this fix for the non use cases as well, before we can fix for the actual use case?

I'm sorry that my frustration is showing. But we seem to be more concerned about people who do the wrong thing than those who do the right thing (and tag correctly, by some definition). The really correct solution is that if the text is not tagged with the language of the dictionary, then no dictionary breaking should occur. I realise that that is just too much for most people and so we have special tagging. But can we please get something out for these users who are able to tag correctly?

Please shipit already.

@srl295
Copy link
Member

srl295 commented Apr 15, 2024

@FrankYFTang is this going to be merged for 75?

@FrankYFTang
Copy link
Contributor Author

no, the issue is more complicated than my PR did.

@srl295
Copy link
Member

srl295 commented Apr 16, 2024

no, the issue is more complicated than my PR did.

Do you have more detail?

@FrankYFTang
Copy link
Contributor Author

no, the issue is more complicated than my PR did.

Do you have more detail?

It require more detail analysis and testings for different cases than what I put into this PR. I missed some complicated combination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incomplete Needs work; do not approve/merge as is.
Projects
None yet
5 participants