Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-21812 move Age/Block/Script to a separate trie #2926

Closed
wants to merge 2 commits into from

Conversation

markusicu
Copy link
Member

Move Age/Block/Script to a separate trie; because

  • The properties vectors in uprops.icu/uprops.h are full, and making them longer is expensive.
  • We are about to overflow the bits for the Age property.
  • These properties seem correlated fairly well with each other but not with other properties.

Also

  • Move some bit-set-setting code from the emojipropsbuilder to toolutil so that it can be shared.
  • Some code cleanup, such as in the corepropsbuilder distinguish end vs. pvecEnd.
  • Making a uprops.icu major-version change allows us to move the Script+Script_Extensions bits back together into a contiguous bit set.

Problem: Pulling these three properties out makes uprops.icu significantly larger. Based on Unicode 15.1 data:

Original:

trie size in bytes:                    46328
size in bytes of additional props trie:65584
number of additional props vectors:     2499
number of 32-bit words per vector:         3
number of 16-bit scriptExtensions:       298
data size:                            142560

(The data size includes 64 bytes for the file header and the indexes array.)

Pulling Age/Block/Script into a separate trie (fast, 32-bit code point trie):

trie size in bytes:                    46328
size in bytes of additional props trie:54760
number of additional props vectors:      673
number of 32-bit words per vector:         3
size in bytes of ABS trie:             86748
number of 16-bit scriptExtensions:       298
data size:                            196572

Pulling them out into separate tries:

  • Age: fast 8-bit
  • Block: small 16-bit indexed by code point/16
  • Script/scx: fast 16-bit
trie size in bytes:                    46328
size in bytes of additional props trie:54760
number of additional props vectors:      673
number of 32-bit words per vector:         3
size in bytes of Age trie:             21596
size in bytes of Block trie:            7600
size in bytes of Script trie:          34420
number of 16-bit scriptExtensions:       298
data size:                            173440

Same, but small tries for Age & Script:

trie size in bytes:                    46328
size in bytes of additional props trie:54760
number of additional props vectors:      673
number of 32-bit words per vector:         3
size in bytes of Age trie:             17128
size in bytes of Block trie:            7600
size in bytes of Script trie:          25984
number of 16-bit scriptExtensions:       298
data size:                            160536

Pulling only Block out into a separate trie:

trie size in bytes:                    46328
size in bytes of additional props trie:62752
number of additional props vectors:     2026
number of 32-bit words per vector:         3
size in bytes of Block trie:            7600
number of 16-bit scriptExtensions:       298
data size:                            141652

Only this last version actually makes uprops.icu slightly smaller than the original.

FYI @echeran

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-21812
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@markusicu
Copy link
Member Author

@markusicu markusicu closed this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant