Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode Blocks access and iteration #4803

Open
sffc opened this issue Apr 12, 2024 Discussed in #4798 · 1 comment
Open

Add Unicode Blocks access and iteration #4803

sffc opened this issue Apr 12, 2024 Discussed in #4798 · 1 comment
Labels
C-unicode Component: Props, sets, tries good first issue Good for newcomers S-medium Size: Less than a week (larger bug fix or enhancement)

Comments

@sffc
Copy link
Member

sffc commented Apr 12, 2024

We should consider adding Unicode Blocks to the icu_properties crate. It should probably include:

  1. Ability to loop over all blocks
  2. Ability to access the code points in a block
  3. Ability to look up a block by name

Concretely I think the most practical way to implement this would be to make an open enum for the block and then basically treat the block like an enumerated property, including code point access and display name parsing. I think it should pack pretty small in an InversionMap since the code point space is already segmented into fairly large contiguous blocks.

I imagine that the most difficult part of the implementation would be the sourcing of the data in datagen.

This would be a good first issue of medium scope.

Discussed in #4798

Originally posted by faassen April 11, 2024
I dug around the source code, but I couldn't find a representation in Rust code of the Blocks.txt data. There's the unicode_blocks crate, but this misses an important feature; I need to be able to iterate through all blocks. I need to be able to look them up by name, but not the name as given as this has space characters. So iteration seems required so I can do some pre-processing. This is to implement regular expressions as defined by appendix F in the XML Schema specification:

Being able to get a CodePointInvList for a block would also be nice.

https://www.w3.org/TR/xmlschema-2/#regexs

Did I miss something? Is this planned?

@sffc sffc added good first issue Good for newcomers C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) labels Apr 12, 2024
@Manishearth
Copy link
Member

One note is that we should document that unassigned code points may still report being in a block: which is not necessarily the behavior of all APIs I have seen that do this; but it is reasonable behavior. (And it's easy to get the other behavior by mixing this with Assigned/gc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries good first issue Good for newcomers S-medium Size: Less than a week (larger bug fix or enhancement)
Projects
None yet
Development

No branches or pull requests

2 participants