Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TimeZoneIdMapper to icu_timezone #4774

Merged
merged 35 commits into from May 23, 2024
Merged

Conversation

sffc
Copy link
Member

@sffc sffc commented Apr 4, 2024

Fixes #4031
Replaces #4548

This new all-in-one type supports all of the time zone ID operations we need, with some operations being asymptotically faster than others.

Not sure if we want to deprecate the old ones or keep them around, at least IanaBcp47RoundTripMapper. It's not completely obsolete since it has different performance characteristics.

Manishearth
Manishearth previously approved these changes Apr 5, 2024
@Manishearth
Copy link
Member

I'm fine with this landing as is but in the long run I want to make sure it is incredibly obvious which of these many mappers is right for the use case, via docs and naming and deprecation.

utils/zerotrie/src/cursor.rs Outdated Show resolved Hide resolved
components/timezone/src/ids.rs Outdated Show resolved Hide resolved
components/timezone/src/ids.rs Outdated Show resolved Hide resolved
components/timezone/src/ids.rs Outdated Show resolved Hide resolved
components/timezone/src/ids.rs Outdated Show resolved Hide resolved
components/timezone/src/ids.rs Outdated Show resolved Hide resolved
components/timezone/src/ids.rs Outdated Show resolved Hide resolved
components/timezone/src/ids.rs Show resolved Hide resolved
components/timezone/src/ids.rs Outdated Show resolved Hide resolved
components/timezone/src/ids.rs Outdated Show resolved Hide resolved
sffc and others added 7 commits April 15, 2024 11:15
Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>
Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>
@sffc sffc requested a review from robertbastian April 15, 2024 18:21
@sffc
Copy link
Member Author

sffc commented Apr 15, 2024

I addressed @robertbastian's feedback except the open question about the naming of NormalizedIana.

@sffc sffc requested a review from robertbastian April 18, 2024 22:59
@sffc
Copy link
Member Author

sffc commented Apr 18, 2024

Ok I think this is mergeable but I'm still unsure about the name TimeZoneIdMapperWithFastCanonicalization

robertbastian
robertbastian previously approved these changes Apr 19, 2024
Copy link

@justingrant justingrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments and suggestions, most with a common theme: most developers won't know how many IDs there are and how long they are, so providing some indication of the scale of the data may help developers make more informed perf and scale tradeoffs.

Feel free to ignore this feedback if it's not the right level of detail for Rust docs.


/// Returns the canonical, normalized IANA ID of the given BCP-47 ID.
///
/// This function performs a slow linear search. If this is problematic, consider one of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's helpful to add context that "slow" here means searching through a list of a few hundred short strings. For users who don't know how many IANA IDs there are, that context may be helpful.

So I'd suggest quantifying the number of BCP-47 IDs to provide context.

Suggested change
/// This function performs a slow linear search. If this is problematic, consider one of the
/// This function performs a slow linear search. Note that there are fewer than 500 BCP-47 IDs (typically growing 1-3 per year) so a linear search may be acceptable for many use cases. But if this is problematic, consider one of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm assuming that BCP-47 IDs are only for canonical IANA names not all IANA names. I realized after I reviewed that I wasn't 100% sure about that assumption.

/// 1. [`TimeZoneIdMapperBorrowed::canonicalize_iana()`]
/// is faster if you have an IANA ID.
/// 2. [`TimeZoneIdMapperWithFastCanonicalizationBorrowed::canonical_iana_from_bcp47()`]
/// is faster, but it requires loading additional data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, can we quantify how much additional data? Many environments would be OK with 10K but not 10MB, for example, and developers who are unfamiliar with time zones may not know the scale of the data involved.

Similarly, some context about the CPU cost of loading that data would also be helpful. Is it a difficult index to create?

///
/// There is only one canonical name, which is "America/Indiana/Indianapolis". The
/// *canonicalization* operation returns the canonical name. You should canonicalize if you
/// need to compare time zones for equality or display the name to the user. Note that the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// need to compare time zones for equality or display the name to the user. Note that the
/// need to compare time zones for equality.

"or display the name to the user" is probably bad advice. Instead, users should generally see the normalized ID that was originally provided. By avoiding canonicalizing user-provided IDs, this insulates programs from renames like Kiev=>Kyiv so that existing code and data continues to work as-is with the old ID.

There may be some cases where auto-updating programs to the latest ID is desired behavior, but from our experience in ECMAScript these cases seem to be in the minority.

Note that "user" in this case assumes a technically-savvy user, because actual end-users should see the localized name, not the ID.

TimeZoneBcp47Id, TimeZoneError,
};

/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For developers unfamiliar with the details of the IANA Time Zone Database (which is most of them!), it may be helpful to understand the scale of the timezone ID data, in order to help them understand the perf and RAM consequences of various mapping options. Here's one possible way to do it. Feel free to ignore if this kind of info is not appropriate here.

Suggested change
/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers.
/// A mapper between IANA time zone identifiers and BCP-47 time zone identifiers.
///
/// There are about 600 IANA time zone identifiers, and fewer than 500 BCP-47
/// time zone identifiers.
///
/// BCP-47 time zone identifiers are 8 ASCII characters or less and currently
/// average 5.1 characters long. Current IANA time zone identifiers are less than
/// 40 ASCII characters and average 14.2 characters long.
///
/// These lists grow very slowly; in a typical year, 2-3 new identifiers are added.

/// - "America/Indianapolis"
/// - "US/East-Indiana"
///
/// There is only one canonical name, which is "America/Indiana/Indianapolis". The

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using "name" and "identifier" interchangeably. I'd suggest using only one term. My preference would be for "identifier" but "name" (for IANA, dunno about BCP-47) is also valid.

/// There is only one canonical name, which is "America/Indiana/Indianapolis". The
/// *canonicalization* operation returns the canonical name. You should canonicalize if you
/// need to compare time zones for equality or display the name to the user. Note that the
/// canonical name can change over time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help to provide an example.

Suggested change
/// canonical name can change over time.
/// canonical name can change over time, for example the identifier Europe/Kiev
/// was renamed to the newly-added identifier Europe/Kyiv in 2022.

/// Normalization is a data-driven operation because there are no algorithmic casing rules that
/// work for all IANA time zone identifiers.
///
/// Normalization is a cheap operation, but canonicalization might be expensive. If you need

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to quantify "expensive" here, or at least clarify that "expensive" may not actually be that expensive relative to actual expensive operations? There's only 600 IDs, after all.

@justingrant
Copy link

One more thought: the index of a time zone id is <16 bits. Are there ever cases where we'd want to expose that index?

In particular I'm asking because for canonical IDs, there's a <= 8-byte key available with the BCP-47 ID which can fit into one 64-bit register or memory location. But for non-canonicalized IANA IDs, the entire string (up to 34 characters, AFAIK) needs to be stored. Is this OK?

@sffc
Copy link
Member Author

sffc commented Apr 24, 2024

One more thought: the index of a time zone id is <16 bits. Are there ever cases where we'd want to expose that index?

In particular I'm asking because for canonical IDs, there's a <= 8-byte key available with the BCP-47 ID which can fit into one 64-bit register or memory location. But for non-canonicalized IANA IDs, the entire string (up to 34 characters, AFAIK) needs to be stored. Is this OK?

Not a bad idea, but the data model doesn't currently have the concept of an index to a non-canonical identifier. I'm not really sure how to do that. It wasn't part of the design.

@sffc
Copy link
Member Author

sffc commented Apr 24, 2024

I think I addressed all of @justingrant's suggestions.

@sffc
Copy link
Member Author

sffc commented May 17, 2024

@justingrant says I can merge this PR since I have integrated his feedback.

@sffc sffc requested a review from echeran May 17, 2024 17:46
@sffc
Copy link
Member Author

sffc commented May 17, 2024

@echeran, this has already been approved as you can see above. I merged main and ran datagen. I need another approval in order to merge.

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rslgtm

@robertbastian robertbastian merged commit d2c4f63 into unicode-org:main May 23, 2024
30 checks passed
@sffc sffc deleted the iana-canon-2 branch May 23, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support retrieval of normalize IANA names, including aliases
5 participants