Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BaseLanguageHandling option to Datagen #4606

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sffc
Copy link
Member

@sffc sffc commented Feb 13, 2024

See #58

I implemented the new option as BaseLanguageHandling::Retain or BaseLanguageHandling::Strip.

@sffc sffc marked this pull request as ready for review February 14, 2024 00:20
@sffc sffc requested review from jedel1043 and removed request for a team and Manishearth February 14, 2024 00:20
@@ -370,23 +385,32 @@ impl DatagenDriver {
.iter()
.try_for_each(|(locale, (payload, _duration))| {
let mut iter = fallbacker_with_config.fallback_for(locale.clone());
let mut first = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of first is that it prevents non-base locales such as tlh-001 from being retained. I'm not sure if this is the behavior that we desire; it only comes up when an unsupported locale is requested and that locale is not a base language. I lean toward removing the check on first, which would cause tlh-001 to be retained.

Copy link
Contributor

@jedel1043 jedel1043 Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to remove first, but we should probably print a warning alerting the user that the requested locale is not supported; if that's not the current behaviour. I think that should be enough indication that using the provider as is could return incorrect results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Today I'm thinking that we probably should not include either tlh-001 or arc since there is no CLDR data for them, and therefore they are not really "supported". I'll work on that tweak.

/// Sets the base language handling in runtime fallback mode.
///
/// For more details, see [`BaseLanguageHandling`].
pub fn with_base_language_handling(self, base_language_handling: BaseLanguageHandling) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this option is only valid for a specific fallback mode. I'd prefer modelling it with a new fallback mode

Copy link
Contributor

@jedel1043 jedel1043 Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it applies for both Runtime and RuntimeManual. It also applies for PreferredForExporter in certain cases; an user who wants to enable resolvedLocale queries but doesn't want to pick a specific fallback type. This means we would need to add 3 more fallback variants + any other variant that is added in the future that is similar to Runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about PreferredForExporter. With that option, the exporter can state its preferred fallback mode, but that question is completely separate from whether to include the base languages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An exporter does not state its preferred fallback mode, it states supports_built_in_fallback. Datagen translates this to an actual mode, which we are free to change: true can be RuntimeThin, and false HybridThin, as you are currently making this the default already.

Comment on lines +228 to +230
/// By default, locales that inherit directly from "und" will be retained even if their data
/// is equal to "und". To override this behavior, see [`BaseLanguageHandling`].
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I'm not convinced this is a good default. This increases baked data size, but supported locale queries are not a universal use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to discuss the default. Given that it's needed for 402 and we've seen enough examples of use cases where people may want to do these queries out of the box, and the data size increase is fairly small (no regional variants), I lean toward including these by default. It's easy for power users to get the extra 5% data size benefit if they know what they are doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly lean towards the default of Strip. I think either way we will need careful documentation of supported locale queries and how they can work. We should do that documentation and clearly document the defaults.

provider/datagen/src/lib.rs Outdated Show resolved Hide resolved
@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Feb 14, 2024
@sffc
Copy link
Member Author

sffc commented Feb 22, 2024

Current:

  • Runtime
  • RuntimeManual
  • Hybrid
  • Preresolved

Proposed to add:

  • RuntimeWithBaseLanguages
  • RuntimeManualWithBaseLanguages

  • @sffc - How to we turn off regional variants?
  • @robertbastian - Maybe something like en-XX to not include regional variants, this woulud be more granular because you could do de,en-ZZ for all de variants and no en variants
  • @sffc - I think it should be a variant, because it can also apply to regional variants like en-001. So you may want en-001-nochilds, which includes en and en-001 but not en-GB.

Deduplicated / Runtime2 {
  use_internal_fallback: bool,
  include_base_languages: bool,
}

Now:

#[non_exhaustive]
pub enum FallbackMode {
    #[default]
    DefaultForProvider,
    // same as Deduplicated { RuntimeFallbackLocation::Internal, BaseLanguageHandling::Strip }
    Runtime,
    // same as Deduplicated { RuntimeFallbackLocation::External, BaseLanguageHandling::Strip }
    RuntimeManual,
    Hybrid,
    Preresolved,
    Deduplicated(Deduplicated),
}

2.0:

#[non_exhaustive]
pub enum LocaleExpansionMode {
    Deduplicated(Deduplicated),
    Exhaustive(Exhaustive),
    Preresolved(Preresolved),
}

#[non_exhaustive]
pub enum RuntimeFallbackLocation {
    Internal,
    External,
}

#[non_exhaustive]
pub enum BaseLanguageHandling {
    #[default]
    Retain,
    Strip,
}

#[non_exhaustive]
#[derive(Default)]
pub struct Deduplicated {
  runtime_fallback_location: Option<RuntimeFallbackLocation>,
  base_language_handling: BaseLanguageHandling,
}

// resolve this against the exporter like
foo.runtime_fallback_location.unwrap_or_else(|| 
    if exporter.supports_runtime_fallback() {
        RuntimeFallbackLocation::Internal
    } else { 
        RuntimeFallbackLocation::External 
    }
)

#[non_exhaustive]
pub struct Exhaustive {}

#[non_exhaustive]
pub struct Preresolved {}

LGTM: @Manishearth @sffc @robertbastian

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Feb 22, 2024
@sffc
Copy link
Member Author

sffc commented Feb 23, 2024

@robertbastian What do you want the CLI to look like in 1.5 and in 2.0?

@robertbastian
Copy link
Member

robertbastian commented Feb 23, 2024

I guess for now we flatten the enum like runtime-manual-retain and runtime-strip?

@sffc
Copy link
Member Author

sffc commented Feb 23, 2024

If we are going to end up with a base_language_handling option in the Deduplicated enum, then maybe it makes sense that it should be its own flag on the CLI.

@robertbastian
Copy link
Member

It doesn't, because then someone will set base_language_handling for hybrid or preresolved and that doesn't make sense.

@sffc
Copy link
Member Author

sffc commented Feb 23, 2024

We have a lot of flags that are only used if some other flag is set. This would be the same.

@sffc
Copy link
Member Author

sffc commented Feb 24, 2024

@robertbastian I would like to merge this as-is because:

  1. The final 2.0 CLI option should indeed be named --base-language-handling because that is how we handle all other cases where flags are not used depending on the value of other flags. If you really want I could make --base-language-handling return an error if set with the wrong value for --fallback.
  2. In the API, we are changing it anyway in 2.0 as discussed above. The change proposed is the smallest change to the API. I don't want to implement only half of the 2.0 changes now. I would rather we, in another PR, implement LocaleExpansionMode as designed above. We could even do that in 1.5 and deprecate FallbackMode. In that PR, we could delete any unreleased APIs such as this one that are covered by LocaleExpansionMode.

@jedel1043
Copy link
Contributor

With my engine maintainer hat on, I have to agree with @sffc. Boa will require all engine embedders to use this feature, so from a docs perspective it's easier to guide our users to activate a simple flag than to explain to them which fallback variants are supported and how to activate them.

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still consider this additional flag tech debt.

#[arg(long, value_enum, default_value_t = BaseLanguageHandling::Auto)]
#[arg(
help = "--fallback=runtime[-manual] only: Whether to retain or strip base languages. \
For details, read the Rust docs for BaseLanguageHandling."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's call the local one BaseLanguageHandlingArg, to avoid confusion when looking at the code. The one above has no docs.

Alternatively give it docs saying to look at the type in the config module.

Comment on lines +228 to +230
/// By default, locales that inherit directly from "und" will be retained even if their data
/// is equal to "und". To override this behavior, see [`BaseLanguageHandling`].
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly lean towards the default of Strip. I think either way we will need careful documentation of supported locale queries and how they can work. We should do that documentation and clearly document the defaults.

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of discussion going on in #4629 which I'd like to resolve before merging anything.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Mar 7, 2024
@sffc sffc marked this pull request as draft April 10, 2024 07:53
@sffc
Copy link
Member Author

sffc commented Apr 10, 2024

Need to re-do this on top of #4710

@jedel1043
Copy link
Contributor

Probably super seeded by #4836.

@sffc
Copy link
Member Author

sffc commented Apr 25, 2024

Probably super seeded by #4836.

Things this PR does that #4836 didn't do:

  • Added to CHANGELOG.md
  • Added "tlh-001" to the options test (to test an unsupported non-base locale)
  • Made retaining base languages the default option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Discuss at a future ICU4X-SC meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants