Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split icu_datagen into a crate for the driver and a crate for the provider #4721

Open
robertbastian opened this issue Mar 22, 2024 · 6 comments
Labels
A-data Area: Data coverage or quality C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting S-medium Size: Less than a week (larger bug fix or enhancement)
Milestone

Comments

@robertbastian
Copy link
Member

robertbastian commented Mar 22, 2024

These two components are fairly independent, and there is now a use case for the datagen driver that doesn't use the CLDR/ICU backed provider (icu_datagen_dart). It would be nice to be able to build it without pulling in all the logic and dependencies for parsing CLDR.

Idea:

  • icu_datagen - DatagenDriver and options
  • icu_provider_source - DatagenProvider
    • type could be renamed to SourceDataProvider
    • this will have most dependencies (zip, wasm, ureq etc.)
  • icu4x-datagen - The CLI, which depends on both crates
    • I think the current state where cargo install icu_datagen installs a binary called icu4x-datagen is confusing
    • This is currently pretty much a separate crate anyway, it's a binary in a library crate, and has its own dependencies (bin feature)
  • icu_datagen_dart depends on icu_datagen and icu_provider_blob, making it a lot more lightweight
  • In complicated build systems, it will be preferrable to use the icu_datagen_dart approach, i.e. have one universal blob generated from sources (long build time and long gen time, but shared), and then filter that down for use cases (short build time and short gen time).
@robertbastian robertbastian added C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) A-data Area: Data coverage or quality labels Mar 22, 2024
@robertbastian robertbastian added this to the ICU4X 2.0 milestone Mar 22, 2024
@robertbastian robertbastian added the discuss Discuss at a future ICU4X-SC meeting label Mar 22, 2024
@robertbastian
Copy link
Member Author

robertbastian commented Mar 22, 2024

Actually, data "generation" happens in the provider crate, so maybe the driver crate should be icu_export (with ExportDriver), and the provider crate icu_provider_datagen (icu_datagen if we don't want to retire the crate name, however all our crates that define providers start with icu_provider_)

@sffc
Copy link
Member

sffc commented Mar 25, 2024

The most modular setup would probably be

  1. icu_datagen_transform defines DatagenProvider
  2. icu_datagen_driver defines DatagenDriver
  3. icu_datagen defines the icu4x-datagen binary

@robertbastian
Copy link
Member Author

That's my proposal, just with different names.

@robertbastian
Copy link
Member Author

We could also move icu_provider::datagen into the driver crate.

@robertbastian
Copy link
Member Author

robertbastian commented Apr 11, 2024

Latest proposal:

  • A crate that contains DatagenDriver, icu_provider::datagen::{ExportMarker, ExportableDataProvider, ...}, the registry (make_exportable_provider!), BakedExporter
    • Desired name: icu_export
    • Also would like to rename DatagenDriver -> ExportDriver
  • A crate that contains DatagenProvider
    • Desired name: icu_provider_datagen
    • Alternative names: icu_datagen, icu_provider_source
  • A crate that contains icu4x-datagen
    • Desired name: icu4x-datagen
  • A crate that contains a CLI that uses a blob instead of DatagenProvider
    • Possible name: icu4x-reexport, icu4x-data-transform
    • We might be able to model this with Cargo features in icu4x-datagen instead

@sffc
Copy link
Member

sffc commented Apr 15, 2024

ICU4X-WG discussion:

  • @sffc - Main advantage, besides modularity, is reducing dependencies. The registry needs to depend on all of icu4x though. Can we put the registry in the metacrate?
  • @robertbastian - Maybe, but the registry helps implement ExportMarker and things that are datagen-specific.
  • @zbraniecki - If you find a crate icu_export, it's not clear that it a prover crate and not a component crate. It looks like a component called "export".
  • @robertbastian - Good point, we should name it icu_provider_export.
  • @sffc - Do we care at all about the binary name and crate name matching?
  • everyone except for @roberbastian - no opinion
  • @sffc - You can pass callbacks to other macros. This should allow us to have the registry in the metacrate. Playground

Macro structure brainstorm:

macro_rules! make_exportable {
    ([$($marker:path,)*], [$($experimental_marker,)*]) => {
        #[cfg(feature = "experimental_components")]
        icu_provider::make_exportable_provider!([$($marker,)* $($experimental_marker,)*]);
        #[cfg(not(feature = "experimental_components"))]
        icu_provider::make_exportable_provider!([$($marker,)*]);
    }
}

// uses call-site Cargo features
registry!(make_exportable);

macro_rules cb {
    ($($marker:path),*)) => {
        fn all_keys() -> ... {
            HashSet::from_iter([
                $($marker),* $<marker>::KEY.path()
            ])
        }
    }
}
  • icu crate
    • registry
    • all_stable_keys()
    • #[cfg(feature = "experimental")] all_experimental_keys()
    • #[cfg(feature = "experimental")] all_keys() (maybe)
    • key(str)
  • icu_provider
    • icu_provider::datagen::*
  • icu_datagen
    • ExportDriver (needs rayon, fallback)
    • baked_exporter module (feature-gated)
    • icu_provider_blob::export (feature-gated)
    • icu_provider_fs::export (feature-gated)
  • icu_provider_source
    • SourceDataProvider
    • depends on icu (registry) and icu_provider (for _::datagen::*) to implement icu_provider::datagen::ExportableProvider
    • depends on cpt builder (wasmer), zip, ureq, etc.
    • lots of cargo features
  • icu4x-datagen
    • Pure binary crate (never appears in a Cargo.toml file)
    • depends on icu_datagen, icu_provider_source, icu (all_stable_keys, all_experimental_keys, key)
    • duplicates all of icu_provider_source's cargo features
      • Cargo.toml: use_wasm = ["icu_provider_source?/use_wasm"]
      • maybe refuse to install if feature combination doesn't make sense
    • without icu_provider_source feature, allows blob inputs (current icu_datagen_dart)

Shane's version:

  • icu metacrate as above
  • icu_provider_transform
    • As above, no dependency on driver
  • icu_datagen
    • Driver
    • Optional dep on icu_provider_transform
    • icu4x-datagen binary

Conclusion:

  • The crate called icu_datagen will have stuff pulled out from it:
    • In 1.5: DatagenProvider goes behind a feature. The feature impacts binary behavior. We could fail to install the binary if the feature combinations are incompatible via a feature-gated compile error.
    • In 1.5: Implement the registry! macro as designed above; keep in icu_datagen
    • In 2.0: DatagenProvider gets pulled out into its own crate (names to be bikeshed) and icu_datagen does not depend on it
    • In 2.0: Move the registry! macro to icu metacrate
    • In 2.0: icu4x-datagen gets pulled out into a binary-only crate called icu4x-datagen

LGTM: @robertbastian @sffc

robertbastian added a commit that referenced this issue Apr 17, 2024
@robertbastian robertbastian removed the discuss-priority Discuss at the next ICU4X meeting label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting S-medium Size: Less than a week (larger bug fix or enhancement)
Projects
Status: No status
Development

No branches or pull requests

2 participants