Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for data installations #4474

Open
wants to merge 21 commits into
base: 5.0.x
Choose a base branch
from
Open

Conversation

smoors
Copy link
Contributor

@smoors smoors commented Mar 3, 2024

motivation

  • leverage EB to install data in a standardized way with proper versioning and checksumming
  • support adding datasets as dependency for software
  • easily swap dataset versions with ml swap

changes

  • add cmd line option --installpath-data similar to --installpath-software
  • add cmd line option --subdir-data (default = data) similar to --subdir-software
  • add cmd line option --sourcepath-data similar to --sourcepath
  • add Easyconfig parameter data_sources similar to sources

design

  • the main reason for a separate subdir_data is reusability: in contrast to software it does not have to be rebuilt/reinstalled when for example upgrading the OS or building for a new architecture
  • the reason for a separate sourcepath_data is that datasets can be very large, so you may want to store them in a different file system or location.

@smoors smoors added the feature label Mar 3, 2024
@boegel boegel added this to the 4.x milestone Mar 13, 2024
Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests needs adjusting
test_prefix_option:

def test_prefix_option(self):

test_show_config:

def test_show_config(self):

@@ -92,6 +92,7 @@
'checksums': [[], "Checksums for sources and patches", BUILD],
'configopts': ['', 'Extra options passed to configure (default already has --prefix)', BUILD],
'cuda_compute_capabilities': [[], "List of CUDA compute capabilities to build with (if supported)", BUILD],
'data_sources': [[], "List of source files for data", BUILD],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to separate data_sources from sources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data sources can be very big so i think it's good to at least have an option to separate them.

we can set data_sources equal to sources by default, would you prefer that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 2e57ad1 8dbf7d3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, i see now that i misunderstood you.
the separate parameter data_sources was suggested by @boegel to make it clear they are different from software sources.
there is no real need for it, it's just cosmetic. i'm not sure if it's a good idea, happy to revert if you prefer.

Copy link
Contributor Author

@smoors smoors May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed during the EUM, we should keep data_sources to allow installing software and datasets in a single easyconfig (e.g. using components) if we want to add support for this later on. this PR does not support this, as it requires substantive changes and i'm unsure how useful it is. i've modified this PR to make it easier to implement support for it if/when desired.

@boegel
Copy link
Member

boegel commented May 22, 2024

@smoors Let's re-target this to 5.0.x / EasyBuild 5.0, not because it involves breaking changes, but because it can serve as a "flagship" feature of EasyBuild 5.0 (seems a bit too much to introduce in a 4.9.x bug fix release)?

@boegel boegel added the EasyBuild-5.0 EasyBuild 5.0 label May 22, 2024
@smoors smoors changed the base branch from develop to 5.0.x May 23, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants