Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement TaskLoader.save when instantiated with xarray/pandas objects #84

Open
tom-andersson opened this issue Oct 19, 2023 · 0 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@tom-andersson
Copy link
Collaborator

tom-andersson commented Oct 19, 2023

Summary

Currently, the TaskLoader can only be .saved when it has been __init__ed with filepaths in its context and target entries, not xarray/pandas data. However, this forces the user to have to save their normalised xarray/pandas data, when they might not actually care that much about where that data lives. For example:

data_processor = DataProcessor(...)
da_normalised = data_processor(da_raw)
data_processor.save("folder")
da_normalised.save("fpath.nc")  # We could potentially bypass this...
task_loader = TaskLoader(context="fpath.nc", target="fpath.nc")  # By instead initialising with raw xarray/pandas here...
task_loader.save("folder")  # And then `.save` would save the raw data objects alongside the TaskLoader config

We could instead initialise the TaskLoader in the typical way with raw xarray/pandas objects (which is more intuitive than fpaths), and then when saving the TaskLoader it will save those variables alongside the JSON config (with context/target file paths populated).

This FR should only be implemented after #82 is closed. We don't want to save the same data multiple times just because it appears multiple times in the context and/or target entries. So we'll want to leverage whatever internal TaskLoader` data structure is added to close #82.

Basic Example

If this feature were implemented, we'd be able to do:

data_processor = DataProcessor(...)
da_normalised = data_processor(da_raw)
data_processor.save("folder")
task_loader = TaskLoader(context=da_normalised, target=da_normalised)
task_loader.save("folder")  # This saves the context and target data as NetCDF/CSV in `"folder"`

See comment above - we will not want to save two NetCDF files in this case, because they are the same objects.

Drawbacks

The user might not realise that task_loader.save will save data to disk, which is especially risky with very large NetCDF data and when disk space is limited. We'll need to be clear in the documentation that this is what is happening under the hood.

Unresolved questions

No response

Implementation PR

No response

Reference Issues

No response

@tom-andersson tom-andersson added enhancement New feature or request good first issue Good for newcomers labels Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant