Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix non-usage of Dataset derived_from and derivations fields #606

Open
robertbartel opened this issue May 7, 2024 · 3 comments
Open

Fix non-usage of Dataset derived_from and derivations fields #606

robertbartel opened this issue May 7, 2024 · 3 comments
Labels
bug Something isn't working maas MaaS Workstream

Comments

@robertbartel
Copy link
Contributor

The Dataset class has two fields - derived_from and derivations - intended to track relationships between datasets when a dataset is derived from one or more others. But these fields aren't being used at all currently. In particular, they are not used by DataDeriveUtil.

Related code should be fixed to properly update these for datasets when a dataset is derived. Code for when datasets are removed will also need to account properly for these fields, as will code related to temporary datasets. It is likely we want protections in place to prevent derived-from datasets from expiring or being easily removed.

@robertbartel robertbartel added bug Something isn't working maas MaaS Workstream labels May 7, 2024
@aaraney
Copy link
Member

aaraney commented May 13, 2024

What are your thoughts on requiring object versioning at the storage level to support this feature? Its standard that S3 and other object stores have some form of object versioning including minio. If we can push this concern to a lower layer, I think it will make modeling the problem much easier.

@robertbartel
Copy link
Contributor Author

What are your thoughts on requiring object versioning at the storage level to support this feature? Its standard that S3 and other object stores have some form of object versioning including minio. If we can push this concern to a lower layer, I think it will make modeling the problem much easier.

First, I don't follow how this would work. For, say, a BMI init config dataset that we automatically generated, it should include both the hydrofabric dataset and the realization config in its derived_from. The derivation processes doesn't (necessarily) produce updated versions of the original objects.

Second, I've got some concerns related to coupling too tightly to object stores. If somehow we could use those to address this particular issue, there'd be a side effect of needing to abstract Dataset and implement an object-store-specific subclass. And regardless, it would complicate adopting other types of dataset storage types in the future. There probably are ways to deal with those things, but that discussion seems like more than what belongs here.

@aaraney
Copy link
Member

aaraney commented May 13, 2024

First, I don't follow how this would work. For, say, a BMI init config dataset that we automatically generated, it should include both the hydrofabric dataset and the realization config in its derived_from. The derivation processes doesn't (necessarily) produce updated versions of the original objects.

Assumptions:

  • Once a dataset if readonly it can never be moved to a mutable state.
  • It is valid to derive a dataset from an existing mutable dataset.

In the purest sense, if you create an object in dataset B that is derived from dataset A, the source dataset, A, should be marked or partially marked as readonly to codify the derived_from relationship. Otherwise, you could derive an object from dataset A and then remove or mutate the data in A s.t. the derivation relationship is not meaningful.

If object versioning is enabled, dataset B just needs to track the versions of the objects in A used to derive objects in B. Dataset A is still free to modify its contents without affecting the derived_from relationship between B and A.

Alternatively, it is simpler if we just allow deriving from readonly datasets. That avoids that aforementioned problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working maas MaaS Workstream
Projects
None yet
Development

No branches or pull requests

2 participants