Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use user defined default dataset factory pattern over the one from the runner #3859

Merged
merged 9 commits into from May 14, 2024

Conversation

ankatiyar
Copy link
Contributor

@ankatiyar ankatiyar commented May 8, 2024

Description

Fix #3720

Development notes

TODO:

  • Update DataCatalog code
  • Update CLI commands code
  • Docs
  • Test
  • Release notes

Wanted to get opinions on the proposed solution before proceeding with the pending tasks mentioned above^

Proposed Solution

When DataCatalog is created (in from_config)

  • If the last pattern in the sorted patterns list is a catch-all pattern i.e. has specificity == 0, it is popped out and stored separately as the DataCatalog._default_pattern (This would be the user-defined default pattern, not from the runner)

Note

This assumes that there is only one catch-all pattern in the user's catalog. If there are more, the last one will be popped which wouldn't ever be used anyway.❓ [Question for reviewers]: We should probably error out if the user has more than one catch-all patterns in this case, does that make sense?

When a dataset is being fetched

  • Try to match is against the patterns in DataCatalog._dataset_patterns OR DataCatalog._default_pattern
  • Resolve config again first from _dataset_patterns OR _default_pattern

When runner.run() is called

When kedro run is executed, the runner creates a shallow copy of the catalog object by calling the DataCatalog.shallow_copy()

  • Here the runner inserts a default dataset pattern e.g. extra_dataset_patterns = {'{default}': MemoryDataset}
  • IF there is already a _default_pattern - it's not needed to do anything
  • IF there isn't a user defined catch-all pattern - add it to the DataCatalog._dataset_patterns (this will then not emit a warning about the catch-all pattern being used over the runner default dataset)

For the catalog CLI commands

Similar logic as above has been copied for the matching of datasets to patterns

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@noklam
Copy link
Contributor

noklam commented May 9, 2024

[Question for reviewers]: We should probably error out if the user has more than one catch-all patterns in this case, does that make sense?

Yes

data_catalog._dataset_patterns[matched_pattern]
data_catalog._dataset_patterns.get(matched_pattern)
or data_catalog._default_pattern.get(matched_pattern)
or {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case possible? If matches are found doesn't it mean that one of data_catalog._dataset_patterns or data_catalog._default_pattern should have it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be by mypy was complaining 🫡

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through the approach suggested and it seems good to me.

Agree that we should validate if user has more than one catch-all patterns.

Left one question regarding the implementation though it's quite hard to validate it without tests. 🙂

This ticket also made me think that it would be good to have this resolution logic in one place, but that's ofc out of the scope now.

@noklam
Copy link
Contributor

noklam commented May 10, 2024

Just dumping my thought here in case I am blocking the PRs. I have some questions about the shallow_copy approach, but that was done long before this PR so I don't want to expand the discussion too much.

I think:

  • There should be only 1 default dataset, validation should happen when DataCatalog is initialised. (I think this is consistent as the shallow_copy will create a new DataCatalog, this is where I find the shallow_copy weird as it does not seem to be a reference but completely new object
  • multiple dataset pattern by Runner is fine. As long as only one of them is default datasets. For example, you could have different default for different namespaces which control by runner.

The most important question here is "should runner dataset patterns override DataCatalog"? In framework setting, the way to override patterns is catalog.yml. However, the order of how framework work is different.

  1. DataCatalog read catalog.yml first
  2. Runner override default patterns (memory dataset)

This suggests that one should only use either catalog.yml or runner. Should it fails or just ignore the dataset pattern from runner? For me, I think I will go for raising an error instead of ignoring the pattern sliently.

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@ankatiyar ankatiyar marked this pull request as ready for review May 13, 2024 08:34
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but generally the approach looks good 👍

kedro/io/data_catalog.py Show resolved Hide resolved
]
if len(catch_all) > 1:
raise DatasetError(
f"Multiple catch-all patterns found in the catalog: {', '.join(catch_all)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to add here how to resolve the problem by making sure only one catch-all exists in the catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tests/io/test_data_catalog.py Show resolved Hide resolved
ankatiyar and others added 2 commits May 14, 2024 11:49
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one minor comment, the rest LGTM! 🚀

@@ -223,19 +223,14 @@ The matches are ranked according to the following criteria:
You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataset`](/api/kedro.io.MemoryDataset) creation.

```yaml
"{a_default_dataset}":
"{default_dataset}":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot comment on the target line as it wasn't changed but we need to update the filepath as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Updated!

ankatiyar and others added 2 commits May 14, 2024 13:23
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@ankatiyar ankatiyar enabled auto-merge (squash) May 14, 2024 12:25
@ankatiyar ankatiyar merged commit 9369227 into main May 14, 2024
41 checks passed
@ankatiyar ankatiyar deleted the dataset-factories-default branch May 14, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prioritise user created catch-all dataset factory pattern over {default} from runners
4 participants