New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use user defined default dataset factory pattern over the one from the runner #3859
Conversation
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Yes |
data_catalog._dataset_patterns[matched_pattern] | ||
data_catalog._dataset_patterns.get(matched_pattern) | ||
or data_catalog._default_pattern.get(matched_pattern) | ||
or {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this case possible? If matches are found doesn't it mean that one of data_catalog._dataset_patterns
or data_catalog._default_pattern
should have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't be by mypy was complaining 🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through the approach suggested and it seems good to me.
Agree that we should validate if user has more than one catch-all patterns.
Left one question regarding the implementation though it's quite hard to validate it without tests. 🙂
This ticket also made me think that it would be good to have this resolution logic in one place, but that's ofc out of the scope now.
Just dumping my thought here in case I am blocking the PRs. I have some questions about the I think:
The most important question here is "should runner dataset patterns override
This suggests that one should only use either |
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, but generally the approach looks good 👍
kedro/io/data_catalog.py
Outdated
] | ||
if len(catch_all) > 1: | ||
raise DatasetError( | ||
f"Multiple catch-all patterns found in the catalog: {', '.join(catch_all)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to add here how to resolve the problem by making sure only one catch-all exists in the catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one minor comment, the rest LGTM! 🚀
@@ -223,19 +223,14 @@ The matches are ranked according to the following criteria: | |||
You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataset`](/api/kedro.io.MemoryDataset) creation. | |||
|
|||
```yaml | |||
"{a_default_dataset}": | |||
"{default_dataset}": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot comment on the target line as it wasn't changed but we need to update the filepath
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Updated!
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Description
Fix #3720
Development notes
TODO:
Wanted to get opinions on the proposed solution before proceeding with the pending tasks mentioned above^
Proposed Solution
When
DataCatalog
is created (infrom_config
)DataCatalog._default_pattern
(This would be the user-defined default pattern, not from the runner)Note
This assumes that there is only one catch-all pattern in the user's catalog. If there are more, the last one will be popped which wouldn't ever be used anyway.❓ [Question for reviewers]: We should probably error out if the user has more than one catch-all patterns in this case, does that make sense?
When a dataset is being fetched
DataCatalog._dataset_patterns
ORDataCatalog._default_pattern
_dataset_patterns
OR_default_pattern
When runner.run() is called
When
kedro run
is executed, the runner creates a shallow copy of the catalog object by calling theDataCatalog.shallow_copy()
extra_dataset_patterns = {'{default}': MemoryDataset}
_default_pattern
- it's not needed to do anythingDataCatalog._dataset_patterns
(this will then not emit a warning about the catch-all pattern being used over the runner default dataset)For the catalog CLI commands
Similar logic as above has been copied for the matching of datasets to patterns
Checklist
RELEASE.md
file