Added 'data_dict' attribute (DataDictDataset) to AbstractVersionedDat… #3737

noamgoldberg · 2024-03-25T20:46:45Z

…aset

Description

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

astrojuanlu · 2024-03-26T07:13:06Z

Hi @noamgoldberg, thanks for your PR ! Could you explain the rationale behind this? What problem does it solve?

noamgoldberg · 2024-03-26T18:27:49Z

Hi @noamgoldberg, thanks for your PR ! Could you explain the rationale behind this? What problem does it solve?

I use kedro a lot for personal projects, and it's helpful to have a data dictionary attached to large datasets. For example, I like to create data_dict.yml with the feature descriptions, ranges, and general source information, to be referenced in Jupyter notebooks and used dynamically in the code (i.e. visualizations, reports). The DataDictDataset class is rather straightforward AbstractDataset, but the unique and helpful change in this PR is the enablement of the attachment of an instance of DataDictDataset to other datasets inheriting from AbstractVersionedDataset (i.e. pandas.CSVDataSet). For example, this would enable the following entry in catalog.yml:

    stocks_data:
        type: pandas.CSVDataSet
        filepath: data/01_raw/stocks.csv
        data_dict:
            dataset: yaml.YAMLDataSet
            filepath: data/01_raw/data_dict.yml

This would create a dataset stocks_data with an attached data dictionary.

merelcht · 2024-03-28T11:41:57Z

@noamgoldberg so this data_dict basically contains metadata about the dataset?

noamgoldberg · 2024-03-28T14:45:48Z

@merelcht yes :) I mainly use it for feature definitions and basic dataset information (i.e. author, source, location/date created)

…aset Signed-off-by: Noam Goldberg <noamgoldberg2@gmail.com>

astrojuanlu · 2024-05-06T07:21:25Z

Hi @noamgoldberg, sorry it took us so long to get back to you.

IIUC, the data_dict you propose here already exists and it's called metadata. See an example here:

https://docs.kedro.org/projects/kedro-viz/en/latest/kedro-viz_visualisation.html#visualise-layers

Please confirm if that would suit your needs. Arguably we could do a better job at documenting it, most likely here: https://docs.kedro.org/en/stable/data/data_catalog.html

noamgoldberg requested a review from merelcht as a code owner March 25, 2024 20:46

noamgoldberg force-pushed the data_dict branch from c36b384 to 937346d Compare March 25, 2024 20:48

noamgoldberg force-pushed the data_dict branch 4 times, most recently from cd32932 to f41a822 Compare March 29, 2024 00:39

Added 'data_dict' attribute (DataDictDataset) to AbstractVersionedDat…

8d75db6

…aset Signed-off-by: Noam Goldberg <noamgoldberg2@gmail.com>

noamgoldberg force-pushed the data_dict branch from f41a822 to 8d75db6 Compare March 29, 2024 00:47

astrojuanlu mentioned this pull request May 6, 2024

Added 'custom_args' attribute to AbstractDataset class #3761

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added 'data_dict' attribute (DataDictDataset) to AbstractVersionedDat… #3737

Added 'data_dict' attribute (DataDictDataset) to AbstractVersionedDat… #3737

noamgoldberg commented Mar 25, 2024

astrojuanlu commented Mar 26, 2024

noamgoldberg commented Mar 26, 2024

merelcht commented Mar 28, 2024

noamgoldberg commented Mar 28, 2024

astrojuanlu commented May 6, 2024

Added 'data_dict' attribute (DataDictDataset) to AbstractVersionedDat… #3737

Are you sure you want to change the base?

Added 'data_dict' attribute (DataDictDataset) to AbstractVersionedDat… #3737

Conversation

noamgoldberg commented Mar 25, 2024

Description

Development notes

Developer Certificate of Origin

Checklist

astrojuanlu commented Mar 26, 2024

noamgoldberg commented Mar 26, 2024

merelcht commented Mar 28, 2024

noamgoldberg commented Mar 28, 2024

astrojuanlu commented May 6, 2024