Skip to content
This repository has been archived by the owner on Oct 23, 2023. It is now read-only.

add partition_columns to StructuredDatasetType #364

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cosmicBboy
Copy link
Contributor

Signed-off-by: Niels Bantilan niels.bantilan@gmail.com

Add partition_columns to StructuredDatasetType

Partially addresses flyteorg/flyte#3219

TL;DR

This PR adds an additional property to the StructureDatasetType protobuf definition so that metadata about which columns in the dataset (some kind of DataFrame object) are used for partitioning the dataset into chunks, for example when a pandas.DataFrame is serialized as a parquet file.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

This change is required to store additional metadata about which columns are used for partitioning. Currently this only meaningfully affects the serialization/deserialization of parquet files, but in the future we could support the partitioning of other serialization formats.

Tracking Issue

Partly addresses flyteorg/flyte#3219

Follow-up issue

NA

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@codecov
Copy link

codecov bot commented Feb 10, 2023

Codecov Report

Merging #364 (dee449a) into master (f3724b4) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #364   +/-   ##
=======================================
  Coverage   73.71%   73.71%           
=======================================
  Files          18       18           
  Lines        1377     1377           
=======================================
  Hits         1015     1015           
  Misses        311      311           
  Partials       51       51           
Flag Coverage Δ
unittests 73.71% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eapolinario please confirm, did we decide this needed to be included in the serialized flyteidl type or could just be read from metadata within flytekit at runtime?

protos/flyteidl/core/types.proto Outdated Show resolved Hide resolved
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@cosmicBboy
Copy link
Contributor Author

@eapolinario please confirm, did we decide this needed to be included in the serialized flyteidl type or could just be read from metadata within flytekit at runtime?

I may be missing something, but we need to include it in the type so that the structured dataset decoder has access to the metadata (unless we want to manually inspect the uri path for multiple directories)

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants