Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for parquet encoder and decoder #127

Open
lorenzwalthert opened this issue Jun 13, 2023 · 0 comments
Open

Support for parquet encoder and decoder #127

lorenzwalthert opened this issue Jun 13, 2023 · 0 comments

Comments

@lorenzwalthert
Copy link

lorenzwalthert commented Jun 13, 2023

Describe the feature you'd like
Support for the MIME type parquet files in the sagemaker toolkit. E.g. in the README of this repo, there is an example default_input_fn():

   def default_input_fn(self, input_data, content_type, context=None):
        """A default input_fn that can handle JSON, CSV and NPZ formats.

        Args:
            input_data: the request payload serialized in the content_type format
            content_type: the request content_type
            context (obj): the request context (default: None).

        Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor depending if cuda is available.
        """
        return decoder.decode(input_data, content_type)

Looking into decoder.decode, I see the following MIME types are supported:

_decoder_map = {
    content_types.NPY: _npy_to_numpy,
    content_types.CSV: _csv_to_numpy,
    content_types.JSON: _json_to_numpy,
    content_types.NPZ: _npz_to_sparse,
}

Should not be too hard to add parquet here. Parquet is a dat file commonly used with large datasets and also supported in other sagemaker services, for example in Autopilot.

How would this feature be used? Please describe.
Reduce storage costs, data I/O costs, increase speed while processing.

Describe alternatives you've considered

CSV is the standard, but it's a much less efficient way to store, read and write column-oriented data.

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant