Support for parquet encoder and decoder #127

lorenzwalthert · 2023-06-13T11:42:59Z

Describe the feature you'd like
Support for the MIME type parquet files in the sagemaker toolkit. E.g. in the README of this repo, there is an example default_input_fn():

   def default_input_fn(self, input_data, content_type, context=None):
        """A default input_fn that can handle JSON, CSV and NPZ formats.

        Args:
            input_data: the request payload serialized in the content_type format
            content_type: the request content_type
            context (obj): the request context (default: None).

        Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor depending if cuda is available.
        """
        return decoder.decode(input_data, content_type)

Looking into decoder.decode, I see the following MIME types are supported:

_decoder_map = {
    content_types.NPY: _npy_to_numpy,
    content_types.CSV: _csv_to_numpy,
    content_types.JSON: _json_to_numpy,
    content_types.NPZ: _npz_to_sparse,
}

Should not be too hard to add parquet here. Parquet is a dat file commonly used with large datasets and also supported in other sagemaker services, for example in Autopilot.

How would this feature be used? Please describe.
Reduce storage costs, data I/O costs, increase speed while processing.

Describe alternatives you've considered

CSV is the standard, but it's a much less efficient way to store, read and write column-oriented data.

Additional context

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for parquet encoder and decoder #127

Support for parquet encoder and decoder #127

lorenzwalthert commented Jun 13, 2023 •

edited

Support for parquet encoder and decoder #127

Support for parquet encoder and decoder #127

Comments

lorenzwalthert commented Jun 13, 2023 • edited

lorenzwalthert commented Jun 13, 2023 •

edited