Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZSTD compression problematic #155

Open
mattpaul opened this issue Feb 9, 2024 · 5 comments
Open

ZSTD compression problematic #155

mattpaul opened this issue Feb 9, 2024 · 5 comments
Assignees

Comments

@mattpaul
Copy link

mattpaul commented Feb 9, 2024

When exporting embeddings to parquet files we currently use ZSTD compression:

_gdf.to_parquet(path=outpath, compression="ZSTD", schema_version="1.0.0")

However, ZSTD compression is not widely supported. Specifically, the version of pyarrow packaged in the AWS SDK for Pandas is not built with support for ZSTD, which yields the following error:

type(err)=<class 'pyarrow.lib.ArrowNotImplementedError'>: 
err=ArrowNotImplementedError("Support for codec 'zstd' not built")

This is proving problematic for the Clay vector service at read time.

GeoPandas docs for GeoDataFrame.to_parquet state that the following compression algorithms are available:

compression {‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’
  Name of the compression to use. Use None for no compression.

@yellowcap - I'd like to request that we switch to a more widely supported compression, such as the default snappy or gzip, in order to facilitate reading/working with Clay embeddings for a wider audience such as those using the AWS SDK for Pandas like the Clay vector service 🤓

Please let me know if you have any questions. Thanks!

@mattpaul
Copy link
Author

Linking to AWS SDK for pandas for reference.

Running into problems importing geopandas directly as a dependency in requirements.txt due to package size limitations imposed on lambda functions:

UPDATE_FAILED: ReadParquetLambdaFunction (AWS::Lambda::Function)
Resource handler returned message: "Unzipped size must be smaller than 86233173 bytes (Service: Lambda, Status Code: 400, Request ID: 5f9f750b-f8fb-4c61-bfc7-02ecb4ba3a22)" (RequestToken: 05ca5f53-f912-68a3-75ab-07ef6d06a3c1, HandlerErrorCode: InvalidRequest)

hence hoping to leverage a pre-built lambda layer. Exploring alternatives...

@weiji14
Copy link
Contributor

weiji14 commented Feb 11, 2024

For context, ZSTD compression was set in #86 (comment), because it results in slightly smaller file sizes and faster read speeds (decompression). Could you please report the version of aws-sdk-pandas that you are using, is it 3.5.2 or an older version? What version of pandas is it running (show using pd.show_versions())?

Running into problems importing geopandas directly as a dependency in requirements.txt due to package size limitations imposed on lambda functions:

What's the limit for AWS Lambda? The PyArrow library used to read Parquet files is known to be quite big (see Drawbacks section under https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html, which mentions PyArrow requiring 120MB, and explicitly calls out this as a potential issue for AWS Lambda). The situation won't improve longer term though, especially for newer versions of Pandas v2.2+, so you might need to look at non-Lambda options if sticking with Pandas+PyArrow.

Taking a step back though, what are you actually trying to do with AWS Lambda? Are you trying to ingest the GeoParquet files into some database?

@weiji14 weiji14 self-assigned this Feb 11, 2024
@mattpaul
Copy link
Author

mattpaul commented Feb 12, 2024

@weiji14 yes, correct. That is the architecture that was proposed here:
https://github.com/Clay-foundation/vector/discussions/3#discussioncomment-7826219

It's unfortunate that the library to simply read a file format should be so large (seems unnecessarily so) though I can appreciate the desire to work with libraries and formats commonly used for data science in interactive notebooks, etc.

I am using the latest version of the AWS SDK for pandas, 3.5.2, via the us-east-1 lamda layer arn for python 3.9 found here:
https://aws-sdk-pandas.readthedocs.io/en/stable/layers.html

arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python39:15

Note: I am able to open parquet files with that version of the library so long as the files have been encoded with supported compression types: gzip, snappy.

I created a handful of test cases here for the purposes of verifying which compression algorithms are supported:

s3://clay-vector-embeddings/test-cases/compression/

I took one of the v01 embeddings we originally generated with zstd compression and used the gpq command line tool to re-encode it as brotli, gzip, snappy. You can see the results of attempting to call read_parquet on each test case here:

(upon successful read it is rendering the head of the dataframe as HTML for demo purposes).

@weiji14 can you tell me more / point me to more info on the binary encoding format the model is using to encode the geometry field? not sure how to decode that at the moment. thanks!

@weiji14
Copy link
Contributor

weiji14 commented Feb 13, 2024

It's unfortunate that the library to simply read a file format should be so large (seems unnecessarily so) though I can appreciate the desire to work with libraries and formats commonly used for data science in interactive notebooks, etc.

Note that PyArrow is not the only library implementation that can read Parquet files, there are others as well 😉

can you tell me more / point me to more info on the binary encoding format the model is using to encode the geometry field? not sure how to decode that at the moment. thanks!

The geometry is stored as a Well Known Binary (WKB) format as per the GeoParquet specification - https://geoparquet.org/releases/v1.0.0/. Examples of readers:

Let me know if you need help understanding the geoparquet schema metadata parser, we can set up a meeting to have a chat.

@mattpaul
Copy link
Author

Yeah, I have been looking at other implementations as well. Geopandas itself is too large to import directly or via a lambda layer. I'll check out that rust based impl with the python bindings, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants