ZSTD compression problematic #155

mattpaul · 2024-02-09T21:33:24Z

When exporting embeddings to parquet files we currently use ZSTD compression:

Line 1002 in 3f4210f

_gdf.to_parquet(path=outpath, compression="ZSTD", schema_version="1.0.0")

However, ZSTD compression is not widely supported. Specifically, the version of pyarrow packaged in the AWS SDK for Pandas is not built with support for ZSTD, which yields the following error:

type(err)=<class 'pyarrow.lib.ArrowNotImplementedError'>: 
err=ArrowNotImplementedError("Support for codec 'zstd' not built")

This is proving problematic for the Clay vector service at read time.

GeoPandas docs for GeoDataFrame.to_parquet state that the following compression algorithms are available:

compression {‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’
  Name of the compression to use. Use None for no compression.

@yellowcap - I'd like to request that we switch to a more widely supported compression, such as the default snappy or gzip, in order to facilitate reading/working with Clay embeddings for a wider audience such as those using the AWS SDK for Pandas like the Clay vector service 🤓

Please let me know if you have any questions. Thanks!

The text was updated successfully, but these errors were encountered:

mattpaul · 2024-02-10T22:11:47Z

Linking to AWS SDK for pandas for reference.

Running into problems importing geopandas directly as a dependency in requirements.txt due to package size limitations imposed on lambda functions:

UPDATE_FAILED: ReadParquetLambdaFunction (AWS::Lambda::Function)
Resource handler returned message: "Unzipped size must be smaller than 86233173 bytes (Service: Lambda, Status Code: 400, Request ID: 5f9f750b-f8fb-4c61-bfc7-02ecb4ba3a22)" (RequestToken: 05ca5f53-f912-68a3-75ab-07ef6d06a3c1, HandlerErrorCode: InvalidRequest)

hence hoping to leverage a pre-built lambda layer. Exploring alternatives...

weiji14 · 2024-02-11T23:41:02Z

For context, ZSTD compression was set in #86 (comment), because it results in slightly smaller file sizes and faster read speeds (decompression). Could you please report the version of aws-sdk-pandas that you are using, is it 3.5.2 or an older version? What version of pandas is it running (show using pd.show_versions())?

Running into problems importing geopandas directly as a dependency in requirements.txt due to package size limitations imposed on lambda functions:

What's the limit for AWS Lambda? The PyArrow library used to read Parquet files is known to be quite big (see Drawbacks section under https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html, which mentions PyArrow requiring 120MB, and explicitly calls out this as a potential issue for AWS Lambda). The situation won't improve longer term though, especially for newer versions of Pandas v2.2+, so you might need to look at non-Lambda options if sticking with Pandas+PyArrow.

Taking a step back though, what are you actually trying to do with AWS Lambda? Are you trying to ingest the GeoParquet files into some database?

mattpaul · 2024-02-12T22:34:22Z

@weiji14 yes, correct. That is the architecture that was proposed here:
https://github.com/Clay-foundation/vector/discussions/3#discussioncomment-7826219

It's unfortunate that the library to simply read a file format should be so large (seems unnecessarily so) though I can appreciate the desire to work with libraries and formats commonly used for data science in interactive notebooks, etc.

I am using the latest version of the AWS SDK for pandas, 3.5.2, via the us-east-1 lamda layer arn for python 3.9 found here:
https://aws-sdk-pandas.readthedocs.io/en/stable/layers.html

arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python39:15

Note: I am able to open parquet files with that version of the library so long as the files have been encoded with supported compression types: gzip, snappy.

I created a handful of test cases here for the purposes of verifying which compression algorithms are supported:

s3://clay-vector-embeddings/test-cases/compression/

I took one of the v01 embeddings we originally generated with zstd compression and used the gpq command line tool to re-encode it as brotli, gzip, snappy. You can see the results of attempting to call read_parquet on each test case here:

brotli (unsupported) ❌
gzip (supported) ✅
snappy (supported) ✅
zstd (unsupported) ❌

(upon successful read it is rendering the head of the dataframe as HTML for demo purposes).

@weiji14 can you tell me more / point me to more info on the binary encoding format the model is using to encode the geometry field? not sure how to decode that at the moment. thanks!

weiji14 · 2024-02-13T01:26:42Z

It's unfortunate that the library to simply read a file format should be so large (seems unnecessarily so) though I can appreciate the desire to work with libraries and formats commonly used for data science in interactive notebooks, etc.

Note that PyArrow is not the only library implementation that can read Parquet files, there are others as well 😉

can you tell me more / point me to more info on the binary encoding format the model is using to encode the geometry field? not sure how to decode that at the moment. thanks!

The geometry is stored as a Well Known Binary (WKB) format as per the GeoParquet specification - https://geoparquet.org/releases/v1.0.0/. Examples of readers:

geopandas - https://github.com/geopandas/geopandas/blob/v0.14.3/geopandas/io/arrow.py#L527-L623 (uses pyarrow)
geoarrow-rust Python bindings - https://github.com/geoarrow/geoarrow-rs/blob/py-v0.1.0/python/core/src/io/parquet.rs#L17-L24
duckdb - Support for GeoParquet duckdb/duckdb_spatial#79

Let me know if you need help understanding the geoparquet schema metadata parser, we can set up a meeting to have a chat.

mattpaul · 2024-02-13T01:56:31Z

Yeah, I have been looking at other implementations as well. Geopandas itself is too large to import directly or via a lambda layer. I'll check out that rust based impl with the python bindings, thanks.

mattpaul assigned yellowcap Feb 9, 2024

weiji14 self-assigned this Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZSTD compression problematic #155

ZSTD compression problematic #155

mattpaul commented Feb 9, 2024 •

edited

mattpaul commented Feb 10, 2024

weiji14 commented Feb 11, 2024 •

edited

mattpaul commented Feb 12, 2024 •

edited

weiji14 commented Feb 13, 2024 •

edited

mattpaul commented Feb 13, 2024

ZSTD compression problematic #155

ZSTD compression problematic #155

Comments

mattpaul commented Feb 9, 2024 • edited

mattpaul commented Feb 10, 2024

weiji14 commented Feb 11, 2024 • edited

mattpaul commented Feb 12, 2024 • edited

weiji14 commented Feb 13, 2024 • edited

mattpaul commented Feb 13, 2024

mattpaul commented Feb 9, 2024 •

edited

weiji14 commented Feb 11, 2024 •

edited

mattpaul commented Feb 12, 2024 •

edited

weiji14 commented Feb 13, 2024 •

edited