New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate_schema keyword not supported yet #758
Comments
Which version of pyarrow are you using? |
Hi I'm having the same problem when using |
Please try using |
Hi @selitvin , I'm getting the same error when trying to write a petastorm dataset to cloud storage ( I am using from petastorm.fs_utils import FilesystemResolver
from petastorm.etl.dataset_metadata import materialize_dataset
dataset_url = "gs://bucket/dataset-name"
resolver = FilesystemResolver(dataset_url)
with materialize_dataset(spark, dataset_url, schema, filesystem_factory=resolver.filesystem_factory()):
# spark logic here
# essentially the same as the example in the docs I'm not sure whether your suggestion isn't working for me due to the fact that I'm writing rather than reading, due to the time that's passed since you originally posted(, or whether I'm doing something stupid!). Do you know how I can resolve this? |
I'm also getting the same error despite using using the file system resolve (which results an s3fs filesystem.
Output:
Trace:
|
Here is how i fixed it .. I am using s3 minIO to cache directory for petastorm. Objective was to create tf dataset from the petastorm cache data in minIO. Since petastorm uses pyarrow underneath, there has been a dependency issues.
from petastorm.fs_utils import FilesystemResolver path_or_paths='s3a://bucket/key' # dont mention the parquet file |
Hi, Im using petastorm to feed tensorflow models lunched with spark in an EMR cluster. The code is the basic to read parquet files on s3:
It throw the next error:
How can be solved this issue? Thanks
The text was updated successfully, but these errors were encountered: