Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load parquet files #2416

Open
Gabriel-p opened this issue Jun 5, 2023 · 2 comments
Open

Load parquet files #2416

Gabriel-p opened this issue Jun 5, 2023 · 2 comments

Comments

@Gabriel-p
Copy link

Is your feature request related to a problem? Please describe it:
Pandas' parquet files are not loaded

Describe the solution you'd like:
Load parquet files

@Carifio24
Copy link
Member

Carifio24 commented Jun 5, 2023

I agree that the ability to read Parquet files would be nice. It's probably worth investigating whether using something like pyarrow directly has any sort of performance gains over pandas.read_parquet, but if you're interested in a very minimal example of a Parquet data loader, you can add the snippet below (which requires pyarrow) to your glue config file, which should allow you to load at least basic Parquet files:

from glue.config import data_factory
from glue.core.data_factories.helpers import has_extension
from glue.core.data_factories.pandas import panda_process

from pandas import read_parquet

@data_factory(label="Parquet file", identifier=has_extension("parquet"))
def pandas_read_parquet(path, engine="pyarrow", **kwargs):
    df = read_parquet(path, engine=engine)
    return panda_process(df)

@Gabriel-p
Copy link
Author

Thank you! It worked perfectly, I just removed the engine specification since my files open just fine with whatever pandas does by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants