Load parquet files #2416

Gabriel-p · 2023-06-05T18:02:33Z

Is your feature request related to a problem? Please describe it:
Pandas' parquet files are not loaded

Describe the solution you'd like:
Load parquet files

The text was updated successfully, but these errors were encountered:

Carifio24 · 2023-06-05T18:51:54Z

I agree that the ability to read Parquet files would be nice. It's probably worth investigating whether using something like pyarrow directly has any sort of performance gains over pandas.read_parquet, but if you're interested in a very minimal example of a Parquet data loader, you can add the snippet below (which requires pyarrow) to your glue config file, which should allow you to load at least basic Parquet files:

from glue.config import data_factory
from glue.core.data_factories.helpers import has_extension
from glue.core.data_factories.pandas import panda_process

from pandas import read_parquet

@data_factory(label="Parquet file", identifier=has_extension("parquet"))
def pandas_read_parquet(path, engine="pyarrow", **kwargs):
    df = read_parquet(path, engine=engine)
    return panda_process(df)

Gabriel-p · 2023-06-05T19:02:02Z

Thank you! It worked perfectly, I just removed the engine specification since my files open just fine with whatever pandas does by default

Gabriel-p added the enhancement label Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load parquet files #2416

Load parquet files #2416

Gabriel-p commented Jun 5, 2023

Carifio24 commented Jun 5, 2023 •

edited

Gabriel-p commented Jun 5, 2023

Load parquet files #2416

Load parquet files #2416

Comments

Gabriel-p commented Jun 5, 2023

Carifio24 commented Jun 5, 2023 • edited

Gabriel-p commented Jun 5, 2023

Carifio24 commented Jun 5, 2023 •

edited