Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: read parquet files in chunks using to_parquet and chunksize #55973

Open
1 of 3 tasks
match-gabeflores opened this issue Nov 15, 2023 · 8 comments
Open
1 of 3 tasks
Assignees
Labels
Arrow pyarrow functionality Enhancement IO Parquet parquet, feather

Comments

@match-gabeflores
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Similar to how read_csv has chunksize parameter, can read_parquet function have chunksize?

Seems like it's possible using pyarrow via iter_batches.
https://stackoverflow.com/questions/59098785/is-it-possible-to-read-parquet-files-in-chunks

Is this something feasible within pandas?

Feature Description

add a new parameter chunksize to read_parquet

Alternative Solutions

use pyarrow iter_batches

Additional Context

No response

@match-gabeflores match-gabeflores added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 15, 2023
@RahulDubey391
Copy link

Hi @match-gabeflores , I would like to have a look into this issue. Can you please assign it to me?

@match-gabeflores
Copy link
Author

match-gabeflores commented Nov 27, 2023

Thanks, go for it! Unfortunately, I don't have access to assign

@lithomas1 lithomas1 added IO Parquet parquet, feather Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 11, 2024
@Meadiocre
Copy link

Hello @match-gabeflores,

My project team is looking for Pandas enhancement features for our grad school semester long project. We saw this task and would like to contribute if possible! Furthermore, we noticed that @RahulDubey391 mentioned that he wanted to work on this feature a few months ago. However, if no one is currently working on it, we would like to pick it up.

@match-gabeflores
Copy link
Author

Go for it, @Meadiocre !

I don't have access to assign, I think that's just a formality anyway. @lithomas1

@HkrFlores
Copy link

HkrFlores commented Feb 4, 2024

take

@HkrFlores
Copy link

Hello @match-gabeflores,
I am working with @Meadiocre, will assign it to me
thanks!

@HkrFlores
Copy link

take

@HkrFlores
Copy link

Hello @match-gabeflores
I just want to make sure I am not deviating of what has been asked.
Got read_parquet() with chunksize to "work" in terms of that the values are displayed based on the chunksize selected, but the data is returned into DataFrame which looking at csv implementation will only display the full table (or at least the last batch). Right now I am working on implementing TextFileReader report the information in a table the same way csv does.
Looking at the engines setup for TextFileReader, noticed that there were not setup for Parquet, I created a new engine type (_typing) and can't find information about those engines, I assume python and pyarrow are ok? is there any other engine to use for this?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants