awswrangler.s3.read_parquet() chunked=True 1 dataframe per result #2086

matthewyakubiw · 2023-03-02T17:07:24Z

Hi there,

I have a question regarding the chunked=true option in awswrangler.s3.read_parquet().
I'm looking to load parquet files from S3 in the most memory efficient way possible. Our data has a differing number of rows per parquet file, but the same number of columns (11). I'd like it so the results from read_parquet() is separated as a pandas DF on a per-parquet file basis. i.e. if based on the filter_query it returns 10 parquet files, I will receive 10 pandas DFs in return. chunked=True works if the number of rows is the same every time, but with our data there will be a different number of rows from time to time, so hard-coding the chunk size isn't feasible.

The documentation says:

If chunked=True, a new DataFrame will be returned for each file in your path/dataset.

However it also seems to be choosing an arbitrary size to chunk in (in my case it's chunks of 65536)
Is there something I'm missing here with regards to this? Thanks very much for your help!

The text was updated successfully, but these errors were encountered:

LeonLuttenberger · 2023-03-02T20:39:58Z

Hey,

Our documentation appears to be incorrect there. If chunked=True, each file will be in a separate chunk, but the maximum maximum size of that chunk will be 65,536 rows. If a single file exceeds that, the iterator will return more than one chunk for that file.

To give an example, let's say you have two files in your dataset. The first file has 70,000 rows, and the second has 50,000. By passing chunked=True, the iterator will return 3 data frames:

65,536 rows from the first file
4,464 rows from the first file
50,000 rows from the second file

Alternatively, if you explicitly pass a chunk size, rows from different files will be spliced together in order to ensure the data size. Therefore, if you pass chunked=60_000 in the previous example, the iterator will return 2 data frames:

60,000 rows from the first file
10,000 rows from the first file and 50,000 rows from the second file

I will update the documentation to better reflect the actual behavior. Let me know if this answers your question.

matthewyakubiw · 2023-03-03T14:17:53Z

Thank you for the response! That definitely makes sense.

Is there any world where we can chunk where each chuck size equals the number of rows returned on that particular parquet file? i.e. for this data I have there are roughly ~1067300 rows per parquet file (but this will vary by a few hundred to thousand.)

If not, maybe you could recommend a solution that might be helpful for my use case. I appreciate your time and help!

LeonLuttenberger · 2023-03-03T16:48:11Z

Hey,

If I understand correctly, you want exactly one chunk for each file? And each of those chunks would have 1067300 rows?

If so, you could iterate through the files you have in your dataset by using wr.s3.list_objects, and then call wr.se.read_parquet for each file separately without specifying chunked=True.

Let me know if this works for you,
Leon

matthewyakubiw · 2023-03-03T17:11:41Z

If I understand correctly, you want exactly one chunk for each file? And each of those chunks would have 1067300 rows?

Roughly that many rows, the number of rows will vary depending (data is measurements of optical hardware has an inconsistent number of entries) - but yeah one chunk for each file would be ideal.

I will take a look at your recommendation! Thank you so much for taking the time.

matthewyakubiw · 2023-03-03T17:21:07Z

Sorry @LeonLuttenberger one more question:
Could you explain in more detail how chunked=True is more memory efficient?
For example: without chunked=True when loading 12 parquet files it's using 11.5GB of memory (each file is roughly 8MB in S3)
With chunked=True it uses roughly 600MB. I'd like to be able to utilize this feature to save memory so I can enable fitting on this data on larger quantities without having to split it up.

LeonLuttenberger · 2023-03-03T17:37:41Z

Hey,

For example: without chunked=True when loading 12 parquet files it's using 11.5GB of memory (each file is roughly 8MB in S3)
With chunked=True it uses roughly 600MB. I'd like to be able to utilize this feature to save memory so I can enable fitting on this data on larger quantities without having to split it up.

Did you mean you were assigning chunked=int in one of these two cases? Both cases say chunked=True.

11.5 GB for 12 Parquet files where each file is 8MB seems excessive. Does each have ~1067300 rows in it? Can I ask how many columns are in each, and what you used to measure the memory? I'd like to try to replicate these results to better understand whats happening.

The reason chunked=True should be more memory efficient is because when it reaches the end of a file, it simply returns whatever is left before moving onto the next file. In contrast to that, chunked=int will reach the end of one file, and then begin gathering data from the next file, while attaching it to the already loaded data. The Data Frame then needs to be copied before being returned., which increases memory usage.

The main exception to the rule that chunked=True is more memory efficient is for datasets that either have lots of rows or large values in each row (e.g. long strings). In that case, chunked=True will still try to batch ~65,000 rows into a single data frame, which may exceed your desired memory usage.

matthewyakubiw · 2023-03-03T17:55:27Z

Did you mean you were assigning chunked=int in one of these two cases? Both cases say chunked=True.

First example was without any chunk statement (absent option from the query), and the second example was with chunked=True

11.5 GB for 12 Parquet files where each file is 8MB seems excessive. Does each have ~1067300 rows in it? Can I ask how many columns are in each, and what you used to measure the memory? I'd like to try to replicate these results to better understand whats happening.

Yeah so each has roughly that many rows, with 11 columns (always 11 columns for every single parquet file). I use jupyterlab to run the notebooks so after I ran the following codeblock, my RAM usage for the kernel was at ~11.5GB:

df = [
       wr.s3.read_parquet(
           path=self.s3_parquet_url,
           dataset=True,
           partition_filter=lambda x: self.partition_filter_wrapper(x, query),
           use_threads=True,
       )
        for query in queries
]

where queries is a list of partition filters. The reason I use a list of partition filters specifying individual partitions vs one large filter is so that I can return each of the parquet files as its own dataframe as opposed to further processing the data after to separate it.

Thanks!

matthewyakubiw · 2023-03-03T18:04:17Z

Here's an example of what the data

looks like (with fake values)

LeonLuttenberger · 2023-03-06T16:19:01Z

Hey,

Which version of awswrangler are you on? The reason I ask is because we recently fixed a memory issue, which is available as of version 2.20.0.

If you're already on 2.20.0, can you please try to run the same code with use_threads=False? My thinking is that when you specify chunksize, the function foregoes any parallelization, which normally results in higher memory usage.

Cheers,
Leon

matthewyakubiw · 2023-03-06T18:25:39Z

Hey,
So I upgraded to 2.20.0 and tried running with use_threads equal to both False and True, and the same amount of memory was used:

Can I do anything else to provide you with more context that might help?

matthewyakubiw · 2023-03-06T18:26:03Z

Is there any chance this is an M1 thing?

matthewyakubiw added the question Further information is requested label Mar 2, 2023

LeonLuttenberger self-assigned this Mar 2, 2023

LeonLuttenberger mentioned this issue Mar 2, 2023

Correct documentation for chunksize=True #2087

Merged

LeonLuttenberger linked a pull request Mar 2, 2023 that will close this issue

Correct documentation for chunksize=True #2087

Merged

LeonLuttenberger closed this as completed in #2087 Mar 3, 2023

LeonLuttenberger reopened this Mar 3, 2023

jaidisido added this to In progress in AWS SDK for pandas roadmap Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awswrangler.s3.read_parquet() chunked=True 1 dataframe per result #2086

awswrangler.s3.read_parquet() chunked=True 1 dataframe per result #2086

matthewyakubiw commented Mar 2, 2023

LeonLuttenberger commented Mar 2, 2023

matthewyakubiw commented Mar 3, 2023

LeonLuttenberger commented Mar 3, 2023

matthewyakubiw commented Mar 3, 2023

matthewyakubiw commented Mar 3, 2023 •

edited

LeonLuttenberger commented Mar 3, 2023

matthewyakubiw commented Mar 3, 2023

matthewyakubiw commented Mar 3, 2023

LeonLuttenberger commented Mar 6, 2023

matthewyakubiw commented Mar 6, 2023

matthewyakubiw commented Mar 6, 2023

awswrangler.s3.read_parquet() chunked=True 1 dataframe per result #2086

awswrangler.s3.read_parquet() chunked=True 1 dataframe per result #2086

Comments

matthewyakubiw commented Mar 2, 2023

LeonLuttenberger commented Mar 2, 2023

matthewyakubiw commented Mar 3, 2023

LeonLuttenberger commented Mar 3, 2023

matthewyakubiw commented Mar 3, 2023

matthewyakubiw commented Mar 3, 2023 • edited

LeonLuttenberger commented Mar 3, 2023

matthewyakubiw commented Mar 3, 2023

matthewyakubiw commented Mar 3, 2023

LeonLuttenberger commented Mar 6, 2023

matthewyakubiw commented Mar 6, 2023

matthewyakubiw commented Mar 6, 2023

matthewyakubiw commented Mar 3, 2023 •

edited