Do not load full source into RAM on write_to_deltalake #2255

aersam · 2024-03-06T14:52:06Z

Description

In python/lib.rs, the first thing that happens on write_to_deltalake is to collect to batches to a Vec. This loads all RecordBatches into RAM, no? This seems like not a good thing to me. I think the main reason is that write.rs tries to get the schema from the batches, but the schema would have been known in python anyway, so why not pass it directly?

Use Case
I don't want to waste resources ;)

Related Issue(s)

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-03-06T14:54:25Z

@aersam correct, it's not the efficient way to do that :) Will already mentioned an improvement over that, which I've logged here, no one is working on that yet, so if you want to pick it up feel free :D #1984

aersam · 2024-03-06T14:57:15Z

I can pick it up, but I'd rather do it on the write.rs operation

ion-elgreco · 2024-03-06T15:06:40Z

@aersam that's fine!

aersam · 2024-03-06T20:04:16Z

Ok, I see partioning makes this quite complicated 🙂 And MemoryExec of DataFusion is not helpful, so might take some time

aersam · 2024-03-07T15:12:39Z

I'll just implement it using chunks. This is not perfect, but should work and is not as invasive as rewriting the whole partitioning

aersam added the enhancement New feature or request label Mar 6, 2024

This was referenced Mar 8, 2024

feat: no longer load full table into ram in write #2265

Closed

feat: no longer load full table into ram in write by using concurrent write #2289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not load full source into RAM on write_to_deltalake #2255

Do not load full source into RAM on write_to_deltalake #2255

aersam commented Mar 6, 2024

ion-elgreco commented Mar 6, 2024 •

edited

aersam commented Mar 6, 2024

ion-elgreco commented Mar 6, 2024

aersam commented Mar 6, 2024

aersam commented Mar 7, 2024

Do not load full source into RAM on write_to_deltalake #2255

Do not load full source into RAM on write_to_deltalake #2255

Comments

aersam commented Mar 6, 2024

Description

ion-elgreco commented Mar 6, 2024 • edited

aersam commented Mar 6, 2024

ion-elgreco commented Mar 6, 2024

aersam commented Mar 6, 2024

aersam commented Mar 7, 2024

ion-elgreco commented Mar 6, 2024 •

edited