feat: no longer load full table into ram in write by using concurrent write #2289

aersam · 2024-03-15T14:25:50Z

Description

This is a followup of #2265

It additionally uses streams/channels to concurrently write at the cost of more memory consumption. Default is keeping one recordbatch in RAM only, so it's opt-in.

I tested this with a local file and it went from 700s to 200s if I work with 10 concurrent streams. Of course memory consumption goes up, but given that we currently load the whole table in RAM, it's OK :)

This adds a depenency on async-channel as I need a multi-consumer channel.

Related Issue(s)

Fixes #2255

…iter2

… write-iter2

…iter2

…iter3

aersam · 2024-03-21T11:54:14Z

Let's get the other one in first

I hope it's ok to prioritize this one from my side to not have to keep both branches up-to-date

ion-elgreco · 2024-03-21T11:54:48Z

Let's get the other one in first

I hope it's ok to prioritize this one from my side to not have to keep both branches up-to-date

That's ok!

…iter3

ion-elgreco · 2024-03-28T22:11:24Z

@aersam I think we should max out the concurrent streams for python users.

In most use cases we are passing a recordBatchReader where the recordBatches are already in memory before constructing the reader, in that case you won't see any memory difference. And it wouldn't be different than the prior behavior since the reader was always collected.

I also have one suggestion on the python side, I think it's better if we simplify it and just provide a parameter called parallelize, which is always set to True. If users want to control the amount of concurrent streams, they should set an env_var which we then can parse in python_lib, if it's not set and parallelize = True, then we take the max possible streams.

aersam · 2024-03-29T07:19:45Z

How about parallelize:bool|int on python side? 🙂

ion-elgreco · 2024-03-29T07:27:02Z

@aersam that also works! :)

…iter3

aersam · 2024-04-15T08:44:55Z

I finally had the time to update this branch with the new parallel parameter in python. Hope it's looking good now!

ion-elgreco · 2024-04-15T09:03:08Z

@aersam btw, did you have any profiling numbers on speed ups/memory trade offs when parallel is True. Would be nice to share those in the release notes later on

aersam · 2024-04-15T10:45:02Z

I only did some manual test on my own data, but could probably write some benchmark in python, using duckdb or polars as source. Would it make sense to add this to the code somehow?

ion-elgreco · 2024-04-15T14:37:19Z

@aersam here you could add it, and even maybe reuse some of the benchmarks there: https://github.com/delta-io/delta-rs/tree/main/crates/benchmarks

aersam · 2024-04-17T19:15:07Z

I did some very basic benchmarking, but the results were not as I hoped :) While RAM consumption is significantly lower, the speed is not good enough yet. I think maybe the channel must be bigger, I'll do some more testing

I did my test quick and dirty using python, I can share the code if you want. Basically it's this:

import duckdb
from deltalake.writer import write_deltalake
from uuid import uuid4

with duckdb.connect() as con: # get your 42.parquet here: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-big-data-age.html
    con.execute("select b, random() as a from read_parquet('42.parquet') limit 300000000")
    reader = con.fetch_record_batch()
    write_deltalake(f"_test/{uuid4()}", reader, schema=reader.schema, mode="overwrite", engine="rust")

aersam · 2024-04-19T05:31:59Z

Pretty sure the non-async write causes issues. But object_store 0.10 will change a lot there, so maybe better to wait for that

ion-elgreco · 2024-04-19T06:42:02Z

Pretty sure the non-async write causes issues. But object_store 0.10 will change a lot there, so maybe better to wait for that

Yes let's see how effective these changes are with new upload trait

aersam and others added 25 commits March 7, 2024 21:15

close to compiling

565f43d

still learning :)

3a52bb7

some compile errors

30a5463

another bug fix

cde4207

clippy feedback

6743373

test compilation

577442b

wip on tests

4b276a7

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

9d022cb

…iter2

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

d1352fa

…iter2

cleanup

d4d82ce

wip on fixes

385c935

more fixes

023df09

more fixes

0397a0c

fmt

c83f947

adjust test

f131eb1

use into()

a3d5585

we need GIL, no?

965968c

clippy, your so right

83d398f

revert 965968c and 965968c

98bf7ec

Merge branch 'main' into write-iter2

44cd5b9

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

5ae3599

…iter2

fmt

28eba65

Merge branch 'write-iter2' of https://github.com/aersam/delta-rs into…

c66762a

… write-iter2

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

6e742a9

…iter2

use tasks for writing

cf375b9

aersam requested review from MrPowers, wjones127, fvaleye, roeap and ion-elgreco as code owners March 15, 2024 14:25

aersam added 4 commits March 20, 2024 07:02

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

551ab2a

…iter3

Feedback from Review

2143d68

should now work

ebb9420

fix test

97ac37e

aersam requested a review from rtyler March 20, 2024 07:32

aersam changed the title ~~feat: no longer load full table into ram in write using concurrent write~~ feat: no longer load full table into ram in write by using concurrent write Mar 20, 2024

aersam changed the title ~~feat: no longer load full table into ram in write by using concurrent write~~ feat: no longer load full table into ram in write by using concurrent write Mar 20, 2024

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

380d4cd

…iter3

aersam mentioned this pull request Mar 21, 2024

feat: no longer load full table into ram in write #2265

Closed

aersam added 2 commits March 25, 2024 08:51

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

a051bf7

…iter3

test fixews

4655742

aersam mentioned this pull request Apr 10, 2024

Schema evolution mergeSchema support #1386

Closed

aersam added 4 commits April 15, 2024 10:16

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

704d86b

…iter3

fmt

2b135f9

use parallel as arg

e5f12fb

ruff

6242179

aersam added 2 commits April 15, 2024 11:15

parallel

68ea7ed

remove fancy union syntax

0bbf3b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: no longer load full table into ram in write by using concurrent write #2289

feat: no longer load full table into ram in write by using concurrent write #2289

aersam commented Mar 15, 2024

aersam commented Mar 21, 2024

ion-elgreco commented Mar 21, 2024

ion-elgreco commented Mar 28, 2024 •

edited

aersam commented Mar 29, 2024

ion-elgreco commented Mar 29, 2024

aersam commented Apr 15, 2024

ion-elgreco commented Apr 15, 2024

aersam commented Apr 15, 2024

ion-elgreco commented Apr 15, 2024

aersam commented Apr 17, 2024

aersam commented Apr 19, 2024

ion-elgreco commented Apr 19, 2024

feat: no longer load full table into ram in write by using concurrent write #2289

Are you sure you want to change the base?

feat: no longer load full table into ram in write by using concurrent write #2289

Conversation

aersam commented Mar 15, 2024

Description

Related Issue(s)

aersam commented Mar 21, 2024

ion-elgreco commented Mar 21, 2024

ion-elgreco commented Mar 28, 2024 • edited

aersam commented Mar 29, 2024

ion-elgreco commented Mar 29, 2024

aersam commented Apr 15, 2024

ion-elgreco commented Apr 15, 2024

aersam commented Apr 15, 2024

ion-elgreco commented Apr 15, 2024

aersam commented Apr 17, 2024

aersam commented Apr 19, 2024

ion-elgreco commented Apr 19, 2024

ion-elgreco commented Mar 28, 2024 •

edited