Implement append only task writer. #74

liurenjie1024 · 2023-07-05T10:31:03Z

An append-only task writer accepts an optional partitioner, file appender factory as arguments. When it receives records, it dispatches records to different file writer according to the partition key(generated by partitioner), and inserts it. When it finished, it returns generated data file structs.

Notice that this will be the api used directly by compute engines such as risingwave, ballista. We can refer to following implementation as an example.

https://github.com/apache/iceberg/blob/e340ad5be04e902398c576f431810c3dfa4fe717/core/src/main/java/org/apache/iceberg/io/PartitionedFanoutWriter.java#L28

ZENOTME · 2023-07-06T01:53:19Z

How did this api used to by compute engines? Such as:

// Create a writer to write data file
let task_writer = table.data_writer();
task_writer.write()
let data_files = task_writer.close();

// apply these data file to the table
let tx = table.transaction;
tx.apply(data_file);
tx.commit();

Xuanwo · 2023-07-06T02:56:42Z

How did this api used to by compute engines?

Let's prioritize making it functional for now and refine the API later.

liurenjie1024 · 2023-07-06T05:11:45Z

How did this api used to by compute engines? Such as:

Let's use risingwave's new coordinated sink as an example:

Each each sink will contains a task writer
When it needs to commit, it calls task writer's commit methods to get data files. The data files will be serialized and passed to sink coordinator
Sink coordinator will iceberg table apis to create a new snapshot and do commitment

ZENOTME · 2023-07-10T02:03:15Z

This issue can be close now

liurenjie1024 · 2023-07-10T02:06:56Z

We need to add support for partition spec. cc @ZENOTME

ZENOTME · 2023-07-13T07:02:31Z

I realize that in future we need to add position delete file writer, and the user can use like following (use different writer seperately):

// Create a writer to write data file
let append_writer = table.append_writer();
append_writer.write();
let delete_writer = table.delete_writer();
delete_writer.write();

let append_data_files = append_writer.close();
let delete_data_files = delete_writer.close();

// apply these data file to the table
let tx = table.transaction;
tx.apply(append_data_files);
tx.commit();

// apply these data file to the table
let tx = table.transaction;
tx.apply(delete_data_files);
tx.commit();

So maybe name the interface be append_writer() will be better?

liurenjie1024 · 2023-07-13T08:09:59Z

In my original design, the task writer should provides two methods:

insert_record
update_record

The internal implementation of update_record needs to maintain a map of record id to file position, this way we can keep users from using low level api of file writer.

liurenjie1024 mentioned this issue Jun 30, 2023

Tracking: rust version of iceberg sink. risingwavelabs/risingwave#10641

Open

26 tasks

This was referenced Jul 6, 2023

feat: support task writer #75

Merged

Implement CoordinatedSinkWriter for iceberg sink. risingwavelabs/risingwave#10642

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement append only task writer. #74

Implement append only task writer. #74

liurenjie1024 commented Jul 5, 2023

ZENOTME commented Jul 6, 2023 •

edited

Xuanwo commented Jul 6, 2023 •

edited

liurenjie1024 commented Jul 6, 2023

ZENOTME commented Jul 10, 2023

liurenjie1024 commented Jul 10, 2023

ZENOTME commented Jul 13, 2023

liurenjie1024 commented Jul 13, 2023

Implement append only task writer. #74

Implement append only task writer. #74

Comments

liurenjie1024 commented Jul 5, 2023

ZENOTME commented Jul 6, 2023 • edited

Xuanwo commented Jul 6, 2023 • edited

liurenjie1024 commented Jul 6, 2023

ZENOTME commented Jul 10, 2023

liurenjie1024 commented Jul 10, 2023

ZENOTME commented Jul 13, 2023

liurenjie1024 commented Jul 13, 2023

ZENOTME commented Jul 6, 2023 •

edited

Xuanwo commented Jul 6, 2023 •

edited