Delta table input connector #1743

ryzhyk · 2024-05-11T22:21:42Z

This adapter supports ingesting a delta table in three modes:

follow - incrementally ingest transaction log starting from a specified (or the latest) version.
snapshot - read table snapshot, optionally sorted by a timestamp column and optionally filtered by a user-provided predicate, e.g., specifying a time range.
snapshot-and-follow - combines the above two modes.

The connector was tested with local FS and with S3, including with a managed delta table created by databricks in S3.

There are a couple of todos left. The biggest one is that sorting by timestamp can be very expensive. A better solution is to implement something like this.

The second issue is that in follow mode we currently read one file at a time from S3. We should look into using datafusion's ListingTable, which (hopefully) parallelizes the process.

Is this a user-visible change (yes/no): yes

blp

I looked at all of this and I did a little bit of followup reading in the deltalake crate, but I didn't study any of it enough to spot bugs in the details. If you want me to study it in detail, let me know and I'll put in the time to do that.

crates/adapters/src/integrated/delta_table/input.rs

mihaibudiu · 2024-05-13T20:37:34Z

crates/pipeline-types/src/transport/delta_table.rs

+        ///
+        /// This option can be used to specify the range of event times to include in the snapshot,
+        /// e.g.: `ts BETWEEN '2005-01-01 00:00:00' AND '2010-12-31 23:59:59'`.
+        snapshot_filter: Option<String>,


have to teach the compiler to push this down

crates/pipeline-types/src/transport/delta_table.rs

mihaibudiu · 2024-05-13T20:49:37Z

crates/adapters/src/integrated/delta_table/input.rs

+        let mut receiver_clone = receiver.clone();
+        select! {
+            _ = Self::worker_task_inner(endpoint.clone(), input_stream, receiver, init_status_sender) => {
+                debug!("delta_table {}: worker task terminated",


I only see logging, but no other actions here

mihaibudiu · 2024-05-13T20:55:48Z

crates/adapters/src/integrated/delta_table/input.rs

+
+        if let Some(timestamp_column) = &self.config.timestamp_column {
+            // Parse expression only to validate it.
+            let mut parser = Parser::new(&GenericDialect)


yes another sql dialect

mihaibudiu · 2024-05-13T22:05:23Z

crates/adapters/src/integrated/delta_table/test.rs

+{
+    let start = Instant::now();
+    let json_file = NamedTempFile::new().unwrap();
+    println!(


crates/adapters/src/integrated/delta_table/test.rs

mihaibudiu · 2024-05-13T22:23:47Z

crates/adapters/src/integrated/delta_table/test.rs

+    // to process `buffer_size` records.
+    let buffer_timeout_ms = 100;
+
+    println!("delta_table_output_test: preparing input file");


I guess this is allowed in test code?

crates/adapters/src/integrated/delta_table/test.rs

snkas

Cool to see the Delta table input counterpart to the existing output, great work :) It would be useful to have a demo to showcase its usage (e.g., how to specify format).

snkas · 2024-05-14T09:47:29Z

crates/adapters/src/controller/mod.rs

-            endpoint_config,
-            endpoint.is_fault_tolerant(),
-        );
+        let reader = match endpoint {


Regarding a connector vs. an integrated connector, is the following understanding correct?

Usually a connector is decoupled from the format. The connector inherently has some "unit" in which its data is separated, it being one unit (e.g., a single file as a whole), or many units (e.g., Kafka messages). This "unit" of data is considered a binary blob that the format turns into tuples the DBSP circuit can ingest.

An integrated connector has a fixed format.

Would it not be simpler to have a format variant called Integrated or None, and upon matching to a transport which is integrated and for which the format variant is not that, throw an error?

Regarding a connector vs. an integrated connector, is the following understanding correct?

* Usually a connector is decoupled from the format. The connector inherently has some "unit" in which its data is separated, it being one unit (e.g., a single file as a whole), or many units (e.g., Kafka messages). This "unit" of data is considered a binary blob that the format turns into tuples the DBSP circuit can ingest. * An integrated connector has a fixed format.

Yep, this is correct.

Would it not be simpler to have a format variant called Integrated or None, and upon matching to a transport which is integrated and for which the format variant is not that, throw an error?

We could do that, but I don't think this will simplify this code. We need two branches to build two different input pipelines, the "normal' one consisting of transport -> input_probe -> format parser, and an integrated one that only has one component (the integrated adapter). Maybe I didn't understand the suggestion. Can you clarify what you had in mind?

I am not as intimately familiar with this particular part of the code base; conceptually I imagined the building to be a function which takes as parameters (1) transport configuration and (2) format configuration, and returns a built connector. What happens internally in terms of validation, is up to the implementation of the connector transport (though, probably will make use of similar helper functions). If a combination of (transport, format) which is not valid is given, it returns an error.

I guess it does not matter as the API does not differ with the implementation, as such whichever works best in current code base sounds good. What happens currently if a format is provided even though the transport does not support it?

Ok, I see what you mean now. The advantage of the current way of doing this is that non-integrated connectors don't even need to know that data formats exist. But I think we could still implement your suggestion by having a special integrated connector implementation act as a wrapper around "regular" connectors and do the check. That would be cleaner, but I probably won't do this in this PR though.

What happens currently if a format is provided even though the transport does not support it?

Good point, it will be ignored, I'll fix this.

crates/pipeline-types/src/transport/delta_table.rs

This is a small cleanup of the controller API. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

This adapter supports ingesting a delta table in three modes: - follow - incrementally ingest transaction log starting from a specified (or the latest) version. - snapshot - read table snapshot, optionally sorted by a timestamp column and optionally filtered by a user-provided predicate, e.g., specifying a time range. - snapshot-and-follow - combines the above two modes. The connector was tested with local FS and with S3, including with a managed delta table created by databricks in S3. There are a couple of todos left. The biggest one is that sorting by timestamp can be very expensive. A better solution is to implement something like this: https://docs.databricks.com/en/structured-streaming/delta-lake.html#process-initial-snapshot-without-data-being-dropped The second issue is that in follow mode we currently read one file at a time from S3. We should look into using datafusion's `ListingTable`, which (hopefully) parallelizes the process. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

ryzhyk requested review from blp and snkas May 11, 2024 22:21

ryzhyk force-pushed the delta-input branch 2 times, most recently from d4d2866 to e913e1c Compare May 11, 2024 23:58

blp approved these changes May 13, 2024

View reviewed changes

crates/adapters/src/integrated/delta_table/input.rs Outdated Show resolved Hide resolved

crates/adapters/src/integrated/delta_table/input.rs Show resolved Hide resolved

mihaibudiu approved these changes May 13, 2024

View reviewed changes

snkas approved these changes May 14, 2024

View reviewed changes

Refactor InputProbe.

e103cc8

This is a small cleanup of the controller API. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

ryzhyk force-pushed the delta-input branch from e913e1c to 48ab704 Compare May 15, 2024 07:37

ryzhyk force-pushed the delta-input branch from 48ab704 to 0724d40 Compare May 15, 2024 16:03

ryzhyk merged commit ccefc5e into main May 15, 2024
5 checks passed

ryzhyk deleted the delta-input branch May 15, 2024 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delta table input connector #1743

Delta table input connector #1743

ryzhyk commented May 11, 2024

blp left a comment

mihaibudiu May 13, 2024

mihaibudiu May 13, 2024

mihaibudiu May 13, 2024

ryzhyk May 15, 2024

mihaibudiu May 13, 2024

mihaibudiu May 13, 2024

snkas left a comment

snkas May 14, 2024

ryzhyk May 15, 2024

snkas May 15, 2024

ryzhyk May 15, 2024

Delta table input connector #1743

Delta table input connector #1743

Conversation

ryzhyk commented May 11, 2024

blp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snkas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment