[SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval #46651

chaoqin-li1123 · 2024-05-18T19:51:58Z

What changes were proposed in this pull request?

Fix the python streaming data source timeout issue for large trigger interval
For python streaming source, keep the long running worker archaetecture but set the socket timeout to be infinity to avoid timeout error.
For python streaming sink, since StreamingWrite is also created per microbatch in scala side, long running worker cannot be attached to s StreamingWrite instance. Therefore we abandon the long running worker architecture, simply call commit() or abort() and exit the worker and allow spark to reuse worker for us.

Why are the changes needed?

Currently we run long running python worker process for python streaming source and sink to perform planning, commit and abort in driver side. Testing indicate that current implementation cause connection timeout error when streaming query has large trigger interval.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add integration test

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon

Seems fine but cc @HeartSaVioR

HeartSaVioR

Looks OK - I don't follow the change of using PythonPlannerRunner but @HyukjinKwon reviewed this so I defer to him. Left a few comments for better testing.

Btw, looks like we leave source runner to keep it as long live one but sink runner as shorter one. Is it for simplicity? I understand they are different, just wanted to know it is thoughtfully decided.

HeartSaVioR · 2024-05-20T02:09:20Z

...re/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala

+        .trigger(ProcessingTimeTrigger(20 * 1000))
+        .start(outputDir.getAbsolutePath)
+      eventually(timeout(waitTimeout * 5)) {
+        inputData.addData(1 to 3)


Do we intentionally add 1 to 3 into source every 15 ms (default)? You can just call addData a few time before eventually as MemoryStream would produce the data for single call as a single microbatch. (2 times = 2 batches)

A batch in memory stream actually doesn't correspond to a microbatch(

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

Line 231 in 88eb9eb

val newBlocks = synchronized {

), but I figure out a way to do something similar.

HeartSaVioR · 2024-05-20T02:10:02Z

...re/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala

+        .start(outputDir.getAbsolutePath)
+      eventually(timeout(waitTimeout * 5)) {
+        inputData.addData(1 to 3)
+        assert(q.lastProgress.batchId >= 2)


Since we have control over the input data once you apply my suggestion, why not check the output as well?

Change applied.

...c/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonStreamingWrite.scala

HeartSaVioR · 2024-05-20T02:19:54Z

python/pyspark/sql/worker/python_streaming_sink_runner.py

+                writer.abort(commit_messages, batch_id)  # type: ignore[arg-type]
+            else:
+                writer.commit(commit_messages, batch_id)  # type: ignore[arg-type]
+                # Send a status code back to JVM.


nit: indentation

HeartSaVioR · 2024-05-20T02:27:19Z

For python streaming sink, since StreamingWrite is also created per microbatch in scala side, long running worker cannot be attached to s StreamingWrite instance. Therefore we abandon the long running worker architecture, simply call commit() or abort() and exit the worker and allow spark to reuse worker for us.

Ah OK, it's unable to be reused anyway. Then makes sense.

HeartSaVioR

+1 pending CI

allisonwang-db · 2024-05-20T17:35:08Z

python/pyspark/sql/worker/python_streaming_sink_runner.py

+    Main method for committing or aborting a data source streaming write operation.
+
+    This process is invoked from the `PythonStreamingSinkCommitRunner.runInPython`
+    method in the StreamingWrite implementation of the PythonTableProvider. It is


We don't have a PythonTableProvider. Do you mean PythonTable or PythonDataSourceV2?

allisonwang-db · 2024-05-20T17:40:15Z

...a/org/apache/spark/sql/execution/datasources/v2/python/PythonStreamingSinkCommitRunner.scala

@@ -39,78 +35,22 @@ import org.apache.spark.sql.types.StructType
 * from the socket, then commit or abort a microbatch.
 */
 class PythonStreamingSinkCommitRunner(


After this change, is the PyhtonStreamingSinkCommitRunner the same as the batch one now?

It is similar except that streaming commit runner also takes the batch id as parameter and throw a different type of exception.

allisonwang-db · 2024-05-20T17:41:53Z

python/pyspark/sql/streaming/python_streaming_source_runner.py

@@ -210,7 +210,8 @@ def main(infile: IO, outfile: IO) -> None:
    # Read information about how to connect back to the JVM from the environment.
    java_port = int(os.environ["PYTHON_WORKER_FACTORY_PORT"])
    auth_secret = os.environ["PYTHON_WORKER_FACTORY_SECRET"]
-    (sock_file, _) = local_connect_and_auth(java_port, auth_secret)
+    (sock_file, sock) = local_connect_and_auth(java_port, auth_secret)
+    sock.settimeout(None)


nit: can we add a short comment here on why we need to set this timeout?

comment added.

HeartSaVioR · 2024-05-20T22:31:08Z

Let's do post-review if there are remaining comments. Looks like the change is right and unavoidable.

HeartSaVioR · 2024-05-20T22:31:24Z

Thanks! Merging to master.

chaoqin-li1123 added 4 commits May 18, 2024 12:27

initial

1b68a04

clean up

965ec1a

clean up

a0cd6c3

clean up

d8303d3

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels May 18, 2024

fix error message

0e6bff9

HyukjinKwon approved these changes May 20, 2024

View reviewed changes

HeartSaVioR reviewed May 20, 2024

View reviewed changes

chaoqin-li1123 added 2 commits May 19, 2024 21:51

address comment

037ba3c

address comment

ac0fcb3

HeartSaVioR approved these changes May 20, 2024

View reviewed changes

allisonwang-db reviewed May 20, 2024

View reviewed changes

address comment

b1d739a

HeartSaVioR closed this in 0393ab4 May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval #46651

[SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval #46651

chaoqin-li1123 commented May 18, 2024

HyukjinKwon left a comment

HeartSaVioR left a comment

HeartSaVioR May 20, 2024

chaoqin-li1123 May 20, 2024

HeartSaVioR May 20, 2024

chaoqin-li1123 May 20, 2024

HeartSaVioR May 20, 2024

chaoqin-li1123 May 20, 2024

HeartSaVioR commented May 20, 2024

HeartSaVioR left a comment

allisonwang-db May 20, 2024

chaoqin-li1123 May 20, 2024

allisonwang-db May 20, 2024

chaoqin-li1123 May 20, 2024

allisonwang-db May 20, 2024

chaoqin-li1123 May 20, 2024

HeartSaVioR commented May 20, 2024

HeartSaVioR commented May 20, 2024

[SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval #46651

[SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval #46651

Conversation

chaoqin-li1123 commented May 18, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon left a comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR commented May 20, 2024

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR commented May 20, 2024

HeartSaVioR commented May 20, 2024