[SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow #46627

Kimahriman · 2024-05-17T00:35:54Z

What changes were proposed in this pull request?

Files don't need to be cached for reuse in FileStreamSource when using Trigger.AvailableNow because all files are already cached for the lifetime of the query in allFilesForTriggerAvailableNow.

Why are the changes needed?

As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR to address #45362), the hard coded cap of 10k files being cached can cause problems when using a maxFilesPerTrigger > 10k. It causes every other batch to be 10k files, which can greatly limit the throughput of a new streaming trying to catch up.

Does this PR introduce any user-facing change?

Every other streaming batch won't be 10k files if using Trigger.AvailableNow and maxFilesPerTrigger greater than 10k.

How was this patch tested?

New UT

Was this patch authored or co-authored using generative AI tooling?

No

Kimahriman · 2024-05-17T00:36:21Z

@HeartSaVioR since you added the file caching originally back in the day

HeartSaVioR

Looks good in overall. Left a few comments for better testing.

HeartSaVioR · 2024-05-20T01:54:24Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+          f
+        }
+
+        source.latestOffset(FileStreamSourceOffset(-1L), ReadLimit.maxFiles(5))


Shall we check the result just for completeness sake?

Updated to check files returned, and used the new setting from #45362 to verify correct number of files based on not caching

HeartSaVioR · 2024-05-20T01:54:31Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+
+        // Reading again leverages the files already tracked in allFilesForTriggerAvailableNow,
+        // so no more listings need to happen
+        source.latestOffset(FileStreamSourceOffset(-1L), ReadLimit.maxFiles(5))


Same as above

HeartSaVioR

+1 pending CI.

HeartSaVioR · 2024-05-22T01:57:33Z

Thanks! Merging to master.

Kimahriman added 3 commits May 15, 2024 07:11

Don't cache files for availableNow trigger

c907d94

Add test

90d8ec5

Cleanup

f4c425b

github-actions bot added SQL STRUCTURED STREAMING labels May 17, 2024

Kimahriman changed the title ~~[SPARK-48314] Don't double cache files for FileStreamSource using Trigger.AvailableNow~~ [SPARK-48314][SQL] Don't double cache files for FileStreamSource using Trigger.AvailableNow May 17, 2024

HeartSaVioR reviewed May 20, 2024

View reviewed changes

HeartSaVioR changed the title ~~[SPARK-48314][SQL] Don't double cache files for FileStreamSource using Trigger.AvailableNow~~ [SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow May 20, 2024

Kimahriman added 4 commits May 21, 2024 10:13

Merge branch 'master' into available-now-no-cache

38a7859

Check files in batch and use new setting to test skipping cache

8879afe

Remove extra newline

a7e0198

Remove comment to go along with undo-ing visibility change

4c33fef

HeartSaVioR approved these changes May 22, 2024

View reviewed changes

HeartSaVioR closed this in e702b32 May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow #46627

[SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow #46627

Kimahriman commented May 17, 2024

Kimahriman commented May 17, 2024

HeartSaVioR left a comment

HeartSaVioR May 20, 2024

Kimahriman May 21, 2024

HeartSaVioR May 20, 2024

Kimahriman May 21, 2024

HeartSaVioR left a comment

HeartSaVioR commented May 22, 2024

[SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow #46627

[SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow #46627

Conversation

Kimahriman commented May 17, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Kimahriman commented May 17, 2024

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR May 20, 2024

Choose a reason for hiding this comment

Kimahriman May 21, 2024

Choose a reason for hiding this comment

HeartSaVioR May 20, 2024

Choose a reason for hiding this comment

Kimahriman May 21, 2024

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented May 22, 2024