Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow #46627

Closed
wants to merge 7 commits into from

Conversation

Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Files don't need to be cached for reuse in FileStreamSource when using Trigger.AvailableNow because all files are already cached for the lifetime of the query in allFilesForTriggerAvailableNow.

Why are the changes needed?

As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR to address #45362), the hard coded cap of 10k files being cached can cause problems when using a maxFilesPerTrigger > 10k. It causes every other batch to be 10k files, which can greatly limit the throughput of a new streaming trying to catch up.

Does this PR introduce any user-facing change?

Every other streaming batch won't be 10k files if using Trigger.AvailableNow and maxFilesPerTrigger greater than 10k.

How was this patch tested?

New UT

Was this patch authored or co-authored using generative AI tooling?

No

@Kimahriman
Copy link
Contributor Author

@HeartSaVioR since you added the file caching originally back in the day

@Kimahriman Kimahriman changed the title [SPARK-48314] Don't double cache files for FileStreamSource using Trigger.AvailableNow [SPARK-48314][SQL] Don't double cache files for FileStreamSource using Trigger.AvailableNow May 17, 2024
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in overall. Left a few comments for better testing.

f
}

source.latestOffset(FileStreamSourceOffset(-1L), ReadLimit.maxFiles(5))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check the result just for completeness sake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to check files returned, and used the new setting from #45362 to verify correct number of files based on not caching


// Reading again leverages the files already tracked in allFilesForTriggerAvailableNow,
// so no more listings need to happen
source.latestOffset(FileStreamSourceOffset(-1L), ReadLimit.maxFiles(5))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@HeartSaVioR HeartSaVioR changed the title [SPARK-48314][SQL] Don't double cache files for FileStreamSource using Trigger.AvailableNow [SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow May 20, 2024
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending CI.

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants