Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more MultiFilereader features/hooks #11984

Merged
merged 7 commits into from May 14, 2024

Conversation

samansmink
Copy link
Contributor

This PR adds some missing links for extending the MultiFileReader

Firstly I added a fieldcase_insensitive_map_t<Value> MultiFileReaderOptions::custom_options for passing custom options.

Secondly, I added the concept of a MultiFileReaderGlobalState. This is a state that should generally be created in the InitGlobal of a table function using the MultiFileReader. The global state allows the MultiFileReader to store state that is created while already knowing what columns are in the projection.

A crucial part of the MultiFileReaderGlobalState is the extra_columns param. This parameter will be set by the MultiFileReader to indicate that the scan will produce more columns than are actually projected. These columns are for internal use by the MultiFileReader during the FinalizeChunk step. This is crucial for the upcoming delta extension to be able to properly apply deletion vectors. To apply a deletion vector, we need to know which rows from the file are actually selected. This means the file_row_number column needs to be available in the FinalizeChunk step. However, this column should not be returned by the actual scan. The solution is very similar to what we currently do for Filter pruning: where columns that are only used for pushed down filters are removed during the scan.

@Mytherin I've managed to push most complexity into the delta extension for now to keep this PR simple, eventually we may want to pull the logic for populating the extra_columns up in the default MultiFileReader though

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 13, 2024 11:43
@samansmink samansmink marked this pull request as ready for review May 13, 2024 11:44
@Mytherin Mytherin merged commit 3ed5d83 into duckdb:main May 14, 2024
41 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 14, 2024
Merge pull request duckdb/duckdb#11984 from samansmink/more-multifile-extensibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants