Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty #295

Open
osopardo1 opened this issue Mar 27, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@osopardo1
Copy link
Member

What went wrong?

When enabling auto indexing, we call SparkColumnsToIndexSelector to choose which are the best columns to group the data.

This selection is based on statistics and correlations of the data itself, but if no data is provided, the current default behavior is to select the first N columns of the schema.

We should define and concrete if that makes sense and what is the minimum number of columns to index.

@osopardo1 osopardo1 added the bug Something isn't working label Mar 27, 2024
@osopardo1 osopardo1 self-assigned this Mar 27, 2024
@osopardo1
Copy link
Member Author

After some discussion, we agreed that, if the DataFrame is empty, makes little sense to use AutoIndexing right away. The code should wait until some data is written to activate the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant