Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty #295

osopardo1 · 2024-03-27T10:30:58Z

What went wrong?

When enabling auto indexing, we call SparkColumnsToIndexSelector to choose which are the best columns to group the data.

This selection is based on statistics and correlations of the data itself, but if no data is provided, the current default behavior is to select the first N columns of the schema.

We should define and concrete if that makes sense and what is the minimum number of columns to index.

The text was updated successfully, but these errors were encountered:

osopardo1 · 2024-04-09T06:09:51Z

After some discussion, we agreed that, if the DataFrame is empty, makes little sense to use AutoIndexing right away. The code should wait until some data is written to activate the feature.

osopardo1 added the bug Something isn't working label Mar 27, 2024

osopardo1 mentioned this issue Mar 27, 2024

Issue 292: Merge main-1.0.0 into main #284

Merged

osopardo1 self-assigned this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty #295

Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty #295

osopardo1 commented Mar 27, 2024

osopardo1 commented Apr 9, 2024

Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty #295

Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty #295

Comments

osopardo1 commented Mar 27, 2024

What went wrong?

osopardo1 commented Apr 9, 2024