[Feature]: When importing Parquet files, you can ignore some built-in index columns. #33197

zhuwenxing · 2024-05-20T12:12:30Z

Is there an existing issue for this?

I have searched the existing issues

Is your feature request related to a problem? Please describe.

For Parquet files generated by pandas, in addition to user-defined columns, some index columns are also generated. When there are extra columns, the import process will treat them as data columns as well.

2024-05-20 11:51:58.473 | INFO     | __main__:prepare_data:70 - The task 449885606406374279 failed, reason: the field: __index_level_0__ is not in schema, if it's a dynamic field, please reformat data by bulk_writer: importing data failed

Describe the solution you'd like.

ignore some built-in index columns

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

The text was updated successfully, but these errors were encountered:

xiaofan-luan · 2024-05-20T12:58:50Z

why do this column exist? did we enable some specical configs?
how to we know this is not a special column?

zhuwenxing · 2024-05-21T02:26:58Z

The __index_level_0__ column appears in Parquet files when saving a Pandas DataFrame with an index. This column is created to store the index information, ensuring that the index can be accurately restored when the file is read back into a DataFrame.

Purpose

The main purpose of __index_level_0__ is to preserve the DataFrame's index. This is crucial for maintaining data integrity and consistency, especially when working with datasets where the index carries significant meaning.

How It Appears

Default Behavior: When saving a DataFrame with an index to a Parquet file, Pandas automatically includes the index as a column if the index is not reset or otherwise managed before saving.

import pandas as pd

# Create a DataFrame with an index
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}).set_index('A')

# Save to a Parquet file
df.to_parquet('example.parquet')

Multi-level Index: For DataFrames with multi-level indices, each level of the index will be stored as a separate column, named __index_level_0__, __index_level_1__, etc.

How to Avoid or Handle It

If you want to avoid having the __index_level_0__ column in your Parquet file, you can either reset the index before saving or handle it appropriately when reading the file.

Resetting the Index

import pandas as pd

# Create a DataFrame with an index
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}).set_index('A')

# Reset the index
df.reset_index(inplace=True)

# Save to a Parquet file
df.to_parquet('example.parquet')

Ignoring the Index on Read

If the Parquet file already includes the __index_level_0__ column, you can ignore the index when reading the file:

import pandas as pd

# Read Parquet file and ignore the index
df = pd.read_parquet('example.parquet', index_col=None)

By understanding the purpose and appearance of __index_level_0__, you can better manage and handle your data files, ensuring accurate data restoration when necessary.

bigsheeper · 2024-05-21T02:36:15Z

How can we determine that this is an index column and not a column mistakenly provided by the user? @zhuwenxing

zhuwenxing · 2024-05-21T02:39:39Z

Suppose we have a Parquet file with columns a, b, c, and d, and we want to import a collection with columns a, b, and c. Can we make this import successful? @bigsheeper

bigsheeper · 2024-05-21T03:01:18Z

Suppose we have a Parquet file with columns a, b, c, and d, and we want to import a collection with columns a, b, and c. Can we make this import successful? @bigsheeper

I don't think that's feasible. Parquet import does not support importing dynamic field data, currently milvus will raise a message/hint like "column d is not in schema, if it's a dynamic field, please reformat data by bulk_writer".

If we "make this import successful", when the user enables dynamic field, they might assume that column d has been imported, but in reality, data is being ignored.

xiaofan-luan · 2024-05-21T12:21:53Z

agree on that.
I thinks there are some misconfiguration and pandas. we don't need any "built-in index columns" and there should be some config disable built-in index columns

zhuwenxing added the kind/feature Issues related to feature request from users label May 20, 2024

zhuwenxing assigned xiaofan-luan May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: When importing Parquet files, you can ignore some built-in index columns. #33197

[Feature]: When importing Parquet files, you can ignore some built-in index columns. #33197

zhuwenxing commented May 20, 2024

xiaofan-luan commented May 20, 2024

zhuwenxing commented May 21, 2024

bigsheeper commented May 21, 2024

zhuwenxing commented May 21, 2024

bigsheeper commented May 21, 2024 •

edited

xiaofan-luan commented May 21, 2024

[Feature]: When importing Parquet files, you can ignore some built-in index columns. #33197

[Feature]: When importing Parquet files, you can ignore some built-in index columns. #33197

Comments

zhuwenxing commented May 20, 2024

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like.

Describe an alternate solution.

Anything else? (Additional Context)

xiaofan-luan commented May 20, 2024

zhuwenxing commented May 21, 2024

Purpose

How It Appears

How to Avoid or Handle It

Resetting the Index

Ignoring the Index on Read

bigsheeper commented May 21, 2024

zhuwenxing commented May 21, 2024

bigsheeper commented May 21, 2024 • edited

xiaofan-luan commented May 21, 2024

bigsheeper commented May 21, 2024 •

edited