Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: When importing Parquet files, you can ignore some built-in index columns. #33197

Open
1 task done
zhuwenxing opened this issue May 20, 2024 · 6 comments
Open
1 task done
Assignees
Labels
kind/feature Issues related to feature request from users

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

For Parquet files generated by pandas, in addition to user-defined columns, some index columns are also generated. When there are extra columns, the import process will treat them as data columns as well.

2024-05-20 11:51:58.473 | INFO     | __main__:prepare_data:70 - The task 449885606406374279 failed, reason: the field: __index_level_0__ is not in schema, if it's a dynamic field, please reformat data by bulk_writer: importing data failed

Describe the solution you'd like.

ignore some built-in index columns

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

@zhuwenxing zhuwenxing added the kind/feature Issues related to feature request from users label May 20, 2024
@xiaofan-luan
Copy link
Contributor

why do this column exist? did we enable some specical configs?
how to we know this is not a special column?

@zhuwenxing
Copy link
Contributor Author

The __index_level_0__ column appears in Parquet files when saving a Pandas DataFrame with an index. This column is created to store the index information, ensuring that the index can be accurately restored when the file is read back into a DataFrame.

Purpose

The main purpose of __index_level_0__ is to preserve the DataFrame's index. This is crucial for maintaining data integrity and consistency, especially when working with datasets where the index carries significant meaning.

How It Appears

  1. Default Behavior: When saving a DataFrame with an index to a Parquet file, Pandas automatically includes the index as a column if the index is not reset or otherwise managed before saving.

    import pandas as pd
    
    # Create a DataFrame with an index
    df = pd.DataFrame({
        'A': [1, 2, 3],
        'B': [4, 5, 6]
    }).set_index('A')
    
    # Save to a Parquet file
    df.to_parquet('example.parquet')
  2. Multi-level Index: For DataFrames with multi-level indices, each level of the index will be stored as a separate column, named __index_level_0__, __index_level_1__, etc.

How to Avoid or Handle It

If you want to avoid having the __index_level_0__ column in your Parquet file, you can either reset the index before saving or handle it appropriately when reading the file.

Resetting the Index

import pandas as pd

# Create a DataFrame with an index
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}).set_index('A')

# Reset the index
df.reset_index(inplace=True)

# Save to a Parquet file
df.to_parquet('example.parquet')

Ignoring the Index on Read

If the Parquet file already includes the __index_level_0__ column, you can ignore the index when reading the file:

import pandas as pd

# Read Parquet file and ignore the index
df = pd.read_parquet('example.parquet', index_col=None)

By understanding the purpose and appearance of __index_level_0__, you can better manage and handle your data files, ensuring accurate data restoration when necessary.

@bigsheeper
Copy link
Contributor

How can we determine that this is an index column and not a column mistakenly provided by the user? @zhuwenxing

@zhuwenxing
Copy link
Contributor Author

Suppose we have a Parquet file with columns a, b, c, and d, and we want to import a collection with columns a, b, and c. Can we make this import successful? @bigsheeper

@bigsheeper
Copy link
Contributor

bigsheeper commented May 21, 2024

Suppose we have a Parquet file with columns a, b, c, and d, and we want to import a collection with columns a, b, and c. Can we make this import successful? @bigsheeper

I don't think that's feasible. Parquet import does not support importing dynamic field data, currently milvus will raise a message/hint like "column d is not in schema, if it's a dynamic field, please reformat data by bulk_writer".

If we "make this import successful", when the user enables dynamic field, they might assume that column d has been imported, but in reality, data is being ignored.

@xiaofan-luan
Copy link
Contributor

agree on that.
I thinks there are some misconfiguration and pandas. we don't need any "built-in index columns" and there should be some config disable built-in index columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Issues related to feature request from users
Projects
None yet
Development

No branches or pull requests

3 participants