awswrangler.athena.to_iceberg not supporting to synchronous/parallel lambda instances. #2651

B161851 · 2024-01-31T13:51:02Z

Describe the bug

wr.athena.to_iceberg(
            df=df,
            database='test_database',
            table='my_table2',
            table_location='s3://bucket-testing1/my_table2/',
            temp_path=f's3://bucket-testing1/temp_path/',
            keep_files=True
        )

For parallel writing, If keep_files=True then it is resulting the duplicates and I tried appending the nano timestamp to the temporary path so it's unique but now I have "ICEBERG_COMMIT_ERROR"
If keep_files=False then it is giving "HIVE_CANNOT_OPEN_SPLIT NoSuchKey Error" when ingesting iceberg data in parallel
and we observed if keep_files=False then in that library entire temp_path was removed from the s3 and getting the above error.

It's not supporting to write to the iceberg table using wrangler from lambda.
So, how can we overcome the above issues in lambda parallel writing to iceberg table using awswrangler.

How to Reproduce

wr.athena.to_iceberg(
            df=df,
            database='test_database',
            table='my_table2',
            table_location='s3://bucket-testing1/my_table2/',
            temp_path=f's3://bucket-testing1/temp_path/',
            keep_files=False
        )

we observed if keep_files=False then in that library entire temp_path was removed from the s3 and resulted "HIVE_CANNOT_OPEN_SPLIT NoSuchKey Error"
if you remove the particular parquet file from the temp_path instead of removing entire temp_path from s3, I think might give the above error.

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Win

Python version

3.8

AWS SDK for pandas version

12

Additional context

No response

The text was updated successfully, but these errors were encountered:

kukushking · 2024-02-12T11:25:46Z

Hi @B161851 , if you are inserting concurrently, you ned to make sure temp_path is unique and empty for each run. Also, when you got ICEBERG_COMMIT_ERROR , did the table you are inserting in exist? It might be a race condition due to multiple runs trying to create a table. Checking.

Salatich · 2024-02-13T18:38:50Z

@kukushking hi, facing with same, even with two concurrent writers (lambdas). Table exists. Trying to perform upsert (MERGE INTO) operation. In my case upsert happens even on diffrent partitions (different parts of a table), so I don't think it's a race condition.

 tmp_table_name = f"my_table_{uuid.uuid4()}".replace("-", "_")
 tmp_path = f"s3://my_bucket/{tmp_table_name}"
 wr.athena.to_iceberg(
                    df=processed_df,
                    database='my_database',
                    table="my_table",
                    table_location="s3://my_bucket/my_table",
                    temp_path=tmp_path,
                    partition_cols=["col1", "col2"],
                    merge_cols=["col1","col2","col3"],
                    keep_files=False
                )

ICEBERG_COMMIT_ERROR: Failed to commit Iceberg update to table

vibe · 2024-02-20T20:57:16Z

Just wanted to bump this issue up as well.

Particular use case is uses upsert very similar to @Salatich's last comment.
Infrastructure is MSK Trigger on a Lambda.

Have had to lock lambda concurrency to 1 to avoid the ICEBERG_COMMIT_ERROR errors.

github-actions · 2024-04-20T21:03:18Z

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

Salatich · 2024-04-24T13:48:20Z

bump

ChanTheDataExplorer · 2024-05-03T04:28:31Z

bump. addressing this feature will be very helpful

kukushking · 2024-05-13T09:54:09Z

All, looks like this is service-side issue. Please raise a support request.

@ChanTheDataExplorer @Salatich @vibe is it also HIVE_CANNOT_OPEN_SPLIT NoSuchKey or any other exception code? This error code may correspond to a multiple different root causes. Any additional info would be appreciated ie what is the size of your dataset? what is the key that is causing an issue and corresponding data frame? Does this consistently reproduce?

ChanTheDataExplorer · 2024-05-13T10:02:29Z

in my side it is just ICEBERG_COMMIT_ERROR

peterklingelhofer · 2024-05-21T13:25:18Z

We're seeing a lot of ICEBERG_COMMIT_ERROR using the latest awswrangler with Athena and Glue when attempting parallel writes. Changing partition sizes so that no writes are ever merging into the same partition does not alleviate the problem. Documentation on the error is fairly vague unfortunately (https://repost.aws/knowledge-center/athena-iceberg-table-error).

B161851 added the bug Something isn't working label Jan 31, 2024

B161851 closed this as completed Jan 31, 2024

B161851 reopened this Jan 31, 2024

github-actions bot added the needs-triage label Feb 5, 2024

jaidisido removed the needs-triage label Feb 12, 2024

github-actions bot added the closing-soon label Apr 20, 2024

github-actions bot removed the closing-soon label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awswrangler.athena.to_iceberg not supporting to synchronous/parallel lambda instances. #2651

awswrangler.athena.to_iceberg not supporting to synchronous/parallel lambda instances. #2651

B161851 commented Jan 31, 2024

kukushking commented Feb 12, 2024

Salatich commented Feb 13, 2024 •

edited

vibe commented Feb 20, 2024

github-actions bot commented Apr 20, 2024

Salatich commented Apr 24, 2024

ChanTheDataExplorer commented May 3, 2024

kukushking commented May 13, 2024 •

edited

ChanTheDataExplorer commented May 13, 2024

peterklingelhofer commented May 21, 2024

awswrangler.athena.to_iceberg not supporting to synchronous/parallel lambda instances. #2651

awswrangler.athena.to_iceberg not supporting to synchronous/parallel lambda instances. #2651

Comments

B161851 commented Jan 31, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

kukushking commented Feb 12, 2024

Salatich commented Feb 13, 2024 • edited

vibe commented Feb 20, 2024

github-actions bot commented Apr 20, 2024

Salatich commented Apr 24, 2024

ChanTheDataExplorer commented May 3, 2024

kukushking commented May 13, 2024 • edited

ChanTheDataExplorer commented May 13, 2024

peterklingelhofer commented May 21, 2024

Salatich commented Feb 13, 2024 •

edited

kukushking commented May 13, 2024 •

edited