Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Storage structureFormat parquet issue #16094

Open
mykola-yesypchuk-inflection opened this issue Apr 30, 2024 · 2 comments
Open

S3 Storage structureFormat parquet issue #16094

mykola-yesypchuk-inflection opened this issue Apr 30, 2024 · 2 comments

Comments

@mykola-yesypchuk-inflection

Affected module
Ingestion Framework

Describe the bug
Failed to run S3 storage metadata ingestion due _SUCCESS file in dataPath entries folder.

To Reproduce
openmetadata.json

{
    "entries": [
        {
            "dataPath": "data/sp_entity",
            "structureFormat": "parquet",
            "isPartitioned": false
        }
    ]
}

Aiflow logs

[2024-04-30, 13:21:32 UTC] {metadata.py:429} INFO - Looking for metadata template file at - s3://test-bucket/openmetadata.json
[2024-04-30, 13:21:33 UTC] {metadata.py:246} INFO - Extracting metadata from path data/sp_entity and generating structured container
[2024-04-30, 13:21:33 UTC] {metadata.py:365} INFO - File data/sp_entity/part-00000-764565f7-45f7-416c-a7aa-8932bc1ebf83-c000.snappy.parquet was picked to infer data structure from.
[2024-04-30, 13:21:40 UTC] {metadata.py:143} INFO - Extracting metadata from path data/sp_entity and generating structured container
[2024-04-30, 13:21:41 UTC] {metadata.py:365} INFO - File data/sp_entity/_SUCCESS was picked to infer data structure from.
[2024-04-30, 13:21:42 UTC] {datalake_utils.py:69} ERROR - Error fetching file [test-bucket/data/sp_entity/_SUCCESS] using [S3Config] due to: [Error reading dataframe due to [Could not open Parquet input source 's3://test-bucket/data/sp_entity/_SUCCESS': Parquet file size is 0 bytes]]
[2024-04-30, 13:21:42 UTC] {status.py:76} WARNING - Wild error while creating Container from bucket details - 'NoneType' object has no attribute 'columns'
[2024-04-30, 13:21:42 UTC] {taskinstance.py:1937} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 192, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 209, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/openmetadata_managed_apis/workflows/ingestion/common.py", line 209, in metadata_ingestion_workflow
    workflow.raise_from_status()
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/workflow_status_mixin.py", line 125, in raise_from_status
    raise err
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/workflow_status_mixin.py", line 122, in raise_from_status
    self.raise_from_status_internal(raise_warnings)
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/ingestion.py", line 149, in raise_from_status_internal
    raise WorkflowExecutionError(
metadata.config.common.WorkflowExecutionError: S3 reported errors: S3 Summary: [1 Records, [0 Updated Records, 0 Warnings, 1 Errors, 91 Filtered]

Expected behavior
Ignore _SUCCESS file - ???
Run job without exception.

Version:

@mykola-yesypchuk-inflection
Copy link
Author

Also I see that s3 container shows wrong stats. It seems stats from current bucket, not from table container itself.
OpenMetadata code:

number_of_objects=self._fetch_metric(
bucket_name=bucket_name, metric=S3Metric.NUMBER_OF_OBJECTS
),
size=self._fetch_metric(
bucket_name=bucket_name, metric=S3Metric.BUCKET_SIZE_BYTES
),

Screenshots:
image
image
image

@mykola-yesypchuk-inflection
Copy link
Author

Do we have any updates on that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant