[Bug] adapter response return incorrect `data_scanned_in_bytes` when incremental model is running #585

jvyoralek · 2024-02-22T14:28:07Z

Is this a new bug in dbt-athena?

I believe this is a new bug in dbt-athena
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

We have defined the incremental model model1 in DBT. When dbt model population is started two queries are generated into AWS Athena:

create a model1__dbt_tmp table and populate it with data
insert data from model1__dbt_tmp into model1

The cost of this incremental should be data_scanned_in_bytes from step 1 + step 2 = 3.95 kB + 0.79 kB = 4.48 kB

In #353 there was added new functionality to return data_scanned_in_bytes, but for incremental build it will return only step 2 = 810 bytes.

run_results.json - part from run_results.json file

      "execution_time": 15.616735935211182,
      "adapter_response": {
        "_message": "OK 10",
        "code": "OK",
        "rows_affected": 10,
        "data_scanned_in_bytes": 810
      },

Expected Behavior

The cost of this incremental run should be the sum of all involved queries started on AWS Athena.

Steps To Reproduce

No response

Environment

- OS: macOS 14.2.1
- Python: 3.11.7
- dbt: 1.7.4
- dbt-athena-community: 1.7.1

Additional Context

DBT command

Running with dbt=1.7.4
Registered adapter: athena=1.7.1
Found 225 models, 1961 tests, 2 seeds, 160 sources, 0 exposures, 0 metrics, 918 macros, 0 groups, 0 semantic models

Concurrency: 8 threads (target='dev')

1 of 1 START sql incremental model hd_dev_playground.model1 .................... [RUN]
1 of 1 OK created sql incremental model hd_dev_playground.model1 ............... [OK 10 in 14.26s]

Finished running 1 incremental model in 0 hours 0 minutes and 20.22 seconds (20.22s).

Completed successfully

Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

DBT model

{{
    config(
        table_type="iceberg",
        format="parquet",
        is_external=false,
        materialized="incremental",
        incremental_strategy="append",
        schema="playground",
        table_properties={"optimize_rewrite_delete_file_threshold": "2"},
        tags=["playground"],
    )
}}
select partition, kafka_offset, time_created, transaction_id, message_type
from {{ source("internal_source", "account_info_json_partitioned") }}
where "year" = 2024 and "month" = 2 and "day" = 21
limit 10

run_results.json - full file

The text was updated successfully, but these errors were encountered:

nicor88 · 2024-02-23T11:45:09Z

This is a know issue, and it's particularly tricky because the pure implementation on what was done here: #353 doesn't work anymore, due to how we handle partitioned limitation introduced by #360

Specifically for tables with more than 100 partitions as you noticed, there will be a CTA plus many batch inserts. Adding an accumulation of every run operations and then returned the final sum, is quite an effort therefore when we implemented this #375 we preferred simplicity over accurateness.

Could you please clarify why such feature will be relevant for you? What use cases do you have?

jvyoralek · 2024-02-23T14:41:02Z

The use case involves monitoring the AWS Athena cost of model population. We have a bunch of models defined in dbt, using AWS Athena for storage. These models can be populated automatically or manually from Dagster, which is the UI for model orchestration.

We conceived an idea to incorporate the AWS Athena cost (the number of bytes scanned during model population) into the model metadata within Dagster for each run. This addition could help us identify models that are problematic from a cost perspective.

Example how metadata could look in Dagster

{ 
  "unique_id": "model.project.model1",
  "invocation_id": "c8814bf2-e82a-412b-95b3-8df55b7b0bf1",
  "exucution_type": "incremental",
  "execution_duration_seconds": 1708,
  "rows_affected": 313,
  "total_data_scanned_mb": 122942,
  "total_spent_usd": 0.59
}

... populated from dbt run_results.json file and Dagster internal variables.

However, in the case of incremental models, for example, this approach is problematic in the current version. It will return only the last part of the population, which could be just a small portion of the real 'price'."

Does it make sense?

nicor88 · 2024-02-23T15:05:11Z

Thanks make sense, it will be neat to have what you requested, not sure how much effort changes requires. As we are community based, we really rely a lot on the OSS contribution, therefore feel free to take a spin to it, and we can guide/review what you propose.

jvyoralek added the bug Something isn't working label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] adapter response return incorrect `data_scanned_in_bytes` when incremental model is running #585

[Bug] adapter response return incorrect `data_scanned_in_bytes` when incremental model is running #585

jvyoralek commented Feb 22, 2024

nicor88 commented Feb 23, 2024

jvyoralek commented Feb 23, 2024

nicor88 commented Feb 23, 2024

[Bug] adapter response return incorrect data_scanned_in_bytes when incremental model is running #585

[Bug] adapter response return incorrect data_scanned_in_bytes when incremental model is running #585

Comments

jvyoralek commented Feb 22, 2024

Is this a new bug in dbt-athena?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Additional Context

nicor88 commented Feb 23, 2024

jvyoralek commented Feb 23, 2024

nicor88 commented Feb 23, 2024

[Bug] adapter response return incorrect `data_scanned_in_bytes` when incremental model is running #585

[Bug] adapter response return incorrect `data_scanned_in_bytes` when incremental model is running #585