You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
By default we do not add load_id and dlt_id to arrow tables. This must be configured explicitly and happens in the normalizer.
As a consequence, we need to decompress and rewrite parquet files which takes a lot of resources.
In this ticket we move this behavior to the extract phase. This is against general architecture but I do not see any other way to do that without rewriting files.
We also unify the behavior making relational normalizer to follow ItemsNormalizerConfiguration
Implementation
We split this ticket into several PR.
PR 1.
add load_id in the extract phase.
make sure we do not clash with normalize which also add load_id (can we remove it from there?)
we (probably) do not need the logic that adds the columns when writing a file. we can just add them to existing table
ItemsNormalizerConfiguration must be taken into account. this is probably a breaking change because we need to move it from normalize to extract so old settings will stop working. or maybe you'll find a clever solution here :)
PR 2.
add dlt_id in the extract phase
when adding _dlt_id we must follow table settings and generate _dlt_id according to hints (ie. SCD2 look how relational.py generates different hashes.). also we have a fast method to generate content hashesh: add_row_hash_to_table
observe "bring your own hash". if there's a column with unique, do not add _dlt_id (random one). if we have SCD2 type hash (please see SCD2 documentaiton on how to add it) we also skip it
PR 3.
Unify relational and pyarrow behavior:
in relational.py do not generate _load_id and _dlt_id if not switched on (ItemsNormalizerConfiguration)
observe bring your own hash in relational.py. please note that now we use _dlt_id if it is present in the data. way better would be to detect unique column in schema and force to use it. same for SCD2
PR 4.
Unify behavior 2
in _compute_table (extract / pyarrow) when we add new columns we should also infer hints like for any new columns. currently schema settings will be ignored
The text was updated successfully, but these errors were encountered:
Background
By default we do not add
load_id
anddlt_id
to arrow tables. This must be configured explicitly and happens in the normalizer.As a consequence, we need to decompress and rewrite parquet files which takes a lot of resources.
In this ticket we move this behavior to the extract phase. This is against general architecture but I do not see any other way to do that without rewriting files.
We also unify the behavior making
relational
normalizer to followItemsNormalizerConfiguration
Implementation
We split this ticket into several PR.
PR 1.
load_id
in the extract phase.load_id
(can we remove it from there?)ItemsNormalizerConfiguration
must be taken into account. this is probably a breaking change because we need to move it fromnormalize
toextract
so old settings will stop working. or maybe you'll find a clever solution here :)PR 2.
dlt_id
in the extract phase_dlt_id
according to hints (ie. SCD2 look howrelational.py
generates different hashes.). also we have a fast method to generate content hashesh: add_row_hash_to_table_dlt_id
(random one). if we have SCD2 type hash (please see SCD2 documentaiton on how to add it) we also skip itPR 3.
Unify relational and pyarrow behavior:
relational.py
do not generate _load_id and _dlt_id if not switched on (ItemsNormalizerConfiguration
)relational.py
. please note that now we use_dlt_id
if it is present in the data. way better would be to detect unique column in schema and force to use it. same for SCD2PR 4.
Unify behavior 2
_compute_table
(extract / pyarrow) when we add new columns we should also infer hints like for any new columns. currently schema settings will be ignoredThe text was updated successfully, but these errors were encountered: