Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Discuss][CHIP] Data Quality Enforcement Pre-writing. #695

Open
cristianfr opened this issue Feb 27, 2024 · 0 comments
Open

[WIP][Discuss][CHIP] Data Quality Enforcement Pre-writing. #695

cristianfr opened this issue Feb 27, 2024 · 0 comments

Comments

@cristianfr
Copy link
Contributor

cristianfr commented Feb 27, 2024

Problem Statement

The lifecycle of fixing data quality issues can be very long. The best way of reducing the turn-around is by making detection as fast as possible. The earliest a data quality issue can be caught is before writing the data frame as a table. This would also help stop the pipeline in case there's downstream effects to the data being written.

This is a common pattern in data engineering. Sometimes called Stage (write to tmp table) Check (execute data quality checks) Exchange (write to production table), iceberg refers to it as write-audit-publish. The idea is the same, before marking the data as production execute some checks.

Chronon has tableUtils module that takes care of writing the data and even collecting some stats in order to make this writes more efficient. The idea would be to define the schema for verifications we may want to do on the data before writing to minimize the time to detect an excessive null rate (commonly associated to bad timestamps or missing input data), missing data (easily detected by a drop in rows), bot activity (new heavy hitters), bad timestamps that reflect past activity, or new values on categorical data for example.

Requirements

[ ] Schema for data quality check definitions
[ ] Expand or migrate DataFrame Stats to take this new responsibility pre-writing data.
[ ] Extra checks may take performance implications, but should be as requested by the configuration to make sure the performance to data reliability balance is acceptable.

Verification

  • New behavior can be unit tested.

Approach

  • TBD

User API (when required)

  • TBD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant