Store batch-partitioned feature and label history #857

ecsalomon · 2021-08-09T14:02:09Z

Our decision not to persist features limits the flexibility and reproducibility of the system. Triage is designed for batch processing, which means that we could follow functional data engineering principles and store batch partitioned feature and label data in partitioned Postgres tables, Redshift, or HDFS. This would make flexible re-testing of models on different label time periods much easier by being able to construct the matrices on the fly at evaluation time without needing to rebuild features and labels and make Rayid's preferred solution for #378 much easier to implement.

Connecting this to #368, if we versioned features on the hash of query logic, aggregation function, aggregation time period, imputation method, etc., we would be able to track how changes in feature definitions between experiments shifted the distributions of features as well as monitor how feature distributions for the same feature definitions change over time (and throw warnings or errors if, e.g., variance on a feature dropped dramatically between batches). Currently, from_obj logic changes are hidden because they affect the experiment hash but not the feature names.

There are some complications to this approach based on how the group + triage typically operate. Data are received and processed in batches from partners, but the definition of a batch in triage is more closely tied to the experiment and experiment run. Storing all experiments or experiment runs as new batches is likely overly redundant. If you change the label definition, do you really need to create a new batch for all of the features? No, but if you rerun the same experiment on new source data, you will. We could consider the hash of the experiment components (e.g., label definition) in a batch definition, but triage has no good way of knowing what batch the source data are at, so it would not have a good basis for knowing when to create a new batch for the same configuration.

A couple of alternatives for this:

Triage has some way of reading batch version of source data and smartly updates its batches if cohort, label, etc. definitions change (only ever adding to existing batches); we currently do this for at least one project with the record linkage timestamp user_metadata key, but figuring out how to make that generalizable to different methods of versioning source batches is a little harder
A triage "batch" incorporates everything but the learner grid, including the run time, and there are presumably redundant batches

In either case, batch_id is added as metadata to the experiment_runs and a batch_metadata table is introduced, potentially subsuming some of the concepts from experiment_runs and/or experiments

The text was updated successfully, but these errors were encountered:

ecsalomon · 2021-08-09T16:22:17Z

What happens to the replace flag under this paradigm? replace indicates that there was an upstream error in the batch process (e.g., an error in cleaning, or PII leakage) and the entire batch (features, labels) and all dependencies (models, evaluations) should be replaced. This is the only time that data should be dropped/updated.

hunterowens · 2021-08-09T16:45:27Z

Chiming in from beyond the DSSG alum past, but one thing I have done to solve this pattern in my work is rely on [Ibis](https://ibis-project.org/) which is basically - what if SQL alchemy, but pandas / data focused, and supports all manner of backends (including on memory pandas DF) but scales to Redshift/postgres/bigquery etc

…

On Mon, Aug 9, 2021 at 9:22 AM Erika Salomon ***@***.***> wrote: What happens to the replace flag under this paradigm? replace indicates that there was an upstream error in the batch process (e.g., an error in cleaning, or PII leakage) and the entire batch (features, labels) and all dependencies (models, evaluations) should be replaced. This is the only time that data should be dropped/updated. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#857 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANHXYXUASNPR7IMPQQ7J4DT376EJANCNFSM5B2CYP4Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store batch-partitioned feature and label history #857

Store batch-partitioned feature and label history #857

ecsalomon commented Aug 9, 2021 •

edited

ecsalomon commented Aug 9, 2021

hunterowens commented Aug 9, 2021 via email

Store batch-partitioned feature and label history #857

Store batch-partitioned feature and label history #857

Comments

ecsalomon commented Aug 9, 2021 • edited

ecsalomon commented Aug 9, 2021

hunterowens commented Aug 9, 2021 via email

ecsalomon commented Aug 9, 2021 •

edited