Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store batch-partitioned feature and label history #857

Open
ecsalomon opened this issue Aug 9, 2021 · 2 comments
Open

Store batch-partitioned feature and label history #857

ecsalomon opened this issue Aug 9, 2021 · 2 comments

Comments

@ecsalomon
Copy link
Contributor

ecsalomon commented Aug 9, 2021

Our decision not to persist features limits the flexibility and reproducibility of the system. Triage is designed for batch processing, which means that we could follow functional data engineering principles and store batch partitioned feature and label data in partitioned Postgres tables, Redshift, or HDFS. This would make flexible re-testing of models on different label time periods much easier by being able to construct the matrices on the fly at evaluation time without needing to rebuild features and labels and make Rayid's preferred solution for #378 much easier to implement.

Connecting this to #368, if we versioned features on the hash of query logic, aggregation function, aggregation time period, imputation method, etc., we would be able to track how changes in feature definitions between experiments shifted the distributions of features as well as monitor how feature distributions for the same feature definitions change over time (and throw warnings or errors if, e.g., variance on a feature dropped dramatically between batches). Currently, from_obj logic changes are hidden because they affect the experiment hash but not the feature names.

There are some complications to this approach based on how the group + triage typically operate. Data are received and processed in batches from partners, but the definition of a batch in triage is more closely tied to the experiment and experiment run. Storing all experiments or experiment runs as new batches is likely overly redundant. If you change the label definition, do you really need to create a new batch for all of the features? No, but if you rerun the same experiment on new source data, you will. We could consider the hash of the experiment components (e.g., label definition) in a batch definition, but triage has no good way of knowing what batch the source data are at, so it would not have a good basis for knowing when to create a new batch for the same configuration.

A couple of alternatives for this:

  • Triage has some way of reading batch version of source data and smartly updates its batches if cohort, label, etc. definitions change (only ever adding to existing batches); we currently do this for at least one project with the record linkage timestamp user_metadata key, but figuring out how to make that generalizable to different methods of versioning source batches is a little harder
  • A triage "batch" incorporates everything but the learner grid, including the run time, and there are presumably redundant batches

In either case, batch_id is added as metadata to the experiment_runs and a batch_metadata table is introduced, potentially subsuming some of the concepts from experiment_runs and/or experiments

@ecsalomon
Copy link
Contributor Author

What happens to the replace flag under this paradigm? replace indicates that there was an upstream error in the batch process (e.g., an error in cleaning, or PII leakage) and the entire batch (features, labels) and all dependencies (models, evaluations) should be replaced. This is the only time that data should be dropped/updated.

@hunterowens
Copy link
Member

hunterowens commented Aug 9, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants