Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate the matrix_type and is_test concepts #855

Open
ecsalomon opened this issue Aug 7, 2021 · 0 comments
Open

Eliminate the matrix_type and is_test concepts #855

ecsalomon opened this issue Aug 7, 2021 · 0 comments

Comments

@ecsalomon
Copy link
Contributor

ecsalomon commented Aug 7, 2021

Pulling this out into an issue. Context from my earlier comment on the retrain-and-predict PR: We decided on using matrix_type in the first version of triage to handle different label behavior in train and test in the inspections use case. It never made much conceptual sense, and introducing new code paths has re-exposed the weaknesses of that choice and is committing us to further weirdness.

I don't know if you all discussed this on Tuesday, but here are a couple of possible solutions for Triage v 5.0.0. The first is probably easier to implement, but I think it is the less good solution:

  • Eliminate both the matrix_type and is_test properties of matrices
  • Include missing labels as missing in all matrices
  • Replace include_missing_labels_in_train_as with two keys: impute_missing_labels_at_train and impute_missing_labels_at_test. These can be True or False. If True, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points
  • The Architect writes matrices without label imputation, allowing them to be reused for any purpose
  • Catwalk does the label imputation on the fly when it is training or testing and always writes the labels to the database. This has high storage costs, but the reason we had to decide on the label imputation at matrix-building time initially was because we had not yet conceived of the train_results schema. Now that Triage can write train predictions, it can write train labels to the database, and the matrix_type concept is no longer needed. We could reduce the storage cost a bit by storing the labels separately from the predictions in the train and test results schemas (model is something like (matrix_uuid, labels_imputed, imputation_method, entity_id, as_of_date, label)) and joining predictions to labels through matrix_uuid and imputation behavior, but that join is likely complex (has to go through the experiments table?). We coulllllld squeeze one extra bit of metadata out by adding a flag for whether that label was imputed, but I don't have any ideas for where we would use that information, and it can always be reconstructed from the matrix itself.
  • The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
  • Separate matrix storage, rather than separate matrix types, is introduced for Production and Development.

An alternative:

  • Eliminate both the matrix_type and is_test properties of matrices
  • Replace include_missing_labels_in_train_as with two keys: impute_missing_labels_at_train and impute_missing_labels_at_test. These can be True or False. If True, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points
  • Instead of storing labels in the train/test matrices, we begin storing the design matrices and the labels separately. Train labels and test labels have separate stores, and the imputation behavior is a key in the label metadata. This reduces (but does not eliminate as in the first proposal) duplication of storage in the EWS case.
  • We introduce new metadata tables for Labels and experiment-label pairs, and add a labels_id column to many of the places where we also have a matrix_id column (models, predictions, evaluations, etc.). This increases db storage a little but not as much as having to store all the train labels in the db.
  • Include all entities in the cohort in all design matrices
  • The replace pathway checks the imputation flag on the labels to determine whether the Architect needs to create new labels
  • Design matrices are combined with the correct labels on reading, and rows are dropped if there is no label and the imputation key is False
  • The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
  • You can still skip writing train results to conserve db storage
  • Production/predict-forward/retrain-and-predict uses design matrices but not labels

Originally posted by @ecsalomon in #631 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant