Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we support not update RCF model in TRCF process method? #299

Open
ylwu-amzn opened this issue Feb 12, 2022 · 1 comment
Open

Should we support not update RCF model in TRCF process method? #299

ylwu-amzn opened this issue Feb 12, 2022 · 1 comment

Comments

@ylwu-amzn
Copy link
Contributor

TRCF (ThresholdedRandomCutForest) process method will always update internal RandomCutForest (Code ThresholdedRandomCutForest, Preprocessor).

Per my understanding TRCF is built for time series data. It looks reasonable to always update current model with latest data to adapt the latest data pattern, for example the increasing sales, the anomalies of high daily sales detected in last season may not be considered as anomaly if we update RCF model with this season's data(averagely higher daily sales). But for other use case, like stable error rate metric, no trending pattern, seems it will bring in noise if we update RCF with anomalies. For example, user trained an RCF model well with normal data. Then they want to keep the model immutable, just detect anomalies without updating the internal RCF model. Based on this understanding, should we support not update RCF model in TRCF process method?

Note: I'm not ML expert, even not understand the RCF in deep. Sorry if this is too naive and waste your time reading.

@sudiptoguha
Copy link
Contributor

"Note: I'm not ML expert, even not understand the RCF in deep. Sorry if this is too naive and waste your time reading."

No worries. It is a legitimate question. There are two view points of any software here :

  1. As advertised value: in this case continuous ML, where models are updated continually, such a pipeline is not easy to implement correctly, and that is the value.
  2. As desired by a specific user. In this case, stopping the continuous learning model on command. This action may go against the intended development of the software and ingrained assumptions may be violated.

In case of Apache 2.0 open source, there is an easy choice -- the repository should conform to (1) such that the collective does not have to worry about getting unintended artifacts/incorrect use. Individual consuming projects, who know what they are modifying, can/should modify the code to their liking.

In this case of RCFs (and any method that can be used on time series) the issue is not just stopping updates -- but restarting the updates some later time. One follows the other as night follows day. However such start/stop flexibility can break the premise behind RCF and iii can be confusing. On the other hand, even worse, would be that such start/stop works, but the real reason for that would be masked behind RCF.

I would recommend revisiting this later -- when the benefit (if any) of TRCF is already established and then we can discuss more nuanced options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants