Should we support not update RCF model in TRCF process method? #299

ylwu-amzn · 2022-02-12T01:32:28Z

TRCF (ThresholdedRandomCutForest) process method will always update internal RandomCutForest (Code ThresholdedRandomCutForest, Preprocessor).

Per my understanding TRCF is built for time series data. It looks reasonable to always update current model with latest data to adapt the latest data pattern, for example the increasing sales, the anomalies of high daily sales detected in last season may not be considered as anomaly if we update RCF model with this season's data(averagely higher daily sales). But for other use case, like stable error rate metric, no trending pattern, seems it will bring in noise if we update RCF with anomalies. For example, user trained an RCF model well with normal data. Then they want to keep the model immutable, just detect anomalies without updating the internal RCF model. Based on this understanding, should we support not update RCF model in TRCF process method?

Note: I'm not ML expert, even not understand the RCF in deep. Sorry if this is too naive and waste your time reading.

The text was updated successfully, but these errors were encountered:

sudiptoguha · 2022-03-15T15:53:40Z

"Note: I'm not ML expert, even not understand the RCF in deep. Sorry if this is too naive and waste your time reading."

No worries. It is a legitimate question. There are two view points of any software here :

As advertised value: in this case continuous ML, where models are updated continually, such a pipeline is not easy to implement correctly, and that is the value.
As desired by a specific user. In this case, stopping the continuous learning model on command. This action may go against the intended development of the software and ingrained assumptions may be violated.

In case of Apache 2.0 open source, there is an easy choice -- the repository should conform to (1) such that the collective does not have to worry about getting unintended artifacts/incorrect use. Individual consuming projects, who know what they are modifying, can/should modify the code to their liking.

In this case of RCFs (and any method that can be used on time series) the issue is not just stopping updates -- but restarting the updates some later time. One follows the other as night follows day. However such start/stop flexibility can break the premise behind RCF and iii can be confusing. On the other hand, even worse, would be that such start/stop works, but the real reason for that would be masked behind RCF.

I would recommend revisiting this later -- when the benefit (if any) of TRCF is already established and then we can discuss more nuanced options.

ylwu-amzn mentioned this issue Feb 12, 2022

add fixed in time rcf opensearch-project/ml-commons#138

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we support not update RCF model in TRCF process method? #299

Should we support not update RCF model in TRCF process method? #299

ylwu-amzn commented Feb 12, 2022

sudiptoguha commented Mar 15, 2022

Should we support not update RCF model in TRCF process method? #299

Should we support not update RCF model in TRCF process method? #299

Comments

ylwu-amzn commented Feb 12, 2022

sudiptoguha commented Mar 15, 2022