Need help to understand data preparation and results from TS Forest Classifier. #4458

ABJ66 · 2023-04-10T14:53:40Z

ABJ66
Apr 10, 2023

Hi team,

Thank you for making such a comprehensive package for time series which covers classification tasks as well. Below is a short summary of the problem, the data preparation used and few findings from using TS Forecast Classifier and Hive Cote2.

Summary: Time stamped data with sensor information (close to 40+ columns/features) along with a flag which is binary, stating 1 ~ anomalous and 0 ~ non-anomalous behavior. There are close to 5,62,575 total records in train data and 93764 groupings basis point 2 below.
Data Preparation: I am creating a rolling window based multiindex pd dataframe (source: https://stackoverflow.com/questions/75325171/dataframe-to-multiindex-for-sktime-format, 1st/accepted answer).
Findings from TS Forest Classifier: I have tried to implement TS Forest Classifier with following parameter settings: TimeSeriesForestClassifier(min_interval=12, n_estimators=200, n_jobs=2, random_state=42) on training data split by an certain date or earlier and tested for the accuracy_score on the test data. Even though the accuracy of the model is around 95%, I noticed that the feature importance array has length = 18 (while I am having close to 40+ columns). Is there a way I can identify which columns were being used and why the lengths do not match.
Findings from HiveCote2: I have tried to implement it on train data, but it seems to be too slow. Is it recommended to use this algorithm as its one of the state of the art?

Would appreciate response from the dev. team &/or user community in general.

Thank you for your time.

Regards.

fkiraly · 2023-04-10T22:17:39Z

fkiraly
Apr 10, 2023
Maintainer

Is there a way I can identify which columns were being used and why the lengths do not match.

The problem with the feature importances of the TS forest not matching what is described in the paper is known, see the 2nd half of the discussion here:
#3517

I believe @JonathanBechtel wanted to open an issue and/or look into it, but I can't find a link currently, perhaps @JonathanBechtel can give more details here.

Is it recommended to use this algorithm as its one of the state of the art?

That's a bit of a subjective choice - personally, I think it's a little overrated and possibly the result of grad student descent on a well-known set of benchmark data. Personally, I would try tuned feature extraction plus sklearn classifier variations first (start simple then iterate), tuned kernel or distance based classifiers, or you can also try some common DL based classifiers.
As said, there will be researchers who disagree with priority of what to try.

Either way, I recommend you ensure you have a robust benchmarking set up that allows you to check runtime and statistical performance (with an extra set held out for final validation if you play around with classifier choice, parameters, and pipeline architecture a lot).

1 reply

ABJ66 Apr 11, 2023
Author

Thanks Franz. I hope the data preparation step is correct as expected by the model/s. I will try out the above steps and post any further questions as they arise. Regards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help to understand data preparation and results from TS Forest Classifier. #4458

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Need help to understand data preparation and results from TS Forest Classifier. #4458

ABJ66 Apr 10, 2023

Replies: 1 comment · 1 reply

fkiraly Apr 10, 2023 Maintainer

ABJ66 Apr 11, 2023 Author

ABJ66
Apr 10, 2023

Replies: 1 comment 1 reply

fkiraly
Apr 10, 2023
Maintainer

ABJ66 Apr 11, 2023
Author