Replies: 1 comment 1 reply
-
The problem with the feature importances of the TS forest not matching what is described in the paper is known, see the 2nd half of the discussion here: I believe @JonathanBechtel wanted to open an issue and/or look into it, but I can't find a link currently, perhaps @JonathanBechtel can give more details here.
That's a bit of a subjective choice - personally, I think it's a little overrated and possibly the result of grad student descent on a well-known set of benchmark data. Personally, I would try tuned feature extraction plus sklearn classifier variations first (start simple then iterate), tuned kernel or distance based classifiers, or you can also try some common DL based classifiers. Either way, I recommend you ensure you have a robust benchmarking set up that allows you to check runtime and statistical performance (with an extra set held out for final validation if you play around with classifier choice, parameters, and pipeline architecture a lot). |
Beta Was this translation helpful? Give feedback.
-
Hi team,
Thank you for making such a comprehensive package for time series which covers classification tasks as well. Below is a short summary of the problem, the data preparation used and few findings from using TS Forecast Classifier and Hive Cote2.
Summary: Time stamped data with sensor information (close to 40+ columns/features) along with a flag which is binary, stating 1 ~ anomalous and 0 ~ non-anomalous behavior. There are close to 5,62,575 total records in train data and 93764 groupings basis point 2 below.
Data Preparation: I am creating a rolling window based multiindex pd dataframe (source: https://stackoverflow.com/questions/75325171/dataframe-to-multiindex-for-sktime-format, 1st/accepted answer).
Findings from TS Forest Classifier: I have tried to implement TS Forest Classifier with following parameter settings: TimeSeriesForestClassifier(min_interval=12, n_estimators=200, n_jobs=2, random_state=42) on training data split by an certain date or earlier and tested for the accuracy_score on the test data. Even though the accuracy of the model is around 95%, I noticed that the feature importance array has length = 18 (while I am having close to 40+ columns). Is there a way I can identify which columns were being used and why the lengths do not match.
Findings from HiveCote2: I have tried to implement it on train data, but it seems to be too slow. Is it recommended to use this algorithm as its one of the state of the art?
Would appreciate response from the dev. team &/or user community in general.
Thank you for your time.
Regards.
Beta Was this translation helpful? Give feedback.
All reactions