Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default window #298

Open
bendruitt opened this issue Jun 6, 2022 · 0 comments
Open

Default window #298

bendruitt opened this issue Jun 6, 2022 · 0 comments

Comments

@bendruitt
Copy link

bendruitt commented Jun 6, 2022

Unless I am very mistaken, there is a big problem with using this library for training a model to be used in production.

For example, if one uses the ta.add_all_ta_features(...) function to create your ML dataset you end up with some indicators using their default window to calculate lagging indicators and some using ALL the data available.

Here is an example of two dataframes containing exactly the same data, one dataframe is 1000 rows long and one is 1001 rows long. Below is shown the results of (df_last_1000.iloc[100:] - df_last_1001.iloc[100:]).sum() As you can see, indicators that use ALL the data in their calculations are different from those with a defined window:

open                         0.000000e+00
high                         0.000000e+00
low                          0.000000e+00
close                        0.000000e+00
volume                       0.000000e+00
volume_adi                   2.009133e+05
volume_obv                  -8.036533e+05
volume_cmf                   0.000000e+00
volume_fi                   -1.443620e-03
volume_em                    0.000000e+00
volume_sma_em                0.000000e+00
volume_vpt                   0.000000e+00
volume_vwap                  0.000000e+00
volume_mfi                   0.000000e+00

The most important implication of this is that when running a model trained on the data generated by ta.add_all_ta_features(...), the dataframe used to generate the vector for the production model would need to be EXACTLY the same length as the dataset used to train the model to be suitable. If you are training your model using a significant time period of financial data then this constraint becomes impractical at production due to calculation expense.

The workaround is, of course, to calculate the indicators for each row of your dataset by iterating over your dataset and applying the ta.add_all_ta_features(...) for each row based on a fixed number of preceding rows. However, this option should be part of the library. For example:

ta.add_all_ta_features(df, max_window=100, ...)

This would impact the expense of this operation but would ensure that when in production you'd know how much data you'd need to calculate a suitable vector for you production application.

The second gotcha, would be that given there is no defined window for a proportion of indicators calculated by ta.add_all_ta_features(...) then it is very difficult to be able to generate a suitable pair of test / training sets if this function is applied BEFORE the split because then there is information leak across the test / train sets. On the other hand, if you apply the function AFTER the split then the values generated by ta.add_all_ta_features(...) will be dependent on how long the training set is compared with the test set. You could, of course, make the test / train sets the same length but that is yet another constraint introduced by this issue.

Perhaps I am missing something? If so, clarification would be much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant