Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

you are prepocessing test dataset before implementing the model #13

Open
mg64ve opened this issue Jan 13, 2019 · 15 comments
Open

you are prepocessing test dataset before implementing the model #13

mg64ve opened this issue Jan 13, 2019 · 15 comments

Comments

@mg64ve
Copy link

mg64ve commented Jan 13, 2019

This means that you assume to know the future!
This would never work.
Regards.

@maxbeyer1
Copy link

Not to speak for the developer, but you are aware that that is how you train an AI model right? You can preprocess the dataset and then use the model when you run it on a live sample.

@mg64ve
Copy link
Author

mg64ve commented Feb 13, 2019

In general I would say that if you preprocess your dataset and then you split into train/test, you train the model and you check the results in the test part, then you are making a mistake. Because you assume to have knowledge of the future in order to preprocess the whole train/test dataset. I was thinking this is the case, but I am not sure anymore, I need to check the code again and I don't have time right now.

@timothyyu
Copy link

timothyyu commented Feb 19, 2019

@mg64ve I am looking into this exact issue in implementing the WSAE-LSTM model, which uses the wavelet transform to denoise data (Bao et al., 2017):
https://github.com/timothyyu/wsae-lstm

My implementation is a work in progress/currently vastly incomplete, but my understanding so far is that you cannot apply the wavelet transform to the entire dataset in one pass - but you can arrange the data in a continuous fashion in a clearly defined train-validate-test split that appears to mostly sidestep this issue.

From Bao et al. (2017) defining the train-validate-test split arrangement for continuous training (Fig 7):
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180944#pone-0180944-g007
image

@mg64ve
Copy link
Author

mg64ve commented Feb 20, 2019

@timothyyu absolutely right! You should apply wavelet and any kind of preprocessing separately on train and test dataset.
I am also working on this topic and I recommend you the following article:

Recurrent Neural Networks for Financial Time-Series Modelling / Gavin Tsang; Jingjing Deng; Xianghua Xie

It has some interesting concepts.
Cheers

@timothyyu
Copy link

Here's an example of applying the wavelet transform to the first two train-validate-test splits of the csci300 index data :
image

@mg64ve
Copy link
Author

mg64ve commented Mar 1, 2019

Here's an example of applying the wavelet transform to the first two train-validate-test splits of the csci300 index data :
image

ok @timothyyu , does this example come from your code?

@timothyyu
Copy link

timothyyu commented Mar 2, 2019

@mg64ve yes, this is from my own code. I have an updated implementation of the above (scaling is done on the train set, and then applied to the validate and test set per period/interval, and then the wavelet transform is applied to each train-validate-test split individually):
image
https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/visualize.py

@timothyyu
Copy link

@mg64ve here is an updated version of the above that clearly illustrates the train-validate-test split, with the effect of scaling and scaling + denoising being visualized:

Implemented as of v0.1.2 / b715d88
https://github.com/timothyyu/wsae-lstm/releases/tag/v0.1.2
image

@mg64ve
Copy link
Author

mg64ve commented Mar 3, 2019

Hi @timothyyu thanks for your reply, let me check the code.
One more question: how do you apply scaling?
Also to scaling you should apply the same concept. Validate and test datasets should be scaled without knowing them in advance.

@timothyyu
Copy link

timothyyu commented Mar 4, 2019

Scaling is done with RobustScaler on the train set, and then the same parameters used to scale the train set are applied to the validate and test sets.

ddi_scaled[index_name][intervals from 1-24][1-train,2-validate,3-test]

def scale_periods(dict_dataframes):
    
    ddi_scaled = dict()
    for key, index_name in enumerate(dict_dataframes):
        ddi_scaled[index_name] = copy.deepcopy(dict_dataframes[index_name])
    for key, index_name in enumerate(ddi_scaled): 

        scaler = preprocessing.RobustScaler(with_centering=True)

        for index,value in enumerate(ddi_scaled[index_name]):
            X_train = ddi_scaled[index_name][value][1]
            X_train_scaled = scaler.fit_transform(X_train)
            X_train_scaled_df = pd.DataFrame(X_train_scaled,columns=list(X_train.columns))
            
            X_val = ddi_scaled[index_name][value][2]
            X_val_scaled = scaler.transform(X_val)
            X_val_scaled_df = pd.DataFrame(X_val_scaled,columns=list(X_val.columns))
            
            X_test = ddi_scaled[index_name][value][3]
            X_test_scaled = scaler.transform(X_test)
            X_test_scaled_df = pd.DataFrame(X_test_scaled,columns=list(X_test.columns))
            
            ddi_scaled[index_name][value][1] = X_train_scaled_df
            ddi_scaled[index_name][value][2] = X_val_scaled_df
            ddi_scaled[index_name][value][3] = X_test_scaled_df
    return ddi_scaled```

@mg64ve
Copy link
Author

mg64ve commented Mar 8, 2019

Hi @timothyyu , I had a look to autoencoder.py and model.py. You basically don't use embedding.
So you basically denoise all your test dataset and then use a value from denoised test dataset to predict the next step.
This can't happen in real life because we only know the past also in the test dataset.
That's why I am thinking that embedding is more useful because at each instant t you process the interval [t-N, t].
What do you think about it?

@hqsds
Copy link

hqsds commented Apr 16, 2019

Hi, thank you for sharing your work and it's interesting. I am looking at the codes, but there were always some errors in generating results for stocks(I see well in FX rate). I would like to compare my results with yours for AAPL. Could you also present a predicted log return vs historical log return for AAPL for the most three years or one years if possible? Thank you very much!

@az13js
Copy link

az13js commented Jun 6, 2019

Why stock_price = np.exp(np.reshape(prediction, (1,)))*stock_data_test[i] ?
File: model.py

stock_price = np.exp(np.reshape(prediction, (1,)))*stock_data_test[i]

@mg64ve
Copy link
Author

mg64ve commented Jun 10, 2019

@az13js I believe it is because log return it is used during preprocessing

@leewi9
Copy link

leewi9 commented Jun 10, 2020

yeah, you should first split, then preprocess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants