Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeseries Cross Validation #200

Open
jmrichardson opened this issue Sep 27, 2019 · 6 comments
Open

Timeseries Cross Validation #200

jmrichardson opened this issue Sep 27, 2019 · 6 comments
Labels
Enhancement New feature or request

Comments

@jmrichardson
Copy link

Hi,

So far loving this package! Question, I am using time series data and would like to use a more sophisticated cross validation than TimeSeriesSplit offered by sklearn. Specifically, I am interested in using the following CV which has a similar API to sklearn:

https://github.com/sam31415/timeseriescv

Here is a snip of my code:

env = Environment(
    train_dataset=X_train,
    test_dataset=X_test,
    target_column='bin',
    results_path='HyperparameterHunterAssets',  # Where your result files will go
    metrics=['roc_auc_score'],  # Callables, or strings referring to `sklearn.metrics`
    cv_type=PurgedWalkForwardCV,
    cv_params=dict(n_splits=10, pred_times=pd.Series(times.index), eval_times=['t1']),
    verbose=1,
    # cv_type=TimeSeriesSplit,  # Class, or string in `sklearn.model_selection`
    # cv_params=dict(n_splits=5)
)

experiment = CVExperiment(
    model_initializer=XGBClassifier,
    model_init_params=dict(
        objective="reg:squarederror", max_depth=3, n_estimators=100, subsample=0.5
    ),
)

Here is the error output:

<11:23:49> Cross-Experiment Key:   'VfQ6_-2CMEXKfgeAQJXwJmO2KTTdPhBamZ4m9VqJaF4='
<11:23:49> Validated Environment:  'VfQ6_-2CMEXKfgeAQJXwJmO2KTTdPhBamZ4m9VqJaF4='
<11:23:49> Initialized Experiment: '4c491538-1ec9-49b6-8dca-380994441846'
<11:23:49> Hyperparameter Key:     'DD6sYbmG4UVOUoHZRxbuJUcovYaWVcZbXcP4dVGQacI='
<11:23:49> Uncaught exception!   TypeError: __init__() got an unexpected keyword argument 'pred_times'
Traceback (most recent call last):
  File "D:\Anaconda3\envs\alpha\lib\code.py", line 91, in runcode
    exec(code, self.locals)
  File "<input>", line 18, in <module>
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiment_core.py", line 165, in __call__
    return super().__call__(*args, **kwargs)
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 752, in __init__
    target_metric=target_metric,
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 598, in __init__
    target_metric=target_metric,
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 304, in __init__
    self.preparation_workflow()
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 357, in preparation_workflow
    self._additional_preparation_steps()
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 603, in _additional_preparation_steps
    self._initialize_folds()
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 768, in _initialize_folds
    self.folds = cv_type(**self.cv_params)
TypeError: __init__() got an unexpected keyword argument 'pred_times'

It looks as though HH doesn't like the "pred_times" and "eval_times" arguments required by PurgedWalkForwardCV. Any way to allow the arguments to be passed?

Thanks for your help!

@HunterMcGushion
Copy link
Owner

Thanks for opening this, and thank you for the example code and traceback!

It looks like the issue stems from the fact that cv_params is used to initialize the cv_type class. There's currently no method of providing extra arguments to the split method of cv_type, which is where PurgedWalkForwardCV seems to expect pred_times and eval_times.

I'd love to add support for this! Can you provide/recommend a toy dataset that looks like the one you're trying to use, so I can build some regression tests?

At the risk of making myself sound like a dummy, I don't think I've worked on a problem that uses both pred_times and eval_times. Are these generally provided as additional input columns in the same DataFrame as the rest of the input data? Or do they get some special treatment?

@HunterMcGushion HunterMcGushion added the Enhancement New feature or request label Sep 27, 2019
@jmrichardson
Copy link
Author

At the risk of making myself sound like a dummy

Lol! That is not possible!

I don't think I've worked on a problem that uses both pred_times and eval_times. Are these generally provided as additional input columns in the same DataFrame as the rest of the input data? Or do they get some special treatment?

They are not features in the DataFrame. They are distinct pandas Series that are timestamps of when an equity trade is made. In Finance, typically want to use a walk forward analysis where we remove training samples that have eval times posterior to the validation prediction times. These samples are removed based on the pred_times and eval_times needed as arguments to PurgedWalkForwardCV. In my case, I have a training set and two series of timestamps for pred_times and eval_times where the date indexes all match.

I'd love to add support for this! Can you provide/recommend a toy dataset that looks like the one you're trying to use, so I can build some regression tests?

I created 3 pickle files for the data set (X_train, eval_times, pred_times):

https://github.com/jmrichardson/data

Here's a sample of how to make the splits:

from timeseriescv.cross_validation import PurgedWalkForwardCV
cv = PurgedWalkForwardCV(n_splits=5)

count=0
for train_set, test_set in cv.split(X_train, pred_times=pred_times, eval_times=eval_times, split_by_time=False):
    count += 1
    print(count)

Thank you so much for your help :)

@HunterMcGushion
Copy link
Owner

Sorry about the delay! The TL;DR version of my findings is that I don’t think HH can support time-series CV right now. Here’s the long version:

I was able to throw together a quick/dirty subclass of PurgedWalkForwardCV that got past the error you posted. I’ll include the code for it below, along your snippet, slightly modified to use the new subclass. Anyways, it got past the TypeError and successfully made predictions and evaluations for folds, until it started expecting data for all n_splits when there was none. As I mentioned, I’m no expert on time-series forecasting, so I just realized that OOF predictions won’t actually be generated for all of the training data because of how time-series CV schemes work. In the end, the Experiment tried to keep going past the third fold, on to a fourth and fifth because n_splits=5; however, there are actually only 3 splits (as your second snippet shows).

So I don’t think we can get this working properly just by using some combination of custom cv_type classes and lambda_callbacks. I believe a new Experiment class would need to be added to specifically deal with time-series data. Because the Experiment class is built out of very modular callback classes that it dynamically inherits at instantiation, I think we could get by with just some new predictors/evaluators callbacks that are updated to make predictions and evaluations for the appropriate number of splits for a time-series problem. If you’re interested in contributing, I’d love to work together to add support for this. At the moment, I’m still not familiar enough with time-series problems to be able to get the job done myself. Either way, I’d love to see this added soon!

My quick and dirty PurgedWalkForwardCV subclass code follows, along with the output.

from hyperparameter_hunter import Environment, CVExperiment
from timeseriescv.cross_validation import PurgedWalkForwardCV
from xgboost import XGBClassifier
import pandas as pd

class UglyPurgedWalkForwardCV(PurgedWalkForwardCV):
    def __init__(self, pred_times=None, eval_times=None, split_by_time=False, **kwargs):
        """Override initialization to receive the three extra kwargs expected by 
        :meth:`split`. Mangle the attribute names to avoid any possible 
        collisions with the original attributes of :class:`PurgedWalkForwardCV`"""
        self.__pred_times = pred_times
        self.__eval_times = eval_times
        self.__split_by_time = split_by_time
        super().__init__(**kwargs)

    def split(self, X, y=None, **kwargs):
        """Override `split` to look more like SKLearn's CV classes, and fetch the 
        mangled attributes set on initialization, rather than expecting them here"""
        return super().split(
            X,
            y,
            pred_times=self.__pred_times,
            eval_times=self.__eval_times,
            split_by_time=self.__split_by_time
        )

if __name__ == "__main__":
    data_df = pd.read_pickle(train_data_path)
    p_times = pd.read_pickle(pred_times_path)
    e_times = pd.read_pickle(eval_times_path)

    env = Environment(
        train_dataset=data_df,
        target_column="bin",
        results_path="HyperparameterHunterAssets",
        metrics=["roc_auc_score"],
        cv_type=UglyPurgedWalkForwardCV,
        cv_params=dict(n_splits=5, pred_times=p_times, eval_times=e_times),
    )

    exp = CVExperiment(XGBClassifier)

Output/error traceback:

Cross-Experiment Key:   'pncMgwGMRAZMDReR0Sd8_6PF3817sHiugD2LI1SYXGI='
<18:28:05> Initialized Experiment: 'ee1aed75-4d3c-4d0d-8d51-9e0aa448ee97'
<18:28:05> Hyperparameter Key:     'o-wi1kDtaizmgwFvBNdNJDF7X_OJBQPoj0iG3Gne-SM='
<18:28:05>
<18:28:05> R0-f0-r-  |  OOF(roc_auc_score=0.43833)  |  Time: 0.04179 s
<18:28:05> R0-f1-r-  |  OOF(roc_auc_score=0.50000)  |  Time: 0.04892 s
<18:28:05> R0-f2-r-  |  OOF(roc_auc_score=0.50000)  |  Time: 0.05401 s
<18:28:05> Uncaught exception!   RuntimeError: generator raised StopIteration
Traceback (most recent call last):
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 792, in <genexpr>
    yield (next(indices) for _ in range(cv_params["n_splits"]))
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "time_series_cv_example.py", line 82, in <module>
    execute()
  File "time_series_cv_example.py", line 78, in execute
    exp = CVExperiment(XGBClassifier)
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiment_core.py", line 165, in __call__
    return super().__call__(*args, **kwargs)
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 749, in __init__
    target_metric=target_metric,
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 595, in __init__
    target_metric=target_metric,
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 303, in __init__
    self.experiment_workflow()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 335, in experiment_workflow
    self.execute()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 607, in execute
    self.cross_validation_workflow()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 623, in cross_validation_workflow
    for self._fold, (self.train_index, self.validation_index) in enumerate(rep_indices):
RuntimeError: generator raised StopIteration

I'd love to hear your thoughts on adding support for time-series problems! Sorry it's not working at the moment, though!

@jmrichardson
Copy link
Author

Thank you so much for your effort to get this to work. I should be able to work with the sklearn TimeSeriesSplit method so it's not a huge deal. I will definitely look into how I can help to support timeseriescv. I am working on a project at the moment that is taking a considerable amount of time so I'm not sure I can look into it for a few weeks. I will definitely keep you posted no how it goes.

On a side note, this package is really nice. I really appreciate you sharing this to the community! It will definitely be part of my toolbox going forward!

@HunterMcGushion
Copy link
Owner

Does SKLearn's TimeSeriesSplit work with HH? If so, I should probably add an example using it... Thanks a lot for your support!

@jmrichardson
Copy link
Author

Sorry for the delay, I have been traveling and will be for the next couple of weeks. I recall that it accepted the parameters (ie the lack of event times). However, I don't recall if I tested completely the sklearn SplitTimeSeries. I will give it a shot soon and report back. Thanks again for your help on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants