Timeseries Cross Validation #200

jmrichardson · 2019-09-27T15:29:57Z

Hi,

So far loving this package! Question, I am using time series data and would like to use a more sophisticated cross validation than TimeSeriesSplit offered by sklearn. Specifically, I am interested in using the following CV which has a similar API to sklearn:

https://github.com/sam31415/timeseriescv

Here is a snip of my code:

env = Environment(
    train_dataset=X_train,
    test_dataset=X_test,
    target_column='bin',
    results_path='HyperparameterHunterAssets',  # Where your result files will go
    metrics=['roc_auc_score'],  # Callables, or strings referring to `sklearn.metrics`
    cv_type=PurgedWalkForwardCV,
    cv_params=dict(n_splits=10, pred_times=pd.Series(times.index), eval_times=['t1']),
    verbose=1,
    # cv_type=TimeSeriesSplit,  # Class, or string in `sklearn.model_selection`
    # cv_params=dict(n_splits=5)
)

experiment = CVExperiment(
    model_initializer=XGBClassifier,
    model_init_params=dict(
        objective="reg:squarederror", max_depth=3, n_estimators=100, subsample=0.5
    ),
)

Here is the error output:

<11:23:49> Cross-Experiment Key:   'VfQ6_-2CMEXKfgeAQJXwJmO2KTTdPhBamZ4m9VqJaF4='
<11:23:49> Validated Environment:  'VfQ6_-2CMEXKfgeAQJXwJmO2KTTdPhBamZ4m9VqJaF4='
<11:23:49> Initialized Experiment: '4c491538-1ec9-49b6-8dca-380994441846'
<11:23:49> Hyperparameter Key:     'DD6sYbmG4UVOUoHZRxbuJUcovYaWVcZbXcP4dVGQacI='
<11:23:49> Uncaught exception!   TypeError: __init__() got an unexpected keyword argument 'pred_times'
Traceback (most recent call last):
  File "D:\Anaconda3\envs\alpha\lib\code.py", line 91, in runcode
    exec(code, self.locals)
  File "<input>", line 18, in <module>
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiment_core.py", line 165, in __call__
    return super().__call__(*args, **kwargs)
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 752, in __init__
    target_metric=target_metric,
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 598, in __init__
    target_metric=target_metric,
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 304, in __init__
    self.preparation_workflow()
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 357, in preparation_workflow
    self._additional_preparation_steps()
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 603, in _additional_preparation_steps
    self._initialize_folds()
  File "D:\Anaconda3\envs\alpha\lib\site-packages\hyperparameter_hunter\experiments.py", line 768, in _initialize_folds
    self.folds = cv_type(**self.cv_params)
TypeError: __init__() got an unexpected keyword argument 'pred_times'

It looks as though HH doesn't like the "pred_times" and "eval_times" arguments required by PurgedWalkForwardCV. Any way to allow the arguments to be passed?

Thanks for your help!

The text was updated successfully, but these errors were encountered:

HunterMcGushion · 2019-09-27T21:32:46Z

Thanks for opening this, and thank you for the example code and traceback!

It looks like the issue stems from the fact that cv_params is used to initialize the cv_type class. There's currently no method of providing extra arguments to the split method of cv_type, which is where PurgedWalkForwardCV seems to expect pred_times and eval_times.

I'd love to add support for this! Can you provide/recommend a toy dataset that looks like the one you're trying to use, so I can build some regression tests?

At the risk of making myself sound like a dummy, I don't think I've worked on a problem that uses both pred_times and eval_times. Are these generally provided as additional input columns in the same DataFrame as the rest of the input data? Or do they get some special treatment?

jmrichardson · 2019-09-28T00:50:04Z

At the risk of making myself sound like a dummy

Lol! That is not possible!

I don't think I've worked on a problem that uses both pred_times and eval_times. Are these generally provided as additional input columns in the same DataFrame as the rest of the input data? Or do they get some special treatment?

They are not features in the DataFrame. They are distinct pandas Series that are timestamps of when an equity trade is made. In Finance, typically want to use a walk forward analysis where we remove training samples that have eval times posterior to the validation prediction times. These samples are removed based on the pred_times and eval_times needed as arguments to PurgedWalkForwardCV. In my case, I have a training set and two series of timestamps for pred_times and eval_times where the date indexes all match.

I'd love to add support for this! Can you provide/recommend a toy dataset that looks like the one you're trying to use, so I can build some regression tests?

I created 3 pickle files for the data set (X_train, eval_times, pred_times):

https://github.com/jmrichardson/data

Here's a sample of how to make the splits:

from timeseriescv.cross_validation import PurgedWalkForwardCV
cv = PurgedWalkForwardCV(n_splits=5)

count=0
for train_set, test_set in cv.split(X_train, pred_times=pred_times, eval_times=eval_times, split_by_time=False):
    count += 1
    print(count)

Thank you so much for your help :)

HunterMcGushion · 2019-10-02T06:07:17Z

Sorry about the delay! The TL;DR version of my findings is that I don’t think HH can support time-series CV right now. Here’s the long version:

I was able to throw together a quick/dirty subclass of PurgedWalkForwardCV that got past the error you posted. I’ll include the code for it below, along your snippet, slightly modified to use the new subclass. Anyways, it got past the TypeError and successfully made predictions and evaluations for folds, until it started expecting data for all n_splits when there was none. As I mentioned, I’m no expert on time-series forecasting, so I just realized that OOF predictions won’t actually be generated for all of the training data because of how time-series CV schemes work. In the end, the Experiment tried to keep going past the third fold, on to a fourth and fifth because n_splits=5; however, there are actually only 3 splits (as your second snippet shows).

So I don’t think we can get this working properly just by using some combination of custom cv_type classes and lambda_callbacks. I believe a new Experiment class would need to be added to specifically deal with time-series data. Because the Experiment class is built out of very modular callback classes that it dynamically inherits at instantiation, I think we could get by with just some new predictors/evaluators callbacks that are updated to make predictions and evaluations for the appropriate number of splits for a time-series problem. If you’re interested in contributing, I’d love to work together to add support for this. At the moment, I’m still not familiar enough with time-series problems to be able to get the job done myself. Either way, I’d love to see this added soon!

My quick and dirty PurgedWalkForwardCV subclass code follows, along with the output.

from hyperparameter_hunter import Environment, CVExperiment
from timeseriescv.cross_validation import PurgedWalkForwardCV
from xgboost import XGBClassifier
import pandas as pd

class UglyPurgedWalkForwardCV(PurgedWalkForwardCV):
    def __init__(self, pred_times=None, eval_times=None, split_by_time=False, **kwargs):
        """Override initialization to receive the three extra kwargs expected by 
        :meth:`split`. Mangle the attribute names to avoid any possible 
        collisions with the original attributes of :class:`PurgedWalkForwardCV`"""
        self.__pred_times = pred_times
        self.__eval_times = eval_times
        self.__split_by_time = split_by_time
        super().__init__(**kwargs)

    def split(self, X, y=None, **kwargs):
        """Override `split` to look more like SKLearn's CV classes, and fetch the 
        mangled attributes set on initialization, rather than expecting them here"""
        return super().split(
            X,
            y,
            pred_times=self.__pred_times,
            eval_times=self.__eval_times,
            split_by_time=self.__split_by_time
        )

if __name__ == "__main__":
    data_df = pd.read_pickle(train_data_path)
    p_times = pd.read_pickle(pred_times_path)
    e_times = pd.read_pickle(eval_times_path)

    env = Environment(
        train_dataset=data_df,
        target_column="bin",
        results_path="HyperparameterHunterAssets",
        metrics=["roc_auc_score"],
        cv_type=UglyPurgedWalkForwardCV,
        cv_params=dict(n_splits=5, pred_times=p_times, eval_times=e_times),
    )

    exp = CVExperiment(XGBClassifier)

Output/error traceback:

Cross-Experiment Key:   'pncMgwGMRAZMDReR0Sd8_6PF3817sHiugD2LI1SYXGI='
<18:28:05> Initialized Experiment: 'ee1aed75-4d3c-4d0d-8d51-9e0aa448ee97'
<18:28:05> Hyperparameter Key:     'o-wi1kDtaizmgwFvBNdNJDF7X_OJBQPoj0iG3Gne-SM='
<18:28:05>
<18:28:05> R0-f0-r-  |  OOF(roc_auc_score=0.43833)  |  Time: 0.04179 s
<18:28:05> R0-f1-r-  |  OOF(roc_auc_score=0.50000)  |  Time: 0.04892 s
<18:28:05> R0-f2-r-  |  OOF(roc_auc_score=0.50000)  |  Time: 0.05401 s
<18:28:05> Uncaught exception!   RuntimeError: generator raised StopIteration
Traceback (most recent call last):
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 792, in <genexpr>
    yield (next(indices) for _ in range(cv_params["n_splits"]))
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "time_series_cv_example.py", line 82, in <module>
    execute()
  File "time_series_cv_example.py", line 78, in execute
    exp = CVExperiment(XGBClassifier)
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiment_core.py", line 165, in __call__
    return super().__call__(*args, **kwargs)
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 749, in __init__
    target_metric=target_metric,
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 595, in __init__
    target_metric=target_metric,
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 303, in __init__
    self.experiment_workflow()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 335, in experiment_workflow
    self.execute()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 607, in execute
    self.cross_validation_workflow()
  File "/Users/Hunter/hyperparameter_hunter/hyperparameter_hunter/experiments.py", line 623, in cross_validation_workflow
    for self._fold, (self.train_index, self.validation_index) in enumerate(rep_indices):
RuntimeError: generator raised StopIteration

I'd love to hear your thoughts on adding support for time-series problems! Sorry it's not working at the moment, though!

jmrichardson · 2019-10-03T18:46:40Z

Thank you so much for your effort to get this to work. I should be able to work with the sklearn TimeSeriesSplit method so it's not a huge deal. I will definitely look into how I can help to support timeseriescv. I am working on a project at the moment that is taking a considerable amount of time so I'm not sure I can look into it for a few weeks. I will definitely keep you posted no how it goes.

On a side note, this package is really nice. I really appreciate you sharing this to the community! It will definitely be part of my toolbox going forward!

HunterMcGushion · 2019-10-03T22:59:47Z

Does SKLearn's TimeSeriesSplit work with HH? If so, I should probably add an example using it... Thanks a lot for your support!

jmrichardson · 2019-10-06T16:09:38Z

Sorry for the delay, I have been traveling and will be for the next couple of weeks. I recall that it accepted the parameters (ie the lack of event times). However, I don't recall if I tested completely the sklearn SplitTimeSeries. I will give it a shot soon and report back. Thanks again for your help on this.

HunterMcGushion added the Enhancement New feature or request label Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeseries Cross Validation #200

Timeseries Cross Validation #200

jmrichardson commented Sep 27, 2019

HunterMcGushion commented Sep 27, 2019

jmrichardson commented Sep 28, 2019

HunterMcGushion commented Oct 2, 2019

jmrichardson commented Oct 3, 2019

HunterMcGushion commented Oct 3, 2019

jmrichardson commented Oct 6, 2019

Timeseries Cross Validation #200

Timeseries Cross Validation #200

Comments

jmrichardson commented Sep 27, 2019

HunterMcGushion commented Sep 27, 2019

jmrichardson commented Sep 28, 2019

HunterMcGushion commented Oct 2, 2019

jmrichardson commented Oct 3, 2019

HunterMcGushion commented Oct 3, 2019

jmrichardson commented Oct 6, 2019