MultipleTimeSeriesCV, why does it iterate backwards in time? #314

carstenf · 2024-01-17T11:49:31Z

It looks like the splits are the wrong way around and should be reversed.
The first split will be used first to compute, than the second split should not use information from the first calculation.
This looks like the other way around, please see the plot.

I made a small pice of code for plotting:

`
  import matplotlib.pyplot as plt
  import numpy as np
  import pandas as pd

# Initialize the MultipleTimeSeriesCV with your parameters
cv = MultipleTimeSeriesCV(n_splits=5, train_period_length=300, test_period_length=300, lookahead=50)

# Define the colors for different sections of the plot
train_color = 'blue'  # In-sample data color
test_color = 'orange'    # Out-of-sample data color
lookahead_color = 'green'  # Lookahead period color

# Create subplots
fig, axs = plt.subplots(cv.n_splits, figsize=(10, 15), sharex=True)

# Iterate over each split
for i, (train_index, test_index) in enumerate(cv.split(data)):
    train = data.iloc[train_index]
    test = data.iloc[test_index]

    # Plot training set
    axs[i].plot(train.loc['AAPL'].index, train.loc['AAPL', 'close'], color=train_color, label='Training Set')
    
    # Plot testing set
    axs[i].plot(test.loc['AAPL'].index, test.loc['AAPL', 'close'], color=test_color, label='Testing Set')

    # Highlight the lookahead period
    if len(train) > 0 and len(test) > 0:
        lookahead_start_date = train.loc['AAPL'].index[-1]
        lookahead_end_date = test.loc['AAPL'].index[0]
        axs[i].axvspan(lookahead_start_date, lookahead_end_date, color=lookahead_color, alpha=0.3, label='Lookahead Period')

    axs[i].legend(loc='best')
    
    # Formatting the plot
    axs[i].set_title(f'Split {i+1}')
    axs[i].set_xlabel('Date')
    axs[i].set_ylabel('Close Price')

plt.tight_layout()
plt.show()

`

The text was updated successfully, but these errors were encountered:

carstenf · 2024-01-17T12:54:46Z

I probably found a solution, but not fully tested:

class MultipleTimeSeriesCV:
     """Generates tuples of train_idx, test_idx pairs
     Assumes the MultiIndex contains levels 'symbol' and 'date'
     purges overlapping outcomes"""

    def __init__(self,
                 n_splits=3,
                 train_period_length=126,
                 test_period_length=21,
                 lookahead=None,
                 date_idx='date',
                 shuffle=False):
        self.n_splits = n_splits
        self.lookahead = lookahead
        self.test_length = test_period_length
        self.train_length = train_period_length
        self.shuffle = shuffle
        self.date_idx = date_idx


    def split(self, X, y=None, groups=None):
            unique_dates = X.index.get_level_values(self.date_idx).unique()
            days = sorted(unique_dates)  # Ascending order
            split_idx = []
            for i in range(self.n_splits):
                # Calculate split indices based on ascending order of days 
                train_start_idx = i * self.test_length   
                train_end_idx = train_start_idx + self.train_length
                test_start_idx = train_end_idx + (self.lookahead or 0)
                test_end_idx = test_start_idx + self.test_length
        
                # Ensure we do not exceed the length of days
                if test_end_idx >= len(days):
                    break
                
                split_idx.append((train_start_idx, train_end_idx, test_start_idx, test_end_idx))
        
            dates = X.reset_index()[[self.date_idx]]
        
            for train_start, train_end, test_start, test_end in split_idx:
                # Adjust the condition to select the right slice based on sorted ascending days
                train_idx = dates[(dates[self.date_idx] >= days[train_start]) & 
                                  (dates[self.date_idx] < days[train_end])].index
                test_idx = dates[(dates[self.date_idx] >= days[test_start]) & 
                                 (dates[self.date_idx] < days[test_end])].index
        
                if self.shuffle:
                    train_idx = np.random.permutation(train_idx)
                
                yield train_idx.to_numpy(), test_idx.to_numpy()

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits

the new result:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultipleTimeSeriesCV, why does it iterate backwards in time? #314

MultipleTimeSeriesCV, why does it iterate backwards in time? #314

carstenf commented Jan 17, 2024 •

edited

carstenf commented Jan 17, 2024 •

edited

MultipleTimeSeriesCV, why does it iterate backwards in time? #314

MultipleTimeSeriesCV, why does it iterate backwards in time? #314

Comments

carstenf commented Jan 17, 2024 • edited

carstenf commented Jan 17, 2024 • edited

carstenf commented Jan 17, 2024 •

edited

carstenf commented Jan 17, 2024 •

edited