Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultipleTimeSeriesCV, why does it iterate backwards in time? #314

Open
carstenf opened this issue Jan 17, 2024 · 1 comment
Open

MultipleTimeSeriesCV, why does it iterate backwards in time? #314

carstenf opened this issue Jan 17, 2024 · 1 comment

Comments

@carstenf
Copy link

carstenf commented Jan 17, 2024

It looks like the splits are the wrong way around and should be reversed.
The first split will be used first to compute, than the second split should not use information from the first calculation.
This looks like the other way around, please see the plot.

I made a small pice of code for plotting:

`
  import matplotlib.pyplot as plt
  import numpy as np
  import pandas as pd

# Initialize the MultipleTimeSeriesCV with your parameters
cv = MultipleTimeSeriesCV(n_splits=5, train_period_length=300, test_period_length=300, lookahead=50)

# Define the colors for different sections of the plot
train_color = 'blue'  # In-sample data color
test_color = 'orange'    # Out-of-sample data color
lookahead_color = 'green'  # Lookahead period color

# Create subplots
fig, axs = plt.subplots(cv.n_splits, figsize=(10, 15), sharex=True)

# Iterate over each split
for i, (train_index, test_index) in enumerate(cv.split(data)):
    train = data.iloc[train_index]
    test = data.iloc[test_index]

    # Plot training set
    axs[i].plot(train.loc['AAPL'].index, train.loc['AAPL', 'close'], color=train_color, label='Training Set')
    
    # Plot testing set
    axs[i].plot(test.loc['AAPL'].index, test.loc['AAPL', 'close'], color=test_color, label='Testing Set')

    # Highlight the lookahead period
    if len(train) > 0 and len(test) > 0:
        lookahead_start_date = train.loc['AAPL'].index[-1]
        lookahead_end_date = test.loc['AAPL'].index[0]
        axs[i].axvspan(lookahead_start_date, lookahead_end_date, color=lookahead_color, alpha=0.3, label='Lookahead Period')

    axs[i].legend(loc='best')
    
    # Formatting the plot
    axs[i].set_title(f'Split {i+1}')
    axs[i].set_xlabel('Date')
    axs[i].set_ylabel('Close Price')

plt.tight_layout()
plt.show()

`

image

@carstenf
Copy link
Author

carstenf commented Jan 17, 2024

I probably found a solution, but not fully tested:

class MultipleTimeSeriesCV:
     """Generates tuples of train_idx, test_idx pairs
     Assumes the MultiIndex contains levels 'symbol' and 'date'
     purges overlapping outcomes"""

    def __init__(self,
                 n_splits=3,
                 train_period_length=126,
                 test_period_length=21,
                 lookahead=None,
                 date_idx='date',
                 shuffle=False):
        self.n_splits = n_splits
        self.lookahead = lookahead
        self.test_length = test_period_length
        self.train_length = train_period_length
        self.shuffle = shuffle
        self.date_idx = date_idx


    def split(self, X, y=None, groups=None):
            unique_dates = X.index.get_level_values(self.date_idx).unique()
            days = sorted(unique_dates)  # Ascending order
            split_idx = []
            for i in range(self.n_splits):
                # Calculate split indices based on ascending order of days 
                train_start_idx = i * self.test_length   
                train_end_idx = train_start_idx + self.train_length
                test_start_idx = train_end_idx + (self.lookahead or 0)
                test_end_idx = test_start_idx + self.test_length
        
                # Ensure we do not exceed the length of days
                if test_end_idx >= len(days):
                    break
                
                split_idx.append((train_start_idx, train_end_idx, test_start_idx, test_end_idx))
        
            dates = X.reset_index()[[self.date_idx]]
        
            for train_start, train_end, test_start, test_end in split_idx:
                # Adjust the condition to select the right slice based on sorted ascending days
                train_idx = dates[(dates[self.date_idx] >= days[train_start]) & 
                                  (dates[self.date_idx] < days[train_end])].index
                test_idx = dates[(dates[self.date_idx] >= days[test_start]) & 
                                 (dates[self.date_idx] < days[test_end])].index
        
                if self.shuffle:
                    train_idx = np.random.permutation(train_idx)
                
                yield train_idx.to_numpy(), test_idx.to_numpy()

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits

the new result:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant