Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data size impacting tune_test_forecast() and find_optimal_transformation() #68

Open
raedbsili1991 opened this issue Aug 4, 2023 · 3 comments
Assignees

Comments

@raedbsili1991
Copy link

Hello,

In an attempt to try to deploy automatic forecasting on different dataset (the optimal goal is to try to find optimal model automatically for each input TS data), I noticed the the two functions tune_test_forecast() and find_optimal_transformation() encounters a "shape".

ValueError: Found array with 0 sample(s) (shape=(0, 45)) while a minimum of 1 is required by MinMaxScaler.

Or maybe does it has to be with the add_ar_terms ?

In the case of the following dataset, I noticed that the problem is mainly on the f.auto_forecast().

However, the find_statistical_transformation() works well.

Dataset attached.

CODE:

`forecast_months_horizon = 18 #Select number of months to be forecasted in the future
performance_metric = "mae"
data = df
f = Forecaster(
y = data['Monthly_Ordered quantity _ basic U_M'], # required
current_dates = data['first_day_of_month'], # required
future_dates=forecast_months_horizon,
cis = False, # choose whether or not to evaluate confidence intervals for all models,
metrics = ['mae','r2','rmse','mape'], # the metrics to evaluate when testing/tuning models
)

f.add_metric(custom_metric_3)

f.set_validation_metric(performance_metric)
f.set_validation_length(int(len(f.y).2) + number_months_validation0)
f.set_test_length(int(len(f.y).25)+number_months_test0)

def forecaster_0(f):

f.add_sklearn_estimator(StackingRegressor,called='stacking')
f.add_sklearn_estimator(AdaBoostRegressor,called='adaboost')
f.add_covid19_regressor()
            
f.add_metric(custom_metric_3)
f.set_validation_metric(performance_metric)

models = ('lasso','gbt','ridge','elasticnet')
for m in tqdm(models):

    f.drop_all_Xvars()
    f.set_estimator(m)
    f.auto_Xvar_select(estimator=m, irr_cycles=[12],cross_validate=True) 
    #f.determine_best_series_length(estimator =m, monitor='ValidationMetricValue' ,cross_validate=True, dynamic_tuning = 18)
    f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
    f.cross_validate(k = 5,verbose=True, dynamic_tuning=True) 
    f.auto_forecast(call_me = m + '_0')
    f.restore_series_length()

auto_arima(f,m=12) # saves a model called auto_arima    

def forecaster_1(f):

#f.eval_cis() # tell the object to build confidence intervals for all models
for i in range(11): f.add_ar_terms(i)

f.add_AR_terms((2,12))
f.add_time_trend()

f.add_seasonal_regressors('month','quarter','week','dayofyear',raw=False,sincos=True)
f.add_seasonal_regressors('dayofweek','is_leap_year','week',raw=False,dummy=True,drop_first=True)
f.add_seasonal_regressors('year')

#f.add_sklearn_estimator(StackingRegressor,called='stacking')
#f.add_sklearn_estimator(AdaBoostRegressor,called='adaboost')

models = ('lasso', 'xgboost')
# f.tune_test_forecast(models, dynamic_testing=True, 
#                      cross_validate= True, summary_stats = True, dynamic_tuning=True,verbose=True)
#f.tune_test_forecast(models, suffix = "_1") 
for m in tqdm(models): 
    f.set_estimator(m) 
    f.tune(dynamic_tuning=True)
    f.cross_validate(k=2,dynamic_tuning=True,verbose=True)
    f.auto_forecast() 

auto_arima(f,m=12) # saves a model called auto_arima
f.add_covid19_regressor()

def Plot_Analysis(f):

print("Plotting AutoCorrelation & Seasonal Decomposition Graph")

f.plot_acf()
plt.title("ACF")

f.plot_pacf()
plt.title("PACF")

f.seasonal_decompose().plot()
plt.title("Seasonal Decompose")
plt.show()

def Plot_Forecasts(f):

f.plot_fitted(order_by='TestSetMAE') # plot fitted values of all models ordered by r2
plt.title('fitted_results results',size=16)
plt.show()
df_models = plot_test_export_summaries(f)

f.plot(order_by='TestSetMAE')
plt.title('Forecasting results',size=16)
plt.show()

#transformer, reverter = find_statistical_transformation(f)

forecaster_1(f)
Plot_Forecasts(f)`

df_A0430151.xlsx

@mikekeith52 mikekeith52 self-assigned this Aug 5, 2023
@mikekeith52
Copy link
Owner

I'm not sure why find_optimal_transformation() wouldn't work, but I think for tune_test_forecast() you need to modify the cross validation process since the dataset is only 41 observations. Calling f.add_AR_terms((2,12)) takes off an additional 24 observations and using a test length of 25% is another 11. So that leaves 6 total observations in a cross validation process of 5 folds, which is the default. That means the last fold has 1 observation to train on, which isn't enough. So in tune_test_forecast() trying using k=2 and let me know if that works. Using a smaller test size and adding fewer AR terms could also be a good idea. Remember that managing the series' length when adding AR terms and splitting the data for testing/validation is an important part of the modeling process.

@raedbsili1991
Copy link
Author

Thank you.
Indeed, the f.add_AR_terms((2,12)) shorts the series.
The tune_test_forecast() trying using k=2 worked when removing the f.add_AR_terms((2,12)).
However, the auto_arima failed.
Is there a rule that defines the maximum AR_terms to be added in function of the series length.
Unfortubnately, the Find_best_series_lengths followed by find best X vars output worst performance.

I think I better let the data with weekly non-periodic data rather than summing up over one month, this would let a bigger size for the dataset.

@mikekeith52
Copy link
Owner

All else equal, having a bigger data size usually leads to better model performance. So switching to a weekly frequency may alleviate many of these problems. There is no rule to determine how many lags/how big a test size to use. Just keep in mind that every lag you add takes an observation off the beginning of the series. The 24th lag shortens the series by 24 observations. Using bigger test sizes also decreases the amount of training observations. All of these considerations need to be balanced when evaluating forecasting models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants