[EPIC, aimodel, sim] Explore SARIMAX modeling #1006

trentmc · 2024-05-08T03:11:41Z

Motivation

We currently use a formulation that looks directly at past values ("autoregressive" AR).

We could generalize this to..

handle expected model prediction errors: "moving average" MA --> overall ARMA
handle trends: "integrated" I --> ARIMA
handle "seasonality" S --> SARIMA
handle "exogenous regressors" aka "external variables" X --> SARIMAX

This could unlock big gains in accuracy while retaining simple models; and plays well with more complex models too.

TODOs

Work through blog posts

Do "blog posts to work through" list below

Analyze for myself on crypto data, and build into sim_engine plots

ARIMA-style feed analysis: for {BoxCox=T/F, diff=0,1,2}, plot seasonality and ACF/PACF #1009
New app for ARIMA-style feed analysis: show plots from ^. Use "main prototype" in ^. #1032
Analyze model prediction sanity, via diagnostic plots of prediction residuals #1010
New app for analyzing model prediction sanity, using plots from ^

Then, implement & explore:

Choose a SARIMAX implementation: NIXTLA vs pmdarima. How: play with NIXTLA more
Implement in sim_engine
Implement in predictoor_agent
Extend for inputs beyond than traditional SARIMAX intent, either as part of exogenous variables "X" or my own formulation:
- For "AR" inputs: include non-close values of the feed to predict, and other feeds (x2, x3, x4, ..). I should be able to specify a lower max order for the non-prediction feeds
- For "I" inputs: include ""
- For "S" inputs: include ""
- For "MA" inputs: use real past prediction errors, vs an MLE estimator to get them

Completed: Blog posts to work through

Work through these blog posts:

Potentially useful tools

NIXTLA StatsForecast library to auto-find SARIMAX params.

20x faster than pmdarima, 4x faster than statsmodels
Py models of AutoARIMA, AutoETS, AutoCES, MSTL and Theta in Python. Has GARCH and ARCH; exponential smoothing
probabilistic forecasting and confidence intervals, via conformal prediction
support for exogenous Variables and static covariates

NIXTLA libraries for time-series

StatsForecast (Github) - lightning fast forecasting with statistical and econometric models
MLForecast (Github) - scalable ML for time series forecasting
NeuralForecast (Github) - scalable and user friendly forecasting algorithms for time series data
Hierarchical Forecast (Github) - probabilistic hierarchical forecasting with statistical and econometric models
TS features - calculate various features from time series data.

pmdarima library to auto-find SARIMAX params

auto_arima() to choose ARIMA p, d, and q automatically. And in fact works for SARIMA and SARIMAX parameters too. Example
all the other key tools & utilities for time-series around SARIMAX: differencing, autocorrelation, partial autocorrelation, plots

statsmodels library

statsmodels.tsa.acf() - autocorrelation function. Example use: Qingkai
statsmodels.graphics.tsaplots.plot_acf() - plot autocorrelation function. Example use in projectpro.io
statsmodels.tsa.stattools.adfuller() - gives probability that a time-series is non-stationary (via Augmented Dickey-Fully test). Example use in projectpro.io
statsmodels.tsa.arima.model.ARIMA() link

tsfresh library

auto-calculates 25+ time series characteristics, ie "features". Plays well with scikit-learn pipelines
with methods to evaluate the explaining power and importance of such features for regression or classification tasks
4 core devs and 20+ other contributors

pandas library

df.value.diff() - first-order differencing: x(t-1) - x(t-2). Example use in projectpro.io
df.value.diff().diff() - second-order differencing: (x[t-1] - x[t-2]) - (x[t-2] - x[t-3]). Example use "".
df.shift() - shift all df rows up or down, or cols up or down. Tutorial

Manually find ARIMA (p, d, q): Blog

d (for I): use statsmodels.tsa.stattools.adfuller() on df.values.diff(), then 2 diffs, etc until result < 0.5. Then p = smallest # diffs that gave result < 0.5, minus 1 = min # differencings to make the data stationary
p (for AR) = how much lag is needed = based on most significant lag in the partial autocorrelation plot. Find the lag that gives max partial autocorrelation on first-order differencing; then on second; then p = max of those.
q (for MA) = like for p but look at autocorrelation plots.

Smoothing eg here

Single exponential smoothing - like a LPF. Fails if trend or seasonality
Double exponential smoothing - smooths both the series and the smoothed series. Works if trend, fails if seasonality
Triple exponential smoothing - works for trend & seasonality

About ARMA ... SARIMAX

An AR formulation uses AR inputs to the model:

AR (Autoregressive) inputs: values of close etc at previous candles = y[t-1], y[t-2], etc.
- To be precise, AR can use each input feed, not just the feed to predict. So it's really:
- For x1: x1[t-1], x1[t-2], ..., x1[t-p]
- For x2: x2[t-1], x2[y-2], ..., x2[t-p]
- ... To xn: xn[t-1], xn[t-2], ..., xn[t-p]
- Where n = # input feeds and p = AR model order = # candles to look back at in AR

An ARMA formulation uses AR and MA inputs to the model:

AR inputs: as above
MA (Moving Average) inputs: model's prediction errors (and directions) from previous predictions = differences between predicted and actual values at previous candles = y^[t-1] - y[t-1], y^[t-2] - y[t-2], ... y^[t-d].
Where d = MA model order = # previous predictions to look backwards for in MA = # lagged forecast errors in the prediction equation
MA only works for feed to predict. It doesn't work for the other feeds. (Whereas AR and I do.)

Another way to think about the moving average model is that it corrects future forecasts based on errors made on recent forecasts. Ref

An ARIMA formulation uses AR, I, and MA inputs to the model:

AR inputs: as above
MA inputs: as above
I (Integrated) inputs: differences between pairs of values of close etc at previous candles = y[t-1] - y[t-2], y[t-2] - y[t-3], etc. To be precise, I can use each input feed, not just the feed to predict. So, it's really:
- For x1: x1[t-1] - x1[t-2], x1[t-2] - x1[t-3], ... x1[t-q] - x1[t-q-1]
- For x2: x2[t-1] - x2[t-2], x2[t-2] - x2[t-3], ..., x2[t-q] - x2[t-q-1]
- ... To xn: xn[t-1] - xn[t-2], xn[t-2] - xn[t-3], ..., xn[t-q] - xn[t-q-1]
- Where n = # input feeds and q = I model order = # candles to look back at for differencing = # nonseasonal differences needed for stationarity
- Above was for a 1st-order differencing: x1[t-1] - x1[t-2] etc. This accounts for a linear "trend" in the data.
- We can also have 2nd-order differencing: (x1[t-1] - x1[t-2]) - (x1[t-2] - x[t-3]). This accounts for a quadratic "trend" in the data.

ARIMA "order" is a 3-tuple (p,d,q) for order of (AR,I,MA). By example:

A model with (only) two AR terms would be specified as an ARIMA of order (p,d,q) = (2,0,0).
A MA(2) model would be specified as an ARIMA of order (p,d,q) = (0,0,2).
A model with one AR term, a first difference, and one MA term would have order (p,d,q) = (1,1,1).

SARIMA models. If a "season" S is every 12 iterations, then a term x1[t-1] - x1[t-12-1] would capture it.

SARIMAX models. "Exogenous regressors" aka "external variables" X are variables besides the target feed. Eg if we're predicting Binance BTC/USDT c, X variables could include Binance BTC/USDT ohlv, Binance ETH/USDT ohlcv, Kraken BTC/USDT ohlcv, etc. These are important to include for Predictoor; eg every non-BTC coin has known dependency on BTC price.

Appendix: Resources

Penn State U "Applied Time Series Analysis"

Great tutorials of AR, ARMA, ARIMA, SARIMA, more. With py code. https://online.stat.psu.edu/stat510/

Neptune. ai ARIMA/SARIMA Real World Time Series Forecasting Guide

https://neptune.ai/blog/arima-sarima-real-world-time-series-forecasting-guide

pmdarima: ARIMA estimators for Python

https://alkaline-ml.com/pmdarima/index.html

"pmdarima brings R’s beloved auto.arima to Python, making an even stronger case for why you don’t need R for data science. pmdarima is 100% Python + Cython and does not leverage any R code, but is implemented in a powerful, yet easy-to-use set of functions & classes that will be familiar to scikit-learn users."

This neptune.ai blog post compares ARIMA
(optimized with pmdarima) to Prophet and LSTM. ARIMA wins.

Ethan Rosenthal Blog

Off-the-cuff ARIMA; accounting for (non) stationarity; how to use via skits; more. Great article! https://www.ethanrosenthal.com/2018/03/22/time-series-for-scikit-learn-people-part2/

Auto ARIMA

Auto-discovers p, d, q

Blog: https://towardsdatascience.com/time-series-forecasting-using-auto-arima-in-python-bb83e49210cd

Code: https://github.com/SushmithaPulagam/TimeSeries_Auto-ARIMA

The text was updated successfully, but these errors were encountered:

trentmc · 2024-05-09T07:51:31Z

The article towardsdatascience.com: ARIMA .. SARIMAX link had GREAT code. Here's all the code in one convenient place. To show any given plot, change if False to if True.

from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from statsmodels.tsa.stattools import adfuller
import pmdarima as pm

# load data
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")

#string to date format
df['Month'] = pd.to_datetime(df['Month'],infer_datetime_format=True)
df = df.set_index(['Month'])
df.head(5)

if False: # plot raw time series
    plt.figure(figsize=(15,7))
    plt.title("Number of Airline Passengers by Date")
    plt.xlabel('Date')
    plt.ylabel('Passengers')
    plt.plot(df)
    plt.show()

    
#Determine rolling statistics
df["rolling_avg"] = df["#Passengers"].rolling(window=12).mean() #window size 12 denotes 12 months, giving rolling mean at yearly level
df["rolling_std"] = df["#Passengers"].rolling(window=12).std()

if False: # plot raw time series + rolling statistics
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#379BDB', label='Original')
    plt.plot(df["rolling_avg"], color='#D22A0D', label='Rolling Mean')
    plt.plot(df["rolling_std"], color='#142039', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)

#Augmented Dickey–Fuller test. It's stationary if p < 0.05
dftest = adfuller(df['#Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value

if False: # print ADF results
    print('Results of Dickey Fuller Test:')
    print(dfoutput)

#Standard ARIMA Model
ARIMA_model = pm.auto_arima(
    df['#Passengers'], 
    start_p=1, 
    start_q=1,
    test='adf', # use adftest to find optimal 'd'
    max_p=3, max_q=3, # maximum p and q
    m=1, # frequency of series (if m==1, seasonal is set to FALSE automatically)
    d=None,# let model determine 'd'
    seasonal=False, # No Seasonality for standard ARIMA
    trace=False, #logs 
    error_action='warn', #shows errors ('ignore' silences these)
    suppress_warnings=True,
    stepwise=True)

# plot diagnostics: [good fit if this cond'n is met]
# 1. standardized residual [There are no obvious patterns in the residuals, with values having a mean of zero and having a uniform variance.]
# 2. histogram + KDE estimate [The KDE curve should be very similar to the normal distribution, labeled as N(0,1) in the plot]
# 3. normal Q-Q [Most of the data points should lie on the straight line]
# 4. ACF plot, ie correlogram [95% of correlations for lag greater than zero should not be significant. The grey area is the confidence band, and if values fall outside of this then they are statistically significant. In our case, there are a few values outside of this area, and therefore we may need to add more predictors to make the model more accurate]

if False: #plot diagnostics on SARIMA model
    ARIMA_model.plot_diagnostics(figsize=(15,12))
    plt.show()

# use the model to forecast airline passenger counts over the next 24 months
def forecast(ARIMA_model, periods=24):
    # Forecast
    n_periods = periods
    fitted, confint = ARIMA_model.predict(n_periods=n_periods, return_conf_int=True)
    index_of_fc = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS')

    # make series for plotting purpose
    fitted_series = pd.Series(fitted, index=index_of_fc)
    lower_series = pd.Series(confint[:, 0], index=index_of_fc)
    upper_series = pd.Series(confint[:, 1], index=index_of_fc)

    # Plot
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#1f76b4')
    plt.plot(fitted_series, color='darkgreen')
    plt.fill_between(lower_series.index, 
                    lower_series, 
                    upper_series, 
                    color='k', alpha=.15)

    plt.title("ARIMA/SARIMA - Forecast of Airline Passengers")
    plt.show()

if False: # plot ARIMA forecast
    forecast(ARIMA_model)

# Seasonal - fit stepwise auto-ARIMA
SARIMA_model = pm.auto_arima(
    df["#Passengers"], start_p=1, start_q=1,
    test='adf',
    max_p=3, max_q=3, 
    m=12, #12 is the frequncy of the cycle
    start_P=0, 
    seasonal=True, #set to seasonal
    d=None, 
    D=1, #order of the seasonal differencing
    trace=False,
    error_action='ignore',  
    suppress_warnings=True, 
    stepwise=True)

if False: #plot diagnostics on SARIMA model
    SARIMA_model.plot_diagnostics(figsize=(15,12))
    plt.show()

if False: # plot SARIMA forecast
    forecast(SARIMA_model)


#adding exogenous variable
df['month_index'] = df.index.month

# SARIMAX Model
SARIMAX_model = pm.auto_arima(
    df[['#Passengers']], exogenous=df[['month_index']],
    start_p=1, start_q=1,
    test='adf',
    max_p=3, max_q=3, m=12,
    start_P=0, seasonal=True,
    d=None, D=1, 
    trace=False,
    error_action='ignore',  
    suppress_warnings=True, 
    stepwise=True)

def sarimax_forecast(SARIMAX_model, periods=24):
    # Forecast
    n_periods = periods

    forecast_df = pd.DataFrame(
        {"month_index":pd.date_range(df.index[-1], periods = n_periods, freq='MS').month},
        index = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS'))

    fitted, confint = SARIMAX_model.predict(
        n_periods=n_periods, 
        return_conf_int=True,
        exogenous=forecast_df[['month_index']],
    )
    index_of_fc = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS')

    # make series for plotting purpose
    fitted_series = pd.Series(fitted, index=index_of_fc)
    lower_series = pd.Series(confint[:, 0], index=index_of_fc)
    upper_series = pd.Series(confint[:, 1], index=index_of_fc)

    # Plot
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#1f76b4')
    plt.plot(fitted_series, color='darkgreen')
    plt.fill_between(lower_series.index, 
                    lower_series, 
                    upper_series, 
                    color='k', alpha=.15)

    plt.title("SARIMAX - Forecast of Airline Passengers")
    plt.show()

if False: # plot SARIMAX forecast
    sarimax_forecast(SARIMAX_model, periods=24)

trentmc added the Type: Enhancement New feature or request label May 8, 2024

trentmc changed the title ~~[Aimodel, sim] Explore ARMA modeling~~ [Aimodel, sim] Explore ARMA & ARIMA modeling May 8, 2024

trentmc self-assigned this May 8, 2024

trentmc changed the title ~~[Aimodel, sim] Explore ARMA & ARIMA modeling~~ [Aimodel, sim] Explore ARMA / ARIMA / SARIMA modeling May 8, 2024

trentmc changed the title ~~[Aimodel, sim] Explore ARMA / ARIMA / SARIMA modeling~~ [Aimodel, sim] Explore ARMA / ARIMA / SARIMA / SARIMAX modeling May 9, 2024

trentmc changed the title ~~[Aimodel, sim] Explore ARMA / ARIMA / SARIMA / SARIMAX modeling~~ [Aimodel, sim] Explore ARMA .. SARIMAX modeling May 9, 2024

trentmc changed the title ~~[Aimodel, sim] Explore ARMA .. SARIMAX modeling~~ [EPIC, aimodel, sim] Explore ARMA .. SARIMAX modeling May 9, 2024

trentmc changed the title ~~[EPIC, aimodel, sim] Explore ARMA .. SARIMAX modeling~~ [EPIC, aimodel, sim] Explore SARIMAX modeling May 9, 2024

This was referenced May 11, 2024

[Sim plots] Too many plots at once; fix via >1 page selectable via sidebar #1021

Open

New app for ARIMA-style feed analysis: show plots from #1009 #1032

Open

trentmc added the Epic label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC, aimodel, sim] Explore SARIMAX modeling #1006

[EPIC, aimodel, sim] Explore SARIMAX modeling #1006

trentmc commented May 8, 2024 •

edited

trentmc commented May 9, 2024

[EPIC, aimodel, sim] Explore SARIMAX modeling #1006

[EPIC, aimodel, sim] Explore SARIMAX modeling #1006

Comments

trentmc commented May 8, 2024 • edited

Motivation

TODOs

Completed: Blog posts to work through

Potentially useful tools

About ARMA ... SARIMAX

Appendix: Resources

Penn State U "Applied Time Series Analysis"

Neptune. ai ARIMA/SARIMA Real World Time Series Forecasting Guide

pmdarima: ARIMA estimators for Python

Ethan Rosenthal Blog

Auto ARIMA

trentmc commented May 9, 2024

trentmc commented May 8, 2024 •

edited