Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC, aimodel, sim] Explore SARIMAX modeling #1006

Open
13 of 21 tasks
trentmc opened this issue May 8, 2024 · 1 comment
Open
13 of 21 tasks

[EPIC, aimodel, sim] Explore SARIMAX modeling #1006

trentmc opened this issue May 8, 2024 · 1 comment
Assignees
Labels
Epic Type: Enhancement New feature or request

Comments

@trentmc
Copy link
Member

trentmc commented May 8, 2024

Motivation

We currently use a formulation that looks directly at past values ("autoregressive" AR).

We could generalize this to..

  • handle expected model prediction errors: "moving average" MA --> overall ARMA
  • handle trends: "integrated" I --> ARIMA
  • handle "seasonality" S --> SARIMA
  • handle "exogenous regressors" aka "external variables" X --> SARIMAX

This could unlock big gains in accuracy while retaining simple models; and plays well with more complex models too.

TODOs

Work through blog posts

  • Do "blog posts to work through" list below

Analyze for myself on crypto data, and build into sim_engine plots

  • ARIMA-style feed analysis: for {BoxCox=T/F, diff=0,1,2}, plot seasonality and ACF/PACF #1009
  • New app for ARIMA-style feed analysis: show plots from ^. Use "main prototype" in ^. #1032
  • Analyze model prediction sanity, via diagnostic plots of prediction residuals #1010
  • New app for analyzing model prediction sanity, using plots from ^

Then, implement & explore:

  • Choose a SARIMAX implementation: NIXTLA vs pmdarima. How: play with NIXTLA more
  • Implement in sim_engine
  • Implement in predictoor_agent
  • Extend for inputs beyond than traditional SARIMAX intent, either as part of exogenous variables "X" or my own formulation:
    • For "AR" inputs: include non-close values of the feed to predict, and other feeds (x2, x3, x4, ..). I should be able to specify a lower max order for the non-prediction feeds
    • For "I" inputs: include ""
    • For "S" inputs: include ""
    • For "MA" inputs: use real past prediction errors, vs an MLE estimator to get them

Completed: Blog posts to work through

Work through these blog posts:

  • qingkaikong.blogspot.com: autocorrelation tutorial. Link
  • projectpro.io: build ARIMA model in py. Link
  • www.nbshare.io: model sunspots with AR, ARIMA, SARIMA. Link. Perrin data. Replace ARIMA import with: from statsmodels.tsa.arima.model.ARIMA import ARIMA
  • neptune.ai: time series forecasting: data, analysis, and practice. On exponential smoothing. Link
  • 🔥 neptune.ai: how to select a model for time series pred'n. Link. Good code showing how to plot a seasonal decomposition
  • neptune.ai: ARIMA & SARIMA. Link
  • neptune.ai: ARIMA vs Prophet vs LSTM blog. Link, only doing ARIMA stuff via pmdarima.
  • www.ethanrosenthal.com: time series for scikit-learn people part 2. Link, refreshingly candid
  • 🔥🔥🔥 towardsdatascience.com: ARIMA .. SARIMAX, incl auto-find params link. GREAT code, copied to 1st comment below
  • zerotomastery.io: auto SARIMAX blog. Link
  • Amit Kumar: XLM price prediction with ARIMA. Pdf, Google scholar contextUseful filtering tricks, more.
  • 🔥🔥🔥 Jahnvi Sikligar: Master's Thesis py notebook. GREAT code

Potentially useful tools

NIXTLA StatsForecast library to auto-find SARIMAX params.

  • 20x faster than pmdarima, 4x faster than statsmodels
  • Py models of AutoARIMA, AutoETS, AutoCES, MSTL and Theta in Python. Has GARCH and ARCH; exponential smoothing
  • probabilistic forecasting and confidence intervals, via conformal prediction
  • support for exogenous Variables and static covariates

NIXTLA libraries for time-series

pmdarima library to auto-find SARIMAX params

  • auto_arima() to choose ARIMA p, d, and q automatically. And in fact works for SARIMA and SARIMAX parameters too. Example
  • all the other key tools & utilities for time-series around SARIMAX: differencing, autocorrelation, partial autocorrelation, plots

statsmodels library

tsfresh library

  • auto-calculates 25+ time series characteristics, ie "features". Plays well with scikit-learn pipelines
  • with methods to evaluate the explaining power and importance of such features for regression or classification tasks
  • 4 core devs and 20+ other contributors

pandas library

  • df.value.diff() - first-order differencing: x(t-1) - x(t-2). Example use in projectpro.io
  • df.value.diff().diff() - second-order differencing: (x[t-1] - x[t-2]) - (x[t-2] - x[t-3]). Example use "".
  • df.shift() - shift all df rows up or down, or cols up or down. Tutorial

Manually find ARIMA (p, d, q): Blog

  • d (for I): use statsmodels.tsa.stattools.adfuller() on df.values.diff(), then 2 diffs, etc until result < 0.5. Then p = smallest # diffs that gave result < 0.5, minus 1 = min # differencings to make the data stationary
  • p (for AR) = how much lag is needed = based on most significant lag in the partial autocorrelation plot. Find the lag that gives max partial autocorrelation on first-order differencing; then on second; then p = max of those.
  • q (for MA) = like for p but look at autocorrelation plots.

Smoothing eg here

About ARMA ... SARIMAX

An AR formulation uses AR inputs to the model:

  • AR (Autoregressive) inputs: values of close etc at previous candles = y[t-1], y[t-2], etc.
    • To be precise, AR can use each input feed, not just the feed to predict. So it's really:
    • For x1: x1[t-1], x1[t-2], ..., x1[t-p]
    • For x2: x2[t-1], x2[y-2], ..., x2[t-p]
    • ... To xn: xn[t-1], xn[t-2], ..., xn[t-p]
    • Where n = # input feeds and p = AR model order = # candles to look back at in AR

An ARMA formulation uses AR and MA inputs to the model:

  • AR inputs: as above
  • MA (Moving Average) inputs: model's prediction errors (and directions) from previous predictions = differences between predicted and actual values at previous candles = y^[t-1] - y[t-1], y^[t-2] - y[t-2], ... y^[t-d].
  • Where d = MA model order = # previous predictions to look backwards for in MA = # lagged forecast errors in the prediction equation
  • MA only works for feed to predict. It doesn't work for the other feeds. (Whereas AR and I do.)

Another way to think about the moving average model is that it corrects future forecasts based on errors made on recent forecasts. Ref

An ARIMA formulation uses AR, I, and MA inputs to the model:

  • AR inputs: as above
  • MA inputs: as above
  • I (Integrated) inputs: differences between pairs of values of close etc at previous candles = y[t-1] - y[t-2], y[t-2] - y[t-3], etc. To be precise, I can use each input feed, not just the feed to predict. So, it's really:
    • For x1: x1[t-1] - x1[t-2], x1[t-2] - x1[t-3], ... x1[t-q] - x1[t-q-1]
    • For x2: x2[t-1] - x2[t-2], x2[t-2] - x2[t-3], ..., x2[t-q] - x2[t-q-1]
    • ... To xn: xn[t-1] - xn[t-2], xn[t-2] - xn[t-3], ..., xn[t-q] - xn[t-q-1]
    • Where n = # input feeds and q = I model order = # candles to look back at for differencing = # nonseasonal differences needed for stationarity
    • Above was for a 1st-order differencing: x1[t-1] - x1[t-2] etc. This accounts for a linear "trend" in the data.
    • We can also have 2nd-order differencing: (x1[t-1] - x1[t-2]) - (x1[t-2] - x[t-3]). This accounts for a quadratic "trend" in the data.

ARIMA "order" is a 3-tuple (p,d,q) for order of (AR,I,MA). By example:

  • A model with (only) two AR terms would be specified as an ARIMA of order (p,d,q) = (2,0,0).
  • A MA(2) model would be specified as an ARIMA of order (p,d,q) = (0,0,2).
  • A model with one AR term, a first difference, and one MA term would have order (p,d,q) = (1,1,1).

SARIMA models. If a "season" S is every 12 iterations, then a term x1[t-1] - x1[t-12-1] would capture it.

SARIMAX models. "Exogenous regressors" aka "external variables" X are variables besides the target feed. Eg if we're predicting Binance BTC/USDT c, X variables could include Binance BTC/USDT ohlv, Binance ETH/USDT ohlcv, Kraken BTC/USDT ohlcv, etc. These are important to include for Predictoor; eg every non-BTC coin has known dependency on BTC price.

Appendix: Resources

Penn State U "Applied Time Series Analysis"

Great tutorials of AR, ARMA, ARIMA, SARIMA, more. With py code. https://online.stat.psu.edu/stat510/

Neptune. ai ARIMA/SARIMA Real World Time Series Forecasting Guide

https://neptune.ai/blog/arima-sarima-real-world-time-series-forecasting-guide

pmdarima: ARIMA estimators for Python

https://alkaline-ml.com/pmdarima/index.html

"pmdarima brings R’s beloved auto.arima to Python, making an even stronger case for why you don’t need R for data science. pmdarima is 100% Python + Cython and does not leverage any R code, but is implemented in a powerful, yet easy-to-use set of functions & classes that will be familiar to scikit-learn users."

This neptune.ai blog post compares ARIMA
(optimized with pmdarima) to Prophet and LSTM. ARIMA wins.

Ethan Rosenthal Blog

Off-the-cuff ARIMA; accounting for (non) stationarity; how to use via skits; more. Great article! https://www.ethanrosenthal.com/2018/03/22/time-series-for-scikit-learn-people-part2/

Auto ARIMA

Auto-discovers p, d, q

Blog: https://towardsdatascience.com/time-series-forecasting-using-auto-arima-in-python-bb83e49210cd

Code: https://github.com/SushmithaPulagam/TimeSeries_Auto-ARIMA

@trentmc trentmc added the Type: Enhancement New feature or request label May 8, 2024
@trentmc trentmc changed the title [Aimodel, sim] Explore ARMA modeling [Aimodel, sim] Explore ARMA & ARIMA modeling May 8, 2024
@trentmc trentmc self-assigned this May 8, 2024
@trentmc trentmc changed the title [Aimodel, sim] Explore ARMA & ARIMA modeling [Aimodel, sim] Explore ARMA / ARIMA / SARIMA modeling May 8, 2024
@trentmc trentmc changed the title [Aimodel, sim] Explore ARMA / ARIMA / SARIMA modeling [Aimodel, sim] Explore ARMA / ARIMA / SARIMA / SARIMAX modeling May 9, 2024
@trentmc trentmc changed the title [Aimodel, sim] Explore ARMA / ARIMA / SARIMA / SARIMAX modeling [Aimodel, sim] Explore ARMA .. SARIMAX modeling May 9, 2024
@trentmc
Copy link
Member Author

trentmc commented May 9, 2024

The article towardsdatascience.com: ARIMA .. SARIMAX link had GREAT code. Here's all the code in one convenient place. To show any given plot, change if False to if True.

from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from statsmodels.tsa.stattools import adfuller
import pmdarima as pm

# load data
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")

#string to date format
df['Month'] = pd.to_datetime(df['Month'],infer_datetime_format=True)
df = df.set_index(['Month'])
df.head(5)

if False: # plot raw time series
    plt.figure(figsize=(15,7))
    plt.title("Number of Airline Passengers by Date")
    plt.xlabel('Date')
    plt.ylabel('Passengers')
    plt.plot(df)
    plt.show()

    
#Determine rolling statistics
df["rolling_avg"] = df["#Passengers"].rolling(window=12).mean() #window size 12 denotes 12 months, giving rolling mean at yearly level
df["rolling_std"] = df["#Passengers"].rolling(window=12).std()

if False: # plot raw time series + rolling statistics
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#379BDB', label='Original')
    plt.plot(df["rolling_avg"], color='#D22A0D', label='Rolling Mean')
    plt.plot(df["rolling_std"], color='#142039', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)

#Augmented Dickey–Fuller test. It's stationary if p < 0.05
dftest = adfuller(df['#Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value

if False: # print ADF results
    print('Results of Dickey Fuller Test:')
    print(dfoutput)

#Standard ARIMA Model
ARIMA_model = pm.auto_arima(
    df['#Passengers'], 
    start_p=1, 
    start_q=1,
    test='adf', # use adftest to find optimal 'd'
    max_p=3, max_q=3, # maximum p and q
    m=1, # frequency of series (if m==1, seasonal is set to FALSE automatically)
    d=None,# let model determine 'd'
    seasonal=False, # No Seasonality for standard ARIMA
    trace=False, #logs 
    error_action='warn', #shows errors ('ignore' silences these)
    suppress_warnings=True,
    stepwise=True)

# plot diagnostics: [good fit if this cond'n is met]
# 1. standardized residual [There are no obvious patterns in the residuals, with values having a mean of zero and having a uniform variance.]
# 2. histogram + KDE estimate [The KDE curve should be very similar to the normal distribution, labeled as N(0,1) in the plot]
# 3. normal Q-Q [Most of the data points should lie on the straight line]
# 4. ACF plot, ie correlogram [95% of correlations for lag greater than zero should not be significant. The grey area is the confidence band, and if values fall outside of this then they are statistically significant. In our case, there are a few values outside of this area, and therefore we may need to add more predictors to make the model more accurate]

if False: #plot diagnostics on SARIMA model
    ARIMA_model.plot_diagnostics(figsize=(15,12))
    plt.show()

# use the model to forecast airline passenger counts over the next 24 months
def forecast(ARIMA_model, periods=24):
    # Forecast
    n_periods = periods
    fitted, confint = ARIMA_model.predict(n_periods=n_periods, return_conf_int=True)
    index_of_fc = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS')

    # make series for plotting purpose
    fitted_series = pd.Series(fitted, index=index_of_fc)
    lower_series = pd.Series(confint[:, 0], index=index_of_fc)
    upper_series = pd.Series(confint[:, 1], index=index_of_fc)

    # Plot
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#1f76b4')
    plt.plot(fitted_series, color='darkgreen')
    plt.fill_between(lower_series.index, 
                    lower_series, 
                    upper_series, 
                    color='k', alpha=.15)

    plt.title("ARIMA/SARIMA - Forecast of Airline Passengers")
    plt.show()

if False: # plot ARIMA forecast
    forecast(ARIMA_model)

# Seasonal - fit stepwise auto-ARIMA
SARIMA_model = pm.auto_arima(
    df["#Passengers"], start_p=1, start_q=1,
    test='adf',
    max_p=3, max_q=3, 
    m=12, #12 is the frequncy of the cycle
    start_P=0, 
    seasonal=True, #set to seasonal
    d=None, 
    D=1, #order of the seasonal differencing
    trace=False,
    error_action='ignore',  
    suppress_warnings=True, 
    stepwise=True)

if False: #plot diagnostics on SARIMA model
    SARIMA_model.plot_diagnostics(figsize=(15,12))
    plt.show()

if False: # plot SARIMA forecast
    forecast(SARIMA_model)


#adding exogenous variable
df['month_index'] = df.index.month

# SARIMAX Model
SARIMAX_model = pm.auto_arima(
    df[['#Passengers']], exogenous=df[['month_index']],
    start_p=1, start_q=1,
    test='adf',
    max_p=3, max_q=3, m=12,
    start_P=0, seasonal=True,
    d=None, D=1, 
    trace=False,
    error_action='ignore',  
    suppress_warnings=True, 
    stepwise=True)

def sarimax_forecast(SARIMAX_model, periods=24):
    # Forecast
    n_periods = periods

    forecast_df = pd.DataFrame(
        {"month_index":pd.date_range(df.index[-1], periods = n_periods, freq='MS').month},
        index = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS'))

    fitted, confint = SARIMAX_model.predict(
        n_periods=n_periods, 
        return_conf_int=True,
        exogenous=forecast_df[['month_index']],
    )
    index_of_fc = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS')

    # make series for plotting purpose
    fitted_series = pd.Series(fitted, index=index_of_fc)
    lower_series = pd.Series(confint[:, 0], index=index_of_fc)
    upper_series = pd.Series(confint[:, 1], index=index_of_fc)

    # Plot
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#1f76b4')
    plt.plot(fitted_series, color='darkgreen')
    plt.fill_between(lower_series.index, 
                    lower_series, 
                    upper_series, 
                    color='k', alpha=.15)

    plt.title("SARIMAX - Forecast of Airline Passengers")
    plt.show()

if False: # plot SARIMAX forecast
    sarimax_forecast(SARIMAX_model, periods=24)

@trentmc trentmc changed the title [Aimodel, sim] Explore ARMA .. SARIMAX modeling [EPIC, aimodel, sim] Explore ARMA .. SARIMAX modeling May 9, 2024
@trentmc trentmc changed the title [EPIC, aimodel, sim] Explore ARMA .. SARIMAX modeling [EPIC, aimodel, sim] Explore SARIMAX modeling May 9, 2024
@trentmc trentmc added the Epic label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant