-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC, aimodel, sim] Explore SARIMAX modeling #1006
Labels
Comments
trentmc
changed the title
[Aimodel, sim] Explore ARMA modeling
[Aimodel, sim] Explore ARMA & ARIMA modeling
May 8, 2024
trentmc
changed the title
[Aimodel, sim] Explore ARMA & ARIMA modeling
[Aimodel, sim] Explore ARMA / ARIMA / SARIMA modeling
May 8, 2024
trentmc
changed the title
[Aimodel, sim] Explore ARMA / ARIMA / SARIMA modeling
[Aimodel, sim] Explore ARMA / ARIMA / SARIMA / SARIMAX modeling
May 9, 2024
trentmc
changed the title
[Aimodel, sim] Explore ARMA / ARIMA / SARIMA / SARIMAX modeling
[Aimodel, sim] Explore ARMA .. SARIMAX modeling
May 9, 2024
The article from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from statsmodels.tsa.stattools import adfuller
import pmdarima as pm
# load data
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")
#string to date format
df['Month'] = pd.to_datetime(df['Month'],infer_datetime_format=True)
df = df.set_index(['Month'])
df.head(5)
if False: # plot raw time series
plt.figure(figsize=(15,7))
plt.title("Number of Airline Passengers by Date")
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.plot(df)
plt.show()
#Determine rolling statistics
df["rolling_avg"] = df["#Passengers"].rolling(window=12).mean() #window size 12 denotes 12 months, giving rolling mean at yearly level
df["rolling_std"] = df["#Passengers"].rolling(window=12).std()
if False: # plot raw time series + rolling statistics
plt.figure(figsize=(15,7))
plt.plot(df["#Passengers"], color='#379BDB', label='Original')
plt.plot(df["rolling_avg"], color='#D22A0D', label='Rolling Mean')
plt.plot(df["rolling_std"], color='#142039', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)
#Augmented Dickey–Fuller test. It's stationary if p < 0.05
dftest = adfuller(df['#Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
if False: # print ADF results
print('Results of Dickey Fuller Test:')
print(dfoutput)
#Standard ARIMA Model
ARIMA_model = pm.auto_arima(
df['#Passengers'],
start_p=1,
start_q=1,
test='adf', # use adftest to find optimal 'd'
max_p=3, max_q=3, # maximum p and q
m=1, # frequency of series (if m==1, seasonal is set to FALSE automatically)
d=None,# let model determine 'd'
seasonal=False, # No Seasonality for standard ARIMA
trace=False, #logs
error_action='warn', #shows errors ('ignore' silences these)
suppress_warnings=True,
stepwise=True)
# plot diagnostics: [good fit if this cond'n is met]
# 1. standardized residual [There are no obvious patterns in the residuals, with values having a mean of zero and having a uniform variance.]
# 2. histogram + KDE estimate [The KDE curve should be very similar to the normal distribution, labeled as N(0,1) in the plot]
# 3. normal Q-Q [Most of the data points should lie on the straight line]
# 4. ACF plot, ie correlogram [95% of correlations for lag greater than zero should not be significant. The grey area is the confidence band, and if values fall outside of this then they are statistically significant. In our case, there are a few values outside of this area, and therefore we may need to add more predictors to make the model more accurate]
if False: #plot diagnostics on SARIMA model
ARIMA_model.plot_diagnostics(figsize=(15,12))
plt.show()
# use the model to forecast airline passenger counts over the next 24 months
def forecast(ARIMA_model, periods=24):
# Forecast
n_periods = periods
fitted, confint = ARIMA_model.predict(n_periods=n_periods, return_conf_int=True)
index_of_fc = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS')
# make series for plotting purpose
fitted_series = pd.Series(fitted, index=index_of_fc)
lower_series = pd.Series(confint[:, 0], index=index_of_fc)
upper_series = pd.Series(confint[:, 1], index=index_of_fc)
# Plot
plt.figure(figsize=(15,7))
plt.plot(df["#Passengers"], color='#1f76b4')
plt.plot(fitted_series, color='darkgreen')
plt.fill_between(lower_series.index,
lower_series,
upper_series,
color='k', alpha=.15)
plt.title("ARIMA/SARIMA - Forecast of Airline Passengers")
plt.show()
if False: # plot ARIMA forecast
forecast(ARIMA_model)
# Seasonal - fit stepwise auto-ARIMA
SARIMA_model = pm.auto_arima(
df["#Passengers"], start_p=1, start_q=1,
test='adf',
max_p=3, max_q=3,
m=12, #12 is the frequncy of the cycle
start_P=0,
seasonal=True, #set to seasonal
d=None,
D=1, #order of the seasonal differencing
trace=False,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
if False: #plot diagnostics on SARIMA model
SARIMA_model.plot_diagnostics(figsize=(15,12))
plt.show()
if False: # plot SARIMA forecast
forecast(SARIMA_model)
#adding exogenous variable
df['month_index'] = df.index.month
# SARIMAX Model
SARIMAX_model = pm.auto_arima(
df[['#Passengers']], exogenous=df[['month_index']],
start_p=1, start_q=1,
test='adf',
max_p=3, max_q=3, m=12,
start_P=0, seasonal=True,
d=None, D=1,
trace=False,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
def sarimax_forecast(SARIMAX_model, periods=24):
# Forecast
n_periods = periods
forecast_df = pd.DataFrame(
{"month_index":pd.date_range(df.index[-1], periods = n_periods, freq='MS').month},
index = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS'))
fitted, confint = SARIMAX_model.predict(
n_periods=n_periods,
return_conf_int=True,
exogenous=forecast_df[['month_index']],
)
index_of_fc = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS')
# make series for plotting purpose
fitted_series = pd.Series(fitted, index=index_of_fc)
lower_series = pd.Series(confint[:, 0], index=index_of_fc)
upper_series = pd.Series(confint[:, 1], index=index_of_fc)
# Plot
plt.figure(figsize=(15,7))
plt.plot(df["#Passengers"], color='#1f76b4')
plt.plot(fitted_series, color='darkgreen')
plt.fill_between(lower_series.index,
lower_series,
upper_series,
color='k', alpha=.15)
plt.title("SARIMAX - Forecast of Airline Passengers")
plt.show()
if False: # plot SARIMAX forecast
sarimax_forecast(SARIMAX_model, periods=24) |
trentmc
changed the title
[Aimodel, sim] Explore ARMA .. SARIMAX modeling
[EPIC, aimodel, sim] Explore ARMA .. SARIMAX modeling
May 9, 2024
This was referenced May 9, 2024
Closed
trentmc
changed the title
[EPIC, aimodel, sim] Explore ARMA .. SARIMAX modeling
[EPIC, aimodel, sim] Explore SARIMAX modeling
May 9, 2024
This was referenced May 11, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Motivation
We currently use a formulation that looks directly at past values ("autoregressive" AR).
We could generalize this to..
This could unlock big gains in accuracy while retaining simple models; and plays well with more complex models too.
TODOs
Work through blog posts
Analyze for myself on crypto data, and build into sim_engine plots
Then, implement & explore:
Completed: Blog posts to work through
Work through these blog posts:
qingkaikong.blogspot.com
: autocorrelation tutorial. Linkprojectpro.io
: build ARIMA model in py. Linkwww.nbshare.io
: model sunspots with AR, ARIMA, SARIMA. Link. Perrin data. ReplaceARIMA
import with:from statsmodels.tsa.arima.model.ARIMA import ARIMA
neptune.ai
: time series forecasting: data, analysis, and practice. On exponential smoothing. Linkneptune.ai
: how to select a model for time series pred'n. Link. Good code showing how to plot a seasonal decompositionneptune.ai
: ARIMA & SARIMA. Linkneptune.ai
: ARIMA vs Prophet vs LSTM blog. Link, only doing ARIMA stuff via pmdarima.www.ethanrosenthal.com
: time series for scikit-learn people part 2. Link, refreshingly candidtowardsdatascience.com
: ARIMA .. SARIMAX, incl auto-find params link. GREAT code, copied to 1st comment belowzerotomastery.io
: auto SARIMAX blog. LinkAmit Kumar
: XLM price prediction with ARIMA. Pdf, Google scholar contextUseful filtering tricks, more.Jahnvi Sikligar
: Master's Thesis py notebook. GREAT codePotentially useful tools
NIXTLA StatsForecast library to auto-find SARIMAX params.
NIXTLA libraries for time-series
pmdarima library to auto-find SARIMAX params
auto_arima()
to choose ARIMA p, d, and q automatically. And in fact works for SARIMA and SARIMAX parameters too. Examplestatsmodels library
statsmodels.tsa.acf()
- autocorrelation function. Example use: Qingkaistatsmodels.graphics.tsaplots.plot_acf()
- plot autocorrelation function. Example use in projectpro.iostatsmodels.tsa.stattools.adfuller()
- gives probability that a time-series is non-stationary (via Augmented Dickey-Fully test). Example use in projectpro.iostatsmodels.tsa.arima.model.ARIMA()
linktsfresh library
pandas library
df.value.diff()
- first-order differencing: x(t-1) - x(t-2). Example use in projectpro.iodf.value.diff().diff()
- second-order differencing:(x[t-1] - x[t-2]) - (x[t-2] - x[t-3])
. Example use "".df.shift()
- shift all df rows up or down, or cols up or down. TutorialManually find ARIMA (p, d, q): Blog
statsmodels.tsa.stattools.adfuller()
on df.values.diff(), then 2 diffs, etc until result < 0.5. Then p = smallest # diffs that gave result < 0.5, minus 1 = min # differencings to make the data stationarySmoothing eg here
About ARMA ... SARIMAX
An AR formulation uses AR inputs to the model:
close
etc at previous candles =y[t-1], y[t-2]
, etc.x1
:x1[t-1], x1[t-2], ..., x1[t-p]
x2
:x2[t-1], x2[y-2], ..., x2[t-p]
xn
:xn[t-1], xn[t-2], ..., xn[t-p]
n
= # input feeds andp
= AR model order = # candles to look back at in ARAn ARMA formulation uses AR and MA inputs to the model:
y^[t-1] - y[t-1], y^[t-2] - y[t-2], ... y^[t-d]
.d
= MA model order = # previous predictions to look backwards for in MA = # lagged forecast errors in the prediction equationAnother way to think about the moving average model is that it corrects future forecasts based on errors made on recent forecasts. Ref
An ARIMA formulation uses AR, I, and MA inputs to the model:
close
etc at previous candles = y[t-1] - y[t-2], y[t-2] - y[t-3], etc. To be precise,I
can use each input feed, not just the feed to predict. So, it's really:x1
:x1[t-1] - x1[t-2], x1[t-2] - x1[t-3], ... x1[t-q] - x1[t-q-1]
x2
:x2[t-1] - x2[t-2], x2[t-2] - x2[t-3], ..., x2[t-q] - x2[t-q-1]
xn
:xn[t-1] - xn[t-2], xn[t-2] - xn[t-3], ..., xn[t-q] - xn[t-q-1]
n
= # input feeds andq
= I model order = # candles to look back at for differencing = # nonseasonal differences needed for stationarityx1[t-1] - x1[t-2]
etc. This accounts for a linear "trend" in the data.(x1[t-1] - x1[t-2]) - (x1[t-2] - x[t-3])
. This accounts for a quadratic "trend" in the data.ARIMA "order" is a 3-tuple (p,d,q) for order of (AR,I,MA). By example:
SARIMA models. If a "season" S is every 12 iterations, then a term
x1[t-1] - x1[t-12-1]
would capture it.SARIMAX models. "Exogenous regressors" aka "external variables" X are variables besides the target feed. Eg if we're predicting Binance BTC/USDT c, X variables could include Binance BTC/USDT ohlv, Binance ETH/USDT ohlcv, Kraken BTC/USDT ohlcv, etc. These are important to include for Predictoor; eg every non-BTC coin has known dependency on BTC price.
Appendix: Resources
Penn State U "Applied Time Series Analysis"
Great tutorials of AR, ARMA, ARIMA, SARIMA, more. With py code. https://online.stat.psu.edu/stat510/
Neptune. ai ARIMA/SARIMA Real World Time Series Forecasting Guide
https://neptune.ai/blog/arima-sarima-real-world-time-series-forecasting-guide
pmdarima: ARIMA estimators for Python
https://alkaline-ml.com/pmdarima/index.html
"pmdarima brings R’s beloved auto.arima to Python, making an even stronger case for why you don’t need R for data science. pmdarima is 100% Python + Cython and does not leverage any R code, but is implemented in a powerful, yet easy-to-use set of functions & classes that will be familiar to scikit-learn users."
This neptune.ai blog post compares ARIMA
(optimized with pmdarima) to Prophet and LSTM. ARIMA wins.
Ethan Rosenthal Blog
Off-the-cuff ARIMA; accounting for (non) stationarity; how to use via
skits
; more. Great article! https://www.ethanrosenthal.com/2018/03/22/time-series-for-scikit-learn-people-part2/Auto ARIMA
Auto-discovers p, d, q
Blog: https://towardsdatascience.com/time-series-forecasting-using-auto-arima-in-python-bb83e49210cd
Code: https://github.com/SushmithaPulagam/TimeSeries_Auto-ARIMA
The text was updated successfully, but these errors were encountered: