Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding regressors #188

Open
drew6050 opened this issue Jun 21, 2023 · 7 comments
Open

Adding regressors #188

drew6050 opened this issue Jun 21, 2023 · 7 comments

Comments

@drew6050
Copy link

Hello. I love this library.

I’m trying to add additional regressors that I don’t know future values for. The documentation example is for wide data when the future is known. But, the last paragraph says:

“Additional regressors can be passed through as additional time series to forecast as part of df_long. Some models here can utilize the additional information they provide to help improve forecast quality. To prevent forecast accuracy for considering these additional series too heavily, input series weights that lower or remove their forecast accuracy from consideration.”

Does this mean that just by having additional columns in the df_long, like below, they will be considered as additional regressors on the value_col?

long=True    
model = model.fit(
      df_long,
      date_col='datetime' if long else None,
      value_col='value' if long else None,
      id_col='series_id' if long else None,
    )
prediction = model.predict()
forecasts_df = prediction.forecast

I see a couple of other have asked similar questions, but I still don’t see this answer. Also, a simple example in the documentation of how to implement this paragraph would help so much!

@winedarksea
Copy link
Owner

@drew6050 There are a couple of ways to do this, it is definitely confusing.

  1. if you are using long style data, you are close but not quite right. They would have the same column, so having values in the value_col, but would be differentiated by having a different series_id in the series_id column. For wide style data, it would just be another column. This sort of input only helps with certain multivariate models.
  2. you can manipulate your data by lags to push it into the future, then use as a future_regressor. So the future value is not about the future but provides a reference to the past, (at a relevant seasonal interval lag if possible). I have a tool that can help with that, see create_regressor used as an example here:
    regr_train, regr_fcst = create_regressor(
    note that df here is your covariates, in a wide (not long) style. This works with different models, those that can use future_regressor, and doesn't always work, but sometimes does well.
    1 and 2 can be used together, and both are in the production example (with wide styel data). Would be worth running that for reference.

@galenseilis
Copy link

@drew6050 There are a couple of ways to do this, it is definitely confusing.

  1. if you are using long style data, you are close but not quite right. They would have the same column, so having values in the value_col, but would be differentiated by having a different series_id in the series_id column. For wide style data, it would just be another column. This sort of input only helps with certain multivariate models.
  2. you can manipulate your data by lags to push it into the future, then use as a future_regressor. So the future value is not about the future but provides a reference to the past, (at a relevant seasonal interval lag if possible). I have a tool that can help with that, see create_regressor used as an example here:
    regr_train, regr_fcst = create_regressor(

    note that df here is your covariates, in a wide (not long) style. This works with different models, those that can use future_regressor, and doesn't always work, but sometimes does well.
    1 and 2 can be used together, and both are in the production example (with wide styel data). Would be worth running that for reference.

If it is confusing then that highlights the need for more documentation! Digging into reported issues on Github to find a link to an example script not accompanied with detailed explanations is not nice experience.

@winedarksea
Copy link
Owner

That is valid, my documentation is not the best.
My primary goal is to provide the most accurate forecasts.
I have found that people even with good documentation, most people either fail to read it, or completely don't understand it anyway so it is really hard to justify putting the time in.
Also my target audience is data scientists, they are usually smart enough to figure these things out if they want.

Another useful example:
regressor_search.py.txt

@nagydavid
Copy link

HI @winedarksea, could you pls upload the C:/Users/Colin/Downloads/as_export_small.csv ?

@winedarksea
Copy link
Owner

@nagydavid
Here is an adjustment to use the built in sample datasets

from autots import load_daily
df_long = load_daily(long=True)
df = long_to_wide(
    df_long,
    date_col='datetime',  # name of datetime column
    value_col='value',  # name of target values column 
    id_col="series_id",  # name of ID column, should be composite of levels, if multiple
)

@CRJFisher
Copy link

Hello @winedarksea, and thank you for the clarifications thus far. To aid our understanding, I'd like to restate the information provided, incorporating some example data tables for both wide and long formats, including regressor series.

For wide format data, each series (including regressors) is represented as a separate column, e.g.

Date Series_A Series_B Regressor_A Regressor_B
2023-01-01 10 20 5 3
2023-01-02 12 22 6 4

In long format data, the regressor series are appended as rows, e.g.

Date Value Series_ID
2023-01-01 10 Series_A
2023-01-02 12 Series_A
2023-01-01 20 Series_B
2023-01-02 22 Series_B
2023-01-01 5 Regressor_A
2023-01-02 6 Regressor_A
2023-01-01 3 Regressor_B
2023-01-02 4 Regressor_B

By integrating regressors as described, we enable (some) models to correlate the original series with the regressor series, thus potentially enhancing forecast accuracy.

Question 1: To confirm, the primary utility of adding these regressor series is to allow models to exploit correlations between the original series and the regressor series - is this correct?

Question 2: Regarding the application of regressors for specific series-to-extra information relationships (e.g., linking "product-a" directly to "brand-b"), could you elaborate on whether this is possible within AutoTS? This would involve adding regressors that are not just additional time series but carry categorical or entity-specific information relevant to the primary time series.

Understanding that documentation and detailed explanations are time-consuming to produce and may not always be fully appreciated, it's discussions like these that often provide invaluable insights and learning opportunities for practitioners in the field. Thanks again for your efforts in developing AutoTS and supporting its community.

@winedarksea
Copy link
Owner

yes, your data view there is correct on long vs wide.

for Question 1: yes, the general idea behind adding regressors is adding external information that can help explain the behavior of the series. Sometimes the model is able to extract insights from general market data or other high level indicators, but generally regressors are only significantly valuable when they provide clear, direct insight into business drivers. An example I know is where knowing the number of school children that will be out of school on holiday, by distance from the business, adds a lot of predictive power to a business driven by kids and families visiting.

for Question 2: regressor_per_series and static_categorical (which becomes static_regressor and categorical_groups) are only available for Cassandra (regressor per series, categorical groups), MultivariateRegression (regressor per series and static regressor), WindowRegression (static regressor) and NeuralForecast (static regressor and regressor per series) in the lower level api approach. See regressor_search.py posted above.

But you can combine all your regressors into a single df and let the model see which helps which series with a future_regressor for the high level AutoTS model search.

Generally regressors don't help as much as people hope. Focus on adding a few quality features that you know impact that business rather than just trying to feed in as much data as possible. There isn't enough history and there is way too much noise in most time series to find the deep hidden patterns people sometimes hope exist in massive regressor sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants