ElasticNet more than doubles memory consumption #27886

carlosg-m · 2023-12-02T02:07:11Z

carlosg-m
Dec 2, 2023

Basically if I have a dataset that takes 1gb of ram after calling ElasticNet I'm going to get peak memory consumption of around ~2200mb.

I've set copy_X to false and passed the X argument as a Fortran-contiguous numpy array using np.asfortranarray to no avail.

Any tips to save memory and avoid dataset duplication would be much appreciated.

glemaitre · 2023-12-02T14:55:01Z

glemaitre
Dec 2, 2023
Maintainer

fit_intercept=True will trigger a copy since you need to center the data. So you're data is already centered you can avoid the copy by passing fit_intercept=False. I also see that we expect y to be F-aligned array.

0 replies

carlosg-m · 2023-12-05T11:25:37Z

carlosg-m
Dec 5, 2023
Author

@glemaitre, thank you!

To give a little bit of context we're using Spark (Python and Pyspark) on Databricks platform to forecast energy consumption reported by medium voltage Substations.

Each load diagram has 15 minute frequency and 2 channels (active power and reactive power), forecasts are at least 3 days ahead, weather forecasts are included as exogenous variables.

A specialized model model is applied to each load diagram. In total there are around 70k MV Substations which produces ~140K time series to forecast on a daily basis with model retraining using the best hyperparameters determined during crossvalidation (walk forward optimization similar to Prophet's cross_validation).

To this day our best model is a simple Linear Regression using lags. As a proof of concept it was implemented using Statsmodels Autoreg. However their implementation is not the best in terms of speed or memory consumption, and has no regularization.

There are a lot of situations in this dataset where there is beneficial to have some kind of regularization:

to deal with multicollinearity caused by autocorrelation in each load diagram
to deal with multicollinearity caused by crosscorrelation between each load diagrams and exogenous variables like temperature or solar radiation (we could benefit for some kind of prewhitening or feature selection, but it is very hard to implement an efficient method that works for all series)
to deal with irrelevant features
to deal with model complexity, overfitting and noisy forecasts

Sklearn's ElasticNet model was a clear candidate to alleviate all of the issues described. There was the need to implement the autoregressive loop that extends the forecasts to the future, and this had to be done in Numpy because Sklearn's predict turned out to be slow when called in a loop to obtain single predictions.

Since we are running this setup in a cluster with ~384 cores and ~1536 GB of ram and each model will have allocated a single core and a limited amount of memory, each second and mb of memory spared is precious. The process was unstable although running, and due to high memory usage and garbage collection issues some workers would die.

With your tip it was possible to speedup and stabilize the process. Inference of the Elastic takes around 36 minutes.

Examples below.

def shift_series(pd_ts, lags):
    '''
    criar lags de série temporal com padding
    '''
    # padding
    xtrain = np.concatenate([np.repeat(np.array(np.nan).astype(np.float32), lags-1), pd_ts.values], axis=0)
    # janela deslizante
    xtrain = np.lib.stride_tricks.sliding_window_view(xtrain, window_shape=(lags,))
    # converter df pandas
    xtrain = pd.DataFrame(xtrain, index=pd_ts.index)
    # retornar
    return xtrain

class Model:

    def __init__(self, model):
                
        # Imputer
        self.imputer_mean_ = None

        # Standard Scaler
        self.scaler_mean_ = None
        self.scaler_std_ = None
        self.epsilon_ = np.finfo(np.float32).eps    

        # Linear Regression
        self.model_ = model
        self.model_coef_ = None
        self.model_intercept_ = None
        #self.model_features_ = None
        
    def fit(self, x, y):

        ### imputation
        self.imputer_mean_ = np.nan_to_num(np.nanmean(x, axis=0, keepdims=True), nan=0.0)
        x = np.where(np.isnan(x), self.imputer_mean_, x)
        
        ### z-score standardize X 
        self.scaler_mean_ = np.mean(x, axis=0, keepdims=True)
        self.scaler_std_ = np.std(x, axis=0, keepdims=True)
        x = x - self.scaler_mean_
        x = np.divide(x, self.scaler_std_, out=np.zeros_like(x, dtype=np.float32), where=(self.scaler_std_ > self.epsilon_))

        ### center y so intercept equals zero (disable intercept fit)
        # https://github.com/scikit-learn/scikit-learn/discussions/27886
        self.model_intercept_ = np.mean(y, axis=0, keepdims=True)
        y = y - self.model_intercept_

        ### model
        # skleran: To avoid unnecessary memory duplication the X argument of the fit method 
        #          should be directly passed as a Fortran-contiguous numpy array.
        x = np.asfortranarray(x)
        y = np.asfortranarray(y)
        # fit model
        x = self.model_.fit(x, y)
        # save model coefs for inference
        self.model_coef_ = x.coef_.ravel()
        # save model intercept for inference
        #self.model_intercept_ = x.intercept_.ravel()
        # save selected features
        #self.model_features_ = SelectFromModel(x, prefit=True).get_support()

        return self
    
    def predict(self, x):
        
        ### imputation
        x = np.where(np.isnan(x), self.imputer_mean_, x)
        ### scaling
        x = x - self.scaler_mean_
        x = np.divide(x, self.scaler_std_, out=np.zeros_like(x, dtype=np.float32), where=(self.scaler_std_ > self.epsilon_))
        ### model
        x = np.dot(x, self.model_coef_) + self.model_intercept_
        
        return x

def sklearn_autoreg(pd_ts, lags, model, forecast_horizon, df_exog=None, lags_exog=None):
    # Implementa autoregressão (AutoReg) em sklearn de forma eficiente.
    # Permite aceder ao modelo treinado.

    # gerar range de datas a prever no futuro
    xtrain_index = pd_ts.index
    xpred_index = pd.date_range(start=pd_ts.index[-1], periods=forecast_horizon+1, freq=pd_ts.index.freq, inclusive='right')
    full_index = np.concatenate((xtrain_index, xpred_index), axis=0)

    # alvo é série original, com lag 0
    # gurdar mascara com valores não nulos
    ytrain = pd_ts.values
    ytrain_mask = ~np.isnan(ytrain)

    # padding janela com média da série, evita nulos ou cortar inicio do dataset
    xtrain = shift_series(pd_ts.shift(1), lags=lags) 
    # criar lags de exógenas, concatenar a dataset
    if (df_exog is not None):
        df_exog = df_exog.reindex(full_index)
        df_exog = pd.concat([shift_series(df_exog[col], lags=lags_exog) for col in df_exog.columns], axis=1)
        xtrain = xtrain.join(df_exog, lsuffix='lags', rsuffix='exog')
    # converter xtrain para numpy
    xtrain = xtrain.values

    # instanciar modelo regressão 
    model = Model(model=model)

    # ignorar valores nulos no alvo
    xtrain = xtrain[ytrain_mask]
    ytrain = ytrain[ytrain_mask]
    # treinar modelo
    model.fit(xtrain, ytrain)
    # apagar dataset de treino para poupar memória
    del xtrain, ytrain, ytrain_mask
    # mostrar variaveis escolhidas (coeficientes diferentes de zero)
    #print('non-zero coefs:', np.sum(model.model_features_))
    
    # constantes para controlar divergencia de autoregressão
    clip_lower = np.finfo(np.float32).min
    clip_upper = np.finfo(np.float32).max
    # inicializar loop de autoregressão com última instancia do dataset
    xpred = pd_ts.values[-lags:]
    # percorrer loop, obter previsão, retroalimentar até percorrer horizonte de previsão  
    for xpred_date in xpred_index:
        # da lista de valores de input do modelo selecionar apenas os necessários para inferir
        xpred_append = xpred[-lags:].reshape(1,-1)
        # adicionar exógenas
        if (df_exog is not None):
            xexog_append = df_exog.loc[[xpred_date]].values
            xpred_append = np.concatenate((xpred_append, xexog_append), axis=1)
        # prever próximo valor
        xpred_append = model.predict(xpred_append)
        # manter valor previsto entre intervalo inferior e superior para evitar divergencia 
        xpred_append = np.clip(xpred_append, clip_lower, clip_upper)
        # concatenar previsão ao array de valores de input do modelo
        xpred = np.concatenate([xpred, xpred_append], axis=0)
    # da janela de lags, obter apenas previsões de acordo com o horizonte de previsão
    xpred = xpred[-forecast_horizon:]

    # converter previsões de treino em série com respetivo indice
    xpred = pd.Series(xpred, index=xpred_index)
    
    # retornar previsões
    return xpred

1 reply

glemaitre Dec 5, 2023
Maintainer

There was the need to implement the autoregressive loop that extends the forecasts to the future, and this had to be done in Numpy because Sklearn's predict turned out to be slow when called in a loop to obtain single predictions.

This something known. We have some overhead for input checking when it comes to latency. If it is in production and only inference matter, one way to go around is to use something like ONNX (and more precisely sklearn-onnx) to reduce the latency.

With your tip it was possible to speedup and stabilize the process. Inference of the Elastic takes around 36 minutes.

Nice ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElasticNet more than doubles memory consumption #27886

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

ElasticNet more than doubles memory consumption #27886

carlosg-m Dec 2, 2023

Replies: 2 comments · 1 reply

glemaitre Dec 2, 2023 Maintainer

carlosg-m Dec 5, 2023 Author

glemaitre Dec 5, 2023 Maintainer

carlosg-m
Dec 2, 2023

Replies: 2 comments 1 reply

glemaitre
Dec 2, 2023
Maintainer

carlosg-m
Dec 5, 2023
Author

glemaitre Dec 5, 2023
Maintainer