Replies: 2 comments 1 reply
-
|
Beta Was this translation helpful? Give feedback.
-
@glemaitre, thank you! To give a little bit of context we're using Spark (Python and Pyspark) on Databricks platform to forecast energy consumption reported by medium voltage Substations. Each load diagram has 15 minute frequency and 2 channels (active power and reactive power), forecasts are at least 3 days ahead, weather forecasts are included as exogenous variables. A specialized model model is applied to each load diagram. In total there are around 70k MV Substations which produces ~140K time series to forecast on a daily basis with model retraining using the best hyperparameters determined during crossvalidation (walk forward optimization similar to Prophet's To this day our best model is a simple Linear Regression using lags. As a proof of concept it was implemented using Statsmodels There are a lot of situations in this dataset where there is beneficial to have some kind of regularization:
Sklearn's Since we are running this setup in a cluster with ~384 cores and ~1536 GB of ram and each model will have allocated a single core and a limited amount of memory, each second and mb of memory spared is precious. The process was unstable although running, and due to high memory usage and garbage collection issues some workers would die. With your tip it was possible to speedup and stabilize the process. Inference of the Elastic takes around 36 minutes. Examples below.
|
Beta Was this translation helpful? Give feedback.
-
Basically if I have a dataset that takes 1gb of ram after calling ElasticNet I'm going to get peak memory consumption of around ~2200mb.
I've set
copy_X
to false and passed the X argument as a Fortran-contiguous numpy array usingnp.asfortranarray
to no avail.Any tips to save memory and avoid dataset duplication would be much appreciated.
Beta Was this translation helpful? Give feedback.
All reactions