You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
TransLink was trying to determine the minimum and recommended CPU & memory requirement for our current ABM model by utilizing multi-processing with chunking turned on in ActivitySim. Our testing was done on Azure VMs. We ran into out-of-memory issues for high CPU core count machines, even though there is a lot of memory on the VM.
To Reproduce
We are running our model on v1.2.1. Our input data contains 1.1 million households and 2.88 million persons, with ~1700 TAZs; our model uses a 1-zone system with 48 half-hour time windows. We performed 10 different runs with different combinations of CPU and memory configurations, please see our results below:
run_machine_label*
chunk_training
num_process
chunk_size
crash point
Note
output_FAILED_4cpu_128GB
training
4
100GB
mem error in trip_destination
output_FAILED_48cpu_384GB
training
16
200GB
mem error in workplace_location
fails for both numpy version 1.24 and 1.25
output_FAILED_64cpu_512GB
training
16
200GB
mem error in workplace_location
output_FAILED_64cpu_512GB_Prod
production
16
200GB
mem error in workplace_location
used chunk_cache from "output_SUCCESS_16cpu_256GB"
output_SUCCESS_16cpu_256GB
training
16
200GB
NA
run time of 255.8 mins
output_SUCCESS_16cpu_512GB
training
16
200GB
NA
run time of 129.6 mins
output_SUCCESS_16cpu_512GB_Prod
production
16
200GB
NA
run time of 129.5 mins (used chunk_cache from "output_SUCCESS_16cpu_512GB")
output_SUCCESS_32cpu_512GB
training
16
200GB
NA
run time of 203.2 mins
output_SUCCESS_32cpu_512GB_Prod
production
16
200GB
NA
run time of 86.9 mins (used chunk_cache from "output_SUCCESS_32cpu_512GB")
output_SUCCESS_no_mp_32cpu_512GB_25pct
disabled
NA
NA
NA
run time of 222.1 mins (25% sample with no multiprocessing)
*Note that the cpu and memory under the run machine label refers to the size of the VM, not the chunking configuration. num_process and chunk_size columns contains information on our chunking configuration.
Steps to reproduce the behavior:
Change VM size to the desired CPU and memory, we made sure that the model is the only program running on the VM.
Update settings.yaml to have the correct chunk training mode, number of processes, and chunk size as indicated. Sharrow is off for our model runs. Note that we restricted the number of processes (num_process in the settings.yaml for config_mp) for higher core count machines as a way to avoid high mp overhead.
Start model run with full sample
Memory errors occur if machine core count is high (for example, for 48 or 64 CPUs)
Expected behavior
We should be able to do chunk training on higher-core count machines as long as we have enough memory. We should also be able to use the chunk_cache.csv result from a lower core count and lower memory machine, and have it run successfully on higher spec machines. This does not seem to be the case.
This unexpected behavior for chunk training and chunk production runs to fail makes it impossible for us to determine a minimum and recommended spec, given a wide range of possible VMs and servers our partners and stakeholders could use to run our model.
Screenshots
We have full logs for these test runs. Please reach out to me on Teams or through my emails. I'm happy to send them over to those interested in looking into this.
Here is the memory profile for the no multiprocessing run with 25% sample, keep in mind that we ran our model with activitysim v1.2.1
Additional context
I had a discussion with @dhensle about this issue, and it looks like he is facing some issues with chunk training taking a long time with SANDAG's model. This issue could also be somewhat related to previous issues #543, #683, and #733.
We are working on trying the version of ActivitySim on the main branch. Will keep you posted on any memory profile changes there.
The text was updated successfully, but these errors were encountered:
I attempted a run with TransLink's model on our own 24 cpu, 250 GB RAM machine using the current ActivitySim "main" branch and got the following results:
Successful chunk training run with 15% sample size
Failed chunk production run with 100% sample. Crashed with out-of-memory error in workplace location choice.
We should not have a case where running in chunk production mode causes an out-of-memory error! Especially when using a chunk_cache.csv that was created on the same machine!
Describe the bug
TransLink was trying to determine the minimum and recommended CPU & memory requirement for our current ABM model by utilizing multi-processing with chunking turned on in ActivitySim. Our testing was done on Azure VMs. We ran into out-of-memory issues for high CPU core count machines, even though there is a lot of memory on the VM.
To Reproduce
We are running our model on v1.2.1. Our input data contains 1.1 million households and 2.88 million persons, with ~1700 TAZs; our model uses a 1-zone system with 48 half-hour time windows. We performed 10 different runs with different combinations of CPU and memory configurations, please see our results below:
*Note that the cpu and memory under the run machine label refers to the size of the VM, not the chunking configuration. num_process and chunk_size columns contains information on our chunking configuration.
Steps to reproduce the behavior:
settings.yaml
to have the correct chunk training mode, number of processes, and chunk size as indicated. Sharrow is off for our model runs. Note that we restricted the number of processes (num_process in the settings.yaml for config_mp) for higher core count machines as a way to avoid high mp overhead.Expected behavior
We should be able to do chunk training on higher-core count machines as long as we have enough memory. We should also be able to use the chunk_cache.csv result from a lower core count and lower memory machine, and have it run successfully on higher spec machines. This does not seem to be the case.
This unexpected behavior for chunk training and chunk production runs to fail makes it impossible for us to determine a minimum and recommended spec, given a wide range of possible VMs and servers our partners and stakeholders could use to run our model.
Screenshots
We have full logs for these test runs. Please reach out to me on Teams or through my emails. I'm happy to send them over to those interested in looking into this.
Here is the memory profile for the no multiprocessing run with 25% sample, keep in mind that we ran our model with activitysim v1.2.1
Additional context
I had a discussion with @dhensle about this issue, and it looks like he is facing some issues with chunk training taking a long time with SANDAG's model. This issue could also be somewhat related to previous issues #543, #683, and #733.
We are working on trying the version of ActivitySim on the main branch. Will keep you posted on any memory profile changes there.
The text was updated successfully, but these errors were encountered: