Large GPU memory consumption at the beginning of training #20

zqh0253 · 2022-06-16T07:01:43Z

Hi, thanks for the great work!

I run the code using 8 A100 GPU cards and find that the gpu memory consumption is extremely large at the first several ticks!
Here is the output log:

tick 0     kimg 0.2      time 1m 31s       sec/tick 21.8    sec/kimg 113.36  maintenance 69.7   cpumem 4.70   gpumem 67.21  augment 0.000
Evaluating metrics for 3sky_timelapse_256_stylegan-v_random3_max32_3-4468dd1 ...
{"results": {"fvd2048_16f": 992.2131880075198}, "metric": "fvd2048_16f", "total_time": 80.59011363983154, "total_time_str": "1m 21s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1655261186.965401}
{"results": {"fvd2048_128f": 1764.0538105755193}, "metric": "fvd2048_128f", "total_time": 230.15506172180176, "total_time_str": "3m 50s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1655261417.1899228}
{"results": {"fvd2048_128f_subsample8f": 1241.4737946211158}, "metric": "fvd2048_128f_subsample8f", "total_time": 54.82384514808655, "total_time_str": "55s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1655261472.0662923}
{"results": {"fid50k_full": 381.6109859044359}, "metric": "fid50k_full", "total_time": 83.23335003852844, "total_time_str": "1m 23s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1655261555.3598747}
tick 1     kimg 5.4      time 11m 28s      sec/tick 23.0    sec/kimg 4.43    maintenance 573.9  cpumem 12.42  gpumem 69.14  augment 0.000
tick 2     kimg 10.6     time 11m 50s      sec/tick 21.4    sec/kimg 4.13    maintenance 0.0    cpumem 12.42  gpumem 10.26  augment 0.000
tick 3     kimg 15.7     time 12m 11s      sec/tick 21.6    sec/kimg 4.17    maintenance 0.0    cpumem 12.42  gpumem 10.26  augment 0.000
tick 4     kimg 20.9     time 12m 33s      sec/tick 21.4    sec/kimg 4.12    maintenance 0.0    cpumem 12.42  gpumem 10.26  augment 0.003
tick 5     kimg 26.1     time 12m 55s      sec/tick 21.7    sec/kimg 4.18    maintenance 0.0    cpumem 12.42  gpumem 10.29  augment 0.010
tick 6     kimg 31.3     time 13m 16s      sec/tick 21.9    sec/kimg 4.22    maintenance 0.0    cpumem 12.42  gpumem 10.29  augment 0.026
tick 7     kimg 36.5     time 13m 39s      sec/tick 22.4    sec/kimg 4.32    maintenance 0.0    cpumem 12.42  gpumem 10.33  augment 0.038
tick 8     kimg 41.7     time 14m 00s      sec/tick 21.6    sec/kimg 4.16    maintenance 0.1    cpumem 12.42  gpumem 10.33  augment 0.036
tick 9     kimg 46.8     time 14m 23s      sec/tick 22.3    sec/kimg 4.30    maintenance 0.1    cpumem 12.42  gpumem 10.32  augment 0.038
tick 10    kimg 52.0     time 14m 44s      sec/tick 21.4    sec/kimg 4.13    maintenance 0.0    cpumem 12.42  gpumem 10.33  augment 0.028

As you can see, gpumem in the first two ticks is abnormal. Do you have any idea about this problem?

The text was updated successfully, but these errors were encountered:

1702609 · 2023-02-12T10:00:36Z

I am experiencing the same problem. For a single forward pass, it consumes 46GB VRAM. Inference takes less than 8GB. What is the solution to this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large GPU memory consumption at the beginning of training #20

Large GPU memory consumption at the beginning of training #20

zqh0253 commented Jun 16, 2022 •

edited

1702609 commented Feb 12, 2023

Large GPU memory consumption at the beginning of training #20

Large GPU memory consumption at the beginning of training #20

Comments

zqh0253 commented Jun 16, 2022 • edited

1702609 commented Feb 12, 2023

zqh0253 commented Jun 16, 2022 •

edited