Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large GPU memory consumption at the beginning of training #20

Open
zqh0253 opened this issue Jun 16, 2022 · 1 comment
Open

Large GPU memory consumption at the beginning of training #20

zqh0253 opened this issue Jun 16, 2022 · 1 comment

Comments

@zqh0253
Copy link

zqh0253 commented Jun 16, 2022

Hi, thanks for the great work!

I run the code using 8 A100 GPU cards and find that the gpu memory consumption is extremely large at the first several ticks!
Here is the output log:

tick 0     kimg 0.2      time 1m 31s       sec/tick 21.8    sec/kimg 113.36  maintenance 69.7   cpumem 4.70   gpumem 67.21  augment 0.000
Evaluating metrics for 3sky_timelapse_256_stylegan-v_random3_max32_3-4468dd1 ...
{"results": {"fvd2048_16f": 992.2131880075198}, "metric": "fvd2048_16f", "total_time": 80.59011363983154, "total_time_str": "1m 21s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1655261186.965401}
{"results": {"fvd2048_128f": 1764.0538105755193}, "metric": "fvd2048_128f", "total_time": 230.15506172180176, "total_time_str": "3m 50s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1655261417.1899228}
{"results": {"fvd2048_128f_subsample8f": 1241.4737946211158}, "metric": "fvd2048_128f_subsample8f", "total_time": 54.82384514808655, "total_time_str": "55s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1655261472.0662923}
{"results": {"fid50k_full": 381.6109859044359}, "metric": "fid50k_full", "total_time": 83.23335003852844, "total_time_str": "1m 23s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1655261555.3598747}
tick 1     kimg 5.4      time 11m 28s      sec/tick 23.0    sec/kimg 4.43    maintenance 573.9  cpumem 12.42  gpumem 69.14  augment 0.000
tick 2     kimg 10.6     time 11m 50s      sec/tick 21.4    sec/kimg 4.13    maintenance 0.0    cpumem 12.42  gpumem 10.26  augment 0.000
tick 3     kimg 15.7     time 12m 11s      sec/tick 21.6    sec/kimg 4.17    maintenance 0.0    cpumem 12.42  gpumem 10.26  augment 0.000
tick 4     kimg 20.9     time 12m 33s      sec/tick 21.4    sec/kimg 4.12    maintenance 0.0    cpumem 12.42  gpumem 10.26  augment 0.003
tick 5     kimg 26.1     time 12m 55s      sec/tick 21.7    sec/kimg 4.18    maintenance 0.0    cpumem 12.42  gpumem 10.29  augment 0.010
tick 6     kimg 31.3     time 13m 16s      sec/tick 21.9    sec/kimg 4.22    maintenance 0.0    cpumem 12.42  gpumem 10.29  augment 0.026
tick 7     kimg 36.5     time 13m 39s      sec/tick 22.4    sec/kimg 4.32    maintenance 0.0    cpumem 12.42  gpumem 10.33  augment 0.038
tick 8     kimg 41.7     time 14m 00s      sec/tick 21.6    sec/kimg 4.16    maintenance 0.1    cpumem 12.42  gpumem 10.33  augment 0.036
tick 9     kimg 46.8     time 14m 23s      sec/tick 22.3    sec/kimg 4.30    maintenance 0.1    cpumem 12.42  gpumem 10.32  augment 0.038
tick 10    kimg 52.0     time 14m 44s      sec/tick 21.4    sec/kimg 4.13    maintenance 0.0    cpumem 12.42  gpumem 10.33  augment 0.028

As you can see, gpumem in the first two ticks is abnormal. Do you have any idea about this problem?

@1702609
Copy link

1702609 commented Feb 12, 2023

I am experiencing the same problem. For a single forward pass, it consumes 46GB VRAM. Inference takes less than 8GB. What is the solution to this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants