-
Notifications
You must be signed in to change notification settings - Fork 5.5k
context_gpu.cu causing memory issues #232
Comments
I add the same kind of issue (except it was not random). I solve this by lowering the amount of memory required modifying the config.yaml. Hope it will help |
@francoto thanks for the input. Indeed reducing the batch size could hinder this problem, but it will also impact model performance. Also the batchsize easily fits inside the GPU memory, but at some random point during training (usually after ~16k iterations) the memory usage suddenly increases and training crashes, which is strange behavior. I haven't yet had time to look into what triggers the context_gpu model to fire up though. |
The problem occurs me when I do: Found Detectron ops lib: /home/intern/usr/local/lib/libcaffe2_detectron_ops_gpu.so ... INFO train.py: 131: Building model: generalized_rcnn I will add my environment info later. |
same problem here, but I have 100 Gb of free space... is there a specific memory used by my GPU? should I get a better GPU? |
This problem for me occurs very randomly. The network (in this case Retinanet) is training just fine, when at a random number of iterations
context_gpu.cu
fires up and seems to eat up the gpu memory such that the training is halted with an out of memory error.We're using Ubuntu 16.04 with Pascal GPUs. Happens on several machines, with different numbers of GPUs (1-4) and when training different network architectures.
Any thoughts?
The text was updated successfully, but these errors were encountered: