This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM error with custom dataset - python systematically crashes after a couple of epochs #559
Comments
PaDim isn't "trained" but is extracting image features at training time that are stored. If the features of every dataset image have been retrieved, at test time the test set image features are compared against the stored training features. You accuracy will not rise after more epochs! Try different algorithms (e.g. PatchCore) or extraction backbones (e.g. Wide ResNet 50) or better training data. |
@alexriedel1 Thank you for your reply! I was not aware of this. In any case, I tried patchcore with wide_resnet_50 and also with resnet18, and still got the same result: the process gets killed before the first epoch even finishes. I tried this in a machine using GPU and another one without GPU (but 96GB of RAM) and still got the same result. In the server without GPU no warning nor error message is displayed. The process simply gets killed at around 43% of the first epoch (I assumed that's the point at which some max memory threshold was achieved). However when running this in the machine WITH GPU there's an interesting warning that pops-up right before the process getting killed. As you can see in the log I am attaching below (the one for the machine with GPU, when using patchcore and a resnet18 backbone) there's a mention to a CUDA OOM Runtime error, as well as to an env variable named "PYTORCH_CUDA_ALLOC_CONF" and another variable named "max_split_size_mb". Looking for "PYTORCH_CUDA_ALLOC_CONF" around the internet I found a couple of places (pytorch/pytorch#16417) where they mentioned this could be solved by either:
Any ideas?? Thank you very much!!
|
The error is probably GPU OOM in both cases and there's not too much you can do about it besides increasing your GPU VRAM or reducing the training set size. The first is a bit more difficult so you should start with reducing the training set size. Also try to decrease decrease the image size to (256,256) |
@manuelblancovalentin, as @alexriedel1 pointed out, padim and patchcore are not memory efficient. If you get OOM even in a single epoch, you could try @alexriedel1's suggestion, or alternatively try to train DRAEM+SSPCAB model. The authors claim sota results here on video anomaly detection, which would be more suitable to your use-case. |
I'll convert this to a discussion, feel free to continue from there. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Context
I am required to train a model to detect anomalies on images coming from a video stream from CCTV cameras. I already built the dataset with the same format as the MVTEC dataset ("good" images in a separated folder, "anomalies" class in a different one, and "ground truth" with their segmented mask in a third one). I created my own custom yaml file which looks like this (I intentionally removed the paths, please ignore those lines):
Describe the bug
When trying to use the previous configuration file to train a padim network, the trainer will start but crash consistently after only one, two or three epochs (depending on the batch size, and the input image size) - see screenshot below.
As mentioned, I tried different batch sizes (as low as 1), number of epochs, and image input sizes, and these are some of the tests I tried:
I have tested this using three different environments, with the same results: Using a 80 core Xeon CPU with 96GB of memory with no GPU; using an aws g5.xlarge instance with 16GB RAM and 24GB GPU (NVIDIA A10G); and using Google Colab. In all of them I get mostly the same results: the code just crashes after a couple of epochs. If I monitor the RAM/GPU usage, I can see that the process is killed once a certain max usage is achieved.
In summary: The only meaningful good results start when I train the model for input size > 256, and for more than 1 epoch. For an image input size of 100px, I can train it for only 10 epochs before it crashes. So effectively, I cannot train the model to achieve the accuracy I would expect.
Expected behavior
Screenshots
Hardware and Software Configuration
My conda env config:
And pip freeze (inside the conda environment used):
Additional comments
Could you please help me figure out how to train my model for as many epochs as I require to get my accuracy to a decent level, without the program crashing? Thank you!!!
The text was updated successfully, but these errors were encountered: