OOM error with custom dataset - python systematically crashes after a couple of epochs #559

manuelblancovalentin · 2022-09-14T17:54:27Z

Context
I am required to train a model to detect anomalies on images coming from a video stream from CCTV cameras. I already built the dataset with the same format as the MVTEC dataset ("good" images in a separated folder, "anomalies" class in a different one, and "ground truth" with their segmented mask in a third one). I created my own custom yaml file which looks like this (I intentionally removed the paths, please ignore those lines):

dataset:
  name: brooks
  format: folder
  path: <removed_in_purpose>
  normal_dir: <removed_in_purpose> # name of the folder containing normal images.
  abnormal_dir: <removed_in_purpose> # name of the folder containing abnormal images.
  normal_test_dir: null # name of the folder containing normal test images.
  task: segmentation # classification or segmentation
  mask: <removed_in_purpose> #optional
  extensions: null
  split_ratio: 0.1 # ratio of the normal images that will be used to create a test split
  image_size: [512,512] #[256,256] #[115, 194] #[1149, 1940]
  train_batch_size: 1
  test_batch_size: 1
  num_workers: 4
  transform_config:
    train: null
    val: null
  create_validation_set: true
  tiling:
    apply: false
    tile_size: null
    stride: null
    remove_border_count: 0
    use_random_tiling: False
    random_tile_count: 16

model:
  name: padim
  backbone: resnet18
  pre_trained: true
  layers:
    - layer1
    - layer2
    - layer3
  normalization_method: min_max # options: [none, min_max, cdf]

metrics:
  image:
    - F1Score
    - AUROC
  pixel:
    - F1Score
    - AUROC
  threshold:
    image_default: 3
    pixel_default: 3
    adaptive: true

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 42
  path: <removed_in_purpose>

logging:
  logger: [] # options: [tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

#optimization:
#  openvino:
#    apply: false

# PL Trainer Args. Don't add extra parameter here.
trainer:
  accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  accumulate_grad_batches: 1
  amp_backend: native
  auto_lr_find: false
  auto_scale_batch_size: false
  auto_select_gpus: false
  benchmark: false
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  default_root_dir: null
  detect_anomaly: false
  deterministic: false
  devices: 1
  enable_checkpointing: true
  enable_model_summary: true
  enable_progress_bar: true
  fast_dev_run: false
  gpus: null # Set automatically
  gradient_clip_val: 0
  ipus: null
  limit_predict_batches: 1.0
  limit_test_batches: 1.0
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  log_every_n_steps: 50
  max_epochs: 4
  max_steps: -1
  max_time: null
  min_epochs: null
  min_steps: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle
  num_nodes: 1
  num_processes: null
  num_sanity_val_steps: 0
  overfit_batches: 0.0
  plugins: null
  precision: 32
  profiler: null
  reload_dataloaders_every_n_epochs: 0
  replace_sampler_ddp: true
  sync_batchnorm: false
  tpu_cores: null
  track_grad_norm: -1
  val_check_interval: 1.0 # Don't validate before extracting features.

Describe the bug
When trying to use the previous configuration file to train a padim network, the trainer will start but crash consistently after only one, two or three epochs (depending on the batch size, and the input image size) - see screenshot below.

As mentioned, I tried different batch sizes (as low as 1), number of epochs, and image input sizes, and these are some of the tests I tried:

Test	Image input size	Max number of epochs before crashing	Max batch size	Completed run (✅) or crashed (❌)?	Comments
Test 0	[100,100]	10	8	✅	Accuracy too low !!
Test 1	[256,256]	4	1	❌
Test 1	[200,200]	4	1	✅	Accuracy too low !!
Test 2	[256,256]	1	1	✅	Accuracy too low !!

I have tested this using three different environments, with the same results: Using a 80 core Xeon CPU with 96GB of memory with no GPU; using an aws g5.xlarge instance with 16GB RAM and 24GB GPU (NVIDIA A10G); and using Google Colab. In all of them I get mostly the same results: the code just crashes after a couple of epochs. If I monitor the RAM/GPU usage, I can see that the process is killed once a certain max usage is achieved.

In summary: The only meaningful good results start when I train the model for input size > 256, and for more than 1 epoch. For an image input size of 100px, I can train it for only 10 epochs before it crashes. So effectively, I cannot train the model to achieve the accuracy I would expect.

Expected behavior

I would expect to be able to train the model for as many epochs as I need, and for pytorch (or anomalib itself) to handle the tensors and training in a way that doesn't blow up the memory.

Screenshots

Hardware and Software Configuration

My conda env config:

# packages in environment at /home/manuelbv/anaconda3/envs/anomalib_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
anomalib                  0.3.5                    pypi_0    pypi
bcrypt                    4.0.0                    pypi_0    pypi
ca-certificates           2022.07.19           h06a4308_0  
certifi                   2022.6.15        py38h06a4308_0  
cffi                      1.15.1                   pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
cryptography              38.0.1                   pypi_0    pypi
idna                      3.4                      pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
monotonic                 1.6                      pypi_0    pypi
ncurses                   6.3                  h5eee18b_3  
numpy                     1.23.3                   pypi_0    pypi
oauthlib                  3.2.1                    pypi_0    pypi
openssl                   1.1.1q               h7f8727e_0  
pandas                    1.4.4                    pypi_0    pypi
paramiko                  2.11.0                   pypi_0    pypi
pillow                    9.2.0                    pypi_0    pypi
pip                       22.1.2           py38h06a4308_0  
protobuf                  3.19.5                   pypi_0    pypi
psutil                    5.9.2                    pypi_0    pypi
pycparser                 2.21                     pypi_0    pypi
pynacl                    1.5.0                    pypi_0    pypi
python                    3.8.13               h12debd9_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytz                      2022.2.1                 pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.1.2                h7f8727e_1  
requests                  2.28.1                   pypi_0    pypi
scipy                     1.9.1                    pypi_0    pypi
setproctitle              1.3.2                    pypi_0    pypi
setuptools                63.4.1           py38h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.39.2               h5082296_0  
tk                        8.6.12               h1ccaba5_0  
tqdm                      4.64.1                   pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0  
xz                        5.2.5                h7f8727e_1  
zlib                      1.2.12               h5eee18b_3

And pip freeze (inside the conda environment used):

absl-py==1.2.0
aiohttp==3.8.1
aiosignal==1.2.0
albumentations==1.2.1
analytics-python==1.4.0
-e git+https://github.com/openvinotoolkit/anomalib.git@a0e040d445a4f4f4e772cbad4e4630036d82bdc0#egg=anomalib
antlr4-python3-runtime==4.9.3
anyio==3.6.1
async-timeout==4.0.2
attrs==22.1.0
backoff==1.10.0
bcrypt==4.0.0
cachetools==5.2.0
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
cryptography==38.0.1
cycler==0.11.0
docker-pycreds==0.4.0
docstring-parser==0.15
einops==0.4.1
fastapi==0.83.0
ffmpy==0.3.0
fonttools==4.37.1
frozenlist==1.3.1
fsspec==2022.8.2
gitdb==4.0.9
GitPython==3.1.27
google-auth==2.11.0
google-auth-oauthlib==0.4.6
gradio==3.3
grpcio==1.48.1
h11==0.12.0
httpcore==0.15.0
httpx==0.23.0
idna==3.4
imageio==2.21.3
imgaug==0.4.0
importlib-metadata==4.12.0
Jinja2==3.1.2
joblib==1.1.0
jsonargparse==4.13.3
kiwisolver==1.4.4
kornia==0.6.7
linkify-it-py==1.0.3
Markdown==3.4.1
markdown-it-py==2.1.0
MarkupSafe==2.1.1
matplotlib==3.5.3
mdit-py-plugins==0.3.0
mdurl==0.1.2
monotonic==1.6
multidict==6.0.2
networkx==2.8.6
numpy==1.23.3
oauthlib==3.2.1
omegaconf==2.2.3
opencv-python==4.6.0.66
opencv-python-headless==4.6.0.66
orjson==3.8.0
packaging==21.3
pandas==1.4.4
paramiko==2.11.0
pathtools==0.1.2
Pillow==9.2.0
Pmw==2.0.1
promise==2.3
protobuf==3.19.5
psutil==5.9.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pycryptodome==3.15.0
pydantic==1.10.2
pyDeprecate==0.3.2
pydub==0.25.1
PyNaCl==1.5.0
pyparsing==3.0.9
python-dateutil==2.8.2
python-gdsii==0.2.1
python-multipart==0.0.5
pytorch-lightning==1.6.5
pytz==2022.2.1
PyWavelets==1.3.0
PyYAML==6.0
qudida==0.0.4
requests==2.28.1
requests-oauthlib==1.3.1
rfc3986==1.5.0
rsa==4.9
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.6
scikit-image==0.19.3
scikit-learn==1.1.2
scipy==1.9.1
sentry-sdk==1.9.8
setproctitle==1.3.2
Shapely==1.8.4
shortuuid==1.0.9
shyaml==0.6.2
six==1.16.0
smmap==5.0.0
sniffio==1.3.0
starlette==0.19.1
tensorboard==2.10.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tifffile==2022.8.12
timm==0.5.4
torch==1.11.0
torchmetrics==0.9.1
torchtext==0.12.0
torchvision==0.12.0
tqdm==4.64.1
typing_extensions==4.3.0
uc-micro-py==1.0.1
urllib3==1.26.12
uvicorn==0.18.3
vext==0.7.6
wandb==0.12.17
websockets==10.3
Werkzeug==2.2.2
yarl==1.8.1
zipp==3.8.1

Additional comments
Could you please help me figure out how to train my model for as many epochs as I require to get my accuracy to a decent level, without the program crashing? Thank you!!!

The text was updated successfully, but these errors were encountered:

alexriedel1 · 2022-09-15T14:00:32Z

PaDim isn't "trained" but is extracting image features at training time that are stored. If the features of every dataset image have been retrieved, at test time the test set image features are compared against the stored training features.
Thus you don't have to "train" PaDim for more than one epoch (except for maybe you use random image augmentations).

You accuracy will not rise after more epochs! Try different algorithms (e.g. PatchCore) or extraction backbones (e.g. Wide ResNet 50) or better training data.

manuelblancovalentin · 2022-09-15T14:50:41Z

@alexriedel1 Thank you for your reply! I was not aware of this.

In any case, I tried patchcore with wide_resnet_50 and also with resnet18, and still got the same result: the process gets killed before the first epoch even finishes. I tried this in a machine using GPU and another one without GPU (but 96GB of RAM) and still got the same result.

In the server without GPU no warning nor error message is displayed. The process simply gets killed at around 43% of the first epoch (I assumed that's the point at which some max memory threshold was achieved).

However when running this in the machine WITH GPU there's an interesting warning that pops-up right before the process getting killed. As you can see in the log I am attaching below (the one for the machine with GPU, when using patchcore and a resnet18 backbone) there's a mention to a CUDA OOM Runtime error, as well as to an env variable named "PYTORCH_CUDA_ALLOC_CONF" and another variable named "max_split_size_mb".

Looking for "PYTORCH_CUDA_ALLOC_CONF" around the internet I found a couple of places (pytorch/pytorch#16417) where they mentioned this could be solved by either:

Adding "torch.cuda.empty_cache()" either at the beginning of the code or after every validation iteration.
Reducing the batch size (my batch size is already 1, so I cannot reduce this)
Downgrading to torch 1.9.1 or 1.8.1 (which I am a bit worried about doing, because I believe this would affect anomalib). My current torch version is 1.12.1
Avoiding tensorflow from getting imported alongside torch, inside the code.

Any ideas?? Thank you very much!!

(dlcuda116) ubuntu@ip-172-31-5-105:~/projects/ForeignObjectsDetection/anomalib$ python tools/train.py --config /home/ubuntu/projects/ForeignObjectsDetection/custom_patchcore_config.yaml --model patchcore
2022-09-15 14:34:22,415 - pytorch_lightning.utilities.seed - INFO - Global seed set to 42
2022-09-15 14:34:22,418 - anomalib.data - INFO - Loading the datamodule
2022-09-15 14:34:22,419 - anomalib.models - INFO - Loading the model.
2022-09-15 14:34:22,440 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmp_1ylxwv0
2022-09-15 14:34:22,440 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmp_1ylxwv0/_remote_module_non_scriptable.py
2022-09-15 14:34:22,465 - anomalib.models.components.base.anomaly_module - INFO - Initializing PatchcoreLightning model.
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called `full_state_update` that has
                not been set for this class (AdaptiveThreshold). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.

  warnings.warn(*args, **kwargs)
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `PrecisionRecallCurve` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called `full_state_update` that has
                not been set for this class (AnomalyScoreDistribution). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.

  warnings.warn(*args, **kwargs)
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called `full_state_update` that has
                not been set for this class (MinMax). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.

  warnings.warn(*args, **kwargs)
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
2022-09-15 14:34:22,865 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2022-09-15 14:34:22,865 - anomalib.utils.callbacks - INFO - Loading the callbacks
2022-09-15 14:34:22,907 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True, used: True
2022-09-15 14:34:22,907 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2022-09-15 14:34:22,907 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2022-09-15 14:34:22,907 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2022-09-15 14:34:22,907 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2022-09-15 14:34:22,907 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2022-09-15 14:34:22,907 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2022-09-15 14:34:22,907 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2022-09-15 14:34:22,908 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2022-09-15 14:34:22,908 - anomalib - INFO - Training the model.
2022-09-15 14:34:22,915 - anomalib.data.folder - INFO - Setting up train, validation, test and prediction datasets.
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `ROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:611: UserWarning: Checkpoint directory /home/ubuntu/projects/ForeignObjectsDetection/results/patchcore/brooks/weights exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
2022-09-15 14:34:26,932 - pytorch_lightning.accelerators.gpu - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py:183: UserWarning: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer
  rank_zero_warn(
2022-09-15 14:34:26,935 - pytorch_lightning.callbacks.model_summary - INFO -
  | Name                  | Type                     | Params
-------------------------------------------------------------------
0 | image_threshold       | AdaptiveThreshold        | 0
1 | pixel_threshold       | AdaptiveThreshold        | 0
2 | training_distribution | AnomalyScoreDistribution | 0
3 | min_max               | MinMax                   | 0
4 | model                 | PatchcoreModel           | 11.7 M
5 | image_metrics         | AnomalibMetricCollection | 0
6 | pixel_metrics         | AnomalibMetricCollection | 0
-------------------------------------------------------------------
11.7 M    Trainable params
0         Non-trainable params
11.7 M    Total params
46.758    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                                                                   | 0/7268 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:137: UserWarning: `training_step` returned `None`. If this was on purpose, ignore this warning...
  self.warning_cache.warn("`training_step` returned `None`. If this was on purpose, ignore this warning...")
Epoch 0:  29%|████████████████████████████████████████▉                                                                                                    | 2108/7268 [00:20<00:50, 101.72it/s, loss=nan]Traceback (most recent call last):
  File "/home/ubuntu/projects/ForeignObjectsDetection/anomalib/tools/train.py", line 71, in <module>
    train()
  File "/home/ubuntu/projects/ForeignObjectsDetection/anomalib/tools/train.py", line 60, in train
    trainer.fit(model=model, datamodule=datamodule)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
    result = self._run_optimization(
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1646, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 403, in step
    closure()
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 140, in _wrap_closure
    closure_result = closure()
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
    step_output = self._step_fn()
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 427, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 333, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/home/ubuntu/projects/ForeignObjectsDetection/anomalib/anomalib/models/patchcore/lightning_model.py", line 77, in training_step
    embedding = self.model(batch["image"])
  File "/home/ubuntu/anaconda3/envs/dlcuda116/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/projects/ForeignObjectsDetection/anomalib/anomalib/models/patchcore/torch_model.py", line 70, in forward
    embedding = self.generate_embedding(features)
  File "/home/ubuntu/projects/ForeignObjectsDetection/anomalib/anomalib/models/patchcore/torch_model.py", line 104, in generate_embedding
    embeddings = torch.cat((embeddings, layer_embedding), 1)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 22.20 GiB total capacity; 14.47 GiB already allocated; 6.06 MiB free; 20.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0:  29%|██▉       | 2108/7268 [00:21<00:51, 99.64it/s, loss=nan]

alexriedel1 · 2022-09-16T07:26:20Z

The error is probably GPU OOM in both cases and there's not too much you can do about it besides increasing your GPU VRAM or reducing the training set size. The first is a bit more difficult so you should start with reducing the training set size.
limit_train_batches: 1.0 in your config file defines the percentage of training data that is used for training (100%). You can decrease the value and see at which points the remaining samples fit on your GPU memory.

Also try to decrease decrease the image size to (256,256)

samet-akcay · 2022-09-23T16:38:38Z

@manuelblancovalentin, as @alexriedel1 pointed out, padim and patchcore are not memory efficient. If you get OOM even in a single epoch, you could try @alexriedel1's suggestion, or alternatively try to train DRAEM+SSPCAB model. The authors claim sota results here on video anomaly detection, which would be more suitable to your use-case.

samet-akcay · 2022-09-23T16:39:12Z

I'll convert this to a discussion, feel free to continue from there.

openvinotoolkit locked and limited conversation to collaborators Sep 23, 2022

samet-akcay converted this issue into discussion #581 Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

OOM error with custom dataset - python systematically crashes after a couple of epochs #559

OOM error with custom dataset - python systematically crashes after a couple of epochs #559

manuelblancovalentin commented Sep 14, 2022

alexriedel1 commented Sep 15, 2022 •

edited

manuelblancovalentin commented Sep 15, 2022 •

edited

alexriedel1 commented Sep 16, 2022 •

edited

samet-akcay commented Sep 23, 2022

samet-akcay commented Sep 23, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

OOM error with custom dataset - python systematically crashes after a couple of epochs #559

OOM error with custom dataset - python systematically crashes after a couple of epochs #559

Comments

manuelblancovalentin commented Sep 14, 2022

alexriedel1 commented Sep 15, 2022 • edited

manuelblancovalentin commented Sep 15, 2022 • edited

alexriedel1 commented Sep 16, 2022 • edited

samet-akcay commented Sep 23, 2022

samet-akcay commented Sep 23, 2022

This issue was moved to a discussion.

alexriedel1 commented Sep 15, 2022 •

edited

manuelblancovalentin commented Sep 15, 2022 •

edited

alexriedel1 commented Sep 16, 2022 •

edited