Replies: 5 comments
-
Hi, could you share minimal reproducible codes with us? |
Beta Was this translation helpful? Give feedback.
-
I suspect |
Beta Was this translation helpful? Give feedback.
-
Unfortunately this is the shortest version I was able to reproduce in a reasonable time:
|
Beta Was this translation helpful? Give feedback.
-
Thanks and sorry for the late response. I'm not sure but pytorch/vision#539 seems related, so maybe not using |
Beta Was this translation helpful? Give feedback.
-
I think the issue looks more question rather than the bug of optuna, let me covert this issue to the discussion. |
Beta Was this translation helpful? Give feedback.
-
Expected behavior
Expected pruning of runs to pass cleanly with a run fully stopped before moving onto the next.
Environment
Error messages, stack traces, or logs
Steps to reproduce
Additional context (optional)
Results seem to be similar to those found on the PyTorch forums: https://discuss.pytorch.org/t/too-many-open-files-caused-by-persistent-workers-and-pin-memory/193372, though in my case is not dependent on the
pin_memory
orpersistent_workers
variables.Watching the memory usage on our compute node indicates a similar behavior -- more and more files remain unclosed after pruning until eventually the Python environment throws an error saying too many files are open.
This does not fail upon the first pruned run, though an error begins to appear after several runs have been pruned. Normally a run takes on the order of 8 seconds per epoch, after some time it appears that several runs have stacked up with some individual epochs taking on the order of 5 minutes to complete:
`================================================
------------ Hyper-parameter Tuning ------------
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
0 | loss_fn | BCEWithLogitsLoss | 0
1 | inc | Down | 2.8 K
2 | down1 | Down | 14.0 K
3 | down2 | Down | 55.7 K
4 | down3 | Down | 221 K
5 | down4 | Down | 886 K
6 | up1 | Up | 574 K
7 | up2 | Up | 143 K
8 | up3 | Up | 36.1 K
9 | up4 | Up | 9.1 K
10 | outc | OutConv | 17
11 | accuracy | BinaryAccuracy | 0
1.9 M Trainable params
0 Non-trainable params
1.9 M Total params
7.776 Total estimated model params size (MB)
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [07:58<00:00, 0.06it/s$
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [08:36<00:00, 0.05it/s$
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [08:16<00:00, 0.22it/s$
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 432/432 [11:24<00:00, 0.63it/s$
Epoch 4: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 216/216 [09:45<00:00, 0.37it/s$
Epoch 4: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 432/432 [09:37<00:00, 0.75it/s$
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 432/432 [00:09<00:00, 47.91it/s$
[I 2024-03-18 10:30:30,473] Trial 24 pruned. Trial was pruned at epoch 1.██████████████████████████████████| 48/48 [00:00<00:00, 133.81it/s$
`
Beta Was this translation helpful? Give feedback.
All reactions