Exception during training #324

sixtyfive · 2022-02-11T08:16:31Z

Latest binary_datasets commit, same data that always used to go hundreds of epochs no problem:

$ ketos --version
ketos, version 3.0.8.dev100
$ ketos train -d cuda:0 -f binary -o rec --workers 18 -N 10000 -q dumb -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 -w 0 -B 1 --augment --no-normalize-whitespace data.bin
...
stage 104/10000  [####################################]  2169/2169                                                                                                
validating  [####################################]  241/241          val_accuracy: 0.96538                                                                        
stage 105/10000  [#########---------------------------]  595/2169  00:04:47Exception in thread Thread-212:                                                                                                
Traceback (most recent call last):                                                                                                                                
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner                                                                                                                                                                                                                                                                                                                                            self.run()                                                                                     
  File "/usr/lib/python3.7/threading.py", line 870, in run                                                                                                        
    self._target(*self._args, **self._kwargs)                                                                                                                                                          
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop                                                                                                                                                                                                                                                                             r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)                                             
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 113, in get                                                                                           
    return _ForkingPickler.loads(res)                                                                                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_filename                     
    storage = cls._new_shared_filename(manager, handle, size)                                                                                                                                                                                                                                                                                                                                                  RuntimeError: falseINTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":263, please report a bug to PyTorch. unable to open shared memory object </torch_1287290_140> in read-write mode                                                                                                                                                                                                              
                                                                                                                                                                                                       
stage 105/10000  [#########---------------------------]  596/2169  00:04:47Traceback (most recent call last):                                                                                                                                                                                                                                                                                                    File "/home/escriptorium/kraken/env/bin/ketos", line 8, in <module>                                                                                                                                  
    sys.exit(cli())                                                                                                                                               
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 1128, in __call__                                                                                                                                                                                                                                                                                                           return self.main(*args, **kwargs)                                                                                                                                                                  
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 1053, in main                                                              
    rv = self.invoke(ctx)                                                                                                                                         
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 1659, in invoke                                                                                                    
    return _process_result(sub_ctx.command.invoke(sub_ctx))                                                                                                                                               
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 1395, in invoke                                                                                                 
    return ctx.invoke(self.callback, **ctx.params)                                                 
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 754, in invoke                                                                                                  
    return __callback(*args, **kwargs)                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func                                                                                           
    return f(get_current_context(), *args, **kwargs)                                               
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/kraken/ketos.py", line 595, in train                                                                                                 
    trainer.fit(model)                                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/kraken/lib/train.py", line 108, in fit                                                                                               
    super().fit(*args, **kwargs)                                                                   
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit                                                                              
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path                                                                                                                   
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt                                                       
    return trainer_fn(*args, **kwargs)                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl                                                                        
    self._run(model, ckpt_path=ckpt_path)                                                          
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run                                                                            
    self._dispatch()                                                                               
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch                                                                       
    self.training_type_plugin.start_training(self)                                                 
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training                                        
    self._results = trainer.run_stage()                                                            
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage                                                                       
    return self._run_train()                                                                       
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train                                                                      
    self.fit_loop.run()                                                                            
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run                                                                                   
    self.advance(*args, **kwargs)                                                                  
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance                                                                           
    self.epoch_loop.run(data_fetcher)                                                              
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run                                                                                   
    self.advance(*args, **kwargs)                                                                  
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 156, in advance                                                          
    batch_idx, (batch, self.batch_progress.is_last_batch) = next(self._dataloader_iter)                                                                                                                
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 203, in __next__                                                                      
    return self.fetching_function()                                                                
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 270, in fetching_function                                                             
    self._fetch_next_batch()                                                                       
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 300, in _fetch_next_batch                                                             
    batch = next(self.dataloader_iter)                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 550, in __next__                                                                      
    return self.request_next_batch(self.loader_iters)                                              
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 562, in request_next_batch                                                            
    return apply_to_collection(loader_iters, Iterator, next)                                       
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 95, in apply_to_collection                                                          
    return function(data, *args, **kwargs)                                                         
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__                                                                               
    data = self._next_data()                                                                       
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data                                                                            
    idx, data = self._get_data()                                                                   
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1147, in _get_data                                                                             
    raise RuntimeError('Pin memory thread exited unexpectedly')                                                                                                                                        
RuntimeError: Pin memory thread exited unexpectedly                                                
$

The text was updated successfully, but these errors were encountered:

mittagessen · 2022-02-11T10:22:24Z

On 22/02/11 12:16AM, J. R. Schmid wrote: Latest `binary_datasets` commit, same command and data that always used to go hundreds of epochs no problem:

As the error message says "please report a bug to PyTorch. This is a pytorch internal crash that don't really have any influence on. Does it occur reliably? If yes an immediate solution would be to disable workers in your call as the crash happens somewhere in the shared memory code.

mittagessen · 2022-02-11T10:28:44Z

There's this issue on the main repository: pytorch/pytorch#1355. Possible reasons might be indeed running into shared memory limits or if using augmentation the OpenCV OpenMP issue mentioned in there as well. Kraken sets OpenMP threads to 0 when using the GPU but the last time I looked the only way to do that reliably system-wide is through the environment variable.

sixtyfive · 2022-02-11T11:00:58Z

As the error message says "please report a bug to PyTorch

My bad I didn't see this. Currently running again with the original --workers 18 (btw in some places of Kraken it's still --threads, not workers). If it happens again I'll try with 0 and after that report back here and close the Issue.

mittagessen closed this as completed Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception during training #324

Exception during training #324

sixtyfive commented Feb 11, 2022 •

edited

mittagessen commented Feb 11, 2022 via email

mittagessen commented Feb 11, 2022

sixtyfive commented Feb 11, 2022 •

edited

Exception during training #324

Exception during training #324

Comments

sixtyfive commented Feb 11, 2022 • edited

mittagessen commented Feb 11, 2022 via email

mittagessen commented Feb 11, 2022

sixtyfive commented Feb 11, 2022 • edited

sixtyfive commented Feb 11, 2022 •

edited

sixtyfive commented Feb 11, 2022 •

edited