Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception during training #324

Closed
sixtyfive opened this issue Feb 11, 2022 · 3 comments
Closed

Exception during training #324

sixtyfive opened this issue Feb 11, 2022 · 3 comments

Comments

@sixtyfive
Copy link
Contributor

sixtyfive commented Feb 11, 2022

Latest binary_datasets commit, same data that always used to go hundreds of epochs no problem:

$ ketos --version
ketos, version 3.0.8.dev100
$ ketos train -d cuda:0 -f binary -o rec --workers 18 -N 10000 -q dumb -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 -w 0 -B 1 --augment --no-normalize-whitespace data.bin
...
stage 104/10000  [####################################]  2169/2169                                                                                                
validating  [####################################]  241/241          val_accuracy: 0.96538                                                                        
stage 105/10000  [#########---------------------------]  595/2169  00:04:47Exception in thread Thread-212:                                                                                                
Traceback (most recent call last):                                                                                                                                
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner                                                                                                                                                                                                                                                                                                                                            self.run()                                                                                     
  File "/usr/lib/python3.7/threading.py", line 870, in run                                                                                                        
    self._target(*self._args, **self._kwargs)                                                                                                                                                          
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop                                                                                                                                                                                                                                                                             r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)                                             
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 113, in get                                                                                           
    return _ForkingPickler.loads(res)                                                                                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_filename                     
    storage = cls._new_shared_filename(manager, handle, size)                                                                                                                                                                                                                                                                                                                                                  RuntimeError: falseINTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":263, please report a bug to PyTorch. unable to open shared memory object </torch_1287290_140> in read-write mode                                                                                                                                                                                                              
                                                                                                                                                                                                       
stage 105/10000  [#########---------------------------]  596/2169  00:04:47Traceback (most recent call last):                                                                                                                                                                                                                                                                                                    File "/home/escriptorium/kraken/env/bin/ketos", line 8, in <module>                                                                                                                                  
    sys.exit(cli())                                                                                                                                               
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 1128, in __call__                                                                                                                                                                                                                                                                                                           return self.main(*args, **kwargs)                                                                                                                                                                  
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 1053, in main                                                              
    rv = self.invoke(ctx)                                                                                                                                         
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 1659, in invoke                                                                                                    
    return _process_result(sub_ctx.command.invoke(sub_ctx))                                                                                                                                               
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 1395, in invoke                                                                                                 
    return ctx.invoke(self.callback, **ctx.params)                                                 
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/core.py", line 754, in invoke                                                                                                  
    return __callback(*args, **kwargs)                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func                                                                                           
    return f(get_current_context(), *args, **kwargs)                                               
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/kraken/ketos.py", line 595, in train                                                                                                 
    trainer.fit(model)                                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/kraken/lib/train.py", line 108, in fit                                                                                               
    super().fit(*args, **kwargs)                                                                   
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit                                                                              
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path                                                                                                                   
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt                                                       
    return trainer_fn(*args, **kwargs)                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl                                                                        
    self._run(model, ckpt_path=ckpt_path)                                                          
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run                                                                            
    self._dispatch()                                                                               
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch                                                                       
    self.training_type_plugin.start_training(self)                                                 
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training                                        
    self._results = trainer.run_stage()                                                            
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage                                                                       
    return self._run_train()                                                                       
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train                                                                      
    self.fit_loop.run()                                                                            
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run                                                                                   
    self.advance(*args, **kwargs)                                                                  
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance                                                                           
    self.epoch_loop.run(data_fetcher)                                                              
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run                                                                                   
    self.advance(*args, **kwargs)                                                                  
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 156, in advance                                                          
    batch_idx, (batch, self.batch_progress.is_last_batch) = next(self._dataloader_iter)                                                                                                                
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 203, in __next__                                                                      
    return self.fetching_function()                                                                
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 270, in fetching_function                                                             
    self._fetch_next_batch()                                                                       
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 300, in _fetch_next_batch                                                             
    batch = next(self.dataloader_iter)                                                             
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 550, in __next__                                                                      
    return self.request_next_batch(self.loader_iters)                                              
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 562, in request_next_batch                                                            
    return apply_to_collection(loader_iters, Iterator, next)                                       
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 95, in apply_to_collection                                                          
    return function(data, *args, **kwargs)                                                         
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__                                                                               
    data = self._next_data()                                                                       
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data                                                                            
    idx, data = self._get_data()                                                                   
  File "/home/escriptorium/kraken/env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1147, in _get_data                                                                             
    raise RuntimeError('Pin memory thread exited unexpectedly')                                                                                                                                        
RuntimeError: Pin memory thread exited unexpectedly                                                
$ 
@mittagessen
Copy link
Owner

mittagessen commented Feb 11, 2022 via email

@mittagessen
Copy link
Owner

There's this issue on the main repository: pytorch/pytorch#1355. Possible reasons might be indeed running into shared memory limits or if using augmentation the OpenCV OpenMP issue mentioned in there as well. Kraken sets OpenMP threads to 0 when using the GPU but the last time I looked the only way to do that reliably system-wide is through the environment variable.

@sixtyfive
Copy link
Contributor Author

sixtyfive commented Feb 11, 2022

As the error message says "please report a bug to PyTorch

My bad I didn't see this. Currently running again with the original --workers 18 (btw in some places of Kraken it's still --threads, not workers). If it happens again I'll try with 0 and after that report back here and close the Issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants