Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocess backend does not work well with nested multiprocessing #1060

Open
bouthilx opened this issue Jan 10, 2023 · 0 comments
Open

Multiprocess backend does not work well with nested multiprocessing #1060

bouthilx opened this issue Jan 10, 2023 · 0 comments
Assignees
Labels
bug Indicates an unexpected problem or unintended behavior medium The bug breaks a feature but it can still be used or causes a confusing user experience

Comments

@bouthilx
Copy link
Member

When training models with pytorch using multi-worker data loaders is generally necessary for efficient data loading. Current multi-process executor in Oríon does not support running multi-process inside parallel workers (which are sub processes spawned using python's multi-process module). This is very constraining and should be fixed.

Example of stack trace reported:

Traceback (most recent call last):
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 25, in _couldpickle_exec
    result = function(*args, **kwargs)
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/client/runner.py", line 122, in _optimize
    return fct(**unflatten(kwargs))
  File "main.py", line 112, in run
    signal = task.run()
  File "/home/mila/s/schmidtv/ocp-project/ocp-drlab/ocpmodels/tasks/task.py", line 50, in run
    return self.trainer.train(
  File "/home/mila/s/schmidtv/ocp-project/ocp-drlab/ocpmodels/trainers/single_trainer.py", line 224, in train
    train_loader_iter = iter(self.loaders["train"])
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 444, in __iter__
    return self._get_iterator()
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__
    w.start()
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/process.py", line 118, in start
    assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 227, in async_get
    results.append(AsyncResult(future, future.get()))
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 54, in get
    r = self.future.get(timeout)
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
AssertionError: daemonic processes are not allowed to have children
@bouthilx bouthilx added bug Indicates an unexpected problem or unintended behavior medium The bug breaks a feature but it can still be used or causes a confusing user experience labels Jan 10, 2023
@Delaunay Delaunay self-assigned this Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior medium The bug breaks a feature but it can still be used or causes a confusing user experience
Projects
None yet
Development

No branches or pull requests

2 participants