Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 98] Address already in use #1170

Open
karl-koschutnig opened this issue Jan 3, 2024 · 3 comments
Open

OSError: [Errno 98] Address already in use #1170

karl-koschutnig opened this issue Jan 3, 2024 · 3 comments
Labels

Comments

@karl-koschutnig
Copy link

What happened?

Hi. I am unsure if this is the right place to ask my question...?
I want to use mriqc with nextflow. The idea is that nextflow runs mriqc for all subjects in parallel (or at least a subsample of the subjects). I use an apptainer-container to start mriqc and everything works fine until there is more than one CPU involved (and more than one is kind of the whole idea).
So I am not sure if the problem is with the apptainer-setup, the nextflow-setup or with mriqc. I tend to think it is a Python (3.9) problem; thats why I post it here ?!

What command did you use?

This is the command (through) nextflow: 
mriqc /bids /out participant     -w /tmp --resource-monitor --no-sub     --nprocs 1 --omp-nthreads 1 -m bold --participant-label sub-122BPAF172043

So, just one proc/thread

What version of the software are you running?

23.1.0

How are you running this software?

Other

Is your data BIDS valid?

Yes

Are you reusing any previously computed results?

No

Please copy and paste any relevant log output.

ERROR ~ Error executing process > 'mriqc (27)'

Caused by:
  Process `mriqc (27)` terminated with an error exit status (1)

Command executed:

  mriqc /bids /out participant     -w /tmp --resource-monitor --no-sub     --nprocs 1 --omp-nthreads 1 -m bold --participant-label sub-122BPAF172043

Command exit status:
  1

Command output:
  (empty)

Command error:
  Process SyncManager-2:
  Traceback (most recent call last):
    File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
      self.run()
    File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
      self._target(*self._args, **self._kwargs)
    File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
      server = cls._Server(registry, address, authkey, serializer)
    File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
      self.listener = Listener(address=address, backlog=16)
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
      self._listener = SocketListener(address, family, backlog)
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
      self._socket.bind(address)
  OSError: [Errno 98] Address already in use
  Traceback (most recent call last):
    File "/opt/conda/bin/mriqc", line 8, in <module>
      sys.exit(main())
    File "/opt/conda/lib/python3.9/site-packages/mriqc/cli/run.py", line 104, in main
      with Manager() as mgr:
    File "/opt/conda/lib/python3.9/multiprocessing/context.py", line 57, in Manager
      m.start()
    File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 558, in start
      self._address = reader.recv()
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 255, in recv
      buf = self._recv_bytes()
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
      buf = self._recv(4)
    File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
      raise EOFError
  EOFError

Additional information / screenshots

No response

@vferat
Copy link

vferat commented Jan 22, 2024

Hey,

I have the same problem working with apptainer on my university's HPC. I have a sbatch script to process several subject in parallels:

OSError: [Errno 98] Address already in use
Traceback (most recent call last):
  File "/opt/conda/bin/mriqc", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.9/site-packages/mriqc/cli/run.py", line 104, in main
    with Manager() as mgr:
  File "/opt/conda/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 558, in start
    self._address = reader.recv()
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
Process SyncManager-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/opt/conda/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/opt/conda/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
    self._socket.bind(address)

It may be that nipype's multiprocessing plugin is raising errors because several mriqc runs are attempting to use the same port to communicate with subprocesses. Ideally, this issue would be addressed in the nipype plugin by allowing the multiprocessing plugin to choose a port that is not already in use.

As a workaround, I am currently trying to use the --net --network none option with the apptainer run command so that each container operates on its own local network, preventing conflicts between jobs. It appears to be working for now, and I will update you once the jobs are completed.

@effigies
Copy link
Member

This has nothing to do with nipype but instead Python's multiprocessing managers: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Manager

I've never seen this issue before, and it looks like Manager() doesn't take any arguments. This doesn't look like an easy thing for us to fix, apart from stopping building the workflow in an external process.

@vferat
Copy link

vferat commented Jan 22, 2024

Yes of course, I didn't mean that the problem comes from nipype per se, but it seems to be possible to tell multiprocessing.Manager() to automatically choose a free port as described in this stackoverflow thread.

But I understand the issue is hard to test and might create other issues.

So far the --net --network none workaround seems to work but might create issue for templateflow to get templates. (In my case, templates are already on disk)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants