Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountering with MPI_ERR_COUNT: invalid count argument on multiple nodes #12442

Closed
intelligi123 opened this issue Mar 28, 2024 · 8 comments
Closed

Comments

@intelligi123
Copy link

intelligi123 commented Mar 28, 2024

Background information

I am running a qiskit (quantum sdk which provides distributed computing support via openmpi) circuit on two nodes via mpi.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

openmpi was built from source and I used this link

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 23.10
  • Computer hardware: Dell Inc. Precision Tower 5810 (with NVIDIA GeForce GTX 1660 Super)
  • Network type:LAN

Details of the problem

I am having memory issues as circuits in qiskit takes huge amount of RAM to simulate qubits. I am not using mpi functions directly in my code, as its builtin in qiskit which I build using openmpi support . I have specified HW which has around 16G RAM and I am running my program on two similar nodes combining memory resources of 32M (not adding GPUs for now). Now if my circuit requires lesser memory, one node is sufficient but when memory requirement increases from 16G i.e. say 16.3, I use second node where problem arises. Now what I expect is in distributed mode my computation will be divided and both nodes will be utilizing 16.3/2 RAM which will not create any problems. But looking at memory stats of both nodes, I found that memory usage by python3 even exceeds 90% and error is thrown which is

dell@dell-5810:~$ mpirun -np 4 python3 ghz.py

[dell-5810:06108] Read -1, expected 536870912, errno = 3
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node dell-5810 exited on signal 9 (Killed).

There is another error I am getting when use a different circuit, which even doesn't exceed with memory utilization but generates error

mpirun -np 2 -machinefile machinefile.txt --mca orte_base_help_aggregate 0 python3 bz-problem.py 
number of qubits 30
number of qubits 30
[dell-Precision-Tower-5810:03749] *** An error occurred in MPI_Irecv
[dell-Precision-Tower-5810:03749] *** reported by process [1039859713,1]
[dell-Precision-Tower-5810:03749] *** on communicator MPI_COMM_WORLD
[dell-Precision-Tower-5810:03749] *** MPI_ERR_COUNT: invalid count argument
[dell-Precision-Tower-5810:03749] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell-Precision-Tower-5810:03749] ***    and potentially your MPI job)
[dell-5810:04423] *** An error occurred in MPI_Irecv
[dell-5810:04423] *** reported by process [1039859713,0]
[dell-5810:04423] *** on communicator MPI_COMM_WORLD
[dell-5810:04423] *** MPI_ERR_COUNT: invalid count argument
[dell-5810:04423] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell-5810:04423] ***    and potentially your MPI job)
[dell-5810:04414] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

Both of these circuits use 16384G of RAM.

Also I would like to know how can I see logs of openmpi or detailed error as in first case, I am unable to find anything specific.

Here is the second code which throws PMIX ERROR, (no MPI functions)

from qiskit import *
from qiskit_aer import *
from qiskit.quantum_info import Statevector
from qiskit_algorithms.utils import algorithm_globals
algorithm_globals.random_seed = 1000

secretnumber='11111010010110100111110001000'
qubits=len(secretnumber)+1

print ("number of qubits "+str(qubits))
circuit = QuantumCircuit(len(secretnumber)+1, len(secretnumber))

circuit.h((range(len(secretnumber))))
circuit.x(len(secretnumber))
circuit.h(len(secretnumber))

for ii, yesno in enumerate(reversed(secretnumber)):
  if yesno == '1':
    circuit.cx(ii, len(secretnumber))

circuit.h(range(len(secretnumber))
circuit.measure(range(len(secretnumber)), range(len(secretnumber)))
simulator = AerSimulator(method='statevector', device='CPU',seed_simulator=algorithm_globals.random_seed, blocking_enable=True, blocking_qubits=qubits-2 )
result = simulator.run(circuit, shots=1).result()
print(result)
@jsquyres
Copy link
Member

Is the issue still present in Open MPI v4.1.6?

@intelligi123
Copy link
Author

Thanks for the response, as I have to build everything from source so this takes some time. I will get back to you with update.

@bosilca
Copy link
Member

bosilca commented Apr 1, 2024

This is an error thrown by the MPI validation layer, and it indicates that you might have calles MPI_Irecv with a negative count. This can happen if you try to call it with a size_t instead of an int. Keep in mind that with the current MPI API the count is limited to MAX_INT.

@intelligi123
Copy link
Author

intelligi123 commented Apr 3, 2024

Thank you, I will report this to qiskit-aer developers as all MPI calls are being managed by qiskit itself. Does upgrading to newest version of MPI can change things in my case?

One thing I need to make clear is, if my program is using 16G of RAM, dividing it on two nodes shouldn't make the RAM half for each node? What I actually see is both nodes are consuming 16G of RAM as if two independent programs are running rather than distribution of resources for each process to compute final result.

@jsquyres
Copy link
Member

jsquyres commented Apr 3, 2024

I suggested upgrading to Open MPI 4.1.6 because it is currently the latest release in the 4.1.x series, and should be just bug fixes compared to the v4.1.1 you used. Hence, if there's a bug in Open MPI causing your issue, that's a safe way to see if that bug has been fixed.

That being said, there is also Open MPI v5.0.2 these days -- it should have all the latest bug fixes from the v4.1.x series, but also new functionality. You may or may not care about this.

And that being said, if @bosilca is right and the error message is actually an application error -- i.e., Open MPI is just correctly reporting a programming error in the application itself -- then upgrading to a newer version of Open MPI won't help at all.

However, we do advise using the latest release of a given Open MPI series -- e.g., v4.1.6 -- as a best practice. There are just a bunch of bugs fixed compared to v4.1.1.

I don't really know how qiskit-aer works or what it does, so I can't answer your question about its RAM usage. The general idea of parallel computing, however, is that if you have an application that takes N amount of memory with 1 process, you should be able to split it into 2 processes, each of which should take roughly N/2 amount of memory. If that is not happening in your case, you might need to check with the qiskit-aer developers. You might also want to ensure that you are launching the application across 2 nodes properly (i.e., that it's launching as 1 x 2-node MPI job, not 2 x 1-node MPI jobs).

@intelligi123
Copy link
Author

ok Thank you so much, I will look into it and update status on this post.

Copy link

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

@github-actions github-actions bot added the Stale label Apr 19, 2024
Copy link

github-actions bot commented May 3, 2024

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants