Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountring "MPI_ERR_COUNT: invalid count argument" when creating GHZ states on multiple nodes #2077

Open
intelligi123 opened this issue Feb 29, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@intelligi123
Copy link

intelligi123 commented Feb 29, 2024

Informations

  • Qiskit Aer version: 0.14.0
  • Python version: 3.11.6
  • Operating system: Ubuntu 23.10

What is the current behavior?

I am running a code to create GHZ state using 30 qubits, using statevector simulator which generated insufficient memory error

qiskit.exceptions.QiskitError: 'ERROR: [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M , ERROR: Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M'
I added a node and run script with two nodes but it spilled above error:

command:

mpirun -np 2 -machinefile machinefile.txt python3 ghz.py

Error:

[dell-Precision-Tower-5810:24773] *** An error occurred in MPI_Irecv
[dell-Precision-Tower-5810:24773] *** reported by process [3164471297,0]
[dell-Precision-Tower-5810:24773] *** on communicator MPI_COMM_WORLD
[dell-Precision-Tower-5810:24773] *** MPI_ERR_COUNT: invalid count argument
[dell-Precision-Tower-5810:24773] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell-Precision-Tower-5810:24773] ***    and potentially your MPI job)
[dell-5810:03630] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[dell-Precision-Tower-5810:24768] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[dell-Precision-Tower-5810:24768] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Here is the code

from qiskit import QuantumCircuit, transpile
from qiskit_aer import *

def create_ghz_circuit(n_qubits):
    circuit = QuantumCircuit(n_qubits)
    circuit.h(0)
    for qubit in range(n_qubits - 1):
        circuit.cx(qubit, qubit + 1)
    return circuit

n_qubits=30
simulator = AerSimulator(method='statevector',device='CPU',blocking_enable=True, blocking_qubits=n_qubits-2)
circuit = create_ghz_circuit(n_qubits)
print(circuit.num_qubits)
circuit.measure_all()
job = simulator.run(circuit)
result = job.result()

Steps to reproduce the problem

Running code with mpirun generates error

What is the expected behavior?

Insufficient Memory issue should be resolved and code should able to simulate GHZ state.

Suggested solutions

The error is in MPI_Irecv method of MPI and MPI_ERR_COUNT: invalid count argument suggests that there is some mismatch in argument type.

@intelligi123 intelligi123 added the bug Something isn't working label Feb 29, 2024
@doichanj
Copy link
Collaborator

doichanj commented Mar 6, 2024

could you try running with smaller qubits on 2 nodes, and also smaller qubits on single node with multiple-processes

@intelligi123
Copy link
Author

I selected 28 qubits and code is same except I have added algorithm_globals.random_seed=1000:

Here is the code:

from qiskit import QuantumCircuit, transpile
from qiskit_aer import *

from qiskit_algorithms.utils import algorithm_globals
algorithm_globals.random_seed = 1000

def create_ghz_circuit(n_qubits):
    circuit = QuantumCircuit(n_qubits)
    circuit.h(0)
    for qubit in range(n_qubits - 1):
        circuit.cx(qubit, qubit + 1)
    return circuit

n_qubits=28
simulator = AerSimulator(method='statevector',seed_simulator = algorithm_globals.random_seed, device='GPU',blocking_enable=True, blocking_qubits=n_qubits-2)
circuit = create_ghz_circuit(n_qubits)
print(circuit.num_qubits)
circuit.measure_all()
job = simulator.run(circuit)
result = job.result()
print(result)

For the case of two nodes:
I got full result variable as:

mpirun -np 2 -machinefile machinefile.txt python3 ghz.py

Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='a7e6782f-e971-4fbc-9503-1395c1bcec4f', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 190.778112021, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'sample_measure_time': 0.051840722, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'max_gpu_memory_mb': 5933, 'method': 'statevector', 'device': 'GPU', 'num_qubits': 28, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'target_gpus': [0], 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000371272, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 26, 'enabled': True}}, time_taken=190.778112021)], date=2024-03-13T10:09:14.735699, status=COMPLETED, header=None, metadata={'time_taken_execute': 190.816238386, 'mpi_rank': 0, 'time_taken_parameter_binding': 5.5836e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 5933, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=190.94678616523743)
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='0e0b0850-a5ef-404d-9dd4-bb2546c3cf68', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-158', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 190.769095649, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'sample_measure_time': 0.062222302, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'max_gpu_memory_mb': 5933, 'method': 'statevector', 'device': 'GPU', 'num_qubits': 28, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'target_gpus': [0], 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000387979, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 26, 'enabled': True}}, time_taken=190.769095649)], date=2024-03-13T10:09:14.723119, status=COMPLETED, header=None, metadata={'time_taken_execute': 190.806562321, 'mpi_rank': 1, 'time_taken_parameter_binding': 4.7389e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 5933, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=193.54268836975098)

Queries:
Here I am expecting simulator to share resources and distribute statevector into two memory spaces but I think from results its looklike that two independent circuits are running on each node which I dont want.

For multiple processes on single node:
When I run above code , it generated error;

std::bad_alloc: cudaErrorMemoryAllocation: out of memory

and worked fine when ran while selecting device as CPU

Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='17b7879c-e5b3-4fbf-bb1e-5ef2addb93c7', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 39.796454583, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'device': 'CPU', 'num_qubits': 28, 'sample_measure_time': 0.490546031, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000383349, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'block_bits': 26, 'enabled': True}}, time_taken=39.796454583)], date=2024-03-13T10:11:57.032453, status=COMPLETED, header=None, metadata={'time_taken_execute': 39.965566354, 'mpi_rank': 0, 'time_taken_parameter_binding': 4.7416e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=39.966766595840454)
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='c33d9971-88e0-44d6-ade9-219e08795d3e', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 39.79647343, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'device': 'CPU', 'num_qubits': 28, 'sample_measure_time': 0.472537557, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.00035926, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'block_bits': 26, 'enabled': True}}, time_taken=39.79647343)], date=2024-03-13T10:11:57.034494, status=COMPLETED, header=None, metadata={'time_taken_execute': 39.96762756, 'mpi_rank': 1, 'time_taken_parameter_binding': 4.3155e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=39.96878981590271)

And again I tried adding qubits to 31 with device as CPU and ran on two nodes, it generated error:

Simulation failed and returned the following error message:
ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='620aee11-405f-486d-8c1c-1dfae26aeb32', success=False, results=[ExperimentResult(shots=0, success=False, meas_level=2, data=ExperimentResultData(), status=ERROR: Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M, circ_id=0, seed_simulator=0, metadata={'batched_shots_optimization': False, 'measure_sampling': False, 'max_memory_mb': 15903, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'num_clbits': 31, 'num_qubits': 31, 'device': 'CPU', 'input_qubit_map': [[30, 30], [29, 29], [12, 12], [11, 11], [10, 10], [9, 9], [8, 8], [7, 7], [6, 6], [5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28]], 'method': 'statevector', 'required_memory_mb': 32768}, time_taken=0.0)], date=2024-03-13T10:21:28.585262, status=ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M, header=None, metadata={'time_taken_execute': 0.011740267, 'mpi_rank': 0, 'time_taken_parameter_binding': 5.0978e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=0.023772716522216797)
Simulation failed and returned the following error message:
ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='6fc632cc-f5ba-4373-977f-d8dd20980c6b', success=False, results=[ExperimentResult(shots=0, success=False, meas_level=2, data=ExperimentResultData(), status=ERROR: Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M, circ_id=0, seed_simulator=0, metadata={'batched_shots_optimization': False, 'measure_sampling': False, 'max_memory_mb': 15903, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'num_clbits': 31, 'num_qubits': 31, 'device': 'CPU', 'input_qubit_map': [[30, 30], [29, 29], [12, 12], [11, 11], [10, 10], [9, 9], [8, 8], [7, 7], [6, 6], [5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28]], 'method': 'statevector', 'required_memory_mb': 32768}, time_taken=0.0)], date=2024-03-13T10:21:28.535773, status=ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M, header=None, metadata={'time_taken_execute': 0.013288266, 'mpi_rank': 1, 'time_taken_parameter_binding': 5.1933e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=0.031948089599609375)

Queries:

Here memory required is 16384M and two nodes together make 15903+15903=31806M which is sufficient for the circuit if it shared resources, but as its running as two independent circuit it generate error.

Similar Error is being generated when I run with device=GPU only now its from CUDA

std::bad_alloc: cudaErrorMemoryAllocation: out of memory

So main problem is my circuit is not running by distributing statevector and sharing resources. How can I achieve this?

@intelligi123
Copy link
Author

intelligi123 commented Apr 3, 2024

Hi @doichanj, Is there any update on the issue?

btw I asked this question on openmpi issues and according to there response this is some sort of type error

size_t instead of an int to call MPI_Irecv.

Can you please suggest what I can do to resolve this or I need to wait for a patch?

Just want to make one thing clear, if my circuit is taking total of 16G RAM, calling two mpi process on two nodes (one each) will divide the required resources (8G on each node) or not as in my case both nodes are using 16G RAM as two independent processes (statevectors) are running as opposed to distribution of one statevector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants