Encountring "MPI_ERR_COUNT: invalid count argument" when creating GHZ states on multiple nodes #2077

intelligi123 · 2024-02-29T05:49:24Z

Informations

Qiskit Aer version: 0.14.0
Python version: 3.11.6
Operating system: Ubuntu 23.10

What is the current behavior?

I am running a code to create GHZ state using 30 qubits, using statevector simulator which generated insufficient memory error

qiskit.exceptions.QiskitError: 'ERROR: [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M , ERROR: Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M'
I added a node and run script with two nodes but it spilled above error:

command:

mpirun -np 2 -machinefile machinefile.txt python3 ghz.py

Error:

[dell-Precision-Tower-5810:24773] *** An error occurred in MPI_Irecv
[dell-Precision-Tower-5810:24773] *** reported by process [3164471297,0]
[dell-Precision-Tower-5810:24773] *** on communicator MPI_COMM_WORLD
[dell-Precision-Tower-5810:24773] *** MPI_ERR_COUNT: invalid count argument
[dell-Precision-Tower-5810:24773] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dell-Precision-Tower-5810:24773] ***    and potentially your MPI job)
[dell-5810:03630] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[dell-Precision-Tower-5810:24768] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[dell-Precision-Tower-5810:24768] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Here is the code

from qiskit import QuantumCircuit, transpile
from qiskit_aer import *

def create_ghz_circuit(n_qubits):
    circuit = QuantumCircuit(n_qubits)
    circuit.h(0)
    for qubit in range(n_qubits - 1):
        circuit.cx(qubit, qubit + 1)
    return circuit

n_qubits=30
simulator = AerSimulator(method='statevector',device='CPU',blocking_enable=True, blocking_qubits=n_qubits-2)
circuit = create_ghz_circuit(n_qubits)
print(circuit.num_qubits)
circuit.measure_all()
job = simulator.run(circuit)
result = job.result()

Steps to reproduce the problem

Running code with mpirun generates error

What is the expected behavior?

Insufficient Memory issue should be resolved and code should able to simulate GHZ state.

Suggested solutions

The error is in MPI_Irecv method of MPI and MPI_ERR_COUNT: invalid count argument suggests that there is some mismatch in argument type.

The text was updated successfully, but these errors were encountered:

doichanj · 2024-03-06T08:24:27Z

could you try running with smaller qubits on 2 nodes, and also smaller qubits on single node with multiple-processes

intelligi123 · 2024-03-13T05:28:52Z

I selected 28 qubits and code is same except I have added algorithm_globals.random_seed=1000:

Here is the code:

from qiskit import QuantumCircuit, transpile
from qiskit_aer import *

from qiskit_algorithms.utils import algorithm_globals
algorithm_globals.random_seed = 1000

def create_ghz_circuit(n_qubits):
    circuit = QuantumCircuit(n_qubits)
    circuit.h(0)
    for qubit in range(n_qubits - 1):
        circuit.cx(qubit, qubit + 1)
    return circuit

n_qubits=28
simulator = AerSimulator(method='statevector',seed_simulator = algorithm_globals.random_seed, device='GPU',blocking_enable=True, blocking_qubits=n_qubits-2)
circuit = create_ghz_circuit(n_qubits)
print(circuit.num_qubits)
circuit.measure_all()
job = simulator.run(circuit)
result = job.result()
print(result)

For the case of two nodes:
I got full result variable as:

mpirun -np 2 -machinefile machinefile.txt python3 ghz.py

Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='a7e6782f-e971-4fbc-9503-1395c1bcec4f', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 190.778112021, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'sample_measure_time': 0.051840722, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'max_gpu_memory_mb': 5933, 'method': 'statevector', 'device': 'GPU', 'num_qubits': 28, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'target_gpus': [0], 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000371272, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 26, 'enabled': True}}, time_taken=190.778112021)], date=2024-03-13T10:09:14.735699, status=COMPLETED, header=None, metadata={'time_taken_execute': 190.816238386, 'mpi_rank': 0, 'time_taken_parameter_binding': 5.5836e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 5933, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=190.94678616523743)
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='0e0b0850-a5ef-404d-9dd4-bb2546c3cf68', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-158', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 190.769095649, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'sample_measure_time': 0.062222302, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'max_gpu_memory_mb': 5933, 'method': 'statevector', 'device': 'GPU', 'num_qubits': 28, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'target_gpus': [0], 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000387979, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'chunk_parallel_gpus': 1, 'block_bits': 26, 'enabled': True}}, time_taken=190.769095649)], date=2024-03-13T10:09:14.723119, status=COMPLETED, header=None, metadata={'time_taken_execute': 190.806562321, 'mpi_rank': 1, 'time_taken_parameter_binding': 4.7389e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 5933, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=193.54268836975098)

Queries:
Here I am expecting simulator to share resources and distribute statevector into two memory spaces but I think from results its looklike that two independent circuits are running on each node which I dont want.

For multiple processes on single node:
When I run above code , it generated error;

std::bad_alloc: cudaErrorMemoryAllocation: out of memory

and worked fine when ran while selecting device as CPU

Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='17b7879c-e5b3-4fbf-bb1e-5ef2addb93c7', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 39.796454583, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'device': 'CPU', 'num_qubits': 28, 'sample_measure_time': 0.490546031, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.000383349, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'block_bits': 26, 'enabled': True}}, time_taken=39.796454583)], date=2024-03-13T10:11:57.032453, status=COMPLETED, header=None, metadata={'time_taken_execute': 39.965566354, 'mpi_rank': 0, 'time_taken_parameter_binding': 4.7416e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=39.966766595840454)
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='c33d9971-88e0-44d6-ade9-219e08795d3e', success=True, results=[ExperimentResult(shots=1024, success=True, meas_level=2, data=ExperimentResultData(counts={'0x0': 530, '0xfffffff': 494}), header=QobjExperimentHeader(creg_sizes=[['meas', 28]], global_phase=0.0, memory_slots=28, n_qubits=28, name='circuit-164', qreg_sizes=[['q', 28]], metadata={}), status=DONE, seed_simulator=1000, metadata={'time_taken': 39.79647343, 'num_bind_params': 1, 'parallel_state_update': 2, 'parallel_shots': 1, 'required_memory_mb': 4096, 'input_qubit_map': [[27, 27], [26, 26], [25, 25], [24, 24], [23, 23], [22, 22], [21, 21], [20, 20], [19, 19], [18, 18], [17, 17], [16, 16], [15, 15], [14, 14], [13, 13], [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11], [12, 12]], 'method': 'statevector', 'device': 'CPU', 'num_qubits': 28, 'sample_measure_time': 0.472537557, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 'num_clbits': 28, 'remapped_qubits': False, 'runtime_parameter_bind': False, 'max_memory_mb': 15903, 'noise': 'ideal', 'measure_sampling': True, 'batched_shots_optimization': False, 'fusion': {'applied': True, 'time_taken': 0.00035926, 'cost_factor': 1.8, 'parallelization': 1, 'max_fused_qubits': 5, 'method': 'unitary', 'threshold': 14, 'enabled': True}, 'cacheblocking': {'max_multiple_chunk_swaps': 11, 'multiple_chunk_swaps_buffer_qubits': 15, 'multiple_chunk_swaps_enable': True, 'block_bits': 26, 'enabled': True}}, time_taken=39.79647343)], date=2024-03-13T10:11:57.034494, status=COMPLETED, header=None, metadata={'time_taken_execute': 39.96762756, 'mpi_rank': 1, 'time_taken_parameter_binding': 4.3155e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=39.96878981590271)

And again I tried adding qubits to 31 with device as CPU and ran on two nodes, it generated error:

Simulation failed and returned the following error message:
ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='620aee11-405f-486d-8c1c-1dfae26aeb32', success=False, results=[ExperimentResult(shots=0, success=False, meas_level=2, data=ExperimentResultData(), status=ERROR: Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M, circ_id=0, seed_simulator=0, metadata={'batched_shots_optimization': False, 'measure_sampling': False, 'max_memory_mb': 15903, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'num_clbits': 31, 'num_qubits': 31, 'device': 'CPU', 'input_qubit_map': [[30, 30], [29, 29], [12, 12], [11, 11], [10, 10], [9, 9], [8, 8], [7, 7], [6, 6], [5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28]], 'method': 'statevector', 'required_memory_mb': 32768}, time_taken=0.0)], date=2024-03-13T10:21:28.585262, status=ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-164 using the statevector simulator. Required memory: 16384M, max memory: 15903M, header=None, metadata={'time_taken_execute': 0.011740267, 'mpi_rank': 0, 'time_taken_parameter_binding': 5.0978e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=0.023772716522216797)
Simulation failed and returned the following error message:
ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M
Result(backend_name='aer_simulator', backend_version='0.14.0', qobj_id='', job_id='6fc632cc-f5ba-4373-977f-d8dd20980c6b', success=False, results=[ExperimentResult(shots=0, success=False, meas_level=2, data=ExperimentResultData(), status=ERROR: Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M, circ_id=0, seed_simulator=0, metadata={'batched_shots_optimization': False, 'measure_sampling': False, 'max_memory_mb': 15903, 'remapped_qubits': False, 'active_input_qubits': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'num_clbits': 31, 'num_qubits': 31, 'device': 'CPU', 'input_qubit_map': [[30, 30], [29, 29], [12, 12], [11, 11], [10, 10], [9, 9], [8, 8], [7, 7], [6, 6], [5, 5], [4, 4], [3, 3], [2, 2], [1, 1], [0, 0], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28]], 'method': 'statevector', 'required_memory_mb': 32768}, time_taken=0.0)], date=2024-03-13T10:21:28.535773, status=ERROR:  [Experiment 0] Insufficient memory to run circuit circuit-158 using the statevector simulator. Required memory: 16384M, max memory: 15903M, header=None, metadata={'time_taken_execute': 0.013288266, 'mpi_rank': 1, 'time_taken_parameter_binding': 5.1933e-05, 'num_mpi_processes': 2, 'num_processes_per_experiments': 2, 'omp_enabled': True, 'max_gpu_memory_mb': 0, 'max_memory_mb': 15903, 'parallel_experiments': 1}, time_taken=0.031948089599609375)

Queries:

Here memory required is 16384M and two nodes together make 15903+15903=31806M which is sufficient for the circuit if it shared resources, but as its running as two independent circuit it generate error.

Similar Error is being generated when I run with device=GPU only now its from CUDA

std::bad_alloc: cudaErrorMemoryAllocation: out of memory

So main problem is my circuit is not running by distributing statevector and sharing resources. How can I achieve this?

intelligi123 · 2024-04-03T05:03:52Z

Hi @doichanj, Is there any update on the issue?

btw I asked this question on openmpi issues and according to there response this is some sort of type error

size_t instead of an int to call MPI_Irecv.

Can you please suggest what I can do to resolve this or I need to wait for a patch?

Just want to make one thing clear, if my circuit is taking total of 16G RAM, calling two mpi process on two nodes (one each) will divide the required resources (8G on each node) or not as in my case both nodes are using 16G RAM as two independent processes (statevectors) are running as opposed to distribution of one statevector.

intelligi123 added the bug Something isn't working label Feb 29, 2024

doichanj mentioned this issue May 1, 2024

Improve stability of MPI parallel simulations #2115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountring "MPI_ERR_COUNT: invalid count argument" when creating GHZ states on multiple nodes #2077

Encountring "MPI_ERR_COUNT: invalid count argument" when creating GHZ states on multiple nodes #2077

intelligi123 commented Feb 29, 2024 •

edited

doichanj commented Mar 6, 2024

intelligi123 commented Mar 13, 2024

intelligi123 commented Apr 3, 2024 •

edited

Encountring "MPI_ERR_COUNT: invalid count argument" when creating GHZ states on multiple nodes #2077

Encountring "MPI_ERR_COUNT: invalid count argument" when creating GHZ states on multiple nodes #2077

Comments

intelligi123 commented Feb 29, 2024 • edited

Informations

What is the current behavior?

Steps to reproduce the problem

What is the expected behavior?

Suggested solutions

doichanj commented Mar 6, 2024

intelligi123 commented Mar 13, 2024

intelligi123 commented Apr 3, 2024 • edited

intelligi123 commented Feb 29, 2024 •

edited

intelligi123 commented Apr 3, 2024 •

edited