OutOfMemeryError running 16 atoms system sci on 4 * DCU node #4124

ZLI-afk · 2024-05-08T12:23:12Z

Details

As described in the title. The 16 atoms task with kspacing=0.05 Bohr^-1 is given by:
relax_task.zip

Image: registry.dp.tech/dptech/abacus:v3.6.0
node type: 4 * DCU_16g
command: OMP_NUM_THREADS=1 mpirun -np 4 abacus_pw > log

Error msg:

dflow.python.python_op_template.TransientError: ('abacus failed\n', 'out msg', '', '\n', 'err msg', '--------------------------------------------------------------------------\nWARNING: No preset parameters were found for the device that Open MPI\ndetected:\n\n  Local host:            e08r4n18\n  Device name:           mlx5_0\n  Device vendor ID:      0x02c9\n  Device vendor part ID: 4123\n\nDefault device parameters will be used, which may result in lower\nperformance.  You can edit any of the files specified by the\nbtl_openib_device_param_files MCA parameter to set values for your\ndevice.\n\nNOTE: You can turn off this warning by setting the MCA parameter\n      btl_openib_warn_no_device_params_found to 0.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nBy default, for Open MPI 4.0 and later, infiniband ports on a device\nare not used by default.  The intent is to use UCX for these devices.\nYou can override this policy by setting the btl_openib_allow_ib MCA parameter\nto true.\n\n  Local host:              e08r4n18\n  Local adapter:           mlx5_0\n  Local port:              1\n\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nWARNING: There was an error initializing an OpenFabrics device.\n\n  Local host:   e08r4n18\n  Local device: mlx5_0\n--------------------------------------------------------------------------\nWARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.\nInfo: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / no device params found\n[e08r4n18:12439] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n--------------------------------------------------------------------------\nPrimary job  terminated normally, but 1 process returned\na non-zero exit code. Per user-direction, the job has been aborted.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nmpirun detected that one or more processes exited with non-zero status, thus causing\nthe job to be terminated. The first process to do so was:\n\n  Process name: [[38950,1],0]\n  Exit code:    2\n--------------------------------------------------------------------------\n', '\n')

Task list for Issue attackers (only for developers)

Reproduce the performance issue on a similar system or environment.
Identify the specific section of the code causing the performance issue.
Investigate the issue and determine the root cause.
Research best practices and potential solutions for the identified performance issue.
Implement the chosen solution to address the performance issue.
Test the implemented solution to ensure it improves performance without introducing new issues.
Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
Review and incorporate any relevant feedback from users or developers.
Merge the improved solution into the main codebase and notify the issue reporter.

The text was updated successfully, but these errors were encountered:

dyzheng · 2024-05-09T03:59:48Z

@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.

ZLI-afk · 2024-05-09T12:06:22Z

The same 32 atoms task with kspacing=0.08 Bohr^-1 can be running on a c64_m64_cpu machine in the Bohrium without MemeryError. What's the different? (CPU task ID: 12062725; DCU task ID: 12062239)
Please see corresponding scf.log for details:
running_scf_c64_m64_cpu.log
running_scf_4_DCU.log

Religious-J · 2024-05-09T15:36:08Z

@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.

OK，I analyze the memory cost for this test case on CPU：

Also running on c64_m64_cpu machine in the Bohrium
command: OMP_NUM_THREADS=1 mpirun -np 32 abacus
This is the memory allocation result in ModuleBase::Memory::record method

NAME-------------------------|MEMORY(MB)--------
                         total     39155.9037
                        Psi_PW     37558.5117
                  PW_B_K::gcar       485.6704
                   PW_B_K::gk2       161.8901
                   Force::vkb1       118.3359
           Stress::dbecp_noevc       118.3359
                  Stress::vkb1       118.3359
                      VNL::vkb        59.1680
                  Force::dbecp        48.9375
             wavefunc::wfcatom        47.6631
                 DiagSub::hpsi        47.6631
                 DiagSub::spsi        47.6631
              DiagSub::evctemp        47.6631
       XC_Functional::gradcorr        29.4496
          Broyden_Mixing::F&DF        28.7967
            Nonlocal<PW>::becp        16.3125
              Nonlocal<PW>::ps        16.3125
                   Force::becp        16.3125
                  Stress::becp        16.3125
                 Stress::dbecp        16.3125
                     FFT::grid        15.0000
       XC_Functional::aux&gaux        10.6996

pxlxingliang · 2024-05-10T04:10:53Z

I try to run this example on BOHRIUM with "4 * NVIDIA GPU_16g", and it also has the out of memory error.

 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory

pxlxingliang · 2024-05-13T04:34:52Z

I use bohrium 4 * NVIDIA GPU_24g run this example, the calculation is successful.
It indicates that 4*24 G memory is enough for gpu.

I also try to use two nodes on sugon DCU, but it still raise the oom error.
The slurm script is:

#!/bin/bash
#SBATCH --job-name=ABACUS_GPU 
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4  #dcu个数
#SBATCH -o %j.out   
#SBATCH -e %j.out 
#SBATCH --exclusive

abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw

module purge
module load compiler/rocm/dtk-22.10
                module load compiler/devtoolset/7.3.1
                module load compiler/cmake/3.23.3
                module load mpi/hpcx/2.6.0/gcc-7.3.1

OMP_NUM_THREADS=1 mpirun -np 8  $abacus > out.log

Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log, and has OOM error,
And try to use 4 nodes, and also has OOM error.

It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.

@denghuilu Is this reasonable?

pxlxingliang · 2024-05-13T04:58:53Z

Use Bohrium '4 * DCU_32g' can run this example successfully.

ZLI-afk · 2024-05-17T16:23:24Z

Could you please help to check if followed Pb task has OOM problem
on 4 * DCU_32g with the new image:
registry.dp.tech/dptech/abacus:3.6.3-less-memory
Pb_32fcc_oom.zip

denghuilu · 2024-05-18T09:51:22Z

I use bohrium 4 * NVIDIA GPU_24g run this example, the calculation is successful. It indicates that 4*24 G memory is enough for gpu.

I also try to use two nodes on sugon DCU, but it still raise the oom error. The slurm script is:
#!/bin/bash
#SBATCH --job-name=ABACUS_GPU 
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4  #dcu个数
#SBATCH -o %j.out   
#SBATCH -e %j.out 
#SBATCH --exclusive

abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw

module purge
module load compiler/rocm/dtk-22.10
                module load compiler/devtoolset/7.3.1
                module load compiler/cmake/3.23.3
                module load mpi/hpcx/2.6.0/gcc-7.3.1

OMP_NUM_THREADS=1 mpirun -np 8  $abacus > out.log
Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log, and has OOM error, And try to use 4 nodes, and also has OOM error.

It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.

@denghuilu Is this reasonable?

We need to check if all the 8 DCUs were actually used when applying for two nodes

ZLI-afk added the Performance Issues related to fail running ABACUS label May 8, 2024

mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues and removed Performance Issues related to fail running ABACUS labels May 8, 2024

WHUweiqingzhou assigned dyzheng and denghuilu and unassigned dyzheng May 9, 2024

ZLI-afk changed the title ~~OutOfMemeryError running 16 atoms system sci on 4 * DCU node~~ OutOfMemeryError running 32 atoms system sci on 4 * DCU node May 9, 2024

ZLI-afk changed the title ~~OutOfMemeryError running 32 atoms system sci on 4 * DCU node~~ OutOfMemeryError running 16 atoms system sci on 4 * DCU node May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OutOfMemeryError running 16 atoms system sci on 4 * DCU node #4124

OutOfMemeryError running 16 atoms system sci on 4 * DCU node #4124

ZLI-afk commented May 8, 2024 •

edited

dyzheng commented May 9, 2024

ZLI-afk commented May 9, 2024 •

edited

Religious-J commented May 9, 2024

pxlxingliang commented May 10, 2024

pxlxingliang commented May 13, 2024 •

edited

pxlxingliang commented May 13, 2024

ZLI-afk commented May 17, 2024

denghuilu commented May 18, 2024

OutOfMemeryError running 16 atoms system sci on 4 * DCU node #4124

OutOfMemeryError running 16 atoms system sci on 4 * DCU node #4124

Comments

ZLI-afk commented May 8, 2024 • edited

Details

Task list for Issue attackers (only for developers)

dyzheng commented May 9, 2024

ZLI-afk commented May 9, 2024 • edited

Religious-J commented May 9, 2024

pxlxingliang commented May 10, 2024

pxlxingliang commented May 13, 2024 • edited

pxlxingliang commented May 13, 2024

ZLI-afk commented May 17, 2024

denghuilu commented May 18, 2024

ZLI-afk commented May 8, 2024 •

edited

ZLI-afk commented May 9, 2024 •

edited

pxlxingliang commented May 13, 2024 •

edited