Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemeryError running 16 atoms system sci on 4 * DCU node #4124

Open
9 tasks
ZLI-afk opened this issue May 8, 2024 · 8 comments
Open
9 tasks

OutOfMemeryError running 16 atoms system sci on 4 * DCU node #4124

ZLI-afk opened this issue May 8, 2024 · 8 comments
Assignees
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues

Comments

@ZLI-afk
Copy link

ZLI-afk commented May 8, 2024

Details

As described in the title. The 16 atoms task with kspacing=0.05 Bohr^-1 is given by:
relax_task.zip

Image: registry.dp.tech/dptech/abacus:v3.6.0
node type: 4 * DCU_16g
command: OMP_NUM_THREADS=1 mpirun -np 4 abacus_pw > log

Error msg:

dflow.python.python_op_template.TransientError: ('abacus failed\n', 'out msg', '', '\n', 'err msg', '--------------------------------------------------------------------------\nWARNING: No preset parameters were found for the device that Open MPI\ndetected:\n\n  Local host:            e08r4n18\n  Device name:           mlx5_0\n  Device vendor ID:      0x02c9\n  Device vendor part ID: 4123\n\nDefault device parameters will be used, which may result in lower\nperformance.  You can edit any of the files specified by the\nbtl_openib_device_param_files MCA parameter to set values for your\ndevice.\n\nNOTE: You can turn off this warning by setting the MCA parameter\n      btl_openib_warn_no_device_params_found to 0.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nBy default, for Open MPI 4.0 and later, infiniband ports on a device\nare not used by default.  The intent is to use UCX for these devices.\nYou can override this policy by setting the btl_openib_allow_ib MCA parameter\nto true.\n\n  Local host:              e08r4n18\n  Local adapter:           mlx5_0\n  Local port:              1\n\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nWARNING: There was an error initializing an OpenFabrics device.\n\n  Local host:   e08r4n18\n  Local device: mlx5_0\n--------------------------------------------------------------------------\nWARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.\nInfo: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / no device params found\n[e08r4n18:12439] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n--------------------------------------------------------------------------\nPrimary job  terminated normally, but 1 process returned\na non-zero exit code. Per user-direction, the job has been aborted.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nmpirun detected that one or more processes exited with non-zero status, thus causing\nthe job to be terminated. The first process to do so was:\n\n  Process name: [[38950,1],0]\n  Exit code:    2\n--------------------------------------------------------------------------\n', '\n')

Task list for Issue attackers (only for developers)

  • Reproduce the performance issue on a similar system or environment.
  • Identify the specific section of the code causing the performance issue.
  • Investigate the issue and determine the root cause.
  • Research best practices and potential solutions for the identified performance issue.
  • Implement the chosen solution to address the performance issue.
  • Test the implemented solution to ensure it improves performance without introducing new issues.
  • Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
  • Review and incorporate any relevant feedback from users or developers.
  • Merge the improved solution into the main codebase and notify the issue reporter.
@ZLI-afk ZLI-afk added the Performance Issues related to fail running ABACUS label May 8, 2024
@mohanchen mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues and removed Performance Issues related to fail running ABACUS labels May 8, 2024
@WHUweiqingzhou WHUweiqingzhou assigned dyzheng and denghuilu and unassigned dyzheng May 9, 2024
@dyzheng
Copy link
Collaborator

dyzheng commented May 9, 2024

@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.

@ZLI-afk
Copy link
Author

ZLI-afk commented May 9, 2024

The same 32 atoms task with kspacing=0.08 Bohr^-1 can be running on a c64_m64_cpu machine in the Bohrium without MemeryError. What's the different? (CPU task ID: 12062725; DCU task ID: 12062239)
Please see corresponding scf.log for details:
running_scf_c64_m64_cpu.log
running_scf_4_DCU.log

@ZLI-afk ZLI-afk changed the title OutOfMemeryError running 16 atoms system sci on 4 * DCU node OutOfMemeryError running 32 atoms system sci on 4 * DCU node May 9, 2024
@Religious-J
Copy link

@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.

OK,I analyze the memory cost for this test case on CPU:

Also running on c64_m64_cpu machine in the Bohrium
command: OMP_NUM_THREADS=1 mpirun -np 32 abacus
This is the memory allocation result in ModuleBase::Memory::record method

NAME-------------------------|MEMORY(MB)--------
                         total     39155.9037
                        Psi_PW     37558.5117
                  PW_B_K::gcar       485.6704
                   PW_B_K::gk2       161.8901
                   Force::vkb1       118.3359
           Stress::dbecp_noevc       118.3359
                  Stress::vkb1       118.3359
                      VNL::vkb        59.1680
                  Force::dbecp        48.9375
             wavefunc::wfcatom        47.6631
                 DiagSub::hpsi        47.6631
                 DiagSub::spsi        47.6631
              DiagSub::evctemp        47.6631
       XC_Functional::gradcorr        29.4496
          Broyden_Mixing::F&DF        28.7967
            Nonlocal<PW>::becp        16.3125
              Nonlocal<PW>::ps        16.3125
                   Force::becp        16.3125
                  Stress::becp        16.3125
                 Stress::dbecp        16.3125
                     FFT::grid        15.0000
       XC_Functional::aux&gaux        10.6996

@pxlxingliang
Copy link
Collaborator

I try to run this example on BOHRIUM with "4 * NVIDIA GPU_16g", and it also has the out of memory error.

 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory

@ZLI-afk ZLI-afk changed the title OutOfMemeryError running 32 atoms system sci on 4 * DCU node OutOfMemeryError running 16 atoms system sci on 4 * DCU node May 10, 2024
@pxlxingliang
Copy link
Collaborator

pxlxingliang commented May 13, 2024

I use bohrium 4 * NVIDIA GPU_24g run this example, the calculation is successful.
It indicates that 4*24 G memory is enough for gpu.

I also try to use two nodes on sugon DCU, but it still raise the oom error.
The slurm script is:

#!/bin/bash
#SBATCH --job-name=ABACUS_GPU 
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4  #dcu个数
#SBATCH -o %j.out   
#SBATCH -e %j.out 
#SBATCH --exclusive

abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw

module purge
module load compiler/rocm/dtk-22.10
                module load compiler/devtoolset/7.3.1
                module load compiler/cmake/3.23.3
                module load mpi/hpcx/2.6.0/gcc-7.3.1

OMP_NUM_THREADS=1 mpirun -np 8  $abacus > out.log

Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log, and has OOM error,
And try to use 4 nodes, and also has OOM error.

It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.

@denghuilu Is this reasonable?

@pxlxingliang
Copy link
Collaborator

Use Bohrium '4 * DCU_32g' can run this example successfully.

@ZLI-afk
Copy link
Author

ZLI-afk commented May 17, 2024

Could you please help to check if followed Pb task has OOM problem
on 4 * DCU_32g with the new image:
registry.dp.tech/dptech/abacus:3.6.3-less-memory
Pb_32fcc_oom.zip

@denghuilu
Copy link
Member

I use bohrium 4 * NVIDIA GPU_24g run this example, the calculation is successful. It indicates that 4*24 G memory is enough for gpu.

I also try to use two nodes on sugon DCU, but it still raise the oom error. The slurm script is:

#!/bin/bash
#SBATCH --job-name=ABACUS_GPU 
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4  #dcu个数
#SBATCH -o %j.out   
#SBATCH -e %j.out 
#SBATCH --exclusive

abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw

module purge
module load compiler/rocm/dtk-22.10
                module load compiler/devtoolset/7.3.1
                module load compiler/cmake/3.23.3
                module load mpi/hpcx/2.6.0/gcc-7.3.1

OMP_NUM_THREADS=1 mpirun -np 8  $abacus > out.log

Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log, and has OOM error, And try to use 4 nodes, and also has OOM error.

It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.

@denghuilu Is this reasonable?

We need to check if all the 8 DCUs were actually used when applying for two nodes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues
Projects
None yet
Development

No branches or pull requests

6 participants