Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plots show excessive amounts of resources #187

Open
hjjvandam opened this issue Dec 21, 2023 · 2 comments
Open

Plots show excessive amounts of resources #187

hjjvandam opened this issue Dec 21, 2023 · 2 comments

Comments

@hjjvandam
Copy link

I am running some workflows on Crusher. The stage with the largest number of tasks runs 64 of them, each using 1 CPU core. The performance analysis plots suggest, however, that around 1000 cores were reserved for this workflow. With 64 CPU cores and 4 GPUs per node you only get this if the node allocation would correspond to 1 GPU per task. I.e. reserving 16 nodes for 64 single core tasks. I hope that the code isn't actually doing that and that just the plotting is off.

The performance data is stored at

/lustre/orion/world-shared/chm136/re.session.login2.hjjvd.019706.0000

I have copied the performance plots into the same directory.

The versions of the RADICAL Cybertools packages are:

(pydeepdrivemd) [hjjvd@login2.crusher test]$ pip list | grep radical
radical.analytics            1.43.0
radical.entk                 1.43.0
radical.gtod                 1.43.0
radical.pilot                1.43.0
radical.saga                 1.43.0
radical.utils                1.44.0

The code I am running lives at

git@github.com:hjjvandam/DeepDriveMD-pipeline.git

In branch feature/nwchem. The job I am running is specified in https://github.com/hjjvandam/DeepDriveMD-pipeline/blob/feature/nwchem/test/bba/molecular_dynamics_workflow_nwchem_test/config.yaml. Let me know if you need any further information, please.

@andre-merzky
Copy link
Member

Hi Hub,

when running that config file, I see the following resource description being used in this line:

{'access_schema': 'local',
 'cpus': 1024,
 'gpus': 64,
 'project': 'CHM136_crusher',
 'queue': 'batch',
 'resource': 'ornl.crusher',
 'walltime': 180}

so that seems to indicate that indeed 1k cores are being allocated. So unfortunately the plotting is correct, the resource allocation is faulty.

@hjjvandam
Copy link
Author

hjjvandam commented Dec 22, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants